PyTorch
PyTorch
is a popular python deep learning/autodifferentiation/optimization library that has excellent GPU and CPU support. It features flexible eager mode execution, just-in-time compilation (“JIT”) support, and support for domain-specific tools (e.g., torchvision
for image-based learning tasks). It can be loaded in a python environment, and the presence of GPU accelerators can be tested as such:
Python 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> >>> import torch >>> for i in range(torch.cuda.device_count()): ... print(torch.cuda.get_device_properties(i).name) ... NVIDIA A100-SXM4-80GB NVIDIA A100-SXM4-80GB NVIDIA A100-SXM4-80GB NVIDIA A100-SXM4-80GB
Extensions
The anaconda3/2023.09
module’s python distribution also contains some useful extensions to PyTorch
:
PyTorch Lightning - Powerful, HPC-friendly, boilerplate-removing library for training, logging, and reproducibility with deep learning models.
PyTorch Geometric - Flexible graph neural network package for use in molecular/materials science, network science, and many other application domains of graph theory.
Examples
Examples of CPU, (multi) GPU, and multi-node training tasks for HPC environments can be found here. Below are reproduced examples for training convolutional neural network image classification models on the Fashion-MNIST dataset.
Setup (on login node):
This sets up some simple packages:
$ module load anaconda3/2023.09 $ conda activate base $ git clone https://github.com/Ruunyox/pytorch-hpc $ cd pytorch-hpc $ pip install --user .
1. Single node, single GPU:
We start with a training YAML file (fashion_mnist_conv_gpu.yaml) appropriate for PyTorch Lightning
(note that a similar training jobs can be set up without PyTorch Lightning
- see the official PyTorch tutorials for more granular examples):
Since only 1 GPU is needed, it is better to use the gpu-a100:shared
partition and request just one GPU (gres=gpu:A100:1
) rather than queuing for a full node with 4 GPUs. The following SLURM submission script details the options:
#! /bin/bash #SBATCH -J pyt_cli_test_conv_gpu #SBATCH -o pyt_cli_test_conv_gpu.out #SBATCH --time=00:30:00 #SBATCH --partition=gpu-a100:shared #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:A100:1 #SBATCH --mem-per-cpu=1G #SBATCH --cpus-per-task=4 module load cuda/11.8 module load anaconda3/2023.09 conda activate base srun pythpc --config fashion_mnist_conv_gpu.yaml fit
and can be run using:
$ sbatch cli_test_conv_gpu.sh
The results can be inspected using TensorBoard
package (also included in the anaconda3/2023.09
module):
$ tensorboard --logdir ./fashion_mnist_conv_gpu/tensorboard --port 8877
which can be viewed on your local machine via SSH tunneling:
ssh -NL 8877:localhost:8877 your_hlrn_username@your_login_address
Note: you may change the port 8877
to something else if needed. Alternatively, you may copy your events*
logfiles to your local machine and inspect them with tensorboard
there.
2. Single node, multiple GPUs
Adding more GPUs with Pytorch Lightning
is as simple as setting:
fit: trainer: devices: 4
In the training yaml (see fashion_mnist_conv_multi_gpu.yaml), and requesting a non-shared partition in the SBATCH
options:
#SBATCH --partition=gpu-a100 #SBATCH --gres=gpu:A100:4
Remember that the number of nodes/GPUs requested through SLURM must match those requested in the PyTorch Lightning
training YAML.
3. Multiple nodes, multiple GPUs
Training across multiple nodes with multiple GPUs on a cluster is seamless with Pytorch Lightning
. Simply change the training YAML to include:
fit: trainer: devices: 4 strategy: ddp nodes: 2
Which expects 2 nodes with 4 GPUs each, for a total of 8 GPUs, using a distributed data parallel strategy (see here for alternative PyTorch Lightning
distributed training strategies). Accordingly, the SLURM submission script must now be changed to include:
#SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:A100:4