PyTorch

PyTorch is a popular python deep learning/autodifferentiation/optimization library that has excellent GPU and CPU support. It features flexible eager mode execution, just-in-time compilation (“JIT”) support, and support for domain-specific tools (e.g., torchvision for image-based learning tasks). It can be loaded in a python environment, and the presence of GPU accelerators can be tested as such:

Python 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import torch
>>> for i in range(torch.cuda.device_count()):
...    print(torch.cuda.get_device_properties(i).name)
...
NVIDIA A100-SXM4-80GB
NVIDIA A100-SXM4-80GB
NVIDIA A100-SXM4-80GB
NVIDIA A100-SXM4-80GB

Extensions

The anaconda3/2023.09 module’s python distribution also contains some useful extensions to PyTorch :

PyTorch Lightning - Powerful, HPC-friendly, boilerplate-removing library for training, logging, and reproducibility with deep learning models.
PyTorch Geometric - Flexible graph neural network package for use in molecular/materials science, network science, and many other application domains of graph theory.

Examples

Examples of CPU, (multi) GPU, and multi-node training tasks for HPC environments can be found here. Below are reproduced examples for training convolutional neural network image classification models on the Fashion-MNIST dataset.

Setup (on login node):

This sets up some simple packages:

$ module load anaconda3/2023.09
$ conda activate base
$ git clone https://github.com/Ruunyox/pytorch-hpc
$ cd pytorch-hpc
$ pip install --user .

1. Single node, single GPU:

We start with a training YAML file (fashion_mnist_conv_gpu.yaml) appropriate for PyTorch Lightning (note that a similar training jobs can be set up without PyTorch Lightning - see the official PyTorch tutorials for more granular examples):

Since only 1 GPU is needed, it is better to use the gpu-a100:shared partition and request just one GPU (gres=gpu:A100:1) rather than queuing for a full node with 4 GPUs. The following SLURM submission script details the options:

#! /bin/bash
#SBATCH -J pyt_cli_test_conv_gpu
#SBATCH -o pyt_cli_test_conv_gpu.out
#SBATCH --time=00:30:00
#SBATCH --partition=gpu-a100:shared
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:A100:1
#SBATCH --mem-per-cpu=1G
#SBATCH --cpus-per-task=4

module load cuda/11.8
module load anaconda3/2023.09

conda activate base

srun pythpc --config fashion_mnist_conv_gpu.yaml fit

and can be run using:

$ sbatch cli_test_conv_gpu.sh

The results can be inspected using TensorBoard package (also included in the anaconda3/2023.09 module):

$ tensorboard --logdir ./fashion_mnist_conv_gpu/tensorboard --port 8877

which can be viewed on your local machine via SSH tunneling:

ssh -NL 8877:localhost:8877 your_hlrn_username@your_login_address

Note: you may change the port 8877 to something else if needed. Alternatively, you may copy your events* logfiles to your local machine and inspect them with tensorboard there.

2. Single node, multiple GPUs

Adding more GPUs with Pytorch Lightning is as simple as setting:

fit:
    trainer:
        devices: 4

In the training yaml (see fashion_mnist_conv_multi_gpu.yaml), and requesting a non-shared partition in the SBATCH options:

#SBATCH --partition=gpu-a100
#SBATCH --gres=gpu:A100:4

Remember that the number of nodes/GPUs requested through SLURM must match those requested in the PyTorch Lightning training YAML.

3. Multiple nodes, multiple GPUs

Training across multiple nodes with multiple GPUs on a cluster is seamless with Pytorch Lightning. Simply change the training YAML to include:

fit:
    trainer:
        devices: 4
        strategy: ddp
        nodes: 2

Which expects 2 nodes with 4 GPUs each, for a total of 8 GPUs, using a distributed data parallel strategy (see here for alternative PyTorch Lightning distributed training strategies). Accordingly, the SLURM submission script must now be changed to include:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:A100:4