is a popular python deep learning/autodifferentiation/optimization library that has excellent GPU and CPU support. It features flexible eager mode execution, just-in-time compilation (“JIT”) support, and support for domain-specific tools (e.g., torchvision
for image-based learning tasks). It can be loaded in a python environment, and the presence of GPU accelerators can be tested as such:
Python 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> for i in range(torch.cuda.device_count()):
... print(torch.cuda.get_device_properties(i).name)
The anaconda3/2023.09
module’s python distribution also contains some useful extensions to PyTorch
PyTorch Lightning - Powerful, HPC-friendly, boilerplate-removing library for training, logging, and reproducibility with deep learning models.
PyTorch Geometric - Flexible graph neural network package for use in molecular/materials science, network science, and many other application domains of graph theory.
Examples of CPU, (multi) GPU, and multi-node training tasks for HPC environments can be found here. Below are reproduced examples for training convolutional neural network image classification models on the Fashion-MNIST dataset.
Setup (on login node):
This sets up some simple packages:
$ module load anaconda3/2023.09
$ conda activate base
$ git clone
$ cd pytorch-hpc
$ pip install --user .
1. Single node, single GPU:
We start with a training YAML file (fashion_mnist_conv_gpu.yaml) appropriate for PyTorch Lightning
(note that a similar training jobs can be set up without PyTorch Lightning
- see the official PyTorch tutorials for more granular examples):
Since only 1 GPU is needed, it is better to use the gpu-a100:shared
partition and request just one GPU (gres=gpu:A100:1
) rather than queuing for a full node with 4 GPUs. The following SLURM submission script details the options:
#! /bin/bash
#SBATCH -J pyt_cli_test_conv_gpu
#SBATCH -o pyt_cli_test_conv_gpu.out
#SBATCH --time=00:30:00
#SBATCH --partition=gpu-a100:shared
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:A100:1
#SBATCH --mem-per-cpu=1G
#SBATCH --cpus-per-task=4
module load cuda/11.8
module load anaconda3/2023.09
conda activate base
srun pythpc --config fashion_mnist_conv_gpu.yaml fit
and can be run using:
The results can be inspected using TensorBoard
package (also included in the anaconda3/2023.09
which can be viewed on your local machine via SSH tunneling:
Note: you may change the port 8877
to something else if needed. Alternatively, you may copy your events*
logfiles to your local machine and inspect them with tensorboard
2. Single node, multiple GPUs
Adding more GPUs with Pytorch Lightning
is as simple as setting:
In the training yaml (see fashion_mnist_conv_multi_gpu.yaml), and requesting a non-shared partition in the SBATCH
Remember that the number of nodes/GPUs requested through SLURM must match those requested in the PyTorch Lightning
training YAML.
3. Multiple nodes, multiple GPUs
Training across multiple nodes with multiple GPUs on a cluster is seamless with Pytorch Lightning
. Simply change the training YAML to include:
Which expects 2 nodes with 4 GPUs each, for a total of 8 GPUs, using a distributed data parallel strategy (see here for alternative PyTorch Lightning
distributed training strategies). Accordingly, the SLURM submission script must now be changed to include: