/
PyTorch

PyTorch

 

PyTorch_logo_icon.svg

PyTorch is a popular python deep learning/autodifferentiation/optimization library that has excellent GPU and CPU support. It features flexible eager mode execution, just-in-time compilation (“JIT”) support, and support for domain-specific tools (e.g., torchvision for image-based learning tasks). It can be loaded in a python environment, and the presence of GPU accelerators can be tested as such:

Python 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> >>> import torch >>> for i in range(torch.cuda.device_count()): ... print(torch.cuda.get_device_properties(i).name) ... NVIDIA A100-SXM4-80GB NVIDIA A100-SXM4-80GB NVIDIA A100-SXM4-80GB NVIDIA A100-SXM4-80GB

Extensions

The anaconda3/2023.09 module’s python distribution also contains some useful extensions to PyTorch :

  • PyTorch Lightning - Powerful, HPC-friendly, boilerplate-removing library for training, logging, and reproducibility with deep learning models.

  • PyTorch Geometric - Flexible graph neural network package for use in molecular/materials science, network science, and many other application domains of graph theory.

Examples

Examples of CPU, (multi) GPU, and multi-node training tasks for HPC environments can be found here. Below are reproduced examples for training convolutional neural network image classification models on the Fashion-MNIST dataset.

Setup (on login node):

This sets up some simple packages:

$ module load anaconda3/2023.09 $ conda activate base $ git clone https://github.com/Ruunyox/pytorch-hpc $ cd pytorch-hpc $ pip install --user .

1. Single node, single GPU:

We start with a training YAML file (fashion_mnist_conv_gpu.yaml) appropriate for PyTorch Lightning (note that a similar training jobs can be set up without PyTorch Lightning - see the official PyTorch tutorials for more granular examples):

Since only 1 GPU is needed, it is better to use the gpu-a100:shared partition and request just one GPU (gres=gpu:A100:1) rather than queuing for a full node with 4 GPUs. The following SLURM submission script details the options:

#! /bin/bash #SBATCH -J pyt_cli_test_conv_gpu #SBATCH -o pyt_cli_test_conv_gpu.out #SBATCH --time=00:30:00 #SBATCH --partition=gpu-a100:shared #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:A100:1 #SBATCH --mem-per-cpu=1G #SBATCH --cpus-per-task=4 module load cuda/11.8 module load anaconda3/2023.09 conda activate base srun pythpc --config fashion_mnist_conv_gpu.yaml fit

and can be run using:

The results can be inspected using TensorBoard package (also included in the anaconda3/2023.09 module):

which can be viewed on your local machine via SSH tunneling:

Note: you may change the port 8877 to something else if needed. Alternatively, you may copy your events* logfiles to your local machine and inspect them with tensorboard there.

2. Single node, multiple GPUs

Adding more GPUs with Pytorch Lightning is as simple as setting:

In the training yaml (see fashion_mnist_conv_multi_gpu.yaml), and requesting a non-shared partition in the SBATCH options:

Remember that the number of nodes/GPUs requested through SLURM must match those requested in the PyTorch Lightning training YAML.

3. Multiple nodes, multiple GPUs

Training across multiple nodes with multiple GPUs on a cluster is seamless with Pytorch Lightning. Simply change the training YAML to include:

Which expects 2 nodes with 4 GPUs each, for a total of 8 GPUs, using a distributed data parallel strategy (see here for alternative PyTorch Lightning distributed training strategies). Accordingly, the SLURM submission script must now be changed to include:

 

Related content

Software
Software
Read with this
TensorFlow
TensorFlow
More like this
PVC AI Tools and Frameworks
PVC AI Tools and Frameworks
More like this
JAX
More like this
CUDA
More like this
AI Frameworks and Tools
AI Frameworks and Tools
More like this