...
The compute partitions available for NHR@ZIB contain several popular artificial intelligence (AI) frameworks and tools for use with both CPU and GPU resources. These packages can be accessed through the custom anaconda3/2023.09
module and its system-wide Python distribution. For a list of installed Python packages, one may call conda list
after the module is loaded. The full list of anaconda packages can be found here.
Inhalt |
---|
PyTorch
...
PyTorch
is a popular python deep learning/autodifferentiation/optimization library that has excellent GPU and CPU support. It features flexible eager mode execution, just-in-time compilation (“JIT”) support, and support for domain-specific tools (e.g., torchvision
for image-based learning tasks). It can be loaded in a python environment, and the presence of GPU accelerators can be tested as such:
Codeblock | ||
---|---|---|
| ||
Python 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import torch
>>> for i in range(torch.cuda.device_count()):
... print(torch.cuda.get_device_properties(i).name)
...
NVIDIA A100-SXM4-80GB
NVIDIA A100-SXM4-80GB
NVIDIA A100-SXM4-80GB
NVIDIA A100-SXM4-80GB |
Extensions
The anaconda3/2023.09
module’s python distribution also contains some useful extensions to PyTorch
:
PyTorch Lightning - Powerful, HPC-friendly, boilerplate-removing library for training, logging, and reproducibility with deep learning models.
PyTorch Geometric - Flexible graph neural network package for use in molecular/materials science, network science, and many other application domains of graph theory.
Examples
Examples of CPU, (multi) GPU, and multi-node training tasks for HPC environments can be found here. Below are reproduced examples for training convolutional neural network image classification models on the Fashion-MNIST dataset.
Setup (on login node):
This sets up some simple packages:
Codeblock |
---|
$ module load anaconda3/2023.09
$ conda activate base
$ git clone https://github.com/Ruunyox/pytorch-hpc
$ cd pytorch-hpc
$ pip install --user . |
1. Single node, single GPU:
We start with a training YAML file (fashion_mnist_conv_gpu.yaml) appropriate for PyTorch Lightning
(note that a similar training jobs can be set up without PyTorch Lightning
- see the official PyTorch tutorials for more granular examples):
Since only 1 GPU is needed, it is better to use the gpu-a100:shared
partition and request just one GPU (gres=gpu:A100:1
) rather than queuing for a full node with 4 GPUs. The following SLURM submission script details the options:
Codeblock | ||
---|---|---|
| ||
#! /bin/bash
#SBATCH -J pyt_cli_test_conv_gpu
#SBATCH -o pyt_cli_test_conv_gpu.out
#SBATCH --time=00:30:00
#SBATCH --partition=gpu-a100:shared
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:A100:1
#SBATCH --mem-per-cpu=1G
#SBATCH --cpus-per-task=4
module load cuda/11.8
module load anaconda3/2023.09
conda activate base
srun pythpc --config fashion_mnist_conv_gpu.yaml fit |
and can be run using:
Codeblock |
---|
$ sbatch cli_test_conv_gpu.sh |
The results can be inspected using TensorBoard
package (also included in the anaconda3/2023.09
module):
Codeblock |
---|
$ tensorboard --logdir ./fashion_mnist_conv_gpu/tensorboard --port 8877 |
which can be viewed on your local machine via SSH tunneling:
Codeblock |
---|
ssh -NL 8877:localhost:8877 your_hlrn_username@your_login_address |
Note: you may change the port 8877
to something else if needed. Alternatively, you may copy your events*
logfiles to your local machine and inspect them with tensorboard
there.
2. Single node, multiple GPUs
Adding more GPUs with Pytorch Lightning
is as simple as setting:
Codeblock |
---|
fit:
trainer:
devices: 4 |
In the training yaml (see fashion_mnist_conv_multi_gpu.yaml), and requesting a non-shared partition in the SBATCH
options:
Codeblock |
---|
#SBATCH --partition=gpu-a100
#SBATCH --gres=gpu:A100:4 |
Info |
---|
Remember that the number of nodes/GPUs requested through SLURM must match those requested in the |
3. Multiple nodes, multiple GPUs
Training across multiple nodes with multiple GPUs on a cluster is seamless with Pytorch Lightning
. Simply change the training YAML to include:
Codeblock |
---|
fit:
trainer:
devices: 4
strategy: ddp
nodes: 2 |
Which expects 2 nodes with 4 GPUs each, for a total of 8 GPUs, using a distributed data parallel strategy (see here for alternative PyTorch Lightning
distributed training strategies). Accordingly, the SLURM submission script must now be changed to include:
Codeblock |
---|
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:A100:4 |
TensorFlow
...
TensorFlow
is a powerful deep learning/autodifferentiation/optimization python package that supports eager execution and JIT compilation for both CPU and GPU accelerators. It can be loaded in a python environment, and the presence of GPU accelerators can be tested as such:
Codeblock |
---|
Python 3.9.18 (main, Sep 11 2023, 13:41:44)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> dl = tf.config.list_physical_devices()
>>> for d in dl:
... print(d)
...
PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')
PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')
PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU') |
Extensions
The anaconda3/2023.09
module also contains some useful TensorFlow
-related packages :
Keras - Python API for building and training
TensorFlow
models with less boilerplate.Horovod - Python package for distributed, multinode training with
TensorFlow
(as well as other deep learning frameworks).
Examples
Examples of CPU and (multi) GPU training tasks for HPC environments can be found here. Below are reproduced examples for training convolutional neural network image classification models on the Fashion-MNIST dataset.
Info |
---|
Currently, there is no |
Setup (on login node):
This sets up some simple packages:
Codeblock |
---|
$ module load anaconda3/2023.09
$ conda activate base
$ git clone https://github.com/Ruunyox/tf-hpc
$ cd tf-hpc
$ pip install --user . |
1. Single node, single GPU:
We start with a training YAML file (config_conv_gpu.yaml) appropriate for Keras
. Since only 1 GPU is needed, it is better to use the gpu-a100:shared
partition and request just one GPU (gres=gpu:A100:1
) rather than queuing for a full node with 4 GPUs. The following SLURM submission script details the options:
Codeblock | ||
---|---|---|
| ||
#! /bin/bash
#SBATCH -J tf_cli_conv_test_gpu
#SBATCH -o tf_cli_conv_test_gpu.out
#SBATCH --time=00:30:00
#SBATCH --partition=gpu-a100
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:A100:1
#SBATCH --mem-per-cpu=1G
#SBATCH --cpus-per-task=4
module load sw.a100
module load cuda/11.8
module load anaconda3/2023.09
conda activate base
export TF_CPP_MIN_LOG_LEVEL=2
export XLA_FLAGS=--xla_gpu_cuda_data_dir=/sw/compiler/cuda/11.8/a100/install
tfhpc --config config_conv_gpu.yaml |
and can be run using:
Codeblock |
---|
$ sbatch cli_test_conv_gpu.sh |
The results can be inspected using TensorBoard
package (also included in the anaconda3/2023.09
module):
Codeblock |
---|
$ tensorboard --logdir ./fashionmnist_conv_gpu/tensorboard --port 8877 |
which can be viewed on your local machine via SSH tunneling:
Codeblock |
---|
ssh -NL 8877:localhost:8877 your_hlrn_username@your_login_address |
Note: you may change the port 8877
to something else if needed. Alternatively, you may copy your events*
logfiles to your local machine and inspect them with tensorboard
there.
2. Single node, multiple GPUs
Adding more GPUs with Keras
is as simple as setting:
Codeblock |
---|
strategy:
name: mirrored_strategy
opts:
devices: ["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3"]
cross_device_ops:
op: hierarchical_copy_all_reduce
opts: null |
In the training yaml (see config_conv_multi_gpu.yaml ), and requesting a non-shared partition in the SBATCH
options:
Codeblock |
---|
#SBATCH --partition=gpu-a100
#SBATCH --gres=gpu:A100:4 |
Info |
---|
Remember that the number of GPUs requested through SLURM must match those requested in the |
3. Multiple node, multiple GPUs
For training across multiple nodes using Tensorflow, we direct the users to Horovod examples.
JAX
...
JAX
is a python package that combines composable NumPy
transforms and accelerated linear algebra (XLA) routines. Although not formally a deep learning framework, it can be used to great effect for any problem that requires fast autodifferentiation. It offers good support for vectorization and parallel computing, and when combined with the extensions below it can be used to train general machine learning models.
Info |
---|
JAX is a functionally pure framework - this may be unfamiliar to users of PyTorch or TensorFlow, which are more object-oriented in nature. See here for solutions to common problems and other tips for getting started with JAX. |
Extensions
There are several useful JAX
-related python packages included in the anaconda3/2023.09
module:
Haiku - Python package for building object-oriented-like machine learning models in JAX.
Optax - Gradient-based optimization library for training models in JAX.
Examples
Examples of CPU, (multi) GPU, and multi-node training tasks for HPC environments can be found here. Below are reproduced examples for training convolutional neural network image classification models on the Fashion-MNIST dataset.
Setup (on login node):
This sets up some simple packages:
Codeblock |
---|
$ module load anaconda3/2023.09
$ conda activate base
$ git clone https://github.com/Ruunyox/jax-hpc
$ cd jax-hpc
$ pip install --user . |
1. Single node, single GPU:
We start with a training YAML file (config_local_gpu.yaml). Since only 1 GPU is needed, it is better to use the gpu-a100:shared
partition and request just one GPU (gres=gpu:A100:1
) rather than queuing for a full node with 4 GPUs. The following SLURM submission script details the options:
Codeblock | ||
---|---|---|
| ||
#! /bin/bash
#SBATCH -J jax_cli_test_gpu
#SBATCH -o jax_cli_test_gpu.out
#SBATCH --time=00:30:00
#SBATCH --partition=gpu-a100
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:A100:1
#SBATCH --mem-per-cpu=1G
#SBATCH --cpus-per-task=4
module load sw.a100
module load cuda/11.8
module load anaconda3/2023.09
conda activate base
export XLA_FLAGS=--xla_gpu_cuda_data_dir=/sw/compiler/cuda/11.8/a100/install
export JAX_PLATFORM_NAME=gpu
export PYTHONUNBUFFERED=on
jaxhpc --config config_local_gpu.yaml |
and can be run using:
Codeblock |
---|
$ sbatch cli_test_conv_gpu.sh |
The results can be inspected in the associated output log.
2. Multiple GPUs
We direct users to the documentation for parallel executions using pmap
here.
XGBoost
...
XGBoost
is a python package for building gradient-boosted decision trees. It has excellent GPU support. For more information, visit this tutorial page.