For efficient use of MPI-distributed GPU codes, an GPU/CUDA-aware MPI installation of Open MPI is available in the openmpi/gcc.11/4.1.4 environment module. Open MPI respects the resource requests made to Slurm. Thus, no special arguments are required to mpiexec/run. Nevertheless, please consider and check the correct binding for your application to CPU cores and GPUs. Use --report-bindings of mpiexec/run to check it.

Container

Apptainer is provided as a module and can be used to download, build and run e.g. Nvidia containers:

Codeblock

language	bash
title	Apptainer example

bgnlogin1 ~ $ module load apptainer
Module for Apptainer 1.1.6 loaded.

#pulling a tensorflow image from nvcr.io - needs to be compatible to local driver
bgnlogin1 ~ $ apptainer pull tensorflow-22.01-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:22.01-tf2-py3
...

#example: single node run calling python from the container in interactive job using 4 GPUs
bgnlogin1 ~ $ srun -pgpu-a100 --gres=gpu:4 --nodes=1 --pty --interactive --preserve-env ${SHELL}
...
bgn1003 ~ $ apptainer run --nv tensorflow-22.01-tf2-py3.sif python
...
Python 3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.config.list_physical_devices("GPU")
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]

#optional: cleanup apptainer cache
bgnlogin1 ~ $ apptainer cache list
...
bgnlogin1 ~ $ apptainer cache clean

Job accounting

As of Q4 2023, 1 gpu-hour costs the equivalent of 150 core hours, a full gpu node (4 GPUs) costs 600 core hours.

Versionen im Vergleich

Alte Version 8

Neue Version 9

Schlüssel

Container

Job accounting

Seitenvergleich

Versionen im Vergleich

Alte Version 8

Neue Version 9

Schlüssel

Container

Job accounting