Content
General information for the usage of all Lise partitions you find on Quickstart, especially for the topics
- login via ssh,
- file systems, and
- general Slurm usage.
Software and environment modules
- Login and compute nodes of the A100 GPU partition are running under Rocky Linux (currently version 8.6).
- Software for the A100 GPU partition provided by NHR@ZIB can be found using the module command, see Quickstart.
- Please note the presence of the sw.a100 environment module. It controls the software selection for the GPU A100 partition.
Example: Show the currently available software and access compilers
bgnlogin1 $ module avail ... bgnlogin1 $ module load gcc ... bgnlogin1 $ module list Currently Loaded Modulefiles: 1) HLRNenv 2) sw.a100 3) slurm 4) gcc/11.3.0(default)
Program build and execution
- Each node of the GPU A100 system is a combination of a host CPU and their four attached device GPUs. There is a wide range of software to support this hardware.
- We recommend to use the GPU A100 login nodes for program build. If a program build needs for the presence of CUDA drivers, compilation is possible on a compute node within a slurm job session, too.
- We restrict our presentation to examples. For that, please visit our manual on
Job monitoring
A running job can be monitored interactively, directly on each of the compute nodes. Once you know the names of the job nodes you can login and monitor the host CPU as well as the GPUs.
Job monitoring
bgnlogin1 $ squeue -u myaccount JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7748370 gpu-a100 a100_mpi myaccount R 1:23 2 bgn[1007,1017] bgnlogin1 $ ssh bgn1007 bgn1007 $ top bgn1007 $ nvidia-smi bgn1007 $ module load nvtop bgn1007 $ nvtop
Using the slurm batch system
The GPU A100 shares the same slurm batch system with all partitions of System Lise.
- A general introduction to the batch system you find Slurm usage.
- Slurm partitions GPU A100 describes the specific properties of slurm for the GPU A100 partition. The main slurm partition for the A100 GPU partition has the name "gpu-a100". An example job script is shown below.
GPU job script
#!/bin/bash #SBATCH --partition=gpu-a100 #SBATCH --nodes=2 #SBATCH --ntasks=8 #SBATCH --gres=gpu:4 module load openmpi/gcc.11/4.1.4 mpirun ./mycode.bin
Container
Apptainer is provided as a module and can be used to download, build and run e.g. Nvidia containers:
Apptainer example
bgnlogin1 ~ $ module load apptainer Module for Apptainer 1.1.6 loaded. #pulling a tensorflow image from nvcr.io - needs to be compatible to local driver bgnlogin1 ~ $ apptainer pull tensorflow-22.01-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:22.01-tf2-py3 ... #example: single node run calling python from the container in interactive job using 4 GPUs bgnlogin1 ~ $ srun -pgpu-a100 --gres=gpu:4 --nodes=1 --pty --interactive --preserve-env ${SHELL} ... bgn1003 ~ $ apptainer run --nv tensorflow-22.01-tf2-py3.sif python ... Python 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> tf.config.list_physical_devices("GPU") [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] #optional: cleanup apptainer cache bgnlogin1 ~ $ apptainer cache list ... bgnlogin1 ~ $ apptainer cache clean