GPU A100 partition
Content
General information for all Lise partitions you can find for the topics
- login via ssh on Usage Guide,
- file systems on Usage Guide, and
- general slurm properties on Slurm usage.
Hardware
The GPU A100 partition offers access to two login nodes and 42 compute nodes equipped with Nvidia A100 GPUs. One single compute node holds the following properties.
2x Intel Xeon "Ice Lake" Platinum 8360Y (36 cores per socket, 2.4 GHz, 250 W)
- 1 TB RAM (DDR4-3200)
- 4x Nvidia A100 (80GB HBM2, SXM), two attached to each CPU socket
- 7.68 TB NVMe local SSD
- 200 GBit/s InfiniBand Adapter (Mellanox MT28908)
Login nodes
The hardware of the login nodes nodes is similar to those of the compute nodes. Notable differences to the compute nodes are
- reduced main memory (512 GB instead of 1 TB RAM) and
- no GPUs and no CUDA drivers.
Login authentication is possible via SSH keys only. Please visit Usage Guide.
Generic login name | List of login nodes |
---|---|
bgnlogin.nhr.zib.de | bgnlogin1.nhr.zib.de bgnlogin2.nhr.zib.de |
Software and environment modules
- Login and compute nodes of the A100 GPU partition are running under Rocky Linux (currently version 8.6).
- Software for the A100 GPU partition provided by NHR@ZIB can be found using the module command, see Usage Guide.
- Please note the presence of the sw.a100 environment module. It controls the software selection for the GPU A100 partition.
bgnlogin1 $ module avail ... bgnlogin1 $ module load gcc ... bgnlogin1 $ module list Currently Loaded Modulefiles: 1) HLRNenv 2) sw.a100 3) slurm 4) gcc/11.3.0(default)
Program build and execution
- Each node of the GPU A100 system is a combination of a host CPU and their four attached device GPUs.
- We recommend to use the GPU A100 login nodes for program build. If a program build needs for the presence of CUDA drivers, compilation is possible on a compute node within a slurm job session, too. For build examples, please visit our manual on
- GPU-aware MPI: For efficient use of MPI-distributed GPU codes, an GPU/CUDA-aware MPI installation of Open MPI is available in the
openmpi/gcc.11/4.1.4
environment module. Open MPI respects the resource requests made to Slurm. Thus, no special arguments are required tompiexec/run
. Nevertheless, please consider and check the correct binding for your application to CPU cores and GPUs. Use--report-bindings
of mpiexec/run to check it.
Job monitoring
A running job can be monitored interactively, directly on each of the compute nodes. Once you know the names of the job nodes you can login and monitor the host CPU as well as the GPUs.
bgnlogin1 $ squeue -u myaccount JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7748370 gpu-a100 a100_mpi myaccount R 1:23 2 bgn[1007,1017] bgnlogin1 $ ssh bgn1007 bgn1007 $ top bgn1007 $ nvidia-smi bgn1007 $ module load nvtop bgn1007 $ nvtop
Using the slurm batch system
The GPU A100 shares the same slurm batch system with all partitions of System Lise.
- A general introduction to the batch system you find Slurm usage.
- Slurm partition GPU A100 describes the specific properties of slurm for the GPU A100 partition. The main slurm partition for the A100 GPU partition has the name "gpu-a100". An example job script is shown below.
#!/bin/bash #SBATCH --partition=gpu-a100 #SBATCH --nodes=2 #SBATCH --ntasks=8 #SBATCH --gres=gpu:4 module load openmpi/gcc.11/4.1.4 mpirun ./mycode.bin