The following GPU partitions are available on the GPU A100 cluster of system Lise.
...
GPU A100 shares the same slurm batch system with all partitions of System Lise. The following slurm partitions are specific for the GPU A100 partition.
Slurm partition | Node number | CPU | Main memory (GB) | GPUs per node | GPU hardware | Walltime (hh:mm:ss) | Description |
---|---|---|---|---|---|---|---|
gpu-a100 | 3634 | Ice Lake 8360Y | 1000 | 4 | NVIDIA Tesla A100 80GB | 24:00:00 | full node exclusive |
gpu-a100:shared | 5 | 4 | NVIDIA Tesla A100 80GB | shared node access, exclusive use of the requested GPUs | |||
gpu-a100:shared:mig | 1 | 28 (4 x 7) | 1 to 28 1g.10gb A100 MIG slices | shared node access, shared GPU devices via Multi Instance GPU. Each of the four GPUs is logically split into usable seven slices with 10 GB of GPU memory associated to each slice | |||
gpu-a100:test | 2 | 4 | NVIDIA Tesla A100 80GB | 01:00:00 | nodes reserved for short job tests before scheduling longer jobs with more resources |
See Slurm usage how to pass a 24h walltime limit with job dependencies.
...
Charge rates for the slurm partitions you find in Accounting.
Examples
Assuming a job script
Codeblock | ||||
---|---|---|---|---|
| ||||
#!/bin/bash
#SBATCH --partition=gpu-a100
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --gres=gpu:4
module load openmpi/gcc.11/4.1.4
mpirun ./mycode.bin
|
you can submit a job to the slurm batch system via the line:
Codeblock | ||||
---|---|---|---|---|
| ||||
bgnlogin2 $ sbatch example.slurm
Submitted batch job 7748544
bgnlogin2 $ squeue -u myaccount
... |
Codeblock | ||
---|---|---|
| ||
$ srun --nodes=2 --gres=gpu:4 --partition=gpu-a100 example_cmd |
...
Codeblock | ||
---|---|---|
| ||
$ srun --gpus=1 --partition=gpu-a100:shared:mig example_cmd |
Hardware configuration
NHR@ZIB offers access to compute nodes equipped with Nvidia A100 GPUs. The GPU A100 partition consists of two login nodes and 42 compute nodes with the following properties for a single node:
2x Intel Xeon "Ice Lake" Platinum 8360Y (36 cores per socket, 2.4 GHz, 250 W)
- 1 TB RAM (DDR4-3200)
- 4x Nvidia A100 (80GB HBM2, SXM), two attached to each CPU socket
- 7.68 TB NVMe local SSD
- 200 GBit/s InfiniBand Adapter (Mellanox MT28908).
...