...
Inhalt |
---|
General information for the usage of all Lise partitions you find on Quickstart, especially for can find for the topics
- login via ssh on Usage Guide,
- file systems on Usage Guide, and
- general Slurm usage.
- general slurm properties on Slurm usage.
Hardware
The GPU A100 partition offers access to two login nodes and 42 compute nodes equipped with Nvidia A100 GPUs. One single compute node holds the following properties.
2x Intel Xeon "Ice Lake" Platinum 8360Y (36 cores per socket, 2.4 GHz, 250 W)
- 1 TB RAM (DDR4-3200)
- 4x Nvidia A100 (80GB HBM2, SXM), two attached to each CPU socket
- 7.68 TB NVMe local SSD
- 200 GBit/s InfiniBand Adapter (Mellanox MT28908)
Login nodes
The hardware of the login nodes nodes is similar to those of the compute nodes. Notable differences to the compute nodes are
- reduced main memory (512 GB instead of 1 TB RAM) and
- no GPUs and no CUDA drivers.
Login authentication is possible via SSH keys only. Please visit Usage Guide.
Generic login name | List of login nodes |
---|---|
bgnlogin.nhr.zib.de | bgnlogin1.nhr.zib.de bgnlogin2.nhr.zib.de |
Software and environment modules
- Login and compute nodes of the A100 GPU partition are running under Rocky Linux (currently version 8.6).
- Software for the A100 GPU partition provided by NHR@ZIB can be found using the module command, see QuickstartUsage Guide.
- Please note the presence of the sw.a100 environment module. It controls the software selection for the GPU A100 partition.
...
- A general introduction to the batch system you find Slurm usage.
- Slurm partitions partition GPU A100 describes the specific properties of slurm for the GPU A100 partition. The main slurm partition for the A100 GPU partition has the name "gpu-a100". An example job script is shown below.
Codeblock | ||||
---|---|---|---|---|
| ||||
#!/bin/bash
#SBATCH --partition=gpu-a100
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --gres=gpu:4
module load openmpi/gcc.11/4.1.4
mpirun ./mycode.bin
|
Container
Apptainer is provided as a module and can be used to download, build and run e.g. Nvidia containers:
Codeblock | ||||
---|---|---|---|---|
| ||||
bgnlogin1 ~ $ module load apptainer Module for Apptainer 1.1.6 loaded. #pulling a tensorflow image from nvcr.io - needs to be compatible to local driver bgnlogin1 ~ $ apptainer pull tensorflow-22.01-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:22.01-tf2-py3 ... #example: single node run calling python from the container in interactive job using 4 GPUs bgnlogin1 ~ $ srun -pgpu-a100 --gres=gpu:4 --nodes=1 --pty --interactive --preserve-env ${SHELL} ... bgn1003 ~ $ apptainer run --nv tensorflow-22.01-tf2-py3.sif python ... Python 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> tf.config.list_physical_devices("GPU") [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] #optional: cleanup apptainer cache bgnlogin1 ~ $ apptainer cache list ... bgnlogin1 ~ $ apptainer cache clean |