Versionen im Vergleich

Schlüssel

  • Diese Zeile wurde hinzugefügt.
  • Diese Zeile wurde entfernt.
  • Formatierung wurde geändert.

...

  • intel_pytorch_2.1.0a0

  • intel_tensorflow_2.14.0

  • intel_jax_0.4.20

Pytorch

Load the Intel OneAPI module and create a new conda environment within your Intel python distribution:

module load intel/2024.0.0 conda create
Codeblock
Hinweis

Please note that PVC nodes currently run on Rocky 8 linux, and so only python versions <=3.9 are supported.

Info

NumPy 2.0.0 breaks binary backwards compatibility. If Numpy-related runtime errors are encountered, please consider downgrading to a version <2.0.0

Pytorch

Load the Intel OneAPI module and create a new conda environment within your Intel python distribution:

Codeblock
module load intel/2024.0.0

conda create -n intel_pytorch_gpu python=3.9
conda activate intel_pytorch_gpu

...

Codeblock
pip install tensorflow==2.14.0
pip install --upgrade intel-extension-for-tensorflow[xpu]==2.14.0

This installs TensorFlow together with it's Intel extension necessary to run non-CUDA operations on Intel GPUs. On a compute node, the presence of GPUs can be assessed:

...

Hinweis

Intel XPU support is still experimental for JAX, as of version 0.4.20

Like Pytorch and TensorFlow, JAX also has an extension via OpenXLA. To prepare a JAX environment for use with Intel GPUs, first create a new conda environment:

...

Once the environment is activated, the following commands install JAX

Codeblock
pip install numpy==1.24.4
pip install jax==0.4.20 jaxlib==0.4.20
pip install --upgrade intel-_extension-_for-openxla_openxla==0.2.1

This installs JAX together with its Intel extension necessary to run non-CUDA operations on Intel GPUs. On a compute node, the presence of GPUs can be assessed:

...

Examples for using the Intel extension for JAX can be found here.

Distributed Training

multigpu and multinode jobs can be executed using the following strategy in a job submission script:

Codeblock
module load intel/2024.0.0
module load impi

export CCL_ROOT=/sw/compiler/intel/oneapi/ccl/2021.12
export LD_LIBRARY_PATH=$I_MPI_ROOT/lib:$LD_LIBRARY_PATH
hnode=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$(scontrol getaddrs $hnode | cut -d' ' -f 2 | cut -d':' -f 1)
export MASTER_PORT=29500

It is advantageous to define the GPU tile usage (each Intel Max 1550 has two compute “tiles”) using affinity masks, wherein the format GPU_ID.TILE_ID (zero-base index) specifies which GPU(s) and tile(s) to use. Eg, two use two GPUs and four tiles, one can specify:

Codeblock
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export ZE_AFFINITY_MASK=0.0,0.1,1.0,1.1

To use four GPUs and eight tiles, one would specify:

Codeblock
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export ZE_AFFINITY_MASK=0.0,0.1,1.0,1.1,2.0,2.1,3.0,3.1

These specifications are applied to all nodes of a job. For more information, and alternative modes, please see the intel level-zero documentation.

Intel MPI can then be used to distribute and run your job, eg:

Codeblock
mpirun -np 8 -ppn 8 your_exe your_exe_flags