...
...
...
...
...
...
...
...
...
...
...
...
Content
Inhalt | ||
---|---|---|
|
Preface
The Starting September 2024, the underlying hardware of Lise’s global file systems HOME and WORK for system Lise in Berlin will be replaced in September 2024. This affects all login nodes and all compute partitions, namely the CPU partition, the GPU-A100 partition, and the GPU-PVC partition.
It is important for users to follow the action items specified below. Rocky Linux 9 introduces new versions of various system tools and libraries. Some codes compiled earlier under CentOS 7 might not be working under Rocky Linux 9 anymore. Thus, legacy versions of environment modules offered under CentOS 7 were not transferred to the new OS environment or have been replaced by more recent versions.
The migration to the new OS is organised in three consecutive phases. It is expected to be complete by the end of July.
The first phase starts with 2 login nodes and 112 compute nodes already migrated to Rocky Linux 9 for testing. The other nodes remain available under CentOS 7 for continued production.
After the test phase, a major fraction of nodes will be switched to Rocky Linux 9 to allow for general job production under the new OS.
During the last phase, only a few nodes still remain under CentOS 7. At the very end, they will be migrated to Rocky Linux 9, too.
During the migration phase the use of Rocky Linux 9 "clx" compute nodes will be free of charge.
Current migration state
...
nodes
...
CentOS 7
...
Rocky Linux 9
...
login
...
blogin[1-6]
...
blogin[7-8]
...
compute (384 GB RAM)
...
832
...
112
...
compute (768 GB RAM)
...
32
...
0
...
compute (1536 GB RAM)
...
2
...
0
Latest news
...
date
...
subject
...
2024-07-03
...
official start of the migration phase with 2 login and 112 compute nodes running Rocky Linux 9
What has changed
SLURM partitions
...
CentOS 7
...
Rocky Linux 9
...
old partition name
...
new partition name
...
current job limits
...
● standard96
...
● cpu-clx
...
40 nodes, 12h wall time
...
● standard96:test
...
● cpu-clx:test
...
16 nodes, 1 h wall time
...
● standard96:ssd
...
● cpu-clx:ssd
...
● large96
...
● cpu-clx:large
...
● large96:test
...
● large96:shared
...
● huge96
...
● cpu-clx:huge
( ● available ● closed/not available yet )
Software and environment modules
...
CentOS 7
...
Rocky Linux 9
...
OS components
...
glibc 2.17
...
glibc 2.34
...
Python 3.6
...
Python 3.9
...
GCC 4.8
...
GCC 11.4
...
bash 4.2
...
bash 5.1
...
check disk quota
...
hlrnquota
...
show-quota
...
Environment modules version
...
4.8 (Tmod)
...
5.4 (Tmod)
...
Modules loaded initially
...
HLRNenv
...
NHRZIBenv
...
slurm
...
slurm
...
sw.skl
...
sw.clx.el9
...
compiler modules
...
intel
≤ 2022.2.1
...
intel
≥ 2024.2
...
gcc
≤ 13.2.0
...
gcc
≥ 13.3.0
...
MPI modules
...
impi
≤ 2021.7.1
...
impi
≥ 2021.13
...
openmpi
≤ 4.1.4
...
openmpi
≥ 5.0.3
Shell environment variables
CentOS 7 | Rocky Linux 9 |
---|---|
| (undefined, local |
|
|
(undefined) |
|
(undefined) |
|
| |
|
What remains unchanged
node hardware and node names
communication network (Intel Omnipath)
file systems (HOME, WORK, PERM) and disk quotas
environment modules system (still based on Tcl, a.k.a. “Tmod”)
access credentials (user IDs, SSH keys) and project IDs
charge rates and CPU time accounting (early migrators' jobs are free of charge)
Lise’s Nvidia-A100 and Intel-PVC partitions
Special remarks
For users of SLURM’s
srun
job launcher:
Open MPI 5.x has dropped support for the PMI-2 API, it solely depends on PMIx to bootstrap MPI processes. For this reason the environment setting was changed fromSLURM_MPI_TYPE=pmi2
toSLURM_MPI_TYPE=pmix
, so binaries linked against Open MPI can be started as usual “out of the box” usingsrun mybinary
. For the case of a binary linked against Intel-MPI, this works too when a recent version (≥2021.11) of Intel-MPI has been used. If an older version of Intel-MPI has been used, and relinking/recompiling is not possible, one can follow the workaround for PMI-2 withsrun
as described in the Q&A section below. Switching fromsrun
tompirun
instead should also be considered.Using more processes per node than available physical cores (PPN > 96; hyperthreads) with the OPX provider:
The OPX provider currently does not support using hyperthreads/PPN > 96 on the clx partitions. Doing so may result in segmentation faults in libfabric during process startup. If a high number of PPN is really required, the libfabric provider has to be changed to PSM2 by settingFI_PROVIDER=psm2
. Note that the usage of hyperthreads may not advisable. We encourage users to test performance before using more threads than available physical cores.
Action items for users
All users of Lise are recommended to
log in to an already migrated login node (see the current state table) and get familiar with the new environment
check self-compiled software for continued operability
relink/recompile software as needed
adapt and test job scripts and workflows
submit test jobs to the new "cpu-clx:test" SLURM partition
read the Q&A section and ask for support in case of further questions, problems, or software requests (support@nhr.zib.de)
Questions and answers
Erweitern | ||
---|---|---|
| ||
There can be several reasons. Maybe our installation of Rocky Linux 9 already includes the “mycode” package in a version newer than “1.2.3”. Or maybe we provide an updated environment module “mycode/2.0” instead. Or maybe we did not consider continuing “mycode” under Rocky Linux 9 yet, in this case please submit a support request. |
Erweitern | ||
---|---|---|
| ||
No. Though environment modules prepared under CentOS 7 might not be available anymore, the actual software they have been pointing at is still available under the “/sw” file system. |
Erweitern | ||
---|---|---|
| ||
Yes. Simply say |
Erweitern | ||
---|---|---|
| ||
This is because we need to define |
...
title | I have loaded the "intel/2024.2" environment module, but still neither the icc nor the icpc compiler is found. Why that? |
---|
Starting with the 2022.2 release of Intel’s oneapi toolkits, the icc
and icpc
“classic” compilers (C/C++) have been marked as “deprecated”, see here. Corresponding user warnings
icc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
...
. Please be aware of the following activities.
One-week downtime for maintenance, preparation steps, and for preliminary transfer of HOME data. NHR@ZIB staff copies all data from the old to the new HOME storage. No user action is required.
One-day maintenance of the entire system for a final synchronization step between old and new HOME. Old HOME goes offline, new HOME goes online. No user action is required.
Two-month Migration phase for WORK. During this period, both the old and new WORK file systems are available. Users transfer their data from the old to the new WORK storage.
Schedule
step | estimated start date | subject | status |
---|---|---|---|
1 | September 30 | one-week downtime of the GPU clusters (A100 and PVC), CPU CLX cluster remains available | ● |
2 | End of November | one-day maintenance for all compute systems | ● |
3 | End of November | two-month migration phase for users to copy their data from the old to the new WORK storage | ● |
● completed
● open
Migration phase for WORK
In step 3 of the schedule, data migration for WORK will be organized as follows.
Active migration phase:
a two-month period starting in October
simultaneous user access to the old and new WORK file systems
data transfer by users from the old to the new WORK storage (no data transfer by NHR@ZIB staff)
Post migration phase:
only the new WORK file system is available
the old WORK file system is switched off, its data is deleted
Remarks
WORK is a scratch file system - that means, no backups are available! Data can get lost any time, due to user mistakes or due to system failures. Users need to copy important data (job results) to a safe place.
WORK is a file system shared by all users. It is important that only data actively used in computations (“hot” data) reside here. WORK is not intended to store backups, software installations, and other kinds of “cold” data.
The PERM file system is not affected by this maintenance. During all times of Lise operation, PERM will be available to each user.