...
- Write intermediate results and checkpoints as seldom as possible.
- Try to write/read larger data volumes (>1 MiB) and reduce the number of files concurrently managed in WORK.
- For inter-process communication use proper protocols (e.g. MPI) instead of files in WORK.
- If you want to control your jobs externally, consider to use POSIX signals, instead of using files frequently opened/read/closed by your program. You can send signals e.g. to batch jobs via "scancel --signal..."
- Use MPI-IO to coordinate your I/O instead of each MPI task doing individual POSIX I/O (HDF5 and netCDF may help you with this).
- Instead of using resursive
chmod/chown/chgrp
, please use as combination oflfs find
andxargs
, e.g.lfs find /path/to/folder|xargs chgrp $project
, as this creates less stress on the metadataservers and is much faster
Analysis of meta data
An existing application can be investigated with respect to meta data usage. Let us assume an example job script for an MPI the parallel application myexample.bin
with 16 MPI tasks.
Codeblock | ||
---|---|---|
| ||
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=8 #SBATCH --time=01:00:00 #SBATCH --partition=standard96 srun ./myexample.bin |
For this example 16 MPI tasks are executed. Once you add the linux command strace
to the job you create two files per linux process (MPI task). For this example 32 trace files are created. Large MPI jobs can create a huge number of trace files, e.g. a 128 node job with 128 x 96 MPI tasks created 24576 files. That is why we strongly recommend to reduce the MPI task number as far as possible.
Codeblock | ||
---|---|---|
| ||
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=8 #SBATCH --time=01:00:00 #SBATCH --partition=standard96 srun strace -ff -t -o trace -e open,openat ./myexample.bin |
Analysing one trace file shows all file open
activity of one process (MPI task).
Codeblock | ||
---|---|---|
| ||
> ls -l trace.* -rw-r----- 1 bzfbml bzfbml 21741 Mar 10 13:10 trace.445215 ... > wc -l trace.445215 258 trace.445215 > cat trace.445215 13:10:37 open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 13:10:37 open("/lib64/libfabric.so.1", O_RDONLY|O_CLOEXEC) = 3 ... 13:10:38 open("/scratch/usr/bzfbml/mpiio-filesystem/mpiio_zxyblock.dat", O_RDWR) = 8 13:10:43 +++ exited with 0 +++ |
...
For the interpretation of the trace file you need to expect a number of open
entries originating from the linux system . For the example above, independently from your code. The example code myexample.bin
creates only one file with the name mpiio_zxyblock.dat
. 258 open
statements in the trace file include only one open
from the application codewhich indicates a very desirable meta data activity.
Known issues
For some of the codes we are aware of certain issues:
...