Since WORK is a shared distributed resource using a variety of IO-servers and hundreds of storage devices in parallel the ressource have to be used fairly.
Especially hundreds of thousands metadata operations like open, close and stat can cause a "slow" filesystem.
Therefore some general advice:
- Write intermediate results and checkpoints as seldom as possible.
- Try to use large IO sizes (>1 MiB) and to arrange your IO as sequential as possible. Work is harddisk based.
- For inter-process communication use proper protocols (e.g. MPI) instead of files in WORK.
- To control your jobs from the outside you can use POSIX signals, which can be send to batch jobs via "scancel --signal..."
- Use MPI-IO to coordinate your IO instead of each MPI task doing individual POSIX IO (HDF5 and netCDF make help you with this).
- OPENFOAM: always set ‘runTimeModifiable false’ and fileHandler collated with a sensible value for purgeWrite and writeInterval (see: https://www.hlrn.de/doc/display/PUB/OpenFOAM)
- NAMD: Christian T.? i.a.
If you have questions or you are unsure regarding your individual scenario, please get in contact with your consultant.