Unspecific error messages when reading huge input files
Problem
In a job that requires "staging" of new huge input files (8GB in 650 files) during runtime, the job fails with error messages like "invalid file format". Inspecting the files later, does not reveal any errors and the input files are sane
cp repository/* input_area mpirun ...
It seems to be a lustre cache related problem, the startup of the parallel process is faster than lustre can sychronise itself on all nodes.
Solution
Add some delay after copying large file sets:
cp repository/* input_area sleep 20 mpirun ... sleep 20
Alternatively, the tool nocache serves as a workaround for this issue (thanks John):
nocache cp repository/* input_area mpirun ...
Related articles
Problem
In a job that requires "staging" of new huge input files (8GB in 650 files) during runtime, the job fails with error messages like "invalid file format". Inspecting the files later, does not reveal any errors and the input files are sane
cp repository/* input_area mpirun ...
It seems to be a lustre cache related problem, the startup of the parallel process is faster than lustre can sychronise itself on all nodes.
Solution
Add some delay after copying large file sets:
cp repository/* input_area sleep 20 mpirun ... sleep 20
Alternatively, the tool nocache serves as a workaround for this issue (thanks John):
nocache cp repository/* input_area mpirun ...
Related articles