Problem
In a job that requires "staging" of new huge input files (8GB in 650 files) during runtime, the job fails with error messages like "invalid file format". Inspecting the files later, does not reveal any errors and the input files are sane
cp repository/* input_area mpirun ...
It seems to be a lustre cache related problem, the startup of the parallel process is faster than lustre can sychronise itself on all nodes.
Solution
Add some delay after copying large file sets:
cp repository/* input_area sleep 20 mpirun ...... sleep 20
Related articles