Last year, my colleagues and I presented a paper on giga model simulations in an SPE conference: Giga-Model Simulations In A Commercial Simulator – Challenges & Solutions. During this talk, we talked about the complexity of I/O for such simulations. We had ordered data as input that we needed to split in chunks to send them on the relevant MPI ranks, and then the same process was required for writing the results, gathering the chunks and then writing them down to the disk.
The central point is that some clusters have parallel file systems, and these works well when you try to access big blobs of aligned data. In fact, as they are the bottleneck of the whole system, you need to limit the number of accesses to what you actually require. For instance in HDF5, you can specify the alignment of datasets, so you can say that all HDF5 datasets will be aligned on the filesystem specifications (so for instance 1MB if your Lustre/GPFS has a chunk size of 1MB) and read or write chunks that are multiple of these values.