While following a discussion on KVR, I thought about adding support for denormals handling in Audio Toolkit
Last year, my colleagues and I presented a paper on giga model simulations in an SPE conference: Giga-Model Simulations In A Commercial Simulator – Challenges & Solutions. During this talk, we talked about the complexity of I/O for such simulations. We had ordered data as input that we needed to split in chunks to send them on the relevant MPI ranks, and then the same process was required for writing the results, gathering the chunks and then writing them down to the disk.
The central point is that some clusters have parallel file systems, and these works well when you try to access big blobs of aligned data. In fact, as they are the bottleneck of the whole system, you need to limit the number of accesses to what you actually require. For instance in HDF5, you can specify the alignment of datasets, so you can say that all HDF5 datasets will be aligned on the filesystem specifications (so for instance 1MB if your Lustre/GPFS has a chunk size of 1MB) and read or write chunks that are multiple of these values.
After my post on HPCToolkit, I felt that I prefered QCacheGrind as a GUI to explore profiling results. So here is a gist with a Python script to convert XML HPCToolkit experiments to callgrind format: https://gist.github.com/mbrucher/6cad31e38beca770523b
For instance, this is a display of an Audio Toolkit test of L2 cache misses:
Sometimes, it’s so easy to rewrite some existing code because it doesn’t fit exactly your bill.
I just so the example with an All To All communication that was written by hand. The goal was to share how many elements would be sent from one MPI process to another, and these elements were stored on one process in different structure instances, one for each MPI process. So in the end, you had n structures on each of the n MPI processes.
The MPI_Alltoall cannot map directly to this scattered structure, so it sounds fair to assume that using MPI_Isend and MPI_Irecv would be simpler to implement. The issue is that this pattern uses buffers on each process for each other process it will send values to or receive values from. A lot of MPI library allocate their buffer when needed, but will never let go of the memory until the end. So you end up with a memory consumption that doesn’t scale. In my case, when using more than 1000 cores, the MPI library uses more than 1GB per MPI process when it hits these calls, just for these additional hidden buffers. This is just no manageable.
Now, if you use MPI_Alltoall, two things happen:
- there are no additional buffer allocated, so this scales nicely when you increase the number of cores
- it is actually faster than your custom implementation
Now with MPI 3 standard having non-blocking collective operations, there is absolutely no reason to try to outsmart the library when you need a collective operation. It has heuristics when it knows that it is doing a collective call, so let them work. You won’t be smarter if you try, but you will if you use them.
In my case, the code to retrieve all values and store them in an intermediate buffer was smaller that the one with the Isend/Irecv.
We know now that we won’t have the same serial computing increase we had in the last decades. We have to cope with optimizing serial codes, and programming parallel and concurrent ones, and this means that all coders have to cope with this paradigm shift. If computer scientists are aware of the tools to use, it is not the same for the “average” scientist or engineer. And this is the purpose of this book: educate the average coder.
We have now several petaflopic clusters available in the Top500. Of course, we are trying to get the most of their peak computational power, but I think we should sometimes also look at optimal resource allocation.
I’ve been thinking about this for several months now, for work that has thousands of tasks, each task being massively data parallel. Traditionnally, one launches a job through one’s favorite batch scheduler (favorite or mandatory…) with fixed resources and during an estimated amount of time. This may work well in research, but in the industrial world, there often a new job that arises and that needs part of your scarce resources. You may have to stop your work, loose your current advances and/or restart the job with less resources. And then the cycle goes on.
I came across the issue of how to teach a trainee how to write a parallel finite-difference time-domain (FDTD) method. There are a lot of books on the FDTD, but only a few on parallel ones. So I’ve decided to go for this book, knowing that some chapters won’t apply to our job (wave equations). My goal was to seek a book that would explain the basics of my issues.
For each algorithm and program, there are architectures that are better than others. Some computation may need a lot of FLOPS, but FLOPS are not the only thing to consider. Communication and memory bandwidth and latency are as important as computational power, specially since memory speed and CPU speed are decoupled.
I had this discussion with one of my Ph.D. advisors some months ago when we talked about correctly using the computers we had then (dual cores), and I had almost the same one in my new job here: applied maths (finite differences, signal processing, …) graduate students are not taught how to use current computers, so how could they develop an HPC program correctly?
I think it goes even further than that, and it will be a part of this post. What I see is that trainees and newly-hired people (to some extent myself included) lack a lot of basic Computer Science knowledge, and even IT knowledge.