We know now that we won’t have the same serial computing increase we had in the last decades. We have to cope with optimizing serial codes, and programming parallel and concurrent ones, and this means that all coders have to cope with this paradigm shift. If computer scientists are aware of the tools to use, it is not the same for the “average” scientist or engineer. And this is the purpose of this book: educate the average coder.
Content and opinions
There are several issues when designing a high-performance application. First, there is the issue of optimizing a serial application with our complex CPUs. Then, there is the issue of parallel computers, how they work, how they have to be used, and then how to design the parallel application to benefit from the parallel architecture one has. With these three main topics, the basis of HPC for non-computer scientists and engineers are covered. What is appreciated in this book is that each time, the authors start with a hardware approach and then the software consequences.
The book can be split in three main parts: serial, shared-memory parallel and distributed-memory parallel. The book starts with an introduction of current processor trends, with the Moore law and what designers had to create to keep up with the transistor count pace. It gives a nice overview of what awaits in the core of the book.
So the first part is all about designing serial code. It’s the basis of all HPC code, but it tends to be forgotten and replaced with creating parallel code directly. Standard tips and tricks as well as adequate profiling make the start of the first chapter in this part, and it goes on with compilers and what you can expect of them. Although I do not agree with the authors on all the C++ subchapter (they tend to forget that there are efficient stuff in C++ that allows code to be more efficient than in C or Fortran), their advice is sound for the general public. The second chapter in this part tackles the difficult problem of data access, and I have to say that it achieves its goal in a remarkable way. The analysis of performance is impressive, and it gives some warnings I didn’t even know about (and they are some people I know that shoukd have known them, but they don’t).
Let’s talk about parallel code. To understand what parallel is, the next chapter covers the different solutions (shared-memory and distributed-memory), with a good coverage of the different types of each of them. I was pleased to see that the common pitfalls of parallel machines were all covered in depth. Also, a second introductory chapter on parallel code tackles data parallelism versus task parallelism as well as all parallel measures (which is mandatory but too often overlooked).
The next three chapters cover the shared-memory computers. It is IMHO the most complicated computers to code for, as you share everything with all CPUs. The authors use OpenMP as a multithreading library, which may be the best solution for the targetted audience (scientists and engineers). A case study with profiling is also offered before the actual big issue: cache usage. The last chapter of this part is about NUMA architectures and they do have a tremendous impact on performance if badly used. The chapter raised issues I overlooked in the past, so it was really a good read, full of solutions to problems I faced (some were fixed by libraries we use, but others may still be looming in my code). The only thing missing may be that in this approach, one has to pay attention to the size of the nD arrays we use for instance in the Jacobi solver (a recurring example in the book), as if the biggest dimensions are not multiples of the cache line size, one may have a lot of cache invalidations and thus a slowdown instead of a speedup.
The last part may be the easiest with MPI complete coverage (although it may be difficult to use, as the MPI API is sometimes verbose). The first chapter of the triad encompasses the simplest MPI layer. With this, one can write any program on a distributed-memory computer. The tricky part is in the second chapter, with asynchronous and blocking functions. Also, deadlocks and contention are introduced, and a lot of graphs describe the performance one can expect. The last chapter of the triad and of the book covers the hybrid OpenMP and MPI approach with its own peculiarities.
Conclusion
When one finishes the book, the trip is not over, it has just started. Now, someone not familiar with all the small computer issues should have a better understanding of how a CPU works, how it can be used at its maximum efficiency (also how this efficiency has to be computed, it is not as straightforward as a simple division), how several CPUs work together… Putting everything into practice is not an easy task, but all the common and uncommon pitfalls are at least known and covered. Congratulations to the authors, they did a very good job.