Sometimes, it’s so easy to rewrite some existing code because it doesn’t fit exactly your bill.
I just so the example with an All To All communication that was written by hand. The goal was to share how many elements would be sent from one MPI process to another, and these elements were stored on one process in different structure instances, one for each MPI process. So in the end, you had n structures on each of the n MPI processes.
The MPI_Alltoall cannot map directly to this scattered structure, so it sounds fair to assume that using MPI_Isend and MPI_Irecv would be simpler to implement. The issue is that this pattern uses buffers on each process for each other process it will send values to or receive values from. A lot of MPI library allocate their buffer when needed, but will never let go of the memory until the end. So you end up with a memory consumption that doesn’t scale. In my case, when using more than 1000 cores, the MPI library uses more than 1GB per MPI process when it hits these calls, just for these additional hidden buffers. This is just no manageable.
Now, if you use MPI_Alltoall, two things happen:
- there are no additional buffer allocated, so this scales nicely when you increase the number of cores
- it is actually faster than your custom implementation
Now with MPI 3 standard having non-blocking collective operations, there is absolutely no reason to try to outsmart the library when you need a collective operation. It has heuristics when it knows that it is doing a collective call, so let them work. You won’t be smarter if you try, but you will if you use them.
In my case, the code to retrieve all values and store them in an intermediate buffer was smaller that the one with the Isend/Irecv.
We know now that we won’t have the same serial computing increase we had in the last decades. We have to cope with optimizing serial codes, and programming parallel and concurrent ones, and this means that all coders have to cope with this paradigm shift. If computer scientists are aware of the tools to use, it is not the same for the “average” scientist or engineer. And this is the purpose of this book: educate the average coder.
Continue reading Book review: Introduction to High Performance Computing for Scientists and Engineers
Due to the end of the free lunch, manufacturers started to provide differents processing units and developers started to go parallel. It’s kind of back to the future, as accelerators existed before today (the x87 FPU started as a coprocessor, for instance). If those accelerators were integrated into the CPU, their instruction set were also.
Today’s accelerators are not there yet. The tools are not ready yet (code translators) and usual programming practices may not be adequate. All the ecosystem will evolve, accelerators will change (GPUs are the main trend, but they will be different in a few years), so what you will do today needs to be shaped with these changes in mind. How is it possible to do so? Is it even possible?
Continue reading Thinking of good practices when developing with accelerators
Some months ago, I had a TotalView tutorial, thanks to my job. Now, I’ve actually used it to debug one of my parallel applications and I would like to share my experience with fantastic tool.
First TotalView is not only a parallel debugger available on several Linux and Unix platforms. It also is a memory checker (MemoryScape and the TotalView plugin) as well as a reverse debugger, that is, you can roll back the execution of a program, even after it crashed (where it would be useless with a standard debugger like GDB).
Continue reading Overview of TotalView, a parallel debugger
I came across the issue of how to teach a trainee how to write a parallel finite-difference time-domain (FDTD) method. There are a lot of books on the FDTD, but only a few on parallel ones. So I’ve decided to go for this book, knowing that some chapters won’t apply to our job (wave equations). My goal was to seek a book that would explain the basics of my issues.
Continue reading Book review: Parallel Finite-Difference Time-Domain Method
I had this discussion with one of my Ph.D. advisors some months ago when we talked about correctly using the computers we had then (dual cores), and I had almost the same one in my new job here: applied maths (finite differences, signal processing, …) graduate students are not taught how to use current computers, so how could they develop an HPC program correctly?
I think it goes even further than that, and it will be a part of this post. What I see is that trainees and newly-hired people (to some extent myself included) lack a lot of basic Computer Science knowledge, and even IT knowledge.
Continue reading How to promote High Performance Computing ?
In my lab, we frequently process huge amounts of data, each process can take hours or days. The problem is that we don’t have a usable tool to do this.
Our legacy software is in C and we plan on moving to Python in the next weeks. We could use some commercial software, but it is not optimal.
This is where P2P comes into the game. We have a lot of unused computers or dual cores that are not used even at 50% because we are not trained in parallel computing (and we won’t in the near future). By “we”, I mainly mean PhD students. Our background is signal or image processing, not Computer Science and even less parallel computing. Those unused computers could be used for our computations, but this implies that the computer is only used if nobody works on it, that we only use what is available at a precise moment, and that some computers may get used during the computations. That’s why P2P seems an elegant idea, as a grid computing tool.
P2P computation is not new in the lab (we developed P2P-MPI in Java for instance), but for our team, it is. For the time being, I did not find much about the tools that we could use, but the JXTA protocol seems a good start. I hope I will be able to talk more about this subject in the near future.