We have now several petaflopic clusters available in the Top500. Of course, we are trying to get the most of their peak computational power, but I think we should sometimes also look at optimal resource allocation.
I’ve been thinking about this for several months now, for work that has thousands of tasks, each task being massively data parallel. Traditionnally, one launches a job through one’s favorite batch scheduler (favorite or mandatory…) with fixed resources and during an estimated amount of time. This may work well in research, but in the industrial world, there often a new job that arises and that needs part of your scarce resources. You may have to stop your work, loose your current advances and/or restart the job with less resources. And then the cycle goes on.
Static resource allocation
How can resource allocation work? Let’s start with a simple case where you have 2 applications with different priorities. One of them has a priority of 70 (it’s supposed to finish in three days) whereas the other one has a priority of 50 (four days left). They share the cluster so that 66% is allocated to the first application and 33% to the second one.

What happens if a third application must be launched with a higher priority, because it has to ne finished by tomorrow? You may stop the other two programs, you may loose a lot of work if you didn’t implement checkpoints (besides, one of them may be an of-the-shelf program you bought yesterday) or suspend it. Either way, this is what you will get:

In fact, even if you use dynamic resource allocation, this is what you must get to have your results by the time you need them, but obviously, you have lost your two other applications. Some batch schedulers allow applications to be suspended, but this is a double-edge sword:
- your cluster must support job suspension, and thus have access to drives to save the job state (which is not possible for medium to large-scaled clusters)
- if your application does not scale to your entire cluster (it happens), although one of the other two applications could go on, it is not possible, all processes are put to sleep
So all things considered, you have to implement dynamic resource allocation.
Dynamic resource allocation
How does this work? Each application must be aware that it can be allocated more resources or deallocated some at all time. To be portable on all clusters, you cannot suspend part of your program, it must really go away. The batch scheduler must also notice that your application has freed some of its resources. You thus have to allocate small jobs that will communicate together (this can be done with MPI-2).
This means that you will have hundreds or thousands of small works. All of them will not have to be connected to the scheduler, only one master must be. Of course, this can easilly be done by using a specific queue. Each application on this queue will thus receive orders from the batch scheduler and act upon it. Another advantage is that also the application gets no resource at one point, it still has a saved state that enable the continuation of a run.

Of course, this is not easy to do. How can this be applied to an of-the-shelf application? Well, in this case, you may create a bogus application on the master queue that will at least allow other applications to be allocated resources beside it.
You do not have to implement this on top of MPI. It can be really hard to do (handling data moves between processors, change the decomposition, …), and you may implement another solution. In my case, I have thousands different tasks that can be run on very few cores, so this is my elementary unit. I don’t need all tasks to communicate between them, so I create each time brand new independent jobs and I also can tell the scheduler it can kill jobs that are not responding before the next allocation phase.
Conclusion
To finish, I’ll say that I know that LSF allows plugins that help dispatch jobs on specific hosts of your cluster (to have the best communication location). There seems to be a way of implementing the needs gathering and the resource assignment, but the documentation is not clear (at all). A specific daemon may be needed. I don’t know if other batch scheduler allow plugins to modify their behavior, if you know of them and their API, please do tell 😉