I had this discussion with one of my Ph.D. advisors some months ago when we talked about correctly using the computers we had then (dual cores), and I had almost the same one in my new job here: applied maths (finite differences, signal processing, …) graduate students are not taught how to use current computers, so how could they develop an HPC program correctly?
I think it goes even further than that, and it will be a part of this post. What I see is that trainees and newly-hired people (to some extent myself included) lack a lot of basic Computer Science knowledge, and even IT knowledge.
A crucial issue is the computer architecture knowlegde:
- What endinanness do I use on my computers? A lot of scientific code was programmed in big endian, whereas current mainstream computers are little endian (safe for the Power architecture, thus Cell, IBM clusters, …). Although this is taught during the first programming courses, some students tend to forget about it.
- How should I loop on data? If you loop on discontiguous data block, the CPU cache is likely to be too small to contain all your data. This will lead to cache misses, and those can cost a lot.
- More generally how much memory can I use and what is its speed (bandwidth and latency)? On current computers you have access to several GB of RAM, MB of cache. On GPUs (which is more and more used for HPC), you have hundreds of MB, but access to the graphic global RAM is costly (the same applies for the CELL).
At least students must learn that they will have to think about those issues. I don’t know each memory size of each processor (I roughly know how much, and if I need more details, I search the Internet for the answer).
Then there is the parallel part. Even for scientific research on small datasets, the advent of multicores is a challenge for students (and to a lot of teachers as well). We can’t expect them to know how to deal with multithreading or multiprocessing if they don’t even know the difference between an Intel chip and an IBM one.
In medical imaging, we could simply split our workflow in tasks (we had several registrations to do, so we can simply split the work and make on each core one registration), but even that was not easy. Some algorithms were not thread-safe (who said that global variables should be avoided?), so we had to launch several different processes. But when it comes to parallelize the registration itself because it was long, and also because the problem size began to be too big for a 32bits application (yes, we were still in 32bits because of legacy GTK1 applications), nobody could do it, because nobody had the time and the knowledge to use all the CPU power.
In bigger applications like sismic ones, it’s even more obvious. Copy ‘n’ Paste can be a sport, if someone one day parallized efficiently his program. All things considered, it’s not a problem, but if the paste is not done smartly and if the developer didn’t learn from the analyze of the first program, it’s useless, because he didn’t learn AND because the pasted program will not be optimal and even subject to bugs.
Some additional thoughts
I found a post on Intel’s Community page about questions that will be discussed at SuperComputing ’08. It shows that there is no clear answer to this problem (at this time of writing). We should teach parallelism, HPC, … to students. But are the tools ready? Is every student even able to cope with parallelism?