Due to the end of the free lunch, manufacturers started to provide differents processing units and developers started to go parallel. It’s kind of back to the future, as accelerators existed before today (the x87 FPU started as a coprocessor, for instance). If those accelerators were integrated into the CPU, their instruction set were also.
Today’s accelerators are not there yet. The tools are not ready yet (code translators) and usual programming practices may not be adequate. All the ecosystem will evolve, accelerators will change (GPUs are the main trend, but they will be different in a few years), so what you will do today needs to be shaped with these changes in mind. How is it possible to do so? Is it even possible?
Available code translators
Code translators are the easiest path to solution. I know two of them.
The first is the PGI compiler. It only supports CUDA and the Fortran and C99 language. I didn’t use it yet, also I plan of testing it in the near future. It is based on pragmas, and the compiler generates the CUDA microcode.
The second solution is HMPP. It supports more than just CUDA (also CAL/IL or OpenCL) and Fortran/C (also Java now). As the PGI compiler, it is based on pragmas, and a excellent thing is that it detects the available accelerators and launches the correct kernel (if you authorized it) or the original code. You can also modify the generated code to put your own (you can tune the code for instance, which may give you an additional x2 factor). Unfortunately, it is not possible to call functions inside the parallelized kernels, which means that only simple or badly-written (too many lines or duplicated code) kernels can be called. I think this is the same for the PGI compiler.
It seems that code translators still need work:
- only few accelerators are supported (CUDA, and sometimes CAL/IL or OpenCL),
- almost no langage (Fortran/C/Java, a lot of Virtual Machines should be able to use them natively, without developers using specific tools),
- only one function can be parallelized at a time.
The last point is currently the biggest issue. You need to cut your function int pieces to have clean code and a good portability/evolutivity for the future.
This is why one still need to program a lot for those accelerators, and so we need to adapt our programming practices, develop in the accelerators’ native langages (even if we know that they may disappear in a few years).
Developping your own “tool chain” for accelerators
For accelerators, there are a lot of things that needs to be done each time: copying some data, computing and getting some data back. These are the steps that code translators automate, in fact it is a common practice to use tools to automate stuff. The issue is that complex kernels are not supported by those translators. So what?
Creating automatic functions that will copy the data you need is in fact very common in metaprogramming. Coding the kernel on an accelerator is in fact not that difficult: the manufacturers provide the needed compilers (that’s what nVidia does and the success of the tool chain cannot be denied), and this is really the cornerstone. One has to write more code, some parts are less portable (because they are written in one of the accelerator’s languages), but in the end, with metaprogramming, the code can be better tuned, enhanced and read. This is the leverage of the accelerators.
Why do we care developing for accelerators? We know that they will go away. Before they do, they are the only way of speeding up our software. Code translators are the best tools to develop in a portable way, but they need time to support more accelerators, languages and method of programming. When CPUs will be on a par with accelerators, their progress will help compilers to target them correctly. It’s just a matter of time.
Meanwhile, metaprogrammin is the next best solution to automate processes that code translators cannot support yet.