Installation is a little bit of a pain, but it is simplified somewhat, as all external dependencies are available in a unique repository, except for PAPI that you have to install by hand.
What is also handy is that all the GUIs are available on all platforms and used right away. The not so handy part is that HPC toolkit can be used on different platform, but not PAPI. And as I like PAPI…
Usage is really simple. Multithreaded applications will agglomerate all threads and even MPI processes separately, and in the viewer, the results would be shown with a specific [thread, process] index. This is really neat.
But before you can do that, there are a couple of commands to execute. The first one is hpcstruct, that just creates a XML file with all the known functions inside, named binary.hpcstruct:
After this, the profiling can be done:
hpcrun -t -e PAPI_TOT_CYC@15000000 -e PAPI_L1_TCM@400000 -e PAPI_L2_TCM@400000 -e WALLCLOCK@5000 -o outputfolder binary arguments
Here there are 3 PAPI counters and one general counter (WALLCLOCK is relevant as current processors have a variable frequency, meaning that the total number of cycling is not enough to actually infer the spent time). After the @, there is the sampling frequency. I haven’t said what kind of profiling PAPI provides, it’s now clear it’s sampling. Sampling means that from time to time, the application will stop and the profiler will look at its state. If the sampling frequency (or in this case the sampling period) is too short, the profile will be biased and will show profiling errors, or it can even break if the sampling period is so small that it gets hit while doing the report…
A list of available counters can be gotten by hpcrun -L or papi_avail -a for PAPI counters.
Once this is done, the profile has to be converted in a “database”:
hpcprof outputfolder -S binary.hpcstruct -I include_folder
The include folder is relevant if you want to browse through the source folders while analyzing your profile. There are two applications that can be used to review profiles, available on all platforms: hpcviewer or hpctraceviewer.
Here is an example of hpcviewer:
There are three ways of displaying results:
- Top-down, starting with top functions, diving into more specific functions, following the tree calls
- Bottom-up, starting with all the leaves in the tree calls, a really informative view that captures the biggest offenders in an eye blink
- A flat view, file by file, then function by function and finally line by line
There is one main warning in these trees: if you compiled in optimized mode with GCC, some layers will be missing where gcc inlined functions. So it takes some agility to actually understand what happens. Another solution is to use the Intel Compiler which has an option to properly generate debug info. The case where this bugs me the most is for lambda functions, as this gets displayed as operator() function! Without the source code, there is no way of knowing where this function actually comes from.
Another thing to remember when checking values is what the third view gives. Each time a sample is taken, it’s at some point in the execution of the program, and the sampling period is given to the line where the program stopped/is sampled. It’s like ergodicity for a random variable, we assume that over enough samples, making a mean on the samples will be the same as making a mean on the timings. So the profiling time has to be important compared to the number of functions profiled and the sampling period.
Also when comparing with other tools like Parallel Studio or Valgrind, I miss at least a bar allowing me to compare different functions cost (that being for implicit or explicit counters). I think a callgrind-like profile can be created out of the HPC Toolkit profile. For the moment, it’s a xml file, but they plan on moving to another binary format…
I thought I could do something with the cvs export, but it only export timings for unfolded lines, without giving lines and file names, without any hierarchy, but it is doable from he original database, as they are stored as XML and as the callgrind file format is open.
One key advantage of HPC Toolkit over Callgrind is the way the profile is computed (sampling vs emulation), so they are faster to be created, and it shows a good picture of what happens within the processor, which an emulation profile can’t exactly show (it depends of the model of the processor, which is never exactly the right model). But it is not reproducible, it depends on the CPU load, so more care is required to analyze this profiles. Also it is only usable on Linux…
It is a great complement to Valgrind/Callgrind, and the fact that it is open source means that there are no excuse for not profiling your application for bottlenecks.