Recently, I got access to the latest release of Parallel Studio with an update version of Advisor. 6 years after my last review, let’s dive into it again!
First lots of things changed since the first release of Parallel Studio. Lots of the dedicated tools were merged with big ones (Amplifier is now VTune, Composer is Debugger…) but still kept their nicer GUIs and workflows. They also evolved quite a lot in a decade, focusing now on how to extract the maximum performance of the Intel CPUs through vectorization and threading.
Intel provides a large range of tools, from a compiler to a sampling profiler (based on hardware counters), from a debugger to tools analyzing programs behavior. Of course, all these applications have a goal: selling more chips. As it’s not just about the chip, this is a fair fight: you need to be able to use the most of your hardware.
Of course, I’ll use Audio Toolkit for the demo and the screenshots.
Let’s start with the beginning and start a new project. You will need to set up a new project (or open an old one), which leads you to the following screenshot.
Survey hotspots analysis is basically what you require, you may want to tick in Survey trip count analysis the Collect information about FLOPS if that’s the kind of analysis you are looking for. In future versions, it will be required for the roofline analysis which is currently not commercially available.
Once the configuration is done, let’s run the 1. Survey target, which would lead to the next screenshot.
I suggest to save snapshots (with the camera icon) after a run, as each run will actually overwrite e000, and bundle at least the source code with it.
Now it is possible to see the results:
I guess it is time now for a quick demo on how we can decrypt such results and improve on them.
The first interesting bit is that it is indeed the IIR filter that takes most of the relevant time. Advisor only works on loops, but as audio processing is about loops, everything is fine. Each loop has different annotations, and the ones in IIR filter have the note “Compiler lacks sufficient information to vectorize the loop”. The issue here is that the Visual Studio compiler can’t vectorize properly, so let’s use the Intel compiler (in CMake GUI, use -t “Intel C++ Compiler 17.0”).
I added here a pragma in the source code to force the vectorization, so the results are quite interesting. We have a good speed up compared to the previous version (6.9s to 4.7s), but here we are skewed because the order of the filter is odd, so there is an even number of coefficients for this loop (FIR part of the filter), and this works great for SSE2. Here only one loop is vectorized, where the icon for the loop is orange instead of blue.
If I want to push and ask for AVX instructions, then we will start seeing indications that the loop may be inefficient. In the following screenshot, I reordered the FIR loop so that we are vectorized on the number of samples of the processing and not the number of coefficients (usually there are only a handful of coefficients but up to hundreds of samples, so far more opportunities for vectorization). So this loop is not marked as inefficient. But the second one (the IIR part) is inefficient as we can’t reorder the loop straight away.
Here, we see that Advisor tags all the calls to the loop as Remainder (or Vectorized Remainder), which is the part of the vectorized loop finishes (the start is Peel, before the samples are aligned, then Body, when data is aligned and the full content of the register is used, and then Remainder when the data is aligned, but only the first parts of the vector registers can be used). And the efficiency of the loop is poor, only 9%, compared to the 76% of the reordered loop.
This was a small tutorial on Advisor, I also added alignment in the arrays in a filter so that Peel would be reduced, and other optimizations. I didn’t talk about the rest of the analytics Advisor provides, but you get the idea, and the fun of these tools is also to explore them.
One final note, Advisor doesn’t like huge applications, it thrives at small applications (with a small number of loops), so try to extract your kernels with representative data.