Audio Toolkit: Zero Latency Efficient Convolution

Convolution is an algorithm that is often used for reverberations. If the equation is easy to understand and to implement, the implementation is costly. The other way of doing it is to use Fast Fourier Transform (FFT), but the direct/crude implementation requires latency. If it is possible to optimize the basic convolution code, it is sometimes more interesting to use a different algorithm, as it is the case here.

Basic theory

Let’s start with the convolution equation, with x[n] the input signal, y[n] the output signal and i[n] the impulse we want to convolve the input signal with:

Convolution equation
Convolution equation

The cost of the operation is the size of the impulse, O(K), which can be quite high. For N output elements, the cost is thus O(KN).

The other way of doing this is in the frequency domain, where the convolution will be a simple multiplication. The issue is that your require from the start N elements. The transform in the frequency domain is in O(M), with M the number of elements in the FFT. As we are talking about convolution, M=N+K-1. So as we have more output elements after the convolution, we will need to add them to the following convolution.

The issue is the selection of the size N. With a big N, we get efficiency, but we also require N elements before being able to output any signal at all. This is worrying, as there is no way of going around this limitation.

Rewriting the equation

Ok, now let’s rewrite the equation as a double sum. Let’s split the impulse in pieces with K elements (let’s say 64 elements, and the original length is a multiple of 64) with S pieces:

Split convolution equation
Split convolution equation

What is interesting is we can also say that the x[k] in the equation can also be regrouped by K elements. This means that we have S convolutions of size K elements. There are several advantages to think about these input samples as pieces. The original algorithm had to have a memory of S*K elements. Either we use a circular buffer, or we have to shift the entire signal. Both solutions have their drawbacks. By regrouping them in pieces, there is no buffer management and no shifting the signal (or more precisely far less).

Unfortunately, we still have latency. We need K elements if we use FFT. Another approach is to process the first convolution in the usual way and the rest with FFTs. Each time we processed K samples, we create a new chunk, transform it in the frequency domain and push it in the list of FFT chunks. Then, we can directly multiply the partial impulses (transformed in the frequency domain) by their respective input chunks, sum them together and then transform the FFT back in the time domain. This means that after the transform we get 2*K elements, the first K elements will be added to the next K elements after the discreet convolution. The last K elements will have to be put aside and added to the first K elements of the next convolution.

Implementation in Audio Toolkit

This is a pretty basic approach to partitioned convolution. It’s just a rewrite of the convolution equation, and no fancy science behind.

The reference implementation will be from the FIR filter (just reverse the impulse). One of the nice features of ATK is that you can make sure that you always get K input elements available in the buffer. Even if you are asked to process 1 element, the framework has a way of always keeping the last K values. This way, there is no additional need for bookkeeping. Once we processed K elements, we can directly create a new chunk. This is done through the following code in the main process method:

    int processed_size = 0;
    do
    {
      // We can only process split_size elements at a time, but if we already have some elements in the buffer,
      // we need to take them into account.
      int64_t size_to_process = std::min(split_size - split_position, size - processed_size);

      // Process first part of the impulse
      process_impulse_beginning(processed_size, size_to_process);

      split_position += size_to_process;
      processed_size += size_to_process;
      if(split_position == split_size)
      {
        process_new_chunk(processed_size);
        split_position = 0;
      }
    }while(processed_size != size);

process_new_chunk() triggers the copy of the last K input samples and computes a FFT on it. Once this is finished, the FFT result (2*K elements) is stored, and the last K elements of the FFT result are added to the buffer. This buffer part will be added to the convolution in process_impulse_beginning.

Some profiling

I always like to profile my code, so I used a small 10000 samples impulse (a ramp) and convoluted it with a triangle signal and compared the performance to the reference algorithm (the FIR filter). With pieces of 64 elements, the new implementation is 10 times faster, and it improves more if the pieces are bigger or the impulse longer:

First, a simple block convolution already improves timings by a factor of 3. Then, using FFTs to compute all the convolutions expect for the first, we can go further. Let’s see a profile of my original implementation:

Original profile with FFT convolution
Original profile with FFT convolution

Now, all the complex and vector calls inside the two convolution main methods can be removed to streamline the code:

Final profile after optimizations
Final profile after optimizations

Besides the optimization of the FFT code, the classic IIR code was also optimized somewhat, leading to a 15% performance increase. The conversion code was also enhanced leading to a gain for all plugins of more than 50% for the input/output handling.

Conclusion

There is still room for improvement, as there are several copies that are still required, but a basic implementation is actually easy to achieve. There is a compromise to be achieved between the computation of the convolution through FFT, dominated by the complex multiplication and accumulation before the inverse FFT, and the convolution through a FIR filter of the first block. If the impulse is long (several seconds), it seems that a size of 256 to 512 elements is the best compromise whereas for shorter impulse (1 second), a block size of 64 elements is the best.

Of course, by improving the complex multiplication and addition (SIMD? FMA?), the compromise would shift towards smaller block sizes.

I didn’t put any big equations on how the block convolution is achieved. This would almost require a paper or a book chapter…

The full implementation is available on Github.

Buy Me a Coffee!
Other Amount:
Your Email Address:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.