Wednesday, September 14, 2016

Benchmarking - FFT Speed

My goal is to find a good microcontroller board for doing audio processing.  Speed is a very important concern so, in my last post, I looked at the speeds for different boards when doing FIR filters.  While time-domain FIR filters are an important audio processing task, I am also curious how suitable these boards are for frequency-domain processing.  In other words, I need to know how fast they can do FFTs.  Let's find out!


Why FFT?  An FFT is a "Fast Fourier Transform", which is not a helpful name if you're not already familiar with the idea.  FFTs are most often used when one wants to look at the frequency content (the spectrum) of audio data.  Intriguingly, you can also do an inverse FFT (an "IFFT") to convert that frequency spectrum data back into audio data.  By pairing the FFT with the IFFT, therefore, one can now manipulate (or mangle!) audio data in the frequency domain, which can be a more natural and easy way to construct one's audio processing algorithms.  They key to it all is the FFT (and its computationally similar IFFT).

Microcontroller Boards:  To understand which hardware might be capable of this frequency domain approach to audio processing, I'm evaluating a range of different boards.  In addition to spanning a range of clock speeds (16 MHz - 180 MHz), they also vary in other computationally important ways:  one is 8-bit while the others are 32-bit, some have DSP extensions while others do not, and one has a floating-point unit (FPU) while the others do not.  Though my testing, I'll see what's important.


My Test Software:  For all of the boards except for the FRDM-K66F, I used the Arduino IDE to write my test software.  It's a simple program that does an FFT on dummy data of a given length.  The program uses Arduino's "micros()" command to measure the time to complete a fixed number of FFTs.  Easy.  For the FRDM-K66F board, which cannot be programmed from the Arduino IDE, I had to use NXP's IDE (Kinetis Design Studio).  The FFT functions were identical, however, regardless of which IDE I used.  All of my software is available on my GitHub (for Arduino, for Kinetis).

KissFFT Function:  The only difficult part of the software is the FFT function itself.  Since I wanted to test across a variety of hardware, I wanted to start with an FFT routine that was written in generic C.  From the FFTW website, I found an interesting comparison of different FFT routines.  From their list, one of the most generic routines appeared to be the "KissFFT".  After downloading it, I refactored the KissFFT code to enable different data types (Int16 vs Int32 vs Float32) and to remove the dynamic allocation of memroy via malloc().  With these changes, it was much more suitable for use on a microcontroller.

Raw Results:  The raw results of my speed tests are here, from which I made the summary table shown below.  The table shows the number of FFTs per second that each board could complete.  I performed the tests for different data types (Int16, Int32, Float32) and for different FFT lengths (N=32 to N=512).  This table is too dense to read, so let's skip ahead.


Speed vs FFT Length:  First, let's look at the overall flow of the data.  The plot below shows the FFT speed for different FFT lengths.  For FFTs using more data points, I would expect slower performance.  This graph confirms that expectation.  Good.  This graph also shows that the relative ranking of the different boards is the same, regardless of the length of the FFT.  This allows me to greatly simplify the rest of the plots.


Speed for Integer Data:  As the length of the FFT doesn't matter for the relative rankings, I chose to focus on an FFT size of 128.  I chose this value because, when operating on audio data sampled at 44100 Hz, N=128 yields a frequency resolution of 344 Hz, which is a reasonably useful value.  The graph below compares the speeds of the different boards for doing 128-point FFT on Int32 data.  The Arduino Uno didn't have enough memory to complete this test, so it is excluded.  Otherwise, I see that the Arduino M0 is the slowest and that the FRDM-K66F is, by far, the fastest.  The only surprise in this data is the slowness of the Arduino M0.  It is much slower compared to the Teensy or Maple than is expected based on their relative clock speeds.  My data shows that Teensy can do ~4x the number of FFTs even though its clock speed is only ~2x higher.  Clearly the M0 is not optimized for these kinds of calculations.


Can They Do Floating Point?  For audio processing projects, I hope to do all of processing using floating point data types (Float32).  Being able to utilize floating point math (Float32) instead of fixed point math (Int16 and Int32) makes the algorithms much easier to design, debug, and optimize.  The difficulty is that microcontrollers tend to be very slow with floating point calculations.  So, let's do some tests with Floats and see if any of my boards are fast enough.

FFT Speed for Float32:  The graph below shows the FFT speeds that I measured using Float32 audio data.  As can be seen by the red bars, the M0, Maple, and Teensy are very slow on Floats.  They can only do 1/3rd as many Float32 FFTs and they can Int32 FFTs.  That is a major speed penalty.  The major exception to this trend is the FRDM-K66F.  It is actually faster using Float32 than Int32.  Presumably, this is due to the float-point unit (FPU) included in the K66F chip.  Even with the FPU, I did not expect it to be faster on Floats than Ints.  Very surprising.


DSP Acceleration:  Because an FFT is such a common digital signal processing (DSP) task, some processors include internal features to accelerate this kind of math.  The Teensy 3.2 and the FRDM-K66F are both based on the Arm Cortex M4 processor core.  The Cortex M4 includes DSP acceleration.  I learned that I can invoke the ARM's DSP accelerators by calling the FFT functions from ARM's "CMSIS" library.  It took me some effort to figure it out, but I eventually got it to work.  And, boy, did it work well...

FFT Speed Using CMSIS:  My results using the CMSIS library are shown below.  Using the CMSIS routines (and the underlying hardware acceleration) really speeds up the FFTs.


The effect of the CMSIS / DSP accelerators is so dramatic that I quantified the improvement in the table below.  The speed improvement is different for the different data types.  Using Int16 data, the CMSIS routines are 4-5x faster than the Generic C "KissFFT" routines.  Wow.  For the In32 data, CMSIS is 3x faster and for Float32 data, CMSIS is 2x faster.  This extra speed is definitely a good reward for the effort spent to figure out how to use the CMSIS library.


How Fast is Fast Enough?  I want to know which boards are fast enough to enable frequency-domain processing of audio signals.  As I discussed earlier, frequency-domain processing requires that I perform an FFT to get into the frequency domain and then an IFFT (which takes the same amount of time as an FFT) to get back out of the frequency domain.  Given my measured FFT speed values, I can estimate the maximum audio sample rate that each board can sustain:

max_sample_rate_Hz = (FFTs_per_second * N_FFT / 2) * (1 - overlap)

In this equation, the "2" accounts for the need to do both an FFT and an IFFT.  The "overlap" term accounts for the fact that most people do frequency-domain processing using blocks of audio samples that overlap by 50% (0.5) in order to smooth out any artifacts at the ends of the audio blocks.

Maximum Sample Rate for Frequency-Domain Processing:  Using this equation, assuming an N=128, and assuming an overlap of 0.5, the table below shows the maximum sample rate that each board can support for frequency-domain processing.  I've highlighted in green those boards that can sustain this kind of processing at sample rates appropriate for audio (ie, greater than 44100 Hz).


Conclusion:  If my goal is to do frequency-domain processing using Float32 data, only the FRDM-K66F is fast enough.  That is my primary conclusion.  My secondary conclusions are that, if I use Int32 data, the CMSIS acceleration means that the Teensy 3.2 is a viable option.  Furthermore, if I can tolerate Int16 data, I can choose from the Maple, Teensy, or K66.  But I don't want to do that.  I want to use Floats.  So, the K66F is for me.

Looking Forward:  While the FRDM-K66F board is very powerful for doing FFT operations, it is difficult for me to program.  I prefer the simplicity of the Arduino IDE, yet the FRDM-K66F is not supported by Ardiuno.  Looking forward however, the folks who do Teensy are about to release the "Teensy 3.6", which uses the same (or similar) processor as is used in the FRDM-K66F.  Since Teensy is programmable from the Arduino IDE, I am hoping that we'll soon have the power of the K66F combined with the ease-of-use of the Teensy.  That will be a truly winning combination for open source audio processing.

3 comments:

  1. I have found your page to be interesting and informative. I happened upon it while attempting to determine if the new Teensy 3.6 would have sufficient power to do realtime convolution of audio signals, specifically for guitar speaker cabinet emulation. Is it fair to assume, from your article, that you're doing something similar? Can you point me to any articles to learn more? Thanks!

    ReplyDelete
  2. I don't recall if I turned on notification- so please reply to this, if you reply. Thanks!

    ReplyDelete
    Replies
    1. Convolution can be done either in the frequency domain via FFT-IFFT (as discussed in the blog post above) or in the time-domain via FIR. I did a post earlier regarding FIR speeds here:

      http://openaudio.blogspot.com/2016/09/benchmarking-fir-filtering.html

      The Teensy 3.6 will give a speed very similar to the FRDM-K66F board, whose results are shown in the post above.

      Looking at the last table in the link above, it says that you can do a fairly high-resolution FIR filter (resolution of ~250 Hz) at a sample rate of over 58 kHz. Since the Teensy Audio Library runs at a sample rate of 44 kHz, you should be able be all set.

      The hard part is turning your desired convolution into either an FIR or FFT operation within the context of the Teensy Audio Library (or within whatever Teensy framework that you choose to use).

      Chip

      Delete