Wednesday, September 14, 2016

Benchmarking - FFT Speed

My goal is to find a good microcontroller board for doing audio processing.  Speed is a very important concern so, in my last post, I looked at the speeds for different boards when doing FIR filters.  While time-domain FIR filters are an important audio processing task, I am also curious how suitable these boards are for frequency-domain processing.  In other words, I need to know how fast they can do FFTs.  Let's find out!

Why FFT?  An FFT is a "Fast Fourier Transform", which is not a helpful name if you're not already familiar with the idea.  FFTs are most often used when one wants to look at the frequency content (the spectrum) of audio data.  Intriguingly, you can also do an inverse FFT (an "IFFT") to convert that frequency spectrum data back into audio data.  By pairing the FFT with the IFFT, therefore, one can now manipulate (or mangle!) audio data in the frequency domain, which can be a more natural and easy way to construct one's audio processing algorithms.  They key to it all is the FFT (and its computationally similar IFFT).

Microcontroller Boards:  To understand which hardware might be capable of this frequency domain approach to audio processing, I'm evaluating a range of different boards.  In addition to spanning a range of clock speeds (16 MHz - 180 MHz), they also vary in other computationally important ways:  one is 8-bit while the others are 32-bit, some have DSP extensions while others do not, and one has a floating-point unit (FPU) while the others do not.  Though my testing, I'll see what's important.

My Test Software:  For all of the boards except for the FRDM-K66F, I used the Arduino IDE to write my test software.  It's a simple program that does an FFT on dummy data of a given length.  The program uses Arduino's "micros()" command to measure the time to complete a fixed number of FFTs.  Easy.  For the FRDM-K66F board, which cannot be programmed from the Arduino IDE, I had to use NXP's IDE (Kinetis Design Studio).  The FFT functions were identical, however, regardless of which IDE I used.  All of my software is available on my GitHub (for Arduino, for Kinetis).

KissFFT Function:  The only difficult part of the software is the FFT function itself.  Since I wanted to test across a variety of hardware, I wanted to start with an FFT routine that was written in generic C.  From the FFTW website, I found an interesting comparison of different FFT routines.  From their list, one of the most generic routines appeared to be the "KissFFT".  After downloading it, I refactored the KissFFT code to enable different data types (Int16 vs Int32 vs Float32) and to remove the dynamic allocation of memroy via malloc().  With these changes, it was much more suitable for use on a microcontroller.

Raw Results:  The raw results of my speed tests are here, from which I made the summary table shown below.  The table shows the number of FFTs per second that each board could complete.  I performed the tests for different data types (Int16, Int32, Float32) and for different FFT lengths (N=32 to N=512).  This table is too dense to read, so let's skip ahead.

Speed vs FFT Length:  First, let's look at the overall flow of the data.  The plot below shows the FFT speed for different FFT lengths.  For FFTs using more data points, I would expect slower performance.  This graph confirms that expectation.  Good.  This graph also shows that the relative ranking of the different boards is the same, regardless of the length of the FFT.  This allows me to greatly simplify the rest of the plots.

Speed for Integer Data:  As the length of the FFT doesn't matter for the relative rankings, I chose to focus on an FFT size of 128.  I chose this value because, when operating on audio data sampled at 44100 Hz, N=128 yields a frequency resolution of 344 Hz, which is a reasonably useful value.  The graph below compares the speeds of the different boards for doing 128-point FFT on Int32 data.  The Arduino Uno didn't have enough memory to complete this test, so it is excluded.  Otherwise, I see that the Arduino M0 is the slowest and that the FRDM-K66F is, by far, the fastest.  The only surprise in this data is the slowness of the Arduino M0.  It is much slower compared to the Teensy or Maple than is expected based on their relative clock speeds.  My data shows that Teensy can do ~4x the number of FFTs even though its clock speed is only ~2x higher.  Clearly the M0 is not optimized for these kinds of calculations.

Can They Do Floating Point?  For audio processing projects, I hope to do all of processing using floating point data types (Float32).  Being able to utilize floating point math (Float32) instead of fixed point math (Int16 and Int32) makes the algorithms much easier to design, debug, and optimize.  The difficulty is that microcontrollers tend to be very slow with floating point calculations.  So, let's do some tests with Floats and see if any of my boards are fast enough.

FFT Speed for Float32:  The graph below shows the FFT speeds that I measured using Float32 audio data.  As can be seen by the red bars, the M0, Maple, and Teensy are very slow on Floats.  They can only do 1/3rd as many Float32 FFTs and they can Int32 FFTs.  That is a major speed penalty.  The major exception to this trend is the FRDM-K66F.  It is actually faster using Float32 than Int32.  Presumably, this is due to the float-point unit (FPU) included in the K66F chip.  Even with the FPU, I did not expect it to be faster on Floats than Ints.  Very surprising.

DSP Acceleration:  Because an FFT is such a common digital signal processing (DSP) task, some processors include internal features to accelerate this kind of math.  The Teensy 3.2 and the FRDM-K66F are both based on the Arm Cortex M4 processor core.  The Cortex M4 includes DSP acceleration.  I learned that I can invoke the ARM's DSP accelerators by calling the FFT functions from ARM's "CMSIS" library.  It took me some effort to figure it out, but I eventually got it to work.  And, boy, did it work well...

FFT Speed Using CMSIS:  My results using the CMSIS library are shown below.  Using the CMSIS routines (and the underlying hardware acceleration) really speeds up the FFTs.

The effect of the CMSIS / DSP accelerators is so dramatic that I quantified the improvement in the table below.  The speed improvement is different for the different data types.  Using Int16 data, the CMSIS routines are 4-5x faster than the Generic C "KissFFT" routines.  Wow.  For the In32 data, CMSIS is 3x faster and for Float32 data, CMSIS is 2x faster.  This extra speed is definitely a good reward for the effort spent to figure out how to use the CMSIS library.

How Fast is Fast Enough?  I want to know which boards are fast enough to enable frequency-domain processing of audio signals.  As I discussed earlier, frequency-domain processing requires that I perform an FFT to get into the frequency domain and then an IFFT (which takes the same amount of time as an FFT) to get back out of the frequency domain.  Given my measured FFT speed values, I can estimate the maximum audio sample rate that each board can sustain:

max_sample_rate_Hz = (FFTs_per_second * N_FFT / 2) * (1 - overlap)

In this equation, the "2" accounts for the need to do both an FFT and an IFFT.  The "overlap" term accounts for the fact that most people do frequency-domain processing using blocks of audio samples that overlap by 50% (0.5) in order to smooth out any artifacts at the ends of the audio blocks.

Maximum Sample Rate for Frequency-Domain Processing:  Using this equation, assuming an N=128, and assuming an overlap of 0.5, the table below shows the maximum sample rate that each board can support for frequency-domain processing.  I've highlighted in green those boards that can sustain this kind of processing at sample rates appropriate for audio (ie, greater than 44100 Hz).

Conclusion:  If my goal is to do frequency-domain processing using Float32 data, only the FRDM-K66F is fast enough.  That is my primary conclusion.  My secondary conclusions are that, if I use Int32 data, the CMSIS acceleration means that the Teensy 3.2 is a viable option.  Furthermore, if I can tolerate Int16 data, I can choose from the Maple, Teensy, or K66.  But I don't want to do that.  I want to use Floats.  So, the K66F is for me.

Looking Forward:  While the FRDM-K66F board is very powerful for doing FFT operations, it is difficult for me to program.  I prefer the simplicity of the Arduino IDE, yet the FRDM-K66F is not supported by Ardiuno.  Looking forward however, the folks who do Teensy are about to release the "Teensy 3.6", which uses the same (or similar) processor as is used in the FRDM-K66F.  Since Teensy is programmable from the Arduino IDE, I am hoping that we'll soon have the power of the K66F combined with the ease-of-use of the Teensy.  That will be a truly winning combination for open source audio processing.

Tuesday, September 6, 2016

Benchmarking - FIR Filtering

I want to do real-time audio processing.  And, I want to do it using small electronics without having to fight with an operating system like Windows or Linux.  Most of my previous experience with electronics hacking has been with Arduino (or Arduino-like) platforms.  Audio processing, however, is pretty computationally demanding.  I need to find an easy-to-use board that is fast for typical audio tasks.  So, as a first step, I'm going to get a bunch of different boards and see which can do FIR filtering at audio rates.  Let's see which boards are up to the task!

Top: Arduino Uno, Arduino M0, LeafLabs Maple.  Bottom: Teensy 3.2 and FRDM-K66F.

The Competitors:  I chose to test six different boards -- many of which I already had kicking around the house.  The six boards that I tested are summarized in the table below.  As you can see, I tried everything from the lowly Arduino Uno up to the mighty Teensy 3.2 and the even-mightier FRDM-K66F.  While the FRDM-K66F is a bit obscure, I'm using it as a proxy for the up-coming Teensy 3.6, which uses the same K66F chip.  Its fast 180 MHz clock speed and its floating point unit (FPU) should make the FRDM-K66F / Teensy 3.6 great for processing audio.

Why FIR Filters?  If you want to manipulate the frequency content of an audio stream, you need a filter.  Boosting the bass?  Cutting some mids?  Boosting the treble?  Apply the appropriate filter.  There are many kinds of filters, typically divided into either IIR filters or FIR filters.  I'm not yet ready to dive into the differences between IIR vs FIR, but I tend to prefer FIR filters for their linear phase and unconditional stability (a good discussion of FIR filters is here and here).  Regardless, FIR filters are a good, basic audio processing task that make for a widely-applicable benchmark.

Lots of Multiplies and Adds:  The challenge with FIR filters is that they can require a processor to do a lot of computation -- a lot of multiply and addition operations.  The finer the frequency resolution desired, the more multiplies and adds are needed to do the filter.   The resolution of an FIR filter scales with the length of the filter (the "N" of the filter).  As a simple rule-of-thumb, an FIR filter's workload scales as:

N_FIR = sample_rate_Hz / freq_res_Hz;      //approximate
num_multiplies = N_FIR * sample_rate_Hz;   //same for num_adds

For example, if you want a frequency resolution of 250 Hz, and if your sample rate is 44 kHz, then you need an FIR filter length N = (44000/250) =  176.  To actually filter the audio, you need to apply this 176-point filter to every audio sample in your 44 kHz audio stream.  To keep up, your processor will need to do at least (176*44000) = 7.7 million multiplications plus 7.7 million additions per second.  That's a lot of work to do!  Which of my boards are capable of this?

FIR Software:  To test the FIR speed of each board, I used the Arduino IDE and wrote a very naive implementation of an FIR filter (yes, faster results could definitely be achieved, but this test is just trying to get a sense of relative speeds of the platforms).  My code uses the Arduino's "micros()" command to measure the time to repeat the FIR filter numerous times.  For the K66 board, which could not be programmed through the Arduino IDE (until the Teensy 3.6 comes out!), I had to use NXP's "Kinetis Design Studio" to write the software, but the FIR function itself is the same.  All of the code is available in my OpenAudio repository on GitHub.

Results, All Data:  My raw data (here) consist of the time required to perform FIR filters on the different platforms.  To ease the presentation of the data, I invert the values so that it tells me the number of FIR filters that can be completed per second.  In this perspective, a bigger number means it can do more FIRs per second, which is good.  My results are shown in the table below.  It shows the FIR speeds for different filter lengths (16-256) and for two different data types (Int32 and Float32).  Because I find tables difficult to read, let's jump over the table and make some plots that better illustrate the results.

Results, Effect of "N":  Longer filters require more computations, so I would expect longer filters to be slower.  Using data from the big table above, the plot below confirms that expectation.  Also, note that all of the lines show the same slope and that the lines never cross each other.  This means that the relative ranking of the different boards stays the same across all FIR filter lengths, which allows me to greatly simplify the rest of the plots.

Results, Speed of Each Board (Int32):  Since the relative ranking of the boards stays the same throughout, let's illustrate the relative speed of each board by picking just one filter length.  The plot below picks N=128.  It shows the speed when using Int32 data.  On the left side of the plot, note that the Arduino Uno is very slow on Int32 values.  Presumably its 8-bit processor has difficulty with the 32-bit data type.  On the right side of the plot, the fastest board is the K66F, which is 100 times faster than the Uno.  That's a huge difference!

Results, Floating Point:  The below below is the same, but I add in the results for floating point data (Float32).  Writing audio processing algorithms using Float operations is much easier than using Int operations, so these are the results that I'm most interested in.  As can be seen in the plot, these boards are *much* slower on Floats than on Ints.  The major exception to this result is the K66F board, which is basically as fast on Floats as it is on Ints.  That's amazing!  This result clearly reflects the fact that the K66 is the only processor in this comparison which has an FPU.

How Fast is Fast Enough?  Deciding what FIR speed is "fast enough" for audio processing is not a simple question.  One approach is to return back to the example at the top of this post: a hypothetical 176-point FIR filter.  This filter length was chosen because it would yield a frequency resolution of 250 Hz when run at an audio sample rate of 44 kHz.  Which of these boards can support such a long filter at this fast sample rate?  Well, by scaling the N=128 speed values to N=176, the table below shows the sample rate that could be sustained by each board.  As can be seen the Teensy 3.2 is fast enough for audio processing using Ints.  But, if I want to do Floats, only the K66F is fast enough.

Programming the K66F:  Unfortunately, the FRDM-K66F board is not programmable from the Arduino IDE.  This makes it much harder to setup and debug by non-professionals.  This a real hurdle.  Luckily, the upcoming Teensy 3.6 also uses the K66 processor.  Since the Teensy products are all compatible with the Arduino IDE, it means that the power of the K66 will soon be far more accessible.  That's the solution that I'm really looking for.  So, I supported the Teensy 3.6 kickstarter.  I can't wait to get my Teensy 3.6!

Next Steps:  FIR filtering is not the only audio processing task that one might like to do.  An FFT is another important type of operation that is computationally intense.  In my next post, I'll look at the FFT speeds to see which boards are capable of real-time, frequency-domain processing.  Until then, have some happy hacking!

Update: FFT Benchmarking Results are here!