Tuesday, September 6, 2016

Benchmarking - FIR Filtering

I want to do real-time audio processing.  And, I want to do it using small electronics without having to fight with an operating system like Windows or Linux.  Most of my previous experience with electronics hacking has been with Arduino (or Arduino-like) platforms.  Audio processing, however, is pretty computationally demanding.  I need to find an easy-to-use board that is fast for typical audio tasks.  So, as a first step, I'm going to get a bunch of different boards and see which can do FIR filtering at audio rates.  Let's see which boards are up to the task!

Top: Arduino Uno, Arduino M0, LeafLabs Maple.  Bottom: Teensy 3.2 and FRDM-K66F.

The Competitors:  I chose to test six different boards -- many of which I already had kicking around the house.  The six boards that I tested are summarized in the table below.  As you can see, I tried everything from the lowly Arduino Uno up to the mighty Teensy 3.2 and the even-mightier FRDM-K66F.  While the FRDM-K66F is a bit obscure, I'm using it as a proxy for the up-coming Teensy 3.6, which uses the same K66F chip.  Its fast 180 MHz clock speed and its floating point unit (FPU) should make the FRDM-K66F / Teensy 3.6 great for processing audio.


Why FIR Filters?  If you want to manipulate the frequency content of an audio stream, you need a filter.  Boosting the bass?  Cutting some mids?  Boosting the treble?  Apply the appropriate filter.  There are many kinds of filters, typically divided into either IIR filters or FIR filters.  I'm not yet ready to dive into the differences between IIR vs FIR, but I tend to prefer FIR filters for their linear phase and unconditional stability (a good discussion of FIR filters is here and here).  Regardless, FIR filters are a good, basic audio processing task that make for a widely-applicable benchmark.

Lots of Multiplies and Adds:  The challenge with FIR filters is that they can require a processor to do a lot of computation -- a lot of multiply and addition operations.  The finer the frequency resolution desired, the more multiplies and adds are needed to do the filter.   The resolution of an FIR filter scales with the length of the filter (the "N" of the filter).  As a simple rule-of-thumb, an FIR filter's workload scales as:

N_FIR = sample_rate_Hz / freq_res_Hz;      //approximate
num_multiplies = N_FIR * sample_rate_Hz;   //same for num_adds

For example, if you want a frequency resolution of 250 Hz, and if your sample rate is 44 kHz, then you need an FIR filter length N = (44000/250) =  176.  To actually filter the audio, you need to apply this 176-point filter to every audio sample in your 44 kHz audio stream.  To keep up, your processor will need to do at least (176*44000) = 7.7 million multiplications plus 7.7 million additions per second.  That's a lot of work to do!  Which of my boards are capable of this?

FIR Software:  To test the FIR speed of each board, I used the Arduino IDE and wrote a very naive implementation of an FIR filter (yes, faster results could definitely be achieved, but this test is just trying to get a sense of relative speeds of the platforms).  My code uses the Arduino's "micros()" command to measure the time to repeat the FIR filter numerous times.  For the K66 board, which could not be programmed through the Arduino IDE (until the Teensy 3.6 comes out!), I had to use NXP's "Kinetis Design Studio" to write the software, but the FIR function itself is the same.  All of the code is available in my OpenAudio repository on GitHub.

Results, All Data:  My raw data (here) consist of the time required to perform FIR filters on the different platforms.  To ease the presentation of the data, I invert the values so that it tells me the number of FIR filters that can be completed per second.  In this perspective, a bigger number means it can do more FIRs per second, which is good.  My results are shown in the table below.  It shows the FIR speeds for different filter lengths (16-256) and for two different data types (Int32 and Float32).  Because I find tables difficult to read, let's jump over the table and make some plots that better illustrate the results.


Results, Effect of "N":  Longer filters require more computations, so I would expect longer filters to be slower.  Using data from the big table above, the plot below confirms that expectation.  Also, note that all of the lines show the same slope and that the lines never cross each other.  This means that the relative ranking of the different boards stays the same across all FIR filter lengths, which allows me to greatly simplify the rest of the plots.


Results, Speed of Each Board (Int32):  Since the relative ranking of the boards stays the same throughout, let's illustrate the relative speed of each board by picking just one filter length.  The plot below picks N=128.  It shows the speed when using Int32 data.  On the left side of the plot, note that the Arduino Uno is very slow on Int32 values.  Presumably its 8-bit processor has difficulty with the 32-bit data type.  On the right side of the plot, the fastest board is the K66F, which is 100 times faster than the Uno.  That's a huge difference!


Results, Floating Point:  The below below is the same, but I add in the results for floating point data (Float32).  Writing audio processing algorithms using Float operations is much easier than using Int operations, so these are the results that I'm most interested in.  As can be seen in the plot, these boards are *much* slower on Floats than on Ints.  The major exception to this result is the K66F board, which is basically as fast on Floats as it is on Ints.  That's amazing!  This result clearly reflects the fact that the K66 is the only processor in this comparison which has an FPU.


How Fast is Fast Enough?  Deciding what FIR speed is "fast enough" for audio processing is not a simple question.  One approach is to return back to the example at the top of this post: a hypothetical 176-point FIR filter.  This filter length was chosen because it would yield a frequency resolution of 250 Hz when run at an audio sample rate of 44 kHz.  Which of these boards can support such a long filter at this fast sample rate?  Well, by scaling the N=128 speed values to N=176, the table below shows the sample rate that could be sustained by each board.  As can be seen the Teensy 3.2 is fast enough for audio processing using Ints.  But, if I want to do Floats, only the K66F is fast enough.


Programming the K66F:  Unfortunately, the FRDM-K66F board is not programmable from the Arduino IDE.  This makes it much harder to setup and debug by non-professionals.  This a real hurdle.  Luckily, the upcoming Teensy 3.6 also uses the K66 processor.  Since the Teensy products are all compatible with the Arduino IDE, it means that the power of the K66 will soon be far more accessible.  That's the solution that I'm really looking for.  So, I supported the Teensy 3.6 kickstarter.  I can't wait to get my Teensy 3.6!

Next Steps:  FIR filtering is not the only audio processing task that one might like to do.  An FFT is another important type of operation that is computationally intense.  In my next post, I'll look at the FFT speeds to see which boards are capable of real-time, frequency-domain processing.  Until then, have some happy hacking!

Update: FFT Benchmarking Results are here!

5 comments:

  1. Hi! nice job!
    I'm planning to use FRDM-K66F board for processing signals from fmcw radar. I expect to adquire analog signals, digitalize them and then, store in the sd card of the board. Do you use the ADC or the line in of the board for the benchmark?. I wonder if I could use teensy 3.6 for this processing (similar to audio signal). Any hints will be great. Thanks!

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. Very interesting results!!
    Do you have any idea if IIR filters will be (al lot) faster?
    Would be very interesting to test these as well.

    ReplyDelete
  5. Why havent you tested with an stm32 M4F device. These devices are very fast and have a single precision FPU...

    ReplyDelete