Wednesday, February 8, 2017

For Speedy Float Math, Specify the Float Version

When programming up my audio processing algorithms on the Teensy 3.6, I sometimes find that the math operations are much slower than I was expecting.  Yes, some things are super-fast: arithmetic, FIR filters, FFT.  But did my code with the logarithm run so slowly?  Well, it turns out that I was using the wrong function call.  If you use the float-specific version of your function, you get tremendously faster floating-point speeds.  Don't rely on the compiler; be explicit and call it yourself!
As you can see in the graph above, I measured how long it took to complete several different math functions when using floating-point data.  The interesting part is that each math function can be called in two different ways: (1) using its generic form such as sqrt(x) or (2) using its explicitly floating-point form such as sqrtf(x).  In all cases, the explicit floating-point form was much faster.

Being a Matlab programmer, my fingers generally type out the generic form of the function.  Being naive, I had assumed that the compiler would detect my data type and automatically substitute the exact same floating-point function that I would have called myself.  Apparently, I was wrong.  Way wrong.

The table above shows that the square root function benefits the most when using the explicitly floating-point form.  The explicitly floating-point version is over 30 times faster!  The Teensy (well, every ARM M4F) has hardware acceleration for doing square roots.  I'm guessing the sqrt(x) form does not use the acceleration whereas the sqrtf(x) form does.  30x is a huge large increase in speed.

Interestingly, the logarithm and exponential/power functions do not have hardware acceleration in the Teensy.  Yet, when using the explicitly floating-point version, they see a 10x increase in speed.  Stunning.

Why are the explicitly floating-point versions so much faster?  I don't know.  But I sure as heck am going to make sure that all my future code uses them.

Tech Info: This data was measured using a Teensy 3.6 running at 180 MHz.  It was programmed from the Arduino IDE (1.6.13) via the Teensyduino (1.35) add-on.  My code is on my GitHub here.

Follow-Up:  Why is the float-specific version ("logf()") so much faster than generic version ("log()")?  I posted this to the Teensy Forum.  Their replies (here) were pretty definitive.  Under the hood, the generic version (ie, "log()") will do the calculations as double-precision floating point math, which is all in software.  By contrasts, the floating-point version ("logf()") is for single-precision floating point math, which on the Teensy is done with hardware acceleration.  That explains why the float-specific version is so much faster.

No comments:

Post a Comment