Wednesday, February 15, 2017

Faster Log10 and Pow

After my previous post, which showed a tremendous increase in speed by choosing float-specific math functions (ie, "log10f") versus their generic counterpart (ie, "log10"), I still wasn't satisfied with the speed of my audio processing on the Teensy 3.6. I had some calculations that went into and out of dB space, which means I needed to do lots of log10f(x) and pow(10,x) function calls. I needed them to be faster.  Here's how I did it...

They Should be Faster:  The figure above shows the number of CPU cycles that I measured for each math function.  The red bars are using the standard math function.  What caught my eye was that the log10f() function is slower than the logf() function.  They should only be different by one multiplication (ie, 1-2 cycles).  Similarly, I was surprised that powf(10.0,x) was so much slower than expf(x).  It seemed like I ought to be able to accelerate these functions.

Faster powf():  I started with powf() it was in the most need of acceleration.  Because I did not need a general powf() command that could do any base number, I could optimize specifically for powf(10.0,x).  Through the rules of exponential and logs, I wrote my own macro that executed pow in terms of exp:

//powf(10.f,x) is exactly exp(log(10.0f)*x)
#define pow10f(x) expf(2.302585092994046f*x)  

Note that the reformulation above is not a numerical approximation.  If you had all of the digits for that 2.302 constant, this would be an exact substitution for pow(10,x).  Yet, as you'll see in a moment, it is 3 times faster.  Way faster with no loss in accuracy.  Wow!

Faster log10f():  Similarly, for log10f(), I started with a reformulating log10 in terms of log.  While that was effective, it was only a modest increase in speed.  I wanted it even faster.  So, in a post in the ARM community forum (here), I found that someone (thanks Dr. Beckmann!) had reformulated log10f using log base 2 and then further accelerated it using an approximation for log base 2 that exploits way that single-precision numbers are represented in memory.  It's a pretty neat solution.

So, using this, my complete substitution for log10f() is shown below.  The log2 approximation function is at the end of this post.

//log10f is exactly log2(x)/log2(10.0f)
#define log10f_fast(x)  (log2f_approx(x)*0.3010299956639812f)

Because this reformulation uses an approximation for log2(), it is not an exact substitution.  But, over the range of input values that I explored, the resulting error seemed to be less than 0.05%, which is good enough for my needs.

Faster by 3x!  In looking at the speed of these to reformulated functions, I saw that my log10f approximation was 2.8x faster than the standard log10f().  Similarly, I found that my pow10f function was 3.0x faster than the standard powf(10,x) function call.  That's a pretty nice acceleration!  I'm pleased.

As promised, here's the fast log2 approximation (source):

// This is a fast approximation to log2()
// Y = C[0]*F*F*F + C[1]*F*F + C[2]*F + C[3] + E;
float log2f_approx(float X) {
  float Y, F;
  int E;

  F = frexpf(fabsf(X), &E);
  Y = 1.23149591368684f;
  Y *= F;
  Y += -4.11852516267426f;
  Y *= F;
  Y += 6.02197014179219f;
  Y *= F;
  Y += -3.13396450166353f;
  Y += E;

Wednesday, February 8, 2017

For Speedy Float Math, Specify the Float Version

When programming up my audio processing algorithms on the Teensy 3.6, I sometimes find that the math operations are much slower than I was expecting.  Yes, some things are super-fast: arithmetic, FIR filters, FFT.  But did my code with the logarithm run so slowly?  Well, it turns out that I was using the wrong function call.  If you use the float-specific version of your function, you get tremendously faster floating-point speeds.  Don't rely on the compiler; be explicit and call it yourself!
As you can see in the graph above, I measured how long it took to complete several different math functions when using floating-point data.  The interesting part is that each math function can be called in two different ways: (1) using its generic form such as sqrt(x) or (2) using its explicitly floating-point form such as sqrtf(x).  In all cases, the explicit floating-point form was much faster.

Being a Matlab programmer, my fingers generally type out the generic form of the function.  Being naive, I had assumed that the compiler would detect my data type and automatically substitute the exact same floating-point function that I would have called myself.  Apparently, I was wrong.  Way wrong.

The table above shows that the square root function benefits the most when using the explicitly floating-point form.  The explicitly floating-point version is over 30 times faster!  The Teensy (well, every ARM M4F) has hardware acceleration for doing square roots.  I'm guessing the sqrt(x) form does not use the acceleration whereas the sqrtf(x) form does.  30x is a huge large increase in speed.

Interestingly, the logarithm and exponential/power functions do not have hardware acceleration in the Teensy.  Yet, when using the explicitly floating-point version, they see a 10x increase in speed.  Stunning.

Why are the explicitly floating-point versions so much faster?  I don't know.  But I sure as heck am going to make sure that all my future code uses them.

Tech Info: This data was measured using a Teensy 3.6 running at 180 MHz.  It was programmed from the Arduino IDE (1.6.13) via the Teensyduino (1.35) add-on.  My code is on my GitHub here.

Follow-Up:  Why is the float-specific version ("logf()") so much faster than generic version ("log()")?  I posted this to the Teensy Forum.  Their replies (here) were pretty definitive.  Under the hood, the generic version (ie, "log()") will do the calculations as double-precision floating point math, which is all in software.  By contrasts, the floating-point version ("logf()") is for single-precision floating point math, which on the Teensy is done with hardware acceleration.  That explains why the float-specific version is so much faster.