After my previous post, which showed a tremendous increase in speed by choosing float-specific math functions (ie, "log10f") versus their generic counterpart (ie, "log10"), I still wasn't satisfied with the speed of my audio processing on the Teensy 3.6. I had some calculations that went into and out of dB space, which means I needed to do lots of log10f(x) and pow(10,x) function calls. I needed them to be faster. Here's how I did it...
Faster powf(): I started with powf() it was in the most need of acceleration. Because I did not need a general powf() command that could do any base number, I could optimize specifically for powf(10.0,x). Through the rules of exponential and logs, I wrote my own macro that executed pow in terms of exp:
//powf(10.f,x) is exactly exp(log(10.0f)*x)
#define pow10f(x) expf(2.302585092994046f*x)
Note that the reformulation above is not a numerical approximation. If you had all of the digits for that 2.302 constant, this would be an exact substitution for pow(10,x). Yet, as you'll see in a moment, it is 3 times faster. Way faster with no loss in accuracy. Wow!
Faster log10f(): Similarly, for log10f(), I started with a reformulating log10 in terms of log. While that was effective, it was only a modest increase in speed. I wanted it even faster. So, in a post in the ARM community forum (here), I found that someone (thanks Dr. Beckmann!) had reformulated log10f using log base 2 and then further accelerated it using an approximation for log base 2 that exploits way that single-precision numbers are represented in memory. It's a pretty neat solution.
So, using this, my complete substitution for log10f() is shown below. The log2 approximation function is at the end of this post.
//log10f is exactly log2(x)/log2(10.0f)
#define log10f_fast(x) (log2f_approx(x)*0.3010299956639812f)
Because this reformulation uses an approximation for log2(), it is not an exact substitution. But, over the range of input values that I explored, the resulting error seemed to be less than 0.05%, which is good enough for my needs.
Faster by 3x! In looking at the speed of these to reformulated functions, I saw that my log10f approximation was 2.8x faster than the standard log10f(). Similarly, I found that my pow10f function was 3.0x faster than the standard powf(10,x) function call. That's a pretty nice acceleration! I'm pleased.
As promised, here's the fast log2 approximation (source):
// This is a fast approximation to log2()
// Y = C[0]*F*F*F + C[1]*F*F + C[2]*F + C[3] + E;
float log2f_approx(float X) {float Y, F;
int E;
F = frexpf(fabsf(X), &E);
Y = 1.23149591368684f;
Y *= F;
Y += -4.11852516267426f;
Y *= F;
Y += 6.02197014179219f;
Y *= F;
Y += -3.13396450166353f;
Y += E;
return(Y);
}
Y = 1.23149591368684f;
Y *= F;
Y += -4.11852516267426f;
Y *= F;
Y += 6.02197014179219f;
Y *= F;
Y += -3.13396450166353f;
Y += E;
return(Y);
}