Fast polynomial evaluation
I guess everybody knows what a polynomial is.
Polynomials are really cool and used all over the place in programming. Evaluating a polynomial is really important, so how should we do it?
First of all lets see how the polynomial function looks like:
$a_n x^n + a_{n-1} x^{n-1} + \dots + a_2 x^2 + a_1 x + a_0$
We can see that a polynomial is basically defined by it’s coefficients $a_i$. So we can actually store all we need for the evaluation by storing the coefficients into an array.
We can use a shorter notation $\sum\limits_{i=1}^n a_i x^i$ that actually implies the first algorithm.
We implement this in Javascript
function polyEvalSimple(x,coeff)
{
var xacc = 1;
var finalValue = 0.0
for(var i=0; i< coeff.length; i++)
{
finalValue = finalValue + xacc * coeff[i];
xacc = xacc * x;
}
return finalValue;
}
and then in Python
def polyeval(x,coeff):
xacc = 1;
finalValue = 0.0
for i in xrange(0,len(coeff)):
finalValue = finalValue + coeff[i] * xacc
xacc = xacc * x
return finalValue
and of course C
double polyEvalSimple(double x, double *coeff, unsigned int numCoeff)
{
double xacc = 1;
double finalValue = 0.0f;
for (unsigned int i = 0; i < numCoeff; i++)
{
finalValue = finalValue + coeff[i] * xacc;
xacc = xacc * x;
}
return finalValue;
}
Ok, now we can do better than this. We just use a little math trickery and split the polynomial into linear parts (monomial basis):
$p(x) = a_0 + x(a_1+x(a_2+ \dots +x(a_{n-1} + a_n x)))$
This is called the Horner method and it’s been around ages (even before Horner of course :) )
Now lets see the implementation:
Javascript
function polyEvalHorner(x,coeff)
{
var finalValue = 0
for(var i=coeff.length -1; i>= 0 ; i--)
{
finalValue = finalValue * x + coeff[i];
}
return finalValue;
}
and then in Python
def polyevalHorner(x,coeff):
finalValue = 0.0
for i in xrange(len(coeff)-1 , -1,-1):
finalValue = finalValue * x+ coeff[i]
return finalValue
and of course C
double polyEvalHorner(double x, double *coeff, unsigned int numCoeff)
{
double finalValue = 0.0f;
for (int i = numCoeff-1; i >= 0 ; i--)
{
finalValue = finalValue* x + coeff[i];
}
return finalValue;
}
As we can see, we basically avoid a multiplication by $x$ each iteration.
Now that’s all fine and dandy but will that actually matter on todays computers?
Let’s do some profiling.
We note the G = $\frac{t_{horner}}{t_{simple}}$ .
Python | Javascript | C | |
---|---|---|---|
G | 0.71 | 0.375 | 1.11 |
I did my tests on my home CPU: i7-2600 at 3.4GHZ (they could vary on your computers so feel free to run them and tell me your results)
We can see that there are some gains in speed when using the Horner method in Python and in Javascript.
We also see that the Horner method is actually worst in C . Why is that? We should be having better times, because we avoid doing a multiplication ! :)
We can also check the assembly code (I took the debug code so we can actually understand what is happening, the debug times are still better for the simple evaluation):
Simple method
finalValue = finalValue + coeff[i] * xacc;
00007FF69D321BCE mov eax,dword ptr [rbp+44h]
00007FF69D321BD1 mov rcx,qword ptr [coeff]
00007FF69D321BD8 movsd xmm0,mmword ptr [rcx+rax*8]
00007FF69D321BDD mulsd xmm0,mmword ptr [xacc]
00007FF69D321BE2 movsd xmm1,mmword ptr [finalValue]
00007FF69D321BE7 addsd xmm1,xmm0
00007FF69D321BEB movaps xmm0,xmm1
00007FF69D321BEE movsd mmword ptr [finalValue],xmm0
xacc = xacc * x;
00007FF69D321BF3 movsd xmm0,mmword ptr [xacc]
00007FF69D321BF8 mulsd xmm0,mmword ptr [x]
00007FF69D321C00 movsd mmword ptr [xacc],xmm0
Horner
finalValue = finalValue* x + coeff[i];
00007FF69D321B10 movsd xmm0,mmword ptr [finalValue]
00007FF69D321B15 mulsd xmm0,mmword ptr [x]
00007FF69D321B1D movsxd rax,dword ptr [rbp+24h]
00007FF69D321B21 mov rcx,qword ptr [coeff]
00007FF69D321B28 addsd xmm0,mmword ptr [rcx+rax*8]
00007FF69D321B2D movsd mmword ptr [finalValue],xmm0
The problem is that the Horner scheme is really bad for the cpu pipeline, because one iteration needs to wait for the other one to finish. Horner is starving the poor pipeline.
In higher level languages (that don’t get converted to machine code) you don’t really feel this, because everything is slower and you are waiting for the language interpreter to do it’s job (which is not starving anytime soon :) ).
If we want to improve the c polynomial evaluation we could actually use a Divede et impera method like Estrin’s scheme.
In practice, if we know the polynomial we sometimes can rewrite it using Estrin’s scheme, without coding the whole algorithm, as it has overhead that is insignificant only when the polynomial is huge.
if you tldr and just want to see the test code click here