Tim Prince
Tim Prince has a B.A. in physics from Harvard and a Ph.D. in mechanical engineering from the University of Cincinnati. He has 25 years of experience in aerodynamic design and computational analysis. He can be reached at Box 3224, Rancho Santa Fe CA 92075.
With the publication of P.J. Plauger's book The Standard C Library, satisfactory source code is readily available for C library functions. While Plauger's code does not treat accuracy as the only criterion for success, most of the functions can be made faster on most computers. Since most of the speedup techniques are the same for any code, I concentrate on those functions for which both better performance and accuracy can be obtained.
Accuracy in the math functions is affected most by the quality of the numerical approximations used and by organizing the arithmetic to minimize roundoff error. In both of these areas, compromises can be made to improve speed at the expense of accuracy. Using one less term in a polynomial or skipping circuitous schemes to get more precision in the arithmetic will still give satisfactory results most of the time. This route is easier to follow than the route which attempts to get the last available bit of precision. Unless the math functions are taking a large percentage of the time in an application, the slight extra time consumed will be much less annoying than the possibility of incorrect results.
Just as a floating-point number must be specified to 17 decimal digits to assure that it can be converted to the closest IEEE P754 standard double
, polynomial approximations should be good to 17 decimal digits. More accurate approximations make no difference, but series which may be off even by half of the least-significant bit will produce more roundoff errors.
A slightly different approach can be taken for functions of float
precision, using series which are accurate to nine decimal digits, with double
arithmetic avoiding the need for any special precautions against roundoff error. For the common case of applications which need both sinf
and cosf
of the same argument, schemes using the half-angle formulae can produce both results in little more time than cosf
alone. These schemes are much more favorable for vectorization. This method can be used internally in coprocessors that have efficient long double
arithmetic, but it limits the accuracy to a little over 15 decimal digits if used in standard double
precision. Software implementations of long double
are so slow that their existence may as well be ignored. They don't even compete with programs like bc
which are useful for getting the series coefficients.
Rational approximations, even more than polynomial approximations, demand that the formulae be organized so that the roundoff error in the division is washed out by later addition of larger terms. This means that the code should be written so that the final operation is addition rather than multiplication or division. For an odd function (one for which f(-x) == -f(x)
), this often produces a +=
assignment.
Review of Plauger's sin
function showed that the basic approach is excellent, but the accuracy of the polynomial approximation can be improved slightly. The bc
program, Listing 1, takes three minutes on an HP720 to calculate the coefficients for approximations up to 28 digits precision. Replacing the approximated function s(x)/x
by c(x)
will produce the other set of coefficients. These series give more accuracy than Plauger's with one less term in the sin
series.
In certain applications, the most critical point of accuracy is whether the functions satisfy the algebraic identity
cos(x) * cos(x) + sin(x) * sin(x) == 1
With the function written as shown, errors in range reduction do not affect this test, which is good, except that other tests are needed to check range reduction. With these coefficients and the change to make the sin
series use the +=
operator, the identity is satisfied more often than with the library functions provided by Sun and HP.
Unless one is trying to simulate IEEE handling of NaN
(Not a Number) and Inf
(infiity) on a system which does not support them, there is no need in a function like this to write code which looks for these special bit patterns. The basic system must trap and handle them appropriately without our wasting time on them. Since we aren't performing anything but normal arithmetic, any NaN
or Inf
will be propagated automatically. These exceptions should happen so seldom that code which shortcuts the full formulae would take more time in the normal case than it could save in the exceptional cases. Likewise, any modern system can handle underflows appropriately without specific user code.
I show two ways of handling arguments with magnitude greater than INT_MAX
faster than by using Plauger's _Dint
. In _Sin
, Listing 2, as did Plauger, I add or subtract multiples of 2π
, but I use the value of DBL_EPSILON
from <float.h>
to write arithmetic directly to get an appropriate multiple. In the cases of FLT_ROUNDS <= 0
or FLT_ROUNDS == 3
, the results should be identical to _Dint
. For FLT_ROUNDS == 1
, the integer will be one greater about half the time, but this only slightly worsens the accuracy for cases in which not much accuracy is possible anyway.
In Listing 3, I save time by not testing the size of the argument, but going straight to a remaindering scheme which works for arguments with magnitude up to π / (2 * DBL_EPSILON)
. Outside of this range, there is no accuracy anyway, and we should in principle report ERANGE
. [The C Standars says we can't, however. – pjp] The code is set up so that the argument sign can be tested in parallel with the multiplication by 2 / π
, saving time on pipelined machines. Plauger's scheme of testing the sign bit directly with integer operators requires the argument to be copied to an int
register, which on many architectures includes an intermediate copy to memory, so it is faster as well as more portable simply to use a double
comparison operator. Rounding should work correctly for any valid FLT_ROUNDS
including those which vary at run time, but more efficient code will be generated if FLT_ROUNDS
is constant.
Some systems have trouble casting to int
a floating-point value that exceeds INT_MAX
. (You need to determine the least-significant integer bits to determine the correct octant. I have included code that works correctly even for large values, then used a #if
to skip over it. But this approach is quite slow on many machines (although not as slow as the Listing 2 code). We avoid any attempts to cast from double
to int
and back to double
, which few compilers optimize.
One might think that the boundaries could be widened slightly by using casts to unsigned long
, possibly with fabs
. But typical architectures do not have hardware support for casts directly from double
to unsigned long
, so our goal of saving time is defeated. More general ways of rounding to an integer, such as a FORTRAN ANINT
function or a PA-RISC FRND
pseudo-instruction, must trap to a function which spends a lot of time checking for cases that we don't care about. These short cuts have never caused me any trouble, but they aren't totally satisfactory for black-box code, which isn't under the control of the user.
Finally, Listing 3 breaks the polynomials into groups to make more efficient use of architectures that can perform double
add or multiply operations in pipeline or parallel fashion. When the separate polynomial chunks are combined, there is a roundoff that is different from a straight Horner-polynomial evaluation. The change is swamped, as mentioned above, by arranging for it to occur before the largest term is added. This roundoff is slightly more favorable than the one in the Horner evaluation. Tests on a SPARC 2 show that the pipelineable polynomial takes 10% longer to evaluate, as would be expected from the requirement for two additional multiplications. On an HP720, Listing 3 runs 30% faster than it would without the parallelization, as the compiler make effective use of the parallel multiply-and-add instruction.
The most serious criticism of Plauger's _Exp
function is that it deals with negative arguments by unthinking application of the symmetry
exp(-x) = 1 / exp(x)
This adds the roundoff error of a final division whenever the argument is negative. [It wasn't unthinking. The acceptance tests I had to pass worried more about this symmetry than the roundoff error. – pjp] The corrected code Listing 4 is simpler, but it involves casting the argument to int
with rounding, both for positive and negative arguments.
As the argument has already been forced to memory by passing it in by reference, the sign bit can be isolated quickly and used as an index into an array {.5, -.5}
of rounding constants. This array can be float
or double
. On 32-bit architectures, it may be faster to get a float
from memory and cast it to double
. With the indexing scheme, no conditionals are required, avoiding breaks in the instruction pipeline.
Plauger's exp
function can be speeded up, as he suggests in his book, by changing _Dtest
to a macro, and avoiding operations that require the argument to be copied through memory. This macro will even detect a NaN
if IEEE arithmetic is implemented:
#define _Dtest(x) ((x) > DBL_MAX || (x) < -DBL_MAX ? INF: \
(x) != (x) ? NAN: \
(x) !=0 ? FINITE : 0)
Caution: in this context, x != x
is not the same as !(x == x)
.
If VAX compatibility is not required, _Dscale
and similar functions can be implemented using long
instead of short
, as most architectures use the same byte order for storage of all 32-bit quantities. For full compatibility with Plauger's book, change all long
and int
declarations to short
.
We can't use 21 digits of precision, so an approximation with one less term than Plauger's gives identical results. I set up the coefficients thinking that I could get more speed by making the smallest coefficient 1
, but this introduces an additional roundoff error that can be avoided by calculating the numerator with +=
as shown. The scale of the coefficients used by Plauger was set for best accuracy on an IBM 360 with base-16 arithmetic, a handicap which we will do without. The change presented here is slightly faster on architectures that can perform add and multiply in parallel.
The formula which ends with the addition of 0.5
is algebraically equivalent to one of the form (even + odd) / (even - odd)
, which has the same symmetry as exp
as mentioned earlier. This form will run faster on certain pipelined machines but doesn't have the desired swamping effect on the division roundoff error. The speed of the formula stems from this identity of terms in the numerator and denominator.
Awareness of standards seems to have brought about a sameness in implementations of exp
, and we find little difference between the performance of these functions from various vendors' libraries.
pow
is so complicated that there is at least one way per programmer to implement it. With just one change, Plauger's code is as accurate as any. The problem, as with his _Exp
function, is the inversion which is used to take care of negative powers. Instead of inverting and then multiplying, the code must be changed to divide in this situation. [Same comment as before. – pjp] This requires elimination of the conditionals which appear to be intended to save time in the cases of integral powers and bases which are exact powers of 2. In most applications where these special cases occur, use of pow
may be avoided entirely, so there is no point in trying to speed up these cases at the expense of the general case.
In terms of performance, the greatest problem is the excessive time spent in the low-level support functions. I suggest using the _Dtest
macro to isolate the special cases, with appropriate changes in the pow
code (Listing 5). The only remaining special case that would be dealt with in _Dunscale
is the subnormal, which can be handled quicker by detecting it and multiplying by 2DBL_MANT_DIG
which equals 1 / DBL_EPSILON
.
The union _Dvar
is identical to _Dconst
except that the const
qualifier is omitted. On most current architectures, working on a union of double
and int
requires copying through memory, so the generated code is the same as if it were done with pointer casts. The MIPS is an exception to this, and it may be possible using a variation on the _Dvar
union to get the compiler to copy directly between float
and general registers.
The call to _Dscale
can be eliminated by using the scaling which Plauger built into _Exp
. Casts from double
to int
to double
have been collected so that they should be recognized by a compiler's peephole optimizer. (This will be discussed later.) Code for raising the fractional part to an integer power has been revised to reduce the number of comparisons. As most C compilers will not register allocate a variable whose address has been taken, care has been taken to copy to register eligible doubles
where this will improve the code.
Some commercial libraries achieve better speed by separating completely the integer power case and using exp(log(x) * y)
for the usual case. The more cautious approach which we have taken from Plauger's book should give more accurate results for |y| > 4
, but many applications confine their use of pow
to the range where these precautions are not needed.
The basic scheme employed in the 4.3BSD library (Listing 6) has proven more accurate than the one shown in Plauger's text. Making the rotations by angles whose inverse tangents are multiples of 0.5 reduces roundoff error. The corresponding angle adjustments are made by splitting the adjustment in a way analogous to the precautions taken by Plauger in some of the other math functions. A small adjustment is made to the third-power term of the series, such that the large adjustment at the end can be done by an "exact" number. The roundoff error in the third-power term is swamped by the later additions of larger "exact" quantities. The 1985 Berkeley code had these adjustments in certain of the ranges. I found that it should be extended to the other ranges.
I changed the coefficients to a series obtained by a Chebyshev fit to atan(x) / x
. Tests in 53-bit IEEE precision showed no difference from the results obtained with the coefficients in the BSD version, but it should not be necessary to add a term to obtain 56-bit precision. I omitted precautions taken in the Berkeley code which are unnecessary when using correctly rounded arithmetic, as well as precautions taken against old-style compilers which may ignore parentheses.
As with the sin
function, greater speed is obtained on pipelined machines with the series written to allow some parallel computation. While more parallel computation is available by further grouping of terms, no more speed is possible on the HP720. Grouping terms at the level shown makes no difference in roundoff error.
The normal way to get atan
would be with a masking macro
#define atan(x) atan2(x, 1.)
This costs little time compared to a separate atan
function. In those cases where a division occurs in the argument of atan
, it is faster and more accurate to use atan2
so that the division can be postponed to the internals of the math function.
asin
and acos
can be written as macros, using atan2
and sqrt
. As most applications make little use of these functions, this gives satisfactory results when there is a hardware square root. It should also put you on the road to simplifying expressions such as cos(asin(x))
which cost a surprising amount of time and accuracy.
The current SPARC, MIPS, and PARISC compilers generate redundant code for casts from double
to int
to double
, which occur frequently in the math functions. These architectures first produce an int
result in a floating-point register, which must be copied to a general register before it can be used for integer arithmetic. When it occurs in isolation, the cast from int
to double
must copy the int
value to a floating-point register, where it is not otherwise allowed to go. When the int
value has just come from truncating a floating-point value, the necessary value already exists in a floating-point register, but the compilers don't take advantage of it.
In source code for the Free Software Foundation gcc compiler, the Machine Description file includes the ingredients to produce a define_peephole
pattern which will match these pairs of casts and generate code without the redundant copy from general to floating-point register. On the SPARC and PA-RISC architectures, this copy has to go through memory, as there is no register-to-register move between general and floating-point registers. At least two instructions and one register allocation can be eliminated. In addition, the floating-point processor can continue on without waiting for results to arrive in the general register, so four or more clock cycles may be saved.
When striving for maximum accuracy, any operation that produces roundoff errors at the level of the least-significant bit of the result should be avoided. The best way to accomplish this is to make the final operation the addition of an exact larger quantity, such as the original argument or an exact constant.
Making execution of small sections of code conditional should be avoided when the results are harmless. It doesn't save time in the majority of cases. Modern architectures handle underflow automatically and don't require special code.
Most current machines use coprocessor architectures, which are slow in transferring data between floating-point and general registers. Mixed int
and double
operations on the same data should be avoided. Such operations which depend on bit patterns are non-portable and difficult to read, so should be used sparingly.
- Press, Flannery, et al. 1988. Numerical Recipes in C, Cambridge University Press.
(
bc
code actually was adapted from 1986 FORTRAN edition.) - Plauger, P.J. 1992. The Standard C Library. Prentice-Hall.