It is achievable to write fast and precise mathematical functions.
There are cutting-edge implementations in computers, phones, calculators,
and game consoles. Some of them are open source, like glibc and musl,
from which we can learn. I have also been working on mathematics in
metallic, as a shameless plug.

It may seem that mathematical functions are hardware instructions. We
usually code them in software instead. The trend is to have the
hardware deal with only basic operations after decades of evolution. We
can perform mathematics with only operations required in IEEE 754 and
integer operations via type punning.

## Target system

### Instruction set

Which instructions are on the target system? C operators are probably
supported. Other operations are not as available even if they are
*required* by IEEE 754. For example, `fmod`

rarely compiles to a single
instruction. It is usually done by long division, which then translates
to a series of integer operations or partial remainder instructions.
This C function is also an example where operators in other programming
languages can be more expensive than expected, like `%`

in JavaScript or
Python.

### Programming language

I suggest programming mathematical functions in C. It is fast and
precise to evaluate complicated expressions with floating-point
contraction. On supported platforms, `a * b + c`

can compile to a fused
multiply–add.

```
// Nearest integer for a nonnegative argument
float nearest(float x)
{
const float rectifier = 0x1p23f;
float y = x + rectifier;
return y - rectifier;
// Wrong: can be optimized to (x + 0)
return x + rectifier - rectifier;
}
```

## Rounding

*Round half to even* is the default rounding in IEEE 754. The roundoff
error of a series of `n` operations by this unbiased rounding
is only in O(√`n`). This rounding will be the implied rounding
method throughout this article unless otherwise specified.

### Double rounding

The default rounding is non-associative. If we round 1.49 to the
nearest tenth and then to the nearest integer, it becomes 2. Rounding
to midpoints must be avoided in intermediate roundings.

*Round to odd* is a helpful intermediate rounding for binary numbers.
It rounds irrepresentable numbers to the nearest representation with an
odd significand. Only numbers with even significands can be midpoints
or exact in a coarser precision. Therefore, *round to odd* does not
interfere with subsequent roundings. *Round to odd* is also
associative like directed roundings as it rounds all values between
representations to either of them.

### Exact addition

We can obtain the exact error of addition with the default rounding.
This technique is useful for storing precise intermediates to stop the
propagation of error. The idea is to find `s` + `e` =
`a` + `b`, where `s` is the rounded
sum, and `e` is the error term. The error term is
defined when `s` does not overflow.

Compensated summation produces precise results with preconditions. The
base of the floating-point system (`FLT_RADIX`

) can be at most 3, and
logb(`a`) ≥ logb(`b`).

```
s = a + b;
e = a - s + b;
```

We can generalize this algorithm by comparison, as |`a`| ≥
|`b`| implies logb(`a`) ≥ logb(`b`).
Branching is nevertheless inefficient. There is another unbranched
algorithm working most of the time. This algorithm overflows only if an
operand is the largest finite representation or its negative.

```
s = a + b;
x = a - s;
y = b - s;
e = (a + x) + (b + y);
```

### Exact multiplication

First, a `double`

is large enough to store the exact product of any two
`float`

s. Therefore, it is preferred to cast and multiply. This method
is straightforward and fast, and double-precision multiplication is
widely supported.

```
double multiply(float a, float b)
{
return (double)a * b;
}
```

Then, we try to find `s` + `e` =
`a``b`. If the FMA instructions are available, use
them. Probe this feature with `FP_FAST_FMA[FL]?`

macros.

```
s = a * b;
e = fma(a, b, -s);
```

Next, without all the hardware witchcraft, we can still count on
schoolbook multiplication. With an even number of significant digits,
equally split them and obtain the higher part with a bitwise AND.

Even if the number of significant bits is odd, we can split a binary
significand with the default rounding. Take IEEE 754 double-precision
as an example, which has 53 significant bits. Its magic multiplier is
2^{27} + 1, where 27 = (53 + 1) / 2.

```
double split(double x)
{
double s = (0x1p27 + 1) * x;
double c = x - s;
return s + c;
}
```

The above function returns the higher half of the argument. We can
extract the lower half by subtracting the higher half. Each half is
guaranteed to have at most 26 significant bits. The possibility that
the halves can have opposite signs recovers the seemingly lost bit.

### Table maker’s dilemma

The cost of a correctly rounded transcendental function is unknown
unless probed with brute force. Faithfully rounded versions are much
faster and generally preferable, though they may round up or down. No
matter how precise intermediate results are, they can be close to a
turning point of the given rounding. For example, `x` is a
mathematical result equal to `x`^{*} + 0.49 ulp, where
`x`^{*} is an exact floating-point. An exquisite
approximation gives `x`^{*} + 0.51 ulp, which is only
0.02 ulp off. Nevertheless, it becomes `x`^{ *} + 1
ulp after default rounding, which is 1 ulp off from the correctly
rounded `x`^{*}.

We can correctly round an algebraic function by solving its polynomial
equation at the turning point and compare the results. However, this
extra cost is unwelcome if faithful rounding is enough. It is unlikely
that a correctly rounded program solves a real-world problem that a
faithfully rounded one does not. IEEE 754 does not require correct
rounding for the cubic root. Therefore, I made `cbrt`

faithfully
rounded in metallic. Its error can be even larger in glibc and other C
libraries.

## Approximation

Eventually, we break down mathematical functions to basic arithmetics.
This section covers how to turn mathematics into source code.

### Argument reduction

Sometimes we can shrink the domain to a short interval with an identity.
For example, to compute `exp`

for a binary format, we can divide its
argument `x` by ln 2.

\begin{align*}
x &= n \ln 2 + r \\
\exp x &= 2^n \exp r
\end{align*}

Where `n` is an integer, bit twiddling takes care of
multiplication by 2^{n}. If we pick `n` as
an integer nearest to `x`, we simultaneously restrict `
r` into [-0.5 ln 2, 0.5 ln 2].

Approximation to exp `r` is fast because [-0.5 ln 2, 0.5 ln 2]
is a short interval. We approximate exp `r` with few terms to
achieve the desired precision.

It is also precise because `r` is small. Computations
involving small numbers are accurate. Floating-points are dense near
the origin since they are essentially scientific notation. In IEEE 754
binary formats, there is the same number of representations in (0, 1.5)
and in (1.5, ∞). Therefore, it is wise to shift the domain close to 0.

Most mathematical functions we compute are well-behaved. We can save
computations by taking advantage of continuity, differentiability, or
symmetry.

When a function `f` **passes through and is monotone at the
origin**, divide it by the identity function and approximate the
quotient `g` instead.

\begin{align*}
f(0) &= 0 \\
f(x) &= x g(x)
\end{align*}

This transformation explicitly omits the constant term and reduces the
overall relative error. The value of `g`(0) can be any finite
number. We define `g` as a continuous extension for rigor.

\[
g(x) = \begin{cases}
\displaystyle \frac{f(x)}{x} & \mbox{if } x \ne 0 \\
\displaystyle \lim_{t \to 0} \frac{f(t)}{t} & \mbox{if } x = 0
\end{cases}
\]

Given an approximant `ĝ` of `g`, the overall
absolute error `x` |`ĝ` − `g`|
tends to 0 when `x` also approaches 0. This transformation
enables approximating `g` without a weight function and
simplifies calculation.

When `f` is an **even function**, view it as another function
`g` of a squared variable.

\begin{align*}
f(x) &= f (-x) \\
f(x) &= g \left( x^2 \right) \\
g(x) &= f \left( \sqrt x \right)
\end{align*}

This transformation explicitly omits odd terms and halves the degree of
the approximant.

An **odd function** is a combination of the above two. It is a product
of the identity function and an even function.

\begin{align*}
f(x) &= -f (-x) \\
f(x) &= x g \left( x^2 \right) \\
g(x) &= \frac{f \left( \sqrt x \right)}{\sqrt x}
\end{align*}

The value of `g`(0) does not affect the approximation of
`g` as it creates no hole in the domain. In practice, set the
lower bound to a tiny positive number like 2^{-200}, and
everything is fine.

### Remez algorithm

Remez exchange algorithm is an iterative minimax algorithm. It
minimizes the error of a rational approximation of a function. The best
explanation of this algorithm I found is
from the Boost libraries. I recommend Remez.jl, a
public module in the Julia language. It works out of the box after
installation.

For example, the following snippet finds a cubic approximation of
cos(√·) in [0, (π / 4)^{2}] with minimax absolute errors.
The last argument is 0 because we want a polynomial, whose denominator
is a constant (of degree 0) if regarded as a rational function.

```
import Remez
N, D, E, X = Remez.ratfn_minimax(x -> cos(√x), (0, (big(π) / 4)^2), 3, 0)
```

The variables `N`

, `D`

, `E`

, `X`

are the numerator, the denominator, the
maximum error, and coordinates of the extrema, respectively. In this
case, we are interested in `N`

and `E`

only. If we run the snippet in
the REPL, it is straightforward to inspect variables.

```
julia> N
4-element Array{BigFloat,1}:
0.9999999724233229210670051040057597041917874465747537951681676248240168483719746
-0.4999985669584884771720232450657038606385147149244782395789475085368551172067715
0.04165502688425152443762347668780274316867072837392713367475023020736799395672903
-0.001358590851011329858521158876238716265345398772374942259275377959127201806930143
julia> E
2.757667707893299489599424029580821255342524620485822487536621483700643103529491e-08
```

The resulting coefficients are in ascending order. For example, the
first element of `N`

is the constant term.

### Polynomial evaluation

The best polynomial evaluation method depends on the system. For
example, pipelining influences execution time. Luckily, there
are well-known evaluation schemes that provide decent performance and
reasonable errors.

**Horner’s scheme** produces the fewest operations. It is the fastest
if its argument is already a vector. It is also usually the most
accurate method for a polynomial approximant of a well-conditioned
function. However, its dependency chain is also the longest. It
underuses the pipeline because all operations except one depend on
another. Hence, it is less than ideal on single-threaded systems.

On the other hand, **Estrin’s scheme** tries to be as parallel as
possible. It groups terms in a binary fashion to achieve the shallowest
dependency tree at the expense of O(log(`n`)) extra squaring
ops.

There are also other evaluation schemes with different pros and cons.
Benchmark to find the most suited one if their difference is critical.