1.3. Floating-point Arithmetic

As shown in the previous section, fractional numbers are a common occurence in scientific computing, which naturally brings up the question of how much precision is sufficient for a particular application. Digital computers use the floating-point number system to represent the set of real numbers \(\mathbb R\) using the scientific notation. For any decimal number \(x\) (assuming \(x\) is a terminating decimal number, with finite non-zero digits) we can write:

\[\begin{split}x=a\times 10^b, \enspace\enspace\enspace\mbox{where}\enspace\enspace1\leq\vert a\vert<10\end{split}\]

The only exception is \(x=0\), where we simply set \(a=b=0\). Every decimal (or base \(10\)) number can be written as:

\[a_k a_{k-1} \ldots a_2 a_1 a_0 . a_{-1} a_{-2} a_{-3}\ldots a_{-m} = \sum_{i=-m}^k a_i 10^i\]

Note how the decimal point moves, or floats, as the power of \(10\) changes. Some examples are given below:

\[\begin{split}\begin{array}{|c|c|c|c|c|c|c|c|} \hline x & a_3 & a_2 & a_1 & a_0. & a_{-1} & a_{-2} & a_{-3} \\ \hline 3.14 & & & & 3 & 1 & 4 & \\ 0.037 & & & & & & 3 & 7 \\ 2012 & 2 & 0 & 1 & 2 & & & \\ \hline \end{array}\end{split}\]

Similarly, binary (base \(2\)) fractional numbers can be written as:

\[b_k b_{k-1} \ldots b_2 b_1 b_0 . b_{-1} b_{-2} b_{-3}\ldots b_{-m} = \sum_{i=-m}^k b_i 2^i\]

where every digit \(b_i\) is now only allowed to equal \(0\) or \(1\). For example:

  • \(5.75_{(10)} = 4+1+0.5+0.25 = 1\times2^2 + 1\times 2^0 + 1\times 2^{-1} +1\times 2^{-2} = 101.11_{(2)}\)
  • \(17.5_{(10)} = 16 + 1 + 0.5 = 1\times 2^4 + 1\times 2^0 + 1\times 2^{-1} = 10001.1_{(2)}\)

Note that certain numbers that are finite (terminating) decimals are actually periodic (or repeating) in binary, for example,

\[0.4_{(10)} = 0.01100110011\ldots_{(2)} = 0.011\overline{0011}_{(2)}\]

1.3.1. Machine Numbers

This is an abbreviation for binary floating-point numbers, which is how real numbers are stored on modern digital computers. In scientific notation, \(x=\pm a\times 2^b\), where \(a\) is called the mantissa and \(b\) the exponent. We also follow the convention that \(1\leq a<2\); the idea being that for any number \(x\), we can always divide it by an appropriate power of \(2\), such that the result will be within \([1,2)\). For example:

\[x = 5_{(10)} = 1.25_{(10)}\times 2^2 = 1.01_{(2)}\times 2^2\]

Thus, a machine number is stored as:

\[x=\pm 1.a_1 a_2\ldots a_{k-1}a_k\times 2^b\]
  • In single precision we store \(k=23\) binary digits, and the exponent \(b\) ranges between \(-126\leq b\leq 127\). The largest number we can thus represent is \((2-2^{-23})\times 2^{127}\approx 3.4\times 10^{38}\).
  • In double precision we store \(k=52\) binary digits, and the exponent \(b\) ranges between \(-1022\leq b\leq 1023\). The largest number we can thus represent is \((2-2^{-52})\times 2^{1023}\approx 1.8\times 10^{308}\).

In other words, single precision provides \(23\) binary significant digits. In order to translate it to familiar decimal terms we note that \(2^{10}\approx 10^3\), thus, \(10\) binary significant digits are roughly equivalent to \(3\) decimal significant digits. Using this, we can say that single precision provides approximately \(7\) decimal significant digits, while double precision offers slightly more than \(15\).

1.3.2. Absolute and Relative Error

As we saw before, all computations on a computer are approximate by nature, due to the limited precision available. As a consequence, some amount of error has to be tolerated in all computation. In order to better understand these errors, the absolute and relative error measures are used. Let \(q\) denote the exact (analytic) quantity that we expect out of a computation, and let \(\bar q\) denote the (likely compromised) value actually generated by the computer.

The absolute error is \(e=\vert q - \bar q\vert\). This is useful for framing the result within a certain interval, since \(e\leq\delta\) implies \(q\in[\bar q-\delta,\bar q + \delta]\). The relative error is \(e=\vert q - \bar q\vert / \vert q\vert\). The result may be expressed as a percentile and is useful for assessing the error relative to the value of the exact quantity. For example, an absolute error of \(10^{-3}\) may be insignificant when the intended value of \(q\) is in the order of \(10^6\), but would be very severe if \(q\approx 10^{-2}\).

1.3.3. Rounding and Truncation

When storing a number on the computer, if the number happens to contain more digits than it is possible to represent via a machine number, an approximation is made via rounding or truncation. When using truncated results, the machine number is constructed by simply discarding significant digits that cannot be stored; rounding approximates a quantity with the closest machine-precision number. For example, when approximating \(\pi=3.1415926535\ldots\) to \(5\) decimal significant digits, truncation would give \(\pi\approx 3.1415\) while the rounded result would be \(\pi\approx 3.1416\). Rounding and truncation are similarly defined for binary numbers. For example, \(0.1011011101110\ldots_{(2)}\) would be approximated to \(5\) binary significant digits as \(x\approx 0.10110_{(2)}\) using truncation, and \(x\approx 0.10111_{(2)}\) when rounded.

Python has a built-in function round for rounding decimal numbers, as shown below:

>>> from math import pi
>>> pi
3.141592653589793
>>> round(pi,3)
3.142
>>> round(pi,5)
3.14159

There are no built-in functions for truncating a number. However, the following truncate function can be readily defined and allows us to truncate numbers similar to the round function shown above.

>>> import math
>>> def truncate(number,digits):
...     stepper=pow(10.0,digits)
...     return math.trunc(stepper*number)/stepper
...
>>> truncate(pi,3)
3.141
>>> truncate(pi,5)
3.14159

1.3.4. Machine Epsilon

A concept that is useful in quantifying the error caused by rounding or truncation is the notion of machine \(\varepsilon\) (epsilon). There are a number of (slightly different) definitions in the literature, depending on whether truncation or rounding is used, specific rounding rules, etc. Here, we will define the machine \(\varepsilon\) as the smallest positive machine number, such that:

\[1 + \varepsilon \neq 1,\enspace\enspace\enspace\enspace\mbox{(on the computer)}\]

Why isn’t the above inequality always true, for any \(\varepsilon > 0\)? The reason is that when subject to limitations imposed by the computer precision, some numbers are “too small” to affect the result of an operation, for example:

\[\begin{split}1 &=& \enspace 1.\underbrace{000\ldots 000}_{\mbox{23 digits}}\times 2^0 \\ 2^{-25} &=& \enspace 0.\underbrace{000\ldots 000}_{\mbox{23 digits}}01\times 2^0 \\ \Rightarrow 1 + 2^{-25} &=& \enspace 1.\underbrace{000\ldots 000}_{\mbox{23 digits}}01\times 2^0\end{split}\]

When rounding (or truncating) the last number to \(23\) binary significant digits corresponding to single precision, the result would be exactly the same as the representation of the number \(x=1\)! Thus, on the computer, we have \(1+2^{-25} = 1\), and consequently \(2^{-25}\) is smaller than the machine epsilon. Notice that the smallest positive number that would actually achieve \(1+\varepsilon\neq 1\) with single precision machine numbers is \(\varepsilon=2^{-24}\) (and we are even relying on a “round upwards” convention for tie breaking to come up with a value this small), which will be called the machine \(\varepsilon\) in this case. For double precision, the machine \(\varepsilon\) is \(2^{-53}\).

The significance of the machine \(\varepsilon\) is that it provides an upper bound for the relative error of representing any number to the precision available on the computer. Thus, if \(q>0\) is the intended numerical quantity, and \(\bar q\) is the closest machine-precision approximation, then:

(1)\[\left|\frac{q-\bar q}{q}\right| \leq \varepsilon\]

where \(\varepsilon\) is the machine epsilon for the degree of precision used. By letting \(\delta = (\bar q-q)/q\), we can write inequality (1) in the following form:

(2)\[\bar q = q(1+\delta),\enspace\enspace\enspace |\delta|\leq\varepsilon\]

NumPy provides the function finfo to retrieve the machine epsilon, as shown below:

>>> import numpy as np
>>> print(np.finfo(np.float32).eps)     # single precision
1.19209e-07
>>> print(np.finfo(float).eps)          # double precision
2.22044604925e-16