Floating-point Arithmetic ------------------------- As shown in the previous section, fractional numbers are a common occurence in scientific computing, which naturally brings up the question of how much *precision* is sufficient for a particular application. Digital computers use the *floating-point* number system to represent the set of real numbers :math:`\mathbb R` using the *scientific notation*. For any decimal number :math:`x` (assuming :math:`x` is a terminating decimal number, with finite non-zero digits) we can write: .. math:: x=a\times 10^b, \enspace\enspace\enspace\mbox{where}\enspace\enspace1\leq\vert a\vert<10 The only exception is :math:`x=0`, where we simply set :math:`a=b=0`. Every decimal (or base :math:`10`) number can be written as: .. math:: a_k a_{k-1} \ldots a_2 a_1 a_0 . a_{-1} a_{-2} a_{-3}\ldots a_{-m} = \sum_{i=-m}^k a_i 10^i Note how the decimal point moves, or *floats*, as the power of :math:`10` changes. Some examples are given below: .. math:: \begin{array}{|c|c|c|c|c|c|c|c|} \hline x & a_3 & a_2 & a_1 & a_0. & a_{-1} & a_{-2} & a_{-3} \\ \hline 3.14 & & & & 3 & 1 & 4 & \\ 0.037 & & & & & & 3 & 7 \\ 2012 & 2 & 0 & 1 & 2 & & & \\ \hline \end{array} Similarly, binary (base :math:`2`) fractional numbers can be written as: .. math:: b_k b_{k-1} \ldots b_2 b_1 b_0 . b_{-1} b_{-2} b_{-3}\ldots b_{-m} = \sum_{i=-m}^k b_i 2^i where every digit :math:`b_i` is now only allowed to equal :math:`0` or :math:`1`. For example: * :math:`5.75_{(10)} = 4+1+0.5+0.25 = 1\times2^2 + 1\times 2^0 + 1\times 2^{-1} +1\times 2^{-2} = 101.11_{(2)}` * :math:`17.5_{(10)} = 16 + 1 + 0.5 = 1\times 2^4 + 1\times 2^0 + 1\times 2^{-1} = 10001.1_{(2)}` Note that certain numbers that are finite (terminating) decimals are actually periodic (or `repeating `_) in binary, for example, .. math:: 0.4_{(10)} = 0.01100110011\ldots_{(2)} = 0.011\overline{0011}_{(2)} Machine Numbers ~~~~~~~~~~~~~~~ This is an abbreviation for binary floating-point numbers, which is how real numbers are stored on modern digital computers. In scientific notation, :math:`x=\pm a\times 2^b`, where :math:`a` is called the *mantissa* and :math:`b` the *exponent*. We also follow the convention that :math:`1\leq a<2`; the idea being that for any number :math:`x`, we can always divide it by an appropriate power of :math:`2`, such that the result will be within :math:`[1,2)`. For example: .. math:: x = 5_{(10)} = 1.25_{(10)}\times 2^2 = 1.01_{(2)}\times 2^2 Thus, a machine number is stored as: .. math:: x=\pm 1.a_1 a_2\ldots a_{k-1}a_k\times 2^b * In *single precision* we store :math:`k=23` binary digits, and the exponent :math:`b` ranges between :math:`-126\leq b\leq 127`. The largest number we can thus represent is :math:`(2-2^{-23})\times 2^{127}\approx 3.4\times 10^{38}`. * In *double precision* we store :math:`k=52` binary digits, and the exponent :math:`b` ranges between :math:`-1022\leq b\leq 1023`. The largest number we can thus represent is :math:`(2-2^{-52})\times 2^{1023}\approx 1.8\times 10^{308}`. In other words, single precision provides :math:`23` binary significant digits. In order to translate it to familiar decimal terms we note that :math:`2^{10}\approx 10^3`, thus, :math:`10` binary significant digits are roughly equivalent to :math:`3` decimal significant digits. Using this, we can say that single precision provides approximately :math:`7` decimal significant digits, while double precision offers slightly more than :math:`15`. Absolute and Relative Error ~~~~~~~~~~~~~~~~~~~~~~~~~~~ As we saw before, all computations on a computer are approximate by nature, due to the limited precision available. As a consequence, some amount of *error* has to be tolerated in all computation. In order to better understand these errors, the absolute and relative error measures are used. Let :math:`q` denote the exact (analytic) quantity that we expect out of a computation, and let :math:`\bar q` denote the (likely compromised) value actually generated by the computer. The *absolute error* is :math:`e=\vert q - \bar q\vert`. This is useful for framing the result within a certain interval, since :math:`e\leq\delta` implies :math:`q\in[\bar q-\delta,\bar q + \delta]`. The *relative error* is :math:`e=\vert q - \bar q\vert / \vert q\vert`. The result may be expressed as a percentile and is useful for assessing the error relative to the value of the exact quantity. For example, an absolute error of :math:`10^{-3}` may be insignificant when the intended value of :math:`q` is in the order of :math:`10^6`, but would be very severe if :math:`q\approx 10^{-2}`. Rounding and Truncation ~~~~~~~~~~~~~~~~~~~~~~~ When storing a number on the computer, if the number happens to contain more digits than it is possible to represent via a machine number, an approximation is made via *rounding* or *truncation*. When using truncated results, the machine number is constructed by simply discarding significant digits that cannot be stored; rounding approximates a quantity with the *closest* machine-precision number. For example, when approximating :math:`\pi=3.1415926535\ldots` to :math:`5` decimal significant digits, truncation would give :math:`\pi\approx 3.1415` while the rounded result would be :math:`\pi\approx 3.1416`. Rounding and truncation are similarly defined for binary numbers. For example, :math:`0.1011011101110\ldots_{(2)}` would be approximated to :math:`5` binary significant digits as :math:`x\approx 0.10110_{(2)}` using truncation, and :math:`x\approx 0.10111_{(2)}` when rounded. `Python `_ has a built-in function ``round`` for rounding decimal numbers, as shown below: :: >>> from math import pi >>> pi 3.141592653589793 >>> round(pi,3) 3.142 >>> round(pi,5) 3.14159 There are no built-in functions for truncating a number. However, the following ``truncate`` function can be readily defined and allows us to truncate numbers similar to the ``round`` function shown above. :: >>> import math >>> def truncate(number,digits): ... stepper=pow(10.0,digits) ... return math.trunc(stepper*number)/stepper ... >>> truncate(pi,3) 3.141 >>> truncate(pi,5) 3.14159 Machine Epsilon ~~~~~~~~~~~~~~~ A concept that is useful in quantifying the error caused by rounding or truncation is the notion of *machine* :math:`\varepsilon` *(epsilon)*. There are a number of (slightly different) definitions in the literature, depending on whether truncation or rounding is used, specific rounding rules, etc. Here, we will define the machine :math:`\varepsilon` as the smallest positive machine number, such that: .. math:: 1 + \varepsilon \neq 1,\enspace\enspace\enspace\enspace\mbox{(on the computer)} Why isn't the above inequality always true, for any :math:`\varepsilon > 0`? The reason is that when subject to limitations imposed by the computer precision, some numbers are "too small" to affect the result of an operation, for example: .. math:: 1 &=& \enspace 1.\underbrace{000\ldots 000}_{\mbox{23 digits}}\times 2^0 \\ 2^{-25} &=& \enspace 0.\underbrace{000\ldots 000}_{\mbox{23 digits}}01\times 2^0 \\ \Rightarrow 1 + 2^{-25} &=& \enspace 1.\underbrace{000\ldots 000}_{\mbox{23 digits}}01\times 2^0 When rounding (or truncating) the last number to :math:`23` binary significant digits corresponding to single precision, the result would be exactly the same as the representation of the number :math:`x=1`! Thus, on the computer, we have :math:`1+2^{-25} = 1`, and consequently :math:`2^{-25}` is smaller than the machine epsilon. Notice that the smallest positive number that would actually achieve :math:`1+\varepsilon\neq 1` with single precision machine numbers is :math:`\varepsilon=2^{-24}` (and we are even relying on a "round upwards" convention for tie breaking to come up with a value this small), which will be called the machine :math:`\varepsilon` in this case. For double precision, the machine :math:`\varepsilon` is :math:`2^{-53}`. The significance of the machine :math:`\varepsilon` is that it provides an upper bound for the relative error of representing any number to the precision available on the computer. Thus, if :math:`q>0` is the intended numerical quantity, and :math:`\bar q` is the closest machine-precision approximation, then: .. math:: \left|\frac{q-\bar q}{q}\right| \leq \varepsilon :label: relative-error where :math:`\varepsilon` is the machine epsilon for the degree of precision used. By letting :math:`\delta = (\bar q-q)/q`, we can write inequality :eq:`relative-error` in the following form: .. math:: \bar q = q(1+\delta),\enspace\enspace\enspace |\delta|\leq\varepsilon :label: machine-error `NumPy `_ provides the function ``finfo`` to retrieve the machine epsilon, as shown below: :: >>> import numpy as np >>> print(np.finfo(np.float32).eps) # single precision 1.19209e-07 >>> print(np.finfo(float).eps) # double precision 2.22044604925e-16