Floating-point Arithmetic
-------------------------

As shown in the previous section, fractional numbers are a common
occurence in scientific computing, which naturally brings up the
question of how much *precision* is sufficient for a particular application.
Digital computers use the *floating-point* number system to represent the set of real
numbers :math:`\mathbb R` using the *scientific notation*. For any decimal
number :math:`x` (assuming :math:`x` is a terminating decimal number, with
finite non-zero digits) we can write:

.. math::
    x=a\times 10^b, \enspace\enspace\enspace\mbox{where}\enspace\enspace1\leq\vert a\vert<10

The only exception is :math:`x=0`, where we simply set :math:`a=b=0`. Every
decimal (or base :math:`10`) number can be written as:

.. math::
    a_k a_{k-1} \ldots a_2 a_1 a_0 . a_{-1} a_{-2} a_{-3}\ldots a_{-m} = \sum_{i=-m}^k a_i 10^i

Note how the decimal point moves, or *floats*, as the power of :math:`10`
changes. Some examples are given below:

.. math::
    \begin{array}{|c|c|c|c|c|c|c|c|}
    \hline
    x & a_3 & a_2 & a_1 & a_0. & a_{-1} & a_{-2} & a_{-3} \\
    \hline
    3.14 & & & & 3 & 1 & 4 & \\
    0.037 & & & & & & 3 & 7 \\
    2012 & 2 & 0 & 1 & 2 & & & \\
    \hline
    \end{array}

Similarly, binary (base :math:`2`) fractional numbers can be written as:

.. math::
    b_k b_{k-1} \ldots b_2 b_1 b_0 . b_{-1} b_{-2} b_{-3}\ldots b_{-m} = \sum_{i=-m}^k b_i 2^i

where every digit :math:`b_i` is now only allowed to equal :math:`0` or :math:`1`. For example:

* :math:`5.75_{(10)} = 4+1+0.5+0.25 = 1\times2^2 + 1\times 2^0 + 1\times 2^{-1} +1\times 2^{-2} = 101.11_{(2)}`
* :math:`17.5_{(10)} = 16 + 1 + 0.5 = 1\times 2^4 + 1\times 2^0 + 1\times 2^{-1} = 10001.1_{(2)}`

Note that certain numbers that are finite (terminating) decimals are actually
periodic (or `repeating <https://en.wikipedia.org/wiki/Repeating_decimal>`_) in binary, for example,

.. math::
    0.4_{(10)} = 0.01100110011\ldots_{(2)} = 0.011\overline{0011}_{(2)}

Machine Numbers
~~~~~~~~~~~~~~~

This is an abbreviation for binary floating-point numbers, which is how real
numbers are stored on modern digital computers. In scientific notation,
:math:`x=\pm a\times 2^b`, where :math:`a` is called the *mantissa* and :math:`b` the *exponent*.
We also follow the convention that :math:`1\leq a<2`; the idea being that for
any number :math:`x`, we can always divide it by an appropriate power of
:math:`2`, such that the result will be within :math:`[1,2)`. For example:

.. math::
    x = 5_{(10)} = 1.25_{(10)}\times 2^2 = 1.01_{(2)}\times 2^2

Thus, a machine number is stored as:

.. math::
    x=\pm 1.a_1 a_2\ldots a_{k-1}a_k\times 2^b

* In *single precision* we store :math:`k=23` binary digits, and the exponent
  :math:`b` ranges between :math:`-126\leq b\leq 127`. The largest number we can
  thus represent is :math:`(2-2^{-23})\times 2^{127}\approx 3.4\times 10^{38}`.
* In *double precision* we store :math:`k=52` binary digits, and the exponent
  :math:`b` ranges between :math:`-1022\leq b\leq 1023`. The largest number we
  can thus represent is :math:`(2-2^{-52})\times 2^{1023}\approx 1.8\times 10^{308}`.

In other words, single precision provides :math:`23` binary significant digits.
In order to translate it to familiar decimal terms we note that
:math:`2^{10}\approx 10^3`, thus, :math:`10` binary significant digits are roughly
equivalent to :math:`3` decimal significant digits. Using this, we can say that
single precision provides approximately :math:`7` decimal significant digits,
while double precision offers slightly more than :math:`15`.

Absolute and Relative Error
~~~~~~~~~~~~~~~~~~~~~~~~~~~

As we saw before, all computations on a computer are approximate by nature, due
to the limited precision available. As a consequence, some amount of *error* has
to be tolerated in all computation. In order to better understand these errors,
the absolute and relative error measures are used. Let :math:`q` denote the
exact (analytic) quantity that we expect out of a computation, and let
:math:`\bar q` denote the (likely compromised) value actually generated by the
computer.

The *absolute error* is :math:`e=\vert q - \bar q\vert`. This is useful for
framing the result within a certain interval, since :math:`e\leq\delta` implies
:math:`q\in[\bar q-\delta,\bar q + \delta]`. The *relative error* is
:math:`e=\vert q - \bar q\vert / \vert q\vert`. The result may be expressed as a
percentile and is useful for assessing the error relative to the value of the
exact quantity. For example, an absolute error of :math:`10^{-3}` may be
insignificant when the intended value of :math:`q` is in the order of
:math:`10^6`, but would be very severe if :math:`q\approx 10^{-2}`.

Rounding and Truncation
~~~~~~~~~~~~~~~~~~~~~~~

When storing a number on the computer, if the number happens to contain more
digits than it is possible to represent via a machine number, an approximation
is made via *rounding* or *truncation*. When using truncated results, the
machine number is constructed by simply discarding significant digits that
cannot be stored; rounding approximates a quantity with the *closest*
machine-precision number. For example, when approximating
:math:`\pi=3.1415926535\ldots` to :math:`5` decimal significant digits,
truncation would give :math:`\pi\approx 3.1415` while the rounded result would
be :math:`\pi\approx 3.1416`. Rounding and truncation are similarly defined for
binary numbers. For example, :math:`0.1011011101110\ldots_{(2)}` would be
approximated to :math:`5` binary significant digits as :math:`x\approx
0.10110_{(2)}` using truncation, and :math:`x\approx 0.10111_{(2)}` when
rounded.

`Python <https://www.python.org/>`_ has a built-in function ``round`` for
rounding decimal numbers, as shown below: ::

    >>> from math import pi
    >>> pi
    3.141592653589793
    >>> round(pi,3)
    3.142
    >>> round(pi,5)
    3.14159

There are no built-in functions for truncating a number. However, the following
``truncate`` function can be readily defined and allows us to truncate numbers similar to the
``round`` function shown above. ::

    >>> import math
    >>> def truncate(number,digits):
    ...     stepper=pow(10.0,digits)
    ...     return math.trunc(stepper*number)/stepper
    ...
    >>> truncate(pi,3)
    3.141
    >>> truncate(pi,5)
    3.14159

Machine Epsilon
~~~~~~~~~~~~~~~

A concept that is useful in quantifying the error caused by rounding or
truncation is the notion of *machine* :math:`\varepsilon` *(epsilon)*. There are
a number of (slightly different) definitions in the literature, depending on
whether truncation or rounding is used, specific rounding rules, etc. Here, we
will define the machine :math:`\varepsilon` as the smallest positive machine
number, such that:

.. math::
    1 + \varepsilon \neq 1,\enspace\enspace\enspace\enspace\mbox{(on the computer)}

Why isn't the above inequality always true, for any :math:`\varepsilon > 0`? The
reason is that when subject to limitations imposed by the computer precision,
some numbers are "too small" to affect the result of an operation, for example:

.. math::
    1                       &=& \enspace 1.\underbrace{000\ldots 000}_{\mbox{23 digits}}\times 2^0 \\
    2^{-25}                 &=& \enspace 0.\underbrace{000\ldots 000}_{\mbox{23 digits}}01\times 2^0 \\
    \Rightarrow 1 + 2^{-25} &=& \enspace 1.\underbrace{000\ldots 000}_{\mbox{23 digits}}01\times 2^0

When rounding (or truncating) the last number to :math:`23` binary significant
digits corresponding to single precision, the result would be exactly the same
as the representation of the number :math:`x=1`! Thus, on the computer, we have
:math:`1+2^{-25} = 1`, and consequently :math:`2^{-25}` is smaller than the
machine epsilon. Notice that the smallest positive number that would actually
achieve :math:`1+\varepsilon\neq 1` with single precision machine numbers is
:math:`\varepsilon=2^{-24}` (and we are even relying on a "round upwards"
convention for tie breaking to come up with a value this small), which will be
called the machine :math:`\varepsilon` in this case. For double precision, the
machine :math:`\varepsilon` is :math:`2^{-53}`.

The significance of the machine :math:`\varepsilon` is that it provides an upper
bound for the relative error of representing any number to the precision
available on the computer. Thus, if :math:`q>0` is the intended numerical
quantity, and :math:`\bar q` is the closest machine-precision approximation,
then:

.. math::
    \left|\frac{q-\bar q}{q}\right| \leq \varepsilon
    :label: relative-error

where :math:`\varepsilon` is the machine epsilon for the degree of precision
used. By letting :math:`\delta = (\bar q-q)/q`, we can write inequality
:eq:`relative-error` in the following form:

.. math::
    \bar q = q(1+\delta),\enspace\enspace\enspace |\delta|\leq\varepsilon
    :label: machine-error

`NumPy <http://www.numpy.org/>`_
provides the function ``finfo`` to retrieve the machine epsilon, as shown below: ::

    >>> import numpy as np
    >>> print(np.finfo(np.float32).eps)     # single precision
    1.19209e-07
    >>> print(np.finfo(float).eps)          # double precision
    2.22044604925e-16