Overdetermined Systems
----------------------

So far, we considered linear systems :math:`Ax=b` with the *same* number of
equations and unknowns (i.e., :math:`A\in\mathbb R^{n\times n}`). In the case where
:math:`A\in\mathbb R^{m\times n}`, with :math:`m>n` (more equations than unknowns) the existence of a
true solution is not guaranteed, in this case we look for the "best possible"
substitute for a solution. Before analyzing what that means, let's look at how
such problems arise.

As an example, in an experiment, we measure the pressure of a gas in a closed container as a
function of the temperature. From Physics, we know that

.. math::
    pV &=& \enspace nR\frac{5}{9}(T+459.67) \\
    \Rightarrow p&=& \enspace\alpha T+\beta, \enspace\enspace\enspace \alpha=\frac{5nR}{9V}, \beta=\frac{5nR\cdot 459.67}{9V}

What are :math:`\alpha` and :math:`\beta`? Experimentally, the measurements should ideally
lie on a straight line :math:`y=c_1x+c_0`, but do not, due to measurement error. If we
have :math:`n` measurement pairs :math:`(x_1,y_1),\ldots,(x_n,y_n)` we would have wanted:

.. math::
    \left.
    \begin{array}{ccccc}
    y_1 & = & c_1x_1 & + & c_0 \\
    y_2 & = & c_1x_2 & + & c_0 \\
        & \vdots & & & \\
    y_n & = & c_1x_n & + & c_0
    \end{array}\right\}\Rightarrow\left[
    \begin{array}{cc}
    x_1 & 1 \\
    x_2 & 1 \\
    \vdots & \\
    x_n & 1
    \end{array}
    \right]\left[
    \begin{array}{c}
    c_1 \\
    c_0
    \end{array}
    \right]=\left[
    \begin{array}{c}
    y_1 \\
    y_2 \\
    \vdots \\
    y_n
    \end{array}
    \right]

Here, :math:`A_{n\times2}x_{2\times1}=b_{n\times1}` is a rectangular system. We cannot
hope to find a true solution to this system. Instead, lets try to find an
"approximate" solution, such that :math:`Ax\approx b`. Lets look at the residual of
this "interpolation". The residual of the approximation of each data point is:

.. math::
    r_i=y_i-f(x_i)=y_i-c_1x_i-c_0

If we write the vector of all residuals:

.. math::
    r=\left[
    \begin{array}{c}
    r_1 \\
    r_2 \\
    \vdots \\
    r_n
    \end{array}
    \right]=\left[
    \begin{array}{c}
    y_1-c_1x_1-c_0 \\
    y_2-c_1x_2-c_0 \\
    \vdots \\
    y_n-c_1x_n-c_0 \\
    \end{array}
    \right]=\left[
    \begin{array}{c}
    y_1 \\
    y_2 \\
    \vdots \\
    y_n
    \end{array}
    \right]-\left[
    \begin{array}{cc}
    x_1 & 1 \\
    x_2 & 1 \\
    \vdots & \\
    x_n & 1
    \end{array}
    \right]\left[
    \begin{array}{c}
    c_1 \\
    c_0
    \end{array}
    \right]=b-Ax

Although we can't find an :math:`x` such that :math:`Ax=b` (thus, :math:`r=0`), we can at least
try to make :math:`r` *small*. As another example, consider the problem of finding the best parabola
:math:`f(x)=c_2x^2+c_1x+c_0` that fits measurements :math:`(x_1,y_1),\ldots,(x_n,y_n)`. We
would like

.. math::
    \left.
    \begin{array}{c}
    f(x_1)\approx y_1 \\
    f(x_2)\approx y_2 \\
    \vdots \\
    f(x_n)\approx y_n \\
    \end{array}
    \right\}=
    \left.
    \begin{array}{c}
    c_2x_1^2+c_1x_1+c_0\approx y_1 \\
    c_2x_2^2+c_1x_2+c_0\approx y_2 \\
    \vdots \\
    c_2x_n^2+c_1x_n+c_0\approx y_n \\
    \end{array}
    \right\}\Rightarrow
    \underbrace{
    \left[
    \begin{array}{ccc}
    x_1^2 & x_1 & 1 \\
    x_2^2 & x_2 & 1 \\
    & \vdots & \\
    x_n^2 & x_n & 1 \\
    \end{array}
    \right]}_A\underbrace{\left[
    \begin{array}{c}
    c_2 \\
    c_1 \\
    c_0
    \end{array}
    \right]}_x\approx\underbrace{\left[
    \begin{array}{c}
    y_1 \\
    y_2 \\
    \vdots \\
    y_n
    \end{array}
    \right]}_b

Once again, we would like to make :math:`r=b-Ax` as small as possible.

How do we quantify :math:`r` being small? :math:`\Rightarrow` using a norm! We could ask
that :math:`\lVert r\rVert_1, \lVert r\rVert_2` or :math:`\lVert r\rVert_\infty` be as small as possible. Any of these
norms would be intuitive to consider for minimization (especially :math:`1`- and
:math:`\infty`-norms are very intuitive). However, we typically use the :math:`2`-norm for
this purpose, because its the easiest to work with in this problem.

.. topic:: Definition

    The *least squares solution* of the
    overdetermined system :math:`Ax\approx b` is the vector :math:`x` that minimizes
    :math:`\lVert r\rVert_2=\lVert b-Ax\rVert_2`.

Define :math:`Q(x)=Q(x_1,x_2,\ldots,x_n)=\lVert b-Ax\rVert_2^2` where
:math:`x=(x_1,\ldots,x_n)` and :math:`A\in\mathbb R^{m\times n}, b\in\mathbb R^m` (:math:`m>n`).
The least squares solution is the set of values :math:`x_1,\ldots,x_n` that
*minimize* :math:`Q(x_1,x_2,\ldots,x_n)`.

.. math::
    Q(x_1,\ldots,x_n) &=& \enspace \lVert b-Ax\rVert_2^2=\lVert r\rVert_2^2=\sum_{i=1}^m r_i^2 \\
    r = b-Ax \Rightarrow r_i &=& \enspace b_i-(Ax)_i\Rightarrow r_i=b_i-\sum a_{ij}x_j \\
    \Rightarrow Q(x_1,\ldots,x_n) &=& \sum_{i=1}^m\left(b_i-\sum_{j=1}^n a_{ij}x_j\right)^2

If :math:`x_1\ldots,x_n` are those that *minimize* :math:`Q`, then:

.. math::
    \frac{\partial Q}{\partial x_1}=0,\frac{\partial Q}{\partial x_2}=0,\ldots,\frac{\partial Q}{\partial x_n}=0

in order to guarantee a minimum.

.. math::
    
    \frac{\partial Q}{\partial x_k} &=& \frac{\partial}{\partial x_k}\left(\sum_{i=1}^m \left(b_i-\sum_{j=1}^n a_{ij}x_j\right)^2\right) \\
                                    &=& \sum_{i=1}^m \frac{\partial}{\partial x_k}\left(b_i-\sum_{j=1}^n a_{ij}x_j\right)^2 \\
                                    &=& \sum_{i=1}^m 2\underbrace{\left(b_i-\sum_{j=1}^n a_{ij}x_j\right)}_{r_i}\frac{\partial}{\partial x_k}\left(b_i-\sum_{j=1}^n a_{ij}x_j\right) \\
                                    &=& \sum_{i=1}^m -2r_ia_{ik} = -2\sum_{i=1}^m [A^T]_{ki}r_i = -2[A^Tr]_k=0 \\
                                    &\Rightarrow& \enspace [A^Tr]_k=0

Thus,

.. math::

    \left.
    \begin{array}{ccccccc}
    \partial Q/\partial x_1 & = & 0 & \Rightarrow & [A^Tr]_1 & = & 0 \\
    \partial Q/\partial x_2 & = & 0 & \Rightarrow & [A^Tr]_2 & = & 0 \\
                                    &   &   & \vdots      &          &   & \\
    \partial Q/\partial x_n & = & 0 & \Rightarrow & [A^Tr]_n & = & 0 \\
    \end{array}
    \right\}\Rightarrow
    \boxed{A^Tr=0}

Since :math:`r=b-Ax`, we have:

.. math::
    0=A^Tr=A^T(b-Ax)=A^Tb-A^TAx\Rightarrow\boxed{A^TAx=A^Tb}

The system above is called the *normal equations system*; it is a *square*
system that has as solution the least-squares approximation of :math:`Ax\approx b`.

.. math::

    \underbrace{A^T_{n\times m}A_{m\times n}}_{n\times n} \underbrace{x_{n\times 1}}_{n\times 1}=\underbrace{A^T_{n\times m}b_{m\times 1}}_{n\times 1}

The normal equations *always* have a solution (with the simple condition
that the columns of :math:`A` have to be linearly independent, which is usually true).

:math:`QR` factorization
~~~~~~~~~~~~~~~~~~~~~~~~

While the normal equations can adequately compute the least squares solution,
the condition number of :math:`A^TA` is the *square* of that of :math:`A` (if
:math:`A` was a square matrix). An alternative method that does not suffer from
this problematic conditioning is :math:`QR` factorization.

.. topic:: Definition

    An :math:`n\times n` matrix :math:`Q` is called *orthonormal* if and only if

    .. math::

        Q^TQ=QQ^T=I

.. topic:: Theorem

    Let :math:`A\in\mathbb R^{m\times n}` (:math:`m>n`) have linearly independent columns. Then
    a decomposition :math:`A=QR` exists, such that :math:`Q\in\mathbb R^{m\times m}` is
    orthonormal and :math:`R\in\mathbb R^{m\times n}` is upper triangular, i.e.,

    .. math::
        
        R=\left(
        \begin{array}{c}
        \hat R \\
        O
        \end{array}
        \right)

    where :math:`\hat R` is an :math:`n\times n` upper triangular matrix. Additionally, given
    that :math:`A` has linearly independent columns, all diagonal elements :math:`r_{ii}\neq 0`.

Now, let us write

.. math::
    
    Q=\left[
    \begin{array}{c|c}
    \hat Q & Q^\star
    \end{array}
    \right]

where :math:`\hat Q\in\mathbb R^{m\times n}` contains the first :math:`n` columns of :math:`Q` and
:math:`Q^\star\in\mathbb R^{m\times (m-n)}`  contains the last :math:`(m-n)` columns.
Respectively, we write:

.. math::

    R=\left(
    \begin{array}{c}
    \hat R \\
    O
    \end{array}
    \right)

where :math:`\hat R\in\mathbb R^{n\times n}` (and upper triangular) contains the first
:math:`n` rows of :math:`R`. :math:`\hat R` is also *non-singular* because it has linearly
independent columns. We can verify the following:

.. math::

    \hat Q^T\hat Q=I_{n\times n}\enspace\enspace\enspace(\mbox{although } \hat Q\hat Q^T\neq I_{m\times m}!)

The factorization :math:`A=\hat Q\hat R` is the so-called *economy size* :math:`QR`
factorization. Once we have :math:`\hat Q` and :math:`\hat R` computed, we observe that the
normal equations can be written as:

.. math::

    A^TAx &=& \enspace A^Tb \nonumber \\
    \Rightarrow \hat R^T\underbrace{\hat Q^T\hat Q}_{=I_{m\times m}}\hat Rx &=& \enspace \hat R^T\hat Q^Tb \nonumber \\
    \Rightarrow \hat R^T\hat Rx &=& \enspace \hat R^T\hat Q^Tb \nonumber \\
    \Rightarrow \boxed{\hat Rx=\hat Q^Tb}

The last equality follows because :math:`\hat R` is invertible. The benefit of using the :math:`QR` factorization
is that :math:`\textsf{cond}(A^TA)=[\textsf{cond}(\hat R)]^2`.
Thus, the resulting equation is *much better* conditioned than the
normal equations.