2.4. Matrix Norms and Condition Numbers

Similar to vectors, norms can also be defined for (square) matrices. These norms allow us to measure the nicety of a linear system in terms of admitting stable numerical algorithms. It turns out these norms allow us to precisly quantify the convergence behavior of many iterative schemes, as we will see in later sections.

2.4.1. Matrix Norms

Definition

A matrix norm is a function \(\lVert\cdot\rVert : \mathbb R^{n\times n}\rightarrow \mathbb R\) that satisfies:

  1. \(\lVert M\rVert\geq 0\) for all \(M\in\mathbb R^{n\times n}\), \(\lVert M\rVert = 0\) if and only if \(M=O\).
  2. \(\lVert \alpha M\rVert = \vert\alpha\vert\cdot\lVert M\rVert\)
  3. \(\lVert M+N\rVert\leq \lVert M\rVert + \lVert N\rVert\)
  4. \(\lVert M\cdot N\rVert \leq \lVert M\rVert\cdot\lVert N\rVert\)

Property \((4)\) above has slightly different flavor than vector norms. Although more types of matrix norms exist, one common category is that of matrix norms induced by vector norms.

Definition

If \(\lVert\cdot\rVert_\star\) is a valid vector norm, its induced matrix norm is defined as:

\[\lVert M\rVert_\star = \max_{x\in\mathbb R^n,x\neq 0} \frac{\lVert Mx\rVert_\star}{\lVert x\rVert_\star}\]

or equivalently,

\[\lVert M\rVert_\star = \max_{x\in\mathbb R^n,\lVert x\rVert=1} \lVert Mx\rVert_\star\]

Proof: Choose a vector \(y = \alpha x\), where \(\alpha = 1/\lVert x\rVert_\star\). Then,

\[\lVert y\rVert_\star = \lVert \alpha x\rVert_\star = |\alpha|\cdot\lVert x\rVert_\star = \frac{1}{\lVert x\rVert_\star}\cdot\lVert x\rVert_\star = 1\]

Also, from the definition of norms, the following relation holds

\[\lVert My\rVert_\star = \frac{\lVert My\rVert_\star}{1} = \frac{\lVert My\rVert_\star}{\lVert y\rVert_\star} = \frac{\lVert M\alpha x\rVert_\star}{\lVert \alpha x\rVert_\star} = \frac{|\alpha |}{|\alpha |}\cdot\frac{\lVert Mx\rVert_\star}{\lVert x\rVert_\star} = \frac{\lVert Mx\rVert_\star}{\lVert x\rVert_\star}\]

Thus, the two definitions above are equivalent.

Note again, that not all valid matrix norms are induced by vector norms. One notable example is the very commonly used Frobenius norm:

\[\lVert M\rVert_F = \sqrt{\sum_{i,j=1}^n M_{ij}^2}\]

We can easily show that induced norms satify properties \((1)\) through \((4)\). Properties \((1)-(3)\) are rather trivial, for example,

\[\begin{split}\lVert M+N\rVert &=& \max_{x\neq 0} \frac{\lVert (M+N)x\rVert}{\lVert x\rVert} \leq \max_{x\neq 0} \frac{\lVert Mx\rVert + \lVert Nx\rVert}{\lVert x\rVert} \\ &=& \max_{x\neq 0} \frac{\lVert Mx\rVert}{\lVert x\rVert} + \max_{x\neq 0} \frac{\lVert Nx\rVert}{\lVert x\rVert} = \lVert M\rVert + \lVert N\rVert\end{split}\]

Property \((4)\) is slightly trickier to show. First, a lemma:

Lemma

If \(\lVert\cdot\rVert\) is a matrix norm induced by a vector norm \(\lVert\cdot\rVert\), then

(1)\[\lVert Ax\rVert \leq \lVert A\rVert\cdot \lVert x\rVert\]

Proof: Since \(\lVert A\rVert = \max_{x\neq 0}\lVert Ax\rVert/\lVert x\rVert\), we have that for an arbitrary \(y\in\mathbb R^n (y\neq 0)\)

\[\lVert A\rVert = \max_{x\neq 0}\frac{\lVert Ax\rVert}{\lVert x\rVert}\geq \frac{\lVert Ay\rVert}{\lVert y\rVert} \Rightarrow \lVert Ay\rVert \leq \lVert A\rVert\cdot\lVert y\rVert\]

This holds for \(y\neq 0\), but we can see that it is also true for \(y=0\).

Now property \((4)\) can be easily proved using the above lemma:

\[\begin{split}\lVert MN\rVert &=& \max_{\lVert x\rVert=1} \lVert MNx\rVert \leq \max_{\lVert x\rVert=1} \lVert M\rVert\cdot \lVert Nx\rVert \\ &=& \lVert M\rVert\cdot \max_{\lVert x\rVert=1} \lVert Nx\rVert = \lVert M\rVert\cdot\lVert N\rVert \\ \Rightarrow \lVert MN\rVert &\leq& \lVert M\rVert\cdot\lVert N\rVert\end{split}\]

Note that when writing an expression such as (1), the matrix norm \(\lVert A\rVert\) is understood to be the inferred norm from the vector norm used in \(\lVert Ax\rVert\) and \(\lVert x\rVert\). Thus,

\[\lVert Ax\rVert_1 \leq \lVert A\rVert_1\cdot \lVert x\rVert_1\]

and

\[\lVert Ax\rVert_\infty \leq \lVert A\rVert_\infty\cdot \lVert x\rVert_\infty\]

are both valid, but we cannot mix and match, for example:

\[\lVert Ax\rVert_\infty \leq \lVert A\rVert_2\cdot \lVert x\rVert_1\]

Although the definition of an induced norm allowed us to prove certain properties, it does not necessarily provide a convenient formula for evaluating the matrix norm. Fortunately, such formulas do exist for the \(L_1\) and \(L_\infty\) induced matrix norms. Given here (without proof):

\[\begin{split}\lVert A\rVert_1 &=& \max_j \sum_{i=1}^n \vert A_{ij}\vert \enspace\enspace\enspace\mbox{(maximum absolute column sum)} \\ \lVert A\rVert_\infty &=& \max_i \sum_{i=1}^n \vert A_{ij}\vert \enspace\enspace\enspace\mbox{(maximum absolute row sum)}\end{split}\]

The formula for the \(L_2\) induced matrix norm is more complicated. We will see it when we study eigenvalues.

2.4.2. Condition Numbers

When solving a linear system \(Ax=b\), computer algorithms are only providing an approximation \(x_\textsf{approx}\) to the exact solution \(x_\textsf{exact}\). This is due to factors such as finite precision, round-off errors, or even imperfect solution algorithms. In either case, we have an error (in fact, an error vector) defined as:

\[e = x_\textsf{approx} - x_\textsf{exact}\]

Naturally, we would like to have an understanding of the magnitude of this error (for example, some appropriate norm \(\lVert e\rVert\)). However, the error cannot be directly measured because the exact solution \(x_\textsf{exact}\) is unknown. One remedy is offered via the residual vector defined as:

\[r = b-Ax_\textsf{approx}\]

The vector \(r\) is something we can compute practically since it involves only the known quantities \(b, A, x_\textsf{approx}\). Moreover, we have:

(2)\[\begin{split}r \enspace &=& \enspace b-Ax_\textsf{approx} \nonumber \\ \enspace &=& \enspace Ax_\textsf{exact} - Ax_\textsf{approx} \\ \enspace &=& \enspace -A(x_\textsf{approx} - x_\textsf{exact}) \\ \enspace &=& \enspace -Ae \\ \Rightarrow r \enspace &=& \enspace -Ae \\ \Rightarrow e \enspace &=& \enspace -A^{-1}r\end{split}\]

Equation (2) links the error with the residual. This allows us to write:

\[\lVert e\rVert = \lVert A^{-1}r\rVert \leq \lVert A^{-1}\rVert\cdot\lVert r\rVert\]

This equation provides a bound for the error, as a function of \(\lVert A^{-1}\rVert\) and the norm of the computable vector \(r\). Note that:

  • We can obtain this estimate without knowing the exact solution, but
  • We need \(\lVert A^{-1}\rVert\) and generally, computing \(\lVert A^{-1}\rVert\) is just as difficult (if not more) than finding \(x_\textsf{exact}\). However, there are some special cases where an estimate of \(\lVert A^{-1}\rVert\) can be obtained.

2.4.3. A Different Source of Error

Sometimes, the right hand side \(b\) of the system of equations \(Ax=b\) has errors that make it deviate from its intended value. For example, if each component of \(b\) was defined as \(b_i=f(x_i)\), where \(f\) is an unknown function, then an error in a measuring device that was supposed to sample \(f(x)\) could lead to erroneous readings \(b_i^\star (\neq b_i)\). Thus, measuring inaccuracies can lead to the right hand side vector \(b\) being misrepresented as \(b^\star\). In this case, instead of the intended solution \(x=A^{-1}b\), we compute the solution to a different system \(x^\star=A^{-1}b^\star\). How important is the error \(e=x^\star-x\) that is caused by this misrepresentation of \(b\)? Let \(\delta b=b^\star-b, \delta x=x^\star-x,Ax=b,Ax^\star=b^\star\). Then, we have:

\[\begin{split}A(x^\star-x) \enspace &=& \enspace b^\star-b \\ A\delta x \enspace &=& \enspace \delta b \\ \Rightarrow \delta x \enspace &=& \enspace A^{-1}\delta b\end{split}\]

Taking norms,

(3)\[\lVert \delta x\rVert = \lVert A^{-1}\delta b\rVert \leq \lVert A^{-1}\rVert\cdot\lVert \delta b\rVert\]

Thus, the error in the computed solution \(\delta x\) is proportional to the error in \(b\). An even more relevant question is: how does the relative error \(\lVert \delta x\rVert/\lVert x\rVert = \lVert x^\star-x\rVert/\lVert x\rVert\) compare to the relative error in \(b\) (\(\lVert\delta b\rVert/\lVert b\rVert\))? This may be more useful to know, since \(\lVert\delta b\rVert\) may be impossible to compute (if we don’t know the real \(b\)). For this, we write:

(4)\[\begin{split}Ax = b &\Rightarrow& \lVert b\rVert = \lVert Ax\rVert \leq \lVert A\rVert\cdot\lVert x\rVert \\ &\Rightarrow& \frac{1}{\lVert x\rVert} \leq \lVert A\rVert\cdot\frac{1}{\lVert b\rVert}\end{split}\]

Multiplying equation (3) and (4) gives

\[\frac{\lVert\delta x\rVert}{\lVert x\rVert} \leq \underbrace{\lVert A\rVert\cdot\lVert A^{-1}\rVert}_{\kappa (A)}\cdot\frac{\lVert \delta b\rVert}{\lVert b\rVert}\]

Thus, the relative error in \(x\) is bounded by a multiple of the relative error in \(b\). This multiplicative constant \(\kappa (A)=\lVert A\rVert\cdot\lVert A^{-1}\rVert\) is called the condition number of \(A\), and is an important measure of the sensitivity of a linear system \(Ax=b\) to being solved on a computer in the presence of inaccurate values.

2.4.4. Why is all this relevant?

Simply put, any \(b\) will have some small relative error due to the fact that it is represented on a computer only up to machine precision. The relative error will be at least as much as the machine epsilon due to round-off.

\[\frac{\lVert\delta b\rVert_\infty}{\lVert b\rVert_\infty} \geq\varepsilon\approx 10^{-7} \enspace\enspace\enspace\enspace\mbox{(in single precision)}\]

But how bad can the condition number get? Very bad at times. For example, Hilbert matrices \(H_n\in\mathbb R^{n\times n}\) are defined as:

\[(H_n)_{ij} = \frac{1}{i+j-1}\]

Considering a specific instance for \(n=5\),

\[\begin{split}H_5=\left[\begin{array}{ccccc} 1 & 1/2 & 1/3 & 1/4 & 1/5 \\ 1/2 & 1/3 & 1/4 & 1/5 & 1/6 \\ 1/3 & 1/4 & 1/5 & 1/6 & 1/7 \\ 1/4 & 1/5 & 1/6 & 1/7 & 1/8 \\ 1/5 & 1/6 & 1/7 & 1/8 & 1/9 \end{array}\right], \enspace\enspace\enspace\enspace\enspace \kappa_\infty(H_5) = \lVert H_5\rVert_\infty\cdot\lVert H_5^{-1}\rVert_\infty\approx 10^{6}\end{split}\]

Thus, any attempt at solving \(H_5 x = b\) would be subject to a relative error of up to \(10\%\) just due to round-off errors in \(b\)! Another case is that of near-singular matrices, for example:

\[\begin{split}A=\left[\begin{array}{cc} 1 & 2 \\ 3 & 6+\varepsilon \end{array}\right]\end{split}\]

As \(\varepsilon\rightarrow 0\), the matrix \(A\) becomes singular (non-invertible). In this case, \(\kappa (A)\rightarrow\infty\).

So far, we have considered upper bounds on the condition number. One may also ask what is the best case scenario for \(\kappa (A)\)? Before we answer this question, a lemma:

Lemma

For any vector-induced matrix norm, \(\lVert I\rVert = 1\).

Proof: From the definition,

\[\lVert I\rVert = \max_{x\neq 0}\frac{\lVert Ix\rVert}{\lVert x\rVert} = \max_{x\neq 0} \frac{\lVert x\rVert}{\lVert x\rVert} = 1\]

Using property (4) of matrix norms, we have:

\[I = A\cdot A^{-1} \Rightarrow 1 = \lVert I\rVert = \lVert A\cdot A^{-1}\rVert \leq \lVert A\rVert\cdot\lVert A^{-1}\rVert\]

Thus, \(\kappa (A)\geq 1\). The “best” conditioned matrices are of the form \(A=c\cdot I\), where \(c\in\mathbb R\) is a non-zero constant, and have \(\kappa (A)=1\). The linalg package of NumPy has an in-built function cond for computing the condition number of any square matrix, as shown below. The second parameter specifies the particular norm to use for evaluating the condition number.

>>> a = np.matrix([[1,0,-1],[0,1,0],[1,0,1]])
>>> a
matrix([[ 1,  0, -1],
        [ 0,  1,  0],
        [ 1,  0,  1]])
>>> np.linalg.cond(a,1)
2.0
>>> np.linalg.cond(a,2)
1.4142135623730951
>>> np.linalg.cond(a,np.inf)
2.0