All the Linear Algebra You Need

All The Linear Algebra You Need For Grad School

Themistoklis Haris, August 25, 2023

Introduction

Knowing good Linear Algebra is almost essential to someone serious about research in mathematical sciences. This is simply because of the universality of the ideas and tools Linear Algebra provides. There are so many linear transformations everywhere around us, from taking derivatives to Fourier Transforms. Having a solid Linear Algebra foundation allows one to approach these topics with the mathematical maturity that allows them to connect dots between seemingly completely different subjects. I think that maturity is a very important skill to have as a researcher.

For this blog, I assume the reader already has prior exposure to linear algebra and they are comfortable with the concept of a vector space at the very least. The purpose of this exposition is to give a high level picture of Linear Algebra, with a particular focus on Data Science, so low-level rigorous proofs will often be omitted with proof sketches taking their place.

The contents of this blog post are the result of studying Chapter 1 of the book Big Data and Learning From Data, by Gilbert Strang.

Vector Spaces, Linear Transformations and Matrices

Linear Algebra at its core is the study of linear transformations. These are transformations map sums to sums and scalar products to scalar products:

T(x+y) = T(x) + T(y)

T(\alpha x) = \alpha T(x)

The objects these transformations handle are called vectors. Sets of vectors equipped with the addition and scalar multiplication operations are called vector spaces. Vector spaces within vector spaces are called subspaces. Vector spaces appear everywhere in the mathematical sciences - in quantum mechanics the set of all admissable wavefunctions is a vector space, and so is the set of all 256 x 256 images on a computer. Amazingly, the set of rules for disecting and analyzing vector spaces that Linear Algebra studies are applied on every single vector space, from physics to computer graphics. It is astonishing to realize that the same principles govern the structure of such different objects.

While a more traditional treatment of Linear Algebra starts with the properties of vector spaces and linear transformations and eventually considers bases and dimensions, we will instead jump directly to the main protagonist of the show: matrices. Matrices are objects through which we can study any linear transformation between any two finite dimensional vector spaces. The connection comes the mapping through a linear transformation of one vector space's basis to another basis. That mapping is what a matrix represents. Indeed, this is one of the first core concepts one should grasp in linear algebra:

\boxed{\textbf{A matrix } A \in \mathbb{R}^{m\times n} \textbf{ represents a linear transformation from }V\cong \mathbb{R}^m \textbf{ to } W\cong \mathbb{R}^n}

Matrix Transformations

How do matrices transform vectors? Via matrix multiplication!

Ax

is the vector in

\mathbb{R}^n

which matrix

A

transforms vector

x \in \mathbb{R}^m

to. The typical example to think of is rotations in a plane via a 2 x 2 matrix. The way everyone learns to multiply matrices is via row-column "dot-product" multiplication, but there is another way. A way that allows us to think of matrix multiplication in terms of linear combinations of columns.

Ax

can be computed via linear combination of the columns of

A

. Consider this example:

\left[\begin{array}{cc} 2&3 \\4&5 \\6&7 \end{array}\right]\cdot\left[\begin{array}{c}x_1\\x_2\end{array}\right] = \left[\begin{array}{c}2x_1 + 3x_2\\4x_1+5x_2\\6x_1+7x_2\end{array}\right] = x_1\left[\begin{array}{c}2\\4\\6\end{array}\right] + x_2\left[\begin{array}{c}3\\5\\7\end{array}\right]

This leads us to the concept of the column space

C(A)

A

. It is the set of all linear combinations of columns of a matrix. Similarly we can define the row space of a matrix as the column space of its transpose:

C(A^T)

. We should of course introduce the concept of linear independence between a set of vectors

x_1,...,x_n

: when none can be written as a linear combination of any others. More simply put,

\sum \alpha_i x_i = 0 \Leftrightarrow \alpha_i = 0\,\,\forall i

A basis of a vector space is a set of linearly independent vectors that can produce any vector in the space via linear combination. The number of elements in a basis is called the dimension of the vector space. We shall work with finite dimensional vector spaces in this document. We can prove that even though multiple different bases may exist for a matrix, their cardinality (aka the dimension) remains the same.

The rank of a matrix is just the dimension of its column space. The rank counts independent columsn. It is an important fact in linear algebra that the number of independent rows is equal to the number of independent columns:

\textbf{column rank} = \textbf{row rank} \text{ or } dim(C(A)) = dim(C(A^T))

Matrix Multiplication

Multiplying two matrices

A \in \mathbb{R}^{m\times n}, B\in\mathbb{R}^{n\times p}

can also be thought of in terms of linear combinations, but this time of rank-1, column x row matrices. When we multiply a row with a column, we get a matrix of rank 1, meaning that all its rows are multiple of one row, and the same is true for its columns:

\left[\begin{array}{c}2\\2\\1\end{array}\right]\left[\begin{array}{ccc}3&4&6\end{array}\right] = \left[\begin{array}{ccc}6 & 8 & 12\\6 & 8 & 12 \\3 & 4 & 6\end{array}\right]

To multiply two arbitrary matrices, we just add pieces of rank-1. We view the first array as a row of columns and the second as a column of rows:

AB = \left[a_1 \cdots a_n \right]\cdot\left[\begin{array}{c}b_1^*\\ \cdot\\ \cdot \\ \cdot \\ b_n^* \end{array}\right] = a_1b_1^* + a_2b_2^{*} + \cdots + a_n b_n^*

where

a_i

are columns and

b_i^*

are rows. For example:

\left[\begin{array}{cc}1&0\\3&1\end{array}\right]\cdot\left[\begin{array}{cc}2&4\\0&5\end{array}\right] = \left[\begin{array}{c}1\\3\end{array}\right]\left[\begin{array}{cc}2&4\end{array}\right] + \left[\begin{array}{c}0\\1\end{array}\right]\left[\begin{array}{cc}0&5\end{array}\right]

But why does this way of multiplying matrices work? Let's see the short proof:

(a_kb_k^*)_{ij} = a_{ik}b_{kj}, \text{ so } \sum\limits_{k=1}^n a_kb_k^* = \sum\limits_{k=1}^n a_{ik}b_{kj} = \text{ row } i \times \text{ column } j

Now, let's understand why column rank equals row rank by putting this new way of multiplying matrices to practice. Suppose we are given a matrix

A \in \mathbb{R}^{m\times n}

. We construct a matrix

C

by iteratively taking columns of

A

and rejecting ones dependent on previously taken columns. This gives us a matrix

C \in \mathbb{R}^{m\times r}

but also another matrix

R \in \mathbb{R}^{r\times n}

that describes how the columns of

C

must be linearly combined to give the columns of

A

. This allows us to see that the number of independent columns equals the number of independent rows and to also arrive at our first major factorization:

\boxed{\boxed{A = CR}}

As an example:

\left[\begin{array}{ccc}1&3&8\\1&2&6\\0&1&2\end{array}\right] = \left[\begin{array}{cc}1&3\\1&2\\0&1\end{array}\right]\cdot \left[\begin{array}{ccc}1&0&2\\0&1&2\end{array}\right]

To test our understanding of column spaces and matrix multiplication, let's explore the following important facts:

Multiplying two matrices does not increase rank
$rank(A+B) \leq rank(A) + rank(B)$

The first fact comes from

C(AB) \subseteq C(A)

. Why is that? Look at matrix multiplication! Every column of

A

is a linear combination of columns of

A

and every row is a linear combination of rows of

B

. The second fact comes from combining bases for

C(A)

and

C(B)

. Columns of

A+B

are sums of respective columns of

A

and

B

Four Fundamental Subspaces

There are 4 subspaces that we need to know because their relationships and significance will play a huge role in the theorems to prove in the future.

The column space $C(A)$ : linear combinations of columns
The row space $C(A^T)$ : linear combinations of rows
The nullspace $N(A)$ : solutions to $Ax = 0$
The left nullspace $N(A^T)$ : solutions to $A^T x = 0$

Five Fundamental Matrix Factorizations

One major common theme in Linear Algebra is to factorize a matrix and then examine the rank-1 pieces of that factorization. That analysis allows us to inspect which parts of a matrix have a larger or a lesser influence on its properties. To that end, Linear Algebra seeks to factorize matrices in as many and as easy ways as possible. Sometimes we can only factorize a class of matrices in a certain way, but the goal is definitely to be as broad as possible. Below follows a list of 5 major factorizations in linear algebra. We will examine each one of them in detail in this post.

$\mathbf{A = LU}\rightarrow$ Elimination. $U$ is upper triangular, $L$ is lower triangular
$\mathbf{A = QR}\rightarrow$ Orthogonalization of columns of $A$
$\mathbf{A = Q\Lambda Q^T}\rightarrow$ Spectral decomposition. $S$ is symmetric.
$\mathbf{A = X\Lambda X^{-1}}\rightarrow$ Diagonalization
$\mathbf{A = U\Sigma V^T}\rightarrow$ Singular Value Decomposition (SVD)

Elimination and $A = LU$

The method of elimination or "pivots" is a well-known method for solving linear systems of equations. We subtract multiples of a row from the other rows to eliminate variables. At the end, we use back substitution to find the solutions to our equations. The method essentially tries to convert a matrix to an upper triangular matrix

U

with the pivots in the diagonal:

\left[\begin{array}{cccc}x&x&x&x\\x&x&x&x\\x&x&x&x\\x&x&x&x\end{array}\right] \longrightarrow \left[\begin{array}{cccc}p_1&x&x&x\\0&p_2&x&x\\0&0&p_3&x\\0&0&0&p_4\end{array}\right]

The numbers we used to create

U

form a lower-triangular matrix

L

. In the

i

-th column of

L

we put all the numbers we used to eliminate rows

(i+1)

and below:

L = \left[\begin{array}{cccc}1&0&0&0\\l_{21}&1&0&0\\l_{31}&l_{32}&1&0\\l_{41}&l_{42}&l_{43}&1\end{array}\right]\,\text{ where } l_{ij} = \frac{A_{ij}}{A_{jj}}\,\,,i>j

Every time we remove a row from the rows below it, we essentially remove a rank-1 matrix from

A

. The first matrix we remove is

\ell_1 u_1^*

(multiples of row 1). Then we remove

\ell_2 u_2^*

(multiples of row 2). In the end, we get

\boxed{\boxed{A = \ell_1 u_1^* + \ell_2 u_2^* + \cdots + \ell_n u_n^* = LU}}

The method of elimination gives us the

A = LU

factorization. This factorization can be used to solve

Ax = b

Ax = b \Leftrightarrow (LU)x = b \Leftrightarrow Ux = L^{-1}b \Leftrightarrow \boxed{x = U^{-1}L^{-1}b}

Of course, the matrix

A

will not always be invertible, so the equation might have no solutions or have infinite solutions, but the main idea is there. One particular example of a pesky obstacle in solving these systems in general is the existence of zero-pivots. For example, if we have

A_{11} = 0

then we cannot define the first pivot. What we do is permute the rows in such a way that we can have

A_{11} \neq 0

. We can permute the rows of a matrix by using a permutation matrix

P

. In this case we can always write

PA = LU

\left[\begin{array}{ccc}0&0&1\\1&0&0\\0&1&0\end{array}\right]\cdot\left[\begin{array}{ccc}0&1&1\\1&3&7\\2&4&8\end{array}\right] = \left[\begin{array}{ccc}0&1&0\\1/2&1&1\\1&0&0\end{array}\right]\cdot\left[\begin{array}{ccc}2&4&8\\0&1&1\\0&0&2\end{array}\right]

Orthogonality

Orthogonal vectors play a huge role in linear algebra. This is because orthogonal vectors are independent, so we can talk about orthogonal bases. Orthogonality comes from defining another operation between vectors: the inner product:

\langle x,y \rangle

. We'll use a standard inner product definition for this section:

\langle x,y \rangle = x^T y

. For complex valued vectors we use the conjugate of

x

, but for simplicity we will assume real components now.

Two vectors are orthogonal if

x^T y = 0

. For a vector

x

we define its L2-norm as

||x|| = \sqrt{x^T x}

. An example is the Pythagorean theorem:

||x-y||^2 = ||x||^2 + ||y||^2,\,\,x\perp y

More general, we have the law of cosines:

||x-y||^2 = ||x||^2 + ||y||^2 - 2||x||\cdot||y||\cdot cos \theta

An orthogonal basis is a basis for which every pair of vectors are orthogonal. Let

\{q_1,...,q_n\}

be such a basis. Then, for any vector

v

, we have that

v = c_1 q_1 + \cdots c_n q_n

. This is equivalent to projecting

v

to the basis. We can get these projection coefficients

c_i

simply by taking the inner product of

v

with each basis vector:

\boxed{\boxed{c_i = q_i^T v}}

Why is that? Just examine:

q_1^T v = c_1q_1^T q_1 + \cdots c_n q_1^T q_n = c_1

A pair of subspaces

V,W

is called orthogonal if for every pair of vectors

x \in V, y\in W

we have

\langle x,y\rangle = 0

. By using the column-row multiplication of

Ax

, it is easy to see that:

\boxed{C(A) \perp N(A^T), C(A^T) \perp N(A)}

We can also consider orthogonal matrices. A matrix

Q

is called orthogonal if its columns are orthonormal, meaning that they are orthogonal to each other and each has norm 1. In other words,

Q^T Q = I

You can think of these matrices as rotations in the plane. An important property of these matrices is that they don't change the lenghts of the vectors they transform:

||Qx|| = (Qx)^T Qx = x^T Q^T Q x = x^T x = ||x||

If a square matrix is orthogonal, its transpose is its inverse!

Q^{-1} = Q^T

Also, if

Q_1, Q_2

are orthogonal, then so is

Q_1 Q_2

. We'll leave this proof to the reader.

Eigenvalues and Eigenvectors

The eigenvectors of a matrix don't change direction when multiplied by it:

\boxed{\boxed{Ax = \lambda x}}

These vectors are, as we will find out, very important for analyzing matrices because they are an invariant to the linear transformation of the matrix. Let's first explore some of their properties (without proof, but reader is encouraged to seek them out on her own):

If $Ax = \lambda x$ , then $A^k x = \lambda^k x$
If $x_1,...,x_n$ are $n$ independent eigenvectors, then they form a basis of the column space of $A$ , so
$v = c_1 x_1 + \cdots c_n x_n \implies A^k v = c_1\lambda_1^k x_1 + \cdots c_n \lambda_n^k x_n$
The trace of a matrix (the sum of its diagonal values) is equal to the sum of its eigenvalues
The determinant of a matrix is equal to the product of its eigenvalues.
If $S$ is symmetric, then its eigenvalues are real. We'll see why soon.
If $AA^T = A^T A\text{ and } \lambda_1 \neq \lambda_2$ , then $x_1 \perp x_2$

Let's examine how to calculate the eigenvalues of a matrix. It boils down to computing a determinant!

Ax = \lambda x \Leftrightarrow (A - \lambda I)x - 0 \Leftrightarrow \boxed{\boxed{det(A - \lambda I) = 0}}

Two matrices are similar when they have the same eigenvalues. If

B

is invertible, then

A

and

BAB^{-1}

are similar:

(BAB^{-1})(Bx) = \lambda (Bx) \Leftrightarrow Ax = \lambda x

We can use similar matrices to compute the eigenvalues of a matrix. The idea is to transform the original matrix into a similar diagonal matrix

\Lambda

where the eigenvalues end up on the diagonal. This process is known as diagonalization. After we diagonalize a matrix to obtain

\Lambda

and also a set of eigenvectors

x_1,...,x_n

, we can then write:

A\left[x_1\cdots x_n\right] = \left[Ax_1\cdots Ax_n\right] = \left[\lambda_1 x_1 \cdots \lambda_n x_n\right] = \left[x_1\cdots x_n\right]\cdot \left[\begin{array}{ccccc}\lambda_1&.&.&.&.\\.&\lambda_2&.&.&.\\.&.&.&.&.\\.&.&.&\lambda_{n-1}&.\\.&.&.&.&\lambda_n\end{array}\right]

This gives us another factorization known as eigendecomposition:

AX = X\Lambda \Leftrightarrow \boxed{\boxed{A = X\Lambda X^{-1}}}

It is important to note at this point that not all matrices are diagonalizable! If there are repeated eigenvalues specifically, we might run into particularly tricky situations, but that is beyond the scope of this document.

Symmetric Positive Definite Matrices

There are some matrices that are more convenient to work with than others. The crown goes to symmetric matrices, and specifically to symmetric matrices with positive eigenvalues. Those matrices play a very important role in science and appear almost everywhere, so we need to know their basic properties:

All the eigenvalues of a symmetric matrix are real numbers.
All $n$ eigenvectors of a symmetric matrix can be chosen to be orthonormal to each other!

Now, if we put all the orthonormal eigenvectors in a matrix

Q

, then we must have

Q^{T}Q = I

, and this in fact gives us the powerful spectral decomposition theorem:

\boxed{\boxed{S = Q\Lambda Q^T}}

which is simply eigendecomposition for symmetric matrices. The power lies in that we don't have to compute the inverse of

Q

. We just take the transpose.

Let's talk a little bit about proving / justifying the two statements above. To explain why all eigenvalues are real, examine the equation

Sx = \lambda x

of a single eigenvalue. Multiply both sides to get

x^T S x = \lambda x^T x

. The quantity on the left is always real (as can be seen with some algebra), and the dot-product on the right is also real, forcing

\lambda

to be real.

As for why the eigenvectors can be chosen orthogonal to each other, still take

Sx = \lambda x

for some vector

x

, but now also consider a different eigenvector

y

for which

Sy = \alpha y

, with

\alpha \neq \lambda

. It is easy to see that

y

belongs to the nullspace of

S-\alpha I

. But

x

belongs to the column space of

S-\alpha I

(S-I\alpha)x = (\lambda-\alpha)x

. And we know that the nullspace and column spaces are orthogonal, so we know that the vectors are orthogonal. We need to handle the case where

\alpha = 0

, but we'll leave that for the reader.

Now to talk about the star of the show: symmetric positive definite matrices. These are matrices whose eigenvalues are all positive. We write

S \succcurlyeq 0

. There are multiple criteria to test whether a matrix is PD:

All the eigenvalues are positive.
For all vectors $x$ , the quantity $E(x) = x^T S x$ is positive.
$S = A^T A$ with $A$ having independent columns and vice-versa!
All leading determinants are positive. These are the determinants of the matrices that result if we delete the last $k$ rows and $k$ columns. (Sylvester's criterion)
All pivots of $S$ are positive.

Let's explore some of the reasoning behind the facts above. The energy or quadratic form expression is going to be a central theme that connects all the criteria. The reason why it works is because it works easily on eigenvectors (since the eigenvalues are positive) and the eigenvectors form a basis for

R^n

. The expression

S = A^T A

works because of the energy calculation:

E(x) = x^T A^T A x = (Ax)^T Ax = ||Ax||^2 > 0

The last inequality is strict because the columns of

A

are independent. As for the pivot criterion, it helps connect the

LU

decomposition with the spectral decomposition into something known as the Cholesky decomposition. Briefly, if we write

S = LU = LDL^T

by pulling the pivots into a matrix

D

then we can partition

D

and write

\boxed{\boxed{S = (L\sqrt{D})(\sqrt{D}L^T)}}

We should note that PD matrices and spectral decomposition play a huge role in optimization. For example, we can check if a function is convex by checking if its Hessian matrix is PD.

Singular Value Decomposition

The SVD is a crowning jewel in Linear Algebra. The reason is its versatility. It works with every matrix, unlike the other factorizations. We built a beautiful theory on symmetric matrices, and we want to use it broadly. So, how to create a symmetric matrix from any matrix

A \in \mathbb{R}^{m\times n}

The idea is to use

S_1 = A^T A \in \mathbb{R}^{m\times n}

and

S_2 = AA^T \in \mathbb{R}^{n\times m}

. Since those matrices are symmetric, we can pick a set of

r

orthonormal eigenvectors from

S_1

, where

r \leq n

is the rank of

A

. Let those be

v_1, v_2,...,v_r

. Also let

\lambda_1,\lambda_2,...,\lambda_r

be the corresponding eigenvalues, and let

\sigma_i = \sqrt{\lambda_i}

Now let

u_k = \frac{Av_k}{\sigma_k}

. These are orthonormal eigenvectors of

S_2

(small calculation) which are chosen carefully to satify the "eigen"-equation:

\boxed{Av_k = u_k \sigma_k}

In matrix form, we can write:

A[v_1\,\cdots\,v_r\,\,0 \cdots 0] = [u_1 u_2 \cdots u_r\,\,0 \cdots 0]\left[\begin{array}{ccccc}\sigma_1&.&.&.&.\\ .&\sigma_2.&.&.&.\\.&.&.&.&.\\ .&.&.&\sigma_r&.\\ .&.&.&.&. \end{array}\right]\Leftrightarrow \boxed{\boxed{AV = U\Sigma \Leftrightarrow A = U\Sigma V^T}}

where

V \in \mathbb{R}^{n\times n}, U \in \mathbb{R}^{m\times m}

and

\Sigma \in \mathbb{R}^{m\times n}

. Importantly, we can do what we always do when factorizing a matrix: break it into rank-1 pieces. If we order the singular values in decreasing order, the first

k

terms of the summation give us the best possible

k

-rank approximation of

A

. This is very valuable in Data Science and is known as the Eckart-Young Theorem, which is the foundation of PCA (Principal Component Analysis)

\boxed{A = \sigma_1 u_1 v_1^T + \cdots +\sigma_r u_r v_r^T}

Finally, it is worth noting that the SVD has a beautiful geometric interpretation that I would highly suggest looking up online for yourself. It basically tells us that every linear transformation can be decomposed into two rotations and one stretching! That, I believe, is amazing to see.

This concludes our quick tour across the Linear Algebra universe. There are many more things to explore, but they will be left up to necessity. PCA, Jordan Canonical Forms, Dual spaces are some of the ideas to look at next. I hope this has been helpful!