All The Linear Algebra You Need For Grad School

Themistoklis Haris, August 25, 2023

Introduction

Knowing good Linear Algebra is almost essential to someone serious about research in mathematical sciences. This is simply because of the universality of the ideas and tools Linear Algebra provides. There are so many linear transformations everywhere around us, from taking derivatives to Fourier Transforms. Having a solid Linear Algebra foundation allows one to approach these topics with the mathematical maturity that allows them to connect dots between seemingly completely different subjects. I think that maturity is a very important skill to have as a researcher.
For this blog, I assume the reader already has prior exposure to linear algebra and they are comfortable with the concept of a vector space at the very least. The purpose of this exposition is to give a high level picture of Linear Algebra, with a particular focus on Data Science, so low-level rigorous proofs will often be omitted with proof sketches taking their place.
The contents of this blog post are the result of studying Chapter 1 of the book Big Data and Learning From Data, by Gilbert Strang.

Vector Spaces, Linear Transformations and Matrices

Linear Algebra at its core is the study of linear transformations. These are transformations map sums to sums and scalar products to scalar products:
T(x+y)=T(x)+T(y)T(x+y) = T(x) + T(y)
T(αx)=αT(x)T(\alpha x) = \alpha T(x)
The objects these transformations handle are called vectors. Sets of vectors equipped with the addition and scalar multiplication operations are called vector spaces. Vector spaces within vector spaces are called subspaces. Vector spaces appear everywhere in the mathematical sciences - in quantum mechanics the set of all admissable wavefunctions is a vector space, and so is the set of all 256 x 256 images on a computer. Amazingly, the set of rules for disecting and analyzing vector spaces that Linear Algebra studies are applied on every single vector space, from physics to computer graphics. It is astonishing to realize that the same principles govern the structure of such different objects.
While a more traditional treatment of Linear Algebra starts with the properties of vector spaces and linear transformations and eventually considers bases and dimensions, we will instead jump directly to the main protagonist of the show: matrices. Matrices are objects through which we can study any linear transformation between any two finite dimensional vector spaces. The connection comes the mapping through a linear transformation of one vector space's basis to another basis. That mapping is what a matrix represents. Indeed, this is one of the first core concepts one should grasp in linear algebra:
A matrix ARm×n represents a linear transformation from VRm to WRn\boxed{\textbf{A matrix } A \in \mathbb{R}^{m\times n} \textbf{ represents a linear transformation from }V\cong \mathbb{R}^m \textbf{ to } W\cong \mathbb{R}^n}

Matrix Transformations

How do matrices transform vectors? Via matrix multiplication! AxAx is the vector in Rn\mathbb{R}^n which matrix AA transforms vector xRmx \in \mathbb{R}^m to. The typical example to think of is rotations in a plane via a 2 x 2 matrix. The way everyone learns to multiply matrices is via row-column "dot-product" multiplication, but there is another way. A way that allows us to think of matrix multiplication in terms of linear combinations of columns.
AxAx can be computed via linear combination of the columns of AA. Consider this example:
[234567][x1x2]=[2x1+3x24x1+5x26x1+7x2]=x1[246]+x2[357]\left[\begin{array}{cc} 2&3 \\4&5 \\6&7 \end{array}\right]\cdot\left[\begin{array}{c}x_1\\x_2\end{array}\right] = \left[\begin{array}{c}2x_1 + 3x_2\\4x_1+5x_2\\6x_1+7x_2\end{array}\right] = x_1\left[\begin{array}{c}2\\4\\6\end{array}\right] + x_2\left[\begin{array}{c}3\\5\\7\end{array}\right]
This leads us to the concept of the column space C(A)C(A) of AA. It is the set of all linear combinations of columns of a matrix. Similarly we can define the row space of a matrix as the column space of its transpose: C(AT)C(A^T). We should of course introduce the concept of linear independence between a set of vectors x1,...,xnx_1,...,x_n: when none can be written as a linear combination of any others. More simply put, αixi=0αi=0i\sum \alpha_i x_i = 0 \Leftrightarrow \alpha_i = 0\,\,\forall i
A basis of a vector space is a set of linearly independent vectors that can produce any vector in the space via linear combination. The number of elements in a basis is called the dimension of the vector space. We shall work with finite dimensional vector spaces in this document. We can prove that even though multiple different bases may exist for a matrix, their cardinality (aka the dimension) remains the same.
The rank of a matrix is just the dimension of its column space. The rank counts independent columsn. It is an important fact in linear algebra that the number of independent rows is equal to the number of independent columns:
column rank=row rank or dim(C(A))=dim(C(AT))\textbf{column rank} = \textbf{row rank} \text{ or } dim(C(A)) = dim(C(A^T))

Matrix Multiplication

Multiplying two matrices ARm×n,BRn×pA \in \mathbb{R}^{m\times n}, B\in\mathbb{R}^{n\times p} can also be thought of in terms of linear combinations, but this time of rank-1, column x row matrices. When we multiply a row with a column, we get a matrix of rank 1, meaning that all its rows are multiple of one row, and the same is true for its columns:
[221][346]=[68126812346]\left[\begin{array}{c}2\\2\\1\end{array}\right]\left[\begin{array}{ccc}3&4&6\end{array}\right] = \left[\begin{array}{ccc}6 & 8 & 12\\6 & 8 & 12 \\3 & 4 & 6\end{array}\right]
To multiply two arbitrary matrices, we just add pieces of rank-1. We view the first array as a row of columns and the second as a column of rows:
AB=[a1an][b1bn]=a1b1+a2b2++anbnAB = \left[a_1 \cdots a_n \right]\cdot\left[\begin{array}{c}b_1^*\\ \cdot\\ \cdot \\ \cdot \\ b_n^* \end{array}\right] = a_1b_1^* + a_2b_2^{*} + \cdots + a_n b_n^*
where aia_i are columns and bib_i^* are rows. For example:
[1031][2405]=[13][24]+[01][05]\left[\begin{array}{cc}1&0\\3&1\end{array}\right]\cdot\left[\begin{array}{cc}2&4\\0&5\end{array}\right] = \left[\begin{array}{c}1\\3\end{array}\right]\left[\begin{array}{cc}2&4\end{array}\right] + \left[\begin{array}{c}0\\1\end{array}\right]\left[\begin{array}{cc}0&5\end{array}\right]
But why does this way of multiplying matrices work? Let's see the short proof:
(akbk)ij=aikbkj, so k=1nakbk=k=1naikbkj= row i× column j(a_kb_k^*)_{ij} = a_{ik}b_{kj}, \text{ so } \sum\limits_{k=1}^n a_kb_k^* = \sum\limits_{k=1}^n a_{ik}b_{kj} = \text{ row } i \times \text{ column } j
Now, let's understand why column rank equals row rank by putting this new way of multiplying matrices to practice. Suppose we are given a matrix ARm×nA \in \mathbb{R}^{m\times n}. We construct a matrix CC by iteratively taking columns of AA and rejecting ones dependent on previously taken columns. This gives us a matrix CRm×rC \in \mathbb{R}^{m\times r} but also another matrix RRr×nR \in \mathbb{R}^{r\times n} that describes how the columns of CC must be linearly combined to give the columns of AA. This allows us to see that the number of independent columns equals the number of independent rows and to also arrive at our first major factorization:
A=CR\boxed{\boxed{A = CR}}
As an example:
[138126012]=[131201][102012]\left[\begin{array}{ccc}1&3&8\\1&2&6\\0&1&2\end{array}\right] = \left[\begin{array}{cc}1&3\\1&2\\0&1\end{array}\right]\cdot \left[\begin{array}{ccc}1&0&2\\0&1&2\end{array}\right]
To test our understanding of column spaces and matrix multiplication, let's explore the following important facts:
  1. Multiplying two matrices does not increase rank
  2. rank(A+B)rank(A)+rank(B)rank(A+B) \leq rank(A) + rank(B)
The first fact comes from C(AB)C(A)C(AB) \subseteq C(A). Why is that? Look at matrix multiplication! Every column of AA is a linear combination of columns of AA and every row is a linear combination of rows of BB. The second fact comes from combining bases for C(A)C(A) and C(B)C(B). Columns of A+BA+B are sums of respective columns of AA and BB

Four Fundamental Subspaces

There are 4 subspaces that we need to know because their relationships and significance will play a huge role in the theorems to prove in the future.
  1. The column space C(A)C(A): linear combinations of columns
  2. The row space C(AT)C(A^T): linear combinations of rows
  3. The nullspace N(A)N(A): solutions to Ax=0Ax = 0
  4. The left nullspace N(AT)N(A^T): solutions to ATx=0A^T x = 0

Five Fundamental Matrix Factorizations

One major common theme in Linear Algebra is to factorize a matrix and then examine the rank-1 pieces of that factorization. That analysis allows us to inspect which parts of a matrix have a larger or a lesser influence on its properties. To that end, Linear Algebra seeks to factorize matrices in as many and as easy ways as possible. Sometimes we can only factorize a class of matrices in a certain way, but the goal is definitely to be as broad as possible. Below follows a list of 5 major factorizations in linear algebra. We will examine each one of them in detail in this post.
  1. A=LU\mathbf{A = LU}\rightarrow Elimination. UU is upper triangular, LL is lower triangular
  2. A=QR\mathbf{A = QR}\rightarrow Orthogonalization of columns of AA
  3. A=QΛQT\mathbf{A = Q\Lambda Q^T}\rightarrow Spectral decomposition. SS is symmetric.
  4. A=XΛX1\mathbf{A = X\Lambda X^{-1}}\rightarrow Diagonalization
  5. A=UΣVT\mathbf{A = U\Sigma V^T}\rightarrow Singular Value Decomposition (SVD)

Elimination and A=LUA = LU

The method of elimination or "pivots" is a well-known method for solving linear systems of equations. We subtract multiples of a row from the other rows to eliminate variables. At the end, we use back substitution to find the solutions to our equations. The method essentially tries to convert a matrix to an upper triangular matrix UU with the pivots in the diagonal:
[xxxxxxxxxxxxxxxx][p1xxx0p2xx00p3x000p4]\left[\begin{array}{cccc}x&x&x&x\\x&x&x&x\\x&x&x&x\\x&x&x&x\end{array}\right] \longrightarrow \left[\begin{array}{cccc}p_1&x&x&x\\0&p_2&x&x\\0&0&p_3&x\\0&0&0&p_4\end{array}\right]
The numbers we used to create UU form a lower-triangular matrix LL. In the ii-th column of LL we put all the numbers we used to eliminate rows (i+1)(i+1) and below:
L=[1000l21100l31l3210l41l42l431] where lij=AijAjj,i>jL = \left[\begin{array}{cccc}1&0&0&0\\l_{21}&1&0&0\\l_{31}&l_{32}&1&0\\l_{41}&l_{42}&l_{43}&1\end{array}\right]\,\text{ where } l_{ij} = \frac{A_{ij}}{A_{jj}}\,\,,i>j
Every time we remove a row from the rows below it, we essentially remove a rank-1 matrix from AA. The first matrix we remove is 1u1\ell_1 u_1^* (multiples of row 1). Then we remove 2u2\ell_2 u_2^* (multiples of row 2). In the end, we get
A=1u1+2u2++nun=LU\boxed{\boxed{A = \ell_1 u_1^* + \ell_2 u_2^* + \cdots + \ell_n u_n^* = LU}}
The method of elimination gives us the A=LUA = LU factorization. This factorization can be used to solve Ax=bAx = b:
Ax=b(LU)x=bUx=L1bx=U1L1bAx = b \Leftrightarrow (LU)x = b \Leftrightarrow Ux = L^{-1}b \Leftrightarrow \boxed{x = U^{-1}L^{-1}b}
Of course, the matrix AA will not always be invertible, so the equation might have no solutions or have infinite solutions, but the main idea is there. One particular example of a pesky obstacle in solving these systems in general is the existence of zero-pivots. For example, if we have A11=0A_{11} = 0 then we cannot define the first pivot. What we do is permute the rows in such a way that we can have A110A_{11} \neq 0. We can permute the rows of a matrix by using a permutation matrix PP. In this case we can always write PA=LUPA = LU:
[001100010][011137248]=[0101/211100][248011002]\left[\begin{array}{ccc}0&0&1\\1&0&0\\0&1&0\end{array}\right]\cdot\left[\begin{array}{ccc}0&1&1\\1&3&7\\2&4&8\end{array}\right] = \left[\begin{array}{ccc}0&1&0\\1/2&1&1\\1&0&0\end{array}\right]\cdot\left[\begin{array}{ccc}2&4&8\\0&1&1\\0&0&2\end{array}\right]

Orthogonality

Orthogonal vectors play a huge role in linear algebra. This is because orthogonal vectors are independent, so we can talk about orthogonal bases. Orthogonality comes from defining another operation between vectors: the inner product: x,y\langle x,y \rangle. We'll use a standard inner product definition for this section: x,y=xTy\langle x,y \rangle = x^T y. For complex valued vectors we use the conjugate of xx, but for simplicity we will assume real components now.
Two vectors are orthogonal if xTy=0x^T y = 0. For a vector xx we define its L2-norm as x=xTx||x|| = \sqrt{x^T x}. An example is the Pythagorean theorem:
xy2=x2+y2,xy||x-y||^2 = ||x||^2 + ||y||^2,\,\,x\perp y
More general, we have the law of cosines:
xy2=x2+y22xycosθ||x-y||^2 = ||x||^2 + ||y||^2 - 2||x||\cdot||y||\cdot cos \theta

An orthogonal basis is a basis for which every pair of vectors are orthogonal. Let {q1,...,qn}\{q_1,...,q_n\} be such a basis. Then, for any vector vv, we have that v=c1q1+cnqnv = c_1 q_1 + \cdots c_n q_n. This is equivalent to projecting vv to the basis. We can get these projection coefficients cic_i simply by taking the inner product of vv with each basis vector:
ci=qiTv\boxed{\boxed{c_i = q_i^T v}}
Why is that? Just examine: q1Tv=c1q1Tq1+cnq1Tqn=c1q_1^T v = c_1q_1^T q_1 + \cdots c_n q_1^T q_n = c_1.

A pair of subspaces V,WV,W is called orthogonal if for every pair of vectors xV,yWx \in V, y\in W we have x,y=0\langle x,y\rangle = 0. By using the column-row multiplication of AxAx, it is easy to see that:
C(A)N(AT),C(AT)N(A)\boxed{C(A) \perp N(A^T), C(A^T) \perp N(A)}
We can also consider orthogonal matrices. A matrix QQ is called orthogonal if its columns are orthonormal, meaning that they are orthogonal to each other and each has norm 1. In other words,
QTQ=IQ^T Q = I
You can think of these matrices as rotations in the plane. An important property of these matrices is that they don't change the lenghts of the vectors they transform:
Qx=(Qx)TQx=xTQTQx=xTx=x||Qx|| = (Qx)^T Qx = x^T Q^T Q x = x^T x = ||x||
If a square matrix is orthogonal, its transpose is its inverse!
Q1=QTQ^{-1} = Q^T
Also, if Q1,Q2Q_1, Q_2 are orthogonal, then so is Q1Q2Q_1 Q_2. We'll leave this proof to the reader.

Eigenvalues and Eigenvectors

The eigenvectors of a matrix don't change direction when multiplied by it:
Ax=λx\boxed{\boxed{Ax = \lambda x}}
These vectors are, as we will find out, very important for analyzing matrices because they are an invariant to the linear transformation of the matrix. Let's first explore some of their properties (without proof, but reader is encouraged to seek them out on her own):
  1. If Ax=λxAx = \lambda x, then Akx=λkxA^k x = \lambda^k x
  2. If x1,...,xnx_1,...,x_n are nn independent eigenvectors, then they form a basis of the column space of AA, so
    v=c1x1+cnxn    Akv=c1λ1kx1+cnλnkxnv = c_1 x_1 + \cdots c_n x_n \implies A^k v = c_1\lambda_1^k x_1 + \cdots c_n \lambda_n^k x_n
  3. The trace of a matrix (the sum of its diagonal values) is equal to the sum of its eigenvalues
  4. The determinant of a matrix is equal to the product of its eigenvalues.
  5. If SS is symmetric, then its eigenvalues are real. We'll see why soon.
  6. If AAT=ATA and λ1λ2AA^T = A^T A\text{ and } \lambda_1 \neq \lambda_2, then x1x2x_1 \perp x_2

Let's examine how to calculate the eigenvalues of a matrix. It boils down to computing a determinant!
Ax=λx(AλI)x0det(AλI)=0Ax = \lambda x \Leftrightarrow (A - \lambda I)x - 0 \Leftrightarrow \boxed{\boxed{det(A - \lambda I) = 0}}

Two matrices are similar when they have the same eigenvalues. If BB is invertible, then AA and BAB1BAB^{-1} are similar:
(BAB1)(Bx)=λ(Bx)Ax=λx(BAB^{-1})(Bx) = \lambda (Bx) \Leftrightarrow Ax = \lambda x
We can use similar matrices to compute the eigenvalues of a matrix. The idea is to transform the original matrix into a similar diagonal matrix Λ\Lambda where the eigenvalues end up on the diagonal. This process is known as diagonalization. After we diagonalize a matrix to obtain Λ\Lambda and also a set of eigenvectors x1,...,xnx_1,...,x_n, we can then write:
A[x1xn]=[Ax1Axn]=[λ1x1λnxn]=[x1xn][λ1.....λ2...........λn1.....λn]A\left[x_1\cdots x_n\right] = \left[Ax_1\cdots Ax_n\right] = \left[\lambda_1 x_1 \cdots \lambda_n x_n\right] = \left[x_1\cdots x_n\right]\cdot \left[\begin{array}{ccccc}\lambda_1&.&.&.&.\\.&\lambda_2&.&.&.\\.&.&.&.&.\\.&.&.&\lambda_{n-1}&.\\.&.&.&.&\lambda_n\end{array}\right]
This gives us another factorization known as eigendecomposition:
AX=XΛA=XΛX1AX = X\Lambda \Leftrightarrow \boxed{\boxed{A = X\Lambda X^{-1}}}
It is important to note at this point that not all matrices are diagonalizable! If there are repeated eigenvalues specifically, we might run into particularly tricky situations, but that is beyond the scope of this document.

Symmetric Positive Definite Matrices

There are some matrices that are more convenient to work with than others. The crown goes to symmetric matrices, and specifically to symmetric matrices with positive eigenvalues. Those matrices play a very important role in science and appear almost everywhere, so we need to know their basic properties:
  1. All the eigenvalues of a symmetric matrix are real numbers.
  2. All nn eigenvectors of a symmetric matrix can be chosen to be orthonormal to each other!
Now, if we put all the orthonormal eigenvectors in a matrix QQ, then we must have QTQ=IQ^{T}Q = I, and this in fact gives us the powerful spectral decomposition theorem:
S=QΛQT\boxed{\boxed{S = Q\Lambda Q^T}}
which is simply eigendecomposition for symmetric matrices. The power lies in that we don't have to compute the inverse of QQ. We just take the transpose.

Let's talk a little bit about proving / justifying the two statements above. To explain why all eigenvalues are real, examine the equation Sx=λxSx = \lambda x of a single eigenvalue. Multiply both sides to get xTSx=λxTxx^T S x = \lambda x^T x. The quantity on the left is always real (as can be seen with some algebra), and the dot-product on the right is also real, forcing λ\lambda to be real.
As for why the eigenvectors can be chosen orthogonal to each other, still take Sx=λxSx = \lambda x for some vector xx, but now also consider a different eigenvector yy for which Sy=αySy = \alpha y, with αλ\alpha \neq \lambda. It is easy to see that yy belongs to the nullspace of SαIS-\alpha I. But xx belongs to the column space of SαIS-\alpha I: (SIα)x=(λα)x(S-I\alpha)x = (\lambda-\alpha)x. And we know that the nullspace and column spaces are orthogonal, so we know that the vectors are orthogonal. We need to handle the case where α=0\alpha = 0, but we'll leave that for the reader.

Now to talk about the star of the show: symmetric positive definite matrices. These are matrices whose eigenvalues are all positive. We write S0S \succcurlyeq 0. There are multiple criteria to test whether a matrix is PD:
  1. All the eigenvalues are positive.
  2. For all vectors xx, the quantity E(x)=xTSxE(x) = x^T S x is positive.
  3. S=ATAS = A^T A with AA having independent columns and vice-versa!
  4. All leading determinants are positive. These are the determinants of the matrices that result if we delete the last kk rows and kk columns. (Sylvester's criterion)
  5. All pivots of SS are positive.
Let's explore some of the reasoning behind the facts above. The energy or quadratic form expression is going to be a central theme that connects all the criteria. The reason why it works is because it works easily on eigenvectors (since the eigenvalues are positive) and the eigenvectors form a basis for RnR^n. The expression S=ATAS = A^T A works because of the energy calculation:
E(x)=xTATAx=(Ax)TAx=Ax2>0E(x) = x^T A^T A x = (Ax)^T Ax = ||Ax||^2 > 0
The last inequality is strict because the columns of AA are independent. As for the pivot criterion, it helps connect the LULU decomposition with the spectral decomposition into something known as the Cholesky decomposition. Briefly, if we write
S=LU=LDLTS = LU = LDL^T
by pulling the pivots into a matrix DD then we can partition DD and write
S=(LD)(DLT)\boxed{\boxed{S = (L\sqrt{D})(\sqrt{D}L^T)}}
We should note that PD matrices and spectral decomposition play a huge role in optimization. For example, we can check if a function is convex by checking if its Hessian matrix is PD.

Singular Value Decomposition

The SVD is a crowning jewel in Linear Algebra. The reason is its versatility. It works with every matrix, unlike the other factorizations. We built a beautiful theory on symmetric matrices, and we want to use it broadly. So, how to create a symmetric matrix from any matrix ARm×nA \in \mathbb{R}^{m\times n}?
The idea is to use S1=ATARm×nS_1 = A^T A \in \mathbb{R}^{m\times n} and S2=AATRn×mS_2 = AA^T \in \mathbb{R}^{n\times m}. Since those matrices are symmetric, we can pick a set of rr orthonormal eigenvectors from S1S_1, where rnr \leq n is the rank of AA. Let those be v1,v2,...,vrv_1, v_2,...,v_r. Also let λ1,λ2,...,λr\lambda_1,\lambda_2,...,\lambda_r be the corresponding eigenvalues, and let σi=λi\sigma_i = \sqrt{\lambda_i}.
Now let uk=Avkσku_k = \frac{Av_k}{\sigma_k}. These are orthonormal eigenvectors of S2S_2 (small calculation) which are chosen carefully to satify the "eigen"-equation:
Avk=ukσk\boxed{Av_k = u_k \sigma_k}
In matrix form, we can write:
A[v1vr00]=[u1u2ur00][σ1.....σ2............σr......]AV=UΣA=UΣVTA[v_1\,\cdots\,v_r\,\,0 \cdots 0] = [u_1 u_2 \cdots u_r\,\,0 \cdots 0]\left[\begin{array}{ccccc}\sigma_1&.&.&.&.\\ .&\sigma_2.&.&.&.\\.&.&.&.&.\\ .&.&.&\sigma_r&.\\ .&.&.&.&. \end{array}\right]\Leftrightarrow \boxed{\boxed{AV = U\Sigma \Leftrightarrow A = U\Sigma V^T}}
where VRn×n,URm×mV \in \mathbb{R}^{n\times n}, U \in \mathbb{R}^{m\times m} and ΣRm×n\Sigma \in \mathbb{R}^{m\times n}. Importantly, we can do what we always do when factorizing a matrix: break it into rank-1 pieces. If we order the singular values in decreasing order, the first kk terms of the summation give us the best possible kk-rank approximation of AA. This is very valuable in Data Science and is known as the Eckart-Young Theorem, which is the foundation of PCA (Principal Component Analysis)
A=σ1u1v1T++σrurvrT\boxed{A = \sigma_1 u_1 v_1^T + \cdots +\sigma_r u_r v_r^T}
Finally, it is worth noting that the SVD has a beautiful geometric interpretation that I would highly suggest looking up online for yourself. It basically tells us that every linear transformation can be decomposed into two rotations and one stretching! That, I believe, is amazing to see.

This concludes our quick tour across the Linear Algebra universe. There are many more things to explore, but they will be left up to necessity. PCA, Jordan Canonical Forms, Dual spaces are some of the ideas to look at next. I hope this has been helpful!