A Straight Line Through Linear Algebra

Useful resources:

Open Table of Contents

Operations and Properties
Common Vector & Matrix Derivatives
- Vector Derivatives
  - Scalar-valued Functions
  - Vector-valued Functions
EigenDecomposition
- Solving for Eigenvectors and Eigenvalues
  - Example
- Matrix Inversion via eigendecomposition
  - Eigenvalue Facts
  - Eigenvector Facts
Singular Value Decomposition (SVD)
- The Moore-Penrose Pseudoinverse
Principal Component Analysis (PCA)
- Computing PCA
  - Interpretation:
Least Squares and the Normal Equation

Operations and Properties

Basic Properties

\begin{align*} \mathbf{A} + \mathbf{B} &= \mathbf{B} + \mathbf{A} \\ \mathbf{A} + (\mathbf{B} + \mathbf{C}) &= (\mathbf{A} + \mathbf{B}) + \mathbf{C} \\ \mathbf{A}(\mathbf{B} + \mathbf{C}) &= \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C} \\ \alpha(\mathbf{A} + \mathbf{B}) &= \alpha\mathbf{A} + \alpha\mathbf{B} \\ (\alpha + \beta)\mathbf{A} &= \alpha\mathbf{A} + \beta\mathbf{A} \\ \mathbf{AB} &\neq \mathbf{BA} \\ \mathbf{ABC} &= (\mathbf{AB})\mathbf{C} = \mathbf{A}(\mathbf{BC}) \\ \end{align*}

Transpose Identities

The transpose is defined as follows:

(\mathbf{A}^T)_{ij} = \mathbf{A}_{ji}

Identities:

\begin{align*} (\mathbf{A}^T)^T &= \mathbf{A} \\ (\mathbf{A} + \mathbf{B})^T &= \mathbf{A}^T + \mathbf{B}^T \\ (\mathbf{AB})^T &= \mathbf{B}^T \mathbf{A}^T \\ (\mathbf{ABC})^T &= \mathbf{C}^T \mathbf{B}^T \mathbf{A}^T \\ \end{align*}

The Trace

The trace of a square matrix is the sum of its diagonal elements:

\text{tr}\mathbf{A} = \sum_{i=1}^n \mathbf{A}_{ii}

\begin{align*} \text{tr}\mathbf{A} &= \text{tr}\mathbf{A}^T \\ \text{tr}(\mathbf{AB}) &= \text{tr}(\mathbf{BA}) \\ &= \text{tr}(\mathbf{B}^T\mathbf{A}^T) \\ \text{tr}(\mathbf{ABC}) &= \text{tr}(\mathbf{CAB}) = \text{tr}(\mathbf{BCA}) \\ \text{tr}(\mathbf{A} + \mathbf{B}) &= \text{tr}\mathbf{A} + \text{tr}\mathbf{B} \\ \text{tr}(\mathbf{a}\mathbf{a}^T) &= \mathbf{a}^T\mathbf{a} \\ \text{tr}\mathbf{A} &= \sum_{i=1}^n \lambda_i , \quad \lambda_i = \text{eig}(\mathbf{A})_i \end{align*}

The Inverse

The inverse of a square matrix $\mathbf{A}$ is denoted $\mathbf{A}^{-1}$ and satisfies the following properties:

\mathbf{A}^{-1}\mathbf{A} = \mathbf{I} = \mathbf{A}\mathbf{A}^{-1}

Not all matrices have inverses. A matrix is invertible if and only if it has the following properties:

It is square
It is full rank (i.e. its columns are linearly independent)
Its determinant is non-zero
0 is not an eigenvalue
The equation $\mathbf{Ax} = \mathbf{0}$ has only the trivial solution $\mathbf{x} = \mathbf{0}$

Identities:

\begin{align*} (\mathbf{A}^{-1})^{-1} &= \mathbf{A} \\ (\mathbf{A}^T)^{-1} &= (\mathbf{A}^{-1})^T \\ (\mathbf{AB})^{-1} &= \mathbf{B}^{-1} \mathbf{A}^{-1} \\ \end{align*}

Determinants

For a square $n \times n$ matrix $\mathbf{A}$ , the determinant is denoted $|\mathbf{A}|$ and satisfies the following properties:

\begin{align*} \text{det}\mathbf{A} &= \prod_{i=1}^n \lambda_i , \quad \lambda_i = \text{eig}(\mathbf{A})_i \\ \text{det}\mathbf{A} &= \text{det}\mathbf{A}^T \\ \text{det}(\mathbf{AB}) &= \text{det}(\mathbf{A}) \text{det}(\mathbf{B}) \\ \text{det}(\mathbf{A}^n) &= \text{det}(\mathbf{A})^n \\ \text{det}(c\mathbf{A}) &= c^n \text{det}(\mathbf{A}) \\ \end{align*}

Common Vector & Matrix Derivatives

In these examples $b$ is a constant scalar, $\mathbf{b}$ is a constant vector, $x$ is a scalar, and $\mathbf{x}$ is a vector.

\begin{array}{|c|c|} \hline \textbf{Scalar derivative} & \textbf{Vector derivative} \\ \hline f(x) \to \frac{df}{dx} & f(\mathbf{x}) \to \frac{df}{d\mathbf{x}} \\ \hline bx \to b & \mathbf{x}^T \mathbf{B} \to \mathbf{B} \\ \hline bx \to b & \mathbf{x}^T \mathbf{b} \to \mathbf{b} \\ \hline x^2 \to 2x & \mathbf{x}^T \mathbf{x} \to 2\mathbf{x} \\ \hline bx^2 \to 2bx & \mathbf{x}^T \mathbf{B} \mathbf{x} \to 2\mathbf{B} \mathbf{x} \\ \hline \end{array}

Vector Derivatives

Scalar-valued Functions

For scalar function $f: \mathbb{R}^n \to \mathbb{R}$ with $y = \mathbf{\alpha}^T \mathbf{x} = \alpha_1 x_1 + \cdots + \alpha_n x_n$ , the derivative is:

\frac{\partial y}{\partial \mathbf{x}} = \frac{\partial \mathbf{x}^T\alpha}{\partial \mathbf{x}}= \frac{\partial \mathbf{\alpha}^T \mathbf{x}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial \mathbf{\alpha}^T \mathbf{x}}{\partial x_1} \\ \vdots \\ \frac{\partial \mathbf{\alpha}^T \mathbf{x}}{\partial x_n} \end{bmatrix} = \mathbf{\alpha}

Vector-valued Functions

For a vector-valued function

\begin{align*} \frac{\partial \mathbf{a}^T \mathbf{Xb}}{\partial \mathbf{X}} &= \mathbf{ab}^T \\ \frac{\partial \mathbf{a}^T \mathbf{X}^T \mathbf{b}}{\partial \mathbf{X}} &= \mathbf{ba}^T \\ \frac{\partial \mathbf{a}^T \mathbf{X} \mathbf{a}}{\partial \mathbf{X}} &= \mathbf{aa}^T \\ \frac{\partial \mathbf{X}}{\partial X_{ij}} &= \mathbf{E}_{ij} \\ \end{align*}

EigenDecomposition

An eigenvector of a square matrix $\mathbf{A}$ is a non-zero vector $\mathbf{v}$ such that

\mathbf{Av} = \lambda \mathbf{v}

for some scalar $\lambda$ , called the eigenvalue corresponding to $\mathbf{v}$ . The eigendecomposition of $\mathbf{A}$ is then given by:

\mathbf{A} = \mathbf{V} \text{diag}(\lambda) \mathbf{V}^{-1}

where $\mathbf{V}$ is the matrix whose columns are the eigenvectors of $\mathbf{A}$ , and $\text{diag}(\lambda)$ is the diagonal matrix of eigenvalues.

Not all matrices have an eigendecomposition. A matrix is diagonalizable if and only if it has $n$ linearly independent eigenvectors. Considering only real symmetric matrices, the eigendecomposition always exists, we can decompose it into an expression involving only real-valued eigenvectors and eigenvalues:

\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T

where $\mathbf{Q}$ is an orthogonal matrix whose columns are the eigenvectors of $\mathbf{A}$ , and $\mathbf{\Lambda}$ is a diagonal matrix of eigenvalues.

Solving for Eigenvectors and Eigenvalues

Given a square matrix $\mathbf{A}$ , we can solve for its eigenvectors and eigenvalues by solving the characteristic equation equation for non-zero $\mathbf{v}$ :

\mathbf{Av} = \lambda \mathbf{v} \implies \mathbf{Av} - \lambda \mathbf{v} = \mathbf{0} \implies (\mathbf{A} - \lambda \mathbf{I})\mathbf{v} = \mathbf{0}

Since we are looking for non-zero $\mathbf{v}$ , the matrix $\mathbf{A} - \lambda \mathbf{I}$ must be singular, giving us the characteristic polynomial:

\text{det}(\mathbf{A} - \lambda \mathbf{I}) = 0

Solving this equation gives us the eigenvalues $\lambda$ of $\mathbf{A}$ . Substituting these back into the original equation, we can solve for the eigenvectors $\mathbf{v}$ .

Example

Consider the matrix

\mathbf{A} = \begin{bmatrix} 3 & 1 \\ 0 & 2 \\ \end{bmatrix}

The characteristic equation is

\text{det}(\mathbf{A} - \lambda \mathbf{I}) = \text{det}\begin{bmatrix} 3 - \lambda & 1 \\ 0 & 2 - \lambda \\ \end{bmatrix} = (3 - \lambda)(2 - \lambda) = 0

because the formula for the determinant of a $2 \times 2$ matrix is $ad - bc$ . Solving this gives us the eigenvalues $\lambda = 3, 2$ . Substituting these back into the original equation, we can solve for the eigenvectors:

\begin{align*} \mathbf{A} - 3\mathbf{I} &= \begin{bmatrix} 0 & 1 \\ 0 & -1 \\ \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix} \\ \implies v_1 &= 1, v_2 = 0 \end{align*}

Matrix Inversion via eigendecomposition

You can compute the inverse of a matrix by computing its eigendecomposition, inverting the eigenvalues, and then transforming back. Inverting the eigenvalues is equivalent to taking the reciprocal of each eigenvalue, which is easy to do for a diagonal matrix.

If a matrix $\mathbf{A}$ can be eigendecomposed and if none of its eigenvalues are zero, then $\mathbf{A}$ is invertible and its inverse is given by:

\mathbf{A}^{-1} = \mathbf{Q} \text{diag}(\lambda^{-1}) \mathbf{Q}^{-1}

If $\mathbf{A}$ is a symmetric matrix, since $\mathbf{Q}$ is formed from the eigenvectors of $\mathbf{A}$ , it is orthogonal, therefore $\mathbf{Q}^{-1} = \mathbf{Q}^T$ . Furthermore, because $\mathbf{\lambda}$ is a diagonal matrix, its inverse is simply the inverse of its diagonal elements.

\left[ \mathbf{\lambda}^{-1} \right]_{ii} = \frac{1}{\lambda_i}

Eigenvalue Facts

The product of the eigenvalues of a matrix is equal to its determinant: $\text{det}(\mathbf{A}) = \prod_{i=1}^{N_{\lambda}} \lambda_i^{n_i}$ , where $n_i$ is the multiplicity of the eigenvalue $\lambda_i$ .
The sum of the eigenvalues of a matrix is equal to its trace: $\text{tr}(\mathbf{A}) = \sum_{i=1}^{N_{\lambda}} n_i\lambda_i$ .
If the eigenvalues of $\mathbf{A}$ are $\lambda_i$ , then the eigenvalues of $\mathbf{A}^{-1}$ are simply $\lambda_i^{-1}$ .

Eigenvector Facts

The eigenvectors of $\mathbf{A}^{-1}$ are the same as the eigenvectors of $\mathbf{A}$ .

Singular Value Decomposition (SVD)

Since not all matrices have an eigendecomposition, we can use the singular value decomposition (SVD) to decompose any matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ into three matrices:

\mathbf{A} = \mathbf{U} \mathbf{D} \mathbf{V}^T

where:

$\mathbf{U} \in \mathbb{R}^{m \times m}$ is an orthogonal matrix whose columns are the eigenvectors of $\mathbf{AA}^T$ , called the left singular vectors
$\mathbf{D} \in \mathbb{R}^{m \times n}$ is a diagonal matrix whose diagonal elements are the singular values of $\mathbf{A}$ . The non-zero singular values are the square roots of the non-zero eigenvalues of $\mathbf{A}^T\mathbf{A}$ or $\mathbf{AA}^T$
$\mathbf{V} \in \mathbb{R}^{n \times n}$ is an orthogonal matrix whose columns are the eigenvectors of $\mathbf{A}^T\mathbf{A}$ called the right singular vectors

The Moore-Penrose Pseudoinverse

Since matrix inversion is not defined for non-square matrices, we can use the SVD to compute the Moore-Penrose pseudoinverse of a matrix $\mathbf{A}$ :

\mathbf{A}^+ = \mathbf{V} \mathbf{D}^+ \mathbf{U}^T

where $\mathbf{D}^+$ is the pseudoinverse of $\mathbf{D}$ , which is obtained by taking the reciprocal of the non-zero singular values and transposing the resulting matrix. $\mathbf{A}^+$ minimizes the Frobenius norm of the residual error $||\mathbf{A}\mathbf{A}^+ - \mathbf{I}||_{2}$

For an equation $\mathbf{Ax} = \mathbf{b}$ , the least squares solution is given by $\mathbf{x} \approx \mathbf{A}^+ \mathbf{b}$ .

Principal Component Analysis (PCA)

PCA is a technique for dimensionality reduction that involves decomposing a dataset into its principal components. Given a dataset $\mathbf{X} \in \mathbb{R}^{n \times d}$ , the goal is to find the orthogonal matrix $\mathbf{W} \in \mathbb{R}^{d \times d}$ that projects the data onto a lower-dimensional subspace such that the variance of the projected data is maximized.

learns an orthogonal linear projection that aligns with the directions of maximum variance in the data

Computing PCA

There are several ways to compute Principal Component Analysis, but the most common method is to use the eigendecomposition of the covariance matrix of the data. Given a dataset $\mathbf{X} \in \mathbb{R}^{n \times d}$ , the steps are as follows:

Center the data: Subtract the mean of each feature from the dataset xc = x - x.mean(dim=0)
Compute the covariance matrix: $\mathbf{C} = \frac{1}{n-1} \mathbf{X}^T \mathbf{X}$ (C = 1/(xc.shape[0]-1) * xc.T @ xc)
Compute the eigenvectors and eigenvalues of $\mathbf{C}$ : $\mathbf{C} = \mathbf{W} \mathbf{\Lambda} \mathbf{W}^T$

The eigenvectors of $\mathbf{C}$ are the principal components of the data, and the eigenvalues are the variances of the data along each principal component.

Interpretation:

Compute the eigendecomposition of the covariance matrix $\mathbf{C}$
Compute the right singular vectors of the centered data matrix $\mathbf{X}$ (because the covariance matrix is $\mathbf{X}^T\mathbf{X}$ and the right singular vectors of $\mathbf{X}$ are the eigenvectors of $\mathbf{X}^T\mathbf{X}$ )

Least Squares and the Normal Equation

Given a training set, define the design matrix $\mathbf{X}$ to be the $n \times d$ matrix (actually $n \times (d+1)$ if we include the bias term) whose rows are the input vectors $\mathbf{x}_i$

\mathbf{X} = \begin{bmatrix} \text{ --- } (x_1)^T \text{ --- } \\ \text{ --- } (x_2)^T \text{ --- } \\ \vdots \\ \text{ --- } (x_n)^T \text{ --- } \\ \end{bmatrix}

Also, let $\mathbf{y}$ be the $n$ -dimensional vector of target values from the training set, and $\mathbf{w}$ be the $d$ -dimensional vector of weights.

\mathbf{y} = \begin{bmatrix} \text{ --- } (y_1) \text{ --- } \\ \text{ --- } (y_2) \text{ --- } \\ \vdots \\ \text{ --- } (y_n) \text{ --- } \\ \end{bmatrix}

The least squares objective is to minimize the sum of squared errors:

L(\mathbf{W}) = \frac{1}{2} \sum_{i=1}^n (y_i - \mathbf{w}^T \mathbf{x}_i)^2 = \frac{1}{2} ||\mathbf{y} - \mathbf{X}\mathbf{w}||^2

Finally, to minimize $L(\mathbf{W})$ , we first compute the gradient with respect to $\mathbf{w}$ :

\begin{align*} \nabla_{\mathbf{w}} L(\mathbf{W}) &= \nabla_{\mathbf{w}} \frac{1}{2} ||\mathbf{Xw} - \mathbf{y}||^2 \\ &= \nabla_{\mathbf{w}} \frac{1}{2} (\mathbf{Xw} - \mathbf{y})^T (\mathbf{Xw} - \mathbf{y}) \\ &= \nabla_{\mathbf{w}} \frac{1}{2} (\mathbf{w}^T \mathbf{X}^T - \mathbf{y}^T)(\mathbf{Xw} - \mathbf{y}) \\ &= \nabla_{\mathbf{w}} \frac{1}{2} (\mathbf{w}^T \mathbf{X}^T \mathbf{Xw} - \mathbf{w}^T \mathbf{X}^T \mathbf{y} - \mathbf{y}^T \mathbf{Xw} + \mathbf{y}^T \mathbf{y}) \\ &= \nabla_{\mathbf{w}} \frac{1}{2} (\mathbf{w}^T \mathbf{X}^T \mathbf{Xw} - 2\mathbf{w}^T \mathbf{X}^T \mathbf{y} + \mathbf{y}^T \mathbf{y}) \\ &= \frac{1}{2} (2\mathbf{X}^T \mathbf{Xw} - 2\mathbf{X}^T \mathbf{y}) \\ &= \mathbf{X}^T \mathbf{Xw} - \mathbf{X}^T \mathbf{y} \\ \end{align*}

Now, setting the gradient to zero, we have the normal equation:

\begin{align*} \mathbf{X}^T \mathbf{Xw} &= \mathbf{X}^T \mathbf{y} \\ \mathbf{w} &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \end{align*}

\boxed{\mathbf{w} = \text{argmin}_{\mathbf{w}} L(\mathbf{w}) = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}}