Skip to content

A Straight Line Through Linear Algebra

Published: at 07:27 PM in 9 min readSuggest Changes

Useful resources:

Table of Contents

Open Table of Contents

Operations and Properties

Basic Properties

A+B=B+AA+(B+C)=(A+B)+CA(B+C)=AB+ACα(A+B)=αA+αB(α+β)A=αA+βAABBAABC=(AB)C=A(BC)\begin{align*} \mathbf{A} + \mathbf{B} &= \mathbf{B} + \mathbf{A} \\ \mathbf{A} + (\mathbf{B} + \mathbf{C}) &= (\mathbf{A} + \mathbf{B}) + \mathbf{C} \\ \mathbf{A}(\mathbf{B} + \mathbf{C}) &= \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C} \\ \alpha(\mathbf{A} + \mathbf{B}) &= \alpha\mathbf{A} + \alpha\mathbf{B} \\ (\alpha + \beta)\mathbf{A} &= \alpha\mathbf{A} + \beta\mathbf{A} \\ \mathbf{AB} &\neq \mathbf{BA} \\ \mathbf{ABC} &= (\mathbf{AB})\mathbf{C} = \mathbf{A}(\mathbf{BC}) \\ \end{align*}

Transpose Identities

The tranpose is defined as follows:

(AT)ij=Aji(\mathbf{A}^T)_{ij} = \mathbf{A}_{ji}

Identities:

(AT)T=A(A+B)T=AT+BT(AB)T=BTAT(ABC)T=CTBTAT\begin{align*} (\mathbf{A}^T)^T &= \mathbf{A} \\ (\mathbf{A} + \mathbf{B})^T &= \mathbf{A}^T + \mathbf{B}^T \\ (\mathbf{AB})^T &= \mathbf{B}^T \mathbf{A}^T \\ (\mathbf{ABC})^T &= \mathbf{C}^T \mathbf{B}^T \mathbf{A}^T \\ \end{align*}

The Trace

The trace of a square matrix is the sum of its diagonal elements:

trA=i=1nAii\text{tr}\mathbf{A} = \sum_{i=1}^n \mathbf{A}_{ii} trA=trATtr(AB)=tr(BA)=tr(BTAT)tr(ABC)=tr(CAB)=tr(BCA)tr(A+B)=trA+trBtr(aaT)=aTatrA=i=1nλi,λi=eig(A)i\begin{align*} \text{tr}\mathbf{A} &= \text{tr}\mathbf{A}^T \\ \text{tr}(\mathbf{AB}) &= \text{tr}(\mathbf{BA}) \\ &= \text{tr}(\mathbf{B}^T\mathbf{A}^T) \\ \text{tr}(\mathbf{ABC}) &= \text{tr}(\mathbf{CAB}) = \text{tr}(\mathbf{BCA}) \\ \text{tr}(\mathbf{A} + \mathbf{B}) &= \text{tr}\mathbf{A} + \text{tr}\mathbf{B} \\ \text{tr}(\mathbf{a}\mathbf{a}^T) &= \mathbf{a}^T\mathbf{a} \\ \text{tr}\mathbf{A} &= \sum_{i=1}^n \lambda_i , \quad \lambda_i = \text{eig}(\mathbf{A})_i \end{align*}

The Inverse

The inverse of a square matrix A\mathbf{A} is denoted A1\mathbf{A}^{-1} and satisfies the following properties:

A1A=I=AA1\mathbf{A}^{-1}\mathbf{A} = \mathbf{I} = \mathbf{A}\mathbf{A}^{-1}

Not all matrices have inverses. A matrix is invertible if and only if it has the following properties:

Identities:

(A1)1=A(AT)1=(A1)T(AB)1=B1A1\begin{align*} (\mathbf{A}^{-1})^{-1} &= \mathbf{A} \\ (\mathbf{A}^T)^{-1} &= (\mathbf{A}^{-1})^T \\ (\mathbf{AB})^{-1} &= \mathbf{B}^{-1} \mathbf{A}^{-1} \\ \end{align*}

Determinants

For a square n×nn \times n matrix A\mathbf{A}, the determinant is denoted A|\mathbf{A}| and satisfies the following properties:

detA=i=1nλi,λi=eig(A)idetA=detATdet(AB)=det(A)det(B)det(An)=det(A)ndet(cA)=cndet(A)\begin{align*} \text{det}\mathbf{A} &= \prod_{i=1}^n \lambda_i , \quad \lambda_i = \text{eig}(\mathbf{A})_i \\ \text{det}\mathbf{A} &= \text{det}\mathbf{A}^T \\ \text{det}(\mathbf{AB}) &= \text{det}(\mathbf{A}) \text{det}(\mathbf{B}) \\ \text{det}(\mathbf{A}^n) &= \text{det}(\mathbf{A})^n \\ \text{det}(c\mathbf{A}) &= c^n \text{det}(\mathbf{A}) \\ \end{align*}

Common Vector & Matrix Derivatives

In these examples bb is a constant scalar, b\mathbf{b} is a constant vector, xx is a scalar, and x\mathbf{x} is a vector.

Scalar derivativeVector derivativef(x)dfdxf(x)dfdxbxbxTBBbxbxTbbx22xxTx2xbx22bxxTBx2Bx\begin{array}{|c|c|} \hline \textbf{Scalar derivative} & \textbf{Vector derivative} \\ \hline f(x) \to \frac{df}{dx} & f(\mathbf{x}) \to \frac{df}{d\mathbf{x}} \\ \hline bx \to b & \mathbf{x}^T \mathbf{B} \to \mathbf{B} \\ \hline bx \to b & \mathbf{x}^T \mathbf{b} \to \mathbf{b} \\ \hline x^2 \to 2x & \mathbf{x}^T \mathbf{x} \to 2\mathbf{x} \\ \hline bx^2 \to 2bx & \mathbf{x}^T \mathbf{B} \mathbf{x} \to 2\mathbf{B} \mathbf{x} \\ \hline \end{array}

Vector Derivatives

Scalar-valued Functions

For scalar function f:RnRf: \mathbb{R}^n \to \mathbb{R} with y=αTx=α1x1++αnxny = \mathbf{\alpha}^T \mathbf{x} = \alpha_1 x_1 + \cdots + \alpha_n x_n, the derivative is:

yx=xTαx=αTxx=[αTxx1αTxxn]=α\frac{\partial y}{\partial \mathbf{x}} = \frac{\partial \mathbf{x}^T\alpha}{\partial \mathbf{x}}= \frac{\partial \mathbf{\alpha}^T \mathbf{x}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial \mathbf{\alpha}^T \mathbf{x}}{\partial x_1} \\ \vdots \\ \frac{\partial \mathbf{\alpha}^T \mathbf{x}}{\partial x_n} \end{bmatrix} = \mathbf{\alpha}

Vector-valued Functions

For a vector-valued function

aTXbX=abTaTXTbX=baTaTXaX=aaTXXij=Eij\begin{align*} \frac{\partial \mathbf{a}^T \mathbf{Xb}}{\partial \mathbf{X}} &= \mathbf{ab}^T \\ \frac{\partial \mathbf{a}^T \mathbf{X}^T \mathbf{b}}{\partial \mathbf{X}} &= \mathbf{ba}^T \\ \frac{\partial \mathbf{a}^T \mathbf{X} \mathbf{a}}{\partial \mathbf{X}} &= \mathbf{aa}^T \\ \frac{\partial \mathbf{X}}{\partial X_{ij}} &= \mathbf{E}_{ij} \\ \end{align*}

EigenDecomposition

An eigenvector of a square matrix A\mathbf{A} is a non-zero vector v\mathbf{v} such that

Av=λv\mathbf{Av} = \lambda \mathbf{v}

for some scalar λ\lambda, called the eigenvalue corresponding to v\mathbf{v}. The eigendecomposition of A\mathbf{A} is then given by:

A=Vdiag(λ)V1\mathbf{A} = \mathbf{V} \text{diag}(\lambda) \mathbf{V}^{-1}

where V\mathbf{V} is the matrix whose columns are the eigenvectors of A\mathbf{A}, and diag(λ)\text{diag}(\lambda) is the diagonal matrix of eigenvalues.

Not all matrices have an eigendecomposition. A matrix is diagonalizable if and only if it has nn linearly independent eigenvectors. Considering only real symmetric matrices, the eigendecomposition always exists, we can decompose it into an expression involving only real-valued eigenvectors and eigenvalues:

A=QΛQT\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T

where Q\mathbf{Q} is an orthogonal matrix whose columns are the eigenvectors of A\mathbf{A}, and Λ\mathbf{\Lambda} is a diagonal matrix of eigenvalues.

Solving for Eigenvectors and Eigenvalues

Given a square matrix A\mathbf{A}, we can solve for its eigenvectors and eigenvalues by solving the characteristic equation equation for non-zero v\mathbf{v}:

Av=λv    Avλv=0    (AλI)v=0\mathbf{Av} = \lambda \mathbf{v} \implies \mathbf{Av} - \lambda \mathbf{v} = \mathbf{0} \implies (\mathbf{A} - \lambda \mathbf{I})\mathbf{v} = \mathbf{0}

Since we are looking for non-zero v\mathbf{v}, the matrix AλI\mathbf{A} - \lambda \mathbf{I} must be singular, giving us the characteristic polynomial:

det(AλI)=0\text{det}(\mathbf{A} - \lambda \mathbf{I}) = 0

Solving this equation gives us the eigenvalues λ\lambda of A\mathbf{A}. Substituting these back into the original equation, we can solve for the eigenvectors v\mathbf{v}.

Example

Consider the matrix

A=[3102]\mathbf{A} = \begin{bmatrix} 3 & 1 \\ 0 & 2 \\ \end{bmatrix}

The characteristic equation is

det(AλI)=det[3λ102λ]=(3λ)(2λ)=0\text{det}(\mathbf{A} - \lambda \mathbf{I}) = \text{det}\begin{bmatrix} 3 - \lambda & 1 \\ 0 & 2 - \lambda \\ \end{bmatrix} = (3 - \lambda)(2 - \lambda) = 0

because the formula for the determinant of a 2×22 \times 2 matrix is adbcad - bc. Solving this gives us the eigenvalues λ=3,2\lambda = 3, 2. Substituting these back into the original equation, we can solve for the eigenvectors:

A3I=[0101][v1v2]=[00]    v1=1,v2=0\begin{align*} \mathbf{A} - 3\mathbf{I} &= \begin{bmatrix} 0 & 1 \\ 0 & -1 \\ \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix} \\ \implies v_1 &= 1, v_2 = 0 \end{align*}

Matrix Inversion via eigendecomposition

You can compute the inverse of a matrix by computing its eigendecomposition, inverting the eigenvalues, and then transforming back. Inverting the eigenvalues is equivalent to taking the reciprocal of each eigenvalue, which is easy to do for a diagonal matrix.

If a matrix A\mathbf{A} can be eigendecomposed and if none of its eigenvalues are zero, then A\mathbf{A} is invertible and its inverse is given by:

A1=Qdiag(λ1)Q1\mathbf{A}^{-1} = \mathbf{Q} \text{diag}(\lambda^{-1}) \mathbf{Q}^{-1}

If A\mathbf{A} is a symmetric matrix, since Q\mathbf{Q} is formed from the eigenvectors of A\mathbf{A}, it is orthogonal, therefore Q1=QT\mathbf{Q}^{-1} = \mathbf{Q}^T. Futhermore, because λ\mathbf{\lambda} is a diagonal matrix, its inverse is simply the inverse of its diagonal elements.

[λ1]ii=1λi\left[ \mathbf{\lambda}^{-1} \right]_{ii} = \frac{1}{\lambda_i}

Eigenvalue Facts

  1. The product of the eigenvalues of a matrix is equal to its determinant: det(A)=i=1Nλλini\text{det}(\mathbf{A}) = \prod_{i=1}^{N_{\lambda}} \lambda_i^{n_i}, where nin_i is the multiplicity of the eigenvalue λi\lambda_i.
  2. The sum of the eigenvalues of a matrix is equal to its trace: tr(A)=i=1Nλniλi\text{tr}(\mathbf{A}) = \sum_{i=1}^{N_{\lambda}} n_i\lambda_i.
  3. If the eigenvalues of A\mathbf{A} are λi\lambda_i, then the eigenvalues of A1\mathbf{A}^{-1} are simply λi1\lambda_i^{-1}.

Eigenvector Facts

  1. The eigenvectors of A1\mathbf{A}^{-1} are the same as the eigenvectors of A\mathbf{A}.

Singular Value Decomposition (SVD)

Since not all matrices have an eigendecomposition, we can use the singular value decomposition (SVD) to decompose any matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} into three matrices:

A=UDVT\mathbf{A} = \mathbf{U} \mathbf{D} \mathbf{V}^T

where:

The Moore-Penrose Pseudoinverse

Since matrix inversion is not defined for non-square matrices, we can use the SVD to compute the Moore-Penrose pseudoinverse of a matrix A\mathbf{A}:

A+=VD+UT\mathbf{A}^+ = \mathbf{V} \mathbf{D}^+ \mathbf{U}^T

where D+\mathbf{D}^+ is the pseudoinverse of D\mathbf{D}, which is obtained by taking the reciprocal of the non-zero singular values and transposing the resulting matrix. A+\mathbf{A}^+ minimizes the Frobenius norm of the residual error AA+I2||\mathbf{A}\mathbf{A}^+ - \mathbf{I}||_{2}

For an equation Ax=b\mathbf{Ax} = \mathbf{b}, the least squares solution is given by xA+b\mathbf{x} \approx \mathbf{A}^+ \mathbf{b}.

Principal Component Analysis (PCA)

PCA is a technique for dimensionality reduction that involves decomposing a dataset into its principal components. Given a dataset XRn×d\mathbf{X} \in \mathbb{R}^{n \times d}, the goal is to find the orthogonal matrix WRd×d\mathbf{W} \in \mathbb{R}^{d \times d} that projects the data onto a lower-dimensional subspace such that the variance of the projected data is maximized.

Computing PCA

There are several ways to compute Principal Component Analysis, but the most common method is to use the eigendecomposition of the covariance matrix of the data. Given a dataset XRn×d\mathbf{X} \in \mathbb{R}^{n \times d}, the steps are as follows:

  1. Center the data: Subtract the mean of each feature from the dataset xc = x - x.mean(dim=0)
  2. Compute the covariance matrix: C=1n1XTX\mathbf{C} = \frac{1}{n-1} \mathbf{X}^T \mathbf{X} (C = 1/(xc.shape[0]-1) * xc.T @ xc)
  3. Compute the eigenvectors and eigenvalues of C\mathbf{C}: C=WΛWT\mathbf{C} = \mathbf{W} \mathbf{\Lambda} \mathbf{W}^T

The eigenvectors of C\mathbf{C} are the principal components of the data, and the eigenvalues are the variances of the data along each principal component.

Interpretation:

Least Squares and the Normal Equation

Given a training set, define the design matrix X\mathbf{X} to be the n×dn \times d matrix (actually n×(d+1)n \times (d+1) if we include the bias term) whose rows are the input vectors xi\mathbf{x}_i

X=[ — (x1)T —  — (x2)T —  — (xn)T — ]\mathbf{X} = \begin{bmatrix} \text{ --- } (x_1)^T \text{ --- } \\ \text{ --- } (x_2)^T \text{ --- } \\ \vdots \\ \text{ --- } (x_n)^T \text{ --- } \\ \end{bmatrix}

Also, let y\mathbf{y} be the nn-dimensional vector of target values from the training set, and w\mathbf{w} be the dd-dimensional vector of weights.

y=[ — (y1) —  — (y2) —  — (yn) — ]\mathbf{y} = \begin{bmatrix} \text{ --- } (y_1) \text{ --- } \\ \text{ --- } (y_2) \text{ --- } \\ \vdots \\ \text{ --- } (y_n) \text{ --- } \\ \end{bmatrix}

The least squares objective is to minimize the sum of squared errors:

L(W)=12i=1n(yiwTxi)2=12yXw2L(\mathbf{W}) = \frac{1}{2} \sum_{i=1}^n (y_i - \mathbf{w}^T \mathbf{x}_i)^2 = \frac{1}{2} ||\mathbf{y} - \mathbf{X}\mathbf{w}||^2

Finally, to minimize L(W)L(\mathbf{W}), we first compute the gradient with respect to w\mathbf{w}:

wL(W)=w12Xwy2=w12(Xwy)T(Xwy)=w12(wTXTyT)(Xwy)=w12(wTXTXwwTXTyyTXw+yTy)=w12(wTXTXw2wTXTy+yTy)=12(2XTXw2XTy)=XTXwXTy\begin{align*} \nabla_{\mathbf{w}} L(\mathbf{W}) &= \nabla_{\mathbf{w}} \frac{1}{2} ||\mathbf{Xw} - \mathbf{y}||^2 \\ &= \nabla_{\mathbf{w}} \frac{1}{2} (\mathbf{Xw} - \mathbf{y})^T (\mathbf{Xw} - \mathbf{y}) \\ &= \nabla_{\mathbf{w}} \frac{1}{2} (\mathbf{w}^T \mathbf{X}^T - \mathbf{y}^T)(\mathbf{Xw} - \mathbf{y}) \\ &= \nabla_{\mathbf{w}} \frac{1}{2} (\mathbf{w}^T \mathbf{X}^T \mathbf{Xw} - \mathbf{w}^T \mathbf{X}^T \mathbf{y} - \mathbf{y}^T \mathbf{Xw} + \mathbf{y}^T \mathbf{y}) \\ &= \nabla_{\mathbf{w}} \frac{1}{2} (\mathbf{w}^T \mathbf{X}^T \mathbf{Xw} - 2\mathbf{w}^T \mathbf{X}^T \mathbf{y} + \mathbf{y}^T \mathbf{y}) \\ &= \frac{1}{2} (2\mathbf{X}^T \mathbf{Xw} - 2\mathbf{X}^T \mathbf{y}) \\ &= \mathbf{X}^T \mathbf{Xw} - \mathbf{X}^T \mathbf{y} \\ \end{align*}

Now, setting the gradient to zero, we have the normal equation:

XTXw=XTyw=(XTX)1XTy\begin{align*} \mathbf{X}^T \mathbf{Xw} &= \mathbf{X}^T \mathbf{y} \\ \mathbf{w} &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \end{align*} w=argminwL(w)=(XTX)1XTy\boxed{\mathbf{w} = \text{argmin}_{\mathbf{w}} L(\mathbf{w}) = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}}

Previous Post
Python and Pandas and P-values, Oh My! A Statistical Journey
Next Post
Back to the Basics