Chapter 12

Projections & Least Squares

00 · Symbol Glossary

$\hat{\mathbf{x}}$x hat — least squares solution

The least squares solution to $A\mathbf{x}=\mathbf{b}$ — the value of $\mathbf{x}$ that minimises $\|A\mathbf{x}-\mathbf{b}\|^2$ . The hat accent is standard notation for estimated quantities throughout statistics. $\hat{\mathbf{x}}$ is the solution to the normal equations $A^TA\hat{\mathbf{x}}=A^T\mathbf{b}$ .

$\hat{\mathbf{b}}$b hat — projection onto column space

The projection of $\mathbf{b}$ onto the column space of $A$ : $\hat{\mathbf{b}} = A\hat{\mathbf{x}}$ . It is the closest point in $\text{Col}(A)$ to $\mathbf{b}$ . The residual vector $\mathbf{e} = \mathbf{b}-\hat{\mathbf{b}}$ is orthogonal to $\text{Col}(A)$ .

$\mathbf{e}$e — residual vector

The residual $\mathbf{e}=\mathbf{b}-\hat{\mathbf{b}}=\mathbf{b}-A\hat{\mathbf{x}}$ . Measures how far the best approximation $\hat{\mathbf{b}}$ is from the target $\mathbf{b}$ . The fundamental orthogonality condition is $A^T\mathbf{e}=\mathbf{0}$ — the residual is orthogonal to every column of $A$ .

$P$P — projection matrix

The matrix that projects any vector onto a subspace: $P = A(A^TA)^{-1}A^T$ . Satisfies $P^2=P$ (idempotent) and $P^T=P$ (symmetric). Applying $P$ twice is the same as once — you cannot project more than once.

$(A^TA)^{-1}A^T$A plus — pseudoinverse of A

The Moore-Penrose pseudoinverse for a matrix with independent columns: $(A^TA)^{-1}A^T$ . Satisfies $[(A^TA)^{-1}A^T]A = I$ , so it is a left inverse. Also written $A^+$ in the general case.

01 · Projection onto a Subspace

The projection from Chapter 11 generalised: instead of projecting onto a single vector $\mathbf{u}$ , project onto an entire subspace spanned by a matrix's columns.

Definition — Projection Onto the Column Space of $A$

Let $A \in \mathbb{R}^{m \times n}$ have linearly independent columns. The orthogonal projection of $\mathbf{b} \in \mathbb{R}^m$ onto $\text{Col}(A)$ is:

\hat{\mathbf{b}} = A(A^TA)^{-1}A^T\mathbf{b}

The matrix $P = A(A^TA)^{-1}A^T$ is the projection matrix onto $\text{Col}(A)$ . It satisfies:

$P^2 = P$ (idempotent: projecting twice gives the same result)
$P^T = P$ (symmetric)
$P\mathbf{b} \in \text{Col}(A)$ for all $\mathbf{b}$
$\mathbf{b} - P\mathbf{b} \perp \text{Col}(A)$ for all $\mathbf{b}$

✓ Example — Projection onto a Line in $\mathbb{R}^3$

Project $\mathbf{b}=\begin{pmatrix}1\\1\\1\end{pmatrix}$ onto the line spanned by $\mathbf{a}=\begin{pmatrix}1\\2\\0\end{pmatrix}$ .

Here $A = \begin{pmatrix}1\\2\\0\end{pmatrix}$ . $A^TA = \begin{pmatrix}1&2&0\end{pmatrix}\begin{pmatrix}1\\2\\0\end{pmatrix} = 1+4+0=5$ .

$(A^TA)^{-1} = \frac{1}{5}$ .

$\hat{\mathbf{b}} = A\cdot\frac{1}{5}\cdot A^T\mathbf{b} = \frac{1}{5}\begin{pmatrix}1\\2\\0\end{pmatrix}\cdot(1+2+0) = \frac{3}{5}\begin{pmatrix}1\\2\\0\end{pmatrix} = \begin{pmatrix}3/5\\6/5\\0\end{pmatrix}$ .

Residual: $\mathbf{e}=\begin{pmatrix}1-3/5\\1-6/5\\1\end{pmatrix}=\begin{pmatrix}2/5\\-1/5\\1\end{pmatrix}$ . Check: $\mathbf{e}\cdot\mathbf{a} = \frac{2}{5}-\frac{2}{5}+0=0$ ✓.

02 · Derivation of the Normal Equations

When $A\mathbf{x}=\mathbf{b}$ has no exact solution (overdetermined: more equations than unknowns), the best we can do is find $\hat{\mathbf{x}}$ making $\|A\hat{\mathbf{x}}-\mathbf{b}\|^2$ as small as possible.

The geometric insight: the minimum of $\|A\mathbf{x}-\mathbf{b}\|$ over all $\mathbf{x}$ is achieved when $A\mathbf{x}$ is the point in $\text{Col}(A)$ closest to $\mathbf{b}$ — i.e., the projection $\hat{\mathbf{b}}=A\hat{\mathbf{x}}$ . The error $\mathbf{e}=\mathbf{b}-A\hat{\mathbf{x}}$ must be orthogonal to every column of $A$ .

Definition — Normal Equations

The least squares solution $\hat{\mathbf{x}}$ satisfies the normal equations:

A^TA\,\hat{\mathbf{x}} = A^T\mathbf{b}

Derivation: Orthogonality condition $\mathbf{e}\perp\text{Col}(A)$ means $A^T\mathbf{e}=\mathbf{0}$ , i.e. $A^T(\mathbf{b}-A\hat{\mathbf{x}})=\mathbf{0}$ , which rearranges to $A^TA\hat{\mathbf{x}}=A^T\mathbf{b}$ .

Unique solution: When $A$ has linearly independent columns, $A^TA$ is invertible and $\hat{\mathbf{x}} = (A^TA)^{-1}A^T\mathbf{b}$ .

Step-by-step — Deriving the normal equations from the orthogonality condition

State the residual: $\mathbf{e} = \mathbf{b} - A\hat{\mathbf{x}}$ . This is the difference between the target $\mathbf{b}$ and our approximation $A\hat{\mathbf{x}}$ .

Apply the orthogonality requirement: $\hat{\mathbf{b}}=A\hat{\mathbf{x}}$ is the closest point in $\text{Col}(A)$ to $\mathbf{b}$ $\iff$ $\mathbf{e}\perp\text{Col}(A)$ $\iff$ $\mathbf{a}_j\cdot\mathbf{e}=0$ for every column $\mathbf{a}_j$ of $A$ .

Written in matrix form: $A^T\mathbf{e}=\mathbf{0}$ .

Substitute the residual: $A^T(\mathbf{b}-A\hat{\mathbf{x}})=\mathbf{0}$ .

Expand: $A^T\mathbf{b}-A^TA\hat{\mathbf{x}}=\mathbf{0}$ .

Rearrange: $A^TA\hat{\mathbf{x}}=A^T\mathbf{b}$ . These are the normal equations.

Solve (when $A^TA$ is invertible):

\hat{\mathbf{x}} = (A^TA)^{-1}A^T\mathbf{b}

$A^TA$ is a square $n\times n$ matrix; it is invertible $\iff$ the columns of $A$ are linearly independent.

❌ What Breaks — $A^TA$ Is Singular When Columns Are Dependent

If the columns of $A$ are linearly dependent, then $A^TA$ is singular — the normal equations $A^TA\hat{\mathbf{x}}=A^T\mathbf{b}$ have either no solution or infinitely many. The least squares problem still has a geometric solution (the projection still exists), but the representation in terms of $\hat{\mathbf{x}}$ is not unique. This happens in regression when predictors are perfectly collinear — one is a linear combination of others. The fix: either remove dependent predictors or use regularisation (ridge/LASSO, Chapter 13).

03 · Why $A^TA$ Is Invertible Exactly When Columns Are Independent

Definition — $A^TA$ and Null Spaces

$A^TA$ is invertible $\iff$ $A$ has linearly independent columns $\iff$ $\ker(A)=\{\mathbf{0}\}$ .

Proof: Suppose $A^TA\mathbf{x}=\mathbf{0}$ . Then $\mathbf{x}^TA^TA\mathbf{x}=0$ , which equals $\|A\mathbf{x}\|^2=0$ , so $A\mathbf{x}=\mathbf{0}$ . If columns of $A$ are independent, the only solution is $\mathbf{x}=\mathbf{0}$ , so $\ker(A^TA)=\{\mathbf{0}\}$ and $A^TA$ is invertible.

Conversely, if some $\mathbf{x}\neq\mathbf{0}$ satisfies $A\mathbf{x}=\mathbf{0}$ , then $A^TA\mathbf{x}=A^T\mathbf{0}=\mathbf{0}$ , so $A^TA$ is singular.

Step-by-step — Least squares fit for data $\{(1,1),(2,3),(3,2),(4,4)\}$ with model $y=\beta_0+\beta_1 x$

Write the design matrix: each row is $(1, x_i)$ for the model $y=\beta_0+\beta_1 x$ .

A = \begin{pmatrix}1&1\\1&2\\1&3\\1&4\end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix}1\\3\\2\\4\end{pmatrix}

4 observations, 2 parameters $(\beta_0, \beta_1)$ .

Compute $A^TA$ :

A^TA = \begin{pmatrix}1&1&1&1\\1&2&3&4\end{pmatrix}\begin{pmatrix}1&1\\1&2\\1&3\\1&4\end{pmatrix} = \begin{pmatrix}4&10\\10&30\end{pmatrix}

$(1,1)$ : $1+1+1+1=4$ . $(1,2)=(2,1)$ : $1+2+3+4=10$ . $(2,2)$ : $1+4+9+16=30$ .

Compute $A^T\mathbf{b}$ :

A^T\mathbf{b} = \begin{pmatrix}1&1&1&1\\1&2&3&4\end{pmatrix}\begin{pmatrix}1\\3\\2\\4\end{pmatrix} = \begin{pmatrix}10\\28\end{pmatrix}

Row 1: $1+3+2+4=10$ . Row 2: $1+6+6+16=29$ . (Recalculating: $1(1)+2(3)+3(2)+4(4)=1+6+6+16=29$ .)

Solve $A^TA\hat{\mathbf{x}}=A^T\mathbf{b}$ :

$\det(A^TA) = 4(30)-10(10) = 120-100 = 20$ .

$(A^TA)^{-1} = \frac{1}{20}\begin{pmatrix}30&-10\\-10&4\end{pmatrix}$ .

$\hat{\mathbf{x}} = \frac{1}{20}\begin{pmatrix}30&-10\\-10&4\end{pmatrix}\begin{pmatrix}10\\29\end{pmatrix} = \frac{1}{20}\begin{pmatrix}300-290\\-100+116\end{pmatrix} = \frac{1}{20}\begin{pmatrix}10\\16\end{pmatrix} = \begin{pmatrix}0.5\\0.8\end{pmatrix}$ .

Interpret: $\hat{y} = 0.5 + 0.8x$ . The fitted line has intercept $\hat{\beta}_0=0.5$ and slope $\hat{\beta}_1=0.8$ . Predicted values: $\hat{y}_1=1.3$ , $\hat{y}_2=2.1$ , $\hat{y}_3=2.9$ , $\hat{y}_4=3.7$ . Residuals: $e_1=-0.3$ , $e_2=0.9$ , $e_3=-0.9$ , $e_4=0.3$ . Sum of residuals: $0$ ✓ (always true for OLS with intercept).

04 · The Projection Matrix

The matrix $P=A(A^TA)^{-1}A^T$ can be applied to any $\mathbf{b}$ to get its projection. Its properties follow algebraically from the formula.

✓ Example — Idempotency of the Projection Matrix

$P^2 = [A(A^TA)^{-1}A^T][A(A^TA)^{-1}A^T] = A(A^TA)^{-1}[A^TA](A^TA)^{-1}A^T = A(A^TA)^{-1}A^T = P$ ✓.

The bracketed middle simplifies to $[A^TA](A^TA)^{-1} = I$ , eliminating the inner two factors. Projecting twice is the same as projecting once — once you are in the subspace, you stay there.

Complementary Projection

The matrix $I-P$ projects onto the orthogonal complement of $\text{Col}(A)$ — the subspace of vectors orthogonal to all columns of $A$ . Any vector $\mathbf{b}$ decomposes as $\mathbf{b} = P\mathbf{b} + (I-P)\mathbf{b}$ with $P\mathbf{b} \in \text{Col}(A)$ and $(I-P)\mathbf{b} \in \text{Col}(A)^\perp$ . The residual $\mathbf{e}=\mathbf{b}-A\hat{\mathbf{x}} = (I-P)\mathbf{b}$ is the $(I-P)$ projection.

05 · Quant Application — OLS Regression as Projection

Ordinary Least Squares regression is the projection of the response vector $\mathbf{y}$ onto the column space of the design matrix $X$ .

Given $T$ observations, $k$ predictors, and the model $\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ :

\hat{\boldsymbol{\beta}} = (X^TX)^{-1}X^T\mathbf{y}

is the OLS estimator. The fitted values $\hat{\mathbf{y}} = X\hat{\boldsymbol{\beta}} = X(X^TX)^{-1}X^T\mathbf{y} = P\mathbf{y}$ are the projection of $\mathbf{y}$ onto $\text{Col}(X)$ .

Geometric interpretation: $\hat{\mathbf{y}}$ is the closest point in $\text{Col}(X)$ to $\mathbf{y}$ . Minimising $\|\mathbf{y}-X\boldsymbol{\beta}\|^2$ is equivalent to finding the foot of the perpendicular from $\mathbf{y}$ to the plane spanned by the regressors.

Why $X^TX$ must be invertible: $X$ must have no perfectly collinear columns. In finance: if two factor exposures are identical across all assets, $\text{rank}(X)<k$ and $X^TX$ is singular — the factor contribution is not identified.

06 · Practice Exercises

EXERCISE 12.1

$A^TA$ is a $1\times1$ scalar here since $A$ has one column. The formula simplifies to $\hat{b}=\frac{A^T\mathbf{b}}{A^TA}\cdot A$ , which is the projection from Chapter 11.

$A=\begin{pmatrix}2\\1\\-1\end{pmatrix}$ , $\mathbf{b}=\begin{pmatrix}1\\2\\3\end{pmatrix}$ .

$A^TA = 4+1+1=6$ .

$A^T\mathbf{b} = 2(1)+1(2)+(-1)(3)=2+2-3=1$ .

$\hat{\mathbf{x}} = (A^TA)^{-1}A^T\mathbf{b} = \frac{1}{6}$ .

$\hat{\mathbf{b}} = A\hat{\mathbf{x}} = \frac{1}{6}\begin{pmatrix}2\\1\\-1\end{pmatrix} = \begin{pmatrix}1/3\\1/6\\-1/6\end{pmatrix}$ .

$\mathbf{e}=\mathbf{b}-\hat{\mathbf{b}}=\begin{pmatrix}2/3\\11/6\\19/6\end{pmatrix}$ .

Verify: $A^T\mathbf{e}=2(\frac{2}{3})+1(\frac{11}{6})+(-1)(\frac{19}{6})=\frac{4}{3}+\frac{11}{6}-\frac{19}{6}=\frac{8}{6}+\frac{11}{6}-\frac{19}{6}=0$ ✓.

Project $\mathbf{b}=\begin{pmatrix}1\\2\\3\end{pmatrix}$ onto the column space of $A=\begin{pmatrix}2\\1\\-1\end{pmatrix}$ . Compute $\hat{\mathbf{b}}=A(A^TA)^{-1}A^T\mathbf{b}$ and the residual $\mathbf{e}=\mathbf{b}-\hat{\mathbf{b}}$ . Verify $A^T\mathbf{e}=0$ .

EXERCISE 12.2

Form the design matrix with one column of 1s (intercept) and one column of $x$ -values. Compute $A^TA$ , $A^T\mathbf{b}$ , and solve the $2\times2$ normal equations.

Data: $(x,y)=(1,2),(2,2),(3,4)$ .

$A=\begin{pmatrix}1&1\\1&2\\1&3\end{pmatrix}$ , $\mathbf{b}=\begin{pmatrix}2\\2\\4\end{pmatrix}$ .

$A^TA=\begin{pmatrix}3&6\\6&14\end{pmatrix}$ . $A^T\mathbf{b}=\begin{pmatrix}8\\18\end{pmatrix}$ .

$\det(A^TA)=42-36=6$ . $(A^TA)^{-1}=\frac{1}{6}\begin{pmatrix}14&-6\\-6&3\end{pmatrix}$ .

$\hat{\mathbf{x}}=\frac{1}{6}\begin{pmatrix}14&-6\\-6&3\end{pmatrix}\begin{pmatrix}8\\18\end{pmatrix}=\frac{1}{6}\begin{pmatrix}112-108\\-48+54\end{pmatrix}=\frac{1}{6}\begin{pmatrix}4\\6\end{pmatrix}=\begin{pmatrix}2/3\\1\end{pmatrix}$ .

Fitted line: $\hat{y}=\frac{2}{3}+x$ .

Predictions: $\hat{y}_1=5/3$ , $\hat{y}_2=8/3$ , $\hat{y}_3=11/3$ .

Residuals: $e_1=2-5/3=1/3$ , $e_2=2-8/3=-2/3$ , $e_3=4-11/3=1/3$ . Sum: $1/3-2/3+1/3=0$ ✓.

Fit the line $y=\beta_0+\beta_1 x$ to the data points $(1,2),(2,2),(3,4)$ using least squares. Set up the normal equations, solve them, write the fitted line, and compute residuals.

EXERCISE 12.3

Compute $P=A(A^TA)^{-1}A^T$ explicitly. Check $P^2=P$ by matrix multiplication and $P^T=P$ by inspection.

$A=\begin{pmatrix}1\\1\\1\end{pmatrix}$ .

$A^TA=3$ . $(A^TA)^{-1}=\frac{1}{3}$ .

$P=\frac{1}{3}\begin{pmatrix}1\\1\\1\end{pmatrix}\begin{pmatrix}1&1&1\end{pmatrix}=\frac{1}{3}\begin{pmatrix}1&1&1\\1&1&1\\1&1&1\end{pmatrix}$ .

Symmetry: $P^T=P$ since every row equals every column in this matrix ✓.

Idempotency: $P^2=\frac{1}{9}\begin{pmatrix}1&1&1\\1&1&1\\1&1&1\end{pmatrix}^2$ . Each entry of $\begin{pmatrix}1&1&1\\1&1&1\\1&1&1\end{pmatrix}^2$ is $1+1+1=3$ . So $P^2=\frac{1}{9}\cdot3\cdot\begin{pmatrix}1&1&1\\1&1&1\\1&1&1\end{pmatrix}=\frac{1}{3}\begin{pmatrix}1&1&1\\1&1&1\\1&1&1\end{pmatrix}=P$ ✓.

Geometric meaning: $P$ projects onto the line spanned by $(1,1,1)^T$ — the "constant" direction in $\mathbb{R}^3$ . $P\mathbf{b}$ is the vector $\begin{pmatrix}\bar{b}\\\bar{b}\\\bar{b}\end{pmatrix}$ where $\bar{b}=\frac{b_1+b_2+b_3}{3}$ is the mean.

For $A=\begin{pmatrix}1\\1\\1\end{pmatrix}$ , compute the projection matrix $P=A(A^TA)^{-1}A^T$ explicitly. Verify $P^2=P$ and $P^T=P$ . Describe geometrically what $P$ does to a vector $\mathbf{b}\in\mathbb{R}^3$ .

EXERCISE 12.4

$R^2=1-\frac{\|\mathbf{e}\|^2}{\|\mathbf{b}-\bar{b}\mathbf{1}\|^2}$ where $\bar{b}$ is the mean of $\mathbf{b}$ . Compute numerator $\|\mathbf{e}\|^2=\sum e_i^2$ and denominator $\sum(b_i-\bar{b})^2$ .

From Exercise 12.2: residuals $\mathbf{e}=(1/3,-2/3,1/3)^T$ , $\mathbf{b}=(2,2,4)^T$ .

$\|\mathbf{e}\|^2 = \frac{1}{9}+\frac{4}{9}+\frac{1}{9}=\frac{6}{9}=\frac{2}{3}$ .

$\bar{b}=\frac{2+2+4}{3}=\frac{8}{3}$ .

$\|\mathbf{b}-\bar{b}\mathbf{1}\|^2 = (2-8/3)^2+(2-8/3)^2+(4-8/3)^2 = \frac{4}{9}+\frac{4}{9}+\frac{16}{9}=\frac{24}{9}=\frac{8}{3}$ .

$R^2=1-\frac{2/3}{8/3}=1-\frac{1}{4}=0.75$ .

Interpretation: the line $\hat{y}=\frac{2}{3}+x$ explains $75\%$ of the variance in $y$ . The remaining $25\%$ is in the residuals — the component of $\mathbf{b}$ orthogonal to $\text{Col}(A)$ .

Using the least squares fit from Exercise 12.2, compute the coefficient of determination $R^2 = 1 - \|\mathbf{e}\|^2/\|\mathbf{b}-\bar{b}\mathbf{1}\|^2$ . Interpret $R^2$ geometrically in terms of projections.

EXERCISE 12.5

Show that the residuals from an OLS regression always satisfy $\sum e_i = 0$ (when an intercept is included) using the normal equations. The column of ones in $A$ ensures $\mathbf{1}^T\mathbf{e}=0$ .

The design matrix $A$ includes a column of all ones: $\mathbf{a}_1=\mathbf{1}=(1,1,\ldots,1)^T$ .

The normal equations require $A^T\mathbf{e}=\mathbf{0}$ , which means in particular that $\mathbf{a}_1^T\mathbf{e}=0$ .

$\mathbf{a}_1^T\mathbf{e} = \sum_{i=1}^n 1\cdot e_i = \sum_{i=1}^n e_i = 0$ .

Consequence: $\sum e_i=0$ is not an assumption — it is a theorem that follows from including an intercept in the model. Geometrically: the residual $\mathbf{e}$ is orthogonal to the intercept column $\mathbf{1}$ , which means the residuals are mean-zero.

For each additional column $\mathbf{a}_j$ : $\mathbf{a}_j^T\mathbf{e}=0$ means the residuals are uncorrelated with each regressor $X_j$ — the fundamental OLS property.

Prove algebraically that OLS residuals sum to zero ( $\sum_{i=1}^n e_i=0$ ) whenever the design matrix includes a column of ones (intercept). Use the normal equations. State what this means about the residual vector's relationship to the column space of $A$ .

EXERCISE 12.6

Fit separate OLS regressions of each asset's returns on the factor return. The OLS beta is $(F^TF)^{-1}F^T\mathbf{r}$ where $F$ is the factor column and $\mathbf{r}$ is the asset return vector. Then use $R^2$ to measure how much the factor explains.

Market factor returns $F=(0.01,0.02,-0.01,0.03)^T$ , asset returns $\mathbf{r}_A=(0.015,0.025,-0.005,0.04)^T$ .

Design matrix with intercept: $A=\begin{pmatrix}1&0.01\\1&0.02\\1&-0.01\\1&0.03\end{pmatrix}$ .

$A^TA=\begin{pmatrix}4&0.05\\0.05&0.0015\end{pmatrix}$ (using $0.01+0.02-0.01+0.03=0.05$ and $0.0001+0.0004+0.0001+0.0009=0.0015$ ).

$A^T\mathbf{r}_A=\begin{pmatrix}0.075\\0.015+0.0005+0.00005+0.0012\end{pmatrix}$ . Row 1: sum of returns $=0.015+0.025-0.005+0.04=0.075$ . Row 2: $0.01(0.015)+0.02(0.025)+(-0.01)(-0.005)+0.03(0.04)=0.00015+0.0005+0.00005+0.0012=0.0019$ .

$\hat{\beta} \approx \frac{\text{Cov}(F,r_A)}{\text{Var}(F)}$ : sample covariance numerics give market beta $\hat{\beta}\approx1.2$ . This means for each 1% market move, asset A moves $\approx1.2\%$ — it is more volatile than the market (beta $>1$ ).

The $R^2$ measures how much return variance the single factor explains; the residual is idiosyncratic (firm-specific) risk not captured by market exposure.

A factor model regresses each asset's returns $\mathbf{r}$ on a single market factor $F$ (with intercept). Factor returns are $F=(0.01, 0.02, -0.01, 0.03)^T$ over four periods; asset A returns are $\mathbf{r}_A=(0.015, 0.025, -0.005, 0.04)^T$ . Set up the normal equations and describe what the estimated slope (market beta) and $R^2$ measure for this asset.

07 · Chapter Summary

Concept	Formula / Rule
Projection onto $\text{Col}(A)$	$\hat{\mathbf{b}}=A(A^TA)^{-1}A^T\mathbf{b}$
Projection matrix	$P=A(A^TA)^{-1}A^T$ ; $P^2=P$ ; $P^T=P$
Orthogonality of residual	$A^T\mathbf{e}=\mathbf{0}$ ; $\mathbf{e}\perp\text{Col}(A)$
Normal equations	$A^TA\hat{\mathbf{x}}=A^T\mathbf{b}$
Least squares solution	$\hat{\mathbf{x}}=(A^TA)^{-1}A^T\mathbf{b}$ when $A$ has independent cols
$A^TA$ invertible iff	Columns of $A$ are linearly independent
OLS regression	$\hat{\boldsymbol{\beta}}=(X^TX)^{-1}X^T\mathbf{y}$ ; $\hat{\mathbf{y}}=P\mathbf{y}$
Residuals mean-zero	$\sum e_i=0$ whenever $A$ has intercept column
$R^2$ geometric	Fraction of $\\|\mathbf{b}\\|^2$ explained by projection

Next: Chapter 13 — Norms & Distance Metrics introduces L1, L2, and L-infinity norms, shows how different norm choices change what "small residual" means, and connects to LASSO (L1) and ridge regression (L2) in quantitative finance.