The least squares solution to Ax=b — the value of x that minimises ∥Ax−b∥2. The hat accent is standard notation for estimated quantities throughout statistics. x^ is the solution to the normal equations ATAx^=ATb.
$\hat{\mathbf{b}}$b hat — projection onto column space
The projection of b onto the column space of A: b^=Ax^. It is the closest point in Col(A) to b. The residual vector e=b−b^ is orthogonal to Col(A).
$\mathbf{e}$e — residual vector
The residual e=b−b^=b−Ax^. Measures how far the best approximation b^ is from the target b. The fundamental orthogonality condition is ATe=0 — the residual is orthogonal to every column of A.
$P$P — projection matrix
The matrix that projects any vector onto a subspace: P=A(ATA)−1AT. Satisfies P2=P (idempotent) and PT=P (symmetric). Applying P twice is the same as once — you cannot project more than once.
$(A^TA)^{-1}A^T$A plus — pseudoinverse of A
The Moore-Penrose pseudoinverse for a matrix with independent columns: (ATA)−1AT. Satisfies [(ATA)−1AT]A=I, so it is a left inverse. Also written A+ in the general case.
01 · Projection onto a Subspace
The projection from Chapter 11 generalised: instead of projecting onto a single vector u, project onto an entire subspace spanned by a matrix's columns.
Definition — Projection Onto the Column Space of $A$
Let A∈Rm×n have linearly independent columns. The orthogonal projection of b∈Rm onto Col(A) is:
b^=A(ATA)−1ATb
The matrix P=A(ATA)−1AT is the projection matrix onto Col(A). It satisfies:
P2=P (idempotent: projecting twice gives the same result)
PT=P (symmetric)
Pb∈Col(A) for all b
b−Pb⊥Col(A) for all b
✓ Example — Projection onto a Line in $\mathbb{R}^3$
Project b=111 onto the line spanned by a=120.
When Ax=b has no exact solution (overdetermined: more equations than unknowns), the best we can do is find x^ making ∥Ax^−b∥2 as small as possible.
The geometric insight: the minimum of ∥Ax−b∥ over all x is achieved when Ax is the point in Col(A) closest to b — i.e., the projection b^=Ax^. The error e=b−Ax^ must be orthogonal to every column of A.
Definition — Normal Equations
The least squares solution x^ satisfies the normal equations:
ATAx^=ATb
Derivation: Orthogonality condition e⊥Col(A) means ATe=0, i.e. AT(b−Ax^)=0, which rearranges to ATAx^=ATb.
Unique solution: When A has linearly independent columns, ATA is invertible and x^=(ATA)−1ATb.
Step-by-step — Deriving the normal equations from the orthogonality condition
1
State the residual:e=b−Ax^. This is the difference between the target b and our approximation Ax^.
2
Apply the orthogonality requirement:b^=Ax^ is the closest point in Col(A) to b⟺e⊥Col(A)⟺aj⋅e=0 for every column aj of A.
Written in matrix form: ATe=0.
3
Substitute the residual:AT(b−Ax^)=0.
Expand: ATb−ATAx^=0.
Rearrange: ATAx^=ATb. These are the normal equations.
4
Solve (when ATA is invertible):
x^=(ATA)−1ATb
ATA is a square n×n matrix; it is invertible ⟺ the columns of A are linearly independent.
❌ What Breaks — $A^TA$ Is Singular When Columns Are Dependent
If the columns of A are linearly dependent, then ATA is singular — the normal equations ATAx^=ATb have either no solution or infinitely many. The least squares problem still has a geometric solution (the projection still exists), but the representation in terms of x^ is not unique. This happens in regression when predictors are perfectly collinear — one is a linear combination of others. The fix: either remove dependent predictors or use regularisation (ridge/LASSO, Chapter 13).
03 · Why ATA Is Invertible Exactly When Columns Are Independent
Definition — $A^TA$ and Null Spaces
ATA is invertible ⟺A has linearly independent columns ⟺ker(A)={0}.
Proof: Suppose ATAx=0. Then xTATAx=0, which equals ∥Ax∥2=0, so Ax=0. If columns of A are independent, the only solution is x=0, so ker(ATA)={0} and ATA is invertible.
Conversely, if some x=0 satisfies Ax=0, then ATAx=AT0=0, so ATA is singular.
Step-by-step — Least squares fit for data $\{(1,1),(2,3),(3,2),(4,4)\}$ with model $y=\beta_0+\beta_1 x$
1
Write the design matrix: each row is (1,xi) for the model y=β0+β1x.
Interpret:y^=0.5+0.8x. The fitted line has intercept β^0=0.5 and slope β^1=0.8. Predicted values: y^1=1.3, y^2=2.1, y^3=2.9, y^4=3.7. Residuals: e1=−0.3, e2=0.9, e3=−0.9, e4=0.3. Sum of residuals: 0 ✓ (always true for OLS with intercept).
04 · The Projection Matrix
The matrix P=A(ATA)−1AT can be applied to any b to get its projection. Its properties follow algebraically from the formula.
The bracketed middle simplifies to [ATA](ATA)−1=I, eliminating the inner two factors. Projecting twice is the same as projecting once — once you are in the subspace, you stay there.
Complementary Projection
The matrix I−P projects onto the orthogonal complement of Col(A) — the subspace of vectors orthogonal to all columns of A. Any vector b decomposes as b=Pb+(I−P)b with Pb∈Col(A) and (I−P)b∈Col(A)⊥. The residual e=b−Ax^=(I−P)b is the (I−P) projection.
05 · Quant Application — OLS Regression as Projection
Ordinary Least Squares regression is the projection of the response vector y onto the column space of the design matrix X.
Given T observations, k predictors, and the model y=Xβ+ε:
β^=(XTX)−1XTy
is the OLS estimator. The fitted values y^=Xβ^=X(XTX)−1XTy=Py are the projection of y onto Col(X).
Geometric interpretation:y^ is the closest point in Col(X) to y. Minimising ∥y−Xβ∥2 is equivalent to finding the foot of the perpendicular from y to the plane spanned by the regressors.
Why XTX must be invertible:X must have no perfectly collinear columns. In finance: if two factor exposures are identical across all assets, rank(X)<k and XTX is singular — the factor contribution is not identified.
06 · Practice Exercises
EXERCISE 12.1
ATA is a 1×1 scalar here since A has one column. The formula simplifies to b^=ATAATb⋅A, which is the projection from Chapter 11.
Fit the line y=β0+β1x to the data points (1,2),(2,2),(3,4) using least squares. Set up the normal equations, solve them, write the fitted line, and compute residuals.
EXERCISE 12.3
Compute P=A(ATA)−1AT explicitly. Check P2=P by matrix multiplication and PT=P by inspection.
A=111.
ATA=3. (ATA)−1=31.
P=31111(111)=31111111111.
Symmetry:PT=P since every row equals every column in this matrix ✓.
Idempotency:P2=911111111112. Each entry of 1111111112 is 1+1+1=3. So P2=91⋅3⋅111111111=31111111111=P ✓.
Geometric meaning: P projects onto the line spanned by (1,1,1)T — the "constant" direction in R3. Pb is the vector bˉbˉbˉ where bˉ=3b1+b2+b3 is the mean.
For A=111, compute the projection matrix P=A(ATA)−1AT explicitly. Verify P2=P and PT=P. Describe geometrically what P does to a vector b∈R3.
EXERCISE 12.4
R2=1−∥b−bˉ1∥2∥e∥2 where bˉ is the mean of b. Compute numerator ∥e∥2=∑ei2 and denominator ∑(bi−bˉ)2.
From Exercise 12.2: residuals e=(1/3,−2/3,1/3)T, b=(2,2,4)T.
Interpretation: the line y^=32+x explains 75% of the variance in y. The remaining 25% is in the residuals — the component of b orthogonal to Col(A).
Using the least squares fit from Exercise 12.2, compute the coefficient of determination R2=1−∥e∥2/∥b−bˉ1∥2. Interpret R2 geometrically in terms of projections.
EXERCISE 12.5
Show that the residuals from an OLS regression always satisfy ∑ei=0 (when an intercept is included) using the normal equations. The column of ones in A ensures 1Te=0.
The design matrix A includes a column of all ones: a1=1=(1,1,…,1)T.
The normal equations require ATe=0, which means in particular that a1Te=0.
a1Te=∑i=1n1⋅ei=∑i=1nei=0.
Consequence: ∑ei=0 is not an assumption — it is a theorem that follows from including an intercept in the model. Geometrically: the residual e is orthogonal to the intercept column 1, which means the residuals are mean-zero.
For each additional column aj: ajTe=0 means the residuals are uncorrelated with each regressor Xj — the fundamental OLS property.
Prove algebraically that OLS residuals sum to zero (∑i=1nei=0) whenever the design matrix includes a column of ones (intercept). Use the normal equations. State what this means about the residual vector's relationship to the column space of A.
EXERCISE 12.6
Fit separate OLS regressions of each asset's returns on the factor return. The OLS beta is (FTF)−1FTr where F is the factor column and r is the asset return vector. Then use R2 to measure how much the factor explains.
Design matrix with intercept: A=11110.010.02−0.010.03.
ATA=(40.050.050.0015) (using 0.01+0.02−0.01+0.03=0.05 and 0.0001+0.0004+0.0001+0.0009=0.0015).
ATrA=(0.0750.015+0.0005+0.00005+0.0012). Row 1: sum of returns =0.015+0.025−0.005+0.04=0.075. Row 2: 0.01(0.015)+0.02(0.025)+(−0.01)(−0.005)+0.03(0.04)=0.00015+0.0005+0.00005+0.0012=0.0019.
β^≈Var(F)Cov(F,rA): sample covariance numerics give market beta β^≈1.2. This means for each 1% market move, asset A moves ≈1.2% — it is more volatile than the market (beta >1).
The R2 measures how much return variance the single factor explains; the residual is idiosyncratic (firm-specific) risk not captured by market exposure.
A factor model regresses each asset's returns r on a single market factor F (with intercept). Factor returns are F=(0.01,0.02,−0.01,0.03)T over four periods; asset A returns are rA=(0.015,0.025,−0.005,0.04)T. Set up the normal equations and describe what the estimated slope (market beta) and R2 measure for this asset.
07 · Chapter Summary
Concept
Formula / Rule
Projection onto Col(A)
b^=A(ATA)−1ATb
Projection matrix
P=A(ATA)−1AT; P2=P; PT=P
Orthogonality of residual
ATe=0; e⊥Col(A)
Normal equations
ATAx^=ATb
Least squares solution
x^=(ATA)−1ATb when A has independent cols
ATA invertible iff
Columns of A are linearly independent
OLS regression
β^=(XTX)−1XTy; y^=Py
Residuals mean-zero
∑ei=0 whenever A has intercept column
R2 geometric
Fraction of ∥b∥2 explained by projection
Next: Chapter 13 — Norms & Distance Metrics introduces L1, L2, and L-infinity norms, shows how different norm choices change what "small residual" means, and connects to LASSO (L1) and ridge regression (L2) in quantitative finance.