The Lp norm of v: ∥v∥p=(∑i=1n∣vi∣p)1/p for p≥1. The subscript p selects which norm. The cases p=1, p=2, p=∞ are the three most important in applications. Without a subscript, ∥v∥ denotes the L2 norm.
$\|\mathbf{v}\|_1$v norm one — L1 norm
The sum of absolute values: ∥v∥1=∣v1∣+∣v2∣+⋯+∣vn∣. Promotes sparse solutions in optimisation — penalising the L1 norm drives many components exactly to zero. Used in LASSO regression and compressed sensing.
$\|\mathbf{v}\|_\infty$v norm infinity — max norm
The maximum absolute component: ∥v∥∞=maxi∣vi∣. The unit ball in this norm is a hypercube. Relevant for worst-case analysis and numerical stability bounds.
$d(\mathbf{u},\mathbf{v})$d u v — distance
The distance between u and v induced by a norm: d(u,v)=∥u−v∥. Different norms give different notions of distance. The Euclidean distance ∥u−v∥2 is the straight-line distance; the Manhattan distance ∥u−v∥1 is the grid-path distance.
$\|A\|$A norm — matrix norm
The induced matrix norm: ∥A∥=maxx=0∥x∥∥Ax∥. Measures the maximum factor by which A stretches a vector. The L2-induced matrix norm equals the largest singular value of A. Used in bounding numerical errors.
01 · What Is a Norm?
The Euclidean length ∥v∥2=∑vi2 from Chapter 1 measures distance in a specific way — it treats all coordinates equally and uses the square-root scale. Other definitions of "length" arise naturally in applications: a trader minimising total position size uses the L1 norm (sum of absolute values); a risk manager bounding the worst single exposure uses the L-infinity norm (maximum). All three satisfy the same axioms.
Definition — Norm
A norm on Rn is a function ∥⋅∥:Rn→R satisfying:
N1. Non-negativity:∥v∥≥0, with ∥v∥=0⟺v=0.
N2. Homogeneity:∥cv∥=∣c∣∥v∥ for all c∈R.
N3. Triangle inequality:∥u+v∥≤∥u∥+∥v∥.
The triangle inequality is the key structural constraint — it says the direct distance is no greater than the sum of two leg distances.
02 · The Three Principal Norms
Definition — L1, L2, and L-Infinity Norms
For v=(v1,…,vn)T∈Rn:
L1 (Manhattan/taxicab):
∥v∥1=∑i=1n∣vi∣
L2 (Euclidean):
∥v∥2=∑i=1nvi2
L-infinity (max/Chebyshev):
∥v∥∞=max1≤i≤n∣vi∣
The Lp family: ∥v∥p=(∑i∣vi∣p)1/p. As p→∞, the maximum dominates — that is why the limit is called the infinity norm.
Step-by-step — Computing all three norms for $\mathbf{v}=\begin{pmatrix}3\\-4\\0\\2\end{pmatrix}$
1
Take absolute values:∣v1∣=3, ∣v2∣=4 (negative signs removed), ∣v3∣=0, ∣v4∣=2.
2
Compute L1: sum the absolute values. ∥v∥1=3+4+0+2=9.
Compute L-infinity: find the largest absolute value. max(3,4,0,2)=4. ∥v∥∞=4.
5
Compare:∥v∥∞=4≤∥v∥2≈5.39≤∥v∥1=9. This ordering always holds in Rn: ∥v∥∞≤∥v∥2≤∥v∥1≤n∥v∥2≤n∥v∥∞.
03 · Unit Balls and Geometry
The unit ballBp={v∈Rn∣∥v∥p≤1} visualises how a norm defines "nearness to the origin." Different norms have dramatically different shapes.
In R2:
L1 ball:∣v1∣+∣v2∣≤1 — a diamond (square rotated 45°) with vertices at (±1,0) and (0,±1). The corners of the diamond lie on the axes — this is why L1-penalised optimisation drives solutions to corners where many components are exactly zero.
L2 ball:v12+v22≤1 — a circle. The round boundary means no direction is special; the optimum can lie anywhere on the boundary.
L-infinity ball:max(∣v1∣,∣v2∣)≤1 — a square aligned with the axes, [−1,1]×[−1,1]. Corners lie at (±1,±1).
✓ Example — Why L1 Promotes Sparsity
In LASSO regression, we minimise ∥Xβ−y∥22 subject to ∥β∥1≤t. The constraint is the L1 ball — a diamond in R2. The unconstrained OLS solution β^ typically lies off the axes. The feasible region is the diamond, and the minimum of the loss function on the diamond surface tends to occur at a corner — a point where one component is exactly zero. No such tendency exists for the L2 ball (circle), whose boundary is smooth.
❌ What Breaks — The L0 Pseudo-Norm Is Not a Norm
∥v∥0=number of nonzero components is sometimes called the L0 norm in compressed sensing and sparse regression. It is not a true norm: ∥cv∥0=∥v∥0 for any c=0, violating homogeneity ∥cv∥=∣c∣∥v∥ (e.g. ∥2v∥0=∥v∥0=2∥v∥0). Optimising over ∥⋅∥0 is NP-hard; the L1 norm is the tightest convex relaxation that still promotes sparsity.
04 · Norm Equivalence
All norms on Rn are equivalent — they order vectors the same way up to constant factors.
Definition — Norm Equivalence
Norms ∥⋅∥α and ∥⋅∥β on Rn are equivalent if there exist constants c1,c2>0 such that:
c1∥v∥α≤∥v∥β≤c2∥v∥αfor all v∈Rn
In Rn, all norms are equivalent. Explicit bounds between L1, L2, and L-infinity:
∥v∥∞≤∥v∥2≤n∥v∥∞,∥v∥2≤∥v∥1≤n∥v∥2
Consequence: convergence in one norm implies convergence in all norms — the choice of norm does not affect which sequences converge, only how fast.
05 · Matrix Norms
Norms extend to matrices. The most useful is the induced (operator) norm: ∥A∥p=max∥x∥p=1∥Ax∥p — the maximum stretch factor.
Definition — Induced Matrix Norms
For A∈Rm×n:
Induced L1:∥A∥1=maxj∑i∣aij∣ — maximum absolute column sum.
Induced L2 (spectral norm):∥A∥2=σmax(A) — largest singular value.
Induced L-infinity:∥A∥∞=maxi∑j∣aij∣ — maximum absolute row sum.
Frobenius norm (not induced): ∥A∥F=∑i,jaij2 — Euclidean norm on the entries.
✓ Example — Matrix Norm Computation
A=(23−14).
∥A∥1: column sums =∣2∣+∣3∣=5 and ∣−1∣+∣4∣=5. Max =5.
∥A∥∞: row sums =∣2∣+∣−1∣=3 and ∣3∣+∣4∣=7. Max =7.
∥A∥F=4+1+9+16=30≈5.48.
06 · Quant Application — Regularisation: LASSO and Ridge
OLS minimises ∥Xβ−y∥22 with no constraint on β. When predictors are correlated or n<p, OLS is unstable. Adding a norm penalty stabilises the solution:
Ridge regression (L2 penalty):
β^ridge=argminβ∥Xβ−y∥22+λ∥β∥22
Closed form: β^ridge=(XTX+λI)−1XTy. Adding λI ensures invertibility — this is the fix for singular XTX from Chapter 12. Ridge shrinks all coefficients toward zero but keeps all of them nonzero.
LASSO (L1 penalty):
β^LASSO=argminβ∥Xβ−y∥22+λ∥β∥1
No closed form — must use convex optimisation. LASSO sets many β^j exactly to zero, producing a sparse model. In factor investing: LASSO automatically selects a small number of relevant factors from a large candidate pool.
The difference: ridge penalises large coefficients; LASSO penalises non-sparsity. The L1 unit ball's corners — aligned with coordinate axes — make exact zero solutions geometrically likely at the optimum.
07 · Practice Exercises
EXERCISE 13.1
Absolute-value each component, then: L1 = sum; L2 = square-root of sum of squares; L-infinity = maximum.
v=−62−30.
Absolute values: 6,2,3,0.
∥v∥1=6+2+3+0=11.
∥v∥2=36+4+9+0=49=7.
∥v∥∞=max(6,2,3,0)=6.
Ordering: 6=∥v∥∞≤∥v∥2=7≤∥v∥1=11 ✓.
Bound check: ∥v∥2≤∥v∥1=11 ✓ and ∥v∥2≤4∥v∥∞=2(6)=12 ✓.
Compute ∥v∥1, ∥v∥2, and ∥v∥∞ for v=−62−30. Verify the ordering ∥v∥∞≤∥v∥2≤∥v∥1.
EXERCISE 13.2
Verify axioms N1, N2, N3 for the L1 norm directly. For N3, use ∣ui+vi∣≤∣ui∣+∣vi∣ component-wise.
N1 (non-negativity):∥v∥1=∑∣vi∣≥0 since each ∣vi∣≥0. Equals zero ⟺ all ∣vi∣=0⟺ all vi=0⟺v=0 ✓.
N2 (homogeneity):∥cv∥1=∑∣cvi∣=∑∣c∣∣vi∣=∣c∣∑∣vi∣=∣c∣∥v∥1 ✓ (since ∣c∣ is a common factor in every term).
N3 (triangle inequality):∥u+v∥1=∑i∣ui+vi∣. For each i: ∣ui+vi∣≤∣ui∣+∣vi∣ (standard absolute value inequality). Summing over i: ∑i∣ui+vi∣≤∑i∣ui∣+∑i∣vi∣=∥u∥1+∥v∥1 ✓.
All three axioms hold. ∥⋅∥1 is a norm.
Verify that the L1 norm ∥v∥1=∑i∣vi∣ satisfies all three norm axioms (N1 non-negativity, N2 homogeneity, N3 triangle inequality) for vectors in Rn.
EXERCISE 13.3
For the induced L1 matrix norm: compute the absolute column sums and take the maximum. For L-infinity: compute the absolute row sums and take the maximum.
since ∣vi∣∣vj∣≥0. Taking square roots (both sides non-negative): ∥v∥1≥∥v∥2.
Proof that ∥v∥∞≤∥v∥2:
∥v∥∞=maxi∣vi∣. Since maxi∣vi∣2≤∑ivi2, taking square roots gives ∥v∥∞≤∥v∥2.
Tight bounds:∥v∥2≤n∥v∥∞ (Cauchy-Schwarz: ∑vi2=∑vi⋅vi≤(∑12)1/2(∑vi4)1/2... simpler: ∑vi2≤nmaxivi2). Similarly ∥v∥1≤n∥v∥2 by Cauchy-Schwarz applied to ∑∣vi∣⋅1.
Prove the norm ordering ∥v∥∞≤∥v∥2≤∥v∥1 for v∈Rn. Also state (with brief justification) the tighter bounds ∥v∥2≤n∥v∥∞ and ∥v∥1≤n∥v∥2.
EXERCISE 13.5
Ridge adds λI to ATA. Show that ATA+λI is always positive definite for λ>0 using the definition xT(ATA+λI)x>0.
For any x=0 and λ>0:
xT(ATA+λI)x=xTATAx+λxTx=∥Ax∥22+λ∥x∥22.
∥Ax∥22≥0 always. λ∥x∥22>0 since λ>0 and x=0.
Therefore xT(ATA+λI)x>0 for all x=0 — positive definite ⇒ invertible.
This holds regardless of whether A has dependent columns: even if ATA is singular, ATA+λI is always invertible for any λ>0. Ridge regression always has a unique solution.
Eigenvalue interpretation: eigenvalues of ATA+λI are σi2+λ (where σi are singular values of A). Since σi2≥0 and λ>0, all eigenvalues are positive — confirming positive definiteness.
Prove that the ridge-regression matrix ATA+λI is always invertible for λ>0, even when ATA is singular. Use the quadratic form definition of positive definiteness.
EXERCISE 13.6
Compute ∥β∥1, ∥β∥22 for each candidate. The LASSO objective is RSS+λ∥β∥1 and the ridge objective is RSS+λ∥β∥22. Compare total objectives for each candidate.
RSS (sum of squared residuals): given two candidates β(1)=(0.8,0,0.3,0)T and β(2)=(0.5,0.3,0.2,0.1)T, assume both have the same RSS =2.0 (illustrating the penalty difference).
For ridge: candidate 2 is better (smaller L2 norm despite same number of nonzeros) — ridge prefers spreading coefficients evenly. For LASSO: both have equal objectives (same L1 norm) — but in practice LASSO's geometry drives toward sparse solutions like candidate 1.
A four-factor return model has two candidate coefficient vectors: β(1)=(0.8,0,0.3,0)T (sparse) and β(2)=(0.5,0.3,0.2,0.1)T (dense), with equal residual sum of squares (RSS = 2.0). With λ=0.5, compute both the LASSO objective (RSS+λ∥β∥1) and ridge objective (RSS+λ∥β∥22) for each candidate. Which does each penalty favour, and why?
08 · Chapter Summary
Concept
Key Formula
Norm axioms
Non-negativity, homogeneity, triangle inequality
L1 norm
∥v∥1=∑∣vi∣ — sum of absolute values
L2 norm
∥v∥2=∑vi2 — Euclidean length
L-infinity norm
∥v∥∞=maxi∣vi∣ — largest absolute component
Norm ordering
∥v∥∞≤∥v∥2≤∥v∥1≤n∥v∥2≤n∥v∥∞
Unit ball shapes
L1: diamond; L2: circle; L-infinity: square
Norm equivalence
All norms on Rn equivalent up to constants
Induced matrix norm
∥A∥p=max∥x∥p=1∥Ax∥p
Ridge regression
Add λ∥β∥22; always invertible; keeps all predictors
LASSO regression
Add λ∥β∥1; sparse solutions; L1 ball corners
L0 pseudo-norm
Counts nonzeros; not a true norm (fails homogeneity)
Next: Chapter 14 — Positive Definite Matrices establishes four equivalent characterisations of positive definiteness and connects them to eigenvalues, Cholesky decomposition, and the requirement that all valid covariance matrices must be positive semi-definite.