Chapter 13

Norms & Distance Metrics

00 · Symbol Glossary

$\|\mathbf{v}\|_p$v norm p — p-norm

The $L^p$ norm of $\mathbf{v}$ : $\|\mathbf{v}\|_p = \left(\sum_{i=1}^n |v_i|^p\right)^{1/p}$ for $p \geq 1$ . The subscript $p$ selects which norm. The cases $p=1$ , $p=2$ , $p=\infty$ are the three most important in applications. Without a subscript, $\|\mathbf{v}\|$ denotes the L2 norm.

$\|\mathbf{v}\|_1$v norm one — L1 norm

The sum of absolute values: $\|\mathbf{v}\|_1 = |v_1|+|v_2|+\cdots+|v_n|$ . Promotes sparse solutions in optimisation — penalising the L1 norm drives many components exactly to zero. Used in LASSO regression and compressed sensing.

$\|\mathbf{v}\|_\infty$v norm infinity — max norm

The maximum absolute component: $\|\mathbf{v}\|_\infty = \max_i |v_i|$ . The unit ball in this norm is a hypercube. Relevant for worst-case analysis and numerical stability bounds.

$d(\mathbf{u},\mathbf{v})$d u v — distance

The distance between $\mathbf{u}$ and $\mathbf{v}$ induced by a norm: $d(\mathbf{u},\mathbf{v}) = \|\mathbf{u}-\mathbf{v}\|$ . Different norms give different notions of distance. The Euclidean distance $\|\mathbf{u}-\mathbf{v}\|_2$ is the straight-line distance; the Manhattan distance $\|\mathbf{u}-\mathbf{v}\|_1$ is the grid-path distance.

$\|A\|$A norm — matrix norm

The induced matrix norm: $\|A\| = \max_{\mathbf{x}\neq\mathbf{0}}\frac{\|A\mathbf{x}\|}{\|\mathbf{x}\|}$ . Measures the maximum factor by which $A$ stretches a vector. The L2-induced matrix norm equals the largest singular value of $A$ . Used in bounding numerical errors.

01 · What Is a Norm?

The Euclidean length $\|\mathbf{v}\|_2=\sqrt{\sum v_i^2}$ from Chapter 1 measures distance in a specific way — it treats all coordinates equally and uses the square-root scale. Other definitions of "length" arise naturally in applications: a trader minimising total position size uses the L1 norm (sum of absolute values); a risk manager bounding the worst single exposure uses the L-infinity norm (maximum). All three satisfy the same axioms.

Definition — Norm

A norm on $\mathbb{R}^n$ is a function $\|\cdot\|: \mathbb{R}^n \to \mathbb{R}$ satisfying:

N1. Non-negativity: $\|\mathbf{v}\| \geq 0$ , with $\|\mathbf{v}\|=0 \iff \mathbf{v}=\mathbf{0}$ .

N2. Homogeneity: $\|c\mathbf{v}\| = |c|\|\mathbf{v}\|$ for all $c \in \mathbb{R}$ .

N3. Triangle inequality: $\|\mathbf{u}+\mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|$ .

The triangle inequality is the key structural constraint — it says the direct distance is no greater than the sum of two leg distances.

02 · The Three Principal Norms

Definition — L1, L2, and L-Infinity Norms

For $\mathbf{v} = (v_1, \ldots, v_n)^T \in \mathbb{R}^n$ :

L1 (Manhattan/taxicab):

\|\mathbf{v}\|_1 = \sum_{i=1}^n |v_i|

L2 (Euclidean):

\|\mathbf{v}\|_2 = \sqrt{\sum_{i=1}^n v_i^2}

L-infinity (max/Chebyshev):

\|\mathbf{v}\|_\infty = \max_{1 \leq i \leq n} |v_i|

The $L^p$ family: $\|\mathbf{v}\|_p = \left(\sum_i |v_i|^p\right)^{1/p}$ . As $p\to\infty$ , the maximum dominates — that is why the limit is called the infinity norm.

Step-by-step — Computing all three norms for $\mathbf{v}=\begin{pmatrix}3\\-4\\0\\2\end{pmatrix}$

Take absolute values: $|v_1|=3$ , $|v_2|=4$ (negative signs removed), $|v_3|=0$ , $|v_4|=2$ .

Compute L1: sum the absolute values. $\|\mathbf{v}\|_1 = 3+4+0+2 = 9$ .

Compute L2: square each, sum, take root. $3^2=9$ , $4^2=16$ , $0^2=0$ , $2^2=4$ . Sum: $9+16+0+4=29$ . $\|\mathbf{v}\|_2=\sqrt{29}\approx5.39$ .

Compute L-infinity: find the largest absolute value. $\max(3,4,0,2)=4$ . $\|\mathbf{v}\|_\infty=4$ .

Compare: $\|\mathbf{v}\|_\infty = 4 \leq \|\mathbf{v}\|_2 \approx 5.39 \leq \|\mathbf{v}\|_1 = 9$ . This ordering always holds in $\mathbb{R}^n$ : $\|\mathbf{v}\|_\infty \leq \|\mathbf{v}\|_2 \leq \|\mathbf{v}\|_1 \leq \sqrt{n}\,\|\mathbf{v}\|_2 \leq n\,\|\mathbf{v}\|_\infty$ .

03 · Unit Balls and Geometry

The unit ball $B_p = \{\mathbf{v} \in \mathbb{R}^n \mid \|\mathbf{v}\|_p \leq 1\}$ visualises how a norm defines "nearness to the origin." Different norms have dramatically different shapes.

In $\mathbb{R}^2$ :

L1 ball: $|v_1|+|v_2|\leq1$ — a diamond (square rotated 45°) with vertices at $(\pm1,0)$ and $(0,\pm1)$ . The corners of the diamond lie on the axes — this is why L1-penalised optimisation drives solutions to corners where many components are exactly zero.
L2 ball: $v_1^2+v_2^2\leq1$ — a circle. The round boundary means no direction is special; the optimum can lie anywhere on the boundary.
L-infinity ball: $\max(|v_1|,|v_2|)\leq1$ — a square aligned with the axes, $[-1,1]\times[-1,1]$ . Corners lie at $(\pm1,\pm1)$ .

✓ Example — Why L1 Promotes Sparsity

In LASSO regression, we minimise $\|X\boldsymbol{\beta}-\mathbf{y}\|_2^2$ subject to $\|\boldsymbol{\beta}\|_1 \leq t$ . The constraint is the L1 ball — a diamond in $\mathbb{R}^2$ . The unconstrained OLS solution $\hat{\boldsymbol{\beta}}$ typically lies off the axes. The feasible region is the diamond, and the minimum of the loss function on the diamond surface tends to occur at a corner — a point where one component is exactly zero. No such tendency exists for the L2 ball (circle), whose boundary is smooth.

❌ What Breaks — The L0 Pseudo-Norm Is Not a Norm

$\|\mathbf{v}\|_0 = \text{number of nonzero components}$ is sometimes called the L0 norm in compressed sensing and sparse regression. It is not a true norm: $\|c\mathbf{v}\|_0 = \|\mathbf{v}\|_0$ for any $c\neq0$ , violating homogeneity $\|c\mathbf{v}\|=|c|\|\mathbf{v}\|$ (e.g. $\|2\mathbf{v}\|_0 = \|\mathbf{v}\|_0 \neq 2\|\mathbf{v}\|_0$ ). Optimising over $\|\cdot\|_0$ is NP-hard; the L1 norm is the tightest convex relaxation that still promotes sparsity.

04 · Norm Equivalence

All norms on $\mathbb{R}^n$ are equivalent — they order vectors the same way up to constant factors.

Definition — Norm Equivalence

Norms $\|\cdot\|_\alpha$ and $\|\cdot\|_\beta$ on $\mathbb{R}^n$ are equivalent if there exist constants $c_1, c_2 > 0$ such that:

c_1 \|\mathbf{v}\|_\alpha \leq \|\mathbf{v}\|_\beta \leq c_2 \|\mathbf{v}\|_\alpha \quad \text{for all } \mathbf{v} \in \mathbb{R}^n

In $\mathbb{R}^n$ , all norms are equivalent. Explicit bounds between L1, L2, and L-infinity:

\|\mathbf{v}\|_\infty \leq \|\mathbf{v}\|_2 \leq \sqrt{n}\,\|\mathbf{v}\|_\infty, \qquad \|\mathbf{v}\|_2 \leq \|\mathbf{v}\|_1 \leq \sqrt{n}\,\|\mathbf{v}\|_2

Consequence: convergence in one norm implies convergence in all norms — the choice of norm does not affect which sequences converge, only how fast.

05 · Matrix Norms

Norms extend to matrices. The most useful is the induced (operator) norm: $\|A\|_p = \max_{\|\mathbf{x}\|_p=1}\|A\mathbf{x}\|_p$ — the maximum stretch factor.

Definition — Induced Matrix Norms

For $A \in \mathbb{R}^{m\times n}$ :

Induced L1: $\|A\|_1 = \max_j \sum_i |a_{ij}|$ — maximum absolute column sum.

Induced L2 (spectral norm): $\|A\|_2 = \sigma_{\max}(A)$ — largest singular value.

Induced L-infinity: $\|A\|_\infty = \max_i \sum_j |a_{ij}|$ — maximum absolute row sum.

Frobenius norm (not induced): $\|A\|_F = \sqrt{\sum_{i,j}a_{ij}^2}$ — Euclidean norm on the entries.

✓ Example — Matrix Norm Computation

$A = \begin{pmatrix}2&-1\\3&4\end{pmatrix}$ .

$\|A\|_1$ : column sums $= |2|+|3|=5$ and $|-1|+|4|=5$ . Max $= 5$ .

$\|A\|_\infty$ : row sums $= |2|+|-1|=3$ and $|3|+|4|=7$ . Max $= 7$ .

$\|A\|_F = \sqrt{4+1+9+16} = \sqrt{30} \approx 5.48$ .

06 · Quant Application — Regularisation: LASSO and Ridge

OLS minimises $\|X\boldsymbol{\beta}-\mathbf{y}\|_2^2$ with no constraint on $\boldsymbol{\beta}$ . When predictors are correlated or $n < p$ , OLS is unstable. Adding a norm penalty stabilises the solution:

Ridge regression (L2 penalty):

\hat{\boldsymbol{\beta}}_\text{ridge} = \arg\min_{\boldsymbol{\beta}} \|X\boldsymbol{\beta}-\mathbf{y}\|_2^2 + \lambda\|\boldsymbol{\beta}\|_2^2

Closed form: $\hat{\boldsymbol{\beta}}_\text{ridge} = (X^TX + \lambda I)^{-1}X^T\mathbf{y}$ . Adding $\lambda I$ ensures invertibility — this is the fix for singular $X^TX$ from Chapter 12. Ridge shrinks all coefficients toward zero but keeps all of them nonzero.

LASSO (L1 penalty):

\hat{\boldsymbol{\beta}}_\text{LASSO} = \arg\min_{\boldsymbol{\beta}} \|X\boldsymbol{\beta}-\mathbf{y}\|_2^2 + \lambda\|\boldsymbol{\beta}\|_1

No closed form — must use convex optimisation. LASSO sets many $\hat{\beta}_j$ exactly to zero, producing a sparse model. In factor investing: LASSO automatically selects a small number of relevant factors from a large candidate pool.

The difference: ridge penalises large coefficients; LASSO penalises non-sparsity. The L1 unit ball's corners — aligned with coordinate axes — make exact zero solutions geometrically likely at the optimum.

07 · Practice Exercises

EXERCISE 13.1

Absolute-value each component, then: L1 = sum; L2 = square-root of sum of squares; L-infinity = maximum.

$\mathbf{v}=\begin{pmatrix}-6\\2\\-3\\0\end{pmatrix}$ .

Absolute values: $6, 2, 3, 0$ .

$\|\mathbf{v}\|_1 = 6+2+3+0=11$ .

$\|\mathbf{v}\|_2 = \sqrt{36+4+9+0}=\sqrt{49}=7$ .

$\|\mathbf{v}\|_\infty = \max(6,2,3,0)=6$ .

Ordering: $6 = \|\mathbf{v}\|_\infty \leq \|\mathbf{v}\|_2 = 7 \leq \|\mathbf{v}\|_1 = 11$ ✓.

Bound check: $\|\mathbf{v}\|_2 \leq \|\mathbf{v}\|_1 = 11$ ✓ and $\|\mathbf{v}\|_2 \leq \sqrt{4}\|\mathbf{v}\|_\infty = 2(6)=12$ ✓.

Compute $\|\mathbf{v}\|_1$ , $\|\mathbf{v}\|_2$ , and $\|\mathbf{v}\|_\infty$ for $\mathbf{v}=\begin{pmatrix}-6\\2\\-3\\0\end{pmatrix}$ . Verify the ordering $\|\mathbf{v}\|_\infty \leq \|\mathbf{v}\|_2 \leq \|\mathbf{v}\|_1$ .

EXERCISE 13.2

Verify axioms N1, N2, N3 for the L1 norm directly. For N3, use $|u_i+v_i|\leq|u_i|+|v_i|$ component-wise.

N1 (non-negativity): $\|\mathbf{v}\|_1=\sum|v_i|\geq0$ since each $|v_i|\geq0$ . Equals zero $\iff$ all $|v_i|=0$ $\iff$ all $v_i=0$ $\iff$ $\mathbf{v}=\mathbf{0}$ ✓.

N2 (homogeneity): $\|c\mathbf{v}\|_1=\sum|cv_i|=\sum|c||v_i|=|c|\sum|v_i|=|c|\|\mathbf{v}\|_1$ ✓ (since $|c|$ is a common factor in every term).

N3 (triangle inequality): $\|\mathbf{u}+\mathbf{v}\|_1=\sum_i|u_i+v_i|$ . For each $i$ : $|u_i+v_i|\leq|u_i|+|v_i|$ (standard absolute value inequality). Summing over $i$ : $\sum_i|u_i+v_i|\leq\sum_i|u_i|+\sum_i|v_i|=\|\mathbf{u}\|_1+\|\mathbf{v}\|_1$ ✓.

All three axioms hold. $\|\cdot\|_1$ is a norm.

Verify that the L1 norm $\|\mathbf{v}\|_1=\sum_i|v_i|$ satisfies all three norm axioms (N1 non-negativity, N2 homogeneity, N3 triangle inequality) for vectors in $\mathbb{R}^n$ .

EXERCISE 13.3

For the induced L1 matrix norm: compute the absolute column sums and take the maximum. For L-infinity: compute the absolute row sums and take the maximum.

$A=\begin{pmatrix}1&-2&3\\4&0&-1\end{pmatrix}$ .

$\|A\|_1$ (max absolute column sum):

Column 1: $|1|+|4|=5$ . Column 2: $|-2|+|0|=2$ . Column 3: $|3|+|-1|=4$ .

$\|A\|_1=\max(5,2,4)=5$ .

$\|A\|_\infty$ (max absolute row sum):

Row 1: $|1|+|-2|+|3|=6$ . Row 2: $|4|+|0|+|-1|=5$ .

$\|A\|_\infty=\max(6,5)=6$ .

$\|A\|_F$ (Frobenius):

$\|A\|_F=\sqrt{1+4+9+16+0+1}=\sqrt{31}\approx5.57$ .

Compute $\|A\|_1$ (max absolute column sum), $\|A\|_\infty$ (max absolute row sum), and $\|A\|_F$ (Frobenius norm) for $A=\begin{pmatrix}1&-2&3\\4&0&-1\end{pmatrix}$ .

EXERCISE 13.4

Show $\|\mathbf{v}\|_2 \leq \|\mathbf{v}\|_1$ by squaring both sides: you need $\sum v_i^2 \leq (\sum|v_i|)^2$ . The right side expands to include all cross terms $|v_i||v_j|\geq0$ .

Proof that $\|\mathbf{v}\|_2 \leq \|\mathbf{v}\|_1$ :

$\|\mathbf{v}\|_1^2 = \left(\sum_i|v_i|\right)^2 = \sum_i v_i^2 + 2\sum_{i < j}|v_i||v_j| \geq \sum_i v_i^2 = \|\mathbf{v}\|_2^2$

since $|v_i||v_j|\geq0$ . Taking square roots (both sides non-negative): $\|\mathbf{v}\|_1 \geq \|\mathbf{v}\|_2$ .

Proof that $\|\mathbf{v}\|_\infty \leq \|\mathbf{v}\|_2$ :

$\|\mathbf{v}\|_\infty = \max_i|v_i|$ . Since $\max_i|v_i|^2 \leq \sum_i v_i^2$ , taking square roots gives $\|\mathbf{v}\|_\infty \leq \|\mathbf{v}\|_2$ .

Tight bounds: $\|\mathbf{v}\|_2 \leq \sqrt{n}\|\mathbf{v}\|_\infty$ (Cauchy-Schwarz: $\sum v_i^2 = \sum v_i\cdot v_i \leq (\sum 1^2)^{1/2}(\sum v_i^4)^{1/2}$ ... simpler: $\sum v_i^2 \leq n\max_i v_i^2$ ). Similarly $\|\mathbf{v}\|_1 \leq \sqrt{n}\|\mathbf{v}\|_2$ by Cauchy-Schwarz applied to $\sum|v_i|\cdot1$ .

Prove the norm ordering $\|\mathbf{v}\|_\infty \leq \|\mathbf{v}\|_2 \leq \|\mathbf{v}\|_1$ for $\mathbf{v}\in\mathbb{R}^n$ . Also state (with brief justification) the tighter bounds $\|\mathbf{v}\|_2\leq\sqrt{n}\|\mathbf{v}\|_\infty$ and $\|\mathbf{v}\|_1\leq\sqrt{n}\|\mathbf{v}\|_2$ .

EXERCISE 13.5

Ridge adds $\lambda I$ to $A^TA$ . Show that $A^TA+\lambda I$ is always positive definite for $\lambda>0$ using the definition $\mathbf{x}^T(A^TA+\lambda I)\mathbf{x}>0$ .

For any $\mathbf{x}\neq\mathbf{0}$ and $\lambda>0$ :

$\mathbf{x}^T(A^TA+\lambda I)\mathbf{x} = \mathbf{x}^TA^TA\mathbf{x}+\lambda\mathbf{x}^T\mathbf{x} = \|A\mathbf{x}\|_2^2 + \lambda\|\mathbf{x}\|_2^2$ .

$\|A\mathbf{x}\|_2^2\geq0$ always. $\lambda\|\mathbf{x}\|_2^2>0$ since $\lambda>0$ and $\mathbf{x}\neq\mathbf{0}$ .

Therefore $\mathbf{x}^T(A^TA+\lambda I)\mathbf{x}>0$ for all $\mathbf{x}\neq\mathbf{0}$ — positive definite $\Rightarrow$ invertible.

This holds regardless of whether $A$ has dependent columns: even if $A^TA$ is singular, $A^TA+\lambda I$ is always invertible for any $\lambda>0$ . Ridge regression always has a unique solution.

Eigenvalue interpretation: eigenvalues of $A^TA+\lambda I$ are $\sigma_i^2+\lambda$ (where $\sigma_i$ are singular values of $A$ ). Since $\sigma_i^2\geq0$ and $\lambda>0$ , all eigenvalues are positive — confirming positive definiteness.

Prove that the ridge-regression matrix $A^TA+\lambda I$ is always invertible for $\lambda>0$ , even when $A^TA$ is singular. Use the quadratic form definition of positive definiteness.

EXERCISE 13.6

Compute $\|\boldsymbol{\beta}\|_1$ , $\|\boldsymbol{\beta}\|_2^2$ for each candidate. The LASSO objective is $\text{RSS}+\lambda\|\boldsymbol{\beta}\|_1$ and the ridge objective is $\text{RSS}+\lambda\|\boldsymbol{\beta}\|_2^2$ . Compare total objectives for each candidate.

RSS (sum of squared residuals): given two candidates $\boldsymbol{\beta}^{(1)}=(0.8, 0, 0.3, 0)^T$ and $\boldsymbol{\beta}^{(2)}=(0.5, 0.3, 0.2, 0.1)^T$ , assume both have the same RSS $= 2.0$ (illustrating the penalty difference).

$\lambda = 0.5$ .

Candidate 1: $\|\boldsymbol{\beta}^{(1)}\|_1=0.8+0+0.3+0=1.1$ ; $\|\boldsymbol{\beta}^{(1)}\|_2^2=0.64+0+0.09+0=0.73$ .

LASSO objective: $2.0+0.5(1.1)=2.55$ . Ridge objective: $2.0+0.5(0.73)=2.365$ .

Candidate 2: $\|\boldsymbol{\beta}^{(2)}\|_1=0.5+0.3+0.2+0.1=1.1$ ; $\|\boldsymbol{\beta}^{(2)}\|_2^2=0.25+0.09+0.04+0.01=0.39$ .

LASSO objective: $2.0+0.5(1.1)=2.55$ . Ridge objective: $2.0+0.5(0.39)=2.195$ .

For ridge: candidate 2 is better (smaller L2 norm despite same number of nonzeros) — ridge prefers spreading coefficients evenly. For LASSO: both have equal objectives (same L1 norm) — but in practice LASSO's geometry drives toward sparse solutions like candidate 1.

A four-factor return model has two candidate coefficient vectors: $\boldsymbol{\beta}^{(1)}=(0.8, 0, 0.3, 0)^T$ (sparse) and $\boldsymbol{\beta}^{(2)}=(0.5, 0.3, 0.2, 0.1)^T$ (dense), with equal residual sum of squares (RSS = 2.0). With $\lambda=0.5$ , compute both the LASSO objective ( $\text{RSS}+\lambda\|\boldsymbol{\beta}\|_1$ ) and ridge objective ( $\text{RSS}+\lambda\|\boldsymbol{\beta}\|_2^2$ ) for each candidate. Which does each penalty favour, and why?

08 · Chapter Summary

Concept	Key Formula
Norm axioms	Non-negativity, homogeneity, triangle inequality
L1 norm	$\\|\mathbf{v}\\|_1=\sum\lvert v_i\rvert$ — sum of absolute values
L2 norm	$\\|\mathbf{v}\\|_2=\sqrt{\sum v_i^2}$ — Euclidean length
L-infinity norm	$\\|\mathbf{v}\\|_\infty=\max_i\lvert v_i\rvert$ — largest absolute component
Norm ordering	$\\|\mathbf{v}\\|_\infty\leq\\|\mathbf{v}\\|_2\leq\\|\mathbf{v}\\|_1\leq\sqrt{n}\\|\mathbf{v}\\|_2\leq n\\|\mathbf{v}\\|_\infty$
Unit ball shapes	L1: diamond; L2: circle; L-infinity: square
Norm equivalence	All norms on $\mathbb{R}^n$ equivalent up to constants
Induced matrix norm	$\\|A\\|_p=\max_{\\|\mathbf{x}\\|_p=1}\\|A\mathbf{x}\\|_p$
Ridge regression	Add $\lambda\\|\boldsymbol{\beta}\\|_2^2$ ; always invertible; keeps all predictors
LASSO regression	Add $\lambda\\|\boldsymbol{\beta}\\|_1$ ; sparse solutions; L1 ball corners
L0 pseudo-norm	Counts nonzeros; not a true norm (fails homogeneity)

Next: Chapter 14 — Positive Definite Matrices establishes four equivalent characterisations of positive definiteness and connects them to eigenvalues, Cholesky decomposition, and the requirement that all valid covariance matrices must be positive semi-definite.