In approximating a function $f$ with a polynomial $p \in \Pi_{n}$, interpolation forces the polynomial to go through a number of data points.
An alternative is to find a polynomial that *minimizes the "distance" between it and $f$, without fixing the points that it passes*. The course only focuses on minimizing some sort of squared error, of the form $\int _{I}w \cdot|f-p_{n}|^{2} \, dx $over some interval $I$. Nevertheless, other well-studied notions of "distance" exist, e.g. the **maximum norm** $\max |f-p_{n}|$.
### The $L_{2}$ Inner Product Space
A real vector space $V$ equipped with a positive-definite, symmetric bilinear form $\left< *,* \right>$ is called a **(real) inner product space**.
* Positive-definiteness: $\left< v,v \right> \ge 0$, and equality only when $v=\mathbf{0}$.
* Symmetry: $\left< v,w \right> = \left< w,v \right>$
* Bilinearity: $\left< u+v,w \right> = \left< u,w \right> + \left< v,w \right>$, $\left< \lambda u,v \right> = \lambda\left< u,v \right>$.
Vectors $u,v$ are **orthogonal** if $\left< u,v \right> = 0$.
* The inner product $\left< *,* \right>$ defines the **induced norm** $\| v \| \equiv \sqrt{ \left< v,v \right> }$.
* The **Cauchy-Schwarz inequality** states that for any $u,v \in (V,\left< *,* \right>)$, with induced norm $\| * \|$, $\| u \|\| v \| \ge \left< u,v \right> $
> [!proof]-
> Consider the quadratic map $\lambda \mapsto \| u+\lambda v \|^{2}$. It has a repeated root or no roots at all, so the determinant is non-positive: $\Delta=4\left< u,v \right>^{2}-4\| u \|^{2}\| v \|^{2} \le 0 $which gives the result.
### The $L^{2}$ Inner Product Space
For least squares approximation, the natural choice of distance is the $l_{2}$ norm: $\| v \|_{2}\equiv \sqrt{ \int _{I} v^{2} \, dx } $
More generally, define the inner product $\left< f,g \right> = \int _{I} w(x)f(x)g(x) \, dx ,$where $w(x)$ is the **weight function**.
* Most common weights include $w \equiv 1$, $w(x)=(1-x^{2})^{-1 / 2}$, and $w=e^{-x}$.
The inner product induces the **2-norm** (with weight $w$): $\| f \|_{2}=\int _{I}w(x)[f(x)]^{2} \, dx,$and is defined over the space $L^{2}_{w} \equiv \{ f: I \to \mathbb{R}\,|\, \| f \|_{2} < \infty \},$i.e. the functions that are bounded under $\| \cdot\|_{2}$.
Then least squares problem is to find $p_{n} \in \Pi_{n}$ given $f \in L^{2}_{w}$ that minimizes the norm of the error: $p_{n} = \underset{{q \in \Pi_{n}}}{\arg\min}~\| f-q \|_{2}.$
### Least Squares as a Matrix Equation
> [!bigidea] The least squares problem can be reduced to a matrix equation called the normal equation.
The naive choice of basis of $\Pi_{n}$ is the **canonical basis** $\mathbf{x}=\{ 1,x,x^{2},\dots,x^{n} \}$. With this basis, $p_{n} \in \Pi_{n}$ can be expressed as: $p_{n}(x)=\sum^{n}_{k=0}\alpha_{k}x^{k}=\mathbf{a}^{T}\mathbf{x}$where $\mathbf{a}$ are its coefficients in this basis.
> [!exposition] Squared error as a matrix
> The squared error is $\begin{align*}
> \| f-p_{n} \|_{2}^{2} &= \left< f-\mathbf{a}^{T}\mathbf{x} , f-\mathbf{a}^{T}\mathbf{x} \right>\\
> &= \| f \|_{2}^{2} - 2\left< f, \mathbf{a}^{T}\mathbf{x} \right> + \| \mathbf{a}^{T}\mathbf{x} \|_{2}^{2}
> \end{align*}$which is minimized by taking partial derivative wrt. $\mathbf{a}$ and setting it to be $\mathbf{0}$, giving $\begin{bmatrix}
> \left< 1,1 \right> & \left< 1,x \right> & \dots & \left< 1,x^{n} \right> \\
> \left< x,1 \right> & \left< x,x \right> & \dots & \left< x,x^{n} \right> \\
> \vdots & \vdots & & \vdots \\
> \left< x^{n},1 \right> & \left< x^{n},x \right> & \dots & \left< x^{n},x^{n} \right>
> \end{bmatrix}\mathbf{a}=\begin{bmatrix}
> \left< f, 1 \right> \\ \left< f,x \right> \\ \vdots \\ \left< f, x^{n} \right>
> \end{bmatrix}.$
>
> If we work with a general basis $\{ \phi_{0},\dots,\phi_{n} \}$ of $\Pi_{n}$, the same calculations yield the **normal equations** $A\mathbf{a}=\varphi$: $\underbrace{\begin{bmatrix}
\left< \phi_{0},\phi_{0} \right> & \left< \phi_{0},\phi_{1} \right> & \dots & \left< \phi_{0},\phi_{n} \right> \\
\left< \phi_{1},\phi_{0} \right> & \left< \phi_{1},\phi_{1} \right> & \dots & \left< \phi_{1},\phi_{n} \right> \\
\vdots & \vdots & & \vdots \\
\left< \phi_{n},\phi_{0} \right> & \left< \phi_{n}, \phi_{1} \right> & \dots & \left< \phi_{n},\phi_{n} \right>
\end{bmatrix}}_{A}\mathbf{a}=\underbrace{\begin{bmatrix}
\left< f, \phi_{0} \right> \\ \left< f,\phi_{1} \right> \\ \vdots \\ \left< f, \phi_{n} \right>
\end{bmatrix}}_{\varphi}$
For any basis $\{ \phi_{0},\dots,\phi_{n} \}$, the matrix $A$ in its normal equation is non-singular hence invertible, and there is always a solution $\mathbf{a}$, and the least-squares approximation $p_{n}$ always exists.
> [!proof]-
> Define column-vector $\Phi=(\phi_{0},\dots,\phi_{n})^{T}$. Then abuse notation to define $\left< f,\Phi \right> \equiv \Big(\left<f,\phi_{0} \right>, \dots,\left< f, \phi_{n} \right> \Big)^{T}$.
> With this abuse, the matrix is $A=\Big(
\left< \phi_{0},\Phi \right> , \left< \phi_{1},\Phi \right> , \dots , \left< \phi_{n}, \Phi \right>
\Big)$so if it is singular, there exists $\mathbf{b}=(\beta_{0},\dots,\beta_{n})^{T} \ne \mathbf{0}: A\mathbf{b}=\mathbf{0}$, giving $\begin{align*}
\mathbf{b}^{T}A\mathbf{b}&= \mathbf{b}^{T}\sum_{k=0}^{n}\beta_{k}\left< \phi_{k},\Phi \right> \\
&= \mathbf{b}^{T}\left< \sum_{k}\beta_{k} \phi_{k} , \Phi\right>\\
&= \left< \sum_{k}\beta_{k} \phi_{k} , \sum_{k}\beta_{k} \phi_{k}\right> = 0
\end{align*} $Hence the definition of the inner product requires $\sum_{k} \beta_{k} \phi_{k}=0$, contradictory to independence of $\Phi$.
Inverting $A$ finds the coefficients $\mathbf{a}$ of the least-squares approximant $p_{n}$, but it is a bad idea because:
* Computing and inverting the matrix is expensive.
* Solving linear equations involving full (opposed to upper triangular or diagonal) matrices are unstable.
### Least Squares in Orthogonal Bases
> [!bigidea] By using an orthogonal bases, the normal equation (and the least squares problem) is trivial.
The issues mentioned above will be solved if the basis $\{ \phi_{i} \}$ makes the matrix $A$ diagonal; namely, it is an **orthogonal basis**: $\left< \phi_{i}, \phi_{j} \right> \,\, \begin{cases}
=0 & \text{if }i \ne j \\
\ne 0 & \text{if } i = j
\end{cases}$Even better, $A=I_{n}$ if it is an **orthonormal basis**: $\left< \phi_{i}, \phi_{j} \right> = \delta_{ij} = \begin{cases}
0 & \text{if }i \ne j \\
1 & \text{if } i = j
\end{cases}$
Then *under an orthonormal basis $\{ \phi_{0},\dots, \phi_{n} \}$, $A=I_{n}$ in the normal equation* and it immediately reveals the coefficients: $(\alpha_{0},\dots,\alpha_{n})=(\left< f,\phi_{0} \right> ,\dots, \left< f,\phi_{n} \right> )$and the least squares approximant is $p_{n}=\sum^{n}_{k=0}\alpha_{k}\phi_{k}=\sum^{n}_{k=0}\left< f,\phi_{k} \right> \phi_{k}$
The **Gram-Schmidt process** with normalization can produce such an orthogonal basis given a general one: $\{ 1,x,x^{2},\dots,x^{n} \}\xmapsto{G-S}\underset{\text{ orthogonal}}{\{ \dots\}} \xmapsto{\text{normalize}} \underset{\text{ orthonormal}}{\{ \phi_{0},\dots,\phi_{n}\}}$
* Although starting with $\{ 1,\dots,x^{n} \}$ seems arbitrary, the outcome is independent to the choice. This is because $\phi_{i} \in \Pi_{i}$ is assumed for polynomial bases $\{ \phi_{i} \}$, and in particular $\phi_{0} \in \mathbb{R}$. Then the rest of the basis $\{ \phi_{1},\dots, \phi_{n} \}$ is determined up to sign.
Naturally, since orthogonality is defined with the inner products, different weight functions give rise to different orthogonal bases:
* $w\equiv 1$ gives the **Legendre polynomials**.
* $w(x)=(1-x^{2})^{1 / 2}$ gives the **Chebyshev polynomials**.
* $w(x)=e^{-x}$ gives the **Laguerre polynomials**.
### Least Squares Approximation as Projection
The least squares error is perpendicular to $\Pi_{n}$: $p_{n}$ is the least squares approximant if and only if $\forall q \in \Pi_{n},\left< f-p_{n},q \right> = 0$.
![[OrthogonalApproximation.png#invert|center|w60]]
> [!proof]-
> ($\Rightarrow$) Sheet 3, consider $\| f-(p+\epsilon q) \|_{2}$ for any $q \in \Pi_{n}$.
> ($\Leftarrow$) For any $q$, $\begin{align*}
\| f-q \|_{2} &= \| (f-p_{n})-(q-p_{n}) \|_{2} \\
&= \| f-p_{n} \| _{2} -2\cancel{\left< f-p_{n}, \underset{\in \Pi_{n}}{q-p_{n}} \right>} +\| q-p_{n} \| _{2}\\
&\ge \| f-p_{n} \| _{2}
\end{align*}$
> [!connection]
> Compare with [[Orthogonal Projection, Confounding, and Missing Variable in OLS|OLS]], which can be interpreted as projecting the response $\mathbf{y}$ onto the column space of predictors $\mathbf{x}$.