In approximating a function $f$ with a polynomial $p \in \Pi_{n}$, interpolation forces the polynomial to go through a number of data points. An alternative is to find a polynomial that *minimizes the "distance" between it and $f$, without fixing the points that it passes*. The course only focuses on minimizing some sort of squared error, of the form $\int _{I}w \cdot|f-p_{n}|^{2} \, dx $over some interval $I$. Nevertheless, other well-studied notions of "distance" exist, e.g. the **maximum norm** $\max |f-p_{n}|$. ### The $L_{2}$ Inner Product Space A real vector space $V$ equipped with a positive-definite, symmetric bilinear form $\left< *,* \right>$ is called a **(real) inner product space**. * Positive-definiteness: $\left< v,v \right> \ge 0$, and equality only when $v=\mathbf{0}$. * Symmetry: $\left< v,w \right> = \left< w,v \right>$ * Bilinearity: $\left< u+v,w \right> = \left< u,w \right> + \left< v,w \right>$, $\left< \lambda u,v \right> = \lambda\left< u,v \right>$. Vectors $u,v$ are **orthogonal** if $\left< u,v \right> = 0$. * The inner product $\left< *,* \right>$ defines the **induced norm** $\| v \| \equiv \sqrt{ \left< v,v \right> }$. * The **Cauchy-Schwarz inequality** states that for any $u,v \in (V,\left< *,* \right>)$, with induced norm $\| * \|$, $\| u \|\| v \| \ge \left< u,v \right> $ > [!proof]- > Consider the quadratic map $\lambda \mapsto \| u+\lambda v \|^{2}$. It has a repeated root or no roots at all, so the determinant is non-positive: $\Delta=4\left< u,v \right>^{2}-4\| u \|^{2}\| v \|^{2} \le 0 $which gives the result. ### The $L^{2}$ Inner Product Space For least squares approximation, the natural choice of distance is the $l_{2}$ norm: $\| v \|_{2}\equiv \sqrt{ \int _{I} v^{2} \, dx } $ More generally, define the inner product $\left< f,g \right> = \int _{I} w(x)f(x)g(x) \, dx ,$where $w(x)$ is the **weight function**. * Most common weights include $w \equiv 1$, $w(x)=(1-x^{2})^{-1 / 2}$, and $w=e^{-x}$. The inner product induces the **2-norm** (with weight $w$): $\| f \|_{2}=\int _{I}w(x)[f(x)]^{2} \, dx,$and is defined over the space $L^{2}_{w} \equiv \{ f: I \to \mathbb{R}\,|\, \| f \|_{2} < \infty \},$i.e. the functions that are bounded under $\| \cdot\|_{2}$. Then least squares problem is to find $p_{n} \in \Pi_{n}$ given $f \in L^{2}_{w}$ that minimizes the norm of the error: $p_{n} = \underset{{q \in \Pi_{n}}}{\arg\min}~\| f-q \|_{2}.$ ### Least Squares as a Matrix Equation > [!bigidea] The least squares problem can be reduced to a matrix equation called the normal equation. The naive choice of basis of $\Pi_{n}$ is the **canonical basis** $\mathbf{x}=\{ 1,x,x^{2},\dots,x^{n} \}$. With this basis, $p_{n} \in \Pi_{n}$ can be expressed as: $p_{n}(x)=\sum^{n}_{k=0}\alpha_{k}x^{k}=\mathbf{a}^{T}\mathbf{x}$where $\mathbf{a}$ are its coefficients in this basis. > [!exposition] Squared error as a matrix > The squared error is $\begin{align*} > \| f-p_{n} \|_{2}^{2} &= \left< f-\mathbf{a}^{T}\mathbf{x} , f-\mathbf{a}^{T}\mathbf{x} \right>\\ > &= \| f \|_{2}^{2} - 2\left< f, \mathbf{a}^{T}\mathbf{x} \right> + \| \mathbf{a}^{T}\mathbf{x} \|_{2}^{2} > \end{align*}$which is minimized by taking partial derivative wrt. $\mathbf{a}$ and setting it to be $\mathbf{0}$, giving $\begin{bmatrix} > \left< 1,1 \right> & \left< 1,x \right> & \dots & \left< 1,x^{n} \right> \\ > \left< x,1 \right> & \left< x,x \right> & \dots & \left< x,x^{n} \right> \\ > \vdots & \vdots & & \vdots \\ > \left< x^{n},1 \right> & \left< x^{n},x \right> & \dots & \left< x^{n},x^{n} \right> > \end{bmatrix}\mathbf{a}=\begin{bmatrix} > \left< f, 1 \right> \\ \left< f,x \right> \\ \vdots \\ \left< f, x^{n} \right> > \end{bmatrix}.$ > > If we work with a general basis $\{ \phi_{0},\dots,\phi_{n} \}$ of $\Pi_{n}$, the same calculations yield the **normal equations** $A\mathbf{a}=\varphi$: $\underbrace{\begin{bmatrix} \left< \phi_{0},\phi_{0} \right> & \left< \phi_{0},\phi_{1} \right> & \dots & \left< \phi_{0},\phi_{n} \right> \\ \left< \phi_{1},\phi_{0} \right> & \left< \phi_{1},\phi_{1} \right> & \dots & \left< \phi_{1},\phi_{n} \right> \\ \vdots & \vdots & & \vdots \\ \left< \phi_{n},\phi_{0} \right> & \left< \phi_{n}, \phi_{1} \right> & \dots & \left< \phi_{n},\phi_{n} \right> \end{bmatrix}}_{A}\mathbf{a}=\underbrace{\begin{bmatrix} \left< f, \phi_{0} \right> \\ \left< f,\phi_{1} \right> \\ \vdots \\ \left< f, \phi_{n} \right> \end{bmatrix}}_{\varphi}$ For any basis $\{ \phi_{0},\dots,\phi_{n} \}$, the matrix $A$ in its normal equation is non-singular hence invertible, and there is always a solution $\mathbf{a}$, and the least-squares approximation $p_{n}$ always exists. > [!proof]- > Define column-vector $\Phi=(\phi_{0},\dots,\phi_{n})^{T}$. Then abuse notation to define $\left< f,\Phi \right> \equiv \Big(\left<f,\phi_{0} \right>, \dots,\left< f, \phi_{n} \right> \Big)^{T}$. > With this abuse, the matrix is $A=\Big( \left< \phi_{0},\Phi \right> , \left< \phi_{1},\Phi \right> , \dots , \left< \phi_{n}, \Phi \right> \Big)$so if it is singular, there exists $\mathbf{b}=(\beta_{0},\dots,\beta_{n})^{T} \ne \mathbf{0}: A\mathbf{b}=\mathbf{0}$, giving $\begin{align*} \mathbf{b}^{T}A\mathbf{b}&= \mathbf{b}^{T}\sum_{k=0}^{n}\beta_{k}\left< \phi_{k},\Phi \right> \\ &= \mathbf{b}^{T}\left< \sum_{k}\beta_{k} \phi_{k} , \Phi\right>\\ &= \left< \sum_{k}\beta_{k} \phi_{k} , \sum_{k}\beta_{k} \phi_{k}\right> = 0 \end{align*} $Hence the definition of the inner product requires $\sum_{k} \beta_{k} \phi_{k}=0$, contradictory to independence of $\Phi$. Inverting $A$ finds the coefficients $\mathbf{a}$ of the least-squares approximant $p_{n}$, but it is a bad idea because: * Computing and inverting the matrix is expensive. * Solving linear equations involving full (opposed to upper triangular or diagonal) matrices are unstable. ### Least Squares in Orthogonal Bases > [!bigidea] By using an orthogonal bases, the normal equation (and the least squares problem) is trivial. The issues mentioned above will be solved if the basis $\{ \phi_{i} \}$ makes the matrix $A$ diagonal; namely, it is an **orthogonal basis**: $\left< \phi_{i}, \phi_{j} \right> \,\, \begin{cases} =0 & \text{if }i \ne j \\ \ne 0 & \text{if } i = j \end{cases}$Even better, $A=I_{n}$ if it is an **orthonormal basis**: $\left< \phi_{i}, \phi_{j} \right> = \delta_{ij} = \begin{cases} 0 & \text{if }i \ne j \\ 1 & \text{if } i = j \end{cases}$ Then *under an orthonormal basis $\{ \phi_{0},\dots, \phi_{n} \}$, $A=I_{n}$ in the normal equation* and it immediately reveals the coefficients: $(\alpha_{0},\dots,\alpha_{n})=(\left< f,\phi_{0} \right> ,\dots, \left< f,\phi_{n} \right> )$and the least squares approximant is $p_{n}=\sum^{n}_{k=0}\alpha_{k}\phi_{k}=\sum^{n}_{k=0}\left< f,\phi_{k} \right> \phi_{k}$ The **Gram-Schmidt process** with normalization can produce such an orthogonal basis given a general one: $\{ 1,x,x^{2},\dots,x^{n} \}\xmapsto{G-S}\underset{\text{ orthogonal}}{\{ \dots\}} \xmapsto{\text{normalize}} \underset{\text{ orthonormal}}{\{ \phi_{0},\dots,\phi_{n}\}}$ * Although starting with $\{ 1,\dots,x^{n} \}$ seems arbitrary, the outcome is independent to the choice. This is because $\phi_{i} \in \Pi_{i}$ is assumed for polynomial bases $\{ \phi_{i} \}$, and in particular $\phi_{0} \in \mathbb{R}$. Then the rest of the basis $\{ \phi_{1},\dots, \phi_{n} \}$ is determined up to sign. Naturally, since orthogonality is defined with the inner products, different weight functions give rise to different orthogonal bases: * $w\equiv 1$ gives the **Legendre polynomials**. * $w(x)=(1-x^{2})^{1 / 2}$ gives the **Chebyshev polynomials**. * $w(x)=e^{-x}$ gives the **Laguerre polynomials**. ### Least Squares Approximation as Projection The least squares error is perpendicular to $\Pi_{n}$: $p_{n}$ is the least squares approximant if and only if $\forall q \in \Pi_{n},\left< f-p_{n},q \right> = 0$. ![[OrthogonalApproximation.png#invert|center|w60]] > [!proof]- > ($\Rightarrow$) Sheet 3, consider $\| f-(p+\epsilon q) \|_{2}$ for any $q \in \Pi_{n}$. > ($\Leftarrow$) For any $q$, $\begin{align*} \| f-q \|_{2} &= \| (f-p_{n})-(q-p_{n}) \|_{2} \\ &= \| f-p_{n} \| _{2} -2\cancel{\left< f-p_{n}, \underset{\in \Pi_{n}}{q-p_{n}} \right>} +\| q-p_{n} \| _{2}\\ &\ge \| f-p_{n} \| _{2} \end{align*}$ > [!connection] > Compare with [[Orthogonal Projection, Confounding, and Missing Variable in OLS|OLS]], which can be interpreted as projecting the response $\mathbf{y}$ onto the column space of predictors $\mathbf{x}$.