### RKHS in Machine Learning ### Representer Theorem The main benefit of functional optimization in an RKHS is that it can be reduced to optimizing $n$ coefficients (rather than over the infinite-dimensional functional space). > [!theorem|*] Representer Theorem > Any solution to the norm-penalized ERM problem $\min_{f \in \mathcal{H}(\kappa)} \sum_{i}L(y_{i}, f(x_{i}))+\lambda\| f \|_{\mathcal{H}(\kappa)} $ is a linear combination of $\{ \kappa(\cdot,x_{i}) ~|~ i = 1,\dots,n \}$, say $\sum_{i}\beta_{i}\kappa(\cdot,x_{i})$. > > In other words, the optimization can be restricted to a (parametrized) subspace $\mathrm{span}\{ \kappa(\cdot, x_{i}) ~|~ i=1,\dots,n \} \le \mathcal{H}(\kappa),$and the objective modified to $\min_{\beta} \sum_{i}L( y_{i}, (K\beta)_{i}) + \lambda \beta^{T}K\beta,$where $K_{ij}=\kappa(x_{i},x_{j})$ is the gram matrix. ^772653 One application is in [[Support Vector Machines#Support Vector Machines||support vector machines]], where it guarantees that the optimizer is a kernel-weighted average of the responses $\mathbf{y}$. Another example is **kernel ridge regression**, where $L$ is the squared $l_{2}$ norm: $\min_{f\in \mathcal{H}(\kappa)}~ \mathrm{RSS}(\mathbf{y}, f(\mathbf{X}))+\lambda \| f \| _{\mathcal{H}(\kappa)}.$Using the theorem gives the simpler quadratic programming problem $\min_{\beta} \| \mathbf{y}- K\beta \|^{2}+\lambda \beta^{T}K\beta .$ ^3f51e8