Residuals and Gradient Descent - Random Notes Go Brrrrrrr

## Gradient Descent If a parametrized model $\hat{y}(\theta)$ needs to optimize a differentiable objective $J(\theta)$ (also dependent on other inputs like a dataset), the obvious choice is to compute $\hat{\theta}: \nabla_{\theta}J(\theta)=\mathbf{0},$where $\nabla_{\theta}J=\left( \frac{ \partial J }{ \partial \theta_{1} } ,\dots,\frac{ \partial J }{ \partial \theta_{p} }\right)$ is the **gradient**. Usually this does not have a closed form or is expensive to compute, so gradient descent offers a way of iteratively approximate the solution $\hat{\theta}$. > [!algorithm] Gradient Descent > $[0]$ Initialize with some guess $\theta^{(0)}$. > For $[k=0,1,\dots]$: > - Compute the gradient $g^{(k)}=\nabla_{\theta}J | _{\theta=\theta^{(k)}}$. > - Produce a new estimate by nudging the original: $\theta^{(k+1)}=\theta^{(k)}-\epsilon g^{(k)}$. Here $\epsilon>0$ is called the **step size** or **learning rate**, which can be constant or adaptive. - Since $\epsilon$ scales each coefficient equally, it implicitly assumes that each predictor has the same scale -- if violated, the rate can be too small for large-scaled inputs (slow convergence) or too large for small-scaled inputs (divergence/zig-zagging) ### Newton-Raphson Algorithm Newton-Raphson optimizes for $J$ more aggressively, by looking at a quadratic expansion around $\theta$: $J(\theta+\delta)\approx J(\theta)+\delta^{T}\nabla_{\theta}J+\frac{1}{2}\delta^{T}\nabla^{2}_{\theta}J \delta,$where $\nabla^{2}_{\theta}J$ is the **Hessian**. Solving for the optimal $\delta$ that minimizes $\mathrm{RHS}$ gives $\delta =-(\nabla^{2}_{\theta}J)^{-1}\nabla_{\theta}J(\theta),$and the algorithm increments the previous approximation $\theta^{(k)}$ by exactly this amount: > [!algorithm] Newton-Raphson > $[0]$ Initialize with some guess $\theta^{(0)}$. > For $[k=0,1,\dots]$: > - Compute the NR step $\delta^{(k)} =-(\nabla^{2}_{\theta}J)^{-1}\nabla_{\theta}J |_{\theta=\theta^{(k)}}$. > - Produce a new estimate by nudging the original: $\theta^{(k+1)}=\theta^{(k)}-\delta^{(k)}$. > [!idea] Gradient descent is equivalent to replacing the computationally expensive Hessian with $I_{p} / \epsilon$. ### Stochastic Gradient Descent Computing the gradient of $L(\mathbf{y},\hat{\mathbf{y}})$ over the entire dataset for each iteration is expensive, especially when the dataset is huge. Assume the objective has the additive form $J(\theta)=\sum_{i}J_{i}(\theta)+\text{penalty},$where the summation is over each observation, and $J_{i}$ is the loss due to the $i$th observation. **Stochastic gradient descent (SGD)** replaces $\nabla_{\theta}J$ with a cheap approximation on a random (index) subset $I \subseteq \{ 1,\dots,n \}$ of the data, called a **minibatch**: $\text{SGD gradient}= \sum_{i \in I}\nabla_{\theta}J_{i} \approx \nabla_{\theta}J.$ - The **batch size** $n_{b}:= | I |$ is usually fixed, and can be as small as $1$; the learning rate $\epsilon$ is usually adaptive for SGD. - The SGD gradient is an unbiased estimator of the true gradient. - Due to its randomness, the SGD gradient is also less greedy than the true gradient, helping the algorithm escape local, suboptimal minima. > [!algorithm] Stochastic Gradient Descent > $[0]$ Initialize with some guess $\theta^{(0)}$. > For $[k=0,1,\dots]$: > - Sample a new minibatch $I^{(k)} \subseteq \{ 1,\dots,n \}$. > - Compute the gradient $g^{(k)}=\sum_{i \in I^{(k)}}\nabla J_{i}(\theta)$. > - Produce a new estimate by nudging the original: $\theta^{(k+1)}=\theta^{(k)}-\epsilon g^{(k)}$. ## Gradients as Generalized Residuals For quadratic ($l_{2}$) loss $L_{2}(\mathbf{y}, \hat{\mathbf{y}}):=\| \mathbf{y}-\hat{\mathbf{y}} \|^{2}_{2}$, the residual $\mathbf{y}-\hat{\mathbf{y}}$ is (proportional to) the negative gradient $-\nabla_{\hat{\mathbf{y}}}\,L_{2}(\mathbf{y}, \hat{\mathbf{y}})$. Stepwise, residual-based methods like gradient boosting trees can be generalized to general losses $L$ by defining the **generalized residual** or **pseudo-residual** $\mathbf{r}:=- \frac{ \partial L(\mathbf{y},\hat{\mathbf{y}}) }{ \partial \hat{\mathbf{y}} } $and then compute each step using this residual. *This makes residual-based methods equivalent to a gradient-descent* on the risk-optimization problem $\hat{\theta}=\underset{\theta}{\arg\min}\,L(\mathbf{y},\hat{\mathbf{y}}_{\theta}),$where $\theta$ is a generic parameter that defines the model, and $\hat{\mathbf{y}}_{\theta}$ its predictions. For the case of gradient boosting, this can be done by fitting a tree $g(\cdot~;\theta)$ with parameters (i.e. choice of predictors and splits) $\theta$ that models the pseudo-residual $\mathbf{r}$. Then descending down the gradient by a step-size of $\epsilon$ is done by adding $\epsilon \cdot g(\cdot~;\theta)$ to the existing model. ^c704ca ## Residuals and MLE *Maximum likelihood estimators take a different approach similar to generative models*, prescribing the responses' distribution $f(\mathbf{y} \,|\, \mathbf{x}, \theta)$, and trying to minimize the [[Deviance]], given by $D(\mathbf{y};\mathbf{x},\theta):= -2\mathrm{log lik}(\mathbf{y};\mathbf{x},\theta)=-2\log f(\mathbf{y}\,|\,\mathbf{x},\theta).$Assuming independence, this further simplifies into $D(\mathbf{y};\mathbf{x}, \theta)=-2\sum_{i}\log f(y_{i}\,|\,x_{i}, \theta).$If we interpret [[Distance between Data Point and Distribution|deviance as the distance between the distribution and the data]], MLE is just finding the distribution that is the closest to the dataset. However, *with the correct choice of losses, the residual-based learners can be recasted to solve the MLE*: namely with $\begin{align*} L(\mathbf{y},\hat{\mathbf{y}})&:= D(\hat{\mathbf{y}};\mathbf{x},\theta),\\[0.8em] \nabla_{\theta}L(\mathbf{y},\hat{\mathbf{y}})&= -2\sum_{i}\nabla _{\theta}\{\log f(y_{i}\,|\,x_{i},\theta)\}\\ &= -2\sum_{i} \frac{\nabla _{\theta}f(y_{i}\,|\,x_{i},\theta)}{f(y_{i}\,|\,x_{i},\theta)}. \end{align*}$ When the [[Deviance#Deviance and Regression Losses|deviance can be made equivalent to common losses]], this is equivalent to classic optimizations using gradient descent. - For example, regression using MSE is equivalent to finding the MLE model under an additive, homoscedastic Gaussian noise (e.g. [[Inference in OLS#The Gauss-Markov Model|the Gauss-Markov model]]).