## Gradient Descent
If a parametrized model $\hat{y}(\theta)$ needs to optimize a differentiable objective $J(\theta)$ (also dependent on other inputs like a dataset), the obvious choice is to compute $\hat{\theta}: \nabla_{\theta}J(\theta)=\mathbf{0},$where $\nabla_{\theta}J=\left( \frac{ \partial J }{ \partial \theta_{1} } ,\dots,\frac{ \partial J }{ \partial \theta_{p} }\right)$ is the **gradient**. Usually this does not have a closed form or is expensive to compute, so gradient descent offers a way of iteratively approximate the solution $\hat{\theta}$.
> [!algorithm] Gradient Descent
> $[0]$ Initialize with some guess $\theta^{(0)}$.
> For $[k=0,1,\dots]$:
> - Compute the gradient $g^{(k)}=\nabla_{\theta}J | _{\theta=\theta^{(k)}}$.
> - Produce a new estimate by nudging the original: $\theta^{(k+1)}=\theta^{(k)}-\epsilon g^{(k)}$.
Here $\epsilon>0$ is called the **step size** or **learning rate**, which can be constant or adaptive.
- Since $\epsilon$ scales each coefficient equally, it implicitly assumes that each predictor has the same scale -- if violated, the rate can be too small for large-scaled inputs (slow convergence) or too large for small-scaled inputs (divergence/zig-zagging)
### Newton-Raphson Algorithm
Newton-Raphson optimizes for $J$ more aggressively, by looking at a quadratic expansion around $\theta$: $J(\theta+\delta)\approx J(\theta)+\delta^{T}\nabla_{\theta}J+\frac{1}{2}\delta^{T}\nabla^{2}_{\theta}J \delta,$where $\nabla^{2}_{\theta}J$ is the **Hessian**. Solving for the optimal $\delta$ that minimizes $\mathrm{RHS}$ gives $\delta =-(\nabla^{2}_{\theta}J)^{-1}\nabla_{\theta}J(\theta),$and the algorithm increments the previous approximation $\theta^{(k)}$ by exactly this amount:
> [!algorithm] Newton-Raphson
> $[0]$ Initialize with some guess $\theta^{(0)}$.
> For $[k=0,1,\dots]$:
> - Compute the NR step $\delta^{(k)} =-(\nabla^{2}_{\theta}J)^{-1}\nabla_{\theta}J |_{\theta=\theta^{(k)}}$.
> - Produce a new estimate by nudging the original: $\theta^{(k+1)}=\theta^{(k)}-\delta^{(k)}$.
> [!idea] Gradient descent is equivalent to replacing the computationally expensive Hessian with $I_{p} / \epsilon$.
### Stochastic Gradient Descent
Computing the gradient of $L(\mathbf{y},\hat{\mathbf{y}})$ over the entire dataset for each iteration is expensive, especially when the dataset is huge.
Assume the objective has the additive form $J(\theta)=\sum_{i}J_{i}(\theta)+\text{penalty},$where the summation is over each observation, and $J_{i}$ is the loss due to the $i$th observation.
**Stochastic gradient descent (SGD)** replaces $\nabla_{\theta}J$ with a cheap approximation on a random (index) subset $I \subseteq \{ 1,\dots,n \}$ of the data, called a **minibatch**: $\text{SGD gradient}= \sum_{i \in I}\nabla_{\theta}J_{i} \approx \nabla_{\theta}J.$
- The **batch size** $n_{b}:= | I |$ is usually fixed, and can be as small as $1$; the learning rate $\epsilon$ is usually adaptive for SGD.
- The SGD gradient is an unbiased estimator of the true gradient.
- Due to its randomness, the SGD gradient is also less greedy than the true gradient, helping the algorithm escape local, suboptimal minima.
> [!algorithm] Stochastic Gradient Descent
> $[0]$ Initialize with some guess $\theta^{(0)}$.
> For $[k=0,1,\dots]$:
> - Sample a new minibatch $I^{(k)} \subseteq \{ 1,\dots,n \}$.
> - Compute the gradient $g^{(k)}=\sum_{i \in I^{(k)}}\nabla J_{i}(\theta)$.
> - Produce a new estimate by nudging the original: $\theta^{(k+1)}=\theta^{(k)}-\epsilon g^{(k)}$.
## Gradients as Generalized Residuals
For quadratic ($l_{2}$) loss $L_{2}(\mathbf{y}, \hat{\mathbf{y}}):=\| \mathbf{y}-\hat{\mathbf{y}} \|^{2}_{2}$, the residual $\mathbf{y}-\hat{\mathbf{y}}$ is (proportional to) the negative gradient $-\nabla_{\hat{\mathbf{y}}}\,L_{2}(\mathbf{y}, \hat{\mathbf{y}})$.
Stepwise, residual-based methods like gradient boosting trees can be generalized to general losses $L$ by defining the **generalized residual** or **pseudo-residual** $\mathbf{r}:=- \frac{ \partial L(\mathbf{y},\hat{\mathbf{y}}) }{ \partial \hat{\mathbf{y}} } $and then compute each step using this residual. *This makes residual-based methods equivalent to a gradient-descent* on the risk-optimization problem $\hat{\theta}=\underset{\theta}{\arg\min}\,L(\mathbf{y},\hat{\mathbf{y}}_{\theta}),$where $\theta$ is a generic parameter that defines the model, and $\hat{\mathbf{y}}_{\theta}$ its predictions.
For the case of gradient boosting, this can be done by fitting a tree $g(\cdot~;\theta)$ with parameters (i.e. choice of predictors and splits) $\theta$ that models the pseudo-residual $\mathbf{r}$. Then descending down the gradient by a step-size of $\epsilon$ is done by adding $\epsilon \cdot g(\cdot~;\theta)$ to the existing model. ^c704ca
## Residuals and MLE
*Maximum likelihood estimators take a different approach similar to generative models*, prescribing the responses' distribution $f(\mathbf{y} \,|\, \mathbf{x}, \theta)$, and trying to minimize the [[Deviance]], given by $D(\mathbf{y};\mathbf{x},\theta):= -2\mathrm{log lik}(\mathbf{y};\mathbf{x},\theta)=-2\log f(\mathbf{y}\,|\,\mathbf{x},\theta).$Assuming independence, this further simplifies into $D(\mathbf{y};\mathbf{x}, \theta)=-2\sum_{i}\log f(y_{i}\,|\,x_{i}, \theta).$If we interpret [[Distance between Data Point and Distribution|deviance as the distance between the distribution and the data]], MLE is just finding the distribution that is the closest to the dataset.
However, *with the correct choice of losses, the residual-based learners can be recasted to solve the MLE*: namely with $\begin{align*}
L(\mathbf{y},\hat{\mathbf{y}})&:= D(\hat{\mathbf{y}};\mathbf{x},\theta),\\[0.8em]
\nabla_{\theta}L(\mathbf{y},\hat{\mathbf{y}})&= -2\sum_{i}\nabla _{\theta}\{\log f(y_{i}\,|\,x_{i},\theta)\}\\
&= -2\sum_{i} \frac{\nabla _{\theta}f(y_{i}\,|\,x_{i},\theta)}{f(y_{i}\,|\,x_{i},\theta)}.
\end{align*}$
When the [[Deviance#Deviance and Regression Losses|deviance can be made equivalent to common losses]], this is equivalent to classic optimizations using gradient descent.
- For example, regression using MSE is equivalent to finding the MLE model under an additive, homoscedastic Gaussian noise (e.g. [[Inference in OLS#The Gauss-Markov Model|the Gauss-Markov model]]).