Linesearch algorithms solve the unconstrained optimization problem $\min_{x}f(x)$by iteratively updating a solution $x_{k}$ in some (descending) direction $s_{k}$ with a step size $\alpha_{k}$, until sufficient convergence (e.g. when $\| \nabla_{\!x} f|_{x=x_{k}}\| \le \epsilon$ for some predetermined tolerance $\epsilon > 0$).
## Choosing the Step Size $\alpha$
Note that if $f$ has $\nabla f$ being [[Lipschitz Continuity|Lipschitz continuous]] with constant $L$, then $\alpha = L^{-1}$ and descent direction $s=\nabla f$ gives a descent of at least $f(x_{k})-f(x_{k+1}) \ge \frac{1}{2L}\| \nabla f(x_{k}) \|^{2}, $
using [[Lipschitz Continuity#^214462|this lemma]]. However, this is rather useless, because for most applications we do not know $L$.
### Exact Linesearch
In very rare cases like quadratic optimisation, it is possible to do **exact linesearch**, i.e. finding the optimal $\alpha$ given $x,s$: for a quadratic objective given by $q(x)=g^{T}x+ \frac{1}{2}x^{T}Hx,$
the exact linesearch tries to solve $\begin{align*}
\alpha&= \underset{\alpha}{\arg\min}~q(x+\alpha s)\\
&= \underset{\alpha}{\arg\min}~g^{T}(x+\alpha s)+ \frac{1}{2}(x+\alpha s)^T H(x+\alpha s)\\
&= \underset{\alpha}{\arg\min}~ \alpha(g+2Hx)^{T}s+ \frac{\alpha^{2}}{2}s^T Hs\\
&= \underset{\alpha}{\arg\min}~\nabla q(x)^{T} s+ \frac{\alpha^{2}}{2}s^{T}Hs,
\end{align*}$by dropping constant terms that do not depend on $\alpha_{k}$.
Taking the the objective's derivative wrt. $\alpha_{k}$ and setting it to 0, we have $\begin{align*}
(\nabla q)^{T}s+2\alpha s^{T}Hs=0,
\end{align*}$and solving it gives the desired optimal $\alpha_{k}$ as $\alpha ^{\ast}= \frac{(\nabla q)^{T}s}{2s^T Hs}.$
### Backtracking and Armijo Condition
Without exact solutions, we still want to select step sizes that are neither too large (overshooting the minimiser) or too little (slow or incorrect convergence).
One idea is to take a large step, then reduce it until it provides a satisfactory amount of descent: for descending from $x:=x_{k}$ in the direction of $s:= s_{k}$,
> [!algorithm] Backtracking
> Select an $\alpha=\alpha^{(0)}$ large enough, and backtracking rate $\tau \in (0,1)$.
> While $f(x+\alpha s)$ is "too large", $\alpha \leftarrow \tau \alpha$.
> Return $\alpha$.
We can properly define "too large" to require the descent to be non-negligible, for example, requiring the descent's slope to be at least $\beta \cdot\nabla_{\!}\!f$: $f(x+\alpha s) \overset{?}{ \le} f(x) + \beta\cdot \alpha \nabla f(x)^{T}s;$usual choices of $\beta$ are very small, say $10^{-1}\sim 10^{-3}$. This is known as the **Armijo condition**.
![[Armijo.excalidraw|center]]
When $\nabla f$ is Lipschitz with constant $L$, we have a range $[0, \alpha^{(L)}]$ within which Armijo always holds, so that we are guaranteed to find a step: $\alpha^{(L)}:= \frac{(\beta-1)s^{T}\nabla f}{L\| s \|^{2} },$for which $\beta = \frac{1}{2}$ provides a lower bound to the exact linesearch for a quadratic (approximation of the) objective.
> [!proof]-
> Consider the function $\phi(\alpha):= f(x+\alpha s)$. Then $\phi'$ is Lipschitz continuous with constant $\tilde{L}:= L \| s \|^{2}$ (i.e. the denominator of $\alpha^{(L)}$), as $\begin{align*}
| \phi'(\alpha_{1})-\phi'(\alpha_{2}) | &= | s^{T}(\nabla f(x+\alpha_{1}s)-\nabla f(x+\alpha_{2}s)) |\\
&\le \| s \|\cdot \| \nabla f(x+\alpha_{1}s)-\nabla f(x+\alpha_{2}s) \|&&[\text{C-S}] \\
&\le \| s \| \cdot L \cdot\| \alpha_{1}s-\alpha_{2}s\| &&[\nabla\! f\text{ is Lipschitz}]\\
&= \tilde{L}| \alpha_{1}-\alpha_{2} |.
\end{align*}$
>
> Therefore, $\phi'(\alpha) \ge \phi'(0)+\tilde{L} \alpha$. Solving for the largest $\alpha$ satisfying $\phi'(\alpha) \le \beta \nabla f^{T}s$ gives the bound $\alpha^{(L)}$.
>
> To rigorously prove that any $\alpha\in [0, \alpha^{(L)}]$ satisfies Armijo, we use the IVT to find $\xi\in (0,\alpha)$ such that $\phi(\alpha)-\phi(0)=\alpha\phi'(\xi)$, and we know $\phi'(\xi)\le \beta \nabla f^{T}s$, so the descent satisfies the Armijo condition.
> [!lemma|*] Backtracking with Armijo
> If the objective $f$ has $\nabla f$ Lipschitz continuous with constant $L$, then the step size $\alpha$ selected by backtracking and Armijo satisfies $\alpha_{k} \ge \min(\alpha^{(0)}, \tau \alpha^{(L)}_{k}),$where $\tau$ is the amount of backtracking, $\alpha^{(0)}$ the initial step size in linesearch, and $\alpha^{(L)}_{k}$ the $\alpha^{(L)}$ of the $k$th step.
When we use step sizes $\alpha_{k}$ that satisfy the Armijo condition, each step has a guaranteed slope of descent: $\begin{align*}
f(x_{1}) -f(x_{0})&\le \beta \alpha_{0} \cdot s_{0}^{T}\nabla f(x_{0}) \\
f(x_{2})-f(x_{1}) &\le \beta \alpha_{1}\cdot s_{1}^{T}\nabla f(x_{1}) \\
&~~\vdots\\
f(x_{k})-f(x_{k-1}) &\le \beta\cdot \alpha_{k-1}s_{k-1}^{T}\nabla f(x_{k-1})\\
\end{align*}$
and summing them together gives
$f(x_{k}) -f(x_{0}) \le \beta \sum_{j < k}\alpha_{j} s_{j}^{T}\nabla f(x_{j}), $recalling that $s_{j}^{T}\nabla f(x_{j}) < 0$ due to the choice of $s_{j}$ as descent directions.
> [!theorem|*] Convergence of Linesearch Gradients
> Suppose $f$ is bounded below by $f(x^{\ast})$, and $\nabla f$ satisfies Lipschitz continuity wrt. $L$.
>
> If we do not terminate the linesearch iterations (e.g. with $\epsilon=0$), then either (1) the algorithm converges to some $x:\nabla f(x)=0$ in finitely many iterations, or (2) the "amount" of descent goes to $0$ in the sense that $\min \left( | s^{T}_{k}\nabla f(x_{k}) |, \frac{| s^{T}_{k}\nabla f(x_{k}) |}{\| s_{k} \| } \right) \to 0.$
>
> > [!proof]-
> >
> > If (1) does not happen, then the total amount of descent is $\lim_{k \to \infty}f(x_{k})-f(x_{0}) \le \beta\sum_{k}\alpha_{k}s^{T}_{k}\nabla f(x_{k}),$
> > which is bounded below by $f(x^{\ast})-f(x_{0})$. Since every term is negative, taking their negative gives
> > $\sum_{k}\alpha_{k}| s_{k}^{T}\nabla f(x_{k}) | < \infty.$
> >
> > Consider the following two cases: $\begin{align*}
> > \mathcal{K}_{0}&:= \{ k: \alpha^{(0)} \le \alpha^{(L)}_{k} \},\\
> > \mathcal{K}_{L}&:=\{ k: \alpha^{(0)} > \alpha^{(L)}_{k} \}.
> > \end{align*}$
> > For the former, we simply have $\alpha_{k}=\alpha^{(0)}$, so
> > $\sum_{k \in \mathcal{K}_{0}}\alpha_{k}| s_{k}^{T}\nabla f(x_{k}) | = \alpha_{0}\sum_{k\in \mathcal{K}_{0}}| s_{k}^{T}\nabla f(x_{k}) |<\infty,$
> > and $| s_{k}^{T}\nabla f(x_{k}) | \to{0}$ when $k \in \mathcal{K}_{0}$.
> >
> > For the latter, we have $\alpha_{k} \ge \tau \alpha^{(L)}_{k}=\frac{\tau(\beta-1) s_{k}^{T}\nabla f(x_{k}) }{L\| s_{k} \|^{2}}$, so substituting, we have $\sum_{k \in \mathcal{K}_{1}}\alpha_{k}| s_{k}^{T}\nabla f(x_{k}) | \ge \frac{\tau | \beta-1|} {L}\sum_{k\in \mathcal{K}_{0}} \frac{| s_{k}^{T}\nabla f(x_{k}) |^{2}}{\| s_{k} \| ^{2}}<\infty,$so $\frac{| s_{k}^{T}\nabla f(x_{k}) |^{2}}{\| s_{k} \| ^{2}} \to 0$ when $k \in \mathcal{K}_{L}$.
> >
> > Combining the two sets, we conclude that $\min\left( | \nabla f^{T}(x_{k})s_{k} |, \frac{| \nabla f^{T}(x_{k})s_{k} |}{s_{k}} \right)\to 0$.
^3b76ba
> [!proof]
> We know that the step size is at least $\alpha_{k}\ge \min (\alpha^{(0)}, \tau\alpha^{(L)}_{k})= \min\left( \alpha^{(0)}, \tau(\beta-1) \frac{\nabla f(x_{k})^{T}s_{k}}{L\| s_{k} \|^{2}} \right)$, and the Armijo condition requires the slope of descent to be at least $\beta \nabla f(x_{k})^{T}s_{k}$, so the descent at each step is at least $\begin{align*}
f(x_{k})-f(x_{k+1})&\ge \alpha_{k}\cdot \beta \nabla f(x_{k})^{T}s\\ &\ge \beta\min\left( \alpha_{0} \cdot\nabla f(x_{k})^{T}s, \tau (\beta-1) \left( \frac{\nabla f(x_{k})^{T}s_{k}}{\| s_{k} \| } \right)^{2} \right).
\end{align*}$
> Since $f$ is bounded below, their sum over $k$ is finite, and since they are all positive, the terms must go to $0$. Because multiplying by constants and squaring does not change convergence, we have
> $\min\left(\nabla f(x_{k})^{T}s_{k}, \frac{\nabla f(x_{k})^{T}s_{k}}{\| s_{k} \| }\right) \to 0.$
Equivalently, defining $\theta_{k}: \cos(\theta_{k})= \frac{| s_{k}^{T}\nabla f(x_{k}) |}{\| s_{k} \|\cdot\| \nabla f(x_{k}) \|}$ as the angle between $\nabla f$ and $s$, we have $\| \nabla f(x_{k}) \|\cdot \cos(\theta_{k})\cdot \min(\| s_{k} \|,1 )\to 0. $Since we wish to drive $\| \nabla f \|$ to $0$, we now want to bound $\cos(\theta_{k})$ (and also the $\min$ term, though that is usually not an issue) below by choosing a proper direction $s$.
## Choosing the Descent Direction $s$
### Steepest Descent
The most obvious (and naive) choice is $s_{\mathrm{SD}}:= -\nabla f(x)$, the steepest descent direction, so that $\theta_{k}=0$ every step. It can be thought of as a first-order greedy algorithm that solves $s_{\mathrm{SD}}:= \underset{\| s \|=\| \nabla f(x) \| }{\arg\min}~f(x)+s^{T}\nabla f,$i.e. the best direction to descend down a hyperplane approximation of $f$.
With this choice of $s$, the Armijo convergence result guarantees the gradient to converge: either (1) it vanishes after finitely many steps, or (2) its norm $\| \nabla f(x_{k}) \|\to 0$.
Unfortunately, this convergence is only linear, and potentially very slow:
> [!theorem|*] Steepest Descent Convergence
> With steepest descent, if $x_{k}\to x^{\ast}$ for some local minimizer $x^{\ast}$ with $\nabla \!f(x^{\ast}) \succ 0$, then the convergence happens linearly:
> $\| x_{k}-x^{\ast} \| \le \rho \| x_{k-1}-x^{\ast} \|, $and the rate $\rho$ has upper bound $\rho \le \frac{\kappa(x^{\ast})-1}{\kappa(x^{\ast})+1}=:\rho_{\mathrm{SD}},$where $\kappa(x):= \kappa_{2}(\nabla^{2}f(x))$ is the [[Numerical Stability#^af2303|condition number]] of the Hessian.
- In practice, we often have $\rho \approx \rho_{\mathrm{SD}}$.
- So if $\kappa$ is huge, $\rho_{\mathrm{SD}}\approx 1$, and the convergence can be very slow.
- This happens for poorly scaled problems, where curvature is large in one direction ($\lambda_{\max}(\nabla^{2}f)$ is large) and very shallow in another ($\lambda_{\min}(\nabla^{2}f)$ is tiny), e.g. the [[Rosenbrock Function|Rosenbrock function]] whose Hessian is ill conditioned in the valley:
![[RosenbrockFunctionGraph.png#invert|center|w50]]