Linesearch Algorithms - Random Notes Go Brrrrrrr

Linesearch algorithms solve the unconstrained optimisation problem $\min_{x}f(x)$ by iteratively updating a solution $x_{k}$ in some (descending) direction $s_{k}$ with a step size $\alpha_{k}$, until sufficient convergence (e.g. when $\| \nabla_{\!x} f(x_{k})\|\le \epsilon$ for some predetermined tolerance $\epsilon > 0$). ## Choosing the Step Size $\alpha$ Note that if $f$ has $\nabla f$ being [[Lipschitz Continuity|Lipschitz continuous]] with constant $L$, then $\alpha = L^{-1}$ and descent direction $s=-\nabla f$ gives a descent of at least $f(x_{k})-f(x_{k+1}) \ge \frac{1}{2L}\| \nabla f(x_{k}) \|^{2}, $ using [[Lipschitz Continuity#^214462|this lemma]]. However, this is rather useless, because for most applications we do not know $L$. ### Exact Linesearch In very rare cases like a quadratic objective, it is possible to do **exact linesearch**, i.e. finding the optimal $\alpha$ given $x,s$: for a quadratic objective given by $q(x)=g^{T}x+ x^{T}Hx,$ the exact linesearch tries to solve $\begin{align*} \alpha&= \underset{\alpha}{\arg\min}~q(x+\alpha s)\\ &= \underset{\alpha}{\arg\min}~g^{T}(x+\alpha s)+ (x+\alpha s)^T H(x+\alpha s)\\ &= \underset{\alpha}{\arg\min}~ \alpha(g+2Hx)^{T}s+ \alpha^{2}s^T Hs\\ &= \underset{\alpha}{\arg\min}~\nabla q(x)^{T} s+ \alpha^{2}s^{T}Hs, \end{align*}$ by dropping constant terms that do not depend on $\alpha_{k}$. Taking the the objective's derivative wrt. $\alpha_{k}$ and setting it to 0, we have $\begin{align*} (\nabla q)^{T}s+2\alpha s^{T}Hs=0, \end{align*}$ and solving it gives the desired optimal $\alpha_{k}$ as $\alpha ^{\ast}_{k}= \frac{(\nabla q)^{T}s}{2s^{T}Hs}.$ ### Backtracking and Armijo Condition Without exact solutions, we still want to select step sizes that are neither too large (overshooting the minimiser) or too little (slow or incorrect convergence). One idea is to take a large step, then reduce it until it provides a satisfactory amount of descent: for descending from $x:=x_{k}$ in the direction of $s:= s_{k}$, > [!algorithm] Backtracking (rough idea) > Select an $\alpha=\alpha^{(0)}$ large enough, and backtracking rate $\tau \in (0,1)$. > While $f(x+\alpha s)$ is "too large", $\alpha \leftarrow \tau \alpha$. > Return $\alpha$. We can properly define "too large" to require the descent to be non-negligible, for example, requiring the descent's slope to be at least $\beta \cdot\nabla_{\!}\!f^{T}s$ for some $\beta \in (0,1)$: $f(x+\alpha s) \overset{?}{ \le} f(x) + \beta\cdot \alpha \nabla f(x)^{T}s;$ usual choices of $\beta$ are very small, say $10^{-1}\sim 10^{-3}$. This is known as the **Armijo condition**. ![[Armijo.excalidraw|center]] > [!theorem|*] Guaranteed interval for the Armijo condition > When $\nabla f$ is Lipschitz with constant $L$, we have a range $[0, \alpha^{(L)}]$ within which Armijo always holds, so there exists a step that satisfies the Armijo condition: $\alpha^{(L)}(x):= \frac{(\beta-1)s^{T}\nabla f}{L\| s \|^{2} }.$ > It depends on $x$ through $\nabla f$, and potentially the choice of $s$. > > > [!proof]- > > Consider the function $\phi(\alpha):= f(x+\alpha s)$. Then $\phi'$ is Lipschitz continuous with constant $\tilde{L}:= L \| s \|^{2}$ (i.e. the denominator of $\alpha^{(L)}$), as $\begin{align*} > | \phi'(\alpha_{1})-\phi'(\alpha_{2}) | &= | s^{T}(\nabla f(x+\alpha_{1}s)-\nabla f(x+\alpha_{2}s)) |\\ > &\le \| s \|\cdot \| \nabla f(x+\alpha_{1}s)-\nabla f(x+\alpha_{2}s) \|&&[\text{C-S}] \\ > &\le \| s \| \cdot L \cdot\| \alpha_{1}s-\alpha_{2}s\| &&[\nabla\! f\text{ is Lipschitz}]\\ > &= \tilde{L}| \alpha_{1}-\alpha_{2} |. > \end{align*}$ > > > > Therefore, $\phi'(\alpha) \ge \phi'(0)+\tilde{L} \alpha$. Solving for the largest $\alpha$ satisfying $\phi'(\alpha) \le \beta \nabla f^{T}s$ gives the bound $\alpha^{(L)}$. > > > > To rigorously prove that any $\alpha\in [0, \alpha^{(L)}]$ satisfies Armijo, we use the IVT to find $\xi\in (0,\alpha)$ such that $\phi(\alpha)-\phi(0)=\alpha\phi'(\xi)$, and we know $\phi'(\xi)\le \beta \nabla f^{T}s$, so the descent satisfies the Armijo condition. > - Recall that $s$ is chosen to be descent, so $s^{T}\nabla f<0$; also $\beta<1$, so the numerator is positive. - With $\beta = \frac{1}{2}$ and $L=\| H \|$ (which is the case for quadratics), we recover the exact linesearch solution for a quadratic (approximation of the) objective. > [!lemma|*] Backtracking with Armijo > Suppose $\nabla f$ is Lipschitz continuous with constant $L$, and $\alpha_{k}^{(L)}:= \alpha^{(L)}(x_{k})$ bound the intervals on which Armijo is guaranteed to hold. Then the step size $\alpha_{k}$ selected by backtracking and Armijo satisfies $\alpha_{k} \ge \min(\alpha^{(0)}, \tau \alpha^{(L)}_{k}),$ > where $\tau$ is the amount of backtracking and $\alpha^{(0)}$ the initial step size in linesearch. > > > > [!proof] > > Either $\alpha^{(0)}\le \alpha_{k}^{(L)}$, so the first guess already satisfies the Armijo condition, or we shrink it. > > > > The last step size that failed Armijo must lie outside the guaranteed interval, i.e. it is gt; \alpha_{k}^{(L)}$. After the last shrink, the final step size is gt; \tau \alpha_{k}^{(L)}$. When we use step sizes $\alpha_{k}$ that satisfy the Armijo condition, each step has a guaranteed slope of descent: $\begin{align*} f(x_{0}) -f(x_{1})&\ge -\beta \alpha_{0} \cdot s_{0}^{T}\nabla f(x_{0}) \\ f(x_{1})-f(x_{2}) &\ge -\beta \alpha_{1}\cdot s_{1}^{T}\nabla f(x_{1}) \\ &~~\vdots\\ f(x_{k-1})-f(x_{k}) &\ge -\beta\cdot \alpha_{k-1}s_{k-1}^{T}\nabla f(x_{k-1})\\ \end{align*}$ and summing them together gives $f(x_{0}) -f(x_{k}) \ge -\beta \sum_{j < k}\alpha_{j} s_{j}^{T}\nabla f(x_{j}), $ recalling that $s_{j}^{T}\nabla f(x_{j}) < 0$ due to the choice of $s_{j}$ as descent directions. > [!theorem|*] Convergence of Linesearch Gradients > Suppose $f$ is bounded below by $f(x^{\ast})$, and $\nabla f$ satisfies Lipschitz continuity wrt. $L$. > > If we do not terminate the linesearch iterations (e.g. with $\epsilon=0$), then either (1) the algorithm converges to some $x:\nabla f(x)=0$ in finitely many iterations, or (2) the "amount" of descent goes to $0$ in the sense that $\min \left( | s^{T}_{k}\nabla f(x_{k}) |, \frac{| s^{T}_{k}\nabla f(x_{k}) |}{\| s_{k} \| } \right) \to 0.$ > > > [!proof]- > > > > If (1) does not happen, then the total amount of descent is $\lim_{k \to \infty}f(x_{0})-f(x_{k}) \ge -\beta\sum_{k}\alpha_{k}s^{T}_{k}\nabla f(x_{k}),$ > > > > which is bounded above by $f(x_{0})-f(x^{\ast})$. Since every term in the sum is negative, they converge absolutely with > > $\sum_{k}\alpha_{k}| s_{k}^{T}\nabla f(x_{k}) | < \infty.$ > > > > Consider the following two cases in determining Armijo step sizes: $\begin{align*} > > \mathcal{K}_{0}&:= \{ k: \alpha^{(0)} \le \alpha^{(L)}_{k} \},\\ > > \mathcal{K}_{L}&:=\{ k: \alpha^{(0)} > \alpha^{(L)}_{k} \}. > > \end{align*}$ > > For the former, we simply have $\alpha_{k}=\alpha^{(0)}$, so > > $\sum_{k \in \mathcal{K}_{0}}\alpha_{k}| s_{k}^{T}\nabla f(x_{k}) | = \alpha_{0}\sum_{k\in \mathcal{K}_{0}}| s_{k}^{T}\nabla f(x_{k}) |<\infty,$ > > and $| s_{k}^{T}\nabla f(x_{k}) | \to{0}$ when $k \in \mathcal{K}_{0}$. > > > > For the latter, we have $\alpha_{k} \ge \tau \alpha^{(L)}_{k}=\frac{\tau(\beta-1) s_{k}^{T}\nabla f(x_{k}) }{L\| s_{k} \|^{2}}$, so substituting, we have $\sum_{k \in \mathcal{K}_{1}}\alpha_{k}| s_{k}^{T}\nabla f(x_{k}) | \ge \frac{\tau | \beta-1|} {L}\sum_{k\in \mathcal{K}_{0}} \frac{| s_{k}^{T}\nabla f(x_{k}) |^{2}}{\| s_{k} \| ^{2}}<\infty,$so $\frac{| s_{k}^{T}\nabla f(x_{k}) |^{2}}{\| s_{k} \| ^{2}} \to 0$ when $k \in \mathcal{K}_{L}$; this is equivalent with $| s_{k} ^{T} \nabla f(x_{k})| / \| s_{k} \|\ \to 0$ (without the square). > > > > Combining the two sets, we conclude that $\min\left( | \nabla f^{T}(x_{k})s_{k} |, \frac{| \nabla f^{T}(x_{k})s_{k} |}{s_{k}} \right)\to 0.$ ^3b76ba > [!proof] > We know that the step size is at least $\alpha_{k}\ge \min (\alpha^{(0)}, \tau\alpha^{(L)}_{k})= \min\left( \alpha^{(0)}, \tau(\beta-1) \frac{\nabla f(x_{k})^{T}s_{k}}{L\| s_{k} \|^{2}} \right)$, and the Armijo condition requires the slope of descent to be at least $\beta \nabla f(x_{k})^{T}s_{k}$, so the descent at each step is at least $\begin{align*} f(x_{k})-f(x_{k+1})&\ge \alpha_{k}\cdot \beta \nabla f(x_{k})^{T}s\\ &\ge \beta\min\left( \alpha_{0} \cdot\nabla f(x_{k})^{T}s, \tau (\beta-1) \left( \frac{\nabla f(x_{k})^{T}s_{k}}{\| s_{k} \| } \right)^{2} \right). \end{align*}$ > Since $f$ is bounded below, their sum over $k$ is finite, and since they are all positive, the terms must go to $0$. Because multiplying by constants and squaring does not change convergence, we have > $\min\left(\nabla f(x_{k})^{T}s_{k}, \frac{\nabla f(x_{k})^{T}s_{k}}{\| s_{k} \| }\right) \to 0.$ Equivalently, defining $\theta_{k}: \cos(\theta_{k})= \frac{| s_{k}^{T}\nabla f(x_{k}) |}{\| s_{k} \|\cdot\| \nabla f(x_{k}) \|}$ as the angle between $\nabla f$ and $s$, we have $\| \nabla f(x_{k}) \|\cdot \cos(\theta_{k})\cdot \min(\| s_{k} \|,1 )\to 0. $Since we wish to drive $\| \nabla f \|$ to $0$, we now want to bound $\cos(\theta_{k})$ (and also the $\min$ term, though that is usually not an issue) below by choosing a proper direction $s$. ## Choosing the Descent Direction $s$ A wide class of descent directions are derived as [[Second-Order Search Directions]]. ### Steepest Descent The most obvious (and naive) choice is $s_{\mathrm{SD}}:= -\nabla f(x)$, the steepest descent direction, so that $\theta_{k}=0$ every step. It can be thought of as a first-order greedy algorithm that solves $s_{\mathrm{SD}}:= \underset{\| s \|=\| \nabla f(x) \| }{\arg\min}~f(x)+s^{T}\nabla f,$i.e. the best direction to descend down a hyperplane approximation of $f$. With this choice of $s$, the Armijo convergence result guarantees the gradient to converge: either (1) it vanishes after finitely many steps, or (2) its norm $\| \nabla f(x_{k}) \|\to 0$. Unfortunately, this convergence is only linear, and potentially very slow: > [!theorem|*] Steepest Descent Convergence > With steepest descent and exact Armijo linesearch, suppose $f\in \mathcal{C}^{2}$ with $x_{k}\to x^{\ast}$ for some local minimiser $x^{\ast}$ with $\nabla \!f(x^{\ast}) \succ 0$ (in particular it is not singular, as being a minimiser requires $\nabla f(x^{\ast}) \succeq 0$). > > Then the convergence happens linearly: > $\| x_{k}-x^{\ast} \| \le \rho \| x_{k-1}-x^{\ast} \|, $ > and the rate $\rho$ has upper bound > $\rho \le \frac{\kappa(x^{\ast})-1}{\kappa(x^{\ast})+1}=:\rho_{\mathrm{SD}},$ > where $\kappa(x):= \kappa_{2}(\nabla^{2}f(x))$ is the [[Numerical Stability#^af2303|condition number]] of the Hessian. - In practice, we often have $\rho \approx \rho_{\mathrm{SD}}$. - So if $\kappa$ is huge, $\rho_{\mathrm{SD}}\approx 1$, and the convergence can be very slow. - This happens for poorly scaled problems, where curvature is large in one direction ($\lambda_{\max}(\nabla^{2}f)$ is large) and very shallow in another ($\lambda_{\min}(\nabla^{2}f)$ is tiny), e.g. the [[Rosenbrock Function|Rosenbrock function]] whose Hessian is ill conditioned in the valley: ![[RosenbrockFunctionGraph.png#invert|center|w50]]