gt; \alpha_{k}^{(L)}$. After the last shrink, the final step size is gt; \tau \alpha_{k}^{(L)}$. When we use step sizes $\alpha_{k}$ that satisfy the Armijo condition, each step has a guaranteed slope of descent: $\begin{align*} f(x_{0}) -f(x_{1})&\ge -\beta \alpha_{0} \cdot s_{0}^{T}\nabla f(x_{0}) \\ f(x_{1})-f(x_{2}) &\ge -\beta \alpha_{1}\cdot s_{1}^{T}\nabla f(x_{1}) \\ &~~\vdots\\ f(x_{k-1})-f(x_{k}) &\ge -\beta\cdot \alpha_{k-1}s_{k-1}^{T}\nabla f(x_{k-1})\\ \end{align*}$ and summing them together gives $f(x_{0}) -f(x_{k}) \ge -\beta \sum_{j < k}\alpha_{j} s_{j}^{T}\nabla f(x_{j}), $ recalling that $s_{j}^{T}\nabla f(x_{j}) < 0$ due to the choice of $s_{j}$ as descent directions. > [!theorem|*] Convergence of Linesearch Gradients > Suppose $f$ is bounded below by $f(x^{\ast})$, and $\nabla f$ satisfies Lipschitz continuity wrt. $L$. > > If we do not terminate the linesearch iterations (e.g. with $\epsilon=0$), then either (1) the algorithm converges to some $x:\nabla f(x)=0$ in finitely many iterations, or (2) the "amount" of descent goes to $0$ in the sense that $\min \left( | s^{T}_{k}\nabla f(x_{k}) |, \frac{| s^{T}_{k}\nabla f(x_{k}) |}{\| s_{k} \| } \right) \to 0.$ > > > [!proof]- > > > > If (1) does not happen, then the total amount of descent is $\lim_{k \to \infty}f(x_{0})-f(x_{k}) \ge -\beta\sum_{k}\alpha_{k}s^{T}_{k}\nabla f(x_{k}),$ > > > > which is bounded above by $f(x_{0})-f(x^{\ast})$. Since every term in the sum is negative, they converge absolutely with > > $\sum_{k}\alpha_{k}| s_{k}^{T}\nabla f(x_{k}) | < \infty.$ > > > > Consider the following two cases in determining Armijo step sizes: $\begin{align*} > > \mathcal{K}_{0}&:= \{ k: \alpha^{(0)} \le \alpha^{(L)}_{k} \},\\ > > \mathcal{K}_{L}&:=\{ k: \alpha^{(0)} > \alpha^{(L)}_{k} \}. > > \end{align*}$ > > For the former, we simply have $\alpha_{k}=\alpha^{(0)}$, so > > $\sum_{k \in \mathcal{K}_{0}}\alpha_{k}| s_{k}^{T}\nabla f(x_{k}) | = \alpha_{0}\sum_{k\in \mathcal{K}_{0}}| s_{k}^{T}\nabla f(x_{k}) |<\infty,$ > > and $| s_{k}^{T}\nabla f(x_{k}) | \to{0}$ when $k \in \mathcal{K}_{0}$. > > > > For the latter, we have $\alpha_{k} \ge \tau \alpha^{(L)}_{k}=\frac{\tau(\beta-1) s_{k}^{T}\nabla f(x_{k}) }{L\| s_{k} \|^{2}}$, so substituting, we have $\sum_{k \in \mathcal{K}_{1}}\alpha_{k}| s_{k}^{T}\nabla f(x_{k}) | \ge \frac{\tau | \beta-1|} {L}\sum_{k\in \mathcal{K}_{0}} \frac{| s_{k}^{T}\nabla f(x_{k}) |^{2}}{\| s_{k} \| ^{2}}<\infty,$so $\frac{| s_{k}^{T}\nabla f(x_{k}) |^{2}}{\| s_{k} \| ^{2}} \to 0$ when $k \in \mathcal{K}_{L}$; this is equivalent with $| s_{k} ^{T} \nabla f(x_{k})| / \| s_{k} \|\ \to 0$ (without the square). > > > > Combining the two sets, we conclude that $\min\left( | \nabla f^{T}(x_{k})s_{k} |, \frac{| \nabla f^{T}(x_{k})s_{k} |}{s_{k}} \right)\to 0.$ ^3b76ba > [!proof] > We know that the step size is at least $\alpha_{k}\ge \min (\alpha^{(0)}, \tau\alpha^{(L)}_{k})= \min\left( \alpha^{(0)}, \tau(\beta-1) \frac{\nabla f(x_{k})^{T}s_{k}}{L\| s_{k} \|^{2}} \right)$, and the Armijo condition requires the slope of descent to be at least $\beta \nabla f(x_{k})^{T}s_{k}$, so the descent at each step is at least $\begin{align*} f(x_{k})-f(x_{k+1})&\ge \alpha_{k}\cdot \beta \nabla f(x_{k})^{T}s\\ &\ge \beta\min\left( \alpha_{0} \cdot\nabla f(x_{k})^{T}s, \tau (\beta-1) \left( \frac{\nabla f(x_{k})^{T}s_{k}}{\| s_{k} \| } \right)^{2} \right). \end{align*}$ > Since $f$ is bounded below, their sum over $k$ is finite, and since they are all positive, the terms must go to $0$. Because multiplying by constants and squaring does not change convergence, we have > $\min\left(\nabla f(x_{k})^{T}s_{k}, \frac{\nabla f(x_{k})^{T}s_{k}}{\| s_{k} \| }\right) \to 0.$ Equivalently, defining $\theta_{k}: \cos(\theta_{k})= \frac{| s_{k}^{T}\nabla f(x_{k}) |}{\| s_{k} \|\cdot\| \nabla f(x_{k}) \|}$ as the angle between $\nabla f$ and $s$, we have $\| \nabla f(x_{k}) \|\cdot \cos(\theta_{k})\cdot \min(\| s_{k} \|,1 )\to 0. $Since we wish to drive $\| \nabla f \|$ to $0$, we now want to bound $\cos(\theta_{k})$ (and also the $\min$ term, though that is usually not an issue) below by choosing a proper direction $s$. ## Choosing the Descent Direction $s$ A wide class of descent directions are derived as [[Second-Order Search Directions]]. ### Steepest Descent The most obvious (and naive) choice is $s_{\mathrm{SD}}:= -\nabla f(x)$, the steepest descent direction, so that $\theta_{k}=0$ every step. It can be thought of as a first-order greedy algorithm that solves $s_{\mathrm{SD}}:= \underset{\| s \|=\| \nabla f(x) \| }{\arg\min}~f(x)+s^{T}\nabla f,$i.e. the best direction to descend down a hyperplane approximation of $f$. With this choice of $s$, the Armijo convergence result guarantees the gradient to converge: either (1) it vanishes after finitely many steps, or (2) its norm $\| \nabla f(x_{k}) \|\to 0$. Unfortunately, this convergence is only linear, and potentially very slow: > [!theorem|*] Steepest Descent Convergence > With steepest descent and exact Armijo linesearch, suppose $f\in \mathcal{C}^{2}$ with $x_{k}\to x^{\ast}$ for some local minimiser $x^{\ast}$ with $\nabla \!f(x^{\ast}) \succ 0$ (in particular it is not singular, as being a minimiser requires $\nabla f(x^{\ast}) \succeq 0$). > > Then the convergence happens linearly: > $\| x_{k}-x^{\ast} \| \le \rho \| x_{k-1}-x^{\ast} \|, $ > and the rate $\rho$ has upper bound > $\rho \le \frac{\kappa(x^{\ast})-1}{\kappa(x^{\ast})+1}=:\rho_{\mathrm{SD}},$ > where $\kappa(x):= \kappa_{2}(\nabla^{2}f(x))$ is the [[Numerical Stability#^af2303|condition number]] of the Hessian. - In practice, we often have $\rho \approx \rho_{\mathrm{SD}}$. - So if $\kappa$ is huge, $\rho_{\mathrm{SD}}\approx 1$, and the convergence can be very slow. - This happens for poorly scaled problems, where curvature is large in one direction ($\lambda_{\max}(\nabla^{2}f)$ is large) and very shallow in another ($\lambda_{\min}(\nabla^{2}f)$ is tiny), e.g. the [[Rosenbrock Function|Rosenbrock function]] whose Hessian is ill conditioned in the valley: ![[RosenbrockFunctionGraph.png#invert|center|w50]]