Power Iterations - Random Notes Go Brrrrrrr

> [!tldr] Power Iterations > Power iterations are a family of iterative algorithms for the [[Eigenvalue Problems|eigenvalue problem]]. They are mainly for theoretical interest or as building blocks of more sophisticated algorithms. We assume that $A$ has a basis of eigenvectors $v_{i}$ (all unit-length), corresponding to eigenvalues $\lambda_{i}$, where $\lambda_{i}$ are distinct and ordered by magnitude. ### The Rayleigh Quotient The general strategy for computing an eigenvalue $\lambda$ is to approximate its eigenvector $v$ with some estimate $x$, then compute the **Rayleigh quotient function** $r(x):= x^{T}Ax / x^{T}x$. It has the neat property that *the estimate $\hat{\lambda}:=r(x)$ is quadratically accurate:* > [!exposition] Accuracy of the Rayleigh Quotient > The quotient has gradient $\nabla r=\frac{2}{x^{T}x}(A-r(x)I)x,$so in particular $r(v)=0$ for any eigenvector $v$. Therefore, if the estimate is $x=v+\delta x$, a Taylor expansion gives $r(x) =r(v_{i})+O(\| \delta x \|^{2} )=\lambda_{i}+O(\| \delta x \|^{2} ).$ ### Vanilla Power Iteration > [!bigidea] > In its vanilla form, power iterations estimates the eigenvector with the largest eigenvalue. Let's assume that $| \lambda_{1} |\gneq| \lambda_{i} |$ for any $i > 1$; let $v_{1}$ be its eigenvector. > [!algorithm] Power Iteration > $[0]$ Initialize some vector $x=x_{0}$. > For $[k=1,\dots]$ until convergence: > - Left apply $A$ to get $x \leftarrow Ax$; > - Normalize to get $x \leftarrow x / \| x \|_{2}$. > > $[\mathrm{END}]$ Return the estimate $\hat{\lambda}_{1}=x^{T}Ax$. - Theoretically we can skip the normalizing step and return $r(x)$ in the end, but in practice this prevents over/underflows. If we decompose the initial vector $x_{0}$ in terms of the eigenvector-basis: $x=\sum_{i=1}^{n}\alpha_{i}v_{i},$ it converges to be $v_{1}$ when we apply $A$ repeatedly: $\begin{align*} A^{k}x&= A^{k}\sum_{i=1}^{n}\alpha_{i}v_{i}\\ &=\sum_{i=1}^{n}\alpha_{i}\lambda_{i}^{k}v_{i}\\ &= \lambda_{1}^{k}\left[\alpha_{1}v_{1}+ \underbrace{\sum_{i=2}^{n}\alpha_{i} \left( \frac{\lambda_{i}}{\lambda_{1}} \right)^{k} v_{i}}_{\to 0} \right]\\ &\underset{\approx}{\propto} v_{1} \end{align*}$ where the quotient $| \lambda_{i} / \lambda_{1} | \lneq 1$ by assumption. However, this algorithm has many drawbacks: * It is limited to one eigenvector (the one with the largest eigenvalue). * It only has **linear convergence** with rate $| \lambda_{1} / \lambda_{2} |$, which could be slow if $| \lambda_{2} | \approx | \lambda_{1} |$. * It requires computing $A^{k}x \mapsto A^{k+1}x$ in each iteration, which is expensive. If we decompose $A^{k}=QR$, we see that the first column of $Q$ is given by $q_{1} =Qe_{1} \propto QRe_{1}=A^{k} e_{1} \to v_{1}$. ### Shifted Inverse Power Iteration We can improve on the algorithm by applying $(A-s I)^{-1}$ at each step, where $s$ is a parameter controlling the amount of **shift**. > [!algorithm] Shifted Inverse Power Iteration > $[0]$ Initialize some vector $x=x_{0}$. > For $[k=1,\dots]$ until convergence: > - $x \leftarrow (A-s I)^{-1}x$; > - Normalize to get $x \leftarrow x / \| x \|_{2}$. > > $[\mathrm{END}]$ Return the estimate $\hat{\lambda}=x^{T}Ax$ as the estimate of the eigenvalue closest to $s$. This matrix has eigenvalues $\left( \frac{1}{\lambda_{i}-s} \right)_{i=1}^{n},$ so if we have $s = \lambda_{i}+\epsilon$ for some $i$ and small $\epsilon$, the algorithm has a linear convergence rate of $\underset{\text{with shift}}{\left| \frac{\lambda_{\dots}-s}{\epsilon} \right|^{k} } \gg \underset{\text{without shift}}{\left| \frac{\lambda_{1}}{\lambda_{2}} \right|^{k}}.$ where $\lambda_{\dots}$ is the eigenvalue the next closest to $s$. This shift can also be dynamic, where for example the **Rayleigh quotient shift**: on the $k$th step we use $s_{k}:= x_{k-1}^{T}Ax_{k-1} \approx \lambda_{i},$where $x_{k-1}$ is the eigenvector computed from the $(k-1)$th step. In each iteration, we spend $O(n^{3})$ time solving the system $(A-s I)^{-1}x$, but in return we get better convergence: > [!theorem|*] Cubic Convergence with Rayleigh Shifts > > *Using Rayleigh shifts gives cubic convergence* in the sense that > $\begin{align*}\| x^{(k+1)}-v_{i} \| &= O(\| x^{(k)}-v_{i} \|^{3} ) .\end{align*} $ > > > > [!proof]- > > > > Let $\delta x:=x^{(k)}-v_{i}$. In the $(k+1)$th iteration, $\begin{align*} > > x^{(k+1)}&\propto(A-s_{k+1}I)^{-1}x^{(k)}\\ > > &= \frac{1}{\lambda_{i}-s_{k+1}}v_{i}+O(\delta x) \\ > > &= \frac{v_{i}}{\| \hat{\lambda}^{(k)}-\lambda_{i} \| }+O(\delta x ). > > \end{align*}$ > > After normalizing, the error is $\| x^{(k+1)}-v_{i} \| =O(| \hat{\lambda}^{(k)}-\lambda_{i} |\cdot \| \delta x \| )=O(\| \delta x\|^{3} ),$ > > since the Rayleigh quotient has error Since we use the Rayleigh quotient as $\hat{\lambda}$, it has error $| \hat{\lambda}^{(k)}-\lambda_{i} |=| r(x^{(k)})-r(v_{i}) |=O(\| x^{(k)}-v_{i} \|^{2} )$. > ### Simultaneous Iterations > [!bigidea] > Applying power iteration $A^{k}$ to multiple vectors, then orthogonalizing them with Gram-Schmidt creates convergence to all the eigenvectors of $A$. Although the power iteration only operates on one vector, there is nothing preventing us from applying it to multiple vectors $x_{1},\dots,x_{s}$: $X \mapsto AX \mapsto A^{2}X \mapsto \dots$where $x_{1},\dots,x_{s}$ are the columns of $X$. However, all columns of $A^{k}X$ converge to something $\propto v_{1}$, so to extract more eigenvectors, we need to "remove" the components proportional to $v_{1}$. Make $A^{k}x_{i}$ orthogonal to $v_{1}$ reveals its component proportional to $v_{2}$: $\begin{align*} A^{k}x_{2}&= \sum_{i=1}^{n}\alpha_{i}\lambda_{i}^{k}v_{i} \\ &= \lambda_{2}^k \Bigg[\, \underbrace{\alpha_{1} \left(\frac{\lambda_{1}}{\lambda_{2}}\right)^{k}v_{1}}_{\substack{\text{removed by}\\\text{orthogonalization}}} + \alpha_{2}v_{2}+ \underbrace{\sum_{i=3}^{n}\alpha_{i} \left(\frac{\lambda_{i}}{\lambda_{2}}\right)^{k}v_{i}}_{\to 0} \Bigg] \\ &\underset{\approx}{\propto} v_{2} \end{align*}$and the rate of convergence is $O((\lambda_{3} / \lambda_{2})^{k})$. In general, removing the components proportional to $v_{1},\dots,v_{s-1}$ in $A^{k}x_{s}$ gives something proportional to $v_{s}$,$\begin{align*} A^{k}x_{s}&= \sum_{i=1}^{n}\alpha_{i}\lambda_{i}^{k}v_{i} \\ &= \lambda_{s}^k \Bigg[\, \underbrace{\sum_{i=1}^{s-1}\alpha_{i} \left(\frac{\lambda_{i}}{\lambda_{s}}\right)^{k}v_{i}}_{\text{removed}} + \alpha_{s}v_{s}+ \underbrace{\sum_{i=s+1}^{n}\alpha_{i} \left(\frac{\lambda_{i}}{\lambda_{s}}\right)^{k}v_{i}}_{\to 0} \Bigg] \\ &\underset{\approx}{\propto} v_{s} \end{align*}$assuming $\lambda_{i}\gneq\lambda_{j>i}$, so that the remaining terms converge to $0$, then the rate of convergence is $O(\lambda_{i+1} / \lambda_{i})^{k}$. So if we apply the Gram-Schmidt process followed by normalization to the columns $A^{k}X$, where $A^{k}x_{i} \mapsto q_{i}$, * For $i=1$, the power iteration gives $A^{k}x_{1}\underset{\approx}{\propto} v_{1}$; G-S only rescales the first basis, so $q_{1} \underset{\approx}{\propto} v_{1}$. * For $i=2$, G-S makes $A^{k}x_{2}$ orthogonal to $A^{k}x_{1}\underset{\approx}{\propto} v_{1}$, hence $q_{2}\underset{\approx}{\propto} v_{2}$. * For general $s$, $A^{k}x_{s}$ is made orthgonal to $v_{1},\dots,v_{s-1}$, hence $q_{s}\underset{\approx}{\propto}v_{s}$. * Lastly, since both $q_{s}$ and $v_{s}$ are unit vectors, $q_{s}\approx \pm v_{s}$.$\{ A^{k}x_{i} \} \xmapsto{G-S}\{ q_{i} \} \approx \{ \pm v_{i} \}$where the $\pm$ doesn't really matter, because both $\pm v_{s}$ are eigenvectors of $\lambda_{s}$ and $A$.