> [!tldr] Power Iterations
> Power iterations are a family of iterative algorithms for the [[Eigenvalue Problems|eigenvalue problem]]. They are mainly for theoretical interest or as building blocks of more sophisticated algorithms.
We assume that $A$ has a basis of eigenvectors $v_{i}$ (all unit-length), corresponding to eigenvalues $\lambda_{i}$, where $\lambda_{i}$ are distinct and ordered by magnitude.
### The Rayleigh Quotient
The general strategy for computing an eigenvalue $\lambda$ is to approximate its eigenvector $v$ with some estimate $x$, then compute the **Rayleigh quotient function** $r(x):= x^{T}Ax / x^{T}x$.
It has the neat property that *the estimate $\hat{\lambda}:=r(x)$ is quadratically accurate:*
> [!exposition] Accuracy of the Rayleigh Quotient
> The quotient has gradient $\nabla r=\frac{2}{x^{T}x}(A-r(x)I)x,$so in particular $r(v)=0$ for any eigenvector $v$. Therefore, if the estimate is $x=v+\delta x$, a Taylor expansion gives $r(x) =r(v_{i})+O(\| \delta x \|^{2} )=\lambda_{i}+O(\| \delta x \|^{2} ).$
### Vanilla Power Iteration
> [!bigidea]
> In its vanilla form, power iterations estimates the eigenvector with the largest eigenvalue.
Let's assume that $| \lambda_{1} |\gneq| \lambda_{i} |$ for any $i > 1$; let $v_{1}$ be its eigenvector.
> [!algorithm] Power Iteration
> $[0]$ Initialize some vector $x=x_{0}$.
> For $[k=1,\dots]$ until convergence:
> - Left apply $A$ to get $x \leftarrow Ax$;
> - Normalize to get $x \leftarrow x / \| x \|_{2}$.
>
> $[\mathrm{END}]$ Return the estimate $\hat{\lambda}_{1}=x^{T}Ax$.
- Theoretically we can skip the normalizing step and return $r(x)$ in the end, but in practice this prevents over/underflows.
If we decompose the initial vector $x_{0}$ in terms of the eigenvector-basis:
$x=\sum_{i=1}^{n}\alpha_{i}v_{i},$
it converges to be $v_{1}$ when we apply $A$ repeatedly:
$\begin{align*}
A^{k}x&= A^{k}\sum_{i=1}^{n}\alpha_{i}v_{i}\\
&=\sum_{i=1}^{n}\alpha_{i}\lambda_{i}^{k}v_{i}\\
&= \lambda_{1}^{k}\left[\alpha_{1}v_{1}+
\underbrace{\sum_{i=2}^{n}\alpha_{i} \left( \frac{\lambda_{i}}{\lambda_{1}} \right)^{k} v_{i}}_{\to 0}
\right]\\
&\underset{\approx}{\propto} v_{1}
\end{align*}$
where the quotient $| \lambda_{i} / \lambda_{1} | \lneq 1$ by assumption.
However, this algorithm has many drawbacks:
* It is limited to one eigenvector (the one with the largest eigenvalue).
* It only has **linear convergence** with rate $| \lambda_{1} / \lambda_{2} |$, which could be slow if $| \lambda_{2} | \approx | \lambda_{1} |$.
* It requires computing $A^{k}x \mapsto A^{k+1}x$ in each iteration, which is expensive.
If we decompose $A^{k}=QR$, we see that the first column of $Q$ is given by $q_{1} =Qe_{1} \propto QRe_{1}=A^{k} e_{1} \to v_{1}$.
### Shifted Inverse Power Iteration
We can improve on the algorithm by applying $(A-s I)^{-1}$ at each step, where $s$ is a parameter controlling the amount of **shift**.
> [!algorithm] Shifted Inverse Power Iteration
> $[0]$ Initialize some vector $x=x_{0}$.
> For $[k=1,\dots]$ until convergence:
> - $x \leftarrow (A-s I)^{-1}x$;
> - Normalize to get $x \leftarrow x / \| x \|_{2}$.
>
> $[\mathrm{END}]$ Return the estimate $\hat{\lambda}=x^{T}Ax$ as the estimate of the eigenvalue closest to $s$.
This matrix has eigenvalues $\left( \frac{1}{\lambda_{i}-s} \right)_{i=1}^{n},$ so if we have $s = \lambda_{i}+\epsilon$ for some $i$ and small $\epsilon$, the algorithm has a linear convergence rate of
$\underset{\text{with shift}}{\left| \frac{\lambda_{\dots}-s}{\epsilon} \right|^{k} } \gg \underset{\text{without shift}}{\left| \frac{\lambda_{1}}{\lambda_{2}} \right|^{k}}.$
where $\lambda_{\dots}$ is the eigenvalue the next closest to $s$.
This shift can also be dynamic, where for example the **Rayleigh quotient shift**: on the $k$th step we use $s_{k}:= x_{k-1}^{T}Ax_{k-1} \approx \lambda_{i},$where $x_{k-1}$ is the eigenvector computed from the $(k-1)$th step.
In each iteration, we spend $O(n^{3})$ time solving the system $(A-s I)^{-1}x$, but in return we get better convergence:
> [!theorem|*] Cubic Convergence with Rayleigh Shifts
>
> *Using Rayleigh shifts gives cubic convergence* in the sense that
> $\begin{align*}\| x^{(k+1)}-v_{i} \| &= O(\| x^{(k)}-v_{i} \|^{3} ) .\end{align*} $
>
>
> > [!proof]-
> >
> > Let $\delta x:=x^{(k)}-v_{i}$. In the $(k+1)$th iteration, $\begin{align*}
> > x^{(k+1)}&\propto(A-s_{k+1}I)^{-1}x^{(k)}\\
> > &= \frac{1}{\lambda_{i}-s_{k+1}}v_{i}+O(\delta x) \\
> > &= \frac{v_{i}}{\| \hat{\lambda}^{(k)}-\lambda_{i} \| }+O(\delta x ).
> > \end{align*}$
> > After normalizing, the error is $\| x^{(k+1)}-v_{i} \| =O(| \hat{\lambda}^{(k)}-\lambda_{i} |\cdot \| \delta x \| )=O(\| \delta x\|^{3} ),$
> > since the Rayleigh quotient has error Since we use the Rayleigh quotient as $\hat{\lambda}$, it has error $| \hat{\lambda}^{(k)}-\lambda_{i} |=| r(x^{(k)})-r(v_{i}) |=O(\| x^{(k)}-v_{i} \|^{2} )$.
>
### Simultaneous Iterations
> [!bigidea]
> Applying power iteration $A^{k}$ to multiple vectors, then orthogonalizing them with Gram-Schmidt creates convergence to all the eigenvectors of $A$.
Although the power iteration only operates on one vector, there is nothing preventing us from applying it to multiple vectors $x_{1},\dots,x_{s}$: $X \mapsto AX \mapsto A^{2}X \mapsto \dots$where $x_{1},\dots,x_{s}$ are the columns of $X$.
However, all columns of $A^{k}X$ converge to something $\propto v_{1}$, so to extract more eigenvectors, we need to "remove" the components proportional to $v_{1}$.
Make $A^{k}x_{i}$ orthogonal to $v_{1}$ reveals its component proportional to $v_{2}$: $\begin{align*}
A^{k}x_{2}&= \sum_{i=1}^{n}\alpha_{i}\lambda_{i}^{k}v_{i}
\\ &= \lambda_{2}^k
\Bigg[\,
\underbrace{\alpha_{1}
\left(\frac{\lambda_{1}}{\lambda_{2}}\right)^{k}v_{1}}_{\substack{\text{removed by}\\\text{orthogonalization}}} +
\alpha_{2}v_{2}+
\underbrace{\sum_{i=3}^{n}\alpha_{i}
\left(\frac{\lambda_{i}}{\lambda_{2}}\right)^{k}v_{i}}_{\to 0}
\Bigg]
\\ &\underset{\approx}{\propto} v_{2}
\end{align*}$and the rate of convergence is $O((\lambda_{3} / \lambda_{2})^{k})$.
In general, removing the components proportional to $v_{1},\dots,v_{s-1}$ in $A^{k}x_{s}$ gives something proportional to $v_{s}$,$\begin{align*}
A^{k}x_{s}&= \sum_{i=1}^{n}\alpha_{i}\lambda_{i}^{k}v_{i}
\\ &= \lambda_{s}^k
\Bigg[\,
\underbrace{\sum_{i=1}^{s-1}\alpha_{i}
\left(\frac{\lambda_{i}}{\lambda_{s}}\right)^{k}v_{i}}_{\text{removed}} +
\alpha_{s}v_{s}+
\underbrace{\sum_{i=s+1}^{n}\alpha_{i}
\left(\frac{\lambda_{i}}{\lambda_{s}}\right)^{k}v_{i}}_{\to 0}
\Bigg]
\\ &\underset{\approx}{\propto} v_{s}
\end{align*}$assuming $\lambda_{i}\gneq\lambda_{j>i}$, so that the remaining terms converge to $0$, then the rate of convergence is $O(\lambda_{i+1} / \lambda_{i})^{k}$.
So if we apply the Gram-Schmidt process followed by normalization to the columns $A^{k}X$, where $A^{k}x_{i} \mapsto q_{i}$,
* For $i=1$, the power iteration gives $A^{k}x_{1}\underset{\approx}{\propto} v_{1}$; G-S only rescales the first basis, so $q_{1} \underset{\approx}{\propto} v_{1}$.
* For $i=2$, G-S makes $A^{k}x_{2}$ orthogonal to $A^{k}x_{1}\underset{\approx}{\propto} v_{1}$, hence $q_{2}\underset{\approx}{\propto} v_{2}$.
* For general $s$, $A^{k}x_{s}$ is made orthgonal to $v_{1},\dots,v_{s-1}$, hence $q_{s}\underset{\approx}{\propto}v_{s}$.
* Lastly, since both $q_{s}$ and $v_{s}$ are unit vectors, $q_{s}\approx \pm v_{s}$.$\{ A^{k}x_{i} \} \xmapsto{G-S}\{ q_{i} \} \approx \{ \pm v_{i} \}$where the $\pm$ doesn't really matter, because both $\pm v_{s}$ are eigenvectors of $\lambda_{s}$ and $A$.