> [!tldr] The SVD
> The SVD is a matrix decomposition into an orthonormal-diagonal-orthogonal product; it exists for all matrices, including non-square ones.
Given a matrix $A \in \mathbb{R}^{m \times n}$, $m \ge n$, there are orthogonal bases $\mathcal{V}=\{ v_{1},\dots,v_{n} \} \subset \mathbb{R}^{n}$ and $\mathcal{U}=\{ u_{1},\dots,u_{m} \} \subset \mathbb{R}^{m}$ such that $ \mathcal{V} \xrightarrow{A}\{ \sigma_{1}u_{1},\dots,\sigma_{n}u_{n} \}$
The factors $\sigma_{1},\dots,\sigma_{n}$ are the **singular values** of $A$; they can be non-distinct, but are usually arranged in descending order (both as a list and in the matrix).
In matrix form, let $V$ be the matrix with columns from $\mathcal{V}$, $\hat{U}$ the matrix with columns being the first $n$ elements of $\mathcal{U}$, and $\hat{\Sigma}=\mathrm{diag}(\sigma_{1},\dots,\sigma_{n})$: $AV=\hat{U}\hat{\Sigma}$
Since $\mathcal{V}$ are chosen to be orthogonal, right-apply the inverse ${V}^{T}$ gives the **singular value decomposition** of $A$: $A=\hat{U}\hat{\Sigma} V^{T}$
![[Singular_value_decomposition.gif#invert|center|w50]]
## Forms of the SVD
As $\mathcal{V}$ is only mapped into the first $n$ elements of $\mathcal{U}$, the above expression omitted the other vectors of $\mathcal{U}$; it is called the **thin/reduced SVD**: $A_{m \times n}=\begin{bmatrix}
\Bigg\uparrow & &\Bigg\uparrow \\
u_{1} & \dots & u_{n} \\
\Bigg\downarrow & & \Bigg\downarrow
\end{bmatrix}_{m \times n}
\begin{bmatrix}
\sigma_{11} \\
& \ddots \\
& & \sigma_{nn}
\end{bmatrix}_{n\times n}
\begin{bmatrix}
\longleftarrow & v_{1} & \longrightarrow \\
& \vdots \\
\longleftarrow & v_{n} & \longrightarrow
\end{bmatrix}_{n\times n}$
The **full SVD** includes the additional vectors of $\mathcal{U}$, and extra zero rows in $\Sigma$ to preserve the product: $A_{m \times n}=\begin{bmatrix}
\Bigg\uparrow &
&\Bigg\uparrow \\
u_{1} & \dots \, \dots & u_{m} \\
\Bigg\downarrow & & \Bigg\downarrow
\end{bmatrix}_{m \times m}
\begin{bmatrix}
\sigma_{11} \\
& \ddots \\
& & \sigma_{nn} \\
\\ \\
\end{bmatrix}_{m\times n}
\begin{bmatrix}
\longleftarrow & v_{1} & \longrightarrow \\
& \vdots \\
\longleftarrow & v_{n} & \longrightarrow
\end{bmatrix}_{n\times n}$and these full matrices are denoted $A=U\Sigma V^{T}$ (without the hats).
In general for complex matrices, the singular value decomposition of $A \in \mathbb{C}^{m\times n}$ is the product $A=U\Sigma V^{*}$where $U,V$ are unitary, and $\Sigma$ diagonal, with non-increasing entries.
### Proofs of Existence
The SVD of $A$ is closely related to $A^{T}A$, which is symmetric and positive definite. Hence all its eigenvalues are non-negative.
> [!proof]-
> *Symmetry* is immediate.
> *Positive-definiteness*: for any $v \in \mathbb{R}^{n}$, $v^{T}(A^{T}A)v=\|Av\|^{2} \ge 0$.
> *Non-negativity*: for any eigenvalue $\lambda$ with eigenvector $v$, $0 \le v^{T} (A^{T}A)v=v^{T}\lambda v=\lambda\|v\|^{2}$, so $\lambda$ must be non-negative.
Hence by the spectral theorem, it can be diagonalized as $A^{T}A=VCV^{T}=VD^{2}V^{T}$where $V$ is the orthogonal basis containing the eigenvectors of $A^{T }A$, and $C=D^{2}$ containing the eigenvalues is well-defined since the eigenvalues are non-negative.
If $\mathrm{rank}(A)\equiv r=n$, then the thin SVD exists.
> [!proof] ] Consider $A^{T}A=VD^{2}V^{T}$.
> Since $A$ is full-ranked, $D$ contains no zeros in the diagonal, hence it can be inverted to give $\hat{U} \equiv AVD^{-1}$, which is orthogonal. Hence $A=\hat{U}DV^{T}$is the thin SVD.
In general, let $\mathrm{rank}(A)=r \le n$, and its thin SVD exists.
> [!proof]
> In this case $D$ is not invertible; we will invert part of it instead. Denote $B=AV$ so that $B^{T}B=D^{2}$, and $B,D$ are of the form $D=\begin{bmatrix}
D_{r} & \\
& 0_{n-r}
\end{bmatrix},\, B=\begin{bmatrix}
\Bigg\uparrow & &\Bigg\uparrow &\Bigg\uparrow & &\Bigg\uparrow \\
b_{1} & \dots & b_{r} & 0 & \dots & 0 \\
\Bigg\downarrow & & \Bigg\downarrow & \Bigg\downarrow & & \Bigg\downarrow
\end{bmatrix}$where $D_{r}$ contains the non-zero entries; the last columns of $B$ are zero since the diagonal elements in $B^{T}B=D$ are their magnitudes, the last few of which are zeros.
> We can then invert $D_{r}$ to define $\begin{bmatrix}\hat{U}_{1} & 0\end{bmatrix}=B \begin{bmatrix}D_{r}^{-1} & \\ & I_{n-r}\end{bmatrix}$. Hence $A=\begin{bmatrix}\hat{U}_{1} & 0\end{bmatrix}\begin{bmatrix}D_{r} & \\ & I_{n-r}\end{bmatrix}V^{T}$
> Filling the zeros by extending $\hat{U}_{1}$ by $(n-r)$ vectors to an orthonormal $\hat{U}=\begin{bmatrix}\hat{U}_{1}&\hat{U}_{2}\end{bmatrix}$ and changing the diagonal entries in to zero to preserve the product gives $A=\hat{U}\underbrace{\begin{bmatrix}D_{r} & \\ & 0_{n-r}\end{bmatrix}}_{\hat{\Sigma}}V^{T}$
Given the thin SVD of a matrix $A=\hat{U}\hat{\Sigma}V^{T}$, any extension of $\hat{U}$ to an $n$-element orthogonal basis of $\mathbb{R}^n$ gives the full SVD.
> Proof: it's straightforward to see that if $U$ is the extension of $\hat{U}$, and $\hat{U}_{\perp}$ is the matrix containing the extra vectors, then $A=\hat{U}\hat{\Sigma}V^{T}=
\underbrace{\begin{bmatrix}\hat{U} & \hat{U}_{\perp}\end{bmatrix}\vphantom{\begin{bmatrix}\hat{1}\\1_{2}\end{bmatrix}}}_{U}
\underbrace{\begin{bmatrix}\hat{\Sigma} \\ 0_{(m-n)\times n}\end{bmatrix}}_{\Sigma}V^{T}$is a full SVD.
## Algebraic Properties
Importantly, SVD determines the [[Matrix Norms|$l_2$ norm of matrices]]:
> [!lemma|*] Largest Singular Value is Matrix $l_2$ Norm
> For any matrix $A$ with largest singular value $\sigma_{1}(A)$, $\| A \|_{2}:= \max_{x} \frac{\| Ax \| _{2}}{\| x \| _{2}}=\sigma_{1}(A).$
This is simply because $\| UDV^{T}x \|_{2}=\| DV^T x\|_{2}$ as $U$ is orthogonal, then defining $y:=V^{T}x$ gives $\max_{x} \frac{\| Ax \| }{\| x \| }=\max_{y=V^{T}x} \frac{\| Dx \| }{\| x \| },$which is obviously optimized by $x=e_{1}=(1,0,\dots,0)^T$, and the maximum is $d_{11}=\sigma_{1}$.
### SVD as Low-Rank Approximations
Written in vector/column forms, where $\hat{U}=[u_{1},\dots,u_{r}]$ and $\hat{V}=[v_{1},\dots,v_{r}]$, the SVD is equivalent to $A=\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{T}$where each term is a rank-1 $m \times n$ matrix; its contribution is determined by $\sigma_{i}$.
The **truncated SVD** $A_{k}$ are rank-$k$ approximations of $A$, where $k \le r$: $A_{k}=\sum_{i=1}^{k}u_{i}\sigma_{i}v^{T}_{i}.$
In fact, this is the best rank-$k$ approximation:
> [!theorem|*] SVD is Optimal Rank-$k$ Approximation
>
Then under the spectral norm, *the truncated SVD is the best approximation*: for any $B \in \mathbb{R}^{m\times n}$ with $\mathrm{rank}(B) \le k$, $\|A-A_{k}\|_{2}=\sigma_{k+1} \le \|A-B\|_{2}.$
>
> > [!proof]-
> > The equality is immediate.
> > For the inequality, suppose for contradiction that there is $B:\|A-B\|_{2} < \sigma_{k+1}$. Then $\mathrm{rank}(B) \le k$ means that there is a subspace $W \le \ker B:\dim(W)=n-k$, and for any $w \in W$, $\|Aw\|_{2}=\|(A-B)w\|_{2} \le \|A-B\|_{2}\|w\|<\sigma_{k+1}\|w\|$Hence $W$ is disjoint with $V_{k+1}=\mathrm{span}( v_{1},\dots,v_{k+1})$, as $\|Av\|_{2}\ge \sigma_{k+1}$ for any $v \in V_{k+1}$.
> > But this means that there are disjoint subspaces with dimensions $n-k$ and $k+1$ in $\mathbb{R}^{n}$, a contradiction.
- Note that $\sigma_{1}=\| A \|_{2}$ is a special case of this theorem with $k=0$, an approximation with rank $0$ (i.e. the zero matrix).
In terms of information, the truncated SVD is the low-rank matrix that best summarizes the information in $A$.
- If $A$ is an image, the truncated SVD compresses a high-rank image matrix into a few vector products.
* If $A$ is data, the truncated SVD finds linear combinations of the features that are the most "interesting" -- this gives the [[Principal Component Analysis|PCA]].
### Courant-Fisher Minmax Theorem
> [!theorem|*] Courant-Fisher
> For a matrix $A \in \mathbb{R}^{m\times n}$, $\sigma_{i}(A)=\max_{\mathcal{S}:\dim \mathcal{S}=i}\min_{x \in \mathcal{S}} \frac{\| Ax \|_{2} }{\| x \|_{2} }.$
> The maximum is achieved by $\mathcal{S}=\mathrm{span}\{ v_{1},\dots,v_{i} \}$ (where $v_{i}$ is the $i$th column of $V$ in the SVD), and its inner minimum is achieved by $v_{i}$.
>
>
> > [!proof]-
> > Since $\| Av_{i} \|=\| UD V^{T}v_{i} \|=\| De_{i} \|=\sigma_{i}$, we know that $v_{i}$ is a special choice of $x \in \mathcal{S}$ that exactly achieves the equality.
> > - Simple rule: we always have $\mathcal{S} \cap \mathrm{span}\{ v_{i},\dots,v_{n} \} \ne \{ 0 \}$, since $\mathrm{dim}~S=i$, then anything in the intersection makes $\min_{x \in \mathcal{S}} \| Ax \| / \| x \| \le \sigma_{i}$.
> > - To attain the maximum, we must have $\mathcal{S}=\mathrm{span}\{ v_{1},\dots,v_{i} \}$ to limit that intersection to $\mathrm{span}\{ v_{i} \}$, and we have $\min_{x \in \mathcal{S}} \| Ax \| / \| x \| = \sigma_{i}$.
> >
> > Therefore, $\sigma_{i}$ is the maximum of $\min_{x \in \mathcal{S}} \| Ax \| / \| x \|$, attained by $\mathcal{S}=\mathrm{span}\{ v_{1},\dots,v_{i} \}$.
^bed618
- Special cases for $i=1,n$.
An important consequence is [[Numerical Stability#^6a3c0e|Weyl's theorem]], which guarantees the numerical stability of computing singular values.
Other corollaries include:
- $\sigma_{i}\left( \begin{bmatrix}A_{1} \\ A_{2}\end{bmatrix} \right) \ge \max(\sigma_{i}(A_{1}),\sigma_{i}(A_{2}))$.
- $\sigma_{i}(AB) \le \sigma_{i}(A)\sigma_{1}(B)$.
## Computing the SVD
One can compute the SVD as derived, using the Gram matrix $A^{T}A$ and its decomposition $A^{T}A=V \Lambda V^{T}$. Lastly, normalizing each column of $AV$ gives $U$, while the normalization factors are the singular values.
However, this is expensive and unstable for small singular values.
### Golub-Kahan Algorithm
Golub-Kahan is a pre-processing algorithm (similar to [[QR Algorithm#Preprocessing into Hessenberg Form|preprocessing into Hessenbergs for QR]]) that speeds up the subsequence computations.
By applying [[QR Factorization#QR with Householder Reflections|Householder reflections]] on both sides of $A$ (though different ones, unlike the reduction to Hessenberg form), we can reduce it to two diagonals: $A \mapsto H_{L,n}\cdots H_{L,1}AH_{R,1}\cdots H_{R,n}=:B$where $B$ has the form $B=\begin{bmatrix}
* & * & & \\ & * & * \\
& & \ddots \\
& & & * & *
\end{bmatrix}.$
Now we can simple work with $B^T B$ and apply [[QR Algorithm]] (which is cheaper to compute due to *$B^{T}B$ being tridiagonal*), or use cleverer algorithms to find the SVD of $B=U_{B}\Sigma V_{B}^{T}$. In the end, we get $A=(H_{L,1}^{T}\cdots H_{L,n}^{T}U_{B}) ~\Sigma~(H_{R,1}\cdots H_{R,1}V_{B})^{T}.$
### Randomized SVD