Singular Value Decomposition - Random Notes Go Brrrrrrr

> [!tldr] The SVD > The SVD is a matrix decomposition into an orthonormal-diagonal-orthogonal product; it exists for all matrices, including non-square ones. Given a matrix $A \in \mathbb{R}^{m \times n}$, $m \ge n$, there are orthogonal bases $\mathcal{V}=\{ v_{1},\dots,v_{n} \} \subset \mathbb{R}^{n}$ and $\mathcal{U}=\{ u_{1},\dots,u_{m} \} \subset \mathbb{R}^{m}$ such that $\mathcal{V} \overset{A}{\mapsto}\{ \sigma_{1}u_{1},\dots,\sigma_{n}u_{n} \},$ i.e. $Av_{1}=\sigma_{1}u_{1}$, etc. The factors $\sigma_{1},\dots,\sigma_{n}$ are the **singular values** of $A$; they can be non-distinct, but are usually arranged in non-increasing order. In matrix form, let $V$ be the matrix with columns from $\mathcal{V}$; ${U}$ the matrix with columns being the first $n$ elements of $\mathcal{U}$; and finally ${\Sigma}=\mathrm{diag}(\sigma_{1},\dots,\sigma_{n})$: the above relationship can be written as $AV={U}{\Sigma}.$ Since $\mathcal{V}$ are chosen to be orthogonal, right-apply the inverse ${V}^{T}$ to get the **singular value decomposition** of $A$: $A={U}\Sigma V^{T};$ equivalently, $A=\sum_{i=1}^{n}\sigma_{i}u_{i}v_{i}^{T}.$ ## Forms of the SVD As $\mathcal{V}$ is only mapped into the first $n$ elements of $\mathcal{U}$, the above expression omitted the other vectors of $\mathcal{U}$; it is called the **thin/reduced SVD**: $A_{m \times n}=\begin{bmatrix} \Bigg\uparrow & &\Bigg\uparrow \\ u_{1} & \dots & u_{n} \\ \Bigg\downarrow & & \Bigg\downarrow \end{bmatrix}_{m \times n} \begin{bmatrix} \sigma_{1} \\ & \ddots \\ & & \sigma_{n} \end{bmatrix}_{n\times n} \begin{bmatrix} \longleftarrow & v_{1} & \longrightarrow \\ & \vdots \\ \longleftarrow & v_{n} & \longrightarrow \end{bmatrix}_{n\times n}.$ The **full SVD** includes the additional vectors of $\mathcal{U}$, and extra zero rows in $\Sigma$ to preserve the product: $A_{m \times n}=\begin{bmatrix} \Bigg\uparrow & &\Bigg\uparrow \\ u_{1} & \dots \, \dots & u_{m} \\ \Bigg\downarrow & & \Bigg\downarrow \end{bmatrix}_{m \times m} \begin{bmatrix} \sigma_{1} \\ & \ddots \\ & & \sigma_{n} \\ \\ \\ \end{bmatrix}_{m\times n} \begin{bmatrix} \longleftarrow & v_{1} & \longrightarrow \\ & \vdots \\ \longleftarrow & v_{n} & \longrightarrow \end{bmatrix}_{n\times n}.$ In general for complex matrices, the singular value decomposition of $A \in \mathbb{C}^{m\times n}$ is the product $A=U\Sigma V^{*},$where $U,V$ are unitary, and $\Sigma$ diagonal, with non-increasing entries. ### Proofs of Existence The SVD of $A$ is closely related to $A^{T}A$, which is symmetric and positive definite. Hence all its eigenvalues are non-negative. > [!proof]- > *Symmetry* is immediate. > *Positive-definiteness*: for any $v \in \mathbb{R}^{n}$, $v^{T}(A^{T}A)v=\|Av\|^{2} \ge 0$. > *Non-negativity*: for any eigenvalue $\lambda$ with eigenvector $v$, $0 \le v^{T} (A^{T}A)v=v^{T}\lambda v=\lambda\|v\|^{2}$, so $\lambda$ must be non-negative. For demonstration, we first assume that $\mathrm{rank}(A)=n$, and find the SVD of $A$: > [!proof] > Consider the eigendecomposition $A^{T}A=V\Lambda V^{T}$ and let > $\Sigma=\Lambda^{1/2}:= \mathrm{diag}(\sqrt{ \lambda_{1} },\dots,\sqrt{ \lambda_{n} }),$ > where by permuting the columns of $V$, we assume $\sigma_{1}>\dots> \sigma_{n}>0$. > Since $A$ is full rank, $A^T A$ hence $\Sigma$ is invertible, so we can define $U:= AV\Sigma^{-1}$, which is orthonormal since > $U^{T} U=\Sigma^{-1}V^{T}A^{T}AV\Sigma^{-1}=I.$ > Then we have $A=U\Sigma V^{T}$ as the SVD of $A$. We now prove the existence in general: $\mathrm{rank}(A)=r \le n$. > [!proof] > In this case $\Sigma=\begin{bmatrix}\Sigma_{r} & \\& 0_{n-r}\end{bmatrix}$ is not invertible, but it has an invertible block $\Sigma_{r}$. We define $V_{r}\in \mathbb{R}^{n\times r}$ as the first $r$ columns of $V(=\begin{bmatrix}V_{r}& V_{r}^{\perp}\end{bmatrix})$. Now set $U_{r}:=AV_{r}\Sigma_{r}^{-1}$, which is orthonormal since > $U_{r}^{T}U_{r}=\Sigma_{r}^{-1}V_{r}^{T} A^{T}AV_{r}\Sigma_{r}^{-1}=I.$ > Then take an orthogonal complement to get $U=\begin{bmatrix}U_{r}& U_{r}^{\perp}\end{bmatrix}\in \mathbb{R}^{m\times n}$. This gives the SVD of A since $U\Sigma V^{T}=\begin{bmatrix}U_{r}& U_{r}^{\perp}\end{bmatrix}\begin{bmatrix}\Sigma_{r} & \\& 0_{n-r}\end{bmatrix}\begin{bmatrix}V_{r}^{T} \\ V_{r}^{\perp T}\end{bmatrix}=U_{r}\Sigma_{r}V_{r}^{T}=A.$ From this summation form, collecting $\{ u_{i} \}_{i=1}^{n}$ and $\{ v_{i} \}_{i=1}^{n}$ into matrices gives the thin SVD, and further extending $\{ u_{1},\dots ,u_{n} \}$ to an orthogonal basis of $\mathbb{R}^{m}$ gives the full SVD. ## Algebraic Properties Importantly, SVD determines the [[Matrix Norms|$l_2$ norm of matrices]]: > [!lemma|*] Largest Singular Value is Matrix $l_2$ Norm > For any matrix $A$ with largest singular value $\sigma_{1}(A)$, $\| A \|_{2}:= \max_{x} \frac{\| Ax \| _{2}}{\| x \| _{2}}=\sigma_{1}(A).$ > > > [!proof]- > > This is simply because $\| U\Sigma V^{T}x \|_{2}=\| \Sigma V^T x\|_{2}$ as $U$ is orthogonal, then defining $y:=V^{T}x$ gives $\max_{x} \frac{\| Ax \| }{\| x \| }=\max_{y} \frac{\| \Sigma y \| }{\| y \| },$ > > using the reparametrization $y=V^{T}x$ (which is bijective since $V$ is orthogonal). Now $\mathrm{RHS}$ is obviously optimised by $x=(1,0,\dots,0)^T$, and the maximum is $\sigma_{1}$. ^0600ec ### Low-Rank Approximation Recall that the SVD can be written as$A=\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{T},$ where each term is an $m \times n$ matrix of rank $1$, and its contribution is determined by $\sigma_{i}$. The **truncated SVD** $A_{k}$ are rank-$k$ approximations of $A$, where $k \le r$: $A_{k}=\sum_{i=1}^{k}u_{i}\sigma_{i}v^{T}_{i}.$ In fact, this is the best rank-$k$ approximation: > [!theorem|*] Truncated SVD minimises $l_{2}$ norm error > Under the $l_{2}$ norm, the truncated SVD is the best approximation: $\|A-A_{k}\|_{2}=\sigma_{k+1} = \min_{B:~ \mathrm{rank}(B)\le k} \| A-B \| _{2}.$ > > > [!proof]- > > The first equality is immediate from $A-A_{k}=\sum_{i=k+1}^{n}\sigma_{i}u_{i}v_{i}^{T}$, which is the SVD of $A-A_{k}$. So $\sigma_{1}(A-A_{k})=\sigma_{k+1}$ (of $A$). > > > > For the second inequality, take any $B$ with $\mathrm{rank}(B)\le k$. Then $\dim \ker B\ge n-k$, and so it has a non-trivial intersection with $\mathrm{\mathrm{span}}\{ v_{1},\dots,v_{k+1} \}$. > > > > So take $x=\sum_{i=1}^{k+1}\beta_{i}v_{i}\in \ker B \cap \mathrm{\mathrm{span}}\{ v_{1},\dots,v_{k+1} \}$, with coefficients (WLOG) satisfying $\| \beta \|_{2}=1$; so $x=V \begin{pmatrix}\beta \\ 0\end{pmatrix}$. Now > > $\begin{align*} \| A-B \|_{2}&\ge \| (A-B)x \|_{2}= \| Ax \|_{2}\\&= \left\| (U\Sigma V^T) V \begin{pmatrix}\beta \\ 0\end{pmatrix} \right\|_{2} = \left\| \Sigma \begin{pmatrix}\beta \\ 0\end{pmatrix} \right\|_{2}= \| \Sigma_{k+1}\beta \| \\ &\ge \sigma_{k+1}(A), \end{align*} $ where $\Sigma_{k+1}=\mathrm{diag}(\sigma_{1},\dots,\sigma_{k+1})$ is the upper-left $(k+1)^{2}$ block of $\Sigma$. > > > > Moreover, from the first equality we know that this lower bound is achieved when $B=A_{k}$. - Note that $\sigma_{1}=\| A \|_{2}$ is a special case of this theorem with $k=0$, an approximation with rank $0$ (i.e. the zero matrix). In terms of information, the truncated SVD is the low-rank matrix that best summarises the information in $A$. - If $A$ is an image, the truncated SVD compresses a high-rank image matrix into a few vector products. * If $A$ is data, the truncated SVD finds linear combinations of the features that are the most "interesting" -- this gives the [[Principal Component Analysis|PCA]]. ### Courant-Fisher Minmax Theorem > [!theorem|*] Courant-Fisher > For a matrix $A \in \mathbb{R}^{m\times n}$, and $i\le \min(m,n)$, > $\sigma_{i}(A)=\max_{\substack{\mathcal{S}\le \mathbb{R}^{n}\\\dim \mathcal{S}=i}}\min_{x \in \mathcal{S}} \frac{\| Ax \|_{2} }{\| x \|_{2} }.$ > The maximum is achieved by $\mathcal{S}=\mathrm{span}\{ v_{1},\dots,v_{i} \}$, and its inner minimum is achieved by $v_{i}$. > > > > [!proof] > > The intersection $\mathcal{S} \cap \mathrm{span}\{ v_{i},\dots,v_{n} \} \ne \{ 0 \}$ is non-trivial since $\mathrm{dim}~S=i$. Writing anything in the intersection as $x=V\begin{pmatrix}0_{i-1} \\ \beta\end{pmatrix}$ (where $\beta\in \mathbb{R}^{n-i+1}$ are its coefficients), we have $\| Ax \| / \| x \| =\| \Sigma_{i:n} \beta\| / \| \beta \|\le \| \Sigma_{i:n} \|=\sigma_{i}(A)$. > > > > Moreover, we attain this upper bound with $\mathcal{S}=\mathrm{span}\{ v_{1},\dots,v_{i} \}$, since any $x=V\begin{pmatrix}\beta \\ 0_{n-i}\end{pmatrix}\in \mathcal{S}$ has $\| Ax \|= \left\| U\Sigma V^{T}V \begin{pmatrix}\beta \\ 0_{n-i}\end{pmatrix}\right\|= \left\| \Sigma\begin{pmatrix}\beta \\ 0_{n-i}\end{pmatrix}\right\|\ge \sigma_{i}(A)\| \beta \|, $ > > and $\| \beta \|=\| V^{T}x \|=\| x \|$. ^bed618 - Special cases for $i=1,\min(m,n)$ give $\sigma_{1}(A)= \max_{x} \| Ax \|_{2} / \| x \|_{2}$ and $\sigma_{\min}(A)=\min_{x} \| Ax \|_{2} / \| x \|_{2}$ respectively. - An important consequence is [[Numerical Stability#^6a3c0e|Weyl's theorem]], which guarantees that the singular values of a matrix are well-conditioned. Other corollaries include: - $\sigma_{i}\left( \begin{bmatrix}A_{1} \\ A_{2}\end{bmatrix} \right) \ge \max(\sigma_{i}(A_{1}),\sigma_{i}(A_{2}))$. - $\sigma_{i}(A)\sigma_{\min}(B)\le \sigma_{i}(AB) \le \sigma_{i}(A)\sigma_{1}(B)$. In particular, $\sigma_{1}(AB)\le \sigma_{1}(A)\sigma_{1}(B)$. - *This does not imply $\sigma_{\min}(A)\sigma_{\min}(B)\le \sigma_{\min}(AB)$! The $\min$ can be different dimensions*: for example, when $B\in \mathbb{R}^{n\times s}$ is wide, take $x\in \ker B$ gives $\sigma_{\min}(AB)\le \| ABx \|=0$, but $A,B$ can both have all-positive singular values. ## Computing the SVD One can compute the SVD as derived, forming the Gram matrix $A^{T}A$ and computing its decomposition $A^{T}A=V \Lambda V^{T}$ with the [[QR Algorithm]]. Lastly, normalising each column of $AV$ gives $U$, while the normalisation factors are the singular values. However, this is inaccurate for small singular values. ### Golub-Kahan Algorithm Golub-Kahan is a pre-processing algorithm (similar to [[QR Algorithm#Preprocessing into Hessenberg Form|preprocessing into Hessenbergs for QR]]) that speeds up the subsequence computations. By applying [[QR Factorisation#QR with Householder Reflections|Householder reflections]] on both sides of $A$, we can reduce it to two diagonals: $A \mapsto H_{L,n}\cdots H_{L,1}AH_{R,1}\cdots H_{R,n}=:B,$ where $B$ has the **bidiagonal** form $B=\begin{bmatrix} * & * & & \\ & * & * \\ & & \ddots \\ & & & * & * \\ & & & & * \\ \\ \end{bmatrix}.$ - We can apply different ones reflections on each side, unlike the reduction to Hessenberg form, because the SVD does not require $U,V$ to be the same. - Note that we apply $n(=\min(m,n))$ transforms on both sides because the left applied matrices take care of the bottom half of the matrix too, hence there is no need to clear them with additional right applied matrices. - Since each Householder application is $O(mn)$ work (computing $m$ terms for $n$ columns, or $n$ terms for $m$ rows -- recall that the Householder map can be defined as $I-2vv^{T}$, so no need for matrix-matrix products), and we apply $2n$ of them, the total complexity of this preprocessing is $O(mn^{2})$. Now we can simply work with $B^T B$ and apply the [[QR Algorithm]] (which only takes $O(n^{2})$ work due to *$B^{T}B$ being tridiagonal*), or use cleverer algorithms to find the SVD of $B=U_{B}\Sigma V_{B}^{T}$. In the end, we get $A=(H_{L,1}^{T}\cdots H_{L,n}^{T}U_{B}) ~\Sigma~(H_{R,1}\cdots H_{R,1}V_{B})^{T}.$