PCA computes a latent representation of a data matrix $\mathbf{X}$ via a projection onto an orthogonal matrix $V$: $\mathbf{X} \mapsto \mathbf{Z}:=\mathbf{X}V.$
The columns of $V$ are also known as the **principal components (PC)**, and $V$ itself the **loading matrix**. $\mathbf{Z}$ is the projections of $\mathbf{X}$ by $V$: $z_{ij}=\text{projection of }\mathbf{X}_{i,:}\text{ onto }v_{j}.$
In particular, the $V$ used by PCA is the matrix $V$ in the full [[Singular Value Decomposition|SVD]] of $X=UD V^{T}$ (alternatively, it is the eigenvector matrix in the spectral decomposition of $X^{T}X$); therefore, $\mathbf{Z}=UDV^{T}V=UD.$
If we perform **dimension reduction** via a **truncated PCA** by only keeping information encoded by the first $k$ PCs, the compressed data is $\hat{\mathbf{X}}_{k}=\mathbf{Z}_{k}V^{T}_{1:k}=\mathbf{X}V_{1:k}V_{1: k}^{T}.$
## Derivation of the Principal Components
### PCA as Variance-Maximizing Projection
Assuming $\mathbf{X}$ has its columns centered at $0$, PCA can be derived by *finding the most "interesting" way of orthogonally projecting the data* in the sense that it finds unit vectors $\nu_{1},\dots$ that:
- Each maximizes $\| \mathbf{X}\nu_{i} \|^{2}$ (the largest sample variance, since $\mathbf{X}$ is centered), subject to being orthogonal to all $\nu_{j<i}$ that comes before it.
- Here those vectors are $\nu$ (the Greek letter), but in fact the solution is just $v$, the columns of $V$ (up to signs).
In this example, the data is 2D: the population size and advertisement spending of $100$ cities.
![[PCA2D.png#invert|center|w80]]
Therefore, the principal components are directions along which $\mathbf{X}$ varies the most. This has a few caveats:
- It treats each column of $\mathbf{X}$ "equally", so its results are not preserved if one column is scaled up/down (e.g. via a unit conversion). A common practice is to *normalize each column to have a standard deviation of $1$*. This is equivalent to replacing $S=\mathrm{Cov}(\mathbf{X})=X^{T}X/(n-1)$ with $R=\mathrm{Corr}(\mathbf{X})$.
- If the projection is used as a preprocessing for supervised tasks like regression/classification, *there is no guarantee that large variance in a predictor $\mathbf{X}_{:, j}$ is relevant for the response*.
> [!theorem|*] PCA as Variance-Maximizing Orthogonal Map
> The maximum-variance orthogonal projection$\nu_{i}=\underset{\nu}{\arg\max}~\| \nu^{T}S\nu \| , ~ ~\text{subject to }\forall j <i:\nu_{i} \perp \nu_{j}$
> is solved by the columns of $V$, i.e. $\text{for }i=1,\dots,p:~\nu_{i}=v_{i}.$
> [!exposition]- Derivation/proof of the result
> This involves the sample covariance matrix $S=\mathbf{X}^{T}\mathbf{X} / (n-1)$, as an orthogonal projection $\mathbf{X} \nu$ has variance $\nu^{T}S\nu$.
>
> Since $S$ is positive definite, this optimization can happen in the direction of their eigenvectors: let $
S= VD^{T}DV^{T}$be its eigen-decomposition, where $V,D$ are the same matrices from the full SVD of $\mathbf{X}$. Then the (unconstrained) objective becomes $\underset{\nu:\| \nu \| =1}{\arg\max}~ \nu^{T} V D^{T}DV^{T}\nu = \| DV^{T}\nu \|^{2} \cdots$
>
> However, since *$\nu \mapsto V^{T}\nu=: \tilde{\nu}$ preserves both orthogonality and unit length*, we can optimize for $\tilde{\nu}:= V^{T}\nu$, and undoing the transformation (by left-applying $V$) gives optimizers of the original objective:
$\cdots=V\left(\underset{\tilde{\nu}:\| \tilde{\nu} \|=1 }{\arg\max}~\left\| D\tilde{\nu} \right\|^{2}\right).$
>
> *First principal component*: since $D$ is diagonal, the first maximizer is just $\tilde{\nu}_{1}=(1,0,\dots,0)$ (here we assumed the singular values in $D$ are in descending order). Transforming back by left-applying $V$ gives $\nu_{1}=v_{1}$, the first column of $v$.
>
> *Successive principal components*: imposing orthogonality on $\tilde{\nu}_{2},\dots$ with previous $\tilde{\nu}