Consider the least squares problem $x^{\ast}:=\underset{x}{\arg\min}~\| Ax-b \|_{2} ,$with an overdetermined system and a very tall-and-skinny $A \in \mathbb{R}^{m\times n}$ (so $m \gg n$). For very large $m$, we want to find an approximate solution $\hat{x} \approx x^{\ast}$. The most naive and simple way to decrease complexity is **row subset selection**, which essentially selects some $s \ll m$ rows of $A$, say $\tilde{A}$, then solve $\hat{x}:=\underset{x}{\arg\min}~\| \tilde{A}x-b \| \approx \underset{x}{\arg\min}~ \| Ax-b \|. $ A deterministic subsetting has the issue of having terrible worst cases: WLOG we select the first $s$ rows of $A$, which is countered with the problem $A=\begin{bmatrix} \epsilon I_{s} \\ I_{s} \\ \vdots \\ I_{s} \end{bmatrix},~~~b=\begin{bmatrix}b_{0} \\ b_{0} \\ \vdots \\ b_{0}\end{bmatrix}$for which the least squares solution is $x \approx b_{0}$ (when $m \to \infty$), but subsetting produces $\hat{x}=b / \epsilon$. ### Sketching Matrices The subsetting idea can be generalized to pre-applying a **sketch matrix** $G\in \mathbb{R}^{s\times m}$ to the problem: $\underset{x}{\arg\min}~\| G(Ax -b) \| \approx \underset{x}{\arg\min}~\| Ax-b \|, $where the subsetting corresponding to $g_{ij}:=\mathbf{1}\{ \text{row }i \text{ is the } j \text{th row sampled}\}$. But we don't have to throw away lots of rows using a sparse $G$ -- instead we can combine info from all rows by mixing them. In particular, we can *use a random matrix $G$ to get a good mix with high probability.* If $\hat{x}$ is the solution to the sketched problem $\underset{x}{\arg\min}~\| GAx-b \|$ with a random $G \in \mathbb{R}^{s\times n}$, then the error has bound $\| A\hat{x}-b \| \le \frac{\sqrt{ s}- \sqrt{ n+1 }}{\sqrt{ s }+\sqrt{ n+1 }} \| Ax^{\ast}-b \|.$with high probability. Since $G$ random is well-conditioned with singular values within $(\sqrt{ s } \pm \sqrt{ n })$ (by Marchenko-Pastur), $\frac{\| G(Ax-b) \|}{\| Ax-b \| } \in (\sigma_{n}(G), \sigma_{1}(G)) \subset(\sqrt{ s } \pm \sqrt{ n }).~~~(\dagger)$ Substituting $x=x^{\ast}$ in $(\dagger)$, $\| G(A\hat{x}-b) \| \le\| G(Ax^{\ast}-b) \| \le (\sqrt{ s }+\sqrt{ n })\| Ax-b \|. $ But for LHS, substituting $x=\hat{x}$ in $(\dagger)$ gives $\| G(A\hat{x}-b) \| \ge \sigma_{n}(G)\| A\hat{x}-b \| \ge (\sqrt{ s }-\sqrt{ n })\| A\hat{x}-b \| . $ Combining the two gives $\| A\hat{x}-b \| \le \frac{\sqrt{ s }+\sqrt{ n }}{\sqrt{ s }-\sqrt{ n }}\| Ax-b \|. $