Effective Sample Size - Random Notes Go Brrrrrrr

> [!tldr] Effective Sample Size > When obtaining a dataset $\mathbf{Y}=(Y_{1},\dots ,Y_{n})$ where $Y_{1},\dots,Y_{n}$ are identically distributed (but not necessarily independent) instances of $Y \sim p$, the **effective sample size** measures the actual amount of information contained in those possibly correlated data. - One application is in [[Markov Chain Monte Carlo]], which generates correlated data. Suppose we are estimating a function $f(Y)$ with its sample mean $\hat{f}(\mathbf{Y}):= \frac{1}{n}\sum_{i}f(Y_{i})$, with an estimation variance of $\mathrm{Var}(\hat{f}(\mathbf{Y}))$. Then the effective sample size is $\mathrm{Var}(f(Y)) / \mathrm{Var}(\hat{f}(\mathbf{Y})),$which is simply $n$ in the case of iid. data. ### ESS in Weighted Averages Assume that we have independent samples $Y_{i}$, but are computing a weighted average (with fixed weights $w$) given by $\hat{f} ~|~ w=\frac{\sum_{i}w_{i}f(Y_{i})}{\sum_{i}w_{i}},$ we let $\bar{w}$ be the actual weights in the estimator $\hat{\mu}:= \sum_{i}\bar{w}_{i}f(x_{i})$, e.g. $w_{i} / n$ for IS, and $w_{i} / \sum_{j}w_{j}$ for NIS, if we pretend their weights are fixed. The average has variance $\mathrm{Var}(\hat{f})= {\sum_{i}\bar{w}_{i}^{2}\cdot \mathrm{Var}(f(Y))},$so the effective sample size evaluates to $\mathrm{ESS}= \frac{1}{\| \bar{w} \|^{2} }= \frac{\left( \sum_{i}w_{i} \right)^{2}}{\| w \|^{2} }.$ Note that $\sum_{i}w_{i}=w^{T}\mathbf{1}$, where $\mathbf{1}:= (1,\dots,1)$, so $\mathrm{ESS}= \left( \frac{w^{T}\mathbf{1}}{\| w \| } \right)^{2}\frac{w^{T}\mathbf{1}\mathbf{1}^{T}w}{\| w \|^{2 }}\le \| \mathbf{11}^{T} \|=n,$ where $\| \cdot \|$ is the matrix 2-norm, and $\mathbf{11}^{T}$ is the matrix with every entry being $1$. Therefore, the maximum ESS is achieved by the first singular vector of $\mathbf{1}\mathbf{1}^{T}$, namely $\mathbf{1}$, corresponding to an equally weighted average. For some applications like [[Importance Sampling#Normalised Importance Sampling|normalised importance sampling]], however, the weights are not independent of $Y$, so this approximation is crude. Let $\mu= \mathbb{E}_{p}[f(Y)]$ be the value being estimated, and let the NIS estimator be $\hat{\mu}= \frac{\frac{1}{n}\sum_{i}w_{i}f(Y_{i})}{\frac{1}{n}\sum_{i}w_{i}}.$ When the sample size is large, the numerator and denominator are roughly Gaussian by the CLT. Applying the delta method gives the approximate variance of $\mathrm{Var}(\hat{\mu})\approx \mathbb{E}_{q}\left[ \left( \frac{p}{q} \right)^2(f-\mu)^2 \right],$ where $q$ is the proposal, $p(y) / q(y)=w(y)\cdot c$ is the true IS weights, and $\mu=\mathbb{E}_{p}[f(Y)]$ is the estimated mean. This leads to the finite-sample estimator $\widehat{\mathrm{Var}}(\hat{\mu})=\frac{\sum_{i}w(y_{i})^{2}(f(y_{i})-\hat{\mu})^2}{\left( \sum_{i}w(y_{i}) \right)^2},$where we estimated the normalising constant $c$ using $\mathbb{E}_{q}[w(Y)]=\mathbb{E}_{q}[p / cq]=1/c$. Now approximating $(f(y_{i})-\hat{\mu})^{2}$ with their average $\sigma^2$ (or $\frac{(n-1)}{n}\sigma^2$), we recover the ESS estimate of $\widehat{\mathrm{Var}}(\hat{\mu})=\mathrm{ESS} \cdot \sigma^{2}.$ ### ESS in Correlated Samples In practice, we do not have $\mathrm{Var}(\hat{f}(\mathbf{Y}))$, which equals $\mathrm{Var}(\hat{f}(\mathbf{Y}))=\frac{1}{n^{2}}\sum_{i,j} \mathrm{Cov}(f(Y_{i}),f(Y_{j})) , $ Denoting $\mathrm{Cov}(f(Y_{i}),f(Y_{j}))_{i,j}=: \Sigma$ and $\mathrm{Var}(f(Y))=\sigma^{2}$, we can first rewrite this covariance with [[Autocorrelations|lag-$k$ autocorrelations]] $\rho_{k}:= \mathbb{E}[\mathrm{Cov}(Y_{t},Y_{t+k})]$, giving the approximation $= \frac{1}{n^{2}}\sum_{ij}\Sigma_{ij}=\frac{\sigma^{2}}{n^{2}}\sum_{ij}\rho_{ij}\approx \frac{\sigma^{2}}{n}\left( 1+2\sum_{k=1}^{n-1} \rho_{k} \right) ,$where we (1) used $\rho_{0}=1$, (2) the fact that all $\rho_{k\ge 1}$ appears on both above and below the diagonal, hence the factor $2$, and (3) the approximation that $\rho_{k}\to 0$ quickly with large $k$, so the over-estimation is slight. Equivalently, we approximated summing the matrix $\Sigma$ with $n$ copies of the first row/column $\Sigma_{11}+\sum_{i\ge 2} \Sigma_{i1}+ \sum_{j \ge 2} \Sigma_{1j}$. We can define the **integrated autocorrelation time (IACT)** $\begin{align*} \tau_{f}&:= 1+2\sum_{k=1}^{n-1}\rho_{k},\\[0.4em] \mathrm{Var}(\hat{f}) &\approx\sigma^{2}\cdot \frac{\tau_{f}}{n}, \\ \mathrm{ESS} &\approx \frac{n}{\tau_{f}}, \end{align*}$so *the IACT measures the amount of samples lost due to autocorrelation*. > [!connection] > This is equivalent to the [[Effective Number of Stocks]] for portfolio with equal weights $\mathbf{w}=\left( \frac{1}{n},\dots, \frac{1}{n} \right)$ of stocks with covariance matrix $\Sigma$. Lastly, replacing $\rho_{k}$ with their sample estimates $r_{k}$, and truncating the sum at some $M < n$ (since $r_{k}$ for large $k$ are estimated by very few samples and are basically noise), we have the estimate $\begin{align*} \hat{\tau}_{f}&= 1+2 \sum_{k=1}^{M}r_{k},\\ \widehat{\mathrm{ESS}}&= \frac{n}{\hat{\tau}_{f}}. \end{align*}$