Importance sampling estimates $\theta=\mathbb{E}[\phi(X)]$ for some random variable $X \sim p$ using a proposal $Y \sim q$, similar to [[rejection sampling]].
Instead of accepting a proposal $X \leftarrow Y=y$ with probability $p(y) / Mq(y)$, *importance sampling always accepts the proposals, but penalizes bad proposals with the weight function.*
* Hence unlike rejection sampling, importance sampling can only estimate the expectation $\mathbb{E}[\phi(X)]$, but not simulate $X \sim p$ itself.
> [!definition|*] Importance Sampling
> The **(importance) weight function** is the ratio of the target to the proposal distribution: $w(y):=\frac{p(y)}{q(y)}$and the estimator is the weighted average of the proposals: given function $\phi$ and $X \sim p$, **the importance estimator** of its finite mean $\theta=\mathbb{E}[\phi(X)]$ is $\hat{ \theta}:=\frac{1}{n}\sum_{i=1}^{n}\phi(Y_{i})w(Y_{i}).$
### Properties of the Importance Estimator
The **importance sampling identity** states that the importance estimator is unbiased: assuming $p >0 \Rightarrow q > 0$ (so $w$ is well-defined), then $\mathbb{E}_{X\sim p}[f(X)]=\mathbb{E}_{X \sim q}\left[ f(X) \frac{p(X)}{q(X)} \right].$
- The importance estimator is also strongly consistent: $\hat{\theta} \to \theta\,\, \mathrm{a.s.}$, as it is an average of iid. samples of $f(X)w(X)$.
> [!exposition] Variance of the Importance Estimator
> To have a finite variance (so that the estimate converges), $\begin{align*}
> \mathrm{Var}(\hat{\mathbb{E}}_{p}[f(X)])&= \frac{1}{n}\mathrm{Var}_{X\sim q}(f(X)w(X))\\
> &= \frac{1}{n}\left( \mathbb{E}_{X \sim q}[f(X)^{2}w(X)^{2}]-\mathbb{E}_{Y}[\phi(Y)w(Y)]^{2} \right)\\
> &= \frac{1}{n}(\mathbb{E}_{X \sim p}[f(X)^{2}w(X)]-\text{target expectation}^{2}),
> \end{align*}$which $\to 0$ if the terms $\mathbb{E}_{p}[f(X)^{2}w(X)]$ and $\mathbb{E}_{p}[f(X)]$ are finite.
>
> When $f(X)$ has finite expectation and variance, having a bounded weight $w:\exists M: |w| \le M$ is sufficient for finite variance of $\hat{ \theta}$.
> * A bounded $w=p / q$ requires the tails of $q$ to decay slower than that of $p$; i.e., the proposal should have thicker tails than the target.
>
> > [!proof]- Proof for Bounded Weight Implying Finite Variance
> > In the expression above, $\mathbb{E}_{X}[\phi^{2}w] \le M\mathbb{E}_{X}[\phi^{2}]=M(\mathrm{Var}(\phi)+\mathbb{E}[\phi]^{2})<\infty$and the other $\theta^{2}$ term is also finite, as $\theta=\mathbb{E}[\phi]$ by definition.
### Choice of the Proposal Distribution
Unlike rejection sampling, *importance sampling does not require a finite weight $p / q$.* In fact, the proposal $q$ only needs to cover "where it matters", namely $p(x)f(x) \ne 0 \Longrightarrow q(x) \ne 0$is sufficient for estimating $\mathbb{E}[\phi(X)]$.
- For example, to estimate $\mathbb{P}[X \in (0,1)]=\mathbb{E}[\mathbb{1}_{X \in (0,1)}]$, the proposal can be $q=U[0,1]$, because $\phi=\mathbb{1}_{X \in (0,1)}$ is non-zero only in that interval.
Theoretically, the variance of the estimator is minimised by
$q_{opt}(x)\propto {p(x)| f(x) |},$
but normalising this density requires knowing the integral $\mathbb{E}_{p}[| f(X) |]$, which is as hard as the original integral.
> [!proof] Proof of optimality
> Each of the estimator term has variance $\mathrm{Var}_{q}(f(X)w(X))=\mathbb{E}_{q}[f(X)^{2}w(X)^{2}]- \mu^{2}\ge \mathbb{E}_{q}[| f(X) |\cdot w(X)]^{2}-\mu^{2}.$
> The inequality is from Jensen's inequality; it becomes equality if and only if the integrand is constant, i.e. $| f(X) |\cdot w(X)=\mathrm{const.}$, or $q \propto | f |\cdot p$.
### Exponentially Tilted Proposals for Rare Events
> [!idea] Besides using proposals that are easy to sample from, importance sampling can also use proposals that focus on rare events.
Normal Monte Carlo sampling is terrible for rare events, as a huge sample size is needed to get a moderate number of occurrences. However, in importance sampling, *a nice proposal distribution can sample a lot from the rare events (although each will have a tiny weight)*.
> [!definition|*] Exponentially Tilted Proposal
> The **exponentially tilted proposal** for $X \sim p$ is the distribution$q_{t}(x)=\frac{p(x) e^{tx}}{M_{X}(t)},$where $t \in \mathbb{R}$ is the tilting parameter, and $M_{X}=\mathbb{E}[e^{tX}]$ is the MGF of $X$; then the weight function is$w=\frac{M_{X}(t)}{e^{tx}}=\frac{\mathbb{E}[e^{tX}]}{e^{tx}}.$
For example, the tilted proposal for $N(\mu, \sigma^{2})$ is just a shift: $q_{t}(x)\sim N(\mu+t \sigma^{2},\sigma^{2}),$which would generate many samples from the region $(\mu + t \sigma^{2}) \pm 2\sigma$, which is difficult to reach for simple Monte Carlo if $t \ge3$.
The optimal $t$, i.e. the amount of tilting, should minimize the variance.
* It‘s good enough if the proposal $q_{t}$ is centered around the area of interest.
* Otherwise, it's possible to numerically find the $t$ that minimizes $\mathrm{Var}(\hat{\theta}_{t})$.
### Normalised Importance Sampling
As in rejection sampling, it's common to only know that $\tilde{p}=p \cdot c_{p}$ and $\tilde{q}=q \cdot c_{q}$, but not $c_{p}$ and $c_{q}$, so the weight function $w=p / q$ is unknown.
In these cases, the weight function $\tilde{w}=\tilde{p} / \tilde{q}$ gives the **normalized importance estimator**: $\hat{ \theta}=\frac{\sum_{i}\tilde{w}(Y_{i})\phi(Y_{i})}{\sum_{i}\tilde{w}(Y_{i})}$
* If $p \ne 0 \Rightarrow q \ne 0$, then *the normalized estimator is strongly consistent, but it is not unbiased*.
> [!proof]-
> Note that due to strong law of large numbers, the numerator is an unbiased and consistent estimator of $\mathbb{E}[\tilde{w}(Y)\phi(Y)]=\frac{c_{p}}{c_{q}}\theta$, and the denominator of $\mathbb{E}[\tilde{w}(Y)]=\frac{c_{p}}{c_{q}}$.
>
> *Consistency* follows from the above fact, where the normalizing $c_{p} / c_{q}$ cancel out in the ratio's limit.
>
> *Biasedness* is because without taking the limit, ratio-of-unbiased-estimators is not an unbiased-estimator-of-ratio in general; in this case: $\theta = \frac{\mathbb{E}[\tilde{w}(Y)\phi(Y)]}{\mathbb{E}[\tilde{w}(Y)]}\ne \mathbb{E}\left[ \frac{\sum_{i}\tilde{w}_{i}(Y_{i})\phi(Y_{i})}{\sum_{i}\tilde{w}(Y_{i})} \right]=\mathbb{E}[\hat{\theta}]$when the sample size $n$ is finite.
When we know $q=\tilde{q}(y) / c_{q}$, but not $c_{p}$ in $p=\tilde{p}(x) / c_{p}$, we can estimate the normalizing constant with the identity $c_{p}=\mathbb{E}_{Y}[\tilde{p}(Y) / q(Y)]$: $\hat{c}_{p}= \frac{1}{n} \sum_{i} \frac{\tilde{p}(Y_{i})}{q(Y_{i})};$when we don't know $c_{q}$ either, we can combine that with the estimate $\hat{c}_{q}^{-1}= \frac{1}{n}\sum_{i} \tilde{q}(Y_{i})^{-1}$, giving the normalized estimate $\hat{c}_{p}= \frac{\frac{1}{n}\sum_{i}\tilde{p}(Y_{i}) / \tilde{q}(Y_{i})}{\frac{1}{n}\sum_{i}\tilde{q}(Y_{i})^{-1}}=\frac{\sum_{i}\tilde{p}(Y_{i}) / \tilde{q}(Y_{i})}{\sum_{i}\tilde{q}(Y_{i})^{-1}},$which can be considered as an NIS estimator with weights $\tilde{q}(Y_{i})^{-1}$ of $\mathbb{E}_{Y\sim \mathrm{Unif} \mathcal{Y}}[\tilde{p}(Y_{i})]=c_{p}$.
We can reproduce the [[Rejection Sampling]] estimator using random weights: in particular, $w_{i}:= \mathbf{1}\left\{ U_{i}\le \frac{p(X_{i})}{Mq(X_{i})} \right\},$
where $U_{i}\overset{\mathrm{iid.}}{\sim} \mathrm{Unif}[0,1]$. Then the sum of weights are just the number of accepted samples: so in $n$ iterations (not $n$ accepted samples), we have the RS estimator $\hat{\mathbb{E}}_{X\sim p}[f(X)]= \frac{\sum_{i=1}^{n}f(X_{i}) \cdot w_{i}(X)}{\sum_{i}w_{i}(X)}= \frac{\sum_{i \text{ accepted}}f(X_{i})}{| \{ i \text{ accepted} \} |}.$