Importance sampling estimates $\theta=\mathbb{E}[\phi(X)]$ for some random variable $X \sim p$ using a proposal $Y \sim q$, similar to [[rejection sampling]].
Instead of accepting a proposal $X \leftarrow Y=y$ with probability $p(y) / Mq(y)$, *importance sampling always accepts the proposals, but penalizes bad proposals with the weight function.*
* Hence unlike rejection sampling, importance sampling can only estimate the expectation $\mathbb{E}[\phi(X)]$, but not simulate $X \sim p$ itself.
> [!definition|*] Importance Sampling
> The **(importance) weight function** is the ratio of the target to the proposal distribution: $w(y):=\frac{p(y)}{q(y)}$and the estimator is the weighted average of the proposals: given function $\phi$ and $X \sim p$, **the importance estimator** of its finite mean $\theta=\mathbb{E}[\phi(X)]$ is $\hat{ \theta}:=\frac{1}{n}\sum_{i=1}^{n}\phi(Y_{i})w(Y_{i}).$
### Properties of the Importance Estimator
The **importance sampling identity** states that the importance estimator is unbiased: assuming $p >0 \Rightarrow q > 0$ (hence $w$ is well-defined), then $\mathbb{E}_{X}[\phi(X)]=\mathbb{E}_{Y}[\phi(Y)w(Y)].$
- The importance estimator is also strongly consistent: $\hat{\theta} \to \theta\,\, \mathrm{a.s.}$
> [!proof]
> Apply the strong law of large numbers to $\hat{\theta}$, the sample average of $\phi(Y_{i})w(y_{i})$.
> [!exposition] Variance of the Importance Estimator
> To have a finite variance (so that the estimate converges), $\begin{align*}
> \mathrm{Var}(\hat{\theta})&= \frac{1}{n}\mathrm{Var}(\phi(Y)w(Y))\\
> &= \frac{1}{n}\left( \mathbb{E}_{Y}[\phi(Y)^{2}w(Y)^{2}]-\mathbb{E}_{Y}[\phi(Y)w(Y)]^{2} \right)\\
> &= \frac{1}{n}(\mathbb{E}_{X}[\phi(X)^{2}w(X)]-\theta^{2}),
> \end{align*}$which $\to 0$ only when the terms $\mathbb{E}_{X}[\phi(X)^{2}w(X)]$ and $\theta=\mathbb{E}[\phi(X)]$ are finite.
>
> When $\phi(X)$ has finite expectation and variance, having a bounded weight $w:\exists M: |w| \le M$ is sufficient for finite variance of $\hat{ \theta}$.
> * A bounded $w=p / q$ requires the tails of $q$ to decay slower than that of $p$; i.e., the proposal should have thicker tails than the target.
>
> > [!proof]- Proof for Bounded Weight Implying Finite Variance
> > In the expression above, $\mathbb{E}_{X}[\phi^{2}w] \le M\mathbb{E}_{X}[\phi^{2}]=M(\mathrm{Var}(\phi)+\mathbb{E}[\phi]^{2})<\infty$and the other $\theta^{2}$ term is also finite, as $\theta=\mathbb{E}[\phi]$ by definition.
### Choice of the Proposal Distribution
Unlike rejection sampling, *importance sampling does not require a finite weight $p / q$.* In fact, the proposal $q$ only needs to cover "where it matters", namely $p(x)\phi(x) \ne 0 \Longrightarrow q(x) \ne 0$is sufficient for estimating $\mathbb{E}[\phi(X)]$.
- For example, to estimate $\mathbb{P}[X \in (0,1)]=\mathbb{E}[\mathbb{1}_{X \in (0,1)}]$, the proposal can be $q=U[0,1]$, because $\phi=\mathbb{1}_{X \in (0,1)}$ is non-zero only in that interval.
Theoretically, the variance becomes $0$ with the **optimal zero-variance** proposal density $q_{opt}(x)=\frac{p(x)\phi(x)}{\mathbb{E}[\phi(x)]}$but computing this density requires $\mathbb{E}[\phi(X)]$, which is begging the question.
### Exponentially Tilted Proposals for Rare Events
> [!idea] Besides using proposals that are easy to sample from, importance sampling can also use proposals that focus on rare events.
Normal Monte Carlo sampling is terrible for rare events, as a huge sample size is needed to get a moderate number of occurrences. However, in importance sampling, *a nice proposal distribution can sample a lot from the rare events (although each will have a tiny weight)*.
> [!definition|*] Exponentially Tilted Proposal
> The **exponentially tilted proposal** for $X \sim p$ is the distribution$q_{t}(x)=\frac{p(x) e^{tx}}{M_{X}(t)},$where $t \in \mathbb{R}$ is the tilting parameter, and $M_{X}=\mathbb{E}[e^{tX}]$ is the MGF of $X$; then the weight function is$w=\frac{M_{X}(t)}{e^{tx}}=\frac{\mathbb{E}[e^{tX}]}{e^{tx}}.$
For example, the tilted proposal for $N(\mu, \sigma^{2})$ is just a shift: $q_{t}(x)\sim N(\mu+t \sigma^{2},\sigma^{2}),$which would generate many samples from the region $(\mu + t \sigma^{2}) \pm 2\sigma$, which is difficult to reach for simple Monte Carlo if $t \ge3$.
The optimal $t$, i.e. the amount of tilting, should minimize the variance.
* It‘s good enough if the proposal $q_{t}$ is centered around the area of interest.
* Otherwise, it's possible to numerically find the $t$ that minimizes $\mathrm{Var}(\hat{\theta}_{t})$.
### Normalised Importance Sampling
As in rejection sampling, it's common to only know that $\tilde{p}=p \cdot c_{p}$ and $\tilde{q}=q \cdot c_{q}$, but not $c_{p}$ and $c_{q}$, so the weight function $w=p / q$ is unknown.
In these cases, the weight function $\tilde{w}=\tilde{p} / \tilde{q}$ gives the **normalized importance estimator**: $\hat{ \theta}=\frac{\sum_{i}\tilde{w}(Y_{i})\phi(Y_{i})}{\sum_{i}\tilde{w}(Y_{i})}$
* If $p \ne 0 \Rightarrow q \ne 0$, then *the normalized estimator is strongly consistent, but it is not unbiased*.
> [!proof]-
> Note that due to strong law of large numbers, the numerator is an unbiased and consistent estimator of $\mathbb{E}[\tilde{w}(Y)\phi(Y)]=\frac{c_{p}}{c_{q}}\theta$, and the denominator of $\mathbb{E}[\tilde{w}(Y)]=\frac{c_{p}}{c_{q}}$.
>
> *Consistency* follows from the above fact, where the normalizing $c_{p} / c_{q}$ cancel out in the ratio's limit.
>
> *Biasedness* is because without taking the limit, ratio-of-unbiased-estimators is not an unbiased-estimator-of-ratio in general; in this case: $\theta = \frac{\mathbb{E}[\tilde{w}(Y)\phi(Y)]}{\mathbb{E}[\tilde{w}(Y)]}\ne \mathbb{E}\left[ \frac{\sum_{i}\tilde{w}_{i}(Y_{i})\phi(Y_{i})}{\sum_{i}\tilde{w}(Y_{i})} \right]=\mathbb{E}[\hat{\theta}]$when the sample size $n$ is finite.
When we know $q=\tilde{q}(y) / c_{q}$, but not $c_{p}$ in $p=\tilde{p}(x) / c_{p}$, we can estimate the normalizing constant with the identity $c_{p}=\mathbb{E}_{Y}[\tilde{p}(Y) / q(Y)]$: $\hat{c}_{p}= \frac{1}{n} \sum_{i} \frac{\tilde{p}(Y_{i})}{q(Y_{i})};$when we don't know $c_{q}$ either, we can combine that with the estimate $\hat{c}_{q}^{-1}= \frac{1}{n}\sum_{i} \tilde{q}(Y_{i})^{-1}$, giving the normalized estimate $\hat{c}_{p}= \frac{\frac{1}{n}\sum_{i}\tilde{p}(Y_{i}) / \tilde{q}(Y_{i})}{\frac{1}{n}\sum_{i}\tilde{q}(Y_{i})^{-1}}=\frac{\sum_{i}\tilde{p}(Y_{i}) / \tilde{q}(Y_{i})}{\sum_{i}\tilde{q}(Y_{i})^{-1}},$which can be considered as an NIS estimator with weights $\tilde{q}(Y_{i})^{-1}$ of $\mathbb{E}_{Y\sim \mathrm{Unif} \mathcal{Y}}[\tilde{p}(Y_{i})]=c_{p}$.