> [!tldr]
> Bootstrap methods offer ways of estimating functionals of the original distribution $X \sim F$ by replacing it with an estimate $\hat{F}$, usually indirectly via resampling.
The most common form of bootstrap is the **nonparametric bootstrap**, which treat the original $\mathcal{T}$ as a population. It draws **bootstrap samples** $\mathcal{T}_{1},\dots,\mathcal{T}_{B}$ with replacement from it (usually $| \mathcal{T}_{b} |=| \mathcal{T} |$) and use them to make the **bootstrap estimates** $\hat{\theta}_{i}$.
- **Bagging (Bootstrap AGGregate)** them gives the estimate $\hat{\theta}^{\mathrm{boot}} := \frac{1}{B}\sum_{b=1}^{B}\hat{\theta}_{b}$.
- Alternatively, we can treat $\mathcal{T}_{b}$ as "new" samples, and use $\hat{\theta}_{b}$ as if they are independent realizations of $\hat{\theta}$ to study its variability.
> [!idea]
>
> Schematically, $\begin{align*}
&\text{True value}: &&F \Longrightarrow \theta;\\
&\text{Point estimate}:&\mathcal{T} \Longrightarrow \,&[\hat{F}=\mathrm{Unif}(\mathcal{T})] \Longrightarrow \hat{\theta};\\
&\text{bootstrapping}: &\mathcal{T} \Longrightarrow \,&[\hat{F} =\mathrm{Unif}( \mathcal{T})] \Longrightarrow \mathcal{T}_{b}.
\end{align*}$
>
> For example, in estimating the population mean $\theta = \int X \, dF$, the frequentist estimate uses the sample mean $\hat{\theta}=\int X \, d\hat{F}$.
>
> Note that estimating $\hat{F}$ with the sample is not new, *the only new thing is using $\hat{F}$ for other purposes*:
> - Computing metrics like $\mathrm{Var}(\hat{\theta})$;
> - Drawing new samples $\mathcal{T}_{1},\dots, \mathcal{T}_{B}$;
> - Bagging by computing $\hat{\theta}_{b}$ from $\mathcal{T}_{b}$.
## Bootstrap Distribution of Point Estimators
### Bootstrap Variance Estimation
In this case, the value of interest is $\mathrm{Var}_{F}(\hat{\theta})$, i.e. the variance of the point estimate $\hat{\theta}$ derived from iid. samples drawn from $F$.
Bootstrap estimates this value in two steps: firstly approximate $F$ with $\hat{F}$, then approximate $\mathrm{Var}_{\hat{F}}(\hat{\theta})$ with procedures like Monte Carlo.
> [!algorithm] Nonparametric Bootstrap Variance Estimate
> For $[b=1,\dots,B]$ :
> - Draw bootstrap sample $\mathcal{T}_{b} \overset{\mathrm{iid.}}{\sim} \hat{F}$.
> - Compute the point estimate $\hat{\theta}_{b}:=\hat{\theta}(\mathcal{T}_{b})$ on the bootstrap sample.
>
> Estimate $\mathrm{Var}_{\hat{F}}(\hat{\theta})$ by treating $\hat{\theta}_{1},\dots,\hat{\theta}_{B}$ as Monte Carlo samples of $\hat{\theta}$:
> - Compute their mean $\bar{\theta}:= \frac{1}{B}\sum_{b}\hat{\theta}_{b}$.
> - Then return their sample variance $\hat{\sigma}^{2}_{\hat{\theta}}:= \frac{1}{B}\sum_{b}(\hat{\theta}_{b}-\bar{\theta})^{2}.$
### Bootstrap Confidence Intervals
The crudest way of using bootstrap to construct CI is to assume (asymptotic/approximate) normality of $\hat{\theta}$, then follow the schema $\hat{F} \Longrightarrow (\mathcal{T}_{1},\dots,\mathcal{T}_{B}) \Longrightarrow (\hat{\theta}_{1}, \dots, \hat{\theta}_{B}) \Longrightarrow \hat{\sigma}_{\hat{\theta}} \Longrightarrow \mathrm{CI}.$
> [!exposition]- Bootstrap Gaussian CI
> The normality approximation is
$\frac{\hat{\theta}-\theta}{\sqrt{ \mathrm{Var}_{F}(\hat{\theta}) }}\overset{D}{\approx}N(0,1).$Replacing the variance with its bootstrap approximation $\hat{\sigma}^{2}_{\hat{\theta}}$: $\frac{\hat{\theta}-\theta}{\hat{\sigma}_{\hat{\theta}}}\overset{D}{\approx}N(0,1),$and solving for $\theta$ gives its confidence intervals: $\theta \overset{\alpha}{\in}(\hat{\theta} \pm z_{1-\alpha / 2}\cdot \hat{\sigma}_{\hat{\theta}}),$where $\alpha$ is the confidence level and $z_{\alpha / 2}$ the $(1-\alpha /2)$ percentile of the standard Gaussian.
Alternatively, we can directly approximate $G$, the distribution of $\hat{\theta}-\theta$ with some $\hat{G}$, e.g. the empirical distribution of $\hat{\theta}_{b}-\hat{\theta}$: $\hat{F} \Longrightarrow (\mathcal{T}_{1},\dots,\mathcal{T}_{B}) \Longrightarrow (\hat{\theta}_{1}, \dots, \hat{\theta}_{B}) \Longrightarrow \hat{G} \Longrightarrow \mathrm{CI}.$
> [!exposition]- Bootstrap Quantile CI
> The empirical distribution of $(\hat{\theta}_{b})$ is $\hat{G}(t):= \frac{1}{B}\sum_{b}\mathbf{1}\{\hat{\theta}_{b}-\theta \le t\}=\text{proportion of }\hat{\theta}_{b}\le t+\hat{\theta}.$
> Define the "quantiles" of $\hat{G}$ as $\hat{t}_{\alpha}:= \inf \{ t~|~ \hat{G}(t) \ge \alpha \},$i.e. the lowest $t$ where at least $\alpha B$ of the bootstrapped $\hat{\theta}_{b}-\hat{\theta}$ are below it.
>
> Then use these as approximations of the quantiles of $\hat{\theta}-\theta \sim G$ to get $\hat{\theta}-\theta\overset{\alpha}{\in} (\hat{t}_{\alpha / 2}, \hat{t}_{1- \alpha / 2}),$and solving for $\theta$ gives its CI: $\theta \overset{\alpha}{\in}(\hat{\theta}-\hat{t}_{1-\alpha / 2}, \hat{\theta}-\hat{t}_{\alpha /2}).$
### Bootstrap Bias Estimation
Similar to the variance, the bias of $\hat{\theta}$ (i.e. $\mathbb{E}_{F}[\hat{\theta}]-\theta$) can be estimated with $\mathbb{E}_{\hat{F}}[\hat{\theta}_{b}]-\hat{\theta}$: $\begin{align*}
\hat{F} &\Longrightarrow (\mathcal{T}_{1},\dots,\mathcal{T}_{B}) \Longrightarrow (\hat{\theta}_{1}, \dots, \hat{\theta}_{B}) \\[0.4em]
&\Longrightarrow \hat{\theta}-\frac{1}{B}\sum_{b}\hat{\theta}_{b}=: \widehat{\text{bias}}(\hat{\theta}).
\end{align*}$
## Forms of Bootstraps
The most common form of bootstraps is the **nonparametric bootstrap**, i.e. drawing the sample as $\mathcal{T}_{b} \overset{\mathrm{iid.}}{\sim}\mathrm{Unif}\,\mathcal{T},$and this is also known as the **paired bootstrap**, since an observation $(x_{i},y_{i})$ is paired and not separated in the process.
An alternative is the **semi-parametric residual bootstrap**, where instead of sampling from $\mathcal{T}=\{ (x_{i}, y_{i}) \,|\,i=1,\dots \}$, a model $\hat{\mathbf{y}}(\mathbf{x})$ is fit, and the residuals are sampled: $\begin{align*}
\text{residuals} & & \mathbf{e}&:= \mathbf{y}-\hat{\mathbf{y}}(\mathbf{x})\\[0.4em]
\text{bootstrap} & &\mathbf{e}_{b}&\sim \mathrm{Unif}\{ e_{1},\dots,e_{N} \},\\[0.2em]
&& \mathbf{y}_{b}&\,= \hat{\mathbf{y}}(\mathbf{x})+\mathbf{e}_{b},
\end{align*}$i.e. keep the fitted values $(\mathbf{x}, \hat{\mathbf{y}})$ unchanged, but shuffle the residuals $\mathbf{e}$ on top of them. This produces $\mathcal{T}_{b}=(\mathbf{x}, \mathbf{y}_{b})$.
### Bootstrap as Reweighing
Equivalently, a bootstrap can be seen as putting a random set of weights $\mathbf{W}=(w_{1},\dots,w_{n}) \sim \mathrm{Multinom}\left( \frac{\mathbb{1}}{n}, n \right)$ on the original sample $\mathcal{T}$.
- This generalizes to other resampling distributions on the **resampling simplex** $\mathcal{S}_{n}:= \left\{ \mathbf{W} \in( \mathbb{R}^{+})^{n} \,|\, \sum_{i}w_{i} = 1 \right\}.$
- One example is **Bayesian resampling**, which uses weights proportional to $G_{1},\dots,G_{n} \overset{\mathrm{iid.}}{\sim}\mathrm{Exp}(1)$ (so each weight is normalized by $\sum_{i}G_{i} \sim \mathrm{Gamma}(n,1)$).
### Parametric Bootstraps
Instead of using the non-parametric estimate $\hat{F} \sim \mathcal{T}$ of the original population, we can impose a model $F \in \mathcal{F}:= \{ F_{\theta} \,|\, \theta \in \Theta \},$where $F_{\theta}$ are possible versions of $F$. We can then plugin some estimate $\hat{\theta}$ to get $\hat{F}=F_{\hat{\theta}}$ and resample from that. So schematically, $\begin{align*}
\text{parametric bootstrap}: &&\mathcal{T} \Longrightarrow \hat{\theta} \Longrightarrow [\hat{F}=F_{\hat{\theta}}] \Longrightarrow \mathcal{T}_{b}\cdots
\end{align*}$
## Bootstrap Error Estimates
Bootstrap can help estimating the errors of a model $\hat{f}$. The **naive bootstrap estimate** would fit the model $\hat{f}_{b}$ on $\mathcal{T}_{b}$ for $b=1,\dots,B$, and compute its prediction error on $\mathcal{T}$: $\widehat{\mathrm{Err}}_{b}:=\frac{1}{|\mathcal{T}|}\sum_{(x_{i},y_{i})\in\mathcal{T}} L(y_{i},\hat{f}_{b}(x_{i}))$and the overall estimate is the average over bootstrap samples $\widehat{\mathrm{Err}}_{\mathrm{naive}}=\frac{1}{B}\sum_{b=1}^{B}\widehat{\mathrm{Err}}_{b} $but *this estimate is overly confident*, as the (bootstrap) test set $\mathcal{T}$ has overlap with the bootstrap training sets $\mathcal{T}_{b}$.
To tackle this problem, **leave-one-out bootstrap estimate** computes an error estimate for each $(x_{i},y_{i}) \in\mathcal{T}$, but only using models fitted on bootstrap training sets $C_{-i}:=\{ b\,|\, (x_{i},y_{i}) \not \in \mathcal{T}_{b} \}$ that do not contain the observation: $\widehat{\mathrm{Err}}_{i}:=\frac{1}{|C_{-i}|}\sum_{b \in C_{-i}}L(y_{i}, \hat{f}_{b}(x_{i}))$
A similar idea is to first aggregate the predictions then compute the error: $\begin{align*}
\hat{y}_{i}^{\mathrm{oob}}&:= \frac{1}{| C_{-i} |}\sum_{b \in C_{-i}}\hat{f}_{b}(x_{i}),\\
\widehat{\mathrm{Err}}_{i}&= L(y_{i},~\hat{y}_{i}^\mathrm{oob}).
\end{align*}$Lastly, the overall error estimate is the average over all observations: $\widehat{\mathrm{Err}}:=\frac{1}{n}\sum_{i=1}^{n}\widehat{\mathrm{Err}}_{i}.$
Nevertheless, there is the issue that the bootstrap samples $\mathcal{T}_{b}$ contain on average about $0.63|\mathcal{T}|$ distinct samples on average, so usually overestimate the error.
- $0.63$ comes from $\mathbb{P}[i\text{th observation in }\mathcal{T}_{b}]=1-\left( 1-\frac{1}{n} \right)^{n} \to 1-\frac{1}{e}\approx 0.63.$
- A common adjustment for non-overfitting situations is $\widehat{\mathrm{Err}}_{(0.63)}:= 0.63\cdot\widehat{\mathrm{Err}}+0.37\cdot \overline{\mathrm{err}}$where $\widehat{\mathrm{err}}$ is the training error.