Consider the standard setup of computing a sample statistic: let $\mathbf{X}:= (X_{1},\dots,X_{n})$ be $\mathrm{iid.}$ samples from some distribution $F$, and we compute a statistic $T(\mathbf{X})$ (e.g. the sample mean). ### Motivation When we discuss the [[Robustness]] of a sample statistic, we usually use the example of the (arithmetic) mean being susceptible to outliers: consider the two datasets $\begin{align*} \mathbf{x}_{1}&= (1,2,3,4)\\ \mathbf{x}_{2}&= (1,2,3,100). \end{align*}$ If $T$ is the sample mean, then $T(\mathbf{x}_{1})=2.5$, while $T(\mathbf{x}_{2})=26.5$, so the mean is heavily influenced by outliers (e.g. data entry mistakes). In contrast, the median for both datasets is $2.5$. What we are essentially doing is *manipulating the value of one entry ($X_{4}$) and observing how that affects the statistic*. To avoid arbitrarily setting the value for other entries, we can take the expectation over them, giving the general form $\mathbb{E}[T(\mathbf{X}) ~|~ X_{1}=x],$and here we WLOG let the entry being perturbed to be the first. Now subtracting what its expectation "used to be", we get an unscaled influence $\mathbb{E}[T(\mathbf{X}) ~|~ X_{1}=x]-\mathbb{E}[T(\mathbf{X})].$ > [!definition|*] Influence/Susceptibility > In order to control for the different scale and values of different statistics, we can "normalize" this value to be $\mathrm{Inf}(x,T):= \frac{\mathbb{E}[T ~|~X_{1}=x]-\mathbb{E}[T_{n}]}{\sqrt{ \mathrm{Var}(T ~|~ X_{1}=x) }},$which resembles the z-statistic. Indeed, if $T$ is the sample mean, $\mathrm{Inf}(x,T)=\frac{1}{\sqrt{ n-1 }}\frac{x-\mu_{X}}{\sigma_{X}},$ where $\mu_{X},\sigma_{X}$ are the population mean and standard deviation of $X\sim F$. - The $\sqrt{ n-1 }$ term becomes $\sqrt{ n }$ if we replace the original denominator with $\sqrt{ \mathrm{Var}(T_{n}) }$ (instead of the conditional variance). > [!idea] Treating this as a function of $x$, we can measure how different $x$-values, especially extreme ones, influence the value of $T$. ### Monte Carlo Estimations Of course most statistics (especially the more robust ones like the median) do not admit such simple closed forms, so we have to resort to Monte Carlo simulations. Consider a truncated Gaussian distribution (truncated for computational purposes), i.e. a pdf $f_{X}(x;r) \propto \mathbf{1}_{| x | < r}\cdot\phi(x),$$\phi$ being the pdf of the standard Gaussian. Of course $f_{X}$ is normalized to have mass 1. We take $r=4$; any value larger than $3$ should give similar results. - As an example, this is the distribution with $r=3$: ![[TruncatedGaussian.png|center|w80]] We estimate the influence using 100 evenly distributed $x$-values, using 1000 Monte-Carlo simulations at each value. We consider the sample mean, median, and winsorized means with $10\%$ and $40\%$ truncation. The result is: ![[InfluenceMonteCarlo.png]] - The lines are the Monte Carlo estimation of the influence; - The shaded regions are $\pm 1.96 \sqrt{ \widehat{\mathrm{Var}}(T_{n} ~|~ X_{1}=x) }$, estimated by the sample variance of each Monte-Carlo'ed sample. > [!idea] > As expected, we see that the robust statistics like median and winsorized mean cap influence of extreme values to very modest values, while the sample mean allows it to grow indefinitely. - Remark: winsorized means truncating $x\%$ are compromises between means ($x=0$) and medians ($x=50$), so its susceptibility is as expected.