> [!definition|*] Order Statistics > Given data $X_{i}=x_{i}$ for $i=1, \dots, n$, their **$r$th order statistic** is $X_{(r)}=x_{(r)}$, the $r$th smallest observed value. - In particular, $X_{(1)}=\min \{ X_{i} \}$, and $X_{(n)}=\max \{ X_{i} \}$. * The **median** is $M=\begin{cases} X_{(m+1)} & \text{if $n=2m+1$ odd} \\ \frac{1}{2}(X_{(m)}+X_{(m+1)}) & \text{if $n=2m$ even} \end{cases}$ > [!theorem|*] Distribution of the Order Statistics > If the iid. data have cdf and pdf $X_{i} \sim F,f$, then their order statistics have distributions: $X_{(r)} \sim f_{(r)}(x)=\frac{n!}{(r-1)!(n-r)!}F(x)^{r-1} [1-F(x)]^{n-r}f(x)$ > > [!proof]- > > Note that $F_{(r)}(x)= \Bbb{P}(X_{(r)} \le x)=\Bbb{P}(\text{at least } r \text{ of }X_{i}\le x)$. > > Hence $\begin{align*}F_{(r)}-F_{(r+1)}&= \Bbb{P}(r \text{ of }X_{i}\le x,\, (n-r)\text{ of }X_{i} > x)\\ &=\begin{pmatrix}n \\ r\end{pmatrix}F(x)^{r}[1-F(x)]^{n-r} > \end{align*}$Then differentiate and use induction to prove the result. > Order statistics of the uniform distribution $U[0,1]$: for an iid. sample $U_{1}, \dots, U_{n} \sim U[0,1]$, their order statistics $U_{(r)}$ follows $\text{Beta}(r,n-r+1)$, which has $\begin{align*} \mathbb{E}[U_{(r)}]&=\frac{r}{r+s}=\frac{r}{n+1}\\[0.4em] \mathrm{Var}(U_{(r)})&= \frac{rs}{(r+s)^{2}(r+s+1)} \end{align*}$where $s=n-r+1$ is the second parameter. ## Q-Q Plots > [!bigidea] > Q-Q plots compare the order statistics of the observed sample to the order statistics of a known distribution; if the sample indeed follows that distribution, the plot should look like a line. > [!theorem|*] Distribution of $F(X)$ > If $X \sim F$, where $F(x)$ is a strictly increasing cdf., then $Y=F(X) \sim U[0,1]$. > > > [!proof]- > > $\begin{align*} > \Bbb{P}(Y \le y)&= \Bbb{P}(F(X) \le y)\\ > &= \Bbb{P}(X \le F^{-1}(y))\\ > &= F(F^{-1}(y))=y > \end{align*}$ > where the inverse exists since $F$ is strictly increasing. Hence $Y \sim U[0,1]$. Therefore, using the delta method to estimate $X_{(r)}=F^{-1}(U_{(r)})$ from $U_{(r)}$ gives $\mathbb{E}[F^{-1}(U_{(r)})]\approx F^{-1}(\mathbb{E}[U_{(r)}])=F^{-1}\left( \frac{r}{n+1} \right)$with variance $\mathrm{Var}(F^{-1}(U_{(r)}))\approx\underbrace{\mathrm{Var}(U_{(r)})}_{\to 0}\cdot {\left[\frac{dF^{-1}(u)}{du}\Big|_{u=\frac{r}{n+1}}\right]^{2}} \to 0.$we expect $X_{(r)}$ to be close to $F^{-1}\left( \frac{r}{n+1} \right)$ when $n$ large, hence *forming a near-perfect line when plotting them against each other*. > [!definition|*] Q-Q Plot > The **Q-Q plot** then plots the observed order statistics $x_{(1)}, \dots, x_{(n)}$ against $F^{-1}\left( \frac{1}{n+1} \right), \dots, F^{-1}\left( \frac{n}{n+1} \right)$, the expected order of a distribution $F$. > > If the points are roughly linear, we can conclude that the data do follow the distribution; otherwise, the distribution would be a bad model. ### Q-Q Plots as Distribution Comparisons Suppose we have a sample $X=(X_{1},\dots,X_{n}) \overset{\mathrm{iid.}}{\sim} F_{X}$, and we want to check if some theoretical distribution $F_{0}$ is a good model. We can use $\{ X_{(r)} \}$ as approximations of $F_{X}^{-1}\left( \frac{r}{n} \right) \approx X_{(r)}.$ Plotting observed rank statistics $\{ X_{(r)} \}$ against some quantiles $\{ \Phi^{-1}(z_{r}) \}$, e.g. $\mathbf{z}=(1 / n, 2 / n\dots, 1)$, we can approximate the plot $(F_{0}^{-1}(p), F^{-1}_{X}(p))$ parametrized by $p \in (0,1)$. The slope of the approximated plot is $\begin{align*} \frac{ d F_{X}^{-1}(p) }{ d F_{0}^{-1}(p) }&= \frac{ d F_{X}^{-1} }{ d p } \Big/ \frac{ d F_{0}^{-1} }{ d p } \\ &= \frac{f_{0}(F_{0}^{-1}(p))}{f_{X}(F_{X}^{-1}(p))}. \end{align*} $ In the case of Gaussians $F_{0} \sim N(0,1)$ and $F_{X}\sim N(0, \sigma^{2})$, the slope simplifies to a constant of $\sigma$. ### Q-Q Plots in Practice If the cdf. $F$ is known, we can directly compute $d_{(r)}=F^{-1}\left( \frac{r}{n+1} \right)$. However, *in practice, we want to test a family of distributions $\{ F(\theta) \}$ indexed by an unknown parameter $\theta$, so we do not have* $F$. * Instead, we need *a linear relation that holds true for any parameter* $\theta$ in the family of distributions $F(\theta)$. We start with the fact that $F(x_{(r)}, \theta) \approx\frac{r}{n+1}$, and rearrange to find the linear relation $x_{(r)}=\alpha(\theta)\cdot g\left(\frac{r}{n+1}\right)+\beta(\theta)$ for some $g$ independent of $\theta$; different $\theta$ should only affect the slope $\alpha(\theta)$ and intercept $\beta(\theta)$ of the line. > [!examples] Normal Q-Q Plot > The **normal Q-Q plot**: if $X_{1,\dots,n} \overset{iid.}{\sim} N(\mu, \sigma^2)$, normalizing $X_{(r)}$ gives $\frac{X_{(r)}-\mu}{\sigma}\sim N(0,1) \Longrightarrow \Phi\left(\frac{X_{(r)}-\mu}{\sigma}\right) \approx \frac{r}{n+1}$giving the linear relationship $x_{(r)} \approx \sigma\Phi^{-1}\left( \frac{r}{n+1} \right)+\mu$therefore plotting $x_{(r)}$ against $\Phi^{-1}\left( \frac{r}{n+1} \right)$ should give a linear relationship. > [!examples] Exponential Q-Q Plot > For exponential distributions $X_{1,\dots,n} \overset{iid.}{\sim} \exp(\lambda)$, the equation $F(x_{(r)},\lambda)\approx \frac{r}{n+1}$ becomes $1-e^{-\lambda x_{(r)}} \approx \frac{r}{n+1}$so solving for $x_{(r)}$ gives $x_{(r)} \approx -\frac{1}{\lambda}\log\left( 1-\frac{r}{n+1} \right)$therefore plotting $x_{(r)}$ against $\log\left( 1-\frac{r}{n+1} \right)$ should give a linear relationship.