Given samples $\mathbf{X}=(X_{1},\dots,X_{n})$ drawn from a distribution $f(X;\theta)$, we may compute statistics $T(\mathbf{X})$ to describe the sample. *This poses a "dilemma": we want to condense the amount of information* (**minimality**), *but not by too much* (**sufficiency**). > [!idea] Statistics induce partitions > Instead of functions, we can interpret statistics as groupings of the sample space. They induce a binary relation for $x,y \in \mathcal{X}$: $x\sim y \iff T(x)=T(y)$or equivalently, the [[Partitions and Information Chunks|partition into groups that are indistinguishable by the value of $T$]]: $\mathcal{X}=\bigcup_{t \in T(\mathcal{X})}\{ x \in \mathcal{X} ~|~ T(x)=t \}$ > In this context, sufficiency means that the distribution is shared within each equivalence class, and minimality means that the partition is not unnecessarily fine-grained. However, **Pitman-Koopman-Darmois theorem** states that minimal sufficient finite-dimensional statistics are rare -- if an iid. sample has such statistic, it must follow some exponential family. - Finite dimensional statistics have dimensions that don't grow with the sample size. ## Sufficiency > [!definition|*] Sufficiency > A statistic $T$ is **sufficient** if its value makes the conditional distribution independent of $\theta$: $f(x;\theta~|~T=t)=f(x~|~ T=t).$ > In particular, given $T=t$, the expectation of any $g(X)$ is made constant: $\mathbb{E}_{\theta}[g(X)~|~ T=t]=\text{const. for all }\theta \in \Theta.$ It's usually difficult to show sufficiency from the definition; the **factorization criterion** gives an equivalent criterion for sufficiency. > [!theorem|*] Factorization Criterion (Fisher-Neyman) > A statistic $T$ is sufficient if and only if there exists the factorization > $\exists g,h:~f(x;\theta)=g(T(x),\theta)h(x).$ > > [!proof]- > > $[\text{Sufficiency} \Rightarrow \text{Factorization}]$ Just let $h(x) := f(x ~|~T(x))$, and let $g(T,\theta)$ be the distribution of $T$. > > $[\text{Sufficiency} \Leftarrow \text{Factorization}]$ is very hard for general distributions. See lecture notes for discrete case. One important corollary of the factorization criterion is that ![[Information and Bounding Errors#^3cd3db]] See [[Information and Bounding Errors#Information and Statistics|the note about information]] for more. ## Minimality > [!definition|*] Minimal Sufficiency > A sufficient statistic $T$ is **minimal sufficient** if $T(X)$ can be written as a function of any other sufficient statistic. In terms of partition, this is equivalent to saying that the partition $\Pi_{T}$ induced by $T$ is **coarser** than that of any other sufficient statistic $S$. - That is, any $A \in \Pi_{T}$ can be written as a union of sets in $\Pi_{S}$. This means that $[S(x)=S(y)] \Rightarrow [T(x)=T(y)]$. Minimality also has a criterion involving densities: > [!lemma|*] Equivalent criterion of Minimal Sufficiency > $T$ is minimal sufficient if and only if $\forall x,y \in \mathcal{X},~\theta \in \Theta,~ \left[ \underset{(\mathrm{i.e.}~x \sim y)}{T(x)= T(y)} \iff \frac{f(x;\theta)}{f(y;\theta)} \text{ is indep. of }\theta \right]$or equivalently, $\theta \mapsto f(x;\theta) / f(y;\theta)$ is constant within any induced class. - Roughly, $[\Rightarrow]$ is sufficiency, where the class is small enough to share the distribution. - $[\Leftarrow]$ is minimality, where no finer splits are made as soon as the distribution is independent of $\theta$. ## Minimal Sufficiency in Exponential Families [[Exponential Families|Exponential families]] give a very convenient way of finding minimal/sufficient statistics: > [!theorem|*] Minimality and Sufficiency in Exponential Families > The canonical observations $\mathbf{T}=(T_{1},\dots,T_{k})$ of an exponential family are sufficient statistics for the parameter(s) $\theta$. > > Moreover, if $\mathcal{P}$ is strictly $k$-parameter, then the canonical observations $\mathbf{T}$ are in fact minimal sufficient. > > [!proof]- > > - Sufficiency is a corollary of the factorization criterion. > > - Minimality follows from $\frac{f(x;\theta)}{f(y;\theta)}=\frac{h(x)}{h(y)}\exp\left[ \eta(\theta) \cdot (T(x)-T(y)) \right]$ > being independent of $\theta$ if and only $T(x)=T(y)$. By extension, for an iid. sample $\mathbf{X}=(X_{1},\dots,X_{n})$ has minimal sufficient statistic $\mathbf{T}^{(n)}=\left( \sum_{i=1}^{n}T_{1}(X_{i}),\dots, \sum_{i=1}^{n}T_{k}(X_{i}) \right)$ If $\eta$ is the canonical parameters, its MLE $\hat{\eta}$ solves the zero-gradient equations $\nabla_{\eta}\left( \sum_{i}l(\eta;x_{i}) \right)=\mathbf{T}^{(n)}-\nabla_{\eta}B=\mathbf{T}^{(n)}-\mathbb{E}[\mathbf{T}]\equiv{0}.$ Therefore, *the MLE $\hat{\eta}$ must match the observed $\mathbf{T}^{(n)}$ with its expectation $\mathbb{E}_{\eta}[\mathbf{T}]$.* ^8d3447