Exponential Families - Random Notes Go Brrrrrrr

**Exponential families** are families of probability distributions, parameterized over some parameter space $\Theta$. A family is defined by the choice of the following ingredients: > [!definition|*] Exponential Families > - The number of parameters $k \in \mathbb{Z}^{+}$, > - **Canonical observations/statistics** $\mathbf{T}=(T_{1},\dots,T_{k})$, where $T_{i}:\mathcal{X} \to \mathbb{R}$, > - **Canonical parameters** $\pmb{\eta}=(\eta_{1},\dots,\eta_{k})$, where $\eta_{i}:\Theta \to \mathbb{R}$, > - Functions $B:\Theta \to \mathbb{R}$ and $h: \mathcal{X} \to \mathbb{R}^{+}$. > > Given those components, the **exponential family** is the set of distributions $\mathcal{P}=\{P_{\theta}~|~ \theta \in \Theta\}$, where $P_{\theta}$ has pdf/pmf $p(x;\theta)$ of the form $\begin{align*} p(x;\theta)&= \exp\left[\mathbf{T}(x)\cdot \pmb{\eta}(\theta)-B(\theta)\right]h(x)\\ &= \exp\left[ \sum_{i=1}^{k}T_{i}(x)\eta_{i}(\theta) - B(\theta)\right]h(x) \end{align*}$where $B(\theta)$ is a scaling factor chosen so that the density $p(x;\theta)$ integrates to $1$. Alternatively, reparametrizing with $\eta_{i}:=\eta_{i}(\theta)$ gives the **canonical form** $\begin{align*} p(x;\theta)&= \exp\left[\mathbf{T}(x)\cdot \pmb{\eta}-B(\eta)\right]h(x)\\ &= \exp\left[ \sum_{i=1}^{k}\eta_{i}T_{i}(x) - B(\pmb{\eta})\right]h(x) \end{align*}$where $\pmb{\eta}=(\eta_{1},\dots,\eta_{k})$, and $B(\pmb{\eta})$ are understood to equal $B(\theta)$. > [!examples] Examples of Exponential Families > $1$-parameter families include: > - Poisson distributions $\mathrm{Po}(\lambda)$, > - Binomial distributions $\mathrm{Binom}(n,p)$ with fixed $n$, > - Gaussians $N(\mu,\sigma^{2}_{0})$ with fixed $\sigma_{0}^{2}$. > > $2$-parameter families include: > - Gamma distributions $\mathrm{Gamma}(\alpha,\beta)$, > - Gaussians $N(\mu,\sigma^{2})$ with both parameters free. > > Not exponential families: > - Uniform distributions $U[0,\theta]$, > - In general anything with non-constant support -- see [[Exponential Families#Exponential Families and Supports|the last section]]. Alternatively, exponential families can be interpreted as distributions obtained from a base distribution: for one distribution $f_{\eta_{0}}$ corresponding to $\pmb{\eta}=\pmb{\eta}_{0}$, any other $f_{\pmb{\eta}}$ from the family has the likelihood ratio $\frac{f_{\pmb{\eta}}}{f_{\pmb{\eta}_{0}}}\propto \exp[(\pmb{\eta}-\pmb{\eta}_{0})\cdot \mathbf{T}],$so $f_{\eta}$ can be obtained by applying **exponential tilting** to the base $f_{\pmb{\eta}_{0}}$, then rescaling for unit mass. - For example, the family $\{ \mathrm{Po}(\lambda) ~|~ \lambda \in \mathbb{R}^{+} \}$ can be obtained by applying $\exp(ax), ~a=\log \frac{\lambda}{\lambda_{0}} \in \mathbb{R}$ to any $\mathrm{Po}(\lambda_{0})$. ![[PoissonExpTilting.png#invert|center|w80]] ### Properties of Exponential Families *Sampling distributions from exponential families*: - If $(X_{i} \sim P_{i})_{i=1}^{n}$ are independent ($P_{i}$ being distributions from the same exponential family), then $\mathbf{X}=(X_{1},\dots,X_{n})$ is also from an exponential family. - Furthermore, if $X_{1},\dots,X_{n} \overset{\mathrm{iid.}}{\sim} f(\cdot~;\theta)$ with canonical observations $\mathbf{T}$, then $\mathbf{X}$ has canonical observations $\mathbf{T}_{(n)}(\mathbf{X})=\sum_{i}\mathbf{T}(X_{i})$. *Moments of exponential families* The scaling factor $B(\eta)$ summarizes behavior of canonical observations $\mathbf{T}(X)$: assuming $\eta \in \mathrm{int}(\Xi)$, $\Xi$ being the [[Exponential Families#Parameter Spaces|natural parameter space]], - Moments $\mathbb{E}_{\eta}[T_{i}(X)^{m}] < \infty$ for all $m \ge 1$. - The mean and covariance of $\mathbf{T}$ are $\begin{align*} \mathbb{E}_{\eta}[T_{i}(X)]&= \frac{ \partial B }{ \partial \eta_{i} } \\[0.2em] \mathrm{\\ Cov}_{\eta}(T_{i}, T_{j})&= \frac{ \partial^{2} B }{ \partial \eta_{i}\partial\eta_{j}} \end{align*}$ > [!warning] Curved Exponential Families > If $\{ P_{\theta} \}$ is [[#^4dfdc4|curved]], $B$ may simplify to another function $B^{\ast}$ of $\pmb{\eta}$ -- this formula does not work for $B^{\ast}$. The canonical statistics $\mathbf{T}(\mathbf{X})=(T_{1}(\mathbf{X}),\dots, T_{k}(\mathbf{X}))$ themselves follow an exponential family with the same canonical parameters $\eta(\theta)=(\eta_{1}(\theta),\dots,\eta_{k}(\theta))$. If $\eta$ is the canonical parameters, its [[Maximum Likelihood Estimator|MLE]] $\hat{\eta}$ solves the zero-gradient equations $\nabla_{\eta}\left( \sum_{i}l(\eta;x_{i}) \right)=\mathbf{T}(\mathbf{x})-\nabla_{\!\eta}B=\mathbf{T}(\mathbf{x})-\mathbb{E}_{\eta}[\mathbf{T}]\equiv{0}.$ Therefore, *the MLE $\hat{\eta}$ must match the observed $\mathbf{T}(\mathbf{x})$ with its expectation $\mathbb{E}_{\eta}[\mathbf{T}]$.* ^e3137c ## Minimal Representations > [!tldr] > A representation is **minimal** if it does not use more parameters than what’s necessary. This is equivalent to canonical functions $\{ \eta_{i} \},\{ T_{i} \}$ both being **affinely independent**. The choice of $\eta_{i},T_{i},B,h$ forms the **representation** of an exponential family, but such representation is not unique. - For example, we may arbitrarily split $\eta_{i}$ into two terms, or switch their orders. > [!definition|*] Number of parameters > For a particular representation with $\eta_{i},T_{i}$ and $i=1,\dots,k$, it is a **$k$-parameter** representation. > > If all representations of a family $\mathcal{P}$ need at least $k$ parameters, then the family is **strictly $k$-parameter**. > > If $\mathcal{P}$ is strictly $k$-parameter, then a $k$-parameter representation of it is **minimal**. *Minimality of a representation is equivalent to* [[Affine Independence|affine independence]] among $\{ \eta_{1},\dots,\eta_{k} \}$ and $\{ T_{1},\dots,T_{k} \}$: $\begin{align*} &\not\exists c_{1},\dots,c_{k+1} \in \mathbb{R}\text{ not all zero}: \forall x \in \mathcal{X}, \sum_{i=1}^{k}c_{i}T_{i}(x)=c_{k+1}\\ &\not\exists d_{1},\dots,d_{k+1} \in \mathbb{R}\text{ not all zero}: \forall \theta \in \Theta, \sum_{i=1}^{k}d_{i}\eta_{i}(\theta)=d_{k+1} \end{align*}$This is because affine dependence allows us to express one of $\eta_{i},T_{i}$ in terms of other canonical functions, thereby reducing the number of parameters to $k-1$. - In particular, all affinely independent representations are $k$-parameter if $\mathcal{ P}$ is strictly $k$-parameter. > [!proof]- > Lol who cares. Go check the lecture notes. > The big idea is that the canonical statistics $\mathbf{T}$ lives in the vector space of (things proportional to the) log-likelihoods. > > Now any affine independent representation $\mathbf{T}$ means that it forms a basis: $\{ \mathbf{1}_{\mathcal{X}}, T_{1},\dots,T_{k} \},$where $\mathbf{1}_{\mathcal{X}}$ is the constant function. Since different affinely independent representations all span the same space, they must have the same dimension. Affine independence among $T_{i}$ is equivalent to $\mathrm{Cov}_{\theta}(\mathbf{T})$ being positive definite. > [!proof]- Sketch Proof > Take any $\eta \in \mathbb{R}^{k}$, then $\eta^{T}\mathrm{Cov}_{\theta}(\mathbf{T})\eta=\mathrm{Var}_{\theta}(\eta \cdot \mathbf{T}(x))$where positive definiteness require $\mathrm{RHS}\gneq 0$. > > This holds for all $\eta$ if and only if there is no linear combination of $\mathbf{T}$ that has zero variance, i.e. a constant. This is equivalent to $\mathbf{T}$ being affinely independent. ## Parameter Spaces > [!definition|*] Parameter Spaces > > For an exponential family $\mathcal{P}$, its **parameter space** is the set of $\theta$ that gives a density with a finite integral: $\Theta:=\left\{ \theta~: \int \exp[\eta(\theta)\cdot \mathbf{T}(x)]\cdot h(x) ~ dx <\infty \right\}$*If the integral is not finite, no choice of $-B(\theta)$ can rescale the density to a proper distribution.* > > The **natural parameter space** $\Xi$ is the set of $\eta \in \mathbb{R}^{k}$ that gives a finite integral: $\Xi:=\left\{ \eta \in \mathbb{R}^{k}: \int \exp[\eta\cdot \mathbf{T}(x)]\cdot h(x) ~ dx <\infty \right\}$ > > [!info]- Minor remarks > > - Naturally $\eta(\Theta) \subseteq \Xi$, although the reverse inclusion is not true in general. > > - Note that in the parameter space $\Theta$, $\eta$ are treated as functions. In the natural parameter space, they are treated as constants. *The natural parameter space is always easy to work with*: if $\mathcal{P}$ is strictly $k$-parameter, then $\Xi$ is always **convex** and contains a $k$-dimensional open ball. - Convexity makes optimization easier, and in particular averages of valid parameters are still valid. - The $k$-dimension ball means that $\Xi$ is not restricting the dimensions of possible parameters. > [!proof]- > Convexity is just lots of $\mathcal{L}^{p}$ space algebra bashing. > > $\eta(\Theta)$ is $k$-dimensional since $\eta$ is affinely independent (it's $k$-parameter representation of a strictly $k$-parameter family). Hence $\Xi \supseteq \eta(\Theta)$ is also $k$-dimensional. Then convexity guarantees the existence of a ball as a subset. If $\eta(\Theta)$ also contains a $k$-dimensional open ball, then the family is **full rank**, or **regular**. If the family $\mathcal{P}$ is strictly $k$-parameter but there exists $q < k$ where $\Theta \subseteq \mathbb{R}^{q}$, then it is not full rank ($\eta(\Theta)$ is a $q$-dimensional manifold). In that case, it is a **curved exponential family**. - *Curved exponential families can arise if we impose some non-linear relationship between $\theta$* -- which restricts the space $\eta(\Theta)$. ^4dfdc4 - For example, the family $\mathcal{P}=\{ N(\theta, \theta^{2})~|~ \theta \in \mathbb{R} \}$ has $q=1$, $k=2$. Its natural parameter space is $\Xi=\mathbb{R} \times \mathbb{R}^{>0}$, but $\eta(\Theta)$ is just a quadratic curve (a 1-dimensional manifold). ## Exponential Families and Supports *Big idea: an exponential family share the same support.* A **support** of a distribution $P$ with density $f:\mathcal{X} \to \mathbb{R}^{+}$ is $\mathrm{supp}(P):=:\mathrm{supp}(f):=\{ x \in \mathcal{X}~|~f(x)>0 \}$and *for exponential families, the support only depends on $h(x)$* (the exponential part is always non-$0$), so > [!theorem|*] Support of Exponential Families > An exponential family share a common support. As a consequence, families of distributions where the support depends on the parameter cannot be exponential families. - Examples include Pareto, uniform $U[0,\theta]$, shifted exponentials. Furthermore, all $p_{\theta},p_{\tilde{\theta}}$ from the same exponential family $\mathcal{P}$ are **equivalent**, so that $\forall N\subseteq \mathcal{X},~ ~\theta,\tilde{\theta} \in \Theta,~~~ \mathbb{P}_{\theta}(N)=0 \iff \mathbb{P}_{\tilde{\theta}}(N)=0$ > [!proof]- > Writing the probabilities as integrals give $0=\mathbb{P}_{\theta}(N)\propto\int \exp[T(x)\cdot \eta]\cdot h(x) \cdot \mathbb{1}_{N}~ dx $and since the exponential term is gt; 0$, we must have $h(x) \cdot \mathbb{1}_{N}=0~\mathrm{a.e.}$ in $\mathcal{X}$, and this is independent of $\theta$. Then the integral of $\mathbb{P}_{\tilde{\theta}}(N)$ must also evaluate to $0$ for any $\tilde{\theta} \in \Theta$.