**Exponential families** are families of probability distributions, parameterized over some parameter space $\Theta$.
A family is defined by the choice of the following ingredients:
> [!definition|*] Exponential Families
> - The number of parameters $k \in \mathbb{Z}^{+}$,
> - **Canonical observations/statistics** $\mathbf{T}=(T_{1},\dots,T_{k})$, where $T_{i}:\mathcal{X} \to \mathbb{R}$,
> - **Canonical parameters** $\pmb{\eta}=(\eta_{1},\dots,\eta_{k})$, where $\eta_{i}:\Theta \to \mathbb{R}$,
> - Functions $B:\Theta \to \mathbb{R}$ and $h: \mathcal{X} \to \mathbb{R}^{+}$.
>
> Given those components, the **exponential family** is the set of distributions $\mathcal{P}=\{P_{\theta}~|~ \theta \in \Theta\}$, where $P_{\theta}$ has pdf/pmf $p(x;\theta)$ of the form $\begin{align*}
p(x;\theta)&= \exp\left[\mathbf{T}(x)\cdot \pmb{\eta}(\theta)-B(\theta)\right]h(x)\\
&= \exp\left[ \sum_{i=1}^{k}T_{i}(x)\eta_{i}(\theta) - B(\theta)\right]h(x)
\end{align*}$where $B(\theta)$ is a scaling factor chosen so that the density $p(x;\theta)$ integrates to $1$.
Alternatively, reparametrizing with $\eta_{i}:=\eta_{i}(\theta)$ gives the **canonical form** $\begin{align*}
p(x;\theta)&= \exp\left[\mathbf{T}(x)\cdot \pmb{\eta}-B(\eta)\right]h(x)\\
&= \exp\left[ \sum_{i=1}^{k}\eta_{i}T_{i}(x) - B(\pmb{\eta})\right]h(x)
\end{align*}$where $\pmb{\eta}=(\eta_{1},\dots,\eta_{k})$, and $B(\pmb{\eta})$ are understood to equal $B(\theta)$.
> [!examples] Examples of Exponential Families
> $1$-parameter families include:
> - Poisson distributions $\mathrm{Po}(\lambda)$,
> - Binomial distributions $\mathrm{Binom}(n,p)$ with fixed $n$,
> - Gaussians $N(\mu,\sigma^{2}_{0})$ with fixed $\sigma_{0}^{2}$.
>
> $2$-parameter families include:
> - Gamma distributions $\mathrm{Gamma}(\alpha,\beta)$,
> - Gaussians $N(\mu,\sigma^{2})$ with both parameters free.
>
> Not exponential families:
> - Uniform distributions $U[0,\theta]$,
> - In general anything with non-constant support -- see [[Exponential Families#Exponential Families and Supports|the last section]].
Alternatively, exponential families can be interpreted as distributions obtained from a base distribution: for one distribution $f_{\eta_{0}}$ corresponding to $\pmb{\eta}=\pmb{\eta}_{0}$, any other $f_{\pmb{\eta}}$ from the family has the likelihood ratio $\frac{f_{\pmb{\eta}}}{f_{\pmb{\eta}_{0}}}\propto \exp[(\pmb{\eta}-\pmb{\eta}_{0})\cdot \mathbf{T}],$so $f_{\eta}$ can be obtained by applying **exponential tilting** to the base $f_{\pmb{\eta}_{0}}$, then rescaling for unit mass.
- For example, the family $\{ \mathrm{Po}(\lambda) ~|~ \lambda \in \mathbb{R}^{+} \}$ can be obtained by applying $\exp(ax), ~a=\log \frac{\lambda}{\lambda_{0}} \in \mathbb{R}$ to any $\mathrm{Po}(\lambda_{0})$.
![[PoissonExpTilting.png#invert|center|w80]]
### Properties of Exponential Families
*Sampling distributions from exponential families*:
- If $(X_{i} \sim P_{i})_{i=1}^{n}$ are independent ($P_{i}$ being distributions from the same exponential family), then $\mathbf{X}=(X_{1},\dots,X_{n})$ is also from an exponential family.
- Furthermore, if $X_{1},\dots,X_{n} \overset{\mathrm{iid.}}{\sim} f(\cdot~;\theta)$ with canonical observations $\mathbf{T}$, then $\mathbf{X}$ has canonical observations $\mathbf{T}_{(n)}(\mathbf{X})=\sum_{i}\mathbf{T}(X_{i})$.
*Moments of exponential families* The scaling factor $B(\eta)$ summarizes behavior of canonical observations $\mathbf{T}(X)$: assuming $\eta \in \mathrm{int}(\Xi)$, $\Xi$ being the [[Exponential Families#Parameter Spaces|natural parameter space]],
- Moments $\mathbb{E}_{\eta}[T_{i}(X)^{m}] < \infty$ for all $m \ge 1$.
- The mean and covariance of $\mathbf{T}$ are $\begin{align*}
\mathbb{E}_{\eta}[T_{i}(X)]&= \frac{ \partial B }{ \partial \eta_{i} } \\[0.2em]
\mathrm{\\
Cov}_{\eta}(T_{i}, T_{j})&= \frac{ \partial^{2} B }{ \partial \eta_{i}\partial\eta_{j}}
\end{align*}$
> [!warning] Curved Exponential Families
> If $\{ P_{\theta} \}$ is [[#^4dfdc4|curved]], $B$ may simplify to another function $B^{\ast}$ of $\pmb{\eta}$ -- this formula does not work for $B^{\ast}$.
The canonical statistics $\mathbf{T}(\mathbf{X})=(T_{1}(\mathbf{X}),\dots, T_{k}(\mathbf{X}))$ themselves follow an exponential family with the same canonical parameters $\eta(\theta)=(\eta_{1}(\theta),\dots,\eta_{k}(\theta))$.
If $\eta$ is the canonical parameters, its [[Maximum Likelihood Estimator|MLE]] $\hat{\eta}$ solves the zero-gradient equations $\nabla_{\eta}\left( \sum_{i}l(\eta;x_{i}) \right)=\mathbf{T}(\mathbf{x})-\nabla_{\!\eta}B=\mathbf{T}(\mathbf{x})-\mathbb{E}_{\eta}[\mathbf{T}]\equiv{0}.$
Therefore, *the MLE $\hat{\eta}$ must match the observed $\mathbf{T}(\mathbf{x})$ with its expectation $\mathbb{E}_{\eta}[\mathbf{T}]$.* ^e3137c
## Minimal Representations
> [!tldr]
> A representation is **minimal** if it does not use more parameters than what’s necessary. This is equivalent to canonical functions $\{ \eta_{i} \},\{ T_{i} \}$ both being **affinely independent**.
The choice of $\eta_{i},T_{i},B,h$ forms the **representation** of an exponential family, but such representation is not unique.
- For example, we may arbitrarily split $\eta_{i}$ into two terms, or switch their orders.
> [!definition|*] Number of parameters
> For a particular representation with $\eta_{i},T_{i}$ and $i=1,\dots,k$, it is a **$k$-parameter** representation.
>
> If all representations of a family $\mathcal{P}$ need at least $k$ parameters, then the family is **strictly $k$-parameter**.
>
> If $\mathcal{P}$ is strictly $k$-parameter, then a $k$-parameter representation of it is **minimal**.
*Minimality of a representation is equivalent to* [[Affine Independence|affine independence]] among $\{ \eta_{1},\dots,\eta_{k} \}$ and $\{ T_{1},\dots,T_{k} \}$: $\begin{align*}
&\not\exists c_{1},\dots,c_{k+1} \in \mathbb{R}\text{ not all zero}: \forall x \in \mathcal{X}, \sum_{i=1}^{k}c_{i}T_{i}(x)=c_{k+1}\\
&\not\exists d_{1},\dots,d_{k+1} \in \mathbb{R}\text{ not all zero}: \forall \theta \in \Theta, \sum_{i=1}^{k}d_{i}\eta_{i}(\theta)=d_{k+1}
\end{align*}$This is because affine dependence allows us to express one of $\eta_{i},T_{i}$ in terms of other canonical functions, thereby reducing the number of parameters to $k-1$.
- In particular, all affinely independent representations are $k$-parameter if $\mathcal{ P}$ is strictly $k$-parameter.
> [!proof]-
> Lol who cares. Go check the lecture notes.
> The big idea is that the canonical statistics $\mathbf{T}$ lives in the vector space of (things proportional to the) log-likelihoods.
>
> Now any affine independent representation $\mathbf{T}$ means that it forms a basis: $\{ \mathbf{1}_{\mathcal{X}}, T_{1},\dots,T_{k} \},$where $\mathbf{1}_{\mathcal{X}}$ is the constant function. Since different affinely independent representations all span the same space, they must have the same dimension.
Affine independence among $T_{i}$ is equivalent to $\mathrm{Cov}_{\theta}(\mathbf{T})$ being positive definite.
> [!proof]- Sketch Proof
> Take any $\eta \in \mathbb{R}^{k}$, then $\eta^{T}\mathrm{Cov}_{\theta}(\mathbf{T})\eta=\mathrm{Var}_{\theta}(\eta \cdot \mathbf{T}(x))$where positive definiteness require $\mathrm{RHS}\gneq 0$.
>
> This holds for all $\eta$ if and only if there is no linear combination of $\mathbf{T}$ that has zero variance, i.e. a constant. This is equivalent to $\mathbf{T}$ being affinely independent.
## Parameter Spaces
> [!definition|*] Parameter Spaces
>
> For an exponential family $\mathcal{P}$, its **parameter space** is the set of $\theta$ that gives a density with a finite integral: $\Theta:=\left\{ \theta~: \int \exp[\eta(\theta)\cdot \mathbf{T}(x)]\cdot h(x) ~ dx <\infty \right\}$*If the integral is not finite, no choice of $-B(\theta)$ can rescale the density to a proper distribution.*
>
> The **natural parameter space** $\Xi$ is the set of $\eta \in \mathbb{R}^{k}$ that gives a finite integral: $\Xi:=\left\{ \eta \in \mathbb{R}^{k}: \int \exp[\eta\cdot \mathbf{T}(x)]\cdot h(x) ~ dx <\infty \right\}$
> > [!info]- Minor remarks
> > - Naturally $\eta(\Theta) \subseteq \Xi$, although the reverse inclusion is not true in general.
> > - Note that in the parameter space $\Theta$, $\eta$ are treated as functions. In the natural parameter space, they are treated as constants.
*The natural parameter space is always easy to work with*: if $\mathcal{P}$ is strictly $k$-parameter, then $\Xi$ is always **convex** and contains a $k$-dimensional open ball.
- Convexity makes optimization easier, and in particular averages of valid parameters are still valid.
- The $k$-dimension ball means that $\Xi$ is not restricting the dimensions of possible parameters.
> [!proof]-
> Convexity is just lots of $\mathcal{L}^{p}$ space algebra bashing.
>
> $\eta(\Theta)$ is $k$-dimensional since $\eta$ is affinely independent (it's $k$-parameter representation of a strictly $k$-parameter family). Hence $\Xi \supseteq \eta(\Theta)$ is also $k$-dimensional. Then convexity guarantees the existence of a ball as a subset.
If $\eta(\Theta)$ also contains a $k$-dimensional open ball, then the family is **full rank**, or **regular**.
If the family $\mathcal{P}$ is strictly $k$-parameter but there exists $q < k$ where $\Theta \subseteq \mathbb{R}^{q}$, then it is not full rank ($\eta(\Theta)$ is a $q$-dimensional manifold). In that case, it is a **curved exponential family**.
- *Curved exponential families can arise if we impose some non-linear relationship between $\theta$* -- which restricts the space $\eta(\Theta)$. ^4dfdc4
- For example, the family $\mathcal{P}=\{ N(\theta, \theta^{2})~|~ \theta \in \mathbb{R} \}$ has $q=1$, $k=2$. Its natural parameter space is $\Xi=\mathbb{R} \times \mathbb{R}^{>0}$, but $\eta(\Theta)$ is just a quadratic curve (a 1-dimensional manifold).
## Exponential Families and Supports
*Big idea: an exponential family share the same support.*
A **support** of a distribution $P$ with density $f:\mathcal{X} \to \mathbb{R}^{+}$ is $\mathrm{supp}(P):=:\mathrm{supp}(f):=\{ x \in \mathcal{X}~|~f(x)>0 \}$and *for exponential families, the support only depends on $h(x)$* (the exponential part is always non-$0$), so
> [!theorem|*] Support of Exponential Families
> An exponential family share a common support.
As a consequence, families of distributions where the support depends on the parameter cannot be exponential families.
- Examples include Pareto, uniform $U[0,\theta]$, shifted exponentials.
Furthermore, all $p_{\theta},p_{\tilde{\theta}}$ from the same exponential family $\mathcal{P}$ are **equivalent**, so that $\forall N\subseteq \mathcal{X},~ ~\theta,\tilde{\theta} \in \Theta,~~~ \mathbb{P}_{\theta}(N)=0 \iff \mathbb{P}_{\tilde{\theta}}(N)=0$
> [!proof]-
> Writing the probabilities as integrals give $0=\mathbb{P}_{\theta}(N)\propto\int \exp[T(x)\cdot \eta]\cdot h(x) \cdot \mathbb{1}_{N}~ dx $and since the exponential term is
gt; 0$, we must have $h(x) \cdot \mathbb{1}_{N}=0~\mathrm{a.e.}$ in $\mathcal{X}$, and this is independent of $\theta$. Then the integral of $\mathbb{P}_{\tilde{\theta}}(N)$ must also evaluate to $0$ for any $\tilde{\theta} \in \Theta$.