A8 Probability - Random Notes Go Brrrrrrr

```dataview table without id File as "Topics", dateformat(file.mtime, "yyyy-MM-dd") as "Last Modified" from "" FLATTEN "[[" + file.path + "|" + truncate(file.name, 30) + "]]" as File FLATTEN course as course where (course and (course = this.file.link) and (file.name != this.file.name)) and !contains(file.path, "2 - Snippets") sort file.mtime desc ``` ```dataview table without id File as "Snippets", dateformat(file.mtime, "yyyy-MM-dd") as "Last Modified" from "2 - Snippets" FLATTEN "[[" + file.path + "|" + truncate(file.name, 30) + "]]" as File FLATTEN course as course where (course and (course = this.file.link) and (file.name != this.file.name)) sort file.mtime desc ``` ## Joint Distributions * The **joint cumulative density function** (cdf) of two random variables $X,Y$ is $F_{X,Y}(x,y)=\mathbb P(X\le x,Y\le y)$ * The two variables are **jointly continuous** if this function can be written as a double integral: $\exists f(x,y): F(x,y)=\int_{-\infty}^{y} \int_{-\infty}^{x} f(r,s)\,dr\,ds $and the function $f$ is the **joint probability density function** (pdf). * Given the joint pdf, $A\subseteq \mathbb R^2$, $\mathbb P\big((X,Y)\in A\big)=\iint_{A}f \, dA$ * The **marginal distribution** of $X$ is obtained by integrating out $Y$: $f_{X}( x)=\int_{-\infty}^\infty f_{X,Y}(x,y) \, dy $ * If variables $X,Y$ are **independent**, $f_{X,Y}(x,y)=f_{X}(x)f_{Y}(y)$Hence given independence, the marginal distribution can be found by factoring. * A factorizable joint distribution is not sufficient for independence; multivariate Gaussians are an exception. ### Conditional Densities * We can define the **conditional cdf** for an event $A$: $F_{X|A}(x)=\frac{\mathbb{P}(X\le x \cap A)}{\mathbb P(A)}$ and the **conditional pdf** $f_{X|A}(x)$ is its derivative. * Then the conditional probability and expectation are$\begin{align*} \mathbb{P}(X\in C\,|\,A)&=\int_{C}f_{X|A}(x) \, dx\\ \mathbb{E}(X|A)&=\int x\cdot f_{X|A}(x) \, dx \end{align*}$ - The **conditional density function** of $Y$ given $X=x$ is: $f_{Y|X=x}(y)=\frac{f_{X,Y}}{f_{X}}$ for $x:f_{X}(x)\ne 0$. Hence when $X,Y$ are independent, $f_{Y|X}(y)=f_{Y}$: conditioning on $X$ does not affect the distribution of $Y$. ### Transformation Formula in 2D * Consider the transformation from random variables $X,Y$ to $T(x,y)=\big(U(x,y),V(x,y)\big)$. Suppose $(X,Y)$ has non-$0$ density in $R \subseteq \mathbb{R}^{2}$. * The **transformation formula** gives the distribution of $U,V$: in the region $T(R)$, their density is $f_{U,V}(u,v)=f_{X,Y}\Big(x(u,v),y(u,v)\Big)\Bigg|\det\frac{\partial(x,y)}{\partial(u,v)}\Bigg|$where $\frac{\partial(u,v)}{\partial(x,y)}$ is the **Jacobian (matrix)** $J=\frac{\partial(x,y)}{\partial(u,v)}=\begin{pmatrix} x_{u} & x_{v} \\ y_{u} & y_{v} \end{pmatrix}$and outside $T(R)$, $f_{U,V}(u,v)=0$. > Proof: for any $A\subseteq R$, there is $T(A)\subseteq T(R)$, so we can write $\mathbb P\big((X,Y)\in A\big)=\mathbb P\big((U,V)\in T(A)\big)$as a double integral, and apply the change-of-variable formula. * Example: deriving the distribution of the sum of two random variables. > Transform from $(X,Y)$ to $(X+Y,Y)$ with a Jacobian of determinant $1$, and substitute $X$ with $X=(X+Y)-Y$, we have$f_{X+Y,\,Y}(u,v)=f_{X,Y}(u-v,v)\cdot 1$so integrating gives the marginal distribution, $f_{X+Y}(u)=\int _{-\infty}^\infty f_{X,Y}(u-v,v)\, dv $ ### Transformations and Gaussians in Higher Dimensions * The same formula holds for higher dimensions, where the Jacobian has the entries defined as $\Bigg( \frac{\partial(x_{1},\dots,x_{n})}{\partial(y_{1},\dots, y_{n})}\Bigg )_{i,j}=\frac{\partial x_{i}}{\partial y_{j}}$ * For linear transformations $\mathbf{W}=A\mathbf{Z}+\mathbf{\mu}$, the Jacobian is just $A^{-1}$, hence $f_{\mathbf{W}}(\mathbf{w})=\frac{f_{\mathbf{Z}}(\mathbf{z}(\mathbf{w}))}{|\det A|}$ * For example, given iid. standard normal variables $\mathbf{Z}$, the transformed $\mathbf{W}=A\mathbf{Z}+\underline{\mu}$ would have distribution $f_{\mathbf{W}}(\mathbf{w})=\frac{1}{(2\pi)^{n/2}|\det A|}\exp\left( -\frac{1}{2}(\mathbf{ w-\underline{\mu}})^T(AA^T)^{-1}(\mathbf{ w-\underline{\mu}}) \right)$*(You probably don't need to memorize this...)* so they follow the multivariate normal distribution $N(\underline{\mu},A A^T)$, where $A A^{T}$ is the covariance matrix. * The **covariance matrix** of random variables $X_{1},\dots,X_{n}$ is the matrix $\Sigma_{n\times n}$ where $(\Sigma)_{ij}=\mathrm{Cov}(X_{i},X_{j})$So independent variables have a diagonal covariance matrix. While the converse is not true in general, it does hold for Gaussians... * If the multivariate Gaussian $\mathbf{W}=(W_{1},\dots,W_{n})$ has an orthogonal covariance matrix $A$, i.e. $A A^{T}=I$, then its pdf can be factored as $f_{\mathbf{W}}=\prod_{i=1}^{n}f_{w_{i}}(w_{i})$So $W_{1},\dots,W_{n}$ are indeed independent. * Here $A$ is called "orthogonal"; a similar result holds for $AA^{T}=\mathrm{diag}(\dots)$, but the algebra is even messier. ### A Random Section on Constructing Normal Variables * Goal: construct two random normal variables $X,Y$ following ${N}(m,s)$ and ${N}(\mu,\sigma)$ respectively, having a correlation of $\rho$. > Idea: decompose $Y$ into a $Z_{1}$ that it shares with $X$, plus an independent part $Z_{2}$, with $Z_{1,2}$ iid. $N(0,1)$: $\begin{align*} X &= sZ_{1}+m\\ Y &= c_{1} Z_{1} + c_{2}Z_{2} + \mu \end{align*}$The coefficient $c_{1}$ of $Z_{1}$ would control the correlation, and we adjust the coefficient $c_{2}$ of $Z_{2}$ to make sure the variance of $Y$ is correct. * Step 1: construct $X$ with $Z_{1}$. > To have $X\sim N(m,s),$ we just choose $Z_{1}\sim N(0,1),\, X=sZ_{1}+m$. * Step 2: fit $Z_{1}$ into $Y$ to get the correlation right. > Since $Z_{1}$ independent from $Z_{2}$ and $\mu$, its coefficient $c_{1}$ determines the correlation: $\text{corr}(X,Y)=\rho=\text{coeff}(Z_{1}) / \sigma$so we can set $c_{1}=\sigma \rho$. * Step 3: adjust the coefficient of $Z_{2}$ to fix the variance of $Y$. > $\mathrm{Var}(Y)=c_{1}^{2}+c_{2}^{2}$, so solving gives $c_{2}=\sigma \sqrt{ 1-\rho^{2} }$. > Since $Z_{1,2}$ independent, adding this term does not change the correlation. * Plugging in the coefficients, $\begin{align*} X&=sZ_{1}+m\\ Y&= \sigma \rho Z_{1}+ \sigma \sqrt{ 1-\rho^{2 }}Z_{2} + \mu &&\text{where } Z_{1,2} \sim N(0,1) \text{ iid.} \end{align*}$has the desired correlations and variances. * To tie back to conditional distributions, writing $Y$ in terms of $X$ gives the conditonal distribution of $Y$ given $X=x$. --- ## Markov Chains * Consider a sequence of random variables $(X_{n})$ that takes values from a countable set $I$ called **the state space**. * Usually the sequence is a change over time, e.g. the location of an elevator at time $n$, and the state space is the set of floors $\{0,1,2,\dots k\}$ * In some cases it can use a more abstract sense of time, e.g. the $n$th character in an email, with state space $\{ a,\dots, z\}$. * If we treat $I$ as a vector $I=(I_{1},\dots)$, the distribution of $X_{n}$ can be written as a row vector $\lambda(t)$, where $(\lambda (t))_{i}=\mathbb{P}(X_{t}=I_{i})$. * In particular, $X_{0} \sim \lambda_{0}$ is the **initial distribution**. ### Homogeneous Markov Chains *Big idea: homogenous Markov chains are memoriless random processes.* * A sequence of random variables $X=(X_{1},X_{2},\dots)$ all taking values in some state space $I$ is called a **Markov chain** if for any $k\ge 1$, any states $i_{0},\dots,i_{k}$: $\mathbb{P}(X_{k}=i_{k}|X_{k-1}=i_{k-1})= \mathbb{P}(X_{k}=i_{k}|X_{k-1}=i_{k-1},\dots,X_{1}=i_{1})$*That is, the next state of the chain $X_{k}$ only depends on the present state $X_{k-1}$, not the past.* * A Markov chain is **time-homogeneous** if the **transition probability** $\mathbb{P}(X_{k}=j|X_{k-1}=i)=p_{ij}$is constant, i.e. invariant of $k$. All chains in this course are homogeneous. * *Hence transition probability $p_{ij}$ is the probability of going from state $i$ to state $j$.* * Homogeneous Markov chains satisfy the **Markov property**: for subsets $A_{1,\dots m+n}\subseteq I, \, i \in I,$$\begin{align*} \mathbb{P}(X_{m+1}\in A_{m+1},&\dots, X_{m+n}\in A_{m+n}|X_{m}=i, X_{m-1}\in A_{m-1},\dots, X_{1}\in A_{1})\\ \\ &=\mathbb{P}(X_{m+1}\in A_{m+1},\dots, X_{m+n}\in A_{m+n}|X_{m}=i) \\\\ &=\mathbb{P}(X_{1}\in A_{1},\dots, X_{n}\in A_{n}|X_{0}=i_{0}) \end{align*}$ - *The first equality: Markov chains do not remember what happened before the present.* - *The second equality: homogeneous Markov chains are **translation-invariant**. - A sufficient criterion for a Markov chain is that: a sequence of random variable $Y=(Y_{0},\dots)$ is a Markov chain if there exists a function $f$ and an sequence $X=(X_{0},\dots)$ such that: $Y_{n+1}=f(Y_{n},X_{n+1}),\, X_{n+1} \text{ independent of }Y_{1,\dots,n}$ ### N-Step Transition Probabilities * The **$n$-step transition probability** is defined to be$p_{ij}^{(n)}=\mathbb{P}(X_{n}=j|X_{0}=i)$ * For homogeneous chains, the **transition matrix** $P\in \mathbb{R}^{|I|^{2}}:\, P_{i,j}=p_{ij}$records the **transition probabilities**. The sequence $X$ can be described as $X\sim \text{Markov}(\lambda_{0},P)$. * The matrix can be (countably) infinite if $I$ is. * The **Chapman-Kolmogorov equations** describe n-step transitions with the transition matrix: for all homogeneous Markov chains, $\begin{align*} p_{ij}^{(n)}&=(P^n)_{ij}\\ p_{ij}^{(m+n)}&=(P^mP^n)_{ij}=\sum_{k \in I} p_{ik}^{(m)} p_{kj}^{(n)} \end{align*}$ * The first equation: applying the transition matrix represents a 1-step move in time. * The second equation: the probability of going from $i$ to $j$ in $m+n$ steps can be broken down by conditioning on it passing $k$ at the $m$th step. > Proof: > First equation: induction. > Second equation: immediate from the first, or use the law of total probabilities. - By the C-K equations, Markov chains can be represented by right-applying the transition matrix: for $n\ge m,$$\lambda(n)=\lambda(m)P^{n-m}$ ## Structures of Markov Chains ### Class Structures * Two states $i,j$ of a Markov chain **communicate**, denoted $i\leftrightarrow j$ if: $\exists m,n:p_{ij}^{(m)}\ne 0,\, p_{ji}^{(n)}\ne 0$ * *That is, there is a chance to go from $i$ to $j$, and vice versa.* * As an equivalence relation, communication partitions the state space into **communicating classes**. A Markov chain is **irreducible** if it only has one communicating class. * A **closed class** $C$ is a communicating class such that: $\forall c \in C, d \notin C, n\in \mathbb{N}:\, \,p_{cd}^{(n)}=0$It is also called an **absorbing state** if it has only one element. * *That is, it is impossible to leave $C$ once inside it.* * An **open class** is a communicating class that is not closed. ### Periodicity * The **period** of a state $i$ is $\gcd \big\{n:p_{ii}^{(n)}>0\big\}$and the state is **aperiodic** if its period is $1$. * For example, a random walk of a length-$n$ cycle has period $2$ if $n$ even; if $n$ is odd, then the chain has period $1$ hence is apeiodic. * The period is a class property: communicating states have the same period. > Proof: suppose $i \leftrightarrow j$, with $m,n: p_{ij}^{(m)}\ne 0,\, p_{ji}^{(n)}\ne 0$. > Define $S_{i}=\big\{k:p_{ii}^{(k)}>0\big\}$, similarly for $S_{j}$. > Then for any common divisor $d$ of $S_{i}$, > $ p_{ii}^{(m+n)}\ge p_{ij}^{(m)}p_{ji}^{(n)}>0 \Longrightarrow (m+n) \in S_{i}, d|(m+n)$and for any $a \in S_{j}$, $p_{ii}^{(m+a+n)}\ge p_{ij}^{(m)}p_{jj}^{(a)}p_{ji}^{(n)}>0\Longrightarrow (m+n+a) \in S_{i}, d|(m+n+a)$ Hence $d|a$, $d$ is also a divisor of the set $S_{j}$, > By symmetry, $S_{i}, S_{j}$ have the same divisors, hence the same $\mathrm{gcd}$, and so states $i,j$ have the same period. ### Hitting Probability * The **hitting probability** of a subset $A \subseteq I$ from state $i$ is$h_{i}^A=\mathbb{P}(X_{n} \in A \text{ for some }n \ge 0 \,|\, X_{0}=i)$ and if $A$ is closed, it is also called the **absorption probability**. * The hitting probabilities $(i\in I:h_{i}^A)$ of a Markov chain is the minimal non-negative solution to the equations $h_{i}^A=\begin{cases} 1 &\text{if }i\in A \\ \sum_{j\in I}p_{ij}h_{j}^A & \text{if } i\notin A \end{cases}$where minimality means that $(h_{i}^A)$ is term-wise no larger than any other solution. > Proof: we can verify that $(h_{i}^A)$ satisfies the equations by conditioning on $X_{0}=i,X_{1}=j$. > To prove minimality, take any solution $(x_{i})$, and we shall show that $\forall i \in I: x_{i}\ge h_{i}$. > Define the "probability of hitting before step $Nquot;: $q_{i}^{(N)}=\mathbb{P}(\exists n\le N: X_{n}\in A\,|\, X_{0}=i)$so $q_{i}^{(N)}=\sum_{j}p_{ij}q_{j}^{(N-1)}$, and induction on $N$ gives $\forall N, q_{i}^{(N)}\le x_{i}$. > Hence $\begin{align*} x_{i}\ge \lim_{ N \to \infty }q_{i}^{(N)}&=\mathbb{P}\Big\{\bigcup_{N}\{\exists n\le N: X_{n}\in A\,|\, X_{0}=i\} \Big\} \\&=\mathbb{P}\{\exists n: X_{n}\in A\}\\ &=h_{i}^{A} \end{align*}$ ### Recurrence and Transience *Big idea: recurrence and transcience describes whether a Markov chain will indefinitely return to a particular state.* * For convenience, we will denote the **return probability** as $r_{i}=\mathbb{P}(\exists n\ge{1}: X_{n}=i\,|\, X_{0}=i)$*but this notation is not standard.* * A state $i\in I$ is **transient** if $r_{i}<1$, which means that $\mathbb{P}(\text{the chain hits } i \text{ infinitely many times}\,|\, X_0=i)=\lim_{ k \to \infty }r_{i}^k =0$ * A state $i\in I$ is **recurrent** if $r_{i}=1$, which means that $\mathbb{P}(\text{the chain hits } i \text{ infinitely many times}\,|\, X_0=i)=\lim_{ k \to \infty }r_{i}^k =1$ * A state $i$ is recurrent if and only if $\sum_{n\ge 1} p_{ii}^{(n)}=\infty$. > Proof: the total number of hits has expectation $\sum_{n} \mathbb{E}[\mathbf{1}(X_{n}=i)\,|\, X_{0}=i]=\sum_{n}\mathbb{P}(X_{n}=i\,|\, X_{0}=i)=\sum_{n}p_{ii}^{(n)}$Then (the state is recurrent) $\iff$ (the expected number of hits is infinite) $\iff$ (the sum is infinite). * Recurrence/transience is a class property: states in the same communicating class are either all recurrent or all transient. * Recurrent classes are all closed; all finite, closed classes are recurrent. > The two propositions are proved in problem sheet 3. ### Hitting and Return Times * The (first) **hitting time** of set $A \subseteq I$ is the random variable$H^A=\inf\{n\ge 0: X_{n}\in A\}$ * The **mean hitting time** of $A$ from $i$ is $k_{i}^{A}=\mathbb{E}[H^{A}\,|\, X_{0}=i]$ * The mean hitting times of the Markov chain $(k_{i\in I}^A)$ is the minimal non-negative solution to the equations $k_{i}^A=\begin{cases} 0 &\text{ if }i\in A \\ 1+\sum_{j}p_{ij}k_{j}^{A} & \text{ if } i\notin A \end{cases}$ > The proof uses the same strategy as the proof for hitting probabilities. * The **mean return time** of a state $i$ is$m_{i}=\mathbb{E}[\,\inf\{n\ge 1: X_{n}=i\}\,|\, X_{0}=i\,]$*That is, the expected time to return to the state after starting from it.* * A recurrent state with a finite mean return time is called **positive recurrent**; else it is **null recurrent**. * A transient state always has a mean return time of $\infty$, since it has a positive probability of never returning. ### Gambler's Ruin as a Markov Chain * Model > Let the gambler's wealth at time $t$ be $X_{t}$, and the winning probability be $p$. They starts with $X_{0}=\textdollar n$. > The transition probabilities are $p_{ij}=\begin{cases} p &\text{ if } j=i+1,\ i\ne 0 \\ q=1-p &\text{ if } j=i-1,\ i\ne 0 \\ 0 &\text{ if otherwise} \end{cases}$ * Hitting probabilities > The probability of hitting zero starting from state $i$, $(h_{i})$, comes from the equations$\begin{align*} h_{0}&=1 \\ h_{i\ne 0} &= \sum_{j}p_{ij}h_{j}=ph_{i+1}+qh_{i-1} \end{align*}$which has the general solution $h_{i}=A+B\big( \frac{q}{p}\big)^i$, where $h_{0}=1$ requires $A+B=1$. > Then non-negativity and minimality requires $(A,B)=\begin{cases} (1,0) &\text{ if }p\le q \\ (0,1) &\text{ if }p>q \\ \end{cases}$So the hitting probability is $1$ in the first case, and $(q / p)^{i}$ in the second case. * Transience/Recurrence > The state $0$ is obviously recurrent, since it is an absorbing state. > > The rest of the chain communicate, so it is sufficient to find the transience/recurrence of the state $1$. Its return probability is $r_{1}<p<1$, so it is transient. Hence the entire class $\{1,2,\dots \}$ is transient. > > *The lecture notes studies the modified chain where $p_{01}>0$, which makes the chain irreducible, and we can find the return probability and mean return time to $0$.* * Mean hitting time of $0$ > If $p > q$, hitting probability is $(q / p)^i<1$, so the mean hitting time is infinite. > > If $p\le q$, the mean hitting times $(k_{i})$ solve the equations $\begin{align*} k_{0}&=1 \\ k_{i\ne 0} &= 1+\sum_{j}p_{ij}k_{j}=1+pk_{i+1}+qk_{i-1} \end{align*}$*The lecture notes involve some Olympic-level hand-waving to claim $k_{i}=\beta i$, and the answer follows. To see why, note that $\begin{align*} k_{i}&= \mathbb{E}[\text{time to hit }(i-1) \text{ from }i]+k_{i-1}\\ &= k_{1} +k_{i-1} \end{align*}$by memorilessness. Then $\beta=k_{1}$.* > > When $p<q$, solving gives $k_{i}=i / (q-p)$. > > When $p=q$, any finite $\beta=k_{1}$ failes, hence the mean hitting times $(k_{i})$ must be all be infinite. --- ## Equilibrium and Convergence of Markov Chains * Recall that the distribution of the Markov chain variable $X_{n}$ can be expressed as a row vector $\lambda(t)$, and right-applying the transition matrix $Plt;u> moves the distribution forward in time: for $n\ge m,$$\lambda(n)=\lambda(m)P^{n-m}$In particular, if $\lambda_{0}$ is the initial distribution, then $X_{n} \sim\lambda_{0} P^n$. - A distribution $\pi$ is a **stationary distribution** of a Markov chain with transition matrix $P$ if it is the left-eigenvector of $P$ with eigenvalue $1$: $\pi=\pi P$Note that it is the left-eigenvector, not the usual right-eigenvector. * *That is, the transition matrix does not change the stationary distribution over time.* ### Convergence Theorems * Existence and uniqueness of the stationary distribution: if $P$ is irreducible, then $P \text{ positive-recurrent} \iff\exists!\pi \text{ stationary, where }\,\pi_{i}=\frac{1}{m_{i}}$where $m_{i}$ is the mean return time to state $i$. * Convergence to the stationary distribution: if $P$ is irreducible and aperiodic with stationary distribution $\pi$, any initial distribution has $\forall i,j:$ $\begin{align*} \mathbb{P}(X_{n}=j)\to \pi_{j}\\ p_{ij}^{(n)}\to \pi_{j} \end{align*}$ - *Aperiodicity is important, as a periodic chain "remembers" its initial distribution.* * Ergodic Theorem: for irreducible chains, the proportion of time spent on state $i$ converges to $1 / m_{i}$ almost surely: $\begin{align*} \frac{\sum_{t=0}^{n-1}\mathbb{1}_{(X_{t}=1)}}{n}\to \frac{1}{m_{i}} \text{ a.s.} \end{align*}$where for null-recurrent and transient states ($m_{i}=\infty$), we interpret $1 / m_{i}$ to be $0$. ---