```dataview
table without id
File as "Topics",
dateformat(file.mtime, "yyyy-MM-dd") as "Last Modified"
from ""
FLATTEN "[[" + file.path + "|" + truncate(file.name, 30) + "]]" as File
FLATTEN course as course
where
(course and (course = this.file.link) and (file.name != this.file.name)) and !contains(file.path, "2 - Snippets")
sort file.mtime desc
```
```dataview
table without id
File as "Snippets",
dateformat(file.mtime, "yyyy-MM-dd") as "Last Modified"
from "2 - Snippets"
FLATTEN "[[" + file.path + "|" + truncate(file.name, 30) + "]]" as File
FLATTEN course as course
where
(course and (course = this.file.link) and (file.name != this.file.name))
sort file.mtime desc
```
## Joint Distributions
* The **joint cumulative density function** (cdf) of two random variables $X,Y$ is $F_{X,Y}(x,y)=\mathbb P(X\le x,Y\le y)$
* The two variables are **jointly continuous** if this function can be written as a double integral: $\exists f(x,y): F(x,y)=\int_{-\infty}^{y} \int_{-\infty}^{x} f(r,s)\,dr\,ds $and the function $f$ is the **joint probability density function** (pdf).
* Given the joint pdf, $A\subseteq \mathbb R^2$, $\mathbb P\big((X,Y)\in A\big)=\iint_{A}f \, dA$
* The **marginal distribution** of $X$ is obtained by integrating out $Y$: $f_{X}(
x)=\int_{-\infty}^\infty f_{X,Y}(x,y) \, dy $
* If variables $X,Y$ are **independent**, $f_{X,Y}(x,y)=f_{X}(x)f_{Y}(y)$Hence given independence, the marginal distribution can be found by factoring.
* A factorizable joint distribution is not sufficient for independence; multivariate Gaussians are an exception.
### Conditional Densities
* We can define the **conditional cdf** for an event $A$: $F_{X|A}(x)=\frac{\mathbb{P}(X\le x \cap A)}{\mathbb P(A)}$ and the **conditional pdf** $f_{X|A}(x)$ is its derivative.
* Then the conditional probability and expectation are$\begin{align*}
\mathbb{P}(X\in C\,|\,A)&=\int_{C}f_{X|A}(x) \, dx\\
\mathbb{E}(X|A)&=\int x\cdot f_{X|A}(x) \, dx
\end{align*}$
- The **conditional density function** of $Y$ given $X=x$ is: $f_{Y|X=x}(y)=\frac{f_{X,Y}}{f_{X}}$ for $x:f_{X}(x)\ne 0$. Hence when $X,Y$ are independent, $f_{Y|X}(y)=f_{Y}$: conditioning on $X$ does not affect the distribution of $Y$.
### Transformation Formula in 2D
* Consider the transformation from random variables $X,Y$ to $T(x,y)=\big(U(x,y),V(x,y)\big)$. Suppose $(X,Y)$ has non-$0$ density in $R \subseteq \mathbb{R}^{2}$.
* The **transformation formula** gives the distribution of $U,V$: in the region $T(R)$, their density is $f_{U,V}(u,v)=f_{X,Y}\Big(x(u,v),y(u,v)\Big)\Bigg|\det\frac{\partial(x,y)}{\partial(u,v)}\Bigg|$where $\frac{\partial(u,v)}{\partial(x,y)}$ is the **Jacobian (matrix)** $J=\frac{\partial(x,y)}{\partial(u,v)}=\begin{pmatrix}
x_{u} & x_{v} \\
y_{u} & y_{v}
\end{pmatrix}$and outside $T(R)$, $f_{U,V}(u,v)=0$.
> Proof: for any $A\subseteq R$, there is $T(A)\subseteq T(R)$, so we can write $\mathbb P\big((X,Y)\in A\big)=\mathbb P\big((U,V)\in T(A)\big)$as a double integral, and apply the change-of-variable formula.
* Example: deriving the distribution of the sum of two random variables.
> Transform from $(X,Y)$ to $(X+Y,Y)$ with a Jacobian of determinant $1$, and substitute $X$ with $X=(X+Y)-Y$, we have$f_{X+Y,\,Y}(u,v)=f_{X,Y}(u-v,v)\cdot 1$so integrating gives the marginal distribution, $f_{X+Y}(u)=\int _{-\infty}^\infty f_{X,Y}(u-v,v)\, dv $
### Transformations and Gaussians in Higher Dimensions
* The same formula holds for higher dimensions, where the Jacobian has the entries defined as $\Bigg( \frac{\partial(x_{1},\dots,x_{n})}{\partial(y_{1},\dots, y_{n})}\Bigg )_{i,j}=\frac{\partial x_{i}}{\partial y_{j}}$
* For linear transformations $\mathbf{W}=A\mathbf{Z}+\mathbf{\mu}$, the Jacobian is just $A^{-1}$, hence $f_{\mathbf{W}}(\mathbf{w})=\frac{f_{\mathbf{Z}}(\mathbf{z}(\mathbf{w}))}{|\det A|}$
* For example, given iid. standard normal variables $\mathbf{Z}$, the transformed $\mathbf{W}=A\mathbf{Z}+\underline{\mu}$ would have distribution $f_{\mathbf{W}}(\mathbf{w})=\frac{1}{(2\pi)^{n/2}|\det A|}\exp\left( -\frac{1}{2}(\mathbf{ w-\underline{\mu}})^T(AA^T)^{-1}(\mathbf{ w-\underline{\mu}}) \right)$*(You probably don't need to memorize this...)* so they follow the multivariate normal distribution $N(\underline{\mu},A A^T)$, where $A A^{T}$ is the covariance matrix.
* The **covariance matrix** of random variables $X_{1},\dots,X_{n}$ is the matrix $\Sigma_{n\times n}$ where $(\Sigma)_{ij}=\mathrm{Cov}(X_{i},X_{j})$So independent variables have a diagonal covariance matrix. While the converse is not true in general, it does hold for Gaussians...
* If the multivariate Gaussian $\mathbf{W}=(W_{1},\dots,W_{n})$ has an orthogonal covariance matrix $A$, i.e. $A A^{T}=I$, then its pdf can be factored as $f_{\mathbf{W}}=\prod_{i=1}^{n}f_{w_{i}}(w_{i})$So $W_{1},\dots,W_{n}$ are indeed independent.
* Here $A$ is called "orthogonal"; a similar result holds for $AA^{T}=\mathrm{diag}(\dots)$, but the algebra is even messier.
### A Random Section on Constructing Normal Variables
* Goal: construct two random normal variables $X,Y$ following ${N}(m,s)$ and ${N}(\mu,\sigma)$ respectively, having a correlation of $\rho$.
> Idea: decompose $Y$ into a $Z_{1}$ that it shares with $X$, plus an independent part $Z_{2}$, with $Z_{1,2}$ iid. $N(0,1)$: $\begin{align*}
X &= sZ_{1}+m\\
Y &= c_{1} Z_{1} + c_{2}Z_{2} + \mu
\end{align*}$The coefficient $c_{1}$ of $Z_{1}$ would control the correlation, and we adjust the coefficient $c_{2}$ of $Z_{2}$ to make sure the variance of $Y$ is correct.
* Step 1: construct $X$ with $Z_{1}$.
> To have $X\sim N(m,s),$ we just choose $Z_{1}\sim N(0,1),\, X=sZ_{1}+m$.
* Step 2: fit $Z_{1}$ into $Y$ to get the correlation right.
> Since $Z_{1}$ independent from $Z_{2}$ and $\mu$, its coefficient $c_{1}$ determines the correlation: $\text{corr}(X,Y)=\rho=\text{coeff}(Z_{1}) / \sigma$so we can set $c_{1}=\sigma \rho$.
* Step 3: adjust the coefficient of $Z_{2}$ to fix the variance of $Y$.
> $\mathrm{Var}(Y)=c_{1}^{2}+c_{2}^{2}$, so solving gives $c_{2}=\sigma \sqrt{ 1-\rho^{2} }$.
> Since $Z_{1,2}$ independent, adding this term does not change the correlation.
* Plugging in the coefficients, $\begin{align*}
X&=sZ_{1}+m\\
Y&= \sigma \rho Z_{1}+ \sigma \sqrt{ 1-\rho^{2 }}Z_{2} + \mu &&\text{where } Z_{1,2} \sim N(0,1) \text{ iid.}
\end{align*}$has the desired correlations and variances.
* To tie back to conditional distributions, writing $Y$ in terms of $X$ gives the conditonal distribution of $Y$ given $X=x$.
---
## Markov Chains
* Consider a sequence of random variables $(X_{n})$ that takes values from a countable set $I$ called **the state space**.
* Usually the sequence is a change over time, e.g. the location of an elevator at time $n$, and the state space is the set of floors $\{0,1,2,\dots k\}$
* In some cases it can use a more abstract sense of time, e.g. the $n$th character in an email, with state space $\{ a,\dots, z\}$.
* If we treat $I$ as a vector $I=(I_{1},\dots)$, the distribution of $X_{n}$ can be written as a row vector $\lambda(t)$, where $(\lambda (t))_{i}=\mathbb{P}(X_{t}=I_{i})$.
* In particular, $X_{0} \sim \lambda_{0}$ is the **initial distribution**.
### Homogeneous Markov Chains
*Big idea: homogenous Markov chains are memoriless random processes.*
* A sequence of random variables $X=(X_{1},X_{2},\dots)$ all taking values in some state space $I$ is called a **Markov chain** if for any $k\ge 1$, any states $i_{0},\dots,i_{k}$: $\mathbb{P}(X_{k}=i_{k}|X_{k-1}=i_{k-1})= \mathbb{P}(X_{k}=i_{k}|X_{k-1}=i_{k-1},\dots,X_{1}=i_{1})$*That is, the next state of the chain $X_{k}$ only depends on the present state $X_{k-1}$, not the past.*
* A Markov chain is **time-homogeneous** if the **transition probability** $\mathbb{P}(X_{k}=j|X_{k-1}=i)=p_{ij}$is constant, i.e. invariant of $k$. <u>All chains in this course are homogeneous.</u>
* *Hence transition probability $p_{ij}$ is the probability of going from state $i$ to state $j$.*
* Homogeneous Markov chains satisfy the **Markov property**: for subsets $A_{1,\dots m+n}\subseteq I, \, i \in I,$$\begin{align*}
\mathbb{P}(X_{m+1}\in A_{m+1},&\dots, X_{m+n}\in A_{m+n}|X_{m}=i, X_{m-1}\in A_{m-1},\dots, X_{1}\in A_{1})\\ \\
&=\mathbb{P}(X_{m+1}\in A_{m+1},\dots, X_{m+n}\in A_{m+n}|X_{m}=i) \\\\
&=\mathbb{P}(X_{1}\in A_{1},\dots, X_{n}\in A_{n}|X_{0}=i_{0})
\end{align*}$
- *The first equality: Markov chains do not remember what happened before the present.*
- *The second equality: homogeneous Markov chains are **translation-invariant**.
- A sufficient criterion for a Markov chain is that: a sequence of random variable $Y=(Y_{0},\dots)$ is a Markov chain if there exists a function $f$ and an sequence $X=(X_{0},\dots)$ such that: $Y_{n+1}=f(Y_{n},X_{n+1}),\,
X_{n+1} \text{ independent of }Y_{1,\dots,n}$
### N-Step Transition Probabilities
* The **$n$-step transition probability** is defined to be$p_{ij}^{(n)}=\mathbb{P}(X_{n}=j|X_{0}=i)$
* For homogeneous chains, the **transition matrix** $P\in \mathbb{R}^{|I|^{2}}:\, P_{i,j}=p_{ij}$records the **transition probabilities**. The sequence $X$ can be described as $X\sim \text{Markov}(\lambda_{0},P)$.
* The matrix can be (countably) infinite if $I$ is.
* The **Chapman-Kolmogorov equations** describe n-step transitions with the transition matrix: for all homogeneous Markov chains, $\begin{align*}
p_{ij}^{(n)}&=(P^n)_{ij}\\
p_{ij}^{(m+n)}&=(P^mP^n)_{ij}=\sum_{k \in I} p_{ik}^{(m)} p_{kj}^{(n)}
\end{align*}$
* The first equation: applying the transition matrix represents a 1-step move in time.
* The second equation: the probability of going from $i$ to $j$ in $m+n$ steps can be broken down by conditioning on it passing $k$ at the $m$th step.
> Proof:
> <u>First equation</u>: induction.
> <u>Second equation</u>: immediate from the first, or use the law of total probabilities.
- By the C-K equations, Markov chains can be represented by right-applying the transition matrix: for $n\ge m,$$\lambda(n)=\lambda(m)P^{n-m}$
## Structures of Markov Chains
### Class Structures
* Two states $i,j$ of a Markov chain **communicate**, denoted $i\leftrightarrow j$ if: $\exists m,n:p_{ij}^{(m)}\ne 0,\, p_{ji}^{(n)}\ne 0$
* *That is, there is a chance to go from $i$ to $j$, and vice versa.*
* As an equivalence relation, communication partitions the state space into **communicating classes**. A Markov chain is **irreducible** if it only has one communicating class.
* A **closed class** $C$ is a communicating class such that: $\forall c \in C, d \notin C, n\in \mathbb{N}:\, \,p_{cd}^{(n)}=0$It is also called an **absorbing state** if it has only one element.
* *That is, it is impossible to leave $C$ once inside it.*
* An **open class** is a communicating class that is not closed.
### Periodicity
* The **period** of a state $i$ is $\gcd \big\{n:p_{ii}^{(n)}>0\big\}$and the state is **aperiodic** if its period is $1$.
* For example, a random walk of a length-$n$ cycle has period $2$ if $n$ even; if $n$ is odd, then the chain has period $1$ hence is apeiodic.
* <u>The period is a class property</u>: communicating states have the same period.
> Proof: suppose $i \leftrightarrow j$, with $m,n: p_{ij}^{(m)}\ne 0,\, p_{ji}^{(n)}\ne 0$.
> Define $S_{i}=\big\{k:p_{ii}^{(k)}>0\big\}$, similarly for $S_{j}$.
> Then for any common divisor $d$ of $S_{i}$,
> $
p_{ii}^{(m+n)}\ge p_{ij}^{(m)}p_{ji}^{(n)}>0 \Longrightarrow (m+n) \in S_{i}, d|(m+n)$and for any $a \in S_{j}$, $p_{ii}^{(m+a+n)}\ge p_{ij}^{(m)}p_{jj}^{(a)}p_{ji}^{(n)}>0\Longrightarrow (m+n+a) \in S_{i}, d|(m+n+a)$ Hence $d|a$, $d$ is also a divisor of the set $S_{j}$,
> By symmetry, $S_{i}, S_{j}$ have the same divisors, hence the same $\mathrm{gcd}$, and so states $i,j$ have the same period.
### Hitting Probability
* The **hitting probability** of a subset $A \subseteq I$ from state $i$
is$h_{i}^A=\mathbb{P}(X_{n} \in A \text{ for some }n \ge 0 \,|\, X_{0}=i)$
and if $A$ is closed, it is also called the **absorption probability**.
* The hitting probabilities $(i\in I:h_{i}^A)$ of a Markov chain is the <u>minimal non-negative solution</u> to the equations $h_{i}^A=\begin{cases}
1 &\text{if }i\in A \\
\sum_{j\in I}p_{ij}h_{j}^A & \text{if } i\notin A
\end{cases}$where minimality means that $(h_{i}^A)$ is term-wise no larger than any other solution.
> Proof: we can verify that $(h_{i}^A)$ satisfies the equations by conditioning on $X_{0}=i,X_{1}=j$.
> To prove minimality, take any solution $(x_{i})$, and we shall show that $\forall i \in I: x_{i}\ge h_{i}$.
> Define the "probability of hitting before step $N
quot;: $q_{i}^{(N)}=\mathbb{P}(\exists n\le N: X_{n}\in A\,|\, X_{0}=i)$so $q_{i}^{(N)}=\sum_{j}p_{ij}q_{j}^{(N-1)}$, and induction on $N$ gives $\forall N, q_{i}^{(N)}\le x_{i}$.
> Hence $\begin{align*}
x_{i}\ge \lim_{ N \to \infty }q_{i}^{(N)}&=\mathbb{P}\Big\{\bigcup_{N}\{\exists n\le N: X_{n}\in A\,|\, X_{0}=i\} \Big\}
\\&=\mathbb{P}\{\exists n: X_{n}\in A\}\\
&=h_{i}^{A}
\end{align*}$
### Recurrence and Transience
*Big idea: recurrence and transcience describes whether a Markov chain will indefinitely return to a particular state.*
* For convenience, we will denote the **return probability** as $r_{i}=\mathbb{P}(\exists n\ge{1}: X_{n}=i\,|\, X_{0}=i)$*but this notation is not standard.*
* A state $i\in I$ is **transient** if $r_{i}<1$, which means that $\mathbb{P}(\text{the chain hits } i \text{ infinitely many times}\,|\, X_0=i)=\lim_{ k \to \infty }r_{i}^k =0$
* A state $i\in I$ is **recurrent** if $r_{i}=1$, which means that $\mathbb{P}(\text{the chain hits } i \text{ infinitely many times}\,|\, X_0=i)=\lim_{ k \to \infty }r_{i}^k =1$
* A state $i$ is recurrent if and only if $\sum_{n\ge 1} p_{ii}^{(n)}=\infty$.
> Proof: the total number of hits has expectation $\sum_{n} \mathbb{E}[\mathbf{1}(X_{n}=i)\,|\, X_{0}=i]=\sum_{n}\mathbb{P}(X_{n}=i\,|\, X_{0}=i)=\sum_{n}p_{ii}^{(n)}$Then (the state is recurrent) $\iff$ (the expected number of hits is infinite) $\iff$ (the sum is infinite).
* <u>Recurrence/transience is a class property</u>: states in the same communicating class are either all recurrent or all transient.
* Recurrent classes are all closed; all finite, closed classes are recurrent.
> The two propositions are proved in problem sheet 3.
### Hitting and Return Times
* The (first) **hitting time** of set $A \subseteq I$ is the random variable$H^A=\inf\{n\ge 0: X_{n}\in A\}$
* The **mean hitting time** of $A$ from $i$ is $k_{i}^{A}=\mathbb{E}[H^{A}\,|\, X_{0}=i]$
* The mean hitting times of the Markov chain $(k_{i\in I}^A)$ is the minimal non-negative solution to the equations $k_{i}^A=\begin{cases}
0 &\text{ if }i\in A \\
1+\sum_{j}p_{ij}k_{j}^{A} & \text{ if } i\notin A
\end{cases}$
> The proof uses the same strategy as the proof for hitting probabilities.
* The **mean return time** of a state $i$ is$m_{i}=\mathbb{E}[\,\inf\{n\ge 1: X_{n}=i\}\,|\, X_{0}=i\,]$*That is, the expected time to return to the state after starting from it.*
* A recurrent state with a finite mean return time is called **positive recurrent**; else it is **null recurrent**.
* <u>A transient state always has a mean return time of</u> $\infty$, since it has a positive probability of never returning.
### Gambler's Ruin as a Markov Chain
* Model
> Let the gambler's wealth at time $t$ be $X_{t}$, and the winning probability be $p$. They starts with $X_{0}=\textdollar n$.
> The transition probabilities are $p_{ij}=\begin{cases}
p &\text{ if } j=i+1,\ i\ne 0 \\
q=1-p &\text{ if } j=i-1,\ i\ne 0 \\
0 &\text{ if otherwise}
\end{cases}$
* Hitting probabilities
> The probability of hitting zero starting from state $i$, $(h_{i})$, comes from the equations$\begin{align*}
h_{0}&=1 \\
h_{i\ne 0} &= \sum_{j}p_{ij}h_{j}=ph_{i+1}+qh_{i-1}
\end{align*}$which has the general solution $h_{i}=A+B\big( \frac{q}{p}\big)^i$, where $h_{0}=1$ requires $A+B=1$.
> Then non-negativity and minimality requires $(A,B)=\begin{cases}
(1,0) &\text{ if }p\le q \\
(0,1) &\text{ if }p>q \\
\end{cases}$So the hitting probability is $1$ in the first case, and $(q / p)^{i}$ in the second case.
* Transience/Recurrence
> The state $0$ is obviously recurrent, since it is an absorbing state.
>
> The rest of the chain communicate, so it is sufficient to find the transience/recurrence of the state $1$. Its return probability is $r_{1}<p<1$, so it is transient. Hence the entire class $\{1,2,\dots \}$ is transient.
>
> *The lecture notes studies the modified chain where $p_{01}>0$, which makes the chain irreducible, and we can find the return probability and mean return time to $0$.*
* Mean hitting time of $0$
> If $p > q$, hitting probability is $(q / p)^i<1$, so the mean hitting time is infinite.
>
> If $p\le q$, the mean hitting times $(k_{i})$ solve the equations $\begin{align*}
k_{0}&=1 \\
k_{i\ne 0} &= 1+\sum_{j}p_{ij}k_{j}=1+pk_{i+1}+qk_{i-1}
\end{align*}$*The lecture notes involve some Olympic-level hand-waving to claim $k_{i}=\beta i$, and the answer follows. To see why, note that $\begin{align*}
k_{i}&= \mathbb{E}[\text{time to hit }(i-1) \text{ from }i]+k_{i-1}\\
&= k_{1} +k_{i-1}
\end{align*}$by memorilessness. Then $\beta=k_{1}$.*
>
> When $p<q$, solving gives $k_{i}=i / (q-p)$.
>
> When $p=q$, any finite $\beta=k_{1}$ failes, hence the mean hitting times $(k_{i})$ must be all be infinite.
---
## Equilibrium and Convergence of Markov Chains
* Recall that the distribution of the Markov chain variable $X_{n}$ can be expressed as a row vector $\lambda(t)$, and <u>right-applying the transition matrix </u>$Plt;u> moves the distribution forward in time:</u> for $n\ge m,$$\lambda(n)=\lambda(m)P^{n-m}$In particular, if $\lambda_{0}$ is the initial distribution, then $X_{n} \sim\lambda_{0} P^n$.
- A distribution $\pi$ is a **stationary distribution** of a Markov chain with transition matrix $P$ if it is the left-eigenvector of $P$ with eigenvalue $1$: $\pi=\pi P$Note that it is the <u>left-eigenvector</u>, not the usual right-eigenvector.
* *That is, the transition matrix does not change the stationary distribution over time.*
### Convergence Theorems
* <u>Existence and uniqueness of the stationary distribution</u>: if $P$ is irreducible, then $P \text{ positive-recurrent} \iff\exists!\pi \text{ stationary, where }\,\pi_{i}=\frac{1}{m_{i}}$where $m_{i}$ is the mean return time to state $i$.
* <u>Convergence to the stationary distribution</u>: if $P$ is irreducible and aperiodic with stationary distribution $\pi$, any initial distribution has $\forall i,j:$ $\begin{align*}
\mathbb{P}(X_{n}=j)\to \pi_{j}\\
p_{ij}^{(n)}\to \pi_{j}
\end{align*}$
- *Aperiodicity is important, as a periodic chain "remembers" its initial distribution.*
* <u>Ergodic Theorem</u>: for irreducible chains, the proportion of time spent on state $i$ converges to $1 / m_{i}$ almost surely: $\begin{align*}
\frac{\sum_{t=0}^{n-1}\mathbb{1}_{(X_{t}=1)}}{n}\to \frac{1}{m_{i}} \text{ a.s.}
\end{align*}$where for null-recurrent and transient states ($m_{i}=\infty$), we interpret $1 / m_{i}$ to be $0$.
---