Consider the linear system $X_{t}=A_{t}X_{t-1}+B_{t}u_{t}+W_{t},$where $X_{t}$ are the random states at time $t$, $u_{t}$ are the controlled (and known) inputs at time $t$, and $W_{t}$ unknown white noise that are independent of all else. However, we cannot directly observe $X_{t}$, and instead have observations $Y_{t}$ determined by $Y_{t}=C_{t}X_{t} + D_{t}u_{t} + N_{t},$ where $N_{t}$ is another white noise (e.g. observation error). Assume $A,B,C,D$ are known for all $t$. > [!tldr] > The **Kalman filter** is a method for estimating $(X_{t})$, optimal under the $l_{2}$ loss (so it is essentially the posterior mean, as we will see). > > It involves two steps: the **prediction** step simply yields $\hat{X}_{t ~|~ t-1}=\mathbb{E}[X_{t}~|~\mathcal{F}_{t-1}]$, where $\mathcal{F}_{t-1}$ is the information available till time $t-1$ (e.g. $\sigma(Y_{1},\dots,Y_{t-1})$ in measure theory language). > > The **update** step updates $\hat{X}_{t}:= \hat{X}_{t~|~t}=\mathbb{E}[X_{t} ~|~ \mathcal{F}_{t}]$, i.e. incorporating the latest observation $Y_{t}$. In general, those computations are prohibitive (requiring storing all of $Y_{1},\dots,Y_{t}$ and computing a covariance matrix of size $O(N^{2})$), but we will see that if we *assume Gaussian errors and drifts (so all $X, Y$ are jointly Gaussian), we can do computations recursively.* Let $P_{t ~|~ t-1}$ denote the covariance matrix of $X_{t}$ given $\mathcal{F}_{t-1}$, i.e. $P_{t ~|~ t-1}:= \mathrm{Cov}(X_{t} ~|~ \mathcal{F}_{t-1}):= \mathbb{E}[(X_{t}-\hat{X}_{t ~|~ t-1})(X_{t}-\hat{X}_{t ~|~ t-1}) ~|~ \mathcal{F}_{t-1}],$where $\hat{X}_{t ~|~ t-1}$ as defined above is the conditional mean of $X_{t} ~|~ \mathcal{F}_{t-1}$. Similarly, we define $P_{t ~|~ t}:= \mathrm{Cov}(X_{t} ~|~ \mathcal{F}_{t})$. - It does not appear in the predictions, but is necessary for the recursive computations. Schematically, the Kalman filter has the recursive relationship $\begin{align*} \cdots \xrightarrow{\text{predict}} (\hat{X}_{t ~|~ t-1}, P_{t ~|~ t-1}) \xrightarrow{\text{update}} (\hat{X}_{t ~|~ t}, P_{t ~|~ t} )\xrightarrow{\text{predict}} (\hat{X}_{t+1 ~|~ t}, P_{t+1 ~|~ t}) \xrightarrow{\text{update}} \cdots \end{align*}$ ## Prediction Step For the prediction step, we compute $\mathbb{E}[X_{t} ~|~ \mathcal{F}_{t-1}]$ to be $ \begin{align*} \mathbb{E}[X_{t} ~|~ \mathcal{F}_{t-1}]&= A\hat{X}_{t-1 ~|~ t-1} + B_{t}u_{t} + \underset{=0,~ W_{t} \perp \mathcal{F}_{t-1}}{\cancel{\mathbb{E}[W_{t} ~|~ \mathcal{F}_{t-1}]}}\\ &= A\hat{X}_{t-1 ~|~ t-1} + B_{t}u_{t}, \end{align*}$so we obtain a recursive formula for computing $\hat{X}_{t ~|~ t-1}$. Similarly, we can solve for (not just estimate) $P_{t ~|~ t-1}$ to be: $\begin{align*} P_{t ~|~ t-1}&:= \mathrm{Cov}(X_{t} ~|~ \mathcal{F}_{t-1})\\ &~= \mathrm{Cov}(A_{t}X_{t-1}+B_{t}u_{t}+W_{t} ~|~ \mathcal{F}_{t-1})\\ &~= A_{t}^{T}\mathrm{Cov}(X_{t-1} ~|~ \mathcal{F}_{t-1}) A_{t} + 0 + \mathrm{Cov}(W_{t})\\ &= A_{t}^{T}P_{t-1 ~|~ t-1} A_{t} + \mathrm{Cov}(W_{t}) \end{align*}$ Hence again, we find a recursive formula for $P_{t ~|~ t-1}$. ## Update Step The update step is harder, and we need the relationship between $Y_{t}$ and $X_{t}$ to find their joint distribution. In particular, we know $Y_{t} ~|~ \mathcal{F}_{t-1}$ is still Gaussian, and using the linear relationship $C_{t}X_{t} + D_{t}u_{t} + N_{t}=Y_{t}$, we find the recursive formulas $\begin{align*} \mathbb{E}[Y_{t} ~|~ \mathcal{F}_{t-1}]&= C_{t}\hat{X}_{t ~|~ t-1} + D_{t}u_{t},\\ \mathrm{Cov}(Y_{t} ~|~ \mathcal{F}_{t-1}) &= C_{t}^{T}P_{t ~|~ t-1} C_{t} + \mathrm{Cov}(N_{t}),\\ \mathrm{Cov}(X_{t}, Y_{t} ~|~ \mathcal{F}_{t-1})&= \mathrm{Cov}(X_{t}, C_{t}X_{t}+D_{t}u_{t}+N_{t} ~|~ \mathcal{F}_{t-1})\\ &= \mathrm{Cov}(X_{t} ~|~ \mathcal{F}_{t-1})C_{t}^{T} + \mathbf{0} + \mathbf{0}\\ &= P_{t ~|~ t-1} C_{t}^{T}. \end{align*}$ ### Updating $\hat{X}$ Now incorporate information about the observed value giving the condition on $\mathcal{F}_{t}$ to be $\begin{align*} \mathbb{E}[X_{t} ~|~ \mathcal{F}_{t}]&= \mathbb{E}[X_{t} ~|~ \mathcal{F}_{t-1}]\\ &~~~~~+\mathrm{Cov}(X_{t}, Y_{t} ~|~ \mathcal{F}_{t-1}) *\mathrm{Cov}(Y_{t} ~|~ \mathcal{F}_{t-1})^{-1}*(Y_{t}-\mathbb{E}[Y_{t} ~|~ \mathcal{F}_{t-1}]).\\ \end{align*}$ Here $*$ is just standard matrix multiplication, included for clarity in the next step; the second term, for scalar valued Gaussians in simplified notation, is just $\sigma_{X} \cdot \rho_{XY} \cdot (Y- \bar{Y}) / \sigma_{Y}$: the surprise to see the deviance $Y-\bar{Y}$, adjusted for $\sigma_{Y}$, then rescaled to its influence on $X$ by $\sigma_{X}\cdot\rho_{XY}$. Now plug in the values found above, we have $ \begin{align*} \mathbb{E}[X_{t} ~|~ \mathcal{F}_{t}]= \hat{X}_{t ~|~ t-1} ~+~ &P_{t ~|~ t-1}C_{t}^{T}\\ &*(C_{t}^{T}P_{t ~|~ t-1}C_{t} + \mathrm{Cov}(N_{t}))^{-1}\\ & *(Y_{t}-C_{t}\hat{X}_{t ~|~ t-1}-D_{t}u_{t}) \end{align*},$where $*$ separate the same matrices as above, just with the values substituted. Note that the second term can be written as $\begin{align*} K_{t} &* \underbrace{(Y_{t}-C_{t}\hat{X}_{t ~|~ t-1}-D_{t}u_{t})}_{\text{residual/prediction error}},\\[0.8em] & \text{where } K_{t}:= P_{t ~|~ t-1}C_{t}^{T}*(C_{t}^{T}P_{t ~|~ t-1}C_{t}+\mathrm{Cov}(N_{t}))^{-1}. \end{align*}$ Hence we define $K_{t}$ given above to be the **Kalman gain** matrix, and the update boils down to $\hat{X}_{t~|~t}=\hat{X}_{t ~|~t-1}+K_{t}* \text{residual}.$ ### Updating $P$ The important identity of multivariate Gaussians: if $Z=(Z_{1},Z_{2})$ (concatenated vectors) follows $N\left(\mu_{Z},\begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{12}^{T} & \Sigma_{22} \end{pmatrix}\right),$then $Z_{1} ~|~ Z_{2} \sim N(\mu_{Z_{1} ~|~ Z_{2}}, \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{12}^{T}).$ Using the identity above, we can find $P_{t ~|~ t}$ to be $\begin{align*} P_{t ~|~ t}&= \mathrm{Cov}(X_{t} ~|~ \mathcal{F}_{t})\\ &= \mathrm{Cov}(X_{t} ~|~ \mathcal{F}_{t-1}) \\ & ~~~~~~~~- \mathrm{Cov}(X_{t}, Y_{t} ~|~ \mathcal{F}_{t-1})*\mathrm{Cov}(Y_{t} ~|~ \mathcal{F}_{t-1})^{-1}*\mathrm{Cov}(X_{t}, Y_{t}~|~ \mathcal{F}_{t-1})^{T}, \end{align*}$and plugging in values give $\begin{align*} P_{t~|~t}&= P_{t~|~t-1}-K_{t}*(C_{t}P_{t ~|~ t-1}) \\ &= (I-K_{t}C_{t})P_{t ~|~ t-1}, \end{align*}$i.e. a decrease proportional to the gain, with $C_{t}$ applied to transform the gain from the scale of $X$ to that of the observation $Y$.