Consider the linear system $X_{t}=A_{t}X_{t-1}+B_{t}u_{t}+W_{t},$where $X_{t}$ are the random states at time $t$, $u_{t}$ are the controlled (and known) inputs at time $t$, and $W_{t}$ unknown white noise that are independent of all else.
However, we cannot directly observe $X_{t}$, and instead have observations $Y_{t}$ determined by
$Y_{t}=C_{t}X_{t} + D_{t}u_{t} + N_{t},$
where $N_{t}$ is another white noise (e.g. observation error). Assume $A,B,C,D$ are known for all $t$.
> [!tldr]
> The **Kalman filter** is a method for estimating $(X_{t})$, optimal under the $l_{2}$ loss (so it is essentially the posterior mean, as we will see).
>
> It involves two steps: the **prediction** step simply yields $\hat{X}_{t ~|~ t-1}=\mathbb{E}[X_{t}~|~\mathcal{F}_{t-1}]$, where $\mathcal{F}_{t-1}$ is the information available till time $t-1$ (e.g. $\sigma(Y_{1},\dots,Y_{t-1})$ in measure theory language).
>
> The **update** step updates $\hat{X}_{t}:= \hat{X}_{t~|~t}=\mathbb{E}[X_{t} ~|~ \mathcal{F}_{t}]$, i.e. incorporating the latest observation $Y_{t}$.
In general, those computations are prohibitive (requiring storing all of $Y_{1},\dots,Y_{t}$ and computing a covariance matrix of size $O(N^{2})$), but we will see that if we *assume Gaussian errors and drifts (so all $X, Y$ are jointly Gaussian), we can do computations recursively.*
Let $P_{t ~|~ t-1}$ denote the covariance matrix of $X_{t}$ given $\mathcal{F}_{t-1}$, i.e. $P_{t ~|~ t-1}:= \mathrm{Cov}(X_{t} ~|~ \mathcal{F}_{t-1}):= \mathbb{E}[(X_{t}-\hat{X}_{t ~|~ t-1})(X_{t}-\hat{X}_{t ~|~ t-1}) ~|~ \mathcal{F}_{t-1}],$where $\hat{X}_{t ~|~ t-1}$ as defined above is the conditional mean of $X_{t} ~|~ \mathcal{F}_{t-1}$. Similarly, we define $P_{t ~|~ t}:= \mathrm{Cov}(X_{t} ~|~ \mathcal{F}_{t})$.
- It does not appear in the predictions, but is necessary for the recursive computations.
Schematically, the Kalman filter has the recursive relationship
$\begin{align*} \cdots \xrightarrow{\text{predict}}
(\hat{X}_{t ~|~ t-1}, P_{t ~|~ t-1}) \xrightarrow{\text{update}} (\hat{X}_{t ~|~ t}, P_{t ~|~ t} )\xrightarrow{\text{predict}} (\hat{X}_{t+1 ~|~ t}, P_{t+1 ~|~ t}) \xrightarrow{\text{update}} \cdots
\end{align*}$
## Prediction Step
For the prediction step, we compute $\mathbb{E}[X_{t} ~|~ \mathcal{F}_{t-1}]$ to be
$
\begin{align*}
\mathbb{E}[X_{t} ~|~ \mathcal{F}_{t-1}]&= A\hat{X}_{t-1 ~|~ t-1} + B_{t}u_{t} + \underset{=0,~ W_{t} \perp \mathcal{F}_{t-1}}{\cancel{\mathbb{E}[W_{t} ~|~ \mathcal{F}_{t-1}]}}\\
&= A\hat{X}_{t-1 ~|~ t-1} + B_{t}u_{t},
\end{align*}$so we obtain a recursive formula for computing $\hat{X}_{t ~|~ t-1}$.
Similarly, we can solve for (not just estimate) $P_{t ~|~ t-1}$ to be:
$\begin{align*}
P_{t ~|~ t-1}&:= \mathrm{Cov}(X_{t} ~|~ \mathcal{F}_{t-1})\\
&~= \mathrm{Cov}(A_{t}X_{t-1}+B_{t}u_{t}+W_{t} ~|~ \mathcal{F}_{t-1})\\
&~= A_{t}^{T}\mathrm{Cov}(X_{t-1} ~|~ \mathcal{F}_{t-1}) A_{t} + 0 + \mathrm{Cov}(W_{t})\\
&= A_{t}^{T}P_{t-1 ~|~ t-1} A_{t} + \mathrm{Cov}(W_{t})
\end{align*}$
Hence again, we find a recursive formula for $P_{t ~|~ t-1}$.
## Update Step
The update step is harder, and we need the relationship between $Y_{t}$ and $X_{t}$ to find their joint distribution.
In particular, we know $Y_{t} ~|~ \mathcal{F}_{t-1}$ is still Gaussian, and using the linear relationship $C_{t}X_{t} + D_{t}u_{t} + N_{t}=Y_{t}$, we find the recursive formulas
$\begin{align*}
\mathbb{E}[Y_{t} ~|~ \mathcal{F}_{t-1}]&= C_{t}\hat{X}_{t ~|~ t-1} + D_{t}u_{t},\\
\mathrm{Cov}(Y_{t} ~|~ \mathcal{F}_{t-1}) &= C_{t}^{T}P_{t ~|~ t-1} C_{t} + \mathrm{Cov}(N_{t}),\\
\mathrm{Cov}(X_{t}, Y_{t} ~|~ \mathcal{F}_{t-1})&= \mathrm{Cov}(X_{t}, C_{t}X_{t}+D_{t}u_{t}+N_{t} ~|~ \mathcal{F}_{t-1})\\
&= \mathrm{Cov}(X_{t} ~|~ \mathcal{F}_{t-1})C_{t}^{T} + \mathbf{0} + \mathbf{0}\\
&= P_{t ~|~ t-1} C_{t}^{T}.
\end{align*}$
### Updating $\hat{X}$
Now incorporate information about the observed value giving the condition on $\mathcal{F}_{t}$ to be $\begin{align*}
\mathbb{E}[X_{t} ~|~ \mathcal{F}_{t}]&= \mathbb{E}[X_{t} ~|~ \mathcal{F}_{t-1}]\\
&~~~~~+\mathrm{Cov}(X_{t}, Y_{t} ~|~ \mathcal{F}_{t-1}) *\mathrm{Cov}(Y_{t} ~|~ \mathcal{F}_{t-1})^{-1}*(Y_{t}-\mathbb{E}[Y_{t} ~|~ \mathcal{F}_{t-1}]).\\
\end{align*}$
Here $*$ is just standard matrix multiplication, included for clarity in the next step; the second term, for scalar valued Gaussians in simplified notation, is just $\sigma_{X} \cdot \rho_{XY} \cdot (Y- \bar{Y}) / \sigma_{Y}$: the surprise to see the deviance $Y-\bar{Y}$, adjusted for $\sigma_{Y}$, then rescaled to its influence on $X$ by $\sigma_{X}\cdot\rho_{XY}$.
Now plug in the values found above, we have
$ \begin{align*}
\mathbb{E}[X_{t} ~|~ \mathcal{F}_{t}]= \hat{X}_{t ~|~ t-1} ~+~ &P_{t ~|~ t-1}C_{t}^{T}\\
&*(C_{t}^{T}P_{t ~|~ t-1}C_{t} + \mathrm{Cov}(N_{t}))^{-1}\\
& *(Y_{t}-C_{t}\hat{X}_{t ~|~ t-1}-D_{t}u_{t})
\end{align*},$where $*$ separate the same matrices as above, just with the values substituted.
Note that the second term can be written as $\begin{align*}
K_{t} &* \underbrace{(Y_{t}-C_{t}\hat{X}_{t ~|~ t-1}-D_{t}u_{t})}_{\text{residual/prediction error}},\\[0.8em]
& \text{where } K_{t}:= P_{t ~|~ t-1}C_{t}^{T}*(C_{t}^{T}P_{t ~|~ t-1}C_{t}+\mathrm{Cov}(N_{t}))^{-1}.
\end{align*}$
Hence we define $K_{t}$ given above to be the **Kalman gain** matrix, and the update boils down to $\hat{X}_{t~|~t}=\hat{X}_{t ~|~t-1}+K_{t}* \text{residual}.$
### Updating $P$
The important identity of multivariate Gaussians: if $Z=(Z_{1},Z_{2})$ (concatenated vectors) follows $N\left(\mu_{Z},\begin{pmatrix}
\Sigma_{11} & \Sigma_{12} \\
\Sigma_{12}^{T} & \Sigma_{22}
\end{pmatrix}\right),$then $Z_{1} ~|~ Z_{2} \sim N(\mu_{Z_{1} ~|~ Z_{2}}, \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{12}^{T}).$
Using the identity above, we can find $P_{t ~|~ t}$ to be $\begin{align*}
P_{t ~|~ t}&= \mathrm{Cov}(X_{t} ~|~ \mathcal{F}_{t})\\
&= \mathrm{Cov}(X_{t} ~|~ \mathcal{F}_{t-1}) \\
& ~~~~~~~~- \mathrm{Cov}(X_{t}, Y_{t} ~|~ \mathcal{F}_{t-1})*\mathrm{Cov}(Y_{t} ~|~ \mathcal{F}_{t-1})^{-1}*\mathrm{Cov}(X_{t}, Y_{t}~|~ \mathcal{F}_{t-1})^{T},
\end{align*}$and plugging in values give
$\begin{align*}
P_{t~|~t}&= P_{t~|~t-1}-K_{t}*(C_{t}P_{t ~|~ t-1}) \\
&= (I-K_{t}C_{t})P_{t ~|~ t-1},
\end{align*}$i.e. a decrease proportional to the gain, with $C_{t}$ applied to transform the gain from the scale of $X$ to that of the observation $Y$.