General Kalman Setup
The system evolution is defined as the evolution of a state:
- state $x$ at $k-1$ through $F_k$ state transition matrix
- control vector $u$ with a control input matrix $B_k$
- $w_k \sim N(0,Q_k)$ a noise process on the evolution
with observations:
- $z_k$ is the measurement of the state $x_k$,
- made through the observation matrix $H_k$ applied on $x_k$
- with noise $v_k \sim N(0,R_k)$
\begin{eqnarray} x_k &=& F_k x_{k-1} + w_k \\ z_k &=& H_k x_k + v_k \end{eqnarray}
The Kalman prediction of the state covariance is given by:
\begin{eqnarray} x_{k|k-1} &=& F_k x_{k-1|k-1} + B_k u_k \\ P_{k|k-1} &=& F_k P_{k-1|k-1} F^t + Q_k \end{eqnarray}
The update rule is
- innovation residual: $y_k = z_k - H_k x_{k|k-1}$
- residual covariance $S_k = H_k P_{k|k-1} H^t_k + R_k$
- optimal kalman gain $K_k = P_{k|k-1} H^t_k S^{-1}_k$
- updated state estimate $x_{k|k} = x_{k|k-1} + K_k y_k$
- updated estimated covariance $P_{k|k} = (I- K_k H_k)P_{k|k-1}$
Exponential Moving Average as a 1D Kalman filter
See the description of the Holt Winters time series filtering for the interpretation of Kalman filter as an EWMA and the relation between EWMA parameter $\alpha$ and the signal/noise ratio $\frac{P+Q}{P+Q+R}$
$p$-Factor Regression as a $p$ factor Kalman filter
Regression model definition: \begin{eqnarray} \beta_i &=& \beta_{i-1} + w_i \\ Y_i &=& X_i \beta_i + \varepsilon_i \end{eqnarray}
To map to previous notations, $F=Id(p), B = 0, x_k=\beta_k, H_k = X_k^t, z_k = Y_k$
Kalman updating rule: \begin{eqnarray} K_i &=& \frac{1}{R+X_i^t (P_{i-1}+Q) X_i} (P_{i-1}+Q) . X_i \\ \beta_i &=& \beta_{i-1} + K_i (Y_i - X_{i-1} \beta_{i-1}) \\ P_i &=& (1-K_iX_i).(P_{i-1} + Q) \end{eqnarray}
with dimensionalities: $X_i \in M(1,p), \beta_i \in M(1,p), K_i \in M(p,1), Q \in M(p,p), P_i \in M(p,p)$
In theory $Q_i = cov(w_i), R_i=cov(\varepsilon_i)$, a more stable setup in practice has covariance estimates $Q=Id(p), P_0=10^7 Id(p), R=5$.
Non Kalman view: Smoothed 1D regression as a Penalized $L^2$ Problem
We keep the model setup: \begin{eqnarray} \beta_i &=& \beta_{i-1} + w_i \\ Y_i &=& X_i \beta_i + \varepsilon_i \end{eqnarray}
This can be minimized given all times using a quadratic programming with the loss function $$ \sum_i (Y_i - X_i \beta_i)^2 + \mu \sum_i (\beta_i - \beta_{i-1})^2 $$ For $\mu = 0$, the regression is exact whereas for $\mu$ sufficiently high, $\beta$ does not change with time.
One of the problem is that the algorithm is not adaptive and requires the whole history to optimize. Besides, the choice of optimal $\mu$ reads like a compromise between precision of the estimate and its stability, and it best estimated using information available up to time $i-1$.