Sparse GP (Titsias VFE)

A sparse GP replaces the full \(N \times N\) kernel matrix \(K_{ff} = K_{XX}\) with an approximation involving \(M \ll N\) inducing points \(Z = (z_1, \dots, z_M)\).

Titsias (2009) derives the variational free energy (VFE) bound by introducing auxiliary inducing-function values \(u = f(Z)\) and maximizing a variational lower bound on \(\log p(y)\).

Kernel approximation

Define the cross-covariances

\[K_{uu} = K(Z, Z) \in \mathbb{R}^{M \times M}, \qquad K_{uf} = K(Z, X) \in \mathbb{R}^{M \times N}.\]

The full \(K_{ff}\) is approximated by

\[Q_{ff} := K_{fu}\, K_{uu}^{-1}\, K_{uf}.\]

In effect, the data covariance is projected through the \(M\)-dim inducing-point subspace.

VFE bound

With per-data diagonal residual

\[\Lambda = \operatorname{diag}\!\bigl(K_{ff} - Q_{ff}\bigr) + \sigma_n^2 I,\]

and the \(M \times M\) “Sigma” matrix

\[\Sigma = K_{uu} + K_{uf}\, \Lambda^{-1}\, K_{fu},\]

the variational lower bound is

\[\mathcal{L}(\theta, Z) = -\tfrac{1}{2}\, y^\top (Q_{ff} + \Lambda)^{-1} y - \tfrac{1}{2}\, \log\lvert Q_{ff} + \Lambda \rvert - \tfrac{N}{2}\, \log(2\pi) - \tfrac{1}{2\sigma_n^2}\, \operatorname{tr}\!\bigl(K_{ff} - Q_{ff}\bigr).\]

The last term is the regularization that distinguishes VFE from the older FITC approximation — it penalizes inducing-point placements that leave residual variance on the training data.

Cost

Storage is \(O(N + M^2)\). Training cost per fit is \(O(NM^2 + M^3)\) — linear in \(N\). Predict mean is \(O(M)\) per test point; predict variance is \(O(M^2)\).

LightGP’s GPSparse initializes inducing locations by farthest-point sampling on the training inputs. This is deterministic, requires no random seed, and gives strong defaults on real data:

import lightgp as gp

model = gp.GPSparse(noise_var=0.1)
model.fit(X, y, num_inducing=100)
pred = model.predict(X_test)

When to use a sparse GP

The decision tree from the user docs:

  • \(N \le 5{,}000\): exact Cholesky is usually fine

  • \(5{,}000 \lesssim N \lesssim 50{,}000\), D ≤ 3: try SKI (KISS-GP)

  • \(N \gtrsim 5{,}000\), D > 3: sparse VFE with \(M \approx \sqrt{N}\) is a strong starting point

For real datasets, a useful sanity check is to plot inducing-point locations against the data:

import matplotlib.pyplot as plt
plt.scatter(X[:, 0], y, alpha=0.2, s=4)
plt.scatter(model.Z[:, 0], np.zeros(len(model.Z)), color="red",
            marker="x", s=50)

Inducing points should be roughly uniform over the input support — if they cluster, your length scale is mis-calibrated and you should model.optimize() first.