Sparse GP (Titsias VFE)
A sparse GP replaces the full \(N \times N\) kernel matrix \(K_{ff} = K_{XX}\) with an approximation involving \(M \ll N\) inducing points \(Z = (z_1, \dots, z_M)\).
Titsias (2009) derives the variational free energy (VFE) bound by introducing auxiliary inducing-function values \(u = f(Z)\) and maximizing a variational lower bound on \(\log p(y)\).
Kernel approximation
Define the cross-covariances
The full \(K_{ff}\) is approximated by
In effect, the data covariance is projected through the \(M\)-dim inducing-point subspace.
VFE bound
With per-data diagonal residual
and the \(M \times M\) “Sigma” matrix
the variational lower bound is
The last term is the regularization that distinguishes VFE from the older FITC approximation — it penalizes inducing-point placements that leave residual variance on the training data.
Cost
Storage is \(O(N + M^2)\). Training cost per fit is \(O(NM^2 + M^3)\) — linear in \(N\). Predict mean is \(O(M)\) per test point; predict variance is \(O(M^2)\).
LightGP’s GPSparse initializes inducing locations by farthest-point
sampling on the training inputs. This is deterministic, requires no
random seed, and gives strong defaults on real data:
import lightgp as gp
model = gp.GPSparse(noise_var=0.1)
model.fit(X, y, num_inducing=100)
pred = model.predict(X_test)
When to use a sparse GP
The decision tree from the user docs:
\(N \le 5{,}000\): exact Cholesky is usually fine
\(5{,}000 \lesssim N \lesssim 50{,}000\), D ≤ 3: try SKI (KISS-GP)
\(N \gtrsim 5{,}000\), D > 3: sparse VFE with \(M \approx \sqrt{N}\) is a strong starting point
For real datasets, a useful sanity check is to plot inducing-point locations against the data:
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], y, alpha=0.2, s=4)
plt.scatter(model.Z[:, 0], np.zeros(len(model.Z)), color="red",
marker="x", s=50)
Inducing points should be roughly uniform over the input support — if
they cluster, your length scale is mis-calibrated and you should
model.optimize() first.