Sparse GP (Titsias VFE) ======================= A sparse GP replaces the full :math:`N \times N` kernel matrix :math:`K_{ff} = K_{XX}` with an approximation involving :math:`M \ll N` **inducing points** :math:`Z = (z_1, \dots, z_M)`. Titsias (2009) derives the **variational free energy** (VFE) bound by introducing auxiliary inducing-function values :math:`u = f(Z)` and maximizing a variational lower bound on :math:`\log p(y)`. Kernel approximation -------------------- Define the cross-covariances .. math:: K_{uu} = K(Z, Z) \in \mathbb{R}^{M \times M}, \qquad K_{uf} = K(Z, X) \in \mathbb{R}^{M \times N}. The full :math:`K_{ff}` is approximated by .. math:: Q_{ff} := K_{fu}\, K_{uu}^{-1}\, K_{uf}. In effect, the data covariance is projected through the :math:`M`-dim inducing-point subspace. VFE bound --------- With per-data diagonal residual .. math:: \Lambda = \operatorname{diag}\!\bigl(K_{ff} - Q_{ff}\bigr) + \sigma_n^2 I, and the :math:`M \times M` "Sigma" matrix .. math:: \Sigma = K_{uu} + K_{uf}\, \Lambda^{-1}\, K_{fu}, the variational lower bound is .. math:: \mathcal{L}(\theta, Z) = -\tfrac{1}{2}\, y^\top (Q_{ff} + \Lambda)^{-1} y - \tfrac{1}{2}\, \log\lvert Q_{ff} + \Lambda \rvert - \tfrac{N}{2}\, \log(2\pi) - \tfrac{1}{2\sigma_n^2}\, \operatorname{tr}\!\bigl(K_{ff} - Q_{ff}\bigr). The last term is the **regularization** that distinguishes VFE from the older FITC approximation — it penalizes inducing-point placements that leave residual variance on the training data. Cost ---- Storage is :math:`O(N + M^2)`. Training cost per fit is :math:`O(NM^2 + M^3)` — linear in :math:`N`. Predict mean is :math:`O(M)` per test point; predict variance is :math:`O(M^2)`. LightGP's ``GPSparse`` initializes inducing locations by **farthest-point sampling** on the training inputs. This is deterministic, requires no random seed, and gives strong defaults on real data: .. code-block:: python import lightgp as gp model = gp.GPSparse(noise_var=0.1) model.fit(X, y, num_inducing=100) pred = model.predict(X_test) When to use a sparse GP ----------------------- The decision tree from the user docs: * :math:`N \le 5{,}000`: exact Cholesky is usually fine * :math:`5{,}000 \lesssim N \lesssim 50{,}000`, **D ≤ 3**: try :doc:`ski` * :math:`N \gtrsim 5{,}000`, **D > 3**: sparse VFE with :math:`M \approx \sqrt{N}` is a strong starting point For real datasets, a useful sanity check is to plot inducing-point locations against the data: .. code-block:: python import matplotlib.pyplot as plt plt.scatter(X[:, 0], y, alpha=0.2, s=4) plt.scatter(model.Z[:, 0], np.zeros(len(model.Z)), color="red", marker="x", s=50) Inducing points should be roughly uniform over the input support — if they cluster, your length scale is mis-calibrated and you should ``model.optimize()`` first.