GP Regression Basics
A Gaussian Process defines a distribution over functions:
It is fully characterized by a mean function \(m(x)\) and a kernel (covariance function) \(k(x, x')\). For any finite set of inputs \(X = (x_1, \dots, x_N)\), the function values \(f(X) = (f(x_1), \dots, f(x_N))\) are jointly Gaussian:
Regression posterior
Given noisy observations \(y = f(X) + \varepsilon\) with \(\varepsilon \sim \mathcal{N}(0, \sigma_n^2 I)\), the predictive distribution at a test point \(x_*\) is also Gaussian:
with
In LightGP, the matrix \(K_y := K_{XX} + \sigma_n^2 I\) is factorized
once at fit() time (Cholesky, CG, or SKI depending on the solver); the
factor is then reused for every test point at predict() time.
Log marginal likelihood
To learn hyperparameters \(\theta\) (length scale, signal variance, noise variance), maximize the log marginal likelihood:
The three terms have intuitive interpretations:
\(-\tfrac{1}{2}\, y^\top K_y^{-1} y\) — data fit. Penalizes prediction errors weighted by the inverse covariance.
\(-\tfrac{1}{2}\, \log\lvert K_y \rvert\) — complexity penalty. Discourages overfitting by penalizing models that allow too many functions.
\(-\tfrac{N}{2}\, \log(2\pi)\) — constant, drops in optimization.
LightGP optimizes \(\log p\) via Adam over the log-hyperparameters
\([\log \ell, \log \sigma^2, \log \sigma_n^2]\). The legacy API uses
analytical gradients; the new Kernel-object API uses central
finite differences so it supports arbitrary kernel compositions.
Why O(N³)?
The cost of an exact GP fit is dominated by the Cholesky factorization of \(K_y\), which is \(O(N^3)\). Predict variance also needs a forward solve, \(O(N^2 M)\) for \(M\) test points.
In practice, exact GP fits comfortably for \(N \lesssim 5{,}000\) on a laptop. The next three pages describe how LightGP scales past that:
Sparse GP (Titsias VFE) — approximate \(K_y\) with \(M \ll N\) inducing points
SKI (KISS-GP) — when \(D \le 3\), replace \(K_y\) with a Toeplitz-structured grid kernel and use FFTs
Matrix-Free Conjugate Gradients — solve the linear system iteratively without ever forming \(K_y\) explicitly