GP Regression Basics

A Gaussian Process defines a distribution over functions:

\[f \sim \mathcal{GP}\!\bigl(m(x),\; k(x, x')\bigr)\]

It is fully characterized by a mean function \(m(x)\) and a kernel (covariance function) \(k(x, x')\). For any finite set of inputs \(X = (x_1, \dots, x_N)\), the function values \(f(X) = (f(x_1), \dots, f(x_N))\) are jointly Gaussian:

\[f(X) \sim \mathcal{N}\!\bigl(m(X),\; K_{XX}\bigr), \qquad (K_{XX})_{ij} = k(x_i, x_j).\]

Regression posterior

Given noisy observations \(y = f(X) + \varepsilon\) with \(\varepsilon \sim \mathcal{N}(0, \sigma_n^2 I)\), the predictive distribution at a test point \(x_*\) is also Gaussian:

\[p(f_* \mid X, y, x_*) = \mathcal{N}(\mu_*, \sigma_*^2),\]

with

\[\begin{split}\mu_* &= m(x_*) + K_{*X}\, (K_{XX} + \sigma_n^2 I)^{-1} (y - m(X)), \\ \sigma_*^2 &= k(x_*, x_*) - K_{*X}\, (K_{XX} + \sigma_n^2 I)^{-1} K_{X*}.\end{split}\]

In LightGP, the matrix \(K_y := K_{XX} + \sigma_n^2 I\) is factorized once at fit() time (Cholesky, CG, or SKI depending on the solver); the factor is then reused for every test point at predict() time.

Log marginal likelihood

To learn hyperparameters \(\theta\) (length scale, signal variance, noise variance), maximize the log marginal likelihood:

\[\log p(y \mid X, \theta) = -\tfrac{1}{2}\, y^\top K_y^{-1} y - \tfrac{1}{2}\, \log\lvert K_y \rvert - \tfrac{N}{2}\, \log(2\pi).\]

The three terms have intuitive interpretations:

\(-\tfrac{1}{2}\, y^\top K_y^{-1} y\) — data fit. Penalizes prediction errors weighted by the inverse covariance.
\(-\tfrac{1}{2}\, \log\lvert K_y \rvert\) — complexity penalty. Discourages overfitting by penalizing models that allow too many functions.
\(-\tfrac{N}{2}\, \log(2\pi)\) — constant, drops in optimization.

LightGP optimizes \(\log p\) via Adam over the log-hyperparameters \([\log \ell, \log \sigma^2, \log \sigma_n^2]\). The legacy API uses analytical gradients; the new Kernel-object API uses central finite differences so it supports arbitrary kernel compositions.

Why O(N³)?

The cost of an exact GP fit is dominated by the Cholesky factorization of \(K_y\), which is \(O(N^3)\). Predict variance also needs a forward solve, \(O(N^2 M)\) for \(M\) test points.

In practice, exact GP fits comfortably for \(N \lesssim 5{,}000\) on a laptop. The next three pages describe how LightGP scales past that:

Sparse GP (Titsias VFE) — approximate \(K_y\) with \(M \ll N\) inducing points
SKI (KISS-GP) — when \(D \le 3\), replace \(K_y\) with a Toeplitz-structured grid kernel and use FFTs
Matrix-Free Conjugate Gradients — solve the linear system iteratively without ever forming \(K_y\) explicitly