Matrix-Free Conjugate Gradients

The GP posterior mean and variance both require solving \((K + \sigma_n^2 I)\, \alpha = y\). The textbook approach forms the \(N \times N\) matrix \(K_y\) and Cholesky-factorizes it, costing \(O(N^3)\) time and \(O(N^2)\) memory.

Conjugate gradients is an alternative that never forms \(K_y\) explicitly — it only needs a function that returns \(K_y\, v\) for arbitrary \(v\).

The CG algorithm

Starting from \(x_0 = 0\) and \(r_0 = y\), repeat for \(k = 0, 1, \dots\):

\[\begin{split}\alpha_k &= \frac{r_k^\top r_k}{p_k^\top (K_y p_k)} \\ x_{k+1} &= x_k + \alpha_k p_k \\ r_{k+1} &= r_k - \alpha_k (K_y p_k) \\ \beta_k &= \frac{r_{k+1}^\top r_{k+1}}{r_k^\top r_k} \\ p_{k+1} &= r_{k+1} + \beta_k p_k\end{split}\]

Each iteration performs one matrix-vector product \(K_y p_k\) plus a handful of vector dots and axpys. After \(k\) iterations the residual \(\lVert r_k \rVert\) typically drops by a factor of \(\mathcal{O}(\kappa(K_y)^{-1/2})\) per step, where \(\kappa\) is the condition number.

In LightGP, Solver.CG runs until \(\lVert r_k \rVert / \lVert y \rVert < 10^{-4}\) or 1000 iterations, whichever comes first. Convergence is typically 30–100 iterations for well-conditioned RBF kernels.

The matrix-free \(K_y v\) shader

The trick that makes this fast on Metal is implementing \(K_y\, v\) without materializing \(K_y\). The kernel matrix-vector product

\[(K_y v)_i = \sum_j k(x_i, x_j)\, v_j + \sigma_n^2\, v_i\]

can be computed in a single GPU shader pass: each output element \(i\) loops over all training points \(j\), evaluates the kernel function, multiplies by \(v_j\), and accumulates — all without storing the intermediate \(N \times N\) matrix.

The result: memory stays \(O(N)\) regardless of how large the kernel matrix would be. At \(N = 20{,}000\) the explicit kernel matrix is 1.6 GB; the matrix-free path needs ~80 KB for \(v\) and \(w\), plus the input \(X\) itself.

LightGP’s matrix-free shader lives at kernels/metal/rbf_matvec.metal and kernels/cuda/rbf_matvec.cu. The host-side CG loop uses these as the matvec callback:

auto K_matvec = [&](const Tensor& v) {
    return rbf_matvec_metal(X, v, length_scale, signal_var, noise_var);
};
cg_solve(K_matvec, y, alpha);

Hutchinson variance estimation

The predictive variance for a test point \(x_*\) is

\[\sigma_*^2 = k(x_*, x_*) - K_{*X}\, K_y^{-1} K_{X*}.\]

The middle term needs another solve. In matrix-free CG mode, LightGP uses Hutchinson probes: sample \(p\) Rademacher vectors \(z_j \in \{-1, +1\}^N\), solve \(K_y\, u_j = z_j\) once per probe, and estimate

\[K_y^{-1}\, K_{*X}^\top \;\approx\; \frac{1}{p} \sum_{j=1}^p z_j \odot u_j.\]

This gives a low-rank approximation of \(K_y^{-1}\) that we then contract against \(K_{*X}\). The variance estimate is biased low but the bias shrinks as \(O(1/p)\).

Log-marginal-likelihood via SLQ

The \(\log\lvert K_y \rvert\) term in the log marginal likelihood can also be computed without ever factorizing \(K_y\), using stochastic Lanczos quadrature (Ubaru et al., 2017): run \(k\) steps of Lanczos with random initial vectors, extract the tridiagonal matrix \(T_k\), and approximate

\[\log\lvert K_y \rvert \approx N \cdot \mathbb{E}_z\!\left[ z^\top \log(K_y) z \right] \approx \frac{N}{p}\sum_{j=1}^p z_j^\top\, \operatorname{e}_1^\top Q_k^\top \log(T_k)\, Q_k\, \operatorname{e}_1.\]

The \(\log(T_k)\) is cheap because \(T_k\) is small (\(k \times k\), typically \(k = 30\)). LightGP’s solvers/cpu/slq_cpu.cpp implements this and is shared between CG and SKI modes.

When CG (and matrix-free) wins

The kernel matrix would exceed RAM (\(N \gtrsim 20{,}000\), GB scale)
You have a GPU with much higher memory bandwidth than CPU
Variance estimates can tolerate Hutchinson noise

For dense \(N < 5{,}000\) problems, exact Cholesky is faster and gives noiseless variance — use GP Regression Basics directly.