Matrix-Free Conjugate Gradients
The GP posterior mean and variance both require solving \((K + \sigma_n^2 I)\, \alpha = y\). The textbook approach forms the \(N \times N\) matrix \(K_y\) and Cholesky-factorizes it, costing \(O(N^3)\) time and \(O(N^2)\) memory.
Conjugate gradients is an alternative that never forms \(K_y\) explicitly — it only needs a function that returns \(K_y\, v\) for arbitrary \(v\).
The CG algorithm
Starting from \(x_0 = 0\) and \(r_0 = y\), repeat for \(k = 0, 1, \dots\):
Each iteration performs one matrix-vector product \(K_y p_k\) plus a handful of vector dots and axpys. After \(k\) iterations the residual \(\lVert r_k \rVert\) typically drops by a factor of \(\mathcal{O}(\kappa(K_y)^{-1/2})\) per step, where \(\kappa\) is the condition number.
In LightGP, Solver.CG runs until \(\lVert r_k \rVert / \lVert y \rVert
< 10^{-4}\) or 1000 iterations, whichever comes first. Convergence is
typically 30–100 iterations for well-conditioned RBF kernels.
The matrix-free \(K_y v\) shader
The trick that makes this fast on Metal is implementing \(K_y\, v\) without materializing \(K_y\). The kernel matrix-vector product
can be computed in a single GPU shader pass: each output element \(i\) loops over all training points \(j\), evaluates the kernel function, multiplies by \(v_j\), and accumulates — all without storing the intermediate \(N \times N\) matrix.
The result: memory stays \(O(N)\) regardless of how large the kernel matrix would be. At \(N = 20{,}000\) the explicit kernel matrix is 1.6 GB; the matrix-free path needs ~80 KB for \(v\) and \(w\), plus the input \(X\) itself.
LightGP’s matrix-free shader lives at
kernels/metal/rbf_matvec.metal and kernels/cuda/rbf_matvec.cu. The
host-side CG loop uses these as the matvec callback:
auto K_matvec = [&](const Tensor& v) {
return rbf_matvec_metal(X, v, length_scale, signal_var, noise_var);
};
cg_solve(K_matvec, y, alpha);
Hutchinson variance estimation
The predictive variance for a test point \(x_*\) is
The middle term needs another solve. In matrix-free CG mode, LightGP uses Hutchinson probes: sample \(p\) Rademacher vectors \(z_j \in \{-1, +1\}^N\), solve \(K_y\, u_j = z_j\) once per probe, and estimate
This gives a low-rank approximation of \(K_y^{-1}\) that we then contract against \(K_{*X}\). The variance estimate is biased low but the bias shrinks as \(O(1/p)\).
Log-marginal-likelihood via SLQ
The \(\log\lvert K_y \rvert\) term in the log marginal likelihood can also be computed without ever factorizing \(K_y\), using stochastic Lanczos quadrature (Ubaru et al., 2017): run \(k\) steps of Lanczos with random initial vectors, extract the tridiagonal matrix \(T_k\), and approximate
The \(\log(T_k)\) is cheap because \(T_k\) is small
(\(k \times k\), typically \(k = 30\)). LightGP’s
solvers/cpu/slq_cpu.cpp implements this and is shared between CG and
SKI modes.
When CG (and matrix-free) wins
The kernel matrix would exceed RAM (\(N \gtrsim 20{,}000\), GB scale)
You have a GPU with much higher memory bandwidth than CPU
Variance estimates can tolerate Hutchinson noise
For dense \(N < 5{,}000\) problems, exact Cholesky is faster and gives noiseless variance — use GP Regression Basics directly.