Benchmarks

LightGP is benchmarked end-to-end against GPyTorch on the same hardware, and against itself across backends. All numbers below are wall-clock medians on Apple M4 (8 GPU cores, 16 GB unified memory), fp32, with the local environment running python3.13 + numpy 1.26 + gpytorch 1.11 + torch 2.2.

Methodology

Each cell is the median of 3 runs (5 for component micro-benchmarks), warmup discarded.
“LightGP CPU” uses Accelerate (CBLAS + LAPACK + AMX coprocessor).
“LightGP Metal” uses the Metal compute shaders.
“GPyTorch MPS” uses PyTorch’s Apple Silicon backend; rows marked (gap) fall back to CPU because the underlying op is missing on MPS (e.g. aten::_linalg_eigh.eigenvalues for exact-GP variance).

Inference scaling

Fit-time scaling per inference method, measured on the local M4 by the included build script (docs/build_benchmark_figure.py):

Log-log plot of fit time vs N for Cholesky, CG, SKI, and Sparse VFE — Cholesky is exact and capped at N=2000 (O(N³) growth dominates). The matrix-free CG path scales as O(N²k); on CPU it remains useful up to ~20k, on Metal it reaches further. SKI and Sparse VFE are the long-N regimes — both still finish in tens of milliseconds at N=50,000.

End-to-end vs GPyTorch

Fit + predict on synthetic y = sin(x) 1-D data:

Config	LightGP CPU	LightGP Metal	GPyTorch CPU	GPyTorch MPS	best ratio
Exact RBF, N=2048, D=4	23.6 ms	195 ms	89 ms	(gap*)	3.8× faster
Exact Matérn-5/2, N=2048, D=4	42 ms	191 ms	106 ms	(gap*)	2.5× faster
Sparse RBF, N=10000, M=200	18.5 ms	42 ms	42 ms	69 ms	2.3× faster
Sparse RBF, N=50000, M=200	97.4 ms	156 ms	196 ms	98 ms	2.0× faster vs CPU; on par with MPS
Matrix-free \(K\mathbf v\), N=20000	n/a	22 ms	n/a	(no equiv)	32× over explicit

*GPyTorch MPS missing op for exact-GP variance — falls back to CPU.

LightGP CPU is faster than GPyTorch CPU across the measured exact and small-to-mid sparse configurations — same Accelerate underneath, less Python dispatch overhead. The matrix-free \(K\mathbf v\) path has no GPyTorch-on-MPS equivalent.

Component micro-benchmarks

Cholesky factorization at increasing N, fp32:

N	LightGP CPU (Accelerate)	LightGP Metal
1024	0.84 ms	12.0 ms
2048	4.6 ms	26.0 ms
4096	41.5 ms	88.0 ms

Apple’s AMX matrix coprocessor wins the dense Cholesky regime on Apple Silicon. This is a hardware result, not a software gap — Metal’s integrated GPU has lower fp32 throughput than CPU+AMX at moderate N. For dense Cholesky on Apple Silicon, Backend.Auto correctly picks Backend.CPU.

Matrix-free RBF kernel-vector product on Metal vs explicit materialization through Accelerate sgemm:

N	Explicit (form K, then matmul)	Matrix-free (Metal)	Memory: explicit / free
5,000	41.7 ms	4 ms	100 MB / 80 KB
10,000	194 ms	9 ms	400 MB / 160 KB
20,000	707 ms	22 ms	1.6 GB / 320 KB

The ~32× speedup at N=20k is bandwidth-bound: explicit forms the 1.6 GB kernel matrix and streams it through sgemm once; matrix-free fuses kernel construction and matvec into a single Metal shader pass.

Reproducing the numbers

The C++ benchmark binaries live in benchmarks/ and emit JSON-per-line:

./build.sh
./build/bench_paper > paper_results.json    # ~2 min full sweep
./build/bench_ski > ski_results.json         # SKI specifically
./build/bench_matvec > matvec_results.json   # matrix-free Kv

The Python comparison against GPyTorch:

source .venv/bin/activate
pip install torch gpytorch
python3 benchmarks/bench_gpytorch.py > gpytorch_results.json

Both sides use identical input shapes and the same fp32 dtype, so the numbers join cleanly on (method, N, M, D).