Benchmarks
==========

LightGP is benchmarked end-to-end against GPyTorch on the same hardware,
and against itself across backends. All numbers below are wall-clock
medians on Apple M4 (8 GPU cores, 16 GB unified memory), fp32, with the
local environment running ``python3.13`` + ``numpy 1.26`` + ``gpytorch
1.11`` + ``torch 2.2``.

Methodology
-----------

- Each cell is the median of 3 runs (5 for component micro-benchmarks),
  warmup discarded.
- "LightGP CPU" uses Accelerate (CBLAS + LAPACK + AMX coprocessor).
- "LightGP Metal" uses the Metal compute shaders.
- "GPyTorch MPS" uses PyTorch's Apple Silicon backend; rows marked
  *(gap)* fall back to CPU because the underlying op is missing on MPS
  (e.g. ``aten::_linalg_eigh.eigenvalues`` for exact-GP variance).

Inference scaling
-----------------

Fit-time scaling per inference method, measured on the local M4 by the
included build script (``docs/build_benchmark_figure.py``):

.. figure:: ../_static/figures/scaling.png
   :alt: Log-log plot of fit time vs N for Cholesky, CG, SKI, and Sparse VFE
   :align: center
   :width: 100%

   Cholesky is exact and capped at N=2000 (O(N³) growth dominates).
   The matrix-free CG path scales as O(N²k); on CPU it remains
   useful up to ~20k, on Metal it reaches further. SKI and Sparse
   VFE are the long-N regimes — both still finish in tens of
   milliseconds at N=50,000.

End-to-end vs GPyTorch
----------------------

Fit + predict on synthetic ``y = sin(x)`` 1-D data:

.. list-table::
   :header-rows: 1
   :widths: 32 14 16 14 16 8

   * - Config
     - LightGP CPU
     - LightGP Metal
     - GPyTorch CPU
     - GPyTorch MPS
     - best ratio
   * - Exact RBF, N=2048, D=4
     - **23.6 ms**
     - 195 ms
     - 89 ms
     - (gap*)
     - 3.8× faster
   * - Exact Matérn-5/2, N=2048, D=4
     - **42 ms**
     - 191 ms
     - 106 ms
     - (gap*)
     - 2.5× faster
   * - Sparse RBF, N=10000, M=200
     - **18.5 ms**
     - 42 ms
     - 42 ms
     - 69 ms
     - 2.3× faster
   * - Sparse RBF, N=50000, M=200
     - **97.4 ms**
     - 156 ms
     - 196 ms
     - **98 ms**
     - 2.0× faster vs CPU; on par with MPS
   * - Matrix-free :math:`K\mathbf v`, N=20000
     - n/a
     - **22 ms**
     - n/a
     - (no equiv)
     - 32× over explicit

\*GPyTorch MPS missing op for exact-GP variance — falls back to CPU.

LightGP CPU is faster than GPyTorch CPU across the measured exact and
small-to-mid sparse configurations — same Accelerate underneath, less
Python dispatch overhead. The matrix-free :math:`K\mathbf v` path has no
GPyTorch-on-MPS equivalent.

Component micro-benchmarks
--------------------------

Cholesky factorization at increasing N, fp32:

.. list-table::
   :header-rows: 1
   :widths: 20 30 30

   * - N
     - LightGP CPU (Accelerate)
     - LightGP Metal
   * - 1024
     - 0.84 ms
     - 12.0 ms
   * - 2048
     - 4.6 ms
     - 26.0 ms
   * - 4096
     - 41.5 ms
     - 88.0 ms

Apple's AMX matrix coprocessor wins the dense Cholesky regime on Apple
Silicon. This is a hardware result, not a software gap — Metal's
integrated GPU has lower fp32 throughput than CPU+AMX at moderate N.
For dense Cholesky on Apple Silicon, ``Backend.Auto`` correctly picks
``Backend.CPU``.

Matrix-free RBF kernel-vector product on Metal vs explicit
materialization through Accelerate ``sgemm``:

.. list-table::
   :header-rows: 1
   :widths: 16 26 26 26

   * - N
     - Explicit (form K, then matmul)
     - Matrix-free (Metal)
     - Memory: explicit / free
   * - 5,000
     - 41.7 ms
     - 4 ms
     - 100 MB / 80 KB
   * - 10,000
     - 194 ms
     - 9 ms
     - 400 MB / 160 KB
   * - 20,000
     - 707 ms
     - **22 ms**
     - 1.6 GB / 320 KB

The ~32× speedup at N=20k is bandwidth-bound: explicit forms the
1.6 GB kernel matrix and streams it through ``sgemm`` once;
matrix-free fuses kernel construction and matvec into a single Metal
shader pass.

Reproducing the numbers
-----------------------

The C++ benchmark binaries live in ``benchmarks/`` and emit JSON-per-line:

.. code-block:: bash

   ./build.sh
   ./build/bench_paper > paper_results.json    # ~2 min full sweep
   ./build/bench_ski > ski_results.json         # SKI specifically
   ./build/bench_matvec > matvec_results.json   # matrix-free Kv

The Python comparison against GPyTorch:

.. code-block:: bash

   source .venv/bin/activate
   pip install torch gpytorch
   python3 benchmarks/bench_gpytorch.py > gpytorch_results.json

Both sides use identical input shapes and the same fp32 dtype, so the
numbers join cleanly on ``(method, N, M, D)``.