Backends

class lightgp.Backend

Compute backend selector. Each model takes a backend= constructor argument; the default is Backend.Auto.

CPU: Reference C++ paths, or Apple Accelerate (macOS) / OpenBLAS (Linux) when compiled in. Cholesky always goes through CBLAS/LAPACK when available — on Apple Silicon this means the AMX matrix coprocessor, which is faster than the integrated GPU for moderate-N dense Cholesky.

Metal: Apple Metal compute shaders for RBF, Matérn (function-constant specialized), GEMM (naive / tiled / simdgroup-matrix), triangular solve, blocked Cholesky, and the matrix-free RBF kernel-vector product. Only available on Apple Silicon builds.

CUDA: cuBLAS GEMM, cuSOLVER Cholesky, cuFFT (for SKI), plus custom CUDA kernels for RBF / Matérn matrix construction and matrix-free matvec. Only available in CUDA-enabled builds.

Auto: Resolves at fit time based on the problem shape. See the dispatch heuristic in lightgp.resolve_auto_backend() (and the rules below).

Auto dispatch rules

On a CUDA-enabled build, Auto routes everything to CUDA once \(N \ge 1024\) (Cholesky) or unconditionally (CG / SKI).

On Apple Silicon (no CUDA), the rules are:

Condition	Resolution	Why
`Solver.SKI`	`CPU` (vDSP)	The Toeplitz FFT path lives only in Accelerate vDSP.
`Solver.CG` and \(N > 2000\)	`Metal`	Matrix-free RBF matvec on Metal wins by ~32× vs explicit.
\(D \ge 16\) and \(N \ge 2000\)	`Metal`	Tiled / float4 kernel construction beats CPU once the kernel matrix is the dominant cost.
otherwise	`CPU`	Dense Cholesky on AMX is the fastest path for low-D, moderate-N.

These rules are empirical — see the benchmarks for the underlying measurements.