SKI (KISS-GP)
Structured Kernel Interpolation (Wilson & Nickisch, 2015) takes the sparse GP idea further: place inducing points on a regular grid so the inducing-inducing kernel matrix \(K_{uu}\) becomes Toeplitz-structured (in 1D) or Kronecker-Toeplitz (in 2D / 3D). Matrix-vector products against Toeplitz matrices can then be done with FFTs in \(O(M \log M)\).
The approximation
Let \(Z\) be a regular grid of size \(M\) covering the input range. Define \(W \in \mathbb{R}^{N \times M}\), a sparse cubic interpolation matrix that interpolates each training point onto its 4 (or 16, in 2D) nearest grid neighbors. Then:
Each row of \(W\) has 4 nonzero entries (in 1D), so \(W\) is sparse with \(4N\) nonzeros total. The product \(W \cdot v\) costs \(O(N)\).
Fast matrix-vector products
The key trick: a Toeplitz matrix \(T\) embeds into a circulant matrix \(C\) of double the size, and circulant matrix-vector products are diagonalized by the FFT. Concretely:
where \(c\) is the first column of \(C\) and \(\tilde v\) is \(v\) zero-padded.
Putting it together, the SKI matrix-vector product for a test vector \(v\) is:
with the middle \(K_{uu}\,(W^\top v)\) evaluated by FFT. Each matvec costs \(O(N + M \log M)\).
Inference
With a fast \(K_y \, v\), the inference is a standard conjugate-gradient solve. LightGP uses \(O(30)\)–\(O(50)\) CG iterations to converge to a target residual; total fit cost is \(O(k \cdot (N + M \log M))\).
Variance estimation in SKI mode uses Hutchinson probes through the same CG-with-FFT-matvec pipeline.
Where the FFT happens
LightGP dispatches the inner FFT by backend:
Backend.CUDA→ cuFFTBackend.CPUon macOS → Apple AcceleratevDSP_DFT_ExecuteBackend.CPUelsewhere → reference dense Toeplitz (no FFT yet)
Backend.Auto correctly resolves to CUDA when available and to CPU
otherwise. On Apple Silicon, the vDSP backend is fast enough that SKI
reaches the same regime that needs CUDA elsewhere.
When SKI works (and when it doesn’t)
SKI is the right tool when:
\(D \le 3\) (the grid has \(M^D\) cells)
The data is stationary or near-stationary
You want O(M log M) matvecs at very large N
SKI is not the right tool when:
\(D > 3\): grid size \(M^D\) makes the FFT too expensive
Inputs are irregularly distributed in regions of very different density: the grid wastes coverage
You’re using a non-stationary kernel
For high-dimensional problems, prefer Sparse GP (Titsias VFE) with farthest-point or k-means inducing initialization.