SKI (KISS-GP)

Structured Kernel Interpolation (Wilson & Nickisch, 2015) takes the sparse GP idea further: place inducing points on a regular grid so the inducing-inducing kernel matrix \(K_{uu}\) becomes Toeplitz-structured (in 1D) or Kronecker-Toeplitz (in 2D / 3D). Matrix-vector products against Toeplitz matrices can then be done with FFTs in \(O(M \log M)\).

The approximation

Let \(Z\) be a regular grid of size \(M\) covering the input range. Define \(W \in \mathbb{R}^{N \times M}\), a sparse cubic interpolation matrix that interpolates each training point onto its 4 (or 16, in 2D) nearest grid neighbors. Then:

\[K_{ff} \approx W\, K_{uu}\, W^\top.\]

Each row of \(W\) has 4 nonzero entries (in 1D), so \(W\) is sparse with \(4N\) nonzeros total. The product \(W \cdot v\) costs \(O(N)\).

Fast matrix-vector products

The key trick: a Toeplitz matrix \(T\) embeds into a circulant matrix \(C\) of double the size, and circulant matrix-vector products are diagonalized by the FFT. Concretely:

\[T v = \operatorname{first\;half\;of}\bigl(\, \mathcal{F}^{-1}\!\bigl(\mathcal{F}(c) \odot \mathcal{F}(\tilde v)\bigr) \,\bigr)\]

where \(c\) is the first column of \(C\) and \(\tilde v\) is \(v\) zero-padded.

Putting it together, the SKI matrix-vector product for a test vector \(v\) is:

\[K_{ff}\,v \;\approx\; W\,\bigl(K_{uu}\,\bigl(W^\top v\bigr)\bigr)\]

with the middle \(K_{uu}\,(W^\top v)\) evaluated by FFT. Each matvec costs \(O(N + M \log M)\).

Inference

With a fast \(K_y \, v\), the inference is a standard conjugate-gradient solve. LightGP uses \(O(30)\)–\(O(50)\) CG iterations to converge to a target residual; total fit cost is \(O(k \cdot (N + M \log M))\).

Variance estimation in SKI mode uses Hutchinson probes through the same CG-with-FFT-matvec pipeline.

Where the FFT happens

LightGP dispatches the inner FFT by backend:

Backend.CUDA → cuFFT
Backend.CPU on macOS → Apple Accelerate vDSP_DFT_Execute
Backend.CPU elsewhere → reference dense Toeplitz (no FFT yet)

Backend.Auto correctly resolves to CUDA when available and to CPU otherwise. On Apple Silicon, the vDSP backend is fast enough that SKI reaches the same regime that needs CUDA elsewhere.

When SKI works (and when it doesn’t)

SKI is the right tool when:

\(D \le 3\) (the grid has \(M^D\) cells)
The data is stationary or near-stationary
You want O(M log M) matvecs at very large N

SKI is not the right tool when:

\(D > 3\): grid size \(M^D\) makes the FFT too expensive
Inputs are irregularly distributed in regions of very different density: the grid wastes coverage
You’re using a non-stationary kernel

For high-dimensional problems, prefer Sparse GP (Titsias VFE) with farthest-point or k-means inducing initialization.