SKI (KISS-GP)

Structured Kernel Interpolation (Wilson & Nickisch, 2015) takes the sparse GP idea further: place inducing points on a regular grid so the inducing-inducing kernel matrix \(K_{uu}\) becomes Toeplitz-structured (in 1D) or Kronecker-Toeplitz (in 2D / 3D). Matrix-vector products against Toeplitz matrices can then be done with FFTs in \(O(M \log M)\).

The approximation

Let \(Z\) be a regular grid of size \(M\) covering the input range. Define \(W \in \mathbb{R}^{N \times M}\), a sparse cubic interpolation matrix that interpolates each training point onto its 4 (or 16, in 2D) nearest grid neighbors. Then:

\[K_{ff} \approx W\, K_{uu}\, W^\top.\]

Each row of \(W\) has 4 nonzero entries (in 1D), so \(W\) is sparse with \(4N\) nonzeros total. The product \(W \cdot v\) costs \(O(N)\).

Fast matrix-vector products

The key trick: a Toeplitz matrix \(T\) embeds into a circulant matrix \(C\) of double the size, and circulant matrix-vector products are diagonalized by the FFT. Concretely:

\[T v = \operatorname{first\;half\;of}\bigl(\, \mathcal{F}^{-1}\!\bigl(\mathcal{F}(c) \odot \mathcal{F}(\tilde v)\bigr) \,\bigr)\]

where \(c\) is the first column of \(C\) and \(\tilde v\) is \(v\) zero-padded.

Putting it together, the SKI matrix-vector product for a test vector \(v\) is:

\[K_{ff}\,v \;\approx\; W\,\bigl(K_{uu}\,\bigl(W^\top v\bigr)\bigr)\]

with the middle \(K_{uu}\,(W^\top v)\) evaluated by FFT. Each matvec costs \(O(N + M \log M)\).

Inference

With a fast \(K_y \, v\), the inference is a standard conjugate-gradient solve. LightGP uses \(O(30)\)\(O(50)\) CG iterations to converge to a target residual; total fit cost is \(O(k \cdot (N + M \log M))\).

Variance estimation in SKI mode uses Hutchinson probes through the same CG-with-FFT-matvec pipeline.

Where the FFT happens

LightGP dispatches the inner FFT by backend:

  • Backend.CUDA → cuFFT

  • Backend.CPU on macOS → Apple Accelerate vDSP_DFT_Execute

  • Backend.CPU elsewhere → reference dense Toeplitz (no FFT yet)

Backend.Auto correctly resolves to CUDA when available and to CPU otherwise. On Apple Silicon, the vDSP backend is fast enough that SKI reaches the same regime that needs CUDA elsewhere.

When SKI works (and when it doesn’t)

SKI is the right tool when:

  • \(D \le 3\) (the grid has \(M^D\) cells)

  • The data is stationary or near-stationary

  • You want O(M log M) matvecs at very large N

SKI is not the right tool when:

  • \(D > 3\): grid size \(M^D\) makes the FFT too expensive

  • Inputs are irregularly distributed in regions of very different density: the grid wastes coverage

  • You’re using a non-stationary kernel

For high-dimensional problems, prefer Sparse GP (Titsias VFE) with farthest-point or k-means inducing initialization.