SKI (KISS-GP)
=============

Structured Kernel Interpolation (Wilson & Nickisch, 2015) takes the sparse
GP idea further: place inducing points on a **regular grid** so the
inducing-inducing kernel matrix :math:`K_{uu}` becomes
**Toeplitz-structured** (in 1D) or **Kronecker-Toeplitz** (in 2D / 3D).
Matrix-vector products against Toeplitz matrices can then be done with
**FFTs** in :math:`O(M \log M)`.

The approximation
-----------------

Let :math:`Z` be a regular grid of size :math:`M` covering the input
range. Define :math:`W \in \mathbb{R}^{N \times M}`, a **sparse cubic
interpolation matrix** that interpolates each training point onto its
4 (or 16, in 2D) nearest grid neighbors. Then:

.. math::

   K_{ff} \approx W\, K_{uu}\, W^\top.

Each row of :math:`W` has 4 nonzero entries (in 1D), so :math:`W` is
sparse with :math:`4N` nonzeros total. The product :math:`W \cdot v`
costs :math:`O(N)`.

Fast matrix-vector products
---------------------------

The key trick: a Toeplitz matrix :math:`T` embeds into a circulant matrix
:math:`C` of double the size, and **circulant matrix-vector products are
diagonalized by the FFT**. Concretely:

.. math::

   T v = \operatorname{first\;half\;of}\bigl(\,
        \mathcal{F}^{-1}\!\bigl(\mathcal{F}(c) \odot \mathcal{F}(\tilde v)\bigr)
        \,\bigr)

where :math:`c` is the first column of :math:`C` and :math:`\tilde v` is
:math:`v` zero-padded.

Putting it together, the SKI matrix-vector product for a test vector
:math:`v` is:

.. math::

   K_{ff}\,v \;\approx\; W\,\bigl(K_{uu}\,\bigl(W^\top v\bigr)\bigr)

with the middle :math:`K_{uu}\,(W^\top v)` evaluated by FFT. Each matvec
costs :math:`O(N + M \log M)`.

Inference
---------

With a fast :math:`K_y \, v`, the inference is a standard conjugate-gradient
solve. LightGP uses :math:`O(30)`–:math:`O(50)` CG iterations to converge
to a target residual; total fit cost is :math:`O(k \cdot (N + M \log M))`.

Variance estimation in SKI mode uses Hutchinson probes through the same
CG-with-FFT-matvec pipeline.

Where the FFT happens
---------------------

LightGP dispatches the inner FFT by backend:

* ``Backend.CUDA`` → cuFFT
* ``Backend.CPU`` on macOS → Apple Accelerate ``vDSP_DFT_Execute``
* ``Backend.CPU`` elsewhere → reference dense Toeplitz (no FFT yet)

``Backend.Auto`` correctly resolves to CUDA when available and to CPU
otherwise. On Apple Silicon, the vDSP backend is fast enough that SKI
reaches the same regime that needs CUDA elsewhere.

When SKI works (and when it doesn't)
------------------------------------

**SKI is the right tool when**:

* :math:`D \le 3` (the grid has :math:`M^D` cells)
* The data is stationary or near-stationary
* You want O(M log M) matvecs at very large N

**SKI is not the right tool when**:

* :math:`D > 3`: grid size :math:`M^D` makes the FFT too expensive
* Inputs are irregularly distributed in regions of very different density:
  the grid wastes coverage
* You're using a non-stationary kernel

For high-dimensional problems, prefer :doc:`sparse_gp` with farthest-point
or k-means inducing initialization.