Matrix-Free Conjugate Gradients
===============================

The GP posterior mean and variance both require solving
:math:`(K + \sigma_n^2 I)\, \alpha = y`. The textbook approach forms the
:math:`N \times N` matrix :math:`K_y` and Cholesky-factorizes it, costing
:math:`O(N^3)` time and :math:`O(N^2)` memory.

Conjugate gradients is an alternative that **never forms** :math:`K_y`
explicitly — it only needs a function that returns :math:`K_y\, v` for
arbitrary :math:`v`.

The CG algorithm
----------------

Starting from :math:`x_0 = 0` and :math:`r_0 = y`, repeat for
:math:`k = 0, 1, \dots`:

.. math::

   \alpha_k    &= \frac{r_k^\top r_k}{p_k^\top (K_y p_k)} \\
   x_{k+1}     &= x_k + \alpha_k p_k \\
   r_{k+1}     &= r_k - \alpha_k (K_y p_k) \\
   \beta_k     &= \frac{r_{k+1}^\top r_{k+1}}{r_k^\top r_k} \\
   p_{k+1}     &= r_{k+1} + \beta_k p_k

Each iteration performs **one** matrix-vector product :math:`K_y p_k` plus
a handful of vector dots and axpys. After :math:`k` iterations the
residual :math:`\lVert r_k \rVert` typically drops by a factor of
:math:`\mathcal{O}(\kappa(K_y)^{-1/2})` per step, where :math:`\kappa`
is the condition number.

In LightGP, ``Solver.CG`` runs until :math:`\lVert r_k \rVert / \lVert y \rVert
< 10^{-4}` or 1000 iterations, whichever comes first. Convergence is
typically 30–100 iterations for well-conditioned RBF kernels.

The matrix-free :math:`K_y v` shader
------------------------------------

The trick that makes this fast on Metal is implementing :math:`K_y\, v`
**without materializing** :math:`K_y`. The kernel matrix-vector product

.. math::

   (K_y v)_i = \sum_j k(x_i, x_j)\, v_j + \sigma_n^2\, v_i

can be computed in a single GPU shader pass: each output element :math:`i`
loops over all training points :math:`j`, evaluates the kernel function,
multiplies by :math:`v_j`, and accumulates — all without storing the
intermediate :math:`N \times N` matrix.

The result: memory stays :math:`O(N)` regardless of how large the kernel
matrix would be. At :math:`N = 20{,}000` the explicit kernel matrix is
1.6 GB; the matrix-free path needs ~80 KB for :math:`v` and :math:`w`,
plus the input :math:`X` itself.

LightGP's matrix-free shader lives at
``kernels/metal/rbf_matvec.metal`` and ``kernels/cuda/rbf_matvec.cu``. The
host-side CG loop uses these as the matvec callback:

.. code-block:: cpp

   auto K_matvec = [&](const Tensor& v) {
       return rbf_matvec_metal(X, v, length_scale, signal_var, noise_var);
   };
   cg_solve(K_matvec, y, alpha);

Hutchinson variance estimation
------------------------------

The predictive variance for a test point :math:`x_*` is

.. math::

   \sigma_*^2 = k(x_*, x_*) - K_{*X}\, K_y^{-1} K_{X*}.

The middle term needs another solve. In matrix-free CG mode, LightGP uses
**Hutchinson probes**: sample :math:`p` Rademacher vectors :math:`z_j \in
\{-1, +1\}^N`, solve :math:`K_y\, u_j = z_j` once per probe, and estimate

.. math::

   K_y^{-1}\, K_{*X}^\top \;\approx\;
       \frac{1}{p} \sum_{j=1}^p z_j \odot u_j.

This gives a low-rank approximation of :math:`K_y^{-1}` that we then
contract against :math:`K_{*X}`. The variance estimate is biased low but
the bias shrinks as :math:`O(1/p)`.

Log-marginal-likelihood via SLQ
-------------------------------

The :math:`\log\lvert K_y \rvert` term in the log marginal likelihood can
also be computed without ever factorizing :math:`K_y`, using **stochastic
Lanczos quadrature** (Ubaru et al., 2017): run :math:`k` steps of
Lanczos with random initial vectors, extract the tridiagonal matrix
:math:`T_k`, and approximate

.. math::

   \log\lvert K_y \rvert \approx N \cdot \mathbb{E}_z\!\left[
       z^\top \log(K_y) z
   \right]
   \approx \frac{N}{p}\sum_{j=1}^p z_j^\top\, \operatorname{e}_1^\top
       Q_k^\top \log(T_k)\, Q_k\, \operatorname{e}_1.

The :math:`\log(T_k)` is cheap because :math:`T_k` is small
(:math:`k \times k`, typically :math:`k = 30`). LightGP's
``solvers/cpu/slq_cpu.cpp`` implements this and is shared between CG and
SKI modes.

When CG (and matrix-free) wins
------------------------------

* The kernel matrix would exceed RAM (:math:`N \gtrsim 20{,}000`, GB scale)
* You have a GPU with much higher memory bandwidth than CPU
* Variance estimates can tolerate Hutchinson noise

For dense :math:`N < 5{,}000` problems, exact Cholesky is faster and
gives noiseless variance — use :doc:`gp_basics` directly.