GP Regression Basics
====================

A Gaussian Process defines a distribution over functions:

.. math::

   f \sim \mathcal{GP}\!\bigl(m(x),\; k(x, x')\bigr)

It is fully characterized by a **mean function** :math:`m(x)` and a
**kernel** (covariance function) :math:`k(x, x')`. For any finite set of
inputs :math:`X = (x_1, \dots, x_N)`, the function values
:math:`f(X) = (f(x_1), \dots, f(x_N))` are jointly Gaussian:

.. math::

   f(X) \sim \mathcal{N}\!\bigl(m(X),\; K_{XX}\bigr),
   \qquad (K_{XX})_{ij} = k(x_i, x_j).

Regression posterior
--------------------

Given noisy observations :math:`y = f(X) + \varepsilon` with
:math:`\varepsilon \sim \mathcal{N}(0, \sigma_n^2 I)`, the predictive
distribution at a test point :math:`x_*` is also Gaussian:

.. math::

   p(f_* \mid X, y, x_*) = \mathcal{N}(\mu_*, \sigma_*^2),

with

.. math::

   \mu_*       &= m(x_*) + K_{*X}\, (K_{XX} + \sigma_n^2 I)^{-1} (y - m(X)), \\
   \sigma_*^2  &= k(x_*, x_*) - K_{*X}\, (K_{XX} + \sigma_n^2 I)^{-1} K_{X*}.

In LightGP, the matrix :math:`K_y := K_{XX} + \sigma_n^2 I` is factorized
once at ``fit()`` time (Cholesky, CG, or SKI depending on the solver); the
factor is then reused for every test point at ``predict()`` time.

Log marginal likelihood
-----------------------

To learn hyperparameters :math:`\theta` (length scale, signal variance,
noise variance), maximize the log marginal likelihood:

.. math::

   \log p(y \mid X, \theta) =
       -\tfrac{1}{2}\, y^\top K_y^{-1} y
       - \tfrac{1}{2}\, \log\lvert K_y \rvert
       - \tfrac{N}{2}\, \log(2\pi).

The three terms have intuitive interpretations:

* :math:`-\tfrac{1}{2}\, y^\top K_y^{-1} y` — **data fit**. Penalizes
  prediction errors weighted by the inverse covariance.
* :math:`-\tfrac{1}{2}\, \log\lvert K_y \rvert` — **complexity penalty**.
  Discourages overfitting by penalizing models that allow too many
  functions.
* :math:`-\tfrac{N}{2}\, \log(2\pi)` — constant, drops in optimization.

LightGP optimizes :math:`\log p` via Adam over the log-hyperparameters
:math:`[\log \ell, \log \sigma^2, \log \sigma_n^2]`. The legacy API uses
analytical gradients; the new ``Kernel``-object API uses central
finite differences so it supports arbitrary kernel compositions.

Why O(N³)?
----------

The cost of an exact GP fit is dominated by the Cholesky factorization of
:math:`K_y`, which is :math:`O(N^3)`. Predict variance also needs a
forward solve, :math:`O(N^2 M)` for :math:`M` test points.

In practice, exact GP fits comfortably for :math:`N \lesssim 5{,}000` on a
laptop. The next three pages describe how LightGP scales past that:

* :doc:`sparse_gp` — approximate :math:`K_y` with :math:`M \ll N` inducing
  points
* :doc:`ski` — when :math:`D \le 3`, replace :math:`K_y` with a
  Toeplitz-structured grid kernel and use FFTs
* :doc:`matrix_free` — solve the linear system iteratively without ever
  forming :math:`K_y` explicitly