Architecture ============ LightGP is organized in five layers. Each one only knows about the layers beneath it; the inference layer at the top is portable across all backends. :: ┌───────────────────────────────────────────────────────────┐ │ Inference GPExact, GPSparse │ │ (inference/*) optimize(), fit(), predict() │ ├───────────────────────────────────────────────────────────┤ │ Solvers Cholesky, CG, SLQ, SKI │ │ (solvers/cpu, /metal, matrix-free matvec callback │ │ /cuda) │ ├───────────────────────────────────────────────────────────┤ │ Kernels + Mean Kernel ABC + concrete kernels │ │ (kernels/*, core/mean.h) + Sum/Product/Scale composition │ ├───────────────────────────────────────────────────────────┤ │ Dispatch routes by Backend + KernelType │ │ (core/dispatch.cpp) to CPU / Metal / CUDA paths │ ├───────────────────────────────────────────────────────────┤ │ Backends Accelerate (CBLAS / LAPACK / │ │ (core/blas_accel, AMX / vDSP), Metal shaders, │ │ kernels/metal, cuBLAS / cuSOLVER / cuFFT │ │ kernels/cuda) │ └───────────────────────────────────────────────────────────┘ Tensor ------ The Tensor type (``core/tensor.h``) is the basic data container — a 2D row-major fp32 matrix with a small set of operations: - ``matmul`` — routes to Accelerate ``cblas_sgemm`` (macOS) or OpenBLAS on Linux, fallback reference implementation otherwise - ``add_jitter`` — diagonal addition used by Cholesky retries - ``transpose``, ``add``, elementwise ops - ``randn``, ``zeros`` factory methods All higher-level code allocates and passes ``Tensor`` values. The class intentionally has no shape-polymorphic broadcasting machinery — every kernel or solver that needs reshaping does it explicitly. Kernel hierarchy ---------------- The abstract ``Kernel`` base in ``kernels/kernel_base.h`` declares: .. code-block:: cpp virtual Tensor compute(const Tensor& X1, const Tensor& X2, Backend) const = 0; virtual Tensor compute_diag(const Tensor& X) const = 0; virtual int num_params() const = 0; virtual std::vector get_log_params() const = 0; virtual void set_log_params(const std::vector& p) = 0; virtual std::string name() const = 0; Concrete kernels (``RBFKernel``, ``MaternKernel``, ``PeriodicKernel``, ``LinearKernel``) implement ``compute`` by calling into the dispatch layer. Composite kernels (``SumKernel``, ``ProductKernel``, ``ScaleKernel``) wrap ``std::shared_ptr`` children and forward to them. The parameter ordering for a composite is **stable and left-to-right**: ``(k1 + k2).get_log_params()`` returns ``k1.get_log_params() ++ k2.get_log_params()``. ``Scale(base)`` prepends its single ``log_scale`` parameter in front of the base's parameters. Dispatch -------- ``core/dispatch.cpp`` is the single place where ``Backend`` enums are translated into actual function calls: .. code-block:: cpp Tensor dispatch_kernel(const Tensor& X1, const Tensor& X2, float length_scale, float signal_variance, KernelType type, Backend backend); bool dispatch_cholesky_with_jitter(const Tensor& K, Tensor& L, float& jitter_used, Backend backend); Tensor dispatch_cholesky_solve(const Tensor& L, const Tensor& b, Backend backend); Tensor dispatch_forward_solve(const Tensor& L, const Tensor& b, Backend backend); Each function inspects the ``Backend`` argument, optionally consults a shape-driven heuristic for ``Backend::Auto``, and forwards to the platform-specific implementation. Backends that aren't compiled in fall back to CPU. Solvers ------- - **Cholesky** (``solvers/cpu/cholesky_cpu``, ``solvers/metal/cholesky_metal``, ``solvers/cuda/cholesky_cuda``) — exact :math:`O(N^3)` factorization. - **Conjugate Gradients** (``solvers/cpu/cg_cpu`` plus a matrix-free variant on Metal and CUDA) — iterative :math:`O(N^2 k)`. - **Stochastic Lanczos Quadrature** (``solvers/cpu/slq_cpu``) — log determinant estimate for CG mode. - **SKI** (``inference/ski.cpp`` + ``ski_accel.cpp`` for vDSP / ``ski_cuda.cu`` for cuFFT) — FFT-accelerated matvec on a grid. Inference --------- ``GPExact`` and ``GPSparse`` are the user-facing classes. ``GPExact`` carries either the legacy ``GPHyperparams`` (kept for binary compatibility) or the new ``std::shared_ptr`` + ``std::shared_ptr`` constructor. ``fit()`` and ``predict()`` branch on which path was taken; ``optimize()`` uses analytical gradients in the legacy path and central finite differences in the new path so it supports arbitrary kernel compositions. Python bindings --------------- ``python/bindings.cpp`` (pybind11) is a thin shim that: 1. Exposes the ``Kernel`` hierarchy with ``std::shared_ptr`` ownership and ``__add__``/``__mul__`` Python operators. 2. Converts ``numpy.ndarray`` <-> ``Tensor`` with ``forcecast`` to fp32 in ``numpy_to_tensor``. 3. Wraps ``GPExact::predict`` so it returns a ``dict`` with ``'mean'`` and ``'var'`` keys instead of mutating two output parameters. The Python package name is ``lightgp``; the import is ``import lightgp as gp``.