Book Notes: Gaussian Processes for Machine learning -- Ch2 Gaussian Process Regression

For the pdf slides, click here

Overview of Gaussian processes (GP)

  • The problem is learning in GP is exactly the problem of finding suitable properties for the covariance function

  • In this book, design matrix is defined slightly differently from common statistical textbooks. Rather, each column in a design matrix is a case, and each row is a covariate

Weight-space View

A regression model with basis functions

  • Basis function \(\boldsymbol\phi(\mathbf{x})\): maps a \(D\)-dimensional input vector \(\mathbf{x}\) into an \(N\)-dimensional feature space

  • \(\boldsymbol\Phi(\mathbf{X}) \in \mathbb{R}^{N \times n}\): the aggregation of columns \(\boldsymbol\phi(\mathbf{x})\) for all \(n\) cases in the training data

  • A regression model \[ f(\mathbf{x}) = \boldsymbol\phi(\mathbf{x})^\top \mathbf{w}, \quad y = f(\mathbf{x}) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2_n) \]

  • We use a zero mean Gaussian prior on the \(N\)-dimensional unknown weights \(\mathbf{w}\) (aka regression coefficients) \[ \mathbf{w} \sim \mathcal{N}(\mathbf{0}, \boldsymbol\Sigma_p) \]

Predictive distribution

  • For a new test point \(\mathbf{x}_*\), the predictive distribution is \[\begin{align*} f_* \mid \mathbf{x}_*, \mathbf{X}, \mathbf{y} & \sim \mathcal{N}\left(\frac{1}{\sigma^2_n} \boldsymbol\phi_*^\top \mathbf{A}^{-1}\boldsymbol\Phi \mathbf{y},\quad \boldsymbol\phi_*^\top \mathbf{A}^{-1} \boldsymbol\phi_* \right),\\ \boldsymbol\phi_* &= \boldsymbol\phi(\mathbf{x}_*), \quad \boldsymbol\Phi = \boldsymbol\Phi(\mathbf{X}), \quad \mathbf{A} = \frac{1}{\sigma^2_n} \boldsymbol\Phi \boldsymbol\Phi^\top + \boldsymbol\Sigma_p^{-1} \end{align*}\]

  • When make predictions, we need to invert the \(N\times N\) matrix \(\mathbf{A}\), which may not be convenient if \(N\), the dimension of the feature space, is large

Rewriting the predictive distribution using the matrix inversion lemma

  • Marix inversion lemma: \(\mathbf{Z} \in \mathbb{R}^{n \times n}\), \(\mathbf{W} \in \mathbb{R}^{m \times m}\), \(\mathbf{U}, \mathbf{V}\in \mathbb{R}^{n \times m}\) \[ \left( \mathbf{Z} + \mathbf{U} \mathbf{W} \mathbf{V}^\top \right)^{-1} = \mathbf{Z}^{-1} - \mathbf{Z}^{-1} \mathbf{U} \left( \mathbf{W}^{-1} + \mathbf{V}^\top \mathbf{Z}^{-1} \mathbf{U}\right)^{-1} \mathbf{V}^\top\mathbf{Z}^{-1} \]

  • We can rewrite the predictive distribution on the previous page as \[\begin{align}\label{eq:weight_space_prediction} f_* \mid \mathbf{x}_*, \mathbf{X}, \mathbf{y} & \sim \mathcal{N}\left( \boldsymbol\phi_*^\top \boldsymbol\Sigma_p \boldsymbol\Phi \left(\mathbf{K} + \sigma^2_n \mathbf{I} \right)^{-1}\mathbf{y}, \right.\\ \nonumber & ~~~~~~~~~\left. \boldsymbol\phi_*^\top \boldsymbol\Sigma_p \boldsymbol\phi_* - \boldsymbol\phi_*^\top \boldsymbol\Sigma_p \boldsymbol\Phi \left(\mathbf{K} + \sigma^2_n \mathbf{I} \right)^{-1} \boldsymbol\Phi^\top \boldsymbol\Sigma_p \boldsymbol\phi_* \right),\\ \nonumber \mathbf{K} &= \boldsymbol\Phi^\top \boldsymbol\Sigma_p \boldsymbol\Phi \end{align}\]

Kernel and the kernel trick

  • In the predictive distribution on the previous page, the feature space always enters in the form of the kernel \(k(\cdot, \cdot)\): \[ k(\mathbf{x}, \mathbf{x}^\prime) = \boldsymbol\phi(\mathbf{x})^\top\boldsymbol\Sigma_p \boldsymbol\phi(\mathbf{x}^\prime), \] where \(\mathbf{x}, \mathbf{x}^\prime\) are in either the training or the test sets

  • Moreover, we can define \[ \boldsymbol\psi(\mathbf{x}) = \boldsymbol\Sigma_p^{1/2} \boldsymbol\phi(\mathbf{x}), \] so that the kernel has a simple dot product representation \[ k(\mathbf{x}, \mathbf{x}^\prime) = \boldsymbol\psi(\mathbf{x}) \cdot \boldsymbol\psi(\mathbf{x}^\prime) \]

  • Kernel trick: if an algorithm is defined solely in terms of inner products in input space, the it can be lifted into feature space by replacing occurrences of those inner products by \(k(\mathbf{x}, \mathbf{x}^\prime)\)

Function-space View

Gaussian process: definition

  • A Gaussian process(GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution

  • A GP is completely specified by its mean function \(m(\mathbf{x})\) and covariance function \(k(\mathbf{x}, \mathbf{x}^\prime)\) \[ f(\mathbf{x}) \sim \mathcal{GP}\left(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}^\prime) \right) \]

  • Usually the prior mean function is set to zero

  • Bayesian linear regression as a Gaussian process \[ f(\mathbf{x}) = \boldsymbol\phi(\mathbf{x})^\top \mathbf{w}, \quad \mathbf{w} \sim \mathcal{N}(\mathbf{0}, \boldsymbol\Sigma_p) \] Here, the GP mean function and the covariance function are \[ m(\mathbf{x}) = \mathbf{0}, \quad k(\mathbf{x}, \mathbf{x}^\prime) = \boldsymbol\phi(\mathbf{x})^\top \boldsymbol\Sigma_p \boldsymbol\phi(\mathbf{x}^\prime) \]

The squared exponential covariance function

  • In this chapter, squared exponential (SE) covariance function will be used \[ \text{cov}\left(f(\mathbf{x}), f(\mathbf{x}^\prime) \right) = k(\mathbf{x}, \mathbf{x}^\prime) = \exp\left(-\frac{1}{2}\left|\mathbf{x}- \mathbf{x}^\prime \right|^2 \right) \]

    • By replacing \(\left|\mathbf{x}- \mathbf{x}^\prime \right|\) by \(\left|\mathbf{x}- \mathbf{x}^\prime \right|/\ell\) for some positive constant \(\ell\), we can change the characteristic length-scale of the process

    • Note that the covariance between the outputs is written as a function of the inputs

    • The squared exponential covariance function corresponds to a Bayesian linear regression model with a infinite number of basis functions

    • Actually for every positive definite covariance function \(k(\cdot, \cdot)\), there exists a (possibly infinite) expansion in terms of basis functions

Three functions drawn at random from a GP prior (left) and their posteriors (right)

  • In both plots, shaded area are the pointwise mean plus and minus two times the standard deviation from each input value

Prediction with noise-free observations

Prediction with noise-free observations

  • Suppose we have noise-free observations \(\{\left(\mathbf{x}_i, f_i\right): i = 1, \ldots, n\}\)

  • According to the GP prior, the joint distribution of the training outputs \(\mathbf{f}\) and the test outputs \(\mathbf{f}_*\) is \[ \left[ \begin{array}{l} \mathbf{f}\\ \mathbf{f}_* \end{array} \right] \sim \mathcal{N}\left(\mathbf{0}, \left[ \begin{array}{ll} K(\mathbf{X}, \mathbf{X}) & K(\mathbf{X}, \mathbf{X}_*)\\ K(\mathbf{X}_*, \mathbf{X}) & K(\mathbf{X}_*, \mathbf{X}_*) \end{array} \right] \right) \]

  • By conditioning the joint Gaussian prior on the observations, we get the posterior distribution \[\begin{align*} \mathbf{f}_* \mid \mathbf{X}_*, \mathbf{X}, \mathbf{f} &\sim \mathcal{N}\left( K(\mathbf{X}_*, \mathbf{X}) K(\mathbf{X}, \mathbf{X})^{-1} \mathbf{f}, \right.\\ &~~~~~~~~~~\left. K(\mathbf{X}_*, \mathbf{X}_*) - K(\mathbf{X}_*, \mathbf{X}) K(\mathbf{X}, \mathbf{X})^{-1}K(\mathbf{X}, \mathbf{X}_*) \right) \end{align*}\]

Prediction with noisy observations

Prediction with noisy observations

  • With noisy observations \(y = f(\mathbf{x}) + \epsilon\), the covariance becomes \[ \text{cov}(\mathbf{y}) = K(\mathbf{X}, \mathbf{X}) + \sigma^2_n \mathbf{I} \]

  • Thus, the joint prior distribution becomes \[ \left[ \begin{array}{l} \mathbf{y}\\ \mathbf{f}_* \end{array} \right] \sim \mathcal{N}\left(\mathbf{0}, \left[ \begin{array}{cc} K(\mathbf{X}, \mathbf{X}) + \sigma^2_n \mathbf{I} & K(\mathbf{X}, \mathbf{X}_*)\\ K(\mathbf{X}_*, \mathbf{X}) & K(\mathbf{X}_*, \mathbf{X}_*) \end{array} \right] \right) \]

  • Key predictive equation for GP regression \[\begin{align}\label{eq:function_space_prediction} \mathbf{f}_* \mid \mathbf{X}_*, \mathbf{X}, \mathbf{f} &\sim \mathcal{N}\left( \bar{\mathbf{f}}_*, \text{cov}(\mathbf{f}_*)\right), \quad \text{where}\\ \nonumber \bar{\mathbf{f}}_* & = K(\mathbf{X}_*, \mathbf{X}) \left[K(\mathbf{X}, \mathbf{X})+ \sigma^2_n\right]^{-1} \mathbf{y}\\ \nonumber \text{cov}(\mathbf{f}_*) & = K(\mathbf{X}_*, \mathbf{X}_*) - K(\mathbf{X}_*, \mathbf{X}) \left[K(\mathbf{X}, \mathbf{X})+ \sigma^2_n\right]^{-1}K(\mathbf{X}, \mathbf{X}_*) \end{align}\]

Correspondence with weight-space view

  • Connection between the function-space view, Eq , and the weight-space view, Eq \[ K(C, D) = \boldsymbol\Phi(C)^\top \boldsymbol\Sigma_p \boldsymbol\Phi(D) \] where \(C, D\) stand for either \(\mathbf{X}\) or \(\mathbf{X}_*\)

  • Thus, for any set of basic functions, we can compute the corresponding covariance function as \[ k(\mathbf{x}, \mathbf{x}^\prime) = \boldsymbol\phi(\mathbf{x})^\top \boldsymbol\Sigma_p \boldsymbol\phi(\mathbf{x}^\prime) \]

  • On the other hand, for every positive definite covariance function \(k\), there exists a possibly infinite expansion in terms of basis functions

Predictive distribution for a single test point \(\mathbf{x}_*\)

  • Denote \(K = K(\mathbf{X}, \mathbf{X})\) and \(\mathbf{k}_* = K(\mathbf{X}, \mathbf{x}_*)\), then the mean and variance of the posterior predictive distribution are \[\begin{align}\label{eq:predictive_mean} \bar{\mathbf{f}}_* & = \mathbf{k}_*^\top\left(K + \sigma^2_n \mathbf{I} \right)^{-1}\mathbf{y},\\ \label{eq:predictive_covariance} \mathbb{V}(\mathbf{f}_*) & = k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}_*^\top\left(K + \sigma^2_n \mathbf{I} \right)^{-1}\mathbf{k}_* \end{align}\]

Predictive distribution mean as a linear predictor

  • The mean prediction Eq is a linear predictor, i.e., it’s a linear combination of observations \(\mathbf{y}\)

  • Another way to look at this equation is to see it as a linear combination of \(n\) kernel functions \[ \bar{f}(\mathbf{x}_*) = \sum_{i=1}^n \alpha_i k(\mathbf{x}_i, \mathbf{x}_*), \quad \boldsymbol\alpha = \left(K + \sigma^2_n \mathbf{I} \right)^{-1}\mathbf{y} \]

About the predictive distribution variance

  • The predictive variance Eq does not depend on the observed targets \(\mathbf{y}\), but only the inputs. This is a property of the Gaussian distribution

  • The noisy prediction of \(\mathbf{y}_*\): simply add \(\sigma^2_n \mathbf{I}\) to the variance \[ \mathbf{y}_* \mid \mathbf{x}_*, \mathbf{X}, \mathbf{y} \sim \mathcal{N} \left(\bar{\mathbf{f}}_*, \mathbb{V}(\mathbf{f}_*) + \sigma^2_n \mathbf{I} \right) \]

Cholesky decomposition and GP regression algorithm

Cholesky decomposition

  • Cholesky decomposition of a symmetric, positive definite matrix \(\mathbf{A}\) \[ \mathbf{A} = \mathbf{L}\mathbf{L}^\top, \] where \(\mathbf{L}\) is a lower triangular matrix, called the Cholesky factor

  • Cholesky decomposition is useful for solving linear systems with symmetric, positive definite coefficient matrix: to solve \(\mathbf{A}\mathbf{x} = \mathbf{b}\)
    • First solve the triangular system \(\mathbf{L}\mathbf{y} = \mathbf{b}\) by forward substitution
    • Then the triangular system \(\mathbf{L}^\top\mathbf{x} = \mathbf{y}\) by back substitution
  • Backslash operator: \(\mathbf{A}\backslash\mathbf{b}\) is the vector \(\mathbf{x}\) which solves \(\mathbf{A}\mathbf{x} = \mathbf{b}\)

    • Under Cholesky decomposition, \[ \mathbf{x} = \mathbf{A}\backslash\mathbf{b} = \mathbf{L}^\top \backslash \left( \mathbf{L} \backslash \mathbf{b}\right) \]
  • The computation of the Cholesky factor \(\mathbf{L}\) is considered numerically extremely stable, and takes time’ \(n^3/6\)

Algorithm: predictions and log marginal likelihood for GP regression

  • Input: \(\mathbf{X}, \mathbf{y}, k, \sigma^2_n, \mathbf{x}_*\)
  1. \(\mathbf{L} = \text{cholesky} \left(K + \sigma^2_n \mathbf{I} \right)\)

  2. \(\boldsymbol\alpha = \mathbf{L}^\top \backslash \left( \mathbf{L} \backslash \mathbf{y}\right)\)

  3. \(\bar{f}_* = \mathbf{k}_*^\top \boldsymbol\alpha\)

  4. \(\mathbf{v} = \mathbf{L} \backslash \mathbf{k}_*\)

  5. \(\mathbb{V}(\mathbf{f}_*) = k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{v}^\top \mathbf{v}\)

  6. \(\log p(\mathbf{y}\mid \mathbf{X}) = -\frac{1}{2}\mathbf{y}^\top \boldsymbol\alpha - \sum_i \log L_{ii} - \frac{n}{2}\log 2\pi\)

  • Return: \(\bar{f}_*, \mathbb{V}(\mathbf{f}_*), \log p(\mathbf{y}\mid \mathbf{X})\)

  • Computational complexity: \(n^3/6\) for the Cholesky decomposition in Line 1, and \(n^2/2\) for solving triangular systems in Line 2, 4

Hyperparameters

Hyperparameters

  • One-dimensional squared-exponential covariance function \[ k_y(x_p, x_q) = \sigma^2_f \exp\left[ -\frac{1}{2\ell^2}(x_p - x_q)^2 \right] + \sigma^2_n \delta_{pq} \]

  • It has three hyperparameters
    • Length-scale \(\ell\)
    • Signal variance \(\sigma^2_f\)
    • Noise variance \(\sigma^2_n\)
  • After selected \(\ell\), the rest two hyperparameters are set by optimizing the marginal likelihood \[ \log p(\mathbf{y}\mid \mathbf{X}) = -\frac{1}{2}\mathbf{y}^\top \left(K + \sigma^2_n \mathbf{I} \right)^{-1}\mathbf{y} - \frac{1}{2}\log \left| K + \sigma^2_n \mathbf{I} \right| - \frac{n}{2}\log 2\pi \]

Smoothing, Weight Functions, and Equivalent Kernels

TO BE CONTINUED

References