Book Notes: Gaussian Processes for Machine learning -- Ch2 Gaussian Process Regression

For the pdf slides, click here

Overview of Gaussian processes (GP)

  • The problem is learning in GP is exactly the problem of finding suitable properties for the covariance function

  • In this book, design matrix is defined slightly differently from common statistical textbooks. Rather, each column in a design matrix is a case, and each row is a covariate

Weight-space View

A regression model with basis functions

  • Basis function ϕ(x): maps a D-dimensional input vector x into an N-dimensional feature space

  • Φ(X)RN×n: the aggregation of columns ϕ(x) for all n cases in the training data

  • A regression model f(x)=ϕ(x)w,y=f(x)+ϵ,ϵN(0,σ2n)

  • We use a zero mean Gaussian prior on the N-dimensional unknown weights w (aka regression coefficients) wN(0,Σp)

Predictive distribution

  • For a new test point x, the predictive distribution is fx,X,yN(1σ2nϕA1Φy,ϕA1ϕ),ϕ=ϕ(x),Φ=Φ(X),A=1σ2nΦΦ+Σ1p

  • When make predictions, we need to invert the N×N matrix A, which may not be convenient if N, the dimension of the feature space, is large

Rewriting the predictive distribution using the matrix inversion lemma

  • Marix inversion lemma: ZRn×n, WRm×m, U,VRn×m (Z+UWV)1=Z1Z1U(W1+VZ1U)1VZ1

  • We can rewrite the predictive distribution on the previous page as fx,X,yN(ϕΣpΦ(K+σ2nI)1y,         ϕΣpϕϕΣpΦ(K+σ2nI)1ΦΣpϕ),K=ΦΣpΦ

Kernel and the kernel trick

  • In the predictive distribution on the previous page, the feature space always enters in the form of the kernel k(,): k(x,x)=ϕ(x)Σpϕ(x), where x,x are in either the training or the test sets

  • Moreover, we can define ψ(x)=Σ1/2pϕ(x), so that the kernel has a simple dot product representation k(x,x)=ψ(x)ψ(x)

  • Kernel trick: if an algorithm is defined solely in terms of inner products in input space, the it can be lifted into feature space by replacing occurrences of those inner products by k(x,x)

Function-space View

Gaussian process: definition

  • A Gaussian process(GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution

  • A GP is completely specified by its mean function m(x) and covariance function k(x,x) f(x)GP(m(x),k(x,x))

  • Usually the prior mean function is set to zero

  • Bayesian linear regression as a Gaussian process f(x)=ϕ(x)w,wN(0,Σp) Here, the GP mean function and the covariance function are m(x)=0,k(x,x)=ϕ(x)Σpϕ(x)

The squared exponential covariance function

  • In this chapter, squared exponential (SE) covariance function will be used cov(f(x),f(x))=k(x,x)=exp(12|xx|2)

    • By replacing |xx| by |xx|/ for some positive constant , we can change the characteristic length-scale of the process

    • Note that the covariance between the outputs is written as a function of the inputs

    • The squared exponential covariance function corresponds to a Bayesian linear regression model with a infinite number of basis functions

    • Actually for every positive definite covariance function k(,), there exists a (possibly infinite) expansion in terms of basis functions

Three functions drawn at random from a GP prior (left) and their posteriors (right)

  • In both plots, shaded area are the pointwise mean plus and minus two times the standard deviation from each input value

Prediction with noise-free observations

Prediction with noise-free observations

  • Suppose we have noise-free observations {(xi,fi):i=1,,n}

  • According to the GP prior, the joint distribution of the training outputs f and the test outputs f is [ff]N(0,[K(X,X)K(X,X)K(X,X)K(X,X)])

  • By conditioning the joint Gaussian prior on the observations, we get the posterior distribution fX,X,fN(K(X,X)K(X,X)1f,          K(X,X)K(X,X)K(X,X)1K(X,X))

Prediction with noisy observations

Prediction with noisy observations

  • With noisy observations y=f(x)+ϵ, the covariance becomes cov(y)=K(X,X)+σ2nI

  • Thus, the joint prior distribution becomes [yf]N(0,[K(X,X)+σ2nIK(X,X)K(X,X)K(X,X)])

  • Key predictive equation for GP regression fX,X,fN(ˉf,cov(f)),whereˉf=K(X,X)[K(X,X)+σ2n]1ycov(f)=K(X,X)K(X,X)[K(X,X)+σ2n]1K(X,X)

Correspondence with weight-space view

  • Connection between the function-space view, Eq , and the weight-space view, Eq K(C,D)=Φ(C)ΣpΦ(D) where C,D stand for either X or X

  • Thus, for any set of basic functions, we can compute the corresponding covariance function as k(x,x)=ϕ(x)Σpϕ(x)

  • On the other hand, for every positive definite covariance function k, there exists a possibly infinite expansion in terms of basis functions

Predictive distribution for a single test point x

  • Denote K=K(X,X) and k=K(X,x), then the mean and variance of the posterior predictive distribution are ˉf=k(K+σ2nI)1y,V(f)=k(x,x)k(K+σ2nI)1k

Predictive distribution mean as a linear predictor

  • The mean prediction Eq is a linear predictor, i.e., it’s a linear combination of observations y

  • Another way to look at this equation is to see it as a linear combination of n kernel functions ˉf(x)=ni=1αik(xi,x),α=(K+σ2nI)1y

About the predictive distribution variance

  • The predictive variance Eq does not depend on the observed targets y, but only the inputs. This is a property of the Gaussian distribution

  • The noisy prediction of y: simply add σ2nI to the variance yx,X,yN(ˉf,V(f)+σ2nI)

Cholesky decomposition and GP regression algorithm

Cholesky decomposition

  • Cholesky decomposition of a symmetric, positive definite matrix A A=LL, where L is a lower triangular matrix, called the Cholesky factor

  • Cholesky decomposition is useful for solving linear systems with symmetric, positive definite coefficient matrix: to solve Ax=b
    • First solve the triangular system Ly=b by forward substitution
    • Then the triangular system Lx=y by back substitution
  • Backslash operator: Ab is the vector x which solves Ax=b

    • Under Cholesky decomposition, x=Ab=L(Lb)
  • The computation of the Cholesky factor L is considered numerically extremely stable, and takes time’ n3/6

Algorithm: predictions and log marginal likelihood for GP regression

  • Input: X,y,k,σ2n,x
  1. L=cholesky(K+σ2nI)

  2. α=L(Ly)

  3. ˉf=kα

  4. v=Lk

  5. V(f)=k(x,x)vv

  6. logp(yX)=12yαilogLiin2log2π

  • Return: ˉf,V(f),logp(yX)

  • Computational complexity: n3/6 for the Cholesky decomposition in Line 1, and n2/2 for solving triangular systems in Line 2, 4

Hyperparameters

Hyperparameters

  • One-dimensional squared-exponential covariance function ky(xp,xq)=σ2fexp[122(xpxq)2]+σ2nδpq

  • It has three hyperparameters
    • Length-scale
    • Signal variance σ2f
    • Noise variance σ2n
  • After selected , the rest two hyperparameters are set by optimizing the marginal likelihood logp(yX)=12y(K+σ2nI)1y12log|K+σ2nI|n2log2π

Smoothing, Weight Functions, and Equivalent Kernels

TO BE CONTINUED

References