For the pdf slides, click here
Overview of Gaussian processes (GP)
The problem is learning in GP is exactly the problem of finding suitable properties for the covariance function
In this book, design matrix is defined slightly differently from common statistical textbooks. Rather, each column in a design matrix is a case, and each row is a covariate
Weight-space View
A regression model with basis functions
Basis function ϕ(x): maps a D-dimensional input vector x into an N-dimensional feature space
Φ(X)∈RN×n: the aggregation of columns ϕ(x) for all n cases in the training data
A regression model f(x)=ϕ(x)⊤w,y=f(x)+ϵ,ϵ∼N(0,σ2n)
We use a zero mean Gaussian prior on the N-dimensional unknown weights w (aka regression coefficients) w∼N(0,Σp)
Predictive distribution
For a new test point x∗, the predictive distribution is f∗∣x∗,X,y∼N(1σ2nϕ⊤∗A−1Φy,ϕ⊤∗A−1ϕ∗),ϕ∗=ϕ(x∗),Φ=Φ(X),A=1σ2nΦΦ⊤+Σ−1p
When make predictions, we need to invert the N×N matrix A, which may not be convenient if N, the dimension of the feature space, is large
Rewriting the predictive distribution using the matrix inversion lemma
Marix inversion lemma: Z∈Rn×n, W∈Rm×m, U,V∈Rn×m (Z+UWV⊤)−1=Z−1−Z−1U(W−1+V⊤Z−1U)−1V⊤Z−1
We can rewrite the predictive distribution on the previous page as f∗∣x∗,X,y∼N(ϕ⊤∗ΣpΦ(K+σ2nI)−1y, ϕ⊤∗Σpϕ∗−ϕ⊤∗ΣpΦ(K+σ2nI)−1Φ⊤Σpϕ∗),K=Φ⊤ΣpΦ
Kernel and the kernel trick
In the predictive distribution on the previous page, the feature space always enters in the form of the kernel k(⋅,⋅): k(x,x′)=ϕ(x)⊤Σpϕ(x′), where x,x′ are in either the training or the test sets
Moreover, we can define ψ(x)=Σ1/2pϕ(x), so that the kernel has a simple dot product representation k(x,x′)=ψ(x)⋅ψ(x′)
Kernel trick: if an algorithm is defined solely in terms of inner products in input space, the it can be lifted into feature space by replacing occurrences of those inner products by k(x,x′)
Function-space View
Gaussian process: definition
A Gaussian process(GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution
A GP is completely specified by its mean function m(x) and covariance function k(x,x′) f(x)∼GP(m(x),k(x,x′))
Usually the prior mean function is set to zero
Bayesian linear regression as a Gaussian process f(x)=ϕ(x)⊤w,w∼N(0,Σp) Here, the GP mean function and the covariance function are m(x)=0,k(x,x′)=ϕ(x)⊤Σpϕ(x′)
The squared exponential covariance function
In this chapter, squared exponential (SE) covariance function will be used cov(f(x),f(x′))=k(x,x′)=exp(−12|x−x′|2)
By replacing |x−x′| by |x−x′|/ℓ for some positive constant ℓ, we can change the characteristic length-scale of the process
Note that the covariance between the outputs is written as a function of the inputs
The squared exponential covariance function corresponds to a Bayesian linear regression model with a infinite number of basis functions
Actually for every positive definite covariance function k(⋅,⋅), there exists a (possibly infinite) expansion in terms of basis functions
Three functions drawn at random from a GP prior (left) and their posteriors (right)
- In both plots, shaded area are the pointwise mean plus and minus two times the standard deviation from each input value
Prediction with noise-free observations
Prediction with noise-free observations
Suppose we have noise-free observations {(xi,fi):i=1,…,n}
According to the GP prior, the joint distribution of the training outputs f and the test outputs f∗ is [ff∗]∼N(0,[K(X,X)K(X,X∗)K(X∗,X)K(X∗,X∗)])
By conditioning the joint Gaussian prior on the observations, we get the posterior distribution f∗∣X∗,X,f∼N(K(X∗,X)K(X,X)−1f, K(X∗,X∗)−K(X∗,X)K(X,X)−1K(X,X∗))
Prediction with noisy observations
Prediction with noisy observations
With noisy observations y=f(x)+ϵ, the covariance becomes cov(y)=K(X,X)+σ2nI
Thus, the joint prior distribution becomes [yf∗]∼N(0,[K(X,X)+σ2nIK(X,X∗)K(X∗,X)K(X∗,X∗)])
Key predictive equation for GP regression f∗∣X∗,X,f∼N(ˉf∗,cov(f∗)),whereˉf∗=K(X∗,X)[K(X,X)+σ2n]−1ycov(f∗)=K(X∗,X∗)−K(X∗,X)[K(X,X)+σ2n]−1K(X,X∗)
Correspondence with weight-space view
Connection between the function-space view, Eq , and the weight-space view, Eq K(C,D)=Φ(C)⊤ΣpΦ(D) where C,D stand for either X or X∗
Thus, for any set of basic functions, we can compute the corresponding covariance function as k(x,x′)=ϕ(x)⊤Σpϕ(x′)
On the other hand, for every positive definite covariance function k, there exists a possibly infinite expansion in terms of basis functions
Predictive distribution for a single test point x∗
- Denote K=K(X,X) and k∗=K(X,x∗), then the mean and variance of the posterior predictive distribution are ˉf∗=k⊤∗(K+σ2nI)−1y,V(f∗)=k(x∗,x∗)−k⊤∗(K+σ2nI)−1k∗
Predictive distribution mean as a linear predictor
The mean prediction Eq is a linear predictor, i.e., it’s a linear combination of observations y
Another way to look at this equation is to see it as a linear combination of n kernel functions ˉf(x∗)=n∑i=1αik(xi,x∗),α=(K+σ2nI)−1y
About the predictive distribution variance
The predictive variance Eq does not depend on the observed targets y, but only the inputs. This is a property of the Gaussian distribution
The noisy prediction of y∗: simply add σ2nI to the variance y∗∣x∗,X,y∼N(ˉf∗,V(f∗)+σ2nI)
Cholesky decomposition and GP regression algorithm
Cholesky decomposition
Cholesky decomposition of a symmetric, positive definite matrix A A=LL⊤, where L is a lower triangular matrix, called the Cholesky factor
- Cholesky decomposition is useful for solving linear systems with symmetric, positive definite coefficient matrix: to solve Ax=b
- First solve the triangular system Ly=b by forward substitution
- Then the triangular system L⊤x=y by back substitution
Backslash operator: A∖b is the vector x which solves Ax=b
- Under Cholesky decomposition, x=A∖b=L⊤∖(L∖b)
The computation of the Cholesky factor L is considered numerically extremely stable, and takes time’ n3/6
Algorithm: predictions and log marginal likelihood for GP regression
- Input: X,y,k,σ2n,x∗
L=cholesky(K+σ2nI)
α=L⊤∖(L∖y)
ˉf∗=k⊤∗α
v=L∖k∗
V(f∗)=k(x∗,x∗)−v⊤v
logp(y∣X)=−12y⊤α−∑ilogLii−n2log2π
Return: ˉf∗,V(f∗),logp(y∣X)
Computational complexity: n3/6 for the Cholesky decomposition in Line 1, and n2/2 for solving triangular systems in Line 2, 4
Hyperparameters
Hyperparameters
One-dimensional squared-exponential covariance function ky(xp,xq)=σ2fexp[−12ℓ2(xp−xq)2]+σ2nδpq
- It has three hyperparameters
- Length-scale ℓ
- Signal variance σ2f
- Noise variance σ2n
After selected ℓ, the rest two hyperparameters are set by optimizing the marginal likelihood logp(y∣X)=−12y⊤(K+σ2nI)−1y−12log|K+σ2nI|−n2log2π
Smoothing, Weight Functions, and Equivalent Kernels
TO BE CONTINUED
References
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine learning, MIT press.