For the pdf slides, click here
Introduction of GAM
In general the GAM model has a following structure g(μi)=Aiθ+f1(x1i)+f2(x2i)+f3(x3i,x4i)+⋯g(μi)=Aiθ+f1(x1i)+f2(x2i)+f3(x3i,x4i)+⋯
- YiYi follows some exponential family distribution: Yi∼EF(μi,ϕ)Yi∼EF(μi,ϕ)
- μi=E(Yi)μi=E(Yi)
- AiAi is a row of the model matrix, and θθ is the corresponding parameter vector
- fjfj are smooth functions of the covariates xkxk
- This chapter
- Illustrates GAMs by basis expansions, each with a penalty controlling function smoothness
- Estimates GAMs by penalized regression methods
Takeaway: technically GAMs are simply GLM estimated subject to smoothing penalties
Univariate Smoothing
Piecewise linear basis: tent functions
Representing a function with basis expansions
Let’s consider a model containing one function of one covariate yi=f(xi)+ϵi,ϵiiid∼N(0,σ2)yi=f(xi)+ϵi,ϵiiid∼N(0,σ2)
If bj(x)bj(x) is the jjth basis function, then ff is assumed to have a representation f(x)=k∑j=1bj(x)βjf(x)=k∑j=1bj(x)βj with some unknown parameters βjβj
- This is clearly a linear model
The problem with polynomials
A kkth order polynomial is f(x)=β0+k∑j=1βkxkf(x)=β0+k∑j=1βkxk
The polynomial oscillates wildly in places, in order to both interpolate the data and to have all derivatives wrt xx continuous

Figure 1: Left: the target function f(x)f(x). Middle: polynomial interpolation. Right: piecewise linear interpolant
Piecewise linear basis
Suppose there are kk knots x∗1<x∗2<⋯<x∗kx∗1<x∗2<⋯<x∗k
The tent function representation of piecewise linear basis is
- For j=2,…,k−1j=2,…,k−1, bj(x)={x−x∗j−1x∗j−x∗j−1,if x∗j−1<x≤x∗jx∗j+1−xx∗j+1−x∗j,if x∗j<x≤x∗j+10,otherwisebj(x)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩x−x∗j−1x∗j−x∗j−1,if x∗j−1<x≤x∗jx∗j+1−xx∗j+1−x∗j,if x∗j<x≤x∗j+10,otherwise
- For the two basis functions on the edge b1(x)={x∗2−xx∗2−x∗1,if x≤x∗20,otherwisebk(x)={x−x∗k−1x∗k−x∗k−1,x>x∗k−10,otherwiseb1(x)={x∗2−xx∗2−x∗1,if x≤x∗20,otherwisebk(x)={x−x∗k−1x∗k−x∗k−1,x>x∗k−10,otherwise
Visualization of tent function basis
bj(x)bj(x) is zero everywhere, except over the interval between the knots immediately to either side of x∗jx∗j
bj(x)bj(x) increases linear from 00 at x∗j−1x∗j−1 to 1 at x∗jx∗j, and then decreases linearly to 00 at x∗j+1x∗j+1

Figure 2: Left: tent function basis, for interpolating the data shown as black dots. Right: the basis functiosn are each multiplied by a coefficient, before being summed
Penalty to control wiggliness
Control smoothness by penalizing wiggliness
To choose the degree of smoothness, rather than selecting the number of knots kk, we can use a relatively large kk, but control the model’s smoothness by adding a “wiggliness” penalty
- Note that a model based on k−1k−1 evenly spaced knots will not be nested within a model based on kk evenly spaced knots
Penalized likelihood function for piecewise linear basis: ‖y−Xβ‖2+λk−1∑j=2[f(x∗j−1)−2f(x∗j)+f(x∗j+1)]2∥y−Xβ∥2+λk−1∑j=2[f(x∗j−1)−2f(x∗j)+f(x∗j+1)]2
- Wiggliness is measured as a sum of squared second differences of the function at the knots
- This crudely approximates the integrated squared second derivative penalty used in cubic spline smoothing
- λλ is called the smoothing parameter
Simplify the penalized likelihood
For the tent function basis, βj=f(x∗j)βj=f(x∗j)
Therefore, the penalty can be expressed as a quadratic form k−1∑j=2(βj−1−2βj+βj+1)2=βTDTDβ=βTSβk−1∑j=2(βj−1−2βj+βj+1)2=βTDTDβ=βTSβ
- The (k−2)×k(k−2)×k matrix DD is D=[1−210...01−210..001−210...............]D=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1−210...01−210..001−210...............⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦
- S=DTDS=DTD is a square matrix
Solution of the penalized regression
To minimize the penalized likelihood ˆβ=argminβ ‖y−Xβ‖2+λβTSβ=(XTX+λS)−1XTy^β=argminβ ∥y−Xβ∥2+λβTSβ=(XTX+λS)−1XTy
The hat matrix (also called influence matrix) AA is thus A=X(XTX+λS)−1XTA=X(XTX+λS)−1XT and the fitted expectation is ˆμ=Ay^μ=Ay
For practical computation, we can introduce imaginary data to re-formulate the penalized least square problem to be a regular least square problem ‖y−Xβ‖2+λβTSβ=‖[y0]−[X√λD]β‖2∥y−Xβ∥2+λβTSβ=∥∥∥[y0]−[X√λD]β∥∥∥2
Hyper-parameter tuning
Between the two hyper-parameters: number of knots kk and the smoothing parameter λλ, the choice of λλ plays the crucial role
We can always use a kk large enough, more flexible then we expect to need to represent f(x)f(x)
In
mgcv
package, the default choice is k=20k=20, and knots are evenly spread out over the range of observed data
Choose λλ by leave-one-out cross validation
Under linear regression, to compute leave-one-out cross validation error (called the ordinary cross validation score), we only need to fit the full model once Vo=1nn∑i=1(yi−ˆf[−i]i)2=1nn∑i=1(yi−ˆfi)2(1−Aii)2Vo=1nn∑i=1(yi−^f[−i]i)2=1nn∑i=1(yi−^fi)2(1−Aii)2
- ˆf[−i]i^f[−i]i is the model fitted to all data except yiyi
- ˆfi^fi is the model fitted to all data, and AiiAii is the iith diagonal entry of the corresponding hat matrix
In practice, AiiAii are often replaced by their mean tr(A)/ntr(A)/n. This results in the generalized cross validation score (GCV) Vg=n∑ni=1(yi−ˆfi)2[n−tr(A)]2Vg=n∑ni=1(yi−^fi)2[n−tr(A)]2
From the Bayesian perspective
The wiggliness penalty can be viewed as a normal prior distribution on ββ β∼N(0,σ2S−λ)β∼N(0,σ2S−λ)
- Because SS is rank deficient, the prior covariance is proportional to the pseudo-inverse S−S−
The posterior of ββ is still normal β∣y∼N(ˆβ,(XTX+λS)−1σ2)β∣y∼N(^β,(XTX+λS)−1σ2)
Given the model this extra structure opens up the possibility of estimating σ2σ2 and λλ using marginal likelihood maximization or REML (aka empirical Bayes)
Additive Models
A simple additive model with two univariate functions
Let’s consider a simple additive model yi=α+f1(xi)+f2(vi)+ϵi,ϵiiid∼N(0,σ2)yi=α+f1(xi)+f2(vi)+ϵi,ϵiiid∼N(0,σ2)
The assumption of additive effects is a fairly strong one
The model now has an identifiability problem: f1f1 and f2f2 are each only estimable to within an additive constant
- Due to the identifiability issue, we need to use penalized regression splines
Piecewise linear regression representation
Basis representation of f1()f1() and f2()f2() f1(x)=k1∑j=1bj(x)δjf2(v)=k2∑j=1Bj(v)γjf1(x)=k1∑j=1bj(x)δjf2(v)=k2∑j=1Bj(v)γj
- The basis functions bj()bj() and Bj()Bj() are tent functions, with evenly spaced knots x∗jx∗j and v∗jv∗j, respectively
Matrix representations f1=[f1(x1),…,f1(xn)]T=X1δ,[X1]i,j=bj(xi)f2=[f2(v1),…,f2(vn)]T=X2γ,[X2]i,j=Bj(xi)f1=[f1(x1),…,f1(xn)]T=X1δ,[X1]i,j=bj(xi)f2=[f2(v1),…,f2(vn)]T=X2γ,[X2]i,j=Bj(xi)
Sum-to-zero constrains to resolve identifiability issues
We assume n∑i=1f1(xi)=0⟺1Tf1=0n∑i=1f1(xi)=0⟺1Tf1=0 This is equivalent to 1TX1δ=01TX1δ=0 for all δδ, which implies 1TX1=01TX1=0
To achieve this condition, we can center the column of X1X1 ˜X1=X1−1 1TX1n,˜f1=˜X1δ~X1=X1−1 1TX1n,~f1=~X1δ
Column centering reduces the rank of ˜X1~X1 to k1−1k1−1, so that only k1−1k1−1 elements of the k1k1 vector δδ can be uniquely estimated
- A simple identifiability constraint:
- Set a single element of δδ to zero
- And delete the corresponding column of ˜X1~X1 and DD
For notation simplicity, in what follows the tildes will be dropped, and we assume that the XjXj, DjDj are the constrained versions
Penalized piecewise regression additive model
We rewrite the penalized regression as y=Xβ+ϵy=Xβ+ϵ where X=(1,X1,X2)X=(1,X1,X2) and βT=(α,δT,γT)βT=(α,δT,γT)
Wiggliness penalties δTDT1D1δ=δTˉS1δ=βTS1β,S1=[0000ˉS10000]γTDT2D2γ=γTˉS2γ=βTS2β,δTDT1D1δ=δT¯S1δ=βTS1β,S1=⎡⎢⎣0000¯S10000⎤⎥⎦γTDT2D2γ=γT¯S2γ=βTS2β,
Fitting additive models by penalized least squares
Penalized least squares objective function ‖y−Xβ‖2+λ1βTS1β+λ2βTS2β∥y−Xβ∥2+λ1βTS1β+λ2βTS2β
Coefficient estimator ˆβ=(XTX+λ1S1+λ2S2)−1XTy^β=(XTX+λ1S1+λ2S2)−1XTy
Hat matrix A=X(XTX+λ1S1+λ2S2)−1XTA=X(XTX+λ1S1+λ2S2)−1XT
Conditional posterior distribution β∣y∼N(ˆβ,ˆVβ),ˆVβ=(XTX+λ1S1+λ2S2)−1ˆσ2β∣y∼N(^β,^Vβ),^Vβ=(XTX+λ1S1+λ2S2)−1^σ2
Choosing two smoothing parameters
Since we now have two smoothing parameters λ1,λ2λ1,λ2, grid searching for the GCV optimal values starts to become inefficient
Instead, R function
optim
can be used to minimize the GCV scoreWe can use log smoothing parameters for optimization to ensure that estimated smoothing parameters are non-negative
Generalized Additive Models
Generalized additive models
Generalized additive models (GAMs): additive models ++ GLM g(μi)=α+f1(xi)+f2(vi)+ϵig(μi)=α+f1(xi)+f2(vi)+ϵi
Penalized iterative least squares (PIRLS) algorithm: iterate the following steps to convergence
Given the current ˆη^η and ˆμ^μ, compute wi=1V(ˆμi)g′(ˆμi)2,zi=g′(ˆμi)(yi−ˆμi)+ˆηi
Let W=diag(wi), we obtain the new ˆβ by minimizing ‖√Wz−√WXβ‖2+λ1βTS1β+λ2βTS2β
Introducing Package mgcv
Introducing package mgcv
Main function:
gam()
, very much like theglm()
functionSmooth terms:
s()
for univariate functions andte()
for tensorsA gamma regression example log(E[\tt Volumei])=f1(\tt Heighti)+f2(\tt Girthi),\tt Volumei∼Gamma
library(mgcv) ## load the package data(trees)
ct1 <- gam(Volume ~ s(Height) + s(Girth),
family=Gamma(link=log),data=trees)
- By default, the degree of smoothness of the fj is estimated by GCV
summary(ct1)
##
## Family: Gamma
## Link function: log
##
## Formula:
## Volume ~ s(Height) + s(Girth)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.27570 0.01492 219.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Height) 1.000 1.000 31.32 7.07e-06 ***
## s(Girth) 2.422 3.044 219.28 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.973 Deviance explained = 97.8%
## GCV = 0.0080824 Scale est. = 0.006899 n = 31
Parital residuals plots
- Pearson residuals added to the estimated smooth terms ˆϵpartial1i=f1(\tt Heighti)+ˆϵpi
par(mfrow = c(1, 2))
plot(ct1,residuals=TRUE)
* The number in the y-axis label: effective degrees of freedom
Finer control of gam()
: choice of basis functions
Default: thin plat regression splines
- It has some appealing properties, but can be somewhat computationally costly for large dataset
We can select penalized cubic regression spline by using
s(..., bs = "cr")
We can change the dimension k of the basis
- The actual effective degrees of freedom for each term is usually estimated from the data by GCV or another smoothness selection criterion
- The upper bound on this estimate is k−1, minus one due to identifiability constraint on each smooth term
s(..., bs = "cr", k = 20)
Finer control of gam()
: the gamma
parameter
GCV is known to have some tendency to overfitting
Inside the
gam()
function, the argumentgamma
can increase the amount of smoothing- The default value for
gamma
is 1 - We can use a higher value to avoid overfitting,
gamma = 1.5
, without compromising model fit
- The default value for
References
- Wood, Simon N. (2017), Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC