Book Notes: Generalized Additive Models -- Ch4 Introducing GAMs

For the pdf slides, click here

Introduction of GAM

  • In general the GAM model has a following structure g(μi)=Aiθ+f1(x1i)+f2(x2i)+f3(x3i,x4i)+g(μi)=Aiθ+f1(x1i)+f2(x2i)+f3(x3i,x4i)+

    • YiYi follows some exponential family distribution: YiEF(μi,ϕ)YiEF(μi,ϕ)
    • μi=E(Yi)μi=E(Yi)
    • AiAi is a row of the model matrix, and θθ is the corresponding parameter vector
    • fjfj are smooth functions of the covariates xkxk
  • This chapter
    • Illustrates GAMs by basis expansions, each with a penalty controlling function smoothness
    • Estimates GAMs by penalized regression methods
  • Takeaway: technically GAMs are simply GLM estimated subject to smoothing penalties

Univariate Smoothing

Piecewise linear basis: tent functions

Representing a function with basis expansions

  • Let’s consider a model containing one function of one covariate yi=f(xi)+ϵi,ϵiiidN(0,σ2)yi=f(xi)+ϵi,ϵiiidN(0,σ2)

  • If bj(x)bj(x) is the jjth basis function, then ff is assumed to have a representation f(x)=kj=1bj(x)βjf(x)=kj=1bj(x)βj with some unknown parameters βjβj

    • This is clearly a linear model

The problem with polynomials

  • A kkth order polynomial is f(x)=β0+kj=1βkxkf(x)=β0+kj=1βkxk

  • The polynomial oscillates wildly in places, in order to both interpolate the data and to have all derivatives wrt xx continuous

Left: the target function $f(x)$. Middle: polynomial interpolation. Right: piecewise linear interpolant

Figure 1: Left: the target function f(x)f(x). Middle: polynomial interpolation. Right: piecewise linear interpolant

Piecewise linear basis

  • Suppose there are kk knots x1<x2<<xkx1<x2<<xk

  • The tent function representation of piecewise linear basis is

    • For j=2,,k1j=2,,k1, bj(x)={xxj1xjxj1,if xj1<xxjxj+1xxj+1xj,if xj<xxj+10,otherwisebj(x)=⎪ ⎪ ⎪ ⎪⎪ ⎪ ⎪ ⎪xxj1xjxj1,if xj1<xxjxj+1xxj+1xj,if xj<xxj+10,otherwise
    • For the two basis functions on the edge b1(x)={x2xx2x1,if xx20,otherwisebk(x)={xxk1xkxk1,x>xk10,otherwiseb1(x)={x2xx2x1,if xx20,otherwisebk(x)={xxk1xkxk1,x>xk10,otherwise

Visualization of tent function basis

  • bj(x)bj(x) is zero everywhere, except over the interval between the knots immediately to either side of xjxj

  • bj(x)bj(x) increases linear from 00 at xj1xj1 to 1 at xjxj, and then decreases linearly to 00 at xj+1xj+1

Left: tent function basis, for interpolating the data shown as black dots. Right: the basis functiosn are each multiplied by a coefficient, before being summed

Figure 2: Left: tent function basis, for interpolating the data shown as black dots. Right: the basis functiosn are each multiplied by a coefficient, before being summed

Penalty to control wiggliness

Control smoothness by penalizing wiggliness

  • To choose the degree of smoothness, rather than selecting the number of knots kk, we can use a relatively large kk, but control the model’s smoothness by adding a “wiggliness” penalty

    • Note that a model based on k1k1 evenly spaced knots will not be nested within a model based on kk evenly spaced knots
  • Penalized likelihood function for piecewise linear basis: yXβ2+λk1j=2[f(xj1)2f(xj)+f(xj+1)]2yXβ2+λk1j=2[f(xj1)2f(xj)+f(xj+1)]2

    • Wiggliness is measured as a sum of squared second differences of the function at the knots
    • This crudely approximates the integrated squared second derivative penalty used in cubic spline smoothing
    • λλ is called the smoothing parameter

Simplify the penalized likelihood

  • For the tent function basis, βj=f(xj)βj=f(xj)

  • Therefore, the penalty can be expressed as a quadratic form k1j=2(βj12βj+βj+1)2=βTDTDβ=βTSβk1j=2(βj12βj+βj+1)2=βTDTDβ=βTSβ

    • The (k2)×k(k2)×k matrix DD is D=[1210...01210..001210...............]D=⎢ ⎢ ⎢ ⎢ ⎢ ⎢1210...01210..001210...............⎥ ⎥ ⎥ ⎥ ⎥ ⎥
    • S=DTDS=DTD is a square matrix

Solution of the penalized regression

  • To minimize the penalized likelihood ˆβ=argminβ yXβ2+λβTSβ=(XTX+λS)1XTy^β=argminβ yXβ2+λβTSβ=(XTX+λS)1XTy

  • The hat matrix (also called influence matrix) AA is thus A=X(XTX+λS)1XTA=X(XTX+λS)1XT and the fitted expectation is ˆμ=Ay^μ=Ay

  • For practical computation, we can introduce imaginary data to re-formulate the penalized least square problem to be a regular least square problem yXβ2+λβTSβ=[y0][XλD]β2yXβ2+λβTSβ=[y0][XλD]β2

Hyper-parameter tuning

  • Between the two hyper-parameters: number of knots kk and the smoothing parameter λλ, the choice of λλ plays the crucial role

  • We can always use a kk large enough, more flexible then we expect to need to represent f(x)f(x)

  • In mgcv package, the default choice is k=20k=20, and knots are evenly spread out over the range of observed data

Choose λλ by leave-one-out cross validation

  • Under linear regression, to compute leave-one-out cross validation error (called the ordinary cross validation score), we only need to fit the full model once Vo=1nni=1(yiˆf[i]i)2=1nni=1(yiˆfi)2(1Aii)2Vo=1nni=1(yi^f[i]i)2=1nni=1(yi^fi)2(1Aii)2

    • ˆf[i]i^f[i]i is the model fitted to all data except yiyi
    • ˆfi^fi is the model fitted to all data, and AiiAii is the iith diagonal entry of the corresponding hat matrix
  • In practice, AiiAii are often replaced by their mean tr(A)/ntr(A)/n. This results in the generalized cross validation score (GCV) Vg=nni=1(yiˆfi)2[ntr(A)]2Vg=nni=1(yi^fi)2[ntr(A)]2

From the Bayesian perspective

  • The wiggliness penalty can be viewed as a normal prior distribution on ββ βN(0,σ2Sλ)βN(0,σ2Sλ)

    • Because SS is rank deficient, the prior covariance is proportional to the pseudo-inverse SS
  • The posterior of ββ is still normal βyN(ˆβ,(XTX+λS)1σ2)βyN(^β,(XTX+λS)1σ2)

  • Given the model this extra structure opens up the possibility of estimating σ2σ2 and λλ using marginal likelihood maximization or REML (aka empirical Bayes)

Additive Models

A simple additive model with two univariate functions

  • Let’s consider a simple additive model yi=α+f1(xi)+f2(vi)+ϵi,ϵiiidN(0,σ2)yi=α+f1(xi)+f2(vi)+ϵi,ϵiiidN(0,σ2)

  • The assumption of additive effects is a fairly strong one

  • The model now has an identifiability problem: f1f1 and f2f2 are each only estimable to within an additive constant

    • Due to the identifiability issue, we need to use penalized regression splines

Piecewise linear regression representation

  • Basis representation of f1()f1() and f2()f2() f1(x)=k1j=1bj(x)δjf2(v)=k2j=1Bj(v)γjf1(x)=k1j=1bj(x)δjf2(v)=k2j=1Bj(v)γj

    • The basis functions bj()bj() and Bj()Bj() are tent functions, with evenly spaced knots xjxj and vjvj, respectively
  • Matrix representations f1=[f1(x1),,f1(xn)]T=X1δ,[X1]i,j=bj(xi)f2=[f2(v1),,f2(vn)]T=X2γ,[X2]i,j=Bj(xi)f1=[f1(x1),,f1(xn)]T=X1δ,[X1]i,j=bj(xi)f2=[f2(v1),,f2(vn)]T=X2γ,[X2]i,j=Bj(xi)

Sum-to-zero constrains to resolve identifiability issues

  • We assume ni=1f1(xi)=01Tf1=0ni=1f1(xi)=01Tf1=0 This is equivalent to 1TX1δ=01TX1δ=0 for all δδ, which implies 1TX1=01TX1=0

  • To achieve this condition, we can center the column of X1X1 ˜X1=X11 1TX1n,˜f1=˜X1δ~X1=X11 1TX1n,~f1=~X1δ

  • Column centering reduces the rank of ˜X1~X1 to k11k11, so that only k11k11 elements of the k1k1 vector δδ can be uniquely estimated

  • A simple identifiability constraint:
    • Set a single element of δδ to zero
    • And delete the corresponding column of ˜X1~X1 and DD
  • For notation simplicity, in what follows the tildes will be dropped, and we assume that the XjXj, DjDj are the constrained versions

Penalized piecewise regression additive model

  • We rewrite the penalized regression as y=Xβ+ϵy=Xβ+ϵ where X=(1,X1,X2)X=(1,X1,X2) and βT=(α,δT,γT)βT=(α,δT,γT)

  • Wiggliness penalties δTDT1D1δ=δTˉS1δ=βTS1β,S1=[0000ˉS10000]γTDT2D2γ=γTˉS2γ=βTS2β,δTDT1D1δ=δT¯S1δ=βTS1β,S1=0000¯S10000γTDT2D2γ=γT¯S2γ=βTS2β,

Fitting additive models by penalized least squares

  • Penalized least squares objective function yXβ2+λ1βTS1β+λ2βTS2βyXβ2+λ1βTS1β+λ2βTS2β

  • Coefficient estimator ˆβ=(XTX+λ1S1+λ2S2)1XTy^β=(XTX+λ1S1+λ2S2)1XTy

  • Hat matrix A=X(XTX+λ1S1+λ2S2)1XTA=X(XTX+λ1S1+λ2S2)1XT

  • Conditional posterior distribution βyN(ˆβ,ˆVβ),ˆVβ=(XTX+λ1S1+λ2S2)1ˆσ2βyN(^β,^Vβ),^Vβ=(XTX+λ1S1+λ2S2)1^σ2

Choosing two smoothing parameters

  • Since we now have two smoothing parameters λ1,λ2λ1,λ2, grid searching for the GCV optimal values starts to become inefficient

  • Instead, R function optim can be used to minimize the GCV score

  • We can use log smoothing parameters for optimization to ensure that estimated smoothing parameters are non-negative

Generalized Additive Models

Generalized additive models

  • Generalized additive models (GAMs): additive models ++ GLM g(μi)=α+f1(xi)+f2(vi)+ϵig(μi)=α+f1(xi)+f2(vi)+ϵi

  • Penalized iterative least squares (PIRLS) algorithm: iterate the following steps to convergence

  1. Given the current ˆη^η and ˆμ^μ, compute wi=1V(ˆμi)g(ˆμi)2,zi=g(ˆμi)(yiˆμi)+ˆηi

  2. Let W=diag(wi), we obtain the new ˆβ by minimizing WzWXβ2+λ1βTS1β+λ2βTS2β

Introducing Package mgcv

Introducing package mgcv

  • Main function: gam(), very much like the glm() function

  • Smooth terms: s() for univariate functions and te() for tensors

  • A gamma regression example log(E[\tt Volumei])=f1(\tt Heighti)+f2(\tt Girthi),\tt VolumeiGamma

library(mgcv) ## load the package data(trees)
ct1 <- gam(Volume ~ s(Height) + s(Girth),
           family=Gamma(link=log),data=trees)
  • By default, the degree of smoothness of the fj is estimated by GCV

summary(ct1)
## 
## Family: Gamma 
## Link function: log 
## 
## Formula:
## Volume ~ s(Height) + s(Girth)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.27570    0.01492   219.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df      F  p-value    
## s(Height) 1.000  1.000  31.32 7.07e-06 ***
## s(Girth)  2.422  3.044 219.28  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.973   Deviance explained = 97.8%
## GCV = 0.0080824  Scale est. = 0.006899  n = 31

Parital residuals plots

  • Pearson residuals added to the estimated smooth terms ˆϵpartial1i=f1(\tt Heighti)+ˆϵpi
par(mfrow = c(1, 2))
plot(ct1,residuals=TRUE)

* The number in the y-axis label: effective degrees of freedom

Finer control of gam(): choice of basis functions

  • Default: thin plat regression splines

    • It has some appealing properties, but can be somewhat computationally costly for large dataset
  • We can select penalized cubic regression spline by using

s(..., bs = "cr")
  • We can change the dimension k of the basis

    • The actual effective degrees of freedom for each term is usually estimated from the data by GCV or another smoothness selection criterion
    • The upper bound on this estimate is k1, minus one due to identifiability constraint on each smooth term
s(..., bs = "cr", k = 20)

Finer control of gam(): the gamma parameter

  • GCV is known to have some tendency to overfitting

  • Inside the gam() function, the argument gamma can increase the amount of smoothing

    • The default value for gamma is 1
    • We can use a higher value to avoid overfitting, gamma = 1.5, without compromising model fit

References

  • Wood, Simon N. (2017), Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC