Book Notes: Generalized Additive Models - Ch3 Generalized Linear Models (GLM)

For the pdf slides, click here

GLM overview

  • In a GLM, a smooth monotonic link function g() connects the expectation μi=E(Yi) with the linear combination of Xi, g(μi)=ηi=Xiβ

  • In a generalized linear mixed model (GLMM), we have g(μi)=ηi=Xiβ+Zib,bN(0,ψ)

Theory of GLMs

Exponential family

Exponential family of distributions

  • The density function for an exponential family distribution f(y)=exp{yθb(θ)a(ϕ)+c(y,ϕ)}
    • a,b,c: arbitrary functions
    • ϕ: an arbitrary scale parameter
    • θ: the canonical parameter; completely depend on the model parameter β
  • Properties about exponential family mean and variance E(Y)=b(θ)var(Y)=b(θ)a(ϕ)

    • In most practical cases, a(ϕ)=ϕ/ω where ω is a known constant
    • We define a function V(μ)=b(θ)/w,so that var(Y)=V(μ)ϕ

Exponential family examples

Iteratively re-weighted least square (IRLS)

Fitting GLMs

  • For the GLM model and , assuming ai(ϕ)=ϕ/ωi, the log likelihood is l(β)=i=1nωi[yiθibi(θi)]/ϕ+ci(ϕ,yi)

  • To optimize, we use the Newton’s method, which is an iterative optimization approach θ(t+1)=θ(t)(2l)1l
    • Where both 2l and l are evaluated at the current iteration θ(t)
    • Alternatively, we can use the Fisher scoring variant of the Newton’s method, by replacing the Hessian matrix with its expectation
  • Next, we will need to compute the gradient vector and expected Hessian matrix of l

Compute the gradient vector and expected Hessian of l

  • By the chain rule, θiβj=θiμiμiηiηiβj=1b(θi)1g(μi)Xij

  • Therefore, the gradient vector of l is lβj=1ϕi=1nωi[yibi(θi)]θiβj=1ϕi=1nyiμig(μi)V(μi)Xij

  • The expected Hessian (expectation taken wrt Y) is E(2lβjβk)=1ϕi=1nXijXikg(μi)2V(μi)

The Fisher scoring update

  • Define the matrices W=diag{wi},wi=1g(μi)2V(μi)G=diag{g(μi)}

  • The Fisher scoring update for β is β(t+1)=β(t)+(XTWX)1XTWG(yμ)=(XTWX)1XTW[G(yμ)+Xβ(t)]z

Iteratively re-weighted least square (IRLS) algorithm

  1. Initialization: μ^i=yi+δi,η^i=g(μ^i)

    • δi is usually zero, but may be a small constant ensuring η^i is finite
  2. Compute pseudo data zi and iterative weights wi: zi=g(μ^i)(yiμ^i)+η^iwi=1g(μ^i)2V(μ^i)

  3. Find β^ by minimizing the weighted least squares objective i=1nwi(ziXiβ)2 then update η^=Xβ^,μ^i=g1(η^i)

  • Repeat Step 2-3 until the change in deviance is near zero

IRLS example 1: logistic regression

  • For logistic regression, g(μ)=log(μ1μ),g(μ)=1μ(1μ)V(μ)=μ(1μ),ϕ=1

  • Therefore, in Step 2 of IRLS, zi=yiμ^iμ^i(1μ^i)+η^iwi=μ^i(1μ^i)

IRLS example 2: GLM with independent normal priors

  • Assume that the vector β has independent normal priors βN(0,ϕλIp)

  • Log posterior density (we still call it l, with some abuse of notation) l(β)=1ϕi=1nωi[yiθibi(θi)]λ2ϕβTβ+const

  • Gradient vector and expected Hessian matrix (wrt β) l=1ϕ[XTWG(yμ)λβ]E(2l)=1ϕ(XTWX+λIp)

    • Here, W and G are the same as in Equation and

  • IRLS for GLM with independent normal priors β(t+1)=β(t)+(XTWX+λIp)1[XTWG(yμ)λβ(t)]=(XTWX+λIp)1XTW[G(yμ)+Xβ(t)]z

Asymptotic consistency of MLE, deviance, tests, residuals

Large sample distribution of β^

  • Hessian of the negative log likelihood (also called observed information) I^=XTWXϕ

  • Fisher information, also called expected information I=E(I^)

  • Asymptotic normality the MLE β^ β^N(β,I1)orβ^N(β,I^1)

Deviance

  • Deviance is the GLM counterpart of the residual sum of squares in normal linear regression D=2ϕ[l(β^max)l(β^)]=i=1n2ωi[yi(θ~iθ^i)b(θ~i)+b(θ^i)]

    • Here, l(β^max) is the maximized likelihood of the saturated model: the model with one parameter per data point. For exponential family distribution, it is computed by simply setting μ^=y.
    • θ~ and θ^ are the maximum likelihood estimates of the canonical parameters for the saturated model and the model of interest, respectively
  • From the second equality, we can see that deviance is independent of ϕ

  • For normal linear regression, deviance equals the residual sum of squares

Scaled deviance

  • Scaled deviance does depend on ϕ D=Dϕ

  • If the model is specified correctly, then approximately Dχnp2

  • To compare two nested models,
    • If ϕ is known, then under H0, we can use D0D1χp1p02
    • If ϕ is unknown, then under H0, we can use F=(D0D1)/(p1p0)D1/(np1)Fp1p0,np1

GLM residuals

  • Model checking is perhaps the most important part of applied statistical modeling

  • It is usual to standardize GLM residuals so that if the model assumptions are correct,

    • the standardized residuals should have approximately equal variance, and
    • behave like residuals from an ordinary linear model
  • Pearson residuals ϵ^ip=yiμ^iV(μi)

    • In practice, the distribution of the Pearson residuals can be quite asymmetric around zero. So the deviance residuals (introduced next) are often preferred.

Deviance residuals

  • Denote di as the ith component in the deviance definition , so that the deviance is D=i=1ndi

  • By analogy with the ordinary linear model,we define the deviance residual ϵ^id=sign(yiμi^)di

    • The sum of squares of the deviance residuals gives the deviance itself

Quasi-likelihood (GEE)

Quasi-likelihood

  • Consider an observation yi, of a random variable with mean μi and known variance function V(μi)

    • Getting the distribution of Yi exactly right is rather unimportant, as long as the mean-variance relationship V() is correct
  • Then the log quasi-likelihood for μi, given yi, is qi(μi)=yiμiyizϕV(z) dz

    • The log quasi-likelihood for the mean vector μ of all the response data is q(μ)=i=1nqi(μi)
  • To obtain the maximum quasi-likelihood estimation of β, we can differentiate q wrt βj, for j 0=qβj=i=1nyiμiϕV(μi)μiβji=1nyiμiV(μi)g(μi)Xij=0 this is exactly the GLM maximum likelihood solution, which can be obtained through IRLS

Generalized Linear Mixed Models (GLMM)

Generalized linear mixed models (GLMM)

  • A GLMM model for an exponential family random variable Yi g(μi)=Xiβ+Zib,bN(0,ψθ)

  • Difficulty in moving from linear mixed models to GLMM: it is no longer possible to evaluate the marginal likelihood analytically

  • One effective solution is Taylor expansion around b^, the posterior mode of f(by,β)

f(yβ)exp{logf(y,b^β)            +12(bb^)T2logf(y,bβ)bbT(bb^)}db

Laplace approximation of GLMM marginal likelihood

  • For GLM, note that the expected Hessian is ZTWZϕψ1

    • W is the IRLS weight vector based on the μ implied by b^ and β
  • Therefore, the approximate marginal likelihood is f(yβ)f(y,b^β)(2π)p/2|ZTWZϕ+ψθ1|1/2

Penalized likelihood and penalized IRLS

  • The point estimators β^ and b^ are obtained by optimizing the penalized likelihood

β^,b^=argmaxβ,blogf(y,bβ)=argmaxβ,b{logf(yb,β)bTψθ1b/2}

  • To simplify notation, we denote BT=(b,β)TX=(Z,X),S=[ψθ1000]

  • A penalized version of the IRLS algorithm (PIRLS) : by , a single Newton update step is B(t+1)=(XTWX+ϕS)1XTW[G(yμ^)+XB(t)]

Penalized quasi-likelihood method

  • Since optimizing the Laplace approximate marginal likelihood can be computationally costly, it is therefore tempting to instead perform a PIRLS iteration, estimating θ,ϕ at each step based on the working mixed model zb,βN(Xβ+Zb,W1ϕ),bN(0,ψθ)

References

  • Wood, Simon N. (2017), Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC