Book Notes: Generalized Additive Models - Ch3 Generalized Linear Models (GLM)

For the pdf slides, click here

GLM overview

  • In a GLM, a smooth monotonic link function \(g(\cdot)\) connects the expectation \(\mu_i = E(Y_i)\) with the linear combination of \(\mathbf{X}_i\), \[\begin{equation}\label{eq:glm_link} g(\mu_i) = \eta_i = \mathbf{X}_i \boldsymbol\beta \end{equation}\]

  • In a generalized linear mixed model (GLMM), we have \[ g(\mu_i) = \eta_i = \mathbf{X}_i \boldsymbol\beta + \mathbf{Z}_i \mathbf{b}, \quad \mathbf{b} \sim \text{N}(\mathbf{0}, \boldsymbol\psi) \]

Theory of GLMs

Exponential family

Exponential family of distributions

  • The density function for an exponential family distribution \[\begin{equation}\label{eq:exponential_family_density} f(y) = \exp\left\{ \frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi) \right\} \end{equation}\]
    • \(a, b, c\): arbitrary functions
    • \(\phi\): an arbitrary scale parameter
    • \(\theta\): the canonical parameter; completely depend on the model parameter \(\boldsymbol\beta\)
  • Properties about exponential family mean and variance \[\begin{align*} E(Y) & = b^{\prime}(\theta)\\ var(Y) & = b^{\prime\prime}(\theta) a(\phi) \end{align*}\]

    • In most practical cases, \(a(\phi) = \phi/\omega\) where \(\omega\) is a known constant
    • We define a function \[ V(\mu) = b^{\prime\prime}(\theta) / w, \quad \text{so that } var(Y) = V(\mu) \phi \]

Exponential family examples

Iteratively re-weighted least square (IRLS)

Fitting GLMs

  • For the GLM model and , assuming \(a_i(\phi) = \phi/\omega_i\), the log likelihood is \[ l(\boldsymbol\beta) = \sum_{i=1}^n \omega_i \left[ y_i \theta_i - b_i(\theta_i) \right]/\phi + c_i(\phi, y_i) \]

  • To optimize, we use the Newton’s method, which is an iterative optimization approach \[ \theta^{(t+1)} = \theta^{(t)} - \left(\nabla^2 l\right)^{-1}\nabla l \]
    • Where both \(\nabla^2 l\) and \(\nabla l\) are evaluated at the current iteration \(\theta^{(t)}\)
    • Alternatively, we can use the Fisher scoring variant of the Newton’s method, by replacing the Hessian matrix with its expectation
  • Next, we will need to compute the gradient vector and expected Hessian matrix of \(l\)

Compute the gradient vector and expected Hessian of \(l\)

  • By the chain rule, \[\begin{align*} \frac{\partial \theta_i}{\partial \beta_j} & = \frac{\partial \theta_i}{\partial \mu_i} \cdot \frac{\partial \mu_i}{\partial \eta_i} \cdot \frac{\partial \eta_i}{\partial \beta_j} \\ & = \frac{1}{b^{\prime\prime}(\theta_i)}\cdot \frac{1}{g^{\prime}(\mu_i)}\cdot X_{ij} \end{align*}\]

  • Therefore, the gradient vector of \(l\) is \[\begin{align*} \frac{\partial l}{\partial \beta_j} = \frac{1}{\phi} \sum_{i=1}^n \omega_i \left[y_i - b^{\prime}_i(\theta_i) \right]\frac{\partial \theta_i}{\partial \beta_j} = \frac{1}{\phi}\sum_{i=1}^n \frac{y_i - \mu_i}{g^{\prime}(\mu_i) V(\mu_i)} X_{ij} \end{align*}\]

  • The expected Hessian (expectation taken wrt \(Y\)) is \[ E\left( \frac{\partial^2 l}{\partial \beta_j \partial \beta_k} \right) = - \frac{1}{\phi}\sum_{i=1}^n \frac{X_{ij} X_{ik}}{g^{\prime}(\mu_i)^2 V(\mu_i)} \]

The Fisher scoring update

  • Define the matrices \[\begin{align}\label{eq:W} \mathbf{W} & = \text{diag}\{w_i\}, \quad w_i = \frac{1}{g^{\prime}(\mu_i)^2 V(\mu_i)} \\ \label{eq:G} \mathbf{G} & = \text{diag}\left\{g^{\prime}(\mu_i)\right\} \end{align}\]

  • The Fisher scoring update for \(\boldsymbol\beta\) is \[\begin{align*} \boldsymbol\beta^{(t+1)} & = \boldsymbol\beta^{(t)} + \left(\mathbf{X}^T \mathbf{W} \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{W} \mathbf{G} (\mathbf{y} - \boldsymbol\mu)\\ & = \left(\mathbf{X}^T \mathbf{W} \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{W} \underbrace{\left[ \mathbf{G}(\mathbf{y} - \boldsymbol\mu) + \mathbf{X}\boldsymbol\beta^{(t)} \right]}_{\mathbf{z}} \end{align*}\]

Iteratively re-weighted least square (IRLS) algorithm

  1. Initialization: \[ \hat{\mu}_i = y_i + \delta_i, \quad \hat{\eta}_i = g\left(\hat{\mu}_i\right) \]

    • \(\delta_i\) is usually zero, but may be a small constant ensuring \(\hat{\eta}_i\) is finite
  2. Compute pseudo data \(z_i\) and iterative weights \(w_i\): \[\begin{align*} z_i &= g^{\prime}\left(\hat{\mu}_i\right)\left(y_i - \hat{\mu}_i\right) + \hat{\eta}_i\\ w_i &= \frac{1}{g^{\prime}\left(\hat{\mu}_i\right)^2 V\left(\hat{\mu}_i\right)} \end{align*}\]

  3. Find \(\hat{\boldsymbol\beta}\) by minimizing the weighted least squares objective \[ \sum_{i=1}^n w_i \left(z_i -\mathbf{X}_i \boldsymbol\beta \right)^2 \] then update \[ \hat{\boldsymbol\eta} = \mathbf{X} \hat{\boldsymbol\beta}, \quad \hat{\mu}_i = g^{-1}\left(\hat{\eta}_i\right) \]

  • Repeat Step 2-3 until the change in deviance is near zero

IRLS example 1: logistic regression

  • For logistic regression, \[\begin{align*} g(\mu) & = \log\left(\frac{\mu}{1 - \mu}\right), \quad g^{\prime}(\mu) = \frac{1}{\mu (1-\mu)}\\ V(\mu) &= \mu (1-\mu), \quad\quad\quad \phi =1 \end{align*}\]

  • Therefore, in Step 2 of IRLS, \[\begin{align*} z_i & = \frac{y_i - \hat{\mu}_i}{\hat{\mu}_i\left(1-\hat{\mu}_i\right)} + \hat{\eta}_i \\ w_i & = \hat{\mu}_i\left(1-\hat{\mu}_i\right) \end{align*}\]

IRLS example 2: GLM with independent normal priors

  • Assume that the vector \(\boldsymbol\beta\) has independent normal priors \[ \boldsymbol\beta \sim \text{N}\left(\mathbf{0}, \frac{\phi}{\lambda}\mathbf{I}_p \right) \]

  • Log posterior density (we still call it \(l\), with some abuse of notation) \[ l(\boldsymbol\beta) = \frac{1}{\phi}\sum_{i=1}^n \omega_i \left[ y_i \theta_i - b_i(\theta_i) \right]- \frac{\lambda}{2\phi}\boldsymbol\beta^T \boldsymbol\beta + \text{const} \]

  • Gradient vector and expected Hessian matrix (wrt \(\boldsymbol\beta\)) \[\begin{align*} \nabla l &= \frac{1}{\phi} \left[\mathbf{X}^T \mathbf{W}\mathbf{G}(\mathbf{y} - \boldsymbol\mu) - \lambda \boldsymbol\beta\right]\\ E\left(\nabla^2 l \right) &= -\frac{1}{\phi}\left(\mathbf{X}^T \mathbf{W}\mathbf{X} + \lambda \mathbf{I}_p \right) \end{align*}\]

    • Here, \(\mathbf{W}\) and \(\mathbf{G}\) are the same as in Equation and

  • IRLS for GLM with independent normal priors \[\begin{align}\notag \boldsymbol\beta^{(t+1)} & = \boldsymbol\beta^{(t)} + \left(\mathbf{X}^T \mathbf{W} \mathbf{X} + \lambda \mathbf{I}_p \right)^{-1} \left[\mathbf{X}^T \mathbf{W} \mathbf{G}(\mathbf{y} - \boldsymbol\mu) - \lambda \boldsymbol\beta^{(t)}\right] \\ \label{eq:IRLS_penalty} & = \left(\mathbf{X}^T \mathbf{W} \mathbf{X} + \lambda \mathbf{I}_p\right)^{-1} \mathbf{X}^T \mathbf{W} \underbrace{\left[ \mathbf{G}(\mathbf{y} - \boldsymbol\mu) + \mathbf{X}\boldsymbol\beta^{(t)} \right]}_{\mathbf{z}} \end{align}\]

Asymptotic consistency of MLE, deviance, tests, residuals

Large sample distribution of \(\hat{\boldsymbol\beta}\)

  • Hessian of the negative log likelihood (also called observed information) \[ \hat{\mathcal{I}} = \frac{\mathbf{X}^T \mathbf{W} \mathbf{X}}{\phi} \]

  • Fisher information, also called expected information \[ \mathcal{I} = E\left(\hat{\mathcal{I}}\right) \]

  • Asymptotic normality the MLE \(\hat{\boldsymbol\beta}\) \[ \hat{\boldsymbol\beta} \sim \text{N}\left(\boldsymbol\beta, \mathcal{I}^{-1} \right) \quad \text{or} \quad \hat{\boldsymbol\beta} \sim \text{N}\left(\boldsymbol\beta, \hat{\mathcal{I}}^{-1} \right) \]

Deviance

  • Deviance is the GLM counterpart of the residual sum of squares in normal linear regression \[\begin{align}\notag D & = 2\phi \left[ l\left(\hat{\boldsymbol\beta}_{\max} \right) - l\left(\hat{\boldsymbol\beta} \right) \right]\\ \label{eq:deviance} &= \sum_{i=1}^n 2 \omega_i \left[y_i \left(\tilde{\theta}_i - \hat{\theta}_i \right) - b\left(\tilde{\theta}_i \right) + b\left(\hat{\theta}_i \right) \right] \end{align}\]

    • Here, \(l\left(\hat{\boldsymbol\beta}_{\max} \right)\) is the maximized likelihood of the saturated model: the model with one parameter per data point. For exponential family distribution, it is computed by simply setting \(\hat{\boldsymbol\mu} = \mathbf{y}\).
    • \(\tilde{\boldsymbol\theta}\) and \(\hat{\boldsymbol\theta}\) are the maximum likelihood estimates of the canonical parameters for the saturated model and the model of interest, respectively
  • From the second equality, we can see that deviance is independent of \(\phi\)

  • For normal linear regression, deviance equals the residual sum of squares

Scaled deviance

  • Scaled deviance does depend on \(\phi\) \[ D^* = \frac{D}{\phi} \]

  • If the model is specified correctly, then approximately \[ D^* \sim \chi^2_{n-p} \]

  • To compare two nested models,
    • If \(\phi\) is known, then under \(H_0\), we can use \[ D^*_0 - D^*_1 \sim \chi^2_{p_1 - p_0} \]
    • If \(\phi\) is unknown, then under \(H_0\), we can use \[ F = \frac{(D_0 - D_1)/(p_1 - p_0)}{D_1/(n - p_1)} \sim F_{p_1 - p_0, n - p_1} \]

GLM residuals

  • Model checking is perhaps the most important part of applied statistical modeling

  • It is usual to standardize GLM residuals so that if the model assumptions are correct,

    • the standardized residuals should have approximately equal variance, and
    • behave like residuals from an ordinary linear model
  • Pearson residuals \[ \hat{\epsilon}_i^p = \frac{y_i - \hat{\mu}_i} {\sqrt{V\left(\mu_i\right)}} \]

    • In practice, the distribution of the Pearson residuals can be quite asymmetric around zero. So the deviance residuals (introduced next) are often preferred.

Deviance residuals

  • Denote \(d_i\) as the \(i\)th component in the deviance definition , so that the deviance is \(D = \sum_{i=1}^n d_i\)

  • By analogy with the ordinary linear model,we define the deviance residual \[ \hat{\epsilon}_i^d = \text{sign}(y_i - \hat{\mu_i}) \sqrt{d_i} \]

    • The sum of squares of the deviance residuals gives the deviance itself

Quasi-likelihood (GEE)

Quasi-likelihood

  • Consider an observation \(y_i\), of a random variable with mean \(\mu_i\) and known variance function \(V(\mu_i)\)

    • Getting the distribution of \(Y_i\) exactly right is rather unimportant, as long as the mean-variance relationship \(V(\cdot)\) is correct
  • Then the log quasi-likelihood for \(\mu_i\), given \(y_i\), is \[ q_i(\mu_i) = \int_{y_i}^{\mu_i} \frac{y_i - z}{\phi V(z)}~dz \]

    • The log quasi-likelihood for the mean vector \(\boldsymbol\mu\) of all the response data is \(q(\boldsymbol\mu) = \sum_{i=1}^n q_i (\mu_i)\)
  • To obtain the maximum quasi-likelihood estimation of \(\boldsymbol\beta\), we can differentiate \(q\) wrt \(\beta_j\), for \(\forall j\) \[ 0 = \frac{\partial q}{\partial \beta_j} = \sum_{i=1}^n \frac{y_i - \mu_i}{\phi V(\mu_i)} \frac{\partial \mu_i}{\partial \beta_j} \Longrightarrow \sum_{i=1}^n \frac{y_i - \mu_i}{V(\mu_i) g^{\prime}(\mu_i)} X_{ij} = 0 \] this is exactly the GLM maximum likelihood solution, which can be obtained through IRLS

Generalized Linear Mixed Models (GLMM)

Generalized linear mixed models (GLMM)

  • A GLMM model for an exponential family random variable \(Y_i\) \[ g(\mu_i) = \mathbf{X}_i \boldsymbol\beta + \mathbf{Z}_i \mathbf{b}, \quad \mathbf{b} \sim \text{N}\left(\mathbf{0}, \boldsymbol\psi_{\boldsymbol\theta} \right) \]

  • Difficulty in moving from linear mixed models to GLMM: it is no longer possible to evaluate the marginal likelihood analytically

  • One effective solution is Taylor expansion around \(\hat{\mathbf{b}}\), the posterior mode of \(f(\mathbf{b}\mid \mathbf{y}, \boldsymbol\beta)\)

\[\begin{align*} f(\mathbf{y} \mid \boldsymbol\beta)\approx &\int \exp\left\{ \log f(\mathbf{y}, \hat{\mathbf{b}} \mid \boldsymbol\beta)\right.\\ &~~~~~~~~~~~~ + \left. \frac{1}{2}\left(\mathbf{b} - \hat{\mathbf{b}}\right)^T\frac{\partial^2 \log f(\mathbf{y}, \mathbf{b}\mid \boldsymbol\beta)}{\partial\mathbf{b}\partial \mathbf{b}^T}\left(\mathbf{b} - \hat{\mathbf{b}}\right)\right\} d\mathbf{b} \end{align*}\]

Laplace approximation of GLMM marginal likelihood

  • For GLM, note that the expected Hessian is \[ -\frac{\mathbf{Z}^T \mathbf{W}\mathbf{Z}}{\phi} - \boldsymbol\psi^{-1} \]

    • \(\mathbf{W}\) is the IRLS weight vector based on the \(\boldsymbol\mu\) implied by \(\hat{\mathbf{b}}\) and \(\boldsymbol\beta\)
  • Therefore, the approximate marginal likelihood is \[ f(\mathbf{y} \mid \boldsymbol\beta)\approx f(\mathbf{y}, \hat{\mathbf{b}} \mid \boldsymbol\beta)\frac{(2\pi)^{p/2}}{\left| \frac{\mathbf{Z}^T \mathbf{W}\mathbf{Z}}{\phi} + \boldsymbol\psi^{-1}_{\boldsymbol\theta} \right|^{1/2}} \]

Penalized likelihood and penalized IRLS

  • The point estimators \(\hat{\boldsymbol\beta}\) and \(\hat{\mathbf{b}}\) are obtained by optimizing the penalized likelihood

\[\begin{align*} \hat{\boldsymbol\beta}, \hat{\mathbf{b}} &= \arg\max_{\boldsymbol\beta, \mathbf{b}} \log f(\mathbf{y}, \mathbf{b} \mid \boldsymbol\beta)\\ &= \arg\max_{\boldsymbol\beta, \mathbf{b}} \left\{\log f(\mathbf{y} \mid \mathbf{b}, \boldsymbol\beta) - \mathbf{b}^T \boldsymbol\psi^{-1}_{\mathbf{\theta}}\mathbf{b}/2\right\} \end{align*}\]

  • To simplify notation, we denote \[\begin{align*} \mathcal{B}^T & = (\mathbf{b}, \boldsymbol\beta)^T\\ \mathcal{X} & = (\mathbf{Z}, \mathbf{X}), \quad \mathbf{S} = \left[ \begin{array}{cc} \boldsymbol\psi^{-1}_{\mathbf{\theta}} & \mathbf{0}\\ \mathbf{0} & \mathbf{0} \end{array} \right] \end{align*}\]

  • A penalized version of the IRLS algorithm (PIRLS) : by , a single Newton update step is \[\begin{align*} \mathcal{B}^{(t+1)} = \left( \mathcal{X}^T \mathbf{W} \mathcal{X} + \phi \mathbf{S}\right)^{-1} \mathcal{X}^T \mathbf{W} \left[\mathbf{G}\left(\mathbf{y} - \hat{\boldsymbol\mu}\right)+ \mathcal{X}\mathcal{B}^{(t)} \right] \end{align*}\]

Penalized quasi-likelihood method

  • Since optimizing the Laplace approximate marginal likelihood can be computationally costly, it is therefore tempting to instead perform a PIRLS iteration, estimating \(\boldsymbol\theta, \phi\) at each step based on the working mixed model \[ \mathbf{z} \mid \mathbf{b}, \boldsymbol\beta \sim \text{N}\left(\mathbf{X}\boldsymbol\beta + \mathbf{Z}\mathbf{b}, \mathbf{W}^{-1}\phi \right), \quad \mathbf{b} \sim \text{N}\left(\mathbf{0}, \boldsymbol\psi_{\boldsymbol\theta}\right) \]

References

  • Wood, Simon N. (2017), Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC