Book Notes: Flexible Imputation of Missing Data -- Ch2 Multiple Imputation

For the pdf slides, click here

Concepts in Incomplete Data

Notations

  • m: number of multiple imputations
  • Y: data of the sample
    • Includes both covariates and response
    • Dimension n×p
  • R: observation indicator matrix, known
    • A n×p 0-1 matrix
    • rij=0 for missing and 1 for observed
  • Yobs: observed data
  • Ymis: missing data
  • Y=(Yobs,Ymis): complete data

  • ψ: the parameter for the missing mechanism
  • θ: the parameter for the full data Y

Concepts of MCAR, MAR, and MNAR, with notations

  • Missing completely at random (MCAR) P(R=0Yobs,Ymis,ψ)=P(R=0ψ)

  • Missing at random (MAR) P(R=0Yobs,Ymis,ψ)=P(R=0Yobs,ψ)

  • Missing not at random (MNAR) P(R=0Yobs,Ymis,ψ) does not simplify

Ignorable

  • The missing data mechanism is ignorable for likelihood inference (on θ), if

    1. MAR, and
    2. Distinctness: the parameters θ and ψ are independent (from a Bayesian’s view)
  • If the nonresponse if ignorable, then P(YmisYobs,R)=P(YmisYobs) Thus, if the missing data model is ignorable, we can model θ just using the observed data

Why and When Multiple Imputation Works

Goal of multiple imputation

  • Note: for most multiple imputation practice, this goal is to train a (predictive) model with as small variances of the parameters as possible

  • Q: estimand (the parameter to be estimated)
  • ˆQ: estimate
    • Unbias E(ˆQY)=Q
    • Confidence valid: E(UY)V(ˆQY) where U is the estimated covariance matrix of ˆQ, the expectation is over all possible samples, and V(ˆQY) is the variance caused by the sampling process

Within-variance and between-variance

E(QYobs)=EYmisYobs{E(QYobs,Ymis)}V(QYobs)=EYmisYobs{V(QYobs,Ymis)}within-variance+VYmisYobs{E(QYobs,Ymis)}between variance

  • Within-variance: average of the repeated complete-data posterior variance of Q, estimated by ˉU=1mml=1ˉUl, where ˉUl is the variance of ˆQl in the lth imputation

  • Between-variance: variance between the complete-data posterior means of Q, estimated by B=1m1ml=1(ˆQlˉQ)(ˆQlˉQ),ˉQ=1mml=1ˆQl

Decomposition of total variation

  • Since ˉQ is estimated using finite m, the contribution to the variance is about B/m. Thus, the total posterior variance of Q can be decomposed into three parts: T=ˉU+B+B/m=ˉU+(1+1m)B

  • ˉU: the conventional variance, due to sampling rather than getting the entire population.

  • B: the extra variance due to missing values

  • B/m: the extra simulation variance because ˉQ is estimated for finite m

    • Traditionally choices are m=3,5,10, but the current advice is to use a larger m, e.g., m=50

Properness of an imputation procedure

  • An imputation procedure is confidence proper for complete-data statistics ˆQ,U, if it satisfies the following three conditions approximately at large m E(ˉQY)=ˆQE(ˉUY)=U(1+1m)E(BY)V(ˉQ)

    • Here ˆQ is the complete-sample estimator of Q, and U is its covariance
    • If we replace the by > in the third formula, then the procedure is said to be proper
    • It is not always easy to check whether a procedure is proper.

Scope of the imputation model

  • Broad: one set of imputations to be used for all projects and analyses

  • Intermediate: one set of imputations per project and use this for all analyses

  • Narrow: a separate imputed dataset is created for each analysis

  • Which one is better: depends on the use case

Variance ratios

  • Proportion of variation attributable to the missing data λ=B+B/mT
    • If λ>0.5, then the influence of the imputation model on the final result is larger than that of the complete-data model
  • Relative increase in variance due to nonresponse r=B+B/mˉU=λ1λ

  • Fraction of information about Q missing due to nonresponse γ=r+2/(ν+3)1+r=ν+1ν+3λ+2ν+3
    • Here, ν is the degrees of freedom (see next)
    • When ν is large, γ is very close to λ

Degrees of freedom (df)

  • The degrees of freedom is the number of observations after accounting for the number of parameters in the model.

  • The “old” formula (as in Rubin 1987): may produce values larger than the sample size in the complete data νold=(m1)(1+1r2)=m1λ2

  • Let νcom be the conventional df in a complete-data inference problem. If the number of parameters in the model is k and the sample size is n, then νcom=nk. The estimated observed data df that accounts for the missing information is νobs=νcom+1νcom+3νcom(1λ)

  • Barnard-Rubin correction: the adjusted df to be used for testing in multiple imputation is ν=νoldνobsνold+νobs

A numerical example

## Load the mice package
library(mice); 
imp <- mice(nhanes, print = FALSE, m = 10, seed = 24415)
fit <- with(imp, lm(bmi ~ age))
est <- pool(fit); print(est, digits = 2)
## Class: mipo    m = 10 
##          term  m estimate ubar    b   t dfcom   df  riv lambda  fmi
## 1 (Intercept) 10     30.8  3.4 2.52 6.2    23  9.2 0.82   0.45 0.54
## 2         age 10     -2.3  0.9 0.39 1.3    23 12.3 0.48   0.32 0.41
  • Columns ubar, b, and t are the variances
  • Column dfcom is νcom
  • Column df is the Barnard-Rubin correction ν

T-test for regression coefficients

  • Use the Barnard-Rubin correction of ν as the shape parameter of t-distribution.
print(summary(est, conf.int = TRUE), digits = 1)
##          term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 (Intercept)       31         2        12  9   5e-07    25   36.4
## 2         age       -2         1        -2 12   7e-02    -5    0.2

More about Imputation Methods

Imputation evaluation criteria

  • The following criteria are useful in simulation studies (when you know the true Q)
  1. Raw bias (RB): upper limit 5% RB=|E(ˉQ)QQ|

  2. Coverage rate (CR): A CR below 90% for the nominal 95% interval is bad

  3. Average width (AW) of confidence interval

  4. Root mean squared error (RMSE): the smaller the better RMSE=(E(ˉQ)Q)2

Imputation is not prediction

  • Shall we evaluate an imputation method by examine how it can closely recover the missing values?
    • For example, using the RMSE to see if the imputed values ˙yi are close to the true (removed) missing data ymisi? RMSE=1nmisnmisi=1(ymisi˙yi)2
  • NO! This will favor least squares estimates, and it will find the same values over and over; and thus it is single imputation. This ignores the inherent uncertainty of the missing values.

When not to use multiple imputation

  • For predictive modeling, if the missing values are in the target variable Y, then complete-case analysis and multiple imputation are equivalent.

  • Two special cases where listwise deletion is better than multiple imputation

  1. If the probability to be missing does not depend on Y

  2. If the complete data model is logistic regression, and the missing data are confined to Y, not X

References

  • Van Buuren, S. (2018). Flexible Imputation of Missing Data, 2nd Edition. CRC press.

  • Rubin, D. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.