Book Notes: Flexible Imputation of Missing Data -- Ch1 Introduction

For the pdf slides, click here

Concepts of MCAR, MAR, MNAR

Concepts of MCAR, MAR, MNAR

  • Missing completely at random (MCAR): the probability of being missing is the same for all cases
    • Cause of missing is unrelated to the data
  • Missing at random (MAR): the probability of being missing only depends on the observed data
    • Cause of missing is unrelated to the missing values
  • Missing not at random (MNAR): probability of being missing depends on the missing values themselves

Ad-hoc Solutions

Listwise deletion and pairwise deletion

  • Listwise deletion (also called complete-case analysis): delete rows which contain one or more missing values
    • If data is MCAR, listwise deletion produces unbiased estimates of means, variances, and regression weights (if need to train a predictive model)
    • If data is not MCAR, listwise deletion can severely bias the above estimates.
  • Pairwise deletion (also called available-case analysis)
    • Mean and variance of variable \(X\) are based on all cases with observed data on \(X\)
    • Covariance and correlation of \(X\) and \(Y\) is based on all data which both \(X\) and \(Y\) have non-missing values

Mean imputation

  • Compared with the observed data, in the imputed data (observed + imputed values)
    • Standard deviations decrease
    • Correlation decreases
    • Means can be biased if the data is not MCAR.

Regression imputation

  1. Build a regression model from the observed data
  2. Impute the missing values in the response variable with the predicted values from the fitted regression
  • The impute values are the most likely values under the model
    • However, it decreases the variance of the target variable
    • And it increases the correlations between the target and covariates
  • Regression imputation, and its modern incarnations in machine learning is probably the most dangerous of all ad-hoc methods

Stochastic regression imputation

  1. Build a regression model from the observed data
  2. Impute a missing value in the response variable with the predicted value plus a random draw from the residual
  • Preserves variance and correlation.
  • Imputed values can exceed the range (e.g., a negative Ozone level). A more suitable model may resolve this.

LOCF and BOCF

  • Last observation carried forward (LOCF) and baseline observation carried forward (BOCF) are for longitudinal data.

  • LOCF can yield biased estimation even under MCAR.

Indicator method

  • Not for imputation, but for building predictive models
  • Only works for missing in covariates, not the target variables

Summary of ad-hoc imputation methods

  • Note: the unbiasness of regression coefficients are assess with the variable containing missing values as the target variable

Multiple Imputation in a Nutshell

Multiple imputation creates \(m>1\) complete datasets

  • Three steps of multiple imputation
    1. Imputation
    2. Analysis: train separate models
    3. Pooling: variance among \(m\) parameter estimates combines the conventional sampling variance (within-imputation variance) and the extra variance caused by the missing data (between-imputation variance)

Why using multiple imputation?

  • It provides a mechanism to deal with the inherent uncertainty of the imputations
  • It separate the solution of the missing data problem from the solution of the complete-data problem (train predictive models on complete data)

Multiple imputation example using the mice package

## Load the mice package
library(mice); 
## Impute 20 times, using preditive mean matching
imp <- mice(airquality, seed = 1, m = 20, print = FALSE)
## Fit linear regressions
fit <- with(imp, lm(Ozone ~ Wind + Temp + Solar.R))
## Pooled regression estimates
pander(summary(pool(fit)))
term estimate std.error statistic df p.value
(Intercept) -60.21 21.57 -2.791 100.3 0.006
Wind -3.174 0.644 -4.927 83.29 0
Temp 1.584 0.228 6.959 125.7 0
Solar.R 0.058 0.023 2.454 79.63 0.016

References