For the pdf slides, click here
Concepts in Incomplete Data
Notations
- : number of multiple imputations
- : data of the sample
- Includes both covariates and response
- Dimension
- : observation indicator matrix, known
- A 0-1 matrix
- for missing and 1 for observed
- : observed data
- : missing data
: complete data
- : the parameter for the missing mechanism
: the parameter for the full data
Concepts of MCAR, MAR, and MNAR, with notations
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
Ignorable
The missing data mechanism is ignorable for likelihood inference (on ), if
- MAR, and
- Distinctness: the parameters and are independent (from a Bayesian’s view)
If the nonresponse if ignorable, then Thus, if the missing data model is ignorable, we can model just using the observed data
Why and When Multiple Imputation Works
Goal of multiple imputation
Note: for most multiple imputation practice, this goal is to train a (predictive) model with as small variances of the parameters as possible
- : estimand (the parameter to be estimated)
- : estimate
- Unbias
- Confidence valid: where is the estimated covariance matrix of , the expectation is over all possible samples, and is the variance caused by the sampling process
Within-variance and between-variance
Within-variance: average of the repeated complete-data posterior variance of , estimated by where is the variance of in the th imputation
Between-variance: variance between the complete-data posterior means of , estimated by
Decomposition of total variation
Since is estimated using finite , the contribution to the variance is about . Thus, the total posterior variance of can be decomposed into three parts:
: the conventional variance, due to sampling rather than getting the entire population.
: the extra variance due to missing values
: the extra simulation variance because is estimated for finite
- Traditionally choices are , but the current advice is to use a larger , e.g.,
Properness of an imputation procedure
An imputation procedure is confidence proper for complete-data statistics , if it satisfies the following three conditions approximately at large
- Here is the complete-sample estimator of , and is its covariance
- If we replace the by in the third formula, then the procedure is said to be proper
- It is not always easy to check whether a procedure is proper.
Scope of the imputation model
Broad: one set of imputations to be used for all projects and analyses
Intermediate: one set of imputations per project and use this for all analyses
Narrow: a separate imputed dataset is created for each analysis
Which one is better: depends on the use case
Variance ratios
- Proportion of variation attributable to the missing data
- If , then the influence of the imputation model on the final result is larger than that of the complete-data model
Relative increase in variance due to nonresponse
- Fraction of information about missing due to nonresponse
- Here, is the degrees of freedom (see next)
- When is large, is very close to
Degrees of freedom (df)
The degrees of freedom is the number of observations after accounting for the number of parameters in the model.
The “old” formula (as in Rubin 1987): may produce values larger than the sample size in the complete data
Let be the conventional df in a complete-data inference problem. If the number of parameters in the model is and the sample size is , then . The estimated observed data df that accounts for the missing information is
Barnard-Rubin correction: the adjusted df to be used for testing in multiple imputation is
A numerical example
## Load the mice package
library(mice);
imp <- mice(nhanes, print = FALSE, m = 10, seed = 24415)
fit <- with(imp, lm(bmi ~ age))
est <- pool(fit); print(est, digits = 2)
## Class: mipo m = 10
## term m estimate ubar b t dfcom df riv lambda fmi
## 1 (Intercept) 10 30.8 3.4 2.52 6.2 23 9.2 0.82 0.45 0.54
## 2 age 10 -2.3 0.9 0.39 1.3 23 12.3 0.48 0.32 0.41
- Columns
ubar
,b
, andt
are the variances - Column
dfcom
is - Column
df
is the Barnard-Rubin correction
T-test for regression coefficients
- Use the Barnard-Rubin correction of as the shape parameter of t-distribution.
print(summary(est, conf.int = TRUE), digits = 1)
## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 (Intercept) 31 2 12 9 5e-07 25 36.4
## 2 age -2 1 -2 12 7e-02 -5 0.2
More about Imputation Methods
Imputation evaluation criteria
- The following criteria are useful in simulation studies (when you know the true )
Raw bias (RB): upper limit
Coverage rate (CR): A CR below for the nominal interval is bad
Average width (AW) of confidence interval
Root mean squared error (RMSE): the smaller the better
Imputation is not prediction
- Shall we evaluate an imputation method by examine how it can closely recover the missing values?
- For example, using the RMSE to see if the imputed values are close to the true (removed) missing data ?
- NO! This will favor least squares estimates, and it will find the same values over and over; and thus it is single imputation. This ignores the inherent uncertainty of the missing values.
When not to use multiple imputation
For predictive modeling, if the missing values are in the target variable , then complete-case analysis and multiple imputation are equivalent.
Two special cases where listwise deletion is better than multiple imputation
If the probability to be missing does not depend on
If the complete data model is logistic regression, and the missing data are confined to , not
References
Van Buuren, S. (2018). Flexible Imputation of Missing Data, 2nd Edition. CRC press.
Rubin, D. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.