For the pdf slides, click here
Concepts in Incomplete Data
Notations
- m: number of multiple imputations
- Y: data of the sample
- Includes both covariates and response
- Dimension n×p
- R: observation indicator matrix, known
- A n×p 0-1 matrix
- rij=0 for missing and 1 for observed
- Yobs: observed data
- Ymis: missing data
Y=(Yobs,Ymis): complete data
- ψ: the parameter for the missing mechanism
θ: the parameter for the full data Y
Concepts of MCAR, MAR, and MNAR, with notations
Missing completely at random (MCAR) P(R=0∣Yobs,Ymis,ψ)=P(R=0∣ψ)
Missing at random (MAR) P(R=0∣Yobs,Ymis,ψ)=P(R=0∣Yobs,ψ)
Missing not at random (MNAR) P(R=0∣Yobs,Ymis,ψ) does not simplify
Ignorable
The missing data mechanism is ignorable for likelihood inference (on θ), if
- MAR, and
- Distinctness: the parameters θ and ψ are independent (from a Bayesian’s view)
If the nonresponse if ignorable, then P(Ymis∣Yobs,R)=P(Ymis∣Yobs) Thus, if the missing data model is ignorable, we can model θ just using the observed data
Why and When Multiple Imputation Works
Goal of multiple imputation
Note: for most multiple imputation practice, this goal is to train a (predictive) model with as small variances of the parameters as possible
- Q: estimand (the parameter to be estimated)
- ˆQ: estimate
- Unbias E(ˆQ∣Y)=Q
- Confidence valid: E(U∣Y)≥V(ˆQ∣Y) where U is the estimated covariance matrix of ˆQ, the expectation is over all possible samples, and V(ˆQ∣Y) is the variance caused by the sampling process
Within-variance and between-variance
E(Q∣Yobs)=EYmis∣Yobs{E(Q∣Yobs,Ymis)}V(Q∣Yobs)=EYmis∣Yobs{V(Q∣Yobs,Ymis)}⏟within-variance+VYmis∣Yobs{E(Q∣Yobs,Ymis)}⏟between variance
Within-variance: average of the repeated complete-data posterior variance of Q, estimated by ˉU=1mm∑l=1ˉUl, where ˉUl is the variance of ˆQl in the lth imputation
Between-variance: variance between the complete-data posterior means of Q, estimated by B=1m−1m∑l=1(ˆQl−ˉQ)(ˆQl−ˉQ)′,ˉQ=1mm∑l=1ˆQl
Decomposition of total variation
Since ˉQ is estimated using finite m, the contribution to the variance is about B/m. Thus, the total posterior variance of Q can be decomposed into three parts: T=ˉU+B+B/m=ˉU+(1+1m)B
ˉU: the conventional variance, due to sampling rather than getting the entire population.
B: the extra variance due to missing values
B/m: the extra simulation variance because ˉQ is estimated for finite m
- Traditionally choices are m=3,5,10, but the current advice is to use a larger m, e.g., m=50
Properness of an imputation procedure
An imputation procedure is confidence proper for complete-data statistics ˆQ,U, if it satisfies the following three conditions approximately at large m E(ˉQ∣Y)=ˆQE(ˉU∣Y)=U(1+1m)E(B∣Y)≥V(ˉQ)
- Here ˆQ is the complete-sample estimator of Q, and U is its covariance
- If we replace the ≥ by > in the third formula, then the procedure is said to be proper
- It is not always easy to check whether a procedure is proper.
Scope of the imputation model
Broad: one set of imputations to be used for all projects and analyses
Intermediate: one set of imputations per project and use this for all analyses
Narrow: a separate imputed dataset is created for each analysis
Which one is better: depends on the use case
Variance ratios
- Proportion of variation attributable to the missing data
λ=B+B/mT
- If λ>0.5, then the influence of the imputation model on the final result is larger than that of the complete-data model
Relative increase in variance due to nonresponse r=B+B/mˉU=λ1−λ
- Fraction of information about Q missing due to nonresponse
γ=r+2/(ν+3)1+r=ν+1ν+3λ+2ν+3
- Here, ν is the degrees of freedom (see next)
- When ν is large, γ is very close to λ
Degrees of freedom (df)
The degrees of freedom is the number of observations after accounting for the number of parameters in the model.
The “old” formula (as in Rubin 1987): may produce values larger than the sample size in the complete data νold=(m−1)(1+1r2)=m−1λ2
Let νcom be the conventional df in a complete-data inference problem. If the number of parameters in the model is k and the sample size is n, then νcom=n−k. The estimated observed data df that accounts for the missing information is νobs=νcom+1νcom+3νcom(1−λ)
Barnard-Rubin correction: the adjusted df to be used for testing in multiple imputation is ν=νoldνobsνold+νobs
A numerical example
## Load the mice package
library(mice);
imp <- mice(nhanes, print = FALSE, m = 10, seed = 24415)
fit <- with(imp, lm(bmi ~ age))
est <- pool(fit); print(est, digits = 2)
## Class: mipo m = 10
## term m estimate ubar b t dfcom df riv lambda fmi
## 1 (Intercept) 10 30.8 3.4 2.52 6.2 23 9.2 0.82 0.45 0.54
## 2 age 10 -2.3 0.9 0.39 1.3 23 12.3 0.48 0.32 0.41
- Columns
ubar
,b
, andt
are the variances - Column
dfcom
is νcom - Column
df
is the Barnard-Rubin correction ν
T-test for regression coefficients
- Use the Barnard-Rubin correction of ν as the shape parameter of t-distribution.
print(summary(est, conf.int = TRUE), digits = 1)
## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 (Intercept) 31 2 12 9 5e-07 25 36.4
## 2 age -2 1 -2 12 7e-02 -5 0.2
More about Imputation Methods
Imputation evaluation criteria
- The following criteria are useful in simulation studies (when you know the true Q)
Raw bias (RB): upper limit 5% RB=|E(ˉQ)−QQ|
Coverage rate (CR): A CR below 90% for the nominal 95% interval is bad
Average width (AW) of confidence interval
Root mean squared error (RMSE): the smaller the better RMSE=√(E(ˉQ)−Q)2
Imputation is not prediction
- Shall we evaluate an imputation method by examine how it can closely recover the missing values?
- For example, using the RMSE to see if the imputed values ˙yi are close to the true (removed) missing data ymisi? RMSE=√1nmisnmis∑i=1(ymisi−˙yi)2
- NO! This will favor least squares estimates, and it will find the same values over and over; and thus it is single imputation. This ignores the inherent uncertainty of the missing values.
When not to use multiple imputation
For predictive modeling, if the missing values are in the target variable Y, then complete-case analysis and multiple imputation are equivalent.
Two special cases where listwise deletion is better than multiple imputation
If the probability to be missing does not depend on Y
If the complete data model is logistic regression, and the missing data are confined to Y, not X
References
Van Buuren, S. (2018). Flexible Imputation of Missing Data, 2nd Edition. CRC press.
Rubin, D. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.