For the pdf slides, click here
Concepts in Incomplete Data
Notations
- \(m\): number of multiple imputations
- \(Y\): data of the sample
- Includes both covariates and response
- Dimension \(n \times p\)
- \(R\): observation indicator matrix, known
- A \(n \times p\) 0-1 matrix
- \(r_{ij} =0\) for missing and 1 for observed
- \(Y_{\text{obs}}\): observed data
- \(Y_{\text{mis}}\): missing data
\(Y = (Y_{\text{obs}}, Y_{\text{mis}})\): complete data
- \(\psi\): the parameter for the missing mechanism
\(\theta\): the parameter for the full data \(Y\)
Concepts of MCAR, MAR, and MNAR, with notations
Missing completely at random (MCAR) \[ P(R = 0 \mid Y_{\text{obs}}, Y_{\text{mis}}, \psi) = P(R = 0\mid \psi) \]
Missing at random (MAR) \[ P(R = 0 \mid Y_{\text{obs}}, Y_{\text{mis}}, \psi) = P(R = 0\mid Y_{\text{obs}}, \psi) \]
Missing not at random (MNAR) \[ P(R = 0 \mid Y_{\text{obs}}, Y_{\text{mis}}, \psi) \text{ does not simplify} \]
Ignorable
The missing data mechanism is ignorable for likelihood inference (on \(\theta\)), if
- MAR, and
- Distinctness: the parameters \(\theta\) and \(\psi\) are independent (from a Bayesian’s view)
If the nonresponse if ignorable, then \[ P(Y_{\text{mis}} \mid Y_{\text{obs}}, R) = P(Y_{\text{mis}} \mid Y_{\text{obs}}) \] Thus, if the missing data model is ignorable, we can model \(\theta\) just using the observed data
Why and When Multiple Imputation Works
Goal of multiple imputation
Note: for most multiple imputation practice, this goal is to train a (predictive) model with as small variances of the parameters as possible
- \(Q\): estimand (the parameter to be estimated)
- \(\hat{Q}\): estimate
- Unbias \[ E(\hat{Q} \mid Y) = Q \]
- Confidence valid: \[ E(U \mid Y) \geq V(\hat{Q}\mid Y) \] where \(U\) is the estimated covariance matrix of \(\hat{Q}\), the expectation is over all possible samples, and \(V(\hat{Q}\mid Y)\) is the variance caused by the sampling process
Within-variance and between-variance
\[\begin{align*} E(Q \mid Y_{\text{obs}}) &= E_{Y_{\text{mis}} \mid Y_{\text{obs}}}\{ E(Q \mid Y_{\text{obs}}, Y_{\text{mis}})\}\\ V(Q \mid Y_{\text{obs}}) &= \underbrace{E_{Y_{\text{mis}} \mid Y_{\text{obs}}}\{ V(Q \mid Y_{\text{obs}}, Y_{\text{mis}})\}}_{\text{within-variance}} + \underbrace{V_{Y_{\text{mis}} \mid Y_{\text{obs}}}\{ E(Q \mid Y_{\text{obs}}, Y_{\text{mis}})\}}_{\text{between variance}} \end{align*}\]
Within-variance: average of the repeated complete-data posterior variance of \(Q\), estimated by \[ \bar{U} = \frac{1}{m} \sum_{l=1}^m \bar{U}_l, \] where \(\bar{U}_l\) is the variance of \(\hat{Q}_l\) in the \(l\)th imputation
Between-variance: variance between the complete-data posterior means of \(Q\), estimated by \[ B = \frac{1}{m-1}\sum_{l=1}^m\left(\hat{Q}_l - \bar{Q}\right)\left(\hat{Q}_l - \bar{Q}\right)', \quad \bar{Q} = \frac{1}{m} \sum_{l=1}^m \hat{Q}_l \]
Decomposition of total variation
Since \(\bar{Q}\) is estimated using finite \(m\), the contribution to the variance is about \(B/m\). Thus, the total posterior variance of \(Q\) can be decomposed into three parts: \[ T = \bar{U} + B + B/m = \bar{U} + \left(1 + \frac{1}{m}\right) B \]
\(\bar{U}\): the conventional variance, due to sampling rather than getting the entire population.
\(B\): the extra variance due to missing values
\(B/m\): the extra simulation variance because \(\bar{Q}\) is estimated for finite \(m\)
- Traditionally choices are \(m = 3, 5, 10\), but the current advice is to use a larger \(m\), e.g., \(m = 50\)
Properness of an imputation procedure
An imputation procedure is confidence proper for complete-data statistics \(\hat{Q}, U\), if it satisfies the following three conditions approximately at large \(m\) \[\begin{align*} E\left(\bar{Q} \mid Y\right) & = \hat{Q}\\ E\left(\bar{U} \mid Y\right) & = U\\ \left(1 + \frac{1}{m}\right) E(B \mid Y) & \geq V(\bar{Q}) \end{align*}\]
- Here \(\hat{Q}\) is the complete-sample estimator of \(Q\), and \(U\) is its covariance
- If we replace the \(\geq\) by \(>\) in the third formula, then the procedure is said to be proper
- It is not always easy to check whether a procedure is proper.
Scope of the imputation model
Broad: one set of imputations to be used for all projects and analyses
Intermediate: one set of imputations per project and use this for all analyses
Narrow: a separate imputed dataset is created for each analysis
Which one is better: depends on the use case
Variance ratios
- Proportion of variation attributable to the missing data
\[
\lambda = \frac{B + B/m}{T}
\]
- If \(\lambda > 0.5\), then the influence of the imputation model on the final result is larger than that of the complete-data model
Relative increase in variance due to nonresponse \[ r = \frac{B + B/m}{\bar{U}} = \frac{\lambda}{1-\lambda} \]
- Fraction of information about \(Q\) missing due to nonresponse
\[
\gamma = \frac{r + 2/(\nu + 3)}{1 + r} = \frac{\nu + 1}{\nu + 3}\lambda + \frac{2}{\nu + 3}
\]
- Here, \(\nu\) is the degrees of freedom (see next)
- When \(\nu\) is large, \(\gamma\) is very close to \(\lambda\)
Degrees of freedom (df)
The degrees of freedom is the number of observations after accounting for the number of parameters in the model.
The “old” formula (as in Rubin 1987): may produce values larger than the sample size in the complete data \[ \nu_{\text{old}} = (m-1) \left(1 + \frac{1}{r^2}\right) = \frac{m-1}{\lambda^2} \]
Let \(\nu_\text{com}\) be the conventional df in a complete-data inference problem. If the number of parameters in the model is \(k\) and the sample size is \(n\), then \(\nu_\text{com} = n-k\). The estimated observed data df that accounts for the missing information is \[ \nu_{\text{obs}} = \frac{\nu_\text{com} + 1}{\nu_\text{com} + 3} \nu_\text{com} (1-\lambda) \]
Barnard-Rubin correction: the adjusted df to be used for testing in multiple imputation is \[ \nu = \frac{\nu_{\text{old}} \nu_{\text{obs}}}{\nu_{\text{old}} + \nu_{\text{obs}}} \]
A numerical example
## Load the mice package
library(mice);
imp <- mice(nhanes, print = FALSE, m = 10, seed = 24415)
fit <- with(imp, lm(bmi ~ age))
est <- pool(fit); print(est, digits = 2)
## Class: mipo m = 10
## term m estimate ubar b t dfcom df riv lambda fmi
## 1 (Intercept) 10 30.8 3.4 2.52 6.2 23 9.2 0.82 0.45 0.54
## 2 age 10 -2.3 0.9 0.39 1.3 23 12.3 0.48 0.32 0.41
- Columns
ubar
,b
, andt
are the variances - Column
dfcom
is \(\nu_\text{com}\) - Column
df
is the Barnard-Rubin correction \(\nu\)
T-test for regression coefficients
- Use the Barnard-Rubin correction of \(\nu\) as the shape parameter of t-distribution.
print(summary(est, conf.int = TRUE), digits = 1)
## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 (Intercept) 31 2 12 9 5e-07 25 36.4
## 2 age -2 1 -2 12 7e-02 -5 0.2
More about Imputation Methods
Imputation evaluation criteria
- The following criteria are useful in simulation studies (when you know the true \(Q\))
Raw bias (RB): upper limit \(5\%\) \[ \text{RB} = \left|\frac{E\left(\bar{Q}\right) - Q}{Q}\right| \]
Coverage rate (CR): A CR below \(90\%\) for the nominal \(95\%\) interval is bad
Average width (AW) of confidence interval
Root mean squared error (RMSE): the smaller the better \[ \text{RMSE} = \sqrt{\left(E\left(\bar{Q}\right) - Q\right)^2} \]
Imputation is not prediction
- Shall we evaluate an imputation method by examine how it can closely recover the missing values?
- For example, using the RMSE to see if the imputed values \(\dot{y}_i\) are close to the true (removed) missing data \(y_i^{\text{mis}}\)? \[ \text{RMSE} = \sqrt{\frac{1}{n_{\text{mis}}}\sum_{i=1}^{n_{\text{mis}}}\left(y_i^{\text{mis}} - \dot{y}_i \right)^2} \]
- NO! This will favor least squares estimates, and it will find the same values over and over; and thus it is single imputation. This ignores the inherent uncertainty of the missing values.
When not to use multiple imputation
For predictive modeling, if the missing values are in the target variable \(Y\), then complete-case analysis and multiple imputation are equivalent.
Two special cases where listwise deletion is better than multiple imputation
If the probability to be missing does not depend on \(Y\)
If the complete data model is logistic regression, and the missing data are confined to \(Y\), not \(X\)
References
Van Buuren, S. (2018). Flexible Imputation of Missing Data, 2nd Edition. CRC press.
Rubin, D. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.