For the pdf slides, click here

Concepts in Incomplete Data

Notations

$m$ : number of multiple imputations
$Y$ : data of the sample
- Includes both covariates and response
- Dimension $n \times p$
$R$ : observation indicator matrix, known
- A $n \times p$ 0-1 matrix
- $r_{i j} = 0$ for missing and 1 for observed
$Y_{obs}$ : observed data
$Y_{mis}$ : missing data
$Y = (Y_{obs}, Y_{mis})$ : complete data
$ψ$ : the parameter for the missing mechanism
$θ$ : the parameter for the full data $Y$

Concepts of MCAR, MAR, and MNAR, with notations

Missing completely at random (MCAR) $P (R = 0 ∣ Y_{obs}, Y_{mis}, ψ) = P (R = 0 ∣ ψ)$
Missing at random (MAR) $P (R = 0 ∣ Y_{obs}, Y_{mis}, ψ) = P (R = 0 ∣ Y_{obs}, ψ)$
Missing not at random (MNAR) $P (R = 0 ∣ Y_{obs}, Y_{mis}, ψ) does not simplify$

Ignorable

The missing data mechanism is ignorable for likelihood inference (on $θ$ ), if
1. MAR, and
2. Distinctness: the parameters $θ$ and $ψ$ are independent (from a Bayesian’s view)
If the nonresponse if ignorable, then $P (Y_{mis} ∣ Y_{obs}, R) = P (Y_{mis} ∣ Y_{obs})$ Thus, if the missing data model is ignorable, we can model $θ$ just using the observed data

Why and When Multiple Imputation Works

Goal of multiple imputation

Note: for most multiple imputation practice, this goal is to train a (predictive) model with as small variances of the parameters as possible
$Q$ : estimand (the parameter to be estimated)
$\hat{Q}$ : estimate
- Unbias $E (\hat{Q} ∣ Y) = Q$
- Confidence valid: $E (U ∣ Y) \geq V (\hat{Q} ∣ Y)$ where $U$ is the estimated covariance matrix of $\hat{Q}$ , the expectation is over all possible samples, and $V (\hat{Q} ∣ Y)$ is the variance caused by the sampling process

Within-variance and between-variance

$\begin{aligned} E (Q ∣ Y_{obs}) & = E_{Y_{mis} ∣ Y_{obs}} {E (Q ∣ Y_{obs}, Y_{mis})} \\ V (Q ∣ Y_{obs}) & = \underset{within-variance}{\underset{⏟}{E_{Y_{mis} ∣ Y_{obs}} {V (Q ∣ Y_{obs}, Y_{mis})}}} + \underset{between variance}{\underset{⏟}{V_{Y_{mis} ∣ Y_{obs}} {E (Q ∣ Y_{obs}, Y_{mis})}}} \end{aligned}$

Within-variance: average of the repeated complete-data posterior variance of $Q$ , estimated by $\bar{U} = \frac{1}{m} \sum_{l = 1}^{m} {\bar{U}}_{l},$ where ${\bar{U}}_{l}$ is the variance of ${\hat{Q}}_{l}$ in the $l$ th imputation
Between-variance: variance between the complete-data posterior means of $Q$ , estimated by $B = \frac{1}{m - 1} \sum_{l = 1}^{m} ({\hat{Q}}_{l} - \bar{Q}) {({\hat{Q}}_{l} - \bar{Q})}^{'}, \bar{Q} = \frac{1}{m} \sum_{l = 1}^{m} {\hat{Q}}_{l}$

Decomposition of total variation

Since $\bar{Q}$ is estimated using finite $m$ , the contribution to the variance is about $B / m$ . Thus, the total posterior variance of $Q$ can be decomposed into three parts: $T = \bar{U} + B + B / m = \bar{U} + (1 + \frac{1}{m}) B$
$\bar{U}$ : the conventional variance, due to sampling rather than getting the entire population.
$B$ : the extra variance due to missing values
$B / m$ : the extra simulation variance because $\bar{Q}$ is estimated for finite $m$
- Traditionally choices are $m = 3, 5, 10$ , but the current advice is to use a larger $m$ , e.g., $m = 50$

Properness of an imputation procedure

An imputation procedure is confidence proper for complete-data statistics $\hat{Q}, U$ , if it satisfies the following three conditions approximately at large $m$ $\begin{aligned} E (\bar{Q} ∣ Y) & = \hat{Q} \\ E (\bar{U} ∣ Y) & = U \\ (1 + \frac{1}{m}) E (B ∣ Y) & \geq V (\bar{Q}) \end{aligned}$
- Here $\hat{Q}$ is the complete-sample estimator of $Q$ , and $U$ is its covariance
- If we replace the $\geq$ by $>$ in the third formula, then the procedure is said to be proper
- It is not always easy to check whether a procedure is proper.

Scope of the imputation model

Broad: one set of imputations to be used for all projects and analyses
Intermediate: one set of imputations per project and use this for all analyses
Narrow: a separate imputed dataset is created for each analysis
Which one is better: depends on the use case

Variance ratios

Proportion of variation attributable to the missing data $λ = \frac{B + B / m}{T}$
- If $λ > 0.5$ , then the influence of the imputation model on the final result is larger than that of the complete-data model
Relative increase in variance due to nonresponse $r = \frac{B + B / m}{\bar{U}} = \frac{λ}{1 - λ}$
Fraction of information about $Q$ missing due to nonresponse $γ = \frac{r + 2 / (ν + 3)}{1 + r} = \frac{ν + 1}{ν + 3} λ + \frac{2}{ν + 3}$
- Here, $ν$ is the degrees of freedom (see next)
- When $ν$ is large, $γ$ is very close to $λ$

Degrees of freedom (df)

The degrees of freedom is the number of observations after accounting for the number of parameters in the model.
The “old” formula (as in Rubin 1987): may produce values larger than the sample size in the complete data $ν_{old} = (m - 1) (1 + \frac{1}{r^{2}}) = \frac{m - 1}{λ^{2}}$
Let $ν_{com}$ be the conventional df in a complete-data inference problem. If the number of parameters in the model is $k$ and the sample size is $n$ , then $ν_{com} = n - k$ . The estimated observed data df that accounts for the missing information is $ν_{obs} = \frac{ν_{com} + 1}{ν_{com} + 3} ν_{com} (1 - λ)$
Barnard-Rubin correction: the adjusted df to be used for testing in multiple imputation is $ν = \frac{ν_{old} ν_{obs}}{ν_{old} + ν_{obs}}$

A numerical example

## Load the mice package
library(mice); 
imp <- mice(nhanes, print = FALSE, m = 10, seed = 24415)
fit <- with(imp, lm(bmi ~ age))
est <- pool(fit); print(est, digits = 2)

## Class: mipo    m = 10 
##          term  m estimate ubar    b   t dfcom   df  riv lambda  fmi
## 1 (Intercept) 10     30.8  3.4 2.52 6.2    23  9.2 0.82   0.45 0.54
## 2         age 10     -2.3  0.9 0.39 1.3    23 12.3 0.48   0.32 0.41

Columns ubar, b, and t are the variances
Column dfcom is $ν_{com}$
Column df is the Barnard-Rubin correction $ν$

T-test for regression coefficients

Use the Barnard-Rubin correction of $ν$ as the shape parameter of t-distribution.

print(summary(est, conf.int = TRUE), digits = 1)

##          term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 (Intercept)       31         2        12  9   5e-07    25   36.4
## 2         age       -2         1        -2 12   7e-02    -5    0.2

More about Imputation Methods

Imputation evaluation criteria

The following criteria are useful in simulation studies (when you know the true $Q$ )

Raw bias (RB): upper limit $5 %$ $RB = | \frac{E (\bar{Q}) - Q}{Q} |$
Coverage rate (CR): A CR below $90 %$ for the nominal $95 %$ interval is bad
Average width (AW) of confidence interval
Root mean squared error (RMSE): the smaller the better $RMSE = \sqrt{{(E (\bar{Q}) - Q)}^{2}}$

Imputation is not prediction

Shall we evaluate an imputation method by examine how it can closely recover the missing values?
- For example, using the RMSE to see if the imputed values ${\dot{y}}_{i}$ are close to the true (removed) missing data $y_{i}^{mis}$ ? $RMSE = \sqrt{\frac{1}{n_{mis}} \sum_{i = 1}^{n_{mis}} {(y_{i}^{mis} - {\dot{y}}_{i})}^{2}}$
NO! This will favor least squares estimates, and it will find the same values over and over; and thus it is single imputation. This ignores the inherent uncertainty of the missing values.

When not to use multiple imputation

For predictive modeling, if the missing values are in the target variable $Y$ , then complete-case analysis and multiple imputation are equivalent.
Two special cases where listwise deletion is better than multiple imputation

If the probability to be missing does not depend on $Y$
If the complete data model is logistic regression, and the missing data are confined to $Y$ , not $X$

References

Van Buuren, S. (2018). Flexible Imputation of Missing Data, 2nd Edition. CRC press.
- https://stefvanbuuren.name/fimd/
Rubin, D. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.

Book Notes: Flexible Imputation of Missing Data -- Ch2 Multiple Imputation