For the pdf slides, click here

$Y$ : a $n \times p$ matrix which contains is missing data
$Y_{j}$ : the $j$ th column in $Y$
$Y_{- j}$ : all but the $j$ th column of $Y$
$R$ : a $n \times p$ missing indicator matrix
- $0$ is missing and $1$ is observed

Missing Data Pattern

When the number of columns is small, we can use the md.pattern function in mice to get missing data counts of all combinations among columns

library(mice); 
md.pattern(pattern4, plot = FALSE)

##   A B C  
## 2 1 1 1 0
## 3 1 1 0 1
## 1 1 0 1 1
## 2 0 0 1 2
##   2 3 3 8

When the number of columns is large, we can use the md.pairs function in mice to check the counts of each of the four pairwise missingness patterns (rr, rm, mr, and mm)

Proportion of usable cases, i.e., inbound statistics for imputing variable $Y_{j}$ from variable $Y_{k}$ : total cases where $Y_{j}$ is missing but $Y_{k}$ is observed over total missings in $Y_{j}$ $I_{j k} = \frac{\sum_{i = 1}^{n} (1 - r_{i j}) r_{i k}}{\sum_{i = 1}^{n} (1 - r_{i j})}$
- When imputing $X_{j}$ , we can use this statistic to quickly identify which variables to use
Outbound statistic measures how observed data in $Y_{j}$ connect to the missing data in $Y_{k}$ $O_{j k} = \frac{\sum_{i = 1}^{n} r_{i j} (1 - r_{i k})}{\sum_{i = 1}^{n} r_{i j}}$

Monotone data imputation
- for monotone missing data pattern
- imputations are created by a sequence of univariate methods
Joint modeling (JM)
- for general missing patterns,
- imputations are created by multivariate models
Fully conditional specification (FCS, aka chained equations)
- for general missing patterns,
- imputations are drawn from iterated conditional univarate models
- Usually, FCS is found better than JM
Block of variables, hybrid imputation between JM and FCS

A monotone missing pattern: the columns of $Y$ can be ordered such that for any row, if $Y_{j}$ is missing, then all columns to the right of $Y_{j}$ are also missing
Suppose the variables with missings are ordered as $Y_{1}, Y_{2}, \dots, Y_{p}$ , and the variables without missings are denoted as $X$ . Then the monotone missing imputation is
- Impute $Y_{1}$ from $X$
- Impute $Y_{2}$ from $(Y_{1}, X)$
- …
- Impute $Y_{p}$ from $(Y_{1}, \dots, Y_{p - 1}, X)$

Fully Conditional Specification (FCS)

FCS specifies the multivariate distribution $p (Y, X, R ∣ θ)$ through a set of conditional densities $p (Y_{j} ∣ X, Y_{- j}, R, ϕ_{j})$
The conditional density is used to impute $Y_{j}$ given $X, Y_{- j}, R$ (including the most recent imputed values). $\begin{aligned} {\dot{ϕ}}_{j} & \sim p (ϕ_{j} ∣ Y_{j}^{obs}, {\dot{Y}}_{- j}, R) \\ {\dot{Y}}_{j} & \sim p (Y_{j}^{mis} ∣ Y_{j}^{obs}, {\dot{Y}}_{- j}, R, ϕ_{j}) \end{aligned}$
- We can use the univariate imputation method introduced in Chapter 3 as building blocks
To initialize, we can impute from the marginal distributions
One iteration consists of one cycle through all $Y_{j}$ . Total number of iterations $M$ can often be low, e.g., 5, 10, or 20.
For multiple imputation, perform this process in parallel for $m$ times

Irreducible: the chain must be able to reach all interesting parts of the state space
- Easy; users have large control over the interesting parts.
Aperiodic: the chain should not oscillate between different states
- A way to diagnose is to stop the chain at different points, and make sure stopping point does not affect statistical inferences
Recurrence: all interesting parts can be reached infinitely often, at least from almost all starting points
- May be diagnosed from traceplots

Two conditional densities $p (Y_{1} ∣ Y_{2})$ , $p (Y_{2} ∣ Y_{1})$ are compatible if
- a joint distribution $p (Y_{1}, Y_{2})$ exists, and
- it has $p (Y_{1} ∣ Y_{2})$ and $p (Y_{2} ∣ Y_{1})$ as its conditional densities
FCS is only guaranteed to work if the conditionals are compatible
The MICE algorithm (the FCS implemented in mice package) is ignorant of the non-existence of joint distribution, and imputes anyway.
- Empirical evidence suggests the estimation results may be robust against violations of compatibility

Why can the number of iterations in FCS be low (usually 5-20)?
- The imputed data ${\dot{Y}}_{mis}$ can have a considerable amount of random noise
- Hence if the relations between the variables are not strong, the autocorrelation over iteration may be low, and thus convergence can be rapid
Watch out for the following situations:
- the correlations between $Y_{j}$ ’s are high
- missing rates are high
- constraints on parameters across different variables exist

One completed covariate $X$ and two incomplete variables $Y_{1}, Y_{2}$
Data are draw from multivariate normals with correlations $ρ (X, Y_{1}) = ρ (X, Y_{2}) = 0.9, ρ (Y_{1}, Y_{2}) = 0.7$
Total sample size $n = 10000$ , and completely observed cases $\in {1000, 500, 250, 100, 50, 0}$
Imputation models are normal linear regressions (PMM)

Missing problem with high correlation and high missing rates: convergence is poor

Van Buuren, S. (2018). Flexible Imputation of Missing Data, 2nd Edition. CRC press.
- https://stefvanbuuren.name/fimd/