Book Notes: Flexible Imputation of Missing Data -- Ch3 Univariate Missing Data

For the pdf slides, click here

Notations

  • In this chapter, we assume that there is only one variable having missing values. We call this variable y the target variable.

    • yobs: the n1 observed data in y
    • ymis: the n0 missing data in y
    • y˙: imputed values in y
  • Suppose X are the variables (covariates) in the imputation model.

    • Xobs: the subset of n1 rows of X which y is observed
    • Xmis: the subset of n0 rows of X which y is missing

Imputation under the Normal Linear Model

Four methods to impute under the normal linear model

  1. Regression imputation: Predict (bad!). Fit a linear model on the observed data and get the OLS estimates β^0,β^1. Impute with the predicted values y˙=β^0+Xmisβ^1
    • In mice package, this method is norm.predict
  2. Stochastic regression imputation: Predict + noise (better, but still bad). Also add a random drawn noise from the estimated residual normal distribution y˙=β^0+Xmisβ^1+ϵ˙,ϵ˙N(0,σ^2)
    • In mice package, this method is norm.nob

Method 3: Bayesian multiple imputation

  • Predict + noise + parameter uncertainty y˙=β˙0+Xmisβ˙1+ϵ˙,ϵ˙N(0,σ˙2)

  • Under the priors (where the hyper-parameter κ is fixed at a small value, e.g., κ=0.0001) βN(0,Ip/κ),p(σ2)1/σ2 We draw β˙ (including both β˙0 and β˙1), σ2˙ from the posterior distribution

  • In mice package, this method is norm

Method 4: Bootstrap multiple imputation

  • Predict + noise + parameter uncertainty y˙=β˙0+Xmisβ˙1+ϵ˙,ϵ˙N(0,σ˙2) where β˙0, β˙1, and σ2˙ are OLS estimates calculated form a bootstrap sample taken from the observed data

  • In mice package, this method is norm.boot

A simulation study, to impute MCAR missing in y

  • Missing rate 50% in y, and number of imputations m=5.
    • From coverage, norm, norm.boot, and listwise deletion are good
    • From CI width, listwise deletion is better than multiple imputation here, but it’s not always this case, especially when the number of covariates is large.
    • RMSE is not imformative at all!

### A simulation study, to impute MCAR missing in x

  • Missing rate 50% in x, and number of imputations m=5.
    • norm.predict is severely biased; norm is slightly biased
    • From coverage, norm, norm.boot, and listwise deletion are good
    • Again, RMSE is not imformative at all!

Impute from a (continuous) non-normal distributions

  • Optional 1: mean predictive matching

  • Optional 2: model the non-normal data directly
    • E.g., impute from a t-distribution
    • The GAMLSS package: extends GLM and GAM

Predictive Mean Matching

Predictive mean matching (PMM), general principle

  • For each missing entry, the method forms a small set of candidate donors (3, 5, or 10) from completed cases whose predicted values closest to the predicted value for the missing entry

  • One donor is randomly drawn from the candidates, and the observed value of the donor is taken to replace the missing value

Advantages of predictive mean matching (PMM)

  • PMM is fairly robust to transformations of the target variable

  • PMM can also be used for discrete target variables

  • PMM is fairly robust to model misspecification
    • In the following example, the relationship between age and BMI is not linear, but PMM seems to preserve this relationship better than linear normal model

How to select the donors

  • Once the metric has been defined, there are four ways to select the donors.
    • Let y^i denote the predicted values of rows with observed yi
    • Let y^j denote the predicted values of rows with missing yj
  1. Pre-specify a threshold η, take all i such that |y^iy^j|<η as donors, and randomly sample one donor to impute
  2. Choose the closest candidate as the donor (only 1 donor), also called (nearest neighbor hot deck)
  3. Pre-specify a number d, take the d closest candidate as donors, and randomly sample one donor to impute. Usually, d=3,5,10
  4. Sample one donor with a probability that depends on the distance |y^iy^j|
    • Implemented by the midastouch method in mice, and also the midastouch package

Types of matching

  • Type 0: y^=Xobsβ^ is matched to y^j=Xmisβ^
    • Bad: it ignores the sampling variability in β^
  • Type 1: y^=Xobsβ^ is matched to y˙j=Xmisβ˙
    • Here, β˙ is a random draw from the posterior distribution
    • Good. The default in mice
  • Type 2: y˙=Xobsβ˙ is matched to y˙j=Xmisβ˙
    • Not very ideal, when model is small, the same donors get selected too often
  • Type 3: y˙=Xobsβ˙ is matched to y¨j=Xmisβ¨
    • Here, β˙ and β¨ are two different random draws from the posterior distribution
    • Good

Illustration of Type 1 matching

Number of donors d

  • d=1 is too low (bad!). It may select the same donor over and over again

  • The default in mice is d=5. Also, d=3,10 are also feasible

Pitfalls of PMM

  • If the data is small, or if there is a region where the missing rate is high, then the same donors may be used for too many times.

  • Mis-specification of the impute model

  • PMM cannot be used to extrapolate beyond the range of the data, or to interpolate within the region where data is sparse

  • PMM may not perform well with small datasets

Imputation under CART

Multiple imputation under a tree model

  • missForest: single imputation with CART is bad

  • Multiple imputation under a tree model using the bootstrap:
  1. Draw a bootstrap sample among the observed data, and fit a CART model f(X)
  2. For each missing value yj, find it’s terminal node gj. All the dj cases in this node are the donors
  3. Randomly select one donor to impute

    • When fitting the tree, it may be useful to pre-set the size of nodes to be 5 or 10
    • We can also use random forest instead of CART

Imputing Categorical and Other Types of Data

Imputation under Bayesian GLMs

  • Binary data: logistic regression (logreg method in mice)
    • In case of data separation, use a more informative Bayesian prior
  • Categorical variable with K unordered categories: multinomial logit model (polyreg method in mice package) P(yi=kXi,β)=exp(Xiβk)j=1Kexp(Xiβj)

  • Categorical variable with K ordered categories: ordered logit model (polr method in mice package) P(yikXi,β,τk)=exp(τkXiβ)1+exp(τkXiβ)
    • For identifiability, set τ1=0
  • When impute from these GLM models, make sure to not use the MLE of parameters, but either a draw from posterior, or a bootstraped estimate.

Categorical variables are harder to impute than continuous ones

  • Empirically, the GLM imputations do not perform well
    • If missing rate exceeds 0.4
    • If the data is imbalanced
    • If there are many categories
  • GLM imputation is found inferior than CART or latent class models

Imputation of count data

  • Option 1: predictive mean matching
  • Option 2: ordered categorical imputation
  • Option 3: (zero-inflated) Poisson regression
  • Option 4: (zero-inflated) negative binomial regression

Imputation of semi-continuous data

  • Semi-continuous data: has a high mass at one point (often zero) and a continuous distribution over the remaining values

  • Option 1: model the data in two parts: logistic regression + regression
  • Option 2: predictive mean matching

References