For the pdf slides, click here
Complete-case (CC) analysis
Complete-case (CC) analysis: use only data points (units) where all variables are observed
Loss of information in CC analysis:
- Loss of precision (larger variance)
- Bias, when the missingness mechanism is not MCAR. In this case, the complete units are not a random sample of the population
In this notes, I will focus on the bias issue
- Adjusting for the CC analysis bias using weights
- This idea is closed related to weighting in randomization inference for finite population surveys
Weighted Complete-Case Analysis
Notations
- Population size \(N\), sample size \(n\)
- Number of variables (items): \(K\)
- Data: \(Y=(y_{ij})\), where \(i = 1, \ldots, N\) and \(j = 1, \ldots, K\)
- Design information (about sampling or missingness): \(Z\)
Sample indicator: \(I = (I_1, \ldots, I_N)'\); for unit \(i\), \[ I_i = \mathbf{1}_{\{\text{unit } i \text{ included in the sample}\}} \]
Sample selection processes can be characterized by a distribution for \(I\) given \(Y\) and \(Z\).
Probability sampling
Properties of probability sampling
Unconfounded: selection doesn’t depend on \(Y\), i.e., \[ f(I \mid Y, Z) = f(I \mid Z) \]
Every unit has a positive (known) probability of selection \[ \pi_i = P(I_i = 1 \mid Z) > 0, \quad \text{for all } i \]
In equal probability sample design, \(\pi_i\) is the same for all \(i\)
Stratified random sampling
\(Z\) is a variable defining strata. Suppose Stratum \(Z=j\) has \(N_j\) units in total, for \(j= 1, \ldots, J\)
In Stratum \(j\), stratified random sampling takes a simple random sample of \(n_j\) units
The distribution of \(I\) under stratified random sampling is \[ f(I \mid Z) = \prod_{j=1}^J {N_j \choose n_j}^{-1} \]
Example: estimating population mean \(\bar{Y}\)
An unbiased estimate is the stratified sample mean \[ \bar{y}_{\text{st}} = \frac{\sum_{j=1}^J N_j \bar{y}_j}{N} \] where \(\bar{y}_j\) is the sample mean in stratum \(j\)
Sampling variance approximation \[ v(\bar{y}_{st}) \approx \frac{1}{N^2} \sum_{j=1}^J N_j^2 \left(\frac{1}{n_j} - \frac{1}{N_j} \right)s_j^2 \] where \(s_j\) is the sample variance of \(Y\) in stratum \(j\)
A large sample 95% confidence interval for \(\bar{Y}\) is \[ \bar{y}_{\text{st}} \pm 1.96 \sqrt{v(\bar{y}_{st})} \]
Weighting methods
Main idea: A unit selected with probability \(\pi_i\) is “representing” \(\pi_i^{-1}\) units in the population, hence should be given weights \(\pi_i^{-1}\).
For example, in stratified random sample
- A selected unit \(i\) in stratum \(j\) represents \(N_j/n_j\) population units
- Thus by Horvitz-Thompson estimate, the population mean can be estimated by the weighted sum \[ \bar{y}_w = \frac{1}{n}\sum_{i=1}^n w_i y_i, \quad \pi_i = \frac{n_j}{N_j}, \quad w_i = n \cdot \frac{\pi_i^{-1}}{\sum_k \pi_k^{-1}} \]
- It is not hard to show that \[ \bar{y}_w = \bar{y}_{\text{st}} \]
Weighting with nonresponses
If the probability of selecting unit \(i\) is \(\pi_i\), and the probability of response for unit \(i\) is \(\phi_i\), then \[ P(\text{unit } i \text{ is observed}) = \pi_i \phi_i \]
Suppose there are \(r\) units observed (respondents). Then the weighted estimate for \(\bar{Y}\) is \[ \bar{y}_w = \frac{1}{r} \sum_{i=1}^r w_i y_i, \quad w_i = r \cdot \frac{(\pi_i \phi_i)^{-1}}{\sum_k (\pi_k \phi_k)^{-1}} \]
Usually \(\phi_i\) is unknown and thus needs to be estimated
Weighting class estimator
Weighting class adjustments are used primarily to handle unit nonresponse
Suppose we partition the sample into \(J\) “weighting classes”. In the weighting class \(C = j\):
- \(n_j\): the sample size
- \(r_j\): number of observed samples
- A simple estimator for \(\phi_j\) is \(\hat{\phi}_j = \frac{r_j}{n_j}\)
For equal probability designs, where \(\pi_i\) is constant, the weighting class estimator is \[ \bar{y}_{\text{wc}} = \frac{1}{n}\sum_{j=1}^J n_j \bar{y}_{j\text{R}} \] where \(\bar{y}_{j\text{R}}\) is the respondent mean in class \(j\)
The estimate is unbiased under the following form of MAR assumption (Quasirandomization): data are MCAR within weighting class \(j\)
More about weighting class adjustments
Pros: handle bias with one set of weights for multivariate \(Y\)
Cons: weighting is inefficient and can increase in sampling variance, if \(Y\) is weakly related to the weighting class variable \(C\)
How to choose weighting class adjustments: weighting is only effective for outcomes (\(Y\)) that are associated with the adjustment cell variable (\(C\)). See the right column in the table below.
Propensity weighting
The theory of propensity scores provides a prescription for choosing the coarsest reduction of \(X\) to a weighting class variable \(C\) so that quasirandomization is roughly satisfied
Let \(X\) denote the variables observed for both respondents and nonrespondents
Suppose data are MAR, with \(\phi\) being unknown parameters about missing mechanism \[ P(M \mid X, Y, \phi) = P(M \mid X, \phi) \] Then quasirandomization is satisfied when \(C\) is chosen to be \(X\)
Response propensity stratification
Define response propensity for unit \(i\) as \[ \rho(x_i, \phi) = P\left(m_i = 0 \mid \rho(x_i, \phi), \phi\right) \] i.e., respondents are a random subsample within strata defined by the propensity score \(\rho(X, \phi)\)
Usually \(\phi\) is unknown. So a practical procedure is
- Estimate \(\hat{\phi}\) from a binary regression of \(M\) on \(X\), based on respondent and nonrespondent data
- Let \(C\) be a grouped variable by coarsening \(\rho\left(X, \hat{\phi}\right)\) into 5 or 10 values
Thus, within the same adjustment class, all respondents and nonrespondents have the same value of the grouped propensity score
An alternative procedure: propensity weighting
An alternative procedure is to weight respondents \(i\) directly by the inverse propensity score \(\rho\left(X, \hat{\phi}\right)^{-1}\)
This method removes nonresponse bias
But it may yield estimates with extremely high sampling variance because respondents with very low estimated response propensities receive large nonresponse weights
Also, weighting directly by inverse propensities place may reliance on correct model specification of the regression of \(M\) on \(X\)
Example: inverse probability weighted generalized estimating equations (GEE)
Let \(x_i\) be covariates of GEE, and \(z_i\) be a fully observed vector that can predict missing mechanism
If \(P(m_i = 1 \mid x_i, y_i, z_i, \phi) = P(m_i = 1 \mid x_i, \phi)\), then the unweighted completed case GEE is unbiased \[ \sum_{i=1}^r D_i(x_i, \beta)\left[y_i - g(x_i, \beta)\right] = 0 \]
If \(P(m_i = 1 \mid x_i, y_i, z_i, \phi) = P(m_i = 1 \mid x_i, z_i, \phi)\), then the inverse probability weighted GEE is unbiased \[ \sum_{i=1}^r w_i(\hat{\alpha}) D_i(x_i, \beta)\left[y_i - g(x_i, \beta)\right] = 0, \quad w_i(\hat{\alpha}) = \frac{1}{p(x_i, z_i \mid \hat{\alpha})} \] where \(p(x_i, z_i \mid \hat{\alpha})\) is the probability of being a complete unit, based on logistic regression of \(m_i\) on \(x_i, z_i\)
Poststratification
The weighting class estimator \[ \bar{y}_{\text{wc}} = \frac{1}{n}\sum_{j=1}^J n_j \bar{y}_{j\text{R}} \] uses the sample proportion \(n_j/n\) to estimate the population proportion \(N_j/N\).
If from an external resource (e.g., census or a large survey), we know the population proportion of weighting classes, then we can use the post stratified mean to estimate \(\bar{Y}\): \[ \bar{y}_{\text{ps}} = \frac{1}{N}\sum_{j=1}^J N_j \bar{y}_{j\text{R}} \]
Summary of weighting methods
Weighted CC estimates are often simple to compute, but the appropriate standard errors can be hard to compute (even asymptotically)
Weighting methods treat weights as fixed and known, but these nonresponse weights are computed from observed data and hence are subject to sampling uncertainty
Because weighted CC methods discard incomplete units and do not provide an automatic control of sampling variance, they are most useful when
- Number of covariates is small, and
- Sample size is large
Available-Case Analysis
Available-case (AC) analysis
Available-case analysis: for univariate analysis, include all unites where that variable is present
- Sample changes from variable to variable according to the pattern of missing data
- This is problematic if not MCAR
- Under MCAR, AC can be used to estimate mean and variance for a single variable
Pairwise AC: estimates covariance of \(Y_j\) and \(Y_k\) based on units \(i\) where both \(y_{ij}\) and \(y_{ik}\) are observed
- Pairwise covariance estimator: \[ s_{jk}^{(jk)} = \sum_{i \in I_{jk}} \left( y_{ij} - \bar{y}_j^{(jk)} \right) \left( y_{ik} - \bar{y}_k^{(jk)} \right)/ \left( n^{(jk)} - 1 \right) \] where \(I_{jk}\) is the set of \(n^{(jk)}\) units with both \(Y_j\) and \(Y_k\) observed
Problems with pairwise AC estimators on correlation
Correlation estimator 1: \[ r_{jk}^* = \frac{s_{jk}^{(jk)}}{\sqrt{s_{jj}^{(j)} s_{kk}^{(k)}}} \]
- Problem: it can lie outside of \((-1, 1)\)
Correlation estimator 2 corrects the previous problem: \[ r_{jk}^{(jk)} = \frac{s_{jk}^{(jk)}}{\sqrt{s_{jj}^{(jk)} s_{kk}^{(jk)}}} \]
Under MCAR, all these estimators on covariance and correlation are consistent
However, when \(K > 3\), both correlation estimators can yield correlation matrices that are not positive definite!
- An extreme example: \(r_{12} = 1, r_{13} = 1, r_{23} = -1\)
Compare CC and AC methods
When data is MCAR and correlations are mild, AC methods are more efficient than CC
When correlations are large, CC methods are usually better
References
- Little, R. J., & Rubin, D. B. (2019). Statistical Analysis with Missing Data, 3rd Edition. John Wiley & Sons.