In a TB prevalence survey, it is usually the case (based on experience to date) that age, sex, stratum, and cluster are known for all (or almost all) eligible individuals, while there will be missing data on TB symptoms, field and central chest X-ray readings, smear and culture results, and the primary outcome of pulmonary TB.
It is essential to start by exploring the extent to which data are missing, in order to understand the possible biases that may result from an analysis that is restricted to survey participants and to choose imputation models that make the MAR assumption plausible. The following three variables should be summarized: the proportion of eligible individuals who participated in the symptom and chest X-ray screening; the proportion of those with two sputum samples among people eligible for sputum examination; and the proportion with smear and culture results from 0, 1 or 2 sputum samples. These summaries should be done overall, and be broken down by individual risk factors for pulmonary TB such as age group, sex and stratum – in order to know which individual characteristics are predictors of missingness.
Missing value imputation is done using regression models in a procedure called “imputation by chained equations”, and can be implemented using standard statistical software packages such as Stata, SAS, and R [20–22]. For example, in the statistical package Stata this is done using the ice (“imputation by chained equations”) command . Additional file 1 explains, step-by-step, how the imputation is implemented to create a single imputed dataset. As recently set out in a paper that provides general guidance on the use of multiple imputation , key principles to observe when specifying the imputation model are: (1) it must include all explanatory variables to be investigated as risk factors at the analysis stage, and the outcome variable itself; (2) to make the MAR assumption plausible it “should include every variable that both predicts the incomplete variable and predicts whether the incomplete variable is missing”; (3) including variables that are predictors of the incomplete variable, whether or not they also predict missingness, will give better imputations; and (4) including variables that are predictors of missingness, whether or not there is statistical evidence they are predictors of the incomplete variable, helps to limit the potential for bias.
Our recommendation, following from this, is as follows. The outcome variable in a TB prevalence survey is pulmonary TB; sputum smear and culture results, the field and central chest X-ray reading, and TB symptoms are used in combination to define if an individual has pulmonary TB (see Additional file 1 for more detail). Thus all of these variables must be included in the imputation models. Individual characteristics that are established predictors of pulmonary TB (e.g. age, sex) and/or predictive of data being missing (e.g. age, sex, stratum) should be considered for inclusion in the imputation models, as illustrated in Additional file 1. The strongest predictors of pulmonary TB and/or missingness (age, sex, stratum) should always be included in the imputation models for TB symptoms, field X-ray reading, and smear and culture positivity. At the same time, the choice of additional predictors (e.g. smoking and alcohol consumption) may need to be limited so as to avoid severe collinearity, especially when imputing smear and culture results and the number of positive smear and culture results is small (though because imputation models are being used for predictive purposes, moderate collinearity is not problematic). Including cluster as an explanatory variable in the imputation model with smear positivity (yes or no) as the outcome variable is not recommended, because the number of individuals with a positive smear result is low relative to the number of clusters; this is true also for the imputation model with culture positivity (yes or no) as the outcome variable. For outcomes that are more common, such as abnormal chest X-ray result (yes or no), including cluster as an explanatory variable in the imputation model may be appropriate.
The process described in Additional file 1 is repeated to create, for example, 10–20 imputed datasets (hence the terminology “multiple” missing value imputation). The number of imputed datasets should be greater than or equal to the percentage of eligible individuals for whom data are missing . To date, this percentage has been in the range 4-15% in TB prevalence surveys, and we recommend that at least 20 imputed datasets are created.
The overall prevalence of pulmonary TB is calculated for each imputed dataset. The national-level pulmonary TB prevalence estimate is then calculated as the average of the pulmonary TB prevalence values from each imputed dataset, with a 95% CI that takes into account both the sampling design and the uncertainty due to missing value imputation. In Stata, this can be done using the mim or mi commands .
Multiple imputation is an efficient method for accounting for missing data, provided the imputation models are specified appropriately [8, 16, 24]. An alternative approach is to use a combination of multiple imputation (MI) and inverse probability weighting (IPW) . With this approach, imputation is used to fill in missing values only among individuals who participated fully in the survey (N5 in Figure 4).
Survey participants can be divided into two groups, eligible or ineligible for sputum examination. Individuals who were ineligible for sputum examination are assumed not to have pulmonary TB, unless they had a normal field chest X-ray reading but an abnormal central chest X-ray reading. For those eligible for sputum examination (N6 in Figure 4, and additionally individuals with a normal field chest X-ray reading but abnormal central chest X-ray reading), multiple imputation is used to fill in missing data, in exactly the same way as described for Method 2 above (including using the same variables in the imputation models). Each of the imputed datasets is then combined with the data on individuals who were ineligible for sputum examination, to give (for example) 20 imputed datasets that include all individuals who participated fully in the survey.
For each imputed dataset, a point estimate and 95% CI for population pulmonary TB prevalence is then calculated, using logistic regression with robust standard errors and weights. Weights are calculated for each combination of cluster, age group, and sex. This is done by a) counting the number of eligible individuals in each combination of cluster, age group, and sex (Nijk, for cluster i, age group j, sex k) and b) counting the number of survey participants in each combination of cluster, age group, and sex (nijk). The weight for each individual is then equal to Nijk / nijk, for the particular combination of cluster/age group/sex that they are in, with nijk / Nijk being the probability that the sampled individual participates in the survey – hence the name “inverse probability weighting”. It is essential to include either the weights or the covariates that predict the weights in the imputation model . We include age group, sex, and stratum (area of residence) in all imputation models. An average of the estimates of pulmonary TB prevalence from each of the imputed datasets is then calculated, together with a 95% CI. In Stata, this can be done using the mim and svy commands.
An advantage of using IPW combined with MI, rather than just MI, is that it is relatively simple and transparent to calculate the probability of survey participation by cluster, age group and sex, compared with adjusting for non-participation through the use of a multivariable imputation model [17, 24]. However, an important assumption remains, which is that after post-stratifying on cluster, age, and sex, the prevalence of pulmonary TB is the same in survey participants and non-participants.