Using a simulation study based on the Melbourne Collaborative Cohort Study we assessed two methods (i.e. complete-case analysis and MI) for handling up to 50% missing data in a repeatedly measured exposure of interest, in the context of a Cox proportional hazards model investigating two epidemiological associations. We found very little bias and the coverage remained around 95% from both complete-case analysis and MI for both associations. For the analysis that included absolute change in waist circumference as the exposure of interest and adjusted for waist circumference at wave 1 (i.e. analysis (a)) there was no gain in precision when using MI instead of a complete-case analysis. However, there were slight gains in precision (i.e. reduction in the standard error) for MI over a complete-case analysis in analysis (b), where there was a strong auxiliary variable in the imputation model.

The simulation study that we developed was based on a large existing cohort study, the Melbourne Collaborative Cohort Study. This approach for designing simulation studies (i.e. based on real studies), which is becoming increasingly common in the literature, allows researchers to incorporate complex and realistic associations within the data structure while simplifying the data generation process compared to a fully simulated scenario [3, 10–12, 15, 35, 36]. As with all simulation studies of this type, the generalisability of our findings is limited since our simulated data are based on only a single cohort. Undoubtedly, further exploration of simulation models based on other real data settings would be useful. As well, there is scope to investigate datasets with more than two waves of data collection to ascertain whether there is any gain in using MI compared to complete-case analysis in these scenarios.

The true HRs that we chose for the association between a 10cm change in waist circumference and risk of colorectal cancer (i.e. 1.1 and 1.5) were based on realistic HRs that are typically observed for this anthropometric measure [25, 37, 38]. These HRs of moderate magnitude may have minimised the bias that we observed but our results are consistent with previous work that investigated HRs of similar magnitude; Demissie et al. [10] found little bias for a HR of one for a dichotomous exposure variable and Marshall et al. [3] found minimal bias for an exposure with a HR of one for a unit change in a continuous exposure variable.

We imposed our missing data on the exposure of interest under MCAR, standard and enhanced covariate-dependent MAR scenarios. Consistent with previously published studies, we found that under the MCAR scenario complete-case analysis produced negligible bias and good coverage of the estimate [3, 10, 11]. The covariate-dependent MAR scenarios that we investigated were based on the variables observed to be predictors of non-attendance at wave 2 in the Melbourne Collaborative Cohort Study (i.e. missing waist circumference at wave 2) and the coefficients for the standard covariate-dependent MAR scenario were set to the same values as the estimates of the regression coefficients from the logistic regression model of missingness indicator at wave 2. This realistic missing data scenario allowed us to evaluate the two methods for handling missing data under weak associations of covariate-dependent MAR, which are more likely to be observed in real studies than the more extreme covariate-dependent MAR scenarios that are often reported in the missing data literature [11, 12, 39, 40]; for example, Donders et al. [39] assigned 40% of the simulated data to missing with all of the missing data occurring in the unexposed group.

We found very little bias in the log(HR) using complete-case analysis and MI to handle the missing data. The slight bias observed in analysis (a) may be a result of the imputation model being semi-compatible with the analysis model (i.e. the exposure of interest in the analysis model is change in waist circumference, however, waist circumference at wave 2 is imputed in the imputation model) [41]. We decided to impute waist circumference at wave 2 instead of change in waist circumference in order to represent the real epidemiological analysis (i.e. in a study where the variable is fully observed at wave 1 and missing at wave 2, the analyst is more likely to impute the variable with missing data and then calculate the absolute change between the two variables). The imputation model that we used included an indicator variable for whether a participant had colorectal cancer or was censored at the time of analysis, and the baseline hazard generated using the Nelson-Aalen method in the imputation model. Although Marshall et al. [3] suggest that this may be a better method to use in the imputation model than including a log transformation of the survival time and event status it may have introduced bias into our results [32]. Further, our MAR scenario was a covariate-dependent scenario, which may be specific to our study and research looking at MAR scenarios dependent on both covariates and the outcome should be considered. Previous published simulation studies, which reported biased estimates using complete-case analysis or MI, induced a missingness mechanism dependent on the exposure and outcome variables and under more extreme missingness scenarios [3, 10, 11, 15, 40].

The auxiliary variables included in our imputation model (i.e. variables not included in the epidemiological analyses) were alcohol intake, smoking status, and physical activity at baseline. These variables had only weak to moderate associations with waist circumference at wave 2. To assess the impact of an auxiliary variable that has a strong association with the exposure of interest we compared MI with a complete-case analysis for handling missing data under two scenarios: (a) the association between change in waist circumference and risk of colorectal cancer adjusted for waist circumference at wave 1, and (b) waist circumference at wave 2 and the risk of colorectal cancer, not adjusted for waist circumference at wave 1. For analysis (b), waist circumference at wave 1 was included in the imputation model as a strong auxiliary variable, with no strong auxiliary variables in model (a). MI provided no gain in precision of the estimate compared to complete-case analysis in analysis (a) where the imputation procedure only included auxiliary variables with weak and moderate associations with the variable with missing data. However, slight gains in precision were observed for the MI estimate compared to the complete-case estimate in analysis (b). Graham and Collins [42] used simulations of artificial data to show that strong auxiliary variables included in the imputation model for MI restored some of the power lost due to missing data. Real data examples are less likely to have auxiliary variables that are strongly associated with the variable subject to missing values; for example Marshall et al. [3] reported correlations in the range of 0.3 and 0.4, and Lee and Carlin [43] reported correlations between 0.1 and 0.6 between the covariates and the variable with missing data. Incorporating auxiliary variables with weak or moderate associations with the variables with missing data into the imputation model will result in large between-imputation variance leading to larger standard errors for the MI estimates and thus, smaller gains (if any) from using MI compared to a complete-case analysis.

Complete-case analysis is the default method for most software packages for handling missing data in statistical analyses. However, MI is now available and easy to implement in many software packages (e.g. Stata, R, SAS and SPSS [7, 8, 44, 45]). This increased accessibility has led to an increase in the use of MI for dealing with missing data in epidemiological studies [34, 46]. MI produces unbiased estimates if the missing data mechanism is MAR, which encompasses the more specific scenario of covariate-dependent MAR [4, 47]. Data Missing Not at Random (MNAR) occur when the study participants with missing data differ from the study participants with complete data in a manner that cannot be explained by the observed data [9]. In our simulation study we did not include an MNAR missing data scenario. It has been suggested that for cohort studies that collect a large amount of information from their participants, the observed data can provide a lot of information about the missing data. Further, the imputation model may include combinations of observed variables that represent surrogate measures of the unobserved variables that are related to the missingness mechanism [48]. However, whether the data are MAR, either covariate-dependent or more generally, is untestable and therefore, further research investigating alternative approaches that explore the sensitivity of conclusions to plausible MNAR mechanisms or simultaneously estimate the missing data model and the analysis model will be important [49].