In this study of nearly all patients who initiated chronic dialysis treatment between 1987 and 2002 in The Netherlands, we developed a propensity score that fulfilled accepted quality criteria: it showed a reasonable fit, had a good c-statistic, calibrated well and balanced the covariates. The Cox model that solely stratified on propensity score yielded essentially identical effect estimates of PD versus HD mortality compared with the multivariable-adjusted model while having the advantage of containing only one as opposed to nine covariates, thus being more parsimonious. When excluding interaction terms, dialysis modality was not an independent predictor of mortality in either model, but both models were misspecified, because effect modification was present. After introducing both interaction terms and all corresponding main effect variables to account for effect modification, the propensity score-stratified model contained almost the same number of covariates as the multivariable-adjusted model. When the models included interaction variables, all covariates remained independent predictors, but now dialysis modality besides its interaction variables became statistically significant. Supporting our findings, the identified effect modifiers, age and diabetes as primary renal disease, correspond to those found in previous studies [12–15].
Our study informs the discussion of the utility of propensity score in outcomes research. In theory, the use of propensity score-stratified modeling may allow for a more straightforward estimation of the relative mortality risk in comparison with multivariable-adjusted modeling. However, our study shows that neglecting effect modification in propensity score-stratified models may lead to erroneous conclusions. Incorporating effect modification however removes the direct interpretability of the main treatment effect one wishes to estimate, thereby limiting one benefit of using a propensity score. Still, Sturmer and colleagues  suggest that the propensity score-adjustment approach may yet have an advantage over traditional methods when effect modification is present, because it allows for a summary effect size across all strata of the effect modifiers. This can be relevant in pharmacoepidemiology, to estimate how a total population might benefit from a particular drug. However in the setting of end-stage renal disease, the assignment of a patient to a specific dialysis modality should be tailored to a patient's specific pretreatment characteristics and should not be determined by the summary effect size across all strata of the effect modifiers.
Other studies that compared propensity score-stratified versus multivariable-adjusted modeling have been reviewed by Shah and colleagues  and Sturmer and colleagues . Similar to our findings, both studies concluded that propensity score-stratified modeling rarely led to a different result compared with multivariable-adjusted modeling. Choosing which method may depend on the quantity of data. Cepeda and colleagues report from their simulation study that with eight or more outcomes per confounder, the multivariable-adjusted logistic regression model showed better precision . However, with fewer than eight events per confounder, propensity score-stratified modeling performed better. Furthermore, the choice also depends on the research question, as suggested by Kurth and colleagues . They showed that when there is a non-uniform treatment effect, different adjustment methods can result in divergent results, which may all be correct but depend on the research question implied by the adjustment method. Propensity score, as opposed to traditional multivariate modeling, enables a direct estimation of comparability of the treatment groups by assessing the covariate balance between groups .
For the propensity score-adjustment approach, there are no accepted rules for construction and evaluation of the propensity score model. Evaluation of a propensity score model often consists of assessing discrimination with the c-statistic and calibration with goodness-of-fit tests. However, Weitzen and colleagues in their simulation study showed that neither the c-statistic nor the Hosmer-Lemeshow goodness-of-fit test was sensitive to omission of an important confounder from the propensity score model .
Failure to include important confounders can lead to biased estimates of the treatment effect . Austin and colleagues reported that propensity scores estimated on administrative data might not balance all clinical characteristics . This could be particularly relevant to our study, because the RENINE database is administrative and does not contain clinical data, in particular co-morbidity data. Information on primary renal diagnosis however is available with the most important co-morbid condition, diabetes, likely well-represented because of the strong correlation between diabetes and diabetes as primary renal disease (PRD-DM). Further, Weitzen and colleagues  report that omitting a confounder in a propensity model has little effect on the treatment effect estimate. This could imply that the propensity score is fairly robust to unobserved confounders if, as also reported by Rosenbaum , at least some of the key variables that explain treatment assignment are included in the score. Moreover, omitting a confounder also leads to biased estimates when using a multivariable-adjusted model.
To summarize: if propensity score models are constructed well and no important confounders are missing, a treatment effect with a reliable significance level can usually be estimated, with the advantage of a more parsimonious outcome model and the advantage of assessing covariate balance between treatment groups explicitly. When the outcome is rare, propensity score-adjustment yields effect size estimates with a higher precision. Reviews of studies applying propensity score-adjustment methods have shown however, that propensity score-stratified modeling was often not implemented or reported appropriately [1, 4, 10]. Researchers should carefully assess whether propensity score-adjustment methods are appropriate for their specific situation.
From our study, we conclude that although the propensity score performed well, it did not alter the treatment effect in the outcome model and lost its advantage of parsimony because effect modification was present. Thus, using a model of mortality of patients on renal replacement therapy as a special case study, our study contributes to the growing literature supporting the comparability of traditional multivariable regression and propensity score methods unless sample size is small and outcome is rare.