Skip to main content

A simulation study of regression approaches for estimating risk ratios in the presence of multiple confounders

Abstract

Background

Risk ratio is a popular effect measure in epidemiological research. Although previous research has suggested that logistic regression may provide biased odds ratio estimates when the number of events is small and there are multiple confounders, the performance of risk ratio estimation has yet to be examined in the presence of multiple confounders.

Methods

We conducted a simulation study to evaluate the statistical performance of three regression approaches for estimating risk ratios: (1) risk ratio interpretation of logistic regression coefficients, (2) modified Poisson regression, and (3) regression standardization using logistic regression. We simulated 270 scenarios with systematically varied sample size, the number of binary confounders, exposure proportion, risk ratio, and outcome proportion. Performance evaluation was based on convergence proportion, bias, standard error estimation, and confidence interval coverage.

Results

With a sample size of 2500 and an outcome proportion of 1%, both logistic regression and modified Poisson regression at times failed to converge, and the three approaches were comparably biased. As the outcome proportion or sample size increased, modified Poisson regression and regression standardization yielded unbiased risk ratio estimates with appropriate confidence intervals irrespective of the number of confounders. The risk ratio interpretation of logistic regression coefficients, by contrast, became substantially biased as the outcome proportion increased.

Conclusions

Regression approaches for estimating risk ratios should be cautiously used when the number of events is small. With an adequate number of events, risk ratios are validly estimated by modified Poisson regression and regression standardization, irrespective of the number of confounders.

Background

A cohort study is a type of observational study that aids in evaluating associations between exposures and outcomes. In such studies, regression analysis is frequently used to estimate the effect of exposure adjusted for multiple confounders. For binary outcomes, logistic regression has been widely employed to estimate adjusted odds ratios. Because of interpretation difficulties, odds ratios are often interpreted as approximates of risk ratios under the assumption of rare events [1]. Although the odds ratio approximates the risk ratio if the outcome risk is sufficiently low for all study subjects, when some of the subjects have risk higher than 10%, the odds ratio is known to be distorted away from the null value compared to the risk ratio [2, 3].

Among the generalized linear models, log-binomial regression models can be used to directly estimate adjusted risk ratios for both common and rare events [4]. However, log-binomial regression using the standard maximum likelihood estimation method often fails to converge [5, 6]. To solve this problem, modified Poisson regression has been proposed [7] and has been applied to estimate adjusted risk ratios in numerous epidemiologic studies (e.g., [8,9,10,11,12]). Moreover, regression standardization using logistic regression could be another possible workaround [13, 14].

In terms of confounding adjustment using regression analysis, the number of confounders that can be included in regression models has been investigated in relation to the number of events or subjects [15,16,17,18,19]. Although a simple criterion of ten or more events per variable is well known for logistic regression [20, 21], other factors such as the number of events and confounders per se and the effect sizes are reported to influence the valid estimation of adjusted odds ratios [17,18,19]. On the other hand, the statistical performance of risk ratio estimation has not been well examined in the presence of multiple confounders.

This study sought to evaluate statistical performance in the presence of multiple confounders of the three regression approaches for estimating risk ratios: (1) risk ratio interpretation of logistic regression coefficients, (2) modified Poisson regression, and (3) regression standardization using logistic regression. After briefly summarizing approaches for estimating risk ratios, we evaluate the statistical performance of the three approaches by means of simulation. We then discuss the interpretation of the simulation results and draw conclusions about regression approaches for estimating risk ratios in the presence of multiple confounders.

Methods

Regression approaches for estimating risk ratios

We consider a cohort study of n subjects involving binary outcome Yi (1 for event and 0 for no event), binary exposure Ai (1 for exposure and 0 for no exposure), and column vector of confounders Li for each subject i. Logistic regression is commonly used to control for confounders and assess the influence of exposure for this type of data. The logistic regression model with first-order terms of exposure and confounders is expressed as follows (we assume that the regression models are correctly specified below unless otherwise noted):

$$\text{log}\frac{E\left({Y}_{i}|{A}_{i},{L}_{i};\alpha \right)}{1-E\left({Y}_{i}|{A}_{i},{L}_{i};\alpha \right)}={\alpha }_{0}+{\alpha }_{1}{A}_{i}+{\alpha }_{2}^{T}{L}_{i}={X}_{i}^{T}\alpha ,$$

where \(\alpha ={\left({\alpha }_{0},{\alpha }_{1},{\alpha }_{2}^{T}\right)}^{T}\) is the unknown parameter vector, and \({X}_{i}={\left(1,{A}_{i},{L}_{i}^{T}\right)}^{T}\). Under this model, the exponentiated exposure coefficient, \(\text{e}\text{x}\text{p}\left({\alpha }_{1}\right)\), indicates the adjusted odds ratio, which is often interpreted as the risk ratio under the assumption of rare events [1]. Assuming that Yi follows a binomial distribution given Ai and Li, the parameter α is estimated using the standard maximum likelihood estimation method.

To estimate the risk ratio directly, the log-binomial regression model, the linear model of the log-transformed mean, may be straightforward [4]:

$$\text{log}E\left({Y}_{i}|{A}_{i},{L}_{i};\beta \right)={\beta }_{0}+{\beta }_{1}{A}_{i}+{\beta }_{2}^{T}{L}_{i}={X}_{i}^{T}\beta ,$$

where \(\beta ={\left({\beta }_{0},{\beta }_{1},{\beta }_{2}^{T}\right)}^{T}\) is the unknown parameter vector. Under this model, the exponentiated exposure coefficient, \(\text{e}\text{x}\text{p}\left({\beta }_{1}\right)\), can be interpreted as the adjusted risk ratio without the assumption of rare events. Assuming that Yi follows a binomial distribution given Ai and Li, the standard maximum likelihood estimation method yields a consistent and asymptotically efficient estimator for β. However, in real-world applications, the iterative procedures for log-binomial regression models often fail to converge [5, 6] and were thus excluded from our simulation experiments.

One proposed solution, modified Poisson regression, estimates parameter β of the log-binomial regression model by solving the following estimating equations for β [7]:

$$U\left(\beta \right)=\sum _{i=1}^{n}{X}_{i}\left\{{Y}_{i}-\text{exp}\left({X}_{i}^{T}\beta \right)\right\}=0.$$

The estimating equations of the modified Poisson regression are equivalent to the score equations of the Poisson regression. The estimator for the standard error is obtained from the robust sandwich variance estimator:

$$\begin{aligned} \widehat{Var}\left(\widehat{\beta }\right) ={\left\{\sum _{i=1}^{n}{X}_{i}\text{exp}\left({X}_{i}^{T}\widehat{\beta }\right){{X}_{i}}^{T}\right\}}^{-1}\left[\sum _{i=1}^{n}{X}_{i}{\left\{{Y}_{i}-\text{exp}\left({X}_{i}^{T}\widehat{\beta }\right)\right\}}^{2}{{X}_{i}}^{T}\right]{\left\{\sum_{i=1}^{n}{X}_{i}\text{exp}\left({X}_{i}^{T}\widehat{\beta }\right){{X}_{i}}^{T}\right\}}^{-1}. \end{aligned}$$

The estimator obtained from the estimating equation is consistent and asymptotically normal, albeit without asymptotic efficiency. For rare events, the modified Poisson regression estimators approximate maximum likelihood estimators of log-binomial and logistic regression, and the efficiency loss should be small [22]. Previous simulation results suggest that the modified Poisson regression estimates are generally close to the maximum likelihood counterparts [7, 23]. The modified Poisson regression is also reported to be less sensitive to outliers [24] and less biased when the mean structure is misspecified [25]. Potential predicted probabilities above 1 [26, 27] are not fatal if the analysis aims to estimate the adjusted risk ratio and not the individual predicted probabilities.

Another approach for estimating the risk ratio is regression standardization using logistic regression [13, 14]. Instead of directly interpreting the logistic regression coefficients, the risk ratio among the entire population is calculated based on predicted probabilities estimated from logistic regression, which are constrained to fall between 0 and 1. Using maximum likelihood estimates of logistic regression \(\widehat{\alpha }\), the predicted risk if subject i was exposed is given by

$${\widehat{\mu }}_{i1}=\frac{\text{e}\text{x}\text{p}\left({X}_{i1}^{T}\widehat{\alpha }\right)}{1+\text{e}\text{x}\text{p}\left({X}_{i1}^{T}\widehat{\alpha }\right)},$$

where \({X}_{i1}={\left(1, 1,{L}_{i}^{T}\right)}^{T}\), and the predicted risk if subject i was not exposed is given by

$${\widehat{\mu }}_{i0}=\frac{\text{e}\text{x}\text{p}\left({X}_{i0}^{T}\widehat{\alpha }\right)}{1+\text{e}\text{x}\text{p}\left({X}_{i0}^{T}\widehat{\alpha }\right)},$$

where \({X}_{i0}={\left(1, 0,{L}_{i}^{T}\right)}^{T}\). The risk ratio for exposure is computed by taking the ratio of these risks averaged over the population:

$$\widehat{\text{R}\text{R}}=\frac{\sum _{i=1}^{n}{\widehat{\mu }}_{i1}}{\sum _{i=1}^{n}{\widehat{\mu }}_{i0}}.$$

The estimator for the standard error of \(\text{log}\widehat{\text{R}\text{R}}\) is easily obtained using the delta method [28]:

$$\widehat{Var}\left(\text{log}\widehat{\text{R}\text{R}}\right)={R}^{T}\widehat{Var}\left(\widehat{\alpha }\right)R,$$

where

$$\begin{aligned}R= & \frac{1}{\sum _{i=1}^{n}{\widehat{\mu }}_{i1}}\left[\sum _{i=1}^{n}\frac{\text{e}\text{x}\text{p}\left({X}_{i1}^{T}\widehat{\alpha }\right)}{{\left\{1+\text{e}\text{x}\text{p}\left({X}_{i1}^{T}\widehat{\alpha }\right)\right\}}^{2}}{X}_{i1}\right]\\ & - \frac{1}{\sum _{i=1}^{n}{\widehat{\mu }}_{i0}}\left[\sum _{i=1}^{n}\frac{\text{e}\text{x}\text{p}\left({X}_{i0}^{T}\widehat{\alpha }\right)}{{\left\{1+\text{e}\text{x}\text{p}\left({X}_{i0}^{T}\widehat{\alpha }\right)\right\}}^{2}}{X}_{i0}\right],\end{aligned}$$

and \(\widehat{Var}\left(\widehat{\alpha }\right)\) is the estimated variance-covariance matrix of logistic regression.

Simulation methods

We conducted a simulation study to evaluate statistical performance in the presence of multiple confounders of three approaches for estimating risk ratios: (1) risk ratio interpretation of logistic regression coefficients, (2) modified Poisson regression, and (3) regression standardization using logistic regression. We simulated 270 scenarios (10,000 iterations) with settings varying systematically on sample size (2500, 5000, and 10,000), number of binary confounders (5, 10, and 20), exposure proportion (20% and 50%), risk ratio for exposure (1, 1.3, and 2), and outcome proportion (1%, 2%, 4%, 8%, and 16%). The simulation was carried out using SAS version 9.4 (SAS Institute, Inc.).

Data generation

We generated binary confounders, exposure, and outcome for each of the 2500, 5000, or 10,000 subjects. To create binary confounders, 5-, 10-, or 20-dimensional Gaussian variables with mean 0, variance 1, and pairwise correlations 0.33 were discretized into 0 and 1 at predefined points (Additional file 1: Table S1). The exposure was generated from a logistic regression model with first-order terms of confounders so that the specified exposure proportion (20% or 50%) was achieved on average (Additional file 1: Table S2). The outcome was generated from a log-binomial regression model with first-order terms of exposure and confounders using three different risk ratios for exposure (1, 1.3, or 2). Confounder-outcome associations were weakened for increased confounders so that the maximum possible individual risk did not exceed 1 at any number of confounders (Additional file 1: Table S3). We also conducted additional simulation experiments keeping the same moderate confounder-outcome associations regardless of the number of confounders. The parameter settings and the results of the additional simulation are provided in Additional file 2. The intercept was adjusted so that the specified outcome proportion (1%, 2%, 4%, 8%, or 16%) was achieved on average.

Although in some previous studies of logistic regression, the number of events was fixed across data sets assuming retrospective samplings such as in a case-control study [17, 19], we generated outcomes so that the specified proportion would be achieved only on average assuming prospective samplings such as in a cohort study. The expected number of events and events per confounder can be calculated using the sample size, outcome proportion, and the number of confounders (Table 1).

Table 1 The expected number of events and the expected number of events per confounder

Analytical approaches

For each data set, we obtained the point estimate, standard error, and 95% Wald confidence interval of the log risk ratio for exposure from the three approaches. We performed logistic and modified Poisson regression with first-order terms of exposure and all confounders using the SAS GENMOD procedure. Note that the logistic regression misspecified the mean structure, and the degree of misspecification was larger with higher outcome proportions as more subjects had relatively high risks (Additional file 1: Fig. S1). We left the default settings unchanged for the optimization algorithm, starting values, and convergence criteria. If the algorithm for logistic regression was deemed to have converged based on criteria stated below, the results for the regression standardization were computed using the SAS program written by the authors.

Performance measures

For each scenario, we summarized the results in terms of convergence proportion, bias, Monte Carlo standard error, mean estimated standard error, and 95% confidence interval coverage for the log risk ratio for exposure. Because the software may falsely report convergence, giving invalid parameter estimates [29], the convergence of logistic and modified Poisson regression was evaluated based on the estimated standard errors of coefficients instead of the procedure’s reports. If the estimated standard error of any coefficient was missing, 0, or above 1000, the algorithm was deemed not to have converged. Results from the converged data sets were used to compute the following performance measures of the three approaches. To provide an intuitive understanding of bias, the mean estimated log risk ratio was transformed back to a linear scale. The Monte Carlo standard error was calculated as the standard deviation of the estimated log risk ratios. The mean estimated standard error was calculated as the average of the standard error estimates of the log risk ratio. The 95% confidence interval coverage was calculated as the proportion of estimated confidence intervals covering the true value.

Results

Convergence proportion

Nonconvergence mainly occurred when the sample size was 2500 and the outcome proportion was 1% (Additional file 1: Fig. S2). Under such scenarios, nonconvergence occurred for up to 2.3% of the data sets without identifiable trends associated with other factors such as the number of confounders, effect size, or exposure proportion. The convergence proportion and converged data sets were identical for logistic and modified Poisson regression. All data sets converged when the sample size was 2500 and the outcome proportion was higher than 2%, when the sample size was 5000 and the outcome proportion was higher than 1%, and when the sample size was 10,000. In the additional experiments, nonconvergence was more frequent with increased confounders (Additional file 2: Fig. S4).

Bias

Figure 1 shows the results of biases for each scenario. In scenarios wherein true risk ratio was 1 (top), the direction and magnitude of the biases were comparable among the three approaches. When the outcome proportion was 1%, the three approaches underestimated the risk ratio with an exposure proportion of 20%. As the outcome proportion increased, the biases of the three approaches decreased. In scenarios wherein true risk ratio was 1.3 (middle) or 2 (bottom), the biases of the logistic regression coefficients showed a different trend compared to the others. With an exposure proportion of 20%, the three approaches underestimated the risk ratio when the outcome proportion was 1%. The largest bias amounted to approximately 25% of the true log risk ratio when the sample size was 2500. As the outcome proportion increased, the biases of the modified Poisson regression and regression standardization decreased, whereas overestimation biases were observed for logistic regression coefficients. With an exposure proportion of 50%, the three approaches overestimated the risk ratio when the outcome proportion was 1%. As the outcome proportion increased, the biases of the modified Poisson regression and regression standardization decreased, whereas the overestimation biases of logistic regression coefficients further increased. The number of confounders did not markedly affect the magnitude of bias. Although the biases associated with low outcome proportions were mitigated by the increased sample size, the biases of logistic regression coefficients associated with high outcome proportions were not. The results of biases were similar in the additional experiments (Additional file 2: Fig. S5).

Fig. 1
figure 1

Mean estimated log risk ratio transformed back to linear scale. a risk ratio 1; b risk ratio 1.3; c risk ratio 2. LO logistic regression, MP modified Poisson regression, RS regression standardization

Standard error

Figure 2 shows the results of the Monte Carlo standard error for each scenario. When the sample size was 2500 and the outcome proportion was lower than 4%, and when the sample size was 5000 and the outcome proportion was lower than 2%, the mean estimated standard error was slightly smaller than the Monte Carlo standard error for the three approaches, indicating that the three approaches underestimated the standard error (Fig. 3, Additional file 1: Fig. S3). When the outcome proportion was 2% or lower, the Monte Carlo standard error and mean estimated standard error were comparable among the three approaches. In contrast, when the outcome proportion was higher than 2%, those from the logistic regression coefficients were slightly larger than those from the other two approaches. The results of the Monte Carlo standard error, mean estimated standard error, and disparity thereof were associated with the expected number of events (Table 1). The results on standard error were similar in the additional experiments (Additional file 2: Figs. S6–S8).

Fig. 2
figure 2

Monte Carlo standard error (MCSE). a risk ratio 1; b risk ratio 1.3; c risk ratio 2. LO logistic regression, MP modified Poisson regression, RS regression standardization

Fig. 3
figure 3

Mean estimated standard error (MESE) minus Monte Carlo standard error (MCSE). a risk ratio 1; b risk ratio 1.3; c risk ratio 2. LO logistic regression, MP modified Poisson regression, RS regression standardization

Coverage proportion

Figure 4 shows the results of 95% Wald confidence interval coverage for each scenario. In scenarios wherein true risk ratio was 1 (top), the coverage proportion was comparable among the three approaches. When the outcome proportion was 1%, overcoverage occurred for the three approaches, notably with a sample size of 2500 and an exposure proportion of 20%. As the outcome proportion increased, the coverage proportion became closer to the nominal level. In scenarios wherein true risk ratio was 1.3 (middle) or 2 (bottom), the coverage proportion of logistic regression coefficients showed a different trend compared to the others. When the outcome proportion was 1%, overcoverage occurred for the three approaches, notably with a sample size of 2500, an exposure proportion of 20%, and a risk ratio of 1.3. As the outcome proportion increased, the coverage proportion of the modified Poisson regression and regression standardization became closer to the nominal level, whereas undercoverage occurred for logistic regression coefficients. The number of confounders did not markedly affect the performance of confidence intervals. With other factors fixed, undercoverage of logistic regression associated with high outcome proportions was more severe with larger samples. The results of confidence interval coverage were similar in the additional experiments (Additional file 2: Fig. S9).

Fig. 4
figure 4

Coverage probability of the 95% Wald confidence interval. a risk ratio 1; b risk ratio 1.3; c risk ratio 2. LO logistic regression, MP modified Poisson regression, RS regression standardization

Discussion

In this study, we evaluated the statistical performance of the regression approaches for estimating risk ratios in the presence of multiple confounders. In summary, with a sample size of 2500 and an outcome proportion of 1%, the three approaches were equally biased and yielded inaccurate confidence intervals. As the outcome proportion or sample size increased, modified Poisson regression and regression standardization yielded unbiased estimates with appropriate confidence intervals irrespective of the number of confounders. The risk ratio interpretation of logistic regression coefficients was substantially biased when the outcome proportion was relatively high and the true risk ratio was not 1. These results of the main simulation remained consistent in the additional simulation with different parameter settings for the outcome generation models.

In our simulation, nonconvergence occurred in identical data sets for logistic regression and modified Poisson regression. Logistic and Poisson regression may fail to converge due to separation or multicollinearity [30, 31]. Because we included multiple binary confounders in the models, quasi-complete separation was considered the dominant cause of nonconvergence. Although modified Poisson regression has been appreciated for being less prone to convergence issues compared to the log-binomial regression [7], it failed to converge for data sets wherein logistic regression failed to converge. In real-world applications, neither logistic nor modified Poisson regression is exempt from convergence issues.

When the outcome proportion was 1%, the three approaches were comparably biased (both upward and downward). It is well known that logistic regression may yield biased odds ratio estimates when the number of events is small and there are multiple confounders [15,16,17,18,19]. In such situations, one should also be careful with using modified Poisson regression and regression standardization. In our simulation, the magnitude of bias of these two approaches was associated with the expected number of events rather than the expected number of events per confounder; the increase in confounders for a fixed expected number of events did not markedly affect the magnitude of bias.

As the expected number of events increased corresponding to the increase in sample size or outcome proportion, modified Poisson regression and regression standardization yielded unbiased risk ratio estimates regardless of the number of confounders. In contrast, as the outcome proportion increased to 4% or higher, logistic regression coefficients were substantially upward biased except with true risk ratio 1. This is probably because the odds ratio no longer approximated the risk ratio. In our simulation, with a true risk ratio of 2 and an outcome proportion of 16%, the mean estimated logistic regression coefficients corresponded to an odds ratio of approximately 2.5, which may lead to an exaggeration of the exposure effect if interpreted as the risk ratio.

The three approaches underestimated the standard error when fewer than 100 events were expected (i.e., when the sample size was 2500 and the outcome proportion was lower than 4%, and when the sample size was 5000 and the outcome proportion was lower than 2%). Some previous simulation studies indicate that variance or standard error estimates of logistic regression may be unreliable when the number of events is small overall or relative to the number of confounders [15, 16]. Our simulation results suggest that similar problems in standard error estimation may arise from the different methods employed in our simulation. Several candidate methods are available for enhancing the performance of standard errors and the resulting confidence intervals. In modified Poisson regression, sandwich standard errors with small sample correction may be explored, as has been considered for linear regression [32] and modified least-squares regression for risk difference estimation [33, 34]. In regression standardization, the bootstrap method may be preferred for small sample sizes if the computational time is not critical [13, 35].

The Monte Carlo standard error and mean estimated standard error of the modified Poisson regression did not exceed those of the logistic regression. They were slightly larger for logistic regression with high outcome proportions, probably because of the larger point estimates. In theory, unlike maximum likelihood estimators, modified Poisson regression estimators are not asymptotically efficient. Nonetheless, our results suggest that efficiency loss in modified Poisson regression may be negligible for cohort studies involving rare outcomes; in such situations, the risk ratio interpretation of logistic regression coefficients will also hold good, however.

The three approaches yielded conservative 95% Wald confidence intervals when the expected number of events was 25 (i.e., when the outcome proportion was 1% and the sample size was 2500). Since this overcoverage was apparent concurrently with biased point estimates and underestimated standard errors, the overcoverage may have been caused by the non-normality of the estimated log risk ratios. These phenomena are in good agreement with a previous simulation study of logistic regression involving multiple binary confounders where coverage proportions were approximately 97% when 30 or fewer events were generated, often concurrently with biased point estimates [17]. Some caution is needed in interpreting the confidence intervals of the three approaches when the number of events is small; however, statistically significant results could be reliable because of conservatism.

As the expected number of events increased corresponding to the increase in sample size or outcome proportion, modified Poisson regression and regression standardization yielded appropriate confidence intervals regardless of the number of confounders. In contrast, as the outcome proportion increased to 8% or higher, undercoverage occurred for logistic regression coefficients except with true risk ratio 1, most likely because of the discrepancy between odds ratios and risk ratios. This undercoverage was more extreme with larger sample sizes, probably because the large sample size decreased the variability of the point estimates and the length of confidence intervals.

In our simulation, the three approaches were comparably biased and yielded inaccurate confidence intervals when only 25 events were expected. In such scenarios, the expected number of events per confounder varied between 1.25, 2.5, and 5. As long as a sufficient number of events were expected, modified Poisson regression and regression standardization yielded unbiased risk ratio estimates with appropriate confidence intervals regardless of the number of confounders. Specifically, when 50 or more events were expected, modified Poisson regression and regression standardization did not provide a problematic bias of over 15% of the true log risk ratio [17] in any of the scenarios with a risk ratio of 1.3 or 2, and the coverage of the two approaches fell within the range of 94–96% in all scenarios. This was the case even when the expected number of events per confounder was 2.5.

For logistic regression, the existing criteria and some previous simulation results emphasize the importance of the number of events per variable or confounder [15,16,17,18,19]. In contrast, according to our simulation, the performance of the modified Poisson regression and regression standardization were associated with the expected number of events per se. This result is in line with some previous simulation studies on logistic regression [17, 19]. Because simulation results for rare-outcome situations are greatly affected by convergence issues [36], further simulations including continuous confounders may help explore such criteria.

The risk ratio interpretation of logistic regression coefficients may be acceptable, assuming an adequate number of events, when the outcome proportion is low or the exposure effect is close to null. Nevertheless, the other two approaches performed equally well in such situations. Our results showed no relative merits in interpreting logistic regression coefficients as risk ratios. Of the other two approaches, modified Poisson regression may be simple and easy to implement, although the choice should ideally be based on the true mean structure expected from prior knowledge [37].

For odds ratio estimation using logistic regression, some authors have encouraged the use of propensity score analyses [16] and shrinkage techniques [19] in rare-outcome situations. These methods may also be helpful for the estimation of risk ratios when the number of events is small. Regression adjustment for the propensity score can be applied to modified Poisson regression. Shrinkage techniques may mitigate the sparse data bias of the predicted probabilities in regression standardization using logistic regression.

Our simulation study has some limitations. First, we generated outcomes from log-binomial regression models, assuming common risk ratios for all subjects. This means not only that the modified Poisson regression naturally outperforms the direct interpretation of logistic regression coefficients but also that our simulation procedures may have provided an edge to modified Poisson regression over regression standardization. Further research may help compare these approaches under different settings. Second, the distribution of individual risks and the overall outcome proportion will impact the discrepancy between odds ratios and risk ratios. Different distributions of individual risks may produce different results.

Conclusions

In this study, we evaluated the statistical performance of the three regression approaches for estimating risk ratios in the presence of multiple confounders. Regression approaches for estimating risk ratios should be cautiously used when the number of events is small. With an adequate number of events, risk ratios are validly estimated by modified Poisson regression and regression standardization, irrespective of the number of confounders.

Availability of data and materials

The codes used to generate the datasets are available from the corresponding author on reasonable request.

References

  1. 1.

    Cornfield J. A method of estimating comparative rates from clinical data; applications to cancer of the lung, breast, and cervix. J Natl Cancer Inst 1951;11:1269–75.

    PubMed  CAS  Google Scholar 

  2. 2.

    Greenland S. Interpretation and choice of effect measures in epidemiologic analyses. Am J Epidemiol 1987;125:761–8.

    Article  CAS  Google Scholar 

  3. 3.

    Schmidt CO, Kohlmann T. When to use the odds ratio or the relative risk? Int J Public Health 2008;53:165–7.

    Article  Google Scholar 

  4. 4.

    Wacholder S. Binomial regression in GLIM: estimating risk ratios and risk differences. Am J Epidemiol 1986;123:174–84.

    Article  CAS  Google Scholar 

  5. 5.

    Spiegelman D, Hertzmark E. Easy SAS calculations for risk or prevalence ratios and differences. Am J Epidemiol 2005;162:199–200.

    Article  Google Scholar 

  6. 6.

    Williamson T, Eliasziw M, Fick GH. Log-binomial models: exploring failed convergence. Emerg Themes Epidemiol 2013;10:14.

    Article  Google Scholar 

  7. 7.

    Zou G. A modified Poisson regression approach to prospective studies with binary data. Am J Epidemiol 2004;159:702–6.

    Article  Google Scholar 

  8. 8.

    Brown TT, Cole SR, Li X, Kingsley LA, Palella FJ, Riddler SA, et al. Antiretroviral therapy and the prevalence and incidence of diabetes mellitus in the multicenter AIDS cohort study. Arch Intern Med. 2005;165:1179–84.

    Article  Google Scholar 

  9. 9.

    Sheridan E, Wright J, Small N, Corry PC, Oddie S, Whibley C, et al. Risk factors for congenital anomaly in a multiethnic birth cohort: an analysis of the Born in Bradford study. Lancet 2013;382:1350–9.

    Article  Google Scholar 

  10. 10.

    Cunningham RM, Carter PM, Ranney M, Zimmerman MA, Blow FC, Booth BM, et al. Violent reinjury and mortality among youth seeking emergency department care for assault-related injury: a 2-year prospective cohort study. JAMA Pediatr 2015;169:63–70.

    Article  Google Scholar 

  11. 11.

    Batty GD, Zaninotto P, Watt RG, Bell S. Associations of pet ownership with biomarkers of ageing: population based cohort study. BMJ 2017;359:j5558.

    Article  Google Scholar 

  12. 12.

    Williams PL, Yildirim C, Chadwick EG, Van Dyke RB, Smith R, Correia KF, et al. Association of maternal antiretroviral use with microcephaly in children who are HIV-exposed but uninfected (SMARTT): a prospective cohort study. Lancet HIV 2020;7:e49–58.

    Article  Google Scholar 

  13. 13.

    Greenland S. Model-based estimation of relative risks and other epidemiologic measures in studies of common outcomes and in case-control studies. Am J Epidemiol 2004;160:301–5.

    Article  Google Scholar 

  14. 14.

    Muller CJ, MacLehose RF. Estimating predicted probabilities from logistic regression: different methods correspond to different target populations. Int J Epidemiol 2014;43:962–70.

    Article  Google Scholar 

  15. 15.

    Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996;49:1373–9.

    Article  CAS  Google Scholar 

  16. 16.

    Cepeda MS, Boston R, Farrar JT, Strom BL. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epidemiol 2003;158:280–7.

    Article  Google Scholar 

  17. 17.

    Vittinghoff E, McCulloch CE. Relaxing the rule of ten events per variable in logistic and Cox regression. Am J Epidemiol 2007;165:710–8.

    Article  Google Scholar 

  18. 18.

    Courvoisier DS, Combescure C, Agoritsas T, Gayet-Ageron A, Perneger TV. Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. J Clin Epidemiol 2011;64:993–1000.

    Article  Google Scholar 

  19. 19.

    van Smeden M, de Groot JAH, Moons KGM, Collins GS, Altman DG, Eijkemans MJC, et al. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Med Res Methodol 2016;16:163.

    Article  Google Scholar 

  20. 20.

    Hosmer DW, Lemeshow S. Applied logistic regression. 2nd ed. New York: Wiley, 2000.

    Book  Google Scholar 

  21. 21.

    Steyerberg E. Clinical prediction models: a practical approach to development, validation, and updating. New York: Springer, 2009.

    Book  Google Scholar 

  22. 22.

    Lumley T, Kronmal R, Ma S. Relative risk regression in medical research: models, contrasts, estimators and algorithms. (UW Biostatistics Working Paper Series, working paper 293). Berkeley, CA: Berkeley Electronic Press; 2006. https://biostats.bepress.com/uwbiostat/paper293 (27 April 2021, date last accessed).

  23. 23.

    Petersen MR, Deddens JA. A comparison of two methods for estimating prevalence ratios. BMC Med Res Methodol 2008;8:9.

    Article  Google Scholar 

  24. 24.

    Chen W, Shi J, Qian L, Azen SP. Comparison of robustness to outliers between robust Poisson models and log-binomial models when estimating relative risks for common binary outcomes: a simulation study. BMC Med Res Methodol 2014;14:82.

    Article  Google Scholar 

  25. 25.

    Chen W, Qian L, Shi J, Franklin M. Comparing performance between log-binomial and robust Poisson regression models for estimating risk ratios under model misspecification. BMC Med Res Methodol 2018;18:63.

    Article  Google Scholar 

  26. 26.

    Deddens JA, Petersen MR. Re: “Estimating the relative risk in cohort studies and clinical trials of common outcomes” [letter]. Am J Epidemiol 2004;159:213–4.

    Article  Google Scholar 

  27. 27.

    Petersen MR, Deddens JA. Re: “Easy SAS calculations for risk or prevalence ratios and differences” [letter]. Am J Epidemiol 2006;163:1158–9.

    Article  Google Scholar 

  28. 28.

    Flanders WD, Rhodes PH. Large sample confidence intervals for regression standardized risks, risk ratios, and risk differences. J Chronic Dis 1987;40:697–704.

    Article  CAS  Google Scholar 

  29. 29.

    Allison PD. Convergence failures in logistic regression. New York: SAS Global Forum; 2008. p. 360.

    Google Scholar 

  30. 30.

    Albert A, Anderson JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 1984;71:1–10.

    Article  Google Scholar 

  31. 31.

    Santos Silva JMC, Tenreyro S. On the existence of the maximum likelihood estimates in Poisson regression. Econ Lett 2010;107:310–2.

    Article  Google Scholar 

  32. 32.

    Long JS, Ervin LH. Using heteroscedasticity consistent standard errors in the linear regression model. Am Stat 2000;54:217–24.

    Google Scholar 

  33. 33.

    Cheung YB. A modified least-squares regression approach to the estimation of risk difference. Am J Epidemiol 2007;166:1337–44.

    Article  Google Scholar 

  34. 34.

    Hagiwara Y, Fukuda M, Matsuyama Y. The number of events per confounder for valid estimation of risk difference using modified least-squares regression. Am J Epidemiol 2018;187:2481–90.

    Article  Google Scholar 

  35. 35.

    Localio AR, Margolis DJ, Berlin JA. Relative risks and confidence intervals were easily computed indirectly from multivariable logistic regression. J Clin Epidemiol 2007;60:874–82.

    Article  Google Scholar 

  36. 36.

    Steyerberg EW, Schemper M, Harrell FE. Logistic regression modeling and the number of events per variable: selection bias dominates [letter]. J Clin Epidemiol 2011;64:1464–5.

    Article  Google Scholar 

  37. 37.

    Tian L, Liu K. Re: “Easy SAS calculations for risk or prevalence ratios and differences” [letter]. Am J Epidemiol 2006;163:1157–8.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

The authors are grateful to Dr. Koji Oba and Dr. Yoshinori Takeuchi for their helpful comments on the simulation results and earlier versions of the manuscript.

Funding

This work was supported by the Center of Innovation Program from Japan Science and Technology Agency (JST) and Japan Agency for Medical Research and Development (AMED) under Grant Number JP21lk0201701t0001.

The funders had no role in the study design, data analysis and interpretation, or manuscript preparation.

Author information

Affiliations

Authors

Contributions

KF proposed the study concept, conducted the literature review, designed and conducted the simulation, interpreted the simulation results, and drafted the manuscript. YH proposed the study concept, conducted the literature review, designed the simulation, interpreted the simulation results, and revised the manuscript. YM supervised the entire project and revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yasuhiro Hagiwara.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

None declared.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1. 

Simulation Experiments Presented in the Main Text.

Additional file 2.

 Additional Simulation Experiments.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Fuyama, K., Hagiwara, Y. & Matsuyama, Y. A simulation study of regression approaches for estimating risk ratios in the presence of multiple confounders. Emerg Themes Epidemiol 18, 18 (2021). https://doi.org/10.1186/s12982-021-00107-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12982-021-00107-2

Keywords

  • Risk ratio
  • Logistic regression
  • Modified Poisson regression
  • Standardization
  • Confounding
  • Simulation study