Sample size requirements to detect the effect of a group of genetic variants in case-control studies
© Moonesinghe et al; licensee BioMed Central Ltd. 2008
Received: 28 September 2007
Accepted: 03 December 2008
Published: 03 December 2008
Because common diseases are caused by complex interactions among many genetic variants along with environmental risk factors, very large sample sizes are usually needed to detect such effects in case-control studies. Nevertheless, many genetic variants act in well defined biologic systems or metabolic pathways. Therefore, a reasonable first step may be to detect the effect of a group of genetic variants before assessing specific variants.
We present a simple method for determining approximate sample sizes required to detect the average joint effect of a group of genetic variants in a case-control study for multiplicative models.
For a range of reasonable numbers of genetic variants, the sample size requirements for the test statistic proposed here are generally not larger than those needed for assessing marginal effects of individual variants and actually decline with increasing number of genetic variants in many situations considered in the group.
When a significant effect of the group of genetic variants is detected, subsequent multiple tests could be conducted to detect which individual genetic variants and their combinations are associated with disease risk. When testing for an effect size in a group of genetic variants, one can use our global test described in this paper, because the sample size required to detect an effect size in the group is comparatively small. Our method could be viewed as a screening tool for assessing groups of genetic variants involved in pathogenesis and etiology of common complex human diseases.
With the completion of the Human Genome Project and continuing advances in gene mapping and sequencing , there is an increasing interest in discovery and characterization of thousands of genetic variants as potential risk factors for common diseases of public health significance . The search for genetic variants is currently hampered by numerous challenges, including the sheer number of genetic variants, the lack of replication of findings in many observational studies, and study design considerations (such as selection bias and confounding) [2–4]. Because the etiology of most common diseases such as cancer, heart disease and diabetes is due to complex genetic and environmental factors, a particular concern in the design of epidemiologic studies is the lack of statistical power to examine the joint effects and statistical interactions of several genetic variants, especially along with environmental risk factors . For example, even if one considers that only 10 independent genetic variants are involved in a particular disease, and assuming simplistically a dichotomous classification of the susceptible genotype, this leads to more than a 1000 strata in which cases and controls can be distributed. With another 10 environmental dichotomous factors, we will have more than a million strata to assess. Note that the issue of multiple strata may be addressed by utilizing quantitative variables in the place of dichotomous variables where appropriate.
There have been several suggested methodologies to reduce the complex interactions of genetic and environmental effects, most notably multi-dimensionality reduction techniques, or MDR . In the context of screening for the importance of a biologic system in the etiology of a specific disease, however, it is often helpful to have an a priori hypothesis for the genetic effects that belong to a certain biologic pathway. For example, in studying the etiology of venous thrombosis, researchers are examining the effects of genetic variants involved in the coagulation pathway . Also, in studying the etiology of neural tube defects (NTD), because of the protective effects of dietary folates, researchers are examining the relationship between genetic variants involved in folate metabolism and the risk of NTD .
In this paper, we present a simple method for assessing the overall effect of a group of genetic variants in the context of case-control studies. Although post hoc tests have to be conducted to assess joint effects of combinations of specific genetic variants, our method enables detection of the average effect of the group of genetic variants with a reasonable sample size; it can thus be used as a screening approach for further study.
Mckeown-Eyssen and Thomas  explored the relationship between exposure and the differences in case-control means when the distribution of exposure is continuous. They derived sample size equations for studies with a continuous exposure, which allow the investigator to specify the strength of the relationship between disease and exposure in terms of relative risk. Given the joint distribution of exposure for controls, Rao  derived the joint distribution for the exposure of cases by dividing the product of the joint distribution of exposure for controls and the risk function by the sum of this product over all the possible values that the exposure variable can assume. We used this method to derive sample size formulas given a joint distribution of k-genetic variants for multiplicative and additive models. The result of our investigations of multiplicative models is presented below.
If the null hypothesis, H0, is rejected, one can conduct subsequent multiple tests to detect which Ris are significantly different from 1 or test subsets of Ris using the same test statistic given above. However, the level of significance of each test has to be adjusted based on the number of multiple tests conducted.
Overall, the sample size requirement declined with increasing values of k. For example, compared with the sample size requirement for k = 1 the sample size requirement for k = 10 declined by approximately 79% on average for all prevalence and risk ratios studied. Prevalences of 0.9 and 0.1 corresponded to the largest sample sizes for all the risk ratios and numbers of genetic variants in the group. There was little difference between sample size requirements for prevalence ranges between 0.3 and 0.6 for large values of k for the given risk ratios. When k is greater than 4 and R = 2.0, the difference in required sample size for the range of prevalence from 0.3 to 0.6 was less than 6 observations. Indicative of this result, the surfaces shown in all three figures have a relatively flat bottom for k greater than 4 and for the range of prevalence from 0.3 to 0.6. As expected, the sample size requirement declined with increasing R. A theoretical explanation of these results is given below.
The difference between δk+1 and δk declines with increasing k and approaches 1 for large values of k; hence, the successive difference between sample size requirements declines with increasing values of k.
Prevalence and odds ratios of five genetic variants for colorectal cancer susceptibility.
Rare allele vs. others
Null vs. others
α2 allele vs. others
NAT2 [imputed from phenotype] (4)
Fast acetylation vs. others
Wild-type vs. variant (C677T)
Sample size requirement to detect mean exposure between cases and controls for some combinations of genetic variants given in Table 1 assuming multiplicative risk
GSTT1 and MTHFR have the smallest odds ratios (1.37 and 1.35 respectively) in Table 1 and the largest sample size requirements (656 and 705 respectively). The higher sample size for MTHFR reflects the small difference (0.02) in R, even though the prevalence for MTHFR is greater than for GSTT1 (0.423 versus 0.376). This shows that when prevalence is closer to 0.5, the sample size requirement is more sensitive to the differences in R. The smallest sample size (130) corresponds to TNF-α, which has an odds ratio of 2.02 and a prevalence of 0.392. The largest odds ratio, 2.67, for HARS1 corresponds to a larger sample size due to the very low prevalence (0.04).
These results for individual genetic variants seem to carry over to the group of genetic variants. For example, the sample size requirement to detect a group of two genetic variants out of the five given in Table 1, the combination GSTT1 and MTHFR, corresponds to the largest sample size (417), and the combination TNF-α and NAT2, which have odds ratios of 2.02 and 1.68, respectively, corresponds to the smallest sample size (107). For a group of three genetic variants, the combination HRAS1, GSTT1 and MTHFR corresponds to the largest sample size requirement (215). These are the three genetic variants that have the largest sample size requirements when considered individually. Overall, as seen before, the sample size requirement declined with the increase in the number of genetic variants in the group. The sample size requirement for all the genetic variants given in Table 1 is 91.
We have presented a simple method for estimating the sample size for case-control studies required to detect a group of genetic variants using multiplicative models. We have also used the same approach for additive risk models; however, we could not show the asymptotic normality of the joint distribution of exposure for cases (Appendix A2).
In the multiplicative model, when the genetic variants are found to be jointly significant, subsequent multiple tests could be conducted to detect which Ris are significantly different from 1. For example, if the null hypothesis is rejected for a group of five genetic variants, and R1, R2 and R5 are significantly different from 1, we can conclude that the joint effect of G1, G2 and G5 is significantly different between cases and controls.
Consider k hypothesis tests. Under the null hypothesis using the Bonferroni inequality, the probability that at least one of the k tests is significant at level α0 is less than or equal to α0k. In order to maintain an overall level of significance α, we would use the significance level α0 = α/k for each of the k separate tests of significance. Several less conservative adjustments for multiple tests of significance have been proposed, such as the procedure of Holm  and Hochberg . All of these procedures conduct the multiple tests by ordering the test statistics from largest to smallest and then using less restrictive significance levels to the second, third, and so on, test conducted. When any one test is not significant, the procedure stops and all further tests are also declared non-significant. Benjamin  suggested that the False Discovery Rate (FDR) may be the appropriate error rate to control in many applied multiple testing problems. The FDR is the expected proportion of erroneous rejections among all rejections. A simple procedure was given there as an FDR controlling procedure for independent test statistics and was shown to be much more powerful than comparable procedures that control the traditional family-wise-error-rate (the probability of erroneously rejecting even one of the true null hypotheses).
One could have conducted a simultaneous test of the k-parameter joint null hypothesis using multiple tests discussed above as an alternative approach to our test. However, all these tests are conservative compared to the multivariate test presented here. On the other hand, multiple comparison tests could be applied in instances in which the k-statistic vector is not normally distributed, making these tests suitable for the additive model given in the Appendix A2.
Garcia-Closas  evaluated the influence of common genetic variation in the NER pathway on bladder cancer risk by analyzing 22 single nucleotide polymorphisms (SNP) in seven NER genes (XPC, RAD23B, ERCC1, ERCC2, ERCC4, ERCC5, and ERCC6). They estimated odds ratios for each individual polymorphism using logistic regression. They then performed a global test for the association between genetic variations in NER pathway as a whole based on the maximum of trend statistics of all the individual polymorphisms. The P-value for the global test was computed by the permutation method described in Westfall . They found significant associations with SNPs in four of the seven NER genes. They used 1150 cases and an almost equal number of controls. The p-value for the global test for pathway effects was 0.04. Their minor allele frequencies ranged from 0.01 to 0.33 and the odds ratios ranged from 0.8 to 1.4 with an average odds ratio of 1.2. If the odds ratios and SNP frequencies were known (assuming an average odds ratio of 1.2 and a dominant model), the sample size required to achieve 80% power at the 5% level of significance in detecting the overall effect of 22 SNPs using our method is 212 cases. In situations in which we find that none of the genetic variants were significant, the method described in this paper could have reduced the cost of the experiment by first screening the group of genetic variants for overall significance.
The results obtained here can be easily extended to a group of k genetic variants and l environmental factors, when the exposure to the ith environmental factor can be specified as Ei = 1 (present) or Ei = 0 (absent) and the Eis are independent among themselves and are independent of the genetic variants.
Our approach is limited by its inability to look at higher order interactions and the assumption of independence between all loci. Covariance terms in the variance-covariance matrix could increase the sample size to detect the group of genetic variants. It is possible that we may not detect individual effects, but there may be joint effects due to interactions. Our method cannot detect these interactions. Our sample size is constrained by our assumption of normal approximation to binomial distribution. Another limitation is the assumption of multiplicative effects of genetic variants. True biologic interactions could be more complex with epistasis and/or other genetic phenomena; furthermore, joint genetic effects and gene-environment interactions on risk may be neither additive nor multiplicative. Unfortunately, for statistical modeling, epidemiologic analyses have had to deal with multiplicative or additive models. The rare disease assumption in case-control studies has been discussed in many papers [17, 18]. Generally, since most diseases are infrequent, ORs are good estimators of relative risks under this "rare disease assumption". For a disease with a frequency of 10%, which is high, the difference between OR and RR is still only 10%. The only requirement in our genetic model is the ability to express exposure due to genotype as 1 (presence of genotype) or 0 (absence of genotype). Therefore, either dominant or recessive models can be used in our analysis.
A non-parametric approach to this problem is the method of Multidimensionality Reduction (MDR), introduced by Ritchie  as a method of reducing the dimensionality of multilocus information to improve the identification of polymorphism combinations associated with disease risk. This data reduction approach seeks to identify combinations of multilocus genotypes and discrete environmental factors that are associated either with high risk of disease or low risk of disease, and defines a single variable that can be divided into high-risk and low-risk combinations. When it was applied to a sporadic breast cancer case-control data set, in the absence of statistically significant independent main effects, MDR identified a statistically significant higher-order interaction among four polymorphisms from three different estrogen-metabolism genes. Limitations of MDR include its applicability only to case-control studies that are balanced, and the difficulty in interpreting MDR models. Three different strategies for improving the power of MDR to detect epistasis in imbalanced datasets have been evaluated in a recent paper.
Another recent approach that holds great promise is logic regression, introduced by Ruczinski  as a tool to detect interactions between binary predictors that are associated with a response variable. Logic regression is an adaptive regression methodology that attempts to construct predictors as Boolean combinations of binary covariates. According to the authors, logic regression is the only methodology that searches for Boolean combinations of predictors in the entire space of such combinations, while being completely embedded in a regression framework, where the quality of the model is determined by the respective objective functions of the regression class.
Suppose there are k genetic variants in a group of genetic variants and only r of them are associated with the disease. The prevalence of each of (k-r) genetic variants that are not associated with the disease (relative risk of each genetic variant is equal to 1) is identical for cases and controls. Therefore, from equation (1), the sample size required to detect the k genetic variants is identical to the sample size required to detect the r genetic variants associated with the disease. Since our sample size is a function of the squares of the difference between prevalence of genetic variants in cases and controls, our method is valid even when we have a combination of positively and negatively associated genetic variants.
One advantage of our method is the simultaneous test of difference of mean exposure instead of multiple testing. Thus, for a range of reasonable numbers of genetic variants, the sample size requirement declines with the increasing number of genetic variants. It is possible that the sample size required to detect a group of genetic variants could increase when adding a genetic variant to the group. However, the sample size required to detect the group with this genetic variant is still less than the sample size required to detect the genetic variant alone or to detect a subset of the genetic variants containing this genetic variant. When testing for an effect size in a group of genetic variants, one can use the global test described in this paper as a screening tool, because the sample size required to detect an effect size in the group is comparatively small. Note that we are comparing the ability to detect at least one of many genetic variants (global test) with the power to detect just one, which are different null hypotheses. If the global test is non-significant, testing for individual genetic variants that require a large sample size is not necessary.
More methodological work is needed in this area to detect joint effects of multiple genetic variants. Our method could be viewed as a screening tool for assessing groups of genetic variants involved in pathogenesis and etiology of common complex human diseases.
Let f0(X1, X2,..., Xk) be the joint probability density function among controls and f1(X1, X2,..., Xk) be the joint probability density function among cases. If denotes controls and D denotes the cases, then
f0(X1, X2,..., Xk) = Pr [(X1, X2,..., Xk)|] and f1(X1, X2,..., Xk) = Pr [(X1, X2,..., Xk)| D]
The probability density function of the exposure variables in the population at risk becomes:
f(X1, X2,..., Xk) = f0(X1, X2,..., Xk)Pr() + f1(X1, X2,..., Xk)Pr(D)
The summation is over all the possible values each Xi can assume (0 and 1).
Yang  defined M as the lifetime risk in the population as a whole of a common disease involving k-genetic variants for multiplicative models.
A comparison of (A) with (C) shows that the joint distribution of exposure among cases has the same form as that of controls; however, they have different parameters for prevalence of the genetic variants and the assumption of independence of exposure variables for controls results in the independence of exposure variables for cases. The prevalence of the ith genetic variant among cases is given by .
This test is identical to the test:
H0:Gi = for i = 1, 2,.., k, versus H1:Gi ≠ for at least one i (i = 1, 2,.., k).
where is the 100(1-α)% probability point of the chi-square distribution with k degrees of freedom.
where a0 = I and ai = (Ri-1)I.
where the probability density function, f, is given by (1).
It can be shown that A = a0 +a1G1+ a2G2+...+ akGk.
This is not an identifiable probability density function. Although it can be shown that the marginal distributions have asymptotically normal distributions, this does not guarantee the asymptotic normality of the joint distribution.
The findings and conclusions in this report are those of the author(s) and do not necessarily represent the views of the Centers for Disease Control and Prevention/the Agency for Toxic Substances and Disease Registry.
- Guttmacher AE, Collins FS: Realizing the promise of genomics in biomedical research. JAMA. 2005, 294: 1399-1402. 10.1001/jama.294.11.1399View ArticlePubMedGoogle Scholar
- Khoury MJ, Millikan R, Little J, Gwinn M: The emergence of epidemiology in the genomics age. Int J Epidemiol. 2004, 33: 936-44. 10.1093/ije/dyh278View ArticlePubMedGoogle Scholar
- Ioannidis JPA: Genetic associations: false or true? Trends Mol Med. 2003, 9: 135-8. 10.1016/S1471-4914(03)00030-3View ArticlePubMedGoogle Scholar
- Bracken MB: Genomic epidemiology of complex disease: the need for an electronic evidence-based approach to research synthesis. Am J Epidemiol. 2005, 162: 297-301. 10.1093/aje/kwi200View ArticlePubMedGoogle Scholar
- Ritchie MD, Hahn LW, Roody N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor-Dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69: 138-47. 10.1086/321276PubMed CentralView ArticlePubMedGoogle Scholar
- Endler G, Mannhalter C: Polymorphisms in coagulation factor genes and their impact on arterial and venous thrombosis. Clin Chim Acta. 2003, 330: 31-55. 10.1016/S0009-8981(03)00022-6View ArticlePubMedGoogle Scholar
- Relton CL, Wilding CS, Pearce MS, Laffling AJ, Jonas PA, Lynch SA, Tawn EJ, Burn J: Gene-gene interaction in folate-related genes and risk of neural tube defects in a UK population. J Med Genet. 2004, 41: 256-60. 10.1136/jmg.2003.010694PubMed CentralView ArticlePubMedGoogle Scholar
- McKeown-Eyssen GE, Thomas DC: Sample size determination in case-control studies: The influence of the distribution of exposure. J Chron Dis. 1985, 38: 559-68. 10.1016/0021-9681(85)90044-XView ArticlePubMedGoogle Scholar
- Rao BR: Joint distribution of simultaneous exposures to several carcinogens in a case-control study: sample size determination. Commun Statist-Theor Meth. 1986, 15: 3035-65. 10.1080/03610928608829294. 10.1080/03610928608829294View ArticleGoogle Scholar
- Lachin JM: Introduction to sample size determination and power analysis of clinical trials. Control Clin Trials. 1981, 2: 93-113. 10.1016/0197-2456(81)90001-5View ArticlePubMedGoogle Scholar
- Yang Q, Khoury MJ, Friedman JM, Little J, Flanders WD: How many genes underlie the occurrence of common complex diseases in the population?. Int J Epidemiol. 2005, 34: 1129-37. 10.1093/ije/dyi130View ArticlePubMedGoogle Scholar
- Holm S: A simple sequentially rejective multiple procedure. Scand J Statist. 1979, 6: 65-70.Google Scholar
- Hochberg Y: A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988, 75: 800-02. 10.1093/biomet/75.4.800. 10.1093/biomet/75.4.800View ArticleGoogle Scholar
- Benjamin Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. JRSS B. 1995, 57: 289-300.Google Scholar
- Garcia-Closas M, Malats N, Real FX, Welch R, Kogevinas M, Chatterjee N, Pfeiffer R, Silverman D, Dosemeci M, Tardon A, Serra C, Carrato A, Garcia-Closas R, Castano-Vinyals G, Chanock S, Yeager M, Rothman N: Genetic Variation in the Nucleotide Excision Repair Pathway and Bladder Cancer Risk. Cancer Epidemiol Biomarkers Prev. 2006, 15: 536-42. 10.1158/1055-9965.EPI-05-0749View ArticlePubMedGoogle Scholar
- Westfall PH, Young SS: Resampling based multiple testing. New York: John Wiley & Sons, Inc; 1993.Google Scholar
- Greenland S, Thomas DC: On the need for the rare disease assumption in case-control studies. Am J Epidemiol. 1982, 116: 547-53.PubMedGoogle Scholar
- Yanagawa T: Designing case-control studies. Environ Health Perspect. 1979, 32: 143-56. 10.2307/3429012PubMed CentralView ArticlePubMedGoogle Scholar
- Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Scott MW, Moore JH: A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol. 2007, 31: 306-15. 10.1002/gepi.20211View ArticlePubMedGoogle Scholar
- Ruczinski I, Kooperberg C, LeBlanc M: Logic Regression. J Comp Graph Stats. 2003, 12: 475-511. 10.1198/1061860032238. 10.1198/1061860032238View ArticleGoogle Scholar
- Lui K: Sample size determination for multiple continuous risk factors in case-control studies. Biometrics. 1993, 49: 873-76. 10.2307/2532207. 10.2307/2532207View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.