| Literature DB >> 29040562 |
Marcus R Munafò1,2, Kate Tilling1,3, Amy E Taylor1,2, David M Evans1,4, George Davey Smith1,3.
Abstract
Large-scale cross-sectional and cohort studies have transformed our understanding of the genetic and environmental determinants of health outcomes. However, the representativeness of these samples may be limited-either through selection into studies, or by attrition from studies over time. Here we explore the potential impact of this selection bias on results obtained from these studies, from the perspective that this amounts to conditioning on a collider (i.e. a form of collider bias). Whereas it is acknowledged that selection bias will have a strong effect on representativeness and prevalence estimates, it is often assumed that it should not have a strong impact on estimates of associations. We argue that because selection can induce collider bias (which occurs when two variables independently influence a third variable, and that third variable is conditioned upon), selection can lead to substantially biased estimates of associations. In particular, selection related to phenotypes can bias associations with genetic variants associated with those phenotypes. In simulations, we show that even modest influences on selection into, or attrition from, a study can generate biased and potentially misleading estimates of both phenotypic and genotypic associations. Our results highlight the value of knowing which population your study sample is representative of. If the factors influencing selection and attrition are known, they can be adjusted for. For example, having DNA available on most participants in a birth cohort study offers the possibility of investigating the extent to which polygenic scores predict subsequent participation, which in turn would enable sensitivity analyses of the extent to which bias might distort estimates.Entities:
Mesh:
Year: 2018 PMID: 29040562 PMCID: PMC5837306 DOI: 10.1093/ije/dyx206
Source DB: PubMed Journal: Int J Epidemiol ISSN: 0300-5771 Impact factor: 7.196
Figure 1Illustration of collider bias. The basic premise of collider bias is shown. In this example, a bell is sounded whenever either coin come up ‘heads’. The result of one coin toss is independent of the other. However, if we hear the bell ring (i.e. we condition on the bell ringing), then if we see a tail on one coin we know there must be a head on the other–the two coin results are no longer independent and a spurious inverse correlation has been induced. Reproduced from Gage SH, Davey Smith G, Ware JJ, Flint J, Munafò MR. G = E: What GWAS can tell us about the environment. PLoS Genet 2016;12: e1005765.
Figure 2Illustration of selection bias simulation. In the intended study population there is no association between allele score and outcome. Selection into the study (either through voluntary participation at baseline, or attrition over time) induces an association between allele score and outcome (collider bias).
Results of simulation study showing the selection bias in estimating an association that is null in the intended study population
| Simulation settings | Results—association between allele score and outcome | |||
|---|---|---|---|---|
| Association between missingness and both phenotype and outcome (OR) | Association between allele score and phenotype (r) | Mean regression coefficient (SD) | Mean z-score (SD) | Number of 95% CIs containing zero |
| OR = 1.8 | 0.05 | −0.001 (0.001) | −1.04 (1.00) | 83 |
| (0.25% variance) | ||||
| 0.10 | −0.003 (0.001) | −2.06 (0.98) | 45 | |
| (1.00% variance) | ||||
| 0.15 | −0.004 (0.001) | −3.07 (0.98) | 9 | |
| (2.25% variance) | ||||
| 0.20 | −0.006 (0.001) | −4.10 (0.98) | 0 | |
| (4.00% variance) | ||||
| 0.30 | −0.008 (0.001) | −6.18 (1.06) | 0 | |
| (9.00% variance) | ||||
| OR = 1.5 | 0.05 | −0.001 (0.001) | −0.42 (0.95) | 94 |
| (0.25% variance) | ||||
| 0.10 | −0.001 (0.001) | −0.80 (0.96) | 89 | |
| (1.00% variance) | ||||
| 0.15 | −0.001 (0.001) | −1.22 (0.96) | 77 | |
| (2.25% variance) | ||||
| 0.20 | −0.002 (0.001) | −1.64 (0.97) | 61 | |
| (4.00% variance) | ||||
| 0.30 | −0.003 (0.001) | −2.44 (0.94) | 35 | |
| (9.00% variance) | ||||
| OR = 1.2 | 0.05 | −0.0002 (0.001) | −0.16 (0.92) | 97 |
| (0.25% variance) | ||||
| 0.10 | −0.0003 (0.001) | −0.25 (0.94) | 97 | |
| (1.00% variance) | ||||
| 0.15 | −0.0005 (0.001) | −0.38 (0.95) | 93 | |
| (2.25% variance) | ||||
| 0.20 | −0.0006 (0.001) | −0.47 (0.95) | 91 | |
| (4.00% variance) | ||||
| 0.30 | −0.0009 (0.001) | −0.66 (0.96) | 89 | |
| (9.00% variance) | ||||
Each scenario was simulated 100 times.
Results of simulation study showing the selection bias in estimating an association that is not null in the intended study population (regression coefficient for outcome on phenotype is 0.1)
| Simulation settings | Results—association between allele score and outcome | |||
|---|---|---|---|---|
| Association between missingness and both phenotype and outcome (OR) | Association between allele score and phenotype (r) | Mean regression coefficient (SD) | True regression coefficient | Number of 95% CIs containing true value |
| OR = 1.8 | 0.05 | 0.003 (0.001) | 0.005 | 78 |
| (0.25% variance) | ||||
| 0.10 | 0.006 (0.001) | 0.01 | 23 | |
| (1.00% variance) | ||||
| 0.15 | 0.010 (0.001) | 0.015 | 2 | |
| (2.25% variance) | ||||
| 0.20 | 0.013 (0.001) | 0.02 | 0 | |
| (4.00% variance) | ||||
| 0.30 | 0.020 (0.001) | 0.03 | 0 | |
| (9.00% variance) | ||||
| OR = 1.5 | 0.05 | 0.004 (0.001) | 0.005 | 94 |
| (0.25% variance) | ||||
| 0.10 | 0.009 (0.001) | 0.01 | 86 | |
| (1.00% variance) | ||||
| 0.15 | 0.013 (0.001) | 0.015 | 69 | |
| (2.25% variance) | ||||
| 0.20 | 0.017 (0.001) | 0.02 | 53 | |
| (4.00% variance) | ||||
| 0.30 | 0.026 (0.001) | 0.03 | 19 | |
| (9.00% variance) | ||||
| OR = 1.2 | 0.05 | 0.005 (0.001) | 0.005 | 98 |
| (0.25% variance) | ||||
| 0.10 | 0.01 (0.001) | 0.01 | 96 | |
| (1.00% variance) | ||||
| 0.15 | 0.014 (0.001) | 0.015 | 94 | |
| (2.25% variance) | ||||
| 0.20 | 0.019 (0.001) | 0.02 | 92 | |
| (4.00% variance) | ||||
| 0.30 | 0.029 (0.001) | 0.03 | 95 | |
| (9.00% variance) | ||||
Each scenario was simulated 100 times.
Associations between a genetic risk score for smoking and maternal education, in ALSPAC and ARIES
| Smoking genetic risk score | 7291 | 1.07 (1.02 to 1.12) | 0.003 |
| Smoking (ever vs never) | 13249 | 0.59 (0.52 to 0.68) | <0.001 |
| Smoking genetic risk score | 7837 | 1.00 (0.93 to 1.07) | 0.92 |
| Maternal education | 12493 | 1.86 (1.58 to 2.19) | <0.001 |
| Smoking (ever vs never) | 12118 | 0.45 (0.40 to 0.50) | <0.001 |
| Smoking genetic risk score | 7046 | 1.01 (0.95 to 1.08) | 0.74 |
| Smoking (ever vs never) | 986 | 0.61 (0.44 to 0.84) | 0.003 |
| Smoking genetic risk score | 791 | 1.20 (1.02 to 1.41) | 0.03 |
aGenetic risk score including variants reaching P < 0.05 for association with ever vs never smoking in the Tobacco and Genetics Consortium GWAS (see Supplementary Material, available at IJE online). Associations are per SD increase in genetic risk score.
bDegree vs no degree.
Figure 3Scenarios where selection bias would occur. A. In truth, the SNP is not causally associated with the outcome; selection will induce an association (which could be positive or negative). B. In truth, the SNP is not causally associated with the outcome; selection will induce an association (which could be positive or negative). C. In truth, the SNP is causally associated with the outcome; selection could make this larger or attenuate it. D. In truth, the SNP is causally associated with the outcome; selection could make this larger or attenuate it. E. In truth, the SNP is causally associated with the outcome; selection will bias this association (which could be positive or negative). F. Note that the association between P and O is biased in the selected sample; however, the association between SNP and O is unbiased in the selected sample. P, Phenotype; O, Outcome; S, Selection; U, Other variables.