Literature DB >> 29040562

Collider scope: when selection bias can substantially influence observed associations.

Marcus R Munafò^1,2, Kate Tilling^1,3, Amy E Taylor^1,2, David M Evans^1,4, George Davey Smith^1,3.

Abstract

Large-scale cross-sectional and cohort studies have transformed our understanding of the genetic and environmental determinants of health outcomes. However, the representativeness of these samples may be limited-either through selection into studies, or by attrition from studies over time. Here we explore the potential impact of this selection bias on results obtained from these studies, from the perspective that this amounts to conditioning on a collider (i.e. a form of collider bias). Whereas it is acknowledged that selection bias will have a strong effect on representativeness and prevalence estimates, it is often assumed that it should not have a strong impact on estimates of associations. We argue that because selection can induce collider bias (which occurs when two variables independently influence a third variable, and that third variable is conditioned upon), selection can lead to substantially biased estimates of associations. In particular, selection related to phenotypes can bias associations with genetic variants associated with those phenotypes. In simulations, we show that even modest influences on selection into, or attrition from, a study can generate biased and potentially misleading estimates of both phenotypic and genotypic associations. Our results highlight the value of knowing which population your study sample is representative of. If the factors influencing selection and attrition are known, they can be adjusted for. For example, having DNA available on most participants in a birth cohort study offers the possibility of investigating the extent to which polygenic scores predict subsequent participation, which in turn would enable sensitivity analyses of the extent to which bias might distort estimates.

Entities: Chemical Disease Mutation Species

Mesh：

Year: 2018 PMID： 29040562 PMCID： PMC5837306 DOI： 10.1093/ije/dyx206

Source DB: PubMed Journal: Int J Epidemiol ISSN： 0300-5771 Impact factor: 7.196

Introduction

Understanding the impact of genetic and environmental factors on physical and mental health outcomes is critical if we are to develop effective preventive and treatment interventions. Large-scale cross-sectional and cohort studies provide an invaluable resource to support these efforts, in particular with respect to genetic influences where the small effects associated with common genetic variants require very large samples to achieve adequate statistical power. A study can be used to draw conclusions about the population it represents (the “intended study population”), but generalizability to other populations depends upon us knowing exactly what the actual study population is. However, participants who volunteer to participate in studies may not be representative of the intended study population, in which case the actual study population is unknown. Some studies may be relatively representative of the intended study population at inception through rigorous efforts to ensure representative recruitment (e.g. birth cohort studies). However, as they mature the likelihood is that attrition from the study will be non-random, so that the cohort becomes less representative of the intended population as time goes on. Alternatively, the reverse may be true—the study may be unrepresentative at inception, but with low attrition. Selection bias can also occur if a sub-set of participants within a study is selected for more detailed investigation (e.g. genotyping) on the basis of having most data available or volunteering for further follow-up. There is already clear evidence from existing large-scale population studies that they are subject to a degree of selection bias. For example, higher genetic risk scores for schizophrenia are consistently associated with non-completion of questionnaires by study mothers and children, as well as non-attendance at data collection clinics, in the Avon Longitudinal Study of Parents and Children (ALSPAC) (see Box 1). Box 1 The Avon Longitudinal Study of Parents and Children Birth cohort studies are not immune to problems of selection bias, where retention in the study may be related to a variety of participant characteristics. The Avon Longitudinal Study of Parents and Children (ALSPAC) recruited pregnant women living in the administrative county of Avon, with expected delivery dates between 1 April 1991 and 31 December 1992. These women, their partners and their offspring have been followed up ever since via questionnaires and clinics. ALSPAC originally captured data on 14 541 pregnancies (75% of eligible women) (19, 36), but inevitably retention in subsequent data collection sweeps (postal questionnaires and clinic assessments) was less than 100%. We see that higher body mass index (BMI) is associated with lower odds of subsequent retention in both mothers (N = 11 319, OR per SD increase in BMI 0.85, 95% CI 0.81 to 0.88), for retention between 2008 and 2011 using pre-pregnancy BMI as a predictor, and offspring (N = 7954, OR 0.91, 95% CI 0.87 to 0.96), for retention at age 18 using BMI at age 7 as a predictor. Similarly, among smoking mothers in ALSPAC, heaviness of smoking is associated with lower odds of retention (N = 3534, OR per additional cigarette smoked per day just prior to pregnancy 0.97, 95% CI 0.96 to 0.98). If low BMI and maternal non-smoking are both related to continuing participation in ALSPAC, this would tend to lead to the association between BMI and maternal smoking being negatively biased (i.e. we would expect to see a more negative association between smoking and BMI in ALSPAC than in the intended study population). Attrition from cohort studies may result in biased estimates of socioeconomic inequalities, and the degree of bias may worsen as participation rates decrease. However, it is often argued that representativeness is not necessary in studies of this kind, although this is not universally accepted. In particular for genetic variants, where conventional confounding is low, it has been argued even by those concerned about selection bias that any problems associated with a lack of representativeness may be modest., Here we ask: what is the impact of selection bias on the results obtained from these studies? We take the perspective that selection bias can amount to conditioning on a collider (i.e. conditioning on a variable that is independently influenced by two other variables).

Collider bias

It is widely acknowledged that selection bias will distort prevalence estimates. This can be clearly seen in differences between participants at baseline and at subsequent assessments in cohort studies, such as when we compare the original ALSPAC sample with those who attended later clinics (see Box 1). It can also be seen in differences between an actual study sample and the source population from which it is drawn (i.e., the intended study population); for example, the UK Biobank study differs relative to the general population in the UK (see Box 2). However, it is often assumed that although selection bias will have a strong effect on prevalence estimates, it should not have a strong impact on observed associations between variables. This overlooks the fact that selection can induce collider bias (see Figure 1), which can lead to biased observational and genetic associations. This bias can be towards or away from any true association, and can distort a true association or a true lack of association.

Figure 1

Illustration of collider bias. The basic premise of collider bias is shown. In this example, a bell is sounded whenever either coin come up ‘heads’. The result of one coin toss is independent of the other. However, if we hear the bell ring (i.e. we condition on the bell ringing), then if we see a tail on one coin we know there must be a head on the other–the two coin results are no longer independent and a spurious inverse correlation has been induced. Reproduced from Gage SH, Davey Smith G, Ware JJ, Flint J, Munafò MR. G = E: What GWAS can tell us about the environment. PLoS Genet 2016;12: e1005765.

Box 2 UK Biobank The UK Biobank is a cross-sectional study which recruited over 500 000 individuals aged between 40 and 69 years between 2006 and 2010 [http://www.ukbiobank.ac.uk/]. Individuals in this age group living within a 25-mile radius of any of the 22 assessment centres across the UK were identified from NHS patient registers. In total, around 9 million individuals were invited to participate. However, UK Biobank was only able to achieve a 5% response rate (∼500 000 participants recruited from ∼9 000 000 invited, personal communication, UK Biobank, 8 July 2016), and the resulting sample is not representative of the UK population as a whole. For example, the proportion of current smokers is relatively low in UK Biobank (19% in the general population vs 11% in UK Biobank, equivalent to an OR of 1.90), as is the proportion with no qualifications (25% vs 17%, equivalent to an OR of 1.63). Unsurprisingly, therefore, participants in UK Biobank have far lower rates of 5-year mortality than the UK population as a whole. Clearly, agreeing to take part in the UK Biobank study is associated with a number of characteristics that will reflect, for example, health status and social position. If non-smoking and having qualifications are both causally related to participation in UK Biobank, we would expect the association between smoking and having qualifications to be positively biased (i.e. we would expect to see a more positive association between genetic variants positively associated with smoking and whether participants had educational qualifications in UK Biobank than in the UK population). The problem is possibly compounded in genetic studies using the first release of genome-wide association data in UK Biobank, which used two genotyping arrays, one of which was applied to a nested case-control study of smoking and lung function (UK BiLEVE). The first release genetic data are therefore further subject to selection bias relative to UK Biobank as a whole (although this will no longer be the case when the full release of genome-wide association data becomes available). Illustration of collider bias. The basic premise of collider bias is shown. In this example, a bell is sounded whenever either coin come up ‘heads’. The result of one coin toss is independent of the other. However, if we hear the bell ring (i.e. we condition on the bell ringing), then if we see a tail on one coin we know there must be a head on the other–the two coin results are no longer independent and a spurious inverse correlation has been induced. Reproduced from Gage SH, Davey Smith G, Ware JJ, Flint J, Munafò MR. G = E: What GWAS can tell us about the environment. PLoS Genet 2016;12: e1005765. Collider bias occurs when two variables (X and Y) independently cause a third variable (Z). In this situation Z is a collider, and statistical adjustment for Z will bias the estimated causal association of X (exposure) on Y (outcome) (see Figure 2). Statistical adjustment of the XY association for a variable Z is equivalent to observing this association in a sub-population where all individuals share the same value of Z., Hence if both X and Y cause participation in a study (Z), then investigating associations in the selected sample (i.e. with Z = 1, indicating participation in the study) is equivalent to conditioning on Z, which in turn may induce collider bias.

Figure 2

Illustration of selection bias simulation. In the intended study population there is no association between allele score and outcome. Selection into the study (either through voluntary participation at baseline, or attrition over time) induces an association between allele score and outcome (collider bias). In other words, sample selection can bias associations between variables that influence participation or retention in a study. This can include inducing spurious associations when no such association exists in the intended study population or, if two variables are correlated in the intended study population and both cause selection, biasing the estimated correlation in the selected sample. Moreover, this selection bias will apply to the genetic correlates (or other ancestors) of these variables, unless the phenotypes are also controlled for. Therefore if genes Gx and Gy cause X (exposure) and Y (outcome), respectively, and both X and Y influence participation, then in the selected sample Gx will appear to be associated with Y (unless X is also controlled for). More complex situations can also give rise to collider bias, such as when the outcome (Y) does not directly cause selection into the study (i.e. it is a downstream consequence of something else that is causing selection into the study). If two traits influence participation (and therefore contribute to selection), selection bias amounts to implicitly conditioning on their common effect (i.e. participation)., This can in principle lead to biased associations between these two traits. There are exceptions to this, depending on the distribution of the outcome and the parametric analysis model used. For example, if the outcome (Y) is a binary phenotype, and logistic regression is used, then the odds ratio for the association between the single nucleotide polymorphism (SNP) and outcome may be unbiased even when the outcome causes selection (as is true of case-control studies). We have previously argued that these effects may be greater in case-control studies than prospective studies, and that since genetic associations have been similar across study designs, the impact of selection bias may in fact be modest. We have also previously argued that because conventional confounding is typically low for single genetic variants, problems of selection bias will be less in this context. However, given the rapid growth in studies using data from highly selected samples such as UK Biobank, and the use of genetic risk scores rather than single genetic variants, we revisited this question and used simulation to explore the potential impact of even relatively weak effects on participation. Given empirical evidence of selection in cross-sectional and cohort studies, what is the potential impact of this on observed phenotypic and genotypic associations?

Simulations

We simulated data on an allele score, a phenotype and an outcome, where both the phenotype and the outcome influence selection into the study, but there was no association between the allele score and the outcome in the intended study population (see Figure 2). The simulation scenario was based loosely on the UK Biobank, and we simulated selection into the study, so all the data on non-selected individuals are missing and therefore imputation is not a potential solution (see below), because this requires some data on which to base the imputation. All variables were Normally distributed, with a standard deviation of 1, and the sample size of the intended study population was 9 000 000. We assumed that phenotype and outcome had independent effects (i.e. no interaction on the additive scale) on the odds of selection into the sample, and for convenience we set these effects to be equal, and examined a weak association [odds ratio (OR) of 1.2 for missingness for a one standard deviation (SD) increase in phenotype/outcome] and two stronger associations (ORs of 1.5 and 1.8). These odds ratios are similar to estimates of the likelihood of participation in UK Biobank for individuals with any educational or vocational qualifications and for non–smokers, respectively (see Box 2), and indicate a difference in mean phenotype/outcome of 0.2 SD, 0.4 SD and 0.6 SD between those participating and those not participating. We varied the correlation between the allele score and the phenotype (between r = 0.05 and r = 0.30) to simulate genetic instruments explaining between 0.25% and 9% of the variance in phenotypes. These values are in the typical range for the association between common genetic variants, or polygenic risk scores comprising multiple common variants, and complex phenotypes. For example, the rs16969968 variant accounts for approximately 1% of the phenotypic variance in cigarette consumption, whereas the polygenic risk score for height captures approximately 9% of phenotypic variance. We controlled the baseline risk of selection into the sample, resulting in a selected sample of approximately 500 000 people. The analysis was an unadjusted regression of outcome on allele score (i.e. not adjusting for the phenotype). We simulated a true null association (i.e. in the intended study population, the regression coefficient for outcome on allele score is zero). We simulated each scenario 100 times. We then repeated the simulations with the addition of a causal effect of the phenotype on the outcome, with a regression coefficient of 0.1. The results of this simulation study are shown in Table 1 (no causal effect of P on O) and Table 2 (causal effect of P on O). Where there is no causal effect of P on O, the effects of selection bias are strongest for stronger independent selection effects, and also where the allele score is more strongly associated with the phenotype (Table 1). However, even for moderate associations between missingness and both phenotype and outcome (OR = 1.5 for both phenotype and outcome) and between allele score and phenotype (r = 0.1, 1% variance explained by allele score) the confidence intervals contain zero only 89% of the time, and this continues to decrease with both greater strength of association between phenotype, outcome and missingness, and stronger association between allele score and phenotype.

Table 1

Results of simulation study showing the selection bias in estimating an association that is null in the intended study population

Simulation settings		Results—association between allele score and outcome
Association between missingness and both phenotype and outcome (OR)	Association between allele score and phenotype (r)	Mean regression coefficient (SD)	Mean z-score (SD)	Number of 95% CIs containing zero
OR = 1.8	0.05	−0.001 (0.001)	−1.04 (1.00)	83
	(0.25% variance)	−0.001 (0.001)	−1.04 (1.00)	83
	0.10	−0.003 (0.001)	−2.06 (0.98)	45
	(1.00% variance)	−0.003 (0.001)	−2.06 (0.98)	45
	0.15	−0.004 (0.001)	−3.07 (0.98)	9
	(2.25% variance)	−0.004 (0.001)	−3.07 (0.98)	9
	0.20	−0.006 (0.001)	−4.10 (0.98)	0
	(4.00% variance)	−0.006 (0.001)	−4.10 (0.98)	0
	0.30	−0.008 (0.001)	−6.18 (1.06)	0
	(9.00% variance)	−0.008 (0.001)	−6.18 (1.06)	0
OR = 1.5	0.05	−0.001 (0.001)	−0.42 (0.95)	94
	(0.25% variance)	−0.001 (0.001)	−0.42 (0.95)	94
	0.10	−0.001 (0.001)	−0.80 (0.96)	89
	(1.00% variance)	−0.001 (0.001)	−0.80 (0.96)	89
	0.15	−0.001 (0.001)	−1.22 (0.96)	77
	(2.25% variance)	−0.001 (0.001)	−1.22 (0.96)	77
	0.20	−0.002 (0.001)	−1.64 (0.97)	61
	(4.00% variance)	−0.002 (0.001)	−1.64 (0.97)	61
	0.30	−0.003 (0.001)	−2.44 (0.94)	35
	(9.00% variance)	−0.003 (0.001)	−2.44 (0.94)	35
OR = 1.2	0.05	−0.0002 (0.001)	−0.16 (0.92)	97
	(0.25% variance)	−0.0002 (0.001)	−0.16 (0.92)	97
	0.10	−0.0003 (0.001)	−0.25 (0.94)	97
	(1.00% variance)	−0.0003 (0.001)	−0.25 (0.94)	97
	0.15	−0.0005 (0.001)	−0.38 (0.95)	93
	(2.25% variance)	−0.0005 (0.001)	−0.38 (0.95)	93
	0.20	−0.0006 (0.001)	−0.47 (0.95)	91
	(4.00% variance)	−0.0006 (0.001)	−0.47 (0.95)	91
	0.30	−0.0009 (0.001)	−0.66 (0.96)	89
	(9.00% variance)	−0.0009 (0.001)	−0.66 (0.96)	89

Each scenario was simulated 100 times.

Table 2

Results of simulation study showing the selection bias in estimating an association that is not null in the intended study population (regression coefficient for outcome on phenotype is 0.1)

Simulation settings		Results—association between allele score and outcome
Association between missingness and both phenotype and outcome (OR)	Association between allele score and phenotype (r)	Mean regression coefficient (SD)	True regression coefficient	Number of 95% CIs containing true value
OR = 1.8	0.05	0.003 (0.001)	0.005	78
	(0.25% variance)	0.003 (0.001)	0.005	78
	0.10	0.006 (0.001)	0.01	23
	(1.00% variance)	0.006 (0.001)	0.01	23
	0.15	0.010 (0.001)	0.015	2
	(2.25% variance)	0.010 (0.001)	0.015	2
	0.20	0.013 (0.001)	0.02	0
	(4.00% variance)	0.013 (0.001)	0.02	0
	0.30	0.020 (0.001)	0.03	0
	(9.00% variance)	0.020 (0.001)	0.03	0
OR = 1.5	0.05	0.004 (0.001)	0.005	94
	(0.25% variance)	0.004 (0.001)	0.005	94
	0.10	0.009 (0.001)	0.01	86
	(1.00% variance)	0.009 (0.001)	0.01	86
	0.15	0.013 (0.001)	0.015	69
	(2.25% variance)	0.013 (0.001)	0.015	69
	0.20	0.017 (0.001)	0.02	53
	(4.00% variance)	0.017 (0.001)	0.02	53
	0.30	0.026 (0.001)	0.03	19
	(9.00% variance)	0.026 (0.001)	0.03	19
OR = 1.2	0.05	0.005 (0.001)	0.005	98
	(0.25% variance)	0.005 (0.001)	0.005	98
	0.10	0.01 (0.001)	0.01	96
	(1.00% variance)	0.01 (0.001)	0.01	96
	0.15	0.014 (0.001)	0.015	94
	(2.25% variance)	0.014 (0.001)	0.015	94
	0.20	0.019 (0.001)	0.02	92
	(4.00% variance)	0.019 (0.001)	0.02	92
	0.30	0.029 (0.001)	0.03	95
	(9.00% variance)	0.029 (0.001)	0.03	95

Each scenario was simulated 100 times.

Results of simulation study showing the selection bias in estimating an association that is null in the intended study population Each scenario was simulated 100 times. Results of simulation study showing the selection bias in estimating an association that is not null in the intended study population (regression coefficient for outcome on phenotype is 0.1) Each scenario was simulated 100 times. Where there is a causal effect of P on O, the results are broadly similar, except that on the whole the confidence intervals had lower coverage than for the equivalent situation with no causal association. We also explored associations between known risk factors and outcomes in a representative birth cohort and a selected sub-study. We used ALSPAC as the birth cohort. Initially, 14 541 pregnant women who were expected to give birth between 1 April 1991 and 31 December 1992 were recruited into the study in the South West region of England. The study website contains details of all data available through a fully searchable data dictionary: [http://www.bris.ac.uk/alspac/researchers/data-access/data-dictionary/]. Ethics approval for the study was obtained from the ALSPAC Ethics and Law Committee and the local research ethics committees. We also used the Accessible Resource for Integrated Epigenomics Studies (ARIES), a sub-study of ALSPAC where a sub-set of 1018 mother-offspring pairs were selected based on availability of DNA samples at two time points for the mother (at an antenatal clinic and at a follow-up clinic when their offspring were mean age 15.5 years) and three time points for the offspring (at birth, childhood and adolescence. We investigated the association between a genetic risk score for smoking (ever vs never) and maternal education in ALSPAC, and in the ARIES sub-sample. The results are shown in Table 3, and indicated that the genetic risk score for smoking and maternal education are associated in ARIES, but not in the full sample.

Table 3

Associations between a genetic risk score for smoking and maternal education, in ALSPAC and ARIES

Association between genetic risk score and ever smoking in ALSPAC
	N	OR (95% CI)	P
Smoking genetic risk score ¹	7291	1.07 (1.02 to 1.12)	0.003

Association with being in the ARIES sub-study
	N	OR (95% CI)	P
Smoking (ever vs never)	13249	0.59 (0.52 to 0.68)	<0.001
Smoking genetic risk score^a	7837	1.00 (0.93 to 1.07)	0.92
Maternal education^b	12493	1.86 (1.58 to 2.19)	<0.001

Association between smoking/smoking genetic risk score and maternal education in ALSPAC and ARIES
ALSPAC	N	OR (95% CI)	P
Smoking (ever vs never)	12118	0.45 (0.40 to 0.50)	<0.001
Smoking genetic risk score^a	7046	1.01 (0.95 to 1.08)	0.74
ARIES	N	OR (95% CI)	P
Smoking (ever vs never)	986	0.61 (0.44 to 0.84)	0.003
Smoking genetic risk score^a	791	1.20 (1.02 to 1.41)	0.03

aGenetic risk score including variants reaching P < 0.05 for association with ever vs never smoking in the Tobacco and Genetics Consortium GWAS (see Supplementary Material, available at IJE online). Associations are per SD increase in genetic risk score.

bDegree vs no degree.

Associations between a genetic risk score for smoking and maternal education, in ALSPAC and ARIES aGenetic risk score including variants reaching P < 0.05 for association with ever vs never smoking in the Tobacco and Genetics Consortium GWAS (see Supplementary Material, available at IJE online). Associations are per SD increase in genetic risk score. bDegree vs no degree.

Conclusions

Our results indicate the potential for selection/attrition to generate biased and potentially misleading estimates of both phenotypic and genotypic associations. In particular, when polygenic scores (associated with a phenotype) that combine many genetic variants are used, association between the phenotype and participation will cause the score to be more strongly related to participation than each individual variant is. This, in turn, can potentially lead to serious bias. For this reason, studies using polygenic scores, genome-wide allelic scores and/or whole-genome genetic correlations, are most at risk of producing biased and potentially misleading results where there is reason to believe the actual study sample is not representative of the intended study population, but the mechanism of selection is unknown. The magnitude of effects we observed in our simulations, based on credible estimates of associations between both a phenotype or outcome and missingness, and between a polygenic score and a phenotype, are comparable to many reported associations derived from large but selected samples, such as between personality and cognitive function, and a range of physical and mental health outcomes,, and between chronotype (i.e. ‘morningness’) and years of education. Such associations could therefore plausibly be generated by selection bias. An appreciation of the potential impact of selection bias may also resolve inconsistencies in the literature, and help to explain apparently paradoxical findings. For example, genetic correlations between cognitive ability and a range of psychiatric disorders have been reported to differ in childhood and older age. One possible interpretation is that this is due to age-dependent pleiotropy, but another is that this is an artefact of different selection bias pressures at different ages. An example serves to illustrate this. Polygenic risk scores that maximally capture schizophrenia liability are associated with increased psychotic experiences in ALSPAC participants, but scores that use more stringent thresholds for including genetic variants are associated with reduced psychotic experiences. Since missing data are likely to be greater for participants who report psychotic experiences, as well as for those at higher genetic risk of a psychotic disorder, psychotic experiences may be relatively under-represented in participants with higher genetic risk, compared with those with lower genetic risk. Such collider bias could occur through initial selection, or selective drop-out, or both–for example, a study could be representative of its intended population initially, but become less representative as those of poorer health drop out due to death. The main difference between these two scenarios–initial selection and selection through attrition–is in the amount of information available on the missing individuals. Where some data are available for all participants (e.g. in the case of drop-out), then multiple imputation or inverse probability weighting can be used under some assumptions which are untestable given the observed data, to recover unbiased estimates of associations. However, where there is no information on missing individuals (e.g. we have no data on individuals who did not volunteer for participation into a study), then such methods cannot be used. External information (such as the expected proportion of males and females in the general population) could be used to investigate likely factors related to participation and to derive bias-adjusted estimates. A related issue is the use of case-control studies to examine associations with ‘secondary’ outcomes–that is, phenotypes other than the case/control outcome., In such studies, the association between genotype and secondary phenotype will be biased if both genotype and secondary phenotype are associated with case-control status. Case-control studies condition on case-control status, and thus again collider bias can bias the association between genotype and secondary phenotype. Various methods have been proposed to overcome this bias, including maximum likelihood and inverse probability weighting. This latter method requires some knowledge about the prevalence of case/control status in the intended study population, or the assumption that the disease is rare., We have discussed one important way in which selection into or out of a study can induce collider bias and spurious associations. There are other ways in which ascertainment can generate biases. For example, Figure 3 (panel B) shows a situation in which entry into a study is conditional upon the value of the phenotype (but not the outcome of interest) and where the phenotype does not cause the outcome, but the phenotype and outcome are correlated in unselected samples (i.e. due to genetic and/or environmental factors U). In this situation, collider bias occurs because conditioning on selection induces an association between SNPs related to the phenotype and the polygenic and/or environmental factors that influence the outcome. Therefore, SNPs that cause the phenotype only (i.e. do not in truth cause the outcome) may now show spurious relationships with the outcome variable. An example of the situation in Figure 3 (panel B) is when the phenotype increases mortality32–35–for example, in studies of smoking as a phenotype, where smoking is associated with premature mortality. In a cohort study which examines smoking, and then follows participants up for Alzheimer’s disease, those who die early (perhaps because of smoking-related illness) will never have the chance to be diagnosed with Alzheimer’s disease, and therefore smoking will appear to be a protective factor. Figure 3 (panels C to E) also shows examples where selection will bias the estimation of the causal effects of SNPs on the outcome. In these examples, SNPs that do cause the outcome directly via the phenotype will show either increased or decreased association in the selected sample, depending on the underlying genetic and environmental aetiology of both traits. In the situations depicted in Figure 3A, C and E, the association between phenotype and outcome (e.g. in an observational study) would also be biased. In contrast, Figure 3F shows a situation where selection will bias the association of the phenotype with the outcome, but the association of the SNP with the outcome will be unbiased. Other, more complex, situations can also lead to selection bias—we have not attempted to outline every possible case here. Algorithms for deciding whether a given causal analysis is biased by selection have been described, and could be used to decide whether bias is likely in a given case.

Figure 3

Scenarios where selection bias would occur. A. In truth, the SNP is not causally associated with the outcome; selection will induce an association (which could be positive or negative). B. In truth, the SNP is not causally associated with the outcome; selection will induce an association (which could be positive or negative). C. In truth, the SNP is causally associated with the outcome; selection could make this larger or attenuate it. D. In truth, the SNP is causally associated with the outcome; selection could make this larger or attenuate it. E. In truth, the SNP is causally associated with the outcome; selection will bias this association (which could be positive or negative). F. Note that the association between P and O is biased in the selected sample; however, the association between SNP and O is unbiased in the selected sample. P, Phenotype; O, Outcome; S, Selection; U, Other variables. Our results highlight the value of representative cohorts (including birth cohorts) where there is little or no selection into the cohort. The issue of whether the study is intended to be representative in the first place is a somewhat separate issue, albeit a related one (see Box 3). In addition, having some baseline data and DNA available on all participants at recruitment into the study at least offers the possibility of investigating the extent to which polygenic scores (and other measured factors at baseline) predict subsequent participation. Without this knowledge, studies in samples with unknown selection/attrition mechanisms run the risk of providing biased and misleading results. In our opinion these important caveats should be borne in mind when interpreting the results of such studies. Box 3 Does representativeness matter? Representativeness refers to the comparison between the population about which we wish to draw inferences (e.g. all people in the UK, all men with a prostate cancer diagnosis), and what we have described as the “intended study population” (e.g., doctors in the British Doctors Study, all men diagnosed with prostate cancer, etc.). The degree to which the intended study population can be used to draw inference about the population of interest will depend on factors influencing selection into the intended study population. For example, the biology of doctors in the British Doctors Study will be similar to the biology of all people, so this intended study population should enable us to draw inferences about many exposures and outcomes that would hold for the general population of the UK (and indeed of the world). If the outcome of interest has not influenced selection into the intended study population, and all of this population is included in the study (i.e., in what we have described as the “actual study population”), then selection bias, and therefore collider bias, should be small. However, for some factors under study, representativeness may matter to a greater degree than is appreciated – for example, if general cognitive ability and childhood socioeconomic position both influence the likelihood of becoming a doctor, and childhood socioeconomic position influences a wide range of health outcomes, bias may still be introduced when attempting to use doctors to draw inferences about the effects, say, of general cognitive ability on disease in a general population. These biases may be small in magnitude, but as we increasingly investigate small associations, their relative importance may be greater than is appreciated.

Supplementary Data

Supplementary data are available at IJE online.

Funding

The UK Medical Research Council and Wellcome Trust (Grant Ref: 102215/2/13/2) and the University of Bristol provide core support for ALSPAC. ARIES was funded by the BBSRC (BBI025751/1 and BB/I025263/1). Supplementary funding to generate DNA methylation data which is (or will be) included in ARIES has been obtained from the MRC, ESRC, NIH and other sources. ARIES is maintained under the auspices of the MRC Integrative Epidemiology Unit at the University of Bristol (MC_UU_12013/2 and MC_UU_12013/8). This work was supported by the Medical Research Council and the University of Bristol (MC_UU_12013/1, MC_UU_12013/4, MC_UU_12013/6, MC_UU_12013/9). M.R.M. and A.E.T. are members of the UK Centre for Tobacco and Alcohol Studies, a UKCRC Public Health Research Centre of Excellence. Funding from British Heart Foundation, Cancer Research UK, Economic and Social Research Council, Medical Research Council and the National Institute for Health Research, under the auspices of the UK Clinical Research Collaboration, is gratefully acknowledged. Click here for additional data file.

34 in total

1. Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk.

Authors: David M Evans; Peter M Visscher; Naomi R Wray
Journal: Hum Mol Genet Date: 2009-06-24 Impact factor: 6.150

2. The Obesity Paradox in Survival after Cancer Diagnosis: Tools for Evaluation of Potential Bias.

Authors: Elizabeth Rose Mayeda; M Maria Glymour
Journal: Cancer Epidemiol Biomarkers Prev Date: 2017-01 Impact factor: 4.254

3. A General and Robust Framework for Secondary Traits Analysis.

Authors: Xiaoyu Song; Iuliana Ionita-Laza; Mengling Liu; Joan Reibman; Ying We
Journal: Genetics Date: 2016-02-19 Impact factor: 4.562

4. Role of survivor bias in pancreatic cancer case-control studies.

Authors: Zhen-Huan Hu; John E Connett; Jian-Min Yuan; Kristin E Anderson
Journal: Ann Epidemiol Date: 2015-11-14 Impact factor: 3.797

5. Cohort Profile: the Avon Longitudinal Study of Parents and Children: ALSPAC mothers cohort.

Authors: Abigail Fraser; Corrie Macdonald-Wallis; Kate Tilling; Andy Boyd; Jean Golding; George Davey Smith; John Henderson; John Macleod; Lynn Molloy; Andy Ness; Susan Ring; Scott M Nelson; Debbie A Lawlor
Journal: Int J Epidemiol Date: 2012-04-16 Impact factor: 7.196

6. Cohort Profile: the 'children of the 90s'--the index offspring of the Avon Longitudinal Study of Parents and Children.

Authors: Andy Boyd; Jean Golding; John Macleod; Debbie A Lawlor; Abigail Fraser; John Henderson; Lynn Molloy; Andy Ness; Susan Ring; George Davey Smith
Journal: Int J Epidemiol Date: 2012-04-16 Impact factor: 7.196

7. Pleiotropy between neuroticism and physical and mental health: findings from 108 038 men and women in UK Biobank.

Authors: C R Gale; S P Hagenaars; G Davies; W D Hill; D C M Liewald; B Cullen; B W Penninx; D I Boomsma; J Pell; A M McIntosh; D J Smith; I J Deary; S E Harris
Journal: Transl Psychiatry Date: 2016-04-26 Impact factor: 6.222

8. Measures of socioeconomic status and self-reported glaucoma in the U.K. Biobank cohort.

Authors: Y Shweikh; F Ko; M P Y Chan; P J Patel; Z Muthy; P T Khaw; J Yip; N Strouthidis; P J Foster
Journal: Eye (Lond) Date: 2015-08-28 Impact factor: 3.775

9. Association of Genetic Risk for Schizophrenia With Nonparticipation Over Time in a Population-Based Cohort Study.

Authors: Joanna Martin; Kate Tilling; Leon Hubbard; Evie Stergiakouli; Anita Thapar; George Davey Smith; Michael C O'Donovan; Stanley Zammit
Journal: Am J Epidemiol Date: 2016-05-10 Impact factor: 4.897

10. Genome-wide association analysis identifies novel loci for chronotype in 100,420 individuals from the UK Biobank.

Authors: Jacqueline M Lane; Irma Vlasac; Simon G Anderson; Simon D Kyle; William G Dixon; David A Bechtold; Shubhroz Gill; Max A Little; Annemarie Luik; Andrew Loudon; Richard Emsley; Frank A J L Scheer; Deborah A Lawlor; Susan Redline; David W Ray; Martin K Rutter; Richa Saxena
Journal: Nat Commun Date: 2016-03-09 Impact factor: 14.919

248 in total

1. Can Survival Bias Explain the Age Attenuation of Racial Inequalities in Stroke Incidence?: A Simulation Study.

Authors: Elizabeth Rose Mayeda; Hailey R Banack; Kirsten Bibbins-Domingo; Adina Zeki Al Hazzouri; Jessica R Marden; Rachel A Whitmer; M Maria Glymour
Journal: Epidemiology Date: 2018-07 Impact factor: 4.822

2. Conditioning on a Collider May Induce Spurious Associations: Do the Results of Gale et al. (2017) Support a Health-Protective Effect of Neuroticism in Population Subgroups?

Authors: Tom G Richardson; George Davey Smith; Marcus R Munafò
Journal: Psychol Sci Date: 2019-02-22

3. Investigating causality in associations between education and smoking: a two-sample Mendelian randomization study.

Authors: Suzanne H Gage; Jack Bowden; George Davey Smith; Marcus R Munafò
Journal: Int J Epidemiol Date: 2018-08-01 Impact factor: 7.196

4. Educational Note: Paradoxical collider effect in the analysis of non-communicable disease epidemiological data: a reproducible illustration and web application.

Authors: Miguel Angel Luque-Fernandez; Michael Schomaker; Daniel Redondo-Sanchez; Maria Jose Sanchez Perez; Anand Vaidya; Mireille E Schnitzer
Journal: Int J Epidemiol Date: 2019-04-01 Impact factor: 7.196

5. Interaction Between Overweight and Genotypes of HLA, TCF7L2, and FTO in Relation to the Risk of Latent Autoimmune Diabetes in Adults and Type 2 Diabetes.

Authors: Rebecka Hjort; Josefin E Löfvenborg; Emma Ahlqvist; Lars Alfredsson; Tomas Andersson; Valdemar Grill; Leif Groop; Elin P Sørgjerd; Tiinamaija Tuomi; Bjørn Olav Åsvold; Sofia Carlsson
Journal: J Clin Endocrinol Metab Date: 2019-10-01 Impact factor: 5.958

6. Patterns of E-cigarette Use and Subsequent Cigarette Smoking Cessation Over 2 Years (2013/2014-2015/2016) in the Population Assessment of Tobacco and Health Study.

Authors: Allison M Glasser; Mahathi Vojjala; Jennifer Cantrell; David T Levy; Daniel P Giovenco; David Abrams; Raymond Niaura
Journal: Nicotine Tob Res Date: 2021-03-19 Impact factor: 4.244

7. Risk factors and appropriate therapeutic strategies for thrombotic microangiopathy after allogeneic HSCT.

Authors: Hiroyuki Matsui; Yasuyuki Arai; Hiroharu Imoto; Takaya Mitsuyoshi; Naoki Tamura; Tadakazu Kondo; Junya Kanda; Takayuki Ishikawa; Kazunori Imada; Yasunori Ueda; Yusuke Toda; Naoyuki Anzai; Kazuhiro Yago; Masaharu Nohgawa; Akihito Yonezawa; Hiroko Tsunemine; Mitsuru Itoh; Kazuyo Yamamoto; Masaaki Tsuji; Toshinori Moriguchi; Akifumi Takaori-Kondo
Journal: Blood Adv Date: 2020-07-14