| Literature DB >> 33184277 |
Gareth J Griffith1,2, Tim T Morris1,2, Matthew J Tudball1,2, Annie Herbert1,2, Giulia Mancano1,2, Lindsey Pike1,2, Gemma C Sharp1,2, Jonathan Sterne2, Tom M Palmer1,2, George Davey Smith1,2, Kate Tilling1,2, Luisa Zuccolo1,2, Neil M Davies1,2,3, Gibran Hemani4,5.
Abstract
Numerous observational studies have attempted to identify risk factors for infection with SARS-CoV-2 and COVID-19 disease outcomes. Studies have used datasets sampled from patients admitted to hospital, people tested for active infection, or people who volunteered to participate. Here, we highlight the challenge of interpreting observational evidence from such non-representative samples. Collider bias can induce associations between two or more variables which affect the likelihood of an individual being sampled, distorting associations between these variables in the sample. Analysing UK Biobank data, compared to the wider cohort the participants tested for COVID-19 were highly selected for a range of genetic, behavioural, cardiovascular, demographic, and anthropometric traits. We discuss the mechanisms inducing these problems, and approaches that could help mitigate them. While collider bias should be explored in existing studies, the optimal way to mitigate the problem is to use appropriate sampling strategies at the study design stage.Entities:
Mesh:
Year: 2020 PMID: 33184277 PMCID: PMC7665028 DOI: 10.1038/s41467-020-19478-2
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Illustrative example of collider bias.
a A directed acyclic graph (DAG) illustrating a scenario in which collider bias would distort the estimate of the causal effect of the risk factor on the outcome. Directed arrows indicate causal effects and dotted lines indicate induced associations. Note that the risk factor and the outcome can be associated with sample selection indirectly (e.g. through unmeasured confounding variables), as shown in b. The type of collider bias induced in graph (b) is sometimes referred to as M-bias. To illustrate the example in a, consider academic ability and sporting ability to each influence selection into a prestigious school. As shown in c, these traits are negligibly correlated in the general population (blue dotted line), but because they are selected for enrolment they become strongly correlated when analysing only the selected individuals (red dotted line).
Fig. 2Collider bias induced by conditioning on a collider in three scenarios relating to COVID-19 analysis.
These are simplified Directed Acyclic Diagrams where only the main variables of interest have been represented for sake of illustrating collider bias scenarios. All assume no unspecified confounding or other biases. Rectangles represent observed variables and solid directed arrows represent causal effects. The dashed line represents an induced association when conditioning on the collider, which in these scenarios are variables that indicate whether an individual is selected into the sample. a When some hypothesised risk factor (e.g. age) and outcome (e.g. COVID-19 infection) each associate with sample selection (e.g. voluntary data collection via mobile phone apps), the hypothesised risk factor and outcome will be associated within the sample. The presence and direction of these biases are model dependent; where causes are supra-multiplicative they will be positively associated in the sample; where they are sub-multiplicative they will be negatively correlated, and where they are exactly multiplicative they will remain unassociated. We extend this scenario in b where the association between the hypothesised risk factor and the collider does not need to be causal. c When inferring the influence of some hypothesised risk factor on mortality, in an unselected sample the risk factor for infection is a causal factor for death (mediated by COVID-19 infection). However, if analysed only amongst individuals who are known to have COVID-19 (i.e. we condition on the COVID-19 infection variable) then the risk factor for infection will appear to be associated with any other variable that influences both infection and progression. In many circumstances, this can lead to a risk factor for disease onset that appears to be protective for disease progression. Each of these scenarios represents those described in the main text.
Fig. 3Quantile-Quantile plot of −log10 p-values for factors influencing being tested for COVID-19 in UK Biobank.
The x-axis represents the expected p-value for 2556 hypothesis tests and y-axis represents the observed p-values. The red line represents the expected relationship under the null hypothesis of no associations.
Fig. 4Example of large associations induced by collider bias under the null hypothesis of no causal relationship, using scenarios similar to those reported for the observed protective association of smoking on COVID-19 infection.
Assume a simple scenario in which the hypothesised exposure (A) and outcome (Y) are both binary and each influence probability of being selected into the sample (S) e.g. where is the baseline probability of being selected, is the effect of A, is the effect of Y and is the effect of the interaction between A and Y. The selection mechanism in question is represented in Fig. 1b (without the interaction term drawn). This plot shows which combinations of these parameters would be required to induce an apparent risk effect with magnitude OR > 2 (blue region) or an apparent protective effect with magnitude OR < 0.5 (red region) under the null hypothesis of no causal effect[61]. To create a simplified scenario similar to that in Miyara et al. we use a general population prevalence of smoking of 0.27 and a sample prevalence of 0.05, thus fixing at 0.22. Because the prevalence of COVID-19 is not known in the general population, we allow the sample to be over- or under-representative (y-axis). We also allow modest interaction effects. Calculating over this parameter space, 40% of all possible combinations lead to an artefactual 2-fold protective or risk association operating through this simple model of bias alone. It is important to disclose this level of uncertainty when publishing observational estimates.