Literature DB >> 33527565

Mendelian randomisation with coarsened exposures.

Matthew J Tudball^1,2, Jack Bowden^1,2,3, Rachael A Hughes^1,2, Amanda Ly^1,2, Marcus R Munafò^1,2,4, Kate Tilling^1,2, Qingyuan Zhao⁵, George Davey Smith^1,2.

Abstract

A key assumption in Mendelian randomisation is that the relationship between the genetic instruments and the outcome is fully mediated by the exposure, known as the exclusion restriction assumption. However, in epidemiological studies, the exposure is often a coarsened approximation to some latent continuous trait. For example, latent liability to schizophrenia can be thought of as underlying the binary diagnosis measure. Genetically driven variation in the outcome can exist within categories of the exposure measurement, thus violating this assumption. We propose a framework to clarify this violation, deriving a simple expression for the resulting bias and showing that it may inflate or deflate effect estimates but will not reverse their sign. We then characterise a set of assumptions and a straight-forward method for estimating the effect of SD increases in the latent exposure. Our method relies on a sensitivity parameter which can be interpreted as the genetic variance of the latent exposure. We show that this method can be applied in both the one-sample and two-sample settings. We conclude by demonstrating our method in an applied example and reanalysing two papers which are likely to suffer from this type of bias, allowing meaningful interpretation of their effect sizes.

Entities: Chemical

Keywords: Mendelian randomisation analysis; biomarkers; latent variable modelling; sensitivity analysis

Mesh：

Year: 2021 PMID： 33527565 PMCID： PMC8603937 DOI： 10.1002/gepi.22376

Source DB: PubMed Journal: Genet Epidemiol ISSN： 0741-0395 Impact factor: 2.344

INTRODUCTION

Mendelian randomisation proposes to use genetic variants that alter, or mirror the biological effects of, modifiable exposures to study the causal effects of such exposures on downstream outcomes. The principle underlying Mendelian randomisation is that genetic variants are randomly passed from parents to offspring at conception, resulting in a plausibly unconfounded source of variation in the exposures with which they are associated. For Mendelian randomisation estimates to inform policies or clinical practices, we must additionally assume that genetic and environmental modifiers of the exposure produce similar effects on the outcome (Davey Smith & Ebrahim, 2003). For example, Mendelian randomisation studies of pharmaceutical exposures typically use genetic variants that code for potential drug targets, assuming that similar effects would be observed if those targets were altered therapeutically (Plump & Davey Smith, 2019). One of the crucial assumptions underlying the Mendelian randomisation approach is that the relationship between the genetic instruments and the outcome is fully mediated by the exposure, known as the exclusion restriction assumption. However, it is important to draw a distinction between the true exposure experienced by an individual and our attempt at measuring it. For practical purposes, we are often restricted to coarsened approximations which do not fully encapsulate the mechanism by which the true exposure of interest affects the outcome. Consistent with existing terminology, we define an exposure measurement as coarsened if it is a discrete measure approximating a continuous latent exposure (Marshall, 2016). In the Mendelian randomisation context, coarsened exposures can violate the exclusion restriction assumption. If the genetic instruments are acting on a latent exposure, such as body mass index (BMI), but the measured exposure is a discretisation of it, such as obesity status, then there can exist genetically driven variation in the true exposure within categories of the measured exposure. We could imagine that counterfactually altering some BMI‐raising single‐nucleotide polymorphism (SNP) in an individual could result in a change in their BMI without necessarily changing their obesity status. This can be viewed a form of measurement error which opens up potential pathways from the genetic instruments to the outcome that do not pass through the exposure measure, thus violating the exclusion restriction assumption. For example, Richardson et al. (2020) attempt to separate the effects of early and later life adiposity on disease risk. The adiposity variable is a three‐category self‐report measure (“thinner,” “plumper,” and “about average”). It is reasonable to conceptualise a continuous measure of body mass (e.g., BMI) underlying this coarsened categorical measure, such that genetic variation in this latent continuous measure could occur within categories of the self‐report variable. We later reanalyse Richardson et al. (2020) in Box 2 using the approach proposed in this paper. Another example is Richmond et al. (2019), who apply Mendelian randomisation to investigate the effect of sleep traits (e.g., morning preference, sleep duration) on breast cancer risk, finding large causal effects of several traits. These traits are categorical measures, for example, morning preference is measured in six categories and sleep duration is split into several groups. It is reasonable to conceptualise the true exposures on which the genetic variants are acting as latent continuous sleep traits and preferences, for which the measured exposures are discrete markers. An important class of latent exposures we consider in this paper is disease liabilities, for which binary disease diagnosis or case status is the typical exposure measurement. There are an increasing number of Mendelian randomisation studies investigating the effects of complex diseases such as asthma, schizophrenia and attention deficit hyperactivity disorder on various outcomes (Lawn et al., 2019; Martins‐Silva et al., 2019; Pasman et al., 2018; Sun et al., 2019). Complex diseases which result from the interaction of environment and multiple genetic variants are likely to affect outcomes of interest through pathways other than diagnosis, for example, severity of subclinical symptoms. Since genetic instruments are, in turn, likely to influence the manifestation or severity of the underlying symptoms, rather than diagnosis alone, this represents a potential violation of the exclusion restriction. This specific violation of the exclusion restriction assumption has been raised before in both the economics and political science literatures (Angrist & Imbens, 1995; Marshall, 2016). It has also been raised briefly in the Mendelian randomisation context in Burgess and Labrecque (2018), who discuss interpretation of estimates with binary exposures. The authors recommend that findings be framed in terms of this latent exposure but note that the estimates themselves have no meaningful causal interpretation. However, it remains to explore in more detail how this bias may distort estimates and clarify how to appropriately frame estimates in terms of the latent exposure, which will depend on the unobservable relationship between the latent exposure and its coarsened measurement. We attempt to provide these clarifications in this paper. In particular, we derive an expression for the bias and introduce a clear set of identifying assumptions under which one can estimate the causal effect of the latent exposure. We hope to allow researchers to decide whether these assumptions are plausible in the context of their study. In Section 2, we outline our technical framework, which assumes a linear single threshold model for the relationship between the latent exposure and its measurement. That is, we assume that values of the coarsened exposure are determined by whether the latent exposure is above or below some threshold, which could be individual‐specific. For example, an individual is classified as obese if their BMI is above 30 and not obese otherwise. This framework also contains the Falconer (1965) liability‐threshold model, which assumes that a disease occurs in an individual, or is sufficiently pronounced to be diagnosed, if a build‐up of underlying liability crosses some threshold. In this model, liability is assumed to capture all genetic, shared and nonshared environmental risk factors. In Section 3.1, we derive an expression for the bias from the naive approach of using the coarsened measure as the exposure directly. Then, in Section 3.2, we show that, if the latent exposure is standardised to have a SD of one, its causal effect can be identified if we have auxiliary information on the genetic variance of the latent exposure. This may be obtained from genome wide association study (GWAS) or treated as a sensitivity parameter and varied over a plausible range of values. In the context of disease liabilities, we may use the coefficient of determination developed by Lee et al. (2012). Section 4 provides some generalisations to this framework, in particular, allowing two‐sample estimation. Section 5 provides a real data example by creating artificially dichotomised variables from the continuous BMI measure in UK Biobank. Boxes 1 and 2 present reanalyses of two papers which could be interpreted within the framework proposed in this paper (Pasman et al., 2018; Richardson et al., 2020). In sections A and B of the appendix, we examine the bias that can emerge when the assumptions of our framework are violated. Pasman et al. (2018) performs a two‐sample bidirectional Mendelian randomisation analysis of schizophrenia and cannabis use (Burgess et al., 2015). The gene‐exposure associations for schizophrenia are pulled from a GWAS of cases and controls and are reported on the log‐odds scale (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2015). While this avoids the problem of using the dichotomous diagnosis variable as the exposure (as discussed in Section 1). it means that the resulting estimates are interpreted as unit increases in the log‐odds, which are scaled by the unobserved parameter . The authors report an odds ratio (OR) of 1.16 (95% confidence interval [95% CI] = 1.06–1.27) for the effect of genetic liability to schizophrenia. While we can infer the direction of the effect from this estimate, we cannot draw any conclusions about the magnitude. We apply the two‐sample generalisation of Section 4.4. One of the strengths of this generalisation is that we do not need to re‐estimate the original inverse‐variance weighted Mendelian randomisation estimates ourselves. In addition to the estimates reported in the original paper, we need only an estimate of , which can be computed from summary data from the schizophrenia GWAS, and some plausible choices for the sensitivity parameter . The schizophrenia GWAS reports that their genome‐wide significant loci explain roughly 3.4% of the variation in schizophrenia liability using the Lee et al. (2012) coefficient of determination. Using this estimate as a baseline, we select three choices for : 0.02, 0.034, and 0.05. Our findings are consistent with a modest positive effect of schizophrenia liability on the odds of cannabis use. As shown in Figure 1, a one SD increase in schizophrenia liability corresponds to a 1.15–1.26 increase in the odds of cannabis use, with 95% CI range of 1.10–1.44. It is important not to directly compare these estimates with the original estimates: the two are not on the same scale. We must interpret the estimates of Figure 1 in terms of SD increases in schizophrenia liability.

Figure 1

Effect of schizophrenia liability on risk of ever using cannabis for several choices of sensitivity parameter . 95% confidence intervals are estimated as in section C of the appendix

Effect of schizophrenia liability on risk of ever using cannabis for several choices of sensitivity parameter . 95% confidence intervals are estimated as in section C of the appendix Richardson et al. (2020) performs two‐sample Mendelian randomisation analysis of child and adult BMI on risk of several diseases: coronary artery disease, type 2 diabetes, breast cancer and prostate cancer. The instrument‐exposure relationship is estimated in the UK Biobank cohort. However, child BMI is not measured directly in UK Biobank, instead, there is a measure of self‐reported adiposity in three discrete categories (“thinner,” “plumper,” or “about average”). In this context, the latent exposure is child BMI and the self‐report measure is a coarsening of child BMI. Since the genetic instruments will act on child BMI directly, the exclusion restriction is likely to be violated. Therefore, we apply the latent variable method of Section 3.2 to this data. We reanalyse the original univariable effect of child BMI on risk of type 2 diabetes (OR = 2.32, 95% CI = 1.76–3.05), coronary artery disease (1.49, 1.33–1.68), and breast cancer (0.59, 0.50–0.71). We apply the two‐sample generalisation of the inverse‐variance weighted estimator of Section 2, 4.4, estimating the instrument‐exposure relationship in UK Biobank using an ordered probit model and the instrument‐outcome relationships using the MR‐Base platform (Hemani et al., 2018). We choose three values for based on a large GWAS of adult BMI: 0.01, 0.02, and 0.05 (Locke et al., 2015). The genetic share of child BMI is estimated using an ordered probit model and standard errors are calculated using the formula in section C of the appendix. Figure 2 shows our results for three of the diseases analysed in the paper. Our estimates are in the same direction as the original estimates, which is expected, however, the interpretation of the magnitudes is different. For example, the original paper estimates that a per‐category increase in self‐reported child adiposity corresponds to an increase in the odds of coronary artery disease of 1.49 (95% CI = 1.33–1.68), which could be inflated due to violation of the exclusion restriction. For , we estimate that a one SD increase in child BMI corresponds to an increase in the odds of coronary artery disease of 1.13 (95% CI = 0.99–1.28). It is difficult to directly compare the two sets of estimates since the exposures are different, however, our estimate is suggestive of a modest effect of child BMI on the risk of coronary artery disease.

Figure 2

Effect of childhood body mass index on risk of several diseases for several choices of sensitivity parameter . 95% confidence intervals are estimated as in section C of the appendix

FRAMEWORK

We begin by outlining some key notation. Suppose there is a genetic instrument , other genetic variants (e.g., pleiotropic, weak) and an environmental risk factor , where is assumed to be continuously distributed with mean zero. We also assume that , and are mutually independent. We define as the genetic share of the latent exposure and define the latent exposure itself as It would be equally correct to define , but the formulation in (1) simplifies some later expressions. In the Falconer framework described in Section 1, would represent liability to some disease. We are able to observe a coarsened exposure characterised by a dichotomisation of the latent exposure. If is disease liability, then would represent occurrence of the disease. In practice, we measure diagnosis of the disease, which does not necessarily correspond to occurrence due to under‐ or over‐diagnosis. We will treat the two as equivalent throughout and discuss violations of this equivalence in Section 6. Equation (2) is the crucial assumption underlying our approach; namely, that is a linear index that relates to according to a single threshold. Section A of the appendix elaborates on the importance of this structural assumption. Figure 3 illustrates our model within the Falconer framework. There is a distribution of disease liabilities and the disease occurs at the right tail of this distribution. The size of the grey region represents the prevalence of the disease in the population.

Figure 3

In the Falconer framework, liability to a disease is assumed to follow a smooth (often normal) distribution. The disease occurs at the tail of the distribution, with the grey region representing expected prevalence in the population We also have an observed outcome . For ease of exposition, we restrict ourselves to a simple linear structural equation model which is implicitly conditional on covariates, where can be correlated with both and . However, this framework can accommodate more general exposure‐outcome relationships of the form , provided is differentiable with respect to . We make the standard instrumental variable assumptions, namely, that and is independent of conditional on covariates. The model (3) implicitly captures the assumption described in Section 1 that genetic and environmental modifiers of the exposure produce equivalent effects on the outcome. In this setting, the marginal effect (in absolute value) of both and is . Figure 4 summarises this model in a directed acyclic graph. We can see that the exclusion restriction is violated since there exists a path from the latent exposure to which does not pass through the measured exposure . The structural Equation (3) assumes no effect of itself. For a disease such as schizophrenia, liability could have a harmful effect on the outcome but being diagnosed will usually lead to receiving treatment and thus could have a protective effect. We cannot separately identify the two effects in this setting, although possibilities for doing so are discussed in Section 4.2. When is believed to have a distinct effect on the outcome, we may instead identify the total effect of liability on the outcome; that is, the direct effect and the indirect effect through .

Figure 4

The framework proposed in Section 2 is summarised in a directed acyclic graph. Dotted circles represent latent variables and complete circles represent observed variables

The framework proposed in Section 2 is summarised in a directed acyclic graph. Dotted circles represent latent variables and complete circles represent observed variables The structural assumptions made in this section can be summarised as follows: ((Single threshold)) The latent exposure and its binary measurement are related by a single threshold model of the form . ((Additivity)) , where and are, respectively, the genetic and environmental shares of . ((Linearity)) is a linear function of the genetic instrument and other genetic variants , such that . ((Environmental share)) has mean zero, SD and is in some family of continuous distributions, with cumulative distribution function given by and density . ((Risk factor independence)) , and are mutually independent. ((Gene–environment equivalence)) The outcome model takes the form , where is a random disturbance and and may be correlated with . ((Instrumental variable assumptions)) is independent of and .

IDENTIFICATION

Bias from the naive approach

The naive approach to Mendelian randomisation is to use the coarsened exposure as the exposure directly. We show in this section that this results in a “multiplicative” bias which will scale the true effect up or down, but not change its direction. When the distribution of has a light tail (e.g., normal distribution), we will typically see inflation of effect estimates, with the degree of inflation increasing as the prevalence of becomes smaller. If is case status for a disease, for example, then effect estimates will be more inflated for rarer diseases. We see this pattern of inflation occurring in our real data examples in Section 5. We call the naive Wald estimand . It is illustrative to derive a closed‐form expression for . Suppose is binary and (i.e., there is no ). Begin by noting that by the mean value theorem, where . Thus, the estimand can be written as meaning that is equal to the true latent exposure effect divided by the density of at the value . is not identified since the distribution of is unknown and is defined on the scale of the latent exposure.

The latent variable approach

The bias formula (4) indicates that the nuisance term is , which is the distribution of the environmental share . Although depends on this unobserved distribution, the genetic share does not. Our latent variable approach therefore proceeds in four steps: (1) estimate the linear predictor of a generalised linear model of on and ; (2) normalise the linear predictor to have mean zero and variance one; (3) use this normalised linear predictor as the exposure in an instrumental variable model; and (4) scale the resulting effect estimate up by the genetic variance of the latent exposure. Step 4 is necessary to interpret effect estimates in terms of SD increases in the latent exposure, which is typically the desired scale. To state this more precisely, define as the SD of , where and are the variances of and , respectively. Within the framework described in Section 2, we claim that the four steps above allow us to identify from the observed data . The remainder of this section proves this claim given the assumptions outlined in Section 2 and discusses its implications. We begin by expressing the quantity within the framework of Section 2. where , , and . can be interpreted as the link function in a generalised linear model and , , and as parameters that can be identified from the observable data. In practice, we could specify directly, for example, as a logistic or normal distribution (corresponding to logistic and probit regressions respectively). Alternatively, to avoid imposing potentially strong distributional assumptions, we could use semi‐parametric estimation methods for generalised linear models, which only require some smoothness conditions on (Ichimura, 1993; Klein & Spady, 1993). Disease liabilities are often assumed be the product of many small, independent traits. Therefore, by the central limit theorem, a normal distribution (i.e., probit model) is a natural choice of link function in this context (Curnow, 1972). Step 1 is accomplished by constructing the predicted genetic share of the latent exposure using parameters estimated from the generalised linear model of on and . An immediate complication is that is unobserved. Treating as a sensitivity parameter is not tractable since its value is defined on the scale of the latent exposure, which is unknown. However, if we standardise by its SD as in step 2, we can remove since By using as our exposure, we can obtain effects in terms of SD increases in the genetic share of the latent exposure. The instrumental variable estimand of step 3 equals This estimand does not often have a natural interpretation. We would prefer to interpret our effects in terms of changes in the latent exposure itself. Let be defined as the genetic variance of the latent exposure. If we have a suitable choice of , we can simply adjust our estimand as in step 4 such that which is our desired effect. The parameter can be treated as a sensitivity parameter and varied over a plausible range of values or can, in some instances, be obtained from GWAS which report this measure. For disease liabilities in particular, Lee et al. (2012) uses the Falconer liability‐threshold model to develop a coefficient of determination for GWAS that is interpretable on the liability scale, which corresponds to . Therefore, can be estimated using this approach or selected from GWAS which report this coefficient. For ease of interpretation, liability is often assumed to have mean zero and variance one, in which case and itself is identified on this scale (Lee et al., 2012).

SOME GENERALISATIONS

Individual‐specific threshold

The formalisation of the relationship between disease and liability in Equation (2) and Figure 3 assumes a fixed threshold. That is, all individuals with liability above the threshold will develop or be diagnosed with the disease and all those below the threshold will not. In reality, we might imagine that diagnosis has a random component, driven, for example, by preferences of the diagnosing clinician or imprecision of the testing procedure. It might be more realistic to assume a model such that where is a random individual‐specific threshold. Provided is independent of the instrument and other variants , this random threshold will not affect identification of , and of Equation (5) under correct model specification. However, the link function of Equation (5) no longer corresponds to the distribution family of ; instead, it corresponds to the distribution family of . This could make correct specification of the link function more difficult and semiparametric approaches may be warranted.

Identifying effects of the coarsened exposure

The structural model (3) assumes no direct effect of the binary exposure measure on the outcome. As discussed in Section 3, when is diagnosis of a disease, we might expect resulting treatment or therapy to have an effect on the outcome distinct from disease liability, suggesting a structural equation model of the form The exposure measure is downstream of the latent exposure and there are assumed to be no direct pathways from the genetic instruments to the exposure measure, as illustrated in Figure 4. Therefore, we cannot use our genetic instrument to estimate the independent effect of the exposure measure on the outcome; the genetic instruments induce no unique variation in the exposure measure independent of the latent exposure. However, consider the individual‐specific threshold of Section 4.1. The variable could represent preferences of the clinician for diagnosing the disease or a change in clinical practices affecting some individuals (Brookhart & Schneeweiss, 2007; Davies et al., 2013). If is independent of each individual's liability, without directly affecting the outcome, then it is a potential instrument for disease diagnosis. The general rule for separately estimating the effects of the latent exposure and coarsened exposure is to have instruments which induce distinct variation in both.

Multivalued discrete exposure

This method generalises easily to the multivalued discrete exposure setting. Suppose we observe a discretised variable characterised by where are latent thresholds. could represent number of years in education and could represent time in education as a continuous measure. Similar to how the dichotomous exposure can be formulated as a binary response model as in Equation (5), exposures of the form (8) can be formulated as an ordered response model and the parameters , and are still identified, allowing the method to be applied as usual.

Two‐sample design with GWAS summary statistics

For rare diseases, it is not always possible to observe the coarsened exposure and the outcome in the same sample. It is common practice in Mendelian randomisation studies to use summary statistics from separate GWAS of the exposure and outcome to obtain two‐sample estimates (Burgess et al., 2015). This method also generalises to the two‐sample setting using the popular inverse‐variance weighted approach (Burgess et al., 2013). Suppose there is a set of SNPs from the exposure GWAS, of which a subset , is selected as instruments from the outcome GWAS. Suppose we have estimates on the log‐odds scale of the instrument‐exposure relationship for each instrument in and estimates of the instrument‐outcome relationship for each instrument in . Additionally, we need the variance for each instrument in , which can be obtained from reported allele frequencies. Lastly, we also need estimates for the inverse‐variance weights , where is the standard error of . Under the assumption that the instruments in are mutually independent, the inverse‐variance weighted estimator for can be obtained from the above summary statistics as which is derived in section C of the appendix. We can recover the effect in terms of (i.e., ) by rescaling by a suitable choice of as described in Section 3. Conveniently, the second term in (9) is the standard form of the inverse‐variance weighted estimator. This means that we can easily readjust existing Mendelian randomisation estimates of coarsened exposures using only the exposure GWAS and a choice for . The large‐sample distribution of the estimator (9) is derived in section C of the appendix.

REAL DATA EXAMPLES

We can assess the performance of this method in a realistic setting by creating a dichotomised variable from an observed continuous measure, BMI. The idea is to dichotomise BMI at some threshold value and then treat only the dichotomisation as observed. We shall compare the true standardised effect of BMI on some outcome with our procedure described in Section 3 and with the naive approach of using the dichotomisation as the exposure. Our example is based on the Mendelian randomisation analysis performed in Lyall et al. (2017), which estimates the effect of BMI on several cardiometabolic measures in the UK Biobank cohort. In particular, we look at the effect of BMI on systolic blood pressure. This is a convenient exposure‐outcome relationship to estimate because we should not expect there to be threshold effects, that is, the dichotomisations of BMI should have no distinct effects on systolic blood pressure except through BMI itself. Consistent with Lyall et al. (2017), we use as potential instruments the 93 genome‐wide significant SNPs reported in Locke et al. (2015) available in UK Biobank and we control for age, sex, assessment centre, alcohol intake, smoking status and Townsend deprivation index, along with genetic batch and the first 10 principal components of the genetic relatedness matrix. To avoid weak instrument bias, we prune these SNPs by including those which correlate with BMI with (conditional on the other SNPs) as instruments. We estimate the “true” standardised effect of BMI on systolic blood pressure via two‐stage least squares, finding that a one SD increase in BMI corresponds to an increase in systolic blood pressure of 1.53 mmHg (95% CI = 0.34–2.72). At each BMI threshold, we then generate a binary variable equal to 1 if an individual's BMI is above the threshold and 0 otherwise. Treating only this binary measure as observed, we apply the latent variable approach of Section 3.2 using a probit link function. The results of this example are summarised in Figure 5, which compares the estimated effects with the “true” effect of 1.53. The estimates using the dichotomised measure as the exposure are highly sensitive to the choice of threshold. Since we should not expect there to be distinct threshold effects in this setting, this demonstrates that the dichotomised exposure is not capturing the effect of the latent exposure, instead, it is picking up the shape of the distribution of the environmental risk factor for BMI, as discussed in Section 3.1. As predicted by the bias formula in Section 3.1, the estimates were inflated at the extreme thresholds where the distribution is flatter.

Figure 5

Comparison of estimated effect with “true” effect for various BMI thresholds. N = 70,261, , and 95% confidence intervals are generated over 1000 bootstrap resamples. “True” corresponds to the sample estimate using BMI as the exposure; “naive” corresponds to using the binary measure as the exposure ; and “latent” corresponds to the latent variable estimator of Section 3.2. BMI, body mass index For the latent variable approach, we select a of 0.0256 based on the R 2 of our first‐stage regression of BMI on the genetic variants. The effect estimate from this approach is much less sensitive to the choice of threshold. Furthermore, the estimates appear to accurately recover the “true” effect of 1.53 regardless of the threshold value, ranging from 1.35 at a BMI cut‐off of 30 to 1.92 at a BMI cut‐off of 22.5. We can also investigate this approach in a more realistic setting by reanalysing two existing papers. Box 1 gives an example of how existing two‐sample results which do not have interpretable effect sizes can be reinterpreted using this method. The original paper finds that schizophrenia liability increases one's likelihood of using cannabis, although the effect sizes are not interpretable (Pasman et al., 2018). Using our approach, we find that a one SD increase in liability corresponds to an OR in the range 1.15–1.26 (95% CI 1.10–1.44) for ever using cannabis. This approach allows us to infer the size of this effect which, in this instance, is very modest. Box 2 gives an example of how this approach can correct exclusion restriction violations. In the original paper the exposure is self‐reported adiposity which is measured on a three‐point scale (“thinner,” “plumper,” and “about average”). Genetic instruments will be acting on the underlying measure of child adiposity (e.g., BMI) rather than the three‐point scale, so the exclusion restriction is likely to be violated (Richardson et al., 2020). We use our latent variable approach to ameliorate this bias and to estimate the effect of child BMI directly, which is the exposure of interest.

DISCUSSION

We propose a simple framework for estimation and interpretation of Mendelian randomisation for coarsened measurements of latent continuous exposures. We begin by demonstrating in Section 3.1 that using the coarsened measurement as the exposure results in a multiplicative bias which will inflate or deflate effect estimates without reversing their sign. However, under the assumptions of our framework, described in Section 2, we can recover the effect of the latent exposure in terms of SD increases. Section 4.4 shows that it is straight‐forward to generalise this approach to the two‐sample setting. The key sensitivity parameter in our approach is the genetic share of the variance of the latent exposure, which may be estimated or varied over a plausible range of values (Lee et al., 2012). Section 5 evaluates this approach by creating binary exposure measurements from the continuous BMI measure in UK Biobank. We show that we can accurately recover the effect of a SD increase in BMI on systolic blood pressure. We also demonstrate this approach in practice by re‐analysing two papers which are likely to suffer from this type of exclusion restriction violation, allowing us to meaningfully interpret their effect sizes. The approach proposed in this paper relies on a number of strong structural assumptions on the relationship between the latent exposure and its corresponding measurement. The appropriateness of these assumptions must be assessed on a case‐by‐case basis. Exposure measurements which are defined by strict thresholds of the latent continuous exposure are easiest to conceptualise within this framework. In general, the assumption most difficult to justify is that the thresholds are independent of the genetic share of the latent exposure. One example where this assumption may be violated is self‐report measures of mental health status, for example, feelings of depression on a 1–5 scale. Individuals who are genetically predisposed to depression may have different thresholds for reporting their mental wellbeing, either over‐ or under‐reporting. An additional complication occurs when this method is applied to disease exposures. We have assumed throughout that disease occurrence and disease diagnosis are equivalent; that is, everyone who develops the disease will receive a diagnosis. However, there are often barriers to seeking and accessing the healthcare services needed to receive a diagnosis. These might include stigma surrounding the disease, a lack of trust in healthcare providers or a lack of access to healthcare services due to cost, distance or institutional complexities (Cassim et al., 2019; Stangl et al., 2019). It is therefore possible that individuals with the disease will fail to be diagnosed. This can be viewed as a form of misclassification bias. Misclassification‐robust methods for binary exposures could potentially be incorporated into this approach, which we leave for future work (Lewbel, 2000; Rekaya et al., 2016; Smith et al., 2013). In studies where the assumptions in Section 2 are believed to be implausible, it is important for researchers to be transparent that the magnitude of their effect estimate will not be well‐defined.

AUTHOR CONTRIBUTIONS

George Davey Smith and Matthew J. Tudball conceived the idea. Matthew J. Tudball designed the method and performed the analyses. George Davey Smith, Jack Bowden, Kate Tilling, Qingyuan Zhao and Rachael A. Hughes supervised the project. All authors contributed to the main ideas and the writing of the manuscript.

CONFLICT OF INTERESTS

The authors declare that there are no conflict of interests.

Table B1

Ratio of estimated to true with link function misspecification

	Value of the skewness parameter a
Choice of link function	0	1	2	3	4	5
Logistic	1.01	1.02	1.03	1.05	1.06	1.07
Logistic	[1.01, 1.02]	[1.01, 1.03]	[1.03, 1.04]	[1.04, 1.06]	[1.05, 1.07]	[1.06, 1.07]
Probit	1.01	1.02	1.03	1.05	1.06	1.07
Probit	[1.00, 1.02]	[1.01, 1.03]	[1.02, 1.04]	[1.04, 1.06]	[1.05, 1.07]	[1.06, 1.08]
Semiparametric*	1.00	1.00	1.00	1.00	1.01	1.01
Semiparametric*	[0.99, 1.00]	[0.99, 1.01]	[0.99, 1.01]	[1.00, 1.01]	[1.00, 1.01]	[1.00, 1.02]

*Klein and Spady estimator; mean over 1000 draws; N = 2500; ; 95% Monte Carlo confidence.

Table B2

Ratio of estimated to true with threshold dependence

Choice of link function	Value of the threshold dependence parameter b
Choice of link function	0	0.1	0.25	0.5	1
Logistic	1.01	1.08	1.17	1.34	1.71
Logistic	[1.01, 1.02]	[1.07, 1.08]	[1.16, 1.18]	[1.33, 1.36]	[1.69, 1.72]
Probit	1.01	1.07	1.17	1.33	1.69
Probit	[1, 1.02]	[1.06, 1.08]	[1.16, 1.18]	[1.32, 1.34]	[1.68, 1.70]
Semiparametric*	1.00	1.05	1.15	1.30	1.67
Semiparametric*	[0.99, 1]	[1.05, 1.06]	[1.14, 1.16]	[1.29, 1.31]	[1.66, 1.69]

*Klein & Spady estimator; mean over 1000 draws; N = 2500; ; 95% Monte Carlo confidence intervals.

24 in total

1. Preference-based instrumental variable methods for the estimation of treatment effects: assessing validity and interpreting results.

Authors: M Alan Brookhart; Sebastian Schneeweiss
Journal: Int J Biostat Date: 2007 Impact factor: 0.968

2. GWAS of lifetime cannabis use reveals new risk loci, genetic overlap with psychiatric traits, and a causal influence of schizophrenia.

Authors: Joëlle A Pasman; Karin J H Verweij; Zachary Gerring; Sven Stringer; Sandra Sanchez-Roige; Jorien L Treur; Abdel Abdellaoui; Michel G Nivard; Bart M L Baselmans; Jue-Sheng Ong; Hill F Ip; Matthijs D van der Zee; Meike Bartels; Felix R Day; Pierre Fontanillas; Sarah L Elson; Harriet de Wit; Lea K Davis; James MacKillop; Jaime L Derringer; Susan J T Branje; Catharina A Hartman; Andrew C Heath; Pol A C van Lier; Pamela A F Madden; Reedik Mägi; Wim Meeus; Grant W Montgomery; A J Oldehinkel; Zdenka Pausova; Josep A Ramos-Quiroga; Tomas Paus; Marta Ribases; Jaakko Kaprio; Marco P M Boks; Jordana T Bell; Tim D Spector; Joel Gelernter; Dorret I Boomsma; Nicholas G Martin; Stuart MacGregor; John R B Perry; Abraham A Palmer; Danielle Posthuma; Marcus R Munafò; Nathan A Gillespie; Eske M Derks; Jacqueline M Vink
Journal: Nat Neurosci Date: 2018-08-27 Impact factor: 24.884

3. Mendelian randomization analysis with multiple genetic variants using summarized data.

Authors: Stephen Burgess; Adam Butterworth; Simon G Thompson
Journal: Genet Epidemiol Date: 2013-09-20 Impact factor: 2.135

4. Using published data in Mendelian randomization: a blueprint for efficient identification of causal risk factors.

Authors: Stephen Burgess; Robert A Scott; Nicholas J Timpson; George Davey Smith; Simon G Thompson
Journal: Eur J Epidemiol Date: 2015-03-15 Impact factor: 8.082

5. The MR-Base platform supports systematic causal inference across the human phenome.

Authors: Gibran Hemani; Jie Zheng; Benjamin Elsworth; Tom R Gaunt; Philip C Haycock; Kaitlin H Wade; Valeriia Haberland; Denis Baird; Charles Laurin; Stephen Burgess; Jack Bowden; Ryan Langdon; Vanessa Y Tan; James Yarmolinsky; Hashem A Shihab; Nicholas J Timpson; David M Evans; Caroline Relton; Richard M Martin; George Davey Smith
Journal: Elife Date: 2018-05-30 Impact factor: 8.140

6. Mendelian randomization with a binary exposure variable: interpretation and presentation of causal estimates.

Authors: Stephen Burgess; Jeremy A Labrecque
Journal: Eur J Epidemiol Date: 2018-07-23 Impact factor: 8.082

7. Investigating causal relations between sleep traits and risk of breast cancer in women: mendelian randomisation study.

Authors: Rebecca C Richmond; Emma L Anderson; Hassan S Dashti; Samuel E Jones; Jacqueline M Lane; Linn Beate Strand; Ben Brumpton; Martin K Rutter; Andrew R Wood; Kurt Straif; Caroline L Relton; Marcus Munafò; Timothy M Frayling; Richard M Martin; Richa Saxena; Michael N Weedon; Debbie A Lawlor; George Davey Smith
Journal: BMJ Date: 2019-06-26

8. The Health Stigma and Discrimination Framework: a global, crosscutting framework to inform research, intervention development, and policy on health-related stigmas.

Authors: Anne L Stangl; Valerie A Earnshaw; Carmen H Logie; Wim van Brakel; Leickness C Simbayi; Iman Barré; John F Dovidio
Journal: BMC Med Date: 2019-02-15 Impact factor: 8.775

9. Patient and carer perceived barriers to early presentation and diagnosis of lung cancer: a systematic review.

Authors: Shemana Cassim; Lynne Chepulis; Rawiri Keenan; Jacquie Kidd; Melissa Firth; Ross Lawrenson
Journal: BMC Cancer Date: 2019-01-08 Impact factor: 4.430

10. Biological insights from 108 schizophrenia-associated genetic loci.

Authors:
Journal: Nature Date: 2014-07-22 Impact factor: 49.962

6 in total

1. Is genetic liability to ADHD and ASD causally linked to educational attainment?

Authors: Christina Dardani; Lucy Riglin; Beate Leppert; Eleanor Sanderson; Dheeraj Rai; Laura D Howe; George Davey Smith; Kate Tilling; Anita Thapar; Neil M Davies; Emma Anderson; Evie Stergiakouli
Journal: Int J Epidemiol Date: 2021-06-07 Impact factor: 9.685

2. Mendelian randomisation with coarsened exposures.

Authors: Matthew J Tudball; Jack Bowden; Rachael A Hughes; Amanda Ly; Marcus R Munafò; Kate Tilling; Qingyuan Zhao; George Davey Smith
Journal: Genet Epidemiol Date: 2021-02-01 Impact factor: 2.344

3. Interpreting Mendelian-randomization estimates of the effects of categorical exposures such as disease status and educational attainment.

Authors: Laurence J Howe; Matthew Tudball; George Davey Smith; Neil M Davies
Journal: Int J Epidemiol Date: 2022-06-13 Impact factor: 9.685

4. A phenome-wide bidirectional Mendelian randomization analysis of atrial fibrillation.

Authors: Qin Wang; Tom G Richardson; Eleanor Sanderson; Matthew J Tudball; Mika Ala-Korpela; George Davey Smith; Michael V Holmes
Journal: Int J Epidemiol Date: 2022-08-10 Impact factor: 9.685

5. Estimation of causal effects of a time-varying exposure at multiple time points through multivariable mendelian randomization.

Authors: Eleanor Sanderson; Tom G Richardson; Tim T Morris; Kate Tilling; George Davey Smith
Journal: PLoS Genet Date: 2022-07-18 Impact factor: 6.020

6. The Risk of Atrial Fibrillation Increases with Earlier Onset of Obesity: A Mendelian Randomization Study.

Authors: Yingchao Zhou; Lingfeng Zha; Silin Pan
Journal: Int J Med Sci Date: 2022-08-08 Impact factor: 3.642

6 in total