Literature DB >> 35533073

Strengthening Causal Inference in Exposomics Research: Application of Genetic Data and Methods.

Christy L Avery^1,2, Annie Green Howard^3,2, Anna F Ballou¹, Victoria L Buchanan¹, Jason M Collins¹, Carolina G Downie¹, Stephanie M Engel¹, Mariaelisa Graff¹, Heather M Highland¹, Moa P Lee¹, Adam G Lilly^2,4, Kun Lu⁵, Julia E Rager⁵, Brooke S Staley¹, Kari E North¹, Penny Gordon-Larsen^6,2.

Abstract

Advances in technologies to measure a broad set of exposures have led to a range of exposome research efforts. Yet, these efforts have insufficiently integrated methods that incorporate genetic data to strengthen causal inference, despite evidence that many exposome-associated phenotypes are heritable. Objective: We demonstrate how integration of methods and study designs that incorporate genetic data can strengthen causal inference in exposomics research by helping address six challenges: reverse causation and unmeasured confounding, comprehensive examination of phenotypic effects, low efficiency, replication, multilevel data integration, and characterization of tissue-specific effects. Examples are drawn from studies of biomarkers and health behaviors, exposure domains where the causal inference methods we describe are most often applied. Discussion: Technological, computational, and statistical advances in genotyping, imputation, and analysis, combined with broad data sharing and cross-study collaborations, offer multiple opportunities to strengthen causal inference in exposomics research. Full application of these opportunities will require an expanded understanding of genetic variants that predict exposome phenotypes as well as an appreciation that the utility of genetic variants for causal inference will vary by exposure and may depend on large sample sizes. However, several of these challenges can be addressed through international scientific collaborations that prioritize data sharing. Ultimately, we anticipate that efforts to better integrate methods that incorporate genetic data will extend the reach of exposomics research by helping address the challenges of comprehensively measuring the exposome and its health effects across studies, the life course, and in varied contexts and diverse populations. https://doi.org/10.1289/EHP9098.

Entities: Chemical

Mesh：

Substances：
Biomarkers

Year: 2022 PMID： 35533073 PMCID： PMC9084332 DOI： 10.1289/EHP9098

Source DB: PubMed Journal: Environ Health Perspect ISSN： 0091-6765 Impact factor: 11.035

Introduction

The past decade has witnessed a paradigm shift in the environmental sciences as studies have shifted from examining specific exposures to attempting to comprehensively measure and characterize the broad range of exposures an individual may encounter over the life course. Termed “exposomics,” this emerging approach aims to better understand the development and progression of disease by comprehensively measuring exogenous and endogenous environmental exposures (the “exposome”) at multiple levels and time periods.[1] The exposome is understandably complex and includes domains that span the chemical environment (e.g., nutrients, medications, and toxicants), health behaviors (e.g., cigarette smoking, sleeping, and physical activity), the social environment (e.g., neighborhood characteristics, social networks, and racism), and the natural and built environment (e.g., air and water pollution, green space)[1-5] (Figure 1). These domains include individual and aggregate exposures (e.g., diet and the built environment) that may be measured directly (e.g., secondhand smoking) or as a biomarker of exposure or effect (e.g., cotinine). The exposome’s broad scope and complex correlation structure have elicited comparisons with the Human Genome Project.[3,6,7] However, the dynamic and high-dimensional nature of the exposome makes measurement, characterization, and causal inference—the discipline that considers the assumptions, study designs, and estimation strategies that allow researchers to draw causal conclusions based on data[8,9]—far more complex.[1]

Figure 1.

Conceptual diagram of the exposome. By placing genetic data in the middle of four exposome domains (the natural and built environment, the social environment, the chemical environment, and health behaviors), the central role of genetic data is emphasized. Figure adapted from Vermeulen R, Schymanski EL, Barabasi AL, Miller GW. 2020. The exposome and health: Where chemistry meets biology. Science 367:392–396. Reprinted with permission from American Association for the Advancement of Science (AAAS). Emerging statistical methods that integrate genetic data offer several avenues to help address measurement, characterization, and causal inference challenges in exposomics research. These approaches are enabled by the time invariance of germline genetic variants and a growing appreciation that many exposures are heritable, making genetic data a central component of the exposome (Figure 1). For example, although arsenic exposure in itself is not heritable, prior studies demonstrated that biomarkers of arsenic metabolism efficiency, which modulates the absolute and relative amounts of disease associated arsenic metabolites, were highly heritable ( range: 50%–59%).[10] The consequent identification of genetic variants associated with arsenic metabolism efficiency biomarkers has enabled causal inference studies examining the role of arsenic in skin lesion risk[11] and cardiometabolic diseases.[12] The goal of this commentary is to demonstrate how the application of causal inference methods that integrate genetic data can empower and enrich exposomics research by helping address six challenges:[3,4,7,13] reverse causation and unmeasured confounding, comprehensive examination of phenotypic effects, low efficiency, replication, multilevel data integration, and characterization of tissue-specific effects. To frame this commentary for audiences with varied backgrounds in genetics and exposure science, we provide overviews of genetic variant measurement, core study designs, and consortia, as well as exposomic approaches that leverage advances in high-resolution mass spectrometry (MS).[14] We then describe two core metrics—heritability and genetic risk scores (GRS)—that may be estimated from genetic data and may inform or be employed in several approaches we describe: Mendelian randomization (MR), phenome-wide association studies (PheWAS), joint models for missing data, cross-study replication, multilevel data integration methods, and tissue-specific biomarker imputation (Table 1). Examples are drawn from studies of biomarkers and health behaviors, exposure domains where the causal inference methods we describe are most often applied. Finally, we acknowledge limitations, including how the utility of methods that integrate genetic data will vary by exposure and may depend on the presence of specific environmental exposures[1] and large sample sizes.[15,16]

Table 1

Six methods or approaches that leverage genetic data to address challenges facing exposomics research and empower causal inference.

Challenge	Statistical method or approach afforded by genetic data
Reverse causation and unmeasured confounding	Mendelian randomization
Comprehensive phenotype measurement and characterization of phenotypic effects	PheWAS of genetically predicted exposures in large biobanks or populations with EHR.
Decreased efficiency from data missing by design or from detection limits	Joint models that address missing data from exposure measured on subset of participants and detection limits by leveraging the information available from any associations between genetics and covariates with exposomic data
Difficulty replicating findings, particularly if exposure is not measured broadly, not measured with comparable protocols, or unidentified	External replication using genetically predicted exposures.
Multilevel data that may be difficult to integrate	Integrative approaches that use genetic data as a framework to link multi-omic data.
Limited ability to characterize tissue-specific effects	Imputation of tissue-specific biomarkers of exposure and internal dose (e.g., transcripts, methylation, metabolomics, proteins) using publicly available data.

Six methods or approaches that leverage genetic data to address challenges facing exposomics research and empower causal inference.

Measurement of Genetic Variants, Genome-Wide Association Study (GWAS), and Consortia

Measurement of genetic variants.

Over the past three decades, assays that measured a handful of genetic variants have advanced to today’s whole genome sequencing. Whole genome sequencing is considered the gold standard in genome measurement because of its accuracy, scope, and ability to identify new genetic variants.[17] However, cost, storage, and computational feasibility have limited widespread adoption of whole genome sequencing. Instead, the primary source of genetic data remains arrays that genotype 500,000 to 5 million genetic variants. Genotyped genetic variants are then used as a scaffold for high-quality imputation to a wider set () of genetic variants,[18] helping ensure a common set of genetic variants across studies. Imputation servers [e.g., the Michigan Imputation Server (https://imputationserver.sph.umich.edu/index.html) and the Trans-Omics Precision Medicine (TOPMed) Imputation Server (https://imputation.biodatacatalyst.nhlbi.nih.gov/)] that perform genetic imputation quickly and free of charge have streamlined, simplified, and improved imputation of array data.[19-22] Imputation coverage and accuracy depend crucially on the size of the reference panel used for imputation, the density of genotyped variants, and the genetic distance between reference panel populations and target populations.[23,24] The largest broadly available reference panels are from TOPMed, with individuals of diverse ancestry.[25,26] These panels provide highly accurate imputation of genetic variants down to a minor allele frequency (MAF) of across multiple ancestries.[26] Although earlier reference panels include individuals from more diverse worldwide populations, these reference panels provide more limited imputation coverage and accuracy because of smaller sample sizes.[27]

The GWAS.

Advances in genotyping and imputation have facilitated the rise of the GWAS as a key study design to identify genetic variants associated with complex phenotypes (reviewed by Tam et al.[28]). By design, GWAS are unbiased with respect to mechanistic hypotheses, biological knowledge, and genomic location.[17,29] This design has been remarkably successful in mapping variant–phenotype associations.[30,31] The requirement for large sample sizes and the importance of replication has prompted the formation of numerous large GWAS consortia[32-34] that are well powered to detect common, infrequent and rare variants associated with complex phenotypes. These consortia also routinely share summary-level data (e.g., effect estimates, -values, and allele frequencies) that are publicly available via centralized repositories[30] (https://www.ncbi.nlm.nih.gov/gap/).

Exposomics Consortia

Recognizing the efficiencies enabled by a consortium model, exposomics consortia also have been formed (see Vrijheid et al.[35] and https://emoryhercules.com/). For example, the EXPOsOMICs project is a European exposomics consortium that includes experimental studies, mother–child cohorts, observational studies of adults, and personal exposure monitoring studies.[36] Several studies contributing to EXPOsOMICs have genetic data. By including cohorts across the life course, consortia like EXPOsOMICs enable examination of questions that would be difficult to conduct within a single study; these consortia also provide opportunities for replication or data pooling. Phenotype specific consortia also have been assembled, including the COnsortium of METabolomics (COMETS).[37] COMETS is the world’s largest metabolomics consortium and comprises 47 international cohorts that include participants with blood metabolomics data. Genetic data were measured in approximately 68% of COMETS participants.

Core Metrics

Heritability.

Heritability () estimates the proportion of phenotypic variation (range: 0–1) attributable to additive, dominance, and epistatic variance components, i.e., “broad-sense heritability” (reviewed by Zaitlen and Kraft[38]). Traditional methods to estimate heritability use family-based studies and only quantify additive genetic variance (“narrow-sense heritability”)—the major contributor to —given longstanding challenges estimating nonadditive genetic variance.[39] Recent innovations have enabled approximation of narrow-sense heritability in population-based studies using genetic data[40]; these approximations typically underestimate narrow-sense heritability due to incomplete linkage disequilibrium (LD) between causal genetic variants and genotyped genetic variants as well as error in genetic variant effect estimates.[15] Narrow-sense heritability can help gauge the potential utility of genetic data to inform causal inference for a given exposure, because genetic data will have limited value when narrow-sense heritability is low. As an example, an Australian twin study of the metallic elements arsenic, cadmium, copper, mercury, lead, selenium, and zinc measured in erythrocytes estimated narrow-sense heritabilities of moderate size ranging from 0.19 (cadmium) to 0.40 (lead).[41] A follow-up GWAS of copper, selenium, and zinc identified eight highly significant () common genetic variants that mapped to loci containing genes with roles in trace element metabolism.[42] These genetic variants accounted for 4%–8% of phenotypic variance in copper, selenium, and zinc. Effects of this magnitude are considerably larger than the majority of genetic variants identified by GWAS to date.[43]

Genetically Predicted Phenotypes

Genetic data can help extend the reach of exposome studies by enabling the estimation of genetically predicted phenotypes. These genetically predicted phenotypes are then substituted for measured phenotypes when conducting association studies or causal inference investigations. As described below, in addition to expanding the number of phenotypes for evaluation, the use of a genetically predicted phenotype can help reduce bias from confounding and reverse causation.[44] Genetically predicted phenotypes can be constructed using one genetic variant, a limited number of genetic variants, or hundreds to millions of genetic variants. For phenotypes with a monogenic or oligogenic genetic architecture, genetically predicted phenotypes may be constructed using one genetic variant or by aggregating a small number of independent genetic variants.[45,46] For polygenic phenotypes, genetically predicted phenotypes often are constructed by aggregating hundreds to millions of genetic variants (reviewed by Chatterjee et al.[47] and Wray et al.[15]). Aggregation of genetic variants into a genetically predicted phenotype is accomplished using a GRS. GRS are calculated as a sum of genetic variants that are typically weighted by the magnitude of association between each genetic variant and the phenotype of interest. Numerous approaches are available to estimate GRS, which are distinguished by the method used to select and weigh genetic variants and the method used to account for LD. GRS weights are usually derived from a GWAS[47-49] and then applied in an independent, ancestry-matched target sample for validation.[15,49] In the absence of an independent target sample for validation, methods are emerging[50,51] that estimate GRS using cross-validation to address overfitting. These methods offer efficient alternatives for studies without access to independent data and may be particularly useful when examining a phenotype that is difficult to measure, a phenotype that is uncommonly measured or when conducting research in a unique population.

Exposure Science and the Chemical Exposome

Environmental health studies have undergone a dramatic shift in recent years, with rapid technological advancements enabling broader coverage of the chemical exposome while also expanding the inclusion of nonchemical stressors.[7,52,53] Approaches for chemical exposome characterization include suspect screening and nontargeted analyses, which enable the measurement of many chemicals simultaneously using approaches that rely on high-resolution chemical detection coupled with computational methods to efficiently mine large data sets. Targeted analytical methods also may be employed to evaluate the impacts of exposures to chemical mixtures in the environment.[54] These methods provide more limited coverage of chemicals and thus may not capture exposure information at the “-omic” level. Indeed, an increasing number of global measurement approaches have recently been implemented to characterize exposome signatures within environmental media, including household dust,[55] drinking water,[56] and consumer products.[57] Biological samples, such as blood, saliva, teeth, and urine, also may be analyzed to measure chemicals and their associated metabolites, as well as other exposure biomarkers.[53,58-62]

Suspect screening and nontargeted analyses.

Suspect screening and nontargeted analyses leverage MS platforms coupled with compound database matching approaches to identify and potentially confirm chemicals.[63-66] Suspect screening can be implemented using both gas chromatography (GC) and liquid chromatography (LC) separation followed by either low- or high-resolution MS detection. GC-based methods can be aided by the addition of electron ionization, whereas LC-based methods can use softer ionization techniques, such as electrospray ionization or atmospheric pressure chemical ionization, resulting in detailed fragmentation spectra information to better identify tentative chemicals. With suspect screening approaches, resulting spectra are compared against a library of known compounds, and those with matching attributes are identified and prioritized for confirmation. Nontargeted approaches, in contrast, rely on high-resolution MS platforms to acquire accurate mass, isotope profile, and fragmentation spectra. These data are then used to predict chemical structures, and chemicals are tentatively assigned formulas and associated chemical identifier information. Therefore, suspect screening analyses query for known chemicals, whereas nontargeted analyses generate information on chemicals that are potentially completely unknown. Both approaches yield tentatively identified chemicals which require further confirmation, using tandem mass spectrometry (MS/MS) fragmentation information or confirmation via chemical standards.[63,64]

Prioritizing chemicals for confirmation.

It is not feasible to confirm all, or even the majority, of chemicals in a given sample. Because of this limitation, it is important to develop and implement methods to prioritize chemicals for final confirmation. Data streams to aid prioritization include chemical exposure estimates and metabolite predictions, which inform the likelihood of a chemical being absent or present in a given sample, as well as toxicity screening and prediction data, which inform the likelihood of a chemical being toxic and therefore of high interest.[55,64] As these methods grow, exposomic measures likely will become increasingly integrated across multiple tiers of data to better address the dynamic nature of the exposome and its overall influence on health and disease.

How Causal Inference Methods and Study Designs That Use Genetic Data Can Empower Exposomics

In this section we draw on research interrogating a spectrum of exposures to demonstrate how causal inference methods and study designs that integrate genetic data can empower exposomics research. We focus on six challenges that are not necessarily unique to exposomics research, but we consider particularly salient, given the score and dimensionality of the exposome. Although it may not be feasible to comprehensively address all six challenges using causal inference methods and study designs that integrate genetic data, we anticipate that these approaches will help strengthen causal inference for numerous exposome phenotypes.

Method 1: Evaluate reverse causation and unmeasured confounding.

Ideally exposomics research would leverage a longitudinal prospective design in which exposures are sampled repeatedly before an outcome occurs.[4] However, many exposome studies use cross-sectional[67] or case–control designs.[68] Reverse causation is a concern with these designs, because disease status may affect the exposure or its measurement.[69-71] Other challenges include unmeasured or poorly measured confounders as well as exposures that are poorly understood, making the identification of confounders difficult. MR, a popular causal inference tool that uses genetic data to investigate associations between potentially modifiable risk factors, including environmental exposures, and outcomes in observational data (reviewed by Davies et al.[44]), has been proposed to help address these challenges. MR is a form of instrumental variable (IV) analysis[72] based on the concept that if exposure X affects outcome Y, factors affecting X (i.e., inherited genetic variants, G) also must affect Y (Figure 2). G therefore serves as an IV for studies of the X–Y association. Strengths of MR include G–Y associations that are generally robust to confounding from variables other than ancestry, which can be addressed through adjustment.[73] Because G is determined at conception, G precedes Y, also protecting against reverse causation under the assumption that G is associated with X, not Y or an alternative cause of Y.[74] MR also is dependent on the identification of a strong genetic IV (i.e., a genetically predicted phenotype) and assumes an absence of pleiotropy (i.e., the effect of G on Y is not exclusively through X). The development of methods to evaluate these assumptions is an active area of research[75,76] and alternative methods, including mediation analysis, have been proposed when assumptions are violated[77] or when biological mechanisms do not conform to assumptions.[78] If IV assumptions are satisfied, MR can inform on the presence and direction of the association between X and Y. However, numerical MR estimates generally are not informative, because by estimating the G–Y association, MR estimates cannot be interpreted as the predicted real-world influence of changes to X.[79] Although few investigators have successfully used MR to study time-varying exposures,[80] methods are under development.[81]

Figure 2.

Example causal diagram representing the relationship between genetic variants G, exposure X, and outcome Y. The hypothesis tested by Mendelian randomization is shown by the dotted arrow where G serves as an instrumental variable for X (solid arrow). Despite these challenges, MR has been used to strengthen causal inference in exposomics research across a variety of phenotypes.[82-84] For example, Pierce et al.[11] used MR to confirm the presence and direction of the association between biomarkers of arsenic methylation efficiency and arsenic toxicity.[11] Here, MR helped gauge the degree to which observational findings reflected reverse causation or residual confounding by unmeasured or poorly measured covariates in a process termed “triangulation,” i.e., the integration of results from several different approaches, each with different and unrelated key sources of potential bias.[85] Other applications of MR relevant to exposomics research include multivariate MR, which has been used to simultaneously examine causal effects of correlated phenotypes.[86]

Method 2: comprehensively examine phenotypic effects.

Few studies have comprehensively studied the health effects of exposome phenotypes.[87] However, advances in large-scale phenotyping through biobanks and linkage to electronic health records (EHR), in combination with genetic data, offer opportunities to help address this research gap via a PheWAS (reviewed by Bush et al.[88]). Benefits of the hypothesis-free PheWAS include broad phenotypic characterization, enabling the identification of potentially novel associations. For example, a recent PheWAS of genetically predicted serum calcium examined associations with 925 disease outcomes constructed from hospital inpatient and mortality data.[89] This PheWAS identified associations with renal, musculoskeletal, and cardiovascular phenotypes, which in part mirror findings from calcium supplementation trials.[90] Unexpected associations with allergy or adverse effects of penicillin also were identified, which may point to an unappreciated role of calcium in immune function. Limitations of PheWAS include the requirement of a genetically predicted phenotype and large sample sizes with broad phenotypic characterization in the same ancestral population from which the genetically predicted phenotype was constructed.[91,92] In addition, few EHR PheWAS have fully incorporated unstructured exposure, behavioral, or lifestyle data, which are likely highly relevant to exposomics research but are challenging to extract from or missing in clinical free text.[93] Emerging electronic phenotyping approaches[93] and global biobank initiatives[94] offer potential ways forward.

Method 3: increase efficiency.

Due to cost or other constraints, biomarkers of exposure and effect may be measured on a subset of study participants. Measuring biomarkers on a subset of participants, thereby introducing missing data due to study design, reduces efficiency in comparison with an analysis of the entire study population, thereby introducing uncertainty and decreasing statistical power. Because genetic data often are available on larger population subsets, statistical methods that use genetic data to infer biomarkers in participants not selected for measurement can help increase efficiency. For example, the Atherosclerosis Risk in Communities (ARIC) study measured serum metabolites on 4,032 (26%) of 15,792 participants at baseline.[95] Although most analyses investigating associations between metabolites and outcomes often restrict to this smaller sample size, imputation methods may increase efficiency. These methods perform well when the sample with measured data is a random or stratified random sample of the larger study population.[96-98] In addition to missing data due to study design, many biomarkers of exposure and effect are subject to limits of detection and are nondetectable. There are commonly accepted analytic practices for nondetectable data; however, most methods cannot address multiple missing data mechanisms.[99,100] Treating all missing data as originating from one missing data mechanism also can result in grossly inefficient and potentially biased estimates.[101,102] Methods that leverage genetic data can account for biomarker data that are missing due to study design and limits of detection. Using metabolites as an example, these missing-data methods use joint models to model both the association between genetic data and metabolites and the association between the metabolites and the outcome[103]; data from participants with genetic and outcome data are used regardless of whether metabolites were measured. By using data from a larger group of participants (e.g., increasing the sample size from to in the ARIC Study[104]), these models offer more efficient estimates of exposure–outcome associations. Accounting for two types of missing data also reduces the potential for biased estimates. Simulations have shown scenarios where these methods provided virtually unbiased estimates, whereas methods addressing only one type of missing data can provide estimates that can differ by as much as 20% from the true value.[103]

Method 4: replicate findings.

Genetic data also provide a template for replication, defined here as the consistent estimation of effect direction, significance, and potentially magnitude (depending on the phenotype under investigation) in an independent sample from the same source population.[105] Replication is especially important in exposomic studies, because the number of exposures adds an exploratory element and concomitant large potential for false positive findings[106] that may mislead scientists and the public and misdirect the allocation of scientific resources. In GWAS, the potential for a high proportion of false positive results was addressed through imputation to common genotype reference panels, stringent multiple hypothesis testing correction, and replication (reviewed in Weinberg[107] and Chanock et al.[108]). A parallel framework is needed in exposomics research, although many exposures may not be measured broadly. Methods and output also may vary across studies according to study design factors such as, including sample type, instrumentation, analytical conditions, and the domain of chemicals under investigation. For studies of the exposome where researchers do not have access to an independent replication sample, genetic data may provide a partial solution. To illustrate, a recent study identified associations between manganese, lead, and chromium biomarkers with intelligence quotient (IQ) in adolescents.[67] Although the authors did not attempt replication, partnering with an independent study with measured IQ and genetic data from which to construct measures of genetically predicted manganese, lead, and chromium phenotypes could provide a replication opportunity. Other avenues for replication could include publicly available GWAS summary statistics.[30] Using publicly available GWAS summary statistics, researchers could examine whether genetic variants predictive of the phenotype of interest also were predictive of the outcome. Returning to the Bauer 2020 example, genetic variant rs13107325 was identified in GWAS of manganese[109] as well as intelligence[110] and general cognitive ability,[111] providing independent evidence linking manganese with IQ.

Method 5: multilevel data integration.

The exposome includes exposures that are multilevel, complex, and likely affected by genetic, environmental, and gene-by-environment effects. However, when describing these complexities, a majority of exposome studies distinguish between environmental and genetic causes of disease, with few studies considering opportunities to integrate information. Multi-omics studies, which include genomic, epigenomic, transcriptomic, proteomic, microbiomic, and exposomic data, are emerging efforts that attempt to disentangle complex, multilayered exposure effects. Examples of multilevel exposome studies include dimensionality reduction and variable selection approaches that consider the correlation structure between multiple omics.[112] Systems and network analyses also have been used to better assess the complex interplay within and between different omics while accounting for biological functionality.[113,114] Parallel efforts include the modeling of concentration dependency and several tools that accommodate different dose–response trends also have been published.[115,116] Together, these approaches are promising avenues to address cross-omics relationships and their complex dynamics.[117] Despite emerging interests in multi-omics studies, few studies have integrated genetics data with other omics data, even though genetics is the most mature of the omics fields.[118] One example is provided by research examining atopic dermatitis (AD),[119] which integrated genetic, epigenomic, transcriptomic, and proteomic data to better understand disease heterogeneity. A crucial component of that study approach was the use of GWAS findings to identify priority genes from which candidate disease pathways integrating multilevel data were constructed and tested.

Method 6: characterization of tissue-specific effects.

Biomarkers of exposure and effect are a promising tool to evaluate molecular responses to exposures as well as downstream consequences of variation in molecular response. However, direct measurement of biomarkers across relevant tissues is largely infeasible due to expense and tissue accessibility. This evidence gap constrains interpretation of biomarker effects and determination of relevance. The parallel collection of genetic and omics data in varied tissues enables construction of tissue-specific genetically predicted phenotypes; one example is the construction of genetically predicted gene expression.[120,121] Measures of genetically predicted gene expression offer a partial solution to examining downstream effects of variation in tissue-specific gene expression, because models to infer genetically predicted gene expression are publicly available (http://predictdb.org/ and http://gusevlab.org/projects/fusion/). These models also can be used to construct exposures for association testing and to examine evidence of tissue-specific effects. Similar imputation approaches are being developed for other omics, including DNA methylation levels.[122] Although exposomics research examples are currently scarce, emerging research examining genetically predicted omics in inaccessible but highly relevant tissues demonstrates the role of this emerging approach for pathophysiological insight.[123]

Discussion

In this commentary, we described how the increased application of genetic data and methods could strengthen causal inference in exposomics research. These approaches are enabled by the broad availability of genetic data, the active development of causal inference tools and study designs that use genetic data, publicly available data repositories, and a growing appreciation that many exposome phenotypes are heritable.[124] Although the application of genetic data and methods may add analytical complexity, these approaches offer the potential to extend the reach of exposomics research and help address the challenges of comprehensively measuring the exposome and its health effects across studies, the life course, and in varied contexts and in diverse populations. We acknowledge several limitations to the application of genetic data and methods for causal inference in exposomics research. Few studies have comprehensively cataloged genetic variants that predict diverse exposures. Even when genetic variants have been identified and replicated in independent studies, ascertaining biological impact remains challenging.[125] However, interpretation of a genetic variant’s impact is not necessarily required for many of the methods we propose, noting that some of the approaches we describe (e.g., using genetic data to characterize tissue-specific effects) may help illuminate effects in toxicity-relevant tissues and organs. We therefore advocate for expanding the evidence base to examine more comprehensively the genetic architecture of exposure biomarkers[126] and health behaviors,[127,128] the exposome domains that most likely harbor heritable exposures or exposure biomarkers. Another major challenge is the lack of diversity in published GWAS. Although the limited racial/ethnic diversity of GWAS has been the topic of several commentaries,[129] GWAS in populations exposed to specific toxicants or populations capturing crucial life course stages (e.g., infancy and childhood or pregnancy) also remain uncommon. Expanding the diversity of GWAS and the cataloging of genetic variants that predict exposome phenotypes, e.g., by international scientific collaborations that share summary results through established repositories, could help remedy these research gaps. Further, we excluded discussion of gene–environment interaction, instead focusing on genetic data applications that are less well known in exposomics research. Consistent with the other challenges described, methods to enhance gene–environment interaction studies are areas of active research.[130-132] Finally, the sample sizes needed to construct well-powered genetically inferred phenotypes may be infeasible for a single study. Again, the sharing of summary data is a disciplinary norm that can increase statistical power to detect genetic effects and construct predictive genetically inferred phenotypes, particularly when examining phenotypes influenced by many common genetic variants of small effects, phenotypes for which the genetic effects are only observable in the presence of specific exposures that are themselves uncommon, or when studying gene–environment interaction.[1] Wild proposed the concept of the exposome in 2005,[1] emphasizing the need to balance investments in genetics research with investments in exposomics research. Almost two decades later, distinctions between environmental vs. genetic effects on disease remain common in the exposomics literature,[3,6,133] with few examples of studies that successfully integrate both sources of data. It is noteworthy that many of the perceived hurdles associated with genetic data, including measurement scale, are not new to exposure scientists. Fully leveraging exposomic data also requires embracing biological complexity and systems-level thinking, two core exposure science paradigms.[2] Adding genetic data simply adds one more level of complexity. Ultimately, the success of attempts to integrate genetic data into exposomics research will likely require environmental scientists to expand their large collaborative network to include geneticists and genetic epidemiologists, because the requisite data are largely extant.[36] Through these collaborations, efforts that better integrate genetic and exposomics data to improve human health are achievable.

128 in total

Review 1. Five years of GWAS discovery.

Authors: Peter M Visscher; Matthew A Brown; Mark I McCarthy; Jian Yang
Journal: Am J Hum Genet Date: 2012-01-13 Impact factor: 11.025

2. Linking high resolution mass spectrometry data with exposure and toxicity forecasts to advance high-throughput environmental monitoring.

Authors: Julia E Rager; Mark J Strynar; Shuang Liang; Rebecca L McMahen; Ann M Richard; Christopher M Grulke; John F Wambaugh; Kristin K Isaacs; Richard Judson; Antony J Williams; Jon R Sobus
Journal: Environ Int Date: 2016-01-23 Impact factor: 9.621

3. Impact of genetic similarity on imputation accuracy.

Authors: Nab Raj Roshyara; Markus Scholz
Journal: BMC Genet Date: 2015-07-22 Impact factor: 2.797

4. The human early-life exposome (HELIX): project rationale and design.

Authors: Martine Vrijheid; Rémy Slama; Oliver Robinson; Leda Chatzi; Muireann Coen; Peter van den Hazel; Cathrine Thomsen; John Wright; Toby J Athersuch; Narcis Avellana; Xavier Basagaña; Celine Brochot; Luca Bucchini; Mariona Bustamante; Angel Carracedo; Maribel Casas; Xavier Estivill; Lesley Fairley; Diana van Gent; Juan R Gonzalez; Berit Granum; Regina Gražulevičienė; Kristine B Gutzkow; Jordi Julvez; Hector C Keun; Manolis Kogevinas; Rosemary R C McEachan; Helle Margrete Meltzer; Eduard Sabidó; Per E Schwarze; Valérie Siroux; Jordi Sunyer; Elizabeth J Want; Florence Zeman; Mark J Nieuwenhuijsen
Journal: Environ Health Perspect Date: 2014-03-07 Impact factor: 9.031

5. The blood exposome and its role in discovering causes of disease.

Authors: Stephen M Rappaport; Dinesh K Barupal; David Wishart; Paolo Vineis; Augustin Scalbert
Journal: Environ Health Perspect Date: 2014-03-21 Impact factor: 9.031

Review 6. Multi-omics approaches to disease.

Authors: Yehudit Hasin; Marcus Seldin; Aldons Lusis
Journal: Genome Biol Date: 2017-05-05 Impact factor: 13.583

7. Outcome-wide Epidemiology.

Authors: Tyler J VanderWeele
Journal: Epidemiology Date: 2017-05 Impact factor: 4.822

8. iOmicsPASS: network-based integration of multiomics data for predictive subnetwork discovery.

Authors: Hiromi W L Koh; Damian Fermin; Christine Vogel; Kwok Pui Choi; Rob M Ewing; Hyungwon Choi
Journal: NPJ Syst Biol Appl Date: 2019-07-09

9. Genetically predicted physical activity levels are associated with lower colorectal cancer risk: a Mendelian randomisation study.

Authors: Xiaomeng Zhang; Evropi Theodoratou; Xue Li; Susan M Farrington; Philip J Law; Peter Broderick; Marion Walker; Yann C Klimentidis; Jessica M B Rees; Richard S Houlston; Ian P M Tomlinson; Stephen Burgess; Harry Campbell; Malcolm G Dunlop; Maria Timofeeva
Journal: Br J Cancer Date: 2021-01-29 Impact factor: 9.075

10. Heritability and preliminary genome-wide linkage analysis of arsenic metabolites in urine.

Authors: Maria Tellez-Plaza; Matthew O Gribble; V Saroja Voruganti; Kevin A Francesconi; Walter Goessler; Jason G Umans; Ellen K Silbergeld; Eliseo Guallar; Nora Franceschini; Kari E North; Wen H Kao; Jean W MacCluer; Shelley A Cole; Ana Navas-Acien
Journal: Environ Health Perspect Date: 2013-01-15 Impact factor: 9.031