Literature DB >> 35047833

Lessons learned from the eMERGE Network: balancing genomics in discovery and practice.

Abstract

The Electronic Medical Records and Genomics (eMERGE) Network, established in 2007, is a consortium of academic and integrated health systems conducting discovery and implementation research in translational genomics. Here, we outline the history of the network, highlight major impacts and lessons learned, and present the tools and resources developed for large-scale genomic analyses and translation into a clinical setting. The network developed methods to extract phenotypes from the electronic medical record to perform genome-wide and phenome-wide association studies. Recruited cohorts were clinically sequenced off a custom panel for targeted sequencing of variants and monogenic disease risks and returned to participants to investigate the impact of return of genomic results. After generating a 105,000 participant-imputed genome-wide association study (GWAS) dataset for discovery, the network enrolled and sequenced 24,998 participants. Integration of these results into the medical record and the effects of results on participants provided key lessons to the field. These learned lessons inform genetic research in diverse populations and provide insights into the clinical impact of return and implementation of genomic medicine using the electronic medical record. The lessons produced by the eMERGE Network can be utilized by other consortia as translational genomic medicine research evolves.

Entities: Chemical

Keywords: consortium; discovery; electronic medical records; genomic; implementation; network

Year: 2020 PMID： 35047833 PMCID： PMC8756524 DOI： 10.1016/j.xhgg.2020.100018

Source DB: PubMed Journal: HGG Adv ISSN： 2666-2477

The utilization of electronic medical record (EMR) data for clinical research has been an integral component of translational science over the last 15 years., The ability to pair clinical data with biobank samples and conduct large-scale genome-wide association studies (GWASs) has paved the way for translational research to provide insights into diabetes, cataracts, cardiovascular disease, and obesity, among many others., The Electronic Medical Records and Genomics (eMERGE) Network has pioneered discovery research methods using longitudinal EMR data linked to genotyping and sequence data9, 10, 11, 12, 13 across diverse geographical, racial, and age distributions. To date, the network has produced a merged, imputed, multi-sample genotyping file representing data from 105,000 participants recruited across three phases14, 15, 16 to investigate genetic associations with disease phenotypes.16, 17, 18 In later phases, the scope of the network expanded to include the clinical applications of genetics. Diverse sites, including pediatric and adult academic medical centers, integrated health systems, and community-based clinics, have sequenced clinically relevant portions of the genome and returned actionable results using the EMR. This diversity provided a natural experiment to study differences and aggregate lessons learned for delivering both sequencing and pharmacogenomic data to diverse populations. As a result, the network has contributed to research in clinical genomics,, pharmacogenomics, phenotyping of clinically relevant diseases, clinical annotation, return of results (RoRs), and assessment of clinical outcomes. The network’s work on establishing methods for transmitting genetic test results from laboratories into heterogeneous health care provider organizations and into clinical practice has helped the network activities span discovery and patient care. This paper describes how the network was structured to achieve clinical implementation and ongoing discovery-based research and provides an overview of developed tools, lessons learned, and resources available to other researchers. The initial goals of the eMERGE Network established the methods for developing and validating electronic phenotyping algorithms across multiple sites and medical record systems. The network also created a pipeline for combining GWAS data from 18,663 participants across five sites to create an imputed, multi-sample file for discovery research., The PheWAS (phenome-wide association scan) method was originally developed by Denny et al. for conducting disease-gene associations using billing codes to automatically categorize over 700 disease phenotypes. The PheWAS Catalog was launched in 2013 and cited over 300 times, making it one of the most highly utilized tools developed by the network. In August 2011, the second phase of eMERGE began with the goal of advancing translational efforts. This was achieved by incorporating available genotyping data with electronic phenotyping and privacy protection methods into clinical research and ongoing clinical care. While eMERGE I focused on discovery in phenotyping and genomics, eMERGE II shifted to returning clinically relevant findings as sites recruited, sequenced, and returned results from the targeted Pharmacogenomics Research Network sequence platform (PGRNseq). These data combined variant information with drug interactions, which led to the development of Sequence and Phenotype Integration Exchange (SPHINX), a web-based portal for exploring data for hypothesis generation, especially around drug response implications of genetic variation across the eMERGE Network. This component of eMERGE II enabled the network to develop processes for the RoRs and EMR integration and to assemble a diverse set of participants, of which approximately 20% was non-European ancestry. Phase III of the eMERGE Network had four aims: (1) sequence and assess variants in targeted clinically relevant genes in approximately 25,000 participants; (2) assess the phenotypic associations of these variants; (3) integrate clinical reports of actionable genetic variants into EMRs for clinical care; and (4) create community resources. Two sequencing centers joined the network of ten clinical sites (Figure S1), which developed, implemented, and returned results from a network-specific sequencing panel, eMERGEseq, for use in clinical care. As the network transitioned from discovery-based research to implementation, protocols focused on sequencing and return of clinically relevant findings were developed. The accomplishments of eMERGE over the past decade (Figure 1) along with the transition to RoRs over the last several years has allowed the network, which has focused on genomic research and pragmatic interventions, to facilitate the integration of research findings into real-world clinical settings (Figure 2).

Figure 1

Timeline and impact of the eMERGE Network

The network has produced 68 clinical phenotypes validated across multiple electronic medical record systems, launched tools focused on the reuse of genomic data, created multiple iterations of a large GWAS imputed dataset culminating with 105,108 participants, and sequenced and returned results off a PGRNseq and eMERGEseq custom sequencing platform.

Figure 2

eMERGE impacts on clinical care and discovery

The eMERGE Network began by focusing on discovery-based research (white hexagons) before moving into clinical utility-based research (gray hexagons) in phase III. Discovery-based research remains a foundation of the network, contributing to the broad knowledgebase of clinical genomics, which, in turn, can be utilized to inform standards of clinical care and precision medicine in non-research settings. The image describes the main workgroup topics focused on during the third phase and how the network approached both clinical- and discovery-based science.

Timeline and impact of the eMERGE Network The network has produced 68 clinical phenotypes validated across multiple electronic medical record systems, launched tools focused on the reuse of genomic data, created multiple iterations of a large GWAS imputed dataset culminating with 105,108 participants, and sequenced and returned results off a PGRNseq and eMERGEseq custom sequencing platform. eMERGE impacts on clinical care and discovery The eMERGE Network began by focusing on discovery-based research (white hexagons) before moving into clinical utility-based research (gray hexagons) in phase III. Discovery-based research remains a foundation of the network, contributing to the broad knowledgebase of clinical genomics, which, in turn, can be utilized to inform standards of clinical care and precision medicine in non-research settings. The image describes the main workgroup topics focused on during the third phase and how the network approached both clinical- and discovery-based science.

Approach and methodology

Achieving network goals through milestone-driven workgroups

During eMERGE phase III, seven workgroups with specific goals and milestones were developed. Clinical annotation: create consistent approaches for gene and variant interpretation across the eMERGE sequencing centers and clinical sites and support contributions to public knowledge bases; EMR integration: integrate the clinical and genomic results data produced during phase III into electronic records at the clinical sites; Genomics: identify best practices and facilitate analyses of common and rare variant data in previous and current phases of eMERGE; Pharmacogenomics: coordinate and promote pharmacogenomic discovery and implementation work across eMERGE utilizing previous and current pharmacogenetic (PGx) data; Phenotyping: advance the science of phenotype development, including case, control, and cohort definition, and the development and implementation of 25 new disease phenotypes and incorporate natural language processing (NLP); Outcomes: develop cross-site outcomes assessment to track implementation and impact of eMERGE III sequencing and assess the impact on health care utilization and outcomes of importance for participants and their families; and RoRs and ethical, legal, and social implications: develop best practices for the return of actionable variants and measure the impact this return has on participants, families, health care providers, and health care systems.

Data sharing, utilization, and the evolution of data security

The notions of privacy and security are critical to supporting genome-based research and clinical care. When eMERGE began, it established protocols for collecting and sharing de-identified data within the network to facilitate investigations and ensure equity of access. These guidelines have become a critical element of success of the eMERGE Network. While many of the participants consented to participate in the study and community consultation was performed, the process was refined as participants shifted from passive contributors of specimens and data in phase I to recipients of clinically actionable information in phasees II and III. At the same time, eMERGE aimed to support the NIH’s goal of making data from genome-based investigations publicly available. It was understood that the uniqueness of certain types of data utilized by eMERGE made records vulnerable to privacy intrusions, particularly through “re-identification” attacks, including demographics, and DNA. As such, organizations making genomic data available began following the practice of sharing aggregate summary statistics about the underlying records, a policy adopted by the Database of Genotypes and Phenotypes (dbGaP), where eMERGE data would eventually be made accessible. However, just as eMERGE was beginning, it was shown that data shared in such a manner could also be subject to probabilistic inference attacks that would reveal the presence of a participant in the cohort and, thus, expose phenotypic information associated with the cohort (e.g., if the cohort was composed solely of participants diagnosed with a cardiological disorder). Thus, it was initially decided that all participant-level, as well as summary, data to be shared outside of the network would only be done so in a controlled fashion. All data sharing requests would be subject to review and all recipients would be required to enter into a data use agreement (DUA; see Web resources), with institutional signoff, which included a prohibition on re-identification attempts and secure data storage. The DUA allowed data sharing between parties while allowing for specific site’s institutional review board (IRB)-approved protocol to inform the transfer. This DUA has allowed the rapid onboarding of 9 affiliates, 5 collaborators, and 18 sites over the history of the network, including the Clinical Sequencing Evidence-Generating Research (CSER), Implementing Genomics in Practice (IGNITE), Population Architecture Using Genomics and Epidemiology (PAGE) Consortium, Clinical Pharmacogenetics Implementation Consortium (CPIC), All of Us Research Program (AoURP), the Analysis, Visualization, and Informatics Lab-space (AnVIL), the US Food and Drug Administration, and the Michigan Imputation Server group. Still, given the need to make data accessible to the public more broadly, the program developed various privacy protection methodologies. These included approaches for de-identification in natural language clinical notes, aggregation of clinical codes to support discovery, and game theory-based risk assessment strategies to determine when genomic and phenotype data could be made public through resources like SPHINX. Although the datasets were centralized, the analyses, research ideas, and discovery were independently led by the eMERGE sites. An essential component of the network’s success was the ability to share data and collaboratively work on projects across multiple institutions. Well-defined policies regarding data utilization, transparent communications across the steering committee and the entire network, along with the DUA, led to the success of eMERGE over the years. The network’s “Publication Policy,” to which all members requesting access to network data adhere, requires members collaborating with investigators from other sites or utilizing data from other sites, to submit a “manuscript concept sheet” (MCS; see Web resources). These sheets capture the scope, data requested, timeline, and goals of the proposed project. The MCS is reviewed by the steering committee and participants from sites can request involvement in the analysis and publication. Discussions regarding overlap with previously proposed projects and clarification of scope occur during this time period, and modifications are made as necessary. The publications across the network are tracked centrally at the Coordinating Center. Citation counts for publications are tracked quarterly using Zotero reference software via Google Scholar.

Development of the eMERGE aggregate genomic datasets

To compile the GWAS Haplotype Reference Consortium (HRC)-imputed array (N ≈ 105,108), the eMERGE Network included 83 genotype array batches comprising 105,108 participants from 12 contributing medical center sites in the final version of this phase III effort. The consented participants have indexed medical records linked to these genotype results for GWAS and PheWAS. The PGRNseq targeted capture sequencing panel (N = 9,010) was developed by the Pharmacogenomics Research Network (PGRN) as a tool to enable PGx discovery and implementation in large cohorts and includes 84 genes, representing all SNPs present on commercial PGx genotyping platforms at the time (Affy DMET+, Illumina ADME) and additional genes (exons and untranslated regions [UTRs]) nominated by PGRN for their associations to PGx phenotypes.,, Development of the eMERGEseq panel targeted capture sequencing panels across N = 24,956 participants has also been described. The variants from these two datasets (eMERGEseq and PGRNseq) are available in summary format by frequency in SPHINX. All three datasets are also available in dbGaP (GWAS: phs001584.v1.p1; PGRNseq: phs000906; and eMERGEseq: phs001616). To facilitate use of the eMERGEseq and PGRNseq datasets, the network upgraded the SPHINX tool during phase III. The eMERGEseq data (N = 24,956) were added to SPHINX, which previously housed the PGRNseq (N = 9,010) dataset. The tool was reconfigured to include dynamic data updates from DrugBank, the GWAS Catalog, linkage to dbSNP, and PharmGKB annotation resources. The importation of DrugBank annotations into SPHINX allowed for updates to drug-gene-pathway interactions to be maintained in a dynamic fashion. This ensured relevant updates to drug compound name information were captured and interactions to the variants detected in genes were displayed, providing a critical element for the study of PGx interactions. The addition of the GWAS Catalog annotations linked variants that were shown to have phenotypic association to the published peer reviewed literature, allow investigators to quickly reference previous findings for a given variant. The linkage to dbSNP provided an indexing service to document and standardize the naming, genomic mapping, population frequency, and discovery history of genetic variation. Similarly, PharmGKB linkage added literature references to the PGx variant evidence. These additions allow SPHINX to maintain current information tied to variants in the eMERGEseq and PGRNseq datasets, creating a dynamic resource for pharmacogenomic and genetic discovery.

Discovery research in the eMERGE Network

As eMERGE transitioned from genotyping to sequencing, investigations utilizing eMERGE datasets began to incorporate rare variation along with common variation in their analyses. The network compiled 157,480 samples across 13 sites and 6 datasets (Table S1). The GWAS, PGRNseq, and eMERGEseq datasets were harmonized at the Coordinating Center to produce research files for network utilization. The array data were imputed and merged at each phase, and the latest effort utilized the Michigan Imputation Server and HRC version 1.1, thus allowing the aggregation of data from 105,108 eMERGE participants for ∼40 million variants, both common and rare. Figure S2 and Table 1 illustrate the evolution of enrollment and ancestry enrichment in eMERGE through all three phases. The main group consisted of European ancestry, African ancestry was the second largest ancestral group, and there was an enrichment of Hispanic eMERGE participants in phase II. This dataset served as the backbone of discovery efforts in eMERGE. These genetic data were combined with phenotypic data across the sites. Over the three phases, the network developed and validated 68 clinical phenotyping algorithms (Table S2). Case control status was centrally collected at the Coordinating Center in addition to annually refreshed commonly used variables (BMI, ICD, CPT, Phecodes, statin medications, and lipid and autoimmune labs; Table S3) linked to age at the event. These data were also linked to the genomic datasets. The Columbia University site led the effort in leveraging the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) to implement shared phenotypes. This effort demonstrated the effectiveness of using data standards to minimize the work needed to implement phenotypes across the network and promote scalability. The harmonization and centralization of both EMR and genetic data at the Coordinating Center allowed for the rapid transfer of data to investigators across the network, freeing the sites from the burden of re-pulling and harmonizing the same data for every new research proposal. This dataset also qualifies eMERGE for the International HundredK+ Cohorts Consortium (IHCC), which is a global effort to aggregate data for translational research.

Table 1

Breakdown of eMERGE dataset diversity

	eMERGEseqN = 24,956	GWAS N = 105,108	PGRNseq N = 9,010
Self-identified racial category

African American	3,914 (16%)	15,836 (15%)	1,209 (13%)
Hawaiian/Pacific Islander	54 (0.2%)	23 (0.02%)	6 (0.1%)
Am Indian/Alaskan	79 (0.3%)	170 (0.2%)	26 (0.3%)
Asian American	1,578 (6%)	1,246 (1%)	135 (2%)
European	17,691 (71%)	79,764 (76%)	6,065 (67%)
Missing/unknown	1640 (7%)	8069 (8%)	1,569 (17%)

Self-identified ethnicity category

Hispanic/Latino	1,506 (6%)	5,217 (5%)	413 (5%)
Not Hispanic/Latino	22,551 (90%)	93,425 (89%)	7,313 (81%)
Missing/unknown	899 (4%)	6,466 (6%)	1,284 (14%)

Self-reported ancestry counts and percent of participants in the network-wide genomic datasets. Race and ethnicity are captured independently and represented separately.

Breakdown of eMERGE dataset diversity Self-reported ancestry counts and percent of participants in the network-wide genomic datasets. Race and ethnicity are captured independently and represented separately. These large datasets enabled 685 publications focused on discovery and clinical utility (Figure S3) to date. In the latter half of phase II, the network moved into sequencing for the clinical return of actionable genetic variation with the PGRNseq dataset. With these changes, the data footprint became larger and decisions on when to disseminate research datasets to the network and investigators influenced the timing of discovery. The rich datasets produced required significant time and effort to compile and maintain. A major lesson learned by the eMERGE Network was the need for outlining the analyses and product goals prior to compiling a large dataset, including phenotypic requirements and hard cut-offs for data processing. Time and resources were required when cohorts were altered and large research datasets were recompiled, therefore groups must clearly lay out objectives and realistic timelines for data generation, quality control, and downstream investigations. Analysis and computation costs are significant in large datasets, and this should be considered when hosting and analyzing data, especially as the footprint of genomic data grow with new technological and software advances.

The evolution of clinical return in eMERGE III

The network’s focus on implementing genetic testing in a clinical setting produced successes but also identified challenges. One challenge that arose was that each site had a unique study design, cohort population, IRB regulations, and protocols, including how, when, and which results were returned. These differences were driven by study design as well as local IRB regulations. Described in detail by Fossey et al., IRBs vary in requirements, processes, and views toward RoRs from genomic sequencing. Sites encountered many real-world issues during the RoR process, including:Several sites had to amend their IRB protocols during the study to facilitate RoRs when participants were unable to be contacted. The major lesson learned from this work was that the RoR process is dependent on institutional culture, priorities, and local IRB regulations. However, in the absence of prior data on RoRs on a large network scale, this flexibility in the study design allowed sites more freedom to explore different approaches for RoRs across different medical settings and enable the network to learn from the complexities of return across multiple institutional settings. The diversity of approaches resulted in a natural experiment that is being explored using implementation science frameworks. This will result in publications that should inform others interested in implementing genomics in clinical care. These lessons will strengthen future research RoR efforts and, in future studies, a single central IRB could standardize the processes and allow the network to create larger, more robust standard guidelines for practice. Delays in expected timelines for RoRs; Changes in participant contact information; Engagement of participants; Death of participants prior to RoRs; Previous genetic testing influencing whether participants were interested in the eMERGE results; and Pediatric sites’ participants turning 18 years of age before RoRs requiring re-consent as an adult. Another focus was measuring the impact of the RoRs on clinical outcomes. The network determined which outcome measures to collect, deployed data collection forms, and collected data 6- and 12-months post return for network-wide analyses. Due to the short time frame of the project, the network focused on “process outcomes” that examined the specific steps in a process that had evidence supporting associations with a particular health outcome (e.g., a lipid test ordered after the return of a result in a gene associated with familial hypercholesterolemia) as well as “intermediate outcomes,” which are biomarkers associated with a given outcome (e.g., LDL cholesterol level). The outcomes forms were compared to the independently developed outcomes from the ClinGen Actionability working group and showed concordance, suggesting that common outcomes could be identified for genetic diseases and interventions. If common outcomes are used across studies, evidence can accumulate more quickly. During development of the forms, it became clear that implementation guides for a given disease were necessary as there were multiple study personnel abstracting data elements from the EMR across the 10 sites. Eleven guides were developed for the abstractors, ensuring consistent data entry for downstream analyses. Final analyses of the clinical outcomes are still pending as sites finalize data collection across their cohorts prior to the completion of phase III. Integrating results into the EMR was a critical component of eMERGE. Several challenges were encountered as the network harmonized integration and sites placed results into the EMR. Some medical centers transitioned to new medical record systems during the project, including large infrastructure changes during the Group Health Cooperative transition to Kaiser Permanente in 2017. Compliance regulations differed across states regarding data usage and return. Creating a standard for data flow between sequencing centers and sites is key to integrating genomic test results into the EMR. This pipeline differed depending on site regulations, study design, and requirements. The network continued to inform the genetic Health IT standards by developing a Fast Healthcare Interoperability Resources (FHIR) profile that codifies all of the combined experience learned during the return and integration of genetic information into the EMR (see Web resources). The ability to utilize and integrate these data for both research and in clinical settings and to learn from the heterogeneity of the network is critical. Overarching lessons learned during phase III (Table 2) showcase an adaptable, cohesive network that has moved beyond genomic discovery to leading research into clinical implementation and the return of genomic results. As eMERGE continues to evolve, the network will utilize its experience with large datasets, innovative research methods, communications, and integration into clinical settings to push forward genomic discovery and clinical implementation

Table 2

Overarching lessons learned from the eMERGE Network

Focus	Lessons Learned	Tools
Genomic discovery¹⁴^,¹⁵^,²⁶^,²⁸	• defined data freezes with specifications regarding diversity, phenotypic data, discovery, and implementation goals are critical to maximize resources• centrally collected and aligned genetic and EMR data maximize output and relieve repeated site efforts	SPHINX; PheWAS Catalog
Electronic clinical phenotyping²⁰	• implementation and review of complex clinical algorithms using local experts maximizes phenotype accuracy• clearly define processes for phenotype creation, validation and implementation, using standard vocabularies and common data models increase portability	PheKB
Pharmaco-genomics¹⁹^,²⁸	• shared variant knowledge base with access to structured data, and knowledge is necessary for implementation• provider education, customization of clinical reports, and ongoing education is critical for provider utilization• technical requirements and approaches for PGx integration differs from highly penetrant genetic results• SNP coverage in targeted panels should be sufficient for interpreting the full range and types of variants for clinical return of a given gene (e.g.CYP2D6)	SPHINX; CDS_KB
Clinical sequencing²¹	• centralized sequencing allows for harmonization across large networks• with multiple sequencing centers, consistency in panel validation, variant interpretations, reclassifications, and discrepancies is essential
Return of results (RoR)²²	• flexibility in study design allows exploration of different approaches for RoR• institutional culture and IRB regulations influence the RoR process• a single IRB and centralized protocol may enhance consistency and data harmonization	MyResults
Integration into EMR²⁴	• a standard for data flow is essential for returning genomic test results across sites• eMERGE-informed national FHIR profiles are needed to support genomic return• transitions to new EMR systems can delay and alter site integration plans	CDS_KB; DocUBuild
Clinical outcomes²³	• implementation guides for EHR abstraction ensure consistency across personnel and sites• focusing on process and intermediate outcomes allows for outcomes analysis in studies with short term follow up

Lessons broken down by phase III workgroup, with relevant resources available. SPHINX (Sequence & Phenotype Integration Exchange) links sequencing data to drug associations, GWAS variants, and ancestry. PheWAS Catalog functions as a platform for analysis of phenotypes against single gene variants. PheKB (Phenotype KnowledgeBase. Collaborative) is an environment to build and validate electronic phenotypic algorithms. CDS_KB (Clinical Decision Support KnowledgeBase) catalogs and shares clinical decision support implementation artifacts. MyResults provides education targeted to the public and information about genetic test results and disease risks. DocUBuild is a web application for creating and sharing information resources for electronic medical record systems.

Overarching lessons learned from the eMERGE Network Lessons broken down by phase III workgroup, with relevant resources available. SPHINX (Sequence & Phenotype Integration Exchange) links sequencing data to drug associations, GWAS variants, and ancestry. PheWAS Catalog functions as a platform for analysis of phenotypes against single gene variants. PheKB (Phenotype KnowledgeBase. Collaborative) is an environment to build and validate electronic phenotypic algorithms. CDS_KB (Clinical Decision Support KnowledgeBase) catalogs and shares clinical decision support implementation artifacts. MyResults provides education targeted to the public and information about genetic test results and disease risks. DocUBuild is a web application for creating and sharing information resources for electronic medical record systems.

Advancing translational genomic research

The eMERGE Network has developed several methods on the secondary use of EMR data for the discovery of genotype-phenotype associations.,,9, 10, 11, 12, 13 The network’s success in enabling discovery methods and findings was dependent on a coordinated workgroup infrastructure, a large and diverse genomic dataset, centralized data use agreements, adherence to policies that governed data security and privacy, and informed investigator collaboration on network publications. The lessons learned from the network (Table 2) span discovery-based research and implementation of the clinical return of genetic variations associated with clinical disease phenotypes. Clearly defined data freezes and centrally collected commonly used EMR variables maximized data availability and minimized duplicative efforts across study teams. Variability in site protocols allowed for the ability to determine how different approaches alter participant experiences and to integrate genomic data into their clinical outcomes. Harmonization of protocols and data flows helped improve sample sizes, data integration, and collection across the network. The network’s transition to clinical implementation, leveraged the variety of strategies devised by sites for returning results to participants and integrating results into diverse EMR systems using accepted standards (Rasmussen et al., 2017, AMIA, conference). As genomic research moves from independent studies to large networks and consortiums, such as CSER, PAGE, AoURP, and IGNITE, these lessons maximize genomic and EMR data utilization from large biobanks to push forward research in clinical and translational genomics. Collaborations between these networks are key as research moves forward. The overarching goal of eMERGE has been to advance genomic medicine. To this end, the network has established multiple tools and resources available to external researchers, spanning the fields of genomic discovery, EMR integration, participant engagement, and education (Table 2). To date, eMERGE publications have been referenced over 40,000 times (Figure S3), impacting the field of translational genomic medicine. Collaborations with external groups have helped shape the practice of other networks, clinical research, and genetic medicine. Though the research conducted over the last several years has led to many lessons learned, there are several limitations associated with the approach the network undertook. Based on the network’s experience with developing clinical algorithms, even with restrictions imposed, false positives and low positive predictive values can be an issue, especially when incorporating complex variables and techniques like NLP. Maximizing predictive values for electronic phenotyping of diseases and outcomes will be critical as the genomic research evolves. The requirements to house, compute, and make large genomic datasets available to the network have increased over time—the data footprint of over 80 TBs presents continued challenges to store and make data accessible to the research community. Calculations on datasets of ∼100,000 participants are computationally intensive and require computational capacity beyond that available to many researchers. As the field moves from local servers to cloud-based platforms for data storage, egress fees and computational costs should be considered to make these data accessible and usable by all investigators. Finally, as sites implemented different protocols, ascertainment bias during enrollment at certain sites restricted downstream analyses on penetrance and clinical outcomes. Sample size was reduced for certain populations and disease phenotypes due to differences in site protocols, and utilizing a single protocol in future work may increase power and allow for the investigation of clinical diseases with low prevalence in general populations. Moving into the next round of eMERGE, the network is focusing on genomic risk assessments and PRS. Overall risk for many common diseases is complex, and elements of family history, a person’s age, and prior health data need to be assessed in parallel with genomic data to provide accurate risk scores. Additionally, the impact of genomic risk can be modified over the lifespan. Age, medications, and lifestyle changes have been shown to interact with the genetic risk of a disease phenotype. The population from which a risk algorithm is determined should be considered when disseminating risk to participants. To achieve a European ancestry level of performance of PRS in non-European ancestry groups, ancestry-specific GWAS and PRS derived from large and diverse cohorts and statistical modeling are crucial. Recent work by the network demonstrated lower hazard ratios in genome-wide PRS for coronary heart disease in African ancestry compared to European ancestry or LatinX ethnicity. The network datasets contain approximately 15% ancestrally diverse participants (Table 1; Figure S2), which may limit the applicability of genetic findings in more diverse populations. In the next phase of the network, the goal is to increase the diversity of underrepresented populations, with targeted recruitment aimed at over 50% non-European ancestry. The lessons from enrollment and RoRs to diverse populations, even limited, will inform our next phase as we continue to strive for a more representative population. As genomic medicine evolves, increasing this diversity is critical, and the next steps for the network include enrollment of populations underrepresented in research. Balancing collaborative discovery-based research with implementing genomic technology and results into clinical settings across multiple sites and phases has been one of the greatest achievements of the network. The rapid access to data and collaborators affords network investigators the power to test relevant hypotheses and observe rare associations on a large scale. Network data are accessible to external researchers through dbGaP as well as through direct collaboration with the network. As the field of genetic medicine evolves, the network will continue to adapt to new techniques and standards, leading the research in prediction of risk of disease, methods to implement risk prediction on a larger scale, including integration of findings into medical records, and evaluating the impact of RoR information to participants and providers.

Data code availability

The datasets generated and referred to during this study are available at dbGap under the following accession numbers: GWAS: phs001584.v1.p1, eMERGEseq: phs001616, and PGRNseq: phs000906.

Supplemental information

Consortia

The members of eMERGE are Jodell E. Linder, Alanna K. Rahm, Ian B. Stanaway, Hana Zouk, Elisabeth A. Rosenthal, Ozan Dikilitas, Adam S. Gordon, Megan J. Puckelwartz, Krzysztof Kiryluk, Bahram Namjou, Ingrid A. Holm, Cynthia A. Prows, Iftikhar J. Kullo, Marc S. Williams, Heidi L. Rehm, Patrick M.A. Sleiman, Chunhua Weng, Wendy K. Chung, Maureen E. Smith, Bradley A. Malin, Samuel E. Adunyah, Christopher G. Chute, Josh C. Denny, Ali Gharavi, Richard Gibbs, Hakon Hakonarson, John Harley, George Hripcsak, Elizabeth W. Karlson, Eric B. Larson, Niall Lennon, Shawn Murphy, Dan M. Roden, Richard R. Sharp, Jordan W. Smoller, Wei-Qi Wei, Scott T. Weiss, Digna R. Velez Edwards, Melissa A. Basford, Brittany B. City, Alanna J. DiVietro, Brandy M. Mapes, Timoethia M. Stone, Laura Allison B. Woods, Jyoti G. Dayal, Robb K. Rowley, Baergen I. Schultz, Ken L. Wiley, Jr., Sheethal Jose, Christine Eng, Jianhong Hu, David Murdock, Donna Muzny, Steven Scherer, Eric Venner, Kimberly Walker, Mullai Murugan, Viktoriya Korchina, Christie Kovar, Kevin R. Dufendach, Kenneth M. Kaufman, Todd Lingren, John Lynch, Keith Marsolo, Erin M. Miller, Melanie F. Myers, Yizhao Ni, Beth L. Cobb, Debra J. Abrams, Berta Almoguera, Meckenzie Behr, Elizabeth J. Bhoj, John J. Connolly, Joe T. Glessner, Margaret Harr, Heather S. Hain, Frank D. Mentch, Addie I. Nesbitt, Renata P. da Silva, Avni Santani, Lifeng Tian, Lyam Vazquez, Paul S. Appelbaum, Wendy Chung, Katherine D. Crew, David Fasel, Alex Fedotov, Alexander L. Hsieh, Atlas Khan, Cong Liu, Maddalena Marasa, Hila Milo Rasouly, Jordan Nestor, Lynn Petukhova, Soumitra Sengupta, Yufeng Shen, Ning Shang, Miguel Verbitsky, Julia Wynn, Kenneth M. Borthwick, Adam H. Buchanan, David J. Carey, Jessica M. Goehringer, Darren K. Johnson, Laney K. Jones, Navya Shilpa Josyula, Anne E. Justice, H. Lester Kirchner, Benjamin R. Kuhn, Ming Ta Michael Lee, Yanfei Zhang, Melissa A. Kelly, Casey Overby Taylor, Thomas N. Person, Cassandra Pisieczko, Amy C. Sturm, Agnes S. Sundaresan, Nephi Walton, Janet Williams, Juliann M. Savatt, Barbara Benoit, Andrew Cagan, Victor M. Castro, Vivian S. Gainer, Shawn N. Murphy, Ladia H. Albertson-Junkans, Deborah J. Bowen, David S. Carrell, Paul K. Crane, Stephanie M. Fullerton, Andrea L. Hartzler, Nora B. Henrikson, Dustin L. Key, Kathleen A. Leppig, James D. Ralston, Arvind Ramaprasan, Aaron Scrol, Peter Tarczy-Hornoch, David L. Veenstra, Hana Bangash, Pedro J. Caraballo, Mariza De Andrade, David C. Kochan, Noralane M. Lindor, Daniel J. Schaid, Joel E. Pacyna, Maya S. Safarova, Gabriel Q. Shaibi, Janet E. Olson, Philip Lammers, Siddharth Pratap, Rajbir Singh, Duane T. Smoot, Sharon Aufox, Christin Hoell, Yoonjung Y. Joo, Yuan Luo, Elizabeth McNally, Jennifer A. Pacheco, Luke V. Rasmussen, Laura J. Rasmussen-Torvik, Justin Starren, Theresa Walunas, Firas H. Wehbe, Samuel Aronson, Lawrence J. Babb, Mark Bowser, Birgit Funke, Stacey Gabriel, Chet Graham, Maegan V. Harden, Elizabeth D. Hynes, Barbara J. Klanderman, Emily Kudalkar, Matthew S. Lebo, Chiao-Feng Lin, Alyssa Macbeth, Lisa Mahanta, Himanshu Sharma, Matthew Varugheese, Leora Witkowski, Sarah T. Bland, Ellen Wright Clayton, Todd L. Edwards, Jacklyn N. Hellwege, Jacob M. Keaton, Sara L. Van Driest, Quinn S. Wells, Murray Brilliant, Scott Hebbring, Terrie Kitchner, Erwin P. Bottinger, Eimear E. Kenny, Aniwaa Owusu Obeng, Rex L. Chisholm, Gail P. Jarvik, Josh F. Peterson, David R. Crosslin. Author affiliations can be found in Table S4.

Declaration of interests

C.E., R.G., J. Hu, D. Murdock, D. Munzy, S.S., E.V., and K.W. list Baylor Genetics Laboratories (BGL) as a competing financial interest, and BGL is co-owned by Baylor College of Medicine. A.H.B. declares MeTree and You, Inc. as a competing financial interest and has equity stakes in the company. D.R.C. is a consultant for UnitedHealth Group R&D. E.E.K. received speaker honorariums from Regeneron, Illumina, and KPMG. J.F.P. is a consultant for Color Genomics. E.M. is a consultant to Invitae. E.V. is cofounder of Codified Genomics, which provided variant interpretation services for a portion of the work conducted. All other authors declare no competing interests.

38 in total

1. Quality improvement with an electronic health record: achievable, but not automatic.

Authors: Richard J Baron
Journal: Ann Intern Med Date: 2007-10-16 Impact factor: 25.391

2. Concordance between Research Sequencing and Clinical Pharmacogenetic Genotyping in the eMERGE-PGx Study.

Authors: Laura J Rasmussen-Torvik; Berta Almoguera; Kimberly F Doheny; Robert R Freimuth; Adam S Gordon; Hakon Hakonarson; Jared B Hawkins; Ammar Husami; Lynn C Ivacic; Iftikhar J Kullo; Michael D Linderman; Teri A Manolio; Aniwaa Owusu Obeng; Renata Pellegrino; Cynthia A Prows; Marylyn D Ritchie; Maureen E Smith; Sarah C Stallings; Wendy A Wolf; Kejian Zhang; Stuart A Scott
Journal: J Mol Diagn Date: 2017-05-11 Impact factor: 5.568

3. Electronic medical records for genetic research: results of the eMERGE consortium.

Authors: Abel N Kho; Jennifer A Pacheco; Peggy L Peissig; Luke Rasmussen; Katherine M Newton; Noah Weston; Paul K Crane; Jyotishman Pathak; Christopher G Chute; Suzette J Bielinski; Iftikhar J Kullo; Rongling Li; Teri A Manolio; Rex L Chisholm; Joshua C Denny
Journal: Sci Transl Med Date: 2011-04-20 Impact factor: 17.956

4. The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies.

Authors: Catherine A McCarty; Rex L Chisholm; Christopher G Chute; Iftikhar J Kullo; Gail P Jarvik; Eric B Larson; Rongling Li; Daniel R Masys; Marylyn D Ritchie; Dan M Roden; Jeffery P Struewing; Wendy A Wolf
Journal: BMC Med Genomics Date: 2011-01-26 Impact factor: 3.063

5. High density GWAS for LDL cholesterol in African Americans using electronic medical records reveals a strong protective variant in APOE.

Authors: Laura J Rasmussen-Torvik; Jennifer A Pacheco; Russell A Wilke; William K Thompson; Marylyn D Ritchie; Abel N Kho; Arun Muthalagu; M Geoff Hayes; Loren L Armstrong; Douglas A Scheftner; John T Wilkins; Rebecca L Zuvich; David Crosslin; Dan M Roden; Joshua C Denny; Gail P Jarvik; Christopher S Carlson; Iftikhar J Kullo; Suzette J Bielinski; Catherine A McCarty; Rongling Li; Teri A Manolio; Dana C Crawford; Rex L Chisholm
Journal: Clin Transl Sci Date: 2012-08-23 Impact factor: 4.689

6. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations.

Authors: Joshua C Denny; Marylyn D Ritchie; Melissa A Basford; Jill M Pulley; Lisa Bastarache; Kristin Brown-Gentry; Deede Wang; Dan R Masys; Dan M Roden; Dana C Crawford
Journal: Bioinformatics Date: 2010-03-24 Impact factor: 6.937

7. Genome-wide study of resistant hypertension identified from electronic health records.

Authors: Logan Dumitrescu; Marylyn D Ritchie; Joshua C Denny; Nihal M El Rouby; Caitrin W McDonough; Yuki Bradford; Andrea H Ramirez; Suzette J Bielinski; Melissa A Basford; High Seng Chai; Peggy Peissig; David Carrell; Jyotishman Pathak; Luke V Rasmussen; Xiaoming Wang; Jennifer A Pacheco; Abel N Kho; M Geoffrey Hayes; Martha Matsumoto; Maureen E Smith; Rongling Li; Rhonda M Cooper-DeHoff; Iftikhar J Kullo; Christopher G Chute; Rex L Chisholm; Gail P Jarvik; Eric B Larson; David Carey; Catherine A McCarty; Marc S Williams; Dan M Roden; Erwin Bottinger; Julie A Johnson; Mariza de Andrade; Dana C Crawford
Journal: PLoS One Date: 2017-02-21 Impact factor: 3.240

8. The Return of Actionable Variants Empirical (RAVE) Study, a Mayo Clinic Genomic Medicine Implementation Study: Design and Initial Results.

Authors: Iftikhar J Kullo; Janet Olson; Xiao Fan; Merin Jose; Maya Safarova; Carmen Radecki Breitkopf; Erin Winkler; David C Kochan; Sara Snipes; Joel E Pacyna; Meaghan Carney; Christopher G Chute; Jyoti Gupta; Sheethal Jose; Eric Venner; Mullai Murugan; Yunyun Jiang; Magdi Zordok; Medhat Farwati; Maraisha Philogene; Erica Smith; Gabriel Q Shaibi; Pedro Caraballo; Robert Freimuth; Noralane M Lindor; Richard Sharp; Stephen N Thibodeau
Journal: Mayo Clin Proc Date: 2018-11 Impact factor: 7.616

9. Phenome-wide association studies demonstrating pleiotropy of genetic variants within FTO with and without adjustment for body mass index.

Authors: Robert M Cronin; Julie R Field; Yuki Bradford; Christian M Shaffer; Robert J Carroll; Jonathan D Mosley; Lisa Bastarache; Todd L Edwards; Scott J Hebbring; Simon Lin; Lucia A Hindorff; Paul K Crane; Sarah A Pendergrass; Marylyn D Ritchie; Dana C Crawford; Jyotishman Pathak; Suzette J Bielinski; David S Carrell; David R Crosslin; David H Ledbetter; David J Carey; Gerard Tromp; Marc S Williams; Eric B Larson; Gail P Jarvik; Peggy L Peissig; Murray H Brilliant; Catherine A McCarty; Christopher G Chute; Iftikhar J Kullo; Erwin Bottinger; Rex Chisholm; Maureen E Smith; Dan M Roden; Joshua C Denny
Journal: Front Genet Date: 2014-08-05 Impact factor: 4.599

10. Harmonizing Outcomes for Genomic Medicine: Comparison of eMERGE Outcomes to ClinGen Outcome/Intervention Pairs.

Authors: Janet L Williams; Wendy K Chung; Alex Fedotov; Krzysztof Kiryluk; Chunhua Weng; John J Connolly; Margaret Harr; Hakon Hakonarson; Kathleen A Leppig; Eric B Larson; Gail P Jarvik; David L Veenstra; Christin Hoell; Maureen E Smith; Ingrid A Holm; Josh F Peterson; Marc S Williams
Journal: Healthcare (Basel) Date: 2018-07-13

4 in total

1. Development and validation of a trans-ancestry polygenic risk score for type 2 diabetes in diverse populations.

Authors: Tian Ge; Marguerite R Irvin; Amit Patki; Vinodh Srinivasasainagendra; Yen-Feng Lin; Hemant K Tiwari; Nicole D Armstrong; Barbara Benoit; Chia-Yen Chen; Karmel W Choi; James J Cimino; Brittney H Davis; Ozan Dikilitas; Bethany Etheridge; Yen-Chen Anne Feng; Vivian Gainer; Hailiang Huang; Gail P Jarvik; Christopher Kachulis; Eimear E Kenny; Atlas Khan; Krzysztof Kiryluk; Leah Kottyan; Iftikhar J Kullo; Christoph Lange; Niall Lennon; Aaron Leong; Edyta Malolepsza; Ayme D Miles; Shawn Murphy; Bahram Namjou; Renuka Narayan; Mark J O'Connor; Jennifer A Pacheco; Emma Perez; Laura J Rasmussen-Torvik; Elisabeth A Rosenthal; Daniel Schaid; Maria Stamou; Miriam S Udler; Wei-Qi Wei; Scott T Weiss; Maggie C Y Ng; Jordan W Smoller; Matthew S Lebo; James B Meigs; Nita A Limdi; Elizabeth W Karlson
Journal: Genome Med Date: 2022-06-29 Impact factor: 15.266

2. pyPheWAS: A Phenome-Disease Association Tool for Electronic Medical Record Analysis.

Authors: Cailey I Kerley; Shikha Chaganti; Tin Q Nguyen; Camilo Bermudez; Laurie E Cutting; Lori L Beason-Held; Thomas Lasko; Bennett A Landman
Journal: Neuroinformatics Date: 2022-01-03

3. Lessons learned and recommendations for data coordination in collaborative research: The CSER consortium experience.

Authors: Kathleen D Muenzen; Laura M Amendola; Tia L Kauffman; Kathleen F Mittendorf; Jeannette T Bensen; Flavia Chen; Richard Green; Bradford C Powell; Mark Kvale; Frank Angelo; Laura Farnan; Stephanie M Fullerton; Jill O Robinson; Tianran Li; Priyanka Murali; James M J Lawlor; Jeffrey Ou; Lucia A Hindorff; Gail P Jarvik; David R Crosslin
Journal: HGG Adv Date: 2022-05-20

Review 4. The Role of Electronic Health Records in Advancing Genomic Medicine.

Authors: Jodell E Linder; Lisa Bastarache; Jacob J Hughey; Josh F Peterson
Journal: Annu Rev Genomics Hum Genet Date: 2021-05-26 Impact factor: 9.340

4 in total