Literature DB >> 35072136

GA4GH: International policies and standards for data sharing across genomic research and healthcare.

Heidi L Rehm^1,2,3, Angela J H Page^1,4, Lindsay Smith^4,5, Jeremy B Adams^4,5, Gil Alterovitz^6,3, Lawrence J Babb¹, Maxmillian P Barkley⁷, Michael Baudis^8,9, Michael J S Beauvais^4,10, Tim Beck¹¹, Jacques S Beckmann¹², Sergi Beltran^13,14,15, David Bernick¹, Alexander Bernier¹⁰, James K Bonfield¹⁶, Tiffany F Boughtwood^17,18, Guillaume Bourque^10,19, Sarion R Bowers¹⁶, Anthony J Brookes¹¹, Michael Brudno^{19,20,21,22,23}, Matthew H Brush²⁴, David Bujold^10,19,23, Tony Burdett²⁵, Orion J Buske²⁶, Moran N Cabili¹, Daniel L Cameron^27,28, Robert J Carroll²⁹, Esmeralda Casas-Silva³⁰, Debyani Chakravarty³¹, Bimal P Chaudhari^32,33, Shu Hui Chen³⁴, J Michael Cherry³⁵, Justina Chung^4,5, Melissa Cline³⁶, Hayley L Clissold¹⁶, Robert M Cook-Deegan³⁷, Mélanie Courtot²⁵, Fiona Cunningham²⁵, Miro Cupak⁷, Robert M Davies¹⁶, Danielle Denisko²⁰, Megan J Doerr³⁸, Lena I Dolman²⁰, Edward S Dove³⁹, L Jonathan Dursi^21,23, Stephanie O M Dyke¹⁰, James A Eddy³⁸, Karen Eilbeck⁴⁰, Kyle P Ellrott²⁴, Susan Fairley^4,25, Khalid A Fakhro^41,42, Helen V Firth^16,43, Michael S Fitzsimons⁴⁴, Marc Fiume⁷, Paul Flicek²⁵, Ian M Fore⁴⁵, Mallory A Freeberg²⁵, Robert R Freimuth⁴⁶, Lauren A Fromont⁴⁷, Jonathan Fuerth⁷, Clara L Gaff^17,18,27,28, Weiniu Gan³⁴, Elena M Ghanaim⁴⁸, David Glazer⁴⁹, Robert C Green^6,3, Malachi Griffith⁵⁰, Obi L Griffith⁵⁰, Robert L Grossman⁴⁴, Tudor Groza⁵¹, Jaime M Guidry Auvil⁴⁵, Roderic Guigó^14,47, Dipayan Gupta²⁵, Melissa A Haendel⁵², Ada Hamosh⁵³, David P Hansen^17,54, Reece K Hart^1,55,56, Dean Mitchell Hartley⁵⁷, David Haussler^36,58, Rachele M Hendricks-Sturrup⁵⁹, Calvin W L Ho⁶⁰, Ashley E Hobb⁷, Michael M Hoffman^20,21,22, Oliver M Hofmann^20,28, Petr Holub^61,62, Jacob Shujui Hsu⁶³, Jean-Pierre Hubaux⁶⁴, Sarah E Hunt²⁵, Ammar Husami⁶⁵, Julius O Jacobsen⁶⁶, Saumya S Jamuar^67,68, Elizabeth L Janes^4,69, Francis Jeanson⁷⁰, Aina Jené⁴⁷, Amber L Johns⁷¹, Yann Joly¹⁰, Steven J M Jones⁷², Alexander Kanitz^9,73, Kazuto Kato⁷⁴, Thomas M Keane^25,75, Kristina Kekesi-Lafrance^4,10, Jerome Kelleher⁷⁶, Giselle Kerry²⁵, Seik-Soon Khor^77,78, Bartha M Knoppers¹⁰, Melissa A Konopko⁷⁹, Kenjiro Kosaki⁸⁰, Martin Kuba⁶², Jonathan Lawson¹, Rasko Leinonen²⁵, Stephanie Li^1,4, Michael F Lin⁸¹, Mikael Linden^82,83, Xianglin Liu⁶⁹, Isuru Udara Liyanage²⁵, Javier Lopez⁸⁴, Anneke M Lucassen⁸⁵, Michael Lukowski⁴⁴, Alice L Mann^4,16, John Marshall⁸⁶, Michele Mattioni⁸⁷, Alejandro Metke-Jimenez⁵⁴, Anna Middleton^88,89, Richard J Milne^88,89, Fruzsina Molnár-Gábor⁹⁰, Nicola Mulder⁹¹, Monica C Munoz-Torres⁵², Rishi Nag²⁵, Hidewaki Nakagawa^92,93, Jamal Nasir⁹⁴, Arcadi Navarro^47,95,96,97, Tristan H Nelson⁹⁸, Ania Niewielska²⁵, Amy Nisselle^18,28,99, Jeffrey Niu²¹, Tommi H Nyrönen^82,83, Brian D O'Connor¹, Sabine Oesterle⁹, Soichi Ogishima¹⁰⁰, Vivian Ota Wang⁴⁵, Laura A D Paglione^101,102, Emilio Palumbo^14,47, Helen E Parkinson²⁵, Anthony A Philippakis¹, Angel D Pizarro¹⁰³, Andreas Prlic⁵⁵, Jordi Rambla^14,47, Augusto Rendon⁸⁴, Renee A Rider⁴⁸, Peter N Robinson^104,105, Kurt W Rodarmer¹⁰⁶, Laura Lyman Rodriguez¹⁰⁷, Alan F Rubin^27,28, Manuel Rueda⁴⁷, Gregory A Rushton¹, Rosalyn S Ryan¹⁰⁸, Gary I Saunders⁷⁹, Helen Schuilenburg²⁵, Torsten Schwede^9,73, Serena Scollen⁷⁹, Alexander Senf¹⁰⁹, Nathan C Sheffield¹¹⁰, Neerjah Skantharajah^4,5, Albert V Smith¹¹¹, Heidi J Sofia⁴⁸, Dylan Spalding^82,83, Amanda B Spurdle¹¹², Zornitza Stark^17,18,28, Lincoln D Stein^5,20, Makoto Suematsu⁸⁰, Patrick Tan^67,113,114, Jonathan A Tedds⁷⁹, Alastair A Thomson³⁴, Adrian Thorogood^10,115, Timothy L Tickle¹, Katsushi Tokunaga^78,116, Juha Törnroos^82,83, David Torrents^96,117, Sean Upchurch¹¹⁸, Alfonso Valencia^96,117, Roman Valls Guimera²⁸, Jessica Vamathevan²⁵, Susheel Varma^25,119, Danya F Vears^18,28,99,120, Coby Viner^20,21, Craig Voisin¹²¹, Alex H Wagner^32,33, Susan E Wallace¹¹, Brian P Walsh²⁴, Marc S Williams⁹⁸, Eva C Winkler¹²², Barbara J Wold¹¹⁸, Grant M Wood¹²³, J Patrick Woolley⁷⁶, Chisato Yamasaki⁷⁴, Andrew D Yates²⁵, Christina K Yung^5,124, Lyndon J Zass⁹¹, Ksenia Zaytseva^10,125, Junjun Zhang⁵, Peter Goodhand^4,5, Kathryn North^18,20,28, Ewan Birney^25,126.

Abstract

The Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution. We describe the GA4GH organization, which is fueled by the development efforts of eight Work Streams and informed by the needs of 24 Driver Projects and other key stakeholders. We present the GA4GH suite of secure, interoperable technical standards and policy frameworks and review the current status of standards, their relevance to key domains of research and clinical care, and future plans of GA4GH. Broad international participation in building, adopting, and deploying GA4GH standards and frameworks will catalyze an unprecedented effort in data sharing that will be critical to advancing genomic medicine and ensuring that all populations can access its benefits.

Entities: Chemical

Year: 2021 PMID： 35072136 PMCID： PMC8774288 DOI： 10.1016/j.xgen.2021.100029

Source DB: PubMed Journal: Cell Genom ISSN： 2666-979X

Introduction

The Universal Declaration of Human Rights states that everyone has the right to share in scientific advancement and its benefits. In order to fully deliver the benefits from genomic science to the broad human population, researchers and clinicians must come together to agree on common methods for collecting, storing, transferring, accessing, and analyzing molecular and other health-related data. Otherwise, this information will remain siloed within individual disease areas, institutions, countries, or other jurisdictions, locking away its potential to contribute to research and medical advances. The Global Alliance for Genomics and Health (GA4GH) is a worldwide alliance of genomics researchers, data scientists, healthcare practitioners, and other stakeholders. We are collaborating to establish policy frameworks and technical standards for responsible, international sharing of genomic and other molecular data as well as related health data. Founded in 2013, the GA4GH community now consists of more than 1,000 individuals across more than 90 countries working together to enable broad sharing that transcends the boundaries of any single institution or country (see https://www.ga4gh.org). In this perspective, we present the strategic goals of GA4GH and detail current strategies and operational approaches to enable responsible sharing of clinical and genomic data, through both harmonized data aggregation and federated approaches, to advance genomic medicine and research. We describe technical and policy development activities of the eight GA4GH Work Streams and implementation activities across 24 real-world genomic data initiatives (“Driver Projects”). We review how GA4GH is addressing the major areas in which genomics is currently deployed including rare disease, common disease, cancer, and infectious disease. Finally, we describe differences between genomic sequence data that are generated for research versus healthcare purposes, and define strategies for meeting the unique challenges of responsibly enabling access to data acquired in the clinical setting.

Harnessing the Genomic Medicine Revolution

As the costs associated with human genomic sequencing continue to decline, genomic assays are increasingly used in both research and healthcare. As a result, we expect tens of millions of human whole-exome or whole-genome sequences to be generated within the next decade, with a high proportion of that data coming from the healthcare setting and therefore associated with clinical information. If they can be shared, these datasets hold great promise for research into the genetic basis of disease and will represent more diverse populations than have traditionally been accessible in research; however, data from individual healthcare systems are rarely accessible outside of institutional boundaries. GA4GH aims to enable the responsible sharing of clinical and genomic data across both research and healthcare by developing standards and facilitating their uptake. We believe that without such a consortium, the emerging utility of genomics in clinical practice will be slower, more expensive, and fragmented, with little harmonization between countries. GA4GH standards (see Table 1) allow researchers to securely and responsibly access data regardless of where they are physically located. Technical standards give researchers the confidence that someone else could reproduce their work by running the same packaged method over the same underlying data, using the same persistent identifiers. Standards also give data providers confidence that their data are being accessed in accordance with their data use policies, by researchers they have authorized, without losing control of multiple downloaded copies of the data. As a result, data providers can enable research with the assurance that their legal and ethical requirements are being upheld, while researchers benefit from the use of global data resources and tools.

Table 1

GA4GH toolkit

Relevant standards	URL	Type	Target user	Purpose
Identify and access datasets relevant to a disease study
Beacon API⁸	https://app.swaggerhub.com/apis/ELIXIR-Finland/ga-4_gh_beacon_api_specification/1.0.0-rc1	API	data custodians, researchers (via research infrastructures), identity provider services	The Beacon protocol defines an open standard for genomics data discovery. It provides a framework for public web services responding to queries against genomic data collections, for instance from population-based or disease-specific genome repositories. Beacon is designed to (1) focus on robustness and easy implementation, (2) be maintained by individual organizations and assembled into a federated network, (3) be general-purpose and able to be used to report on any variant collection, (4) provide a boolean (or quantitative) answer about the observation of a variant, and (5) protect privacy, with queries not returning information about single individuals. A new version of the API will include support for more granular control based on a user’s identity authorization and will enable discovery of cohorts, cases (patients), biological samples, and genomic variants and associated knowledge. More details can be found on the Beacon Project website.
Data Connect	https://github.com/ga4gh-discovery/data-connect	API	data custodians, researchers, and API & tool developers	Data Connect is a specification for discovery and search of biomedical data, which provides a mechanism for describing data and its data model, and for searching data within the given data model.The primary container for data in Data Connect is the table. Tables contain rows of data, where each row is a JSON object with key/value pairs. The table describes the structure of its row objects using JSON Schema (https://json-schema.org/). Row attributes can take on any legal JSON value, e.g., numbers, strings, booleans, nulls, arrays, and nested JSON objects.The API supports browsing and discovery of data models and table metadata, listing table data, and optionally searching table data using arbitrarily complex expressions including joins and aggregations. The query language is SQL with domain-specific functions to facilitate informative typing of the result fields. Data publishers can wrap existing data storage and retrieval systems in the Data Connect API or may choose to publish data directly as static files in the Data Connect JSON format.Data consumers can use Data Connect via graphical data discovery and exploration built upon the API, via command line tools (interactively or in batch workflows), and directly as an API in custom analysis programs. More information can be found in the specification (https://github.com/ga4gh-discovery/data-connect/blob/master/SPEC.md).
Data Use Ontology⁹	http://purl.obolibrary.org/obo/duo.owl	Data Model / Ontology	data custodians, researchers, DACs	The Data Use Ontology (DUO) is a hierarchical vocabulary of terms describing data use permissions and modifiers, in particular for research data in the health/clinical/biomedical domain. The GA4GH DUO standard allows large genomics and health data repositories to consistently annotate their datasets, ensuring a shared, machine readable, representation of data access conditions, and making them automatically discoverable based on a researcher’s authorization level or intended use.Reference implementations are available at Broad’s FireCloud - Data Library Broad’s DUOS (Data Use Oversight System) - Data Catalog European Genome-Phenome Archive. DUO is based on the OBO Foundry principles (http://www.obofoundry.org/principles/fp-000-summary.html) and developed using the W3C Web Ontology Language. DUO can be browsed online via the Ontology Lookup Service or Ontobee. It has been registered with the OBO Foundry (http://obofoundry.org/ontology/duo.html).
GA4GH Passports¹⁰	https://github.com/ga4gh-duri/ga4gh-duri.github.io/blob/master/researcher_ids/ga4gh_passport_v1.md	API / Data Model	data custodians, researchers, DACs, clinicians, API and tool developers	The GA4GH Passport specification aims to support data access policies within current and evolving data access governance systems. This specification defines Passports and Passport Visas as the standard way of communicating a user’s data access authorizations based on either their role (e.g., researcher), affiliation, or access status. Passport Visas from trusted organizations can therefore express data access authorizations that require either a registration process (for the Registered Access data access model¹¹) or custom data access approval (such as the Controlled Access applications used for many datasets).
Service Info	https://github.com/ga4gh-discovery/ga4gh-service-info	API	API and tool developers	Service discovery is at the root of any computational workflow using web-based APIs. Traditionally, this is hard-coded into workflows, and discovery is a manual process. Service Info provides a way for an API to expose a set of metadata to help discovery and aggregation of services via computational methods. It also allows a server/implementation to describe its capabilities and limitations. Service-info is described in GA4GH OpenAPI specification, which can be visualized using Swagger Editor (https://editor.swagger.io/?url=https://raw.githubusercontent.com/ga4gh-discovery/ga4gh-service-info/develop/service-info.yaml).
Service Registry	https://github.com/ga4gh-discovery/ga4gh-service-registry	API	API and tool developers	Service registry is a GA4GH service providing information about other GA4GH services, primarily for the purpose of organizing services into networks or groups and service discovery across organizational boundaries. Information about the individual services in the registry is described in the complementary Service Info specification (see above). The Service Registry specification is useful when dealing with technologies that handle multiple GA4GH services. Common use cases include creating networks or groups of services of a certain type (e.g., Beacon Network searches networks of Beacon services across multiple organizations, a workflow can be executed by a specific group of Workflow Execution Services, or Data Connect search on biomedical data is federated across a set of nodes), or a certain host (e.g., an organization provides implementations of Beacon, Data Connect, and Data Repository Service APIs, or a server hosts an implementation of refget and htsget APIs).
Remotely run analytical methods on data of interest
htsget¹²	samtools.github.io/hts-specs/htsget.html	API	API and tool developers, researchers	htsget is a data retrieval API that bridges from existing genomics file formats to a client/server model with the following features: Incumbent data formats (BAM, CRAM, VCF) are preferred initially, with a future path to others. Multiple server implementations are supported, including those that do format transcoding on the fly, and those that return essentially unaltered filesystem data. Multiple use cases are supported, including access to small subsets of genomic data (e.g., for browsing a given region) and to full genomes (e.g., for calling variants).
refget¹³	samtools.github.io/hts-specs/refget.html	API	API and tool developers, researchers	Refget (https://w3id.org/ga4gh/refget) is an API and mechanism for generating identifiers for reference sequences and retrieving sequences via API. The refget identifier is derived from sequence content directly and therefore does not rely on a central issuing authority. This allows downstream clients to unambiguously refer to a reference sequence and to retrieve said sequence. The refget API can also provide subsequences and metadata pertaining to the checksum identifier. A refget server can host any number of reference sequences of any type, e.g., genomic DNA or protein sequences. The refget protocol is a fundamental building block of the CRAM specification.An OpenAPI description of this specification is available and describes the 1.0.0 version (https://github.com/samtools/hts-specs/blob/master/pub/refget-openapi.yaml). Implementors can check if their refget implementations conform to the specification by using our compliance suite (https://github.com/ga4gh/refget-compliance-suite). A summary of all known public implementations is available from our compliance report website.
Task Execution Service (TES)	https://github.com/ga4gh/task-execution-schemas	API	API and tool developers, researchers, academic institutions	The Task Execution Service (TES) API is a standardized schema and API for describing and executing batch execution tasks. A task defines a set of input files, a set of containers and commands to run, a set of output files, and some additional logging and metadata. TES servers accept task documents and execute them asynchronously on available compute resources. A TES server could be built on top of a traditional HPC queuing system, such as Grid Engine, Slurm, or cloud style compute systems such as AWS Batch or Kubernetes.
Tool Registry Service (TRS)	https://github.com/ga4gh/tool-registry-service-schemas	API	API and tool developers, researchers, academic institutions	The GA4GH Tool Registry (TRS) API aims to provide a standardized way to describe the availability of tools and workflows. In this way, multiple repositories that share Docker-based tools and workflows (based on Common Workflow Language [CWL], Workflow Description Language [WDL], Nextflow, or Galaxy) can consistently interact, search, and retrieve information from one another. The end goal is to make it much easier to share scientific tools and workflows, enhancing our ability to make research reproducible, shareable, and transparent.To access the specification, users can: view the human-readable Reference Documentation explore the specification in the Swagger Editor preview documentation from the gh-openapi-docs for the development branch at https://ga4gh.github.io/tool-registry-service-schemas/preview/develop/docs/index.html
Workflow Execution Service (WES)	https://github.com/ga4gh/workflow-execution-service-schemas	API	API and tool developers, researchers, academic institutions	The Workflow Execution Service (WES) API describes a standard programmatic way to run and manage workflows. Having this standard API supported by multiple execution engines will let people run the same workflow using various execution platforms running on various clouds/environments. Key features include: (1) ability to request a workflow run using CWL or WDL; (2) ability to parameterize that workflow using a JSON schema; and (3) ability to get information about running workflows.
Securely access genotype and phenotype information on patients with related traits
Authentication & Authorisation Infrastructure (AAI)	https://github.com/ga4gh/data-security/blob/master/AAI/AAIConnectProfile.md	Guide	API and tool developers	The GA4GH Authentication & Authorisation Infrastructure (AAI) specification profiles the OpenID Connect (OIDC) protocol to provide a federated (multilateral) authentication and authorization infrastructure for greater interoperability between genomics institutions in a manner specifically applicable to (but not limited to) the sharing of restricted datasets.In particular, this specification introduces a JSON Web Token (JWT) syntax for an access token to enable an OIDC provider (called a Broker) to allow a downstream access token consumer (called a Claim Clearinghouse) to locate the Broker’s /userinfo endpoint as a means to fetch GA4GH Claims. This specification is suggested to be used together with others that specify the syntax and semantics of the GA4GH Claims exchanged.
Cloud Security and Privacy Policy v1.0	https://docs.google.com/document/d/1cBTwtetnsvO2vU3HVwLTLaC9H_ya-4MjZUa_g_xzOBg/edit	Guide	anyone handling sensitive data in a cloud infrastructure.	An increasing number of GA4GH projects rely on Cloud services to pursue their goals, and the GA4GH Cloud Work Stream is working on several products to make the GA4GH community take full advantage of the Cloud paradigm. However, the use of the Cloud poses significant security and privacy challenges that need to be carefully evaluated and addressed. The purpose of the Cloud Security and Privacy Policy is to outline a common security technology framework that can be used to systematically assess the products developed by the CWS from a security perspective. Product developers and reviewers can leverage the information contained herein to identify requirements, threats, and countermeasures related to the products they are working on, thus facilitating the production of secure standards.
CRAM¹⁴	samtools.github.io/hts-specs/CRAMv3.pdf	File Format	API and tool developers, researchers	The CRAM file format holds DNA sequencing records. It has the following major objectives: Significantly better lossless compression than BAM To permit simple and lossless transformations to and from BAM files Support for controlled loss of data The first two objectives allow users to take immediate advantage of the CRAM format while offering a smooth transition path from using BAM files.The third objective supports the exploration of different lossy compression strategies and provides a framework in which to effect these choices.Data in CRAM is stored in a columnar fashion, with each column being compressed with either a general-purpose compressor or a custom method. If aligned, sequences may be stored as differences against a reference sequence, which is optionally stored within the CRAM file. External references may be either a local file or obtained remotely via the refget API. Data may be retrieved either as whole alignment records, or selectively only for the fields (columns) required.
Crypt4GH¹⁵	samtools.github.io/hts-specs/crypt4gh.pdf	File Format	API and tool developers, data generators, researchers, clinicians, data custodians	By its nature, genomic data can include information of a confidential nature about the health of individuals. It is important that such information is not accidentally disclosed. One part of the defense against such disclosure is to, as much as possible, keep the data in an encrypted format. The Crypt4GH specification describes a file format that can be used to store data in an encrypted state. Existing applications can, with minimal modification, read and write data in the encrypted format. The choice of encryption also allows the encrypted data to be read starting from any location, facilitating indexed access to files. The format has the following properties: Confidentiality: Data stored in the file are readable only by holders of the correct secret decryption key. The format does not hide the length of the encrypted file, although it is possible to pad some file structures to obscure the length. Integrity: Data are stored in a series of 64 kilobyte blocks, each of which includes a message authentication code (MAC). At tempts to change the data in a block will make the MAC invalid; it is not possible to recalculate the MAC without knowing the key used to encrypt the file. The format only protects the contents of each individual block. It does not protect against insertion, removal, or reordering of entire blocks. Authentication: The format does not provide any way of authenticating files. Crypt4GH may be used with any data file or stream, but one usage is encryption of BAM, CRAM, VCF, and BCF data within the htsget API while still retaining full random access.
Data Repository Service (DRS)	https://github.com/ga4gh/data-repository-service-schemas	API	API and tool developers, researchers, academic institutions	The Data Repository Service (DRS) API provides a generic interface to data repositories so data consumers, including workflow systems, can access data objects in a single, standard way regardless of where they are stored and how they are managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID. The DRS specification describes the characteristics of those IDs, the types of data supported, how they can be pointed to using URIs, and how clients can use these URIs to ultimately make successful DRS API requests. The specification also describes the DRS API in detail and provides information on the specific endpoints, request formats, and responses. This specification is intended for developers of DRS-compatible services and of clients that will call these DRS services.
Data Security Infrastructure Policy (DSIP)	https://github.com/ga4gh/data-security/blob/master/DSIP/DSIP_v4.0.md	Policy Framework	data protection authorities	The Data Security Infrastructure Policy (DSIP) describes the data security infrastructure recommended for stakeholders in the GA4GH community. It is not meant to be a normative document, but rather a set of recommendations and best practices to enable a secure data sharing and processing ecosystem. However, it does not claim to be exhaustive, and additional precautions other than the ones collected in the policy might have to be taken to be compliant with national/regional legislations. As a living document, the DSIP will be revised and updated over time, in response to changes in the GA4GH Privacy and Security Policy, and as technology and biomedical science continue to advance.
Machine Readable Consent Guidance (MRCG) v1.0	https://www.ga4gh.org/wp-content/uploads/Machine-readable-Consent-Guidance_6JUL2020-1.pdf	Guide	researchers, institutional review boards/research ethics committees (international and national), research ethics policy makers, data generators, funding agencies	The Machine Readable Consent Guidance (MRCG) provides standardized consent clauses and supporting information to enable the development of consent forms that map unambiguously to the GA4GH Data Use Ontology (DUO). Integrating DUO into consent forms thereby facilitates data discovery and data access requests and approvals, maximizing data sharing, integration, and re-use while respecting the autonomy of data subjects. MRCG implementations include the Broad Data Use Oversight System (DUOS)¹⁶ and the Australian Genomics dynamic consent participant platform, CTRL.
Pedigree V1	https://github.com/GA4GH-Pedigree-Standard/pedigree	Data Model / Ontology	clinicians, researchers, API and tool developers, data generators, EHR vendors	Family health history is an important aspect in both genomic research and patient care. The GA4GH pedigree standard is an object-oriented graph-based model to represent family health history and pedigree information. It is intended to fit within the structure of other standards like HL7 FHIR and Phenopackets and enable the computable exchange of family health history as well as representation of larger, more complex families. Computable representation of family structure will allow patients, physicians, and researchers to share this information more easily between healthcare systems and help software tools use this information to improve genomic analysis and diagnosis. The draft model can be found on Github along with a Family History Relations Ontology and draft FHIR implementation guide. A draft recommendation for a minimal dataset of family health history (https://docs.google.com/document/d/1UAtSLBEQ_7ePRLvDPRpoFpiXnl6VQEJXL2eQByEmfGY/edit?usp=sharing) was developed as a foundation of these efforts.
Phenopackets	http://phenopackets.org	Data Model / Ontology	data generators, data custodians, researchers, clinicians, API and tool developers	The Phenopacket specification is an open machine-readable schema that supports the global exchange of disease and phenotype information to improve our ability to diagnose and conduct research on all types of diseases, including cancer and rare disease. A Phenopacket links detailed phenotypic descriptions with disease, patient, and genetic information, enabling clinicians, biologists, and disease and drug researchers to build more complete models of disease. Version 2 of the standard, released in June 2021, expands on the previous version to include better representation of the time course of disease, treatment, and COVID-19 and cancer-related data. The schema, as well as source code in Java, C++, and Python, are available from the phenopacket-schema GitHub repository.
RNAget	https://ga4gh-rnaseq.github.io/schema/docs/index.html	API	Data generators, data custodians, researchers, tool developers	The RNAget API describes a common set of endpoints for search and retrieval of processed RNA data. This currently includes feature level expression data from RNA-seq type assays and signal data over a range of bases from ChIP-seq, methylation, or similar epigenetic experiments.By using these common endpoints, data providers make it easier for client software to access their data with minimal or no modifications to underlying code. This improves interoperability with other compliant data providers and makes it easier for investigators to retrieve and compare data from multiple sites.For the software developer, these common endpoints and patterns make it easier to access multiple compliant server sites with the same client software. This reduces development time which may have otherwise been spent writing parsers and custom request generators. Using the API, it becomes much easier to write software to conduct comparisons, data mingling, or other analyses on data retrieved from multiple, potentially geographically dispersed data servers.The OpenAPI description of the specification can be used with code generators like OpenAPI Generator. The testing and compliance page includes a list of example server implementations which can be used as is or as a starting point. A custom solution can be implemented to link the API endpoints and queries to a local data backend (of any desired type) serving the data.
SAM and BAM¹⁷	samtools.github.io/hts-specs/SAMv1.pdf	File Format	researchers	SAM, or Sequence Alignment/Map format, is a format for storing primary DNA sequencing records. These are typically aligned and sorted by genomic coordinate, but unaligned data can also be represented. SAM is a TAB-delimited text format consisting of a header meta-data section and an alignment section. The BAM format is a binary serialization of SAM for more efficient access. SAM and BAM support full random access, selected by genomic region. The SAMtags document defines the optional per-record annotations. These are also used by the CRAM specification.
Variant Annotation	https://github.com/ga4gh/va-spec	Data Model / Modeling Framework	API and tool developers	Variant annotations are structured data object that holds a central piece of knowledge about a genetic variation, along with metadata supporting its interpretation and use. A given variant annotation may describe knowledge about its molecular consequence, functional impact on gene function, population frequency, pathogenicity for a given disease, or impact on therapeutic response to a particular treatment. The GA4GH VA-Specification will define an extensible data model for representation and exchange these and other diverse kinds of variant annotations. It will provide machine-readable messaging specifications to support sharing and validation of data through APIs and other exchange mechanisms. It will also provide a formal framework for defining custom extensions to the core model - allowing community-driven development of VA-based data models for new data types and use cases. A more detailed description of these components can be found online.The VA-Spec is being authored by a partnership among national resource providers and major public initiatives within GA4GH. It has been informed by and will be tested in diverse, established, and actively developed Driver Projects, including ClinGen, VICC, Genomics England, the Monarch Initiative, BRCA Exchange, and Australian Genomics. In these contexts, it will be used to support different types of tools and information systems, including variant curation tools and interpretation platforms (e.g., ClinGen, CIViC, Genomics England), variant annotation services (e.g., CellBase), knowledge aggregators/portals (e.g., BRCA Exchange, Monarch Initiative), matchmaking applications (e.g., Matchmaker Exchange), and clinical information systems and decision support tools.
Variation Representation¹⁸	https://vrs.ga4gh.org	Data Model & terminology	data generators, API and tool developers, data custodians	Maximizing the personal, public, research, and clinical value of genomic information will require that clinicians, researchers, and testing laboratories exchange genetic variation data reliably. The Variation Representation Specification (VRS, pronounced “verse”) — written by a partnership among national information resource providers, major public initiatives, and diagnostic testing laboratories — is an open specification to standardize the exchange of variation data.The primary contributions of VRS include (1) terminology and an information model, (2) a machine readable schema, (3) conventions that promote reliable data sharing, (4) globally unique computed identifiers, and (5) a Python implementation (available at vrs-python) that demonstrates the above schema and algorithms and supports translation of existing variant representation schemes into VRS for use in genomic data sharing. It may be used as the basis for development in Python, but it is not required in order to use VRS. The machine-readable schema definitions and example code are available online at the VRS repository. Readers may wish to view a complete example before reading the specification. For a discussion of VRS with respect to existing standards, such as HGVS, SPDI, and VCF, see “Relationship of VRS to existing standards,” an appendix to the specification documentation.
VCF/BCF¹⁹	samtools.github.io/hts-specs/VCFv4.3.pdf	File Format	researchers	The variant call format (VCF) is a generic format for storing DNA polymorphism data such as single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants, together with rich annotations. VCF may hold data for multiple samples within the same file. The specification contains the header meta-data fields, a series of mandatory columns describing the variants, and details of the optional annotations which are either per-site or per-sample. VCF and its binary counterpart, BCF, is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome.

The GA4GH Toolkit outlines a suite of secure standards and frameworks that will enable more meaningful research and patient data harmonization and sharing. This suite addresses a variety of challenges across the data sharing life cycle and is applicable across the world’s accessible medical and patient-centered systems, knowledgebases, and raw data sources. All standards are subject to the GA4GH Copyright Policy (https://www.ga4gh.org/wp-content/uploads/GA4GH-Copyright-Policy-Updated-Formatting.pdf) and should be made available under an open source license such as the Apache 2.0 license for software.

As nascent genomic medicine programs emerge in many countries, we believe that federated approaches (see Federated access below), in addition to centralized data sharing where feasible, are necessary to satisfy the goals of both the research and healthcare communities. In addition, many commercial and public organizations aim to minimize the costs and risks of the complex technical software needed to either contribute to genomic medicine or deliver genomic tools. A complex, multistakeholder ecosystem requires neutral and technically competent standards; these standards must be adaptable for disparate purposes and useful for the broad set of end-users: clinical, academic, commercial, and public. Finally, standards must be developed to intentionally support the global research community with specific attention to policies of equity, diversity, and inclusion to tangibly enable progress for all global communities.

GA4GH Organization

GA4GH has partnered with 24 real-world genomic data initiatives (Driver Projects) to ensure its standards are fit for purpose and driven by real-world needs. Driver Projects make a commitment to help guide GA4GH development efforts and pilot GA4GH standards (see Table 2). Each Driver Project is expected to dedicate at least two full-time equivalents to GA4GH standards development, which takes place in the context of GA4GH Work Streams (see Figure 1). Work Streams are the key production teams of GA4GH, tackling challenges in eight distinct areas across the data life cycle (see Box 1). Work Streams consist of experts from their respective sub-disciplines and include membership from Driver Projects as well as hundreds of other organizations across the international genomics and health community.

Table 2

GA4GH Driver Projects

Driver Project	URL	Location	Thematic area*	Current size	Data type(s) collected	Data hosting model(s)	Data access model(s)	Implementations / deployments of GA4GH standards
All of Us Research Program	https://allofus.nih.gov/	US	RD, Ca, CT	100k whole-genome sequences (planning for 1 million)	WGS, WES	centralized	cloud	CRAM, DRS (forthcoming), htsget (forthcoming), Passports (forthcoming), TRS (forthcoming), and WES (forthcoming)
Australian Genomics	https://www.australiangenomics.org.au/	Australia	RD, Ca, CT	13,500 whole-genome sequences across all pilots	WGS, WES, panels, phenotype	centralized	cloud	Beacon V1, CRAM, Crypt4GH, DRS (forthcoming), DUO, htsget, MRCG (forthcoming), Passports (forthcoming), refget
Autism Sharing Initiative	https://www.autismsharinginitiative.org/	international	CT	11,316 whole-genome sequences (estimating 15k by 2025)	WGS	distributed	federated analysis	AAI (forthcoming), Beacon V1 (forthcoming), CRAM (forthcoming), Data Connect, DRS (forthcoming), DUO (forthcoming), Passports (forthcoming), Service Registry / Info, TRS (forthcoming), WES (forthcoming)
BRCA Exchange	http://www.brcaexchange.org	international	RD, Ca	66,657 variants	genetic variant pathogenicity assertions and supporting evidence	centralized	public	Beacon V1, VA (forthcoming), VRS, WES (forthcoming)
CanDIG	https://www.distributedgenomics.ca/	Canada	RD, Ca, CT, Bio	1,700 data records	WGS tumor/normal and whole transcriptome for cancer; WGS for COVID; clinical phenotype	distributed	federated analysis	Beacon V1, CRAM, DRS, DUO, htsget, Phenopackets, refget (forthcoming), RNAGet, Service Registry / Info (forthcoming), VRS (forthcoming), WES (forthcoming)
ClinGen	https://www.clinicalgenome.org/	US	RD	2,077 unique genes with at least one curation and 2,417 unique variants with at least one curation	genetic and experimental evidence	centralized	public	VA (forthcoming), VRS
ELIXIR	https://elixir-europe.org/	Europe	RD, Ca, CT, Bio	23 national nodes hold a variety of data types and run multiple services, some listed within this table (e.g., EGA). For a list of ELIXIR Core Data Resources, see https://elixir-europe.org/platforms/data/core-data-resources		distributed	download (also exploring Cloud)	AAI, Beacon V1, Crypt4GH, DRS, DUO, htsget, Passports, Phenopackets, refget, RNAGet, Service Registry / Info, TES, TRS, WES
ENA / EVA / EGA	https://www.ebi.ac.uk/ena,	Europe	RD, Ca, CT, Bio	EGA - 700k data records	EGA - WGS, WES, RNaseq, epigenetics, genotyping, transcriptome, singlecell seq, healthy and disease cohorts	distributed	download (also exploring Distributed Cloud)	Crypt4GH, htsget AAI, Passports, DUO
EpiShare	https://epishare-project.org/	international	Bio	~2,800 data records	FASTQ, CRAM/BAM, bigwig, bigbed for epigenomics experiments	distributed	federated analysis	CRAM (forthcoming), DRS, DUO, htsget (forthcoming), Phenopackets, RNAGet, Service Registry / Info, WES
EUCANCan	http://www.eucancan.com	international	Ca	data from 35 different sources including human, model, and non-model organisms	whole-genome, whole-exome, and whole-transcriptome sequence data	distributed	Cloud and federated analysis	AAI (forthcoming), Beacon V1 (forthcoming), CRAM (forthcoming), Data Connect (forthcoming), DRS (forthcoming), Passports (forthcoming), Phenopackets (forthcoming), Service Registry / Info (forthcoming), TES (forthcoming), TRS (forthcoming), VRS (forthcoming), WES (forthcoming)
European Joint Programme on Rare Disease (EJP RD)	https://www.ejprarediseases.org/	Europe	RD	>130,000 data records across several resources hosting genomic human data, mainly the EGA, DECIPHER and the RD-Connect Genome-Phenome Analysis Platform	a mix of WGS, WES, plausibly pathogenic variants and phenotypic information	distributed across centralized resources	download and Cloud analysis	AAI (forthcoming), Beacon V1, CRAM, Crypt4GH, DRS (forthcoming), DUO, htsget, Passports, Phenopackets, Service Registry / Info, TES, TRS, WES
GEnome Medical Alliance Japan (GEM Japan)	https://www.amed.go.jp/en/aboutus/collaboration/ga4gh_gem_japan.html	Japan	RD, Ca, CT	24k WGS (aiming for 100k)	whole-genome sequencing, whole-exome sequencing, gene expression, panels, phenotypic	centralized	download (also exploring Cloud)	Beacon V1 (forthcoming), CRAM, DUO, Phenopackets (forthcoming)
Genomics England	https://www.genomicsengland.co.uk	UK	RD, Ca, CT	136K WGS, (estimating 450K WGS by 2024)	WGS	centralized	Cloud	AAI (forthcoming), CRAM, DRS (forthcoming), DUO (forthcoming), htsget, Passports (forthcoming), WES (forthcoming)
Human Cell Atlas	https://www.humancellatlas.org	International	RD, Ca, CT, Bio	1,300 donors	single-cell sequencing	centralized	public and Cloud	AAI, DRS, DUO (forthcoming), Passports (forthcoming), TES, TRS, WES
Human Heredity and Health in Africa (H3Africa)	https://h3africa.org/	Africa	CT, Bio	75,000 participants (across all projects)	whole-genome sequencing, whole-exome sequencing, gene expression, microbiome, imaging, phenotypic, environmental/lifestyle	centralized	download	AAI (forthcoming), Beacon V1, CRAM, Crypt4GH, Data Connect (forthcoming), DUO, Passports (forthcoming), Phenopackets (forthcoming), VRS (forthcoming)
International Cancer Genome Consortium (ICGC) Accelerating Research in Genomic Oncology (ARGO)	https://www.icgc-argo.org	international	Ca	100k Genomes	WGS, WES, RNA-Seq, phenotype	distributed	Cloud and federated analysis	AAI (forthcoming), Beacon V1, CRAM, Passports (forthcoming), TRS, WES
Matchmaker Exchange	https://www.matchmakerexchange.org	international	RD	>109K cases	WGS, WES	distributed	federated analysis	AAI (forthcoming), Beacon V1, CRAM, htsget, Phenopackets
Monarch Initiative	https://monarchinitiative.org/	international	RD, Ca, CT, Bio	N/A	gene, genotype, variant, disease, and phenotype data across many species in the tree of life, from over 30 data sources	centralized	public cloud	DUO (forthcoming), Passports (forthcoming), Phenopackets, VRS
National Cancer Institute Cancer Research Data Commons (NCI CRDC)	https://datascience.cancer.gov/data-commons	US	Ca	~100,000 data records (includes GDC)	whole-genome sequencing, whole-exome sequencing, gene expression, panels, phenotypic, biospecimen, imaging, proteomics	centralized	Cloud and federated analysis	CRAM, DRS, DUO (forthcoming), Passports (forthcoming), Service Registry / Info, WES
National Cancer Institute Genomic Data Commons (NCI GDC)	https://gdc.cancer.gov	US	Ca	83,700 cases	WGS, WXS, panel, RNA-seq, miRNA-seq, methylation array, genotyping array, diagnosis slides, tissue slides, ATAC-seq, scRNA-seq. Also clinical (phenotypic) and biospecimen information	centralized	download and Cloud	AAI (forthcoming), CRAM (forthcoming), DRS (forthcoming), DUO (forthcoming), Passports (forthcoming), Phenopackets (forthcoming), TES (forthcoming), TRS (forthcoming), VRS (forthcoming), WES (forthcoming)
Swiss Personalized Health Network (SPHN)	http://sphn.ch	Switzerland	RD, Ca, CT, Bio	24 health data projects across Switzerland	clinical phenotypic, clinical routine, omics (genomic, transcriptomic, proteomic, etc), cohort, and imaging data and expert variant curation	distributed	federated analysis	Beacon V1, DRS (forthcoming), htsget (forthcoming), Phenopackets, TES (forthcoming), WES (forthcoming)
Trans-Omics for Precision Medicine (TOPMed)	https://topmed.nhlbi.nih.gov	US	RD, Ca, CT, Bio	180k whole genome sequences (233k by 2025), 96k panels	WGS, RNA-seq, metabolome, methylome (MethylationEPIC ‘850K’), proteome (SomaScan and Olink), longitudinal epidemiology studies, disease-studies, environmental/lifestyle, imaging	centralized	cloud	AAI (forthcoming), CRAM, DRS, DUO, Passports (forthcoming), Service Registry / Info (forthcoming), TRS, WES
Variant Interpretation for Cancer Consortium (VICC)	cancervariants.org	international	Ca	24,366 evidence items	genetic and experimental evidence	centralized	public	Beacon V1, Service Registry / Info, VA (forthcoming), VRS

GA4GH Driver Projects are external genomic data initiatives that have committed to both contributing to the development of genomic data sharing standards as well as piloting their use in real world practice. Abbreviations: RD, rare disease; Ca, cancer; CT, complex traits; Bio, basic biology.

Figure 1

Matrix structure of the Global Alliance for Genomics and Health

GA4GH is a community of diverse stakeholders from Driver Projects and other institutions working together in the context of Work Streams. Each GA4GH Driver Project is expected to dedicate two full-time equivalents across at least two GA4GH Work Streams. As foundational groups that review all GA4GH deliverables, the Regulatory and Ethics and Data Security Work Streams must have representation from every Driver Project. In addition to Driver Projects, any member of the community—regardless of domain, sector, nation, or affiliation—is invited to participate in any GA4GH Work Stream. Supplemental information includes details on how each of the 24 GA4GH Driver Projects intersects with the six technical Work Streams.

GA4GH standards development and approval process

GA4GH Work Streams and Driver Projects have identified, and are actively developing, the technical specifications and policy frameworks they believe to be of most relevance to enable widespread data sharing, federated approaches, and interoperability across datasets to facilitate genomic research (see supplemental information for more details on the product development process); the areas of focus are outlined in Box 1, with individual products defined in Table 1 and in the 2020/2021 GA4GH Roadmap (https://www.ga4gh.org/roadmap). Each GA4GH deliverable can be implemented on its own to enable interoperability and consistency in a single area. However, when implemented together, they support broader activities in the research and clinical domains and enable productive genomic data sharing and collaborative analyses that can leverage global datasets produced in distinct locations around the world. Each approved GA4GH deliverable is reviewed by a panel of internal and external experts not involved in the product’s development, and then by the GA4GH Steering Committee (https://www.ga4gh.org/about-us/governance-and-leadership-2/#steering). GA4GH standards are not typically accredited by a national or international standards body, and instead follow a model inspired by the Internet Engineering Task Force (IETF; https://www.ietf.org) and the World Wide Web Consortium (W3C; http://www.w3.org). This enables a flexible and rapid response to community needs and a focus on lowering barriers to interoperability through the development and adoption of pragmatic standards. However, there are occasions when certain standards benefit from a more formal accreditation process, especially when there is a direct link into healthcare usage (see next section and Box 2).

Alignment with other standards organizations

To achieve greater international coordination and consistency of standards development, GA4GH proactively collaborates with other standards development organizations working in genomics, e.g., Health Level Seven (HL7; http://www.hl7.org), International Organization for Standardization (ISO; https://www.iso.org), Open Biological and Biomedical Ontology Foundry (OBO; http://www.obofoundry.org/). While defined work processes between GA4GH and other standards development bodies are still under development, GA4GH has initiated several pilot projects to explore mechanisms of collaboration. One such approach is the submission of GA4GH standards to ISO’s technical committees for approval as ISO international standards. Using a product development timeline that aligns the ISO approval process with the GA4GH approval process, both communities are able to contribute to the development of a standard in a harmonized manner. These efforts expand the diversity of contributors to both organizations, leading to more robust and internationally applicable standards. Another approach, guided by HL7 working groups and experts, is the translation of GA4GH standards into HL7 Fast Health Interoperability Resources (FHIR) Implementation Guides. These implementation guides enable interoperability of GA4GH standards with clinical systems and accelerate the use of clinical data for research. GA4GH also aims to support and interoperate with existing translational models, ontologies, and terminologies (e.g., FHIR, HGVS, OMOP, PCORnet, Human Phenotype Ontology, SNOMED CT) for clinical genetics and genomics. Before launching a new standards development project, GA4GH Work Streams are encouraged to complete a landscape analysis that both defines relevant existing standards and how they will influence the development of the new standard. Coordination activities—such as joint meetings, shared documentation, and process harmonization between GA4GH work and these health standards-focused efforts—are critical for bridging the research-clinical divide and keeping respective products aligned. This helps prevent unnecessary proliferation of redundant standards and minimizes the development of semantically and syntactically conflicting standards that could hamper large-scale interoperability and lead to confusion within the adopter community (see Box 2).

Federated approaches

Federated approaches—the ability to analyze data across multiple distinct and secure sites—is increasingly seen as an important strategy where data cannot be pooled for legal or practical reasons. These approaches are characterized by independent organizations hosting data in secure processing environments (e.g., clouds, trusted research environments) while adopting technical standards that enable analysis at scale. Application programming interfaces (APIs) can be deployed to enable researchers and portable workflows to visit multiple databases even where the data and computing environment are variably configured. Tools like “identity federation” can facilitate even closer integration across organizations. GA4GH Driver Projects and other partners are beginning to implement cloud-based workflows built on GA4GH standards that allow scientists to share, access, and interrogate data stored at disparate sites around the globe. Some concrete examples of this access pattern include (1) the Data Coordination Platform of the Human Cell Atlas, an internationally federated compute environment for analyzing single-cell data; (2) Genomics England’s secure Research Environment for approved investigators to access the 100,000 Genomes Project dataset; (3) the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) and the Gen3 Data Commons, which provide cloud-based spaces for scientists to work with large-scale genomic and genomic-related datasets and shared tools; and (4) H3ABioNet, a bioinformatics platform that serves data from the Human Heredity and Health in Africa (H3Africa) network to researchers across the continent and provides containerized workflows for analysis of the data. Because these workflows are built on interoperable standards, they allow for secure access and efficient discovery, portability, and analysis. With more instances like these, the global community will be able to harness the power of large data and improve the reach of genomic medicine research. The federation and transparency enabled by standards will also encourage greater willingness among non-western and other underrepresented populations to share their data, affording greater diversity in the overall data available and equity in its impacts.

Genomics in Healthcare

The process of sequencing a genome is essentially the same in any setting, but the scale and quality control of production, as well as the regulation and dissemination of the resulting data, can be quite different in healthcare compared to research. “Research genomes” contain de-identified data and therefore are often openly shared with other researchers, including for funding and publishing requirements (for NIH policy, see web resources), frequently with managed access, e.g., via the European Genome-phenome Archive (EGA), the Japanese Genotype-phenotype Archive (JGA), or the database of Genotypes and Phenotypes (dbGaP). Researchers worldwide will draw on these openly shared genomic datasets for their own studies, increasing the amount of knowledge derived from each genome. However, while such research genomes are more readily available, these datasets usually do not include the type or extent of longitudinal, standardized, or interoperable clinical data needed for genomic medicine. Healthcare-based research and testing have an entirely different financial, legal, and social landscape, with the structure, provision, and regulation varying by country, covering the full spectrum from state-run to private schemes. In each system, the cost of an assay in healthcare—genomics included—is often considered in light of its benefits to the health of an individual and cost effectiveness within the healthcare system. In theory, if a genomic assay demonstrates clinical utility for a specific application within a healthcare system—especially if it is cost effective—the only limit to its deployment is the number of patients who will potentially benefit. In practice, however, there are logistical, financial, regulatory, educational, scientific, and clinical-based hurdles to overcome before a genomic test becomes a routine clinical offering. In addition, barriers to healthcare access will likely remain impediments to large-scale implementation in many countries. The current case for implementing genomics in healthcare can be presented in four broad disease areas: rare disease, cancer, common/chronic disease, and infectious disease. In the following sections we outline the case for healthcare-funded sequencing in each disease area. We also highlight challenges to implementation in each area and GA4GH deliverables aimed at overcoming these issues.

Rare disease

Arguably, the rare disease space has seen the most successful deployment of genomics in healthcare, with many reporting diagnostic rates of at least 20%–30%, and health economic studies demonstrating cost-effectiveness and diagnostic utility. Clinical geneticists have used single-gene or small gene panel tests since the early 1990s to support diagnosis and some treatment decisions for many of these diseases. The cost of assaying broader genomic regions—including exome and genome sequencing—has fallen considerably, with a substantial impact on rare-disease diagnosis and discovery research. However, with more than 10,000 rare diseases affecting more than 300 million patients worldwide diagnosing and discovering treatments for many of these diseases has been challenging. As such, the rare disease community has embraced data sharing in order to facilitate global knowledge exchange and improve patient diagnostic rates, understand disease progression, and augment care strategies. To further enable progress, clinical and research laboratories and health systems must support several key activities to effectively identify, diagnose, and eventually treat the genetic causes of rare disease: (1) aggregate genomic and phenotypic data, needed for discerning population allele frequencies in disease and non-disease populations and implicating new genes in rare disease; (2) catalog the validity of gene-disease associations using consistent annotation models and terminologies; (3) collectively build knowledge bases to understand variant pathogenicity; (4) define the natural histories of rare diseases to predict disease progression and enable a foundation upon which to develop clinical trials; and (5) monitor treatment efficacy of emerging therapeutics. GA4GH standards and policies already enable and will continue to build upon these activities. For example, the Matchmaker Exchange—a rare disease gene discovery platform which has benefited from GA4GH guidance on API-based data exchange formats as well as consent and data security policies —illustrates the power of bringing practicing clinicians and researchers together, as cases from across the globe are necessary to build evidence to confirm new gene-disease relationships. GA4GH promotes knowledge sharing in ClinVar, a database which has accelerated improvements in variant classification across the clinical laboratory community. Additional methods are now being deployed to move beyond manual submission of variant classifications to a centralized database; such advances will enable more timely access to siloed laboratory knowledge and evidence-based variant classification. Realtime sharing with ClinVar—facilitated by APIs and with entries linked to rich, case-level data—will be needed to scale our understanding of the more than 750 million variants so far identified in the human genome (e.g., within gnomAD; https://gnomad.broadinstitute.org). The Variation Representation (VRS) and Variant Annotation (VA) specifications aim to support the exchange of variant data, Phenopackets and Pedigree representation to support the use of standardized clinical and family history data, as well as new APIs (e.g., Beacon v2 API and Data Connect API) to enable the identification of data for further access and analysis. The aim is for these standards to support a more global and federated approach to rare disease data and knowledge sharing that will be critical to advancing diagnosis and treatment of rare diseases.

Cancer

One in five men and one in six women worldwide will have a cancer diagnosis in their lifetime. This risk is 2- to 3-fold greater in higher-resource countries, with estimates as high as one in two people in the UK for example. An altered somatic genome is a consistent hallmark of cancer, often associated with specific pathogenic mutations. In some individuals with hereditary cancer syndromes, germline variants can disrupt cancer-related pathways and increase the risk of developing a “heritable” malignancy. Characterizing a cancer by sequencing a patient’s tumor genome alongside their germline genome has resulted in profound insights into molecular mechanisms of malignant transformation and discovery of potential therapeutic targets. Tumor/normal sequencing has demonstrated applications in disease monitoring as well as diagnosis, prognosis, and therapeutic response prediction, both at initial presentation and disease recurrence. Applying cancer genomics in the clinic is more complicated than that for rare diseases. For cancer patients, treatment strategy time frames are commonly measured in weeks and incorporating genomic information within such an urgent turnaround time is logistically challenging to integrate into clinical decision making. Additionally, while the use of genomics for diagnosis and improved symptom management can lead to substantial improvements for rare disease patients and their families, application of genomics in cancer treatment is more complex and may include dual assessment of both somatic and germline genomes to determine heritable cancer risk and the assessment of the evolving tumor genome due to changing selective pressures in response to targeted therapies. Cancer genomic information is most useful if it informs treatment options, yet development of systems that match patients to appropriate clinical trials would be needed to fully realize the benefits of genomic tumor data where estimates of clinical trial enrollment in patients with cancer stands at ~8%. Genomic information is increasingly important in clinical decision making through routine clinical sequencing assays and molecular tumor boards. The heterogeneity of cancer as a disease—of each individual tumor and of any concurrent or subsequent manifestation, such as metastasis or recurrence—adds many layers of complexity to genomic analysis. To address this complexity, it is important to analyze somatic and germline variation data together to understand their contribution to cancer risk. Most of the same standards and workflows important for rare disease apply to tumor sequencing, including data storage and compression standards (e.g., CRAM), variation representation (e.g., VCF and VRS), analysis (e.g., cloud-based workflows), and linkage to patient records (e.g., Phenopackets). However, discovery of oncogenic driver mutations also requires significant coordination and standardization to track outcome data (e.g., progression and response to treatment), a key element in determining the clinical significance of variation found in cancer patients. As such, many groups have created knowledge bases to annotate cancer genomic variation associated with evidence of pathogenicity or relevant treatment options; however, these knowledge bases can have limited levels of interoperability. In 2014, a GA4GH task team launched the Variant Interpretation for Cancer Consortium (VICC), which standardizes and coordinates clinical somatic cancer curation efforts and has created an open community resource to provide the aggregated information. Moving forward, major oncogenomic resources are now working with GA4GH on the harmonization of variant interpretation evidence, through refinement and adoption of standards such as the Beacon API, the Data Use Ontology (DUO), VA, and VRS. Additionally, these standards are being implemented across multiple GA4GH Driver Projects (see Table 2) that capture genomic data and/or diagnostic variant interpretation across the longitudinal evolution of cancer.

Common/chronic disease

“Common disease” is a catchall phrase describing a vast spectrum of diseases that have complex environmental and genetic etiologies. Accurate prediction of common diseases from genetics has been a topic of study since the inception of human genetics, yet genomic information is still not widely used in clinical practice for this purpose. The discovery of a large number of genetic susceptibility loci (polygenic architecture) supported the common-disease common-variant hypothesis and has led to the generation of polygenic risk scores summarizing common disease risk. Studies are now beginning to demonstrate the clinical benefits of applying polygenic risk scores in practice through stratification of the population for deploying disease management strategies. As the assay of choice moves from genotype arrays to sequencing, there will be integration between common disease and rare disease applications; this is already the case for certain diseases such as susceptibility to breast cancer or heart disease. When such genomic information can be used clinically for common diseases, it will be more justifiable to sequence entire populations. Population-scale sequencing is in place already in some countries (e.g., Iceland) and is likely to become more commonplace in the next two decades. To support the discovery of the genetic causes and contributors to common disease across all populations, researchers must be able to identify and access aggregated data from large-scale cohort population studies from diverse backgrounds, carried out by multiple distinct sites such as biobanks in the UK (UK BioBank, Generation Scotland), China (China Ka-doorie Biobank), the US (NIH All of Us Research Program), and Japan (Tohoku Medical Megabank, Japanese BioBank); and whole population cohorts in Iceland (deCODE), Estonia (Estonian Genome Project), and Finland (FinnGen). Doing so requires the data to be harmonized across all sites using common data models and terminologies. Furthermore,since genomic datasets of this scale are too large to download and manipulate at individual sites, researchers must be able to bring analytical tools to the data, regardless of their location. Protocols are needed to deploy these tools consistently and effectively across distinct federated sites. GA4GH products support this critical type of biological study across the typical research life cycle from data discovery to analysis: (1) identify and access datasets relevant to a disease study (e.g., GA4GH Passports, DUO, multiple data discovery APIs), (2) access secure genotype and phenotype information on patients with related traits (e.g., Phenopackets, Data Repository Service [DRS] API, VRS, VA), and (3) remotely run analytical methods on data of interest (e.g., Task Execution Service [TES], Workflow Execution Service [WES] API, htsget API ), avoiding the need for inter-jurisdictional transfers and disparate regulatory requirements.

Infectious disease

Genomics can be used to identify the infectious agents of disease with more confidence and precision than ever before, and at increasing speed, allowing treatments that can quickly resolve infections as well as identifying the evolution of new species that may evade antibiotics, antivirals, and vaccines. The main challenges to deployment of genomics in infectious disease care are managing cost and logistics, tracking disease progression and its characterization, achieving precise phenotypic prediction (e.g., antibiotic resistance), and harmonizing historical knowledge bases from non-genomic-based assays to integrate with contemporary genomic tests. The COVID-19 pandemic tested this infrastructure, with diagnostic testing becoming widespread, viral genomic sequencing enabling tracking of strains, and human genome sequencing of symptomatic individuals contributing to a better understanding of the basis of COVID-19 disease severity. Infectious disease genomic research and surveillance primarily rely on sequencing bacterial and viral pathogens and the organisms in which they are carried and transmitted. These genomes vary greatly in size, content, and associated metadata, so the standards and APIs created for human genomic data may be insufficient for infectious disease data. However, while the specific data standards needed to advance pathogen genomics differ from those in human genomics, there is still considerable overlap in the mechanics of sharing the data. Through a variety of strategic alignments with organizations such as the Public Health Alliance for Genomic Epidemiology (PHA4GE; https://pha4ge.org/), the International COVID-19 Data Alliance (ICODA; http://www.icoda-research.org), and the European COVID19 data portal (http://www.covid19dataportal.org), GA4GH is working to ensure that the species-agnostic elements of genomic data sharing standards are transferred into the infectious disease community. In addition, some GA4GH standards have begun to explore how they should adapt to support infectious disease data; for example, the Phenopackets standard was improved to support case-level presentation for infectious diseases in 2020 in response to the COVID-19 pandemic. In addition, recently launched initiatives such as large-scale tuberculosis sequencing in several countries, rapid identification of Ebola and Zika virus strains, and tracing hospital outbreaks using genomics demonstrate a vibrant, functional interface between research, public health institutions, and clinical practice.

Challenges to Secondary Use of Clinically Acquired Data

We envision the global clinical and research communities collaborating seamlessly in the context of practicing healthcare to enable a true “learning healthcare system” (LHS). The LHS concept has existed for over a decade; however, implementation is still in its infancy, facing several barriers. Some useful implementations are found across medicine, including genomic medicine. Increasing numbers of institutions and countries have begun biobanks, in many cases connected to their healthcare system (see Common/chronic disease above), providing fertile grounds on which to bring healthcare data— including clinical genomic data—into research. To enable these efforts to reach their full potential, disparate systems must be able to share genomic and clinical data, requiring the community to overcome key challenges, particularly in the areas of infrastructure development, patient and physician incentives, ethics and regulation, privacy and security, and socio-cultural expectations (see Box 3). We believe these challenges can be overcome—but only if the genomics and healthcare communities commit to broad-based advocacy and coordinated efforts worldwide. This has already been successfully modeled through the Clinical Genome Resource (ClinGen; a GA4GH Driver Project), where healthcare providers, clinical laboratory staff, and researchers work together to develop standards for gene and variant curation, share underlying evidence, and then apply that evidence through a consensus-driven process to classify genes and variants which are made freely accessible to the broader community to support both research and clinical care.

Developing clinical data standards

Much of the clinical data contained within healthcare are not encoded in a standardized format. Multiple electronic health record (EHR) vendors exist today and are highly proprietary in their technical structures, making standardization across EHRs and with downstream research systems difficult. Although data recorded in EHRs often use standardized clinical terminologies (e.g., ICD, SNOMED CT), the intent of these systems is generally to present clinical information on individuals to healthcare providers and, in some regions, facilitate billing practices. This presents a challenge for secondary users, where it is difficult to make accurate, population-scale conclusions, often requiring extensive efforts to understand practices and generate useful research data. In order to promote adoption of standardized formats in research and ultimately within EHRs, GA4GH is developing standardized information models (e.g., Phenopackets, Pedigree) to describe clinical phenotypes and family histories. Standardizing the representation of phenotype and pedigree information will allow patients, care providers, and researchers to share this information more easily between healthcare and research systems and enable software tools to use this information to improve genome analysis and diagnosis.

Incentivizing and facilitating data sharing in healthcare

Resource limitations for healthcare providers and patients also impact their ability to share valuable clinical data. Some healthcare institutions (e.g., NHS England [https://www.england.nhs.uk/genomics/nhs-genomic-med-service], Dana-Farber Cancer Institute [http://www.dana-farber.org/for-patients-and-families/becoming-a-patient/preparing-for-your-first-appointment/checklist-for-new-adult-patients], Danish healthcare ) have built layered consent procedures into the regular routine of medical practice. Others support parallel biobanking efforts to separately consent patients for research. Still others have built this into their operations as an inherent part of the healthcare system. Further incentives can be built if providers can experience the direct benefits of research. For example, the clinical laboratory genetic testing industry largely participates voluntarily in data sharing through ClinVar, in part because they directly benefit from accurate variant interpretation. Several laboratories also joined when the US insurance industry began requiring submission as a condition of test reimbursement. However, despite progress in the sharing of variant knowledge, additional incentives and infrastructure are needed to support access to case-level results (e.g., variants interpreted for a patient indication) as well as full sequencing data, along with rich clinical phenotypes. Currently, most genetic test results are returned through PDF-based reports or accessed through external portals outside the medical system. Although standards exist for the exchange of genetic test results (see, for example, HL7’s guide in the web resources), robust standards that capture highly detailed, discrete genomic data are still under development. Adoption of those standards has been motivated by the implementation of downstream clinical decision support, but more incentives and infrastructure will be needed. To date, GA4GH has worked on maintaining and evolving standardized file formats for raw and annotated genomic data (SAM, BAM, CRAM, VCF/BCF); individual variant representation and interpretation (VRS, VA); and transmission of individual phenotype data and interpreted results (Phenopackets), all of which are critical for the evolving use of genomics in healthcare systems—particularly clinical laboratory workflows to share genomic data and genetic testing results. Future areas of development include better representation of structural variants, unambiguous representation of complex multi-allelic loci, and research into new, more scalable formats for storing and exchanging genetic variation. Population-scale sequencing programs in which healthcare systems share clinical genomic data for research are unlikely to allow large-scale aggregation of data to migrate beyond national boundaries, but federated analysis—in which analytical algorithms or queries are brought to the data in its location without data egress—is feasible and is a major area of focus of GA4GH’s standards development.

Ethics and regulation

Ethical considerations for patients and populations, together with responsible regulation, are essential for healthcare-funded genomics, which involves complex national regulation and legislation. Different countries and institutions have individual values and policies that relate to allowing access to personal information, with some embracing more open regulatory norms and systems on data collection, access, and sharing, and others being more restrictive. Nevertheless, most systems have some mechanism for researchers to access both research and clinical data. The GA4GH Regulatory and Ethics Work Stream (REWS) develops ready-to-use policy guidance to support responsible, international genomic and health-related data sharing. In Box 4, we list central components of the GA4GH Regulatory & Ethics Toolkit, including policies, consent tools, and data access guidance. The REWS also reviews all GA4GH technical standards for consideration of any regulatory or ethics issues that may be relevant. The first REWS product was the GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, which is built on the human right to benefit from scientific progress and its applications, as well as privacy, non-discrimination, and procedural fairness. It provides guidance for the responsible sharing of human genomic and health-related data, including personal health data and other types of data that may have predictive power in relation tohealth. The Framework has now been translated into 14 languages and has been used to inform local data sharing approaches around the globe, including, for example, the World Economic Forum, the Academy of Science of South Africa, DNA.Land, Health Data Research UK, and the Horizon-2020 CORBEL project. Keeping the fundamental human right to benefit from science at the heart of clinical and genomic data sharing ensures a universal approach to balancing the benefits and potential risks. We believe that most healthcare system actors can ultimately participate in responsible, worldwide data sharing while remaining compliant with applicable laws and institutional policies.

Privacy and security

Federating large volumes of sensitive clinical and genomic data across internationally distributed virtual computing environments presents formidable challenges in assuring data integrity, service availability, and individual privacy. Some of these challenges call for innovative application of well-established security standards, frameworks, and protocols—such as identity federation on a global scale—and some GA4GH standards already do so (e.g., crypt4GH, Authentication & Authorization Infrastructure [AAI] / Passports). Another crucial challenge is to enable secure, privacy-preserving federated analysis, wherein researchers can extract information without having to transfer raw data. This evolution is key to foster inter-institutional and international collaboration and will be a strong incentive to improve ontology homogeneity. Several technical solutions are available, either based on hardware devices or on software algorithms. The former are computationally efficient, but require trusting a vendor and are prone to side-channel attacks. The latter are computationally slower, but are mathematically proven and are a better response to GA4GH expectations. Recent results have demonstrated the effectiveness of a software-based approach (a combination of homomorphic cryptography and secure multiparty computation called “Multi-party Homomorphic Encryption” or MHE); these approaches have been positioned with respect to the GDPR. One of the major strengths of MHE is that partial aggregates can be considered to be anonymized and not just pseudonymous, in the sense of GDPR, and thus potentially obviating the need for data transfer and use agreements (DTUAs).

Societal challenges

Societal challenges of allowing access to genomic data within the healthcare ecosystem include maintaining public trust, overcoming differences in objectives and methods between research and healthcare, and breaking down unproductive divides between disciplines. Our vision for healthcare data ecosystems is one in which vetted researchers around the world can, with appropriate oversight and policy enforcement, gain access to human health data for the benefit of patients. GA4GH has defined the core elements of responsible data sharing, including transparency, accountability, recognition, and attribution as well as sanctions for misuse which form a framework to respect and maintain the trust of participants. In particular, the GA4GH Engagement Framework (see Box 4) further assists researchers in designing and understanding engagement with public, patient, and participant stakeholders through the central themes of fairness, context, heterogeneity, and the recognition of tensions. Through the implementation arm of GA4GH, the Genomics in Health Implementation Forum (https://www.ga4gh.org/implementation) described below and other engagement efforts, GA4GH is tackling the broader societal implementation issues including education and engagement of the public, healthcare providers, and regulators in order to build trust within the community. The GA4GH “Your DNA, Your Say” survey, an effort to gather international public attitudes toward genomic data sharing, has provided an evidence base for understanding which factors are important to maintaining public trust in the generation and sharing of genomic data, as well as how concerns differ according to geography. These findings help ensure that GA4GH’s work can enhance the public trust in a global context upon which the future of genomics depends.

Connecting Standards for Implementation

With more than 30 GA4GH standards approved, and dozens of production-ready implementations of those standards deployed around the world, GA4GH is now shifting its focus toward demonstrating how standards can work together to provide seamless support of genomic activities. Interconnected standards that are compatible and interoperable with each other and are hardened for real-world use will enable solutions for federated analyses across platforms and use cases. To drive this effort, GA4GH has established the Federated Analysis System Project (FASP), which aims to demonstrate how GA4GH APIs, when used in concert, can support real-world, scientific use cases (see https://www.ga4gh.org/genomic-data-toolkit/2020-connection-demos/). A key outcome of FASP is a series of scripts that represent working examples of clients accessing real-world GA4GH-compatible services to solve a spectrum of challenges across the search-access-analyze workflow. The scripts illustrate how these services have adopted GA4GH standards to solve challenges, such as dataset discoverability and controlled data access, in order to drive larger scale and more powerful analyses. By developing working implementations of GA4GH standards that are pressure tested in real world scenarios, the FASP team has identified specific areas of improvement within the standards. As a result of this work, new features will be added to existing GA4GH specifications to further facilitate secure, real-world federated data sharing and analysis. Most notably, the group is working toward a standardized solution for using a GA4GH Passport to access a controlled access dataset from a Data Repository Service (DRS), while fulfilling robust security requirements, such as preventing escalation of privilege. These efforts will be critical to support access to valuable datasets across the globe.

GA4GH Starter Kit

To date, GA4GH has primarily focused on overcoming the challenges of enabling interoperability within new initiatives built on a foundation of cloud infrastructure. However, an additional—and potentially more significant—challenge is bringing high-performance computing (HPC) infrastructures that are not already focused on cloud interoperability into the federated network envisioned by this community. While more ambitious goals are on the horizon for connecting and extending GA4GH standards (e.g., discovery of datasets; matching requests, analyses, and datasets; describing phenotypes; reporting on variants), FASP has shown through its real-world demonstrations of access across distributed but interoperable datasets that the initial groundwork for federated analysis is now in place. The Data Repository Service (DRS) allows data custodians to make controlled access data available at multiple sites; the Workflow Execution and Task Execution Services (WES & TES) allow researchers to encapsulate and run analyses on those data; and AAI and Passports allow for federated authorization and authentication, streamlining the data access process for both researchers and data custodians. In 2021, GA4GH has begun to develop the GA4GH Starter Kit, a set of open source reference implementations (for example, code bases that demonstrate the standards working in practice), to help ensure existing HPC environments can interoperate with the wider GA4GH network. These resources consist of “plug- and-play” code that any institution (cloud-based or HPC) can use to quickly achieve GA4GH compatibility and will facilitate the progressive movement of established large-scale systems toward interoperability. In addition, a testing suite will be developed to ensure deployments of both reference and non-reference implementations are compliant to their respective GA4GH specifications.

Genomics in Health Implementation Forum

Once standards have been piloted in real-world Driver Project settings and shown to enable true federated analysis in FASP, they can begin to be promoted more broadly in the research and clinical genomics communities. Launched in 2020, the Genomics in Health Implementation Forum (GHIF) brings together a group of national-scale genomic data initiatives to share resources, experiences, and best practices for implementing GA4GH standards, as well as broader experience in rolling out national and international data sharing activities. GHIF aims to support more accurate data interpretation and disease diagnosis plus other innovative solutions across healthcare through global cooperation in data sharing and clinical implementation of genomics. Broad uptake of GA4GH standards among GHIF members— which include both GA4GH Driver Projects as well as other national and multi-national initiatives (see https://ga4gh.org/implementation for full list)—will provide strong evidence that GA4GH standards are supporting the community’s actual data sharing needs. Implementation of GA4GH policies and standards throughout the scientific and healthcare communities will allow researchers to access data across the globe—a critical step toward answering otherwise impenetrable questions about disease and basic human biology. As the volume of genomic and health-related data grows exponentially around the world, researchers, clinicians, and bioinformaticians have a responsibility to make that data appropriately accessible and to use it to realize benefits for all humans everywhere. The promise of genomic medicine lies at a crossroads that depends on harmonization across the global community to significantly enhance human health and medicine. We believe that GA4GH, by embracing collaborative innovation and knowledge exchange, is well poised to meet this challenge.

103 in total

1. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation.

Authors: Samuel A Lambert; Laurent Gil; Simon Jupp; Scott C Ritchie; Yu Xu; Annalisa Buniello; Aoife McMahon; Gad Abraham; Michael Chapman; Helen Parkinson; John Danesh; Jacqueline A L MacArthur; Michael Inouye
Journal: Nat Genet Date: 2021-04 Impact factor: 38.330

2. Scaling resolution of variant classification differences in ClinVar between 41 clinical laboratories through an outlier approach.

Authors: Steven M Harrison; Jill S Dolinksy; Wenjie Chen; Christin D Collins; Soma Das; Joshua L Deignan; Kathryn B Garber; John Garcia; Olga Jarinova; Amy E Knight Johnson; Juha W Koskenvuo; Hane Lee; Rong Mao; Rebecca Mar-Heyming; Andrew S McFaddin; Krista Moyer; Narasimhan Nagan; Stefan Rentas; Avni B Santani; Eija H Seppälä; Brian H Shirts; Timothy Tidwell; Scott Topper; Lisa M Vincent; Kathy Vinette; Heidi L Rehm
Journal: Hum Mutat Date: 2018-11 Impact factor: 4.878

3. Mendelian Gene Discovery: Fast and Furious with No End in Sight.

Authors: Michael J Bamshad; Deborah A Nickerson; Jessica X Chong
Journal: Am J Hum Genet Date: 2019-09-05 Impact factor: 11.025

4. Cohort Profile: Estonian Biobank of the Estonian Genome Center, University of Tartu.

Authors: Liis Leitsalu; Toomas Haller; Tõnu Esko; Mari-Liis Tammesoo; Helene Alavere; Harold Snieder; Markus Perola; Pauline C Ng; Reedik Mägi; Lili Milani; Krista Fischer; Andres Metspalu
Journal: Int J Epidemiol Date: 2014-02-11 Impact factor: 7.196

5. Prospective comparison of the cost-effectiveness of clinical whole-exome sequencing with that of usual care overwhelmingly supports early use and reimbursement.

Authors: Zornitza Stark; Deborah Schofield; Khurshid Alam; William Wilson; Nessie Mupfeki; Ivan Macciocca; Rupendra Shrestha; Susan M White; Clara Gaff
Journal: Genet Med Date: 2017-01-26 Impact factor: 8.822

6. Translating Germline Cancer Risk into Precision Prevention.

Authors: Matthew B Yurgelun; Georgia Chenevix-Trench; Scott M Lippman
Journal: Cell Date: 2017-02-09 Impact factor: 66.850

7. Clinical whole genome sequencing as a first-tier test at a resource-limited dysmorphology clinic in Mexico.

Authors: Alicia Scocchia; Kristen M Wigby; Diane Masser-Frye; Miguel Del Campo; Carolina I Galarreta; Erin Thorpe; Julia McEachern; Keisha Robinson; Andrew Gross; Subramanian S Ajay; Vani Rajan; Denise L Perry; John W Belmont; David R Bentley; Marilyn C Jones; Ryan J Taft
Journal: NPJ Genom Med Date: 2019-02-14 Impact factor: 8.617

8. The Monarch Initiative in 2019: an integrative data and analytic platform connecting phenotypes to genotypes across species.

Authors: Kent A Shefchek; Nomi L Harris; Michael Gargano; Nicolas Matentzoglu; Deepak Unni; Matthew Brush; Daniel Keith; Tom Conlin; Nicole Vasilevsky; Xingmin Aaron Zhang; James P Balhoff; Larry Babb; Susan M Bello; Hannah Blau; Yvonne Bradford; Seth Carbon; Leigh Carmody; Lauren E Chan; Valentina Cipriani; Alayne Cuzick; Maria Della Rocca; Nathan Dunn; Shahim Essaid; Petra Fey; Chris Grove; Jean-Phillipe Gourdine; Ada Hamosh; Midori Harris; Ingo Helbig; Maureen Hoatlin; Marcin Joachimiak; Simon Jupp; Kenneth B Lett; Suzanna E Lewis; Craig McNamara; Zoë M Pendlington; Clare Pilgrim; Tim Putman; Vida Ravanmehr; Justin Reese; Erin Riggs; Sofia Robb; Paola Roncaglia; James Seager; Erik Segerdell; Morgan Similuk; Andrea L Storm; Courtney Thaxon; Anne Thessen; Julius O B Jacobsen; Julie A McMurry; Tudor Groza; Sebastian Köhler; Damian Smedley; Peter N Robinson; Christopher J Mungall; Melissa A Haendel; Monica C Munoz-Torres; David Osumi-Sutherland
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

9. Refractory alveolar rhabdomyosarcoma in an 11-year-old male.

Authors: Cora A Ricker; Andrew D Woods; William Simonson; Melvin Lathara; Ganapati Srinivasa; Erin R Rudzinski; Atiya Mansoor; Robert G Irwin; Charles Keller; Noah E Berlow
Journal: Cold Spring Harb Mol Case Stud Date: 2021-02-19

10. NCBI's Database of Genotypes and Phenotypes: dbGaP.

Authors: Kimberly A Tryka; Luning Hao; Anne Sturcke; Yumi Jin; Zhen Y Wang; Lora Ziyabari; Moira Lee; Natalia Popova; Nataliya Sharopova; Masato Kimura; Michael Feolo
Journal: Nucleic Acids Res Date: 2013-12-01 Impact factor: 16.971

15 in total

Review 1. [Rare-disease data standards].

Authors: Peter N Robinson; Holm Graessner
Journal: Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz Date: 2022-09-23 Impact factor: 1.595

Review 2. Opportunities and challenges for the use of common controls in sequencing studies.

Authors: Genevieve L Wojcik; Jessica Murphy; Jacob L Edelson; Christopher R Gignoux; Alexander G Ioannidis; Alisa Manning; Manuel A Rivas; Steven Buyske; Audrey E Hendricks
Journal: Nat Rev Genet Date: 2022-05-17 Impact factor: 59.581

3. Seven years since the launch of the Matchmaker Exchange: The evolution of genomic matchmaking.

Authors: Kym M Boycott; Danielle R Azzariti; Ada Hamosh; Heidi L Rehm
Journal: Hum Mutat Date: 2022-05-10 Impact factor: 4.700

4. Road to FAIR genomes: a gap analysis of NGS data generation and sharing in the Netherlands.

Authors: Jeroen A M Belien; Anke E Kip; Morris A Swertz
Journal: BMJ Open Sci Date: 2022-04-12

5. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space.

Authors: Michael C Schatz; Anthony A Philippakis; Enis Afgan; Eric Banks; Vincent J Carey; Robert J Carroll; Alessandro Culotti; Kyle Ellrott; Jeremy Goecks; Robert L Grossman; Ira M Hall; Kasper D Hansen; Jonathan Lawson; Jeffrey T Leek; Anne O'Donnell Luria; Stephen Mosher; Martin Morgan; Anton Nekrutenko; Brian D O'Connor; Kevin Osborn; Benedict Paten; Candace Patterson; Frederick J Tan; Casey Overby Taylor; Jennifer Vessio; Levi Waldron; Ting Wang; Kristin Wuichet
Journal: Cell Genom Date: 2022-01-13

6. International federation of genomic medicine databases using GA4GH standards.

Authors: Adrian Thorogood; Heidi L Rehm; Peter Goodhand; Angela J H Page; Yann Joly; Michael Baudis; Jordi Rambla; Arcadi Navarro; Tommi H Nyronen; Mikael Linden; Edward S Dove; Marc Fiume; Michael Brudno; Melissa S Cline; Ewan Bimey
Journal: Cell Genom Date: 2021-11-10

7. CRAM 3.1: Advances in the CRAM File Format.

Authors: James K Bonfield
Journal: Bioinformatics Date: 2022-01-06 Impact factor: 6.937

8. The GA4GH Variation Representation Specification: A computational framework for variation representation and federated identification.

Authors: Alex H Wagner; Lawrence Babb; Gil Alterovitz; Michael Baudis; Matthew Brush; Daniel L Cameron; Melissa Cline; Malachi Griffith; Obi L Griffith; Sarah E Hunt; David Kreda; Jennifer M Lee; Stephanie Li; Javier Lopez; Eric Moyer; Tristan Nelson; Ronak Y Patel; Kevin Riehle; Peter N Robinson; Shawn Rynearson; Helen Schuilenburg; Kirill Tsukanov; Brian Walsh; Melissa Konopko; Heidi L Rehm; Andrew D Yates; Robert R Freimuth; Reece K Hart
Journal: Cell Genom Date: 2021-11-10

9. SvAnna: efficient and accurate pathogenicity prediction of coding and regulatory structural variants in long-read genome sequencing.

Authors: Daniel Danis; Julius O B Jacobsen; Parithi Balachandran; Qihui Zhu; Feyza Yilmaz; Justin Reese; Matthias Haimel; Gholson J Lyon; Ingo Helbig; Christopher J Mungall; Christine R Beck; Charles Lee; Damian Smedley; Peter N Robinson
Journal: Genome Med Date: 2022-04-28 Impact factor: 15.266

10. Beacon v2 and Beacon networks: A "lingua franca" for federated data discovery in biomedical genomics, and beyond.

Authors: Jordi Rambla; Michael Baudis; Roberto Ariosa; Tim Beck; Lauren A Fromont; Arcadi Navarro; Rahel Paloots; Manuel Rueda; Gary Saunders; Babita Singh; John D Spalding; Juha Törnroos; Claudia Vasallo; Colin D Veal; Anthony J Brookes
Journal: Hum Mutat Date: 2022-04-08 Impact factor: 4.700