| Literature DB >> 35072136 |
Heidi L Rehm1,2,3, Angela J H Page1,4, Lindsay Smith4,5, Jeremy B Adams4,5, Gil Alterovitz6,3, Lawrence J Babb1, Maxmillian P Barkley7, Michael Baudis8,9, Michael J S Beauvais4,10, Tim Beck11, Jacques S Beckmann12, Sergi Beltran13,14,15, David Bernick1, Alexander Bernier10, James K Bonfield16, Tiffany F Boughtwood17,18, Guillaume Bourque10,19, Sarion R Bowers16, Anthony J Brookes11, Michael Brudno19,20,21,22,23, Matthew H Brush24, David Bujold10,19,23, Tony Burdett25, Orion J Buske26, Moran N Cabili1, Daniel L Cameron27,28, Robert J Carroll29, Esmeralda Casas-Silva30, Debyani Chakravarty31, Bimal P Chaudhari32,33, Shu Hui Chen34, J Michael Cherry35, Justina Chung4,5, Melissa Cline36, Hayley L Clissold16, Robert M Cook-Deegan37, Mélanie Courtot25, Fiona Cunningham25, Miro Cupak7, Robert M Davies16, Danielle Denisko20, Megan J Doerr38, Lena I Dolman20, Edward S Dove39, L Jonathan Dursi21,23, Stephanie O M Dyke10, James A Eddy38, Karen Eilbeck40, Kyle P Ellrott24, Susan Fairley4,25, Khalid A Fakhro41,42, Helen V Firth16,43, Michael S Fitzsimons44, Marc Fiume7, Paul Flicek25, Ian M Fore45, Mallory A Freeberg25, Robert R Freimuth46, Lauren A Fromont47, Jonathan Fuerth7, Clara L Gaff17,18,27,28, Weiniu Gan34, Elena M Ghanaim48, David Glazer49, Robert C Green6,3, Malachi Griffith50, Obi L Griffith50, Robert L Grossman44, Tudor Groza51, Jaime M Guidry Auvil45, Roderic Guigó14,47, Dipayan Gupta25, Melissa A Haendel52, Ada Hamosh53, David P Hansen17,54, Reece K Hart1,55,56, Dean Mitchell Hartley57, David Haussler36,58, Rachele M Hendricks-Sturrup59, Calvin W L Ho60, Ashley E Hobb7, Michael M Hoffman20,21,22, Oliver M Hofmann20,28, Petr Holub61,62, Jacob Shujui Hsu63, Jean-Pierre Hubaux64, Sarah E Hunt25, Ammar Husami65, Julius O Jacobsen66, Saumya S Jamuar67,68, Elizabeth L Janes4,69, Francis Jeanson70, Aina Jené47, Amber L Johns71, Yann Joly10, Steven J M Jones72, Alexander Kanitz9,73, Kazuto Kato74, Thomas M Keane25,75, Kristina Kekesi-Lafrance4,10, Jerome Kelleher76, Giselle Kerry25, Seik-Soon Khor77,78, Bartha M Knoppers10, Melissa A Konopko79, Kenjiro Kosaki80, Martin Kuba62, Jonathan Lawson1, Rasko Leinonen25, Stephanie Li1,4, Michael F Lin81, Mikael Linden82,83, Xianglin Liu69, Isuru Udara Liyanage25, Javier Lopez84, Anneke M Lucassen85, Michael Lukowski44, Alice L Mann4,16, John Marshall86, Michele Mattioni87, Alejandro Metke-Jimenez54, Anna Middleton88,89, Richard J Milne88,89, Fruzsina Molnár-Gábor90, Nicola Mulder91, Monica C Munoz-Torres52, Rishi Nag25, Hidewaki Nakagawa92,93, Jamal Nasir94, Arcadi Navarro47,95,96,97, Tristan H Nelson98, Ania Niewielska25, Amy Nisselle18,28,99, Jeffrey Niu21, Tommi H Nyrönen82,83, Brian D O'Connor1, Sabine Oesterle9, Soichi Ogishima100, Vivian Ota Wang45, Laura A D Paglione101,102, Emilio Palumbo14,47, Helen E Parkinson25, Anthony A Philippakis1, Angel D Pizarro103, Andreas Prlic55, Jordi Rambla14,47, Augusto Rendon84, Renee A Rider48, Peter N Robinson104,105, Kurt W Rodarmer106, Laura Lyman Rodriguez107, Alan F Rubin27,28, Manuel Rueda47, Gregory A Rushton1, Rosalyn S Ryan108, Gary I Saunders79, Helen Schuilenburg25, Torsten Schwede9,73, Serena Scollen79, Alexander Senf109, Nathan C Sheffield110, Neerjah Skantharajah4,5, Albert V Smith111, Heidi J Sofia48, Dylan Spalding82,83, Amanda B Spurdle112, Zornitza Stark17,18,28, Lincoln D Stein5,20, Makoto Suematsu80, Patrick Tan67,113,114, Jonathan A Tedds79, Alastair A Thomson34, Adrian Thorogood10,115, Timothy L Tickle1, Katsushi Tokunaga78,116, Juha Törnroos82,83, David Torrents96,117, Sean Upchurch118, Alfonso Valencia96,117, Roman Valls Guimera28, Jessica Vamathevan25, Susheel Varma25,119, Danya F Vears18,28,99,120, Coby Viner20,21, Craig Voisin121, Alex H Wagner32,33, Susan E Wallace11, Brian P Walsh24, Marc S Williams98, Eva C Winkler122, Barbara J Wold118, Grant M Wood123, J Patrick Woolley76, Chisato Yamasaki74, Andrew D Yates25, Christina K Yung5,124, Lyndon J Zass91, Ksenia Zaytseva10,125, Junjun Zhang5, Peter Goodhand4,5, Kathryn North18,20,28, Ewan Birney25,126.
Abstract
The Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution. We describe the GA4GH organization, which is fueled by the development efforts of eight Work Streams and informed by the needs of 24 Driver Projects and other key stakeholders. We present the GA4GH suite of secure, interoperable technical standards and policy frameworks and review the current status of standards, their relevance to key domains of research and clinical care, and future plans of GA4GH. Broad international participation in building, adopting, and deploying GA4GH standards and frameworks will catalyze an unprecedented effort in data sharing that will be critical to advancing genomic medicine and ensuring that all populations can access its benefits.Entities:
Year: 2021 PMID: 35072136 PMCID: PMC8774288 DOI: 10.1016/j.xgen.2021.100029
Source DB: PubMed Journal: Cell Genom ISSN: 2666-979X
GA4GH toolkit
| Relevant standards | URL | Type | Target user | Purpose |
|---|---|---|---|---|
|
| ||||
| Beacon API
|
| API | data custodians, researchers (via research infrastructures), identity provider services | The Beacon protocol defines an open standard for genomics data discovery. It provides a framework for public web services responding to queries against genomic data collections, for instance from population-based or disease-specific genome repositories. Beacon is designed to (1) focus on robustness and easy implementation, (2) be maintained by individual organizations and assembled into a federated network, (3) be general-purpose and able to be used to report on any variant collection, (4) provide a boolean (or quantitative) answer about the observation of a variant, and (5) protect privacy, with queries not returning information about single individuals. A new version of the API will include support for more granular control based on a user’s identity authorization and will enable discovery of cohorts, cases (patients), biological samples, and genomic variants and associated knowledge. More details can be found on the Beacon Project website. |
| Data Connect |
| API | data custodians, researchers, and API & tool developers | Data Connect is a specification for discovery and search of biomedical data, which provides a mechanism for describing data and its data model, and for searching data within the given data model. |
| Data Use Ontology
|
| Data Model / Ontology | data custodians, researchers, DACs | The Data Use Ontology (DUO) is a hierarchical vocabulary of terms describing data use permissions and modifiers, in particular for research data in the health/clinical/biomedical domain. The GA4GH DUO standard allows large genomics and health data repositories to consistently annotate their datasets, ensuring a shared, machine readable, representation of data access conditions, and making them automatically discoverable based on a researcher’s authorization level or intended use. Broad’s FireCloud - Data Library Broad’s DUOS (Data Use Oversight System) - Data Catalog European Genome-Phenome Archive. |
| GA4GH Passports
|
| API / Data Model | data custodians, researchers, DACs, clinicians, API and tool developers | The GA4GH Passport specification aims to support data access policies within current and evolving data access governance systems. This specification defines Passports and Passport Visas as the standard way of communicating a user’s data access authorizations based on either their role (e.g., researcher), affiliation, or access status. Passport Visas from trusted organizations can therefore express data access authorizations that require either a registration process (for the Registered Access data access model
|
| Service Info |
| API | API and tool developers | Service discovery is at the root of any computational workflow using web-based APIs. Traditionally, this is hard-coded into workflows, and discovery is a manual process. Service Info provides a way for an API to expose a set of metadata to help discovery and aggregation of services via computational methods. It also allows a server/implementation to describe its capabilities and limitations. Service-info is described in GA4GH OpenAPI specification, which can be visualized using Swagger Editor ( |
| Service Registry |
| API | API and tool developers | Service registry is a GA4GH service providing information about other GA4GH services, primarily for the purpose of organizing services into networks or groups and service discovery across organizational boundaries. Information about the individual services in the registry is described in the complementary Service Info specification (see above). The Service Registry specification is useful when dealing with technologies that handle multiple GA4GH services. Common use cases include creating networks or groups of services of a certain type (e.g., Beacon Network searches networks of Beacon services across multiple organizations, a workflow can be executed by a specific group of Workflow Execution Services, or Data Connect search on biomedical data is federated across a set of nodes), or a certain host (e.g., an organization provides implementations of Beacon, Data Connect, and Data Repository Service APIs, or a server hosts an implementation of refget and htsget APIs). |
|
| ||||
| htsget
|
| API | API and tool developers, researchers | htsget is a data retrieval API that bridges from existing genomics file formats to a client/server model with the following features: Incumbent data formats (BAM, CRAM, VCF) are preferred initially, with a future path to others. Multiple server implementations are supported, including those that do format transcoding on the fly, and those that return essentially unaltered filesystem data. Multiple use cases are supported, including access to small subsets of genomic data (e.g., for browsing a given region) and to full genomes (e.g., for calling variants). |
| refget
|
| API | API and tool developers, researchers | Refget ( |
| Task Execution Service (TES) |
| API | API and tool developers, researchers, academic institutions | The Task Execution Service (TES) API is a standardized schema and API for describing and executing batch execution tasks. A task defines a set of input files, a set of containers and commands to run, a set of output files, and some additional logging and metadata. TES servers accept task documents and execute them asynchronously on available compute resources. A TES server could be built on top of a traditional HPC queuing system, such as Grid Engine, Slurm, or cloud style compute systems such as AWS Batch or Kubernetes. |
| Tool Registry Service (TRS) |
| API | API and tool developers, researchers, academic institutions | The GA4GH Tool Registry (TRS) API aims to provide a standardized way to describe the availability of tools and workflows. In this way, multiple repositories that share Docker-based tools and workflows (based on Common Workflow Language [CWL], Workflow Description Language [WDL], Nextflow, or Galaxy) can consistently interact, search, and retrieve information from one another. The end goal is to make it much easier to share scientific tools and workflows, enhancing our ability to make research reproducible, shareable, and transparent. view the human-readable Reference Documentation explore the specification in the Swagger Editor preview documentation from the gh-openapi-docs for the development branch at |
| Workflow Execution Service (WES) |
| API | API and tool developers, researchers, academic institutions | The Workflow Execution Service (WES) API describes a standard programmatic way to run and manage workflows. Having this standard API supported by multiple execution engines will let people run the same workflow using various execution platforms running on various clouds/environments. Key features include: (1) ability to request a workflow run using CWL or WDL; (2) ability to parameterize that workflow using a JSON schema; and (3) ability to get information about running workflows. |
|
| ||||
| Authentication & Authorisation Infrastructure (AAI) |
| Guide | API and tool developers | The GA4GH Authentication & Authorisation Infrastructure (AAI) specification profiles the OpenID Connect (OIDC) protocol to provide a federated (multilateral) authentication and authorization infrastructure for greater interoperability between genomics institutions in a manner specifically applicable to (but not limited to) the sharing of restricted datasets. |
| Cloud Security and Privacy Policy v1.0 |
| Guide | anyone handling sensitive data in a cloud infrastructure. | An increasing number of GA4GH projects rely on Cloud services to pursue their goals, and the GA4GH Cloud Work Stream is working on several products to make the GA4GH community take full advantage of the Cloud paradigm. However, the use of the Cloud poses significant security and privacy challenges that need to be carefully evaluated and addressed. The purpose of the Cloud Security and Privacy Policy is to outline a common security technology framework that can be used to systematically assess the products developed by the CWS from a security perspective. Product developers and reviewers can leverage the information contained herein to identify requirements, threats, and countermeasures related to the products they are working on, thus facilitating the production of secure standards. |
| CRAM
|
| File Format | API and tool developers, researchers | The CRAM file format holds DNA sequencing records. It has the following major objectives: Significantly better lossless compression than BAM To permit simple and lossless transformations to and from BAM files Support for controlled loss of data |
| Crypt4GH
|
| File Format | API and tool developers, data generators, researchers, clinicians, data custodians | By its nature, genomic data can include information of a confidential nature about the health of individuals. It is important that such information is not accidentally disclosed. One part of the defense against such disclosure is to, as much as possible, keep the data in an encrypted format. The Crypt4GH specification describes a file format that can be used to store data in an encrypted state. Existing applications can, with minimal modification, read and write data in the encrypted format. The choice of encryption also allows the encrypted data to be read starting from any location, facilitating indexed access to files. The format has the following properties: Confidentiality: Data stored in the file are readable only by holders of the correct secret decryption key. The format does not hide the length of the encrypted file, although it is possible to pad some file structures to obscure the length. Integrity: Data are stored in a series of 64 kilobyte blocks, each of which includes a message authentication code (MAC). At tempts to change the data in a block will make the MAC invalid; it is not possible to recalculate the MAC without knowing the key used to encrypt the file. The format only protects the contents of each individual block. It does not protect against insertion, removal, or reordering of entire blocks. Authentication: The format does not provide any way of authenticating files. |
| Data Repository Service (DRS) |
| API | API and tool developers, researchers, academic institutions | The Data Repository Service (DRS) API provides a generic interface to data repositories so data consumers, including workflow systems, can access data objects in a single, standard way regardless of where they are stored and how they are managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID. The DRS specification describes the characteristics of those IDs, the types of data supported, how they can be pointed to using URIs, and how clients can use these URIs to ultimately make successful DRS API requests. The specification also describes the DRS API in detail and provides information on the specific endpoints, request formats, and responses. This specification is intended for developers of DRS-compatible services and of clients that will call these DRS services. |
| Data Security Infrastructure Policy (DSIP) |
| Policy Framework | data protection authorities | The Data Security Infrastructure Policy (DSIP) describes the data security infrastructure recommended for stakeholders in the GA4GH community. It is not meant to be a normative document, but rather a set of recommendations and best practices to enable a secure data sharing and processing ecosystem. However, it does not claim to be exhaustive, and additional precautions other than the ones collected in the policy might have to be taken to be compliant with national/regional legislations. As a living document, the DSIP will be revised and updated over time, in response to changes in the GA4GH Privacy and Security Policy, and as technology and biomedical science continue to advance. |
| Machine Readable Consent Guidance (MRCG) v1.0 |
| Guide | researchers, institutional review boards/research ethics committees (international and national), research ethics policy makers, data generators, funding agencies | The Machine Readable Consent Guidance (MRCG) provides standardized consent clauses and supporting information to enable the development of consent forms that map unambiguously to the GA4GH Data Use Ontology (DUO). Integrating DUO into consent forms thereby facilitates data discovery and data access requests and approvals, maximizing data sharing, integration, and re-use while respecting the autonomy of data subjects. MRCG implementations include the Broad Data Use Oversight System (DUOS)
|
| Pedigree V1 |
| Data Model / Ontology | clinicians, researchers, API and tool developers, data generators, EHR vendors | Family health history is an important aspect in both genomic research and patient care. The GA4GH pedigree standard is an object-oriented graph-based model to represent family health history and pedigree information. It is intended to fit within the structure of other standards like HL7 FHIR and Phenopackets and enable the computable exchange of family health history as well as representation of larger, more complex families. Computable representation of family structure will allow patients, physicians, and researchers to share this information more easily between healthcare systems and help software tools use this information to improve genomic analysis and diagnosis. The draft model can be found on Github along with a Family History Relations Ontology and draft FHIR implementation guide. A draft recommendation for a minimal dataset of family health history ( |
| Phenopackets |
| Data Model / Ontology | data generators, data custodians, researchers, clinicians, API and tool developers | The Phenopacket specification is an open machine-readable schema that supports the global exchange of disease and phenotype information to improve our ability to diagnose and conduct research on all types of diseases, including cancer and rare disease. A Phenopacket links detailed phenotypic descriptions with disease, patient, and genetic information, enabling clinicians, biologists, and disease and drug researchers to build more complete models of disease. Version 2 of the standard, released in June 2021, expands on the previous version to include better representation of the time course of disease, treatment, and COVID-19 and cancer-related data. The schema, as well as source code in Java, C++, and Python, are available from the phenopacket-schema GitHub repository. |
| RNAget |
| API | Data generators, data custodians, researchers, tool developers | The RNAget API describes a common set of endpoints for search and retrieval of processed RNA data. This currently includes feature level expression data from RNA-seq type assays and signal data over a range of bases from ChIP-seq, methylation, or similar epigenetic experiments. |
| SAM and BAM
|
| File Format | researchers | SAM, or Sequence Alignment/Map format, is a format for storing primary DNA sequencing records. These are typically aligned and sorted by genomic coordinate, but unaligned data can also be represented. SAM is a TAB-delimited text format consisting of a header meta-data section and an alignment section. The BAM format is a binary serialization of SAM for more efficient access. SAM and BAM support full random access, selected by genomic region. The SAMtags document defines the optional per-record annotations. These are also used by the CRAM specification. |
| Variant Annotation |
| Data Model / Modeling Framework | API and tool developers | Variant annotations are structured data object that holds a central piece of knowledge about a genetic variation, along with metadata supporting its interpretation and use. A given variant annotation may describe knowledge about its molecular consequence, functional impact on gene function, population frequency, pathogenicity for a given disease, or impact on therapeutic response to a particular treatment. The GA4GH VA-Specification will define an extensible data model for representation and exchange these and other diverse kinds of variant annotations. It will provide machine-readable messaging specifications to support sharing and validation of data through APIs and other exchange mechanisms. It will also provide a formal framework for defining custom extensions to the core model - allowing community-driven development of VA-based data models for new data types and use cases. A more detailed description of these components can be found online. |
| Variation Representation
|
| Data Model & terminology | data generators, API and tool developers, data custodians | Maximizing the personal, public, research, and clinical value of genomic information will require that clinicians, researchers, and testing laboratories exchange genetic variation data reliably. The Variation Representation Specification (VRS, pronounced “verse”) — written by a partnership among national information resource providers, major public initiatives, and diagnostic testing laboratories — is an open specification to standardize the exchange of variation data. |
| VCF/BCF
|
| File Format | researchers | The variant call format (VCF) is a generic format for storing DNA polymorphism data such as single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants, together with rich annotations. VCF may hold data for multiple samples within the same file. The specification contains the header meta-data fields, a series of mandatory columns describing the variants, and details of the optional annotations which are either per-site or per-sample. VCF and its binary counterpart, BCF, is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. |
The GA4GH Toolkit outlines a suite of secure standards and frameworks that will enable more meaningful research and patient data harmonization and sharing. This suite addresses a variety of challenges across the data sharing life cycle and is applicable across the world’s accessible medical and patient-centered systems, knowledgebases, and raw data sources. All standards are subject to the GA4GH Copyright Policy (https://www.ga4gh.org/wp-content/uploads/GA4GH-Copyright-Policy-Updated-Formatting.pdf) and should be made available under an open source license such as the Apache 2.0 license for software.
GA4GH Driver Projects
| Driver Project | URL | Location | Thematic area* | Current size | Data type(s) collected | Data hosting model(s) | Data access model(s) | Implementations / deployments of GA4GH standards |
|---|---|---|---|---|---|---|---|---|
| All of Us Research Program |
| US | RD, Ca, CT | 100k whole-genome sequences (planning for 1 million) | WGS, WES | centralized | cloud | CRAM, DRS (forthcoming), htsget (forthcoming), Passports (forthcoming), TRS (forthcoming), and WES (forthcoming) |
| Australian Genomics |
| Australia | RD, Ca, CT | 13,500 whole-genome sequences across all pilots | WGS, WES, panels, phenotype | centralized | cloud | Beacon V1, CRAM, Crypt4GH, DRS (forthcoming), DUO, htsget, MRCG (forthcoming), Passports (forthcoming), refget |
| Autism Sharing Initiative |
| international | CT | 11,316 whole-genome sequences (estimating 15k by 2025) | WGS | distributed | federated analysis | AAI (forthcoming), Beacon V1 (forthcoming), CRAM (forthcoming), Data Connect, DRS (forthcoming), DUO (forthcoming), Passports (forthcoming), Service Registry / Info, TRS (forthcoming), WES (forthcoming) |
| BRCA Exchange |
| international | RD, Ca | 66,657 variants | genetic variant pathogenicity assertions and supporting evidence | centralized | public | Beacon V1, VA (forthcoming), VRS, WES (forthcoming) |
| CanDIG |
| Canada | RD, Ca, CT, Bio | 1,700 data records | WGS tumor/normal and whole transcriptome for cancer; WGS for COVID; clinical phenotype | distributed | federated analysis | Beacon V1, CRAM, DRS, DUO, htsget, Phenopackets, refget (forthcoming), RNAGet, Service Registry / Info (forthcoming), VRS (forthcoming), WES (forthcoming) |
| ClinGen |
| US | RD | 2,077 unique genes with at least one curation and 2,417 unique variants with at least one curation | genetic and experimental evidence | centralized | public | VA (forthcoming), VRS |
| ELIXIR |
| Europe | RD, Ca, CT, Bio | 23 national nodes hold a variety of data types and run multiple services, some listed within this table (e.g., EGA). For a list of ELIXIR Core Data Resources, see | distributed | download (also exploring Cloud) | AAI, Beacon V1, Crypt4GH, DRS, DUO, htsget, Passports, Phenopackets, refget, RNAGet, Service Registry / Info, TES, TRS, WES | |
| ENA / EVA / EGA |
| Europe | RD, Ca, CT, Bio | EGA - 700k data records | EGA - WGS, WES, RNaseq, epigenetics, genotyping, transcriptome, singlecell seq, healthy and disease cohorts | distributed | download (also exploring Distributed Cloud) | Crypt4GH, htsget AAI, Passports, DUO |
| EpiShare |
| international | Bio | ~2,800 data records | FASTQ, CRAM/BAM, bigwig, bigbed for epigenomics experiments | distributed | federated analysis | CRAM (forthcoming), DRS, DUO, htsget (forthcoming), Phenopackets, RNAGet, Service Registry / Info, WES |
| EUCANCan |
| international | Ca | data from 35 different sources including human, model, and non-model organisms | whole-genome, whole-exome, and whole-transcriptome sequence data | distributed | Cloud and federated analysis | AAI (forthcoming), Beacon V1 (forthcoming), CRAM (forthcoming), Data Connect (forthcoming), DRS (forthcoming), Passports (forthcoming), Phenopackets (forthcoming), Service Registry / Info (forthcoming), TES (forthcoming), TRS (forthcoming), VRS (forthcoming), WES (forthcoming) |
| European Joint Programme on Rare Disease (EJP RD) |
| Europe | RD | >130,000 data records across several resources hosting genomic human data, mainly the EGA, DECIPHER and the RD-Connect Genome-Phenome Analysis Platform | a mix of WGS, WES, plausibly pathogenic variants and phenotypic information | distributed across centralized resources | download and Cloud analysis | AAI (forthcoming), Beacon V1, CRAM, Crypt4GH, DRS (forthcoming), DUO, htsget, Passports, Phenopackets, Service Registry / Info, TES, TRS, WES |
| GEnome Medical Alliance Japan (GEM Japan) |
| Japan | RD, Ca, CT | 24k WGS (aiming for 100k) | whole-genome sequencing, whole-exome sequencing, gene expression, panels, phenotypic | centralized | download (also exploring Cloud) | Beacon V1 (forthcoming), CRAM, DUO, Phenopackets (forthcoming) |
| Genomics England |
| UK | RD, Ca, CT | 136K WGS, (estimating 450K WGS by 2024) | WGS | centralized | Cloud | AAI (forthcoming), CRAM, DRS (forthcoming), DUO (forthcoming), htsget, Passports (forthcoming), WES (forthcoming) |
| Human Cell Atlas |
| International | RD, Ca, CT, Bio | 1,300 donors | single-cell sequencing | centralized | public and Cloud | AAI, DRS, DUO (forthcoming), Passports (forthcoming), TES, TRS, WES |
| Human Heredity and Health in Africa (H3Africa) |
| Africa | CT, Bio | 75,000 participants (across all projects) | whole-genome sequencing, whole-exome sequencing, gene expression, microbiome, imaging, phenotypic, environmental/lifestyle | centralized | download | AAI (forthcoming), Beacon V1, CRAM, Crypt4GH, Data Connect (forthcoming), DUO, Passports (forthcoming), Phenopackets (forthcoming), VRS (forthcoming) |
| International Cancer Genome Consortium (ICGC) Accelerating Research in Genomic Oncology (ARGO) |
| international | Ca | 100k Genomes | WGS, WES, RNA-Seq, phenotype | distributed | Cloud and federated analysis | AAI (forthcoming), Beacon V1, CRAM, Passports (forthcoming), TRS, WES |
| Matchmaker Exchange |
| international | RD | >109K cases | WGS, WES | distributed | federated analysis | AAI (forthcoming), Beacon V1, CRAM, htsget, Phenopackets |
| Monarch Initiative |
| international | RD, Ca, CT, Bio | N/A | gene, genotype, variant, disease, and phenotype data across many species in the tree of life, from over 30 data sources | centralized | public cloud | DUO (forthcoming), Passports (forthcoming), Phenopackets, VRS |
| National Cancer Institute Cancer Research Data Commons (NCI CRDC) |
| US | Ca | ~100,000 data records (includes GDC) | whole-genome sequencing, whole-exome sequencing, gene expression, panels, phenotypic, biospecimen, imaging, proteomics | centralized | Cloud and federated analysis | CRAM, DRS, DUO (forthcoming), Passports (forthcoming), Service Registry / Info, WES |
| National Cancer Institute Genomic Data Commons (NCI GDC) |
| US | Ca | 83,700 cases | WGS, WXS, panel, RNA-seq, miRNA-seq, methylation array, genotyping array, diagnosis slides, tissue slides, ATAC-seq, scRNA-seq. Also clinical (phenotypic) and biospecimen information | centralized | download and Cloud | AAI (forthcoming), CRAM (forthcoming), DRS (forthcoming), DUO (forthcoming), Passports (forthcoming), Phenopackets (forthcoming), TES (forthcoming), TRS (forthcoming), VRS (forthcoming), WES (forthcoming) |
| Swiss Personalized Health Network (SPHN) |
| Switzerland | RD, Ca, CT, Bio | 24 health data projects across Switzerland | clinical phenotypic, clinical routine, omics (genomic, transcriptomic, proteomic, etc), cohort, and imaging data and expert variant curation | distributed | federated analysis | Beacon V1, DRS (forthcoming), htsget (forthcoming), Phenopackets, TES (forthcoming), WES (forthcoming) |
| Trans-Omics for Precision Medicine (TOPMed) |
| US | RD, Ca, CT, Bio | 180k whole genome sequences (233k by 2025), 96k panels | WGS, RNA-seq, metabolome, methylome (MethylationEPIC ‘850K’), proteome (SomaScan and Olink), longitudinal epidemiology studies, disease-studies, environmental/lifestyle, imaging | centralized | cloud | AAI (forthcoming), CRAM, DRS, DUO, Passports (forthcoming), Service Registry / Info (forthcoming), TRS, WES |
| Variant Interpretation for Cancer Consortium (VICC) |
| international | Ca | 24,366 evidence items | genetic and experimental evidence | centralized | public | Beacon V1, Service Registry / Info, VA (forthcoming), VRS |
GA4GH Driver Projects are external genomic data initiatives that have committed to both contributing to the development of genomic data sharing standards as well as piloting their use in real world practice. Abbreviations: RD, rare disease; Ca, cancer; CT, complex traits; Bio, basic biology.
Figure 1Matrix structure of the Global Alliance for Genomics and Health
GA4GH is a community of diverse stakeholders from Driver Projects and other institutions working together in the context of Work Streams. Each GA4GH Driver Project is expected to dedicate two full-time equivalents across at least two GA4GH Work Streams. As foundational groups that review all GA4GH deliverables, the Regulatory and Ethics and Data Security Work Streams must have representation from every Driver Project. In addition to Driver Projects, any member of the community—regardless of domain, sector, nation, or affiliation—is invited to participate in any GA4GH Work Stream. Supplemental information includes details on how each of the 24 GA4GH Driver Projects intersects with the six technical Work Streams.