| Literature DB >> 30311374 |
Piotr Pawliczek1, Ronak Y Patel1, Lillian R Ashmore1, Andrew R Jackson1, Chris Bizon2, Tristan Nelson3, Bradford Powell4, Robert R Freimuth5, Natasha Strande4, Neethu Shah1, Sameer Paithankar1, Matt W Wright6, Selina Dwight6, Jimmy Zhen6, Melissa Landrum7, Peter McGarvey8, Larry Babb9, Sharon E Plon1,10, Aleksandar Milosavljevic1.
Abstract
Effective exchange of information about genetic variants is currently hampered by the lack of readily available globally unique variant identifiers that would enable aggregation of information from different sources. The ClinGen Allele Registry addresses this problem by providing (1) globally unique "canonical" variant identifiers (CAids) on demand, either individually or in large batches; (2) access to variant-identifying information in a searchable Registry; (3) links to allele-related records in many commonly used databases; and (4) services for adding links to information about registered variants in external sources. A core element of the Registry is a canonicalization service, implemented using in-memory sequence alignment-based index, which groups variant identifiers denoting the same nucleotide variant and assigns unique and dereferenceable CAids. More than 650 million distinct variants are currently registered, including those from gnomAD, ExAC, dbSNP, and ClinVar, including a small number of variants registered by Registry users. The Registry is accessible both via a web interface and programmatically via well-documented Hypertext Transfer Protocol (HTTP) Representational State Transfer Application Programming Interface (REST-APIs). For programmatic interoperability, the Registry content is accessible in the JavaScript Object Notation for Linked Data (JSON-LD) format. We present several use cases and demonstrate how the linked information may provide raw material for reasoning about variant's pathogenicity.Entities:
Keywords: HGVS representation; linked data; pathogenicity of genetic variants; variant centric resources; variant identifiers
Mesh:
Year: 2018 PMID: 30311374 PMCID: PMC6519371 DOI: 10.1002/humu.23637
Source DB: PubMed Journal: Hum Mutat ISSN: 1059-7794 Impact factor: 4.878
Figure 1Conceptual model of Allele Registry entities based on the Allele Model developed by ClinGen Data Model Working Group
Figure 2(a) Design and workflow of ClinGen Allele Registry. (b) Screenshot of current core registry‐hosted links for a typical variant in the user interface
Types of variants within the Registry
| Reference and alternated sequences | Region of alteration | Alternate allele | Reference allele | |
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Duplications are treated internally as a special type of insertion.
Inversions are treated internally as a special type of indel.
Variants involving insertion and/or deletion and their left‐ and right‐aligned representations
| Example variant | Left‐aligned | Right‐aligned |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 3Registry API services permit on‐demand linking of variant information from external sources. (i) The external source indicates their RFC6570 URI template for their API and, optionally, for their UI. (ii) Then the external source associates one or more parameters with CAids about which they have information via PUT requests to the Registry API. Bulk uploads of associations are also supported. These parameters will be used to fill the templates, thereby creating the appropriate link. (iii) The Registry can now include links to these external sources in addition to its own core variant metadata. For the Allelic Epigenome case, because their API directly employs CAids, no parameter values need be supplied when registering a link via the PUT requests to the Registry. In contrast, if CIViC were to add links from Registry alleles to their data, two parameter values (p1, p2) would be registered for each CAid. Based on the CIViC templates shown, both parameter values are needed to construct the appropriate web page URL, whereas only one is needed to form the CIViC “api” URL
Figure 4Reference sequences currently supported by the Registry. The NM, NP, and NR represent known and XM, XP, and XR represent modeled reference sequences from RefSeq (O'Leary et al., 2016). NC represents sequence of chromosomes, whereas NW, NT, and NG represent various genomic scaffolds. LRG, LRGt, and LRGp are genomic, transcript, and protein sequences from Locus Reference Genomic Database (MacArthur et al., 2014). ENST and ENSP are transcript and amino acid sequences from ENSEMBL (Aken et al., 2016)
Resources preregistered and cross‐linked in the ClinGen Allele Registry
| Resource | Number of variants with link to source |
|---|---|
| ClinVar RCV | 475,034 |
| ClinVar Allele | 590,706 |
| ClinVar Variations | 348,882 |
| dbSNP | 338,830,568 |
| ExAC | 10,175,861 |
| gnomAD | 276,797,608 |
| myvariant.info (hg19) | 339,605,025 |
| myvariant.info (hg38) | 231,910,513 |
| COSMIC | 20,581,973 |
Figure 5Query and registration functions accessible via the Registry web interface. (a) Example of HGVS‐based search from the Registry landing page (left) and a typical page presented to user when the variant is not registered. For logged‐in users, one click on “Get Identifier” provides canonical allele identifier. (b) Search interface for fuzzy queries where the exact transcript for which the variation is defined is not known (left). Results of example queries are shown on the right
Different representations produced by different software for ground truth alleles and corresponding canonical allele identifiers
| HGVS expressions | CAIDS |
|---|---|
|
NM_000277.1:c.1200‐1delG NM_000277.1:c.1200delG | CA229394 |
|
NM_017739.3:c.1895+1_1895+4delGTGA NM_017739.3:c.1895+5_1895+8delGTGA | CA263965 |
|
NM_005228.3:c.2284‐6delCinsCTCCAGGA AGCCT NM_005228.3:c.2284‐5_2290dupTCCAGG AAGCCT | CA135833 |
Summary of time required to query and number of duplicate variants identified in key variant centric resources
| Source | Number of variants | Number of variants used for checking duplicates | Number of variants processed by Registry | Number of duplicates | Time for processing |
|---|---|---|---|---|---|
| dbSNP | 339,334,552 | 19,964,466 indel | 19,953,620 | 1,775,058 | ∼15 min |
| MyVariant.Info | 412,996,966 | 412,996,966 | 412,965,634 | 134,881 | ∼90 min |
| ClinVar | 302,036 | 302,036 | 302,024 | 0 | 40 s |
Figure 6Adoption of canonical allele identifiers by variant‐centric resources. (a) ClinGen variant and gene curation interface, (b) CIViC, and (c) ClinVar. Other systems that use Allele Registry identifiers (including ClinGen Pathogenicity Calculator and Database of pathogenic variants at Keio University) are not shown for brevity
Comparison of assertions in ClinVar for variants that result in identical amino acid change
| Benign | Uncertain significance | Pathogenic | |
|---|---|---|---|
| Benign | 70 | – | – |
| Uncertain significance | 31 | 140 | – |
| Pathogenic | 0 | 34 | 571 |
Note for simplicity the likely pathogenic and pathogenic variants were combined as well as likely benign and benign. The full list of variants and assertions is found in Supporting Information Table S1.