| Literature DB >> 26746786 |
Jörg Hakenberg1, Wei-Yi Cheng2,3, Philippe Thomas4,5, Ying-Chih Wang6,7, Andrew V Uzilov8, Rong Chen9.
Abstract
BACKGROUND: Data from a plethora of high-throughput sequencing studies is readily available to researchers, providing genetic variants detected in a variety of healthy and disease populations. While each individual cohort helps gain insights into polymorphic and disease-associated variants, a joint perspective can be more powerful in identifying polymorphisms, rare variants, disease-associations, genetic burden, somatic variants, and disease mechanisms. DESCRIPTION: We have set up a Reference Variant Store (RVS) containing variants observed in a number of large-scale sequencing efforts, such as 1000 Genomes, ExAC, Scripps Wellderly, UK10K; various genotyping studies; and disease association databases. RVS holds extensive annotations pertaining to affected genes, functional impacts, disease associations, and population frequencies. RVS currently stores 400 million distinct variants observed in more than 80,000 human samples.Entities:
Mesh:
Year: 2016 PMID: 26746786 PMCID: PMC4706706 DOI: 10.1186/s12859-015-0865-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Number of variants imported from various external resources
| Study | Variant sites | Variants | Unique to study | Variants passed | Samples |
|---|---|---|---|---|---|
| 1000 Genomes [ | 81,195,126 | 81,693,252 | 57,400,612 | all | 2,504 |
| ESP6500 [ | 1,982,177 | 1,998,204 | 184,225 | all | 6,503 |
| UK10K [ | 37,258,978 | 37,560,436 | 6,155,493 | all | 2,432 |
| UK10K with disease c | 9,391,582 | 11,177,227 | 8,847,466 | 9,969,036 | 4,888 |
| TCGA [ | 200,691,728 | 219,533,884 | 90,884,769 | n/a | 4,224 |
| TCGA somatic | 876,970 | 890,172 | 696,754 | all | 4,205 |
| Scripps Wellderly [ | 76,144,271 | 91,947,469 | 63,331,143 | 53,303,437 | 534 |
| ExAC b [ | 9,579,712 | 10,450,724 | 6,581,946 | 8,811,372 | 63,352 |
| MSSM BioBank genotyping | 849,806 | 849,806 | 0 | all | 11,210 |
| In-house resequencing study | 29,326,393 | 29,671,729 | 10,134,258 | 23,610,572 | 142 |
| Total observed | 358,152,122 | 399,404,510 | 244,216,666 | >217,796,115 | 82,558 b |
| Other resources: | |||||
| dbNSFP a [ | 30,523,109 | 89,617,785 | 73,561,239 | — | — |
| ClinVar [ | 101,317 | 104,455 | 31,694 | — | — |
| OMIM [ | 10,863 | 10,913 | — | — | — |
| COSMIC [ | 1,483,983 | 1,525,243 | — | — | — |
| PharmGKB c [ | 672 | 684 | — | — | — |
| SwissVar d | (77,047) | (84,649) | (34,198) | — | — |
| HGMD c [ | 125,744 | 133,464 | 32,178 | — | — |
| Literature mining | — | 890,665 | — | — | — |
| Total observed + other | 388,902,292 | 472,965,749 | 317,841,777 | >217,796,115 | 82,558 |
The first block refers to sequencing/genotyping studies, the second to sample-independent annotation databases. “Unique to study” counts variants that were observed only in that particular study. “Variants passed” refers to variants that passed quality metrics as defined by the particular study, at least one sample has to pass; n/a: individual sample quality metrics not available. Totals exclude duplicates seen in different studies. Variants in annotation databases are included only if they can be mapped to precise coordinates and allele. Since a large proportion of the variants discovered by literature mining are given at the protein level only, they were not compared to other studies
adbNSFP contains hypothetical variants, see text
bExAC includes samples from 1000 Genomes, ESP6500, and TCGA
cNote that data from HGMD, PharmGKB, UK10K diseases and TCGA germline are not visible to external users on the RVS website
dCounts for SwissVar refer to distinct amino acid changes. Further details on individual resources are provided in Additional file 4: Table S3
Fig. 1RVS architecture and workflow. All new variant data in VCF format gets populated into a staging area, where novel variants are registered with RVS. Novel variants are exported to the compute cluster for annotation with snpEff etc. Data are imported back into the production tables of RVS. Large studies will also trigger the upload of (sub)population frequencies. Variants in RVS are assigned to each new or updated source, allowing multiple sources per variant
Major tables in the Reference Variant Store that hold all imported variants and annotations
| Table | Description |
|---|---|
| Summary | main table that stores each variant by chromosomal location, reference and alternate allele, dbSNP, and GRCh36/38 locations; most other tables are dependent tables |
| Impact | effect(s) on gene, transcript, intron/exon, missense/ non-sense, CDS and amino acid change, where applicable; by transcript |
| Frequencies | allele frequencies in large-scale sequencing studies (1000 Genomes, ESP6500, ExAC, Scripps Wellderly, etc.) |
| Predictions | computational predictions of functional impact, such as PolyPhen-2, MutationAssessor, SIFT, CADD, PROVEAN, GWAVA, and ensemble scores |
| Phenotypes | disease-associations from ClinVar, HGMD, OMIM, etc. |
| Regions | observed and predicted regions that contain the given variant: functional and regulatory elements (ENCODE), protein domains (InterPro), microRNA target sites (miRanda) |
| Source | maps each variant to the study/studies in which it was observed; also stores pass- or non-pass flags according to filtering criteria if provided by the study |
| Comments |
|
| Staging_summary | registry that holds potentially new variants while they are not yet automatically annotated and copied to the production summary table |
| Staging_impact | holds results from computational models regarding effects of the mutation (protein level) |
Mutations extracted from PubMed/MEDLINE, PMC full texts, and PMC from PDFs including supplementary files such as Excel tables. Variants are grouped by variant type, counting each evidence for each variant resulting in the grand total. We also show the number of variants that we were able to map to a dbSNP ID, as well as the number of unique variants, disregarding occurrences across multiple publications
| Type | PubMed | PMC | PDF and | Total |
|---|---|---|---|---|
| Supplement | ||||
| Substitution | 617,693 | 853,487 | 5,804,542 | |
| dbSNP | 102,040 | 222,310 | 4,433,018 | |
| Insertion | 3,072 | 2,252 | 17,640 | |
| Duplication | 875 | 1,263 | 5,522 | |
| Repeat | 42 | 76 | 339 | |
| Deletion | 19,987 | 27,192 | 69,326 | |
| Insdel | 202 | 290 | 2,061 | |
| Frameshift | 2,185 | 3,065 | 28,405 | |
| Structural | 15,347 | 6,143 | 5,642,341 | |
| Total non-unique | 761,449 | 1,116,093 | 15,802,854 | |
| – with dbSNP ID | 261,881 | 381,500 | 4,743,471 | |
| Total unique | 203,055 | 201,597 | 4,221,952 | |
| Total unique mapped to allele a | 101,652 | 122,393 | 727,602 | 890,665 |
aIn case amino acid changes were given in the literature, we counted only one allele that would lead to that change
Variants in clinical annotation databases observed in healthy cohorts, binned by maximum ethnicity-specific allele frequency across cohorts. Bins are non-cumulative and intervals exclude the value of the upper boundary
| Source | 0 | 0–0.001 | 0.001–0.005 | 0.005–0.01 | 0.01–0.05 | 0.05–0.1 | 0.1–0.5 | ≥0.5 | Total |
|---|---|---|---|---|---|---|---|---|---|
| ClinVar: pathogenic | 30.09 | 2.59 | 0.86 | 0.20 | 0.26 | 0.05 | 0.14 | 0.02 | 34.21 |
| ClinVar: likely pathogenic | 3.26 | 0.29 | 0.08 | 0.01 | 0.02 | 3.66 | |||
| ClinVar: risk factor | 0.35 | 0.03 | 0.06 | 0.02 | 0.05 | 0.02 | 0.13 | 0.10 | 0.76 |
| ClinVar: association | 0.01 | <0.01 | <0.01 | 0.01 | 0.01 | <0.01 | 0.05 | 0.02 | 0.10 |
| ClinVar: likely benign | 0.47 | 0.95 | 1.01 | 0.49 | 0.51 | 0.05 | 0.05 | 0.01 | 3.54 |
| ClinVar: benign | 0.49 | 0.26 | 0.47 | 0.62 | 2.33 | 1.10 | 2.61 | 1.70 | 9.58 |
| ClinVar: protective | <0.01 | <0.01 | <0.01 | 0.02 | 0.01 | 0.03 | |||
| ClinVar: drug response | <0.01 | <0.01 | 0.01 | 0.01 | 0.02 | ||||
| ClinVar: uncertain significance | 8.15 | 2.05 | 1.53 | 0.37 | 0.36 | 0.05 | 0.06 | 0.03 | 12.60 |
| ClinVar: other | 1.02 | 0.05 | 0.04 | 0.02 | 0.03 | 0.01 | 0.15 | 0.08 | 1.40 |
| ClinVar: unknown | 29.44 | 2.58 | 0.97 | 0.19 | 0.38 | 0.12 | 0.31 | 0.11 | 34.10 |
| HGMD: DM | 81.24 | 4.17 | 1.40 | 0.39 | 0.46 | 0.05 | 0.04 | 0.01 | 87.76 |
| HGMD: DM? | 4.80 | 0.61 | 0.50 | 0.17 | 0.35 | 0.12 | 0.17 | 0.02 | 6.74 |
| HGMD: DFP | 0.16 | 0.01 | 0.02 | 0.01 | 0.07 | 0.08 | 0.48 | 0.29 | 1.12 |
| HGMD: DP | 0.30 | 0.03 | 0.06 | 0.04 | 0.13 | 0.09 | 0.71 | 0.45 | 1.81 |
| HGMD: FP | 0.86 | 0.15 | 0.16 | 0.09 | 0.20 | 0.09 | 0.32 | 0.15 | 2.02 |
| HGMD: FTV | 0.32 | 0.06 | 0.03 | 0.01 | 0.03 | 0.01 | 0.04 | 0.03 | 0.53 |
| OMIM: pathogenic | 72.24 | 8.69 | 3.19 | 1.00 | 1.14 | 0.26 | 0.73 | 0.23 | 87.48 |
| OMIM: probably pathogenic | 0.02 | 0.02 | |||||||
| OMIM: probably not pathogenic | 0.01 | 0.01 | |||||||
| OMIM: risk factor | 1.33 | 0.23 | 0.28 | 0.06 | 0.29 | 0.13 | 0.59 | 0.57 | 3.48 |
| OMIM: association | 0.02 | 0.01 | 0.01 | 0.01 | 0.06 | 0.09 | 0.20 | ||
| OMIM: no known pathogenicity | 0.11 | 0.03 | 0.04 | 0.03 | 0.11 | 0.06 | 0.39 | 0.15 | 0.92 |
| OMIM: confers sensitivity | 0.01 | 0.01 | |||||||
| OMIM: protective | 0.01 | 0.01 | 0.03 | 0.06 | 0.05 | 0.16 | |||
| OMIM: drug response | 0.01 | 0.02 | 0.05 | 0.07 | 0.15 | ||||
| OMIM: other | 6.46 | 0.32 | 0.15 | 0.06 | 0.03 | 0.03 | 0.04 | 0.01 | 7.10 |
| OMIM: VUS | 0.11 | 0.08 | 0.07 | 0.03 | 0.03 | 0.05 | 0.09 | 0.05 | 0.51 |
Values represent the percentage of variants from the respective resource that fall into each category and bin. DM, disease-causing mutation; DM?, likely DM; DP, disease-associated polymorphism; DFP, DP with additional functional evidence; FP, functional polymorphism; FTV, frameshift or truncating; VUS, variant of unknown significance
Fig. 2RVS web query interface: public datasets in RVS can be queried by coordinates (shown), dbSNP, genes, and by defining ‘cohorts’ using populations in RVS. RVS will return full annotations, frequencies, phenotypes, and literature references