| Literature DB >> 30840782 |
Peter B McGarvey1,2, Andrew Nightingale3, Jie Luo3, Hongzhan Huang4, Maria J Martin3, Cathy Wu4.
Abstract
Understanding the association of genetic variation with its functional consequences in proteins is essential for the interpretation of genomic data and identifying causal variants in diseases. Integration of protein function knowledge with genome annotation can assist in rapidly comprehending genetic variation within complex biological processes. Here, we describe mapping UniProtKB human sequences and positional annotations, such as active sites, binding sites, and variants to the human genome (GRCh38) and the release of a public genome track hub for genome browsers. To demonstrate the power of combining protein annotations with genome annotations for functional interpretation of variants, we present specific biological examples in disease-related genes and proteins. Computational comparisons of UniProtKB annotations and protein variants with ClinVar clinically annotated single nucleotide polymorphism (SNP) data show that 32% of UniProtKB variants colocate with 8% of ClinVar SNPs. The majority of colocated UniProtKB disease-associated variants (86%) map to 'pathogenic' ClinVar SNPs. UniProt and ClinVar are collaborating to provide a unified clinical variant annotation for genomic, protein, and clinical researchers. The genome track hubs, and related UniProtKB files, are downloadable from the UniProt FTP site and discoverable as public track hubs at the UCSC and Ensembl genome browsers.Entities:
Keywords: UniProt database; genome mapping; missense variants
Mesh:
Substances:
Year: 2019 PMID: 30840782 PMCID: PMC6563471 DOI: 10.1002/humu.23738
Source DB: PubMed Journal: Hum Mutat ISSN: 1059-7794 Impact factor: 4.878
UniProtKB sequence annotations in track hubs. Annotation types, descriptions and current number of each feature mapped to the human genome are shown. UniProtKB release 2018_01 (Jan 2018) was used for this table. For more information on sequence features in UniProtKB see (www.uniprot.org/help/sequence_annotation)
| Annotation type | Description (feature_name) | Features mapped | |
|---|---|---|---|
|
| Proteome | Location of complete protein and isoform sequences (proteome) | 112,093 |
| Signal | Sequence targeting proteins to the secretory pathway or periplasmic space (signal) | 10,360 | |
| Transit peptide | Extent of a transit peptide for organelle targeting (transit) | 482 | |
| Chain | Extent of a polypeptide chain in the mature protein (chain) | 25,339 | |
| Peptide | Extent of an active peptide in the mature protein (peptide) | 383 | |
| Propeptide | Peptide that is cleaved during maturation or activation (propep) | 802 | |
| Initiator met | Cleavage of the initiator methionine (init_met) | 1,992 | |
|
| Topological domain | Location of non‐membrane regions of membrane‐spanning proteins (topo_dom) | 18,773 |
| Transmembrane | Extent of a membrane‐spanning region (transmem) | 43,734 | |
| Intramembrane | Extent of a region located in a membrane without crossing it (intramem) | 329 | |
| Domain | Position and type of each modular protein domain (domain) | 66,315 | |
| Repeat | Positions of repeated sequence motifs or domains (repeat) | 19,314 | |
| Calcium binding | Position(s) of calcium binding region(s) within the protein (ca_bind) | 731 | |
| Zinc finger | Position(s) and type(s) of zinc fingers within the protein (zn_fing) | 9,127 | |
| DNA binding | Position and type of a DNA‐binding domain (dna_bind) | 1,267 | |
| Nucleotide binding | Nucleotide phosphate binding region (np_bind) | 3,826 | |
| Region | Region of interest in the sequence (region) | 9,894 | |
| Coiled coil | Positions of regions of coiled coil within the protein (coiled) | 16,909 | |
| Motif | Short (up to 20 aa) sequence motif of biological interest (motif) | 3,332 | |
|
| Active site | Amino acids directly involved in the activity of an enzyme (act_site) | 4,190 |
| Metal binding | Binding site for a metal ion (metal) | 3,031 | |
| Binding site | Binding site for any chemical group (binding) | 6,275 | |
| Site | Any interesting single amino acid site on the sequence (site) | 2,183 | |
|
| Modified residue | Modified residues excluding lipids, glycans & cross‐links (mod_res) | 54,743 |
| Lipidation | Covalently attached lipid group(s) (lipid) | 1,035 | |
| Glycosylation | Covalently attached glycan group(s) (carbohyd) | 16,474 | |
| Disulfide bond | Cysteine residues participating in disulfide bonds (disulfid) | 19,816 | |
| Cross‐link | Residues in covalent linkage between proteins (crosslnk) | 6,829 | |
| Non‐standard residue | occurrence of non‐standard amino acids (selenocysteine & pyrrolysine) in the protein sequence (non_std) | 36 | |
|
| Helix | Helical regions in the experimentally determined structure (helix) | 57,596 |
| Turn | Turns within the experimentally determined protein structure (turn) | 14,813 | |
| Beta strand | Beta strand regions within the experimentally determined protein structure (strand) | 63,579 | |
|
| Mutagenesis | Sites experimentally altered by mutagenesis (mutagen) | 20,335 |
| Natural variant | Description of a natural variant of the protein (variant) | 76,678 | |
Figure 1The GLA gene (P06280, alpha‐galactosidase A) associated with Fabry disease (FD) shown on the UCSC browser with UniProtKB genome tracks plus ClinVar and dbSNP tracks. Panel 1 selection a shows UniProt annotation for part of the enzyme's Active Site and an amino acid variation from a SNP associated with FD that removes an acidic proton donor (Asp, D) is replaced by the neutral (Asn, N). In selection b another variant disrupts an annotated disulfide bond by removing a cysteine required for a structural fold. SNPs are not observed in the other data resources in these positions. Panel 2 selection c shows an N‐linked glycosylation site disrupted by another UniProt amino acid variant that does overlap pathogenic variants in ClinVar and other databases. Links between UniProt and ClinVar are illustrated in the display. GLA, alpha‐galactosidase A; SNP, single nucleotide polymorphism
Figure 2Percentage of ClinVar SNPs in each annotation category that exist in each feature type, underlying data table in supplemental methods. Features with “*” show greater pathogenic than benign or uncertain classifications. Features with “**” have pathogenic classifications greater than or equal to benign, but less than uncertain. SNP, single nucleotide polymorphism
Mapping of Variants and annotation between ClinVar SNPs and UniProtKB amino acid variants that overlap in genome position and result in the same amino acid change. Only gold star rated ClinVar variants were included with evaluation criteria and no conflicts in assertions. Numbers in Bold face are comparisons discussed in the text. Numbers in parentheses are totals for each database
| All ClinVar SNPs (249,784) | Pathogenic SNPs (27,819) | Uncertain SNPs (132,904) | Benign SNPs (94,541) | Other SNPs (11,489) | |
|---|---|---|---|---|---|
| All UniProt variants (77,647) |
| 3,918 | 1,291 | 3,360 | 40 |
| Disease variants (30,220) | 4135 |
| 412 | 159 | 2 |
| Unclassified variants (7,579) | 876 | 245 |
| 155 | 0 |
| Polymorphism variants (39,848) | 3598 | 111 | 403 |
| 38 |
Abbreviation: SNP, single nucleotide polymorphism