| Literature DB >> 32293014 |
David Stanek1, Dana M Bis-Brewer2, Cima Saghira2, Matt C Danzi2, Pavel Seeman1, Petra Lassuthova1, Stephan Zuchner2.
Abstract
Genetic variation occurring within conserved functional protein domains warrants special attention when examining DNA variation in the context of disease causation. Here we introduce a resource, freely available at www.prot2hg.com, that addresses the question of whether a particular variant falls onto an annotated protein domain and directly translates chromosomal coordinates onto protein residues. The tool can perform a multiple-site query in a simple way, and the whole dataset is available for download as well as incorporated into our own accessible pipeline. To create this resource, National Center for Biotechnology Information protein data were retrieved using the Entrez Programming Utilities. After processing all human protein domains, residue positions were reverse translated and mapped to the reference genome hg19 and stored in a MySQL database. In total, 760 487 protein domains from 42 371 protein models were mapped to hg19 coordinates and made publicly available for search or download (www.prot2hg.com). In addition, this annotation was implemented into the genomics research platform GENESIS in order to query nearly 8000 exomes and genomes of families with rare Mendelian disorders (tgp-foundation.org). When applied to patient genetic data, we found that rare (<1%) variants in the Genome Aggregation Database were significantly more annotated onto a protein domain in comparison to common (>1%) variants. Similarly, variants described as pathogenic or likely pathogenic in ClinVar were more likely to be annotated onto a domain. In addition, we tested a dataset consisting of 60 causal variants in a cohort of patients with epileptic encephalopathy and found that 71% of them (43 variants) were propagated onto protein domains. In summary, we developed a resource that annotates variants in the coding part of the genome onto conserved protein domains in order to increase variant prioritization efficiency. Database URL: www.prot2hg.com.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32293014 PMCID: PMC7157182 DOI: 10.1093/database/baz161
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1The process of protein domain mapping to human genome. (A) Data obtained from NCBI—data were downloaded about all proteins and stored into data structure for further analysis. (B) Data processed to the data structure—during this step, we record each protein domain with its subsequence and complete information. (C) Feature sequence mapped to cDNA—in this step, the protein subsequence is translated into a DNA sequence. A matching sequence in cDNA is then found and compared to a translated sequence to give a match. (D) Data split into exons and chromosomal location was assigned—some domains cover more than one exon, and in this step, using the UCSC gene database, we identify the chromosome location of every exon that codes to the domain. Then, the completed record is stored in the Prot2HG database.
Figure 2Chi-square statistics. From observed (real) data, we were able to calculate the expected data distribution. The difference between the two columns (observed vs. expected) is proportional to the dependence of the data and accordingly statistical significance. The difference is also proportional to the odds ratio. The pie chart is calculated from observed (real) data only. (A) For rare variants, 32% were annotated, while only 23% of common variants fall onto a conserved protein domain. The results show the dependence of distribution, and the difference was evaluated as significant with P < 0.01 and odds ratio of 1.59. (B) For ClinVar, the difference was even more obvious: 66% of pathogenic/likely pathogenic variants were annotated, while only 32% of other ClinVar variants fall onto a domain. The results show the dependence of distribution, and the difference was evaluated as significant with P < 0.01 and odds ratio of 4.04.