| Literature DB >> 33258967 |
Harry Biggs1, Padmini Parthasarathy1, Alexandra Gavryushkina1,2, Paul P Gardner1,2.
Abstract
Variants within the non-coding genome are frequently associated with phenotypes in genome-wide association studies. These non-coding regions may be involved in the regulation of gene expression, encode functional non-coding RNAs, or influence splicing and other cellular functions. We have curated a list of characterized non-coding human genome variants based on the published evidence that indicates phenotypic consequences of the variation. In order to minimize annotation errors, two curators have independently verified the supporting evidence for pathogenicity of each non-coding variant in the published literature. The database consists of 721 non-coding variants linked to the published literature describing the evidence of functional consequences. We have also sampled 7228 covariate-matched benign controls, that have a population frequency of over 5%, from the single nucleotide polymorphism database (dbSNP151) database. These were sampled controlling for potential confounding factors such as linkage with pathogenic variants, annotation type (untranslated region, intron, intergenic, etc.) and variant type (substitution or indel). The dataset presented here represents a curated repository, with a potential use for the training or evaluation of algorithms used in the prediction of non-coding variant functionality. Database URL: https://github.com/Gardner-BinfLab/ncVarDB.Entities:
Year: 2020 PMID: 33258967 PMCID: PMC7706182 DOI: 10.1093/database/baaa105
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.The location and single-nucleotide polymorphism (SNP) types of ncVarDB variants in comparison to variants from the dbSNP database. A comparison of the variant positions and the type of variants in every SNP in dbSNP dataset excluding variants from alternate contigs (dbSNP), every non-coding SNP with a MAF between 5 and 20% (5–20% MAF dbSNP), the ncVar benign dataset and the ncVar pathogenic dataset. (A) A comparison of the frequency of genomic positions of variants present in each dataset. Positions are based on the genomic notation submitted with the variant in either dbSNP or ClinVar. (B) A comparison of the frequency of variant types for each dataset. Variant types have been simplified to three types to avoid type expansion.
Figure 2.ROC curves for the classification analyses of the ncVar dataset by three different software tools: FATHMM-XF, CADD v1.4 and DANN. FATHMM-XF and CADD predict the pathogenicity of the ncVarDB variants with noticeably higher specificity and sensitivity than DANN. Overall good performance of all three tools additionally validates the ncVar dataset.