| Literature DB >> 35079159 |
José Luis Cabrera-Alarcon1, Jorge García Martinez2, José Antonio Enríquez3,4, Fátima Sánchez-Cabo5.
Abstract
Accurate detection of pathogenic single nucleotide variants (SNVs) is a key challenge in whole exome and whole genome sequencing studies. To date, several in silico tools have been developed to predict deleterious variants from this type of data. However, these tools have limited power to detect new pathogenic variants, especially in non-coding regions. In this study, we evaluate the use of a new metric, the Shannon Entropy of Locus Variability (SELV), calculated as the Shannon entropy of the variant frequencies reported in genome-wide population studies at a given locus, as a new predictor of potentially pathogenic variants in non-coding nuclear and mitochondrial DNA and also in coding regions with a selective pressure other than that imposed by the genetic code, e.g splice-sites. For benchmarking, SELV was compared to predictors of pathogenicity in different genomic contexts. In nuclear non-coding DNA, SELV outperformed CDTS (AUCSELV = 0.97 in ROC curve and PR-AUCSELV = 0.96 in Precision-recall curve). For non-coding mitochondrial variants (AUCSELV = 0.98 in ROC curve and PR-AUCSELV = 1.00 in Precision-recall curve) SELV outperformed HmtVar. Moreover, SELV was compared against two state-of-the-art ensemble predictors of pathogenicity in splice-sites, ada-score, and rf-score, matching their overall performance both in ROC (AUCSELV = 0.95) and Precision-recall curves (PR-AUC = 0.97), with the advantage that SELV can be easily calculated for every position in the genome, as opposite to ada-score and rf-score. Therefore, we suggest that the information about the observed genetic variability in a locus reported from large scale population studies could improve the prioritization of SNVs in splice-sites and in non-coding regions.Entities:
Mesh:
Year: 2022 PMID: 35079159 PMCID: PMC9091277 DOI: 10.1038/s41431-021-01034-1
Source DB: PubMed Journal: Eur J Hum Genet ISSN: 1018-4813 Impact factor: 5.351
Fig. 1Distribution of neutral and pathogenic SNPs in analyzed data.
Number of neutral/pathogenic in splice vs not-splice analysis (A), in dataset of mitochondrial SNVs (B), and in dataset used for benchmarking SELV in Non-coding variants (C). NPC: Non protein coding, PC: protein coding.
Fig. 2Performance of predictors compared to SELV.
ROC (A) and PR curves (B) for SELV, ada score, rf score, phastCons, and phyloP conservation scores for deleteriousness in splice-site SNVs. ROC curves (C) and precision-recall curves (D) for SELV, HmtVar disease score phastCons, and phyloP conservation scores for pathogenic variant detection in mitochondrial non-coding SNVs. The performance of SELV, CDTS, phastCons and phyloP conservation scores for pathogenic variant detection in nuclear non-coding SNVs depicted as ROC (E) and PR curves (F). Abbreviations: ROC receiver operating characteristic, PR precision-recall, SELV Shannon entropy locus variability, CDTS context-dependent tolerance score, and SNV single nucleotide variants.
Comparison of the area under the receiving operative characteristic curve between different considered predictors and SELV, both in splice-sites and non-coding mitochondrial DNA.
| Splice-site SNVs | AUCSELV = 0.95 | |
| AUCrf_score = 0.94 | ||
| AUCada_score = 0.95 | ||
| AUCphastCons = 0.84 | ||
| AUCphyloP = 0.9 | ||
| Non-coding mitochondrial SNVs | AUCSELV = 0.98 | |
| AUCHmtVar = 0.82 | ||
| AUCphastCons = 0.86 | ||
| AUCphyloP = 0.88 | ||
| Non-coding nuclear SNVs | AUCSELV = 0.97 | |
| AUCCDTS = 0.71 | ||
| AUCphastCons = 0.93 | ||
| AUCphyloP = 0.95 |