| Literature DB >> 30704475 |
Fang Shi1, Yao Yao2, Yannan Bin2, Chun-Hou Zheng1, Junfeng Xia3.
Abstract
BACKGROUND: Although synonymous single nucleotide variants (sSNVs) do not alter the protein sequences, they have been shown to play an important role in human disease. Distinguishing pathogenic sSNVs from neutral ones is challenging because pathogenic sSNVs tend to have low prevalence. Although many methods have been developed for predicting the functional impact of single nucleotide variants, only a few have been specifically designed for identifying pathogenic sSNVs.Entities:
Keywords: Feature selection; Pathogenicity prediction; Random forest; Synonymous variant
Mesh:
Year: 2019 PMID: 30704475 PMCID: PMC6357349 DOI: 10.1186/s12920-018-0455-6
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
sSNVs were annotated with a set of 10 optimized features spanning five distinct classes of infomration relevant to assessing the harmfulness of sSNVs
| Feature name | Description | Type |
|---|---|---|
| Sequence feature | ||
| DSP | Mutation site distance to the nearest splice site | Integer |
| Function regions annotation | ||
| TFBS | Whether the variant is in transcription factor binding site? | Bool |
| Splicing | ||
| MDE | Minimum distance as a proportion of half the exon | Numeric |
| DVE | Distance of the variant across the exon as a proportion | Numeric |
| ese-dens | Density of neighborhood inference-exonic splicing enhancers hexamers in the exon sequence | Numeric |
| MES | Max splice site score | Numeric |
| SR- | SR-protein motifs lost | Numeric |
| dPSIZ | The z-score of dPSI (the predicted change in percent-inclusion due to the variant, reported as the maximum across tissues) relative to the distribution of dPSI that are due to common variant | Numeric |
| Conservation | ||
| verPhyloP | Vertebrate PhyloP at the mutation position at mutation position | Numeric |
| Translation efficiency | ||
| TE | The tRNA adaptation index of the tRNA usage | Numeric |
Fig. 1The ROC curves of prediction methods with and without feature selection using the 10-fold cross-validation on the training set
Prediction results by subtracting each feature using the 10-fold cross-validation on the training set
| Feature | Recall | Precision | F-measure (β = 1) | AUC |
|---|---|---|---|---|
| All features |
| 0.820 | 0.755 |
|
| No SR- | 0.697 | 0.823 | 0.755 |
|
| No MES | 0.667 | 0.810 | 0.731 | 0.829 |
| No MDE | 0.677 | 0.820 | 0.742 | 0.834 |
| No DVE | 0.697 | 0.816 | 0.752 | 0.846 |
| No ese-dens | 0.693 | 0.835 |
| 0.845 |
| No dPSIZ |
| 0.813 | 0.752 | 0.850 |
| No verPhyloP | 0.697 | 0.768 | 0.731 | 0.829 |
| No TFBS | 0.690 | 0.818 | 0.749 |
|
| No DSP | 0.683 |
| 0.752 | 0.845 |
| No TE | 0.673 | 0.798 | 0.731 | 0.829 |
The highest values are highlighted in bold
Fig. 2The ROC curves of different machine learning methods using the 10-fold cross-validation on the training set
Performance comparison of different methods on the independent test set
| Method | Recall | Precision | F-measure (β = 34) | AUC |
|---|---|---|---|---|
| IDSV |
| 0.098 |
|
|
| CADD | 0.320 | 0.081 | 0.319 | 0.700 |
| FATHMM-MKL | 0.712 | 0.053 | 0.704 | 0.751 |
| SilVA | 0.490 | 0.581 | 0.490 | 0.844 |
| DDIG-SN | 0.298 |
| 0.298 | 0.854 |
| TraP | 0.575 | 0.518 | 0.575 | 0.827 |
The highest values are highlighted in bold
Performance comparison of different methods based on the balanced subset of the independent test set in which benign variants were randomly selected from the full negative independent test set. We repeated this process 5 times with different random subsets of benign variants and averaged the results
| Method | Recall | Precision | F-measure (β = 1) | AUC | |
|---|---|---|---|---|---|
| IDSV |
| 0.781 ± 0.022 |
|
| * |
| CADD | 0.320 ± 0.000 | 0.760 ± 0.041 | 0.450 ± 0.007 | 0.698 ± 0.018 | 9.452e-07 |
| FATHMM-MKL | 0.712 ± 0.000 | 0.660 ± 0.026 | 0.685 ± 0.014 | 0.753 ± 0.019 | 0.0007962 |
| SilVA | 0.490 ± 0.000 | 0.977 ± 0.017 | 0.653 ± 0.004 | 0.844 ± 0.017 | 3.211e-05 |
| DDIG-SN | 0.298 ± 0.000 |
| 0.459 ± 0.001 | 0.853 ± 0.006 | 5.957e-07 |
| TraP | 0.575 ± 0.000 | 0.971 ± 0.012 | 0.723 ± 0.003 | 0.848 ± 0.043 | 0.001015 |
The highest values are highlighted in bold. *Denotes the reference when calculating the P-value