| Literature DB >> 32967157 |
Satishkumar Ranganathan Ganakammal1, Emil Alexov1,2.
Abstract
Single-nucleotide variants (SNVs) are a major form of genetic variation in the human genome that contribute to various disorders. There are two types of SNVs, namely non-synonymous (missense) variants (nsSNVs) and synonymous variants (sSNVs), predominantly involved in RNA processing or gene regulation. sSNVs, unlike missense or nsSNVs, do not alter the amino acid sequences, thereby making challenging candidates for downstream functional studies. Numerous computational methods have been developed to evaluate the clinical impact of nsSNVs, but very few methods are available for understanding the effects of sSNVs. For this analysis, we have downloaded sSNVs from the ClinVar database with various features such as conservation, DNA-RNA, and splicing properties. We performed feature selection and implemented an ensemble random forest (RF) classification algorithm to build a classifier to predict the pathogenicity of the sSNVs. We demonstrate that the ensemble predictor with selected features (20 features) enhances the classification of sSNVs into two categories, pathogenic and benign, with high accuracy (87%), precision (79%), and recall (91%). Furthermore, we used this prediction model to reclassify sSNVs with unknown clinical significance. Finally, the method is very robust and can be used to predict the effect of other unknown sSNVs.Entities:
Keywords: pathogenicity prediction; random forest (RF); synonymous variants (sSNVs); variant of unknown significance (VUS)
Mesh:
Year: 2020 PMID: 32967157 PMCID: PMC7565489 DOI: 10.3390/genes11091102
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Brief description of all the 29 features categorized into 5 groups.
| Feature Class | Feature | Description |
|---|---|---|
| In silico predictors | CADD | It uses a c-score obtained by the integration of multiple variant annotation resources. |
| EIGEN | It uses a supervised approach to derive the aggregate functional score from various annotation resources. | |
| TraP (V3) | It evaluates the ability of a variant to cause disease by damaging the final transcript. | |
| Conservation Score | GERP++ | GERP++ score is used to measure the conservation at the mutation position |
| Phylop (100 ways) | It computes P-values for conservation-based specific lineage | |
| PHAST Cons | Scores based on conserved element | |
| Codon Usage | dRSCU | Change in RSCU caused by mutation |
| RSCU | RSCU (Relative synonymous codon usage) of new codon | |
| Splicing Properties | MES | Max splice site score |
| MES-KM | Has a value of 1 if site changes most or 0 if not | |
| dMES | Max change in splice site score | |
| MES- | Max splice site score decrease | |
| MES+ | Max splice site score increase | |
| dpsi | The delta PSI is the predicted change in percent-inclusion due to the variant | |
| dpsiz | The z-score of the dPSI relative | |
| FAS6+ | Hexamer splice suppressor motifs gained | |
| FAS6- | Hexamer splice suppressor motifs lost | |
| MEC-MC | Has a value of 1 if strongest site change or 0 if not | |
| MEC-CS | Has a value of 1 if a cryptic site now strongest or 0 if not | |
| PESS- | Octamer splice suppressor motifs lost | |
| PESS+ | Octamer splice suppressor motifs gained | |
| PESE- | Octamer splice enhancer motifs lost | |
| PESE+ | Octamer splice suppressor motifs gained | |
| SR- | SR-protein motifs lost | |
| SR+ | SR-protein motifs gained | |
| Sequence Properties | CpG_exon | Observed/expected CpG content of exon |
| CpG | Has a value of 1 if mutation change a CpG or 0 if not | |
| f_premrna | Relative distance to end of pre-mRNA | |
| f_mrna | Relative distance to end of mature mRNA |
Statistical measures used to access the performance of classification methods. Here TP stands for true positive, FP for false positive, FN for false negative, and FP for false positive.
| Statistics | Formula |
|---|---|
| Precision |
|
| Recall |
|
| F-measure |
|
| MCC |
|
| Accuracy |
|
| Receiver operating characteristic (ROC) curve | Plotted between TP rate to FP rate |
| Area under the ROC Curve (AUC) | Area Under the ROC curve, it measures the capability of a model to distinguish between classes. |
The purpose of the background color is to highlight the header of the table.
Summary of performance calculated using both random forest (RF) and Naive Bayes (NB) classification algorithm for 5 different training sets using 10-fold cross-validation, which includes 243 benign variants chosen randomly (5 times) along with 243 pathogenic variants. The training and testing were done using all 29 features.
| Classification Algorithm | Precision | Recall | F-Measure | MCC | Accuracy | AUC | |
|---|---|---|---|---|---|---|---|
| Training Set 1 | Random forest | 0.886 | 0.802 | 0.842 | 0.703 | 0.849 | 0.929 |
| Naive Bayes | 0.862 | 0.744 | 0.799 | 0.631 | 0.812 | 0.888 | |
| Training Set 2 | Random forest | 0.928 |
|
| 0.789 |
|
|
| Naive Bayes | 0.873 | 0.761 | 0.813 | 0.656 | 0.825 | 0.898 | |
| Training Set 3 | Random forest |
| 0.831 | 0.886 |
|
| 0.941 |
| Naive Bayes | 0.872 | 0.757 | 0.811 | 0.652 | 0.823 | 0.894 | |
| Training Set 4 | Random forest | 0.928 | 0.844 | 0.884 | 0.781 | 0.889 | 0.953 |
| Naive Bayes | 0.868 | 0.786 | 0.825 | 0.67 | 0.833 | 0.912 | |
| Training Set 5 | Random forest | 0.923 | 0.844 | 0.882 | 0.777 | 0.886 | 0.948 |
| Naive Bayes | 0.877 | 0.761 | 0.815 | 0.66 | 0.827 | 0.905 |
Highest value is highlighted in bold.
Figure 1The ROC curve for evaluating the performance of the top (10, 15, 20) ranked features. Though the AUC is very close between all three sets, the top 20 features had better accuracy compared to the other two sets.
Figure 2(a) Bar plot shows the distribution of accuracy obtained by each compared method on a known test dataset; (b) ROC curves for the same methods.