| Literature DB >> 31799076 |
Satishkumar Ranganathan Ganakammal1, Emil Alexov2.
Abstract
BACKGROUND: Genomics diagnostic tests are done for a wide spectrum of complex genetics conditions such as autism and cancer. The growth of technology has not only aided in successfully decoding the genetic variants that causes or trigger these disorders. However, interpretation of these variants is not a trivial task even at a level of distinguish pathogenic vs benign variants.Entities:
Keywords: DNA variants; Disease-causing mutations; Machine learning; Pathogenic mutations classification; Rett syndrome; Variant pathogenicity predictors
Year: 2019 PMID: 31799076 PMCID: PMC6884988 DOI: 10.7717/peerj.8106
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Description of In-silico predictors evaluated.
Brief description of the fourteen in-silco predictors (both independent and empirical predictors) used in this study with pathogenicity cutoffs values.
| Predictor | Description | Pathogenicity cutoff |
|---|---|---|
| SIFT | It uses MSA methodology that determines the probability that a missense variant is tolerated conditional on the most frequent amino acid being tolerated ( | <0.049 |
| Polyphen2 | It calculates the normalized accessible surface area and changes in accessible surface propensity resulting from the amino acid substitution ( | >0.447 |
| LTR | It uses heuristic methods to identify mutations that disrupt highly conserved amino acids within protein-coding sequences ( | NA |
| Mutation taster | It uses naive Bayes classifier to evaluate the pathogenicity of a variant based on information available from various databases ( | >0.5 |
| Mutation assessor | It uses the concept of evolutionary conservation that affects amino acid in protein homologs ( | >1.935 |
| FATHMM | It uses Hidden Markov Models (HMM) to assess the functionality of the candidate variant by incorporating a disease-specific weighting scheme ( | <−1.151 |
| PROVEAN | It uses the concept of pairwise sequence alignment scores to predict the biological effect on the protein function ( | <−2.49 |
| VEST3 | It uses supervised learning method utilizing | NA |
| MetaSVM | It uses support vector machine approach on the previous generated scores ( | >0 |
| MetaLR | It uses logistic regression model on the previous generated scores ( | >0.5 |
| M-CAP | It uses gradient boosting trees method to analyze interactions between features to determine variant pathogenicity ( | NA |
| REVEL | It combines all results from available prediction tools by using them as features to access the pathogenicity of a variant ( | >0.75 |
| CADD | It uses a c-score obtained by the integration of multiple variant annotation resources ( | >19 |
| Eigen | It uses a supervised approach to derive the aggregate functional score from various annotation resources ( | NA |
Clinical relevance distribution of variants from ClinVar database.
Counts of Single Nucleotide Variants (SNVs) from ClinVar Database (for build GRCh37) categorized based on major clinical relevance.
| Clinical relevance | Total number of variants |
|---|---|
| Pathogenic | 36,536 |
| Benign | 7,249 |
| Likely pathogenic | 2,105 |
| Likely benign | 17,295 |
| Variant of unknown significance (VUS) | 135,534 |
Proposed golden dataset set.
The golden data set that includes pathogenic and benign variants obtained by filtering the ClinVar SNVs (build GRCh37) based on the number of submitters (NOS) and submitter categories (SC).
| Clinical relevance | Total number of variants | Criteria |
|---|---|---|
| Pathogenic | 2,123 | NOS > 2 & SC = 3 |
| Benign | 2,231 | NOS > 3 & SC >= 2 |
| Total | 4,354 |
Statistical measure from our supervised learning method.
Various statistics values calculated from our performance evaluation and classification analysis from Weka Software.
| Statistics | Formula |
|---|---|
| Sensitivity | |
| Specificity | |
| Precision | |
| MCC |
Summary of various supervised learning method.
Statistics calculated on our cross-validation dataset by applying different machine learning algorithms to identify the best methods for feature evaluation.
| Classification algorithm | Sensitivity | Specificity | Precision | Recall | MCC | Accuracy | |
|---|---|---|---|---|---|---|---|
| Random forest | 0.985 | 0.952 | 0.956 | 0.985 | 0.970 | 0.938 | 0.969 |
| Naive Bayes | 0.905 | 0.911 | 0.914 | 0.905 | 0.909 | 0.815 | 0.907 |
| Classification via regression | 0.957 | 0.944 | 0.948 | 0.957 | 0.953 | 0.902 | 0.951 |
| LibSVM | 0.940 | 0.930 | 0.934 | 0,940 | 0.937 | 0.870 | 0.953 |
Figure 1Performance evaluations of 14 in-silico predictors.
The graphical representation of the major statistics obtained from the evaluation of all 14 in-silico predictors.
Figure 2Performance comparison: independent vs empirical in-silico predictors.
The graphical representation of the major statistics obtained from the evaluation of both independent (solid bars) and ensemble (grey bars) predictors.
Summary statistics of our combinatory approach.
Statistics obtained by applying our classifier to the golden dataset with proposed combined set of independent (VEST3, LTR, Polyphen2 and PROVEAN) and ensemble or dependent (CADD, Eigen, MetaSVM and REVEL) predictors.
| Predictors | Classification algorithm | Sensitivity | Specificity | Precision | Recall | MCC | Accuracy | |
|---|---|---|---|---|---|---|---|---|
| VEST3, LTR, Polyphene2, PROVEAN. CADD, Eigen, MetaSVM and REVEL | Random forest | 0.982 | 0.950 | 0.954 | 0.982 | 0.968 | 0.933 | 0.966 |
Figure 3Performance comparison: our approach vs ReVe.
Comparison of the statistics obtained from the proposed combined set of independent (VEST3, LTR, Polyphen2 and PROVEAN) and ensemble or dependent (CADD, Eigen, MetaSVM and REVEL) predictors (solid bars) to the combination of REVEL and VEST as proposed by Li et al. (2018) (grey bars).
Reclassification of the MECP2 variants.
Variants that was previously classified as likely benign/pathogenic, uncertain significant (VUS) and conflicting interpretations of pathogenicity classes was reclassified using our golden dataset (as training dataset) along with benchmarking against “pathogenic” and “benign” mutations.
| Clinical significance | Total variants | Classification on best in-silico predictors | Success rate | |
|---|---|---|---|---|
| Benign | Pathogenic | |||
| Pathogenic | 64 | 7 | 57 | 89% |
| Benign | 1 | 1 | 0 | 100% |
| Likely Benign | 10 | 9 | 1 | NA |
| Likely pathogenic | 11 | 2 | 9 | NA |
| Uncertain significance | 69 | 25 | 44 | NA |
| Conflicting interpretation | 11 | 5 | 6 | NA |