| Literature DB >> 32735577 |
Kirsley Chennen1,2, Thomas Weber1, Xavière Lornage2, Arnaud Kress1, Johann Böhm2, Julie Thompson1, Jocelyn Laporte2, Olivier Poch1.
Abstract
The diffusion of next-generation sequencing technologies has revolutionized research and diagnosis in the field of rare Mendelian disorders, notably via whole-exome sequencing (WES). However, one of the main issues hampering achievement of a diagnosis via WES analyses is the extended list of variants of unknown significance (VUS), mostly composed of missense variants. Hence, improved solutions are needed to address the challenges of identifying potentially deleterious variants and ranking them in a prioritized short list. We present MISTIC (MISsense deleTeriousness predICtor), a new prediction tool based on an original combination of two complementary machine learning algorithms using a soft voting system that integrates 113 missense features, ranging from multi-ethnic minor allele frequencies and evolutionary conservation, to physiochemical and biochemical properties of amino acids. Our approach also uses training sets with a wide spectrum of variant profiles, including both high-confidence positive (deleterious) and negative (benign) variants. Compared to recent state-of-the-art prediction tools in various benchmark tests and independent evaluation scenarios, MISTIC exhibits the best and most consistent performance, notably with the highest AUC value (> 0.95). Importantly, MISTIC maintains its high performance in the specific case of discriminating deleterious variants from benign variants that are rare or population-specific. In a clinical context, MISTIC drastically reduces the list of VUS (<30%) and significantly improves the ranking of "causative" deleterious variants. Pre-computed MISTIC scores for all possible human missense variants are available at http://lbgi.fr/mistic.Entities:
Mesh:
Year: 2020 PMID: 32735577 PMCID: PMC7394404 DOI: 10.1371/journal.pone.0236962
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Selected features included in the Soft Voting system of MISTIC.
| Category | Name | Number of features | Description |
|---|---|---|---|
| minor allele frequencies | 6 | minor allele frequencies for 6 populations: all exomes (global MAF), African (AFR), American (AMR), East Asian (EAS), None Finnish European (NFE), South Asian (SAS) | |
| PhastCons | 3 | phastCons conservation score based on three categories of multiple alignments: (i) 100 vertebrate genomes, (ii) 30 mammalians, and (iii) 17 primates. The larger the score, the more conserved the site | |
| PhyloP | 3 | phyloP (phylogenetic p-values) conservation score based on three categories of multiple alignments: (i) 100 vertebrate genomes, (ii) 30 mammalians, and (iii) 17 primates. The larger the score. the more conserved the site | |
| SiPhy 29way logOdds | 1 | The estimated stationary distribution of A, C, G and T at the locus using SiPhy algorithm based on 29 mammalian genomes | |
| GERP++_RS | 1 | Identified constrained elements in multiple alignments | |
| CCRS | 1 | The score reflects the intolerance of constrained coding regions of protein-coding genes for protein-altering variants | |
| MPC | 1 | A deleteriousness prediction score for missense variants based on regional missense constraints. | |
| AAindex | 90 | The AAindex substitution matrices for different physicochemical and biochemical properties of amino acids. | |
| SIFT | 1 | Prediction of the impact of an amino acid substitution on the protein function | |
| PolyPhen 2 | 1 | Prediction of the impact of an amino acid substitution on the structure and function of a protein using straightforward physical and comparative considerations | |
| VEST4_score | 1 | Machine learning method predicting the functional significance of missense mutations based on the probability that they are pathogenic | |
| Condel | 1 | Weighted average of the normalized scores of five methods (SIFT, PolyPhen2, Logre, MAPP, MutationAssessor) | |
| CADD PHRED | 1 | Machine learning scoring model that integrates more than sixty annotation features into a single metric, to distinguish variants that survived naturel selection from simulated mutations | |
| MetaLR | 1 | Logistic Regression model combining multiple variant scoring metrics | |
| MetaSVM | 1 | Support Vector Machine model combining multiple variant scoring metrics |
Fig 1Performance of missense prediction tools on VarTest set.
MISTIC was compared to individual component features (MetaSVM, MetaLR, VEST4, Condel, CADD, PolyPhen2, SIFT) used in its model (in grey) and the best-performing tools recently published (in color). The Area Under the receiver operating characteristics Curve (AUC) is shown in brackets.
Fig 2Evaluation of prediction tools on different variant analysis scenarios.
The performance of MISTIC was compared to other missense prediction tools for the discrimination of deleterious variants from rare benign variants and population-specific missense variants. All prediction tools were evaluated using novel deleterious variants (Fig 2A - ClinVarNew and Benign_EvalSet set), known deleterious variants from diverse sources (Fig 2B - DoCM and Benign_EvalSet set), rare benign variants with MAF data (<0.01, <0.005, <0.001, <0.0001, singleton) or benign variants without MAF (ClinVarNew/DoCM and PopSpe_EvalSet: UK10K, SweGen, WesternAsia; Fig 2C).
Fig 3Evaluation of the different missense prediction tools using simulated and real disease exomes.
A–Distribution of the percentage of predicted deleterious variants in the simulated disease exomes. B–Ranking of the “causative” deleterious variants introduced in simulated disease exomes. C–Distribution of the percentage of predicted deleterious variants on the exomes of the MyoCapture project. D–Ranking of the causative deleterious variants identified in real congenital myopathy exomes from the MyoCapture project.
Fig 4Distribution of scores for deleterious and benign variants.
The variants of the deleterious (Del_EvalSet) and benign sets with MAF (Benign_EvalSet) were pooled and the distribution of the scores for deleterious and benign variants were represented using violin plots. Red area–distribution of scores for deleterious variants. Green area–distribution of scores for benign variants. Black line–recommended threshold.