| Literature DB >> 32831124 |
Shuang Li1,2, K Joeri van der Velde1,2, Dick de Ridder3, Aalt D J van Dijk3,4, Dimitrios Soudis5, Leslie R Zwerwer5, Patrick Deelen1,2, Dennis Hendriksen2, Bart Charbon2, Marielle E van Gijn1, Kristin Abbott1, Birgit Sikkema-Raddatz1, Cleo C van Diemen1, Wilhelmina S Kerstjens-Frederikse1, Richard J Sinke1, Morris A Swertz6,7.
Abstract
Exome sequencing is now mainstream in clinical practice. However, identification of pathogenic Mendelian variants remains time-consuming, in part, because the limited accuracy of current computational prediction methods requires manual classification by experts. Here we introduce CAPICE, a new machine-learning-based method for prioritizing pathogenic variants, including SNVs and short InDels. CAPICE outperforms the best general (CADD, GAVIN) and consequence-type-specific (REVEL, ClinPred) computational prediction methods, for both rare and ultra-rare variants. CAPICE is easily added to diagnostic pipelines as pre-computed score file or command-line software, or using online MOLGENIS web service with API. Download CAPICE for free and open-source (LGPLv3) at https://github.com/molgenis/capice .Entities:
Keywords: Allele frequency; Clinical genetics; Exome sequencing; Genome diagnostics; Machine learning; Molecular consequence; Variant pathogenicity prediction
Mesh:
Year: 2020 PMID: 32831124 PMCID: PMC7446154 DOI: 10.1186/s13073-020-00775-w
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Fig. 1An overview of the study setup
Data source for the variants and pathogenicity interpretation
| Data name | Data source | Number of pathogenic variants | Number of neutral variants |
|---|---|---|---|
| Training dataset | ClinVar (≥ 1 stars) | 10,370 | 14,954 |
| VKGL (≥ 1 lab support) | 581 | 11,129 | |
| van der Velde et al. [ | 30,187 | 274,112 | |
| Benchmark dataset | ClinVar (≥ 2 stars) | 5421 | 20 |
| VKGL (≥ 2 lab support) | 187 | 11 | |
| ExAC | 0 | 5392 | |
| Benign Benchmark dataset 1 | Niroula et al. [ | 0 | 60,699 |
| Benign Benchmark dataset 2 | GoNL | 0 | 14,426,914 |
*The total numbers of variants are smaller or equal to the sum of variants from all data sources due to the removal of duplicated variants
Fig. 2CAPICE outperforms other predictors in discriminating pathogenic variants and neutral variants. a True/false classification for all predictors tested against the full benchmark set that contains all types of variants. Top bar shows the breakdown of the test set. Other bars show the classification performance for each method. Purple blocks represent correct classification of pathogenic variants. Dark-blue blocks represent neutral variants. Pink and light-blue blocks denote false classifications. Gray blocks represent variants that were not classified by the predictor tested. Threshold selection methods are described in the “Methods” section. b Receiver operating characteristic (ROC) curves of CAPICE with AUC values for a subset of the benchmark data that only contains non-synonymous variants (the ROC curve for the full dataset can be found in Additional File 1: Fig. S2). Each ROC curve is for a subset of variants displaying a specific molecular consequence. AUC values for the different methods are listed in the figure legend
Fig. 3Performance comparison for rare and ultra-rare variants for a variants with different molecular consequences and b in the missense subset. Each dot represents the mean AUC value with standard deviation
Fig. 4Performance comparison of CAPICE and CADD for variants of different molecular consequences
Fig. 5Performance comparison in real cases. In total, 54 patients and 58 variants were included. Each variant is reported as the diagnosis for that patient. Each dot in the plot shows a variant. The color of the dot represents the molecular effect predicted by VEP