| Literature DB >> 28985712 |
Michael Ferlaino1,2, Mark F Rogers3, Hashem A Shihab4, Matthew Mort5, David N Cooper5, Tom R Gaunt4, Colin Campbell3.
Abstract
BACKGROUND: Small insertions and deletions (indels) have a significant influence in human disease and, in terms of frequency, they are second only to single nucleotide variants as pathogenic mutations. As the majority of mutations associated with complex traits are located outside the exome, it is crucial to investigate the potential pathogenic impact of indels in non-coding regions of the human genome.Entities:
Keywords: Indels; Non-coding genome; Support vector machines; Variant prioritisation
Mesh:
Substances:
Year: 2017 PMID: 28985712 PMCID: PMC5955213 DOI: 10.1186/s12859-017-1862-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Nested cross validation. To implement nested cross validation, we split the data set into ten stratified folds. The figure shows one out of ten NCV loops. For each NCV iteration, an independent testing set (F (10) in the figure) is left out to assess FATHMM-indel’s performance. The remaining folds (red sets in the figure) are merged to create the tuning set used to learn, under cross validation, the optimal values of the hyperparameters. Crucially, a different fold is used as testing set in each iteration, fully exploiting all data to evaluate FATHMM-indel’s performance
NCV experiment results. FATHMM-indel’s performance across 50 data sets created by randomly subsampling the neutral (EVS) class
| Sensitivity (SE) | Specificity (SE) | Balanced accuracy (SE) | AUC (SE) |
|---|---|---|---|
| 0.886 (0.005) | 0.891 (0.005) | 0.889 (0.004) | 0.950 (0.003) |
The small standard errors (SEs) indicate it is consistent to use one random EVS sample to train the final model
Validation, on benchmark data, against published methods
| Sensitivity | Specificity | Balanced accuracy | MCC | |
|---|---|---|---|---|
| FATHMM-indel | 0.905 | 0.887 | 0.896 | 0.793 |
| CADD | 0.669 | 0.934 | 0.802 | 0.626 |
| GAVIN | 0.611 | 0.934 | 0.773 | 0.576 |
Fig. 2Empirical ROC curves for FATHMM-indel and CADD. Performance comparison, on benchmark data, between FATHMM-indel and CADD. ROC curves display sensitivities and false positive rates at all possible cutoff levels. Therefore, they can be used to assess the performance of a model independently of the decision threshold
Fig. 3Frequency spectrum for 1 KG indels predicted as pathogenic. Comparison between non-coding variants across populations and stratified according to allele frequency (AF <1% for rare indels and AF >5% for common indels)
Fig. 4Cautious classification of benchmark indels. Balanced accuracy, over validation data, as a function of the decision threshold. By selecting only predictions with highest confidence, FATHMM-indel is capable of achieving almost perfect classification