| Literature DB >> 30649191 |
Weitai Huang1,2, Yu Amanda Guo1, Karthik Muthukumar1, Probhonjon Baruah1, Mei Mei Chang1, Anders Jacobsen Skanderup1.
Abstract
SUMMARY: Somatic Mutation calling method using a Random Forest (SMuRF) integrates predictions and auxiliary features from multiple somatic mutation callers using a supervised machine learning approach. SMuRF is trained on community-curated matched tumor and normal whole genome sequencing data. SMuRF predicts both SNVs and indels with high accuracy in genome or exome-level sequencing data. Furthermore, the method is robust across multiple tested cancer types and predicts low allele frequency variants with high accuracy. In contrast to existing ensemble-based somatic mutation calling approaches, SMuRF works out-of-the-box and is orders of magnitudes faster.Entities:
Mesh:
Year: 2019 PMID: 30649191 PMCID: PMC6735703 DOI: 10.1093/bioinformatics/btz018
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Performance of SMuRF. Precision-recall profiles for individual somatic mutation callers and SMuRF evaluated on (A) SNV and (B) indels using 20% withheld test data. Curves show the performance of the individual algorithms under different variant score thresholds (MuTect2 tumor log-odds score, Freebayes log-odds score, VarDict SSF score, VarScan SSC score and SMuRF confidence score). Solid points refer to the default performance of the caller in the bcbio-nextgen workflow. Black solid points denote the accuracy of calls identified by the majority-voting scheme in bcbio-nextgen (at least 1, 2, 3 or 4 callers). The grey contours indicate F1 scores as a function of recall and precision. (C) Accuracy of SMuRF and individual callers as a function of somatic variant allele frequency in the test set; F1 scores evaluated for each variant allele frequency bin. (D–F) Evaluation of SMuRF and SomaticSeq performance when trained and tested across different cancer types. (D) Models were trained on 70% of CLL data and tested on 30% of MB data (and vice versa). F1 scores were recorded for SMuRF and SomaticSeq SNV (E) and indel (F) predictions. Error bars represent the standard deviation of the mean across 10 random training/test data splits (same splits for both methods)