Literature DB >> 30649191

SMuRF: portable and accurate ensemble prediction of somatic mutations.

Weitai Huang^1,2, Yu Amanda Guo¹, Karthik Muthukumar¹, Probhonjon Baruah¹, Mei Mei Chang¹, Anders Jacobsen Skanderup¹.

Abstract

SUMMARY: Somatic Mutation calling method using a Random Forest (SMuRF) integrates predictions and auxiliary features from multiple somatic mutation callers using a supervised machine learning approach. SMuRF is trained on community-curated matched tumor and normal whole genome sequencing data. SMuRF predicts both SNVs and indels with high accuracy in genome or exome-level sequencing data. Furthermore, the method is robust across multiple tested cancer types and predicts low allele frequency variants with high accuracy. In contrast to existing ensemble-based somatic mutation calling approaches, SMuRF works out-of-the-box and is orders of magnitudes faster.
AVAILABILITY AND IMPLEMENTATION: The method is implemented in R and available at https://github.com/skandlab/SMuRF. SMuRF operates as an add-on to the community-developed bcbio-nextgen somatic variant calling pipeline. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Gene Species

Mesh：

Year: 2019 PMID： 30649191 PMCID： PMC6735703 DOI： 10.1093/bioinformatics/btz018

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Identification of somatic mutations from matched tumor and normal samples is challenged by sequencing noise and alignment ambiguities as well as the heterogeneous composition of tumors. Recent studies have revealed low concordance between existing methods for somatic variant calling (Hwang ; Kroigard ; O’Rawe ; Roberts ). Additionally, a benchmark study demonstrated that the accuracy of a given somatic mutation calling algorithm can vary extensively across different workflows and pipelines (Alioto ). Parameters influencing this variation may be choice of alignment algorithm, use of local re-alignment, as well as configuration of a multitude of post-processing filters. The consensus of multiple callers have been used to improve the accuracy of somatic variant calling (Callari ; Ellrott ; Rashid ). Taking this one step further, a machine learning based ensemble method may combine multiple mutation callers with auxiliary sequence and alignment features to improve mutation calling accuracy (Ding ; Fang ; Wood ). While such approaches may improve accuracy, they are generally not portable: The end-user must obtain suitable training and testing datasets and need to have knowledge of machine learning (Supplementary Fig. S1A ). There is therefore a need for accurate and pre-trained ensemble approaches for somatic mutation calling that can be ported between research groups. Here, we developed a Somatic Mutation calling method using a Random Forest (SMuRF), which combines predictions from four mutation callers with auxiliary alignment and mutation features using supervised machine learning (Supplementary Fig. S1B).

2 Implementation

SMuRF is available as an R package. Briefly, the bcbio-nextgen framework (https://github.com/chapmanb/bcbio-nextgen) is used to generate somatic variant calls from 4 different methods: MuTect2 (Cibulskis ), Freebayes somatic (ArXiv: https://arxiv.org/abs/1207.3907), VarDict (Lai ) and VarScan (Koboldt ). Variant and auxiliary features are extracted from the VCF files. The SMuRF random forest model is pre-trained on a gold standard set of mutation calls curated by the International Cancer Genome Consortium (ICGC) community using deep (>100×) whole genome sequencing (WGS) of two tumors (Alioto ). Feature extraction and prediction of somatic variants takes ∼10 min for tumor-normal WGS data on a standard computer (4 CPUs, 16GB RAM).

3 Overview

SMuRF SNV and indel models were trained on matched tumor-normal WGS data from a chronic lymphocytic leukemia (CLL) patient and a medulloblastoma (MB) patient, where the true somatic mutations have been identified and curated by the International Cancer Genome Consortium (ICGC) (Alioto ). The training data was augmented to expose the model to additional variation in sequencing coverage, tumor purity and tumor/normal coverage imbalance (Supplementary Fig. S1C and Supplementary Methods). SMuRF was trained on 80% of the data, with 20% of the data withheld as a test set. Highly predictive features were mostly somatic variant scores provided by individual methods as well as mapping and base quality estimates (Supplementary Table S1). SMuRF achieved F1-scores of 0.88 and 0.74 for SNVs and indels, respectively (Fig. 1A and B, Supplementary Tables S2 and S3 and Supplementary Fig. S3 for SNV coding regions). Importantly, SMuRF showed improved accuracy over the best mutation calling submissions reported in the benchmark by Alioto et al. using the same dataset (best reported F1-scores 0.79 and 0.65 for SNV and indels, respectively) (Alioto ). In our analysis, while individual methods could recover most of the true SNVs, this came at the cost of very low precision (<40% precision at 80% recall) (Supplementary Table S2). In contrast, SMuRF could recover 86% of the true SNVs (recall) at 92% precision on the withheld test set. While a simple consensus approach using the intersection of individual methods performed well (F1 = 0.82), SMuRF achieved markedly higher recall (86% versus 74%) at a similar level of precision. All methods, including SMuRF, were mostly robust when tested under different levels of tumor purity (Supplementary Fig. S4). However, SMuRF showed substantially improved accuracy at low somatic variant allele frequencies (VAFs) as compared to individual methods (Fig. 1C), which is particularly important in the setting of tumor heterogeneity inference (Shi ). We further benchmarked SMuRF SNV calling using independent data from the DREAM Somatic Mutation Challenge where artificial tumor data has been generated using an in-silico approach (Ewing ). While the performance of individual methods varied across these datasets, SMuRF was highly accurate across all synthetic tumors (F1 > 0.8) (Supplementary Fig. S5). Overall, these results support that SMuRF is robust and can generalize to unseen data.

Fig. 1.

Performance of SMuRF. Precision-recall profiles for individual somatic mutation callers and SMuRF evaluated on (A) SNV and (B) indels using 20% withheld test data. Curves show the performance of the individual algorithms under different variant score thresholds (MuTect2 tumor log-odds score, Freebayes log-odds score, VarDict SSF score, VarScan SSC score and SMuRF confidence score). Solid points refer to the default performance of the caller in the bcbio-nextgen workflow. Black solid points denote the accuracy of calls identified by the majority-voting scheme in bcbio-nextgen (at least 1, 2, 3 or 4 callers). The grey contours indicate F1 scores as a function of recall and precision. (C) Accuracy of SMuRF and individual callers as a function of somatic variant allele frequency in the test set; F1 scores evaluated for each variant allele frequency bin. (D–F) Evaluation of SMuRF and SomaticSeq performance when trained and tested across different cancer types. (D) Models were trained on 70% of CLL data and tested on 30% of MB data (and vice versa). F1 scores were recorded for SMuRF and SomaticSeq SNV (E) and indel (F) predictions. Error bars represent the standard deviation of the mean across 10 random training/test data splits (same splits for both methods) Analysis of indel prediction accuracy showed that individual mutation callers could recover most of the true indels (64–94% recall), but only at the cost of very low precision (<8%). Interestingly, simple consensus approaches performed well for indel prediction (F1 0.46 and 0.66 for 3 and 4-caller consensus, respectively). However, while consensus methods suffered from either low recall (0.55) or precision (0.31), SMuRF obtained high indel prediction accuracy (F1 = 0.74) with both high recall (0.74) and precision (0.75) (Fig. 1B, Supplementary Table S3). We also analyzed the extent that SMuRF predicts the same somatic mutations in tumor samples profiled with both (>200× coverage) WES and (<100× coverage) WGS. When restricting analysis to variants in coding regions, SMuRF predicted somatic SNVs and indels with comparable or higher concordance than individual methods (Supplementary Figs S6 and S7). Finally, we compared SMuRF to two existing machine learning-based methods. The first was MutationSeq, a pre-trained ensemble SNV caller (Ding ), which achieved an F1-score of 0.68, similar to the other individual SNV callers in our analysis (Supplementary Fig. S8). Next, we compared the performance of SomaticSeq (Fang ), a method that required users to train their own predictive model (see Supplementary Methods). The trained SomaticSeq model had slightly increased test set prediction accuracy over SMuRF for both SNV (0.90 versus 0.88) and indels (0.78 versus 0.75) (Supplementary Fig. S9B and C). We further evaluated how the methods generalized when models were trained and tested across different tumor datasets and found that SomaticSeq showed greater test accuracy variation (Fig. 1D–F). This was especially pronounced for indel prediction, where the F1 accuracy of SomaticSeq varied from 0.48 to 0.73 (SMuRF 0.63–0.65) when tested on the MBL or CLL sample, respectively. Furthermore, SomaticSeq used ∼24 h to predict both SNVs and indels since it also computes auxiliary features from the raw alignment data. In contrast, SMuRF depends only on VCF files and predicts both SNVs and indels in ∼10 min (Supplementary Fig. S9A). Overall, these results support that SMuRF is both accurate and computationally efficient. In summary, SMuRF is an accurate, portable and user-friendly ensemble-based somatic mutation caller, which should benefit both cancer genomics studies as well as clinical applications. Click here for additional data file.

16 in total

1. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.

Authors: Daniel C Koboldt; Qunyuan Zhang; David E Larson; Dong Shen; Michael D McLellan; Ling Lin; Christopher A Miller; Elaine R Mardis; Li Ding; Richard K Wilson
Journal: Genome Res Date: 2012-02-02 Impact factor: 9.043

2. Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data.

Authors: Jiarui Ding; Ali Bashashati; Andrew Roth; Arusha Oloumi; Kane Tse; Thomas Zeng; Gholamreza Haffari; Martin Hirst; Marco A Marra; Anne Condon; Samuel Aparicio; Sohrab P Shah
Journal: Bioinformatics Date: 2011-11-13 Impact factor: 6.937

3. Cake: a bioinformatics pipeline for the integrated analysis of somatic variants in cancer genomes.

Authors: Mamunur Rashid; Carla Daniela Robles-Espinoza; Alistair G Rust; David J Adams
Journal: Bioinformatics Date: 2013-06-25 Impact factor: 6.937

4. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection.

Authors: Adam D Ewing; Kathleen E Houlahan; Yin Hu; Kyle Ellrott; Cristian Caloian; Takafumi N Yamaguchi; J Christopher Bare; Christine P'ng; Daryl Waggott; Veronica Y Sabelnykova; Michael R Kellen; Thea C Norman; David Haussler; Stephen H Friend; Gustavo Stolovitzky; Adam A Margolin; Joshua M Stuart; Paul C Boutros
Journal: Nat Methods Date: 2015-05-18 Impact factor: 28.547

5. A comparative analysis of algorithms for somatic SNV detection in cancer.

Authors: Nicola D Roberts; R Daniel Kortschak; Wendy T Parker; Andreas W Schreiber; Susan Branford; Hamish S Scott; Garique Glonek; David L Adelson
Journal: Bioinformatics Date: 2013-07-09 Impact factor: 6.937

6. An ensemble approach to accurately detect somatic mutations using SomaticSeq.

Authors: Li Tai Fang; Pegah Tootoonchi Afshar; Aparna Chhibber; Marghoob Mohiyuddin; Yu Fan; John C Mu; Greg Gibeling; Sharon Barr; Narges Bani Asadi; Mark B Gerstein; Daniel C Koboldt; Wenyi Wang; Wing H Wong; Hugo Y K Lam
Journal: Genome Biol Date: 2015-09-17 Impact factor: 13.583

7. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing.

Authors: Jason O'Rawe; Tao Jiang; Guangqing Sun; Yiyang Wu; Wei Wang; Jingchu Hu; Paul Bodily; Lifeng Tian; Hakon Hakonarson; W Evan Johnson; Zhi Wei; Kai Wang; Gholson J Lyon
Journal: Genome Med Date: 2013-03-27 Impact factor: 11.117

8. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples.

Authors: Kristian Cibulskis; Michael S Lawrence; Scott L Carter; Andrey Sivachenko; David Jaffe; Carrie Sougnez; Stacey Gabriel; Matthew Meyerson; Eric S Lander; Gad Getz
Journal: Nat Biotechnol Date: 2013-02-10 Impact factor: 54.908

9. Systematic comparison of variant calling pipelines using gold standard personal exome variants.

Authors: Sohyun Hwang; Eiru Kim; Insuk Lee; Edward M Marcotte
Journal: Sci Rep Date: 2015-12-07 Impact factor: 4.379

10. A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing.

Authors: Tyler S Alioto; Ivo Buchhalter; Sophia Derdak; Barbara Hutter; Matthew D Eldridge; Eivind Hovig; Lawrence E Heisler; Timothy A Beck; Jared T Simpson; Laurie Tonon; Anne-Sophie Sertier; Ann-Marie Patch; Natalie Jäger; Philip Ginsbach; Ruben Drews; Nagarajan Paramasivam; Rolf Kabbe; Sasithorn Chotewutmontri; Nicolle Diessl; Christopher Previti; Sabine Schmidt; Benedikt Brors; Lars Feuerbach; Michael Heinold; Susanne Gröbner; Andrey Korshunov; Patrick S Tarpey; Adam P Butler; Jonathan Hinton; David Jones; Andrew Menzies; Keiran Raine; Rebecca Shepherd; Lucy Stebbings; Jon W Teague; Paolo Ribeca; Francesc Castro Giner; Sergi Beltran; Emanuele Raineri; Marc Dabad; Simon C Heath; Marta Gut; Robert E Denroche; Nicholas J Harding; Takafumi N Yamaguchi; Akihiro Fujimoto; Hidewaki Nakagawa; Víctor Quesada; Rafael Valdés-Mas; Sigve Nakken; Daniel Vodák; Lawrence Bower; Andrew G Lynch; Charlotte L Anderson; Nicola Waddell; John V Pearson; Sean M Grimmond; Myron Peto; Paul Spellman; Minghui He; Cyriac Kandoth; Semin Lee; John Zhang; Louis Létourneau; Singer Ma; Sahil Seth; David Torrents; Liu Xi; David A Wheeler; Carlos López-Otín; Elías Campo; Peter J Campbell; Paul C Boutros; Xose S Puente; Daniela S Gerhard; Stefan M Pfister; John D McPherson; Thomas J Hudson; Matthias Schlesner; Peter Lichter; Roland Eils; David T W Jones; Ivo G Gut
Journal: Nat Commun Date: 2015-12-09 Impact factor: 14.919

7 in total

1. Semi-supervised learning for somatic variant calling and peptide identification in personalized cancer immunotherapy.

Authors: Elham Sherafat; Jordan Force; Ion I Măndoiu
Journal: BMC Bioinformatics Date: 2020-12-30 Impact factor: 3.169

2. Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery.

Authors: Johannes Köster; Louis J Dijkstra; Tobias Marschall; Alexander Schönhuth
Journal: Genome Biol Date: 2020-04-28 Impact factor: 13.583

3. Somatic and Germline Variant Calling from Next-Generation Sequencing Data.

Authors: Ti-Cheng Chang; Ke Xu; Zhongshan Cheng; Gang Wu
Journal: Adv Exp Med Biol Date: 2022 Impact factor: 2.622

Review 4. Computational analysis of cancer genome sequencing data.

Authors: Isidro Cortés-Ciriano; Doga C Gulhan; Jake June-Koo Lee; Giorgio E M Melloni; Peter J Park
Journal: Nat Rev Genet Date: 2021-12-08 Impact factor: 53.242

5. Single-cell and bulk transcriptome sequencing identifies two epithelial tumor cell states and refines the consensus molecular classification of colorectal cancer.

Authors: Pratyaksha Wirapati; Nancy Zhao; Zahid Nawaz; Ignasius Joanito; Grace Yeo; Fiona Lee; Christine L P Eng; Dominique Camat Macalinao; Merve Kahraman; Harini Srinivasan; Vairavan Lakshmanan; Sara Verbandt; Petros Tsantoulis; Nicole Gunn; Prasanna Nori Venkatesh; Zhong Wee Poh; Rahul Nahar; Hsueh Ling Janice Oh; Jia Min Loo; Shumei Chia; Lih Feng Cheow; Elsie Cheruba; Michael Thomas Wong; Lindsay Kua; Clarinda Chua; Andy Nguyen; Justin Golovan; Anna Gan; Wan-Jun Lim; Yu Amanda Guo; Choon Kong Yap; Brenda Tay; Yourae Hong; Dawn Qingqing Chong; Aik-Yong Chok; Woong-Yang Park; Shuting Han; Mei Huan Chang; Isaac Seow-En; Cherylin Fu; Ronnie Mathew; Ee-Lin Toh; Lewis Z Hong; Anders Jacobsen Skanderup; Ramanuj DasGupta; Chin-Ann Johnny Ong; Kiat Hon Lim; Emile K W Tan; Si-Lin Koo; Wei Qiang Leow; Sabine Tejpar; Shyam Prabhakar; Iain Beehuat Tan
Journal: Nat Genet Date: 2022-06-30 Impact factor: 41.307

Review 6. Comprehensive Outline of Whole Exome Sequencing Data Analysis Tools Available in Clinical Oncology.

Authors: Áron Bartha; Balázs Győrffy
Journal: Cancers (Basel) Date: 2019-11-04 Impact factor: 6.639

7. Accurate somatic variant detection using weakly supervised deep learning.

Authors: Kiran Krishnamachari; Dylan Lu; Alexander Swift-Scott; Anuar Yeraliyev; Kayla Lee; Weitai Huang; Sim Ngak Leng; Anders Jacobsen Skanderup
Journal: Nat Commun Date: 2022-07-22 Impact factor: 17.694

7 in total