Literature DB >> 23803469

Cake: a bioinformatics pipeline for the integrated analysis of somatic variants in cancer genomes.

Mamunur Rashid1, Carla Daniela Robles-Espinoza, Alistair G Rust, David J Adams.   

Abstract

We have developed Cake, a bioinformatics software pipeline that integrates four publicly available somatic variant-calling algorithms to identify single nucleotide variants with higher sensitivity and accuracy than any one algorithm alone. Cake can be run on a high-performance computer cluster or used as a stand-alone application. Availabilty: Cake is open-source and is available from http://cakesomatic.sourceforge.net/

Entities:  

Mesh:

Year:  2013        PMID: 23803469      PMCID: PMC3740632          DOI: 10.1093/bioinformatics/btt371

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

The development of next-generation sequencing technologies has made it possible to generate more comprehensive catalogues of somatic alterations in cancer genomes than ever before. Software tools to find these variants deploy different mathematical approaches to interrogate the genome sequences of tumour/germline paired samples. For example, the variant detectors Bambino (Edmonson ) and VarScan 2 (Koboldt ) both identify somatic variants by comparing alternative allele frequencies between tumour and normal sequences. VarScan 2 uses a Fisher’s exact test and Bambino a Bayesian scoring model to identify somatic variants in paired samples. Other algorithms include CaVEMan (Stephens ) and SAMtools mpileup (Li ), which compute the genotype likelihood of nucleotide positions in tumour and normal genome sequences by use of an expectation-maximization method. Putative raw variant calls made by these algorithms typically undergo further filtering. For example, known single nucleotide polymorphisms (SNPs) present in dbSNP (Sherry ) or in the 1000 Genomes project (The 1000 Genomes Project Consortium ), or sites with low mapping qualities are usually filtered from the final somatic call set. Validation rates ultimately depend on the stringency of this filtering of putative sites. Intriguingly, applying different variant-calling algorithms to the same data often results in a set of only partially overlapping somatic single nucleotide variant (SNV) calls. To illustrate this phenomenon, we deployed four publicly available somatic variant-calling algorithms (Bambino, CaVEMan, SAMtools mpileup and VarScan 2) on a dataset composed of 24 human hepatocellular carcinoma tumour/germline exome pairs (Guichard ). Because this study reported 994 validated somatic variants identified using the independent CASAVA pipeline, we used these data to gauge the performance of each algorithm. This analysis revealed at best a 43.8% overlap between SNV calls made by any two of these widely used callers, and at worst a 6.45% overlap (Supplementary Table S1). Notable, however, was the fact that the majority of validated calls were identified by two or more algorithms, suggesting that a merging approach may improve both the sensitivity and accuracy of somatic variant calling. See the Supplementary Material for details. In an effort to take advantage of existing software tools and to improve variant detection, we developed Cake (Supplementary Fig. S1). Cake is a fully configurable bioinformatics pipeline that integrates four single nucleotide somatic variant-calling algorithms (Bambino, CaVEMan, SAMtools mpileup, and VarScan 2) and deploys an extensive collection of fully customizable post-processing filtering steps. We show that the performance of Cake exceeds any one algorithm for somatic SNV detection, making it an optimal tool for cancer genome analysis.

2 IMPLEMENTATION

Cake is implemented in Perl, enabling the configuration, execution and monitoring of the four callers in a high-performance computing environment using a job scheduler. Alternatively, Cake can be configured to run in stand-alone mode on a single computer (See the User Manual on SourceForge for more details). The standard Cake workflow is to run all of the algorithms individually, merge the predicted SNVs reported by at least any two (Supplementary Fig. S2) somatic callers and then apply the post-processing filters. This configuration can, however, be easily adjusted as required (Supplementary Table S2). The existing choice of algorithms can also be modified using a template we provide. A package containing wrappers around the callers, the post-processing modules and an installation script is available for download.

3 RESULTS

To evaluate the performance of Cake, we used the aforementioned human hepatocellular carcinoma dataset composed of 24 exome tumour/germline pairs and two human breast cancer exomes for which we had genomic DNA for follow-up validation (Stephens ). The performance of each variant-calling algorithm was evaluated by running each one individually using their default settings and filtering the results using the post-processing filters implemented in Cake. The results are summarized in Table 1.
Table 1.

Summary of the results of different somatic variant-calling algorithms and Cake on two human exome datasets

Hepatocellular carcinoma (24 samples/842 validated sites)
Breast cancer (2 samples/264 validated sites)
Calling strategyAlgorithmsValidated mutations identified (total)Sensitivity (%)Average number of variant calls per sampleValidated mutations identified (total)Sensitivity (%)Average number of variant calls per sampleValidation success rate (Sequenom) (%)
Single algorithms (after filtering)Bambino74288.12503 ± 107024893.93456 ± 324
CaVEMan80195.11072 ± 1055(263)(99.6)(961 ± 90)
Mpileup72786.3429 ± 22618168.6329 ± 32
VarScan 280595.6926 ± 52720577.7929 ± 91

Cake≥ any 2 callers81296.4634 ± 29925496.2613 ± 4251.5
≥ any 3 callers79494.3270 ± 13221481.1326 ± 5081.7
4 callers65277.4168 ± 9816662.8178 ± 4288.3
Summary of the results of different somatic variant-calling algorithms and Cake on two human exome datasets

3.1 Human hepatocellular carcinoma dataset

In their study, Guichard experimentally validated 850 SNV positions, of which 8 were not covered by sequence reads following realignment leaving a target reference set of 842. Using Cake with an intersection of any two or more algorithms, 812 validated variants were retained (Supplementary Fig. S3), representing an overall sensitivity of 96.4%. An average of 634 variants was predicted per exome (Table 1). Cake outperformed the best single algorithm in terms of specificity and the number of variants reported per sample.

3.2 Human breast cancer exome dataset

Because the above analysis will favour callers that perform like CASAVA, and because we did not have DNA from the hepatocellular carcinomas for follow-up analysis to ascertain the false positive and negative rates, we next used exome data from two breast tumours for which whole genome amplified tumour and germline DNA was in hand. Using Cake and an intersection of any two or more callers, we made 1225 calls (per sample 613 ± 42), of which 254 were from a reference call set representing a subset of positions (264) covered by the capture baits where a somatic mutation had resulted in a non-synonymous change; a sensitivity of 96.2% (Table 1, Supplementary Fig. S4). Excluding CaVEMan, which was used in the original study, Cake again outperformed all other algorithms (Table 1). To assess the specificity of the somatic variant calling by Cake, we used the Sequenom MassARRAY SNP genotyping platform on tumour and germline DNA samples. A total of 400 variants were randomly selected from the 1225 calls made by any two or more callers in the Cake pipeline, 200 from each sample. Two hundred and seventy variants were validated, including 95 somatic mutations confirmed in the original study, 111 somatic mutation that were not described previously and 64 germline variants (Supplementary Fig. S5). Importantly, we called variants in a greater target region than the original study by analyzing positions in 5′ and 3′ untranslated regions, and introns. Six additional non-synonymous SNVs were discovered and confirmed (Supplementary Table 3), including variants in AKAP1, PCNT and RERE, all of which have been implicated in cancer. A further 400 variants were included as a true negative set resulting in a worst-case accuracy for Cake of 75.8% [Accuracy = (95 + 111 + 400)/(400 + 400)]. Although we used our default of at least any two callers as part of the aforementioned analysis, 88.3% of positions that validated as somatic variants were reported by all four algorithms used by Cake (Supplementary Fig. S5, Table 1). This indicates that merging predictions increases the probability of identifying true mutations. Thus, we demonstrate that Cake may be used to help prioritize somatic SNVs calls for follow-up validation.

4 SUMMARY

Here we describe Cake, a software tool integrating four somatic variant detection algorithms to call variants with higher accuracy and specificity than any one algorithm alone. Cake performs well on whole genomes, exomes and targeted next-generation sequencing data, as well as on both human and mouse samples. Cake is freely available to the research community.
  7 in total

1.  dbSNP: the NCBI database of genetic variation.

Authors:  S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.

Authors:  Daniel C Koboldt; Qunyuan Zhang; David E Larson; Dong Shen; Michael D McLellan; Ling Lin; Christopher A Miller; Elaine R Mardis; Li Ding; Richard K Wilson
Journal:  Genome Res       Date:  2012-02-02       Impact factor: 9.043

3.  Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format.

Authors:  Michael N Edmonson; Jinghui Zhang; Chunhua Yan; Richard P Finney; Daoud M Meerzaman; Kenneth H Buetow
Journal:  Bioinformatics       Date:  2011-01-28       Impact factor: 6.937

4.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

5.  Integrated analysis of somatic mutations and focal copy-number changes identifies key genes and pathways in hepatocellular carcinoma.

Authors:  Cécile Guichard; Giuliana Amaddeo; Sandrine Imbeaud; Yannick Ladeiro; Laura Pelletier; Ichrafe Ben Maad; Julien Calderaro; Paulette Bioulac-Sage; Mélanie Letexier; Françoise Degos; Bruno Clément; Charles Balabaud; Eric Chevet; Alexis Laurent; Gabrielle Couchy; Eric Letouzé; Fabien Calvo; Jessica Zucman-Rossi
Journal:  Nat Genet       Date:  2012-05-06       Impact factor: 38.330

6.  The landscape of cancer genes and mutational processes in breast cancer.

Authors:  Philip J Stephens; Patrick S Tarpey; Helen Davies; Peter Van Loo; Chris Greenman; David C Wedge; Serena Nik-Zainal; Sancha Martin; Ignacio Varela; Graham R Bignell; Lucy R Yates; Elli Papaemmanuil; David Beare; Adam Butler; Angela Cheverton; John Gamble; Jonathan Hinton; Mingming Jia; Alagu Jayakumar; David Jones; Calli Latimer; King Wai Lau; Stuart McLaren; David J McBride; Andrew Menzies; Laura Mudie; Keiran Raine; Roland Rad; Michael Spencer Chapman; Jon Teague; Douglas Easton; Anita Langerød; Ming Ta Michael Lee; Chen-Yang Shen; Benita Tan Kiat Tee; Bernice Wong Huimin; Annegien Broeks; Ana Cristina Vargas; Gulisa Turashvili; John Martens; Aquila Fatima; Penelope Miron; Suet-Feung Chin; Gilles Thomas; Sandrine Boyault; Odette Mariani; Sunil R Lakhani; Marc van de Vijver; Laura van 't Veer; John Foekens; Christine Desmedt; Christos Sotiriou; Andrew Tutt; Carlos Caldas; Jorge S Reis-Filho; Samuel A J R Aparicio; Anne Vincent Salomon; Anne-Lise Børresen-Dale; Andrea L Richardson; Peter J Campbell; P Andrew Futreal; Michael R Stratton
Journal:  Nature       Date:  2012-05-16       Impact factor: 49.962

7.  An integrated map of genetic variation from 1,092 human genomes.

Authors:  Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal:  Nature       Date:  2012-11-01       Impact factor: 49.962

  7 in total
  14 in total

Review 1.  Informatics for cancer immunotherapy.

Authors:  J Hammerbacher; A Snyder
Journal:  Ann Oncol       Date:  2017-12-01       Impact factor: 32.976

2.  Mutational Analysis of Ionizing Radiation Induced Neoplasms.

Authors:  Amy L Sherborne; Philip R Davidson; Katharine Yu; Alice O Nakamura; Mamunur Rashid; Jean L Nakamura
Journal:  Cell Rep       Date:  2015-09-03       Impact factor: 9.423

3.  BRAF inhibitor resistance mediated by the AKT pathway in an oncogenic BRAF mouse melanoma model.

Authors:  Daniele Perna; Florian A Karreth; Alistair G Rust; Pedro A Perez-Mancera; Mamunur Rashid; Francesco Iorio; Constantine Alifrangis; Mark J Arends; Marcus W Bosenberg; Gideon Bollag; David A Tuveson; David J Adams
Journal:  Proc Natl Acad Sci U S A       Date:  2015-01-26       Impact factor: 11.205

4.  In-depth comparison of somatic point mutation callers based on different tumor next-generation sequencing depth data.

Authors:  Lei Cai; Wei Yuan; Zhou Zhang; Lin He; Kuo-Chen Chou
Journal:  Sci Rep       Date:  2016-11-22       Impact factor: 4.379

5.  Genomic analysis and clinical management of adolescent cutaneous melanoma.

Authors:  Roy Rabbie; Mamunur Rashid; Ana M Arance; Marcelo Sánchez; Gemma Tell-Marti; Miriam Potrony; Carles Conill; Remco van Doorn; Stefan Dentro; Nelleke A Gruis; Pippa Corrie; Vivek Iyer; Carla Daniela Robles-Espinoza; Joan A Puig-Butille; Susana Puig; David J Adams
Journal:  Pigment Cell Melanoma Res       Date:  2017-04-19       Impact factor: 4.693

6.  An ensemble approach to accurately detect somatic mutations using SomaticSeq.

Authors:  Li Tai Fang; Pegah Tootoonchi Afshar; Aparna Chhibber; Marghoob Mohiyuddin; Yu Fan; John C Mu; Greg Gibeling; Sharon Barr; Narges Bani Asadi; Mark B Gerstein; Daniel C Koboldt; Wenyi Wang; Wing H Wong; Hugo Y K Lam
Journal:  Genome Biol       Date:  2015-09-17       Impact factor: 13.583

7.  BAYSIC: a Bayesian method for combining sets of genome variants with improved specificity and sensitivity.

Authors:  Brandi L Cantarel; Daniel Weaver; Nathan McNeill; Jianhua Zhang; Aaron J Mackey; Justin Reese
Journal:  BMC Bioinformatics       Date:  2014-04-12       Impact factor: 3.169

8.  Whole genome and exome sequencing of monozygotic twins discordant for Crohn's disease.

Authors:  Britt-Sabina Petersen; Martina E Spehlmann; Andreas Raedler; Björn Stade; Ingo Thomsen; Raquel Rabionet; Philip Rosenstiel; Stefan Schreiber; Andre Franke
Journal:  BMC Genomics       Date:  2014-07-05       Impact factor: 3.969

9.  Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and Targeted Deep Sequencing Data.

Authors:  Anne Bruun Krøigård; Mads Thomassen; Anne-Vibeke Lænkholm; Torben A Kruse; Martin Jakob Larsen
Journal:  PLoS One       Date:  2016-03-22       Impact factor: 3.240

10.  Adenoma development in familial adenomatous polyposis and MUTYH-associated polyposis: somatic landscape and driver genes.

Authors:  Mamunur Rashid; Andrej Fischer; Cathy H Wilson; Jessamy Tiffen; Alistair G Rust; Philip Stevens; Shelley Idziaszczyk; Julie Maynard; Geraint T Williams; Ville Mustonen; Julian R Sampson; David J Adams
Journal:  J Pathol       Date:  2015-11-02       Impact factor: 7.996

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.