SUMMARY: Thousands of cancer exomes are currently being sequenced, yielding millions of non-synonymous single nucleotide variants (SNVs) of possible relevance to disease etiology. Here, we provide a software toolkit to prioritize SNVs based on their predicted contribution to tumorigenesis. It includes a database of precomputed, predictive features covering all positions in the annotated human exome and can be used either stand-alone or as part of a larger variant discovery pipeline. AVAILABILITY AND IMPLEMENTATION: MySQL database, source code and binaries freely available for academic/government use at http://wiki.chasmsoftware.org, Source in Python and C++. Requires 32 or 64-bit Linux system (tested on Fedora Core 8,10,11 and Ubuntu 10), 2.5*≤ Python <3.0*, MySQL server >5.0, 60 GB available hard disk space (50 MB for software and data files, 40 GB for MySQL database dump when uncompressed), 2 GB of RAM.
SUMMARY: Thousands of cancer exomes are currently being sequenced, yielding millions of non-synonymous single nucleotide variants (SNVs) of possible relevance to disease etiology. Here, we provide a software toolkit to prioritize SNVs based on their predicted contribution to tumorigenesis. It includes a database of precomputed, predictive features covering all positions in the annotated human exome and can be used either stand-alone or as part of a larger variant discovery pipeline. AVAILABILITY AND IMPLEMENTATION: MySQL database, source code and binaries freely available for academic/government use at http://wiki.chasmsoftware.org, Source in Python and C++. Requires 32 or 64-bit Linux system (tested on Fedora Core 8,10,11 and Ubuntu 10), 2.5*≤ Python <3.0*, MySQL server >5.0, 60 GB available hard disk space (50 MB for software and data files, 40 GB for MySQL database dump when uncompressed), 2 GB of RAM.
A fundamental goal of modern cancer genomics studies is to understand how alterations in DNA sequence contribute to disease susceptibility and prognosis. Targeted whole-exome deep sequencing is now affordable for many academic labs and the multitude of studies underway is yielding datasets of unprecedented magnitude. While researchers have previously developed methods to computationally predict the impact of single nucleotide variants (SNVs) (Kaminker ; Mooney ; Ng and Henikoff, 2003; Sunyaev ), to our knowledge there are no existing tools capable of fast classification of very large SNV datasets in cancer exomes.We have previously developed a computational method CancerSpecific High-throughput Annotation of Somatic Mutations (CHASM) (Carter , 2010) that predicts whether tumor-derived somatic missense mutations are important contributors to cancer cell fitness. Here, we describe a software package that implements the CHASM method. The package includes a database of pre-computed predictive features called SNVBox that facilitates rapid feature retrieval and classification of very large SNV datasets. Furthermore, the features in SNVBox can be generally used to aid in the development of new classification algorithms that predict the impact of either germline or somatic SNVs.
2 METHODS AND IMPLEMENTATION
CHASM is an open-source collection of Python and C++ programs that takes a list of somatic missense mutations as input and ranks them according to their likely tumorigenic impact. It includes a curated set of driver mutations culled from the COSMIC database (Forbes ), which is used as a positive class for training a Random Forest classifier (Amit and Geman, 1997; Breiman, 2001). The negative class of mutations is generated in silico according to an estimated distribution of benign (passenger) variation, matched to the tumor type of interest. Users have the option to use their own estimates of passenger variant frequencies or to select from a library of pre-computed passenger frequency tables for several common cancers.PyInstaller 1.4 was used to package Python source into dynamically linked, executable binaries. The SnvGet, Build Classifier and RunChasm executables are run by the user on the command line, while the others are called internally. The statically compiled C++ executable waffles_learn from the WAFFLES machine learning library is also called internally.SNVBox is an MySQL database of 86 predictive features relevant to the biological impact of an SNV. The features have been pre-computed for each codon in all protein-coding exons of annotated human mRNA transcripts in the NCBI RefSeq, CCDS and EBI Ensembl databases (Birney ; Pruitt ; Pruitt ). The SnvGet program enables fast retrieval of selected features from the database for classifier training and scoring of mutations input by the user.
3 WORKFLOW
Prepare an input file of estimated passenger mutation rates in the cancer of interest. Optionally, select from one of several pre-computed passenger rate tables.Prepare an input file of missense SNVs to be classified. Each row contains a protein accession identifier, codon number, and reference and variant amino acid residues.Run the BuildClassifier program.Produces a negative class of in silico passenger mutations by random nucleotide substitution in a library of expressed human mRNA transcripts from NCBI RefSeq, according to the distributions specified in the passenger mutation rate table (Supplementary Material).Retrieves a list of predictive features for each passenger (and driver) in the training set from SNVBox.Builds a Random Forest classifier, using waffles_learn.Run the RunChasm program.Retreives a feature list for all mutations supplied by the user.Applies the trained classifier to generate a CHASM score for each variant.Generates a second set of in silico passenger mutations, which (unlike the first set) is carefully filtered to eliminate mutations in any genes previously associated with cancer in either the Cancer Gene Census (Futreal ), the COSMIC cancer gene list and all cancer (C4 collection) genesets in MSigDB (Subramanian ).Filtered passengers are scored by the classifier to produce an empirical null distribution of variant scores.This null score distribution is used to compute a P-value for each variant supplied by the user (fraction of filtered passengers having CHASM scores less than or equal to the score of the variant).Benjamini–Hochberg multiple testing correction (Benjamini and Hochberg, 1995) is applied to the P-values.Outputs a list of the user-supplied mutations, with CHASM scores, P-values and Benjamini–Hochberg estimated false discovery rate (FDR).Outputs an ARFF formatted file of features for the submitted mutations.
4 DISCUSSION
The CHASM/SNVBox toolkit is the first distributable software package that specifically targets somatic missense mutations in cancer. The learning task of the Random Forest classifier is to discriminate between known drivers and a set of random passenger missense mutations that match the mutation spectrum in a cancer type of interest. CHASM results are sensitive to this definition of mutation spectrum and users are encouraged to use the somatic variant calls from their sequencing data to make the best possible estimates of the spectrum (Supplementary Material).While many SNV classifiers are available through web interfaces [reviewed in Karchin (2009)], these are not currently capable of handling large size custom datasets (e.g. thousands to millions of SNVs discovered in sequencing projects). Some researchers have developed distributable packages that users can run on their local system to enable high-throughput SNV processing. These packages depend on third-party databases (sequences, alignments, protein structures, specialized protein annotations) and third-party software packages. The popular PolyPhen system, for example, requires installation of 10 third-party software packages, in addition to three Perl modules. To our knowledge, all available SNV classification tools base their inferences on predictive features computed when a custom dataset is input to the system (almost always using third-party databases and software). In contrast, the predictive features available in SNVBox (also calculated with many third-party tools) have been exhaustively pre-computed, allowing rapid retrieval for a custom dataset. In benchmark testing, retrieval of 86 features for one million SNVs took 11.39 h on a Dell R900 server with two AMD Opteron dual-core 64 bit CPUs and 16 GBs of RAM. CHASM score computation for these one million mutations took an additional 10 min and 33 s.Finally, the predictive features available in SNVBox were designed to be useful for classification of both germline and somatic SNVs. We hope that SNVBox will enable design of new, improved machine learning algorithms to predict the impact of SNVs.Funding: National Cancer Institute, National Institutes of Health grant (CA135877 to R.K., in part); National Science Foundation CAREER award (DBI 0845275 to R.K., in part); DOD NDSEG Fellowship 32 CFR 168a (to H.C., in part).Conflict of Interest: none declared.
Authors: Ewan Birney; T Daniel Andrews; Paul Bevan; Mario Caccamo; Yuan Chen; Laura Clarke; Guy Coates; James Cuff; Val Curwen; Tim Cutts; Thomas Down; Eduardo Eyras; Xose M Fernandez-Suarez; Paul Gane; Brian Gibbins; James Gilbert; Martin Hammond; Hans-Rudolf Hotz; Vivek Iyer; Kerstin Jekosch; Andreas Kahari; Arek Kasprzyk; Damian Keefe; Stephen Keenan; Heikki Lehvaslaiho; Graham McVicker; Craig Melsopp; Patrick Meidl; Emmanuel Mongin; Roger Pettett; Simon Potter; Glenn Proctor; Mark Rae; Steve Searle; Guy Slater; Damian Smedley; James Smith; Will Spooner; Arne Stabenau; James Stalker; Roy Storey; Abel Ureta-Vidal; K Cara Woodwark; Graham Cameron; Richard Durbin; Anthony Cox; Tim Hubbard; Michele Clamp Journal: Genome Res Date: 2004-04-12 Impact factor: 9.043
Authors: Kim D Pruitt; Jennifer Harrow; Rachel A Harte; Craig Wallin; Mark Diekhans; Donna R Maglott; Steve Searle; Catherine M Farrell; Jane E Loveland; Barbara J Ruef; Elizabeth Hart; Marie-Marthe Suner; Melissa J Landrum; Bronwen Aken; Sarah Ayling; Robert Baertsch; Julio Fernandez-Banet; Joshua L Cherry; Val Curwen; Michael Dicuccio; Manolis Kellis; Jennifer Lee; Michael F Lin; Michael Schuster; Andrew Shkeda; Clara Amid; Garth Brown; Oksana Dukhanina; Adam Frankish; Jennifer Hart; Bonnie L Maidak; Jonathan Mudge; Michael R Murphy; Terence Murphy; Jeena Rajan; Bhanu Rajput; Lillian D Riddick; Catherine Snow; Charles Steward; David Webb; Janet A Weber; Laurens Wilming; Wenyu Wu; Ewan Birney; David Haussler; Tim Hubbard; James Ostell; Richard Durbin; David Lipman Journal: Genome Res Date: 2009-06-04 Impact factor: 9.043
Authors: S A Forbes; G Bhamra; S Bamford; E Dawson; C Kok; J Clements; A Menzies; J W Teague; P A Futreal; M R Stratton Journal: Curr Protoc Hum Genet Date: 2008-04
Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205
Authors: P Andrew Futreal; Lachlan Coin; Mhairi Marshall; Thomas Down; Timothy Hubbard; Richard Wooster; Nazneen Rahman; Michael R Stratton Journal: Nat Rev Cancer Date: 2004-03 Impact factor: 60.716
Authors: Turgut Dogruluk; Yiu Huen Tsang; Maribel Espitia; Fengju Chen; Tenghui Chen; Zechen Chong; Vivek Appadurai; Armel Dogruluk; Agna Karina Eterovic; Penelope E Bonnen; Chad J Creighton; Ken Chen; Gordon B Mills; Kenneth L Scott Journal: Cancer Res Date: 2015-12-01 Impact factor: 12.701
Authors: De-Chen Lin; Xuan Meng; Masaharu Hazawa; Yasunobu Nagata; Ana Maria Varela; Liang Xu; Yusuke Sato; Li-Zhen Liu; Ling-Wen Ding; Arjun Sharma; Boon Cher Goh; Soo Chin Lee; Bengt Fredrik Petersson; Feng Gang Yu; Paul Macary; Min Zin Oo; Chan Soh Ha; Henry Yang; Seishi Ogawa; Kwok Seng Loh; H Phillip Koeffler Journal: Nat Genet Date: 2014-06-22 Impact factor: 38.330
Authors: Rodrigo Dienstmann; Fei Dong; Darrell Borger; Dora Dias-Santagata; Leif W Ellisen; Long P Le; A John Iafrate Journal: Mol Oncol Date: 2014-04-04 Impact factor: 6.603
Authors: XiaoMei Liu; Alejandro Gomez-Pinillos; Charisse Loder; Enrique Carrillo-de Santa Pau; Ruifang Qiao; Pamela D Unger; Ralf Kurek; Carole Oddoux; Jonathan Melamed; Robert E Gallagher; John Mandeli; Anna C Ferrari Journal: Am J Pathol Date: 2012-07-20 Impact factor: 4.307
Authors: Patrick Kwok-Shing Ng; Jun Li; Kang Jin Jeong; Shan Shao; Hu Chen; Yiu Huen Tsang; Sohini Sengupta; Zixing Wang; Venkata Hemanjani Bhavana; Richard Tran; Stephanie Soewito; Darlan Conterno Minussi; Daniela Moreno; Kathleen Kong; Turgut Dogruluk; Hengyu Lu; Jianjiong Gao; Collin Tokheim; Daniel Cui Zhou; Amber M Johnson; Jia Zeng; Carman Ka Man Ip; Zhenlin Ju; Matthew Wester; Shuangxing Yu; Yongsheng Li; Christopher P Vellano; Nikolaus Schultz; Rachel Karchin; Li Ding; Yiling Lu; Lydia Wai Ting Cheung; Ken Chen; Kenna R Shaw; Funda Meric-Bernstam; Kenneth L Scott; Song Yi; Nidhi Sahni; Han Liang; Gordon B Mills Journal: Cancer Cell Date: 2018-03-12 Impact factor: 31.743
Authors: Abel Gonzalez-Perez; Ville Mustonen; Boris Reva; Graham R S Ritchie; Pau Creixell; Rachel Karchin; Miguel Vazquez; J Lynn Fink; Karin S Kassahn; John V Pearson; Gary D Bader; Paul C Boutros; Lakshmi Muthuswamy; B F Francis Ouellette; Jüri Reimand; Rune Linding; Tatsuhiro Shibata; Alfonso Valencia; Adam Butler; Serge Dronov; Paul Flicek; Nick B Shannon; Hannah Carter; Li Ding; Chris Sander; Josh M Stuart; Lincoln D Stein; Nuria Lopez-Bigas Journal: Nat Methods Date: 2013-08 Impact factor: 28.547