Literature DB >> 21685053

CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer.

Wing Chung Wong¹, Dewey Kim, Hannah Carter, Mark Diekhans, Michael C Ryan, Rachel Karchin.

Abstract

SUMMARY: Thousands of cancer exomes are currently being sequenced, yielding millions of non-synonymous single nucleotide variants (SNVs) of possible relevance to disease etiology. Here, we provide a software toolkit to prioritize SNVs based on their predicted contribution to tumorigenesis. It includes a database of precomputed, predictive features covering all positions in the annotated human exome and can be used either stand-alone or as part of a larger variant discovery pipeline.
AVAILABILITY AND IMPLEMENTATION: MySQL database, source code and binaries freely available for academic/government use at http://wiki.chasmsoftware.org, Source in Python and C++. Requires 32 or 64-bit Linux system (tested on Fedora Core 8,10,11 and Ubuntu 10), 2.5*≤ Python <3.0*, MySQL server >5.0, 60 GB available hard disk space (50 MB for software and data files, 40 GB for MySQL database dump when uncompressed), 2 GB of RAM.

Entities: Chemical Disease Species

Mesh：

Year: 2011 PMID： 21685053 PMCID： PMC3137226 DOI： 10.1093/bioinformatics/btr357

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

A fundamental goal of modern cancer genomics studies is to understand how alterations in DNA sequence contribute to disease susceptibility and prognosis. Targeted whole-exome deep sequencing is now affordable for many academic labs and the multitude of studies underway is yielding datasets of unprecedented magnitude. While researchers have previously developed methods to computationally predict the impact of single nucleotide variants (SNVs) (Kaminker ; Mooney ; Ng and Henikoff, 2003; Sunyaev ), to our knowledge there are no existing tools capable of fast classification of very large SNV datasets in cancer exomes. We have previously developed a computational method CancerSpecific High-throughput Annotation of Somatic Mutations (CHASM) (Carter , 2010) that predicts whether tumor-derived somatic missense mutations are important contributors to cancer cell fitness. Here, we describe a software package that implements the CHASM method. The package includes a database of pre-computed predictive features called SNVBox that facilitates rapid feature retrieval and classification of very large SNV datasets. Furthermore, the features in SNVBox can be generally used to aid in the development of new classification algorithms that predict the impact of either germline or somatic SNVs.

2 METHODS AND IMPLEMENTATION

CHASM is an open-source collection of Python and C++ programs that takes a list of somatic missense mutations as input and ranks them according to their likely tumorigenic impact. It includes a curated set of driver mutations culled from the COSMIC database (Forbes ), which is used as a positive class for training a Random Forest classifier (Amit and Geman, 1997; Breiman, 2001). The negative class of mutations is generated in silico according to an estimated distribution of benign (passenger) variation, matched to the tumor type of interest. Users have the option to use their own estimates of passenger variant frequencies or to select from a library of pre-computed passenger frequency tables for several common cancers. PyInstaller 1.4 was used to package Python source into dynamically linked, executable binaries. The SnvGet, Build Classifier and RunChasm executables are run by the user on the command line, while the others are called internally. The statically compiled C++ executable waffles_learn from the WAFFLES machine learning library is also called internally. SNVBox is an MySQL database of 86 predictive features relevant to the biological impact of an SNV. The features have been pre-computed for each codon in all protein-coding exons of annotated human mRNA transcripts in the NCBI RefSeq, CCDS and EBI Ensembl databases (Birney ; Pruitt ; Pruitt ). The SnvGet program enables fast retrieval of selected features from the database for classifier training and scoring of mutations input by the user.

3 WORKFLOW

Prepare an input file of estimated passenger mutation rates in the cancer of interest. Optionally, select from one of several pre-computed passenger rate tables. Prepare an input file of missense SNVs to be classified. Each row contains a protein accession identifier, codon number, and reference and variant amino acid residues. Run the BuildClassifier program. Produces a negative class of in silico passenger mutations by random nucleotide substitution in a library of expressed human mRNA transcripts from NCBI RefSeq, according to the distributions specified in the passenger mutation rate table (Supplementary Material). Retrieves a list of predictive features for each passenger (and driver) in the training set from SNVBox. Builds a Random Forest classifier, using waffles_learn. Run the RunChasm program. Retreives a feature list for all mutations supplied by the user. Applies the trained classifier to generate a CHASM score for each variant. Generates a second set of in silico passenger mutations, which (unlike the first set) is carefully filtered to eliminate mutations in any genes previously associated with cancer in either the Cancer Gene Census (Futreal ), the COSMIC cancer gene list and all cancer (C4 collection) genesets in MSigDB (Subramanian ). Filtered passengers are scored by the classifier to produce an empirical null distribution of variant scores. This null score distribution is used to compute a P-value for each variant supplied by the user (fraction of filtered passengers having CHASM scores less than or equal to the score of the variant). Benjamini–Hochberg multiple testing correction (Benjamini and Hochberg, 1995) is applied to the P-values. Outputs a list of the user-supplied mutations, with CHASM scores, P-values and Benjamini–Hochberg estimated false discovery rate (FDR). Outputs an ARFF formatted file of features for the submitted mutations.

4 DISCUSSION

The CHASM/SNVBox toolkit is the first distributable software package that specifically targets somatic missense mutations in cancer. The learning task of the Random Forest classifier is to discriminate between known drivers and a set of random passenger missense mutations that match the mutation spectrum in a cancer type of interest. CHASM results are sensitive to this definition of mutation spectrum and users are encouraged to use the somatic variant calls from their sequencing data to make the best possible estimates of the spectrum (Supplementary Material). While many SNV classifiers are available through web interfaces [reviewed in Karchin (2009)], these are not currently capable of handling large size custom datasets (e.g. thousands to millions of SNVs discovered in sequencing projects). Some researchers have developed distributable packages that users can run on their local system to enable high-throughput SNV processing. These packages depend on third-party databases (sequences, alignments, protein structures, specialized protein annotations) and third-party software packages. The popular PolyPhen system, for example, requires installation of 10 third-party software packages, in addition to three Perl modules. To our knowledge, all available SNV classification tools base their inferences on predictive features computed when a custom dataset is input to the system (almost always using third-party databases and software). In contrast, the predictive features available in SNVBox (also calculated with many third-party tools) have been exhaustively pre-computed, allowing rapid retrieval for a custom dataset. In benchmark testing, retrieval of 86 features for one million SNVs took 11.39 h on a Dell R900 server with two AMD Opteron dual-core 64 bit CPUs and 16 GBs of RAM. CHASM score computation for these one million mutations took an additional 10 min and 33 s. Finally, the predictive features available in SNVBox were designed to be useful for classification of both germline and somatic SNVs. We hope that SNVBox will enable design of new, improved machine learning algorithms to predict the impact of SNVs. Funding: National Cancer Institute, National Institutes of Health grant (CA135877 to R.K., in part); National Science Foundation CAREER award (DBI 0845275 to R.K., in part); DOD NDSEG Fellowship 32 CFR 168a (to H.C., in part). Conflict of Interest: none declared.

13 in total

1. Prediction of deleterious human alleles.

Authors: S Sunyaev; V Ramensky; I Koch; W Lathe; A S Kondrashov; P Bork
Journal: Hum Mol Genet Date: 2001-03-15 Impact factor: 6.150

2. SIFT: Predicting amino acid changes that affect protein function.

Authors: Pauline C Ng; Steven Henikoff
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

Review 3. An overview of Ensembl.

Authors: Ewan Birney; T Daniel Andrews; Paul Bevan; Mario Caccamo; Yuan Chen; Laura Clarke; Guy Coates; James Cuff; Val Curwen; Tim Cutts; Thomas Down; Eduardo Eyras; Xose M Fernandez-Suarez; Paul Gane; Brian Gibbins; James Gilbert; Martin Hammond; Hans-Rudolf Hotz; Vivek Iyer; Kerstin Jekosch; Andreas Kahari; Arek Kasprzyk; Damian Keefe; Stephen Keenan; Heikki Lehvaslaiho; Graham McVicker; Craig Melsopp; Patrick Meidl; Emmanuel Mongin; Roger Pettett; Simon Potter; Glenn Proctor; Mark Rae; Steve Searle; Guy Slater; Damian Smedley; James Smith; Will Spooner; Arne Stabenau; James Stalker; Roy Storey; Abel Ureta-Vidal; K Cara Woodwark; Graham Cameron; Richard Durbin; Anthony Cox; Tim Hubbard; Michele Clamp
Journal: Genome Res Date: 2004-04-12 Impact factor: 9.043

4. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

Authors: Kim D Pruitt; Jennifer Harrow; Rachel A Harte; Craig Wallin; Mark Diekhans; Donna R Maglott; Steve Searle; Catherine M Farrell; Jane E Loveland; Barbara J Ruef; Elizabeth Hart; Marie-Marthe Suner; Melissa J Landrum; Bronwen Aken; Sarah Ayling; Robert Baertsch; Julio Fernandez-Banet; Joshua L Cherry; Val Curwen; Michael Dicuccio; Manolis Kellis; Jennifer Lee; Michael F Lin; Michael Schuster; Andrew Shkeda; Clara Amid; Garth Brown; Oksana Dukhanina; Adam Frankish; Jennifer Hart; Bonnie L Maidak; Jonathan Mudge; Michael R Murphy; Terence Murphy; Jeena Rajan; Bhanu Rajput; Lillian D Riddick; Catherine Snow; Charles Steward; David Webb; Janet A Weber; Laurens Wilming; Wenyu Wu; Ewan Birney; David Haussler; Tim Hubbard; James Ostell; Richard Durbin; David Lipman
Journal: Genome Res Date: 2009-06-04 Impact factor: 9.043

5. Next generation tools for the annotation of human SNPs.

Authors: Rachel Karchin
Journal: Brief Bioinform Date: 2009-01 Impact factor: 11.622

6. The Catalogue of Somatic Mutations in Cancer (COSMIC).

Authors: S A Forbes; G Bhamra; S Bamford; E Dawson; C Kok; J Clements; A Menzies; J W Teague; P A Futreal; M R Stratton
Journal: Curr Protoc Hum Genet Date: 2008-04

7. Prioritization of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations (CHASM).

Authors: Hannah Carter; Josue Samayoa; Ralph H Hruban; Rachel Karchin
Journal: Cancer Biol Ther Date: 2010-10-01 Impact factor: 4.742

8. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

Review 9. A census of human cancer genes.

Authors: P Andrew Futreal; Lachlan Coin; Mhairi Marshall; Thomas Down; Timothy Hubbard; Richard Wooster; Nazneen Rahman; Michael R Stratton
Journal: Nat Rev Cancer Date: 2004-03 Impact factor: 60.716

10. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2006-11-27 Impact factor: 16.971

48 in total

1. Identification of Variant-Specific Functions of PIK3CA by Rapid Phenotyping of Rare Mutations.

Authors: Turgut Dogruluk; Yiu Huen Tsang; Maribel Espitia; Fengju Chen; Tenghui Chen; Zechen Chong; Vivek Appadurai; Armel Dogruluk; Agna Karina Eterovic; Penelope E Bonnen; Chad J Creighton; Ken Chen; Gordon B Mills; Kenneth L Scott
Journal: Cancer Res Date: 2015-12-01 Impact factor: 12.701

2. The genomic landscape of nasopharyngeal carcinoma.

Authors: De-Chen Lin; Xuan Meng; Masaharu Hazawa; Yasunobu Nagata; Ana Maria Varela; Liang Xu; Yusuke Sato; Li-Zhen Liu; Ling-Wen Ding; Arjun Sharma; Boon Cher Goh; Soo Chin Lee; Bengt Fredrik Petersson; Feng Gang Yu; Paul Macary; Min Zin Oo; Chan Soh Ha; Henry Yang; Seishi Ogawa; Kwok Seng Loh; H Phillip Koeffler
Journal: Nat Genet Date: 2014-06-22 Impact factor: 38.330

Review 3. Standardized decision support in next generation sequencing reports of somatic cancer variants.

Authors: Rodrigo Dienstmann; Fei Dong; Darrell Borger; Dora Dias-Santagata; Leif W Ellisen; Long P Le; A John Iafrate
Journal: Mol Oncol Date: 2014-04-04 Impact factor: 6.603

4. KLF6 loss of function in human prostate cancer progression is implicated in resistance to androgen deprivation.

Authors: XiaoMei Liu; Alejandro Gomez-Pinillos; Charisse Loder; Enrique Carrillo-de Santa Pau; Ruifang Qiao; Pamela D Unger; Ralf Kurek; Carole Oddoux; Jonathan Melamed; Robert E Gallagher; John Mandeli; Anna C Ferrari
Journal: Am J Pathol Date: 2012-07-20 Impact factor: 4.307

5. A protein-centric approach for exome variant aggregation enables sensitive association analysis with clinical outcomes.

Authors: Ginny X H Li; Dan Munro; Damian Fermin; Christine Vogel; Hyungwon Choi
Journal: Hum Mutat Date: 2020-01-23 Impact factor: 4.878

6. VIPdb, a genetic Variant Impact Predictor Database.

Authors: Zhiqiang Hu; Changhua Yu; Mabel Furutsuki; Gaia Andreoletti; Melissa Ly; Roger Hoskins; Aashish N Adhikari; Steven E Brenner
Journal: Hum Mutat Date: 2019-08-17 Impact factor: 4.878

7. Integration of genomic data enables selective discovery of breast cancer drivers.

Authors: Félix Sanchez-Garcia; Patricia Villagrasa; Junji Matsui; Dylan Kotliar; Verónica Castro; Uri-David Akavia; Bo-Juen Chen; Laura Saucedo-Cuevas; Ruth Rodriguez Barrueco; David Llobet-Navas; Jose M Silva; Dana Pe'er
Journal: Cell Date: 2014-11-26 Impact factor: 41.582

8. Systematic Functional Annotation of Somatic Mutations in Cancer.

Authors: Patrick Kwok-Shing Ng; Jun Li; Kang Jin Jeong; Shan Shao; Hu Chen; Yiu Huen Tsang; Sohini Sengupta; Zixing Wang; Venkata Hemanjani Bhavana; Richard Tran; Stephanie Soewito; Darlan Conterno Minussi; Daniela Moreno; Kathleen Kong; Turgut Dogruluk; Hengyu Lu; Jianjiong Gao; Collin Tokheim; Daniel Cui Zhou; Amber M Johnson; Jia Zeng; Carman Ka Man Ip; Zhenlin Ju; Matthew Wester; Shuangxing Yu; Yongsheng Li; Christopher P Vellano; Nikolaus Schultz; Rachel Karchin; Li Ding; Yiling Lu; Lydia Wai Ting Cheung; Ken Chen; Kenna R Shaw; Funda Meric-Bernstam; Kenneth L Scott; Song Yi; Nidhi Sahni; Han Liang; Gordon B Mills
Journal: Cancer Cell Date: 2018-03-12 Impact factor: 31.743

9. Computational approaches to identify functional genetic variants in cancer genomes.

Authors: Abel Gonzalez-Perez; Ville Mustonen; Boris Reva; Graham R S Ritchie; Pau Creixell; Rachel Karchin; Miguel Vazquez; J Lynn Fink; Karin S Kassahn; John V Pearson; Gary D Bader; Paul C Boutros; Lakshmi Muthuswamy; B F Francis Ouellette; Jüri Reimand; Rune Linding; Tatsuhiro Shibata; Alfonso Valencia; Adam Butler; Serge Dronov; Paul Flicek; Nick B Shannon; Hannah Carter; Li Ding; Chris Sander; Josh M Stuart; Lincoln D Stein; Nuria Lopez-Bigas
Journal: Nat Methods Date: 2013-08 Impact factor: 28.547

Review 10. The functional relevance of somatic synonymous mutations in melanoma and other cancers.

Authors: Valer Gotea; Jared J Gartner; Nouar Qutob; Laura Elnitski; Yardena Samuels
Journal: Pigment Cell Melanoma Res Date: 2015-11 Impact factor: 4.693