Literature DB >> 23325621

CRAVAT: cancer-related analysis of variants toolkit.

Christopher Douville¹, Hannah Carter, Rick Kim, Noushin Niknafs, Mark Diekhans, Peter D Stenson, David N Cooper, Michael Ryan, Rachel Karchin.

Abstract

SUMMARY: Advances in sequencing technology have greatly reduced the costs incurred in collecting raw sequencing data. Academic laboratories and researchers therefore now have access to very large datasets of genomic alterations but limited time and computational resources to analyse their potential biological importance. Here, we provide a web-based application, Cancer-Related Analysis of Variants Toolkit, designed with an easy-to-use interface to facilitate the high-throughput assessment and prioritization of genes and missense alterations important for cancer tumorigenesis. Cancer-Related Analysis of Variants Toolkit provides predictive scores for germline variants, somatic mutations and relative gene importance, as well as annotations from published literature and databases. Results are emailed to users as MS Excel spreadsheets and/or tab-separated text files. AVAILABILITY: http://www.cravat.us/

Entities: Disease Gene Species

Mesh：

Year: 2013 PMID： 23325621 PMCID： PMC3582272 DOI： 10.1093/bioinformatics/btt017

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

With the advent of high-throughput sequencing technology, researchers face a bottleneck in terms of the time required to analyse the potential impact on disease aetiology of the many genetic variants routinely detected. Computational algorithms can in principle help researchers to prioritize and direct future experiments by narrowing down the numerous genetic alterations identified in sequencing studies. However, in practice, it can be challenging to run these algorithms in a researcher’s own laboratory, owing to the requirements of third-party software and databases, and large hard disk space and RAM specifications. We have developed Cancer-Related Analysis of VAriants Toolkit (CRAVAT), a web-based application that provides a simple interface to prioritize genes and variants important for tumorigenesis, allowing users to assess millions of variants in a single upload step (Fig. 1).

Fig. 1.

CRAVAT interface and workflow. (1) Input co-ordinates. (2) Select ‘Cancer driver analysis’, ‘Functional effect analysis’ and/or ‘Gene annotation’. (3) Results are delivered to the provided email address

Numerous web implementations already exist for variant classifiers [reviewed in Karchin (2009)]. CRAVAT handles both germline and somatic variation but is dedicated to cancer genome analysis. It accepts variant calls from sequencing studies in either genomic coordinates (hg18 or hg19) or transcript coordinates—NCBI Refseq, CCDS and Ensembl (Pruitt , 2009; Flicek ). Variants are mapped onto the best available transcript, using a greedy algorithm (see Supplementary Methods), and those variants that cause missense changes are identified. These variants can be scored in terms of their predicted impact on tumorigenesis, using the Cancer-Specific High-throughput Annotation of Somatic Mutations (CHASM) method (Carter ). They can also be scored by their predicted impact on protein function, with the Variant Effect Scoring Tool (VEST) (Carter ). Genes are ranked by their most significantly scored variant or mutation. Results are linked with published information from the 1000 Genomes Project (Clarke ), the Exome Sequencing Project, Catalogue of Somatic Mutations in Cancer (COSMIC) (Forbes ), GeneCards (Harel ) and PubMed, enabling users to compare predictions with known gene function, cancer associations and clinical/experimental studies. CRAVAT returns results via email in Excel and/or tab-separated text. It can also provide a formatted submission file for mutation Position Imaging Toolbox (muPIT) interactive (N.Niknafs et al., submitted for publication), allowing users to visualize variants interactively in 3D, together with position-specific annotations.

2 SYSTEMS AND METHODS

CRAVAT runs on a Linux server with Apache Tomcat 6.0.35, and its web interface is written as Java Server Pages. When a user submits a job, a Java servlet is called, which places the job in the server’s queuing system, built on Redis backend and written in Python. When the queued job runs, a ‘master analyzer’ script written is launched to perform requested analyses, calling and processing the result of our prediction software and annotation utilities as needed. Local mirrors of annotation source databases are updated monthly. Prediction tools Single Nucleotide Variant Toolbox (SNVBox) (Wong ), CHASM and VEST are updated several times a year. Depending on server load, run time for analysis of 1000 SNVs is ∼5–10 minutes. Run time scales linearly with the number of SNVs. A job with 1.8 million SNVs takes from 4 to 13 days. Benchmarking details are provided in the Supplementary Information. There is no limit to the size of a job. To ensure that large jobs do not hold up smaller jobs, jobs are separated into two queues, depending on size.

2.1 Prediction software

CHASM: Software to rank potential somatic driver mutations for specific cancer tissue types. It trains a classifier using parf, a fortran implementation of Random Forest (Amit and Geman, 1997; Breiman, 2001). The training set is a positive class of known cancer drivers from the COSMIC database and a negative class of simulated passenger mutations. VEST: VEST scores variants by predicted protein functional impact. It also uses parf to train a Random Forest classifier. The VEST training set is a positive class of disease-causing germline variants from the Human Gene Mutation Database (HGMD Professional 2012v2) (Stenson ) and a negative class of common variants from the Exome Sequencing Project dataset (ESP6500 accessed July 2012) (http://evs.gs.washington.edu/EVS/]). Both CHASM and VEST provide P-values and false discovery rate estimates to help the user establish a score cut-off for accepting predictions. SnvGet: Returns 86 pre-computed features for each variant from the SNVBox database including the following: physiochemical properties of amino acid residues; scores derived from multiple sequence alignments of protein or DNA; region-based amino acid sequence composition; predicted properties of local protein structure; and annotations from the UniProtKB feature tables (UniProt Consortium and others, 2012). These features are used by CHASM and VEST to train classifiers and can be incorporated in new, user-generated predictive algorithms.

2.2 Annotation utilities

Each variant is annotated with database of single nucleotide polymorphisms identifiers, allele frequencies from the 1000 Genomes Project and ESP6500 populations, gene function information from the GeneCards database, the number of times that variant was observed in the COSMIC database and previous cancer association of the gene harbouring the variant, returned by PubMed search.

3 DISCUSSION

We provide an example to demonstrate how the CRAVAT web server can prioritize and facilitate mutation analysis. We obtained genomic coordinates of 184 824 mutations from The Cancer Genome Atlas sequencing study of 248 endometrial tumors from Firehose. We limited our submission to mutations that were called as ‘missense’ by Firehose, yielding 121 440 mutations. Options for ‘Cancer Driver Analysis’, ‘CHASM’, ‘Uterus’ tissue type and ‘Include gene annotation’ were selected. Results were received via email after 16 h: Excel spreadsheet with pages for ‘Variant Analysis’, ‘Amino Acid Level Analysis’ and ‘Gene Level Analysis’ and a separate text file to visualize amino acid substitutions in muPIT. On the ‘Variant Analysis’ sheet, 1066 mutations, of which 800 were unique, received a CHASM false discovery rate ≤0.3. Many significantly scored mutations were involved in pathways previously determined to impact endometrial cancer, e.g. PI3K, Wnt signalling, MAPK signalling and p53 signalling pathways (Kanehisa ). Several genes from these pathways (PIK3CA, PTEN, TP53, KRAS and CTNNB1) were known endometrial cancer driver genes (Liang ). In addition to identifying well-known drivers, CHASM identified potential drivers not previously associated with endometrial cancer, in biologically relevant pathways: viz MTOR in the PI3K pathway and GSK-3B in the Wnt signalling pathway).

3.1 Future work

CRAVAT is currently limited to analysis of missense mutations. We shall provide additional tools to analyse other types of mutation and to rank genes based on somatic mutation frequencies, aggregated P-values of CHASM or VEST scores, ratios of truncating to non-truncating mutations and counts of recurrently mutated positions. We also plan to include statistics useful in identifying which variant calls may be artifacts. CRAVAT interface and workflow. (1) Input co-ordinates. (2) Select ‘Cancer driver analysis’, ‘Functional effect analysis’ and/or ‘Gene annotation’. (3) Results are delivered to the provided email address

13 in total

1. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

Authors: Kim D Pruitt; Jennifer Harrow; Rachel A Harte; Craig Wallin; Mark Diekhans; Donna R Maglott; Steve Searle; Catherine M Farrell; Jane E Loveland; Barbara J Ruef; Elizabeth Hart; Marie-Marthe Suner; Melissa J Landrum; Bronwen Aken; Sarah Ayling; Robert Baertsch; Julio Fernandez-Banet; Joshua L Cherry; Val Curwen; Michael Dicuccio; Manolis Kellis; Jennifer Lee; Michael F Lin; Michael Schuster; Andrew Shkeda; Clara Amid; Garth Brown; Oksana Dukhanina; Adam Frankish; Jennifer Hart; Bonnie L Maidak; Jonathan Mudge; Michael R Murphy; Terence Murphy; Jeena Rajan; Bhanu Rajput; Lillian D Riddick; Catherine Snow; Charles Steward; David Webb; Janet A Weber; Laurens Wilming; Wenyu Wu; Ewan Birney; David Haussler; Tim Hubbard; James Ostell; Richard Durbin; David Lipman
Journal: Genome Res Date: 2009-06-04 Impact factor: 9.043

2. Next generation tools for the annotation of human SNPs.

Authors: Rachel Karchin
Journal: Brief Bioinform Date: 2009-01 Impact factor: 11.622

3. The Catalogue of Somatic Mutations in Cancer (COSMIC).

Authors: S A Forbes; G Bhamra; S Bamford; E Dawson; C Kok; J Clements; A Menzies; J W Teague; P A Futreal; M R Stratton
Journal: Curr Protoc Hum Genet Date: 2008-04

4. The 1000 Genomes Project: data management and community access.

Authors: Laura Clarke; Xiangqun Zheng-Bradley; Richard Smith; Eugene Kulesha; Chunlin Xiao; Iliana Toneva; Brendan Vaughan; Don Preuss; Rasko Leinonen; Martin Shumway; Stephen Sherry; Paul Flicek
Journal: Nat Methods Date: 2012-04-27 Impact factor: 28.547

5. Whole-exome sequencing combined with functional genomics reveals novel candidate driver cancer genes in endometrial cancer.

Authors: Han Liang; Lydia W T Cheung; Jie Li; Zhenlin Ju; Shuangxing Yu; Katherine Stemke-Hale; Turgut Dogruluk; Yiling Lu; Xiuping Liu; Chao Gu; Wei Guo; Steven E Scherer; Hannah Carter; Shannon N Westin; Mary D Dyer; Roeland G W Verhaak; Fan Zhang; Rachel Karchin; Chang-Gong Liu; Karen H Lu; Russell R Broaddus; Kenneth L Scott; Bryan T Hennessy; Gordon B Mills
Journal: Genome Res Date: 2012-10-01 Impact factor: 9.043

6. CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer.

Authors: Wing Chung Wong; Dewey Kim; Hannah Carter; Mark Diekhans; Michael C Ryan; Rachel Karchin
Journal: Bioinformatics Date: 2011-06-17 Impact factor: 6.937

7. Ensembl 2012.

Authors: Paul Flicek; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Simon Brent; Denise Carvalho-Silva; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Laurent Gil; Leo Gordon; Maurice Hendrix; Thibaut Hourlier; Nathan Johnson; Andreas K Kähäri; Damian Keefe; Stephen Keenan; Rhoda Kinsella; Monika Komorowska; Gautier Koscielny; Eugene Kulesha; Pontus Larsson; Ian Longden; William McLaren; Matthieu Muffato; Bert Overduin; Miguel Pignatelli; Bethan Pritchard; Harpreet Singh Riat; Graham R S Ritchie; Magali Ruffier; Michael Schuster; Daniel Sobral; Y Amy Tang; Kieron Taylor; Stephen Trevanion; Jana Vandrovcova; Simon White; Mark Wilson; Steven P Wilder; Bronwen L Aken; Ewan Birney; Fiona Cunningham; Ian Dunham; Richard Durbin; Xosé M Fernández-Suarez; Jennifer Harrow; Javier Herrero; Tim J P Hubbard; Anne Parker; Glenn Proctor; Giulietta Spudich; Jan Vogel; Andy Yates; Amonida Zadissa; Stephen M J Searle
Journal: Nucleic Acids Res Date: 2011-11-15 Impact factor: 16.971

8. GIFtS: annotation landscape analysis with GeneCards.

Authors: Arye Harel; Aron Inger; Gil Stelzer; Liora Strichman-Almashanu; Irina Dalah; Marilyn Safran; Doron Lancet
Journal: BMC Bioinformatics Date: 2009-10-23 Impact factor: 3.169

9. Identifying Mendelian disease genes with the variant effect scoring tool.

Authors: Hannah Carter; Christopher Douville; Peter D Stenson; David N Cooper; Rachel Karchin
Journal: BMC Genomics Date: 2013-05-28 Impact factor: 3.969

10. The Human Gene Mutation Database: 2008 update.

Authors: Peter D Stenson; Matthew Mort; Edward V Ball; Katy Howells; Andrew D Phillips; Nick St Thomas; David N Cooper
Journal: Genome Med Date: 2009-01-22 Impact factor: 11.117

67 in total

1. An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers.

Authors: Steven A Roberts; Michael S Lawrence; Leszek J Klimczak; Sara A Grimm; David Fargo; Petar Stojanov; Adam Kiezun; Gregory V Kryukov; Scott L Carter; Gordon Saksena; Shawn Harris; Ruchir R Shah; Michael A Resnick; Gad Getz; Dmitry A Gordenin
Journal: Nat Genet Date: 2013-07-14 Impact factor: 38.330

2. AVIA: an interactive web-server for annotation, visualization and impact analysis of genomic variations.

Authors: Hue Vuong; Robert M Stephens; Natalia Volfovsky
Journal: Bioinformatics Date: 2013-11-09 Impact factor: 6.937

3. Identification of recurrent SMO and BRAF mutations in ameloblastomas.

Authors: Robert T Sweeney; Andrew C McClary; Benjamin R Myers; Jewison Biscocho; Lila Neahring; Kevin A Kwei; Kunbin Qu; Xue Gong; Tony Ng; Carol D Jones; Sushama Varma; Justin I Odegaard; Toshihiro Sugiyama; Souichi Koyota; Brian P Rubin; Megan L Troxell; Robert J Pelham; James L Zehnder; Philip A Beachy; Jonathan R Pollack; Robert B West
Journal: Nat Genet Date: 2014-05-25 Impact factor: 38.330

4. The genomic landscape of nasopharyngeal carcinoma.

Authors: De-Chen Lin; Xuan Meng; Masaharu Hazawa; Yasunobu Nagata; Ana Maria Varela; Liang Xu; Yusuke Sato; Li-Zhen Liu; Ling-Wen Ding; Arjun Sharma; Boon Cher Goh; Soo Chin Lee; Bengt Fredrik Petersson; Feng Gang Yu; Paul Macary; Min Zin Oo; Chan Soh Ha; Henry Yang; Seishi Ogawa; Kwok Seng Loh; H Phillip Koeffler
Journal: Nat Genet Date: 2014-06-22 Impact factor: 38.330

5. Comprehensive strategy for the design of precision drugs and identification of genetic signature behind proneness of the disease-a pharmacogenomic approach.

Authors: Preethi M Iyer; S Karthikeyan; P Sanjay Kumar; P K Krishnan Namboori
Journal: Funct Integr Genomics Date: 2017-05-03 Impact factor: 3.410

6. Limited heterogeneity of known driver gene mutations among the metastases of individual patients with pancreatic cancer.

Authors: Alvin P Makohon-Moore; Ming Zhang; Johannes G Reiter; Ivana Bozic; Benjamin Allen; Deepanjan Kundu; Krishnendu Chatterjee; Fay Wong; Yuchen Jiao; Zachary A Kohutek; Jungeui Hong; Marc Attiyeh; Breanna Javier; Laura D Wood; Ralph H Hruban; Martin A Nowak; Nickolas Papadopoulos; Kenneth W Kinzler; Bert Vogelstein; Christine A Iacobuzio-Donahue
Journal: Nat Genet Date: 2017-01-16 Impact factor: 38.330

7. Preliminary whole-exome sequencing reveals mutations that imply common tumorigenicity pathways in multiple endocrine neoplasia type 1 patients.

Authors: Minerva Angélica Romero Arenas; Richard G Fowler; F Anthony San Lucas; Jie Shen; Thereasa A Rich; Elizabeth G Grubbs; Jeffrey E Lee; Paul Scheet; Nancy D Perrier; Hua Zhao
Journal: Surgery Date: 2014-11-11 Impact factor: 3.982

8. Systematic Functional Annotation of Somatic Mutations in Cancer.

Authors: Patrick Kwok-Shing Ng; Jun Li; Kang Jin Jeong; Shan Shao; Hu Chen; Yiu Huen Tsang; Sohini Sengupta; Zixing Wang; Venkata Hemanjani Bhavana; Richard Tran; Stephanie Soewito; Darlan Conterno Minussi; Daniela Moreno; Kathleen Kong; Turgut Dogruluk; Hengyu Lu; Jianjiong Gao; Collin Tokheim; Daniel Cui Zhou; Amber M Johnson; Jia Zeng; Carman Ka Man Ip; Zhenlin Ju; Matthew Wester; Shuangxing Yu; Yongsheng Li; Christopher P Vellano; Nikolaus Schultz; Rachel Karchin; Li Ding; Yiling Lu; Lydia Wai Ting Cheung; Ken Chen; Kenna R Shaw; Funda Meric-Bernstam; Kenneth L Scott; Song Yi; Nidhi Sahni; Han Liang; Gordon B Mills
Journal: Cancer Cell Date: 2018-03-12 Impact factor: 31.743

9. Exome-Scale Discovery of Hotspot Mutation Regions in Human Cancer Using 3D Protein Structure.

Authors: Collin Tokheim; Rohit Bhattacharya; Noushin Niknafs; Derek M Gygax; Rick Kim; Michael Ryan; David L Masica; Rachel Karchin
Journal: Cancer Res Date: 2016-04-28 Impact factor: 12.701

10. DawnRank: discovering personalized driver genes in cancer.

Authors: Jack P Hou; Jian Ma
Journal: Genome Med Date: 2014-07-31 Impact factor: 11.117