Literature DB >> 21609966

Data mining using the Catalogue of Somatic Mutations in Cancer BioMart.

Rebecca Shepherd1, Simon A Forbes, David Beare, S Bamford, Charlotte G Cole, Sari Ward, Nidhi Bindal, Prasad Gunasekaran, Mingming Jia, Chai Yin Kok, Kenric Leung, Andrew Menzies, Adam P Butler, Jon W Teague, Peter J Campbell, Michael R Stratton, P Andrew Futreal.   

Abstract

Catalogue of Somatic Mutations in Cancer (COSMIC) (http://www.sanger.ac.uk/cosmic) is a publicly available resource providing information on somatic mutations implicated in human cancer. Release v51 (January 2011) includes data from just over 19,000 genes, 161,787 coding mutations and 5573 gene fusions, described in more than 577,000 tumour samples. COSMICMart (COSMIC BioMart) provides a flexible way to mine these data and combine somatic mutations with other biological relevant data sets. This article describes the data available in COSMIC along with examples of how to successfully mine and integrate data sets using COSMICMart. DATABASE URL: http://www.sanger.ac.uk/genetics/CGP/cosmic/biomart/martview/.

Entities:  

Mesh:

Year:  2011        PMID: 21609966      PMCID: PMC3263736          DOI: 10.1093/database/bar018

Source DB:  PubMed          Journal:  Database (Oxford)        ISSN: 1758-0463            Impact factor:   3.451


Project description

COSMIC is a repository for somatic mutations and associated phenotype/clinical information, combining data from a number of sources. Primarily, data are manually curated from the scientific literature for genes selected from the Cancer Gene Census (http://www.sanger.ac.uk/genetics/CGP/Census/), a listing of genes that are known to be mutated in human cancer. The manual curation initially focused on known cancer genes that had a high proportion of coding point mutations. In recent years, this has been extended to include cancer genes that are also mutated by gene fusion events. With the advent of next-generation sequencing, COSMIC has been adapted to hold complete catalogues of somatic mutations for individual tumour samples. The COSMIC website has been developed to overview the underlying database in a user-friendly manner. The website can be navigated by gene, cancer sample or tissue/histology type and has a series of graphical and tabular displays to summarize the content of the database. However, the integration of biological data sets is still a major IT challenge. Although, the data in COSMIC can be viewed and downloaded in a number of useful ways, it is still challenging for users to generate their own custom queries and also to link to related resources. In order to facilitate the integration of the COSMIC data set, we have used the BioMart data mining software. BioMart uses a federated data model to allow integration of biological data from diverse databases (1). There is now a significant number of data resources that have set up their own BioMarts including Ensembl, UniProt and HGNC. The BioMart software also has interfaces that allow users to easily generate custom queries to obtain subsets of data from one or more BioMart data resources. We have successfully set up an instance of a BioMart, COSMICMart, which holds a summary of the somatic mutations and associated phenotype data from the COSMIC database.

Data content

COSMIC version 51 (January 2011) contains a full curation of the scientific literature for 91 cancer genes, mostly point-mutated, as well as 53 curated fusion gene pairs, with over 11 000 papers having been assessed. Weekly literature searches are carried out for each cancer gene and each abstract/article is manually evaluated. Checks are made to ensure a paper has not been curated already, whether the paper is a review or is reporting original data, if there are any inconsistencies in the data, and whether all the required information for full curation is present, e.g. information on the numbers of samples screened and sufficient mutation detail. Papers which pass the initial scan are submitted for detailed manual curation, while the remaining papers are ‘listed’ on the COSMIC website. Information is manually recorded on the actual mutation change, the detailed phenotype of the cancer samples and the original publication/study. Mutation changes are checked for accuracy and recorded at the genome, transcript (to one reference transcript) and protein level using the mutation syntax standards developed by the Human Genome Variation Society (2). Negative results (sample screened does not have a mutation for a particular gene for a particular publication) are also recorded so prevalence statistics can be estimated for each gene. As part of the curation process, cancer sample tissue/histology descriptions from the original publication are recorded and then re-classified to a COSMIC standard tissue/histology ontology (See classification section of the COSMIC additional information page, http://www.sanger.ac.uk/genetics/CGP/cosmic/add_info/). The standardization of the main data types in COSMIC means the utility of the data is significantly enhanced and the data are easier to browse and incorporate into external data sets. When available, clinical and exposure data and therapeutic responses are also recorded. The full functionality of COSMIC has previously been described (3, 4). The database coverage has recently been enhanced by the integration of somatic mutations from external data resources. A collaboration with the database curators at International Agency for Research on Cancer (IARC) has allowed the majority of mutations from IARC TP53 database R14 to be integrated into COSMIC (5). Somatic mutations from large-scale systematic cancer screens are curated in COSMIC to include studies on glioblastoma multiforme, breast, colorectal and lung cancer (6–9). Data are also directly curated from the data portals of large-scale projects to include the Cancer Genome Project (CGP) at the Sanger Institute and more recently validated somatic mutations from The Cancer Genome Atlas (TCGA) (6) and International Cancer Genome Consortium (ICGC) (10). COSMIC has been enhanced to accommodate somatic mutations from whole genome and exome screens. This includes all somatic and non-coding mutations, structural rearrangements and gene fusions. Currently, COSMIC has annotated 51 genomes and 332 exomes from a range of cancers to include lung, malignant melanoma, renal, AML, pancreatic and ovarian cancer. The current contents of the database (v51, January 2011) are displayed in Table 1.
Table 1.

Total contents in v51 of the COSMIC database, January 2011 release

Curated data typeCurated data count
Experiments2 946 792
Tumours577 304
Mutations167 193
References11 062
Genes19 000
Fusions5573
Structural variants2729
Whole-cancer genomes51
Whole-cancer exomes332
Total contents in v51 of the COSMIC database, January 2011 release

Query examples

COSMICMart allows data to be filtered on six different categories (Figure 1): cancer sample, gene, mutation, site of the tumour, histology and other (e.g. Ensembl Gene ID, Swissprot ID, Entrez Gene ID). The interface has a number of pre-selected filters and attributes; mutated samples are selected by default. Users can change these to suit their requirements. Results are displayed in tabulated form and are exportable in various formats for further analysis.
Figure 1.

Example of how COSMICMart can be queried. This query searches for all cell lines with missense substitution mutations in the BRAF gene (A). Attributes can be selected (B) to display in the results table (C).

Example of how COSMICMart can be queried. This query searches for all cell lines with missense substitution mutations in the BRAF gene (A). Attributes can be selected (B) to display in the results table (C). Query #1: ‘Find all missense substitution mutations for BRAF in cell lines, and display sample, mutation, site, and histology information’ (Figure 1, Table 2).
Table 2.

Data sets, filters and attributes selected for query #1

Data setsFiltersAttributes
COSMIC51Mutated sample: yesSample name
Sample source: cell-lineSample source
Gene name: BRAFGene name
AA mutation type: substitution—missenseCosmic mutation ID (COSM ID)
CDS mutation syntax
AA mutation syntax
Primary site
Primary histology
Tumour source
Pubmed ID
Data sets, filters and attributes selected for query #1 Missense mutations are the most common variant type in COSMIC; over 90% of mutations in BRAF are missense mutations at the p.V600 position. The results are returned as a tabular summary with links back to the COSMIC website. The sample name field links back to the COSMIC sample overview web page, and mutation ID (COSM ID) to the COSMIC mutation summary page (Figure 2). From the COSMIC mutation summary web page, there are links to the Ensembl contig view so the mutation can be viewed in a genomic context. There are also links to the GMOD’s GBrowse where COSMIC coding and non-coding mutations, gene footprints, structural rearrangements and copy number variants can be viewed (11).
Figure 2.

The COSMIC sample (A) and mutation (B) summary pages are linked directly from COSMICMart output table.

The COSMIC sample (A) and mutation (B) summary pages are linked directly from COSMICMart output table. Query #2: ‘Find all gene fusion mutations involving the FUS gene with a primary site of bone, and display mutation and sample information’ (Table 3).
Table 3.

Data sets, filters and attributes selected for query #2

Data setsFiltersAttributes
COSMIC51Mutated sample: yesCosmic sample ID
Gene name: FUSSample name
CDS mutation type: inferred breakpoint, observed mRNASample source
Primary site: boneCosmic fusion mutation ID
CDS mutation syntax
Pubmed ID
Data sets, filters and attributes selected for query #2 Gene fusions have been associated with a number of specific tumour types including prostate and blood tumours. These biomarkers can be useful in diagnosis and as targets for drug therapies. COSMIC has annotations, for an increasing number of gene fusion mutations, which are viewable using COSMICMart. The COSMIC fusion mutation ID links to the gene fusion summary pages, which give a graphical view of different fusion structures observed. Many of the papers describing gene fusions have identified more than one gene fusion product for the same genes in a single sample. Observed mRNAs are the actual expressed products reported in the results. However, to aid display and website navigation, we have inferred the genomic breakpoint from the experimental data. Query #3: ‘Find variation information in Ensembl for all genes from mutated samples with a primary site of breast, and display COSMIC gene, mutation and sample information along with Ensembl variation information’ (Table 4).
Table 4.

Data sets, filters and attributes selected for query #3

Data setsFiltersAttributes
COSMIC51Mutated sample: yesCosmic sample ID
Primary site: breastSample name
Sample source
Cosmic mutation ID (COSM ID)
CDS mutation syntax
AA mutation syntax
Ensembl: Homo sapiens genesFeatures: Ensembl gene ID
Features: Ensembl transcript ID
Variations: variation source
Variations: source description
Variations: reference ID
Variations: allele
Data sets, filters and attributes selected for query #3 COSMICMart is federated with Ensembl (12), which allows Biomart queries to return and integrate data from both resources. For instance, the linking of the two resources can allow the retrieval of variation data from both resources (somatic mutations from COSMIC and germline polymorphisms from Ensembl) for a particular gene or set of genes or cancer type. There is an increasing awareness of how genomic variation can affect a tumour’s sensitivity or resistance to anti-cancer agents. While this genetic variation can be familial or somatic, an understanding of common genetic variation around known cancer genes can be of much value to investigations searching for loci modifying a tumour’s response to drug therapy (13–15). This query is achieved by first selecting the filters/attributes in the COSMIC BioMart and then clicking the ‘Dataset’ link at the bottom of the left hand margin of the BioMart interface. An additional data set can then be selected from the drop down list, in this instance Ensembl, which allows a federated query between COSMIC and Ensembl. The filters/attributes are then set in the usual way using the Ensembl BioMart to produce an integrated query.

Future directions

COSMIC will continue to curate newly discovered cancer genes and is committed to update existing cancer genes with a data release every 2 months. This will ensure that the scientific community has an up-to-date catalogue of somatic mutations implicated in human cancer. COSMICMart is also automatically updated with each new COSMIC release, which allows the data set to be easily mined and integrated with other resources. COSMIC has been successfully adapted to hold complete catalogues of somatic mutations for individual cancer samples. Currently COSMIC holds genome-wide data on 383 tumour samples and we expect this to increase in the near future. It is intended to federate COSMICMart with further BioMart-driven data resources in addition to the current link with Ensembl. Linking our data to PRIDE (16), UniProt (17) and InterPro (18) will allow COSMIC somatic mutation data to be linked to protein and peptide annotation, while the addition of the Reactome (19) database will allow the incorporation of pathway data. We also intend to create direct links between COSMIC and the ICGC Data Portal (http://dcc.icgc.org/) so somatic mutation data can be integrated between the two data resources.

Funding

Funding for open access charge: Wellcome Trust (grant reference 077012/Z/05/Z). Conflict of interest. None declared.
  19 in total

Review 1.  Cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents.

Authors:  Sreenath V Sharma; Daniel A Haber; Jeff Settleman
Journal:  Nat Rev Cancer       Date:  2010-03-19       Impact factor: 60.716

Review 2.  Factors underlying sensitivity of cancers to small-molecule kinase inhibitors.

Authors:  Pasi A Jänne; Nathanael Gray; Jeff Settleman
Journal:  Nat Rev Drug Discov       Date:  2009-07-24       Impact factor: 84.694

3.  Somatic mutations affect key pathways in lung adenocarcinoma.

Authors:  Li Ding; Gad Getz; David A Wheeler; Elaine R Mardis; Michael D McLellan; Kristian Cibulskis; Carrie Sougnez; Heidi Greulich; Donna M Muzny; Margaret B Morgan; Lucinda Fulton; Robert S Fulton; Qunyuan Zhang; Michael C Wendl; Michael S Lawrence; David E Larson; Ken Chen; David J Dooling; Aniko Sabo; Alicia C Hawes; Hua Shen; Shalini N Jhangiani; Lora R Lewis; Otis Hall; Yiming Zhu; Tittu Mathew; Yanru Ren; Jiqiang Yao; Steven E Scherer; Kerstin Clerc; Ginger A Metcalf; Brian Ng; Aleksandar Milosavljevic; Manuel L Gonzalez-Garay; John R Osborne; Rick Meyer; Xiaoqi Shi; Yuzhu Tang; Daniel C Koboldt; Ling Lin; Rachel Abbott; Tracie L Miner; Craig Pohl; Ginger Fewell; Carrie Haipek; Heather Schmidt; Brian H Dunford-Shore; Aldi Kraja; Seth D Crosby; Christopher S Sawyer; Tammi Vickery; Sacha Sander; Jody Robinson; Wendy Winckler; Jennifer Baldwin; Lucian R Chirieac; Amit Dutt; Tim Fennell; Megan Hanna; Bruce E Johnson; Robert C Onofrio; Roman K Thomas; Giovanni Tonon; Barbara A Weir; Xiaojun Zhao; Liuda Ziaugra; Michael C Zody; Thomas Giordano; Mark B Orringer; Jack A Roth; Margaret R Spitz; Ignacio I Wistuba; Bradley Ozenberger; Peter J Good; Andrew C Chang; David G Beer; Mark A Watson; Marc Ladanyi; Stephen Broderick; Akihiko Yoshizawa; William D Travis; William Pao; Michael A Province; George M Weinstock; Harold E Varmus; Stacey B Gabriel; Eric S Lander; Richard A Gibbs; Matthew Meyerson; Richard K Wilson
Journal:  Nature       Date:  2008-10-23       Impact factor: 49.962

4.  An integrated genomic analysis of human glioblastoma multiforme.

Authors:  D Williams Parsons; Siân Jones; Xiaosong Zhang; Jimmy Cheng-Ho Lin; Rebecca J Leary; Philipp Angenendt; Parminder Mankoo; Hannah Carter; I-Mei Siu; Gary L Gallia; Alessandro Olivi; Roger McLendon; B Ahmed Rasheed; Stephen Keir; Tatiana Nikolskaya; Yuri Nikolsky; Dana A Busam; Hanna Tekleab; Luis A Diaz; James Hartigan; Doug R Smith; Robert L Strausberg; Suely Kazue Nagahashi Marie; Sueli Mieko Oba Shinjo; Hai Yan; Gregory J Riggins; Darell D Bigner; Rachel Karchin; Nick Papadopoulos; Giovanni Parmigiani; Bert Vogelstein; Victor E Velculescu; Kenneth W Kinzler
Journal:  Science       Date:  2008-09-04       Impact factor: 47.728

5.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways.

Authors: 
Journal:  Nature       Date:  2008-09-04       Impact factor: 49.962

6.  BioMart Central Portal--unified access to biological data.

Authors:  Syed Haider; Benoit Ballester; Damian Smedley; Junjun Zhang; Peter Rice; Arek Kasprzyk
Journal:  Nucleic Acids Res       Date:  2009-05-06       Impact factor: 16.971

7.  COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer.

Authors:  Simon A Forbes; Gurpreet Tang; Nidhi Bindal; Sally Bamford; Elisabeth Dawson; Charlotte Cole; Chai Yin Kok; Mingming Jia; Rebecca Ewing; Andrew Menzies; Jon W Teague; Michael R Stratton; P Andrew Futreal
Journal:  Nucleic Acids Res       Date:  2009-11-11       Impact factor: 16.971

8.  Ensembl's 10th year.

Authors:  Paul Flicek; Bronwen L Aken; Benoit Ballester; Kathryn Beal; Eugene Bragin; Simon Brent; Yuan Chen; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Julio Fernandez-Banet; Leo Gordon; Stefan Gräf; Syed Haider; Martin Hammond; Kerstin Howe; Andrew Jenkinson; Nathan Johnson; Andreas Kähäri; Damian Keefe; Stephen Keenan; Rhoda Kinsella; Felix Kokocinski; Gautier Koscielny; Eugene Kulesha; Daniel Lawson; Ian Longden; Tim Massingham; William McLaren; Karine Megy; Bert Overduin; Bethan Pritchard; Daniel Rios; Magali Ruffier; Michael Schuster; Guy Slater; Damian Smedley; Giulietta Spudich; Y Amy Tang; Stephen Trevanion; Albert Vilella; Jan Vogel; Simon White; Steven P Wilder; Amonida Zadissa; Ewan Birney; Fiona Cunningham; Ian Dunham; Richard Durbin; Xosé M Fernández-Suarez; Javier Herrero; Tim J P Hubbard; Anne Parker; Glenn Proctor; James Smith; Stephen M J Searle
Journal:  Nucleic Acids Res       Date:  2009-11-11       Impact factor: 16.971

9.  The Universal Protein Resource (UniProt) in 2010.

Authors: 
Journal:  Nucleic Acids Res       Date:  2009-10-20       Impact factor: 16.971

10.  InterPro: the integrative protein signature database.

Authors:  Sarah Hunter; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Alex Bateman; David Binns; Peer Bork; Ujjwal Das; Louise Daugherty; Lauranne Duquenne; Robert D Finn; Julian Gough; Daniel Haft; Nicolas Hulo; Daniel Kahn; Elizabeth Kelly; Aurélie Laugraud; Ivica Letunic; David Lonsdale; Rodrigo Lopez; Martin Madera; John Maslen; Craig McAnulla; Jennifer McDowall; Jaina Mistry; Alex Mitchell; Nicola Mulder; Darren Natale; Christine Orengo; Antony F Quinn; Jeremy D Selengut; Christian J A Sigrist; Manjula Thimma; Paul D Thomas; Franck Valentin; Derek Wilson; Cathy H Wu; Corin Yeats
Journal:  Nucleic Acids Res       Date:  2008-10-21       Impact factor: 16.971

View more
  29 in total

1.  Identification of germline genomic copy number variation in familial pancreatic cancer.

Authors:  Wigdan Al-Sukhni; Sarah Joe; Anath C Lionel; Nora Zwingerman; George Zogopoulos; Christian R Marshall; Ayelet Borgida; Spring Holter; Aaron Gropper; Sara Moore; Melissa Bondy; Alison P Klein; Gloria M Petersen; Kari G Rabe; Ann G Schwartz; Sapna Syngal; Stephen W Scherer; Steven Gallinger
Journal:  Hum Genet       Date:  2012-06-05       Impact factor: 4.132

2.  Enabling a genetically informed approach to cancer medicine: a retrospective evaluation of the impact of comprehensive tumor profiling using a targeted next-generation sequencing panel.

Authors:  Douglas B Johnson; Kimberly H Dahlman; Jared Knol; Jill Gilbert; Igor Puzanov; Julie Means-Powell; Justin M Balko; Christine M Lovly; Barbara A Murphy; Laura W Goff; Vandana G Abramson; Marta A Crispens; Ingrid A Mayer; Jordan D Berlin; Leora Horn; Vicki L Keedy; Nishitha M Reddy; Carlos L Arteaga; Jeffrey A Sosman; William Pao
Journal:  Oncologist       Date:  2014-05-05

3.  Mitotic recombination of chromosome arm 17q as a cause of loss of heterozygosity of NF1 in neurofibromatosis type 1-associated glomus tumors.

Authors:  Douglas R Stewart; Alexander Pemov; Peter Van Loo; Eline Beert; Hilde Brems; Raf Sciot; Kathleen Claes; Evgenia Pak; Amalia Dutra; Chyi-Chia Richard Lee; Eric Legius
Journal:  Genes Chromosomes Cancer       Date:  2012-01-17       Impact factor: 5.006

4.  Evolutionary and functional analysis of the invariant SWIM domain in the conserved Shu2/SWS1 protein family from Saccharomyces cerevisiae to Homo sapiens.

Authors:  Stephen K Godin; Camille Meslin; Faiz Kabbinavar; Dominique S Bratton-Palmer; Christina Hornack; Michael J Mihalevic; Kyle Yoshida; Meghan Sullivan; Nathan L Clark; Kara A Bernstein
Journal:  Genetics       Date:  2015-02-05       Impact factor: 4.562

5.  KRAS (but not BRAF) mutations in ovarian serous borderline tumour are associated with recurrent low-grade serous carcinoma.

Authors:  Yvonne T Tsang; Michael T Deavers; Charlotte C Sun; Suet-Yan Kwan; Eric Kuo; Anais Malpica; Samuel C Mok; David M Gershenson; Kwong-Kwok Wong
Journal:  J Pathol       Date:  2013-12       Impact factor: 7.996

6.  Genomic Characterization of Brain Metastases Reveals Branched Evolution and Potential Therapeutic Targets.

Authors:  Priscilla K Brastianos; Scott L Carter; Gad Getz; William C Hahn; Sandro Santagata; Daniel P Cahill; Amaro Taylor-Weiner; Robert T Jones; Eliezer M Van Allen; Michael S Lawrence; Peleg M Horowitz; Kristian Cibulskis; Keith L Ligon; Josep Tabernero; Joan Seoane; Elena Martinez-Saez; William T Curry; Ian F Dunn; Sun Ha Paek; Sung-Hye Park; Aaron McKenna; Aaron Chevalier; Mara Rosenberg; Frederick G Barker; Corey M Gill; Paul Van Hummelen; Aaron R Thorner; Bruce E Johnson; Mai P Hoang; Toni K Choueiri; Sabina Signoretti; Carrie Sougnez; Michael S Rabin; Nancy U Lin; Eric P Winer; Anat Stemmer-Rachamimov; Matthew Meyerson; Levi Garraway; Stacey Gabriel; Eric S Lander; Rameen Beroukhim; Tracy T Batchelor; Jose Baselga; David N Louis
Journal:  Cancer Discov       Date:  2015-09-26       Impact factor: 39.397

7.  The BRAF mutation is associated with the prognosis in colorectal cancer.

Authors:  Tae Sung Ahn; Dongjun Jeong; Myoung Won Son; Haeil Jung; Soyoung Park; Hyungjoo Kim; Sang Byung Bae; Han Jo Kim; Young-Woo Jeon; Moon Soo Lee; Moo-Jun Baek
Journal:  J Cancer Res Clin Oncol       Date:  2014-06-19       Impact factor: 4.553

8.  Cancer heterogeneity: origins and implications for genetic association studies.

Authors:  Davnah Urbach; Mathieu Lupien; Margaret R Karagas; Jason H Moore
Journal:  Trends Genet       Date:  2012-07-31       Impact factor: 11.639

9.  Breakpoint profiling of 64 cancer genomes reveals numerous complex rearrangements spawned by homology-independent mechanisms.

Authors:  Ankit Malhotra; Michael Lindberg; Gregory G Faust; Mitchell L Leibowitz; Royden A Clark; Ryan M Layer; Aaron R Quinlan; Ira M Hall
Journal:  Genome Res       Date:  2013-02-14       Impact factor: 9.043

10.  Detection of high-frequency and novel DNMT3A mutations in acute myeloid leukemia by high-resolution melting curve analysis.

Authors:  Rajesh R Singh; Ashish Bains; Keyur P Patel; Hamed Rahimi; Bedia A Barkoh; Abhaya Paladugu; Tigist Bisrat; Farhad Ravandi-Kashani; Jorge E Cortes; Hagop M Kantarjian; L Jeffrey Medeiros; Rajyalakshmi Luthra
Journal:  J Mol Diagn       Date:  2012-05-27       Impact factor: 5.341

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.