Literature DB >> 20972221

CaSNP: a database for interrogating copy number alterations of cancer genome from SNP array data.

Qingyi Cao¹, Meng Zhou, Xujun Wang, Cliff A Meyer, Yong Zhang, Zhi Chen, Cheng Li, X Shirley Liu.

Abstract

Cancer is known to have abundant copy number alterations (CNAs) that greatly contribute to its pathogenesis and progression. Investigation of CNA regions could potentially help identify oncogenes and tumor suppressor genes and infer cancer mechanisms. Although single-nucleotide polymorphism (SNP) arrays have strengthened our ability to identify CNAs with unprecedented resolution, a comprehensive collection of CNA information from SNP array data is still lacking. We developed a web-based CaSNP (http://cistrome.dfci.harvard.edu/CaSNP/) database for storing and interrogating quantitative CNA data, which curated ∼11,500 SNP arrays on 34 different cancer types in 104 studies. With a user input of region or gene of interest, CaSNP will return the CNA information summarizing the frequencies of gain/loss and averaged copy number for each study, and provide links to download the data or visualize it in UCSC Genome Browser. CaSNP also displays the heatmap showing copy numbers estimated at each SNP marker around the query region across all studies for a more comprehensive visualization. Finally, we used CaSNP to study the CNA of protein-coding genes as well as LincRNA genes across all cancer SNP arrays, and found putative regions harboring novel oncogenes and tumor suppressors. In summary, CaSNP is a useful tool for cancer CNA association studies, with the potential to facilitate both basic science and translational research on cancer.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2010 PMID： 20972221 PMCID： PMC3013814 DOI： 10.1093/nar/gkq997

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Cancer is a complex genetic disease, whose initiation and progression are often accompanied by genome alterations. A great amount of copy number alterations (CNAs) is known to occur in the malignant neoplasm at full scale of human genome. Of the different types of genome variations, CNA has been the most implicated in oncogenesis and cancer progression, and many CNAs are known to be characteristic of specific types of cancers (1–3). There is a growing demand to understand the nature of CNA in cancer, as CNAs not only serve as biomarkers to predict cancer malignancy and prognosis, but also often harbor tumor suppressors and oncogenes (4,5), the studies of which could shed light on the sequence and mechanism of oncogenesis. In addition, there is increasing evidence that some CNAs could target noncoding RNA (ncRNA) genes such as miRNAs (6), suggesting ncRNAs might be extensively involved in oncogenesis. Array comparative genomic hybridization (aCGH) has long been the standard platform to investigate the relative gains and losses of genomic DNA by measuring the relative signal ratios of the differentially labeled array hybridization between tumor and normal samples. Several repositories focusing on CNAs detected from aCGH are already publicly available (7–9). However, most of these aCGH studies use BAC or cDNA probes, which have a coarse resolution for CNA detection. In the last few years, single nucleotide polymorphism (SNP) arrays have gradually become the major platform for SNP genotyping and CNA detection (10,11). SNP array has probes on known SNPs that are densely distributed in the human genome, and allows accurate SNP genotyping at these loci for individual biological samples. In addition, by comparing the signal intensities at the SNP loci between cancer and normal reference samples, one can also gain high-resolution CNA knowledge about the cancer of interest. It has been reported that SNP arrays outperform traditional aCGH in CNA detection resolution (12), enabling high-resolution SNP and CNA detection at individual gene level (13–15). Currently, some studies use websites to display the results of their own (e.g. http://www.broadinstitute.org/tumorscape). However, a comprehensive resource of CNA data from all cancer SNP array experiments is still unavailable. We present CaSNP as a comprehensive collection of CNA information inferred from cancer SNP array data. We analyzed ∼11 500 Affymetrix SNP arrays on 34 different cancer types in 104 studies to profile the genome-wide CNAs. This includes all the publicly available cancer SNP profiles using Affymetrix SNP arrays, mostly from Gene Expression Omnibus (GEO) (16). We also developed a data extraction and annotation schema to interrogate copy number on user-specified genomic region by cancer type and across different array platforms (from SNP 10K to 6.0) and studies. CaSNP is available at http://cistrome.dfci.harvard.edu/CaSNP/.

DESIGN AND IMPLEMENTATION

Data analysis and curation

Among the 104 studies collected, 100 are from GEO, one is from GlaxoSmithKline (https://cabig.nci.nih.gov/tools/caArray_GSKdata) and three are from individual publication’s supplementary websites (17,18). The raw data (.cel file) of array experiments and accompanying genotype files (if available) for samples were collected. dCHIP-SNP (19), a widely used and referenced SNP array analysis algorithm (cited by 238 accordingly to Google Scholar), was applied to each data set. Array raw data within each study were normalized in dCHIP-SNP with invariant set normalization, and signal values for individual SNP loci were further computed with the model-based expression index method (20). Relative copy number value for each SNP was calculated as the signal ratio of tumor samples versus the average of normal reference samples within the same data set, and was exported and stored in CaSNP. For data sets with no normal reference samples, the average ‘normal reference’ was calculated for each SNP from the tumor samples bearing the middle 50% of signals (i.e. 25% outlier signals from both sides were excluded). We did not choose normal samples from other experiments of the same array type as reference to avoid potential microarray batch effect. To treat and query copy number data from different array platforms in a unified manner, we updated the genome coordinate system to the latest human genome assembly (UCSC hg19). In addition, all SNP IDs were converted to dbSNP129. We also manually extracted and curated information on sample clinical background and organized them at two levels: the top level on the tissue origin (e.g. lung cancer), while the second on cancer subtypes (e.g. small cell lung carcinoma).

Query parameters

The only required field in a user query is the genome region where a user inputs a genomic coordinate range (limited to 2-MB size), a gene name, a RefSeq ID or an miRNA name—all of which will be internally converted to a genomic coordinate range. The user could optionally specify the cancer type and subtype to limit the query. Alternatively, one could also go to the ‘Browse Data’ page to select a subset of the data sets/series for analysis. The ‘Browse Data’ option allows the user to focus on specific studies or conduct joint analysis in two or more studies and/or across multiple cancer types. When the user specifies a cancer type or subtype or data sets/series, CaSNP will consult the sample information table to extract the matching samples for analysis. In addition, the user can specify the upper and lower CNA thresholds (default 2.2 and 1.8, respectively) for CaSNP to calculate the percentage of samples beyond the thresholds within each study. A flowchart depicting the internal table schema of CaSNP is shown in Figure 1.

Figure 1.

Overview of internal structure of CaSNP. The input for genome regions, whatever of its kind, will be uniformly translated into genomic coordinates, by querying coordinate tables of miRNA or refSeq gene, and then sent to the query engine. The input for cancer type will be checked against sample information table, to extract the names of samples qualifying this cancer type, which will further be used by the query engine to search the CNA data tables. The CNA data are stored in tables of each series and grouped by platform type. After having been extracted from data tables, relevant copy number data are combined and grouped by the output engine to calculate average copy numbers and the percentage of threshold-passing samples, which will be further displayed on the result page. Besides, a graphic display is available within which the signals of each series on the region of query will be represented as heatmaps. In addition, the returned CNA data are coordinated and written to .bed files for users to download. Detailed information for each study could be viewed on the ‘Browse Data’ page by linking to their corresponding annotations on GEO.

Data output and visualization

A screenshot of CaSNP’s result output page is shown in Figure 2. The most important results from a CaSNP query is the average copy number of the queried region for each of the series involved, and the percentage of samples exceeding the copy number thresholds. This value is calculated as the mean of all biological samples in each series. If there are multiple array platforms for a sample, all data for the sample will be combined before the calculation. If user specifies the upper or lower CNA threshold at input, the frequency of threshold-passing samples will also be displayed for each series. This could help the user to determine whether an observed CNA is prevalent in many samples or only caused by outlier ones. The percentage values of threshold-passing samples at the SNP loci in the region are also coded in the bedGraph file format, which is the standard for displaying continuous-valued data as a track in the UCSC genome browser. The bed files generated could be directly viewed in UCSC genome browser (21) via a link or downloaded. Also displayed on the result page are statistics of sample and SNP number for each series, links to their corresponding GEO entries at NCBI and other relevant information.

Figure 2.

A screenshot of CaSNP’s query result page.

A screenshot of CaSNP’s query result page. A graphic display of the results is also provided through the ‘HeatMap’ query page (Figure 3). The series returned are grouped by array platforms, with CNAs (loss to gain) expressed in color gradient (blue to red), and white for normal diploid (copy number 2) which gives users a comprehensive view of the copy number data in the queried region and cancer types. The heatmaps are dynamically generated from the data in the database.

Figure 3.

A screenshot of CaSNP’ s heatmap query result page. Red represents higher copy number and blue represents lower copy number, and white for normal. Rows are samples involved, and columns are individual SNP markers detected by their corresponding array platforms along the queried region.

Database implementation

CaSNP is running on an Apache web server and the data resides in a MySQL server. The scripts for query processing and data analysis are written in Python and the user interface is based on a django frame.

A CASE STUDY USING CaSNP

As an example of how CaSNP can be used for cancer biomarker or oncogene/tumor suppressor detection, we systematically exctracted the copy number of all 20 221 RefSeq genes from CaSNP. We then calculated a G-score, which is a component of the GISTIC methodology (22) for each gene to summarize both the frequency and amplitude of its copy number alteration in all 11 500 cancer samples. When comparing with known annotated database of oncogenes (http://www.sanger.ac.uk/genetics/CGP/Census/) and tumor suppressor genes (http://cbio.mskcc.org/CancerGenes/), we found that regions of highest or lowest G-scores often harbor known oncogenes and tumor suppressor genes, respectively (Figure 4). This partially validated the quality of the data and the accuracy of our copy number estimation. Interestingly, we observed that many chromosome ends show strong deletions in cancer, and harbor some of the well-known tumor suppressors such as STK11, TSPAN32, MAPK9 and PTGES.

Figure 4.

The distribution of amplified/deleted genes over the whole genome. The height of the bar represents the relative value of G-score. Top 50 oncogenes/tumor suppressors in G-score ranking were denoted.

The distribution of amplified/deleted genes over the whole genome. The height of the bar represents the relative value of G-score. Top 50 oncogenes/tumor suppressors in G-score ranking were denoted. A very striking exception is a strong amplified region on the left tip of chromosome 5, with no previously annotated tumor suppressors and oncogenes. The region was implicated in breast cancer risk (23), and a recent cancer CNA study (24) identified the putative target amplification gene as TERT, but did not experimentally validate its function in breast caner. Checking Oncomine (25), we found that TERT is not highly expressed in breast cancers. Instead, a nearby gene IRX2 not only shows gene amplification and enhanced expression in breast cancers, but also has some literature support for playing a role in mammary gland neoplasia (26). Alternatively, the oncogene in the Chr5 left tip might be an ncRNA, so we investigated the CNAs of all 4013 newly identified LincRNAs (27) in mammalian genomes (Supplementary Figure S1). Although gene expression of LincRNA in breast cancers is still lacking, our analysis did generate interesting leads for potential follow up validations of LincRNAs as tumor suppressors and oncogenes and demonstrate the value of CaSNP.

DISCUSSION

Here, we have presented the CaSNP database for identifying and visualizing CNAs in cancers at any specific region within the human genome. CaSNP stores pre-computed raw copy numbers, and dynamically generates viewable and downloadable summaries of CNA status in response to user queries. A schema for uniformly processing, storing, annotating and presenting data sets across different data sets or platforms was successfully implemented, making CaSNP a useful tool for cancer genomic meta-study. The query results contain numerical values of cancer copy numbers and the frequencies of CNA events, which are well suited for more detailed analysis by other software or methods. Besides the tabular display, the heatmap view displays SNP copy numbers in colors, enabling users to intuitively and comprehensively visualize the results and facilitating finding novel CNA regions in subset of samples. Besides, we provided a scenario of using CaSNP to explore cancer biomarkers or genes through a meta-analysis, and proved CaSNP’s ability in suggesting novel oncogenes/tumor suppressors, whether a protein coding gene or a ncRNA. Benefited from the abundance of SNP array data sets in recent years, CaSNP is the largest repository of SNP array-oriented CNA data among all the databases of the similar type. The amount of public-accessible SNP array data on cancer is still expanding, so will be the data collection in CaSNP. Such a large-scale analysis will be extremely valuable when correlating CNA data with a genomic location with specific diagnostic, prognostic or therapeutic value found in other studies, or to reduce noise from individual studies via meta-analysis. Nowadays, when high-throughput methods as ChIP–chip or ChIP-seq could generate hundreds of thousands of regions of interest in a single run, CaSNP will be powerful for independent validation purpose, such as screening the regions which might be related to oncogenesis and might go unnoticed in ChIP experiments alone. Besides collecting more data, we will commit our work to make better use of them. The loss-of-heterozygosity (LOH) information deduced from genotype data will be added, and the CNA status will be compared across different cancer types for specified regions and across the genome.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

The National Institutes of Health R01 (HG004069 to X.S.L., GM077122 to C.L.); Chinese Scholarship Council (2008632067 to Q.C.); State S & T Projects (11th Five Year) of China (2008ZX10002-007 to Z.C.); National Basic Research Program of China (973 Program No. 2010CB944904 to M.Z., X.W. and Y.Z.). Funding for open access charge: National Basic Research Program of China (973 Program No. 2010CB944904). Conflict of interest statement. None declared.

27 in total

1. An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays.

Authors: Xiaojun Zhao; Cheng Li; J Guillermo Paez; Koei Chin; Pasi A Jänne; Tzu-Hsiu Chen; Luc Girard; John Minna; David Christiani; Chris Leo; Joe W Gray; William R Sellers; Matthew Meyerson
Journal: Cancer Res Date: 2004-05-01 Impact factor: 12.701

2. dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data.

Authors: Ming Lin; Lee-Jen Wei; William R Sellers; Marshall Lieberfarb; Wing Hung Wong; Cheng Li
Journal: Bioinformatics Date: 2004-02-10 Impact factor: 6.937

3. A large intergenic noncoding RNA induced by p53 mediates global gene repression in the p53 response.

Authors: Maite Huarte; Mitchell Guttman; David Feldser; Manuel Garber; Magdalena J Koziol; Daniela Kenzelmann-Broz; Ahmad M Khalil; Or Zuk; Ido Amit; Michal Rabani; Laura D Attardi; Aviv Regev; Eric S Lander; Tyler Jacks; John L Rinn
Journal: Cell Date: 2010-08-06 Impact factor: 41.582

4. Homozygous deletions and chromosome amplifications in human lung carcinomas revealed by single nucleotide polymorphism array analysis.

Authors: Xiaojun Zhao; Barbara A Weir; Thomas LaFramboise; Ming Lin; Rameen Beroukhim; Levi Garraway; Javad Beheshti; Jeffrey C Lee; Katsuhiko Naoki; William G Richards; David Sugarbaker; Fei Chen; Mark A Rubin; Pasi A Jänne; Luc Girard; John Minna; David Christiani; Cheng Li; William R Sellers; Matthew Meyerson
Journal: Cancer Res Date: 2005-07-01 Impact factor: 12.701

5. The interactive online SKY/M-FISH & CGH database and the Entrez cancer chromosomes search database: linkage of chromosomal aberrations with the genome sequence.

Authors: Turid Knutsen; Vasuki Gobu; Rodger Knaus; Hesed Padilla-Nash; Meena Augustus; Robert L Strausberg; Ilan R Kirsch; Karl Sirotkin; Thomas Ried
Journal: Genes Chromosomes Cancer Date: 2005-09 Impact factor: 5.006

Review 6. Single nucleotide polymorphism array analysis of cancer.

Authors: Amit Dutt; Rameen Beroukhim
Journal: Curr Opin Oncol Date: 2007-01 Impact factor: 3.645

7. Regulated expression patterns of IRX-2, an Iroquois-class homeobox gene, in the human breast.

Authors: M T Lewis; S Ross; P A Strickland; C J Snyder; C W Daniel
Journal: Cell Tissue Res Date: 1999-06 Impact factor: 5.249

8. Human microRNA genes are frequently located at fragile sites and genomic regions involved in cancers.

Authors: George Adrian Calin; Cinzia Sevignani; Calin Dan Dumitru; Terry Hyslop; Evan Noch; Sai Yendamuri; Masayoshi Shimizu; Sashi Rattan; Florencia Bullrich; Massimo Negrini; Carlo M Croce
Journal: Proc Natl Acad Sci U S A Date: 2004-02-18 Impact factor: 11.205

9. Genetic alterations and oncogenic pathways associated with breast cancer subtypes.

Authors: Xiaolan Hu; Howard M Stern; Lin Ge; Carol O'Brien; Lauren Haydu; Cynthia D Honchell; Peter M Haverty; Brock A Peters; Thomas D Wu; Lukas C Amler; John Chant; David Stokoe; Mark R Lackner; Guy Cavet
Journal: Mol Cancer Res Date: 2009-04 Impact factor: 5.852

10. The landscape of somatic copy-number alteration across human cancers.

Authors: Rameen Beroukhim; Craig H Mermel; Dale Porter; Guo Wei; Soumya Raychaudhuri; Jerry Donovan; Jordi Barretina; Jesse S Boehm; Jennifer Dobson; Mitsuyoshi Urashima; Kevin T Mc Henry; Reid M Pinchback; Azra H Ligon; Yoon-Jae Cho; Leila Haery; Heidi Greulich; Michael Reich; Wendy Winckler; Michael S Lawrence; Barbara A Weir; Kumiko E Tanaka; Derek Y Chiang; Adam J Bass; Alice Loo; Carter Hoffman; John Prensner; Ted Liefeld; Qing Gao; Derek Yecies; Sabina Signoretti; Elizabeth Maher; Frederic J Kaye; Hidefumi Sasaki; Joel E Tepper; Jonathan A Fletcher; Josep Tabernero; José Baselga; Ming-Sound Tsao; Francesca Demichelis; Mark A Rubin; Pasi A Janne; Mark J Daly; Carmelo Nucera; Ross L Levine; Benjamin L Ebert; Stacey Gabriel; Anil K Rustgi; Cristina R Antonescu; Marc Ladanyi; Anthony Letai; Levi A Garraway; Massimo Loda; David G Beer; Lawrence D True; Aikou Okamoto; Scott L Pomeroy; Samuel Singer; Todd R Golub; Eric S Lander; Gad Getz; William R Sellers; Matthew Meyerson
Journal: Nature Date: 2010-02-18 Impact factor: 49.962

13 in total

1. Losses of cytokines and chemokines are common genetic features of human cancers: the somatic copy number alterations are correlated with patient prognoses and therapeutic resistance.

Authors: Henry Sung-Ching Wong; Wei-Chiao Chang
Journal: Oncoimmunology Date: 2018-07-30 Impact factor: 8.110

2. canEvolve: a web portal for integrative oncogenomics.

Authors: Mehmet Kemal Samur; Zhenyu Yan; Xujun Wang; Qingyi Cao; Nikhil C Munshi; Cheng Li; Parantu K Shah
Journal: PLoS One Date: 2013-02-13 Impact factor: 3.240

3. arrayMap: a reference resource for genomic copy number imbalances in human malignancies.

Authors: Haoyang Cai; Nitin Kumar; Michael Baudis
Journal: PLoS One Date: 2012-05-18 Impact factor: 3.240

4. arrayMap 2014: an updated cancer genome resource.

Authors: Haoyang Cai; Saumya Gupta; Prisni Rath; Ni Ai; Michael Baudis
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 19.160

Review 5. Human cancer databases (review).

Authors: Athanasia Pavlopoulou; Demetrios A Spandidos; Ioannis Michalopoulos
Journal: Oncol Rep Date: 2014-10-31 Impact factor: 3.906