Literature DB >> 19906715

ChimerDB 2.0--a knowledgebase for fusion genes updated.

Pora Kim¹, Suhyeon Yoon, Namshin Kim, Sanghyun Lee, Minjeong Ko, Haeseung Lee, Hyunjung Kang, Jaesang Kim, Sanghyuk Lee.

Abstract

Chromosome translocations and gene fusions are frequent events in the human genome and have been found to cause diverse types of tumor. ChimerDB is a knowledgebase of fusion genes identified from bioinformatics analysis of transcript sequences in the GenBank and various other public resources such as the Sanger cancer genome project (CGP), OMIM, PubMed and the Mitelman's database. In this updated version, we significantly modified the algorithm of identifying fusion transcripts. Specifically, the new algorithm is more sensitive and has detected 2699 fusion transcripts with high confidence. Furthermore, it can identify interchromosomal translocations as well as the intrachromosomal deletions or inversions of large DNA segments. Importantly, results from the analysis of next-generation sequencing data in the short read archives are incorporated as well. We updated and integrated all contents (GenBank, Sanger CGP, OMIM, PubMed publications and the Mitelman's database), and the user-interface has been improved to support diverse types of searches and to enhance the user convenience especially in browsing PubMed articles. We also developed a new alignment viewer that should facilitate examining reliability of fusion transcripts and inferring functional significance. We expect ChimerDB 2.0, available at http://ercsb.ewha.ac.kr/fusiongene, to be a valuable tool in identifying biomarkers and drug targets.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2009 PMID： 19906715 PMCID： PMC2808913 DOI： 10.1093/nar/gkp982

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Fusion genes play important roles in tumorigenesis and cancer progression (1). Perhaps, the best-characterized case, BCR-ABL1 fusion is the cause of the chronic myelogenous leukemia and the target of the anticancer drug, Gleevec (imatinib) (2). Identification of fusion genes thus can lead to the discovery of diagnostic biomarkers and therapeutic targets as well as understanding the molecular basis of tumorigenesis. Initial studies have concentrated on the hematological cancer in large part due to the sample availability (1,3). Over the last few years, however, there has been significant progress in fusion gene identification in solid tumors. Importantly, Chinnaiyan and colleagues (4–7) reported several cases of gene fusion in prostate cancer identified via integrative analysis of microarray data (TMPRSS2 and ETS transcription factors) and transcriptome resequencing. Soda et al. (8) identified the transforming EML4-ALK fusion gene in nonsmall cell lung cancer (NSCLC) using a function-based screening procedure. A proteomic study of phosphotyrosine kinases also revealed the ROS-ALK fusion in NSCLC cell lines (9). These cases clearly indicate that gene fusions play an important role in cancer development in solid tumors. Recent progress in next-generation sequencing (NGS) techniques provides a tremendous opportunity for fusion gene discovery. Notably, paired-end sequencing, now a frequent if not standard procedure, compensates for the short read length of NGS techniques (10). Sequencing and analyzing whole genome or transcriptome lead to identification of many chromosomal aberrations including translocations, amplifications and deletions. Short read sequencing strategies were successfully applied to find fusion genes in prostate, lung and breast cancer cell lines (5,11–13). There have been considerable efforts to make a catalog of fusion genes. The Mitelman’s database and the Sanger cancer genome project (CGP) are the notable examples of collecting fusion genes from literature reports (1,3). The COSMIC and CancerGenes database include other types of chromosomal aberrations such as mutations, amplifications and deletions (14,15). Currently, the Mitelman’s database and the Sanger CGP collection include 150 and 270 gene pairs, respectively, involved in gene fusion events. Bioinformatics analysis of public transcriptome sequences in the GenBank also provides ample cases of fusion transcript candidates. Fusion genes may be classified into two groups, interchromosomal and intrachromosomal. The former results from fusion between two different chromosomes i.e. translocation and the latter originates from single chromosomes due to deletion, inversion or amplification of large DNA segments. Romani et al. (16) analyzed the mRNA sequences and Hahn et al. (17) analyzed the mRNA and EST sequences to identify fusion transcripts between different chromosomes. Similar data-mining approaches were adopted to construct databases of fusion genes such as ChimerDB (18), HybridDB (19) and TICdb (20) although computational details vary considerably. ChimerDB is designed to be a knowledgebase of fusion genes that encompass the fusion transcripts identified from bioinformatics analysis of transcript sequences in the GenBank and various public resources such as the Sanger CGP (3), OMIM (21), Mitelman’s database (1) and PubMed. The updated version, ChimerDB 2.0, features (i) algorithm modifications for increased sensitivity, (ii) extensive coverage of recent publications and relevant databases, (iii) analysis of NGS data in the NCBI’s short read archives (SRA) and (iv) the enhanced user interface and the novel alignment viewer to support diverse types of search. ChimerDB 2.0 would be the most extensive catalog of fusion genes and transcripts publically available to date.

IMPLEMENTATIONS

Computational method for transcriptome analysis

The basic strategy is virtually identical to the procedure used in ChimerDB 1.0 where the genomic alignments of transcript sequences were analyzed to identify the fusion transcripts. We will describe the major differences and modifications here with more details provided in the Supplementary data and in the web site documentation. The most important change is relieving the boundary conditions based on our observation that many reported cases did not satisfy the strict condition that the fusion boundary of the transcript should match the exon boundary. Therefore, we introduced the ‘reliability class’ as a measure of confidence level. We consider the alignment with multiple exons or single exon with matching boundaries as features of reliability. Entry to Class A requires that both head and tail transcripts consist of multiple exons or of single exons with matching boundaries, thus being the most reliable cases. Only one or neither of the head and tail genes satisfying this condition would put a given transcript in Class B or C, respectively. Another important difference is the introduction of various refinement steps. For example, we removed the entries whose genomic alignments have many hits of comparable qualities in different genomic regions even though these genomic regions are not marked as repeat sequences. This step was necessary to avoid possible complications arising from gene duplication, pseudo-genes and retroposon sequences. In addition, the number of exons was estimated by using the Exonerate program rather than the BLAT alignments (22,23). The computational pipeline for 454 sequences from the SRA is identical with the EST processing since the sequence length is comparable. Solexa reads are generally too short and we used them just as supporting evidence for the existence of fusion transcripts. Solexa transcriptome reads were aligned using the BWA program (24) against the fusion transcripts to determine if multiple reads cover the fusion point. The alignment of resulting candidates was manually examined. In this updated version, we also include the fusion transcripts within the same chromosome. The head and tails genes, not being adjacent, should be separated by >1 Mb. We exclude the fusion cases between adjacent genes, which we named co-transcription and intergenic splicing, in order to limit our focus to genuine fusion genes originating from chromosomal aberrations.

Data sources

Transcript sequences were downloaded from the GenBank last updated on September, 2008. It included 323 914 mRNA and over 8 million EST sequences for human. We also downloaded NGS transcriptome sequences in the SRA that included ∼1.2 million 454 sequences and ∼762 million Solexa reads. The human genome map used for transcriptome analysis was the NCBI build 36.1 (hg18 in the UCSC genome browser database) (25). Literature-related information was obtained as follows. PubMed articles related to fusion genes were retrieved by using the Entrez query of ‘chromosomal translocation or fusion gene’ and the MeSH terms on human cancers. Abstracts of 3618 articles were manually examined to obtain information on the fusion gene pairs. OMIM records retrieved by the query of ‘translocation or fusion’ (May 2009) were also manually examined to find fusion gene pairs. As for the Sanger CGP data, the ‘cancer gene census’ list released on December 2008 was downloaded from the web site. Mitelman’s database was obtained from the web site for the recurrent chromosome aberrations in cancer (http://cgap.nci.nih.gov/Chromosomes/RecurrentAberrations) as of April 2009. Entries with specific gene symbols for both head and tail genes were retained as part of our literature-related data.

RESULTS

ChimerDB 2.0 includes 9358 genes, 117 47 fusion gene pairs and 9358 fusion transcripts. Figure 1A shows the number of fusion gene pairs according to the information source, counting just the Class A candidates for the transcriptome analysis. As expected, transcriptome analysis is the most ample source of fusion gene pairs and includes 2699 candidates, compared with ∼300 candidates with the original version. Only 96 cases of those have the literature evidence from other resources, implying that the majority of candidates remain to be verified experimentally.

Figure 1.

The number of gene pairs in the ChimerDB 2.0 according to the information source.

The number of gene pairs in the ChimerDB 2.0 according to the information source. Comparison between databases for just the literature-based cases is also revealing (Figure 1B). The overlaps between Sanger CGP, OMIM, Mitelman’s database and our own PubMed collections are much smaller than expected, and 327 cases out of 556 fusion gene pairs are found only in one of the literature databases. This reveals the incomplete coverage of manual efforts and the necessity for integration of various databases. ChimerDB 2.0 includes 537 genes and 556 fusion gene pairs from literature publications, which is a significant increase than other single database. Detailed statistics of transcriptome analysis is shown in Table 1. We found 1046, 6178 and 2674 fusion transcripts from mRNA, EST and 454 sequences, respectively. In sum, 89% of the total cases are interchromosomal fusions.

Table 1.

Statistics of transcriptome analysis in ChimerDB 2.0

Class	Interchromosomal			Intrachromosomal
Class	A	A + B	A + B + C	A	A + B	A + B + C
No. of transcripts	1900	6073	8833	515	887	1065
mRNA	479	855	900	110	143	146
EST	1247	3972	5397	396	677	781
NGS	174	1246	2536	9	67	138
No. of genes (9358 in total)	2855	6976	8710	703	1276	1543
No. of gene pairs	2209	7362	10639	490	909	1108
With multiple transcripts	278	807	1137	144	220	246
With Solexa evidence	14	14	15	65	67	67

Statistics of transcriptome analysis in ChimerDB 2.0 A significant proportion of gene pairs (422) in the Class A features multiple fusion transcripts, thus indicating an even higher chance of representing genuine fusion. We also searched the short reads from the Solexa transcriptome sequences in the SRA that span the fusion boundary of our candidates. Notably, 82 fusion transcripts were found to have multiple short-read matches. One of these fusions, CRTC1-MAML2, has been reported to be a frequent feature of mucoepidermoid carcinoma (26). It is noteworthy that intrachromosomal events are overrepresented among our fusion transcript candidates. A close look reveals that a major portion consists of genes belonging to the same family or of pseudogenes. It remains to be seen whether they are from alignment ambiguity or genuine fusion events.

User interface

The user interface of ChimerDB is significantly improved in this updated version. Figure 2 shows the important features in the user interface. Most importantly, we support diverse types of search targeting transcripts, genes, gene pairs, cytobands and tissues. As for the fusion transcripts, users may choose the reliability class, number of exons, boundary types for the head and tail genes.

Figure 2.

User interface of ChimerDB 2.0. The search page is designed to support diverse types of search. The ‘search result’ page shows the gene pairs and the disease-related information in OMIM, Sanger CGP and Mitelman’s database with the title and journal name of PubMed articles. Clicking ‘more info’ link shows the detailed information on fusion genes and transcripts as seen in the bottom panel. The ‘alignment view’ shows the hypothetical fusion gene (head gene in blue, tail gene in red) and the candidate fusion transcript (in magenta) along with the UCSC-annotated genes (exons in black, UTRs in grey). The repeat regions and the Pfam domains are indicated in green and orange colors, respectively. Clicking on the alignment picture opens a magnified view. The information contents in this figure are trimmed for brevity. The result page includes the gene pairs, disease information, PubMed articles and linkouts to diverse resources. PubMed articles are displayed with the title and journal name for user convenience. Except for the few cases without the fusion sequence available, we show the alignment picture that includes the gene structure, domains and repeat sequences. We also provide information on tissue and pathology type from the GenBank records and CGAP (Cancer Genome Anatomy Project) library data. Links to the UCSC genome browser are provided to allow users to examine the detailed gene structure and alignment.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Korea Science and Engineering Foundation (KOSEF) funded by the Korea government (MEST) (R01-2008-000-20818-0 and 2007-03983); grant from BioGreen 21 Program of the Korean Rural Development Administration (20070401034010); ‘Systems Biology Infrastructure Establishment Grant’ provided by Gwangju Institute of Science and Technology in 2009 through Ewha Research Center for Systems Biology (ERCSB); grant from the National Core Research Center (NCRC) program (R15-2006-020) of the KOSEF funded by the MEST. Funding for open access charge: Korea Science and Engineering Foundation (R01-2008-000-20818-0). Conflict of interest statement. None declared.

26 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

2. Finding fusion genes resulting from chromosome rearrangement by analyzing the expressed sequence databases.

Authors: Yoonsoo Hahn; Tapan Kumar Bera; Kristen Gehlhaus; Ilan R Kirsch; Ira H Pastan; Byungkook Lee
Journal: Proc Natl Acad Sci U S A Date: 2004-08-23 Impact factor: 11.205

3. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer.

Authors: Scott A Tomlins; Daniel R Rhodes; Sven Perner; Saravana M Dhanasekaran; Rohit Mehra; Xiao-Wei Sun; Sooryanarayana Varambally; Xuhong Cao; Joelle Tchinda; Rainer Kuefer; Charles Lee; James E Montie; Rajal B Shah; Kenneth J Pienta; Mark A Rubin; Arul M Chinnaiyan
Journal: Science Date: 2005-10-28 Impact factor: 47.728

Review 4. Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses.

Authors: Melissa J Fullwood; Chia-Lin Wei; Edison T Liu; Yijun Ruan
Journal: Genome Res Date: 2009-04 Impact factor: 9.043

Review 5. A census of human cancer genes.

Authors: P Andrew Futreal; Lachlan Coin; Mhairi Marshall; Thomas Down; Timothy Hubbard; Richard Wooster; Nazneen Rahman; Michael R Stratton
Journal: Nat Rev Cancer Date: 2004-03 Impact factor: 60.716

6. ChimerDB--a knowledgebase for fusion sequences.

Authors: Namshin Kim; Pora Kim; Seungyoon Nam; Seokmin Shin; Sanghyuk Lee
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

7. Automated generation of heuristics for biological sequence comparison.

Authors: Guy St C Slater; Ewan Birney
Journal: BMC Bioinformatics Date: 2005-02-15 Impact factor: 3.169

8. Transcriptome sequencing to detect gene fusions in cancer.

Authors: Christopher A Maher; Chandan Kumar-Sinha; Xuhong Cao; Shanker Kalyana-Sundaram; Bo Han; Xiaojun Jing; Lee Sam; Terrence Barrette; Nallasivam Palanisamy; Arul M Chinnaiyan
Journal: Nature Date: 2009-01-11 Impact factor: 49.962

9. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

10. The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website.

Authors: S Bamford; E Dawson; S Forbes; J Clements; R Pettett; A Dogan; A Flanagan; J Teague; P A Futreal; M R Stratton; R Wooster
Journal: Br J Cancer Date: 2004-07-19 Impact factor: 7.640

47 in total

1. Identification of gene fusions from human lung cancer mass spectrometry data.

Authors: Han Sun; Xiaobin Xing; Jing Li; Fengli Zhou; Yunqin Chen; Ying He; Wei Li; Guangwu Wei; Xiao Chang; Jia Jia; Yixue Li; Lu Xie
Journal: BMC Genomics Date: 2013-12-09 Impact factor: 3.969

2. De novo unbalanced translocations have a complex history/aetiology.

Authors: Maria Clara Bonaglia; Nehir Edibe Kurtas; Edoardo Errichiello; Sara Bertuzzo; Silvana Beri; Mana M Mehrjouy; Aldesia Provenzano; Debora Vergani; Vanna Pecile; Francesca Novara; Paolo Reho; Marilena Carmela Di Giacomo; Giancarlo Discepoli; Roberto Giorda; Micheala A Aldred; Cíntia Barros Santos-Rebouças; Andressa Pereira Goncalves; Diane N Abuelo; Sabrina Giglio; Ivana Ricca; Fabrizia Franchi; Philippos Patsalis; Carolina Sismani; María Angeles Morí; Julián Nevado; Niels Tommerup; Orsetta Zuffardi
Journal: Hum Genet Date: 2018-10-01 Impact factor: 4.132

3. Kinase impact assessment in the landscape of fusion genes that retain kinase domains: a pan-cancer study.

Authors: Pora Kim; Peilin Jia; Zhongming Zhao
Journal: Brief Bioinform Date: 2018-05-01 Impact factor: 11.622

4. Recurrent BCAM-AKT2 fusion gene leads to a constitutively activated AKT2 fusion kinase in high-grade serous ovarian carcinoma.

Authors: Kalpana Kannan; Cristian Coarfa; Pei-Wen Chao; Liming Luo; Yan Wang; Amy E Brinegar; Shannon M Hawkins; Aleksandar Milosavljevic; Martin M Matzuk; Laising Yen
Journal: Proc Natl Acad Sci U S A Date: 2015-03-02 Impact factor: 11.205

5. Onco-proteogenomics: cancer proteomics joins forces with genomics.

Authors: Javier A Alfaro; Ankit Sinha; Thomas Kislinger; Paul C Boutros
Journal: Nat Methods Date: 2014-11 Impact factor: 28.547

6. Gene Fusion Markup Language: a prototype for exchanging gene fusion data.

Authors: Shanker Kalyana-Sundaram; Achiraman Shanmugam; Arul M Chinnaiyan
Journal: BMC Bioinformatics Date: 2012-10-16 Impact factor: 3.169

7. Integrative transcriptome sequencing reveals extensive alternative trans-splicing and cis-backsplicing in human cells.

Authors: Trees-Juen Chuang; Yen-Ju Chen; Chia-Ying Chen; Te-Lun Mai; Yi-Da Wang; Chung-Shu Yeh; Min-Yu Yang; Yu-Ting Hsiao; Tien-Hsien Chang; Tzu-Chien Kuo; Hsin-Hua Cho; Chia-Ning Shen; Hung-Chih Kuo; Mei-Yeh Lu; Yi-Hua Chen; Shan-Chi Hsieh; Tai-Wei Chiang
Journal: Nucleic Acids Res Date: 2018-04-20 Impact factor: 16.971

8. NCLscan: accurate identification of non-co-linear transcripts (fusion, trans-splicing and circular RNA) with a good balance between sensitivity and precision.

Authors: Trees-Juen Chuang; Chan-Shuo Wu; Chia-Ying Chen; Li-Yuan Hung; Tai-Wei Chiang; Min-Yu Yang
Journal: Nucleic Acids Res Date: 2015-10-05 Impact factor: 16.971

Review 9. Proteogenomics from a bioinformatics angle: A growing field.

Authors: Gerben Menschaert; David Fenyö
Journal: Mass Spectrom Rev Date: 2015-12-15 Impact factor: 10.946

10. Transcription-mediated chimeric RNAs in prostate cancer: time to revisit old hypothesis?

Authors: Guoping Ren; Yanling Zhang; Xueying Mao; Xiaoyan Liu; Emma Mercer; Jacek Marzec; Dong Ding; Yurong Jiao; Qingchong Qiu; Yue Sun; Biao Zhang; Marc Yeste-Velasco; Claude Chelala; Daniel Berney; Yong-Jie Lu
Journal: OMICS Date: 2014-09-04