Literature DB >> 27899563

ChimerDB 3.0: an enhanced database for fusion genes from cancer transcriptome and literature data mining.

Myunggyo Lee¹, Kyubum Lee², Namhee Yu³, Insu Jang⁴, Ikjung Choi⁵, Pora Kim⁶, Ye Eun Jang¹, Byounggun Kim⁷, Sunkyu Kim⁷, Byungwook Lee⁴, Jaewoo Kang^8,7, Sanghyuk Lee^9,3,5.

Abstract

Fusion gene is an important class of therapeutic targets and prognostic markers in cancer. ChimerDB is a comprehensive database of fusion genes encompassing analysis of deep sequencing data and manual curations. In this update, the database coverage was enhanced considerably by adding two new modules of The Cancer Genome Atlas (TCGA) RNA-Seq analysis and PubMed abstract mining. ChimerDB 3.0 is composed of three modules of ChimerKB, ChimerPub and ChimerSeq. ChimerKB represents a knowledgebase including 1066 fusion genes with manual curation that were compiled from public resources of fusion genes with experimental evidences. ChimerPub includes 2767 fusion genes obtained from text mining of PubMed abstracts. ChimerSeq module is designed to archive the fusion candidates from deep sequencing data. Importantly, we have analyzed RNA-Seq data of the TCGA project covering 4569 patients in 23 cancer types using two reliable programs of FusionScan and TopHat-Fusion. The new user interface supports diverse search options and graphic representation of fusion gene structure. ChimerDB 3.0 is available at http://ercsb.ewha.ac.kr/fusiongene/.

Entities: CellLine Disease Gene Species

Mesh：

Substances：

Year: 2016 PMID： 27899563 PMCID： PMC5210563 DOI： 10.1093/nar/gkw1083

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Fusion genes have been firmly established as an important class of biomarkers and therapeutic targets in various types of cancer. A number of databases have been developed to catalogue fusion genes of clinical value. Initial efforts include the Mitelman database (1), COSMIC (2), ChimerDB 1.0 (3) and TICdb (4). Since the advent of high-throughput sequencing technology, the deep sequencing data became the main source of identifying fusion genes. Numerous tools have been developed to predict fusion genes from genome or transcriptome sequencing data (5). However, the reliability of most programs falls short of the expectation from bench biologists or doctors who want the predictions to pass the validation experiments that typically require valuable resources such as patient tissues and time. Another major problem is the computer resources since typical programs require substantial amount of CPU time and memory. Thus, processing hundreds or thousands of RNA-Seq data for detection of fusion transcripts is almost impossible for most labs although massive amount of deep sequencing data are currently available in public. Nevertheless, several databases on fusion genes were released to include the results of analyzing transcriptome sequencing data. ChimerDB 2.0 was designed to be a knowledgebase of fusion genes with extensive manual curation and transcriptome sequencing data analysis (6). It has been used in many applications as a gold standard for developing prediction tools (7,8) and as a reference data set for fusion transcriptome simulation (9). ChiTaRS was developed with similar concepts and includes ∼29 000 fusion transcripts from eight species (10). The amount of cancer genome sequencing data is exploding during the last several years. For example, The Cancer Genome Atlas (TCGA) provides RNA-Seq data for 10 539 tumor samples in 26 cancer types as of September 13, 2016. Stransky et al. focused on recurrent kinase fusion genes with oncogenic potentials (11). They applied an extensive filtering scheme of germline fusions observed in normal samples of the TCGA and Genotype-Tissue Expression projects (12). Approximately 3.0% of tumor samples were estimated to contain likely oncogenic, recurrent kinase fusion genes. Verhaak et al. developed a pipeline for predicting fusion genes named PRADA (13), and analyzed the TCGA RNA-Seq data to identify ∼8000 fusion transcript candidates in 13 tumor types (14). However, these fusion gene databases contain computational results without experimental validation, and thus cannot be used for developing biomarkers or therapeutic targets of clinical value as they are. Since the number of PubMed articles reporting fusion genes is rapidly increasing, cataloguing and curation itself has become a major challenge. Thus, automatic methods for text mining of PubMed articles to identify fusion genes would be of great help to build a comprehensive knowledgebase on fusion genes. In this updated version, we introduce a new module of text mining for fusion genes, provide an updated version of knowledgebase with manual curation and present a dramatically enhanced version of fusion transcripts obtained by analyzing TCGA RNA-Seq data with several advanced programs. ChimerDB 3.0 would be the most extensive catalog of fusion genes and transcripts publically available to date.

SYSTEM DESIGN AND METHODS

System overview

ChimerDB 3.0 is composed of three modules of ChimerKB, ChimerPub and ChimerSeq as shown in Figure 1. ChimerKB represents a knowledgebase of fusion genes that were compiled from well-known public resources such as GenBank, Mitelman (1), OMIM (15), COSMIC (2), TICdb (4), dbCRID (16) fusion databases and PubMed articles. All entries were manually curated for disease, sequences, breakpoints and experimental evidences. Specifically for PubMed articles reporting fusion genes, we examined the full text to find the relevant information.

Figure 1.

Overview of ChimerDB 3.0. Each number indicates the number of gene pairs from relevant resources.

Overview of ChimerDB 3.0. Each number indicates the number of gene pairs from relevant resources. ChimerPub is our effort to provide up-to-date information on published fusion genes. We developed an advanced text mining system to identify fusion genes from 26 million PubMed articles. Text mining techniques and an elaborate machine learning approach yielded a highly reliable pipeline for extracting genuine PubMed articles reporting fusion genes. ChimerSeq collected fusion transcripts identified by computational analysis of transcriptome sequencing data including RNA-Seq and EST sequences. TCGA is the largest data set of deep sequencing for cancer patients currently available in public. We have re-analyzed RNA-Seq data from the TCGA with 4569 tumor samples in 23 cancer types using FusionScan (http://fusionscan.ewha.ac.kr/) and TopHat-Fusion (17) that we selected based on the benchmark test of precision and recall rates. Prediction results from PRADA pipeline were merged to build the TCGA fusion transcripts (PRADA ver. 1). We further compiled the fusion transcripts from ChiTaRS ver. 2.1 (18) and ChimerDB 2.0 (6) for better coverage.

ChimerPub development and implementation

ChimerPub is a new module that contains the fusion gene information obtained by text mining of PubMed abstracts. Computational procedure for identifying fusion-related PubMed abstracts is summarized in Figure 1 with more detailed information provided in Supplementary Figure S1. Out of ∼26 million PubMed abstracts as of June 2016, we searched all sentences containing multiple gene names that were recognized by the BEST entity extractor (19) after taxonomy filtering to remove articles on non-human species using PubTator (20). This initial screening yielded 302 615 sentences in 156 229 abstracts. We also extracted information on experimental methods, related diseases and translocation from the same abstract. To build a classifier for fusion gene sentences based on machine learning technique, we prepared the positive and negative sentence sets using 283 known fusion cases from the COSMIC database (2) as detailed in Supplementary Figure S1. We identified 9277 sentences that contained both of gene names for the COSMIC fusion genes. Then, we randomly selected 1800 sentences and manually examined to obtain the gold standard positive set of 1549 sentences. Sentences not containing both of COSMIC fusion gene names were used as the negative data set in the training procedure even though they may include fusion genes not in the list of COSMIC fusion genes. The procedure for constructing a classifier consists of the feature selection followed by logistic regression with extracted features. We compared the word distribution between positive and negative data sets to obtain 37 differential word features. We also added three features for fusion-specific information such as translocation description and experimental methods for validation. We applied logistic regression with 40 features to build a classification model. Finally, we scored all candidate sentences with the resulting regression model. The reliability of high scoring sentences was evaluated in two ways. We checked 3000 top scoring sentences manually to see if they report genuine fusion genes and found that only one entry was not related to fusion gene. We further examined the cumulative probability of including two sets of positive sentences – gold standard sentences used in the training procedure and pseudo-positive sentences that included both of gene names of known fusion genes in the ChimerKB module. The number of known fusion genes is much larger in ChimerKB than in COSMIC database. Top 10 000 sentences recovered 38% and 49% of positive sets from gold positive and ChimerKB, respectively, with only 0.8% of the negative sentence set used in the training procedure (Supplementary Figure S2). For database construction, we selected 10 580 top scoring sentences, which were converted into 2767 fusion gene pairs. Thus, ChimerPub provides highly accurate list of fusion genes obtained from text mining of PubMed abstracts.

ChimerSeq development and implementation

ChimerSeq module includes fusion transcripts obtained from computational analysis of transcriptome sequencing data. RNA-Seq is the most popular source of data to predict fusion transcripts and a number of tools have been published so far. We have compared the performance of several tools including SOAPfuse (21), deFuse (22), FusionHunter (23), FusionMap (24), TopHat-Fusion (17) and FusionScan (our in-house developed program; http://fusionscan.ewha.ac.kr/) using RNA-Seq data of three cell lines (NCI-H660, K562 and MCF-7) whose genuine fusion genes were known. FusionScan and TopHat-Fusion achieved the best performance in the overall F1-score, a combination of the precision and recall rates (Supplementary Table S1). Notably, FusionScan outperformed other programs in terms of the precision rate, which would be the most important factor for clinical utility (precision rate = 0.60). Thus, we chose FusionScan and TopHat-Fusion to analyze RNA-Seq data in the TCGA project. The raw sequence data were downloaded from the CGHub of TCGA with the dbGap permission. Run time for FusionScan and TopHat-Fusion depends on the computer specification. Analyzing whole transcriptome data in the TCGA took several months with ∼600 CPU cores. To build a reliable list of fusion cases, we kept the fusion transcripts with the number of seed/junction reads ≥ 2 for FusionScan and PRADA, and the number of spanning pairs ≥ 100 for TopHat-Fusion. Number of fusion transcripts for each cancer type is shown in Supplementary Figure S3.

RESULTS

ChimerDB 3.0 includes 33 316 fusion gene pairs as summarized in the overall statistics of Table 1. Two representative modules for known fusion genes (i.e. ChimerKB and ChimerPub) takes ∼10% of the total and the remaining ∼90% are predicted ones from transcriptome sequencing (ChimerSeq), which require further experimental validation. Three modules are complementary since overlap among them is not large (Figure 2A).

Table 1.

Statistics of ChimerDB 3.0

ChimerKB		ChimerPub		ChimerSeq
Literature Curation	147	Information available		TCGA	13 731
COSMIC	365	Translocation	1248	FusionScan	5789
mRNA Sequence	273	Disease	1917	Tophat-Fusion	1830
Mitelman,OMIM,GenBank	495	Validation method	1147	PRADA	7992
		Others	741	ChimerDB 2.0	142
				ChiTaRS 2.1	16 360
Total	1066	Total	2767	Total	30 001
ChimerPub supported	250	ChimerKB supported	250	ChimerKB supported	226
ChimerSeq supported	226	ChimerSeq supported	146	ChimerPub supported	146
Known breakpoint cases				Novel fusion*	29 733
Genomic position	106			TCGA	13 637
Exon junction	1450			ChiTaRS	16 149

All numbers represent the number of unique fusion genes.

*Transcripts not included in ChimerKB and ChimerPub were classified as novel fusion.

Figure 2.

Statistics of ChimerDB 3.0. (A) Number of gene pairs from three modules. (B) Number of gene pairs from three prediction programs.

Statistics of ChimerDB 3.0. (A) Number of gene pairs from three modules. (B) Number of gene pairs from three prediction programs. All numbers represent the number of unique fusion genes. *Transcripts not included in ChimerKB and ChimerPub were classified as novel fusion. Entries in the ChimerKB are most supported by other modules (45.1%, Table 1) as expected from the nature of knowledgebase. Of note, we collected fusion cases with known breakpoints carefully, where the genomic position and exon junction information were annotated for 106 and 1450 cases, respectively. This should be a useful resource for developing algorithms to predict fusion breakpoints de novo (25). ChimerPub contains 2767 entries supported by literature publication. Only 250 of those are annotated in ChimerKB containing 1066 entries. It is evident that ChimerPub contributed a major portion of published fusion genes, emphasizing the importance of text mining in grasping current knowledge on fusion genes. ChimerPub entries were automatically annotated for information on disease, translocation and experimental methods, and the number of annotated fusion genes is shown in Table 1. ChimerSeq takes ∼90% of fusion genes and most of its entries are not annotated or published, thus being the gold mine of novel fusion genes. However, it should be warned that many false positives are present even though we tried to use the most conservative programs to analyze the deep sequencing data. We expect that almost half of the prediction could be false, which should be acceptable considering the current status of prediction accuracy (Supplementary Table S1). Since ChimerSeq represents a compilation of fusion transcripts from various resources based on computational analysis of transcriptome sequencing data, reliability estimation of each resource is very important for end users. We compared the overlap of predicted fusion transcripts with the ChimerKB and ChimerPub that represent the current knowledge of known fusion genes (Supplementary Table S2). The overlap ratio was in the order of TopHat-Fusion (1.75%), FusionScan (1.00%), PRADA (0.63%) and ChiTaRS (0.01%). Similarly, we also compared the overlap among three prediction programs (Figure 2B). The overlapping proportions were 32.2%, 26.8%, 17.8% for TopHat-Fusion, FusionScan and PRADA, respectively. The order was identical in two comparisons, thus users are recommended to take the reliability of the prediction tools in this order. TopHat-Fusion's prediction is most reliable but it misses many true positives. On the other hand, ChiTaRS has the most hits but it seems to contain many false positives. FusionScan seems to be a good compromise to this end.

USER INTERFACE

The user interface of ChimerDB was redesigned to accommodate new modules in this update. Figure 3 shows the important features in the user interface, taking EML4-ALK fusion as an example query to the ChimerSeq module. We support diverse types of search including gene, gene pair, chromosome locus and disease types. In ChimerSeq search, users may select the data source, cancer type and prediction tools with optional parameters. With ample annotations, we support diverse filtering options such as function filters for kinase, oncogene, tumor suppressor, receptor and transcription factor genes. Users may keep fusion transcripts supported from other modules as well for increased reliability or cross-checking.

Figure 3.

User interface of ChimerDB 3.0. (A) The search window for ChimerSeq is shown as an example of supporting user query for specific data source, cancer type and prediction tools. (B) Main output is the result table showing query hits with brief summary and links to further information. A click on each row activates the fusion structure window in (C) and the detailed information window (D). Note that we support the zooming and moving capability, exon structure with domain information and the alignment of seed/junction reads if available, in the fusion structure viewer. Output GUI consists of a table of summary with search hits, a graphic illustration of fusion structure and detailed information on a specific fusion event. The output table supports many features of searching, sorting, exporting and linkouts to external resources. Click on each entry activates the graphic window of fusion gene structure and the detailed information table. The fusion gene graphic window shows readily the exons, domains and the breakpoint before and after the fusion event. This should be the most insightful picture for deducing functional significance of fusion event since the location of functional domains before and after gene fusion is illustrated. If available, we also show the alignment of short reads (seed/junction read only). We support zooming and panning for user convenience. The detailed information table provides all relevant information on the fusion transcript. Of note, the UCSC links guide users to the UCSC genome browser with short read alignment added as a custom track so that they can examine the detailed gene structure and alignment.

25 in total

1. FusionHunter: identifying fusion transcripts in cancer using paired-end RNA-seq.

Authors: Yang Li; Jeremy Chien; David I Smith; Jian Ma
Journal: Bioinformatics Date: 2011-05-05 Impact factor: 6.937

Review 2. Application of next generation sequencing to human gene fusion detection: computational tools, features and perspectives.

Authors: Qingguo Wang; Junfeng Xia; Peilin Jia; William Pao; Zhongming Zhao
Journal: Brief Bioinform Date: 2012-08-09 Impact factor: 11.622

3. PRADA: pipeline for RNA sequencing data analysis.

Authors: Wandaliz Torres-García; Siyuan Zheng; Andrey Sivachenko; Rahulsimham Vegesna; Qianghu Wang; Rong Yao; Michael F Berger; John N Weinstein; Gad Getz; Roel G W Verhaak
Journal: Bioinformatics Date: 2014-04-01 Impact factor: 6.937

4. Reference-free prediction of rearrangement breakpoint reads.

Authors: Edward Wijaya; Kana Shimizu; Kiyoshi Asai; Michiaki Hamada
Journal: Bioinformatics Date: 2014-05-29 Impact factor: 6.937

5. Human genomics. The human transcriptome across tissues and individuals.

Authors: Marta Melé; Pedro G Ferreira; Ferran Reverter; David S DeLuca; Jean Monlong; Michael Sammeth; Taylor R Young; Jakob M Goldmann; Dmitri D Pervouchine; Timothy J Sullivan; Rory Johnson; Ayellet V Segrè; Sarah Djebali; Anastasia Niarchou; Fred A Wright; Tuuli Lappalainen; Miquel Calvo; Gad Getz; Emmanouil T Dermitzakis; Kristin G Ardlie; Roderic Guigó
Journal: Science Date: 2015-05-08 Impact factor: 47.728

6. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts.

Authors: Daehwan Kim; Steven L Salzberg
Journal: Genome Biol Date: 2011-08-11 Impact factor: 13.583

7. TICdb: a collection of gene-mapped translocation breakpoints in cancer.

Authors: Francisco J Novo; Iñigo Ortiz de Mendíbil; José L Vizmanos
Journal: BMC Genomics Date: 2007-01-26 Impact factor: 3.969

8. COSMIC: exploring the world's knowledge of somatic mutations in human cancer.

Authors: Simon A Forbes; David Beare; Prasad Gunasekaran; Kenric Leung; Nidhi Bindal; Harry Boutselakis; Minjie Ding; Sally Bamford; Charlotte Cole; Sari Ward; Chai Yin Kok; Mingming Jia; Tisham De; Jon W Teague; Michael R Stratton; Ultan McDermott; Peter J Campbell
Journal: Nucleic Acids Res Date: 2014-10-29 Impact factor: 16.971

9. The landscape and therapeutic relevance of cancer-associated transcript fusions.

Authors: K Yoshihara; Q Wang; W Torres-Garcia; S Zheng; R Vegesna; H Kim; R G W Verhaak
Journal: Oncogene Date: 2014-12-15 Impact factor: 9.867

10. ChimerDB 2.0--a knowledgebase for fusion genes updated.

Authors: Pora Kim; Suhyeon Yoon; Namshin Kim; Sanghyun Lee; Minjeong Ko; Haeseung Lee; Hyunjung Kang; Jaesang Kim; Sanghyuk Lee
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

35 in total

1. Co-fuse: a new class discovery analysis tool to identify and prioritize recurrent fusion genes from RNA-sequencing data.

Authors: Sakrapee Paisitkriangkrai; Kelly Quek; Eva Nievergall; Anissa Jabbour; Andrew Zannettino; Chung Hoow Kok
Journal: Mol Genet Genomics Date: 2018-06-07 Impact factor: 3.291

2. Identification of large rearrangements in cancer genomes with barcode linked reads.

Authors: Li C Xia; John M Bell; Christina Wood-Bouwens; Jiamin J Chen; Nancy R Zhang; Hanlee P Ji
Journal: Nucleic Acids Res Date: 2018-02-28 Impact factor: 16.971

3. Genomic basis for RNA alterations in cancer.

Authors: Claudia Calabrese; Natalie R Davidson; Deniz Demircioğlu; Nuno A Fonseca; Yao He; André Kahles; Kjong-Van Lehmann; Fenglin Liu; Yuichi Shiraishi; Cameron M Soulette; Lara Urban; Liliana Greger; Siliang Li; Dongbing Liu; Marc D Perry; Qian Xiang; Fan Zhang; Junjun Zhang; Peter Bailey; Serap Erkek; Katherine A Hoadley; Yong Hou; Matthew R Huska; Helena Kilpinen; Jan O Korbel; Maximillian G Marin; Julia Markowski; Tannistha Nandi; Qiang Pan-Hammarström; Chandra Sekhar Pedamallu; Reiner Siebert; Stefan G Stark; Hong Su; Patrick Tan; Sebastian M Waszak; Christina Yung; Shida Zhu; Philip Awadalla; Chad J Creighton; Matthew Meyerson; B F Francis Ouellette; Kui Wu; Huanming Yang; Alvis Brazma; Angela N Brooks; Jonathan Göke; Gunnar Rätsch; Roland F Schwarz; Oliver Stegle; Zemin Zhang
Journal: Nature Date: 2020-02-05 Impact factor: 49.962

4. FusionPro, a Versatile Proteogenomic Tool for Identification of Novel Fusion Transcripts and Their Potential Translation Products in Cancer Cells.

Authors: Chae-Yeon Kim; Keun Na; Saeram Park; Seul-Ki Jeong; Jin-Young Cho; Heon Shin; Min Jung Lee; Gyoonhee Han; Young-Ki Paik
Journal: Mol Cell Proteomics Date: 2019-06-17 Impact factor: 5.911

5. Improved detection of gene fusions by applying statistical methods reveals oncogenic RNA cancer drivers.

Authors: Roozbeh Dehghannasiri; Donald E Freeman; Milos Jordanski; Gillian L Hsieh; Ana Damljanovic; Erik Lehnert; Julia Salzman
Journal: Proc Natl Acad Sci U S A Date: 2019-07-15 Impact factor: 11.205

6. RNA-Seq for the detection of gene fusions in solid tumors: development and validation of the JAX FusionSeq™ 2.0 assay.

Authors: Daniel Bergeron; Harshpreet Chandok; Qian Nie; Matthew Prego; Melissa Soucy; Kevin Kelly; Guruprasad Ananda; Andrew Hesse; Honey V Reddi
Journal: J Mol Med (Berl) Date: 2022-01-10 Impact factor: 4.599