Chimeric RNAs that comprise two or more different transcripts have been identified in many cancers and among the Expressed Sequence Tags (ESTs) isolated from different organisms; they might represent functional proteins and produce different disease phenotypes. The ChiTaRS database of Chimeric Transcripts and RNA-Sequencing data (http://chitars.bioinfo.cnio.es/) collects more than 16 000 chimeric RNAs from humans, mice and fruit flies, 233 chimeras confirmed by RNA-seq reads and ∼2000 cancer breakpoints. The database indicates the expression and tissue specificity of these chimeras, as confirmed by RNA-seq data, and it includes mass spectrometry results for some human entries at their junctions. Moreover, the database has advanced features to analyze junction consistency and to rank chimeras based on the evidence of repeated junction sites. Finally, 'Junction Search' screens through the RNA-seq reads found at the chimeras' junction sites to identify putative junctions in novel sequences entered by users. Thus, ChiTaRS is an extensive catalog of human, mouse and fruit fly chimeras that will extend our understanding of the evolution of chimeric transcripts in eukaryotes and can be advantageous in the analysis of human cancer breakpoints.
Chimeric RNAs that comprise two or more different transcripts have been identified in many cancers and among the Expressed Sequence Tags (ESTs) isolated from different organisms; they might represent functional proteins and produce different disease phenotypes. The ChiTaRS database of Chimeric Transcripts and RNA-Sequencing data (http://chitars.bioinfo.cnio.es/) collects more than 16 000 chimeric RNAs from humans, mice and fruit flies, 233 chimeras confirmed by RNA-seq reads and ∼2000 cancer breakpoints. The database indicates the expression and tissue specificity of these chimeras, as confirmed by RNA-seq data, and it includes mass spectrometry results for some human entries at their junctions. Moreover, the database has advanced features to analyze junction consistency and to rank chimeras based on the evidence of repeated junction sites. Finally, 'Junction Search' screens through the RNA-seq reads found at the chimeras' junction sites to identify putative junctions in novel sequences entered by users. Thus, ChiTaRS is an extensive catalog of human, mouse and fruit fly chimeras that will extend our understanding of the evolution of chimeric transcripts in eukaryotes and can be advantageous in the analysis of humancancer breakpoints.
The eukaryote transcriptome is composed of RNAs transcribed from almost any location in the genome (1–6). Although most RNAs can be assigned to a single locus, some of them, called chimeras, are composed of exons from distinct genes and are therefore assigned to several loci (1,7–24). In some cases, the loci are close to each other in the genome, suggesting that the chimera is generated by read-through transcription (1,12). In other instances, the loci are megabases apart or on different chromosomes, suggesting that the chimera is generated through genome rearrangements or trans-splicing (9,22). Although the possibility that some chimeras are the in vitro artifact of template switching by the reverse transcriptase cannot be totally ruled out (reverse transcriptase–free assays are much harder to perform) (25), the recent evidence that some chimeras are translated corroborates their authenticity and motivated us to establish a systematic catalog of all chimeras (19). Another reason to categorize chimeras is their association with cancer, when the transcriptome is notoriously more complex owing to a large number of genome rearrangements, mutations and alterations of the splicing machinery (26,27).The best-characterized chimeric transcript example is the BCR–ABL1 fusion that is expressed strongly in chronic myelogenous leukemia (23,24). Indeed, this fusion is the target of the anticancer drug imatinib (28,29). Thus, the therapeutic relationship highlights the benefits that can be obtained from identifying chimeric transcripts in cancers and other diseases, both as potential drug targets and as diagnostic tools (30–32).Next-generation sequencing technology provides a great opportunity to identify chromosomal aberrations and novel fusion genes (14,23,24,31,33,34). Indeed, the TMPRSS2 and ETS fusion was identified in prostate cancer by RNA-sequencing and microarray data analysis (23,24,34). Similarly, the EML4–ALK fusion gene was identified in non–small-cell lung cancer using a functional screening procedure (35,36). Short-read sequencing strategies were successfully applied to find fusion genes in prostate, lung and breast cancer cell lines (18,23,24,34,37,38). These are just a few examples where it has been possible to show how gene fusions are associated with solid tumor development, and more examples are likely to come.We recently screened thousands of candidate chimeric transcripts using functional annotation, high-throughput RNA sequencing and mass spectrometry, and identified 175 chimeric RNAs and 12 novel chimeric proteins expressed in humans (19,20). Generally, chimeric transcripts are expressed weakly in normal tissues, although these chimeras tend to incorporate highly expressed parental genes (19). Moreover, we presented evidence of chimeras that had lost certain functional domains and that might therefore actively compete with the functional wild-type proteins, producing dominant negative effects in cancers and other diseases (20). Hence, the screening of the Expressed Sequence Tag (EST) databases and RNA-sequence mapping may have certain advantages when attempting to identify novel chimeric transcripts in cancers, or in normal cells (39).An enormous effort has been made to catalog chimeric transcripts from the literature: the Mitelman database (40) and the Sanger cancer genome project (41) , including the COSMIC database (27,42,43). GenBank (44) also provides a resource to identify candidate inter-chromosomal or intra-chromosomal chimeric transcripts from EST and mRNA data sets (13,45,46). Several fundamental databases have been constructed to incorporate chimeric transcripts from different resources and using a variety of computational procedures: ChimerDB 2.0 (47), ChimerDB (48), HybridDB (49), TICdb (50) and dbCrid (51). Although these databases have been very useful and supported the research in the area, none of them integrates EST or mRNA sequences and literature resources together with RNA-sequencing data, expression level and tissue specificity of chimeric transcripts in different tissues and organisms.Our ChiTaRS database is designed to incorporate chimeric transcripts from three organisms (human, mouse and fruit fly (drosophila)), which helps to provide evidence of chimeras conserved in these organisms. The database was generated by performing a bioinformatics analysis of transcript sequences for the three organisms in GenBank (44). The special features of ChiTaRS include the use of an algorithm optimized for the quick retrieval and search of 16 262 chimeric transcripts in the three organisms, using various search parameters. It includes an extensive coverage of recent publications and relevant databases that collate 1892 cancer breakpoints and read-through fusions, as well as manual verification of the entries. Moreover, the database incorporates evidence from RNA-seq reads that map 233 chimeric junction sites from multiple next-generation sequencing data sets for the three organisms, providing information regarding the level of expression and the tissue specificity of the entries. The download page includes all the entries and tables in the database, together with the RNA-seq and mass spectrometry data supporting the existence of these chimeras. ChiTaRS also enables the transcripts and their junction site to be visualized by SpliceGrapher (52), using the genome annotation for humans, mice and the fruit flies. Finally, the database has a unique feature to analyze the junction consistency, ranking the chimeras according to the evidence of the same junction site. This feature is advantageous to researchers seeking empirical confirmation of highly ranked chimeric transcripts. As a result, ChiTaRS represents the most extensive catalog of chimeric RNA transcripts in the human, mouse and fruit fly, which makes it particularly important that the data are presented in an easily understandable and user-friendly format.
RESULTS
Data sets of candidate chimeric transcripts
We have created a data set of chimeric transcripts using ESTs and mRNAs sequences for human [the UCSC reference genome (51–53): GRCh37/hg19], mouse (NCBI37/mm9) and Drosophila (BDGP R5/dm3) from GenBank (44). All sequences were aligned to the corresponding reference genomic sequences using the UCSC BLAT program (53,54). The sequence was considered a chimera whenever the first part aligned to one gene and the second to another gene located at least 750 kb away [the default maximum intron size in BLAT (54)]. For the alignments, identity was set at a minimum of 95%, and the minimum length was set at 50 nucleotides (nt). We allowed an overlap of up to 10nt between the two subparts of a chimera; therefore ChiTaRS also includes chimeric transcripts with short homologous sequences (46). Furthermore, we did not put any constraint on the splices sites, hence ChiTaRS contains chimeras with either canonical or non canonical splice sites. In this way, 14 512 human, 10 550 mouse and 4084 fruit fly candidate chimeras were identified. The chimeric transcripts incorporating opposite strands of the same gene were removed to avoid fusion by cotranscription and intergenic splicing (CoTIS) (47,48). Moreover, the chimeric junction sites were characterized to distinguish between genuine chimeras and artifacts, as the junction in a chimera is typically around the exon–exon splice sites (45,47). Applying this filter, whereby candidate chimeric sequences were removed if the junction was situated >50 nt away from a known splice site, reduced the number of chimeric candidates to 9379, 4828 and 2055 in the human, mouse and fruit fly data sets, respectively. These candidate chimeras involved 7808, 5141 and 1784 unique genes from human, mouse and fruit fly, respectively. It is worth noting that this pipeline did not capture read-through fusions (1), as they involve genes located <750 kb away. The read-through chimeras that we included in the database were added separately, based on adequate published supporting evidence (see ‘Full Collection & Search’).
RNA-sequencing analysis of candidate chimeric transcripts
To assess the expression and validate the authenticity of the candidate chimeric transcripts, we screened RNA-seq data sets from the corresponding organism (19). For human candidate chimeras, we used the Human Body Map 2.0 data generated on the HiSeq 2000 by Illumina in 2010. This data set comprises 1097 million (M) reads of 75 nt derived from the sequencing of RNA from 16 different tissues. For the candidate drosophila chimeras, we used a data set of 22 M reads of 75 nt resulting from the sequencing of the ovarian cell line Kc167 (the RGASP competition, the modENCODE group). For the candidate drosophila chimeras, we used a data set of 22M reads of 75 nt resulting from the sequencing of the ovarian cell line Kc167 (the RGASP competition, the modENCODE group) (55). Clearly, the depth (number of reads sequenced) and breadth (number of tissues sampled) of the sequencing differed between the data sets, which explains why the proportion of chimeras we confirmed was different for each organism: 192 for humans, 12 for mice, 29 for fruit flies (see ‘Full Collection’).To ensure that a RNA-seq read could be unambiguously assigned to a chimera and not to another location in the genome, we followed a specific mapping protocol (19). First, we mapped all RNA-seq reads to the reference genome and annotated exon junctions using the Grape RNAseq Analysis Pipeline Environment (GRAPE) (http://big.crg.cat/services/grape), and thereby removed any reads that could be linearly assigned to genomic regions. The remaining reads served as the set of putative chimeric reads and were mapped to our candidate chimeras. Selection of candidate chimeras required that an RNA-seq read map precisely to the chimeric junction, with at least six nucleotides on each side of the junction, with no more than three mismatches (32). Finally, 192 human chimeric transcripts were confirmed by at least two RNA-seq reads covering the gene–gene junction site. Based on this RNA-sequencing analysis, the ChiTaRS database contains information regarding the number of reads across the chimera junction, its tissue specificity and the abundance of a given chimera in human tissues (see ‘Full Collection & Search’).
Cancer-associated chimeric transcripts
The human data set of chimeric transcripts includes chromosomal fusions found in cancers that we extracted from the TICdb (50), dbCrid (51), ChimerDB 2.0 (47) and Mitelman (56) databases. The chimeric transcripts collected in our database are the result of chromosomal translocations, insertions, deletions, inversions, ring chromosomes, derivatives and many others (see ‘Breakpoints’) (50,51,56). The manual inspection of >7000 (3343 unique) articles was applied to confirm the correspondence between the fusion event, disease and the two genes incorporated into the chimeras. Thus, the ChiTaRS database is composed of 1892 fusions involving >1000 unique genes (see ‘Breakpoints’ and Figure 1) with cross-links to the chimeric ESTs. The database incorporates also the published read-through and trans-splicing fusions (1), which can be found explicitly under the ‘Full Collection & Search’ page (use a check-box for ‘Published Fusions’). To the best of our knowledge, ChiTaRS is the first catalog that enables cross-referencing between chimeric transcripts found in GenBank (44), relevant Pubmed articles regarding putative breakpoints, the two genes involved and the ‘chimeric’ RNA-seq reads covering the chimeric junctions in a specific tissue or a cell type. For example, there is a chromosomal translocation t(10;11)(p13;q14), which creates a fusion between the PICALM and MLLT10 genes, characteristic of hematological malignancies. The translocation described corresponds to the chimeric transcript found in our database, ESTid = ‘EF051633’. Interestingly, searching ChiTaRS with this chimeric RNA transcript revealed that this chimera was also expressed in a female patient with chronic obstructive lung disease, according to the Human Body Map 2.0 data, with the expression level of 0.33 reads per kilobase per million reads (RPKM) (1–2 transcript per cell in average) (see ‘Breakpoints’ and Figure 1).
Figure 1.
The ChiTaRS breakpoints collection page. The breakpoints collection (‘Breakpoints’) includes ∼2000 human cancer breakpoints with the links to TICdb (50), dbCrid (51), ChimerDB 2.0 (47) and the Mitelman database (38,56). The search can be performed on the ‘Breakpoints’ page by a PubMed ‘Reference’, a ‘Gene’ name, ESTid (‘ChimeraID’), a ‘Disease’ and a type of ‘Chromosomal Aberrations’. The information for the search is recognized automatically between ‘Reference’, ‘Gene’, ‘ChimeraID’ or ‘Disease’. A specific combination of chromosomes, arms and the locus can be used as a ‘Search’ option as well. Finally, the RNA-sequencing results are presented by clicking on ‘RNA-seq’ and ‘Save sets and Search’.
The ChiTaRS breakpoints collection page. The breakpoints collection (‘Breakpoints’) includes ∼2000 humancancer breakpoints with the links to TICdb (50), dbCrid (51), ChimerDB 2.0 (47) and the Mitelman database (38,56). The search can be performed on the ‘Breakpoints’ page by a PubMed ‘Reference’, a ‘Gene’ name, ESTid (‘ChimeraID’), a ‘Disease’ and a type of ‘Chromosomal Aberrations’. The information for the search is recognized automatically between ‘Reference’, ‘Gene’, ‘ChimeraID’ or ‘Disease’. A specific combination of chromosomes, arms and the locus can be used as a ‘Search’ option as well. Finally, the RNA-sequencing results are presented by clicking on ‘RNA-seq’ and ‘Save sets and Search’.
Features of the ChiTaRS database
The search options
The ChiTaRS database is accessible at its home page: http://chitars.bioinfo.cnio.es. The search (see the ‘Full Collection & Search’ page and Figure 2) can be performed using ESTid (‘ChimeraID’), the names of the genes participating in the chimeras (‘Gene Name’, e.g. LMNA, DDX5), a sequence identity score (‘Identity’, e.g. 100, 95), a tissue type (‘Tissue Name’, e.g. lung), gene synonyms (‘Gene Synonym’) or a keyword (‘Keyword’, e.g. RARA). The ‘Full Collection’ can be obtained using the ‘Full Collection’ option and clicking on ‘Search’. All 16 262 entries in the databases for human, mouse and fruit fly are listed together (Figure 2).
Figure 2.
The ChiTaRS full collection page. The full collection (‘Full Collection & Search’) consists of 16 262 transcripts in human (H. Sapiens), mouse (M. Musculus) and fruit fly (D. Melanogaster). The search can be performed using ESTid (‘ChimeraID’), the names of the genes participating in the chimeras (‘Gene Name’, e.g. LMNA, DDX5), a sequence identity score (‘Identity’, e.g. 100, 95), a tissue type (‘Tissue Name’, e.g. lung), gene synonyms (‘Gene Synonym’) or a keyword (‘Keyword’, e.g. RARA). The RNA-sequencing results and the breakpoints can be extracted by clicking on the corresponding check-boxes and then ‘Search’.
The ChiTaRS full collection page. The full collection (‘Full Collection & Search’) consists of 16 262 transcripts in human (H. Sapiens), mouse (M. Musculus) and fruit fly (D. Melanogaster). The search can be performed using ESTid (‘ChimeraID’), the names of the genes participating in the chimeras (‘Gene Name’, e.g. LMNA, DDX5), a sequence identity score (‘Identity’, e.g. 100, 95), a tissue type (‘Tissue Name’, e.g. lung), gene synonyms (‘Gene Synonym’) or a keyword (‘Keyword’, e.g. RARA). The RNA-sequencing results and the breakpoints can be extracted by clicking on the corresponding check-boxes and then ‘Search’.The search results page shows all the relevant instances associated with the chimeric transcripts available, the RNA-seq data of the mapping to the chimeric junction site (19), the level of transcript expression and the cancer breakpoints (see ‘pop-ups’ windows clicking on the ‘RNA-seq’ column in the ‘Full Collection’ and Figure 2). It contains detailed information about the identifier and the link to the corresponding GenBank entry, the junction site, the gene names and the identity of the two genes incorporated into the chimera. The chimeras can be visualized through as splice graphs (see description given later in the text). Together with the two genes and the disease information, the table of fusion transcripts includes general links to relevant resources, such as the Entrez Gene, GenBank (44), the Mitelman database (56), TICdb (50), dbCrid (51) and PubMed references. The search results can be saved as a tab-delimited text file using the ‘Get Results as Text’ button (up to 100 sequences).In addition, ‘Junction Search’ provides the option to screen through the list of RNA-seq reads found at the chimeras’ junction sites (19) to identify putative junction sites in novel sequences provided by a user. The ‘Junction Search’ is available for all three organisms in the database, and both the transcript sequence and the GenBank accession number can be used as inputs. The search is an automatic procedure that identifies a junction site in the transcript entered by a user and that aligns the previously found ‘chimeric’ RNA-seq reads to this junction site. This special feature of ChiTaRS allows users to identify to what extent their chimeric transcripts are similar to those for which there is RNA-seq data in the database. It is essential for scientists to be able to analyze their chimeras in the complex setting of a large high-throughput data set and with multiple sequences. In the ‘Downloads’ section, we provide all the unmapped reads for the different RNA-seq data sets for three organisms. These data sets enable users to search for the junction coverage among other available chimeric transcripts in the different databases.
Unique gene names
The distinct aliases used for unique gene names represent one of the main problems when dealing with different gene, protein and transcript databases, which may represent a source of duplication in the databases. In ChiTaRS, we use a specific table to map the synonymous gene names to a unique record, using the NCBI Entrez gene name as a key. We have currently performed four updates to the ChiTaRS database after manual verification of the entries and cancer breakpoints. Each update is verified automatically for the synonymous gene name so that it is unique for both genes incorporated into the chimeras. Thus, all entries currently appearing in ChiTaRS have unified gene names, and as a result, searches can be performed based on gene names and synonyms (under ‘Full Collection & Search’, Figure 2).
Ranking of chimeric junction consistency
One of the key novelties of our database is the calculation and ranking of chimeric junction consistency. The ChiTaRS database contains transcripts that are chimeras of two genes, and in some cases, there is evidence these two genes may participate in many chimeras. The junction consistency ranking is a measure of how many times the same junction between the same genes has been found in chimeric transcripts. Thus, if the junction site is at the same genomic location of two genes incorporated in chimeras with a difference of no more than 1000 nt (an empirical number, can be changed in the ‘Search’ options), the junction rank is high (for more details, see Figure 3). The junction consistency in ChiTaRS is a particularly important experimental feature that may be of interest to verify the existence of highly ranked chimeras in cells by polymerase chain recation, reverse transcriptase-quantitative polymerase chain reaction or other techniques, thereby reducing the chance of dealing with chimeras that are mere artifacts.
Figure 3.
An example of 15 chimeric transcripts involving the BCR and ABL1 genes (ESTids = ‘M19730.1’, ‘DQ912590.1’, ‘AM491360.1’, ‘EF423615.1’, ‘AM491359.1’, ‘DQ912589.1’, ‘AF113911.1’, ‘AY043457.1’, ‘M25946.1’, ‘AY789120.1’, ‘AM491361.1’, ‘EU216071.1’, ‘EU216066.1’, ‘AJ131467.1’ and ‘AJ131466.1’), presented by SpliceGrapher, with a consistent junction awarded a ranking of 5. This figure depicts the known splicing patterns for the two genes involved (middle two panels) in the chimera along with the 15 ESTs found in the ChiTaRS database that provide evidence for the chimera (top and bottom panels). Exons that participate in the chimera are highlighted in dark grey, and the location of the chimeric junction is highlighted in red. To score the junction consistency, we selected all 15 chimeras mapping to this gene pair. For each chimera, we have at our disposal the genomic location of its junction: end (gene 1) and start (gene 2). We calculated a distance between all pairs of chimeras based on these coordinates. The distance simply corresponds to the difference between the two starts of gene 1 and the difference between the two ends of gene 2. Then we selected the chimera with shortest distance to all others as the reference chimera. If another chimera of the same gene 1 and gene 2 had a distance of <1000 nt to the reference chimera, we decided that the junction is consistent and incremented the rank by 1. In the special case where two chimeras had strictly the same mapping positions, we selected only one, assuming that the duplication could be due to artifacts. In the example from the figure, the reference chimera is EU216071.1. Twelve chimeric ESTs among 15 (except chimeras ‘AM491361.1’, ‘AF113911.1’, ‘M19730.1’) are consistent with the junction site of EU216071.1; the rank of these 12 chimeras is 5. The junction consistency and rank may show that potential breakpoints are not artifacts, and indeed, the BCR and ABL1 chimeras have a breakpoint for the Philadelphia translocation t(9;22)(q34;q11) in chronic myelogenous leukemia.
An example of 15 chimeric transcripts involving the BCR and ABL1 genes (ESTids = ‘M19730.1’, ‘DQ912590.1’, ‘AM491360.1’, ‘EF423615.1’, ‘AM491359.1’, ‘DQ912589.1’, ‘AF113911.1’, ‘AY043457.1’, ‘M25946.1’, ‘AY789120.1’, ‘AM491361.1’, ‘EU216071.1’, ‘EU216066.1’, ‘AJ131467.1’ and ‘AJ131466.1’), presented by SpliceGrapher, with a consistent junction awarded a ranking of 5. This figure depicts the known splicing patterns for the two genes involved (middle two panels) in the chimera along with the 15 ESTs found in the ChiTaRS database that provide evidence for the chimera (top and bottom panels). Exons that participate in the chimera are highlighted in dark grey, and the location of the chimeric junction is highlighted in red. To score the junction consistency, we selected all 15 chimeras mapping to this gene pair. For each chimera, we have at our disposal the genomic location of its junction: end (gene 1) and start (gene 2). We calculated a distance between all pairs of chimeras based on these coordinates. The distance simply corresponds to the difference between the two starts of gene 1 and the difference between the two ends of gene 2. Then we selected the chimera with shortest distance to all others as the reference chimera. If another chimera of the same gene 1 and gene 2 had a distance of <1000 nt to the reference chimera, we decided that the junction is consistent and incremented the rank by 1. In the special case where two chimeras had strictly the same mapping positions, we selected only one, assuming that the duplication could be due to artifacts. In the example from the figure, the reference chimera is EU216071.1. Twelve chimeric ESTs among 15 (except chimeras ‘AM491361.1’, ‘AF113911.1’, ‘M19730.1’) are consistent with the junction site of EU216071.1; the rank of these 12 chimeras is 5. The junction consistency and rank may show that potential breakpoints are not artifacts, and indeed, the BCR and ABL1 chimeras have a breakpoint for the Philadelphia translocation t(9;22)(q34;q11) in chronic myelogenous leukemia.
Visualization of chimeras by SpliceGrapher
A bonus feature of ChiTaRS is the visualization of chimeric transcripts, and their genomic context, including the junction site. The visualization figures were produced using the SpliceGrapher package, which was designed to predict splice graphs for a gene by combining evidence from RNA-Seq data, annotated gene models and EST alignments (52). To produce splice graphs for chimeras, we first used GMAP (57) to align all available chimeric sequences to their reference genome (Homo sapiens version GGRCh37.63, Drosophila melanogaster version BDGP R5/dm3 and Mus musculus version NCBI37/mm9), and subsequently, SpliceGrapher was used to convert the resulting alignments into splice graphs (52). Finally, we used SpliceGrapher’s visualization modules to integrate the ESTs and gene models into figures that illustrate chimeric splicing. Each figure shows how the ESTs align across two genes, making it possible to envisage the potential transcripts that could arise from each chimera (Figure 3).
The human and mouse chimeras
The ChiTaRS database provides evidence of chimeric transcripts and their mapping by the RNA-seq reads from three higher eukaryotes: human, mouse and fruit fly. The database is very robust and allows investigating the transcripts that incorporate the same orthologous genes in different organisms. An interesting example is the human chimera, ChimeraID = ‘AW882230’, and mouse chimera, ChimeraID = ‘CF577921’. These chimeras both incorporate the PTMS gene (parathymosin, which may mediate the immune function) and are confirmed by RNA-seq reads in each organism. Therefore, these RNA-seq data support that the ability to form chimeras is a conserved feature of genomic loci. ChiTaRS takes the first step in exploring this premise because one of its main future goals is related to the study of the evolution of chimeric transcripts.
The ‘Contact Us’ webpage
The ‘Contact Us’ page describes a way to submit new chimeras and fusion transcripts, which have been detected by other groups, published, or found using alternative software or data sets. All requests to include data in the ChiTaRS database will be inspected and verified manually before uploading.
Downloads
The ChiTaRS database not only provides extended ‘Search’ options, but also offers the possibility to download all the database tables and the data sets in a very user-friendly manner. The full human, mouse and fruit fly collections include information about the two genes incorporated into the chimeras, the sequence identity and the positions of the junction sites. In addition, the freely available RNA-seq results, all the unmapped RNA-seq reads and mass spectrometry results are downloadable for each organism. For easy access to the most important fusions, we produced separate files for the published fusions (1) as well as the fusions identified in a prostate cancer by high-throughput RNA sequencing (23,24).
CONCLUSIONS AND FUTURE PLANS
ChiTaRS is an extended database of chimeric transcripts selected from GenBank from three organisms: human, mouse and fruit fly. The database features, such as the junction consistency and the ranking, allow rapid discovery of genes from different chimeras and of chimeras that share the same junction site. In addition, the chimeras derived from the three organisms provide an evolutionary tool to study chimeric transcripts across different organisms that involve the same genes. The RNA-Seq data should serve as a basis for further experimental confirmation of candidate chimeric transcripts. Moreover, the expression level of transcripts, as obtained from RNA-seq reads in different organisms and tissues, offers important information regarding the expression of chimeric transcripts, in particular, tissue specificity and function. Our ChiTaRS database is already of great use for experimental and evolutionary studies of chimeric transcripts, and for the annotation of chimeras in the International Cancer Genome Consortium (ICGC) project studing Chronic Lymphocytic Leukemia (CLL) project [in collaboration with the ICGC consortium (26,58)].In summary, the ChiTaRS database encompasses all chimeric transcripts confirmed in humans and potentially translated into chimeric proteins (19). Our prediction is that the functions of chimeric proteins are substantially different from those of the original native proteins. Indeed, chimeric proteins sometimes contain different protein domains (20), or they are found in distinct cellular compartments or specific tissues associated with disease or cancer. We intend to continue expanding and annotating the ChiTaRS database with chimeric transcripts confirmed by RNA-seq reads and through the existence of the corresponding chimeric proteins, the latter preferably confirmed by mass spectrometry experiments. Our database should prove useful to biologists characterizing normal and cancer-associated chimeric transcripts and their corresponding proteins, and more generally, to researchers interested in gene expression and evolution, both physiological and pathological.
FUNDING
Miguel Servet (FIS) grant (to M.F-M.); Obra Social laCaixa grant (to K.I.); National Science Foundation ABI [0743097 to A.B-H. and M.R.]. Funding for open access charge: NHGRI-NIH ENCODE grant [HG00455-04]; Blueprint European Union project [282510]; Spanish Government grant [BIO2007-66855]; Spanish National Bioinformatics Institute (INB-ISCIII), Genecode/ENCODE NHGRI-NIH grant [HG00455-04]; Plan Cancer 2009-2013, ERC Advanced Grant Sisyphe, Investissements d'avenir en Bioinformatique.Conflict of interest statement. None declared.
Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong Journal: Nature Date: 2007-06-14 Impact factor: 49.962
Authors: Peter J Campbell; Philip J Stephens; Erin D Pleasance; Sarah O'Meara; Heng Li; Thomas Santarius; Lucy A Stebbings; Catherine Leroy; Sarah Edkins; Claire Hardy; Jon W Teague; Andrew Menzies; Ian Goodhead; Daniel J Turner; Christopher M Clee; Michael A Quail; Antony Cox; Clive Brown; Richard Durbin; Matthew E Hurles; Paul A W Edwards; Graham R Bignell; Michael R Stratton; P Andrew Futreal Journal: Nat Genet Date: 2008-04-27 Impact factor: 38.330
Authors: Daniel G Jamieson; Phoebe M Roberts; David L Robertson; Ben Sidders; Goran Nenadic Journal: Database (Oxford) Date: 2013-05-23 Impact factor: 3.451