Literature DB >> 22199257

RNASEQR--a streamlined and accurate RNA-seq sequence analysis program.

Leslie Y Chen¹, Kuo-Chen Wei, Abner C-Y Huang, Kai Wang, Chiung-Yin Huang, Danielle Yi, Chuan Yi Tang, David J Galas, Leroy E Hood.

Abstract

Next-generation sequencing (NGS) technologies-based transcriptomic profiling method often called RNA-seq has been widely used to study global gene expression, alternative exon usage, new exon discovery, novel transcriptional isoforms and genomic sequence variations. However, this technique also poses many biological and informatics challenges to extracting meaningful biological information. The RNA-seq data analysis is built on the foundation of high quality initial genome localization and alignment information for RNA-seq sequences. Toward this goal, we have developed RNASEQR to accurately and effectively map millions of RNA-seq sequences. We have systematically compared RNASEQR with four of the most widely used tools using a simulated data set created from the Consensus CDS project and two experimental RNA-seq data sets generated from a human glioblastoma patient. Our results showed that RNASEQR yields more accurate estimates for gene expression, complete gene structures and new transcript isoforms, as well as more accurate detection of single nucleotide variants (SNVs). RNASEQR analyzes raw data from RNA-seq experiments effectively and outputs results in a manner that is compatible with a wide variety of specialized downstream analyses on desktop computers.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 22199257 PMCID： PMC3315322 DOI： 10.1093/nar/gkr1248

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Analyzing the spectrum of poly-adenylated RNA using conventional Sanger sequencing has provided rich biological information on gene-expression levels, alternative RNA splicing events and common and rare genetic variations in the last few decades (1–3). Recently, RNA-seq, a deep transcriptome profiling approach based on the next-generation sequencing (NGS) platforms, provides an enormous amount of sequence information and offers a larger dynamic range than other transcriptome profiling methods (4,5). Prior studies have also shown that gene-expression profiles obtained by RNA-seq correlate well with quantitative polymerase chain reaction (qPCRs) measurements (4). The millions of short sequences from NGS platforms pose a challenge for experimental biologists to analyze and extract meaningful biological information (6). The sequences from the early versions of NGS technology ranged from 25 to 50 bp. With improvements in the chemistry and instrumentation the length of sequences generated from NGS is becoming longer, which should improve the accuracy of RNA sequence analysis. However, the longer sequences involve additional challenges in data analysis since these sequences are more likely to span multiple exons. A recent study indicated that ∼30% of the sequences in a 75-bp RNA-seq library extend across at least one exon junction (7), which makes it more difficult to accurately map and align these sequences. Previous approaches to address this challenge have been focused on creating splice junction reference libraries built from either known gene models (8–10) or predicted exons (11–13). Other approaches have also been adapted to assist the alignment of RNA-seq sequences, such as using: seed matching followed by a heuristic identification of splice junctions (14–17); in silico prediction of splice junctions (18); clustering or assembly of RNA-seq sequences (19–22); and comprehensive hash-based alignment (23,24). However, these approaches are heavily dependent on computational resources and still have a significant frequency of mis-aligned sequences. We have developed a new sequence mapper/aligner, RNASEQR, specifically for RNA-seq data analysis. RNASEQR takes advantage of annotated transcripts and genomic reference sequences to obtain high quality mapping/alignment results. To evaluate the performance of RNASEQR, we compared the results to those from other widely used RNA-seq tools, including ERANGE (8), MapSplice (16), SpliceMap (12) and TopHat (11), with a simulated dataset derived from the Consensus CDS (CCDS) project (25) and two experimental data sets generated from a patient with glioblastoma multiforme (GBM). RNASEQR significantly improves the mapping results, especially on transcripts containing smaller exons, which results in more accurate assessment of gene-expression profiles and better transcript structures. The RNASEQR pipeline also significantly reduces false identification of single nucleotide variants (SNVs) near the splice junctions. We report in this manuscript a comprehensive comparison between RNASEQR and four other most widely used RNA-seq tools by evaluating their performance in several downstream analyses. RNASEQR and its open source code are available at https://github.com/rnaseqr/RNASEQR.

MATERIALS AND METHODS

RNASEQR pipeline

RNASEQR was written in Python 2.7 and runs on 64-bit Linux systems. It employs a Burrows–Wheeler transform (BWT)-based and a hash-based indexing algorithm. Briefly, there are three sequential processing steps: the first step is to align RNA-Seq sequences to a transcriptomic reference; the second step is to detect novel exons; the third step is to identify novel splice junctions using an anchor-and-align strategy. In the first step, RNA-seq sequences are mapped to a pre-built transcriptomic reference sequence using Bowtie (26), and sequences are classified into three categories based on the mapping results: unique, multiple, and unassigned. RNASEQR records a maximum of 40 alignment records for each sequence by default. To obtain genomic coordinates for each mapped sequence, RNASEQR applies a two-level data structure to record the exon information of each transcript. The first level records the identifier of each transcript, while the second level records the information of chromosome, orientation, start, and length for each exon in a transcript. This unique data structure maintains a constant processing time for converting transcriptome-genome coordination. The positions of both uniquely- and multiply-assigned sequences on gene transcripts are converted to the coordinates on the genomic reference sequence, and sequences satisfying the following criteria are recorded as having unique genomic positions: Given a scoring function Φ and genome reference G, one read r is said to have a unique alignment if and only if there exists one alignment v of r such that Φ(v,G)> Φ(v’,G) for all alignments v’ of r besides v. The scoring function Φ is calculated using Hamming distance that measures the minimum number of substitutions required to change one string to the other. In the second step, RNASEQR maps the unassigned sequences to a pre-built BWT-based genomic index using Bowtie (26). The current version of RNASEQR only records sequences mapped uniquely on the genomic reference sequences. In the third step, RNASEQR applies an anchor-and-align strategy. To generate anchors, RNASEQR splits each unassigned sequence into multiple substring sequences of fixed-length (25 bp by default) and even distribution. These anchors are then mapped to pre-built BWT-based transcriptomic and genomic indexes simultaneously using Bowtie (26). Sequences with a specified number of anchors (two anchors by default) pointing to a unique genomic location are aligned locally to candidate chromosomal regions using BLAT (27). To map the paired-end RNA-seq reading results, RNASEQR maps such sequences to a transcriptomic reference using Bowtie with built-in paired-end feature. The remaining unmapped paired-end sequences were then separately mapped as single-end reading using Bowtie and BLAT in the second and the third steps. The un-paired mapped sequences are then examined for pairing based on their genomic distance and sequence orientation. The final mapping result of each sequence is reported in the SAM file format with extended CIGAR string format (28).

RNA-seq of a glioblastoma tumor and matched peripheral brain tissue

Tumor and peripheral brain tissues were obtained from a GBM patient at the Chang Gung Memorial Hospital in Taiwan under proper IRB approval (CGMH IRB No. 94-0182 and WIRB 20070569). Total RNA was extracted from 0.5 g of frozen tissue using the miRNeasy mini kit (Qiagen, Germantown, MD, USA), and the RNA quality was checked using Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA, USA). We performed 75-bp single-end RNA sequencing (RNA-seq) on the Illumina Genome analyzer II following the manufacturer's suggestion (Illumina, San Diego, CA). In short, 10 µg high-quality RNA was used to enrich poly-adenylated RNA, which was then fragmented, reverse transcribed followed by the synthesis of the second strand. Each double-stranded cDNA fragment resulting was then blunt-ended, adenylated, ligated to adaptors, and size-purified for ∼200-bp fragments. These size-selected cDNA templates were further enriched using PCR and checked for quality on the Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA, USA). We performed 75-bp sequencing using the SBS sequencing kit v2 (Illumina, San Diego, CA, USA) one lane for each sample, and obtained 25.3 million, and 23.2 million high quality sequences for the tumor and the peripheral brain RNA samples, respectively. The raw sequence data was deposited to the Gene-expression Omnibus (GEO) database and is accessible through accession number GSE33328.

Genomic and transcriptomic reference sequence and simulated RNA-seq library

We compiled the sequences of each full-length transcript annotated on the UCSC KnownGene (29), NCBI RefSeq (30), Ensembl Genes (31) and Consensus CDS (CCDS) Genes (25) databases using the human genome reference sequence (GRCh37). We used Bowtie to create a pre-built index from the compiled transcriptomic reference sequence for all gene databases. The index of genomic references was downloaded from the Bowtie webpage (http://bowtie-bio.sourceforge.net/). We performed a simulation test using the transcripts annotated in CCDS Gene (25). Full-length sequences of each transcript in CCDS Gene were assembled using human genome reference (GRCh37), and then split into overlapping 75-bp simulated sequences by sliding a 75-bp window one nucleotide at a time. Positions of these unique sequences were recorded for further evaluating mapping and alignment accuracy.

Some publicly available mapping programs and downstream analysis tools

The mapping performance of RNASEQR was compared with that of ERANGE (8) (version 3.2.1), MapSplice (16) (version 1.14.1), SpliceMap (12) (version 3.3.5.1) and TopHat (11) (version 1.1.1). We ran the SAMtools (28) (version 0.1.8) to detect SNVs presented in the mapping result. SNVs with read-depth fewer than five were manually removed. We ran the Scripture program (7) to assemble the mapping results, construct transcript structures, and determine alternative isoforms, and calculate gene-expression levels for each transcript from the mapping results. Novel exons and novel splice junction sites were identified by comparing the assembled transcript structures to that annotated in the Ensembl Genome Browser. Sequence mapping result, SNVs, and assembled transcripts were visualized using the Integrative Genomics Viewer (32).

RESULTS

RNASEQR adapted a three-step ‘align and remove’ strategy to streamline the RNA-seq sequence mapping and alignment process (Figure 1). Sequences were first mapped to a set of full-length RNA transcripts (transcriptomic reference), which assigned a majority of the sequences and left a smaller portion of sequences undetermined. This allowed us to fully implement computationally intensive algorithms with limited resources in the following steps. Sequences that failed to map to the transcriptomic reference were subsequently compared to a genomic reference sequence to identify novel exons. To identify new exon junctions, we adopted BLAT (27), a memory-efficient hash-based alignment algorithm. Collectively, the three-step process yielded high-quality mapping results for both known and novel transcripts.

Figure 1.

RNASEQR sequence mapping procedure. RNA-seq sequences read by NGS instruments are derived from either one or multiple exons. RNASEQR first maps the sequences to a set of full-length transcripts and calculates the genomic coordinates for their mapped reads. Unmapped sequences are then compared to a genomic reference sequence in the next step to identify novel exons. Novel splice junctions are identified in the third step using a gap-tolerant alignment algorithm. The result of uniquely mapped sequences is recorded in the Sequence Alignment/Map (SAM) format. To test the performance of RNASEQR, we used three different datasets, one simulated and two experimental RNA-seq datasets. The simulated dataset contained 38 506 959 sequences, which were generated by sliding a 75-bp window along the sequence of all annotated transcripts in CCDS (25). Of the two experimental datasets, one of them was generated from a tumor tissue and the other was from the corresponding peripheral normal (reference) brain tissue obtained from a GBM patient. Taken together, the two experimental libraries yielded 48 643 647 single-end 75-bp high-quality sequences, respectively.

The effect of different transcriptomic annotations on the performance of RNASEQR

Several independent efforts, including the Ensembl Genome browser (31), the UCSC Genome browser (29), the NCBI Reference Sequence (30) and the Consensus CDS (25), are dedicated to annotating transcribed regions of the genome (herein Ensembl, UCSC, RefSeq and CCDS). Each annotation effort has implemented different inclusion criteria for transcripts, which might affect the results generated from RNASEQR. To address this, we compared the results with RNASEQR using a transcriptomic reference library built from different annotations. Approximately 35 million sequences were mapped uniquely in the first and second steps (Table 1) using transcriptomic references based on either Ensembl, UCSC or RefSeq. Since CCDS collects only protein-coding transcripts, only one-third of the sequences were assigned in the first step, but an equivalent number of mapped sequences were obtained after the second and the third steps analysis in RNASEQR. The number of uniquely mapped sequences differed by only 1.71%, suggesting that different transcriptomic reference databases used did not significantly affect the overall performance of RNASEQR.

Table 1.

RNASEQR mapping results of two RNA-seq libraries using various transcriptomic references

	Transcriptome ref.		Ensembl	UCSC	RefSeq	CCDS
	Annotated transcripts		151 185	77 614	37 162	23 754
Step	Reference	Mapping	Reads
I	Transcriptome	Unique	8 034 365	6 998 920	16 024 976	10 570 226
		Unique ^a	20 099 654	21 143 858	11 228 678	3 829 696
		Multiple^b	1 016 650	625 143	411 881	240 348
II	Genome	Input	17 015 881	17 398 629	18 501 015	31 526 280
		Unique	6 411 301	6 704 035	7 478 666	19 518 521
		Multiple	258 860	342 518	598 051	1 010 132
III		Split reads	10 345 720	10 352 076	10 424 298	10 997 627
	Transcriptome and genome	Anchored	2 765 619	2 769 510	2 798 471	3 052 090
	Genome	Unique	731 426	742 058	751 073	882 532
	Total mapped uniquely, n (%)		35 276 746 (76.41)	35 588 871 (77.09)	35 483 393 (76.86)	34 800 975 (75.38)

aOn multiple transcripts but unique genomic location.

bOn multiple transcripts and multiple genomic locations.

RNASEQR mapping results of two RNA-seq libraries using various transcriptomic references aOn multiple transcripts but unique genomic location. bOn multiple transcripts and multiple genomic locations.

Comparing the mapping performance of RNASEQR with other RNA-seq tools

The mapping performance of RNASEQER was compared to some widely used RNA-seq tools including ERANGE (8), MapSplice (16), SpliceMap (12) and TopHat (11) using the simulated dataset created from CCDS transcripts. Among 38 506 959 non-redundant sequences, RNASEQR assigned more sequences uniquely than the other tools with highest overall accuracy of 99.91% (Table 2). The aligned sequences were further broken down into two groups based on whether the sequence originated from a single exon: a total of 23 187 354 sequences were from single exon (unspliced) and 15 319 605 sequences were from more than one exon (spliced). All tools performed equally well with high sensitivity for sequences originating from a single exon. RNASEQR provided a better mapping result in both sensitivity and specificity with sequences that came from one or more than one exon (Table 2). This finding suggests that RNASEQR is particularly effective and accurate in assigning sequences correctly to splice junctions and that it will be increasingly effective as the length of the RNA sequence reads increases.

Table 2.

Mapping result of a 75-bp CCDS derived library in human

	RNASEQR	ERANGE	MapSplice	SpliceMap	TopHat
Uniquely mapped reads,^an (%)	37 634 842 (97.74)	37 128 297 (96.42)	37 459 713 (97.28)	34 516 928 (89.64)	35 770 965 (92.89)
Mapped unspliced reads^b	22 619 568	22 480 501	22 496 259	21 619 384	22 487 761
Sensitivity (%)	97.54	96.95	97.02	93.15	96.98
Specificity (%)	99.94	92.82	92.66	90.68	79.11
Mapped spliced reads^c	15 015 274	14 647 796	14 963 454	12 897 544	13 283 204
Sensitivity (%)	97.98	88.41	90.26	69.29	65.36
Specificity (%)	99.90	99.85	99.50	88.90	99.15
Overall mapping accuracy (%)	99.91	96.95	96.69	87.48	90.50

aTotal 38 506 959 unique sequences (reads).

bTotal 23 187 354 reads originated from a single exon.

cTotal 15 319 605 spliced reads originated from at least two exons.

Mapping result of a 75-bp CCDS derived library in human aTotal 38 506 959 unique sequences (reads). bTotal 23 187 354 reads originated from a single exon. cTotal 15 319 605 spliced reads originated from at least two exons. Since no sequence variation was introduced in the simulated dataset, we could fully evaluate the impact of incorrect mapping by means of the identification of SNVs. RNASEQR gave the lowest number of incorrectly mapped sequences, and most of these sequences were partially correct (Table 3). RNASEQR also yielded the lowest number of SNVs due to its highly accurate alignment results (Table 3). Most other tools reported a higher number of spurious SNVs identified near the splice junctions (≤5 bp, Table 3). Further analysis showed that these spurious SNVs were the consequences of incorrect identification of splice junctions (data not shown).

Table 3.

Incorrectly mapped reads and location of resulting SNVs

	RNASEQR	ERANGE	MapSplice	SpliceMap	TopHat
Incorrectly mapped reads	32 184	1 133 128	1 240 521	4 322 547	3 399 156
Partially correct ^a (%)	86.31	49.01	41.89	43.08	42.77
Resulting false SNVs
Coding exon	719	53 735	27 090	53 362	126 812
Non-coding exon	37	3991	5132	5053	51 461
Intron (≤5 bp from exon)	330	382 742	249 573	507 627	704 682
Intron (>5 bp from exon)	39	302	1523	1523	252 773
Intergenic region	86	875	2065	2065	248 298

aCorrect position either at the beginning or the end of sequence.

Incorrectly mapped reads and location of resulting SNVs aCorrect position either at the beginning or the end of sequence.

Evaluate the performance of RNASEQR with experimental data

The two experimental libraries derived from a single GBM patient had 25 384 704 (tumor) and 23 258 943 (reference) single-end sequences, respectively. RNASEQR mapped 58% of the sequences with unique genomic coordinates to the UCSC transcriptomic reference. In the second and third mapping steps additional 6 704 035 (tumor) and 731 426 (reference) sequences with novel exons or splice junctions were assigned to the genome with unique locations. In summary, RNASEQR assigned ∼78.3% of the sequences to unique regions on the genome, 2.1% of the sequences were assigned to multiple locations, and 20.8% sequences are still unmapped. Most of these unmapped sequences were poor quality sequences based on the Phred quality score (Supplementary Figure S1). Comparing the mapping results with other programs, RNASEQR assigned ∼3% more sequences than the other programs with various number of mismatches allowed in mapping (Figure 2A). Using default settings, RNASEQR allowed three mismatches in the first and second steps; ERANGE and SpliceMap also allowed three mismatches, while MapSplice tolerated up to five. TopHat mapped 0.5% more sequences than RNASEQR by tolerating more mismatches and truncating low quality sequences. The mapped sequences under default setting were further classified as spliced and unspliced. Accurately mapped and assigned spliced sequences are essential to obtain complete gene structures. RNASEQR mapped 7.3 million spliced sequences that is 17–106% more than the other tools mapped (Figure 2B).

Figure 2.

Numbers of uniquely mapped RNA-seq sequences. (A) RNASEQR assigned more sequences than the other tools with default threshold. (B) Using the default threshold, RNASEQR mapped more spliced sequences than the other programs. The accuracy of the mapping process inevitably influences the quality of downstream analysis. To investigate the influence of mapping accuracy on gene-expression level estimation, we ran a program called Scripture (7) to calculate the expression levels of genes that have been annotated in Ensembl. Different RNA-seq mapping tools gave high overall expression correlations when comparing the results for those genes with expression levels greater than 1 RPKM (reads per kilo per million) (Figure 3A and B). The RNASEQR result showed that underestimated gene expression was inferred by the other tools and this was seen in protein-coding gene transcripts (Figure 3C and D). This was predominately seen especially in low abundant transcripts (dots in the red circle in the Figure 3A and B). The RNASEQR result showed that 21 549 Ensembl transcripts showed a 2-fold change in expression level in the glioblastoma tumor compared to that in the peripheral brain tissue (Supplementary Table S1).

Figure 3.

Expression levels of gene transcripts annotated in Ensembl in reference brain (A and C) and tumor tissues (B and D). (A) and (B) represent transcripts with expression levels >1 RPKM. The numbers indicate the Pearson’s correlation and coefficient of expression levels between different tools. Red circles indicate low abundant transcripts underestimated by the other tools. (C) and (D) denote the numbers of gene transcripts with expression levels of >2-fold when comparing RNASEQR to the other tools. In addition, we observed that transcripts containing small exons (exon length shorter than the sequence read length) could lead to an underestimation of gene expression, as expected. For example, ST13, suppression of tumorigenicity 13 (colon carcinoma), is an example of a gene with 12 exons where 5 of them are <76 bp (RNA-seq sequence length). RNASEQR was the only tool that could detect all exons in ST13, which may provide better transcriptional abundance information (Figure 4A). RNASEQR failed to detect 4658 exons and 4762 exons in the gene transcripts with expression level >1 RPKM in the peripheral brain tissue and the glioblastoma tumor, respectively. RNASEQR showed a significantly lower ratio of the unidentified exons when the exons are <76 bp (Figure 4B).

Figure 4.

(A) Detail sequence mapping result on the gene ST13. Gray peaks indicate the sequence depth (coverage). Blue lines and blocks indicate sequences that spanned splice junctions. The result was visualized using the Integrative Genomics Viewer. (B) Ratio of unidentified exons in the transcripts with expression levels >1 RPKM. Exons in these gene transcripts were classified according to their length, <76 bp or ≥76 bp. Identifying genes with alternative exon usage is one of the most powerful applications for RNA-seq. For example, adenylate kinase 2 (AK2) has seven known exons. RNASEQR identified two novel exons and four isoforms (Figure 5). Two of these isoforms were seen only in the tumor RNA-seq library. The two novel exons and the tumor specific isoforms were experimentally verified (Supplementary Figure S2). All the other tools failed to reveal the complete gene structures for the AK2 gene.

Figure 5.

Gene structure of the gene adenylate kinase 2, AK2, assembled from the sequences in (A) reference brain and (B) tumor tissues. Uniquely mapped RNA-seq sequences were assembled using Scripture to build the gene structure. RNASEQR identified two novel exons and four isoforms with complete gene structure. We used SAMtools (28) to identify SNVs in the RNA-seq dataset. As expected, the tumor harbored more SNVs than the reference brain tissue (Table 4). As with the simulated dataset, the RNASEQR identified fewer SNVs in splice junction regions in GBM samples compared to other tools, especially near the splice sites (Supplementary Figure S3), and hence provides more accurate assessments of variation in junctional sequences (Figure 6). These results suggested that RNASEQR provided more accurate alignment results that gave more reliable transcripts and associated SNVs.

Table 4.

SNVs identified in peripheral brain and glioblastoma tissues

	RNASEQR	ERANGE	MapSplice	SpliceMap	TopHat
Peripheral brain
Coding exon	22 986	19 944	50 714	19 641	21 871
Non-coding exon	1378	1103	3091	1194	4826
Intron (≤5 bp from exon)	82	6209	1637	7984	9850
Intron (>5 bp from exon)	2130	1555	4196	1250	2035
Intergenic region	5626	4077	11 119	9916	13 491
Subtotal	32 202	32 888	70 757	39 985	52 073
Glioblastoma
Coding exon	26 763	24 220	53 429	24 062	26 456
Non-coding exon	1898	1494	3952	1504	5894
Intron (≤5 bp from exon)	104	3994	2607	9756	32 796
Intron (>5 bp from exon)	3697	2911	7109	2146	3333
Intergenic region	7495	9281	15 192	12 396	17 169
Subtotal	39 957	41 900	82 289	49 864	85 648

Figure 6.

Detail sequence alignment result on an exon–intron junction (chr1:220,311,381 to 220,311–391) in IARS2.

Detail sequence alignment result on an exon–intron junction (chr1:220,311,381 to 220,311–391) in IARS2. SNVs identified in peripheral brain and glioblastoma tissues

DISCUSSION

Current NGS technology can produce tens of billions of raw sequence data in a few days in routine operation; this creates an enormous challenge for the efficiency and accuracy of subsequent data analysis. Compared to the size of human genome (∼3 billion base pairs), transcriptomic reference databases are much smaller and range from 40 million base pairs in CCDS to 240 million base pairs in the Ensembl genome browser. The UCSC transcriptomic annotation that is used as default reference for transcriptomic data in RNASEQR is ∼200 million base pairs, 1/15 the size of the human genome. The BWT-based indexing allows large genomic sequences to be searched efficiently in a workstation computer equipped with small memory. RNASEQR also takes advantage of the BWT-based alignment algorithm and parallel computation to shorten the processing time. For a 25 million-read library, RNASEQR takes ∼200 min on a single-thread 2.4 GHz Xeon CPU and 100 min if a four-threaded process is applied (Supplementary Figure S4). It requires as little as 4 GB of computer memory and can efficiently run on a standard desktop computer. As far as we know, the RNASEQR is the fastest pipeline with least hardware requirements for RNA-seq data processing. Mapping RNA-seq sequences to full-length gene transcripts is intuitive and has been discussed elsewhere (33); however, the limitation in using transcriptomic references alone is the inherent inability to identify novel transcripts, exons or alternative splicing events. RNASEQR uses both transcriptomic and genomic references so that it can effectively assess the expression levels of known transcripts and identify novel exons, transcripts and alternative uses of known exons. We set the default transcriptomic reference as the UCSC genome browser transcriptome in RNASEQR. The performance of RNASEQR is little affected by the specific use of the transcriptomic reference. Therefore, other databases could be selected as references for this purpose. The results from some RNA-seq data analysis programs show a preference to map sequences to pseudogenes in the genome (data not shown) because of their high sequence similarity to the functional, spliced protein-coding genes—most of them do not contain intronic sequences. This can lead to inaccurate measurement of gene-expression levels as well as transcript-associated SNVs. The implementation of transcriptomic references in the first step in RNASEQR greatly reduces the problem of ‘preferentially’ aligning the sequences to pesudogenes located in the genome for two reasons: the first is that full-length RNA sequences do not require tolerating sequence gaps in mapping and alignment; the second is that RNA-seq sequences match to their origins better when both coding and pseudogene sequences are provided in the transcriptomic references. To deal with sequences that span more than one exon, ERANGE uses a splice junction library generated from annotated exons in the UCSC Genome browser while MapSplice, SpliceMap and TopHat detect splice junctions with their de novo methods. RNASEQR takes a third route by utilizing the sequences of full-length gene transcripts for the known exon junctions and a hash-based local alignment algorithm to detect novel splice sites. Based on our test, the three-step framework implemented in RNASEQR best explores annotated and novel transcriptomic repertoires, and is also able to identify insertions and deletions (data not shown). Tolerating some mismatched sequences in sequence alignment is necessary because of SNVs, sequencing errors, and even the possibility of RNA editing. However, mismatch allowances should be minimized to avoid false identification of SNVs. Compared to other programs, RNASEQR aligns more sequences with fewer mismatches and uses only full-length RNA-seq sequences to avoid spurious results. Short exons represent another class of genomic features that may affect the performance of RNA-seq data analysis. In Ensembl, 17.5% of the annotated exons were <76 bp, and half of the genes in the human genome have at least one exon <76 bp. Poor mapping of small exons and associated splice junctions will result in an underestimation of expression levels and the overestimation of transcripts with alternative exon usages. However, the calculation for differential expression for abundant transcripts between samples will not be affected because undetected small exons are not observed in all analyzed RNA-seq samples. But underestimation of low abundant transcripts could prevent these transcripts from the downstream differential expression analysis (Supplementary Tables S2–S5). The RNASEQR performed much better than other tools in handling small exon associated sequences and accurately provided complete gene structures (Figure 4A). Sequence variations near splice junctions could have biological implications, for functions such as alternative exon usage. Most, if not all, programs have problems aligning sequences near splice junctions (Supplementary Figure S5). To avoid false identification of SNVs due to mis-alignment, approaches including additional local alignments to remap sequences mapped near splice junctions (34,35) and de novo RNA-seq sequence assembly are used. However, these approaches take significant time and computation resources. RNASEQR adapted anchored-and-align approach to deal with this problem and delivered the lowest number of falsely identified SNVs. Various important biological questions are being addressed by using RNA-seq techniques, such as the assessment of transcriptional regulation (36,37), the identification of novel regulatory RNAs (7), the expression of quantitative trait loci (38,39), and the assessment of allelic expression imbalances (40,41), to name just a few areas. The file output for RNASEQR is compatible with tools such as DEseq (42), DEGseq (43), SAMtools (28), GATK (44), SNVMix (45), MISO (37), Cufflink (36) and Scripture (7) to estimate gene-expression levels, identify transcripts associated SNVs, discover alternative exon usage and assemble complete gene structures. The current version of RNASEQR can read both the color-space and nucleotide sequence formats for both single-end and paired-end sequence analyses, and it can easily be adapted to take the results from future NGS technology, such as single molecular sequencing, for more sophisticated experimental designs.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Table 1–5, Supplementary Figures 1–5, and Supplementary Methods.

FUNDING

Institute for Systems Biology-University of Luxemburg program; the National Institutes of Health Center for Systems Biology P50 [GM076547]; Republic of China National Science Council [97-2221-E-126-012-MY3]; Republic of China National Health Research Institute [NHRI-EX100-10004NI]; and Chang-Gung Memorial Hospital [CMRPG 380621, CMRPG 392101]. Funding for open access charge: Institute for Systems Biology-University of Luxemburg program. Conflict of interest statement. None declared.

45 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

2. Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences.

Authors: K Irizarry; V Kustanovich; C Li; N Brown; S Nelson; W Wong; C J Lee
Journal: Nat Genet Date: 2000-10 Impact factor: 38.330

3. The UCSC Known Genes.

Authors: Fan Hsu; W James Kent; Hiram Clawson; Robert M Kuhn; Mark Diekhans; David Haussler
Journal: Bioinformatics Date: 2006-02-24 Impact factor: 6.937

4. Optimal spliced alignments of short sequence reads.

Authors: Fabio De Bona; Stephan Ossowski; Korbinian Schneeberger; Gunnar Rätsch
Journal: Bioinformatics Date: 2008-08-15 Impact factor: 6.937

5. Mapping and quantifying mammalian transcriptomes by RNA-Seq.

Authors: Ali Mortazavi; Brian A Williams; Kenneth McCue; Lorian Schaeffer; Barbara Wold
Journal: Nat Methods Date: 2008-05-30 Impact factor: 28.547

6. Stem cell transcriptome profiling via massive-scale mRNA sequencing.

Authors: Nicole Cloonan; Alistair R R Forrest; Gabriel Kolle; Brooke B A Gardiner; Geoffrey J Faulkner; Mellissa K Brown; Darrin F Taylor; Anita L Steptoe; Shivangi Wani; Graeme Bethel; Alan J Robertson; Andrew C Perkins; Stephen J Bruce; Clarence C Lee; Swati S Ranade; Heather E Peckham; Jonathan M Manning; Kevin J McKernan; Sean M Grimmond
Journal: Nat Methods Date: 2008-05-30 Impact factor: 28.547

7. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence.

Authors: M D Adams; A R Kerlavage; R D Fleischmann; R A Fuldner; C J Bult; N H Lee; E F Kirkness; K G Weinstock; J D Gocayne; O White
Journal: Nature Date: 1995-09-28 Impact factor: 49.962

8. Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Authors: Manfred G Grabherr; Brian J Haas; Moran Yassour; Joshua Z Levin; Dawn A Thompson; Ido Amit; Xian Adiconis; Lin Fan; Raktima Raychowdhury; Qiandong Zeng; Zehua Chen; Evan Mauceli; Nir Hacohen; Andreas Gnirke; Nicholas Rhind; Federica di Palma; Bruce W Birren; Chad Nusbaum; Kerstin Lindblad-Toh; Nir Friedman; Aviv Regev
Journal: Nat Biotechnol Date: 2011-05-15 Impact factor: 54.908

9. Alternative isoform regulation in human tissue transcriptomes.

Authors: Eric T Wang; Rickard Sandberg; Shujun Luo; Irina Khrebtukova; Lu Zhang; Christine Mayr; Stephen F Kingsmore; Gary P Schroth; Christopher B Burge
Journal: Nature Date: 2008-11-27 Impact factor: 49.962

10. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution.

Authors: Brian T Wilhelm; Samuel Marguerat; Stephen Watt; Falk Schubert; Valerie Wood; Ian Goodhead; Christopher J Penkett; Jane Rogers; Jürg Bähler
Journal: Nature Date: 2008-05-18 Impact factor: 49.962

12 in total

1. Reliable identification of genomic variants from RNA-seq data.

Authors: Robert Piskol; Gokul Ramaswami; Jin Billy Li
Journal: Am J Hum Genet Date: 2013-09-26 Impact factor: 11.025

2. Identification of allelic expression imbalance genes in human hepatocellular carcinoma through massively parallel DNA and RNA sequencing.

Authors: Qiudao Wang; Yan An; Qing Yuan; Yao Qi; Ying Ou; Junhui Chen; Jian Huang
Journal: Med Oncol Date: 2016-03-21 Impact factor: 3.064

3. A context-based approach to identify the most likely mapping for RNA-seq experiments.

Authors: Thomas Bonfert; Gergely Csaba; Ralf Zimmer; Caroline C Friedel
Journal: BMC Bioinformatics Date: 2012-04-19 Impact factor: 3.169

4. Mining RNA-seq data for infections and contaminations.

Authors: Thomas Bonfert; Gergely Csaba; Ralf Zimmer; Caroline C Friedel
Journal: PLoS One Date: 2013-09-03 Impact factor: 3.240

5. RNA-Seq identifies key reproductive gene expression alterations in response to cadmium exposure.

Authors: Hanyang Hu; Xing Lu; Xiang Cen; Xiaohua Chen; Feng Li; Shan Zhong
Journal: Biomed Res Int Date: 2014-05-27 Impact factor: 3.411

6. PARRoT- a homology-based strategy to quantify and compare RNA-sequencing from non-model organisms.

Authors: Ruei-Chi Gan; Ting-Wen Chen; Timothy H Wu; Po-Jung Huang; Chi-Ching Lee; Yuan-Ming Yeh; Cheng-Hsun Chiu; Hsien-Da Huang; Petrus Tang
Journal: BMC Bioinformatics Date: 2016-12-22 Impact factor: 3.169

7. A comprehensive next generation sequencing-based virome assessment in brain tissue suggests no major virus - tumor association.

Authors: Michael J Strong; Eugene Blanchard; Zhen Lin; Cindy A Morris; Melody Baddoo; Christopher M Taylor; Marcus L Ware; Erik K Flemington
Journal: Acta Neuropathol Commun Date: 2016-07-11 Impact factor: 7.801