Literature DB >> 22303387

Thousands of Novel Transcripts Identified in Mouse Cerebrum, Testis, and ES Cells Based on ribo-minus RNA Sequencing.

Wanfei Liu¹, Yuhui Zhao, Peng Cui, Qiang Lin, Feng Ding, Chengqi Xin, Xinyu Tan, Shuhui Song, Jun Yu, Songnian Hu.

Abstract

The high-throughput next-generation sequencing technologies provide an excellent opportunity for the detection of less-abundance transcripts that may not be identifiable by previously available techniques. Here, we report a discovery of thousands of novel transcripts (mostly non-coding RNAs) that are expressed in mouse cerebrum, testis, and embryonic stem (ES) cells, through an in-depth analysis of rmRNA-seq data. These transcripts show significant associations with transcriptional start and elongation signals. At the upstream of these transcripts we observed significant enrichment of histone marks (histone H3 lysine 4 trimethylation, H3K4me3), RNAPII binding sites, and cap analysis of gene expression tags that mark transcriptional start sites. Along the length of these transcripts, we also observed enrichment of histone H3 lysine 36 trimethylation (H3K36me3). Moreover, these transcripts show strong purifying selection in their genomic loci, exonic sequences, and promoter regions, implying functional constraints on the evolution of these transcripts. These results define a collection of novel transcripts in the mouse genome and indicate their potential functions in the mouse tissues and cells.

Entities: Chemical Disease Gene Species

Keywords: next-generation sequencing; non-coding RNA; novel transcripts; ribo-minus RNA-seq

Year: 2011 PMID： 22303387 PMCID： PMC3268642 DOI： 10.3389/fgene.2011.00093

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

Introduction

The mammalian transcriptomes are much more complex than what we have been anticipated according to the related research activities over the past decade. Recently, novel transcripts have been continuously identified in mammalian genomes. Bertone et al. (2004) found 10,595 novel transcribed sequences in human liver tissue. Carninci et al. (2005) demonstrated that the majority of the mammalian genome is transcribed and reported 16,247 new mouse protein-coding transcripts. The ENCODE pilot project reported that the human genome is pervasively transcribed and discovered the relationship between transcripts and chromatin accessibility features (Birney et al., 2007). According to the chromatin-state maps, about 1,600 large multi-exonic RNAs were identified by Guttman et al. (2009) in mouse. Cabili et al. (2011) presented an integrative approach and defined >8,000 human lincRNAs. Trapnell et al. (2010) got 3,724 previously un-annotated transcripts in mouse and 62% of them were supported by independent expression data or homologous genes in other species. These novel transcripts are called the “dark matter” RNAs, which include any RNAs whose functions are still unknown (Kapranov et al., 2010). Kapranov et al. (2010) concluded that the “dark matter” RNA can be greater than protein-encoding transcripts and a large number of long non-coding RNA reside in intergenic regions. However, controversial opinions still exist. It has been suggested that most novel transcribed regions are associated with known neighboring gene models. For example, by mapping and quantifying mouse transcriptome using poly(A) selected RNA-seq data, 92% of novel transcription regions can be assigned to their neighboring genes in a recent study (Mortazavi et al., 2008). van Bakel et al. (2010) also concluded that most non-exonic transcribed sequence fragments (seqfrags) probably are indeed partial fragments of pre-mRNA with introns, new exons of known genes in intergenic sequences, or promoter- and terminator-associated transcripts. Clark et al. (2011) and van Bakel et al. (2011) have discussed possible mechanisms of the pervasive transcription and some of the arguments are focused on universality and functionality of these novel transcripts (Jarvis and Robertson, 2011). In addition, studies have suggested that non-coding RNAs are important in transcriptional and post-transcriptional regulations, chromatin-modification, development and diseases, such as cancers (Gupta et al., 2010; Mattick et al., 2010; Glass et al., 2011; Kogo et al., 2011) and indeed fundamental to eukaryotic evolution (Mattick, 2010). Recently, RNA-seq methods, mRNA-based, or ribo-minus (rm) based on the next-generation sequencing technologies, are considered to be more accurate and comprehensive for transcriptome profiling (Wang et al., 2009). They are supreme over other transcriptomic methods, including expressed sequence tag (EST), serial analysis of gene expression (SAGE), and microarray, in dynamic range, sampling depth, and material processing. The methods allow researchers to acquire adequate amount of data to characterize novel transcripts, and moreover, when combined with other complementary data, such as those from cap analysis of gene expression (CAGE), histone modification, and RNAPII, as well as sequence conservation analysis, they provide stronger evidence for identifying novel transcription. In this study, we used publicly available rmRNA-seq data from the mouse cerebrum, testis, and embryonic stem (ES) cells to excavate new transcripts and verify their existence in the mouse genome, with an anticipation that rmRNA-seq data are expected to contribute more coding and non-coding transcripts, which lack polyA tails (Cui et al., 2010). We built a pipeline to identify expressed regions and candidate exons in the entire genome to define novel transcripts through comparison to known transcripts and carried out a combined analysis on relevant public data, including CAGE (Kawaji et al., 2006), histone modifications (H3K4me3, H3K27me3, and H3K36me3) and RNAPII (Mikkelsen et al., 2007), and sequence conservation values (Fujita et al., 2011). We also examined potential functions of these novel transcripts according to their sequence structures and characteristics. We expect to provide useful insights into the “dark matter” of the mouse genome.

Materials and Methods

Datasets

The transcriptome profiling of mouse cerebrum, testis, and ES cells, as well as data for histone modifications (H3K4me3 and H3K27me3) of mouse cerebrum and testis were from NCBI SRA database, SRA039962 and SRX005943, which were produced by our group previously. We also retrieved ChIP-seq data of RNAPII, H3K4me3, H3K27me3, and H3K36me3 from mouse ES cells (Mikkelsen et al., 2007) and 5′ CAGE tags from multiple mouse tissues published by the Fantom3 project (Kawaji et al., 2006). In addition, we obtained conservation scores from the UCSC database (Fujita et al., 2011).

Exon or transcription unit (TU) identification based on rmRNA-seq data

We built an efficient pipeline for TU identification (Figure A1 in Appendix; File S3 in Supplementary Material). First, RNA sequencing reads were mapped to the mouse genome assembly (mm9) by using TopHat (Langmead et al., 2009; Trapnell et al., 2009) and the coverage files were created based on mapping results by using a custom-designed perl script. Second, according to the coverage files, we obtained average coverage of all Refgene introns and set a cutoff value of the coverage to exclude 95% of introns (3, 4, and 7 for cerebrum, ES cells, and testis, respectively). To define the expressed regions, we limited each region to have at least 55-bp consecutive length and all these positions must be equal or above the cutoff value. If the distance of adjacent expressed regions (exons) is equal or smaller than the length of 95 bp (95% intron lengths are larger than 95 bp), we combined the adjacent expressed regions into one. We also revised the boundaries of exons using the split read feature from TopHat. Third, we evaluated the accuracy of exon identification, calculated the average coverage for exons defined in Refgene introns and removed exons whose coverage below the cutoff value. Fourth, we annotated and removed certain exons by comparing our putative exons with several databases (UCSC, ENSEMBL, NONCODE, RNAdb, fRNAdb, Rfam, miRBase, tRNAdb, and ncRNAdb). Fifth, we constructed TUs for exons found in intergenic regions according to the distance between exons, RNAPII signals, and H3K36me3 signals. Sixth, to assess the accuracy of this method, we compared our TUs (by using all exons in intergenic regions) with the Fantom3 RNAs.

Figure A1

A flowchart of gene identification process. We mapped the ribo-minus RNA-seq data using TopHat and created the coverage file for genome and identified exons according to the coverage of each position (>= cutoff value). Since 95% intron lengths are > or =95 bp, we merged small exons (distance < or =95 bp). Moreover, since 95% exon lengths are > or =55, we only keep the exons whose length is equal or larger than 55 bp to reduce false positives. We remove low coverage exons to reduce errors. We also filter known exons and build novel TUs on the basis of H3K36me3, RNAPII, and the different distance of adjacent exons between internal of TUs and adjacent TUs. We evaluate the accuracy of TU building by comparing our TUs with Fantom3 RNAs of intergenic regions.

Identification of enriched intervals of ChIP-seq data

We defined H3K4me3- and H3K27me3-enriched intervals by using SICER program (v1.03; Zang et al., 2009). The parameters were set as follows: (1) 200-bp window, 200-bp gap, and 0.001 for False Discovery Rate of H3K4me3; (2) 200-bp window, 600-bp gap, and 0.001 for False Discovery Rate of H3K27me3. The sequencing reads from a pan-H3 experiment was used as a background control for H3K4me3 and H3K27me3. The H3K36me3 and RNAPII enrichment intervals were downloaded from the website at the Broad Institute (see text footnote 1; Mikkelsen et al., 2007). Chromatin states of exons or TUs were determined based on overlapping regions where H3K4me3, H3K27me3, H3K36me3, and RNAPII are all enriched.

Conservation of exons, TUs, and their promoters

To estimate sequence conservation of exons, TUs, and their promoters, we used conservation scores derived from an alignment of 29-vertebrate-to-mouse genomes from the UCSC database (Fujita et al., 2011). We calculated the conservation score in a 12-bp sliding window with a step length of 1 bp and selected the maximal value as the conservation score. The sequences that have higher conservation scores are more conservative than other sequences.

Correlation between sense and antisense gene expression

We extracted the information for sense–antisense gene pairs and calculated the RPKM value for the sense and antisense genes based on mapping results. We subsequently divided the sense–antisense gene pairs into two portions according to their expression ratios between two samples for the sense and antisense expression. The expression ratio is equal to the sample 2 expression divided by the sample 1 expression. We classified them as positive if the log10 value of both sense and antisense expression ratios are greater or less than zero. Otherwise, we classified them as negative. We correlated the positive and negative types of sense–antisense gene pairs using the expression ratio.

Results

Identification of actively transcribed regions (exons)

We obtained rmRNA-seq data from the mouse cerebrum, testis, and ES cells which were generated based on the SOLiD sequencing platforms using a strand-specific rmRNA-seq method (Cui et al., 2010), and mapped rmRNA-seq reads onto the mouse genome assembly (mm9) using TopHat software (Table A1 in Appendix). Based on the mapped reads, we assessed sequencing saturation according to the increase of read start-points with increasing mapped read (Figure 1). To define actively transcribed regions, we calculated the coverage per nucleotide position and used those positions whose coverage values are equal or larger than the cutoff values (3, 4, and 7 for cerebrum, ES cells, and testis, respectively; also see Materials and Methods for details; Figure 2A). Moreover, due to sequencing bias, we required that each region to have at least 55 bp consecutive sequence above the cutoff value of coverage and the distance of adjacent regions is larger than 95bp (Figures 2C,B). Consequently, we obtained 395,546, 465,149, and 194,996 putative exons in the total in the three libraries, respectively (Table 1; File S1 in Supplementary Material). For assessing the accuracy of exon identification, we compared the defined actively transcribed regions to Refgene exons (Karolchik et al., 2004), and found that most Refgene exons (∼94.12%) have been identified and that the aligned length is up to ∼88.71%. Furthermore, ∼93.81% RefSeq-defined exons are shown to be one-to-one matches (Table A2 in Appendix). These statistics proved the viability of our pipeline for this analysis. Moreover, we found that different samples have different percentages of reads assembled into exons (Table A3 in Appendix). We believe that such variability is related to read length, read coverage, and the cutoff value of the read coverage.

Table A1

The sequence mapping summary of rmRNA-seq data.

Tissue/cell	Total reads	Multiple mapping reads	Unique mapping reads	Average coverage (whole genome)	Average coverage (identified exons)	Mapping percent
Cerebrum	428,434,624	124,991,301	120,041,080	2.19	69.46	29.17
Testis	497,996,641	144,583,797	136,348,798	2.37	55.97	29.03
ES cell	126,791,595	52,476,546	35,829,866	0.67	31.15	41.38

Figure 1

The saturation curve for the number of start-points of mapped reads. x-Axis shows the number of the mapped reads and y-axis displays the start-points number (million) of mapped reads.

Figure 2

Parameters used in exon identification. (A) The cutoff value of coverage in the mouse cerebrum, testis, and ES cells. The cutoff value (blue) of coverage (3, 4, and 7 for cerebrum, ES cells, and testis, respectively) is labeled on the x-axis and the corresponding accumulative frequency (0.95, colored in green) is labeled on the y-axis. (B) The minimal intron length used in exon identification. The value (blue) on x-axis is identified as minimal intron length (95) and the value (green) on y-axis is the corresponding accumulative frequency (0.05). (C) The minimal exon length used in exon identification. The value (blue) on the x-axis is identified as minimal exon length (55) and the value (green) on the y-axis is the corresponding accumulative frequency (0.05).

Table 1

Summary of novel exons identified in our analysis.

Sample	Cerebrum	Testis	ES cell
Identified exons¹	395,546 (105,657,702, 100%)³	465,149 (109,695,106, 100%)	194,996 (28,838,854, 100%)
Refgene exons	106,218 (25,924,734, 24.54%)	98,065 (38,325,083, 34.94%)	84,792 (18,396,077, 63.79%)
Refgene introns	233,775 (33,864,388, 32.95%)	243,879 (32,067,720, 29.23%)	75,426 (7,308,663, 25.34%)
Intergenic regions	69,971 (45,868,580, 43.41%)	135,644 (39,302,303, 35.83%)	43798 (3,134,114, 10.87%)
Refgene introns (filtered)²	33,053 (27,401,823, 25.93%)	28,931 (15,245,705, 13.9%)	10,011 (3,107,050, 10.77%)

Table A2

The evaluation of exon identification.

Sample	RefGene exons¹	Aligned exons (percent)	Percent of aligned length	One-to-one percent
Cerebrum	93,947	87,501(93.14%)	86.13%	93.57
Testis	99,332	93,834(94.47%)	88.66%	95.82
ES cell	82,942	78,580(94.74%)	91.34%	96.80

Table A3

The percentage of exon reads in all mapped reads.

Tissue/cell	Total reads	Identified exon reads	Exon reads percent
Cerebrum	124,991,301	105,657,702	84.53
Testis	144,583,797	109,695,106	75.87
ES cell	524,76,546	28,838,854	54.96

Summary of novel exons identified in our analysis. . The saturation curve for the number of start-points of mapped reads. x-Axis shows the number of the mapped reads and y-axis displays the start-points number (million) of mapped reads. Parameters used in exon identification. (A) The cutoff value of coverage in the mouse cerebrum, testis, and ES cells. The cutoff value (blue) of coverage (3, 4, and 7 for cerebrum, ES cells, and testis, respectively) is labeled on the x-axis and the corresponding accumulative frequency (0.95, colored in green) is labeled on the y-axis. (B) The minimal intron length used in exon identification. The value (blue) on x-axis is identified as minimal intron length (95) and the value (green) on y-axis is the corresponding accumulative frequency (0.05). (C) The minimal exon length used in exon identification. The value (blue) on the x-axis is identified as minimal exon length (55) and the value (green) on the y-axis is the corresponding accumulative frequency (0.05).

Annotation of novel exons

To annotate novel transcripts, we first removed known exons according to the Refgene collection. We then removed all other known exons that have matches to other databases, such as the NCBI nr database (Johnson et al., 2008) based on sequence alignment using the BLAST software packages. The repeat regions of the mouse genome were avoided according to the repeat annotation at UCSC (Fujita et al., 2011). We also built a custom-designed ncRNA database through integrating several databases that include mouse ncRNA data in ENSEMBL (Flicek et al., 2011), UCSC (Fujita et al., 2011), NONCODE (He et al., 2008), RNAdb (Pang et al., 2007), fRNAdb (Kin et al., 2007), ncRNAdb (Szymanski et al., 2007), Rfam (Griffiths-Jones et al., 2005), miRBase (Griffiths-Jones et al., 2006), and tRNAdb (Juhling et al., 2009; Table A4 in Appendix). Moreover, we filtered the newly identified exons of known genes using the split reads from the TopHat result. We also predicted the function of novel exons (unannotated) by comparing them to the Rfam database. Most Rfam-predicted exons are snoRNAs, but some are miRNAs, tRNAs, and snRNAs. Finally, we obtained three sets of putative novel exons (Table 2).

Table A4

The summary of ncRNA database records.

Sub-database	Records
ENSEMBL	8,269
UCSC	1,432
NONCODE	107,090
RNAdb	38,227
fRNAdb	510,055
Rfam	4,253
miRBase	579
tRNAdb	433
ncRNAdb	31,136

Table 2

The exon annotation based on Refgene intron (filtered) and intergenic regions.

Sample	Cerebrum	Testis	ES cell
Total exons¹	103,024 (73,270,403, 100%)²	164,575 (54,548,008, 100%)	53,809 (6,241,164, 100%)
nr	14,333 (7,613,882, 10.39%)	24,343 (19,326,301, 35.43%)	13,556 (1,803,672, 28.90%)
ncRNA	3,933 (2,372,724, 3.24%)	7,862 (4,096,291, 7.50%)	4,547 (97,395, 1.56%)
Repeat	19,116 (15,226,166, 20.78%)	34,015 (11,659,836, 21.38%)	13,109 (971,411, 15.56%)
New exons of known genes	1,073 (46,792, 0.06%)	1,404 (64,663, 0.12%)	1,101 (8,593, 0.14%)
Rfam prediction	207 (4,306,737, 5.90%)	219 (617,575, 1.13%)	104 (15,076, 0.24%)
Remaining	64,357 (43,704,102, 59.63%)	96,607 (18,783,342, 34.44%)	19,790 (3,345,017, 53.60%)

The exon annotation based on Refgene intron (filtered) and intergenic regions. .

Building novel transcription units (TU) in intergenic region

Since well-defined actively transcribed regions exhibit obvious gene structure features, we tried to connect the neighboring active regions into the same transcription units (TUs). When we calculated the distance of adjacent actively transcribed regions, we found that there are two main peaks in the density plots and this feature can be used for building novel TUs (Figure 3). In addition, there is a small peak appeared around 100 bp in length, which is a characteristic of the minimal intron (∼100 bp in length) described in our previous publications (such as Zhu et al., 2010). The first major peak represents the distance of adjacent exons inside TUs and the second major peak is related to the distance of exons between adjacent TUs. Moreover, we downloaded the RNAPII and H3K36me3 data of ES cell, which were used to define the transcription start and the elongation of the transcripts, respectively. We finally constructed TUs for novel exons in intergenic regions according to the information from the distance between exons, RNAPII signals, and H3K36me3 signals, producing 17,931, 18,512, and 6,966 annotated TUs in cerebrum, testis, and ES cells, respectively (File S2 in Supplementary Material).

Figure 3

Density distribution for distances between adjacent exons in intergenic regions. x-Axis shows the distance between adjacent exons and y-axis displays the density. There is a small peak appeared around 100 bp length and this peak implies the minimal introns. In addition, the first main peak represents the distance of general adjacent exons inside TUs and the second main peak is related to the distance of exons between adjacent TUs. To evaluate our processing algorithm, we compared our TUs with the intergenic RNAs annotated by the Fantom3 project. As expected, the one-to-one matching rate is about 95.62%, but the aligned length is a little bit lower, ∼70.99% (Table A5 in Appendix). The reason why the aligned length is not as high as the matching rate is that we may lose some exonic sequences due to their low coverage in the real data. It can be improved when more rmRNA-seq data are added. Nevertheless, the matching rate encouraged us to proceed.

Table A5

The evaluation of TU building based on the Fantom3 RNAs with multiple exons.

Sample	Fantom3 RNA¹	Percent of aligned length	One-to-one percent
Cerebrum	443	71.96	95.49
Testis	605	75.51	94.88
ES cell	143	65.51	96.50

We subsequently compared our intergenic TUs with the intergenic vlinc regions identified by Kapranov et al. (2007) in human. The coordinates of the 580 vlinc RNA domains were transformed from the hg18 to the mm9 version of the mouse genome, and we converted 486 vlinc RNAs successfully. The total matched vlinc RNAs and the total one-to-one matched vlinc RNA are 316 and 278, respectively (Table A6 in Appendix). This result implicated that many intergenic TUs are conserved among mammalian genomes. The one-to-one matching rate between intergenic TUs and vlinc RNAs is lower than what between intergenic TUs and the Fantom3 RNAs, and it may be resulted from expression regulation of intergenic TUs and the evolution of intergenic TUs among different species.

Table A6

The evaluation of TU building based on the vlinc RNA.

Sample	Vlinc RNA	Matching number	Matching percent	One-to-one number	One-to-one percent
Cerebrum	486	236	48.56	156	32.10
Testis	486	215	44.24	157	32.30
ES cell	486	95	19.55	62	12.76

The evidence of novel TUs in intergenic region

To define the function of novel TUs, we examined the distribution of 5′ CAGE tags (Kawaji et al., 2006) and RNAPII (Mikkelsen et al., 2007), histone modifications (H3K4me3, H3K27me3, and H3K36me3; Mikkelsen et al., 2007) around the TUs, and evaluated their sequence conservation value (Fujita et al., 2011). First, we found that there is a significant enrichment of 5′ CAGE tags at the TU start in all three samples (Figure 4A), suggesting that these TUs have 5′ 7-methylguanosine caps and possess transcriptional start sites. Moreover, we investigated the binding of RNAPII within upstream of these TUs using RNAPII data from mouse ES cells and observed an obvious enrichment of RNAPII around their TSS (Figure 5A), suggesting that the TUs have their own promoters for regulating transcriptional initiation. Second, based on ChIP-seq data for the three mouse samples, we examined H3K4me3, H3K27me3, and H3K36me3 statuses around the TUs (Figures 4B–D) and observed that H3K4me3 and H3K27me3 are enrichment at the upstream of the TUs and their densities are correlated well with gene expression. Moreover, H3K36me3 are also enriched across the TUs and marked the transcriptional elongation sites. These lines of evidence suggested that these novel TUs may be indeed independently transcribed in the samples. Finally, we investigated the sequence conservation of the novel TUs (see Materials and Methods) by calculating their conservation scores of the exonic sequences in comparison with Refgene protein exons and random sequences as controls. The conservation scores of the novel TU exons are highly similar to those of the Fantom3 RNA-defined exons (Figure 5B) and similar results were observed in the promoter conservation scores (Figure 5C). Results from both analyses suggest possible functionality of the novel TUs.

Figure 4

5′ CAGE and histone modification around novel TU–TSS or gene bodies in the mouse cerebrum, testis, and ES cell. (A–D) Profiles of 5′ CAGE, H3K4me3, H3K27me3, and H3K36me3.

Figure 5

The RNAPII around novel TU–TSS and sequence conservation of TU exon and promoters. (A) Profile of RNAPII, (B) cumulative distribution of sequence conservation for TU exon, protein exon, Fantom3 RNA exon, and random region, and (C) cumulative distribution of sequence conservation for TU promoter, protein promoter, Fantom3 RNA promoter, and random region.

5′ CAGE and histone modification around novel TU–TSS or gene bodies in the mouse cerebrum, testis, and ES cell. (A–D) Profiles of 5′ CAGE, H3K4me3, H3K27me3, and H3K36me3. The RNAPII around novel TU–TSS and sequence conservation of TU exon and promoters. (A) Profile of RNAPII, (B) cumulative distribution of sequence conservation for TU exon, protein exon, Fantom3 RNA exon, and random region, and (C) cumulative distribution of sequence conservation for TU promoter, protein promoter, Fantom3 RNA promoter, and random region. To illustrate the related characteristics of the novel TUs, we showed an intronic TU and an intergenic TU in Figure A2 in Appendix. Both TUs are significantly expressed in the tissues and cell. Moreover, the exons of both TUs have homologous sequences according to their conservation scores (0 means no conservation and 1 means highly conserved).

Figure A2

A snapshot for TUs in Refgene intron and intergenic region. The upper panel is a TU in an intron of the transmembrane protein gene, Tmem180, and lower panel is a TU adjacent to Sap130 gene. SAP130 is a subunit of the histone deacetylase-dependent SIN3A co-repressor complex which acts as a transcriptional repressor. The TU in plus and minus strands is shown as red and blue horizontal bars, respectively. For each TU, we show RNA expression level (vertical bars in red and blue), identified TU, Refgene, RNAPII signal (green), H3K36me3 signal (purple), and conservation score (yellow).

Classification and function analysis of novel TUs and exons

In order to explore the functionality of the novel TUs, we predicted their protein-coding capability based on PhyloCSF result (Table 3; Lin et al., 2011). We obtained their amino acid sequences (for single exon TUs) according to ORF prediction and aligned the amino acid sequences to the NCBI nr database. We found that most of these protein-coding TUs are similar to either known (such as ribosomal proteins and dehydrogenase) or hypothetical proteins (such as hypothetical and unnamed proteins). For exploring whether these protein-coding TUs are pseudogenes, we compared them to the Vega pseudogene annotations, and only about 8.85% of them are likely to be pseudogenes (Table A7 in Appendix). Moreover, we selected two protein-coding transcripts and predicted their secondary structures. One of them is similar to mouse mCG1041001 protein and is predicted to be extracellularly located. The other is similar to mouse EG382421 protein and possesses nuclear localization sequence.

Table 3

Summary of coding and non-coding exons and TUs.

Sample	Cerebrum	Testis	ES cell
New exons¹	64,357	96,607	19,790
Intron exons	19,986	16,263	3,703
Coding exons²	3,911 (3,625,424)⁶	3,308 (579,368)	1,716 (73,420)
Non-coding exons²	15,973 (2,865,243)	12,866 (1,583,086)	1,952 (108,833)
Unknown exons²	102 (3,526,680)	89 (14,187)	35 (215)
Intergenic exons	44,525	80,489	16,153
Coding exons³	7,440 (4,073,104)	8,717 (535,205)	2,703 (28,257)
Non-coding exons³	36,363 (15,469,822)	69,193 (4,068,501)	12,690 (238,595)
Unknown exons³	722 (150,523)	2,579 (214,720)	760 (2,599)
Intergenic TUs	17,931	18,512	6,966
Coding TUs³	5,441 (4,100,203)	4,794 (1,144,421)	2,005 (54,654)
Non-coding TUs³	11,735 (11,077,673)	12,230 (2,212,637)	4,618 (193,861)
Unknown TUs³	426 (1,286,476)	698 (559,251)	281 (8,639)
Inconsistent TUs^3,4	329 (174,288)	790 (814,202)	62 (8,232)
Modified non-coding TUs⁵	12,445 (11,077,673)	13,199 (2,212,637)	4,963 (193,861)

Table A7

The pseudogenes in novel intergenic TUs.

Sample	Protein-coding TUs in intergenic region	Matched pseudogenes	Matched percent
Cerebrum	5,441	329	6.05
Testis	4,794	278	5.80
ES cell	2,005	295	14.71

Summary of coding and non-coding exons and TUs. . We looked into the antisense regulation of the novel TUs. According to the PhyloCSF prediction, 65.93% of the novel TUs can be defined as non-coding RNAs due to lacking protein-coding characteristics (Table 4). To further examine them, we extracted the antisense RNAs by comparing the location of the TUs to known genes as putative cis-antisense RNAs (ncerebrum = 2,614, ntestis = 2,756, and nES = 732) and their target genes (ncerebrum = 2,324, ntestis = 2,356, and nES = 689). Since previous studies have suggested that sense–antisense gene pairs may play potential regulatory roles (Okada et al., 2008), we clustered the sense–antisense regulated genes using DAVID website (Huang da et al., 2009; Huang et al., 2009) and found that most of these antisense regulated genes are associated with synapse, ion binding/transport, cell junction, cytoskeletal, membrane, and signal transduction in the three samples (S1–S3 in Supplementary Material). Moreover, we found that the genes in cerebrum and testis are related to cardiomyopathy, cancer, endocytosis, cell junction, and signal pathway (S4 and S5 in Supplementary Material). The expression levels of the sense–antisense transcripts are either positively or negatively correlated among different tissues and cell lines (Katayama et al., 2005; Okada et al., 2008). We also compared the sense–antisense expression in a pairwise fashion among the three samples (see Materials and Methods) and found that the antisense expression is either positively or negatively associated with the sense expression (Figure 6). This characteristic is in agreement with previous studies (Katayama et al., 2005; Okada et al., 2008).

Table 4

The classification of ncRNAs in intronic and intergenic regions.

Sample	Cerebrum	Testis	ES cell
Intron RNAs	15,973 (2865243, 100%)³	12,866 (1,583,086, 100%)	1,952 (108,833, 100%)
Antisense RNAs¹	141 (4,347, 0.15%)	156 (12,829, 0.81%)	19 (626, 0.58%)
Small ncRNAs¹	14,196 (1,781,616, 62.18%)	10,483 (764,884, 48.32%)	1,467 (30,496, 28.02%)
Long ncRNAs¹	1,636 (1,079,280, 37.67%)	2,227 (80,5373, 50.87%)	466 (77,711, 71.40%)
Intergenic RNAs	12,445 (11,077,673, 100%)	13,199 (2,212,637, 100%)	4,963 (193,861, 100%)
Antisense RNAs²	2,614 (792,653, 7.16%)	2,756 (440,655, 19.92%)	732 (27,289, 14.08%)
Small ncRNAs²	6,502 (9,716,663, 87.71%)	5,072 (624,312, 28.22%)	2,271 (27,232, 14.05%)
Long ncRNAs²	3,329 (568,357, 5.13%)	5,371 (1,147,670, 51.86%)	1,960 (139,340, 71.87%)

Figure 6

The correlation between sense and antisense expression ratio in sense–antisense gene pairs. Red and green points represent the sense–antisense gene pairs in positive and negative types. “P” stands for the positive type and “N” stands for the negative type.

The classification of ncRNAs in intronic and intergenic regions. . The correlation between sense and antisense expression ratio in sense–antisense gene pairs. Red and green points represent the sense–antisense gene pairs in positive and negative types. “P” stands for the positive type and “N” stands for the negative type. We examined the novel TUs to see if some of them are actually non-coding RNAs. We divided the remaining (non-exonic) novel ncRNAs into long or small ncRNAs according to their sizes. About 43.87% of the remaining ncRNAs are larger than 200 bp in size, which were defined as long ncRNAs. There are 3,329, 5,371, and 1,960 novel long ncRNAs identified in the cerebrum, testis, and ES cells, respectively (Table 4). Comparing our long ncRNAs to lincRNAs identified by Guttman et al. (2009), we found 724 lincRNAs in our three samples, which are accounted for 43.48% of all lincRNAs. There are about 21% of lincRNAs found in each of our samples (ncerebrum = 359, ntestis = 391, and nES = 304). We defined the rest of the ncRNAs as small ncRNAs, ranging from 55 to 200 bp in length; the majority of these small ncRNAs (∼24.04%) are from 55 to 65 bp in size (Figure 7A). This size range of small ncRNAs is related to the insert size of the libraries and data processing parameters. First, we selected RNA fragments in a length range of 50–150 bp for analysis. Second, we filtered the small RNAs whose lengths are less than 55 bp and have overlapping sequences so that some of the smaller RNAs were eliminated in data processing procedures. For these small ncRNAs, we predicted their motifs using MEME software (Bailey and Elkan, 1994) and some conserved motifs were identified (Figures 7B–D), which were accounted for ∼20% of all small ncRNAs. To explore the relationship between conserved motifs and RNA structures, we calculated two distances: one is what between the RNA 5′ end and the motif start and the other is what between the motif end and RNA 3′ end; we did not observe any obvious patterns in the motif distribution (Figure A3 in Appendix). We also compared the novel ncRNAs among the three samples and found that the ratios of the tissue- or cell-specific novel ncRNAs are larger than the ratio of known genes (Figure A4 in Appendix). The biased distribution of the novel ncRNAs indicates their possible functional roles in different tissues or cell types. More ncRNA expression data from a broader tissue spectrum are certainly needed to decipher the functionality of the ncRNAs.

Figure 7

Histograms and motif logo of small RNAs in intergenic regions. (A) the histogram of small RNA length, (B) a motif logo of small RNAs in cerebrum (16.13% of 65 bp small RNAs involved in this motif), (C) a motif logo of small RNA in testis (41.54% of 64 bp small RNAs involved in this motif), and (D) a motif logo of small RNAs in ES cell (25.97% of 56 bp small RNAs involved in this motif).

Figure A3

The distances between (1) motif start and RNA 5′ end and (2) between motif end and RNA 3′ end. The histogram shows the distance between motif start and RNA 5′ end (left), the distance between motif end and RNA 3′ end (middle), and the density of both (right).

Figure A4

Venn diagram of newly identified non-coding TUs among mouse the cerebrum, testis, and ES cells.

Actively transcribed intronic regions

Based on the PhyloCSF prediction (Table 3), we identified about 79.52% ncRNA exons in the intronic regions of the cerebrum and testis, whereas only 52.71% ncRNA exons in the intronic regions of ES cells. Whether most of the predicted protein-coding exons are actually parts of known genes remains to be elucidated. For the analysis of ncRNAs in the intronic regions, we also divided them into three portions: antisense RNAs, small ncRNAs, and long ncRNAs (Table 4). Unlike ncRNAs in intergenic regions, most intronic ncRNAs are small ncRNAs: 88.88, 81.48, and 75.15% in the mouse cerebrum, testis, and ES cell, respectively. Since the intronic expressions are mostly weak and interfered by background expression, more efforts are to be devoted in the future for exploring their functions.

Discussion

In this study, we attempted to identify novel transcripts using rmRNA-seq data from two mouse transcript-rich tissues, the cerebrum and testis, and ES cells. Compared to what generated from polyA-based mRNA-seq method, rmRNA-seq data are expected to harbor more novel transcripts that do not have the polyA tails typical for eukaryotic mRNA (Cui et al., 2010). In addition, we took the advantage of a strand-specific nature of the method, which is readily done using the SOLiD platform and allows us to define sense and antisense transcript pairs of the antisense regulated genes. Using a custom-designed data processing pipeline, we carefully identified several to twenty thousands of novel TUs from different mouse tissues and cells and analyzed their distributions in both intronic and intergenic regions. We also used other supporting evidence from transcriptional initiation and epigenetic signals as well as one of the common evolutionary strategies – sequence conservation. These features helped us to argue for their functional roles in the tissues and cells. Our pipeline is able to recover ∼94.12% Refgene exons (average coverage is equal or larger than the cutoff value) from the dataset and the method is capable of driving mammalian transcriptome annotation to a completion if coupled with a protocol for characterizing even smaller RNAs, such as miRNAs. The annotation of these novel UTs remains challenging. First, when aligning these TUs that are characteristic of amino acid sequences, such as single exons, to sequences in the NCBI nr database, we can readily annotate about 24% of the novel protein-coding TUs. Although some of them are annotated to be structural proteins, such as those similar to ribosomal proteins and house-keeping enzymes, most of them are actually matching to unknown proteins. Second, we identified a large number of ncRNAs, including antisense RNAs, small ncRNAs, and long ncRNAs. According to the analysis on the targeted genes of antisense RNAs, we found that they are associated with synapse, ion binding-transport, cell junction, cytoskeletal, membrane, and signal transduction. Surprisingly, these genes are enriched in disease related pathways, such as cardiomyopathy and cancer. We believe that such enrichment is largely an artifact due to the fields of intensive research activities. In addition, we found that antisense expression is either positively or negatively associated with sense expression of sense–antisense gene pairs. Furthermore, numerous long ncRNAs are identified in intergenic regions, providing a basis for future functional studies. Moreover, we found that the majority of small ncRNAs are in a length range of 55–65 bp in intergenic regions, which may represent a novel class of ncRNAs since conserved motifs were found among the sequences. In addition, most novel exons we found in intronic regions are small ncRNAs of the same size range.

Conclusion

In this study, we identified a large number of novel exons and TUs using three strand-specific rmRNA-seq datasets. We also evaluated the universality and functionality of these novel TUs to demonstrate their features as actively transcribed genes based on an analysis that combines data from transcription start site, histone modification, RNAPII binding site, and sequence conservation. Our efforts in annotating these novel TUs revealed their possible functional features, resembling sequences of protein-coding and sense–antisense regulated genes as well as long and small ncRNAs. This study also provides a practical approach for the identification of most, if not all, genes of mammalian genomes.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at http://www.frontiersin.org/non-coding_rna/10.3389/fgene.2011.00093/abstract Functional classification of cerebrum-associated genes that have antisense RNA. Click here for additional data file. Functional classification of testis-associated genes that have antisense RNA. Click here for additional data file. Functional classification of ES cell-associated genes that have antisense RNA. Click here for additional data file. KEGG pathways of the cerebrum-associated genes that have antisense RNA. Click here for additional data file. KEGG pathways of testis-associated genes that have antisense RNA. Click here for additional data file. Putatively identified exons in the mouse cerebrum, testis, and ES cells. These files have five columns. They are chromosome, strand, type (“modified” or “no-modified” for distinguishing whether it was revised using split reads), start position and end position. Click here for additional data file. Putatively identified TUs in the mouse cerebrum, testis, and ES cells. Click here for additional data file. The main step of identification of exons and TUs. Click here for additional data file.

42 in total

Review 1. Non-coding RNAs: regulators of disease.

Authors: Ryan J Taft; Ken C Pang; Timothy R Mercer; Marcel Dinger; John S Mattick
Journal: J Pathol Date: 2010-01 Impact factor: 7.996

2. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses.

Authors: Moran N Cabili; Cole Trapnell; Loyal Goff; Magdalena Koziol; Barbara Tazon-Vega; Aviv Regev; John L Rinn
Journal: Genes Dev Date: 2011-09-02 Impact factor: 11.361

3. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal: Nature Date: 2007-06-14 Impact factor: 49.962

Review 4. RNA-Seq: a revolutionary tool for transcriptomics.

Authors: Zhong Wang; Mark Gerstein; Michael Snyder
Journal: Nat Rev Genet Date: 2009-01 Impact factor: 53.242

5. The transcriptional landscape of the mammalian genome.

Authors: P Carninci; T Kasukawa; S Katayama; J Gough; M C Frith; N Maeda; R Oyama; T Ravasi; B Lenhard; C Wells; R Kodzius; K Shimokawa; V B Bajic; S E Brenner; S Batalov; A R R Forrest; M Zavolan; M J Davis; L G Wilming; V Aidinis; J E Allen; A Ambesi-Impiombato; R Apweiler; R N Aturaliya; T L Bailey; M Bansal; L Baxter; K W Beisel; T Bersano; H Bono; A M Chalk; K P Chiu; V Choudhary; A Christoffels; D R Clutterbuck; M L Crowe; E Dalla; B P Dalrymple; B de Bono; G Della Gatta; D di Bernardo; T Down; P Engstrom; M Fagiolini; G Faulkner; C F Fletcher; T Fukushima; M Furuno; S Futaki; M Gariboldi; P Georgii-Hemming; T R Gingeras; T Gojobori; R E Green; S Gustincich; M Harbers; Y Hayashi; T K Hensch; N Hirokawa; D Hill; L Huminiecki; M Iacono; K Ikeo; A Iwama; T Ishikawa; M Jakt; A Kanapin; M Katoh; Y Kawasawa; J Kelso; H Kitamura; H Kitano; G Kollias; S P T Krishnan; A Kruger; S K Kummerfeld; I V Kurochkin; L F Lareau; D Lazarevic; L Lipovich; J Liu; S Liuni; S McWilliam; M Madan Babu; M Madera; L Marchionni; H Matsuda; S Matsuzawa; H Miki; F Mignone; S Miyake; K Morris; S Mottagui-Tabar; N Mulder; N Nakano; H Nakauchi; P Ng; R Nilsson; S Nishiguchi; S Nishikawa; F Nori; O Ohara; Y Okazaki; V Orlando; K C Pang; W J Pavan; G Pavesi; G Pesole; N Petrovsky; S Piazza; J Reed; J F Reid; B Z Ring; M Ringwald; B Rost; Y Ruan; S L Salzberg; A Sandelin; C Schneider; C Schönbach; K Sekiguchi; C A M Semple; S Seno; L Sessa; Y Sheng; Y Shibata; H Shimada; K Shimada; D Silva; B Sinclair; S Sperling; E Stupka; K Sugiura; R Sultana; Y Takenaka; K Taki; K Tammoja; S L Tan; S Tang; M S Taylor; J Tegner; S A Teichmann; H R Ueda; E van Nimwegen; R Verardo; C L Wei; K Yagi; H Yamanishi; E Zabarovsky; S Zhu; A Zimmer; W Hide; C Bult; S M Grimmond; R D Teasdale; E T Liu; V Brusic; J Quackenbush; C Wahlestedt; J S Mattick; D A Hume; C Kai; D Sasaki; Y Tomaru; S Fukuda; M Kanamori-Katayama; M Suzuki; J Aoki; T Arakawa; J Iida; K Imamura; M Itoh; T Kato; H Kawaji; N Kawagashira; T Kawashima; M Kojima; S Kondo; H Konno; K Nakano; N Ninomiya; T Nishio; M Okada; C Plessy; K Shibata; T Shiraki; S Suzuki; M Tagami; K Waki; A Watahiki; Y Okamura-Oho; H Suzuki; J Kawai; Y Hayashizaki
Journal: Science Date: 2005-09-02 Impact factor: 47.728

6. Ensembl 2011.

Authors: Paul Flicek; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Simon Brent; Yuan Chen; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Leo Gordon; Maurice Hendrix; Thibaut Hourlier; Nathan Johnson; Andreas Kähäri; Damian Keefe; Stephen Keenan; Rhoda Kinsella; Felix Kokocinski; Eugene Kulesha; Pontus Larsson; Ian Longden; William McLaren; Bert Overduin; Bethan Pritchard; Harpreet Singh Riat; Daniel Rios; Graham R S Ritchie; Magali Ruffier; Michael Schuster; Daniel Sobral; Giulietta Spudich; Y Amy Tang; Stephen Trevanion; Jana Vandrovcova; Albert J Vilella; Simon White; Steven P Wilder; Amonida Zadissa; Jorge Zamora; Bronwen L Aken; Ewan Birney; Fiona Cunningham; Ian Dunham; Richard Durbin; Xosé M Fernández-Suarez; Javier Herrero; Tim J P Hubbard; Anne Parker; Glenn Proctor; Jan Vogel; Stephen M J Searle
Journal: Nucleic Acids Res Date: 2010-11-02 Impact factor: 16.971

7. The majority of total nuclear-encoded non-ribosomal RNA in a human cell is 'dark matter' un-annotated RNA.

Authors: Philipp Kapranov; Georges St Laurent; Tal Raz; Fatih Ozsolak; C Patrick Reynolds; Poul H B Sorensen; Gregory Reaman; Patrice Milos; Robert J Arceci; John F Thompson; Timothy J Triche
Journal: BMC Biol Date: 2010-12-21 Impact factor: 7.431

8. The reality of pervasive transcription.

Authors: Michael B Clark; Paulo P Amaral; Felix J Schlesinger; Marcel E Dinger; Ryan J Taft; John L Rinn; Chris P Ponting; Peter F Stadler; Kevin V Morris; Antonin Morillon; Joel S Rozowsky; Mark B Gerstein; Claes Wahlestedt; Yoshihide Hayashizaki; Piero Carninci; Thomas R Gingeras; John S Mattick
Journal: PLoS Biol Date: 2011-07-12 Impact factor: 8.029

9. The UCSC Genome Browser database: update 2011.

Authors: Pauline A Fujita; Brooke Rhead; Ann S Zweig; Angie S Hinrichs; Donna Karolchik; Melissa S Cline; Mary Goldman; Galt P Barber; Hiram Clawson; Antonio Coelho; Mark Diekhans; Timothy R Dreszer; Belinda M Giardine; Rachel A Harte; Jennifer Hillman-Jackson; Fan Hsu; Vanessa Kirkup; Robert M Kuhn; Katrina Learned; Chin H Li; Laurence R Meyer; Andy Pohl; Brian J Raney; Kate R Rosenbloom; Kayla E Smith; David Haussler; W James Kent
Journal: Nucleic Acids Res Date: 2010-10-18 Impact factor: 16.971

10. NCBI BLAST: a better web interface.

Authors: Mark Johnson; Irena Zaretskaya; Yan Raytselis; Yuri Merezhuk; Scott McGinnis; Thomas L Madden
Journal: Nucleic Acids Res Date: 2008-04-24 Impact factor: 16.971

9 in total

1. Developmental transcriptome analysis of human erythropoiesis.

Authors: Lihong Shi; Yu-Hsuan Lin; M C Sierant; Fan Zhu; Shuaiying Cui; Yuanfang Guan; Maureen A Sartor; Osamu Tanabe; Kim-Chew Lim; James Douglas Engel
Journal: Hum Mol Genet Date: 2014-04-29 Impact factor: 6.150

2. RNF17 blocks promiscuous activity of PIWI proteins in mouse testes.

Authors: Kaja A Wasik; Oliver H Tam; Simon R Knott; Ilaria Falciatori; Molly Hammell; Vasily V Vagin; Gregory J Hannon
Journal: Genes Dev Date: 2015-06-26 Impact factor: 11.361

3. Developmental analysis of spliceosomal snRNA isoform expression.

Authors: Zhipeng Lu; A Gregory Matera
Journal: G3 (Bethesda) Date: 2014-11-21 Impact factor: 3.154

4. Ribogenomics: the science and knowledge of RNA.

Authors: Jiayan Wu; Jingfa Xiao; Zhang Zhang; Xumin Wang; Songnian Hu; Jun Yu
Journal: Genomics Proteomics Bioinformatics Date: 2014-04-24 Impact factor: 7.691

5. Comparative analyses of H3K4 and H3K27 trimethylations between the mouse cerebrum and testis.

Authors: Peng Cui; Wanfei Liu; Yuhui Zhao; Qiang Lin; Daoyong Zhang; Feng Ding; Chengqi Xin; Zhang Zhang; Shuhui Song; Fanglin Sun; Jun Yu; Songnian Hu
Journal: Genomics Proteomics Bioinformatics Date: 2012-06-09 Impact factor: 7.691

6. Challenges to the common dogma.

Authors: Jun Yu
Journal: Genomics Proteomics Bioinformatics Date: 2012-06-09 Impact factor: 7.691

7. Improvement of genome assembly completeness and identification of novel full-length protein-coding genes by RNA-seq in the giant panda genome.

Authors: Meili Chen; Yibo Hu; Jingxing Liu; Qi Wu; Chenglin Zhang; Jun Yu; Jingfa Xiao; Fuwen Wei; Jiayan Wu
Journal: Sci Rep Date: 2015-12-11 Impact factor: 4.379

8. Life on two tracks.

Authors: Jun Yu
Journal: Genomics Proteomics Bioinformatics Date: 2012-06-23 Impact factor: 7.691

9. The pendulum model for genome compositional dynamics: from the four nucleotides to the twenty amino acids.

Authors: Zhang Zhang; Jun Yu
Journal: Genomics Proteomics Bioinformatics Date: 2012-08-11 Impact factor: 7.691

9 in total