Literature DB >> 22955620

Landscape of transcription in human cells.

Sarah Djebali¹, Carrie A Davis, Angelika Merkel, Alex Dobin, Timo Lassmann, Ali Mortazavi, Andrea Tanzer, Julien Lagarde, Wei Lin, Felix Schlesinger, Chenghai Xue, Georgi K Marinov, Jainab Khatun, Brian A Williams, Chris Zaleski, Joel Rozowsky, Maik Röder, Felix Kokocinski, Rehab F Abdelhamid, Tyler Alioto, Igor Antoshechkin, Michael T Baer, Nadav S Bar, Philippe Batut, Kimberly Bell, Ian Bell, Sudipto Chakrabortty, Xian Chen, Jacqueline Chrast, Joao Curado, Thomas Derrien, Jorg Drenkow, Erica Dumais, Jacqueline Dumais, Radha Duttagupta, Emilie Falconnet, Meagan Fastuca, Kata Fejes-Toth, Pedro Ferreira, Sylvain Foissac, Melissa J Fullwood, Hui Gao, David Gonzalez, Assaf Gordon, Harsha Gunawardena, Cedric Howald, Sonali Jha, Rory Johnson, Philipp Kapranov, Brandon King, Colin Kingswood, Oscar J Luo, Eddie Park, Kimberly Persaud, Jonathan B Preall, Paolo Ribeca, Brian Risk, Daniel Robyr, Michael Sammeth, Lorian Schaffer, Lei-Hoon See, Atif Shahab, Jorgen Skancke, Ana Maria Suzuki, Hazuki Takahashi, Hagen Tilgner, Diane Trout, Nathalie Walters, Huaien Wang, John Wrobel, Yanbao Yu, Xiaoan Ruan, Yoshihide Hayashizaki, Jennifer Harrow, Mark Gerstein, Tim Hubbard, Alexandre Reymond, Stylianos E Antonarakis, Gregory Hannon, Morgan C Giddings, Yijun Ruan, Barbara Wold, Piero Carninci, Roderic Guigó, Thomas R Gingeras.

Abstract

Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three-quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations, taken together, prompt a redefinition of the concept of a gene.

Entities: Chemical

Mesh：

Substances：

Year: 2012 PMID： 22955620 PMCID： PMC3684276 DOI： 10.1038/nature11233

Source DB: PubMed Journal: Nature ISSN： 0028-0836 Impact factor: 49.962

As the technologies for RNA profiling and for cell type isolation and culture continue to improve, the catalogue of RNA types has grown and led to an increased appreciation for the numerous biological roles played by RNA, arguably putting them on par with the functional importance of proteins[1]. The Encyclopedia of DNA Elements (ENCODE) project has sought to catalogue the repertoire of RNAs produced by human cells as part of the intended goal of identifying and characterizing the functional elements present in the human genome sequence[2]. The pilot phase of the ENCODE project[3] examined approximately 1% of the human genome and observed that the gene-rich and gene-poor regions were pervasively transcribed, confirming results of prior studies[4,5]. During the second phase of the ENCODE project, the scope of examination was broadened to interrogate the complete human genome. Thus, we have sought to both provide a genome-wide catalogue of human transcripts and to identify the sub-cellular localization for the RNAs produced. Here we report identification and characterization of annotated and novel RNAs that are enriched in either of the two major cellular sub-compartments (nucleus and cytosol) for all 15 cell lines studied, and in three additional sub-nuclear compartments in one cell line. In addition, we have sought to determine if identified transcripts are modified at their 5′ and 3′ termini by the presence of a 7-methyl guanosine cap or polyadenylation, respectively. We further studied primary transcript and processed product relationships for a large proportion of the previously annotated long and small RNAs. These results considerably extend the current genome-wide annotated catalogue of long polyadenylated and small RNAs collected by the Gencode annotation group[6-8]. Taken together our genome-wide compilation of subcellular localized and product-precursor related RNAs serves as a public resource and reveals new and detailed facets of the RNA landscape: Cumulatively, we observed a total of 62.1% and 74.7% of the human genome to be covered by either processed or primary transcripts respectively, with no cell line showing more than 56.7% of the union of the expressed transcriptomes across all cell lines. The consequent reduction in the length of “intergenic regions” leads to a significant overlapping of neighboring gene regions and prompts a redefinition of a gene. Isoform expression by gene does not follow a minimalistic expression strategy resulting in a tendency for genes to express many isoforms simultaneously with a plateau at about 10-12 expressed isoforms per gene per cell line. Cell type-specific enhancers are promoters that are differentiable from other regulatory regions by the presence of novel RNA transcripts, chromatin marks and DNAse l hypersensitive sites. Coding and non-coding transcripts are predominantly localized in the cytosol and nucleus respectively, with a range of expression spanning six orders of magnitude for polyadenylated RNAs, and five orders of magnitude for non-polyadenylated RNAs. Approximately 6% of all annotated coding and non-coding transcripts overlap with small RNAs and are likely precursors to these small RNAs. The sub-cellular localization of both annotated and unannotated short RNAs is highly specific.

RNA dataset generation

We performed sub-cellular compartment fractionation (whole cell, nucleus and cytosol) prior to RNA isolation in 15 cell lines (Table S1) to deeply interrogate the human transcriptome. For the K562 cell line, we also performed additional nuclear sub-fractionation into: chromatin, nucleoplasm and nucleoli. The RNAs from each of these sub-compartments were prepared in replica and were separated based on length into >200 nucleotides (nt) (long) and <200 nt (short). Long RNAs were further fractionated into polyadenylated and non-polyadenylated transcripts. A number of complementary technologies were employed to characterize these RNA fractions as to their sequence (RNA-seq), sites of initiation of transcription (Cap-Analysis of Gene Expression -CAGE[9]) and sites of 5′ and 3′ transcript termini (Paired End Tags -PET[10], Figure S1). Sequence reads were mapped and post-processed using a variety of software tools (Table S2, Figure S2). We used the mapped data to assemble and quantify de novo elements (exons, transcripts, genes, contigs, splice junctions and transcription start sites, TSS) as well as to quantify annotated Gencode (v7) elements. Elements and quantifications were further assessed for reproducibility between replicates using a non-parametric version (npIDR, Supplementary Material) of the Irreproducible Detection Rate (IDR) statistical test[11]. Only elements deemed to be reproducible with at least 90% likelihood were used in most analyses. The raw data, mapped data and elements were then made available by the ENCODE Data Coordination Center or DCC (http://genome.ucsc.edu/ENCODE/dataSummary.html) (Figure S2). These data, as well as additional data on all intermediate processing steps are available on the RNA Dashboard: http://genome.crg.cat/encode_RNA_dashboard/.

Long RNA expression landscape

Detection of annotated and novel transcripts

The Gencode gene (Figure S3a) and transcript (Figure S3b) reference annotation[8] captures our current understanding of the polyadenylated human transcriptome. In the samples interrogated here, we cumulatively detected 70% of annotated splice junctions, transcripts, and genes (Figure 1, and Table 1.1). We also detected approximately 85% of annotated exons with an average coverage by RNA-seq contigs of 96%. The variation in the proportion of detected elements among cell lines was small (Figure 1, width of box plots). Consistent with earlier studies, most annotated elements are present in both polyadenylated (Table S3a) and non-polyadenylated (Table S3b) samples[12-15]. Only a small proportion of Gencode elements (0.4% of exons, 2.8% of splice sites, 3.3% of transcripts and 4.7% of genes) are detected exclusively in the non-polyadenylated RNA fraction.

Figure1

A large majority of Gencode elements are detected by RNA-seq data

Shown are Gencode detected elements in the polyadenylated and non-polyadenylated fractions of cellular compartments (cumulative counts for both RNA fractions and compartments refer to elements present in any of the fractions or compartments). Each box plot is generated from values across all cell lines, thus capturing the dispersion across cell lines. The largest point shows the cumulative value over all cell lines.

Table 1

Long polyadenylated and non polyadenylated RNAs

1. Expression of Gencode (v7) annotated elements
Gene type	Detectedexons2(annotation #)	Detectedsplicejunctions2(annotation #)	Detectedtranscripts2(annotation #)	Detectedgenes2(annotation #)	Exonnucleotidecoverage3 (%)	Number ofgenesexpressed inat least onecell line	Number ofgenesexpressed inonly 1 cellline	Proportionover genesexpressed(%)	Number ofgenesexpressed in14 cell lines	Proportionover genesexpressed(%)
Long noncoding	22,381(41,467)	8,017(26,872)	6,521(14,880)	5,906 (9,277)	87.5	5,906	1,386	23.5	631	10.7
Proteincoding	288,322(318,514)	194,752(244,158)	59,822(76,006)	18,939(20,679)	98.1	18,939	1,082	5.7	10,571	55.8
Other1	102,000(133,937)	19,277(47,663)	45,410(71,113)	10,649(21,750)	95.2	10,649	2,453	23.0	1,896	17.8
Totalannotated	412,703(493,918)	222,046(318,693)	111,753(161,999)	35,494(51,706)	96.7	35,394	4,921	13.9	13,098	37.0

includes pseudogenes, miRNAs, etc

all elements that passed npIDR (0.1)

cumulative detected nucleotide in detected exons / total nucleotides in detected exons

Beyond the Gencode annotated elements, we observed a substantial number of novel elements represented by reproducible RNA-seq contigs. These novel elements covered 78% of the intronic nucleotides and 34% of the intergenic sequences (Figure S4). Overall, the unique contribution of each cell line to the coverage of the genome tend to be small and similar for each cell line (Figure S5). We used the Cufflinks algorithm (see Supplementary Material), and predicted over all long RNA-seq samples, 94,800 exons, 69,052 splice junctions, 73,325 transcripts and 41,204 genes in intergenic and antisense regions (Table 1.2). These novel elements increase the Gencode collection of exons, splice sites, transcripts and genes by 19%, 22%, 45% and 80% respectively. The increase in the number of genes and the relatively low contribution of novel splice sites is primarily caused by the detection of both polyadenylated and non-polyadenlyated mono-exonic transcripts (Table S3). Detection of unspliced transcripts could partially be an artifact, caused by low levels of DNA contamination or by incomplete determination of transcript structures. Independent validation of multi-exonic transcript models and the associated predicted coding products were carried out using overlapping targeted 454 Life Sciences (Roche) paired-end reads and mass spectrometry. Of approximately 3,000 intergenic and antisense transcript models tested, validation rates from 70 to 90% were observed, depending on the number of reads and IDR score. In addition, these experiments led to the identification of more than 22,000 novel splice sites not previously detected, meaning an almost 8-fold increase in detection compared to the sites originally detected with RNA-seq (Figure S6). Using mass spectrometric analyses, we investigated what fraction of the novel Cufflinks transcript models show evidence consistent with protein expression. We produced 998,570 spectra from two cell lines (K562 and GM12878, for details see Khatun et al.[16]), and mapped them to a 3-frame translation of the novel Cufflinks models (Supplementary Material). At a 1% false discovery rate (FDR), we identified 419 novel models with 5 or more spectral and/or 2 or more peptide hits, of which only 56 were intergenic or antisense to Gencode genes (Table S4 and Figure S7). Thus, most novel transcripts appear to lack protein coding capacity.

The transcriptome of nuclear sub-compartments

For the K562 cell line, we also analyzed RNA isolated from three sub-nuclear compartments (chromatin, nucleolus and nucleoplasm, Table S5). Almost half (18,330) of the Gencode (v7) annotated genes detected for all 15 cell lines (35,494) were identified in the analysis of just these three nuclear sub-compartments. In addition, there were as many novel unannotated genes found in K562 sub-compartments as there were in all other datasets combined (Table S5 vs. Table 1.2). For all annotated (Table S5.1) or novel (Table S5.2) elements, only a small fraction in each sub-compartment was unique to that compartment (Table S6). The interrogation of different sub-cellular RNA fractions provides snapshots of the status of the RNA population along the RNA processing pathway. Thus, by analyzing short and long RNAs in the different sub-cellular compartments, we confirm that splicing predominantly occurs during transcription. By using RNA-seq to measure the degree of completion of splicing (Figure 2a), we observed that around most exons, introns are already being spliced in chromatin-associated RNA—the fraction that includes the RNAs in the process of being transcribed (Figure 2b). Concomitantly, we found strong enrichment specifically of spliceosomal small nuclear RNAs (snRNAs) in this RNA fraction (see short RNA expression landscape section below). Co-transcriptional splicing provides an explanation for the increasing evidence connecting chromatin structure to splicing regulation, and we have indeed observed that exons in the process of being spliced are enriched in a number of chromatin marks[17,18].

Figure2

Co-transcriptional splicing

a. Short read mappings for exon-based splicing completion. Read mappings that allow assessment of splicing completion around exons. (a,b,c) Reads providing evidence of splicing completion for the region containing the exon (with either exon inclusion, ab, or exclusion, c) (d,e) Reads providing evidence for the splicing of the region containing the exon not being completed yet. The complete Splicing Index (coSI) is the ratio of a+b+c over a+b+c+d+e and can thus be broadly assumed to correspond to the fraction of RNA molecules in which the region containing the exon has already been spliced (see Tilgner et al.[17]). A coSI value of 1 means splicing completed, while a value of 0 indicates that splicing has not yet been initiated.

b. Distribution of coSI scores computed on Gencode internal exons: (Top) Distribution in the total chromatin RNA fraction. (Bottom) Distribution in cytosolic polyadenylated RNA fraction.

Gene expression across cell lines

The analyses of RNAs isolated from different sub-cellular compartments also provide information concerning compartment-specific relative steady-state abundance and the post transcriptional processing state (spliced/unspliced, polyadenylated/non-polyadenylated, 5′capped/uncapped) for each of the detected transcripts. The observed range of gene expression spans six orders of magnitude for polyadenylated RNAs (from 10−2 to 104 reads per kilobase per million reads [RPKM]), and five orders of magnitude (from 10−2 to 103 RPKM) for non-polyadenylated RNAs (Figure 3 and figure S8a). The distribution of gene expression is very similar across cell lines, with protein coding genes, as a class, having on average higher expression levels than long non-coding RNAs (lncRNAs). Assuming that 1-4 RPKM approximates to 1 copy per cell[19], we find that almost one quarter of expressed protein coding genes and 80% of the detected lncRNAs are present in our samples in 1 or fewer copies per cell. The general lower level of gene expression measured in lncRNAs may not necessarily be the result of consistent low RNA copy number in all cells within the population interrogated, but may also result from restricted expression in only a subpopulation of cells. In some cell lines, individual lncRNAs can exhibit steady-state expression levels as high as those of protein coding genes. This is, for example, seen in the expression of the protein coding gene actin, gamma 1 (ACTG1), and the non-coding gene, H19 (Figure 3). ACTG1 transcripts are part of all non-muscle cytoskeleton systems within cells and show a steady state expression level at the population level that is at least 1-2 logs greater than H19, a cytosolic ncRNA. However, when measured at the individual transcript level, expression of lncRNA transcripts is comparable to that of individual protein coding transcripts (Figure S8b).

Figure 3

Abundance of gene types in cellular compartments

2D Kernel density plots of nuclear over cytosolic enrichment (Y axis) versus overall gene expression in the whole cell extract (X axis), for protein coding, long non-coding and novel genes over all cell lines. Only genes present in all 3 RNA extracts are displayed, as well as two representative genes (ACTG1 in red and H19 in blue), for which the expression in each individual cell line is shown. The actual values of the estimated Kernel density are indicated by contour lines and color shades.

Novel antisense and intergenic genes predicted in this study comprise a third clustering of RNAs with levels of expression ranging from 10−4 to 10−1 RPKM. As a class, only protein coding genes appear enriched in the cytosol, making the nucleus a center for the accumulation of non-coding RNAs (Figure 3). Other gene classes, such as pseudogenes and small annotated ncRNAs, also show sub-cellular compartmental enrichment (Figure S9). Higher variability and lower pairwise correlation of expression across all cell lines is consistent with lncRNAs contributing more to cell line specificity than protein-coding genes. Indeed, a considerable fraction (29%) of all expressed lncRNAs are detected in only one of the cell lines studied when considering the whole cell polyadenylated RNAs, while only 10% were expressed in all cell lines. Conversely, while a large fraction (53%) of expressed protein coding genes were constitutive (expressed in all cell lines), only ~7% were cell-line specific (Table S7, Figure S10).

Patterns of splicing

The analysis of the expression of alternative isoforms resulted in several observations. First, isoform expression does not seem to follow a minimalistic strategy. Genes tend to express many isoforms simultaneously, and as the number of annotated isoforms per gene grows, so does the number of expressed isoforms (Figure 4a). The increase, however, is not linear and appears to plateau at about 10-12 expressed isoforms per gene. We cannot obviously distinguish, however, whether this is the result of multiple isoforms expressed in the same cell or of different isoforms expressed in different cells within the interrogated population. Second, alternative isoforms within a gene are not expressed at similar levels, and one isoform dominates in a given condition—usually capturing a large fraction of the total gene expression (at least 30% even for genes with many isoforms, Figure 4b). Third, about three quarters of protein coding genes have at least two different dominant/major isoforms depending on the cell line (Figure S11a). Fourth, the number of major isoforms per gene grows with the number of annotated isoforms; indeed, the proportion of genes with n isoforms that express only one major isoform is strikingly proportional to 1/n (Figure S11b). Fifth, variability of gene expression contributes more than variability of splicing ratios to the variability of transcript abundances across cell lines (Supplementary Material).

Figure 4

Isoform expression within a gene

a. Number of expressed isoforms per gene per cell line. Genes tends to express many isoforms simultaneously.

b. Relative expression of the most abundant isoform per gene per cell line. There is generally one dominant isoform in a given condition.

Alternative transcription initiation and termination

Based on RNA-seq analysis of polyadenylated RNAs, a total of 128,021 TSS were detected across all cell lines, of which 97,778 were previously annotated and 30,243 were novel intergenic/antisense TSS (Table S3a). CAGE tags, filtered by a hidden Markov model (HMM) based algorithm to differentiate between 5′ capped termini of polymerase II transcripts and recapping events[20] (Supplementary Material), identified a total of 82,783 non-redundant TSS (Table S8). Approximately 48% of the CAGE identified TSS are located within 500 bp of an annotated RNA-seq detected Gencode TSS, while an additional 3% are within 500 bp of a novel TSS (Figure S12). Interestingly, only ~72% of all CAGE sequencing reads map to TSS, indicating that the remaining 30% may originate from recapping events or from a new class of TSS. Using data collected within the ENCODE consortium[21], we carried out a comparison of the Gencode/RNA-seq and CAGE determined TSSs and correlated them to chromatin and DNA features characteristic of initiation of transcription, such as DNAse hypersensitivity[22], chromatin modification and DNA binding elements[23,24]. All Gencode/RNA-seq determined TSS were examined in each of the cell lines (column 1, Figure S13). Of these redundant positions, 44.7% (199,146) of the RNA-seq supported TSS also displayed evidence of CAGE. Approximately half of these TSS positions are associated with at least one of the other characteristic features of transcription initiation (DNAse I, H3K27Ac and H3K4me3 chromatin modifications). Thus only a small minority of the TSS identified by either CAGE or RNA-seq/Gencode displayed all of the characteristics of the start of transcription (presence of DNAseI, H3K4me3, H3K27ac sites and either Taf1 or Tbp binding). This is consistent with the possibility that regulatory regions proximal to TSS, are of more than one type. On the other hand, a total of 128,824 sites mapping within annotated Gencode transcripts were identified as potential sites of polyadenylation after trimming unmapped RNA-seq reads with long terminal polyadenine stretches[25]. About 20% of these mapped proximal to annotated polyadenylation sites (PAS) while the remaining 80% correspond to novel PAS of annotated genes, raising the average number of PAS per gene from 1.1 to 2.5. Generally, we observed a cell type preference for proximal PAS (closest to the annotated stop codon) in the cytosol compared to the nucleus (Supplementary Material).

Short RNA expression landscape

Annotated small RNAs

Currently, a total of 7,053 small RNAs are annotated by Gencode, 85% of which correspond to four major classes: small nuclear (sn)RNAs, small nucleolar (sno)RNAs, micro (mi)RNAs and transfer (t)RNAs (Table 2a). Overall we find 28% of all annotated small RNAs to be expressed in at least one cell line (Table 2a). The distribution of annotated small RNAs differs markedly between cytosolic and nuclear compartments (Figure S14a). We found that the small RNA classes were enriched in those compartments where they are known to perform their functions: miRNAs and tRNAs in the cytosol, and snoRNAs in the nucleus. Interestingly, snRNAs were equally abundant in both the nucleus and the cytosol. When specifically interrogating the sub-nuclear compartments of the K562 cell line, however, snRNAs appear to be present in very high abundance in the chromatin-associated RNA fraction (Figure S14bc). This striking enrichment is consistent with splicing being predominantly co-transcriptional[17,26].

Table 2

Short RNAs

a. Expression of Gencode (v7) annotated small RNA genes
Gene type1.	Gencode total	Detected genes (%detected)	# Genes expressedin only 1 cell line (%detected)	# Genes expressedin 12 cell lines (%detected)	miRNA guidefragment3	miRNApassengerfragment4	Internal fragments5of annotated smallRNA (average perdetected gene)
miRNA	1,756	497 (28)	59 (12)	147 (30)	454 (454)	175 (175)	18
snoRNA	1,521	458 (30)	73 (16)	223 (49)	NA	NA	60
snRNA	1,944	378 (19)	123 (33)	41 (11)	NA	NA	36
tRNA	624	465 (75)	29 (6)	197 (42)	NA	NA	52
Other2	1,209	191 (16)	69 (36)	24 (13)	NA	NA	32
Total Gencode	7,054	1,989 (28)	353 (18)	632 (32)	NA	NA	40

includes all other Gencode small transcripts biotypes except pseudogenes

all elements that have passed npIDR (0.1)

number of detected miRNAs with an expressed annotated guide (with an annotated guide in mirbase)

number of detected miRNAs with an expressed annotated passenger (with an annotated passenger in mirbase)

short RNAseq mapping which 5′ ends starts 5 bp after the start and ends 5bp before the end of a detected gene

Unannotated short RNAs

We detected two types of unannotated short RNAs. The first type corresponds to sub-fragments of annotated small RNAs. Since we performed 36 nt end-sequencing of the small RNA fraction, we expected RNA-seq reads to map to the 5′ end of the small RNAs. Figure S15 shows the mapping profile of reads along small RNA genes. In both the nuclear and cytosolic compartments, we indeed detect accumulation of reads at the start of snoRNAs and at the guide and passenger sequences of annotated miRNAs. For snRNAs, however, we observed three prominent peaks: the expected one at the 5′ end and two smaller ones at the middle and at the 3′ end of the gene, suggesting fragmentation of some snRNAs. Finally, tRNAs appear not to have any prominent sets of 5′ end fragments present at levels greater than what is seen at the annotated 5′ termini. While sub-fragments of mature tRNAs have been reported previously, these reports were confined to distinct alleles of only a few tRNA genes[27-29]. The second and largest source of unannotated short RNAs correspond to novel short RNAs (Table 2b) that map outside of annotated ones. Almost 90% of these are only observed in one cell line and are present at low copy numbers. Nearly 40% of these unannotated short RNAs are associated with promoter and terminator regions of annotated genes (promoter associated short RNAs [PASRs], termini associated short RNAs [TASRs]), and their position relative to TSS and transcription termination sites is similar to previously found[4].

Genealogy of short RNAs

Genome wide, 27% of annotated small RNAs reside within 8% of protein-coding and 5% within 3% of lncRNA genes (Figure S16). Overall, about 6% of all annotated long transcripts overlap with small RNAs and are likely precursors to these small RNAs. While the majority of these small RNAs reside in introns, when controlling for relative exon/intron length, we found that exons from lncRNAs are comparatively enriched as hosts for snoRNAs (Figure S17a). Additionally, 8.4% of Gencode annotated small RNAs map within novel intergenic transcripts with the majority overlapping annotated tRNAs. The enrichment for tRNAs was mostly in novel intergenic transcripts derived from non-polyadenylated RNAs (Figure S17b). Many long RNAs, both novel and annotated, thus appear to have dual roles, as functional (protein coding) RNAs, and as precursors for many important classes of small RNAs. Using RNA-seq data from K562, we investigated the preferential cellular localization of these RNA precursors (Figure S18). For mature miRNAs and tRNAs (cytosolic enrichment), the potential RNA precursors, identified as RNA-seq contigs overlapping the small RNAs, were detected to be predominantly nuclear (FigureS18a,d). Interestingly, while mature snRNAs were both nuclear and cytosolic, the overlapping long RNAs were observed to be primarily nuclear (Figure S18c). Finally, for snoRNAs (nuclear enrichment), potential long RNA precursors were decidedly observed to be both nuclear and cytosolic (Figure S18b). Unannotated short RNAs were found overall not to be enriched in either the nuclear or cytosolic compartment (Figure S18e).

RNA editing and allele-specific expression

The sequence of transcripts can differ from the underlying genomic sequence as the result of post-transcriptional editing. We developed a pipeline to filter sequencing artifacts and identify genes that are RNA edited[30]. Focusing first on GM12878, a cell line that has been deeply resequenced, we find a total 51,557 RNA consistent single nucleotide variants within genic boundaries, 65% of which are present in dbSNP. Of the remainder, 1,186 SNVs in 430 genes (Figure S19a) survive our most stringent filters and 88% of these are candidate adenosine to inosine A->G(I) changes. Notably the next highest frequency of SNVs are for T->C (5%) and are primarily in regions with detectable antisense transcription[30]. We find similar A->G(I) frequencies of 75-84%, in 7 additional cell lines (Figure S19b). The remaining non-canonical edits amount to very few events in each cell line and are relatively evenly distributed (G->A is the third highest). These results do not support a recent report of a substantial number of non-canonical SNV edits in the RNA of human lymphoblastoid cells[31]. Using the AlleleSeq pipeline[32] on the SNPs in the GM12878 genome, we found that approximately 18% of both Gencode annotated protein coding and long non-coding genes exhibit allele-specific expression (ASE). The proportion of genes with ASE was similar in the three investigated RNA fractions (whole-cell, cytoplasm and nucleus, Table S9 and Supplementary Material).

Repeat region transcription

About 18% (14,828) of CAGE defined TSS regions overlap repetitive elements. More precisely, we find 322, 315, 507 and 1,262 intergenic CAGE clusters overlapping LINE, SINE, LTR and other repeat elements respectively (see Supplementary Material). Measuring Shannon entropy across cell lines, we found that CAGE clusters mapping to repeat regions were noticeably more narrowly expressed that CAGE clusters mapping within genic regions (Figure S20a). We represented the correlation of levels of expression compared to cell types as heat maps drawn separately for each of the three repeat element families (LINE, SINE and LTR) (Figure S20b-d). While a large proportion of the transcripts in the human genome are thought to be initiated from repetitive elements (especially retrotransposon elements[33]), these data clearly point to cell line specificity as the main characteristic of transcripts emanating from repeat regions.

Characterization of enhancer RNA

It has recently been reported that RNA polymerase II binds some distal enhancer regions and can produce enhancer-associated transcripts named eRNA[34-36]. We used our RNA assays to detect and characterize transcriptional activity at enhancer loci predicted genome-wide from ENCODE ChIP-seq data [21,37]. Figure 5a shows the aggregate pattern of RNA-seq and CAGE signal in a strand specific manner around the subset of predicted gene-distal enhancers containing DNAse I hypersensitive sites and centered on those sites. In these plots, as denoted by the accumulation of CAGE tags signifying transcription start sites (TSS), transcription initiation within the enhancer region is observed, and continues outwards for several kilobases. This behaviour can be observed for the polyadenylated and non-polyadenylated RNA fractions mapping in both intronic and intergenic regions. As previously reported[34], we observe a large diversity of expression levels at each of the transcribed enhancers. Polyadenylated to non-polyadenyated RNA ratios, as well as nuclear to cytoplasmic ratios vary at individual enhancers (Figure S21ab). However, contrary to some previous reports, while the majority of eRNAs are prevalent in the nuclear non-polyadenylated RNA fraction, some eRNAs appeared to be polyadenylated in the nucleus. This pattern was significantly different compared to transcripts from Gencode annotated and novel predicted[21] promoters (Figure 5b).

Figure 5

Transcription at enhancers

a. The pattern of RNA elements around enhancer predictions[21,37] containing DNase I hypersensitive (HS) sites. The lines represent the average frequency of RNA elements (top: polyadenylated long RNA contigs; middle: CAGE tag clusters; bottom: non-polyadenylated long RNA contigs) in a genomic window around the center of the enhancer prediction as determined by DNase I HS sites. Elements on the plus strand are shown in red, and on the minus strand in blue.

b. Enhancer transcripts differ from promoter transcripts.

The box plots compare the features of transcripts at predicted enhancer loci compared to predicted novel intergenic promoters[21] and annotated promoters[8]. H3k4me3, PolyA+ and Nucleus denote the 3 following ratios: H3k4me3/(H3k4me3 + H3k4me1), polyadenylated/(polyadenylated + non-polyadenylated), Nuclear/(Nuclear + Cytosolic). Enhancers are marked by higher levels of H3k4me1 compared to H3K4me3 than novel or annotated promoters (left). Enhancer transcripts show higher levels of non-polyadenylated (middle) and nuclear (right) RNA relative to promoters.

c. Chromatin state at transcribed enhancers.

Enhancer predictions with evidence of transcription (in blue; Cage tags present at predicted locus) show a different pattern of histone modifications and higher levels of RNA Polymerase II binding than non-transcribed predictions (red). They are enriched for H3K27 acetylation, H3K4 methylation, H3K79 di-methylation and depleted for H3K27 tri-methylation.

d. Enhancer activity and transcription is cell type specific.

Loci predicted to be active transcribed enhancers in GM12878 cells, show low signal for CAGE tags (top) and for H3K27 acetylation (bottom) in other cell lines.

Transcribed enhancers on average show a significantly different pattern of chromatin modifications than non transcribed ones[38-41]. The enhancer regions displayed stronger signals for H3K4 methylation, H3K27 acetylation and H3K79 dimethylation along with higher levels of RNA polymerase II binding, all associated with transcriptional initiation and elongation (Figure 5c). Both the transcripts and the chromatin states are cell-type specific (Figure 5d). Taking the GM12878 cell line as an example, the enhancer loci producing eRNA demonstrate enrichment of CAGE tag detection (Figure 5d.1) and the presence of H3K27ac histone modification (Figure 5d.2) in this cell line compared to five other analyzed cell lines. This strongly suggests that the regulatory regions governing the expression of enhancer transcripts are distinguished from regulatory regions located at the beginning of genic regions.

Conclusion: Genome-wide coverage of transcribed regions of the human genome and its consequences

The cumulative coverage of transcribed regions in the 15 cell lines across the human genome is 62.1% and 74.7% for processed and primary transcripts (Table S10 and Figure S22). On average for each cell line, 39% of the genome is covered by primary transcripts, and 22% by processed RNAs. No cell line showed transcription of more than 56.7% of the union of the expressed transcriptomes across all cell lines. When mapping the current RNA-seq data to the ENCODE pilot regions (Table S10), we observed a similar, albeit higher, extent of transcriptional coverage of 73.3% for processed RNAs, and 84.5% for primary transcripts. Previously reported estimates in these regions for processed and primary transcripts, were 24% and 93% respectively (Table S2.4.3[3]). The increased genome coverage by processed RNAs stems largely from the inclusion of non-polyadenylated RNAs in the current study. Other than that, given the differences in the samples studied, the selection of pilot regions with high genic content, the increase of annotated genomic regions over time, and the different technologies used to interrogate transcription, both estimates are in reasonable agreement. As a consequence of both the expansion of genic regions by the discovery of new isoforms and the identification of novel intergenic transcripts, there has been a marked increase in the number of intergenic regions (from 32,481 to 60,250) due to their fragmentation and a decrease in their lengths (from 14,170bp to 3,949bp median length, Figure 6). Concordantly, we observe an increased overlap of genic regions. Since the determination of genic regions is currently defined by the cumulative lengths of the isoforms and their genetic association to phenotypic characteristics, the likely continued reduction in the lengths of intergenic regions will steadily lead to the overlap of most genes previously assumed to be distinct genetic loci. This supports and is consistent with earlier observations of a highly interleaved transcribed genome[12], but more importantly, prompts the reconsideration of the definition of a gene. Being this a consistent characteristic of annotated genomes, we would propose that the transcript be considered as the basic atomic unit of inheritance. Concomitantly, the term gene would then denote a higher order concept intended to capture all those transcripts (eventually divorced from their genomic locations) that contribute to a given phenotypic trait.

Figure 6

Size distribution of intergenic regions

Novel genes increase the proportion of small intergenic regions; ig/as = intergenic / antisense.

36 in total

1. Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain.

Authors: Adam Ameur; Ammar Zaghlool; Jonatan Halvardson; Anna Wetterbom; Ulf Gyllensten; Lucia Cavelier; Lars Feuk
Journal: Nat Struct Mol Biol Date: 2011-11-06 Impact factor: 15.369

Review 2. Genome-wide transcription and the implications for genomic organization.

Authors: Philipp Kapranov; Aarron T Willingham; Thomas R Gingeras
Journal: Nat Rev Genet Date: 2007-05-08 Impact factor: 53.242

3. Mapping and quantifying mammalian transcriptomes by RNA-Seq.

Authors: Ali Mortazavi; Brian A Williams; Kenneth McCue; Lorian Schaeffer; Barbara Wold
Journal: Nat Methods Date: 2008-05-30 Impact factor: 28.547

4. Large heterogeneous nuclear ribonucleic acid has three times as many 5' caps as polyadenylic acid segments, and most caps do not enter polyribosomes.

Authors: M Salditt-Georgieff; M M Harpold; M C Wilson; J E Darnell
Journal: Mol Cell Biol Date: 1981-02 Impact factor: 4.272

Review 5. Non-polyadenylated mRNAs from eukaryotes.

Authors: P K Katinakis; A Slater; R H Burdon
Journal: FEBS Lett Date: 1980-07-11 Impact factor: 4.124

6. The metabolism of a poly(A) minus mRNA fraction in HeLa cells.

Authors: C Milcarek; R Price; S Penman
Journal: Cell Date: 1974-09 Impact factor: 41.582

7. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal: Nature Date: 2007-06-14 Impact factor: 49.962

8. Post-transcriptional processing generates a diversity of 5'-modified long and short RNAs.

Authors:
Journal: Nature Date: 2009-01-25 Impact factor: 49.962

9. AlleleSeq: analysis of allele-specific expression and binding in a network framework.

Authors: Joel Rozowsky; Alexej Abyzov; Jing Wang; Pedro Alves; Debasish Raha; Arif Harmanci; Jing Leng; Robert Bjornson; Yong Kong; Naoki Kitabayashi; Nitin Bhardwaj; Mark Rubin; Michael Snyder; Mark Gerstein
Journal: Mol Syst Biol Date: 2011-08-02 Impact factor: 11.429

10. GENCODE: producing a reference annotation for ENCODE.

Authors: Jennifer Harrow; France Denoeud; Adam Frankish; Alexandre Reymond; Chao-Kung Chen; Jacqueline Chrast; Julien Lagarde; James G R Gilbert; Roy Storey; David Swarbreck; Colette Rossier; Catherine Ucla; Tim Hubbard; Stylianos E Antonarakis; Roderic Guigo
Journal: Genome Biol Date: 2006-08-07 Impact factor: 13.583

2000 in total

1. Increased miR-21a provides metabolic advantages through suppression of FBP1 expression in non-small cell lung cancer cells.

Authors: Qingchun Dai; Na Li; Xiaohong Zhou
Journal: Am J Cancer Res Date: 2017-11-01 Impact factor: 6.166

2. Divergent transcription of long noncoding RNA/mRNA gene pairs in embryonic stem cells.

Authors: Alla A Sigova; Alan C Mullen; Benoit Molinie; Sumeet Gupta; David A Orlando; Matthew G Guenther; Albert E Almada; Charles Lin; Phillip A Sharp; Cosmas C Giallourakis; Richard A Young
Journal: Proc Natl Acad Sci U S A Date: 2013-02-04 Impact factor: 11.205

3. Structure of the 30 kDa HIV-1 RNA Dimerization Signal by a Hybrid Cryo-EM, NMR, and Molecular Dynamics Approach.

Authors: Kaiming Zhang; Sarah C Keane; Zhaoming Su; Rossitza N Irobalieva; Muyuan Chen; Verna Van; Carly A Sciandra; Jan Marchant; Xiao Heng; Michael F Schmid; David A Case; Steven J Ludtke; Michael F Summers; Wah Chiu
Journal: Structure Date: 2018-02-02 Impact factor: 5.006

Review 4. RNA Biology in Retinal Development and Disease.

Authors: Lina Zelinger; Anand Swaroop
Journal: Trends Genet Date: 2018-01-31 Impact factor: 11.639

Review 5. Long Noncoding RNAs in Host-Pathogen Interactions.

Authors: Federica Agliano; Vijay A Rathinam; Andrei E Medvedev; Sivapriya Kailasan Vanaja; Anthony T Vella
Journal: Trends Immunol Date: 2019-04-30 Impact factor: 16.687

6. An Enumerative Combinatorics Model for Fragmentation Patterns in RNA Sequencing Provides Insights into Nonuniformity of the Expected Fragment Starting-Point and Coverage Profile.

Authors: Celine Prakash; Arndt Von Haeseler
Journal: J Comput Biol Date: 2016-09-23 Impact factor: 1.479

7. Deep transcriptome profiling of mammalian stem cells supports a regulatory role for retrotransposons in pluripotency maintenance.

Authors: Alexandre Fort; Kosuke Hashimoto; Daisuke Yamada; Md Salimullah; Chaman A Keya; Alka Saxena; Alessandro Bonetti; Irina Voineagu; Nicolas Bertin; Anton Kratz; Yukihiko Noro; Chee-Hong Wong; Michiel de Hoon; Robin Andersson; Albin Sandelin; Harukazu Suzuki; Chia-Lin Wei; Haruhiko Koseki; Yuki Hasegawa; Alistair R R Forrest; Piero Carninci
Journal: Nat Genet Date: 2014-04-28 Impact factor: 38.330

Review 8. The rise of regulatory RNA.

Authors: Kevin V Morris; John S Mattick
Journal: Nat Rev Genet Date: 2014-04-29 Impact factor: 53.242

Review 9. Encoding activities of non-coding RNAs.

Authors: Yanan Pang; Chuanbin Mao; Shanrong Liu
Journal: Theranostics Date: 2018-03-28 Impact factor: 11.556

Review 10. Long non-coding RNAs: modulators of nuclear structure and function.

Authors: Jan H Bergmann; David L Spector
Journal: Curr Opin Cell Biol Date: 2013-09-20 Impact factor: 8.382