| Literature DB >> 22684627 |
Martin Hemberg1, Jesse M Gray, Nicole Cloonan, Scott Kuersten, Sean Grimmond, Michael E Greenberg, Gabriel Kreiman.
Abstract
More than 98% of a typical vertebrate genome does not code for proteins. Although non-coding regions are sprinkled with short (<200 bp) islands of evolutionarily conserved sequences, the function of most of these unannotated conserved islands remains unknown. One possibility is that unannotated conserved islands could encode non-coding RNAs (ncRNAs); alternatively, unannotated conserved islands could serve as promoter-distal regulatory factor binding sites (RFBSs) like enhancers. Here we assess these possibilities by comparing unannotated conserved islands in the human and mouse genomes to transcribed regions and to RFBSs, relying on a detailed case study of one human and one mouse cell type. We define transcribed regions by applying a novel transcript-calling algorithm to RNA-Seq data obtained from total cellular RNA, and we define RFBSs using ChIP-Seq and DNAse-hypersensitivity assays. We find that unannotated conserved islands are four times more likely to coincide with RFBSs than with unannotated ncRNAs. Thousands of conserved RFBSs can be categorized as insulators based on the presence of CTCF or as enhancers based on the presence of p300/CBP and H3K4me1. While many unannotated conserved RFBSs are transcriptionally active to some extent, the transcripts produced tend to be unspliced, non-polyadenylated and expressed at levels 10 to 100-fold lower than annotated coding or ncRNAs. Extending these findings across multiple cell types and tissues, we propose that most conserved non-coding genomic DNA in vertebrate genomes corresponds to promoter-distal regulatory elements.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22684627 PMCID: PMC3439890 DOI: 10.1093/nar/gks477
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.More conserved islands overlap enhancers and RFBSs than unannotated ncRNAs. The bars show the number of unannotated conserved islands that overlapped an enhancer, other promoter-distal RFBS or novel (HaTriC-defined) transcript (in mouse neurons). For Enhancers and RFBSs, a conserved island had to overlap an enhancer or RFBS to be counted. For novel ncRNAs, we used a statistical approach to estimate the number of conserved islands within exons of transcribed regions (Supplementary Methods). Lists of enhancer loci were taken from (5,18). RFBSs were defined as TF binding sites (mouse neurons) or DHSs (HeLa) that were not at promoters or enhancers.
Figure 3.Across many cell types, more conserved islands overlap RFBSs than ncRNAs. (A) Number of novel ncRNAs, ucRFBSs, and insulators discovered with each additional tissue or cell type investigated. The number of conserved islands assigned to each category initially increases as a power-law. (B) Extrapolation to additional cell types of the number of unannotated conserved islands explained by ncRNAs, unannotated RFBSs and insulators (based on the slopes in the left panel). Assuming that there are a total of 100, 200, 500, 1000 or 1500 distinct cell-types or conditions, we calculated the total number of conserved islands that would overlap ncRNAs, insulators or other RFBSs.
Comprehensive accounting of RNA-Seq reads by genomic locus
| Transcript category | Percentage of RNA-Seq reads | No. of loci | Percentage of genome |
|---|---|---|---|
| Protein-coding gene | 71.451 | 12 108 | 21.27 |
| Annotated non-coding gene | 1.381 | 2564 | 0.92 |
| snRNAs, tRNAs, scRNAs, srpRNAs, rRNAs | 26.465 | 3625 | 0.01 |
| Promoter AS transcript | 0.354 | 4844 | 0.63 |
| Other (HaTric-defined) AS transcript | 0.038 | 660 | 0.11 |
| Novel (HaTric-defined) transcript | 0.076 | 255 | 0.08 |
| Extragenic eRNA | 0.013 | 622 | 0.04 |
| Intragenic eRNA | 0.008 | 331 | 0.01 |
| Other RFBSs-associated RNA | 0.062 | 793 | 0.04 |
| Associated with other H3K4me3 peaks | 0.017 | 367 | 0.01 |
| Total | 99.8643 | 26 169 | 23.1147 |
The vast majority of RNA-Seq reads in mouse neurons fall within 10 categories of genomic loci (rows), identified and classified using a combination of gene annotation, HaTriC transcript calling, and chromatin state (Supplementary Methods). Here categories of expressed loci are characterized based on their fraction of the total number of RNA-Seq reads, their number of genomic loci and their fraction of genomic base-pairs. Transcribed loci were required to have nine RNA-Seq reads and a read density of at least 1 per kb. Annotated gene categories include UTRs and introns. Annotated non-coding genes include those annotated in the UCSC, RefSeq, Ensembl, lincRNA and macroRNA collections, excluding snRNAs, tRNAs, scRNAs, srpRNAs and rRNAs. A ‘Promoter AS transcript’ is an AS transcript with its 5′-end within 2 kb of an annotated TSS. An ‘Other (HaTriC-defined) AS transcript’ is an AS transcript (overlapping an annotated gene) with its 5′-end further than 2 kb from any annotated TSS. An ‘Other RFBS-associated RNA’ starts within 2 kb of a RFBS not identified as an enhancer. snRNAs, tRNAs, scRNAs, srpRNAs and rRNAs are defined by repeatMasker. We note that rRNAs are under-represented here relative to within a cell due to their removal from total RNA samples by hybridization prior to sequencing. Similar results for HeLa cells are presented in Supplementary Table S8.
Figure 2.The transcriptional profile at ucRFBSs is more similar to that of enhancers than promoters. Comparison of several properties of transcribed regions (as defined in ‘Materials and Methods’) that overlap at least one conserved island. TSSs were defined by annotation for annotated genes or by HaTriC for unannotated genes. Only expressed loci are included; thresholds for defining expressed loci were nine RNA-Seq reads and a read density of at least 1 per kb. (A) Transcribed regions at enhancers and ucRFBSs are short and expressed at lower levels than annotated genes, as shown by the average expression near TSSs in each category. For sites without obvious strand orientation (e.g. H3K4me3 sites), forward and reverse (rev) genomic strands are plotted separately; otherwise, only sense reads are plotted. (B) ucRFBSs and enhancers are associated with fewer 5′ ends of 5′-sequenced ESTs than promoters. (Note the different y-scales.) (C) The ratio of polyA+ to total RNA reads is much lower at enhancers and ucRFBSs relative to annotated RNAs. The x-axis is the ratio of normalized polyA+ reads divided by the number of normalized total RNA reads at a locus, and the y-axis is the cumulative density (CDF). (D) ucRFBSs and enhancers are expressed at lower levels than annotated RNAs. (E) Transcribed regions at ucRFBSs and enhancers are shorter than those at protein-coding genes. (F) ucRFBSs and enhancers are not bound by the initiation-specific H3K4me3 mark. (G) Genomic sequence conservation at promoters extends outward further than genomic sequence conservation at ucRFBSs and enhancers. (H) The CpG content at ucRFBSs and enhancers is lower than that at promoters.