| Literature DB >> 27531712 |
Julien Lagarde1,2, Barbara Uszczynska-Ratajczak1,2, Javier Santoyo-Lopez3, Jose Manuel Gonzalez4, Electra Tapanari4, Jonathan M Mudge4, Charles A Steward4, Laurens Wilming4, Andrea Tanzer1,2, Cédric Howald5, Jacqueline Chrast5, Alicia Vela-Boza3,6, Antonio Rueda3, Francisco J Lopez-Domingo3, Joaquin Dopazo3,7,8, Alexandre Reymond5, Roderic Guigó1,2, Jennifer Harrow4.
Abstract
Long non-coding RNAs (lncRNAs) constitute a large, yet mostly uncharacterized fraction of the mammalian transcriptome. Such characterization requires a comprehensive, high-quality annotation of their gene structure and boundaries, which is currently lacking. Here we describe RACE-Seq, an experimental workflow designed to address this based on RACE (rapid amplification of cDNA ends) and long-read RNA sequencing. We apply RACE-Seq to 398 human lncRNA genes in seven tissues, leading to the discovery of 2,556 on-target, novel transcripts. About 60% of the targeted loci are extended in either 5' or 3', often reaching genomic hallmarks of gene boundaries. Analysis of the novel transcripts suggests that lncRNAs are as long, have as many exons and undergo as much alternative splicing as protein-coding genes, contrary to current assumptions. Overall, we show that RACE-Seq is an effective tool to annotate an organism's deep transcriptome, and compares favourably to other targeted sequencing techniques.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27531712 PMCID: PMC4992054 DOI: 10.1038/ncomms12339
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Figure 1Schematic overview of RACE-seq.
Standard 5′ and 3′ RACE primers are designed to target exons of a gene and produce primary RACE products, which undergo a second round of RACE reactions using nested 5′ or 3′ RACE primers. Both standard and nested 5′ and 3′ RACE products are subjected to long-read sequencing, followed by mapping to the reference genome.
Figure 2Locus extension and novel transcript boundaries.
(a) Venn diagrams depicting the proportion of loci (left panel) and transcripts (right panel) extended in 5′ and/or 3′ direction. (b) Novel boundaries for CAGE-supported (bottom box-plot) and CAGE-unsupported loci (top box-plot). A schematic depiction of a target locus is provided below the plots. The viewing range of the box plots is reduced (−10,000, 10,000 nucleotides) for clarity.
Figure 3On-target RACE enrichment and RACE-Seq specificity.
(a) Number of RACE-Seq reads falling into exonic regions of targeted genes (dark shades) and outside of them (light shades), after using standard (blue) and nested (orange) 5′ and 3′ RACE. (b) Proportion of targets detected by standard (blue) and nested (orange) 5′ and 3′ RACE-Seq.
Figure 4New isoform discovery and annotation.
(a) Length distribution of spliced transcripts (logarithmic scale) for pre- (brown) and post-RACE-Seq (green) targets. (b) Distribution of the number of alternatively spliced isoforms per pre- (brown) and post-RACE-Seq (green) targeted gene locus. (c) Exon count distribution in pre- (brown) and post-RACE-Seq (green) transcripts.
Figure 5Locus examples.
(a) Two separate loci were merged into one larger locus (LINC01246). This example illustrates the large number of alternative splicing events found using the RACE-Seq approach. The red filled transcripts (far left) indicate the manual annotation models built from the 454 reads (far right in pink), as visualized in the ZMap browser (http://www.sanger.ac.uk/science/tools/zmap). (b) RACE-Seq reads (far right, in pink) establishes the Transcriptional start site (TSS) of an existing incomplete lincRNA, by extending the 5′ end of the gene to a CpG island (yellow box) and is also supported by FANTOM5 CAGE data (small pink boxes).
Comparison of various TTS data sets with Merck PolyA-Seq peaks.
| Targets (pre-RACE) | 535 | 83 | 16% |
| Targets updated (post-RACE) | 1,027 | 99 | 10% |
| Protein coding | 17,940 | 7,019 | 39% |
| lncRNAs | 12,556 | 2,223 | 18% |
lncRNA, long non-coding RNA; RACE, rapid amplification of cDNA ends; TTS, transcription termination site.
Statistics are also reported for the full sets of GENCODE-annotated protein-coding genes and lncRNAs for reference.
Comparison of pre- and post-RACE TTSs data sets with polyA peaks called using our RACE-Seq data.
| Targets (pre-RACE) | 535 | 206 | 39% |
| Targets updated (post-RACE) | 1,027 | 321 | 31% |
RACE, rapid amplification of cDNA ends; TTS, transcription termination site.
Table summarizing basic annotation statistics before and after RACE-Seq.
| Targets (pre-RACE) | 398 | 597 | 1.5 | 1,889 | 1,695 | 3.2 |
| Targets updated (post-RACE) | 343 | 2,556 | 7.5 | 10,139 | 5,326 (4,626) | 4.0 |
RACE, rapid amplification of cDNA ends.
Unique exons are those having distinct coordinates on the genome. The number of previously unannotated unique exons is indicated between parentheses in the penultimate column.
Proportion of annotated splice junctions in pre- and post-RACE-Seq targets supported by short-read ENCODE or GTEx RNA-Seq data.
| Targets (pre-RACE) | 1,093 | 771 | 71% |
| Targets updated (post-RACE) | 3,085* | 975 | 31% |
| Protein coding | 82,627 | 74,090 | 90% |
| lncRNAs | 24,133 | 16,937 | 67% |
lncRNA, long non-coding RNA; RACE, rapid amplification of cDNA ends.
Both data sets are derived from conventional, unbiased sequencing experiments. (‘*' represents novel introns only).
Figure 6RACE-Seq performance compared with CaptureSeq.
Venn diagram indicating the number of annotated and unannotated on-target (± 5 kb) splice junctions discovered by RNA CaptureSeq and RACE-seq. Only the top 25% splice junctions with canonical splice sites ranked by read coverage were included in this analysis (see Methods).