Literature DB >> 22113004

RNA-Seq and find: entering the RNA deep field.

Abstract

Initial high-throughput RNA sequencing (RNA-Seq) experiments have revealed a complex and dynamic transcriptome, but because it samples transcripts in proportion to their abundances, assessing the extent and nature of low-level transcription using this technique has been difficult. A new assay, RNA CaptureSeq, addresses this limitation of RNA-Seq by enriching for low-level transcripts with cDNA tiling arrays prior to high-throughput sequencing. This approach reveals a plethora of transcripts that have been previously dismissed as 'noise', and hints at single-cell transcription fingerprints that may be crucial in defining cellular function in normal and disease states.

Entities: CellLine Chemical Disease Gene Species

Year: 2011 PMID： 22113004 PMCID： PMC3308029 DOI： 10.1186/gm290

Source DB: PubMed Journal: Genome Med ISSN： 1756-994X Impact factor: 11.117

The deep field

Techniques for directly assessing and quantifying RNA by high-throughput sequencing, collectively known as RNA-Seq [1], have revealed unexpected complexity and diversity in human transcriptomes [2]. Many previously unknown transcripts that are being detected are low-abundance long intergenic non-coding RNAs (lncRNAs), which seem to be crucial for function and in disease [3]. A key advantage of RNA-Seq over previous microarray-based methods for assessing transcription is the ability to query all transcripts on a genome-wide scale without prior knowledge about the locations and structures of genes. However, this advantage of RNA-Seq has also been its Achilles heel: transcriptomes are dominated by few highly abundant transcripts, and the frequent sampling of such transcripts in proportion to their abundances and lengths reduces the power to detect transcripts that are rare or short [4]. An apt analogy for a genome biologist attempting to measure transcription of a short, low-abundance gene from genome-wide RNA-Seq data is the difficulty encountered by an astronomer attempting to detect a low magnitude star from images collected in a low resolution sky survey. The tradeoff between breadth and depth is one that astronomers have grappled with for a long time and have ultimately resolved with the development of telescopes that can limit the scope of a detector to areas of interest, along with guiding technology enabling deep sampling of a region over long exposures. This approach was the design principle for the Hubble Deep Field, which focused a powerful detector on the darkest portions of the sky. With the bright 'foreground' of nearby objects removed, an immense number of galaxies were discovered in what was previously thought to be empty space. Mercer et al. [5] have designed an analogous focused experiment for probing the transcriptome, RNA CaptureSeq, and describe a similar outcome: regions with only scattered coverage in genome-wide experiments are revealed to be loci with transcription of low-abundance RNAs. A key aspect of CaptureSeq is that the integrity of the transcriptome in the 'deep field' is preserved: the relative proportions of transcripts sampled with CaptureSeq are shown to be equivalent to the relative proportions in conventional RNA-Seq.

Seq and find

Mercer et al. [5] first performed conventional RNA-Seq focusing on a primary human foot fibroblast cell line. De novo assembled transcripts from conventional RNA-Seq were combined with 'dark' intergenic regions that seemed not to be transcribed to design capture arrays. The targeted regions were then pulled down using the array, followed by sequencing. Mercer et al. [5] provide numerous controls to show that the approach maintains library diversity without introducing PCR amplification bias or other biases. The enrichment provided by CaptureSeq is estimated to be 380-fold more than conventional RNA-Seq in the targeted regions. Therefore, the resolution achievable with CaptureSeq (in the targeted regions) is approximately equivalent to what could be obtained with 10 billion conventional RNA-Seq reads. As Mercer et al. [5] point out, such depth is necessary for finding very-low-abundance transcripts and for accurately quantifying abundances. We reinforce the latter observation in Figure 1, which gives an example showing that extreme depth is necessary for accurate isoform abundance estimation for the dystrophin gene, a complex multi-isoform gene that can harbor mutations causing muscular dystrophy. The need for deep sequencing in the example arises from the overlap between the multiple isoforms of the gene and the ambiguity that this causes in read mapping. Large gene families are equally difficult to resolve for the same reason. It has been estimated that only 60% of transcripts can be accurately quantified with 10 billion conventional RNA-Seq reads [6].

Figure 1

To demonstrate the difficulty of accurate isoform-level abundance estimation on low-abundance genes, we simulated an RNA CaptureSeq experiment on the 18 isoforms of the dystrophin gene as annotated in RefSeq (hg19), shown in (a). For each number of 76 bp paired-end fragments that aligned to the gene, we estimated abundances of each isoform using the online EM algorithm (for details of the model used see [8] and for details of the implementation see [11]). (b) The accuracy of isoform abundance estimation measured as the Pearson correlation coefficient (r) of the logged relative abundance estimates compared with the true abundance used to generate the simulated data. The results are averaged over four simulations from different random abundance distributions. Because of the similarity of the isoforms, only 2.5% of fragments aligned uniquely to a single isoform on average, making the deconvolution particularly difficult. The bottom x-axis shows how many alignable paired-end fragments would be required to achieve the same r in a genome-wide experiment as in the CaptureSeq simulation. Here we assume 3.17 fragments per kilobase per million mapped reads (FPKM) for the gene, which is what we estimated from a sample ENCODE dataset [accession SRR065495]. Increased accuracy is only one advantage of CaptureSeq. Another striking result of Mercer et al. [5] is the number of previously unknown transcripts discovered and the corollary that current sequencing experiments are very far from saturation. The message is clearly 'seq and find' and this is exactly what is happening in RNA-Seq. The experiments surveyed in [1] are an order of magnitude smaller than the norm today, and it is reasonable to extrapolate that as the costs of sequencing drop precipitously, the average depth of sequencing in RNA-Seq experiments will increase by another order of magnitude in the next 3 years.

Breaking the curse of deep sequencing

Given the observations above, it is natural to speculate that conventional RNA-Seq with 10 billion reads will be commonplace in the near future and to ask whether technologies such as CaptureSeq are truly necessary. It certainly seems plausible that exome sequencing, which is to genome sequencing as CaptureSeq is to conventional RNA-Seq, will eventually be replaced by routine whole genome sequencing. However, in addition to technological challenges that must be overcome to allow routine sequencing of 10 billion reads, there are also bioinformatics problems that must be solved if such data are to be useful. In particular, increased numbers of reads in RNA-Seq lead to the 'curse of deep sequencing', in which extra sequence actually reduces performance and accuracy in current processing pipelines. This happens for two reasons. Firstly, most existing algorithms for RNA-Seq quantification require loading a substantial fraction of the total number of sequenced reads into memory. This is already a challenging prospect at 100 million reads. Algorithms whose running times are not linear in the number of reads are also likely to fail with large amounts of sequence data. Secondly, reads have errors and are prone to various biases [7], which, while appearing at a fixed frequency, occur at greater numbers with increased reads. For example, if a sufficiently large amount of sequence is obtained, a recurring error in a highly abundant gene may appear to be a (false) novel isoform. It therefore becomes imperative with large amounts of sequence to correct for sequencing artifacts, and this can be computationally prohibitive. CaptureSeq breaks the curse of deep sequencing by providing increased transcript resolution at fixed sequencing depth. This means that existing methods can be readily applied to the analysis of CaptureSeq data, and indeed the authors [5] show that the Cufflinks suite of tools can be used for both assembly [8] and quantification [7] of CaptureSeq data.

The single cell transcriptome

Mercer et al. [5] emphasize the discovery of novel lncRNAs in the deep field. They found 163 novel neighboring and antisense lncRNAs around protein coding genes. In general, they found that captured lncRNAs have very low expression of only 0.011 FPKM (fragments per kilobase per million mapped reads). These findings follow on the heels of a recent comprehensive annotation of human lncRNAs [9] and together suggest that very rare transcripts may bestow individual transcriptional 'fingerprints' on cells. In fact, Mercer et al. [5] estimate that the newly discovered lncRNAs are present at an average copy number of 0.0006 transcripts per cell. This precise quantification together with evidence from CaptureSeq that the sequenced fragments are samples from complete transcripts (and not just 'noise') points towards the presence of very rare transcripts, possibly even unique to individual cells. CaptureSeq therefore motivates the development of other approaches to the enrichment of low-abundance transcripts. One complementary possibility is the further development of depletion approaches, which could selectively filter the highest-abundance transcripts before sequencing; an example is removal of ribosomal RNA already performed in many experiments [1]. Although the depletion approach may inadvertently remove lower-abundance RNAs because of cross-hybridization, it offers a genome-wide approach to enrichment. Furthermore, CaptureSeq may have a similar bias in the opposite direction: repetitive sequence in the transcriptome might lead to captured RNA from outside a targeted genomic region.

RNA CaptureSeq and beyond

Regardless of how low-abundance transcripts are detected, Mercer et al. [5] have demonstrated the extent of discovery possible in the deep field. The functional relevance of ultra low-abundance transcripts is currently debated [9,10], and the question of whether rare transcripts regulate biologically important processes or are artifacts of stochastic transcription is a key open problem. However, there is increasing recognition that antisense lncRNAs are present at many protein coding genes, including numerous proto-oncogenes, and that they regulate their associated genes via epigenetic modifications [3]. The ability to see farther into the RNA deep field with CaptureSeq is therefore likely to lead to many exciting developments in genomic medicine thanks to better understanding of the aberrant transcription underlying human disease.

Abbreviations

FPKM: fragments per kilobase per million mapped reads; lncRNA: long intergenic non-coding RNA; RNA-Seq: RNA sequencing.

Competing interests

The authors declare that they have no competing interests.

10 in total

1. RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing.

Authors: Brian T Wilhelm; Josette-Renée Landry
Journal: Methods Date: 2009-03-29 Impact factor: 3.608

Review 2. Non-coding RNAs: regulators of disease.

Authors: Ryan J Taft; Ken C Pang; Timothy R Mercer; Marcel Dinger; John S Mattick
Journal: J Pathol Date: 2010-01 Impact factor: 7.996

3. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses.

Authors: Moran N Cabili; Cole Trapnell; Loyal Goff; Magdalena Koziol; Barbara Tazon-Vega; Aviv Regev; John L Rinn
Journal: Genes Dev Date: 2011-09-02 Impact factor: 11.361

4. Targeted RNA sequencing reveals the deep complexity of the human transcriptome.

Authors: Tim R Mercer; Daniel J Gerhardt; Marcel E Dinger; Joanna Crawford; Cole Trapnell; Jeffrey A Jeddeloh; John S Mattick; John L Rinn
Journal: Nat Biotechnol Date: 2011-11-13 Impact factor: 54.908

5. Most "dark matter" transcripts are associated with known genes.

Authors: Harm van Bakel; Corey Nislow; Benjamin J Blencowe; Timothy R Hughes
Journal: PLoS Biol Date: 2010-05-18 Impact factor: 8.029

Review 6. RNA-Seq: a revolutionary tool for transcriptomics.

Authors: Zhong Wang; Mark Gerstein; Michael Snyder
Journal: Nat Rev Genet Date: 2009-01 Impact factor: 53.242

7. Improving RNA-Seq expression estimates by correcting for fragment bias.

Authors: Adam Roberts; Cole Trapnell; Julie Donaghey; John L Rinn; Lior Pachter
Journal: Genome Biol Date: 2011-03-16 Impact factor: 13.583

8. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

Authors: Cole Trapnell; Brian A Williams; Geo Pertea; Ali Mortazavi; Gordon Kwan; Marijke J van Baren; Steven L Salzberg; Barbara J Wold; Lior Pachter
Journal: Nat Biotechnol Date: 2010-05-02 Impact factor: 54.908

9. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling.

Authors: Paweł P Łabaj; Germán G Leparc; Bryan E Linggi; Lye Meng Markillie; H Steven Wiley; David P Kreil
Journal: Bioinformatics Date: 2011-07-01 Impact factor: 6.937

10. Transcript length bias in RNA-seq data confounds systems biology.

Authors: Alicia Oshlack; Matthew J Wakefield
Journal: Biol Direct Date: 2009-04-16 Impact factor: 4.540

10 in total

9 in total

1. Collagen synthesis disruption and downregulation of core elements of TGF-β, Hippo, and Wnt pathways in keratoconus corneas.

Authors: Michal Kabza; Justyna A Karolak; Malgorzata Rydzanicz; Michał W Szcześniak; Dorota M Nowak; Barbara Ginter-Matuszewska; Piotr Polakowski; Rafal Ploski; Jacek P Szaflik; Marzena Gajecka
Journal: Eur J Hum Genet Date: 2017-02-01 Impact factor: 4.246

2. Foxn4 promotes gene expression required for the formation of multiple motile cilia.

Authors: Evan P Campbell; Ian K Quigley; Chris Kintner
Journal: Development Date: 2016-11-18 Impact factor: 6.868

Review 3. The rise of regulatory RNA.

Authors: Kevin V Morris; John S Mattick
Journal: Nat Rev Genet Date: 2014-04-29 Impact factor: 53.242

4. Next-generation sequencing and microarray-based interrogation of microRNAs from formalin-fixed, paraffin-embedded tissue: preliminary assessment of cross-platform concordance.

Authors: Andrew D Kelly; Katherine E Hill; Mick Correll; Lan Hu; Yaoyu E Wang; Renee Rubio; Shenghua Duan; John Quackenbush; Dimitrios Spentzos
Journal: Genomics Date: 2013-04-03 Impact factor: 5.736

5. Multicilin drives centriole biogenesis via E2f proteins.

Authors: Lina Ma; Ian Quigley; Heymut Omran; Chris Kintner
Journal: Genes Dev Date: 2014-06-16 Impact factor: 11.361

6. Global Transcriptional Profiling of Diapause and Climatic Adaptation in Drosophila melanogaster.

Authors: Xiaqing Zhao; Alan O Bergland; Emily L Behrman; Brian D Gregory; Dmitri A Petrov; Paul S Schmidt
Journal: Mol Biol Evol Date: 2015-11-13 Impact factor: 16.240

7. High resolution temporal transcriptomics of mouse embryoid body development reveals complex expression dynamics of coding and noncoding loci.

Authors: Brian S Gloss; Bethany Signal; Seth W Cheetham; Franziska Gruhl; Dominik C Kaczorowski; Andrew C Perkins; Marcel E Dinger
Journal: Sci Rep Date: 2017-07-27 Impact factor: 4.379

8. Stress-induced and epigenetic-mediated maize transcriptome regulation study by means of transcriptome reannotation and differential expression analysis.

Authors: Cristian Forestan; Riccardo Aiese Cigliano; Silvia Farinati; Alice Lunardon; Walter Sanseverino; Serena Varotto
Journal: Sci Rep Date: 2016-07-27 Impact factor: 4.379

9. FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads.

Authors: Fanny-Dhelia Pajuste; Lauris Kaplinski; Märt Möls; Tarmo Puurand; Maarja Lepamets; Maido Remm
Journal: Sci Rep Date: 2017-05-31 Impact factor: 4.379

9 in total