Roland J Siezen1, Greer Wilson, Tilman Todt. 1. Kluyver Centre for Genomics of Industrial Fermentation, 2600GA Delft, The Netherlands. siezen@cmbi.ru.nl
Hybridization to microarrays has been the standard for genome‐wide transcriptome analyses of prokaryotes in the past 10 years. Microarrays have several limitations, however, among which are a small dynamic range for detection of transcript levels due to problems with saturation, background noise, spot density and spot quality. Moreover, comparing different experiments requires complex normalization methods (Hinton ) and comparing different strains requires designing pangenome arrays based on multiple sequenced genomes, leading to further problems in non‐specific or cross‐hybridization and complicated data analysis (Bayjanov ). Most microarrays have a biased genome coverage, as they only contain a limited number of short probes for known or expected genes in sequenced genomes, and they rarely probe intergenic regions. Technological advances in array production and dropping costs have recently led to the design and use of high‐density tiling arrays based on overlapping short oligonucleotides covering both strands of entire genomes (Selinger ; Mcgrath ; Rasmussen ; Toledo‐Arana ). Tiling array and other studies have provided a first insight into far more complex transcriptomes than previously envisioned, including an ever‐expanding range of regulatory RNAs (Waters and Storz, 2009). To overcome the remaining limitations of microarrays, a totally new approach to whole‐transcriptome analysis was needed – and a much‐awaited breakthrough in DNA sequencing came to the rescue. Here, we describe the first whole‐transcriptome applications in prokaryotes and discover that a new treasure chest of regulation in prokaryotes is being opened.
Whole‐transcriptome sequencing
With the dawn of next generation (or deep) sequencing technologies in recent years (Ansorge, 2009; Metzker, 2010), their application to high‐depth sequencing of whole transcriptomes, a technique now referred to as RNA‐seq, has been explored (Morozova ; Wang ; Wilhelm and Landry, 2009). RNA‐seq requires a conversion of mRNA into cDNA by reverse transcription, followed by deep sequencing of this cDNA (Fig. 1A). RNA‐seq was initially only used for analysing eukaryotic mRNA, as prokaryote mRNA is less stable and lacks the poly(A) tail that is used for enrichment and reverse transcription priming in eukaryotes. But these technological difficulties are being overcome, as various methods for enrichment of prokaryote mRNA and appropriate cDNA library construction protocols have been developed, some generating strand‐specific libraries which provide valuable information about the orientation of transcripts.
Figure 1
(Left panel) Flow diagram of the steps involved in microbial transcriptome sequencing. The starting material is a mix of RNA, followed by optional subtraction of tRNA and rRNA, generation of cDNA libraries, sequencing, bioinformatics and interpretation of cDNA sequencing read histograms. (Right panel) Schematic representation of transcriptome sequencing histograms. Examples are shown of monocistronic and polycistronic mRNAs, non‐coding RNA, cis‐acting RNAs, and antisense RNA. Black filled arrows represent annotated ORFs. Reprinted from van Vliet (2010). Copyright 2009, FEMS and Blackwell Publishing Ltd.
(Left panel) Flow diagram of the steps involved in microbial transcriptome sequencing. The starting material is a mix of RNA, followed by optional subtraction of tRNA and rRNA, generation of cDNA libraries, sequencing, bioinformatics and interpretation of cDNA sequencing read histograms. (Right panel) Schematic representation of transcriptome sequencing histograms. Examples are shown of monocistronic and polycistronic mRNAs, non‐coding RNA, cis‐acting RNAs, and antisense RNA. Black filled arrows represent annotated ORFs. Reprinted from van Vliet (2010). Copyright 2009, FEMS and Blackwell Publishing Ltd.In June 2008, the first reports appeared of RNA sequencing of whole microbial transcriptomes, i.e. the yeastsSaccharomyces cerevisae (Nagalakshmi ) and Schizosaccharomyces pombe (Wilhelm ). Both studies demonstrated that most of the non‐repetitive sequence of the yeast genome is transcribed, and provided detailed information of novel genes, introns and their boundaries, 3′ and 5′ boundary mapping, 3′ end heterogeneity and overlapping genes, antisense RNA and more. Starting in 2009, several examples have been reported of prokaryote whole‐transcriptome analysis using tiling arrays and/or RNA‐seq, and these are summarized in Table 1. The first reviews of prokaryote transcriptome sequencing have just appeared (Croucher ; van Vliet and Wren, 2009; Sorek and Cossart, 2010; van Vliet, 2010).
Whole‐transcriptome analysis of microbes.Enriched for only sRNAs of 14–200 nt.TA, tiling array; RNAseq, cDNA sequencing; ss, strand‐specific; ncRNA, non‐coding RNA.
Novel general features discovered
Numerous new insights into genomic elements, gene expression and complexity of regulation are emerging from these new high‐throughput and high‐resolution studies of microbial transcriptomes (Fig. 1B).
Gene structure/length, novel genes
Gene annotation has always been fraught with difficulties and is not a trivial exercise. Most gene‐finding algorithms miss or miss‐annotate small protein‐encoding genes and non‐coding RNAs (together called sRNAs), but tiling arrays and RNA‐seq can readily identify these genes (Figs 2 and 3). The high resolution of these techniques allows transcription start sites (TSS) to be mapped with single‐base pair resolution. Moreover, gene structure can be corrected (Table 1), as many gene starts are found to be downstream of the automatically predicted start of largest possible ORFs, e.g. in Sulfolobus solfataricus (Wurtzel ).
Figure 2
Transcriptome structure in H. salinarum determined with high‐density tiling arrays (60‐mer overlapping probes). Segment of genome map with signal intensity of total RNA is shown. Each blue dot represents probe intensity (in log2 scale) in the forward (upper panel) or reverse strand (lower panel). The overlaid red line is the result of a segmentation algorithm that was applied to determine transcription start sites (TSS and black arrows), transcription termination sites (TTS), untranslated regions in mRNAs (3′ UTR) and putative non‐coding RNAs. Reprinted and adapted from Koide ). Copyright 2009, EMBO and Macmillan Publishers Limited.
Figure 3
The structure of the S. solfataricus transcriptome determined by RNA‐seq. A. Core promoter. B. Distribution of mapped TSS (transcription start site) positions relative to the ORF ATG codon. C. Example of correction of gene annotations. Transcriptome data indicate that gene SSO0451 actually is 228 bp shorter, and that a new small gene is encoded on the reverse strand. D. Refinement of operon definition. Transcriptome data show either 2 or 3 separate transcriptional units (TU), instead of the predicted 1 TU. Red arrow indicates TSS on forward strand, and blue arrows indicate TSS on reverse strand. Reprinted from Wurtzel ). Copyright 2009, Cold Spring Harbor Laboratory Press.
Transcriptome structure in H. salinarum determined with high‐density tiling arrays (60‐mer overlapping probes). Segment of genome map with signal intensity of total RNA is shown. Each blue dot represents probe intensity (in log2 scale) in the forward (upper panel) or reverse strand (lower panel). The overlaid red line is the result of a segmentation algorithm that was applied to determine transcription start sites (TSS and black arrows), transcription termination sites (TTS), untranslated regions in mRNAs (3′ UTR) and putative non‐coding RNAs. Reprinted and adapted from Koide ). Copyright 2009, EMBO and Macmillan Publishers Limited.The structure of the S. solfataricus transcriptome determined by RNA‐seq. A. Core promoter. B. Distribution of mapped TSS (transcription start site) positions relative to the ORF ATG codon. C. Example of correction of gene annotations. Transcriptome data indicate that gene SSO0451 actually is 228 bp shorter, and that a new small gene is encoded on the reverse strand. D. Refinement of operon definition. Transcriptome data show either 2 or 3 separate transcriptional units (TU), instead of the predicted 1 TU. Red arrow indicates TSS on forward strand, and blue arrows indicate TSS on reverse strand. Reprinted from Wurtzel ). Copyright 2009, Cold Spring Harbor Laboratory Press.
Untranslated regions
Whole‐transcriptome mapping can identify contiguous expression extending into flanking regions of a protein‐encoding gene, indicative of 5′ or 3′ untranslated regions (UTRs). Long 5′ UTRs are often indicative of upstream regulatory elements, such as riboswitches (Toledo‐Arana ). Archaea have much shorter or no 5′ UTRs compared with bacteria (Koide ; Wurtzel ), suggesting alternative modes of regulation. Long 3′ UTRs could affect expression of downstream genes or genes on the opposite strand, as found in archaea (Brenneis and Soppa, 2009).
Operon structures
Whole‐transcriptome data allow operons to be better defined, and the first experimentally determined operon maps show that 60–70% of bacterial genes are transcribed as operons, but only 30–40% in archaea. Staircase‐like expression within operons appears to be common (Guell ).Whole‐transcriptome analysis of Mycoplasma pneumoniae, using a mixture of tiling arrays, deep sequencing and 137 different growth conditions, showed that there is context‐dependent modulation of operon structure (Guell ). This involves repression or activation of operon internal genes as well as genes located at the operon ends. This adds a whole new level of complexity to gene regulation. Similar ‘conditional operons’ were found in Halobacterium salinarum (Koide ).
Non‐coding RNAs
Non‐coding RNAs (ncRNA), typically 50–500 nt long, can play important regulatory roles in prokaryotic physiology, such as virulence, stress response and quorum sensing. These ncRNAs have been largely overlooked in prokaryote genome annotation, since they are very difficult to detect with existing gene‐prediction software (Meyer, 2008; Livny and Waldor, 2009). Many act by binding to target 5′ UTR by base pairing, resulting in inhibition of translation or mRNA degradation. Whole‐transcriptome analysis of several prokaryotes has now identified large numbers of ncRNAs (Table 1), some of which are induced during niche switching, such as in Burkholderia cenocepacia (Yoder‐Himes ).
Antisense RNA
Cis‐antisense RNA was previously thought to be extremely rare in prokaryotes, but whole‐transcriptome analysis has recently detected hundreds of antisense transcripts in bacteria and archaea (Table 1). Some of these have been experimentally shown to downregulate their sense counterparts (Toledo‐Arana ). This is an area in which much is still to be discovered, as cis‐antisense may be a common form of regulation in prokaryotes.
Validation and comparing techniques
The ultimate goal is to obtain a complete and bias‐free view on microbial transcriptomes. The question remains in how far RNA‐seq has the potential to provide such a view. Clearly, RNA‐seq has a number of advantages above microarray technology, since RNA‐seq offers both a single‐base resolution and a high‐mapping resolution. RNA‐seq is especially suited to identify novel transcripts, alternative splice variants and non‐coding RNA (Marioni ; Mortazavi ; Nagalakshmi ; Wilhelm ).However, some studies indicate that RNA‐seq is also not bias‐free (Marioni ; Mortazavi ). In recent studies that compared expression levels measured using both (tiling) microarrays and RNA‐seq, expression levels between the two technologies show reasonably good correlation (ranging from 0.62 to 0.75) (Marioni ; Mortazavi ; Fu ), especially when comparison is restricted to protein‐coding gene loci (Sasidharan ). It should be noted that in order to compare expression levels from tiling microarray and RNA‐seq, one has to consider the different data types of the two technologies. Comparison of results may depend on the procedure applied to convert continuous expression levels from tiling microarray into a ‘digital’ signal (Sasidharan ). Correlating expression levels from both technologies to proteomics data shows that RNA‐seq provides a better estimate not only of absolute transcript levels but also of protein levels (Fu ).As demonstrated in a recent study on M. pneumoniae, combining various experimental data types can provide a more complete view on a transcriptome than using tiling arrays or RNA‐seq alone (Guell ). They report that in some cases (in particular for lowly expressed genes), RNA‐seq data alone were not sufficient to unambiguously define operon boundaries. However, the single‐base resolution of RNA‐seq allows more precise prediction of promoter locations (Guell ).
Future
Deep RNA sequencing provides clear advantages over the conventional (tiling) micro array technology. It allows transcriptome analysis of the entire nucleotide sequence of the genome, it is very sensitive, it offers a large dynamic range, and it allows accurate determination of boundaries (e.g. TSS, 3′/5′ ends, exons). However, RNA‐seq is not completely bias‐free. Nearly all studies to date have used some sort of enrichment procedure for mRNA, inherently leading to some bias. In many recent studies this enrichment step is being skipped, as the enormous volume of cDNA sequence data holds enough information, even if mRNA comprises only a few % of the total RNA. Just throw away 95–98% of your sequence data!The conversion of RNA into complementary DNA (cDNA) may also lead to bias. Recently, a new method was developed that measures RNA levels directly without this conversion step (Ozsolak ). The method is based on direct sequencing of RNA and is an extension of single‐molecule DNA sequencing technology (Braslavsky ; Harris ). The direct method uses RNA directly as a template for nucleotide incorporation by a modified DNA polymerase with reverse transcriptase activity. Under optimal conditions the method yields sequences in the range of 20–40 nucleotides in length, with a total raw base error rate of approximately 4%. These read lengths and error rates are sufficient to align sequences to reference genomes (Ozsolak ).What does the future hold for sequencing and RNA‐seq? There is no doubt that the revolution that has occurred in our ability to sequence and profile RNA from the days of a single ‘Southern blot’ to microarray RNA dot‐blot hybridization and Q‐PCR to RNA‐seq has been exciting, informative and rapid. In the future we will need to miniaturize as we move to single‐cell sequencing and transcriptomics. How will this be achieved? IBM is working on nanotechnology (‘The DNA transistor’; for a video see http://www.youtube.com/watch?v=wvclP3GySUY) to enable even more rapid, accurate and cheap genome sequencing (patent US200828191A1). DNA, or in fact any charged polymer, can be made to move through nanopores, and detection of the bases moving through the pore is possible. In fact the DNA moves through the pore too quickly and needs to be slowed down to be readable. So in the not too distant future, we may see that the genome sequence, transcriptome and regulome of a single cell will all be determined before the first coffee break of the day.
Authors: D W Selinger; K J Cheung; R Mei; E M Johansson; C S Richmond; F R Blattner; D J Lockhart; G M Church Journal: Nat Biotechnol Date: 2000-12 Impact factor: 54.908
Authors: Jumamurat R Bayjanov; Michiel Wels; Marjo Starrenburg; Johan E T van Hylckama Vlieg; Roland J Siezen; Douwe Molenaar Journal: Bioinformatics Date: 2009-01-07 Impact factor: 6.937
Authors: Melanie J Filiatrault; Paul V Stodghill; Christopher R Myers; Philip A Bronstein; Bronwyn G Butcher; Hanh Lam; George Grills; Peter Schweitzer; Wei Wang; David J Schneider; Samuel W Cartinhour Journal: PLoS One Date: 2011-12-28 Impact factor: 3.240
Authors: John J Schellenberg; Tobin J Verbeke; Peter McQueen; Oleg V Krokhin; Xiangli Zhang; Graham Alvare; Brian Fristensky; Gerhard G Thallinger; Bernard Henrissat; John A Wilkins; David B Levin; Richard Sparling Journal: BMC Genomics Date: 2014-07-07 Impact factor: 3.969