| Literature DB >> 23185583 |
Marc T J Johnson1, Eric J Carpenter, Zhijian Tian, Richard Bruskiewich, Jason N Burris, Charlotte T Carrigan, Mark W Chase, Neil D Clarke, Sarah Covshoff, Claude W Depamphilis, Patrick P Edger, Falicia Goh, Sean Graham, Stephan Greiner, Julian M Hibberd, Ingrid Jordon-Thaden, Toni M Kutchan, James Leebens-Mack, Michael Melkonian, Nicholas Miles, Henrietta Myburg, Jordan Patterson, J Chris Pires, Paula Ralph, Megan Rolf, Rowan F Sage, Douglas Soltis, Pamela Soltis, Dennis Stevenson, C Neal Stewart, Barbara Surek, Christina J M Thomsen, Juan Carlos Villarreal, Xiaolei Wu, Yong Zhang, Michael K Deyholos, Gane Ka-Shu Wong.
Abstract
Next-generation sequencing plays a central role in the characterization and quantification of transcriptomes. Although numerous metrics are purported to quantify the quality of RNA, there have been no large-scale empirical evaluations of the major determinants of sequencing success. We used a combination of existing and newly developed methods to isolate total RNA from 1115 samples from 695 plant species in 324 families, which represents >900 million years of phylogenetic diversity from green algae through flowering plants, including many plants of economic importance. We then sequenced 629 of these samples on Illumina GAIIx and HiSeq platforms and performed a large comparative analysis to identify predictors of RNA quality and the diversity of putative genes (scaffolds) expressed within samples. Tissue types (e.g., leaf vs. flower) varied in RNA quality, sequencing depth and the number of scaffolds. Tissue age also influenced RNA quality but not the number of scaffolds ≥ 1000 bp. Overall, 36% of the variation in the number of scaffolds was explained by metrics of RNA integrity (RIN score), RNA purity (OD 260/230), sequencing platform (GAIIx vs HiSeq) and the amount of total RNA used for sequencing. However, our results show that the most commonly used measures of RNA quality (e.g., RIN) are weak predictors of the number of scaffolds because Illumina sequencing is robust to variation in RNA quality. These results provide novel insight into the methods that are most important in isolating high quality RNA for sequencing and assembling plant transcriptomes. The methods and recommendations provided here could increase the efficiency and decrease the cost of RNA sequencing for individual labs and genome centers.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23185583 PMCID: PMC3504007 DOI: 10.1371/journal.pone.0050226
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Example of the qualitative and quantitative results from Agilent’s Bioanalyzer 2100.
In panels (A) and (B) we show peaks produced from electropherograms (top left) that depict the size distribution of RNA fragments, the corresponding gel-like image of RNA fragments (top right), and metrics of RNA concentration and integrity (r26S/18S and RIN). We show two representative samples with (A) high-quality RNA in terms of high yield and minimal degradation (r26S/18S ≥1, RIN ≥8), and (B) low-quality RNA in terms of modest yield and high degradation (r26S/18S <1, RIN <5). Absence of clear peaks at 26S and 18S and an abundance of short fragments clustering on the left of the electropherogram in panel (B) are the hallmarks of severe RNA degradation. In electropherograms, ribosomal 26S (large) and 18S (small) subunits are shown in green and pink, respectively. Concentrations of 26S and 18S were calculated by taking the area above the green and pink straight lines at the base of the 26S and 18S peaks, respectively.
Figure 2Schematic representation of the method used to assemble Illumina reads into contigs, and contigs into scaffolds.
All reads were initially assembled into contigs using the de Bruijn graph method without using information about paired-end reads (shown by blue dashed lines). A contig’s sequence was resolved at every base. Contigs were then assembled into longer scaffolds by connecting contigs that contained paired-end reads assembled into separate contigs. Assembling scaffolds in this way allowed us to create longer sequences of known length, but sometimes there were gaps of unknown sequence. These gaps were constrained to represent <5% of total sequence length.
Figure 3Variation among tissue types in RNA quality and transcriptome size.
We observed differences among tissue types for (A) total RNA mass (µg) isolated, (B) RIN, (C) sequencing depth and (D) number of scaffolds. For each tissue, we show the mean +1 SE and sample size at the base of columns. A posteriori pairwise contrasts among means corrected for multiple comparisons are shown in Supplemental Tables 3 and 5.
Effects of tissue type and age on metrics of RNA quality and sequencing.
|
|
| |||||||||||
| ndf | F | P | ndf, ddf | F | P | |||||||
|
| ||||||||||||
| RNA mass | 7,24 | 3.77 |
| 1,40 | 1.36 | 0.25 | ||||||
| r26S/18S | 7,1071 | 10.06 |
| 1,41 | 4.56 | 0.33 | ||||||
| RIN | 7,1061 | 5.14 |
| 1,41 | 48.91 |
| ||||||
| OD 260/280 | 6,503 | 3.16 |
| 1,40 | 0.58 | 0.45 | ||||||
| OD 260/230 | 6,383 | 1.30 | 0.26 | – | – | – | ||||||
|
| ||||||||||||
| Bases | 7,576 | 7.95 |
| 1,31 | 2.63 | 0.12 | ||||||
| Q20 bases | 7,576 | 13.23 |
| 1,31 | 2.66 | 0.11 | ||||||
| Reads | 7,576 | 10.98 |
| 1,31 | 2.54 | 0.12 | ||||||
| Scaffolds | 7,574 | 2.33 |
| 1,31 | 0.87 | 0.36 | ||||||
Significant effects (P<0.05) are shown in bold.
Measured as µg of total RNA isolated from a given tissue.
Numerator degrees of freedom (ndf) of F-statistic.
Denominator degrees of freedom (ddf) of F-statistic. ddf are low for RNA mass because an unequal variance model was used to account for heteroscedasticity in residuals among tissues.
F-statistic from analysis of variance (ANOVA).
P-value of F-statistic given ndf and ddf.
Statistical significance of explanatory variables in the best-fitting models for the data set with OD ratios and without OD ratios.
| Variable | df | F | P | r |
|
| ||||
| Tissue type | 2,174 | 0.23 | 0.791 | 0.002 |
| Sequencing platform | 1,174 | 59.27 |
| 0.219 |
| r26S/18S | 1,174 | 0.90 | 0.344 | 0.003 |
| RIN | 1,174 | 9.38 |
| 0.035 |
| RNA seq | 1,174 | 8.11 |
| 0.030 |
| OD 260/280 | 1,174 | 3.46 | 0.065 | 0.013 |
| OD 260/230 | 1,174 | 14.69 |
| 0.054 |
|
| ||||
| Tissue type | 7,499 | 0.28 | 0.96 | 0.003 |
| Sequencing platform | 2,499 | 30.35 |
| 0.104 |
| r26S/18S | 1,499 | 1.00 | 0.318 | 0.002 |
| RIN | 1,499 | 8.08 |
| 0.014 |
| RNA seq | 1,499 | 12.77 |
| 0.022 |
The best-fitting models were determined by comparing AIC values among models that considered all possible combinations of explanatory variables. Statistical significance was determined using an ANOVA model with type III sums-of-squares (SS). Variables with P<0.05 are shown in bold. Partial r2 values (coefficient of determination) were determined by dividing SS values of each factor by total SS.
Numerator (first number) and denominator (second number) degrees of freedom (df) for F-test.
RNA integrity number (RIN).
Mass of total RNA sequenced.
Other abbreviations as per Table 1.
Figure 4Factors that significantly predicted the number of large scaffolds.
Among our measures of RNA quality, (A) RNA integrity number (RIN) and (B) OD 260/230 ratio were the strongest predictors of the number of scaffolds ≥1000 bp. (C) Sequencing platform also had a strong effect on number of large scaffolds (P<0.001, Table 2; numbers at the base of bars show sample size), and (D) mass of RNA sequenced had a weak but detectable effect (see Table 2). Note, for most samples we used 20, 30 or 40 µg of total RNA for sequencing, but a few samples used intermediate or lower amounts.