| Literature DB >> 18725995 |
Jack A Gilbert1, Dawn Field, Ying Huang, Rob Edwards, Weizhong Li, Paul Gilna, Ian Joint.
Abstract
BACKGROUND: Sequencing the expressed genetic information of an ecosystem (metatranscriptome) can provide information about the response of organisms to varying environmental conditions. Until recently, metatranscriptomics has been limited to microarray technology and random cloning methodologies. The application of high-throughput sequencing technology is now enabling access to both known and previously unknown transcripts in natural communities. METHODOLOGY/PRINCIPALEntities:
Mesh:
Substances:
Year: 2008 PMID: 18725995 PMCID: PMC2518522 DOI: 10.1371/journal.pone.0003042
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Comparison of DNA and mRNA from samples from mid- and post-phytoplankton bloom.
| Mid-Bloom | Post-Bloom | Combined data | |||||||||
| DNA-High CO2 | mRNA-High CO2 | DNA-Present Day | mRNA-Present Day | DNA-High CO2 | mRNA-High CO2 | DNA-Present Day | mRNA-Present Day | All DNA | All mRNA | All samples | |
| Total size (Mbp) | 47,289,282 |
| 30,991,689 |
| 59,316,369 |
| 68,187,679 |
| 205,784,939 |
| 323,161,989 |
| Total No. of reads | 209,073 |
| 134,915 |
| 344,216 |
| 304,020 |
| 992,224 |
| 1,498,577 |
| Average length (bp) | 226 |
| 229 |
| 172 |
| 224 |
| 207 |
| 215 |
| % of rRNA genes | 0.33 |
| 0.31 |
| 0.17 |
| 0.24 |
| 0.25 |
| 0.16 |
| Absolute number of unique nucleotide sequence clusters | 170,580 |
| 112,459 |
| 257,375 |
| 232,729 |
| 630,159 |
| 723,050 |
| Normalized number of clusters | 86,791 |
| 84,096 |
| 86,996 |
| 87,112 |
| n/a |
| n/a |
| Total number of reads in top cluster | 12 |
| 23 |
| 19 |
| 20 |
| 36 |
| 4866 |
| Clustering: 1 sequence | 141,340 |
| 94,386 |
| 200,569 |
| 183,028 |
| 437,149 |
| 494,604 |
| 2–9 sequences | 29,232 |
| 18,072 |
| 56,729 |
| 49,681 |
| 191,822 |
| 224,198 |
| 10–99 sequences | 8 |
| 1 |
| 77 |
| 20 |
| 1188 |
| 3,639 |
| 100+ sequences | 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 609 |
| SEED Subsystem hits | 130,567 |
| 120,141 |
| 161,789 |
| 175,477 |
| n/a |
| n/a |
| Total pORFs | 419,565 |
| 279,061 |
| 532,373 |
| 637,896 |
| 1,868,895 |
| 3,026,200 |
| Unique pORFs at 95% | 358,705 |
| 242,317 |
| 435,876 |
| 515,266 |
| 1,340,241 |
| 1,571,348 |
| Protein clusters | 321,839 |
| 223,888 |
| 382,762 |
| 452,754 |
| 1,083,644 |
| 1,228,601 |
| Protein clusters | 11 |
| 3 |
| 29 |
| 31 |
| 695 |
| 2,029 |
| PFAM | 9 |
| 1 |
| 14 |
| 13 |
| 379 |
| 571 |
| TIGRfam | 9 |
| 1 |
| 12 |
| 7 |
| 366 |
| 476 |
| COG | 10 |
| 1 |
| 18 |
| 17 |
| 431 |
| 572 |
| Number of novel Protein clusters | 0 |
| 2 |
| 9 |
| 12 |
| 202 |
| 1287 |
Size, clustering and annotation data were generated by CAMERA. rRNA and subsystem hits were generated by SEED.
Analysis of sequences against the Ribosomal Database Project II (RDP-II) and the European Ribosomal large subunit (LSU)dataset.
Based on clustering at 95% identity over 80% length of a sequence and over 120 bp.
For direct comparison of samples, individual rarefaction analysis, by R (http://www.r-project.org/) and the vegan package (http://cc.oulu.fi/jarioksa/softhelp/vegan.html), was used to estimate the number of clusters in each sample after adjusting sample size of each dataset to be equivalent to the number of reads in the smallest dataset (mRNA – High CO2).
Partial Open Reading Frames (pORFs) from six reading frame translation from all reads using translation table 11, starting at the beginning of a read or first ATG after previous stop codon, ending at the end of a read, or at a stop codon and being at least 30 contiguous amino acids.
Total pORF reads clustered at 95% identity of over 80% length of sequences.
Clusters are identified using the representative sequences of each cluster from the 95% step to cluster at 60% identity of over 80% length of sequences.
The dominant clusters (≥10 non-redundant sequences) with the exclusion of spurious pORFs.
Protein families database.
The Institute for Genomic Research protein database.
NCBI clusters of orthologous groups database.
With ≥10 non-redundant clustered sequences excluding spurious ORFs.
n/a – not analyzed.
Comparison of methods described by current manuscript with the three most recent methods for analysing microbial metatranscriptomes.
| Leininger et al | Frias-Lopez et al. | Gilbert et al | |
|
| Soil (Nutrient-poor, sandy-soil) | Marine (oligotrophic ocean) | Marine (eutrophic coastal waters) |
|
| 1 (1 metatranscriptome) | 1 (1 metatranscriptome, 1 metagenome) | 4 (4 metatranscriptomes, 4 metagenomes) |
|
| ∼25.32 | ∼60.1 | ∼323.2 |
|
| Griffiths et al | mirVana RNA isolation kit (Ambion) from 1 L of sea water | Neufeld et al, |
|
| N/A | mRNA amplification using MEssageAmp II-Bacterial kit (Ambion) | MicrobeExpress and Megaclear kit (Ambion). GenomiPHI amplification (GE Healthcare) |
|
| GS20-pyrosequencing | GS-20 pyrosequencing | GS-flx pyrosequencing |
|
| 98 bp | 112 bp | 215 bp |
|
| 8.2% | 47.1% | 99.9% |
|
| 22% (60% of mRNA assigned tags) | 89.5% | 87% |
based on hits to nucleotide sequences using the MG-RAST Seed database.
based on hits to potential open reading frames using the PFAM, TIGRfam and COG protein databases.
Not performed, rRNA and mRNA expressly sequenced together to examine both community structure and function.
Top ten most abundant identifiable transcripts identified from pORF clustering.
| Rank | COG ID | No. of Seqs (nr) | No. of clusters | Annotation | TIGRfam ID | No. of Seqs. (nr) | No. of clusters | Annotation | PFAM ID | No. of Seqs. (nr) | No. of clusters | Annotation |
| 1 | COG0209 | 464 (149) | 12 | Ribonucleotide reductase, alpha subunit | TIGR02505 | 526 (190) | 13 | ribonucleoside-triphosphate reductase, adenosylcobalamin-dependent | PF02407 | 27147 (815) | 44 | Putative viral replication protein |
| 2 | COG0443 | 330 (156) | 11 | Molecular chaperone | TIGR02348 | 359 (158) | 11 | chaperonin GroL | PF00910 | 15101 (595) | 28 | RNA helicase |
| 3 | COG0459 | 359 (158) | 11 | Chaperonin GroEL (HSP60 family) | TIGR02350 | 330 (156) | 11 | chaperone protein DnaK | PF00005 | 372 (212) | 18 | ABC transporter |
| 4 | COG0376 | 96 (46) | 9 | Catalase (peroxidase I) | TIGR01369 | 214 (108) | 9 | carbamoyl-phosphate synthase, large subunit | PF00004 | 326 (149) | 13 | ATPase family associated with various cellular activities (AAA) |
| 5 | COG0458 | 214 (108) | 9 | Carbamoylphosphate synthase large subunit | TIGR02188 | 236 (120) | 9 | acetate–CoA ligase | PF00012 | 330 (156) | 11 | Hsp70 protein |
| 6 | COG5265 | 236 (126) | 9 | ABC-type transport system involved in Fe-S cluster assembly, permease and ATPase components | TIGR00630 | 224 (103) | 8 | excinuclease ABC, A subunit | PF00118 | 359 (158) | 11 | TCP-1/cpn60 chaperonin family |
| 7 | COG0086 | 210 (96) | 8 | DNA-directed RNA polymerase, beta' subunit/160 kD subunit | TIGR00936 | 257 (118) | 8 | adenosylhomocysteinase | PF00006 | 292 (122) | 10 | ATP synthase alpha/beta family, nucleotide-binding domain |
| 8 | COG0178 | 224 (103) | 8 | Excinuclease ATPase subunit | TIGR02013 | 219 (82) | 7 | DNA-directed RNA polymerase, beta subunit | PF00009 | 399 (148) | 10 | Elongation factor Tu GTP binding domain |
| 9 | COG0499 | 257 (118) | 8 | S-adenosylhomocysteine hydrolase | TIGR02506 | 267 (78) | 7 | ribonucleoside-diphosphate reductase, alpha subunit | PF02867 | 326 (101) | 9 | Ribonucleotide reductase, barrel domain |
| 10 | COG0085 | 219 (82) | 7 | DNA-directed RNA polymerase, beta subunit/140 kD subunit | TIGR01242 | 178 (84) | 6 | 26S proteasome subunit P45 family | PF00501 | 228 (116) | 8 | AMP-binding enzyme |
This table only includes the pORF clusters with >10 non-redundant sequences. No. of Seqs. refers to the number of sequences which contribute to that cluster, in brackets are the number of non-redundant sequences which contribute to that cluster.
Figure 1Relative abundance of sequence types identified for each sample.
Number of sequences per metabolism subsystem were normalised to sequencing effort for each sample and then relative abundance for each was calculated as a percentage.
Figure 2Percentage taxonomic affiliation of sequences identified in each dataset by BLAST against the SEED database.
A – community at peak of the phytoplankton bloom (±1 SD). B - community after the phytoplankton bloom (±1 SD). Standard deviations are calculated from comparison of the different treatments. Data shown are for the high CO2 treatment.
BLASTN comparison of total nucleic acids, representative sequences from nucleic acid clusters and representative sequences from pORF clusters from this study and the Frias-Lopez study [9].
| Gilbert et al | |||||
|
|
| ||||
|
|
| 44261(10.7) | 102637 (10.3) | 19359 (4.67) | 56835 (11.2) |
|
| 35575 (10.6, | 59918 (9.5, | 15564 (4.65, | 11698 (8.8, | |
|
| 59774 (15, | 40002 (3.7, | 21302 (5.5, | 17672 (7.5, | |
|
| 64609 (52.4) | 18602 (1.9) | 58123 (45) | 2680 (0.53) | |
|
| 15179 (24.1, | 15598 (1.57, | 13121 (20.8, | 2162 (0.43, | |
|
| 27094 (38.7, | 10689 (1.7, | 22289 (31.9, | 1942 (1.45, | |
|
| 8338 (19, | 9631 (1.5, | 6171 (14, | 1624 (1.2, | |
|
| 4330 (9, | 5484 (0.5, | 2372 (5, | 1736 (0.7, | |
For each comparison two values are given, the first value is the percentage of Frias-Lopez data which is homologous to data from the current study; the second is the percentage of data from the current study which is homologous to the Frias-Lopez data. Comparisons were performed using BLASTN with the current studies dataset as reference database, and the Frias-Lopez dataset as the query. The (-b –v) parameter in BLASTN was set to 40,000. For every query sequence, every similar sequence in the reference dataset is identified. Sequences from both datasets that meet the criteria of an E-value <0.001 were included. Percentage values in parentheses are calculated by dividing each value by the total number of sequences/representative sequences for each dataset. For the Frias-Lopez data: Total DNA – 414,323, Total DNA nuc-clusters – 334,940, Total DNA pORF clusters – 390,599, Total mRNA – 128,234, Total mRNA (rRNA removed) - 63,111, Total mRNA nuc-clusters – 69,948, Total mRNA nuc-clusters (rRNA removed) - 43,948, Total mRNA-pORF clusters – 46,703. For the Gilbert data: Total DNA – 992,224, Total DNA nuc-clusters – 630,159, Total DNA pORF clusters – 1,083,644, Total mRNA – 506,353, Total mRNA nuc-clusters – 133,447, Total mRNA pORF clusters – 238,655. Percentage values in bold are normalised by divided each value through the Total DNA or Total RNA for the relevant study. Nuc-cluster refers to nucleotide clusters.
BLASTN comparison of the reference sequences of the abundant nucleic acid clusters (>10 and >100 sequences per cluster) from the current study to the total combined mRNA and DNA sequences from the Frias-Lopez et al [9] study.
| Frias-Lopez et al | |||
| mRNA homologues (%) | DNA homologues (%) | ||
| Current Study | 3639 nucleotide clusters (10–99 sequences) | 107 (2.9%) | 326 (9%) |
| 85 nucleotide clusters (>100 sequences) | 1 (1.2%) | 4 (4.7%) | |
The 3649 clusters have >10 sequences and the 85 clusters are ‘contigs’ of all 609 clusters with >100 sequences (as described in the Materials and Methods).
Top 10 most abundant annotatable transcripts from the Frias-Lopez et al. [9].
| PFAM | TIGRFAM | COG | |||||||||
| PFAM ID | Annotation | A | B | TIGRfam ID | Annotation | A | B | COG ID | Annotation | A | B |
| PF00004 | ATPase family associated with various cellular activities (AAA) | 13 | 2 | TIGR00485 | Translation elongation factor Tu | 6 | 2 | COG4585 | Signal transduction histidine kinase | 0 | 8 |
| PF01370 | NAD dependent epimerase/dehydratase family | 4 | 1 | TIGR01242 | 26S proteasome subunit P45 family | 6 | 1 | COG5116 | 26S proteasome regulatory complex component | 0 | 6 |
| PF03143 | Elongation factor Tu C-terminal domain | 3 | 1 | TIGR02639 | ATP-dependent Clp protease ATP-binding subunit ClpA | 6 | 1 | COG0050 | GTPases - translation elongation factors | 6 | 2 |
| PF00521 | DNA gyrase/topoisomerase IV, subunit A | 2 | 1 | TIGR00962 | ATP synthase F1, alpha subunit | 4 | 1 | COG1222 | ATP-dependent 26S proteasome regulatory subunit | 0 | 2 |
| PF03144 | Elongation factor Tu domain 2 | 2 | 1 | TIGR01472 | GDP-mannose 4,6-dehydratase | 3 | 1 | COG0187 | Type IIA topoisomerase (DNA gyrase/topo II, topoisomerase IV), B subunit | 4 | 1 |
| PF00216 | Bacterial DNA-binding protein | 1 | 1 | TIGR01017 | Ribosomal protein S4 | 2 | 1 | COG0568 | DNA-directed RNA polymerase, sigma subunit (sigma70/sigma32) | 4 | 1 |
| PF00101 | Ribulose bisphosphate carboxylase, small chain | 0 | 1 | TIGR02521 | Type IV pilus biogenesis/stability protein PilW | 1 | 1 | COG1089 | GDP-D-mannose dehydratase | 3 | 1 |
| PF00016 | Ribulose bisphosphate carboxylase large chain, catalytic domain | 0 | 1 | TIGR00038 | translation elongation factor P | 0 | 1 | COG0188 | Type IIA topoisomerase (DNA gyrase/topo II, topoisomerase IV), A subunit | 2 | 1 |
| PF01106 | NifU-like domain | 0 | 1 | TIGR00050 | RNA methyltransferase, TrmH family, group 1 | 0 | 1 | COG0206 | Cell division GTPase | 0 | 1 |
| PF07719 | Tetratricopeptide repeat | 0 | 1 | TIGR00065 | Cell division protein FtsZ | 0 | 1 | COG0278 | Glutaredoxin-related protein | 0 | 1 |
(B). If a homologue was identified in the current study (A) that too is included. Numbers in columns A and B refer to the number of sequences which were assigned this particular protein annotation. Only pORF clusters with >10 non-redundant sequences were included in this analysis.