| Literature DB >> 22180538 |
Isaam Saeed1, Sen-Lin Tang, Saman K Halgamuge.
Abstract
An approach to infer the unknown microbial population structure within a metagenome is to cluster nucleotide sequences based on common patterns in base composition, otherwise referred to as binning. When functional roles are assigned to the identified populations, a deeper understanding of microbial communities can be attained, more so than gene-centric approaches that explore overall functionality. In this study, we propose an unsupervised, model-based binning method with two clustering tiers, which uses a novel transformation of the oligonucleotide frequency-derived error gradient and GC content to generate coarse groups at the first tier of clustering; and tetranucleotide frequency to refine these groups at the secondary clustering tier. The proposed method has a demonstrated improvement over PhyloPythia, S-GSOM, TACOA and TaxSOM on all three benchmarks that were used for evaluation in this study. The proposed method is then applied to a pyrosequenced metagenomic library of mud volcano sediment sampled in southwestern Taiwan, with the inferred population structure validated against complementary sequencing of 16S ribosomal RNA marker genes. Finally, the proposed method was further validated against four publicly available metagenomes, including a highly complex Antarctic whale-fall bone sample, which was previously assumed to be too complex for binning prior to functional analysis.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22180538 PMCID: PMC3300000 DOI: 10.1093/nar/gkr1204
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The motivation for the two-tiered clustering framework and the features used therein: (A) the PCA projection of the tetranucleotide frequency of random fragments of nine genomes results in poor discrimination between each genome type—shown here for the first two principal components for visualization and is also applicable when considering the first three principal components. (B) However, the nine genome types are found to form two coarse groups in the OFDEG and GC content space. (C and D) When the tetranucleotide frequency of fragments is computed with respect to each group, the discrimination between each genome type is more clearly evident.
A comparison of features at Tier 1 of the clustering framework revealed O4-GC as the most suitable feature for Tier 1 separation
| Ranking | Feature | simLC | simMC (family) | simMC (species) | sim-BG | Average performance |
|---|---|---|---|---|---|---|
| 1 | O4-GC | 100.00 | 100.00 | 87.00 | 79.49 | 91.62 |
| 2 | O2-GC | 100.00 | 100.00 | 87.02 | 73.89 | 90.23 |
| 3 | TNF | 100.00 | 77.42 | 100.00 | 75.32 | 88.19 |
| 4 | ODDS | 100.00 | 100.00 | 87.82 | 62.23 | 87.51 |
| 5 | ZSN-TNF | 100.00 | 77.72 | 97.30 | 70.05 | 86.27 |
| 6 | MOMN-TNF | 100.00 | 77.67 | 98.74 | 59.49 | 83.96 |
The accuracies reported here are the F-scores for each cluster solution.
Pairwise comparisons of the cluster structure produced by each compositional feature
| MOMN-TNF | ODDS (%) | O2-GC (%) | O4-GC (%) | TNF (%) | ZSN-TNF (%) | |
|---|---|---|---|---|---|---|
| MOMN-TNF | 1.00 (–) | 0.87 (32.82) | 0.71 (27.45) | 0.72 (23.38) | 0.88 (26.45) | 0.96 (35.99) |
| ODDS | – | 1.00 (–) | 0.70 (44.66) | 0.70 (45.81) | 0.90 (46.71) | 0.53 (46.71) |
| O2-GC | – | – | 1.00 (–) | 0.99 (77.71) | 0.89 (51.10) | 0.55 (55.82) |
| O4-GC | – | – | – | 1.00 (–) | 0.86 (52.71) | 0.56 (58.35) |
| TNF | – | – | – | – | 1.00 (–) | 0.58 (63.24) |
| ZSN-TNF | – | – | – | – | – | 1.00 (–) |
The reported values are the ARI values between clustered feature sets; the number of sequences in correspondence between cluster solutions is shown in parentheses.
Figure 2.Comparison of the proposed framework against the two next-best binning methods, PhyloPythia and TaxSOM, on the low complexity (simLC), medium complexity (simMC) and the medium–high complexity (sim-BG) benchmark data sets. The sim-BG benchmark, in particular, highlights the percentage improvement over PhyloPythia and TaxSOM at 78.41 and 17.55% in sensitivity, respectively; and 0.13 and 9.47% in specificity, respectively.
A summary of the two-tiered binning approach applied to the novel mud volvano metagenome
| Bin | Candidate (Min support: 20) | Assigned | Not assigned (%) | No hits (%) | Total |
|---|---|---|---|---|---|
| M1-1 | 61 | 165 (14.34) | 628 (54.56) | 1151 | |
| M1-2 | 24 | 128 (30.55) | 234 (55.85) | 419 | |
| M2-1 | 22 | 80 (12.18) | 365 (55.56) | 657 | |
| M2-2 | 41 | 54 (8.90) | 347 (57.17) | 607 | |
| M2-3 | Unknown | – | 55 (39.57) | 84 (60.43) | 139 |
The predicted taxonomic assignments were estimated by post-processing blastp hits (e-value: 10−5) using MEGAN with a minimum support of 20 and a minimum bitscore of 100.
Summary of the validation of the proposed framework on real-world metagenomes
| Sample | Taxon | Rank | Bins | ||||||
|---|---|---|---|---|---|---|---|---|---|
| A1-1 | A1-3 | A2-1 | A2-2 | ||||||
| Acid mine drainage | Genus | 99 | 0 | 0 | 0 | ||||
| Species | 24 | 59 | 0 | 0 | |||||
| Species | 0 | 0 | 93 | 23 | |||||
| Species | 0 | 0 | 0 | 28 | |||||
| GC content (%) | 37.73 | 37.80 | 59.18 | 54.48 | |||||
| Length (Mb) | 1.87 | 0.49 | 0.48 | 2.07 | |||||
| G1-1 | G3-1 | G3-2 | G4-1 | ||||||
| Gutless worm | δ1- | Class | 277 | 10 | 0 | 2 | |||
| δ4- | Class | 14 | 153 | 0 | 4 | ||||
| γ1- | Class | 3 | 2 | 0 | 50 | ||||
| γ3- | Class | 2 | 1 | 75 | 0 | ||||
| Unknown | – | 102 | 51 | 6 | 77 | ||||
| GC content (%) | 55.68 | 55.97 | 62.48 | ||||||
| Length (Mb) | 3.74 | 1.14 | 1.76 | 0.30 | |||||
| W1-1 | W1-2 | W1-3 | W1-4 | W2-1 | W2-2 | W2-3 | |||
| Antarctic whale fall bone | Order | 649 | 21 | 10 | 5 | 0 | 0 | 0 | |
| Genus | 0 | 666 | 0 | 9 | 0 | 0 | 0 | ||
| Order | 0 | 0 | 1601 | 9 | 0 | 0 | 0 | ||
| Unidentified | – | – | – | – | 302 | – | – | – | |
| Class | 0 | 0 | 0 | 0 | 22 | 0 | 0 | ||
| Order | 0 | 0 | 0 | 5 | 11 | 83 | 44 | ||
| Order | 0 | 0 | 0 | 0 | 0 | 0 | 324 | ||
| GC content (%) | 41.16 | 44.54 | 35.04 | 44.70 | 57.91 | 56.96 | 57.97 | ||
| Length (Mb) | 0.43 | 0.49 | 1.07 | 0.26 | 0.12 | 0.13 | 0.39 | ||
The gutless worm community revealed 236 additional contigs that have been classified. The apparent noise in the classification could be traced back to the BLAST-based assignments given in Ref. (17). The whale fall bone sample shows good separation at the rank of order.
aSequences correspond to mosaic genome types that require alignment to a reference genome for correct classification and cannot be separated by nucleotide frequency (18,23); nevertheless, all bins were classified with perfect specificity.