| Literature DB >> 21083935 |
Francis C Weng1, Chien-Hao Su, Ming-Tsung Hsu, Tse-Yi Wang, Huai-Kuang Tsai, Daryi Wang.
Abstract
BACKGROUND: Investigation of metagenomes provides greater insight into uncultured microbial communities. The improvement in sequencing technology, which yields a large amount of sequence data, has led to major breakthroughs in the field. However, at present, taxonomic binning tools for metagenomes discard 30-40% of Sanger sequencing data due to the stringency of BLAST cut-offs. In an attempt to provide a comprehensive overview of metagenomic data, we re-analyzed the discarded metagenomes by using less stringent cut-offs. Additionally, we introduced a new criterion, namely, the evolutionary conservation of adjacency between neighboring genes. To evaluate the feasibility of our approach, we re-analyzed discarded contigs and singletons from several environments with different levels of complexity. We also compared the consistency between our taxonomic binning and those reported in the original studies.Entities:
Mesh:
Year: 2010 PMID: 21083935 PMCID: PMC3098102 DOI: 10.1186/1471-2105-11-565
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of the proposed approach.
Number (and ratio) of discarded singletons that did not contain any CDS and those that contain one, two and three or more CDSs in the simulated metagenomes.
| Number of CDS on singleton | simLC | simMC | simHC | |||
|---|---|---|---|---|---|---|
| Singletons | % | Singletons | % | Singletons | % | |
| 0 | 2575 | 6.2 | 2637 | 6.5 | 3986 | 6.0 |
| 1 | 18219 | 44.0 | 18072 | 44.8 | 29926 | 45.0 |
| 2 | 17549 | 42.4 | 16832 | 41.7 | 27874 | 41.9 |
| 3 or more | 3081 | 7.4 | 2838 | 7.0 | 4738 | 7.1 |
| Total singletons | 41424 | 100 | 40379 | 100 | 66524 | 100 |
Summary of collected metagenomic fragments.
| Data type I (contigs) | Assigned | Unassigned | ||||
|---|---|---|---|---|---|---|
| Location | Position | Total contigs | CDSsa | Contigsb | Average length (bp) | |
| whale fall sub. 1 | Pacific Ocean, Santa Cruz Basin (N33.30 W 119.22) | section of rib bone | 35975 | 33139 | 7039 | 1167 |
| whale fall sub. 2 | Pacific Ocean, Santa Cruz Basin (N33.30 W 119.22) | bone | 32459 | 32395 | 7660 | 1199 |
| whale fall sub. 3 | West Antarctic Peninsula Shelf (S65.10 W64.47) | bone | 27130 | 26841 | 4990 | 1357 |
| Japanese In-A | Male | 45 years | 76434 | 29247 | 13399 | 1057 |
| Japanese In-B | Male | 6 months | 80617 | 14718 | 7078 | 1058 |
| Japanese In-D | Male | 35 years | 84237 | 48033 | 28244 | 1034 |
| Japanese In-E | Male | 3 months | 80852 | 27860 | 10838 | 1124 |
| Japanese In-M | Female | 4 months | 89340 | 26350 | 8456 | 1008 |
| Japanese In-R | Female | 24 years | 85787 | 45438 | 21661 | 998 |
| Japanese F1-S | Male | 30 years | 78452 | 40427 | 15378 | 1005 |
| Japanese F1-T | Female | 28 years | 81348 | 46487 | 21780 | 958 |
| Japanese F1-U | Female | 7 months | 82525 | 27332 | 11791 | 969 |
| Japanese F2-V | Male | 37 years | 80772 | 49411 | 19733 | 1006 |
| Japanese F2-W | Female | 36 years | 79163 | 42750 | 16961 | 1039 |
| Japanese F2-X | Male | 3 years | 80858 | 41337 | 19351 | 1040 |
| Japanese F2-Y | Female | 1.5 years | 79754 | 49315 | 20061 | 990 |
a Genes with best hits at 30% identity or higher in Archaea and Bacteria kingdoms from JGI
b Genes with best hits less than 30% identity in Archaea and Bacteria kingdoms from JGI
c Phred and PCAP assembly package for Japanese samples.
d The number of predicted open-reading frames showing similarity to genes in the "in-house NR database".
Summary of reassignments using discarded metagenomic data.
| Data type I (contigs) | Re-assigned | |||||
|---|---|---|---|---|---|---|
| Contigs | Contigs | Average length (bp) | Rate (%) | r (phylum) | r (family) | |
| whale fall sub. 1 | 7039 | 1050 | 1388 | 14.9 | 0.98 | 0.92 |
| whale fall sub. 2 | 7660 | 995 | 1295 | 12.9 | 0.98 | 0.77 |
| whale fall sub. 3 | 4990 | 720 | 1400 | 14.4 | 0.97 | 0.79 |
| Japanese In-A | 13399 | 3542 | 1074 | 26.4 | 0.95 | 0.85 |
| Japanese In-B | 7078 | 2050 | 1073 | 28.9 | 0.99 | 0.90 |
| Japanese In-D | 28244 | 5542 | 1061 | 16.9 | 0.89 | 0.72 |
| Japanese In-E | 10838 | 2888 | 1129 | 26.6 | 0.99 | 0.95 |
| Japanese In-M | 8546 | 2159 | 1057 | 25.2 | 0.93 | 0.86 |
| Japanese In-R | 21661 | 3993 | 1020 | 18.4 | 0.96 | 0.80 |
| Japanese F1-S | 15378 | 3216 | 1018 | 20.9 | 0.95 | 0.82 |
| Japanese F1-T | 21780 | 4395 | 971 | 20.1 | 0.93 | 0.59 |
| Japanese F1-U | 11791 | 3711 | 983 | 31.4 | 0.99 | 0.99 |
| Japanese F2-V | 19733 | 4007 | 1020 | 20.3 | 0.90 | 0.61 |
| Japanese F2-W | 16961 | 4011 | 1052 | 23.6 | 0.89 | 0.77 |
| Japanese F2-X | 19351 | 4402 | 1054 | 22.7 | 0.92 | 0.66 |
| Japanese F2-Y | 20061 | 4766 | 1002 | 23.7 | 0.96 | 0.82 |
The consistency between binning with discarded fragments and that in the original studies was tested by the Pearson correlation coefficient (r).
Figure 2Compositional View of 16 microbiomes in the phylum rank. The bars depict the detailed contribution of microbiomes with 22 phyla represented on two types of genomic fragments, contigs and singletons. For each microbiome, the similarity of binning between our re-assignments and that of the original studies was compared. The consistency of the two datasets is represented by the Pearson correlation coefficient (Table 3).
Sensitivity and specificity of taxonomic binning at different taxonomic ranks using discarded dataset of simMC.
| Accuracy | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Criteria | P | O | F | G | ||||||
| with | 1e-2 | 50 | 31.8 | 97.2 | 28.3 | 90.8 | 27.6 | 84.8 | 24.1 | 88.8 |
| 150 | 34.9 | 94.5 | 30.8 | 86.2 | 30.0 | 79.3 | 25.3 | 82.0 | ||
| 250 | 36.8 | 93.3 | 31.7 | 84.0 | 30.7 | 77.0 | 25.9 | 79.0 | ||
| 350 | 37.9 | 92.7 | 32.0 | 83.0 | 30.7 | 75.8 | 25.7 | 77.6 | ||
| 1e-4 | 50 | 29.7 | 97.4 | 27.8 | 91.1 | 27.4 | 85.1 | 23.7 | 89.2 | |
| 150 | 34.1 | 94.9 | 30.3 | 87.1 | 29.5 | 80.2 | 25.5 | 83.1 | ||
| 250 | 36.0 | 93.7 | 31.4 | 84.9 | 30.3 | 77.8 | 25.7 | 80.2 | ||
| 350 | 37.1 | 93.2 | 31.7 | 83.9 | 30.5 | 76.7 | 25.8 | 78.7 | ||
| 1e-6 | 50 | 29.1 | 97.5 | 27.3 | 91.2 | 26.8 | 85.1 | 23.4 | 89.2 | |
| 150 | 33.1 | 94.9 | 29.7 | 87.1 | 28.8 | 80.2 | 24.7 | 83.1 | ||
| 250 | 35.0 | 93.7 | 30.8 | 84.9 | 29.8 | 77.8 | 25.5 | 80.2 | ||
| 350 | 36.2 | 93.2 | 31.3 | 83.9 | 30.2 | 76.7 | 25.8 | 78.7 | ||
| without | 1e-2 | 50 | 31.3 | 99.6 | 9.6 | 97.8 | 5.9 | 89.1 | 4.1 | 93.3 |
| 150 | 20.0 | 99.7 | 7.3 | 94.0 | 5.2 | 88.1 | 4.5 | 92.0 | ||
| 250 | 17.5 | 99.6 | 7.3 | 94.0 | 5.2 | 88.1 | 4.5 | 92.3 | ||
| 350 | 16.3 | 99.5 | 7.3 | 94.0 | 5.2 | 88.1 | 4.5 | 92.3 | ||
Criteria considering gene adjacency and without considering gene adjacency were tested separately. P: phylum, O: order, F: family, G: genus. Sn and Sp denote sensitivity (%) and Specificity (%).