| Literature DB >> 25859745 |
Ruichang Zhang, Zhanzhan Cheng, Jihong Guan, Shuigeng Zhou.
Abstract
BACKGROUND: With the rapid development of high-throughput technologies, researchers can sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these metagenomic reads into different species or taxonomical classes is a vital step for metagenomic analysis, which is referred to as binning of metagenomic data.Entities:
Mesh:
Year: 2015 PMID: 25859745 PMCID: PMC4402587 DOI: 10.1186/1471-2105-16-S5-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The pipeline of the .
Figure 2The LDA model.
Figure 3Applying the LDA model to metagenomic reads.
Simulated datasets of low abundance (read length is 1 kbp on average).
| Dataset | Reads number | Species number | Abundance ratio |
|---|---|---|---|
| D1 | 5k | 2 | 1:1 |
| D2 | 5k | 2 | 1:2 |
| D3 | 5k | 2 | 1:4 |
| D4 | 5k | 2 | 1:6 |
| D5 | 5k | 2 | 1:8 |
| D6 | 5k | 2 | 1:10 |
| D7 | 5k | 2 | 1:12 |
| D8 | 5k | 3 | 1:1:1 |
| D9 | 5k | 3 | 1:3:9 |
| D10 | 5k | 4 | 1:3:3:9 |
| D11 | 5k | 5 | 1:1:1:1:1 |
| D12 | 5k | 5 | 1:1:3:3:9 |
| D13 | 5k | 10 | 1:1:1:1:1:1:1:1:1:1 |
| D14 | 50k | 3 | 1:3:9 |
| D15 | 50k | 4 | 1:3:3:9 |
| D16 | 50k | 5 | 1:1:3:3:9 |
Simulated datasets of relatively-high abundance (read length is 1 kbp on average).
| Dataset | Reads number | Species number | Abundance ratio |
|---|---|---|---|
| S1 | 50k | 2 | 1:1 |
| S2 | 50k | 3 | 1:1:1 |
| S3 | 50k | 3 | 1:3:9 |
| S4 | 50k | 5 | 1:1:3:3:9 |
| S5 | 50k | 10 | 1:1:1:1:1:1:1:1:1:1 |
| S6 | 500k | 2 | 1:1 |
| S7 | 500k | 3 | 1:1:1 |
| S8 | 500k | 3 | 1:3:9 |
| S9 | 500k | 5 | 1:1:3:3:9 |
| S10 | 500k | 10 | 1:1:1:1:1:1:1:1:1:1 |
Simulated datasets of very high abundance (read length is 75 bp on average).
| Dataset | Reads number | Species number | Abundance ratio |
|---|---|---|---|
| A | 1 million | 20 | 1 × 5:3 × 5:5 × 5:10 × 5 |
| B | 1 million | 50 | 1 × 34:6 × 6:8 × 5:10 × 5 |
Simulated datasets of extremely high abundance (read length is 128 bp on average).
| Dataset | Reads number | Species number | Abundance ratio |
|---|---|---|---|
| C | 3000k | 2 | 1:1 |
| D | 3000k | 3 | 1:1:1 |
| E | 3000k | 3 | 1:3:9 |
| F | 3000k | 5 | 1:1:3:3:9 |
| G | 3000k | 10 | 1:1:1:1:1:1:1:1:1:1 |
Figure 4The taxonomy of species in R1.
Figure 5The effect of topic number on binning performance of .
Results on simulated datasets (D1, D8, D11 and D13) with identical abundance ratio.
| Dataset | MetaCluster 3.0 | MCluster | TM-MCluster | ||||||
|---|---|---|---|---|---|---|---|---|---|
| D1 | .9628 | .9805 | .9877 | .9877 | .9877 | .9882 | |||
| D8 | .7432 | .9218 | .8229 | .9158 | .9158 | .9158 | |||
| D11 | .8215 | .8766 | 0.8481 | .8394 | .8394 | .8394 | |||
| D13 | .4335 | .5794 | .706 | .6894 | .6976 | .7732 | |||
Each bold value indicates the best result on a certain dataset.
Results on 12 unevenly-distributed datasets.
| Dataset | MetaCluster 3.0 | MCluster | TM-MCluster | ||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| D2 | .9648 | .9820 | .9888 | .9860 | .9860 | .9860 | |||
| D3 | .9596 | .9793 | .9950 | .9948 | .9948 | .9948 | |||
| D4 | .9612 | .9802 | .9942 | .9942 | .9942 | .9946 | |||
| D5 | .9608 | .9800 | .9950 | .9950 | .9950 | .9954 | |||
| D6 | .9610 | .9801 | .9966 | .9966 | |||||
| D7 | .9618 | .9805 | .9980 | .9980 | .9980 | .9988 | |||
| D9 | .7277 | .8289 | .8974 | .8974 | .8974 | .9320 | |||
| D10 | .7345 | .9096 | .8127 | .8852 | .8852 | .8852 | |||
| D12 | .7489 | .8202 | .8524 | .8524 | .8524 | .8930 | |||
| D14 | .7275 | .8255 | .8863 | .8860 | .8863 | .9420 | |||
| D15 | .7472 | .8247 | .8764 | .8764 | .8765 | .9070 | |||
| D16 | .6792 | .778 | .8546 | .8546 | .8546 | .8875 | |||
Each bold value indicates the best result on a certain dataset.
Results of on high-abundance datasets.
| Dataset | AbundanceBin | MCluster | TM-MCluster | ||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| S1 | .7258 | .9740 | .8317 | .9875 | .9875 | .9875 | |||
| S2 | .4047 | .9405 | .5600 | .9154 | .9154 | .9154 | |||
| S3 | .5866 | .7528 | .6594 | .8873 | .8873 | .8873 | |||
| S4 | .4106 | .5723 | .8554 | .8554 | .8554 | .8921 | |||
| S5 | .1748 | .2970 | .7361 | .7241 | .7301 | .7546 | |||
| S6 | .7266 | .8416 | .9873 | .9869 | .9869 | .9869 | |||
| S7 | .3991 | .5705 | .9173 | .9173 | .9173 | .9545 | |||
| S8 | .8591 | .8591 | .8591 | .8868 | .8868 | .8868 | |||
| S9 | .6457 | .6476 | .6466 | .8581 | .8581 | .8581 | |||
| S10 | .1888 | .7223 | .2993 | .7161 | .7207 | .7196 | |||
Each bold value indicates the best result on a certain dataset.
Binning performance of AbundanceBin, MCluster and TM-MCluster on short reads (75 bp average) datasets: Dataset-A and Dataset-B.
| Dataset | AbundanceBin | MCluster | TM-MCluster | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Pr | Se | F1 | Pr | Se | F1 | Pr | Se | F1 | |
| A | .2270 | .9878 | .3692 | .2250 | .3674 | .6471 | |||
| B | .0757 | .9878 | .1407 | .0744 | .1384 | .5836 | |||
The bold values are the best precision, sensitivity and F1-score.
Memory and time costs of AbundanceBin, MCluster and TM-MCluster on short reads (75 bp average) datasets: Dataset-A and Dataset-B.
| Dataset | AbundanceBin | MCluster | TM-MCluster | |||
|---|---|---|---|---|---|---|
| Memory | Time | Memory | Time | Memory | Time | |
| A | 3.07 GB | 2.15 h | 3.20 GB | 1.36 h | 4.12 GB | 3.11 h |
| B | 3.20 GB | 3.20 h | 3.46 GB | 2.38 h | 4.10 GB | 3.31 h |
Performance comparison: TM-MCluster vs. MetaCluster 5.0.
| Dataset | MetaCluster 5.0 | TM-MCluster | ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| C | .3862 | .5563 | .9793 | |||
| D | .4290 | .5986 | .7198 | . | ||
| E | .6923 | .4645 | .5574 | |||
| F | .3178 | .4796 | .5801 | |||
| D | .0066 | .0131 | .2141 | |||
Results on the real dataset R1.
| Methods | # | Pr | Se | F1 |
|---|---|---|---|---|
| MetaCluster 3.0 | 2 | .8441 | .7845 | |
| AbundanceBin | 2 | .3952 | .5655 | |
| 3 | .3952 | .5655 | ||
| 5 | .3952 | .5648 | ||
| MCluster | 2 | .7050 | .9422 | .8066 |
| 3 | .7054 | .9179 | .7978 | |
| 5 | .6972 | .6444 | .6698 | |
| TM-MCluster | 2 | .7186 | .9682 | |
| 3 | .9645 | |||
| 5 | .9130 | |||