| Literature DB >> 29297295 |
Damayanthi Herath1,2, Sen-Lin Tang3, Kshitij Tandon3,4,5, David Ackland6, Saman Kumara Halgamuge7.
Abstract
BACKGROUND: In metagenomics, the separation of nucleotide sequences belonging to an individual or closely matched populations is termed binning. Binning helps the evaluation of underlying microbial population structure as well as the recovery of individual genomes from a sample of uncultivable microbial organisms. Both supervised and unsupervised learning methods have been employed in binning; however, characterizing a metagenomic sample containing multiple strains remains a significant challenge. In this study, we designed and implemented a new workflow, Coverage and composition based binning of Metagenomes (CoMet), for binning contigs in a single metagenomic sample. CoMet utilizes coverage values and the compositional features of metagenomic contigs. The binning strategy in CoMet includes the initial grouping of contigs in guanine-cytosine (GC) content-coverage space and refinement of bins in tetranucleotide frequencies space in a purely unsupervised manner. With CoMet, the clustering algorithm DBSCAN is employed for binning contigs. The performances of CoMet were compared against four existing approaches for binning a single metagenomic sample, including MaxBin, Metawatt, MyCC (default) and MyCC (coverage) using multiple datasets including a sample comprised of multiple strains.Entities:
Keywords: Binning; Contig composition; Contig coverage; DBSCAN algorithm; Metagenomics
Mesh:
Year: 2017 PMID: 29297295 PMCID: PMC5751405 DOI: 10.1186/s12859-017-1967-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A schematic diagram showing workflow in CoMet. The figure illustrates the key steps involved in proposed binning workflow
Precision comparison between CoMet and other contig coverage and/or composition based binning methods
| Dataset | Metawatt | MaxBin | MyCC (default) | MyCC (coverage) | CoMet |
|---|---|---|---|---|---|
| sim10_1 | 96.69 (4) | 92.44 (5) | 97.47 (2) | 97.42 (3) |
|
| sim10_20x | 84.25 (5) | 96.90 (2) | 90.66 (4) | 96.71 (3) |
|
| sim10_80x | 95.13 (5) | 95.63 (4) | 98.55 (2) |
| 97.12 (3) |
| sim30_CAMI | 53.68 (5) | 66.60 (4) | 75.02 (2) | 75.02 (2) |
|
Binning methods are ranked based on their precision with different datasets with their ranks given in parentheses. Bold values indicate the highest of the precisions
F1-Score comparison between CoMet and other contig coverage and/or composition based binning methods
| Dataset | Metawatt | MaxBin | MyCC (default) | MyCC (coverage) | CoMet |
|---|---|---|---|---|---|
| sim10_1 | 91.58 (4) | 94.91 (3) |
| 96.09 (2) | 89.28 (5) |
| sim10_20x | 70.5 (5) |
| 83.46 (4) | 88.35 (3) | 96.7 (2) |
| sim10_80x | 75.21 (5) | 95.61 (3) | 98.55 (2) |
| 88.56 (4) |
| sim30_CAMI | 60.55 (5) | 75.58 (4) | 80.93 (2) | 80.93 (2) |
|
Binning methods are ranked based on their F1-score with different datasets with their rank given in parentheses. Bold values indicate the highest of the F1-scores
The number of species recovered from different binning approaches
| Dataset | Metawatt | MaxBin | MyCC (default) | MyCC (coverage) | CoMet |
|---|---|---|---|---|---|
| sim10_1 | 9 (3) | 8 (5) |
|
| 9 (3) |
| sim10_20x |
| 3 (3) | 3 (3) | 2 (5) |
|
| sim10_80x | 6 (5) |
|
|
| 7 (4) |
| sim30_CAMI | 9 (5) | 13 (4) | 18 (2) | 18 (2) |
|
Binning methods are ranked based on number of species discovered with their rank given in parentheses. Bold values indicate the highest of the number of species identified
Individual precision and contigs binned from each identified species from the strain dataset from CAMI
| Taxon Id | MaxBin | Metawatt | MyCC (default) | MyCC (coverage) | CoMet | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Precision | Contigs binned (%) | Precision | Contigs binned (%) | Precision | Contigs binned (%) | Precision | Contigs binned (%) | Precision | Contigs binned (%) | |
| 1 | NI | NI | NI | NI | NI | NI | NI | NI |
|
|
| 2 | 96.15 |
| NI | NI | NI | NI | NI | NI |
| 88 |
| 3 | NI | NI | 93.16 | 63.08 | 93.58 | 87.06 | 93.58 | 87.06 |
|
|
| 4 | 88.74 | 87.31 | NI | NI | 89.37 |
| 89.37 |
|
| 51.32 |
| 5 | NI | NI | 74.84 | 56.93 | 97.94 |
| 97.94 |
|
| 67.61 |
| 6 | NI | NI | 61.84 | 58.02 | 83.82 | 77.03 | 83.82 | 77.03 |
| 77.03 |
| 7 | NI | NI | 96.22 | 59.89 | 96.59 |
| 96.59 |
|
| 58.17 |
| 8 | NI | NI | NI | NI | NI | NI | NI | NI |
| 96.72 |
| 9 | 55.18 | 86.26 | NI | NI | 93.46 |
| 93.46 |
|
| 50.07 |
| 10 | NI | NI | 80.72 | 57.41 |
|
|
|
| NI | NI |
| 11 | 92.96 |
| NI | NI | 68.18 | 51.92 | 68.18 | 51.92 |
| 51.15 |
| 12 | 97.25 |
| 83.25 | 92.29 | 96.1 | 89.97 | 96.1 | 89.97 |
| 63.94 |
| 13 | 50 |
| NI | NI | 52.17 | 75 | 52.17 | 75 |
| 75 |
| 14 | 51.14 | 63.38 | NI | NI | 65.22 | 63.38 | 65.22 | 63.38 |
|
|
| 15 | 53.3 | 57.74 | NI | NI | 82.29 |
| 82.29 |
|
| 62.13 |
| 16 | NI | NI | NI | NI | NI | NI | NI | NI |
|
|
| 17 | NI | NI | NI | NI |
| 50.55 |
| 50.55 | 51.09 |
|
| 18 | 72.58 |
| NI | NI | 66.1 | 81.25 | 66.1 | 81.25 |
| 89.58 |
| 19 | NI | NI | NI | NI | NI | NI | NI | NI |
|
|
| 20 | 75.05 | 94.62 | 82.2 | 84.17 | 92.54 |
| 92.54 |
|
| 96.24 |
| 20 | 73.25 | 74.78 | NI | NI |
| 94.93 |
| 94.93 | NI | NI |
| 22 | 88.72 | 87.41 | 85.23 | 64.13 |
| 78.15 |
| 78.15 | NI | NI |
| 23 | 93.97 | 99 | 84.29 | 86.51 | 96.41 |
| 96.41 |
|
| 70.57 |
Bold data represent the highest precisions and highest percentage of contigs binned for each identified species. NI: Not Identified
Fig. 2Performance of CoMet on contigs with different number of distinct coverage distributions. The figure shows the variations of binning performances of CoMet as the differences in contig coverage values of a sample of multiple strains vary