| Literature DB >> 28361710 |
Jarno Alanko1, Fabio Cunial2, Djamal Belazzougui3, Veli Mäkinen4.
Abstract
BACKGROUND: A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed.Entities:
Keywords: Burrows-Wheeler transform; Metagenomics; Read clustering; Right-maximal k-mer; Submaximal repeat; Suffix-link tree; Text indexing
Mesh:
Year: 2017 PMID: 28361710 PMCID: PMC5374685 DOI: 10.1186/s12859-017-1466-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Precision, sensitivity, peak memory usage and wall clock time of the following clustering algorithms: bwtCluster (BWT), MetaCluster (MC), MBBC, and BiMeta (BM)
| Dataset | Size | Tool | Precluster | Cluster | Mem. | Time | ||
|---|---|---|---|---|---|---|---|---|
| prec. | sens. | prec. | sens. | (GB) | ||||
| Species level | 0.1 | BWT | 0.84 | 0.18 | 0.83 | 0.8 | 0.3 | 3.7 m |
| MC | × | × | × | × | × | × | ||
| MBBC | 0.7 | 0.7 | 3.8 | 7.3 m | ||||
| BM | 0.52 | 0.76 | 2.6 | 9.1 m | ||||
| Genus level | 0.1 | BWT | 0.87 | 0.09 | 0.87 | 0.84 | 0.3 | 3.7 m |
| MC | × | × | × | × | × | × | ||
| MBBC | 0.79 | 0.79 | 3.9 | 8.7 m | ||||
| BM | 0.59 | 0.59 | 2.5 | 9.1 m | ||||
| Family level | 0.1 | BWT | 0.94 | 0.12 | 0.94 | 0.91 | 0.3 | 3.9 m |
| MC | × | × | × | × | × | × | ||
| MBBC | 0.78 | 0.78 | 3.8 | 4.6 m | ||||
| BM | 0.65 | 0.65 | 2.6 | 9 m | ||||
| A | 1.7 | BWT | 0.84 | 0.01 | 0.71 | 0.34 | 6.4 | 1 h |
| MC | 0.90 | 0.05 | 0.76 | 0.71 | 22.4 | 41 m | ||
| MBBC | ≥64 | ≥24 h | ||||||
| BiMeta | ≥69 | ≥24 h | ||||||
| B | 0.6 | BWT | 0.92 | 0.01 | 0.76 | 0.72 | 1.9 | 27 m |
| MC | 0.97 | 0.05 | 0.82 | 0.37 | 10.5 | 11 m | ||
| MBBC | ≥27 | ≥24 h | ||||||
| BM | 0.30 | 0.52 | 22 | 4.6 h | ||||
| C | 1.7 | BWT | 0.84 | 0.01 | 0.69 | 0.40 | 5.7 | 1 h |
| MC | 0.90 | 0.05 | 0.71 | 0.70 | 22.3 | 43 m | ||
| MBBC | ≥65 | ≥24 h | ||||||
| BM | ≥59 | ≥24 h | ||||||
| Human gut | 1.4 | BWT | 4.1 | 53 m | ||||
| MC | × | × | ||||||
| MBBC | ≥35 | ≥24 h | ||||||
| BM | ≥17 | ≥24 h | ||||||
| Mouse gut | 0.8 | BWT | 2.2 | 25 m | ||||
| MC | × | × | ||||||
| MBBC | ≥22 | ≥24 h | ||||||
| BM | ≥35 | ≥24 h | ||||||
The size of each dataset is given in billion base pairs (Gbp). Algorithms that return an error are marked with symbol ×
Fig. 1Running bwtCluster on dataset A: time (horizontal axis, minutes) versus memory (vertical axis, gigabytes). BWT index constructions are highlighted in gray