| Literature DB >> 28367376 |
Temesgen Hailemariam Dadi1, Bernhard Y Renard2, Lothar H Wieler2, Torsten Semmler3, Knut Reinert4.
Abstract
Identification and quantification of microorganisms is a significant step in studying the alpha and beta diversities within and between microbial communities respectively. Both identification and quantification of a given microbial community can be carried out using whole genome shotgun sequences with less bias than when using 16S-rDNA sequences. However, shared regions of DNA among reference genomes and taxonomic units pose a significant challenge in assigning reads correctly to their true origins. The existing microbial community profiling tools commonly deal with this problem by either preparing signature-based unique references or assigning an ambiguous read to its least common ancestor in a taxonomic tree. The former method is limited to making use of the reads which can be mapped to the curated regions, while the latter suffer from the lack of uniquely mapped reads at lower (more specific) taxonomic ranks. Moreover, even if the tools exhibited good performance in calling the organisms present in a sample, there is still room for improvement in determining the correct relative abundance of the organisms. We present a new method Species Level Identification of Microorganisms from Metagenomes (SLIMM) which addresses the above issues by using coverage information of reference genomes to remove unlikely genomes from the analysis and subsequently gain more uniquely mapped reads to assign at lower ranks of a taxonomic tree. SLIMM is based on a few, seemingly easy steps which when combined create a tool that outperforms state-of-the-art tools in run-time and memory usage while being on par or better in computing quantitative and qualitative information at species-level.Entities:
Keywords: Metagenomics; Microbial communities; Microbiology; Microorganisms; NGS data; Taxonomic profiling
Year: 2017 PMID: 28367376 PMCID: PMC5372838 DOI: 10.7717/peerj.3138
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Overview of the SLIMM methodology: (A) The SLIMM algorithm: SLIMM takes two inputs, i.e., the SLIMMDB and an alignment file in either SAM or BAM format and calculates statistical data for each reference sequences in the database. SLIMM uses coverage information to leave out reference sequences from consideration and recalculate the statistics again. We use this, in turn, to receive read counts that are uniquely mapped to a clade at a given taxonomic rank. (B) SLIMM Pipeline: the preprocessing module of SLIMM downloads/updates all available genomes of a certain interest group (e.g., Archaea, Bacteria, Viruses or any combination of them) and tags the sequences with their corresponding taxonomic information. A read mapper is then used to map the WGS reads to these reference sequences. Then SLIMM algorithm uses the mapping results to produces taxonomic profile reports. (C) Reference filtering based on coverage information: an illustration of how SLIMM uses reference filtering based on coverage information: G2 and G3 could not pass the filtering steps because they did not contain enough coverage by uniquely mapped reads and all reads respectively.
Runtime and memory comparison of SLIMM against existing methods.
| Alignment + SLIMM | Kraken | GOTTCHA | mOTUs | |
|---|---|---|---|---|
| Avg. Runtime (Seconds) | 157.4 | 1727.1 | 1526.6 | |
| Peak Memory (GB) | 102 | 4 | 1.6 |
Comparison of SLIMM against different tools regarding precision and recall on species-level: The highest values in each row are marked bold for both precision and recall.
| Precision | Recall | F1 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Type | Dataset | SLIMM | Kraken | GOTTCHA | mOTUs | SLIMM | Kraken | GOTTCHA | mOTUs | SLIMM | Kraken | GOTTCHA | mOTUs |
| Mock | MG01 | 0.8923 | 0.6264 | 0.9808 | 0.9194 | 0.8226 | 0.8065 | 0.7451 | 0.8947 | 0.8929 | |||
| MG02 | 0.9545 | 0.8400 | 0.9524 | 0.8571 | 0.9130 | 0.9756 | 0.9231 | ||||||
| MG03 | 0.9524 | 0.6897 | 0.8571 | 0.4286 | 0.8000 | 0.9231 | 0.6000 | ||||||
| Mimic.Sim | MG04 | 0.4250 | 0.6000 | 0.9474 | 0.6176 | 0.5294 | 0.5965 | 0.6087 | 0.6792 | ||||
| MG05 | 0.6650 | 0.8714 | 0.9630 | 0.4656 | 0.1985 | 0.7988 | 0.6070 | 0.3291 | |||||
| Rand.Sim | MG06 | 0.4352 | 0.6897 | 0.8718 | 0.9375 | 0.8333 | 0.7083 | 0.6026 | 0.7547 | 0.7816 | |||
| MG07 | 0.4352 | 0.6964 | 0.9091 | 0.9375 | 0.8125 | 0.6250 | 0.6026 | 0.7500 | 0.7407 | ||||
| MG08 | 0.4299 | 0.7143 | 0.8824 | 0.9375 | 0.8333 | 0.6250 | 0.5935 | 0.7692 | 0.7317 | ||||
| MG09 | 0.7220 | 0.8396 | 0.9286 | 0.9211 | 0.5855 | 0.3421 | 0.8291 | 0.6899 | 0.5000 | ||||
| MG10 | 0.7178 | 0.7949 | 0.9574 | 0.9276 | 0.4079 | 0.2961 | 0.8192 | 0.5391 | 0.4523 | ||||
| MG11 | 0.7164 | 0.8058 | 0.9464 | 0.9079 | 0.5461 | 0.3487 | 0.8159 | 0.6510 | 0.5096 | ||||
| MG12 | 0.8284 | 0.7333 | 0.9773 | 0.9315 | 0.0377 | 0.1473 | 0.8889 | 0.0717 | 0.2560 | ||||
| MG13 | 0.8237 | 0.8095 | 0.9811 | 0.9281 | 0.0582 | 0.1781 | 0.8728 | 0.1086 | 0.3014 | ||||
| MG14 | 0.9851 | 0.8000 | 0.9811 | 0.9041 | 0.0548 | 0.1781 | 0.9429 | 0.1026 | 0.3014 | ||||
| CAMI | MG15 | 0.7644 | 0.7397 | 0.8000 | 0.7990 | 0.2714 | 0.1206 | 0.7813 | 0.3971 | 0.2096 | |||
| MG16 | 0.8377 | 0.7027 | 0.6883 | 0.7839 | 0.2663 | 0.1106 | 0.7411 | 0.3841 | 0.1956 | ||||
| MG17 | 0.7608 | 0.4531 | 0.7368 | 0.7990 | 0.1457 | 0.1407 | 0.7794 | 0.2205 | 0.2363 | ||||
| MG18 | 0.6996 | 0.4839 | 0.7778 | 0.7839 | 0.1508 | 0.1407 | 0.7393 | 0.2299 | 0.2383 | ||||
Notes.
GOTTCHA and mOTUs have unfairly lower recall and F1 values due to their database which does not contain the complete set of references for the corresponding datasets.
Figure 2PR Curves: comparison of SLIMM against existing methods (A) and (B): true Positive Rate(TPR)/recall drawn against precision. SLIMM showed the highest performance. GOTTCHA did not discover any false positives but is low in recall. PR curves different variants of SLIMM (C) and (D): SLIMM i.e., SLIMM-DG (with digital normalization), SLIMM-NF (without filtration step based on coverage landscape), SLIMM-NF-DG (without filtration but with digital normalization) and SLIMM using alignment produced by the read mapper Bowtie2.
Figure 3Predicting abundances correctly (A)—Random Dataset and (B)—CAMI Dataset: Abundances predicted by different tools compared to the true abundance used for simulation. SLIMM predicted the abundances more accurately than the other tools. Kraken overestimates the abundance. GOTTCHA and mOTUs did not perform well in predicting the abundances. Violin plots (C)—Random Dataset and (D)—CAMI Dataset: SLIMM has the lowest divergence from true abundances.