| Literature DB >> 33841367 |
Ze-Gang Wei1,2, Xiao-Dan Zhang1, Ming Cao3,4, Fei Liu1, Yu Qian1, Shao-Wu Zhang2.
Abstract
With the advent of next-generation sequencing technology, it has become convenient and cost efficient to thoroughly characterize the microbial diversity and taxonomic composition in various environmental samples. Millions of sequencing data can be generated, and how to utilize this enormous sequence resource has become a critical concern for microbial ecologists. One particular challenge is the OTUs (operational taxonomic units) picking in 16S rRNA sequence analysis. Lucky, this challenge can be directly addressed by sequence clustering that attempts to group similar sequences. Therefore, numerous clustering methods have been proposed to help to cluster 16S rRNA sequences into OTUs. However, each method has its clustering mechanism, and different methods produce diverse outputs. Even a slight parameter change for the same method can also generate distinct results, and how to choose an appropriate method has become a challenge for inexperienced users. A lot of time and resources can be wasted in selecting clustering tools and analyzing the clustering results. In this study, we introduced the recent advance of clustering methods for OTUs picking, which mainly focus on three aspects: (i) the principles of existing clustering algorithms, (ii) benchmark dataset construction for OTU picking and evaluation metrics, and (iii) the performance of different methods with various distance thresholds on benchmark datasets. This paper aims to assist biological researchers to select the reasonable clustering methods for analyzing their collected sequences and help algorithm developers to design more efficient sequences clustering methods.Entities:
Keywords: 16S rRNA; high-throughput sequencing; metagenomics; operational taxonomic units; sequence clustering
Year: 2021 PMID: 33841367 PMCID: PMC8024490 DOI: 10.3389/fmicb.2021.644012
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
FIGURE 1Schematic diagram of hierarchical clustering algorithms. (A) Input reads set, (B) distance matrix, (C) hierarchical Tree, and (D) OTUs formation.
FIGURE 2The distance between two clusters defined in single-linkage (SL) (A), complete-linkage (CL) (B), and average-linkage (AL) (C) clustering algorithms.
FIGURE 3Schematic diagram of classical heuristic clustering methods. (A) sequence assignment, (B) new seed generation, and (C) OTUs results.
FIGURE 4Schematic diagram of network-based methods.
FIGURE 5Published years of operational taxonomic unit (OTU) picking methods (mentioned in this paper).
Statistics of three benchmark datasets for operational taxonomic unit (OTU) picking.
| Simulated dataset | 9 | 22 K | 500 | - | |
| V4 dataset | 68 | ∼511 K | 253 | V4 | |
| Global 16S rRNA | 1,498 | ∼887 K | ∼1,400 | V1-V9 |
FIGURE 6Normalized mutual information (NMI) values of different clustering methods on the simulated dataset.
Maximum normalized mutual information (NMI) values for different OTU picking methods on the simulated dataset.
| Max. NMI | 0.9503 | 0.9107 | 0.9252 | 0.8979 | 0.9334 | 0.9334 |
| OTUs number | 9 | 10 | 17 | 13 | 9 | 9 |
| Max. NMI | 0.9334 | 0.9333 | 0.9334 | 0.8795 | 0.9293 | 0.9333 |
| OTUs number | 9 | 9 | 9 | 9 | 9 | 9 |
FIGURE 7The Matthews correlation coefficient (MCC) values of 12 OTU picking methods on the simulated dataset.
The average, SD, and maximum MCC values of 11 OTU picking methods on the simulated dataset.
| Max. MCC | 0.9980 | 0.9369 | 0.9838 | 0.9947 | 0.9840 | 0.9980 |
| OTUs number | 9 | 528 | 17 | 16 | 27 | 9 |
| Ave. MCC | 0.9363 | 0.8198 | 0.7929 | 0.9286 | 0.9120 | 0.8347 |
| SD of MCC | 0.0343 | 0.0737 | 0.1750 | 0.0366 | 0.0451 | 0.1585 |
| Max. MCC | 0.9349 | 0.9921 | 0.9868 | 0.9106 | 0.9868 | 0.9980 |
| OTUs number | 1,291 | 15 | 9 | 9 | 9 | 9 |
| Ave. MCC | 0.8204 | 0.8891 | 0.5474 | 0.7832 | 0.8879 | 0.9564 |
| SD of MCC | 0.0578 | 0.0567 | 0.1385 | 0.1436 | 0.0781 | 0.0270 |
FIGURE 8NMI values of eight OTU picking methods at different clustering thresholds on the V4 dataset.
FIGURE 9MCC values of eight OTU picking methods with different clustering thresholds on the V4 dataset.
The average, SD, and maximum MCC values of seven OTU picking methods on V4 dataset.
| Max. | 0.9913 | 0.9797 | 0.9746 | 0.9884 | 0.9875 | 0.9083 | 0.9876 | 0.9904 |
| Ave. | 0.9480 | 0.8481 | 0.8444 | 0.9478 | 0.8938 | 0.7671 | 0.8697 | 0.9246 |
| SD | 0.0330 | 0.1438 | 0.0933 | 0.0283 | 0.1409 | 0.1593 | 0.1382 | 0.1175 |