| Literature DB >> 27822515 |
Evguenia Kopylova1, Jose A Navas-Molina2, Céline Mercier3, Zhenjiang Zech Xu1, Frédéric Mahé4, Yan He5, Hong-Wei Zhou5, Torbjørn Rognes6, J Gregory Caporaso7, Rob Knight2.
Abstract
Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH's most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1).Entities:
Keywords: amplicon sequencing; microbial community analysis; operational taxonomic units; sequence clustering
Year: 2016 PMID: 27822515 PMCID: PMC5069751 DOI: 10.1128/mSystems.00003-15
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 6.496
Description of studies used in analysis
| Data set | QIIME identity | Reference | Gene | Region | No. of reads | No. of samples | Read length | Platform |
|---|---|---|---|---|---|---|---|---|
| Simulated | ||||||||
| sim_even | 16S | V4 | 107,600 | 1 | 150 | ART | ||
| sim_staggered | 16S | V4 | 107,025 | 1 | 150 | ART | ||
| Mock | ||||||||
| Bokulich_2 | 1685 | 16S | V4 | 6,938,836 | 4 | 189–251 | MiSeq | |
| Bokulich_3 | 1686 | 16S | V4 | 3,594,237 | 4 | 114–151 | MiSeq | |
| Bokulich_6 | 1688 | 16S | V4 | 250,903 | 1 | 114–150 | MiSeq | |
| mock_nematodes | 18S | V4 | 9,061 | 1 | 54–305 | GS FLX | ||
| Genuine | ||||||||
| canadian_soil | 632 | 16S | V4 | 2,966,053 | 13 | 76–100 | HiSeq | |
| body_sites | 449 | 16S | V2 | 886,630 | 602 | 117–351 | GS FLX | |
| global_soil | 2107 | 18S | V9 | 9,252,764 | 57 | 119–151 | HiSeq |
Benchmark summary
OTU counts do not include singletons. F measure (F1) is for assigned taxonomies at the genus level. The phylogenetic diversity (PD) whole-tree column for Bokulich_2 and Bokulich_3 represent PD intervals across various sampling depths. Procrustes M2 (the sum of the squared deviations or the dissimilarity of two datasets for UniFrac PCoA) and rho (Pearson’s correlation coefficient for taxonomies at genus level) values are with respect to UCLUST (default for QIIME versions 1.0.0 to 1.9.1). Monte Carlo P values were not included, since all values were <0.05 except for de novo usearch52 versus uclust (P = 0.09). The darkest blue shades represent the highest F1 scores, while the darkest red shades represent results closest to those obtained with UCLUST.
Sensitivity and selectivity statistics for assigned taxonomies at genus level, Bokulich_2
| Software | No. of OTUs (no singletons) | P | R | F1 | No. of taxonomies | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| TP | FN | FP | ||||||||
| Total | Chimeric | Known | Other | |||||||
| usearch52 | 1,522 | 0.34 | 1 | 0.5 | 18 | 0 | 35 | 5 | 13 | 17 |
| Swarm | 7,084 | 0.32 | 1 | 0.48 | 18 | 0 | 38 | 7 | 22 | 9 |
| uclust | 20,084 | 0.25 | 1 | 0.4 | 18 | 0 | 53 | 4 | 15 | 34 |
| usearch61 | 22,987 | 0.24 | 1 | 0.39 | 18 | 0 | 56 | 4 | 18 | 34 |
| sumaclust | 9,575 | 0.24 | 1 | 0.38 | 18 | 0 | 57 | 4 | 15 | 38 |
| Closed reference | ||||||||||
| usearch52 | 571 | 0.37 | 1 | 0.54 | 18 | 0 | 30 | 3 | 13 | 14 |
| sortmerna | 396 | 0.36 | 1 | 0.53 | 18 | 0 | 31 | 4 | 26 | 1 |
| uclust | 1,053 | 0.36 | 1 | 0.53 | 18 | 0 | 32 | 6 | 26 | 0 |
| usearch61 | 1,027 | 0.36 | 1 | 0.53 | 18 | 0 | 32 | 4 | 28 | 0 |
| Open reference | ||||||||||
| uclust | 10,169 | 0.25 | 1 | 0.4 | 18 | 0 | 52 | 4 | 19 | 29 |
| usearch61 | 9,414 | 0.25 | 1 | 0.4 | 18 | 0 | 53 | 4 | 18 | 31 |
| sortmerna_sumaclust | 9,272 | 0.24 | 1 | 0.39 | 18 | 0 | 55 | 5 | 16 | 34 |
P, precision; R, recall; F1, F measure, TP, true positive; FN, false negative; FP, false positive. The last three columns represent a refined breakdown of FP data, including false-positive taxonomies for which all comprising OTUs were classified as chimeric (using UCHIME) (chimeric), mapped to BLAST’s NT database with ≥97% similarity (known), or mapped to BLAST’s NT database with <97% similarity (other).
FIG 1 Layered bar chart showing top 20 abundant genera, Bokulich_6. The bars do not reach 1, since only a fraction (top 20) of taxonomies was illustrated.
FIG 2 Taxonomic composition graph illustrating top 50 (per software) abundant genera, body_sites. The bars do not reach 1, since only a fraction (top 50) of taxonomies was illustrated. mothur was run using recommended filtering (trim.seqs function) for 454 SOP and with QIIME’s split_libraries_fastq.py to highlight the effect of different filtering methods.
FIG 3 Taxonomic composition graph illustrating top 50 (per software) abundant genera, canadian_soil. The bars do not reach 1, since only a fraction (top 50) of taxonomies was illustrated.
FIG 4 Alpha diversity for tools at different sampling depths (order: de novo, closed reference, and open reference), canadian_soil.
FIG 5 Run time performance for all benchmarked software. All tests were performed using 1 to 32 cores on Intel Xeon CPU E5-2640 v3 at 2.60 GHz. Input files contained reads subsampled from the Global Gut. For serial performance, some tools do not show results for 108 reads due to exceeding wall time limit (230 h) or failed memory allocation. For parallel performance, a single file containing 1 million Illumina sequences was used over multiple threads.