| Literature DB >> 28289728 |
Sarah L Westcott1, Patrick D Schloss1.
Abstract
Assignment of 16S rRNA gene sequences to operational taxonomic units (OTUs) is a computational bottleneck in the process of analyzing microbial communities. Although this has been an active area of research, it has been difficult to overcome the time and memory demands while improving the quality of the OTU assignments. Here, we developed a new OTU assignment algorithm that iteratively reassigns sequences to new OTUs to optimize the Matthews correlation coefficient (MCC), a measure of the quality of OTU assignments. To assess the new algorithm, OptiClust, we compared it to 10 other algorithms using 16S rRNA gene sequences from two simulated and four natural communities. Using the OptiClust algorithm, the MCC values averaged 15.2 and 16.5% higher than the OTUs generated when we used the average neighbor and distance-based greedy clustering with VSEARCH, respectively. Furthermore, on average, OptiClust was 94.6 times faster than the average neighbor algorithm and just as fast as distance-based greedy clustering with VSEARCH. An empirical analysis of the efficiency of the algorithms showed that the time and memory required to perform the algorithm scaled quadratically with the number of unique sequences in the data set. The significant improvement in the quality of the OTU assignments over previously existing methods will significantly enhance downstream analysis by limiting the splitting of similar sequences into separate OTUs and merging of dissimilar sequences into the same OTU. The development of the OptiClust algorithm represents a significant advance that is likely to have numerous other applications. IMPORTANCE The analysis of microbial communities from diverse environments using 16S rRNA gene sequencing has expanded our knowledge of the biogeography of microorganisms. An important step in this analysis is the assignment of sequences into taxonomic groups based on their similarity to sequences in a database or based on their similarity to each other, irrespective of a database. In this study, we present a new algorithm for the latter approach. The algorithm, OptiClust, seeks to optimize a metric of assignment quality by shuffling sequences between taxonomic groups. We found that OptiClust produces more robust assignments and does so in a rapid and memory-efficient manner. This advance will allow for a more robust analysis of microbial communities and the factors that shape them.Entities:
Keywords: 16S rRNA gene; bioinformatics; microbial ecology; microbiome
Year: 2017 PMID: 28289728 PMCID: PMC5343174 DOI: 10.1128/mSphereDirect.00073-17
Source DB: PubMed Journal: mSphere ISSN: 2379-5042 Impact factor: 4.389
Description of data sets used to evaluate the OptiClust algorithm and compare its performance to other algorithms
| Data set (reference[s]) | Read length (nt) | No. of samples | Total no. of sequences | No. of unique sequences | No. of distances | No. of OTUs |
|---|---|---|---|---|---|---|
| Soil ( | 150 | 18 | 948,243 | 143,677 | 11,775,167 | 40,216 |
| Marine ( | 250 | 7 | 1,384,988 | 75,923 | 12,908,857 | 25,787 |
| Mice ( | 250 | 360 | 2,825,495 | 32,447 | 6,988,306 | 2,658 |
| Human ( | 250 | 489 | 20,951,841 | 121,281 | 38,544,315 | 11,648 |
| Even ( | NA | NA | 1,155,800 | 11,558 | 29,694 | 7,651 |
| Staggered ( | NA | NA | 1,156,550 | 11,558 | 29,694 | 7,653 |
Each data set contains sequences from the V4 region of the 16S rRNA gene. The number of distances for each data set indicates those that were less than or equal to 0.03. The number of OTUs was determined using the OptiClust algorithm. The even and staggered data sets were generated by extracting the V4 region from full-length reference sequences, and the data sets from the natural communities were generated by sequencing the V4 region using an Illumina MiSeq with paired reads of either 150 or 250 nt. NA, not applicable.
FIG 1 Comparison of de novo clustering algorithms. Plot of MCC (A), number of OTUs (B), and execution times (C) for the comparison of de novo clustering algorithms when applied to four natural and two synthetic data sets. The first three columns of each panel contain the results of clustering the data sets: (i) seeding the algorithm with one sequence per OTU and allowing the algorithm to proceed until the MCC value no longer changed, (ii) seeding the algorithm with one sequence per OTU and allowing the algorithm to proceed until the MCC changed by less than 0.0001, and (iii) seeding the algorithm with all of the sequences in one OTU and allowing the algorithm to proceed until the MCC value no longer changed. The human data set could not be clustered by the average neighbor, Sumaclust, USEARCH, or OTUCLUST with less than 45 GB of RAM or 50 h of execution time. The median from 10 reorderings of the data is presented for each method and data set. The range of observed values is indicated by the error bars, which are typically smaller than the plotting symbol.
FIG 2 OptiClust performance. Average execution time (A) and memory usage (B) required to cluster the four natural data sets. The confidence intervals indicate the range between the minimum and maximum values. The y axis is scaled by the square root to demonstrate the relationship between the time and memory requirements relative to the number of unique sequences squared.
FIG 3 Effects of taxonomically splitting the data sets on clustering quality. The data sets were split at each taxonomic level based on their classification using a naive Bayesian classifier and clustered using average neighbor, VSEARCH-based DGC, and OptiClust.