| Literature DB >> 26078860 |
Jullien M Flynn1, Emily A Brown2, Frédéric J J Chain1, Hugh J MacIsaac3, Melania E Cristescu1.
Abstract
Metabarcoding has the potential to become a rapid, sensitive, and effective approach for identifying species in complex environmental samples. Accurate molecular identification of species depends on the ability to generate operational taxonomic units (OTUs) that correspond to biological species. Due to the sometimes enormous estimates of biodiversity using this method, there is a great need to test the efficacy of data analysis methods used to derive OTUs. Here, we evaluate the performance of various methods for clustering length variable 18S amplicons from complex samples into OTUs using a mock community and a natural community of zooplankton species. We compare analytic procedures consisting of a combination of (1) stringent and relaxed data filtering, (2) singleton sequences included and removed, (3) three commonly used clustering algorithms (mothur, UCLUST, and UPARSE), and (4) three methods of treating alignment gaps when calculating sequence divergence. Depending on the combination of methods used, the number of OTUs varied by nearly two orders of magnitude for the mock community (60-5068 OTUs) and three orders of magnitude for the natural community (22-22191 OTUs). The use of relaxed filtering and the inclusion of singletons greatly inflated OTU numbers without increasing the ability to recover species. Our results also suggest that the method used to treat gaps when calculating sequence divergence can have a great impact on the number of OTUs. Our findings are particularly relevant to studies that cover taxonomically diverse species and employ markers such as rRNA genes in which length variation is extensive.Entities:
Keywords: 18S rRNA; OTU; biodiversity; eDNA; high-throughput sequencing; metabarcoding
Year: 2015 PMID: 26078860 PMCID: PMC4461425 DOI: 10.1002/ece3.1497
Source DB: PubMed Journal: Ecol Evol ISSN: 2045-7758 Impact factor: 2.912
Figure 1A natural zooplankton community sampled from Sudbury, Ontario, Canada.
List of different clustering algorithms (not exhaustive). Identity definitions: no gaps = gaps are not included in the identity calculation; one gap = a gap of any size is treated as a single mutational difference; each gap = each nucleotide in the gap is treated as an additional mutational difference
| Algorithm name | Algorithm type | Identity definition(s) used/available |
|---|---|---|
| mothur (Schloss et al. | Hierarchical | Default is |
| UCLUST (Edgar | Greedy heuristic | |
| UPARSE (Edgar | Greedy heuristic | |
| CD-HIT (Li and Godzik | Greedy heuristic | Gaps penalized only in longer sequence of pairwise comparison; user cannot change |
| ESPRIT (Sun et al. | Hierarchical | |
| ESPRIT-Tree (Cai and Sun | Hierarchical but pairwise comparisons are not exhaustive | |
| CROP (Hao et al. | Bayesian approach | |
| TSC (Jiang et al. | Step 1: hierarchical 2: greedy heuristic | Directly from alignment algorithm; user cannot change |
| M-pick (Wang et al. | Modularity based | |
| MSClust (Chen et al. | Greedy heuristic | Directly from alignment algorithm; user cannot change |
| SWARM (Mahé et al. | Agglomerative |
Previous studies that have compared clustering methods (not exhaustive)
| Reference | Relevant methods compared | Marker(s) and data used | Performance measure(s) | Conclusions |
|---|---|---|---|---|
| Barriuso et al. ( | mothur, ESPRIT, CROP, UCLUST, RDP clustering | 16S sequences | OTU number compared to expected | RDP, ESPRIT, UCLUST produced acceptable results, CROP produced anomalous results |
| Synthetic and natural community data | mothur unable to process large datasets | |||
| Sun et al. ( | MSA vs. PSA | 16S sequences | OTU number | Although PSA does not consider secondary structure like MSA can, PSA still produced more reliable estimates with 16S sequences |
| CD-HIT, UCLUST, ESPRIT, MUSCLE, ESPRIT-Tree | Simulation and natural community data | NMI | Hierarchical clustering algorithms performed better | |
| Edgar ( | UPARSE, AmpliconNoise | 16S sequences | OTU number | UPARSE performed best: most perfect and good sequences and fewest chimeric sequences |
| Two mock communities and natural community data | Classified OTUs as perfect, good, noisy, chimeric | UPARSE OTUs approached 1:1 correspondence with species in mock community | ||
| Chen et al. ( | ESPRIT, ESPRIT-Tree, mothur, muscle+mothur, CROP, CD-HIT, UCLUST, SLP | Dataset of 16S sequences of known microbial species | NID | With default parameters, the methods tended to inaccurately estimate number of OTUs |
| Simulated 16S datasets | OTU number compared to expected | |||
| Bachy et al. ( | MSA+mothur, AmpliconNoise, USEARCH workflow, CD-HIT-OTU | 18S and ITS sequences from a mock community of protist morphotypes | OTU number compared to expected from morphology and map to reference dataset | Great differences in OTU number, some methods overestimating by an order of magnitude |
| Denoising methods tended to underestimate some of the species richness. | ||||
| Yang et al. ( | USEARCH+CROP, Denoiser+UCLUST, OCTUPUS | 18S and CO1 sequences | OTU number | Pipelines produced similar results for community composition |
| Natural community data | OCTUPUS appeared to inflate diversity | |||
| Bonder et al. ( | Filtering: none, chimera removal, denoising, denoising + chimera removal | 16S sequences | OTU number compared to expected | CD-HIT, UCLUST, ESPRIT-Tree performed well |
| Clustering: UCLUST, mothur, ESPRIT-Tree, CD-HIT, QIIME | Mock community and natural community datasets | NMI score | Filtering required for accurate OTU estimates | |
| May et al. ( | Filtering: none, chimera removal, denoising, denoising then chimera removal, chimera removal then denoising | 16S sequences | OTU number compared to expected | The choice and order of filtering options have a great impact on clustering results |
| Clustering: 11 different clustering algorithms were evaluated | Mock community datasets and simulated datasets | NMI score | After chimera removal and denoising, the performance of the different clustering algorithms was similar |
MSA – multiple sequence alignment; PSA – pairwise sequence alignment (when comparing sequences during clustering).
Metric of cluster quality and proper assignment of sequences; generally requires a ground truth composition to determine.
Algorithm that denoises reads before further processing (Quince et al. 2011).
Pipeline that implements a variety of tools for data processing (Caporaso et al. 2010).
Single linkage preclustering; a method that attempts to reduce noise to minimize OTU estimate inflation (Huse et al. 2010).
Greedy heuristic algorithm (Ghodsi et al. 2011).
Greedy heuristic algorithm based on a grammar distance metric (Russell et al. 2010).
Fonseca et al. (2010).
Main characteristics of stringent and relaxed filtering procedures
| Stringent filtering (USEARCH) | Relaxed filtering (RDP) |
|---|---|
| Primer mismatches removed | Primer mismatches removed |
| Sequences <400 bp removed and remaining sequences trimmed to 400 bp | Sequences <250 bp or >600 bp removed |
| Sequences containing ambiguous nucleotides (Ns) removed | Sequences containing ambiguous nucleotides (Ns) removed |
| Sequences with expected error >0.5 removed | Sequences with average quality <20 removed |
| Chimeras removed with UCHIME | Chimeras removed with UCHIME |
Except for datasets clustered with UPARSE.
Figure 2Data analysis methods. MSA refers to multiple sequence alignment.
Mock and natural community OTU results. The number of preprocessed reads indicates the reads before quality and length filtering and the number of processed sequences is the number of sequences after filtering and dereplication. First number represents the total number of OTUs; the number in brackets represents OTUs that matched the target species in the mock community; in bold are numbers of species detected from the mock community
| USEARCH filtering | RDP filtering | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Singletons removed | Singletons included | Singletons removed | Singletons included | ||||||||
| Mock community (61 species) 427,868 preprocessed reads | |||||||||||
| Processed sequences | 6490 | 19,881 | 22,923 | 169,807 | |||||||
| ID def | mothur | uclust | uparse | mothur | uclust | uparse | mothur | uclust | uparse | uclust | uparse |
| No gaps | 63 (57) | 64 (58) | – | 83 (68) | 101 (86) | - | 70 (60) | 78 (68) | – | 716 | – |
| One gap | 70 (64) | 62 (56) | – | 98 (83) | 94 (79) | - | 75 (65) | 79 (68) | – | 1025 | – |
| Each gap | 79 (73) | 68 (62) | 60 (54) | 137 (121) | 114 (99) | 84(69) | 262 (239) | 263 (241) | 75 (62) | 5068 | 647 |
| Natural community 497,806 preprocessed reads | |||||||||||
| Processed sequences | 2731 | 15,806 | 7080 | 130,433 | |||||||
| ID def | mothur | uclust | uparse | mothur | uclust | uparse | mothur | uclust | uparse | uclust | uparse |
| No gaps | 22 | 22 | – | 96 | 147 | – | 28 | 42 | – | 2723 | – |
| One gap | 33 | 33 | – | 227 | 160 | – | 46 | 48 | – | 5471 | – |
| Each gap | 43 | 44 | 24 | 312 | 243 | 63 | 247 | 230 | 38 | 22,191 | 1174 |
Figure 3Species detection and precision across workflows. Species detection is the ratio of the number of species recovered and the number of species in the mock community database, whereas precision is the ratio of the number of species recovered and the number of OTUs. (A) The combination of relaxed (RDP) and stringent (USEARCH) filtering methods with clustering algorithms. Results shown for the mock community dataset with singletons removed, and each gap identity definition was used for all clustering algorithms. (B) The combination of removing singletons (− singletons) and including singletons (+ singletons) with all clustering algorithms. Results shown for the mock community dataset filtered with USEARCH and clustering with each gap identity definition.
Figure 4Species detected unique to the particular filtering method. Stringent (USEARCH) versus relaxed (RDP) filtering – both with singletons removed and clustered with UPARSE. The size of the circle corresponds with the number of species recovered.
Figure 5Precision comparisons of methods of calculating sequence divergence. Precision is the ratio of the number of species recovered and the number of OTUs. (A) USEARCH filtered data with singletons removed and clustered by mothur with all identity definitions. (B) RDP filtered data with singletons removed and clustered by mothur with all identity definitions.