| Literature DB >> 29084957 |
Tobias Guldberg Frøslev1,2, Rasmus Kjøller3, Hans Henrik Bruun3, Rasmus Ejrnæs4, Ane Kirstine Brunbjerg4, Carlotta Pietroni5, Anders Johannes Hansen6.
Abstract
DNA metabarcoding is promising for cost-effective biodiversity monitoring, but reliable diversity estimates are difficult to achieve and validate. Here we present and validate a method, called LULU, for removing erroneous molecular operational taxonomic units (OTUs) from community data derived by high-throughput sequencing of amplified marker genes. LULU identifies errors by combining sequence similarity and co-occurrence patterns. To validate the LULU method, we use a unique data set of high quality survey data of vascular plants paired with plant ITS2 metabarcoding data of DNA extracted from soil from 130 sites in Denmark spanning major environmental gradients. OTU tables are produced with several different OTU definition algorithms and subsequently curated with LULU, and validated against field survey data. LULU curation consistently improves α-diversity estimates and other biodiversity metrics, and does not require a sequence reference database; thus, it represents a promising method for reliable biodiversity estimation.Entities:
Mesh:
Substances:
Year: 2017 PMID: 29084957 PMCID: PMC5662604 DOI: 10.1038/s41467-017-01312-x
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Effects of curation with the LULU algorithm for clustering methods at 97% level. OTU table metrics before (red = raw) and after (blue = curated) curation with LULU. a correspondence of OTU (plant ITS2 sequence data) richness vs. plant richness for each of the 130 sampling sites, b total number of OTUs compared to total plant species recorded (564 species, dashed line), c percentage of OTUs having taxonomically redundant annotation, d OTU β-diversity (total richness/mean site richness) compared to plant β-diversity (17.23, dashed line), e distribution of best reference database (GenBank) match for OTUs retained and discarded by LULU
Metrics of the OTU tables produced with multiple OTU generation algorithms before and after curation with LULU
| Method | Level | Correlation ( | Slope | Intercept | Taxonomic redundancy | Total OTUs | Avg. best match | β-diversity |
|---|---|---|---|---|---|---|---|---|
| CROP | 98% | 0.56/0.59 | 0.32/0.3 | 3.8/2.9 | 28%/7% | 369/241 | 95.8%/97.5% | 25.9/19.1 |
| CROP | 97% | 0.54/0.6 | 0.24/0.23 | 2/1.4 | 22%/6% | 249/174 | 94.7%/96.4% | 25/19.5 |
| CROP | 95% | 0.48/0.6 | 0.24/0.22 | 1.8/1.1 | 28%/8% | 383/252 | 92.2%/93.7% | 39/29.9 |
| DADA2 | 100% | 0.42/0.56 | 0.77/0.53 | 15.6/3.6 | 77%/45% | 2568/761 | 97.7%/98.8% | 62.8/36.3 |
| DADA2( + VS) | 98.50% | 0.54/0.63 | 0.55/0.44 | 6.4/1.8 | 53%/13% | 1141/430 | 96.7%/98.7% | 46.9/26.5 |
| DADA2( + VS) | 98% | 0.55/0.64 | 0.52/0.42 | 6/1.9 | 50%/10% | 1033/402 | 96.6%/98.7% | 45.2/25.5 |
| DADA2( + VS) | 97% | 0.57/0.65 | 0.49/0.42 | 5/1.8 | 43%/7% | 842/365 | 96.4%/98.6% | 40.4/23.7 |
| DADA2( + VS) | 96% | 0.62/0.67 | 0.47/0.41 | 4/1.3 | 37%/6% | 721/341 | 96.2%/98.6% | 37.3/22.9 |
| DADA2( + VS) | 95% | 0.61/0.68 | 0.44/0.41 | 3.7/1.1 | 32%/5% | 622/324 | 96.2%/98.5% | 34.2/22.3 |
| SWARM | 99% | 0.15/0.64 | 3.49/0.64 | 49.6/2.1 | 93%/18% | 14828/520 | 95.1%/97.9% | 90.5/22.5 |
| SWARM | 98.50% | 0.2/0.67 | 2.35/0.62 | 26.4/1.8 | 88%/13% | 8422/467 | 94.2%/97.8% | 81.5/21.2 |
| SWARM | 98% | 0.25/0.69 | 1.81/0.58 | 18.1/2.1 | 84%/9% | 5779/430 | 93.6%/97.7% | 74.8/20.6 |
| SWARM | 97% | 0.27/0.69 | 1.55/0.56 | 14.7/2.8 | 81%/8% | 4585/401 | 93.3%/97.7% | 70/19.1 |
| SWARM | 96% | 0.27/0.7 | 1.55/0.56 | 14.1/2.8 | 81%/8% | 4547/401 | 93.2%/97.7% | 70/19.1 |
| SWARM | 95% | 0.39/0.71 | 1.15/0.53 | 4.5/2.3 | 70%/9% | 2500/362 | 92.6%/97.3% | 59.4/18.5 |
| VSEARCH | 98.50% | 0.15/0.63 | 2.15/0.73 | 62.7/1.6 | 90%/23% | 8008/558 | 97.4%/98.4% | 60.2/21.9 |
| VSEARCH | 98% | 0.17/0.59 | 1.58/0.7 | 41.5/1.7 | 85%/20% | 4815/517 | 96.8%/98.4% | 51.6/20.9 |
| VSEARCH | 97% | 0.22/0.64 | 0.92/0.61 | 22/1.8 | 72%/13% | 2425/458 | 96.1%/98.4% | 46.5/21 |
| VSEARCH | 96% | 0.27/0.64 | 0.8/0.57 | 16.4/1.9 | 64%/10% | 1740/415 | 95.7%/98.3% | 40.9/20.1 |
| VSEARCH | 95% | 0.34/0.66 | 0.7/0.55 | 12.3/1.9 | 56%/9% | 1320/396 | 95.5%/98.2% | 37.5/19.8 |
Effects of post-clustering curation with the LULU algorithm for clustering methods (VSEARCH, SWARM, DADA2 and CROP) at several levels. Values before the slash represent metrics for the method prior to curation with LULU. Values after the slash are post-curation metrics. R 2 denotes the coefficient of determination of the linear regression of OTU count vs. plant richness, slope and intercept denotes the constants of the inferred linear regression, taxonomic redundancy is calculated as the proportion of OTUs with a redundant taxonomic assignment, total OTUs is the count of total unique OTUs for each method, avg. best match is the average of the best GenBank match for all OTUs for each method, and β-diversity is the average α-diversity divided by γ-diversity
Taxonomic composition of OTUs for single sites compared with plant survey data
| Method | Level | Imperfect_matches | Recaptured species | Unregistered species | Redundant species | Lost species |
|---|---|---|---|---|---|---|
| CROP | 98% | 0.56 ± 0.16/0.50 ± 0.17 | 0.31 ± 0.13/0.34 ± 0.13 | 0.12 ± 0.11/0.13 ± 0.12 | 0.02 ± 0.05/0.02 ± 0.06 | 0.00 ± 0.01 |
| CROP | 97% | 0.80 ± 0.13/0.79 ± 0.13 | 0.13 ± 0.11/0.14 ± 0.11 | 0.06 ± 0.10/0.07 ± 0.10 | 0.00 ± 0.00/0.00 ± 0.00 | 0.00 ± 0.00 |
| CROP | 95% | 0.87 ± 0.09/0.86 ± 0.09 | 0.09 ± 0.09/0.10 ± 0.09 | 0.04 ± 0.09/0.04 ± 0.09 | 0.00 ± 0.00/0.00 ± 0.00 | 0.00 ± 0.00 |
| DADA2 | 100% | 0.66 ± 0.12/0.40 ± 0.14 | 0.22 ± 0.10/0.43 ± 0.16 | 0.08 ± 0.07/0.14 ± 0.11 | 0.03 ± 0.03/0.03 ± 0.05 | 0.02 ± 0.04 |
| DADA2( + VS) | 98.50% | 0.51 ± 0.15/0.29 ± 0.14 | 0.36 ± 0.14/0.54 ± 0.17 | 0.11 ± 0.08/0.16 ± 0.12 | 0.01 ± 0.03/0.02 ± 0.05 | 0.02 ± 0.09 |
| DADA2( + VS) | 98% | 0.48 ± 0.16/0.29 ± 0.14 | 0.38 ± 0.15/0.54 ± 0.15 | 0.12 ± 0.09/0.16 ± 0.12 | 0.01 ± 0.03/0.01 ± 0.03 | 0.02 ± 0.09 |
| DADA2( + VS) | 97% | 0.46 ± 0.16/0.29 ± 0.14 | 0.40 ± 0.14/0.53 ± 0.15 | 0.13 ± 0.09/0.16 ± 0.12 | 0.01 ± 0.03/0.02 ± 0.04 | 0.01 ± 0.03 |
| DADA2( + VS) | 96% | 0.42 ± 0.16/0.27 ± 0.14 | 0.43 ± 0.15/0.55 ± 0.16 | 0.13 ± 0.1/0.16 ± 0.12 | 0.01 ± 0.02/0.01 ± 0.03 | 0.01 ± 0.03 |
| DADA2( + VS) | 95% | 0.39 ± 0.16/0.25 ± 0.14 | 0.45 ± 0.17/0.56 ± 0.17 | 0.15 ± 0.1/0.19 ± 0.12 | 0.00 ± 0.02/0.00 ± 0.02 | 0.01 ± 0.03 |
| SWARM | 99% | 0.80 ± 0.15/0.32 ± 0.12 | 0.11 ± 0.10/0.46 ± 0.14 | 0.05 ± 0.05/0.19 ± 0.10 | 0.03 ± 0.03/0.03 ± 0.05 | 0.02 ± 0.05 |
| SWARM | 98.50% | 0.74 ± 0.16/0.29 ± 0.11 | 0.15 ± 0.11/0.48 ± 0.13 | 0.08 ± 0.07/0.22 ± 0.10 | 0.03 ± 0.04/0.01 ± 0.02 | 0.02 ± 0.05 |
| SWARM | 98% | 0.69 ± 0.17/0.26 ± 0.12 | 0.18 ± 0.11/0.49 ± 0.14 | 0.10 ± 0.08/0.24 ± 0.11 | 0.03 ± 0.03/0.01 ± 0.03 | 0.03 ± 0.05 |
| SWARM | 97% | 0.66 ± 0.17/0.25 ± 0.12 | 0.20 ± 0.11/0.48 ± 0.14 | 0.12 ± 0.09/0.27 ± 0.11 | 0.02 ± 0.03/0.00 ± 0.01 | 0.03 ± 0.05 |
| SWARM | 96% | 0.65 ± 0.17/0.25 ± 0.12 | 0.20 ± 0.11/0.49 ± 0.14 | 0.13 ± 0.09/0.27 ± 0.11 | 0.02 ± 0.03/0.00 ± 0.01 | 0.03 ± 0.05 |
| SWARM | 95% | 0.55 ± 0.17/0.24 ± 0.11 | 0.27 ± 0.13/0.48 ± 0.16 | 0.16 ± 0.10/0.28 ± 0.13 | 0.02 ± 0.04/0.00 ± 0.01 | 0.05 ± 0.08 |
| VSEARCH | 98.50% | 0.85 ± 0.09/0.43 ± 0.14 | 0.10 ± 0.07/0.42 ± 0.15 | 0.04 ± 0.03/0.14 ± 0.09 | 0.01 ± 0.02/0.01 ± 0.04 | 0.02 ± 0.05 |
| VSEARCH | 98% | 0.80 ± 0.12/0.41 ± 0.16 | 0.13 ± 0.09/0.44 ± 0.16 | 0.05 ± 0.05/0.15 ± 0.09 | 0.02 ± 0.02/0.01 ± 0.03 | 0.02 ± 0.05 |
| VSEARCH | 97% | 0.70 ± 0.14/0.39 ± 0.15 | 0.21 ± 0.11/0.45 ± 0.14 | 0.08 ± 0.06/0.15 ± 0.10 | 0.02 ± 0.03/0.01 ± 0.03 | 0.02 ± 0.04 |
| VSEARCH | 96% | 0.64 ± 0.14/0.36 ± 0.14 | 0.25 ± 0.11/0.47 ± 0.14 | 0.09 ± 0.07/0.16 ± 0.10 | 0.02 ± 0.03/0.01 ± 0.03 | 0.01 ± 0.03 |
| VSEARCH | 95% | 0.60 ± 0.14/0.36 ± 0.14 | 0.28 ± 0.11/0.47 ± 0.15 | 0.10 ± 0.07/0.16 ± 0.10 | 0.02 ± 0.03/0.00 ± 0.01 | 0.01 ± 0.03 |
Effect of curation on the taxonomic composition of single sites for OTU tables produced with different clustering methods at several levels. Values before the slash are values prior to curation with LULU. Values after the slash are post-curation values. Values are average proportions for single sites (given with standard deviations). Imperfect matches are calculated as the proportion of OTUs for each site that have a less than 100% reference database match. Recaptured species are calculated as the proportion of OTUs with a perfect reference database match and a unique taxonomic annotation corresponding to a plant species recorded for the site. Unregistered species are calculated as the proportion of OTUs with a perfect reference database match and a unique taxonomic annotation corresponding to a plant species not recorded for the site. Redundant species are calculated as the proportion of OTUs with a perfect reference database match and a redundant taxonomic annotation (i.e., already represented by a recaptured or unregistered species). Lost species is the proportion of the recaptured species lost during curation
Fig. 2LULU curation workflow. (1) The user constructs an OTU table. (2) The user constructs a match list. (3) OTU table and match list is fed to the LULU algorithm