| Literature DB >> 25888405 |
Mohamed Mysara1,2,3, Natalie Leys4, Jeroen Raes5,6,7, Pieter Monsieurs8.
Abstract
BACKGROUND: The popularity of new sequencing technologies has led to an explosion of possible applications, including new approaches in biodiversity studies. However each of these sequencing technologies suffers from sequencing errors originating from different factors. For 16S rRNA metagenomics studies, the 454 pyrosequencing technology is one of the most frequently used platforms, but sequencing errors still lead to important data analysis issues (e.g. in clustering in taxonomic units and biodiversity estimation). Moreover, retaining a higher portion of the sequencing data by preserving as much of the read length as possible while maintaining the error rate within an acceptable range, will have important consequences at the level of taxonomic precision.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25888405 PMCID: PMC4403973 DOI: 10.1186/s12859-015-0520-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Benchmarking of different denoising algorithms using the MOCK1 dataset
|
|
|
|
|
|
|---|---|---|---|---|
| Denoiser | 504 | 0.0045 | 112 hr | 862279 |
| Acacia | 482 | 0.0040 | 8.8hr | 845513 |
| Pre-cluster | 482 | 0.0028 | 8 hr | 845513 |
| AmpliconNoise | 499 | 0.0019 | 370 hr | 860273 |
| NoDe | 481 | 0.0012 | 9.5 hr | 845513 |
|
|
|
|
|
|
| Denoiser | 439 | 0.0024 | 96 hr | 785115 |
| Acacia | 424 | 0.0021 | 7.7hr | 827123 |
| Pre-cluster | 424 | 0.0014 | 7 hr | 827123 |
| AmpliconNoise | 424 | 0.0013 | 312 hr | 818421 |
| NoDe | 425 | 0.0008 | 8.3 hr | 827123 |
The comparison covers the final error rate as well as the computational cost (on a single CPU - Intel Xeon E5-2640 2.50 GHz) for the analysis pipelines including all tested denoising algorithms (Acacia, Denoiser, Pre-cluster, AmpliconNoise, NoDe). Also the number of reads and average read length returned by the different algorithms is displayed.
Figure 1Effect of denoising algorithms with respect to position in read. A) Plot showing the error rate versus the position in the read after being treated with different denoising algorithms, including: Acacia (orange), Denoiser (blue), SLP (Green), AmpliconNoise (violet) and NoDe (red), with the raw error rate in black. B) Plots showing the insertion (upper), deletion (middle) and substitution (lower) error rates produced in the raw reads (black), as well as after being treated by different approaches, versus the position in the read.
Tabular overview of the computational cost of the different denoising algorithms
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| NoDe | 00:00:12 | 00:00:02 | 00:02:16 (760 MBs) | 00:06:40 | 00:01:00 | 00:10:05 |
| Pre-cluster | 00:00:12 | 00:00:02 | 00:00:13 (100 MBs) | 00:06:40 | 00:01:00 | 00:08:02 |
| AmpliconNoise | 00:00:12 | 00:00:01 | 08:25:17 (1,900 MBs) | 00:03:40 | 00:01:00 | 08:30:27 |
| Denoiser | 00:00:12 | 00:00:01 | 00:38:17 (2,300 MBs) | 00:02:30 | 00:01:00 | 00:42:00 |
| Acacia | 00:00:12 | 00:00:02 | 00:00:55 (1,600 MBs) | 00:06:40 | 00:01:00 | 00:08:49 |
To have an idea about the computational cost for each step, the complete pipeline was subdivided in different steps to illustrate its running time, as described above. From the table, it can be observed that the computational burden added to the complete preprocessing pipeline (by integrating the NoDe algorithm) was relatively small, and it was largely compensated with a significant improvement in the error rate, that exceeded the second best performing (but computationally intensive) algorithm AmpliconNoise. For the denoising algorithms, the average amount of memory required was added.
OTUs produced after treating the data with different noise removal approaches
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 | 17 | 4 | 0 | 1 | 0 | 0 |
| 4 | 18 |
|
| 29 | 16 | 4 | 1 | 2 | 1 | 5 |
| 11 | 18 |
|
| 24 | 16 | 3 | 1 | 1 | 0 | 4 |
| 7 | 17 |
|
| 46 | 17 | 24 | 0 | 4 | 0 | 0 |
| 22 | 24 |
|
| 46 | 17 | 23 | 0 | 5 | 1 | 0 |
| 21 | 25 |
|
| 58 | 17 | 29 | 0 | 5 | 1 | 7 |
| 35 | 17 |
The left side of the table displays the qualitative OTU assessment and the right side displays the quantitative analysis. For the qualitative analysis, we counted the number of “correct OTUs” (classified as one of the mock species), “noisy OTUs” (classified as one of mock species but only classified until Class, Order or Family level), “missed OTUs” (number of undetected mock species), “over-splitted OTUs” (correct yet redundant classification), “contaminant OTUs” (classified as species no belonging to mock) and “other OTUs” (OTUs unclassified at the Class level or higher). In the quantitative analysis, the number of OTUs with a redundancy below 0.1% (rare OTUs) and the ones with a redundancy above 0.1% (Redundant OTUs) were counted.