| Literature DB >> 28369460 |
Mohamed Mysara1,2,3,4, Mercy Njima1, Natalie Leys1, Jeroen Raes2,3,4, Pieter Monsieurs1.
Abstract
The development of high-throughput sequencing technologies has provided microbial ecologists with an efficient approach to assess bacterial diversity at an unseen depth, particularly with the recent advances in the Illumina MiSeq sequencing platform. However, analyzing such high-throughput data is posing important computational challenges, requiring specialized bioinformatics solutions at different stages during the processing pipeline, such as assembly of paired-end reads, chimera removal, correction of sequencing errors, and clustering of those sequences into Operational Taxonomic Units (OTUs). Individual algorithms grappling with each of those challenges have been combined into various bioinformatics pipelines, such as mothur, QIIME, LotuS, and USEARCH. Using a set of well-described bacterial mock communities, state-of-the-art pipelines for Illumina MiSeq amplicon sequencing data are benchmarked at the level of the amount of sequences retained, computational cost, error rate, and quality of the OTUs. In addition, a new pipeline called OCToPUS is introduced, which is making an optimal combination of different algorithms. Huge variability is observed between the different pipelines in respect to the monitored performance parameters, where in general the amount of retained reads is found to be inversely proportional to the quality of the reads. By contrast, OCToPUS achieves the lowest error rate, minimum number of spurious OTUs, and the closest correspondence to the existing community, while retaining the uppermost amount of reads when compared to other pipelines. The newly introduced pipeline translates Illumina MiSeq amplicon sequencing data into high-quality and reliable OTUs, with improved performance and accuracy compared to the currently existing pipelines.Entities:
Keywords: 16S rRNA metagenomics; OTU clustering; amplicon sequencing; chimera; denoising; operational taxonomic units
Mesh:
Year: 2017 PMID: 28369460 PMCID: PMC5466709 DOI: 10.1093/gigascience/giw017
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Overview of the algorithms available for different steps within amplicon sequencing data analysis
| Step | Tools | Reference |
|---|---|---|
| Paired-end assembler | FLASH | [ |
| PANDAseq | [ | |
| COPE | [ | |
| PEAR | [ | |
| Quality filtering | trim.seqs(mothur) | [ |
| split_libraries (QIIME) | [ | |
| fastq_filter (USEARCH) | [ | |
| Denoising | Pre-cluster | [ |
| UNOISE | [ | |
| IPED | [ | |
| Chimera detection | Pintail | [ |
| Bellerophon | [ | |
| ChimeraSlayer | [ | |
| DECIPHER | [ | |
| Perseus | [ | |
| UPARSE | [ | |
| UCHIME | [ | |
| CATCh | [ | |
| Clustering | Dotur | [ |
| ESPRIT | [ | |
| ESPRIT-Tree | [ | |
| CD-HIT | [ | |
| Uclust | [ | |
| GramCluster | [ | |
| DNAClust | [ | |
| CROP | [ | |
| Swarm | [ | |
| UPARSE. | [ |
Figure 1.Overview of the different steps within each pipeline.
Figure 2.Average amount of reads removed within the various pipelines, due to improper assembly, quality filtering or chimera removal. Due to different order of the processing steps in LotuS, this pipeline could not be included in the figure (on average LotuS retains 23% of the reads).
The error rates for the different samples after applying various pipelines, either with complete removal of chimeric reads (via the seq.error command), or after applying the chimera removal algorithm embedded within the workflow in question
| Chimera absent | Chimera removal algorithms | |||||||
|---|---|---|---|---|---|---|---|---|
| Sample ID | QIIME | Mothur | USEARCH | OCToPUS | QIIME | Mothur | USEARCH | OCToPUS |
| 130403( | 0.0022 | 0.0006 | 0.0003 | 0.0003 | 0.0023 | 0.0008 | 0.0005 | 0.0004 |
| 130417( | 0.0018 | 0.0005 | 0.0003 | 0.0003 | 0.0019 | 0.0007 | 0.0005 | 0.0004 |
| 130422( | 0.0023 | 0.0012 | 0.0008 | 0.0009 | 0.0023 | 0.0011 | 0.0010 | 0.0009 |
| 130403( | 0.00055 | 0.00013 | 0.00010 | 0.00005 | 0.00208 | 0.00167 | 0.00161 | 0.00126 |
| 130417( | 0.00049 | 0.00010 | 0.00008 | 0.00003 | 0.00187 | 0.00150 | 0.00147 | 0.00114 |
| 130422( | 0.00048 | 0.00010 | 0.00008 | 0.00003 | 0.00182 | 0.00144 | 0.00141 | 0.00109 |
| V4.I.1 | 0.00079 | 0.00007 | 0.00002 | 0.00002 | 0.00087 | 0.00013 | 0.00006 | 0.00003 |
| V4.I.05 | 0.00087 | 0.00010 | 0.00002 | 0.00002 | 0.00099 | 0.00020 | 0.00008 | 0.00003 |
| V4.V5.I.1 | 0.0257 | 0.0084 | 0.0075 | 0.0041 | 0.0241 | 0.0069 | 0.0049 | 0.0047 |
| V4.V5.I.11 | 0.0218 | 0.0060 | 0.0072 | 0.0031 | 0.0218 | 0.0044 | 0.0047 | 0.0032 |
| M1( | 0.0014 | 0.0006 | 0.0006 | 0.0005 | 0.0052 | 0.0039 | 0.0042 | 0.0038 |
| M2( | 0.0014 | 0.0007 | 0.0006 | 0.0005 | 0.0058 | 0.0045 | 0.0047 | 0.0043 |
| M3( | 0.0011 | 0.0006 | 0.0005 | 0.0005 | 0.0052 | 0.0041 | 0.0041 | 0.0039 |
|
|
|
|
|
|
|
|
|
|
Figure 3.Rarefaction curves of the different samples. In the X-axis the sequencing depth is given, in the Y-axis the amount of OTUs returned by each pipeline.
Figure 4.Composition of the OTUs produced via the various approaches, classified into different categories: original (blue), chimeric (violet), contaminant (green), and no hit (red). The size of the circles is representative for the amount of reads retained after running each pipeline (exact percentages can be found in supplementary file 2).