| Literature DB >> 31816087 |
Julien Tremblay1, Etienne Yergeau2.
Abstract
BACKGROUND: With the advent of high-throughput sequencing, microbiology is becoming increasingly data-intensive. Because of its low cost, robust databases, and established bioinformatic workflows, sequencing of 16S/18S/ITS ribosomal RNA (rRNA) gene amplicons, which provides a marker of choice for phylogenetic studies, has become ubiquitous. Many established end-to-end bioinformatic pipelines are available to perform short amplicon sequence data analysis. These pipelines suit a general audience, but few options exist for more specialized users who are experienced in code scripting, Linux-based systems, and high-performance computing (HPC) environments. For such an audience, existing pipelines can be limiting to fully leverage modern HPC capabilities and perform tweaking and optimization operations. Moreover, a wealth of stand-alone software packages that perform specific targeted bioinformatic tasks are increasingly accessible, and finding a way to easily integrate these applications in a pipeline is critical to the evolution of bioinformatic methodologies.Entities:
Keywords: High Performance Computing; bioinformatics; metagenomics; rRNA gene amplicons
Mesh:
Year: 2019 PMID: 31816087 PMCID: PMC6901069 DOI: 10.1093/gigascience/giz146
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Details of investigated datasets
| Study | Targeted gene and region | Average read length of paired assembled fragments1 (mean ± standard deviation) | Sequencing configuration | No. of reads | No. of base pairs (Gb) | No. of samples | File size of sequencing data (gzip compressed) |
|---|---|---|---|---|---|---|---|
| Even mock community (this study) | 16S Bacteria/archaea; V4 | 250.3 ± 0.7 bp | Illumina 2 × 250 bp | 1,987,408 | 0.50 | 4 | 375 MB |
| Staggered mock community [ | 16S Bacteria/archaea; V4 | 250.2 ± 0.7 bp | Illumina 2 × 250 bp | 289,434 | 0.072 | 3 | 30 MB |
| Indoor microbiome [ | 16S Bacteria; V3–V4 region | No assembled fragments, single-end reads of 151 bp | Illumina 1 × 150 bp | 111,093,697 | 16.8 | 1.625 | 6.9 GB |
| Lake Michigan [ | 18S Eukaryotes; 1181F–1624R | 250.5 ± 3.1 bp | Illumina 2 × 150 bp | 19,359,618 | 4.86 | 89 | 2.3 GB |
| Antibiotic-associated diarrhea [ | 16S Bacteria/archaea; V4 | 250.4 ± 0.8 bp | Illumina 2 × 250 bp | 22,003,478 | 3.3 | 276 | 2.9 GB |
| Plant microbiome transplant [ | Fungi ITS; ITS1 | 249.0 ± 7.9 bp | Illumina 2 × 250 bp | 30,775,636 | 7.72 | 94 | 4.1 GB |
| PacBio mock community [ | 16S Bacterial; full length | No assembled fragments, single-end reads of 1,472.7 ± 215.5 bp | PacBio single-end sequencing | 86,353 | 0.13 | 8 | 25 MB |
| Oral microbiota [ | 16S Bacterial; full length | No assembled fragments, single-end reads of 1,2470.2 ± 225.8 bp | PacBio single-end sequencing | 689,430 | 1.01 | 40 | 140 MB |
1These are the reads that are sent for OTU/ASV generation after having been paired-end assembled (for paired-end sequencing) and controlled for quality as described in Methods.
Figure 1:Comparison between Deblur, DADA2, DNACLUST, and VSEARCH as implemented in AmpliconTagger and QIIME2-VSEARCH, QIIME2-DADA2, and QIIME2-Deblur for the taxonomic profiles of (a) even and (b) staggered mock community and (c) β-diversity (weighted UniFrac) and (d) α-diversity of mock community samples (16S V4 region; 2 × 250 bp) where each point represents the Observed OTUs or ASVs indexes of a given sample. Results labelled with the QIIME2- prefix were entirely processed with QIIME2 using either VSEARCH, Deblur, or DADA2 as OTU or ASV generation method. PCo: principal coordinate.
Figure 2:Resource consumption for investigated datasets and each OTU/ASV generation method. There are no common core steps for QIIME2-DADA2 workflow because raw reads were submitted to DADA2 directly.
Number of reads and OTUs/ASVs throughout AmpliconTagger's execution
| Project | OTU/ASV generation method | Total reads | Contaminant reads | PhiX reads | Non-contaminant and non-PhiX reads | Non-contaminant and non-PhiX reads 1 | Non-contaminant and non-PhiX reads 2 | Reads 1 QC passed | Assembled reads | Assembled reads QC passed | Clustered or dereplicated sequences | No. of clusters or dereplicated sequences |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mock community (V4 16S; paired-end) | Deblur | 2,602,808 | 8 | 27 | 2,276,776 | 1,138,388 | 1,138,388 | - | 1,123,408 | 1,032,461 | 597,547 | 67 |
| DNACLUST | 918,257 | 67 | ||||||||||
| VSEARCH | 973,750 | 34 | ||||||||||
| DADA2 | 928,310 R1 + 928,310 R2 | - | - | 885,235 | 96 | |||||||
| QIIME2-Deblur | - | - | - | - | - | - | 966,899 | 966,829 | 599,522 | 46 | ||
| QIIME2-VSEARCH | - | - | - | - | - | - | 966,899 | 966,829 | 916,980 | 37 | ||
| QIIME2-DADA2 | - | - | - | - | - | - | - | - | 980,184 | 124 | ||
| Indoor microbiome (V4 16S; single end) | Deblur | 111,093,697 | 48,996 | 0 | 111,044,701 | - | - | 108,008,427 | - | - | 73,010,229 | 27,391 |
| DNACLUST | 95,120,374 | 18,340 | ||||||||||
| VSEARCH | 100,289,243 | 14,341 | ||||||||||
| DADA2 | 103,031,599 | 32,647 | ||||||||||
| Lake Michigan (1181F-1624R 18S; paired-end) | Deblur | 19,359,618 | 275,694 | 4,521 | 18,803,930 | 9,401,965 | 9,401,965 | - | 8,052,948 | 3,356,475 | 2,201,736 | 662 |
| DNACLUST | 2,629,227 | 564 | ||||||||||
| VSEARCH | 2,672,395 | 483 | ||||||||||
| DADA2 | 2,522,201 R1 + 2,522,201 R2 | - | - | 1,880,759 | 854 | |||||||
| AAD (V4 16S; paired-end) | Deblur | 22,003,478 | 151 | 80 | 22,001,808 | 11,000,904 | 11,000,904 | - | 10,860,416 | 9,209,510 | 5,657,445 | 1,560 |
| DNACLUST | 7,791,719 | 1,053 | ||||||||||
| VSEARCH | 8,100,048 | 827 | ||||||||||
| DADA2 | 7,435,547 R1 + 7,435,547 R2 | - | - | 6,583,440 | 1,791 | |||||||
| Plant microbiome transplant (ITS1 ITS; paired-end) | Deblur | 30,775,636 | 174,471 | 5,770,656 | 24,816,770 | 12,408,385 | 12,408,385 | - | 9,850,519 | 7,479,355 | 1,901,625 | 780 |
| DNACLUST | 6,124,824 | 1,172 | ||||||||||
| VSEARCH | 7,166,333 | 1,056 | ||||||||||
| DADA2 | 3,215,102 R1 + 3,215,102 R2 | - | - | 2,954,161 | 1,130 | |||||||
| Mock community (full-length 16S; single end) | Deblur | 93,905 | - | - | 93,905 | 86,353 | - | 74,485 | - | - | 6,543 | 47 |
| DNACLUST | 59,348 | 1,026 | ||||||||||
| VSEARCH | 60,499 | 415 | ||||||||||
| DADA2 | 50,876 | 49 | ||||||||||
| Oral microbiome (full-length 16S; single end) | Deblur | 627,138 | - | - | 627,138 | 562,986 | - | 562,896 | - | - | 219,423 | 199,780 |
| DNACLUST | 520,992 | 77,305 | ||||||||||
| VSEARCH | 523,444 | 69,478 | ||||||||||
| DADA2 | 262,414 | 6,535 |
Mantel r statistics comparing distance matrices of each ASV/OTU generation method for each project
| Beta-diversity metric | Deblur vs DNACLUST | Deblur vs VSEARCH | Deblur vs DADA2 | DNACLUST vs VSEARCH | DNACLUST vs DADA2 | VSEARCH vs DADA2 |
|---|---|---|---|---|---|---|
| Weighted UniFrac | ||||||
| Mock community (V4 16S; paired-end) | 0.984 | 0.999 | 0.999 | 0.983 | 0.986 | 0.999 |
| Indoor microbiome (V4 16S; single-end) | 0.940 | 0.900 | 0.905 | 0.934 | 0.955 | 0.956 |
| Lake Michigan (1181F-1624R 18S; paired-end) | 0.952 | 0.956 | 0.949 | 0.986 | 0.977 | 0.978 |
| AAD (V4 16S; paired-end) | 0.943 | 0.943 | 0.942 | 0.968 | 0.970 | 0.958 |
| Plant microbiome transplant (ITS1 ITS; paired-end) | 0.403 | 0.508 | 0.542 | 0.403 | 0.465 | 0.617 |
| Mock community (full-length 16S; single-end) | 0.359 | 0.225 | 0.461 | 0.988 | −0.078 | −0.080 |
| Oral microbiome (full-length 16S; single-end) | −0.097 | −0.100 | −0.093 | 0.987 | 0.958 | 0.953 |
| Bray-Curtis | ||||||
| Mock community (V4 16S; paired-end) | 0.864 | 0.580 | 0.999 | 0.724 | 0.865 | 0.580 |
| Indoor microbiome (V4 16S; single-end) | 0.991 | 0.967 | 0.996 | 0.980 | 0.993 | 0.972 |
| Lake Michigan (1181F-1624R 18S; paired-end) | 0.994 | 0.996 | 0.993 | 0.997 | 0.993 | 0.991 |
| AAD (V4 16S; paired-end) | 0.960 | 0.941 | 0.993 | 0.980 | 0.967 | 0.949 |
| Plant microbiome transplant (ITS1 ITS; paired-end) | 0.821 | 0.812 | 0.792 | 0.979 | 0.935 | 0.914 |
| Mock community (full length 16S; single-end) | −0.331 | −0.241 | −0.119 | 0.901 | 0.023 | 0.194 |
| Oral microbiome (full length 16S; single-end) | 0.046 | 0.040 | −0.026 | 0.933 | 0.415 | 0.436 |
Each r statistic had a P-value < 0.001.