| Literature DB >> 25077016 |
Miklós Bálint1, Philipp-André Schmidt2, Rahul Sharma3, Marco Thines2, Imke Schmitt2.
Abstract
High-throughput metabarcoding studies on fungi and other eukaryotic microorganisms are rapidly becoming more frequent and more complex, requiring researchers to handle ever increasing amounts of raw sequence data. Here, we provide a flexible pipeline for pruning and analyzing fungal barcode (ITS rDNA) data generated as paired-end reads on Illumina MiSeq sequencers. The pipeline presented includes specific steps fine-tuned for ITS, that are mostly missing from pipelines developed for prokaryotes. It (1) employs state of the art programs and follows best practices in fungal high-throughput metabarcoding; (2) consists of modules and scripts easily modifiable by the user to ensure maximum flexibility with regard to specific needs of a project or future methodological developments; and (3) is straightforward to use, also in classroom settings. We provide detailed descriptions and revision techniques for each step, thus giving the user maximum control over data treatment and avoiding a black-box approach. Employing this pipeline will improve and speed up the tedious and error-prone process of cleaning fungal Illumina metabarcoding data.Entities:
Keywords: community ecology; data pruning; high throughput; internal transcribed spacer rDNA; next-generation sequencing
Year: 2014 PMID: 25077016 PMCID: PMC4113289 DOI: 10.1002/ece3.1107
Source DB: PubMed Journal: Ecol Evol ISSN: 2045-7758 Impact factor: 2.912
Steps of the Illumina metabarcoding data denoising pipeline using a fungal ITS rDNA example file. We present decreasing read/cluster numbers, approximate computing times, and computing infrastructure for each step. Computing times are based on the example data file run on a standard desktop computer with two processors and 4GB RAM or a massive RAM machine with 48 processors and 512GB RAM
| Pipeline step | Program | Files | Read numbers | Cluster numbers | Time | Computer | Processors |
|---|---|---|---|---|---|---|---|
| Raw sequence data | Pool 1 forward | 14.940.845 | |||||
| Pool 1 reverse | 14.940.845 | ||||||
| Pool 2 forward | 11.209.268 | ||||||
| Pool 2 reverse | 11.209.268 | ||||||
| Pool 3 forward | 13.946.058 | ||||||
| Pool 3 reverse | 13.946.058 | ||||||
| 1. Quality filtering | Script (Supplements) | Pool 1 forward | 13.433.309 | Up to 1 h | Desktop computer | 2 | |
| Reads_Quality_Length_distribution.pl | Pool 1 reverse | 13.433.309 | |||||
| Pool 2 forward | 9.998.878 | ||||||
| Pool 2 reverse | 9.998.878 | ||||||
| Pool 3 forward | 13.044.704 | ||||||
| Pool 3 reverse | 13.044.704 | ||||||
| 2. Paired-end assembly | PANDAseq | Pool 1 | 12.341.403 | Up to 1 h | Desktop computer | 2 | |
| Pool 2 | 9.314.737 | ||||||
| Pool 3 | 12.134.242 | ||||||
| 3. Remove primer artifacts | Script (Supplements) | Pool 1 | 11.255.037 | Minutes | Desktop computer | 2 | |
| remove_multiprimer.py | Pool 2 | 8.520.491 | |||||
| Pool 3 | 11.153.448 | ||||||
| 4. Reorient reads to 5′-3′ | fqgrep | Pool 1 | 9.061.462 | Up to 1 h | Desktop computer | 2 | |
| Pool 2 | 8.155.112 | ||||||
| Pool 3 | 10.539.993 | ||||||
| 5. Demultiplex | Script (Supplements) | ||||||
| (A) forward labels | demultiplex.sh | Pool 1 | 8.851.827 | Up to 1 h | Desktop computer | 2 | |
| Pool 2 | 8.053.268 | ||||||
| Pool 3 | 9.903.834 | ||||||
| (B) reverse labels | Pool 1 | 3.297.016 | Up to 1 h | Desktop computer | 2 | ||
| Pool 2 | 2.957.182 | ||||||
| Pool 3 | 3.949.934 | ||||||
| 6. Pool files, remove primers and labels | rename.pl | Pool 1, 2 and 3 combined | 10.203.752 | Minutes | Desktop computer | 2 | |
| 7. Extract fungal ITS | FungalITSextractor | Pool 1, 2 and 3 combined | 10.093.751 | Several days | Computer cluster | 50 | |
| 8. Similarity clustering | |||||||
| (A) Remove replicate sequences | UPARSE | Pool 1, 2 and 3 combined | 4.869.466 | Minutes | Desktop computer | 2 | |
| (B) Sort sequences by abundance | UPARSE | Pool 1, 2 and 3 combined | 560,678 | Minutes | Desktop computer | 2 | |
| (C) Cluster OTUs | UPARSE | Pool 1, 2 and 3 combined | 14,781 | Up to 1 h | Desktop computer | 2 | |
| 9. Reference-based chimera filtering | UPARSE | Pool 1, 2 and 3 combined | 14,636 | Up to 1 h | Desktop computer | 2 | |
| 10. Identify fungal OTUs | |||||||
| (A) BLAST | BLAST | Pool 1, 2 and 3 combined | 14,636 | Several hours | Computer cluster | 25 | |
| (B) Assign fungal reads | MEGAN | Pool 1, 2 and 3 combined | 3208 | Minutes | Desktop computer | ||
| 11. Fungal OTU abundance table | UPARSE | Pool 1, 2 and 3 combined | 5.964.069 | 3208 | Several hours | Desktop computer | 2 |
Figure 1Frequency of soil samples in relation to the number of Illumina MiSeq reads allocated to each sample. The majority of samples contained between 40,000 and 70,000 Illumina MiSeq reads.
Figure 2Distribution of Illumina MiSeq reads and 97% OTUs across different taxonomic groups. Approximately, 5.9 million reads are assigned to fungi (from a total of >10 million reads). From a total of 14,636 OTUs 3208 could be assigned to fungi.