Literature DB >> 30271256

Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding.

Sten Anslan¹, R Henrik Nilsson², Christian Wurzbacher³, Petr Baldrian⁴, Mohammad Bahram^5,6,7.

Abstract

Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appears to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon dataset. We conclude that the output of each platform requires manual validation of the OTUs by examining the taxonomy assignment values.

Entities: Chemical Disease Species

Keywords: Microbial communities; amplicon sequencing; fungal biodiversity; metagenomics; microbiome; mycobiome

Year: 2018 PMID： 30271256 PMCID： PMC6160831 DOI： 10.3897/mycokeys.39.28109

Source DB: PubMed Journal: MycoKeys ISSN： 1314-4049 Impact factor: 2.984

Introduction

are major ecological and functional players in terrestrial ecosystems. The full diversity of fungi remains largely uncharted due to their largely unculturable nature, the lack of tangible morphological manifestations and shortcomings of the mycological community to sample beyond traditional habitats and substrates (Grossart et al. 2016; Hibbett et al. 2017; Lücking et al. 2018). As a result, identification of fungi has come to rely mainly on direct DNA sequencing of material containing fungal hyphae or spores. In this regard, several DNA barcoding regions have been evaluated and the current consensus is that the nuclear ribosomal internal transcribed spacer (ITS) region is the best region for delimiting fungal taxa at the species level across a variety of fungal groups (Schoch et al. 2012). Recent advances in high-throughput sequencing (HTS) have made it possible to sequence millions of reads and identify thousands of fungal taxa from a single sample. Handling this enormous amount of data is often complicated and requires extensive bioinformatics expertise. Multiple analysis platforms have been introduced to facilitate the bioinformatics treatment of HTS data. However, most of these software suites were developed for the prokaryotic 16S rRNA gene and may therefore perform poorly with other markers and other organisms, in particular ITS sequences due to their length variation and non-alignability across taxonomic expanses. To accommodate this, several tailored platforms have recently been developed to specifically address fungal ITS datasets (Anslan et al. 2017; Gweon et al. 2015; Hildebrand et al. 2014; Vetrovský et al. 2018). These platforms cover multiple steps of the analysis procedure, including quality control, clustering, taxonomic assignment and generating Operational Taxonomic Unit (OTU) abundance tables. Many of these platforms cover all these analysis steps, whereas others do not. The application of different bioinformatics workflows may introduce variation in the data quality and output OTU tables (Majaneva et al. 2015; Sinha et al. 2017). However, to date, there are no data on the relative performance of the available tools for fungal HTS data analysis. In this study, we report on the relative performance of the most popular software pipelines on two contrasting HTS datasets.

Methods

Sequence data and general workflow

We compared the performance of bioinformatics analysis platforms on two fungal ITS datasets. Tested datasets included Illumina MiSeq paired-end ITS2 amplicons from arthropod substrates (Anslan et al. 2018) and full ITS circular consensus sequences from Pacific Biosciences (PacBio) Sequel machine, amplified from soil samples. PacBio data set is available through PlutoF database (Abarenkov et al. 2010b), https://plutof.ut.ee/#/datacite/10.15156%2FBIO%2F781236). For bioinformatics analyses, we used multiple platforms that support all steps in the analysis of HTS-based metabarcoding datasets: QIIME2 (v2018.2; Caporaso et al. 2010), LotuS (v1.59; Hildebrand et al. 2014), Galaxy (v.2.1.1; Afgan et al. 2016), PipeCraft (v1.0; Anslan et al. 2017) and PIPITS (v2.0; Gweon et al. 2015) (Table 1; Figure 1). Depending on the analysis platform, quality filtering was performed using either VSEARCH (Rognes et al. 2016), trimmomatic (Bolger et al. 2014), DADA2 (Callahan et al. 2016), sdm (Hildebrand et al. 2014) or fastx (http://hannonlab.cshl.edu/fastx_toolkit). Quality filtered sequences were passed through chimeric reads removal algorithms as implemented in USEARCH (Edgar 2013; Edgar et al. 2011) or VSEARCH. Using PipeCraft, LotuS and PIPITS, reads were also subjected to ITS extraction using ITSx (Bengtsson-Palme et al. 2013) to remove conservative flanking genes of the ITS region. OTU formation (clustering) was performed using USEARCH or VSEARCH as outlined below (Platform specific options). For each platform, we relied on de-novo single linkage clustering, which is the most popular approach in fungal community studies, knowing that reference-based clustering methods can provide similar results (Cline et al. 2017). Taxonomic affiliations were assigned to OTUs using DP Naive Bayesian rRNA Classifier (RDP classifier v2.11; Wang et al. 2007) with the Warcup Fungal ITS trainset 2 (confidence threshold: 80%; Deshpande et al. 2016) as well as BLAST+ (Camacho et al. 2009) search (e-value = 0.001, word size = 7, reward = 1, penalty = -1, gap open = 1, gap extend = 2) against the UNITE v7.2 reference database (Abarenkov et al. 2010a).

Table 1.

a)	LotuS	Qiime2	PipeCraft	Galaxy	PIPITS
Raw reads	7,981,812a	7,335,838b	7,981,812a	7,981,812a	7 335 838b
Assembly	FLASH/ NA	DADA2/ NA	VSEARCH/ 7,511,274	FASTQ joiner/ 7,911,554	VSEARCH/ 7,198,094
Quality filtering	sdm/NA	DADA2/ 5,428,563	VSEARCH/ 7,511,274	trimmomatic/ 7,879,960	fastqx/ 7,142,354
Demultiplexing	sdm/ 6,727,631	NP	mothur/ 6,558,772	mothur/ 1,643,879	NP
Chimera filtering	USEARCH/ 6,486,802	NP	VSEARCH/ 6,300,085	VSEARCH/ 1,621,330	VSEARCH/ NA
ITS extractor	5,919,084	NP	6,262,000	NP	6,401,097
Clustering (OTUs)	UPARSE/ 8,659	VSEARCH/ 7,477	UPARSE/ 7,598	VSEARCH/ 23,167	VSEARCH/ 7,887
b)	LotusS	PipeCraft	Galaxy
CCSc reads	720,222a	720,222a	720,222a
Quality filtering	sdm/ NA	VSEARCH/ 462,010	trimmomatic/ 672,292
Demultiplexing	sdm/ 258,085	mothur/ 380,722	mothur/ 457,173
Chimera filtering	USEARCH/ 255,746	VSEARCH/ 341,154	VSEARCH/ 405,025
ITS extraction	192,485	338,150	NP
Clustering (OTUs)	UPARSE/ 942	UPARSE/ 4,176	VSEARCH/ 8,338

amultiplexed input data; bdemultiplexed input data; ccircular consensus sequences; NA: indicate not available; NP: not performed.

Figure 1.

Outline of workflow in different analysis pipelines.

Used software, sequence and OTU counts (values in bold) by a) Illumina and b) PacBio analysis platforms. The number of sequences denotes raw input reads and remaining reads after each analysis step. Singleton OTUs were excluded from the OTU counts. amultiplexed input data; bdemultiplexed input data; ccircular consensus sequences; NA: indicate not available; NP: not performed. Outline of workflow in different analysis pipelines.

Platform specific options

Using QIIME2, reads were assembled (Illumina data) and quality filtered using DADA2 (Callahan et al. 2016) with default options, except --p-trunc-len = 0, --p-max-ee = 1 and --p-chimera-method = none (with chimera-method = consensus, QIIME2 reported error for our data). Clustering was performed with VSEARCH cluster-features-de-novo (--p-perc-identity 0.97). In LotuS pipline, data was assembled (Illumina data), quality filtered (minimum length = 170, minAvgQuality = 27, TruncateSequenceLength = 170, maxAccumulatedError = 0.75) and demultiplexed with sdm (pdiffs = 1, bdiffs = 1). Chimera filtering was undertaken using USEARCH de novo chimera filtering (abundance annotation = 0.97, abskew = 2) and USEARCH reference-based chimera filtering using UNITE v7.2 as reference database. Flanking genes of the ITS region were discarded using ITSx (v1.0.11; default options). ITS reads were clustered to OTUs with USEARCH/UPARSE algorithm (-id = 3, -minsize = 2). Using web-based Galaxy pipeline, Illumina data were assembled with Fastq joiner (Galaxy Version 2.0.1.1; Blankenberg et al. 2010) with default options. Quality filtering was performed with Trimmomatic (Galaxy Version 0.36.3) ‒ SLIDINGWINDOW; number of bases to average across = 15, average quality required = 30, minimum length of kept reads = 45. Fastq files were converted to FASTA files using FASTQ to FASTA converter (Galaxy Version 1.0.0). Fasta files were demultiplexed using mothur (Galaxy Version 1.39.5.0; Schloss et al. 2009) ‒ pdiffs = 2, bdiffs = 1. As sequences were of mixed orientation in the files (5’-3’ and 3’-5’), the demultiplexing step was repeated for reverse orientated sequences (reads were reversed using mothur reverse.seqs). Chimera filtering was undertaken using VSEARCH chimera detection (Galaxy Version 1.9.7.0) with default settings (abundance annotation = 97% similarity threshold) and using the UNITE v7.2 database as reference. Clustering was performed using VSEARCH (--cluster-fast, --id 0.97, --iddef 1). In PipeCraft, platform reads were assembled (Illumina data) and quality filtered using VSEARCH (minimum overlap = 15, minimum length = 100, E max = 1, max ambiguous = 0, allowstagger = T). Demultiplexing was undertaken using mothur (pdiffs = 2, bdiffs = 1). In this step, sequences are also re-orientated into the 5’-3’ orientation based on primers (2 mismatches allowed). Chimeric sequences were removed using VSEARCH de novo (abundance annotation = 0.97, abskew = 2) and reference-based (UNITE v7.2 as reference) chimera filtering algorithms. In the chimera filtering step, the PipeCraft supported option for “primer artefact” removal was also used (sequences where primer strings were found in the middle of the sequence were removed). ITS reads were extracted using ITSx (default options). Clustering was performed using USEARCH/UPARSE algorithm (id = 3, minsize = 2). Using PIPITS, sequences were assembled with VSEARCH and quality-filtering was undertaken with fastx through the PIPITS command pispino_createreadpairslist. The ITSx was executed through the PIPITS command pipits_funits. Chimera filtering and clustering were undertaken using VSEARCH through the PIPITS command pipits_process.

Additional filtering

The additional manual OTU table filtering was based on the BLAST similarity scores when run against UNITE (v7.2) reference database. Any OTUs that had no BLAST hit or that were not classified to the kingdom were discarded from the OTU table. The remaining OTUs were filtered based on BLAST e-value and query coverage. OTUs with higher e-value than 1e-25 and query coverage less than 70% were excluded from the dataset (as putative artefacts or non-fungal OTUs). Additionally, OTUs with low numbers of sequences per sample were removed (less than 10 sequences per sample; Brown et al. (2015)). Finally, the LULU (Frøslev et al. 2017) algorithm was applied (minimum_ratio_type = “min”, minimum_match = 97) to merge consistently co-occurring ‘daughter’ OTUs.

Data pooling

To detect the effect of analysis platform choice on the OTU composition, we pooled sequences originating from different platforms and applied the common clustering method to generate a single OTU table. For Illumina data, filtered reads from PipeCraft, LotuS and PIPITS were pooled and clustered using CD-HIT (Fu et al. 2012) at 97% sequence similarity (Table 1). The pooled PacBio dataset included filtered sequences from LotuS, PipeCraft and Galaxy platform, clustering was performed using UPARSE algorithm with 97% sequence similarity threshold (Table 1).

Statistical analysis

We used PERMANOVA analysis (Anderson and Walsh 2013; Type III SS, 4,999 permutations) on Bray-Curtis distances of Hellinger-transformed OTU matrices, using PRIMER6 (Clarke and Gorley 2006). Outliers were screened and removed using analysis of non-metric multidimensional scaling (NMDS). The numbers of sequences per sample were included in the analysis as covariates. Rarefaction curves were generated based on OTU abundance matrices for each dataset using the RTK package (Saary et al. 2017) of R (R-Core-Team 2015).

Results and discussion

Properties of bioinformatics analysis platforms

All tested bioinformatics platforms offer straightforward installation. While Galaxy provides a freely available online platform, the benefits of PipeCraft and QIIME2 include easy-to-use graphical user interfaces and multiple options for data analysis. These platforms bundle many tools for diverse tasks. LotuS and PIPITS represent command-line based platforms. PIPITS offers a limited number of tools, but data analysis is easily performed with a straightforward pipeline. LotuS has been developed to minimise computational time and memory requirements. Specifically, for accuracy of ITS-based analyses of fungi and other eukaryotes, PipeCraft, LotuS and PIPITS implement the ITSx tool (Bengtsson-Palme et al. 2013), which removes the fragments of conservative flanking genes for precise clustering purposes. There is no such option in QIIME2 and Galaxy. Bioinformatics platforms differ by specific requirements to the input data, with the options being a raw multiplexed file (a single file containing all sequences from one run) and multiple demultiplexed files (reads split into separate files based on indexes). PipeCraft and Galaxy use raw multiplexed data, whereas QIIME2 and PIPITS require demultiplexed files. Only LotuS allows both, multiplexed and demultiplexed files as input. As the raw data files are multiplexed by default, QIIME2 and PIPITS platforms required additional steps of analyses outside these tools to meet the input requirements. Using a Python script, we demultiplexed the raw Illumina data, allowing 2 and 1 mismatches to primer and index strings, respectively. However, PacBio data analysis was dropped for QIIME2 and PIPITS as the present versions of these platforms are limited to analysis of short read (Illumina) data.

Performance of bioinformatics platforms on sequence data

For both the Illumina and PacBio datasets, the final OTU richness (singleton OTUs excluded) differed considerably amongst the tested workflows (Table 1). We found that pipelines, which produced roughly comparable numbers of total OTUs (QIIME2, PipeCraft, PIPITS and LotuS for Illumina data), still exhibited large variations in OTU richness per sample (Figures 2 and 3). By performing joint de-novo clustering for filtered sequences from different pipelines (total number of OTUs = 16333), we observed a weak but significant effect of pipeline choice on overall OTU composition for the Illumina data set (PERMANOVA: pseudo-F2,868 = 5.88, R2adj = 0.012, P < 0.001). For the PacBio dataset (total number of OTUs = 4448), differences amongst platforms were slightly stronger (pseudo-F2,512 = 9.174; R2adj = 0.033, P < 0.001).

Figure 2.

OTU accumulation curves of the evaluated pipelines for a) PacBio and b) Illumina datasets.

Figure 3.

Number of OTUs per sample for Illumina data recorded from a) pipeline-generated OTU tables (median differences = 38 OTUs) and from b) filtered OTU tables (median differences = 12 OTUs). The Galaxy workflow was excluded here.

Taxonomic annotation tools differed in the ability to classify OTUs. In general, BLAST searches revealed many cases of high-quality matches to non-fungal organisms (in some cases for hundreds of OTUs), while RDP when combined with the Warcup Fungal ITS trainset optimistically classified all OTUs to (100% confidence). Numerous papers have evaluated the performance of different methods on the accuracy of taxonomic assignment and performance inevitably hinges on the completeness of the reference database used (e.g. Gdanetz et al. 2017; Richardson et al. 2017). In spite of its relatively rapid performance, the RDP Fungal ITS trainset does not include any non-fungal data, which explains its shortcomings in detecting non-fungal OTUs. However, the confidence score of an RDP classifier did not exceed 64% for non-fungal OTUs, mostly overestimating the group of unclassified fungi. We also observed that the quality-filtered datasets included up to ~10% of obvious erroneous/chimeric OTUs that produced matches with low query coverage and confidence scores. A long tail of satellite OTUs, assigned to a single species hypothesis with 99–100% BLAST identity and RDP classifier confidence level, were also common – especially in the results where a relatively high number of OTUs was observed (Galaxy platform). After filtering the spurious OTUs manually (see Methods), we found that richness estimates per sample became more homogeneous across pipelines (Illumina data: Figure 3). When OTU table filtering was applied to jointly clustered reads from different pipelines, the significant effect of pipeline choice on the community composition diminished (Illumina data: pseudo-F2,837 = 0.955, R2adj = 0.007, P = 0.779). OTU accumulation curves of the evaluated pipelines for a) PacBio and b) Illumina datasets. Number of OTUs per sample for Illumina data recorded from a) pipeline-generated OTU tables (median differences = 38 OTUs) and from b) filtered OTU tables (median differences = 12 OTUs). The Galaxy workflow was excluded here. In conclusion, our results indicate that bioinformatics analysis pipelines greatly differ in their relative performance on ITS datasets targeting fungi, although roughly similar quality-orientated settings were implemented. Overall, our recommended Illumina data workflow would be PipeCraft, PIPITS or LotuS, which provide a good balance between speed, mycological accuracy (including support for ITS Extractor) and technical quality. For PacBio, the tools implemented in PipeCraft were most suitable for the long-read analysis. Conversely, the widely used platform in prokaryote 16S-based studies, our options chosen in Galaxy, performed relatively poorly on the ITS data. While QIIME2 implements an accurate quality filtering algorithm of DADA2, the lack of ITS region extraction lowers the accuracy for mycological studies. Of classification tools, BLAST searches against the UNITE database provided more accurate results on the kingdom and phylum levels compared with the RDP and Warcup ITS trainset combined. We emphasise that none of the tested bioinformatics workflows is able to fully filter out the errors that accumulated during sample preparation and sequencing, even when using the most elaborate error-filtering options. Therefore, manual curation of OTU tables continues to be an important step in obtaining robust datasets, although semi-automatic tools to assist evaluation are becoming available (Frøslev et al. 2017). It is also important to rely on high-coverage reference databases to be able to recognise non-target organisms and metagenomic reads.

25 in total

1. UPARSE: highly accurate OTU sequences from microbial amplicon reads.

Authors: Robert C Edgar
Journal: Nat Methods Date: 2013-08-18 Impact factor: 28.547

2. PipeCraft: Flexible open-source toolkit for bioinformatics analysis of custom high-throughput amplicon sequencing data.

Authors: Sten Anslan; Mohammad Bahram; Indrek Hiiesalu; Leho Tedersoo
Journal: Mol Ecol Resour Date: 2017-06-21 Impact factor: 7.090

3. Fungal identification using a Bayesian classifier and the Warcup training set of internal transcribed spacer sequences.

Authors: Vinita Deshpande; Qiong Wang; Paul Greenfield; Michael Charleston; Andrea Porras-Alfaro; Cheryl R Kuske; James R Cole; David J Midgley; Nai Tran-Dinh
Journal: Mycologia Date: 2015-11-09 Impact factor: 2.696

4. DADA2: High-resolution sample inference from Illumina amplicon data.

Authors: Benjamin J Callahan; Paul J McMurdie; Michael J Rosen; Andrew W Han; Amy Jo A Johnson; Susan P Holmes
Journal: Nat Methods Date: 2016-05-23 Impact factor: 28.547

5. UCHIME improves sensitivity and speed of chimera detection.

Authors: Robert C Edgar; Brian J Haas; Jose C Clemente; Christopher Quince; Rob Knight
Journal: Bioinformatics Date: 2011-06-23 Impact factor: 6.937

6. Bioinformatic Amplicon Read Processing Strategies Strongly Affect Eukaryotic Diversity and the Taxonomic Composition of Communities.

Authors: Markus Majaneva; Kirsi Hyytiäinen; Sirkka Liisa Varvio; Satoshi Nagai; Jaanika Blomster
Journal: PLoS One Date: 2015-06-05 Impact factor: 3.240

7. RTK: efficient rarefaction analysis of large datasets.

Authors: Paul Saary; Kristoffer Forslund; Peer Bork; Falk Hildebrand
Journal: Bioinformatics Date: 2017-08-15 Impact factor: 6.937

8. CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors: Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

9. LotuS: an efficient and user-friendly OTU processing pipeline.

Authors: Falk Hildebrand; Raul Tadeo; Anita Yvonne Voigt; Peer Bork; Jeroen Raes
Journal: Microbiome Date: 2014-09-30 Impact factor: 14.650

10. PIPITS: an automated pipeline for analyses of fungal internal transcribed spacer sequences from the Illumina sequencing platform.

Authors: Hyun S Gweon; Anna Oliver; Joanne Taylor; Tim Booth; Melanie Gibbs; Daniel S Read; Robert I Griffiths; Karsten Schonrogge
Journal: Methods Ecol Evol Date: 2015-05-25 Impact factor: 7.781

9 in total

1. Amplicon-Based Analysis of the Fungal Diversity across Four Kenyan Soda Lakes.

Authors: Romano Mwirichia
Journal: Scientifica (Cairo) Date: 2022-05-05

2. Unambiguous identification of fungi: where do we stand and how accurate and precise is fungal DNA barcoding?

Authors: Robert Lücking; M Catherine Aime; Barbara Robbertse; Andrew N Miller; Hiran A Ariyawansa; Takayuki Aoki; Gianluigi Cardinali; Pedro W Crous; Irina S Druzhinina; David M Geiser; David L Hawksworth; Kevin D Hyde; Laszlo Irinyi; Rajesh Jeewon; Peter R Johnston; Paul M Kirk; Elaine Malosso; Tom W May; Wieland Meyer; Maarja Öpik; Vincent Robert; Marc Stadler; Marco Thines; Duong Vu; Andrey M Yurkov; Ning Zhang; Conrad L Schoch
Journal: IMA Fungus Date: 2020-07-10 Impact factor: 3.515

3. Optimisation and Benchmarking of Targeted Amplicon Sequencing for Mycobiome Analysis of Respiratory Specimens.

Authors: Nur A'tikah Binte Mohamed Ali; Micheál Mac Aogáin; Raika Francesca Morales; Pei Yee Tiew; Sanjay H Chotirmall
Journal: Int J Mol Sci Date: 2019-10-09 Impact factor: 5.923

4. In depth search of the Sequence Read Archive database reveals global distribution of the emerging pathogenic fungus Scedosporium aurantiacum.

Authors: Laszlo Irinyi; Michael Roper; Wieland Meyer
Journal: Med Mycol Date: 2022-04-09 Impact factor: 3.747

Review 5. New-Generation Sequencing Technology in Diagnosis of Fungal Plant Pathogens: A Dream Comes True?

Authors: Maria Aragona; Anita Haegi; Maria Teresa Valente; Luca Riccioni; Laura Orzali; Salvatore Vitale; Laura Luongo; Alessandro Infantino
Journal: J Fungi (Basel) Date: 2022-07-16

6. Lampenflora in a Show Cave in the Great Basin Is Distinct from Communities on Naturally Lit Rock Surfaces in Nearby Wild Caves.

Authors: Jake Burgoyne; Robin Crepeau; Jacob Jensen; Hayden Smith; Gretchen Baker; Steven D Leavitt
Journal: Microorganisms Date: 2021-05-31

Review 7. High-throughput identification and diagnostics of pathogens and pests: Overview and practical recommendations.

Authors: Leho Tedersoo; Rein Drenkhan; Sten Anslan; Carmen Morales-Rodriguez; Michelle Cleary
Journal: Mol Ecol Resour Date: 2018-12-04 Impact factor: 7.090

8. Daring to be differential: metabarcoding analysis of soil and plant-related microbial communities using amplicon sequence variants and operational taxonomical units.

Authors: Lisa Joos; Stien Beirinckx; Annelies Haegeman; Jane Debode; Bart Vandecasteele; Steve Baeyen; Sofie Goormachtig; Lieven Clement; Caroline De Tender
Journal: BMC Genomics Date: 2020-10-22 Impact factor: 3.969

Review 9. Analyzing the human gut mycobiome - A short guide for beginners.

Authors: Nadja Thielemann; Michaela Herz; Oliver Kurzai; Ronny Martin
Journal: Comput Struct Biotechnol J Date: 2022-01-19 Impact factor: 7.271

9 in total