Literature DB >> 34025601

HCK and ABAA: A Newly Designed Pipeline to Improve Fungi Metabarcoding Analysis.

Kodjovi D Mlaga¹, Alban Mathieu^1,2, Charles Joly Beauparlant^1,2, Alban Ott³, Ahmad Khodr³, Olivier Perin³, Arnaud Droit^1,2.

Abstract

INTRODUCTION: The fungi ITS sequence length dissimilarity, non-specific amplicons, including chimaera formed during Polymerase Chain Reaction (PCR), added to sequencing errors, create bias during similarity clustering and abundance estimation in the downstream analysis. To overcome these challenges, we present a novel approach, Hierarchical Clustering with Kraken (HCK), to classify ITS1 amplicons and Abundance-Base Alternative Approach (ABAA) pipeline to detect and filter non-specific amplicons in fungi metabarcoding sequencing datasets.
MATERIALS AND METHODS: We compared the performances of both pipelines against QIIME, KRAKEN, and DADA2 using publicly available fungi ITS mock community datasets and using BLASTn as a reference. We calculated the Precision, Recall, F-score using the True-Positive, False-positive, and False-negative estimation. Alpha diversity (Chao1 and Shannon metrics) was also used to evaluate the diversity estimation of our method.
RESULTS: The analysis shows that ABAA reduced the number of false-positive with all metabarcoding methods tested, and HCK increases precision and recall. HCK, coupled with ABAA, improves the F-score and bring alpha diversity metric value close to that of the BLASTn alpha diversity values when compared to QIIME, KRAKEN, and DADA2.
CONCLUSION: The developed HCK-ABAA approach allows better identification of the fungi community structures while avoiding use of a reference database for non-specific amplicons filtration. It results in a more robust and stable methodology over time. The software can be downloaded on the following link: https://bitbucket.org/GottySG36/hck/src/master/.

Entities: Chemical

Keywords: ABAA; F-score; HCK; ITS amplicons; benchmarking; fungi; hierarchical clustering

Year: 2021 PMID： 34025601 PMCID： PMC8134036 DOI： 10.3389/fmicb.2021.640693

Source DB: PubMed Journal: Front Microbiol ISSN： 1664-302X Impact factor: 5.640

Introduction

The mycobiome concept was first introduced in 2010 to designate the fungal community of the human oral cavity (Tang et al., 2015) before being extended to other micro-environments. Three genomic markers are widely used to identify fungal species in a microbial environment: 18S ribosomal gene (Wu et al., 2015), 28S ribosomal gene (Ninet et al., 2003), and the Internal Transcribed Spacers (ITS) (Martin and Rygiewicz, 2005; Bellemain et al., 2010). The most commonly used is the ITS amplicon (Fujita et al., 2001) which targets two loci: ITS1, located between the 18S and 5.8S genes, and ITS2, between 5.8S and 28S (Bellemain et al., 2010). ITS1 has been demonstrated to yield the best performance (Bazzicalupo et al., 2013; Wang et al., 2015). Several packages have been developed to automate the process, and most of them are OTU (Operational Taxonomic Unit) sequence similarity-based pipeline (Schloss et al., 2009; Gweon et al., 2015; Rognes et al., 2016; Mysara et al., 2017; Bolyen et al., 2018). To date, the research communities are gradually moving to the new concept of ASVs (Amplicons sequence Variants) or Exact Sequences Variants (ESVs) (Callahan et al., 2017). With these pipelines, the taxonomy delineates based on the single nucleotides’ variant of amplicons, assuming that amplicons sequences have a similar length which is not the case with fungi ITS sequences. To date, several pipelines have been developed to classify fungal species using ITS sequencing. These include Plutof (Abarenkov et al., 2010b), Clotu (Kumar et al., 2011), PIPITS (Gweon et al., 2015), CloVR-ITS (White et al., 2013), and BioMaS (Fosso et al., 2015) specially designed to analyse fungi ITS datasets, Kraken (Wood and Salzberg, 2014), Mothur (Schloss et al., 2009) Qiime (Caporaso et al., 2010; Bolyen et al., 2018), Vsearch (Rognes et al., 2016), and DADA2 (Callahan et al., 2016) among many others, to examine both bacterial 16S rRNA and fungal ITS amplicons. The size of fungal ITS sequences is highly variable, and species can differ widely by the number of loci (Tang et al., 2015; Khodadadi et al., 2017). The sequence length dissimilarity creates bias during clustering and affects OTUs abundance estimation. Moreover, besides biologically valid amplicons, PCR generates many non-specific fragments resulting from elongation interruption or two or more incomplete amplicons joining (chimaeras) (Lahr and Katz, 2009; Edgar, 2016; Bjørnsgaard Aas et al., 2017). These non-specific amplicons are hybrid products between multiple parent sequences that can be falsely interpreted as existing or novel species, thus significantly affect the diversities, including the alpha and beta diversity metrics (Zajec et al., 2012). Hence, non-specific amplicons formed during amplification with two incomplete segments (bimeras) are generally at a lower proportion. However, chimaeras with more than two fragments (multimers) may form at comparable rates and account for a significant fraction in an amplified sample (Lahr and Katz, 2009). The most commonly used pipeline to detect chimaeras is UCHIME, composed of reference-based and de novo approaches (Edgar, 2016). The reference-based approach detects non-specific amplicons in a dataset by making a model from a concatenated pair of sub-sequences in a reference database. Chimaeras are detected if the query alignment sequence score of the model exceeds a threshold. UCHIME depends on a reference database, and ITS sequence size variation can be a significant source of false-positive detection, throwing away biologically valid sequences. DADA2 implements isBimeraDenovo() function that identifies exact bimeras or multimeras sequences. Child sequences that differ by a single mismatch from the chimeric model are flagged if the left parent and right parent are at least four nucleotides away from the child sequence (Callahan et al., 2016). The challenge is that databases are rarely updated, and the similarity search can be time-consuming, especially when databases are large. Computational resources are one of the critical limitations. Maintaining specific databases up to date is a real challenge, and a broad range of databases suffer from contamination and unannotated sequences. The available databases, such as UNITE, which is commonly used, presents 26% of entries that cannot be consistently assigned to a taxonomic family (Nilsson et al., 2008; Kõljalg et al., 2013). These tools are mainly developed for 16S/18S markers but widely applied to fungal ITS amplicons. Besides, these tools have been optimised using simulated datasets and not real datasets (Bjørnsgaard Aas et al., 2017). To overcome the above limitations, we present a novel classification approach for ITS amplicon’s taxonomy assignment. This approach consists of two steps: The amplicons Abundance-Base Alternative Approach (ABAA), a de novo method to filter non-specific amplicons from sequence datasets and a Hierarchical Clustering with Kraken (HCK) to classify ITS amplicons. We built HCK on a hierarchical clustering approach with multiple-step iterating runs. Each cluster’s representative sequences are taxonomically assigned using Kraken with the exact alignment of k-mers using fungal ITS loci sequence database (ITSdb). In this study, we use comparative analysis approach to assess the performance of ABAA and HCK. We calculated the Precision, the Recall, and the F-score using the True-Positive, False-positive, and False-negative estimation. Alpha diversity (Chao1 and Shannon) was also used to evaluate the methods’ diversity estimation. Chao1 is based on the concept that rare species allow inferring the number of missing species. As the Chao1 richness estimator gives more weight to the low abundance species while the Shannon index measures the richness and the evenness (Kim et al., 2017), making the Chao1 metric more sensitive to abundance estimation than Shannon’s. Henceforth, to simplify the manuscript, chimaeras and non-specific amplicons will interchangeably be used to designate all non-specific amplicons, including chimaeras, incomplete amplicons and sequencing errors.

Materials and Methods

The methodology in this study is organised in two parts. In the first part, we will describe ABAA and HCK workflow using publicly available ITS mock community datasets. We will then, in a second part, compare the performance of HCK-ABAA to that of QIIME, DADA2 and KRAKEN using BLASTn search abundance estimation as a reference.

Fungi ITS Mock Communities’ Datasets

We downloaded Biological mock community datasets of three different projects from the SRA NCBI database. The three projects were conducted using the Illumina Miseq sequencing technology. The first project, available under accession number (McTaggart et al., 2019), contains six different samples (), which were prepared from subsets of 53 species of fungi with an emphasis on human lung pathogens. The second project, available under accession number (Hoggard et al., 2018), contains three samples (, including specific fungal species from different human body location or organs (lung, oral cavity, gastrointestinal tract, and skin). The third project, available under accession number , contains two samples ( that include 16 species of fungi. Overall, the mock communities contain 36 fungi genera which are: Alternaria, Apophysomyces, Aspergillus, Blastomyces, Candida, Cladosporium, Clavispora, Coccidioides, Cryptococcus, Cunninghamella, Exophiala, Fusarium, Histoplasma, Lichtheimia, Malassezia, Meyerozyma, Mucor, Paecilomyces, Penicillium, Phanerochaete, Pichia, Purpureocillium, Rasamsonia, Rhizopus, Saccharomyces, Sarocladium, Scedosporium, Schizosaccharomyces, Sporidiobolus, Sporothrix, Talaromyces, Trichoderma, Trichosporon, Wickerhamomyces, Sclerotina, Rhyzomucor, Trichophyton detailed in Table 1.

TABLE 1

Absolute count of reads affiliated of each genus among the different datasets of the mock community (determined by BLASTn against NT database).

Taxa	SRR 5439721	SRR 5439722	SRR 6702280	SRR 6702281	SRR 6702283	SRR 8473974	SRR 8473977	SRR 8473978	SRR 8473979	SRR 8473980	SRR 8473984
Alternaria	0	0	3,941	3,334	5,462	39,009	0	0	0	0	0
Apophysomyces	0	0	0	0	0	0	0	0	0	0	191
Aspergillus	0	1,381	924	1,029	1,120	315,986	5,991	5,096	5,657	3,613	2,362
Blastomyces	0	0	0	0	0	0	379	230	256	183	0
Candida	150,704	108,934	33,556	28,123	39,320	125,949	0	0	0	0	0
Cladosporium	0	0	51	66	76	22,476	0	0	0	0	0
Clavispora	0	0	0	0	0	1,645	0	0	0	0	0
Coccidioides	0	0	0	0	0	0	502	649	698	515	0
Cryptococcus	1,087	33,983	12	9	16	16	555	568	825	665	131
Cunninghamella	0	0	0	0	0	0	0	0	0	0	13
Exophiala	0	0	3,340	2,791	4,140	0	299	271	319	221	2
Fusarium	0	0	52	54	76	39,505	0	0	0	0	138
Histoplasma	0	0	0	0	0	0	152	112	138	87	0
Lichtheimia	0	0	0	0	0	0	0	0	0	0	45
Malassezia	0	0	50	67	67	0	0	0	0	0	0
Meyerozyma	0	0	0	0	0	15,220	0	0	0	0	0
Mucor	0	0	0	0	0	11,935	109	96	109	80	438
Paecilomyces	0	0	0	0	0	0	0	0	0	0	80
Penicillium	0	0	32,037	29,551	39,368	27,888	1,645	1,461	1,593	854	0
Phanerochaete	32,650	20,763	0	0	0	0	0	0	0	0	0
Pichia	22,965	69,159	0	0	0	95,622	0	0	0	0	0
Purpureocillium	0	0	0	0	0	0	120	120	128	96	0
Rasamsonia	0	0	2	0	0	292	0	0	0	0	128
Rhizopus	0	0	0	0	0	57	2,283	2,354	3,447	2,584	109
Saccharomyces	57,107	46,582	1,737	1,863	1,942	12,654	0	0	0	0	0
Sarocladium	0	0	0	0	0	0	257	269	336	221	0
Scedosporium	0	0	0	0	0	3	62	75	86	54	36
Schizosaccharomyces	36,413	23,880	0	2	0	7	0	0	0	0	0
Sporidiobolus	0	0	0	0	0	17,476	0	0	0	0	0
Sporothrix	0	0	0	0	0	2	0	0	0	0	1
Talaromyces	0	0	0	0	0	1,336	435	382	403	160	6
Trichoderma	29,027	18,431	0	0	0	0	0	0	0	0	0
Trichosporon	0	0	122	77	185	88,790	2,305	2,080	2,221	1,642	0
Wickerhamomyces	2,208	1,235	0	0	0	10	0	0	0	0	0
Sclerotina	25,545	12,058	0	0	0	0	0	0	0	0	0
Rhyzomucor	0	0	0	0	0	0	0	0	0	0	0
Trichophyton	0	0	7,142	6,084	9,087	0	0	0	0	0	0

Absolute count of reads affiliated of each genus among the different datasets of the mock community (determined by BLASTn against NT database).

Fungi ITS Analysis Workflow With HCK-ABAA

Data Pre-processing and Quality Check

The sequence reads are trimmed with paired-end mode using Trimmomatic (Bolger et al., 2014) to remove residual adapters. The default parameters are used, including “phred33” to encode the quality part of the Fastq file to base 33, the low-quality bases from the sequence beginning and the end is set to 3 bases, respectively. The sliding window size was set to 4 with a minimum length of 50 bases. The paired reads generated from the trimming are then joined into contigs to produce the final fasta file using Pandaseq (Masella et al., 2012) with default parameters. Sequences with ambiguous bases are removed.

Non-specific Amplicons Filtering: ABAA

We empirically consider that amplicons with length-frequency below the standard deviation overall distribution to originate from non-specific amplification. Technically, after determining the amplicons’ length distribution and their frequency within each sample, an amplicon is considered to be non-specific if its length-frequency is below a certain threshold. This threshold corresponds to the standard deviation of the frequency of the amplicon lengths. ABAA filtering corresponds to step 1 of the whole pipeline.

Hierarchical Clustering With Kraken Assignment (HCK)

Amplicons Hierarchical Clustering

Amplicon hierarchical clustering corresponds to step 2 of the whole pipeline. HCK clusters amplicons sequences using multiple-step iterated runs of sequence alignments with a neighbour-joining algorithm implemented in CD-HIT version 4.5.4 (Fu et al., 2012). A segment sliding window in this context or “word” is defined as the consecutive position of a certain number of nucleotides in a sequence fragment. We implemented three iterative runs in the clustering and set the sequence identities (c) to 0.99, 0.98, and 0.97, as well as the “word” size (n) to 10, 8, and 7 bps, respectively. It is possible to control the sequence length difference cut-off(s), the alignment coverage of the more extended sequence (aL), and the alignment coverage for the shorter sequence (aS). The most crucial parameter is the length difference cut-off(s) depending on the overall distribution of the amplicon’s size. It can be empirically estimated by dividing the average size by the size of the most extended amplicon. This value was set to 90% in the study. The iterated clusters generated are then merged into one single, no redundant cluster file and sorted by size to remove singleton amplicons. An intermediary step 3 is essential to retrieve representative sequences from each cluster and be classified using Kraken (Figure 1A).

FIGURE 1

HCK workflow diagram. (A) Hierarchical clustering with three iterations. Chimaeras free sequences are results of pipeline step 1, including raw reads trimming, merging (forward and reverse reads) and ABAA filtering. Sequences are combined into one single fasta file and clustered using a hierarchical clustering approach in step 2. All clusters are then merged into one single non-redundant clusters and got rid of singletons sequences. HCK retrieves representative sequences from each cluster for amplicons’ classification, the second part of the pipeline (step 3). (B) Classification uses Lowest Common Ancestor (LCA) taxonomical assignment implemented in Kraken to classify representative sequences and taxonomy reported to each cluster, and a final BIOM file can be generated for downstream analysis (Steps 4 and 5).

ITS Loci RefSeq

We downloaded the fungal Internal Transcribed Spacer RNA (ITS) RefSeq Targeted Loci (ITSdb) containing 11,252 entries. We retrieved the corresponding taxonomy profile from the NCBI taxonomy database[1] and created a Qiime-compatible taxonomy file. Both files (fasta and taxonomy file) were sorted and cleaned to have similar entries, using the following utilities[2]. ITSdb was used to generate a kraken database following the procedure available at this web address: http://ccb.jhu.edu/software/kraken/MANUAL.html.

Taxonomical Classification

Each cluster’s representative sequences are classified using the Lowest Common Ancestor (LCA) algorithm with Kraken version 1 (Wood and Salzberg, 2014). The taxonomy assignment is then extended to other amplicons of the respective clusters for a complete classification. This step corresponds to step 4 of the HCK workflow. The command uses the sample metadata information to generate a BIOM file. The final stage, step 5, uses the BIOM file to estimate the diversity abundance and further metric calculation analysis (Figure 1B).

Benchmark Analysis and Performances Evaluation

BLASTn (Reference)

We determined the actual reference diversity and abundance with BLASTn sequence similarity search against the NCBI NT database. Consensus classification was determined for coverage ≥98%, identity ≥97%, and e-value ≤ 0.00001 with a maximum of 100 hits retained per entry. The BLASTn output was then filtered for the best hits successively by the e-value, coverage percentage, and identity percentage. The final consensual taxonomy classification for each amplicon is kept based on a minimum number of 80 identical taxid out of 100 for each query (80% of the total hits) to generate an abundance table following a procedure described by other authors (Blaalid et al., 2013; McTaggart et al., 2019).

Comparative Analysis

To evaluate the efficacy of the newly developed tools, we compared the absolute count diversity of HCK to Qiime v1.9 (Caporaso et al., 2010), Kraken (Wood and Salzberg, 2014), and DADA2 (version 1.8) (Callahan et al., 2016) with and without non-specific sequences/chimaera removal using BLASTn abundance estimation as reference. We test HCK, Kraken, Qiime with ITSdb, Qiime with UNITE (Abarenkov et al., 2010a) and ITSdb database. DADA2 is tested only with the native UNITE database. The performance of each method is determined by its ability to assign the suitable taxa to the right sequence and to be able to assign the maximum of good sequences using sensitivity (recall), the positive predictive value (precision), and the f-score metric calculation (Figure 2). We determined True positive (TP) as following: For x, the abundance estimated by the BLAST (reference) and y, the abundance estimated by the tested methods for given sample i, we determined true positives by TP. The overestimated abundance classified by the tested method is considered false positive, and the underestimation differences are included in the false negatives. The false negatives (FN) are determined by the sum of counts of amplicon only detected by BLAST but are not correctly assigned by the assessed method. For Tr, the total abundance estimated by BLAST for given sample i, FN. The false positive (FP) corresponds to the sum of counts of amplicons wrongly assigned by the tested method but not detected by BLAST or not included in the initial mock community composition. For Tm, the total number of amplicons classified for given taxa by the tested method, FP. We determined the precision (P) and the recall (R) and calculated the F- score using the following formula. P, F-score (Gardner et al., 2019). We also calculated alpha diversity using Shannon and chao1 indices to assess the association of chimaera removal methods and taxonomy classification in downstream diversity analysis. We compared it to the diversity of BLAST abundance estimation. We estimate the difference between the alpha diversity of the assessed methods and that of the BLASTn estimation. The lower the difference, the best is the method. All scripts and command lines are details in Supplementary Material: scripts_and_command_lines.

FIGURE 2

Benchmarking workflow: the workflow is organised in pre-processing, including Illumina sequencing reads trimming and forward and reverse reads merging. To determine the best non-specific amplicons filtering method, ABAA was tested with UCHIME (UCHIME_Ref, UCHIME_DENOVO, and isBimeraDenovo() implemented in DADA2. The classifier tested also includes HCK with ITSdb, the newly designed pipeline, QIIME with ITSdb and UNITE, Kraken with ITSdb, DADA2 with UNITE and compared to BLASTn search using NCBI NT database, as reference. The pipelines performances are evaluated using precision, recall and F-score.

Results

Fungi ITS Amplicons Length: A Vast Diversity Among Species

All 11 samples from the three projects were combined into one single dataset during the pre-processing treatment. The average read length is 200.9 bp (SD = 65.6), the maximum read size is 251 bp, and the minimum is 35 bp (Figure 3A). Fragments with a read length below 150 bp have fewer duplicated percentages than those between 240 and 250. After joining the paired reads, the average size is 233.3 bp (SD = 94.45) with a maximum of 472 bp and a minimum of 35 bp. The predominant amplicons size is 251 bps. We observed low-frequency fragments below 250 bp and above 400 bp (Figure 3B).

FIGURE 3

Chimaera detection flow using ABAA: (A) distribution of reads from all datasets; The average read length is 200.9 bp (SD = 65.6, a median of 250), the maximum reads size is 251 bp, and the minimum is 35 bp. (B) Distribution of the frequency of contigs length (assembly of forward and reverse reads), the average size is 233.3 bp (SD = 94.45) with a maximum of 472 bp and a minimum of 35 bp. The predominant amplicons size is 251 bps. (C) Distribution of contigs length-frequency by length: determination of non-specific amplicons filtering cut-off: cut-off was tested for means (blue line), standard deviation (green line), and mean + standard deviation (red line). The standard deviation was kept for better performance. (D) Distribution of sequences by the length in all datasets. Blackline represents the distribution of all sequences). Moreover, the red line represents the distribution of filtered sequences with ABAA (standard deviation).

Taxonomic Assignment Using BLAST: Mock Communities Real Abundance Estimation as a Reference

We conducted a BLASTn search against the NCBI NT database to re-estimate the absolute abundance of the expected genus. We observed discrepancies between theoretical data and BLAST results. Even though samples SRR5439721 and SRR5439722 were from the same mock preparation, Aspergillus amplicons could not be detected in SRR5439721, and Cryptococcus was undermined in sample SRR5439721. Malassezia was also undermined in SRR6702280, SRR6702281, SRR6702283. The details of the abundance table are shown in Table 1.

Benchmark and Comparative Analysis: Performance of HCK and ABAA

ABAA: Amplicons Filtering

For our analysis, amplicons length below 250 bp and above 400 bp have shown low frequency compared to those comprised between 251 and 400 bp (Figure 3D). Each peak in Figure 3C is composed of amplicons of a similar size. The enlargement of the base of the curve may correspond to the variation of the amplicon’s size. The frequency of these amplicons indicates that they could also be derived from non-specific amplification. Here we hypothesise this amplicon to be a chimaera and attempt to filter them out. The minimum sequence length detected by ABAA is 35 bp, with an average of 308 bp, higher than the overall average length (233 bp) and a maximum of 472 bp. It indicates that most chimaeras formed in this dataset may result from bimera and or multimera forming than incomplete amplification. Filtered amplicons by ABAA include amplicons below 250 bp, above 400 bp, and low amplification between 250 and 400 bp (Figure 3D). In total ABAA has detected 252,567 sequences accounting for 10.86% of overall sequences as non-specific amplicons while 528,544 (22.72%) with uchime_ref and 1,165,031 (50.08%) by DADA2 and 32 (0.0013%) detected by uchime_denovo. isBimeraDenovo() in DADA2 has filtered out up to 75.43% of sequences in sample SRR5439722. However, 24.76% were detected with UCHIME_REF, and 17.02% by ABAA on the other hand. Also, 51.74% were detected in sample SRR5439721, while 24.67% detected with UCHIME_REF and 17.23% by ABAA and UCHIME_REF seem to be more consistent than isBimeraDenovo() in DADA2 as samples SRR5439721 and SRR5439722 were from the same mock preparation (Table 2).

TABLE 2

Level of detection of chimaera removal methods.

Samples	Total sequences	Chimaera ABaa	Chimaera free Abaa*	Chimaera uchime_ref	Chimaera free uchime_ref*	Chimaera uchime_denovo	Chimaera free uchime_denovo*	Chimaera dada2	Chimaera free dada2*
SRR5439721	426,354	73,482 (17.23%)	352,872	105,167 (24.67%)	321,187	03 (00007%)	426,351	220,607 (51.74%)	205,747
SRR5439722	397,268	68,091 (17.02%)	329,177	98,367 (24.76%)	298,901	00 (0%)	397,268	299,665 (75.43%)	97,603
SRR6702280	174,185	19,187 (11.02%)	154,998	61,937 (35.56%)	112,248	11 (0.0063%)	174,174	71,129 (40.84%)	103,056
SRR6702281	148,397	17,980 (12.12%)	130,417	55,059 (37.10)	93,338	09 (0%)	148,388	63,247 (42.62%)	85,150
SRR6702283	246,368	31,616 (12.83%)	214,752	90,865 (36.88%)	155,503	08 (0.0032%)	246,360	93,543 (37.97%)	152,825
SRR8473974	865,248	40,730 (4.71%)	824,518	112,068 (12.95%)	753,180	01 (00001%)	865,247	388,352 (44.88%)	476,896
SRR8473977	17,181	379 (2.21%)	16,802	1,566 (9.11%)	15,615	00 (0%)	17,181	7,044 (41.00%)	10,137
SRR8473978	15,694	121 (0.77%)	15,573	1,363 (8.68%)	14,331	00 (0%)	15,694	6,510 (41.48%)	9,184
SRR8473979	18,394	295 (1.60%)	18,099	1,336 (7.26%)	17,058	00 (0%)	18,394	7,421 (40.34%)	10,973
SRR8473980	12,730	676 (5.31%)	12,054	387 (3.04%)	12,343	00 (0%)	12,730	4,898 (38.48%)	7,832
SRR8473984	4,420	10 (0.23%)	4,410	429 (9.71%)	3,991	00 (0%)	4,420	2,615 (59.16%)	1,805
Total	2,326,239	252,567 (10.86%)	2,073,672	528,544 (22.72%)	1,797,695	32 (0.0013%)	2,326,207	1,165,031 (50.08%)	1,161,208

Level of detection of chimaera removal methods.

HCK-ABAA: Taxonomic Assignment Performances

The second step of the HCK pipeline handles the chimaeras-free sequences. Samples sequences pre-processed and filtered by ABAA in the first step are then combined into a single fasta for the clustering process. With our datasets, we cluster a total of 2,326,239 amplicons with HCK using multiple-step iterated runs of cd-hit-est to perform hierarchical clustering. The first iteration performed with 99% sequence similarity generates 88,428 clusters, the second iteration with 98% creates 32,200 clusters (3/8 of the initial clusters), and the final iteration at 97% produces 18,831 clusters. The final iteration reduces the total clusters by 1/5 of the initial clusters, a crucial benefit of the hierarchical clustering that will be detailed in the discussion. All clusters generated by different iterations are merged into 18,770 non-redundant clusters, including 14,431 singletons, for which 2,545 have fragments size ≤ 149 bp and 11,886 with sequence size ≥150 bp (150 bp, widely considered as the minimum standard of ITS length). The singletons are removed from further analysis based on the assumption that a unique sequence might derive from sequencing errors or non-specific amplification. As a result, only 4,339 clusters are composed of biologically valid amplicons corresponding to 4,339 representative sequences. The performance of HCK with and without ABAA is assessed using the precision (positive predictive value), the recall (sensitivity), and the F-score based on the true-positive, false-positive, and false-negative rates as described in material and method. This performance is compared to other pipelines, e.g., QIIME version 1 with both databases ITSdb and UNITE, Kraken version1 with ITSdb, and DADA2 with UNITE database. All classification methods are tested with and without the chimaera removal step. The analysis shows that HCK without non-specific amplicons removal is slightly better than Kraken (precision: 0.685 and 0.682, recall: 0.986 and 0.983 and F-score: 0.80 and 0.79, respectively) and HCK decreases by 13.22% the false-positive detection and by 45.36% of false negatives compared to Kraken. The chimaera removal step with UCHIME_REF reduces the false positives by 32.05% and the false-negative by 10.44% compared to raw sequence processing. However, adding a step of chimaera filtering affects the sensitivity (recall), regardless of the method (Table 3).

TABLE 3

Precision, recall, accuracy, and F-score performance of HCK and ABAA version other tested combination of chimaera detection and taxonomy assignment methods.

Taxa. assign.	Chimaera remov.	Database	Total sequences	Unclassified sequences	True positive	False positive	False negative	Precision	Recall	F-score
hck	ABaa	ITSdb	2,073,672	378046	1,528,646	166,980	27,314	0.834998	0.97222	0.89594
hck	Uchime_ref	ITSdb	1,797,695	231779	1,306,938	258,978	249,022	0.803725	0.84344	0.81431
hck	Uchime_denovo	ITSdb	2,326,207	102360	1,547,646	676,201	8,314	0.670532	0.98645	0.79213
hck	–	ITSdb	2,326,239	220025	1,547,649	558,565	8,311	0.685258	0.98645	0.8013
kraken	ABaa	ITSdb	2,073,672	73480	1,497,681	502,511	58,279	0.713001	0.9651	0.81431
kraken	Uchime_ref	ITSdb	1797695	138673	1,277,882	381,140	278,078	0.725327	0.83483	0.77351
kraken	Uchime_denovo	ITSdb	2,326,207	141772	1,540,747	643,688	15,213	0.682455	0.98295	0.7977
kraken	–	ITSdb	2,326,239	141804	1,540,747	643,688	15,213	0.682455	0.98295	0.7977
qiime	ABaa	ITSdb	2,073,672	407960	1,369,677	296,035	186,283	0.782386	0.83022	0.78878
qiime	Uchime_ref	ITSdb	1,797,695	389342	1,175,806	232,547	380,154	0.765336	0.72615	0.71648
qiime	Uchime_denovo	ITSdb	2,326,207	515687	1,391,958	418,562	164,002	0.734174	0.83587	0.76968
qiime	–	ITSdb	2,326,239	525891	1,391,597	408,751	164,363	0.735705	0.83605	0.77068
qiime	ABaa	Unite	2,073,672	455026	1,208,721	409,925	347,239	0.529577	0.76841	0.62466
qiime	Uchime_ref	Unite	1,797,695	43024	1,050,365	704,306	505,595	0.510642	0.64894	0.56654
qiime	Uchime_denovo	Unite	2,326,207	347217	1,310,235	668,755	245,725	0.527798	0.79978	0.63181
qiime	–	Unite	2,326,239	52075	1,307,126	967,038	248,834	0.539557	0.79841	0.6396
Dada2		Unite	1,161,208	0	750,588	410,620	805,372	0.717852	0.5755	0.62471

Precision, recall, accuracy, and F-score performance of HCK and ABAA version other tested combination of chimaera detection and taxonomy assignment methods. HCK yields better classification performance when ABAA is added upstream (precision 0.83 and the second best is HCK/UCHIME_REF with 0.8), and consequently, the F-score is also improved (0.89, Figure 4). Besides, the association of HCK and ABAA reduces the proportion of false-positive by 35.52% compared to HCK with UCHIME_REF and 97.01% without non-specific sequences removal. QIIME used with UCHIME_REF, and ITSdb performs better (F-score = 0.716) than similar approach with UNITE database (f-score = 0.648). DADA2 was also tested with its filtering method includes in the pipeline. The true-positive sequence classified was shallow compared to others, and this might be due to the high number of chimaeras sequences filtered (50.08%). Its F-score performance is 0.62, with 411 775 false-positive and 805 372 false negatives. The Kraken based classification implemented with chimaera methods yields comparable sensitivity results (recall) to that of HCK, but the higher number of false-positive impacts the precision and the overall performance (F-score: 0.79) (Table 3).

FIGURE 4

F-score performance of HCK (light green), Kraken (mauve), Qiime with ITSdb (blue), Qiime with Unite database (red), and DADA2 (green).

Diversity Metrics Analysis

One of the most significant endpoints of ITS sequencing is the comparison of alpha diversity; thus, we compare the alpha diversity of all tested classification methods to that of BLASTn using Chao1 and Shannon indexes, assuming that diversity with BLAST search is closer to reality. With the Chao1 index, HCK diversity is closed to BLASTn estimation compared to Kraken, QIIME, and DADA2 estimation. With chao1, HCK in association with ABAA held the lowest difference with BLASTn (54.02), followed by HCK with UCHIME_DENOVO. With the Shannon index, HCK, used with UCHIME_REF, held the best rank(0.76), followed by HCK with ABAA (Supplementary Table 1). The data show that the BLASTn search estimates the chao1 index between 8 and 14 for all the samples. HCK with ABAA chaos1estimates is between 20 and 80, and kraken with and without chimaera removal’s estimation between 176 and 856. DADA2’s chao1 estimation is also close to the BLASTn search’s; however, there is an overestimation for some samples (15−650, Figure 5). The average Shannon index diversity of BLASTn search is 2 for all samples. It varies between 2 and 3 with HCK and ABAA, 2−7 for DADA2, 3 for Kraken with or without chimaera removal, and up to 6 for QIIME, depending on the database and the chimaera removal method (Figure 5).

FIGURE 5

Estimation of Alpha diversity metric with Chao1 (left) and Shannon (right) indexes calculated with the estimated abundance of different methods and association with various chimaera removal methods

Computing Specification and Speed

ABAA and HCK computing resources were tested and timed on ubuntu-based system 20.04, WSL2; Processor: Intel® CoreTM i7-8650U CPU @ 1.90GHz 2.11 GHz; RAM 16.0 GB; System type 64-bit Operating System, x64-based processor. The running speed for each method tested is listed in Supplementary Table 2. ABaa filtered 2,326,239 sequences (623 MiB) in less than 2 min (1 min 18 s), while this requires almost 50 min for UCHIME_REF. isBimeraDenovo() function of DADA2 could not be tested separately as this step depends on many other steps. HCK and Qiime v1 process 23,700 sequences per min on average, while DADA2 processes 65,300 sequences per min. This speed does not include sequence truncation as it is not required for ITS processing.

Discussion

In this work, we present full fungi ITS based classification workflow using two newly developed tools, ABAA and HCK, to filter non-specific amplicons and taxonomically classify them. We compare the performance of ABAA to UCHIME and isBimeraDenovo() in DADA2 and the performance of HCK to that of Kraken, QIIME, and DADA2 using publicly available mock community Illumina sequencing datasets. The analysis revealed that HCK-ABAA yields the best performance. The F-score is systematically improved when ABAA is used to filter amplicons regardless of the classifier. This work also shows the impact of filtering methods on the ecological diversity metrics and how they can dramatically change the estimation of a sample’s diversity. The efficiency of sequence amplification and the quality of sequencing reads are critical and determinants for the outcome of the metabarcoding analysis and especially for fungi ITS locus (Schloss et al., 2011). Non-specific amplicons present a serious threat to the classification and taxa abundance estimation. The size and the number of ITS loci are highly variable, unlike the 16S rRNA gene in bacteria and sufficiently polymorphic to delineate fungi at the genus and or species level (Tang et al., 2015; Khodadadi et al., 2017). This variation can be biological or can derive from high rates of insertions and deletions in the evolution of this less conserved genetic region. It can also derive from non-specific amplification. ABAA has this advantage of considering the real distribution of amplicons size from real datasets. It does not need database maintenance and only requires minimum computing resources. It filters sequences based on the distribution of their size-frequency and mainly targets amplicons with low length-frequency. The performance of the majority of chimaera filtering methods are usually assessed on simulated chimaera sequences (Nilsson et al., 2010; Harris et al., 2012), but when applied to the real dataset, it is challenging to determine whether sequences that have been filtered are real chimaeras. The fragment size dissimilarity also creates bias during conventional clustering. Consequently, this affects OTUs picking and abundance estimation, including overestimating or underestimating community abundance (De Filippis et al., 2017). Except for Kraken, the majority of metabarcoding methods include a clustering process. Clustering consists of reducing the amplicons similarity redundancy of data diversity. The most commonly used in amplicon metabarcoding analysis are uclust in the usearch algorithm (Edgar, 2010), vsearch (Rognes et al., 2016), and CD-HIT (Fu et al., 2012). usearch and vsearch can cluster nucleotide sequences based on their similarity, length, and abundance, assuming that the same species’ amplicons will probably be identical in size with a minimal coverage dissimilarity. As a result, with fungi ITS, clustering may create multiple OTUs from the same species amplicons and increase the alpha and the beta diversities. CD-HIT implements a more realistic clustering approach, hierarchical clustering, which consists of a multiple-step, iterated runs with a neighbour-joining approach and generates a hierarchical structure. In HCK, with the datasets that we analyse, the second iteration with 98% identity reduces the first number of clusters by 3/8 and the final iteration with 97% identity by 1/5. In addition to filtering out the singleton, HCK drastically reduces the number of false-positive and normalises the diversity abundance. It is essential to highlight that databases also play an important role in the performance of the classifier. Qiime version 1 performs better with the ITSdb database than its native database UNITE, regardless of the filtering method. The inappropriate estimation of the abundance (overestimation, underestimation of population or sequence wrongly classified) can also influence metrics of diversity. The high diversity found with the UNITE database might be due to the higher number of incorrect classification sequences in the UNITE database.

Conclusion

The classification of fungi using ITS marker is very challenging. It is owed to the high diversity of the kingdom. Moreover, targeting an intergenic section as ITS1 leads to diversified amplicon sizes and sequences that are not taken into account with the classical approaches developed for 16S analysis. Combining HCK and ABAA increases the number of true-positive and decreases the proportion of false-positive, as shown with the datasets we have evaluated. Consequently, HCK maintained the alpha diversity metric with the Chao1 index close to that of the BLASTn, compared to QIIME, Kraken, and DADA2. As demonstrated in this analysis, the use of HCK in association with ABAA allows a more realistic estimation of fungal diversity. So far, it is the best option to perform fungi ITS1 metabarcoding analysis on clinical and non-clinical samples.

Data Availability Statement

Sequences analysed in this project are available under accession numbers SRR8473974, SRR8473977, SRR8473978, SRR8473979, SRR8473980, SRR8473984, SRP132544, SRR6702280, SRR6702281, SRR6702283, SRR5439721, and SRR5439722 in NCBI. The original contributions presented in the study are included in the Material and method section, further inquiries can be directed to the corresponding authors.

Author Contributions

AD and OP supervised the study. KM analysed interpreted data and wrote the manuscript. CB, AM, AO, and AK made critical revisions. All authors read and approved the final manuscript.

Conflict of Interest

AO, AK and OP were employed by the company L’Oreal. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

38 in total

1. Utilization of size polymorphism in ITS1 and ITS2 regions for identification of pathogenic yeast species.

Authors: Hossein Khodadadi; Ladan Karimi; Nilufar Jalalizand; Hassan Adin; Hossein Mirhendi
Journal: J Med Microbiol Date: 2017-03 Impact factor: 2.472

2. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities.

Authors: Patrick D Schloss; Sarah L Westcott; Thomas Ryabin; Justine R Hall; Martin Hartmann; Emily B Hollister; Ryan A Lesniewski; Brian B Oakley; Donovan H Parks; Courtney J Robinson; Jason W Sahl; Blaz Stres; Gerhard G Thallinger; David J Van Horn; Carolyn F Weber
Journal: Appl Environ Microbiol Date: 2009-10-02 Impact factor: 4.792

3. Reducing the impact of PCR-mediated recombination in molecular evolution and environmental studies using a new-generation high-fidelity DNA polymerase.

Authors: Daniel J G Lahr; Laura A Katz
Journal: Biotechniques Date: 2009-10 Impact factor: 1.993

4. ITS1: a DNA barcode better than ITS2 in eukaryotes?

Authors: Xin-Cun Wang; Chang Liu; Liang Huang; Johan Bengtsson-Palme; Haimei Chen; Jian-Hui Zhang; Dayong Cai; Jian-Qin Li
Journal: Mol Ecol Resour Date: 2014-09-24 Impact factor: 7.090

5. DADA2: High-resolution sample inference from Illumina amplicon data.

Authors: Benjamin J Callahan; Paul J McMurdie; Michael J Rosen; Andrew W Han; Amy Jo A Johnson; Susan P Holmes
Journal: Nat Methods Date: 2016-05-23 Impact factor: 28.547

6. PANDAseq: paired-end assembler for illumina sequences.

Authors: Andre P Masella; Andrea K Bartram; Jakub M Truszkowski; Daniel G Brown; Josh D Neufeld
Journal: BMC Bioinformatics Date: 2012-02-14 Impact factor: 3.169

7. CLOTU: an online pipeline for processing and clustering of 454 amplicon reads into OTUs followed by taxonomic annotation.

Authors: Surendra Kumar; Tor Carlsen; Bjørn-Helge Mevik; Pål Enger; Rakel Blaalid; Kamran Shalchian-Tabrizi; Håvard Kauserud
Journal: BMC Bioinformatics Date: 2011-05-20 Impact factor: 3.307

8. Mycobiome Sequencing and Analysis Applied to Fungal Community Profiling of the Lower Respiratory Tract During Fungal Pathogenesis.

Authors: Lisa R McTaggart; Julia K Copeland; Anuradha Surendra; Pauline W Wang; Shahid Husain; Bryan Coburn; David S Guttman; Julianne V Kus
Journal: Front Microbiol Date: 2019-03-15 Impact factor: 5.640

9. CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors: Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

10. Kraken: ultrafast metagenomic sequence classification using exact alignments.

Authors: Derrick E Wood; Steven L Salzberg
Journal: Genome Biol Date: 2014-03-03 Impact factor: 13.583