| Literature DB >> 35404122 |
Yiheng Hu1, Laszlo Irinyi2,3,4, Minh Thuy Vi Hoang2,3,4, Tavish Eenjes1, Abigail Graetz1, Eric A Stone1,5, Wieland Meyer2,3,4,6,7, Benjamin Schwessinger1, John P Rathjen1.
Abstract
The kingdom Fungi is highly diverse in morphology and ecosystem function. Yet fungi are challenging to characterize as they can be difficult to culture and morphologically indistinct. Overall, their description and analysis lag far behind other microbes such as bacteria. Classification of species via high-throughput sequencing is increasingly becoming the norm for pathogen detection, microbiome studies, and environmental monitoring. With the rapid development of sequencing technologies, however, standardized procedures for taxonomic assignment of long sequence reads have not yet been well established. Focusing on nanopore sequencing technology, we compared classification and community composition analysis pipelines using shotgun and amplicon sequencing data generated from mock communities comprising 43 fungal species. We show that regardless of the sequencing methodology used, the highest accuracy of species identification was achieved by sequence alignment against a fungal-specific database. During the assessment of classification algorithms, we found that applying cutoffs to the query coverage of each read or contig significantly improved the classification accuracy and community composition analysis without major data loss. We also generated draft genome assemblies for three fungal species from nanopore data which were absent from genome databases. Our study improves sequence-based classification and estimation of relative sequence abundance using real fungal community data and provides a practical guide for the design of metagenomics analyses focusing on fungi. IMPORTANCE Our study is unique in that it provides an in-depth comparative study of a real-life complex fungal community analyzed with multiple long- and short-read sequencing approaches. These technologies and their application are currently of great interest to diverse biologists as they seek to characterize the community compositions of microbiomes. Although great progress has been made on bacterial community compositions, microbial eukaryotes such as fungi clearly lag behind. Our study provides a detailed breakdown of strategies to improve species identification with immediate relevance to real-world studies. We find that real-life data sets do not always behave as expected, distinct from reports based on simulated data sets.Entities:
Keywords: bioinformatics; fungi; metagenomics; pathogens
Mesh:
Year: 2022 PMID: 35404122 PMCID: PMC9040722 DOI: 10.1128/mbio.02444-21
Source DB: PubMed Journal: mBio Impact factor: 7.786
Metadata of the mock fungal community
| Species name | Strain used in the mock community | Corresponding reference genome strain | Assembly level | NCBI accession | File name |
|---|---|---|---|---|---|
|
| WM 03.225 | NRRL3357 | Scaffold | GCF_000006275.2 | GCF_000006275.2_JCVI-afl1-v2.0_genomic.fna.gz |
|
| WM 06.98 | Af293 | Chromosome | GCF_000002655.1 | GCF_000002655.1_ASM265v1_genomic.fna.gz |
|
| WM 07.12 (CBS 522.75 Type strain) | NRRL Y-17577 | Scaffold | GCA_003707485.2 | GCA_003707485.2_ASM370748v2_genomic.fna.gz |
|
| WM 229 (CBS 562 Type strain) | SC5314 | Chromosome | GCA_000182965.3 | GCA_000182965.3_ASM18296v3_genomic.fna.gz |
|
| WM 02.73 | CD36 | Chromosome | GCA_000026945.1 | GCA_000026945.1_ASM2694v1_genomic.fna.gz |
|
| WM 03.500 | CBS 138 | Chromosome | GCF_000002545.3 | GCF_000002545.3_ASM254v2_genomic.fna.gz |
|
| WM 890 | B11899 | Contig | GCF_002926055.2 | GCF_002926055.2_CanHae_1.0_genomic.fna.gz |
|
| WM 02.200 | CDC317 | Scaffold | GCA_000182765.2 | GCA_000182765.2_ASM18276v2_genomic.fna.gz |
|
| WM 01.203 | MYA-3404 | Scaffold | GCF_000006335.3 | GCF_000006335.3_ASM633v3_genomic.fna.gz |
|
| WM 18 (CBS 4413 Type strain) | ATCC 42720 | Scaffold | GCF_000003835.1 | GCF_000003835.1_ASM383v1_genomic.fna.gz |
| WM 773 (CBS 142 Type strain) | JCM 2334 | Scaffold | GCA_001599735.1 | GCA_001599735.1_JCM_2334_assembly_v001_genomic.fna.gz | |
| WM 179 (VGI Standard strain) | WM 276 | Chromosome | GCF_000185945.1 | GCA_000185945.1_ASM18594v1_genomic.fna.gz | |
| WM 178 (VGII Standard strain) | R265 | Scaffold | GCA_000786445.1 | GCA_000786445.1_R265.1_genomic.fna.gz | |
| WM 175 (VGIII Standard strain) | |||||
| WM 779 (VGIV Standard strain) | |||||
| WM 13.104 | |||||
| WM 148 (VNI Standard strain) | JEC21 | Chromosome | GCA_000091045.1 | GCA_000091045.1_ASM9104v1_genomic.fna.gz | |
| WM 626 (VNII Standard strain) | B-3501A | Chromosome | GCF_000149385.1 | GCF_000149385.1_ASM14938v1_genomic.fna.gz | |
| WM 629 (VNIV Standard strain) | H99 | Chromosome | GCF_000149245.1 | GCF_000149245.1_CNA3_genomic.fna.gz | |
|
| WM 45 (CBS 1600 Type strain) | NBRC0988 | Chromosome | GCA_000328385.1 | GCA_000328385.1_Cuti_1.0_genomic.fna.gz |
|
| WM 309 (CBS 767 Type strain) | CBS767 | Chromosome | GCF_000006445.2 | GCF_000006445.2_ASM644v2_genomic.fna.gz |
| WM 03.477 | WY3-10-4 | Contig | GCA_003285555.1 | GCA_003285555.1_ASM328555v1_genomic.fna.gz | |
|
| WM 03.468 | ||||
|
| WM 03.463 | ||||
|
| WM 05.217 | CLIB 918 | Scaffold | GCA_001402995.1 | GCA_001402995.1_New2.3_08062011_genomic.fna.gz |
|
| WM 17.18 | CICC 1368 | Scaffold | GCA_002233575.1 | GCA_002233575.1_ASM223357v1_genomic.fna.gz |
|
| WM 13 (CBS 834) | DMKU3-1042 | Complete Genome | GCF_001417885.1 | GCF_001417885.1_Kmar_1.0_genomic.fna.gz |
|
| WM 10.200 | 148 | Contig | GCA_004919595.1 | GCA_004919595.1_Kodohm_148_genomic.fna.gz |
|
| WM 13.369 | JHH-5317 | Contig | GCA_002276285.1 | GCA_002276285.1_Lprolificans_pilon_genomic.fna.gz |
|
| WM 03.389 | MG20W | Contig | GCA_000755205.1 | GCA_000755205.1_ASM75520v1_genomic.fna.gz |
|
| WM 02.131 | ATCC 6260 | Scaffold | GCF_000149425.1 | GCF_000149425.1_ASM14942v1_genomic.fna.gz |
|
| WM 02.78 | CBS573 | Complete Genome | GCA_003054445.1 | GCA_003054445.1_ASM305444v1_genomic.fna.gz |
|
| WM 32 (CBS 638 Type strain) | NRRL Y-2026 | Scaffold | GCF_001661235.1 | GCF_001661235.1_Picme2_genomic.fna.gz |
|
| WM 885 | NRRL Y-7687 | Scaffold | GCA_003705465.1 | GCA_003705465.1_ASM370546v1_genomic.fna.gz |
|
| PLFJ-1 | GCA_004026455.1 | GCA_004026455.1_ASM402645v1_genomic.fna | ||
|
| WM 09.204 | RIT389 | Contig | GCA_002250355.1 | GCA_002250355.1_RIT389_v1_genomic.fna.gz |
|
| WM 06.385 | WM 09.24 | Scaffold | GCA_000812075.1 | GCA_000812075.1_ASM81207v1_genomic.fna.gz |
| WM 09.122 (CBS 101.22 Type strain) | IHEM 23826 | Contig | GCA_002221725.1 | GCA_002221725.1_ScBoyd1.0_genomic.fna.gz | |
|
| WM 04.474 | CBS 118892 | Scaffold | GCF_000151425.1 | GCF_000151425.1_ASM15142v1_genomic.fna.gz |
|
| WM 03.423 | CBS 2479 | Scaffold | GCF_000293215.1 | GCF_000293215.1_Trichosporon_asahii_1_genomic.fna.gz |
|
| WM 601 | JCM 11170 | Scaffold | GCA_003116895.1 | GCA_003116895.1_JCM_11170_assembly_v001_genomic.fna.gz |
|
| WM 03.507 | NRRL Y-366-8 | Scaffold | GCF_001661255.1 | GCF_001661255.1_Wican1_genomic.fna.gz |
|
| WM 17 (CBS 6124 Type strain) | CLIB122 | Chromosome | GCF_000002525.2 | GCF_000002525.2_ASM252v1_genomic.fna.gz |
|
| WM 02.460 (CBS 5839 Type strain) | Y-7136 | Scaffold | Zyghe1_2_AssemblyScaffolds_Repeatmasked.fasta |
No publicly available reference genome, but high quality reference genomes of relative strains.
Locally sequenced and assembled.
Potential contamination species.
Downloaded from JGI, no NCBI accession.
Characteristics of each sequence data set
| Sample | Sequencing tech | Sequencing strategy | Number base pairs | Number reads | Number assembled contigs | Number mapped base pairs (Gb) |
|---|---|---|---|---|---|---|
| Pooled DNA | Illumina | Shotgun | 3.91 Gb | 14,525,058 | 338,823 | 3.69 |
| Amplicon | 66.9/95.8/106.4 Mb | 39,374/9,614/10,236 | NA | N/A | ||
| Nanopore | Shotgun | 1.96 Gb | 1,273,484 | NA | N/A | |
| Amplicon | 71.5/72.5/86.5 Mb | 26,212/ 26,680/ 31,826 | NA | N/A | ||
| Pooled biomass | Illumina | Shotgun | 3.67 Gb | 13,623,120 | 345,009 | 3.44 |
| Amplicon | 55.7/38.1/71.9 Mb | 23,613/13,828/27,093 | NA | N/A | ||
| Nanopore | Shotgun | 3.78 Gb | 1,043,343 | NA | N/A | |
| Amplicon | 54.5/49.4/42.0 Mb | 20,163/ 18,273/ 15,502 | NA | N/A |
The total number of basepairs of each technical replicate was calculated before import into QIIME2 pipeline.
Number of nanopore reads or paired-end Illumina reads for technical replicate 1/replicate 2/replicate 3 after quality control.
FIG 1Analysis of shotgun metagenomics data. (A) Swarmplot showing the concordance in genus identification after varying either the alignment algorithm or querying different databases on different data inputs. nt = NCBI nucleotide database (29); RFD = RefSeq Fungi database (30); data inputs are indicated below the line (PD = pooled DNA; PB = pooled biomass); (B) Identification of fungal genera from PD samples. The classification proportion and precision were derived from different combinations of search algorithms and databases as indicated (box); (C) Identification of fungal genera from pooled biomass samples. The classification proportion and precision were derived from the different combinations of search algorithms and databases as indicated.
FIG 2Dynamics in precision, completeness, and remaining rate after applying progressive cutoffs on BLAST alignment metrics. PD = pooled DNA; PB = pooled biomass. (A) Cutoffs applied to query length. (B) cutoffs applied to alignment E values. (C) cutoffs applied to the percentage of identical matches. (D) cutoffs applied to query coverage.
Assignment of published sequence data to genera after application of cutoffs to query coverage
| Sample ID | Sample description | Sequencing tech | Cutoffs on query coverage (%) | Filtered results (%) | Percentage of confirmed genera before applying cutoffs (%) | Percentage of confirmed genera after applying cutoffs (%) |
|---|---|---|---|---|---|---|
| a1 | Human sputum samples35 | Nanopore | 59 | 20.2 | 85.9 | 86.5 |
| a2 | 53.2 | 20.1 | 97.9 | 98.5 | ||
| a3 | 54 | 20.5 | 96.5 | 97.4 | ||
| a4 | 45.5 | 20.1 | 16.2 | 19.8 | ||
| a5 | 58.5 | 20 | 71.1 | 66.9 | ||
| a6 | 50.4 | 20.1 | 93.6 | 94.7 | ||
| b1 | Field infected wheat samples34 | 5 | 20 | 60.4 | 75.1 | |
| b2 | 0.77 | 19.9 | 34.8 | 43 | ||
| b3 | 12 | 19.7 | 67 | 82 | ||
| b4 | 0.61 | 20 | 5.8 | 6.2 | ||
| c1 | Swine gut microbiome samples38 | Illumina | 2.4 | 20.1 | 32 | 35.4 |
| c2 | 3.3 | 20.2 | 34.2 | 36.6 | ||
| c3 | 2.6 | 20.2 | 35.2 | 38.3 | ||
| d1 | Mouse gut microbiome samples39 | 3.4 | 19.8 | 29.1 | 24.3 | |
| d2 | 14 | 20.1 | 63.7 | 69.4 | ||
| d3 | 4.5 | 20.2 | 38.6 | 42.3 |
FIG 3Benchmarking of amplicon data sets. PD = pooled DNA; PB = pooled biomass. (A) Scatterplot representing genus level classification proportion and precision for nanopore amplicon data. (B) Genus level precision of Illumina amplicon data. Classification proportion values for Illumina data were 100% due to the nature of the QIIME2 pipeline (based on the UNITE ITS database). (C) Genus level completeness of both nanopore and Illumina amplicon data sets. The nanopore results are from minimap2 algorithm against the UNITE ITS database.
FIG 4Improving community composition analysis by applying query coverage cutoffs. PD = pooled DNA; PB = pooled biomass; nt = NCBI nucleotide database (29); RFD = RefSeq Fungi database (30) (A) Experimental flowchart for analyzing community compositions. (B) Statistical similarity measures between gold standard community composition and each combination of algorithms and databases. Lower values correspond to greater similarity between the samples and the gold standard. (C) Change in Bhattacharyya distance after applying cutoffs to query coverage for each data set as indicated. The query coverage gap between each dot point is 0.5%. (D) Change in relative Euclidean distance after applying cutoffs to query coverage for each data set. The gap between each dot point is 0.5%. (E) Change in relative entropy after applying cutoffs on query coverage for each data set. The gap between each dot point is 0.5%.