| Literature DB >> 32566672 |
Chao Wang1, Ying Zhang2, Shuguang Han3.
Abstract
Fungi play essential roles in many ecological processes, and taxonomic classification is fundamental for microbial community characterization and vital for the study and preservation of fungal biodiversity. To cope with massive fungal barcode data, tools that can implement extensive volumes of barcode sequences, especially the internal transcribed spacer (ITS) region, are necessary. However, high variation in the ITS region and computational requirements for processing high-dimensional features remain challenging for existing predictors. In this study, we developed Its2vec, a bioinformatics tool for the classification of fungal ITS barcodes to the species level. An ITS database covering more than 25,000 species in a broad range of fungal taxa was assembled. For dimensionality reduction, a word embedding algorithm was used to represent an ITS sequence as a dense low-dimensional vector. A random forest-based classifier was built for species identification. Benchmarking results showed that our model achieved an accuracy comparable to that of several state-of-the-art predictors, and more importantly, it could implement large datasets and greatly reduce dimensionality. We expect the Its2vec model to be helpful for fungal species identification and, thus, for revealing microbial community structures and in deepening our understanding of their functional mechanisms.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32566672 PMCID: PMC7275950 DOI: 10.1155/2020/2468789
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Schematic view of Its2vec. (a) Pipeline scheme of ITS dataset construction. (b) The skip-gram architecture of word2vec, which predicts surrounding k-mers (GAA, AGG, AAG, and ACA) based on a given center word (GAA). (c) Pipeline scheme of distributed representation of ITS sequences. For example, the ITS sequence SH1_1 (length N) was first represented by an N-2 3-mer set (TAG, AGA, GAG,…, AAG). Then, for each k-mer, we generated a distributed vector representation based on the skip-gram model with a vector of size 100, i.e., TAG [0.041, 0.158, 0.219…]. Thus, sequence SH1_1 was represented by the average of all n-2 k-mers, which also is a vector of size 100, i.e., SH1_1 [0.050, –0.017, 0.391…]. Similar words have close vectors; in this figure, SH1_1 and SH2_1 are close to SH1_2 and SH2_2, respectively. (d) Flow diagram showing model training and testing using the RF classifier.
Taxonomic coverage of the ITS database established in this study.
| Taxonomy level | ITS subsets | Number of taxa | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| ITSset_2 | ITSset_3 | ITSset_4 | ITSset_5 | ITSset_6 | ITSset_7 | ITSset_8 | ITSset_9 | ITSset_10 | ||
| Phyla | 15 | 13 | 9 | 8 | 7 | 8 | 8 | 7 | 10 | 18 |
| Classes | 58 | 46 | 41 | 34 | 31 | 32 | 28 | 30 | 43 | 63 |
| Order | 165 | 142 | 128 | 106 | 97 | 98 | 78 | 78 | 133 | 187 |
| Family | 516 | 424 | 381 | 317 | 280 | 263 | 220 | 201 | 418 | 626 |
| Genus | 2073 | 1432 | 1116 | 875 | 693 | 633 | 497 | 404 | 1598 | 3385 |
| Species | 8586 | 4141 | 2503 | 1684 | 1236 | 929 | 701 | 566 | 5374 | 25720 |
| Sequences | 17172 | 12423 | 10012 | 8420 | 7416 | 6503 | 5608 | 5094 | 53740 | 126388 |
Accuracy of the models constructed with different k-mers and subsets.
| ITSset | 3-mer | 4-mer | 5-mer | 6-mer | 7-mer | 8-mer | 9-mer | 10-mer | 11-mer | 12-mer |
|---|---|---|---|---|---|---|---|---|---|---|
| ITSset_2 | 70.04 | 68.08 | 68.32 | 69.23 | 69.64 | 70.37 | 71.37 | 71.23 | 70.53 | 69.34 |
| ITSset_3 | 83.96 | 82.70 | 83.53 | 83.88 | 83.71 | 83.94 | 84.71 | 84.22 | 83.41 | 82.64 |
| ITSset_4 | 89.15 | 89.13 | 89.538 | 89.63 | 89.79 | 89.85 | 90.40 | 90.12 | 89.07 | 87.95 |
| ITSset_5 | 92.36 | 92.71 | 93.17 | 93.02 | 92.84 | 92.96 | 93.37 | 93.15 | 92.47 | 91.78 |
| ITSset_6 | 93.34 | 93.68 | 93.99 | 94.22 | 93.82 | 94.05 | 94.12 | 94.39 | 93.31 | 92.97 |
| ITSset_7 | 94.99 | 95.40 | 95.92 | 95.73 | 95.76 | 96.02 | 96.03 | 95.77 | 94.99 | 94.88 |
| ITSset_8 | 96.09 | 96.20 | 96.47 | 96.45 | 96.34 | 96.31 | 96.43 | 96.36 | 96.31 | 95.74 |
| ITSset_9 | 95.84 | 96.47 | 96.54 | 96.78 | 96.62 | 96.58 | 96.51 | 96.37 | 95.90 | 95.78 |
| ITSset_10 | 84.37 | 84.57 | 85.78 | 86.37 | 86.36 | 87.19 | 87.96 | 86.26 | 86.06 | 84.93 |
Accuracy of the models constructed with different window sizes and subsets.
| ITSset | Size = 1 | Size = 2 | Size = 3 | Size = 4 | Size = 5 | Size = 6 | Size = 7 |
|---|---|---|---|---|---|---|---|
| ITSset_2 | 66.96 | 70.60 | 71.19 | 71.65 | 71.30 | 70.75 | 70.84 |
| ITSset_3 | 81.26 | 84.10 | 84.73 | 85.20 | 84.88 | 84.47 | 84.15 |
| ITSset_4 | 87.74 | 90.26 | 90.14 | 90.32 | 90.39 | 89.92 | 89.87 |
| ITSset_5 | 91.50 | 93.47 | 93.69 | 93.30 | 93.33 | 93.10 | 93.17 |
| ITSset_6 | 93.03 | 94.62 | 94.62 | 94.30 | 94.27 | 94.39 | 94.30 |
| ITSset_7 | 95.42 | 96.28 | 95.99 | 95.96 | 95.85 | 95.77 | 95.71 |
| ITSset_8 | 96.06 | 97.02 | 96.63 | 96.50 | 96.63 | 96.67 | 96.54 |
| ITSset_9 | 96.43 | 97.02 | 96.90 | 96.84 | 96.76 | 96.60 | 97.02 |
| ITSset_10 | 85.62 | 88.00 | 87.73 | 87.71 | 88.06 | 87.42 | 96.91 |
Figure 2Accuracy, precision, recall, and MCC values of the RF model constructed using different numbers of features and estimators. (a–d) represent evaluation results of ITSset_5; (e–h) represent the evaluation results of ITSset_7.
Performance of the Its2vec model on 9 ITSsets based on optimized parameters.
| ITSset | Accuracy | Precision | Recall | MCC |
|---|---|---|---|---|
| ITSset_2 | 78.62 | 72.10 | 78.62 | 0.79 |
| ITSset_3 | 89.70 | 85.96 | 89.70 | 0.90 |
| ITSset_4 | 93.36 | 90.69 | 93.36 | 0.96 |
| ITSset_5 | 95.51 | 93.58 | 95.51 | 0.96 |
| ITSset_6 | 95.95 | 94.09 | 95.95 | 0.96 |
| ITSset_7 | 96.96 | 95.62 | 96.96 | 0.97 |
| ITSset_8 | 97.50 | 96.37 | 97.50 | 0.98 |
| ITSset_9 | 97.53 | 96.37 | 97.53 | 0.98 |
| ITSset_10 | 90.23 | 86.48 | 90.23 | 0.90 |
Comparison of the accuracy of the Its2vec and other existing predictors.
| ITS dataset | Classifier | Accuracy | Significance |
|---|---|---|---|
| ITSset_5 | Its2vec | 95.51 ± 1.55 | a∗ |
| RDP | 98.68 ± 0.55 | b | |
| Mothur | 97.97 ± 0.62 | Bc | |
| funbarRF | 91.00 ± 2.656 | c | |
|
| |||
| Fold-10 | Its2vec | 89.80 ± 1.92 | a |
| RDP | 89.36 ± 2.21 | a | |
| Mothur | 85.54 ± 2.54 | b | |
| funbarRF | 84.94 ± 4.65 | b | |
∗Different letters indicate significant differences among the methods according to Tukey's HST test at P < 0.05.