| Literature DB >> 34093989 |
Shunyao Wu1, Yuzhu Chen1, Zhiruo Li2, Jian Li1, Fengyang Zhao1, Xiaoquan Su1.
Abstract
Machine learning (ML) has been widely used in microbiome research for biomarker selection and disease prediction. By training microbial profiles of samples from patients and healthy controls, ML classifiers constructs data models by community features that highly correlated with the target diseases, so as to determine the status of new samples. To clearly understand the host-microbe interaction of specific diseases, previous studies always focused on well-designed cohorts, in which each sample was exactly labeled by a single status type. However, in fact an individual may be associated with multiple diseases simultaneously, which introduce additional variations on microbial patterns that interferes the status detection. More importantly, comorbidities or complications can be missed by regular ML models, limiting the practical application of microbiome techniques. In this review, we summarize the typical ML approaches of single-label classification for microbiome research, and demonstrate their limitations in multi-label disease detection using a real dataset. Then we prospect a further step of ML towards multi-label classification that potentially solves the aforementioned problem, including a series of promising strategies and key technical issues for applying multi-label classification in microbiome-based studies.Entities:
Keywords: Machine learning; Microbiome; Multi-label classification; Single-label classification
Year: 2021 PMID: 34093989 PMCID: PMC8131981 DOI: 10.1016/j.csbj.2021.04.054
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Comparison of single-label classification and multi-label classification. a. Single-label classification requires a sample has one label (status). b. Multi-label classification can detect more than one status for each sample.
Characteristics of machine learning methods widely used for microbiome-based disease detection.
| ML approach | Feature importance measurement | Interpretability | Package and applicable programming language |
|---|---|---|---|
| LR | Y | Excellent | Scikit-learn (Python) |
| SVM | Y | Good | Scikit-learn (Python), LibSVM |
| N | Weak | Scikit-learn (Python) | |
| RF | Y | Good | Scikit-learn (Python) |
| GBDT | Y | Good | Xgboost (Python/R/C++) |
| Neural Networks | N | Weak | Tensorflow (Python/Java) |
Brief summary of samples labeled with target diseases.
| Target disease | Total number of disease samples | Number of single-disease samples | Number of comorbidities samples |
|---|---|---|---|
| IBS | 2351 | 1064 | 1287 |
| Autoimmune | 2301 | 487 | 1814 |
| Lung disease | 2251 | 1248 | 1003 |
| Migraine | 2109 | 938 | 1171 |
| Thyroid | 1814 | 559 | 1255 |
Results of single-label classifiers on target diseases detection.
| a. Performance (AUC) on IBS | ||||
|---|---|---|---|---|
| Testing set | SD | MD | ||
| Training set | SD | MD | SD | MD |
| RF | 0.681 ± 0.039 | 0.661 ± 0.032 | 0.718 ± 0.025 | 0.757 ± 0.018 |
| GBDT | 0.713 ± 0.025 | 0.689 ± 0.036 | 0.731 ± 0.022 | 0.787 ± 0.015 |
Fig. 2Microbial biomarkers of autoimmune selected from SD and MD by distribution-free independence test.
Fig. 3Decision tree of GBDT binary classifier constructed from SD (A) was less complicated than that from MD (B). In each tree internal nodes represent taxa on genus-level, leaf nodes represent labels, and branch weights represent criteria for decision.
Fig. 4Three key technical issues in multi-label classification. a. Too many labels in training data leads to unexpected high computational cost. b. Missed label reduces the detection sensitivity. c. Ambiguous label introduces false positive results.