| Literature DB >> 34979898 |
Congmin Zhu1,2,3, Xin Wang4, Jianchu Li4, Rui Jiang5, Hui Chen1, Ting Chen6, Yuqing Yang7.
Abstract
Lifestyle and physiological variables on human disease risk have been revealed to be mediated by gut microbiota. Low concordance between case-control studies for detecting disease-associated microbe existed due to limited sample size and population-wide bias in lifestyle and physiological variables. To infer gut microbiota-disease associations accurately, we propose to build machine learning models by including both human variables and gut microbiota. When the model's performance with both gut microbiota and human variables is better than the model with just human variables, the independent gut microbiota -disease associations will be confirmed. By building models on the American Gut Project dataset, we found that gut microbiota showed distinct association strengths with different diseases. Adding gut microbiota into human variables enhanced the classification performance of IBD significantly; independent associations between occurrence information of gut microbiota and irritable bowel syndrome, C. difficile infection, and unhealthy status were found; adding gut microbiota showed no improvement on models' performance for diabetes, small intestinal bacterial overgrowth, lactose intolerance, cardiovascular disease. Our results suggested that although gut microbiota was reported to be associated with many diseases, a considerable proportion of these associations may be very weak. We proposed a list of microbes as biomarkers to classify IBD and unhealthy status. Further functional investigations of these microbes will improve understanding of the molecular mechanism of human diseases.Entities:
Keywords: Disease classification; Gut microbiota; Human variables; Machine learning
Mesh:
Substances:
Year: 2022 PMID: 34979898 PMCID: PMC8722223 DOI: 10.1186/s12866-021-02414-9
Source DB: PubMed Journal: BMC Microbiol ISSN: 1471-2180 Impact factor: 3.605
Fig. 1Workflow of disease classification models construction. We classified eight diseases (IBD: Inflammatory Bowel Disease; CDI: C. difficile Infection; IBS: Irritable Bowel Syndrome; SIBO: Small Intestinal Bacterial Overgrowth; DI: Diabetes; LI: Lactose Intolerance; CD: Cardiovascular Disease; MD: Mental Disorder) with 30 human variables (physiological characteristics, lifestyle, location, and diet) and gut microbial community data (OTUs) obtained from the American Gut Project database using four machine learning techniques (Random Forest, Gradient Boosting Decision Tree, Logistic Regression and eXtreme Gradient Boosting). We propose to build association models by including both human variables and gut microbiota, and assumed that when the performance of the model with both gut microbiota and human variables is better than the model with just human variables, the independent association of gut microbiota with the disease can be confirmed
Fig. 2Comparing AUC values of nine diseases using five feature types
Top 10 most important features using three types of feature sets for IBD
| Meta | Meta-OTUab | Meta-OTUoc | |||
|---|---|---|---|---|---|
| Feature Name | Rank | Feature Name | Rank | Feature Name | Rank |
| ELEVATION | 5 | PROBIOTIC_FREQUENCY | 3.7 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_;s_ | 5 |
| VITAMIN_B_SUPPLEMENT_FREQUENCY | 6 | p_Firmicutes;c_Erysipelotrichi;o_Erysipelotrichales;f_Erysipelotrichaceae;g_Holdemania;s_ | 4.7 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Ruminococcaceae;g_Ruminococcus;s_ | 14 |
| PROBIOTIC_FREQUENCY | 6.3 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_;s_ | 9.8 | PROBIOTIC_FREQUENCY | 15.6 |
| LATITUDE | 7.3 | p_Bacteroidetes;c_Bacteroidia;o_Bacteroidales;f_Rikenellaceae;g_Alistipes;s_indistinctus | 13.4 | EXERCISE_FREQUENCY | 24.7 |
| SALTED_SNACKS_FREQUENCY | 7.7 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_Coprococcus;s_ | 14.5 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Ruminococcaceae;g_;s_ | 25.3 |
| AGE_CORRECTED | 7.7 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Ruminococcaceae;g_Ruminococcus;s_ | 21 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_;g_;s_ | 45.4 |
| BMI | 7.9 | p_Proteobacteria;c_Gammaproteobacteria;o_Enterobacteriales;f_Enterobacteriaceae;g_Morganella;s_ | 22.3 | WEIGHT_KG | 48.5 |
| MILK_CHEESE_FREQUENCY | 8.1 | VITAMIN_D_SUPPLEMENT_FREQUENCY | 23.3 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_;s_ | 50.7 |
| FROZEN_DESSERT_FREQUENCY | 9.7 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_[Ruminococcus];s_ | 29.1 | Caucasian | 51.2 |
| VITAMIN_D_SUPPLEMENT_FREQUENCY | 9.7 | VITAMIN_B_SUPPLEMENT_FREQUENCY | 31.3 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_;s_ | 51.3 |
Top 10 features using Meta-OTUoc for classifying IBS, CDI, and UH
| IBS | CDI | UH | |
|---|---|---|---|
| 1 | LATITUDE | p_Firmicutes;c_Erysipelotrichi;o_Erysipelotrichales;f_Erysipelotrichaceae;g_;s_ | MILK_CHEESE_FREQUENCY |
| 2 | MILK_CHEESE_FREQUENCY | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_;s_ | PROBIOTIC_FREQUENCY |
| 3 | PROBIOTIC_FREQUENCY | VITAMIN_B_SUPPLEMENT_FREQUENCY | MILK_SUBSTITUTE_FREQUENCY |
| 4 | AGE_CORRECTED | BMI | p_Firmicutes;c_Clostridia;o_Clostridiales;f_;g_;s_ |
| 5 | female | WEIGHT_KG | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_;s_ |
| 6 | WEIGHT_KG | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Ruminococcaceae;g_;s_ | AGE_CORRECTED |
| 7 | MILK_SUBSTITUTE_FREQUENCY | EXERCISE_FREQUENCY | FROZEN_DESSERT_FREQUENCY |
| 8 | HEIGHT_CM | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Ruminococcaceae;g_Oscillospira;s_ | POULTRY_FREQUENCY |
| 9 | EXERCISE_FREQUENCY | PROBIOTIC_FREQUENCY | VITAMIN_B_SUPPLEMENT_FREQUENCY |
| 10 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_;g_;s_ | p_Bacteroidetes;c_Bacteroidia;o_Bacteroidales;f_[Barnesiellaceae];g_;s_ | p_Firmicutes;c_Clostridia;o_Clostridiales;f_;g_;s_ |
Fig. 3Feature distribution for the best model with the highest AUC. Different features are marked with various colors and shapes. OTUs are annotated at the order level. In all subgraphs, the orders of host variables and OTUs are fixed and unified, and OTUs are sorted according to their average sizes reversely
Top 10 features using a combination of Meta and OTUs for classifying DI, SIBO, LI, and CD
| DI (Meta-OTUoc) | SIBO (Meta-OTUab) | LI (Meta-OTUoc) | CD (Meta-OTUab) | |
|---|---|---|---|---|
| 1 | BMI | MILK_CHEESE_FREQUENCY | MILK_SUBSTITUTE_FREQUENCY | AGE_CORRECTED |
| 2 | AGE_CORRECTED | PROBIOTIC_FREQUENCY | MILK_CHEESE_FREQUENCY | WEIGHT_KG |
| 3 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_;s_ | WHOLE_GRAIN_FREQUENCY | FROZEN_DESSERT_FREQUENCY | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Ruminococcaceae;g_;s_ |
| 4 | WEIGHT_KG | FROZEN_DESSERT_FREQUENCY | POULTRY_FREQUENCY | p_Bacteroidetes;c_Bacteroidia;o_Bacteroidales;f_[Barnesiellaceae];g_;s_ |
| 5 | p_Proteobacteria;c_Deltaproteobacteria;o_Desulfovibrionales;f_Desulfovibrionaceae;g_Desulfovibrio;s_ | WEIGHT_KG | Caucasian | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_Coprococcus;s_ |
| 6 | p_Actinobacteria;c_Coriobacteriia;o_Coriobacteriales;f_Coriobacteriaceae;g_;s_ | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_[Ruminococcus];s_ | ELEVATION | HEIGHT_CM |
| 7 | p_Proteobacteria;c_Deltaproteobacteria;o_Desulfovibrionales;f_Desulfovibrionaceae;g_Bilophila;s_ | p_Actinobacteria;c_Coriobacteriia;o_Coriobacteriales;f_Coriobacteriaceae;g_Collinsella;s_aerofaciens | HIGH_FAT_RED_MEAT_FREQUENCY | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Ruminococcaceae;g_Ruminococcus;s_ |
| 8 | p_Proteobacteria;c_Gammaproteobacteria;o_Enterobacteriales;f_Enterobacteriaceae;g_;s_ | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Ruminococcaceae;g_Oscillospira;s_ | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_;s_ | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Ruminococcaceae;g_;s_ |
| 9 | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Ruminococcaceae;g_;s_ | p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Streptococcaceae;g_Streptococcus;s_ | BMI | p_Firmicutes;c_Clostridia;o_Clostridiales;f_Lachnospiraceae;g_;s_ |
| 10 | p_Proteobacteria;c_Gammaproteobacteria;o_Pseudomonadales;f_Pseudomonadaceae;g_;s_ | VITAMIN_B_SUPPLEMENT_FREQUENCY | RED_MEAT_FREQUENCY | p_Firmicutes;c_Clostridia;o_Clostridiales;f_;g_;s_ |
Fig. 4Performances of four machine learning methods in different characteristics and disease prediction. The color of the open circle represents different machine learning methods, and the size represents the standard deviation
Fig. 5Changes in the AUC of the optimal model with the number of OTUs. The optimal model for four diseases (IBD: Inflammatory Bowel Disease; IBS: Irritable Bowel Syndrome; DI: Diabetes; UH: Unhealthy status)