| Literature DB >> 29568746 |
Honglong Wu1,2,3, Lihua Cai1,4, Dongfang Li1,2,3, Xinying Wang2,3, Shancen Zhao5, Fuhao Zou6, Ke Zhou1.
Abstract
The dysbiosis of human microbiome has been proven to be associated with the development of many human diseases. Metagenome sequencing emerges as a powerful tool to investigate the effects of microbiome on diseases. Identification of human gut microbiome markers associated with abnormal phenotypes may facilitate feature selection for multiclass classification. Compared with binary classifiers, multiclass classification models deploy more complex discriminative patterns. Here, we developed a pipeline to address the challenging characterization of multilabel samples. In this study, a total of 300 biomarkers were selected from the microbiome of 806 Chinese individuals (383 controls, 170 with type 2 diabetes, 130 with rheumatoid arthritis, and 123 with liver cirrhosis), and then logistic regression prediction algorithm was applied to those markers as the model intrinsic features. The estimated model produced an F1 score of 0.9142, which was better than other popular classification methods, and an average receiver operating characteristic (ROC) of 0.9475 showed a significant correlation between these selected biomarkers from microbiome and corresponding phenotypes. The results from this study indicate that machine learning is a vital tool in data mining from microbiome in order to identify disease-related biomarkers, which may contribute to the application of microbiome-based precision medicine in the future.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29568746 PMCID: PMC5820663 DOI: 10.1155/2018/2936257
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1The pipeline of data mining procedures. The whole pipeline of this study consists of preprocessing data (SRA to FASTQ, clinical information available, and discarding samples without complete clinical information), aligning to IGC and constructing the abundance matrix, feature selection and training algorithm, and biological interpretation.
Statistics on sample information.
| Characteristic | Phenotype | |
|---|---|---|
| Normal ( | Abnormal ( | |
|
| 50.99 (14.36) | 51.35 (12.00) |
|
| ||
| Male (1) | 183 | 215 |
| Female (0) | 200 | 208 |
|
| 23.81 (3.78) | 23.30 (3.26) |
|
| ||
| Type 2 diabetes | 383 (0) | 170 (1) |
| Rheumatoid arthritis | 130 (2) | |
| Liver cirrhosis | 123 (3) | |
Note. The number in the parenthesis indicates the label of phenotype (normal = 0, type 2 diabetes = 1, rheumatoid arthritis = 2, and liver cirrhosis = 3).
The evaluation of algorithms based on F1 score.
| KNN | LR | RF | SVM | GBDT | SGD | ADA | |
|---|---|---|---|---|---|---|---|
| NearMiss3a | 0.6628 | 0.7510 | 0.7888 | 0.7184 | 0.8282 | 0.7696 | 0.7956 |
| SMOTEENNb | 0.8602 |
| 0.8341 | 0.9138 | 0.8741 | 0.8360 | 0.8959 |
|
| |||||||
| # of Markers | 280 | 300 | 220 | 320 | 160 | 220 | 220 |
KNN, K-nearest neighbor; LR, logistic regression; RF, random forest; SVM, supporting vector machine; GBDT, gradient boosting decision tree; SGD, stochastic gradient descent; ADA, adaptive boosting. aNearMiss3 is one method using the K-nearest neighbor (KNN) classifier to achieve undersampling; bSMOTEENN is one method by removing three nearest neighbors from training set to achieve oversampling. Bold font stands for the best result.
Figure 2The ROC plot of 5-fold cross-validations. For repeated cross-validation, multiple curves were plotted with green color, where each class was repeated five times, and the mean curve was plotted with blue color. The confidence interval was fulfilled with grey color.
Figure 3The ROC plot and AUC. Multicurves were plotted on the same figure, (normal class = blue, type 2 diabetes = green, rheumatoid arthritis = red, and liver cirrhosis = aqua-blue); the numbers in parenthesis were values of AUC.
Figure 4The Venn diagram of phenotype-specific biomarkers. Each circle stands for one phenotype and the number stands for biomarkers.