| Literature DB >> 35928158 |
Celia Díez López1, Diego Montiel González1, Athina Vidaki1, Manfred Kayser1.
Abstract
Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits.Entities:
Keywords: class imbalance; data augmentation; human microbiome; machine learning; prediction modeling; saliva microbiome; smoking status; trait prediction
Year: 2022 PMID: 35928158 PMCID: PMC9343866 DOI: 10.3389/fmicb.2022.886201
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 6.064
Characteristics of the two saliva microbiome datasets used in this study.
|
| ||
| Never smoker | 473 (43.5) | 39 (24.8) |
| Former smoker | 519 (47.7) | 39 (24.8) |
| Current smoker | 96 (8.8) | 79 (50.4) |
|
| ||
| Female | 429 (39.4) | 88 (56.1) |
| Male | 659 (60.6) | 69 (43.9) |
|
| ||
| 20–29 | – | 20 (12.7) |
| 30–39 | – | 31 (19.8) |
| 40–49 | – | 40 (25.5) |
| 50–59 | 147 (13.5) | 29 (18.5) |
| 60–69 | 505 (46.4) | 21 (13.4) |
| 70–79 | 377 (34.7) | 9 (5.7) |
| 80–89 | 59 (5.4) | 6 (3.8) |
| ≥90 | – | 1 (0.6) |
|
| ||
| European | 1,028 (94.5) | 59 (37.6) |
| Non-European | 60 (5.5) | 98 (62.4) |
Figure 1Overview of the study's analytical strategy. (A–C) The original dataset was split into a training set (80%) (purple box in B) and a holdout test set (20%) (red box in C) by maintaining the original ratio between classes in the partitions. Data augmentation techniques were applied to the training set, making a total of six different input data types (d = 6), including baseline non-augmented and differently augmented data types. (D) For the nested cross-validation (nCV) approach, the training set was split into five outer k-folds of training (80%) (orange box in D) and test (20%) (blue box in D) sets each. (E) Each outer k-fold was split into two inner n-folds of training (50%) and validation (50%) sets (orange box in E) in which seven different machine learning (ML) models (m = 7) were optimized and validated (inner models). (F) The best-performing n-fold inner model (green box in F) was applied to the corresponding k-fold test set (green arrow to blue box in F). (G) For each k-fold test set, two performance metrics were obtained: Matthews correlation coefficient (MCC) and area under the receiver operating characteristic curve (AUC). Repetition of steps (D) to (G) for all the input data types (d = 6) with ML method (m = 7) (total of 42 different approaches). (H) Repetition of steps (A) to (G) 10 times (i = 10) to control for introduced variation by data partitions. (I) Selection of the best-performing data type with ML method based on MCC metric and training on full final 80% training set to create the final prediction model. (J) Validation of final prediction model on final 20% holdout test set.
Figure 2Validation of data types with machine learning (ML) methods for microbiome-based prediction of smoking habits based on the S1 and S2 datasets together. For each ML method, we evaluated six types of input data: baseline non-augmented and five augmented datasets based on different methods (ADASYN-1, ADASYN-2, SMOTE-1, SMOTE-2, and TADA). (A) Matthews correlation coefficient (MCC) and (B) area under the receiver operating characteristic curve (AUC) values from the 5-fold nested crossed-validation were repeated for 10 times (5 * 10). For MCC, +1 represents a perfect prediction, 0 random prediction, and −1 perfect inverse prediction. For AUC, 1 indicates perfectly accurate prediction and 0.5 indicates random prediction. ML method abbreviations: DT, decision trees; KNN, k-nearest neighbors; LR, logistic regression; RF, random forest; SVML, support vector machine with linear kernel; SVMR, support vector machine with radial kernel; XGBoost, extreme gradient boosting.