| Literature DB >> 28239609 |
Mahbaneh Eshaghzadeh Torbati1, Makedonka Mitreva2, Vanathi Gopalakrishnan3.
Abstract
Human microbiome data from genomic sequencing technologies is fast accumulating, giving us insights into bacterial taxa that contribute to health and disease. The predictive modeling of such microbiota count data for the classification of human infection from parasitic worms, such as helminths, can help in the detection and management across global populations. Real-world datasets of microbiome experiments are typically sparse, containing hundreds of measurements for bacterial species, of which only a few are detected in the bio-specimens that are analyzed. This feature of microbiome data produces the challenge of needing more observations for accurate predictive modeling and has been dealt with previously, using different methods of feature reduction. To our knowledge, integrative methods, such as transfer learning, have not yet been explored in the microbiome domain as a way to deal with data sparsity by incorporating knowledge of different but related datasets. One way of incorporating this knowledge is by using a meaningful mapping among features of these datasets. In this paper, we claim that this mapping would exist among members of each individual cluster, grouped based on phylogenetic dependency among taxa and their association to the phenotype. We validate our claim by showing that models incorporating associations in such a grouped feature space result in no performance deterioration for the given classification task. In this paper, we test our hypothesis by using classification models that detect helminth infection in microbiota of human fecal samples obtained from Indonesia and Liberia countries. In our experiments, we first learn binary classifiers for helminth infection detection by using Naive Bayes, Support Vector Machines, Multilayer Perceptrons, and Random Forest methods. In the next step, we add taxonomic modeling by using the SMART-scan module to group the data, and learn classifiers using the same four methods, to test the validity of the achieved groupings. We observed a 6% to 23% and 7% to 26% performance improvement based on the Area Under the receiver operating characteristic (ROC) Curve (AUC) and Balanced Accuracy (Bacc) measures, respectively, over 10 runs of 10-fold cross-validation. These results show that using phylogenetic dependency for grouping our microbiota data actually results in a noticeable improvement in classification performance for helminth infection detection. These promising results from this feasibility study demonstrate that methods such as SMART-scan can be utilized in the future for knowledge transfer from different but related microbiome datasets by phylogenetically-related functional mapping, to enable novel integrative biomarker discovery.Entities:
Keywords: 16S rRNA gene; SMART-scan method; classification; helminth infection; microbiota; taxonomic tree; transfer learning
Year: 2016 PMID: 28239609 PMCID: PMC5325162 DOI: 10.3390/data1030019
Source DB: PubMed Journal: Data (Basel) ISSN: 2306-5729
Indonesia and Liberia datasets statistics.
| Dataset ID | Number of Samples | Number of Taxa | Class Distribution (Helminth-Infected/Non-Infected) | Helminth-Infected Distribution (Single/Multi-Infected) | Type of Multi-Infection Worms |
|---|---|---|---|---|---|
| Indonesia | 90 | 702 | 38/52 | 35/3 | 3 (ascaris + hookworm) |
| Liberia | 74 | 702 | 23/51 | 19/4 | 3 (ascaris + hookworm) and 1 (ascaris + whipworm) |
Figure 1Flowchart of baseline and taxonomic models. The phylogenetic tree is one of the outputs of Ribosomal Database Project (RDP) [30].
Figure 2The pseudocode and first three iterations of the SMART-scan algorithm derived from the explanations provided in the main paper [7], and the R code provided by the authors of this paper. The derived psuedocode in (a) is meant to provide a computational interpretation and rapid lookup to enhance the extensions to the application of the SMART-scan method for automatic clustering of microbiota. The first three iterations of the SMART-scan algorithm in (b) are prepared based on the pseudocode in (a). In the first column, the sub-tree candidate for grouping is extracted from TreeCandidatesForSplitting. In the second column, the sub-tree is enclosed by a triangle in the phylogenetic tree; all the possible cut points of the sub-tree are marked by lines cutting edges, and the taxa are named as x. In the third column, the selected cut points are depicted by double lines, the selected grouping of taxa are named as Z, and the new splitted sub-trees, named as T1 and T2, are pushed into TreeCandidatesForSplitting.
Baseline and taxonomic model performance on detecting helminth infection over the Indonesia dataset, using 10 runs of 10-fold cross-validation. In this table and the two other following tables, columns are as follows (from left to right): the classifier generation method, the Area Under the ROC Curve (AUC) measures for the baseline and taxonomic models, the improvement achieved by the taxonomic model, and the p-value for t-test and Wilcoxon tests. Best results of columns’ Taxonomic Model AUC and Improvement are shown in bold font in the tables.
| Classifier | Baseline AUC | Taxonomic Model AUC | Improvement | |
|---|---|---|---|---|
| NB | 0.69 | 0.18 | [0.029, 0.039] | |
| SVMs | 0.61 | [0.002, 0.000] | ||
| MLP | 0.61 | 0.85 | 0.24 | [0.004, 0.001] |
| RF | 0.67 | 0.81 | 0.14 | [0.039, 0.048] |
Baseline and taxonomic model (AUC) performance on detecting helminth infection over the Combined dataset.
| Classifier | Baseline AUC | Taxonomic Model AUC | Improvement | |
|---|---|---|---|---|
| NB | 0.66 | [0.002, 0.000] | ||
| SVMs | 0.59 | 0.81 | 0.22 | [0.002, 0.000] |
| MLP | 0.75 | 0.87 | 0.12 | [0.014, 0.008] |
| RF | 0.72 | 0.85 | 0.13 | [0.027, 0.024] |
Baseline and taxonomic model performance on detecting helminth infection over the Indonesia dataset, using 10 runs of 10-fold cross-validation. In this table and the two other following tables, columns are as follows (from left to right): the classifier generation, the Sensitivity/Specificity (Sen/Spec) for infection detection and the Balanced accuracy (Bacc) measures for the baseline and taxonomic models, the improvement achieved by the taxonomic model for Bacc, and the p-value for t-test and Wilcoxon tests. Best results of the columns’ Taxonomic Model Bacc and Improvement are depicted in bold font in the tables.
| Classifier | Baseline Sen/Spec | Taxonomic Model Sen/Spec | Baseline Bacc | Taxonomic Model Bacc | Improvement Bacc | |
|---|---|---|---|---|---|---|
| NB | 0.58/0.67 | 0.86/0.77 | 0.62 | 0.82 | 0.20 | [0.001, 0.004] |
| SVMs | 0.26/0.96 | 0.78/0.96 | 0.61 | [0.000, 0.002] | ||
| MLP | 0.51/0.73 | 0.82/0.78 | 0.62 | 0.80 | 0.18 | [0.030, 0.040] |
| RF | 0.41/0.88 | 0.57/0.85 | 0.64 | 0.71 | 0.07 | [0.006, 0.030] |
Baseline and taxonomic model (Sensitivity, Specificity, and Balanced accuracy) performance on detecting helminth infection over the Combined dataset.
| Classifier | Baseline Sen/Spec | Taxonomic Model Sen/Spec | Baseline Bacc | Taxonomic Model Bacc | Improvement Bacc | |
|---|---|---|---|---|---|---|
| NB | 0.59/0.70 | 0.87/0.79 | 0.64 | 0.83 | [0.001, 0.002] | |
| SVMs | 0.21/0.97 | 0.67/0.88 | 0.59 | 0.77 | 0.18 | [0.002, 0.006] |
| MLP | 0.68/0.82 | 0.77/0.80 | 0.75 | 0.12 | [0.036, 0.039] | |
| RF | 0.29/0.94 | 0.61/0.91 | 0.61 | 0.76 | 0.15 | [0.001, 0.004] |
Baseline and taxonomic model (AUC) performance on detecting helminth infection over the Liberia dataset.
| Classifier | Baseline AUC | Taxonomic Model AUC | Improvement | |
|---|---|---|---|---|
| NB | 0.71 | 0.23 | [0.006, 0.002] | |
| SVMs | 0.52 | 0.82 | [0.002, 0.000] | |
|
| ||||
| MLP | 0.78 | 0.84 | 0.06 | [0.180, 0.087] |
| RF | 0.85 | 0.92 | 0.07 | [0.062, 0.073] |
Baseline and taxonomic model (Sensitivity, Specificity, and Balanced accuracy) performance on detecting helminth infection over the Liberia dataset.
| Classifier | Baseline Sen/Spec | Taxonomic Model Sen/Spec | Baseline Bacc | Taxonomic Model Bacc | Improvement Bacc | |
|---|---|---|---|---|---|---|
| NB | 0.6/0.72 | 0.88/0.90 | 0.66 | [0.020, 0.040] | ||
| SVMs | 0.05/1.0 | 0.72/0.92 | 0.75 | 0.82 | 0.07 | [0.000, 0.002] |
| MLP | 0.63/0.82 | 0.68/0.92 | 0.72 | 0.80 | 0.08 | [0.001, 0.002] |
| RF | 0.18/0.98 | 0.52/0.98 | 0.58 | 0.75 | 0.17 | [0.030, 0.040] |