| Literature DB >> 32449747 |
Divya Sharma1, Andrew D Paterson1,2, Wei Xu1,3.
Abstract
MOTIVATION: Research supports the potential use of microbiome as a predictor of some diseases. Motivated by the findings that microbiome data is complex in nature, and there is an inherent correlation due to hierarchical taxonomy of microbial Operational Taxonomic Units (OTUs), we propose a novel machine learning method incorporating a stratified approach to group OTUs into phylum clusters. Convolutional Neural Networks (CNNs) were used to train within each of the clusters individually. Further, through an ensemble learning approach, features obtained from each cluster were then concatenated to improve prediction accuracy. Our two-step approach comprising stratification prior to combining multiple CNNs, aided in capturing the relationships between OTUs sharing a phylum efficiently, as compared to using a single CNN ignoring OTU correlations.Entities:
Year: 2020 PMID: 32449747 PMCID: PMC7750934 DOI: 10.1093/bioinformatics/btaa542
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.An illustration of correlation-based ordering in the OTUs in a cluster. (a) Example heatmap obtained by plotting Spearman rank coefficients between positively correlated OTUs in a cluster. (b) Cumulative coefficient obtained with respect to each row of the heatmap matrix. (c) Vector of cumulative coefficients arranged in a decreasing order where, . (d) The cumulative coefficients are renamed as to represent that they are now arranged in a decreasing order. (e) Heatmap sorted based on the new order of cumulative coefficients, making the correlated terms concentrate in a space and arrange closer in the matrix
Fig. 2.Illustration of the layers in the CNN framework. (a) Detailed illustration of the phylum-based stratification and ensemble learning of CNNs for disease prediction. The four different clusters are color coded with different colours and after phyla stratification are input to the four neural networks (N1, N2, N3 and N4). Later the features extracted are flattened and stacked during the concatenation step to further lead to prediction of disease outcome. (b) Illustration of the layers in a single neural network (N1/N2/N3/N4) acting on one particular cluster of the input data. (Color version of this figure is available at Bioinformatics online.)
Fig. 3.ROC curve obtained on the test set of the simulated study. The test set comprised 60 controls and 60 cases. The red dotted line corresponds to AUC equal to 0.5, indicating a random classification model
Fig. 4.95% confidence intervals obtained for the mean AUC values for 10 times 10-fold cross validation on the training set for the (a) T2D study and the (b) Cirrhosis study
AUC values tabulated for various machine learning methods on test set of T2D and Cirrhosis studies
| AUC T2D | AUC Cirrhosis | |||
|---|---|---|---|---|
| Method | w/o age+sex | w age+sex | w/o age+sex | w age+sex |
| RF | 0.703 | 0.708 | 0.893 | 0.901 |
| GBC | 0.642 | 0.648 | 0.816 | 0.825 |
| SVM | 0.701 | 0.704 | 0.877 | 0.882 |
| Lasso regression | 0.665 | 0.670 | 0.823 | 0.831 |
| Ridge regression | 0.700 | 0.705 | 0.842 | 0.848 |
| NB | 0.682 | 0.685 | 0.802 | 0.807 |
| CNN_basic | 0.643 | 0.647 | 0.799 | 0.801 |
| CNN_shuffle | 0.712 | 0.718 | 0.844 | 0.852 |
|
| 0.720 | 0.725 | 0.903 | 0.908 |
|
|
|
|
|
|
Note: The results are reported on both studies considering model performance without (w/o) including age and sex and with (w) age and sex. Note that the last row (values in bold) shows the consistent improvement in the performance of the proposed model taxoNNcorr for both studies.