| Literature DB >> 31216991 |
Abstract
BACKGROUND: Microbiome profiles in the human body and environment niches have become publicly available due to recent advances in high-throughput sequencing technologies. Indeed, recent studies have already identified different microbiome profiles in healthy and sick individuals for a variety of diseases; this suggests that the microbiome profile can be used as a diagnostic tool in identifying the disease states of an individual. However, the high-dimensional nature of metagenomic data poses a significant challenge to existing machine learning models. Consequently, to enable personalized treatments, an efficient framework that can accurately and robustly differentiate between healthy and sick microbiome profiles is needed.Entities:
Keywords: Host phenotypes; Machine learning; Metagenomics; Neural networks
Mesh:
Year: 2019 PMID: 31216991 PMCID: PMC6584521 DOI: 10.1186/s12859-019-2833-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Our proposed MetaNN framework for the classification of metagenomic data. Given the raw metagenomic count data, we first filter out microbes that appear in less than 10% of total samples for each dataset. Next, we use negative binomial (NB) distribution to fit the training data, and then sample the fitted distribution to generate microbial samples to augment the training set. The augmented samples along with the training set are used to train a neural network classifier. In this example, the neural network takes counts of three microbes (x1,x2,x3) as input features and outputs the probability of two class labels (z1,z2). The intermediate layers are hidden layers each with four and three hidden units, respectively. The input for each layer is calculated by the output of the previous layer and multiplied by the weights (W1,W2,W) on the connected lines. Finally, we evaluate our proposed neural network classifier on synthetic and real datasets based on different metrics and compare outputs against several existing machine learning models (see Review of ML methods)
Real metagenomic data used in this paper
| Dataset | # of samples | # of features | # of classes | Classification task |
|---|---|---|---|---|
| Classification of body sites | ||||
| Costello | 552 | 1454 | 6 | Classify body habitats: skin (357), oral cavity (46), External Auditory Canal (44), Hair (14), Nostril (46), Feces (45) |
| Costello | 357 | 600 | 12 | Classify skin sites: external nose (14), forehead (32), glans penis (8), labia minora (6), axilla (28), pinna (27), palm (64), palmar index finger (28), plantar foot (64), popliteal fossa (46), velar forearm (28), umbilicus (12) |
| Human Microbiome Project (HMP) | 1025 | 323 | 5 | Classify 5 major body sites: anterior nares (269), buccal mucosa (312), stool (319), supragingival plaque (313), tongue dorsum (316) |
| Classification of subjects | ||||
| Costello | 140 | 464 | 7 | Classify 7 subjects: (20, 20, 20, 20, 20, 20, 20) |
| Fierer | 104 | 294 | 3 | Classify 3 subjects: (40, 33, 31) |
| Fierer | 98 | 294 | 6 | Classify by subject and left/right hand: (20, 18, 17, 14, 16, 13) |
| Classification of disease states | ||||
| Inflammatory Bowel Disease (IBD) | 1025 | 1025 | 2 | Classify disease states: normal (500), IBD (500) |
| Pei | 200 | 5955 | 4 | Classify disease states: normal (28), reflux esophagitis (36), Barrett’s esophagus (84), esophageal adenocarcinoma (52) |
We consider three different categories of classification aims: body sites, subjects, and disease states. Number of samples for a particular class is included between the round brackets. The number of features equals the number of different OTUs (i.e., microbes)
Fig. 2Synthetic microbial frequency count distribution generated using NB distribution based on microbiome profiles. a The underlying true distribution which is highly zero inflated (i.e., no presence of certain microbe). b Type 1 error that adds non-zero noise to the zero count entries in order to change the distribution. c Type 2 error that changes the underlying non-zero entries to zeros. d Type 3 error changes the distribution of non-zeros counts. Note that all different types of errors are added with probability of 0.5
Fig. 3Illustration of random dropout where dropout units are shown as blue filled circles. a No dropout. b With dropout. As it can be seen, connections to the dropout units are also disabled. Since we randomly choose dropout units in NNs, this means we effectively combine exponentially many different NN architectures to prevent data over-fitting
Fig. 4A regular convolutional neural network (CNN). The input consists of S samples and P features. The 1D filter with kernel size of K and L channels is used for convolving data with the input. By pooling (downsampling) with kernel size of 2, the resulting tensor now becomes approximately of size S×P/4×L. The fully connected layer considers all the features in every channels and output the probability of class labels (C) for each sample
Model configurations for MLP and CNN
| Synthetic | CBH | CSS | HMP | CS | FS | FSH | IBD | PDX | |
|---|---|---|---|---|---|---|---|---|---|
| MLP | (256, 256) | (1024, 512) | (512, 256) | (512, 256) | (512, 512) | (512, 512) | (512, 256) | (512, 256, 128) | (512, 256, 128) |
| CNN | Conv1D(8, 3) → Dropout → ReLu → MaxPool1D(2) → Conv1D(8, 3) → ReLu → MaxPool1D(2) → FC | ||||||||
Number in the round bracket represents the number of hidden units. Conv1D is the one-dimensional convolution layer. ReLu is the non-linear rectifier layer. MaxPool1D represents the one-dimensional max pooling layer. Dropout and FC represent dropout and fully connected layers, respectively. Details of each dataset are described in Table 1
Performance comparison of different ML and NN models for different types of error (e1,e2,e3)
| ( | SVM | GB | RF | MNB | LR1 | LR2 | MLP | CNN |
|---|---|---|---|---|---|---|---|---|
| F1-micro | ||||||||
| (0.5, 0.1, 0.4) | 0.96 | 0.79 |
|
| 0.30 |
| 0.98 | 0.75 |
| (0.5, 0.4, 0.1) | 0.99 | 0.82 |
|
| 0.43 |
|
| 0.81 |
| (0.3, 0.1, 0.4) | 0.98 | 0.87 | 0.98 |
| 0.54 |
|
| 0.74 |
| (0.0, 0.7, 0.2) | 0.99 | 0.83 |
|
| 0.66 |
|
| 0.86 |
| (0.0, 0.2, 0.7) | 0.89 | 0.58 | 0.81 |
| 0.51 | 0.87 |
| 0.59 |
We consider several existing supervised ML methods, as well as NN models (i.e., MLP and CNN). For each experiment, we use 10-fold cross-validation. We use F1-micro to quantify the performance as defined in Classification performance metrics. Bold values represent the best results
Performance comparison of ML models on eight real datasets described in Table 1
| Dataset | SVM | RF | GB | MNB | LR1 | LR2 |
|---|---|---|---|---|---|---|
| F1-macro | ||||||
| CBH | 0.78(0.03) | 0.73(0.03) | 0.74(0.04) | 0.66(0.03) | 0.41(0.04) | 0.17(0.01) |
| CSS | 0.63(0.07) | 0.58(0.08) | 0.48(0.05) | 0.49(0.03) | 0.26(0.03) | 0.24(0.02) |
| HMP | 0.97(0.01) | 0.97(0.01) | 0.95(0.01) | 0.95(0.01) | 0.94(0.01) | 0.93(0.01) |
| CS | 0.88(0.05) | 0.87(0.05) | 0.74(0.06) | 0.76(0.04) | 0.16(0.04) | 0.19(0.06) |
| FS | 0.94(0.03) | 1.00(0.01) | 0.91(0.06) | 0.98(0.01) | 0.60(0.05) | 0.58(0.04) |
| FSH | 0.68(0.04) | 0.63(0.08) | 0.55(0.06) | 0.50(0.04) | 0.17(0.01) | 0.17(0.00) |
| IBD | 0.68(0.04) | 0.57(0.02) | 0.65(0.02) | 0.43(0.01) | 0.47(0.02) | 0.43(0.01) |
| PDX | 0.29(0.13) | 0.28(0.09) | 0.35(0.05) | 0.18(0.03) | 0.15(0.01) | 0.15(0.01) |
| F1-micro | ||||||
| CBH | 0.93(0.02) | 0.91(0.02) | 0.89(0.02) | 0.88(0.02) | 0.76(0.02) | 0.68(0.00) |
| CSS | 0.71(0.03) | 0.67(0.03) | 0.57(0.04) | 0.58(0.03) | 0.48(0.03) | 0.48(0.03) |
| HMP | 0.97(0.01) | 0.97(0.01) | 0.95(0.01) | 0.95(0.01) | 0.94(0.01) | 0.93(0.01) |
| CS | 0.88(0.06) | 0.88(0.04) | 0.75(0.05) | 0.75(0.05) | 0.23(0.05) | 0.28(0.07) |
| FS | 0.94(0.03) | 1.00(0.01) | 0.91(0.06) | 0.98(0.01) | 0.68(0.03) | 0.67(0.03) |
| FSH | 0.70(0.08) | 0.69(0.05) | 0.58(0.06) | 0.62(0.03) | 0.33(0.01) | 0.33(0.01) |
| IBD | 0.79(0.02) | 0.78(0.02) | 0.77(0.02) | 0.76(0.02) | 0.76(0.02) | 0.76(0.02) |
| PDX | 0.44(0.07) | 0.43(0.07) | 0.40(0.05) | 0.42(0.04) | 0.42(0.04) | 0.42(0.04) |
We consider several existing supervised ML methods. For each experiment, we consider 10-fold cross-validation and use F1-macro and F1-micro scores to quantify performance as defined in Classification performance metrics. For each fold, we perform five simulation runs with standard deviations shown between round brackets
Performance comparison of SVM, RF and NN models on eight real datasets described in Table 1
| Dataset | SVM | SVM+A | RF | RF+A | MLP+D | CNN+D | MLP+D+A | CNN+D+A | Gain (%) |
|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||
| CBH | 0.78 (0.03) | 0.82 (0.03) | 0.73 (0.03) | 0.75 (0.03) | 0.85 (0.03) | 0.77 (0.04) | 0.86 (0.03) | 0.82 (0.03) | 5 |
| CSS | 0.63 (0.07) | 0.65 (0.06) | 0.58 (0.08) | 0.61 (0.06) | 0.66 (0.06) | 0.59 (0.06) | 0.67 (0.06) | 0.62 (0.06) | 3 |
| HMP | 0.97 (0.01) | 0.97 (0.01) | 0.97 (0.01) | 0.97 (0.01) | 0.97 (0.01) | 0.97 (0.01) | 0.97 (0.01) | 0.97 (0.01) | 0 |
| CS | 0.88 (0.05) | 0.88 (0.05) | 0.87 (0.05) | 0.87 (0.05) | 0.92 (0.05) | 0.87 (0.06) | 0.93 (0.05) | 0.88 (0.05) | 6 |
| FS | 0.94 (0.03) | 0.95 (0.02) | 1.00 (0.01) | 1.00 (0.01) | 0.97 (0.03) | 0.90 (0.15) | 0.98 (0.02) | 0.97 (0.02) | -2 |
| FSH | 0.68 (0.08) | 0.70 (0.08) | 0.63 (0.08) | 0.68 (0.08) | 0.74 (0.06) | 0.66 (0.07) | 0.74 (0.05) | 0.72 (0.07) | 6 |
| IBD | 0.68 (0.04) | 0.72 (0.02) | 0.57 (0.02) | 0.60 (0.02) | 0.75 (0.02) | 0.67 (0.03) | 0.78 (0.02) | 0.70 (0.02) | 8 |
| PDX | 0.29 (0.13) | 0.43 (0.02) | 0.28 (0.09) | 0.34 (0.07) | 0.51 (0.00) | 0.44 (0.05) | 0.56 (0.03) | 0.45 (0.08) | 30 |
|
| |||||||||
| CBH | 0.93 (0.02) | 0.93 (0.01) | 0.91 (0.02) | 0.92 (0.02) | 0.94 (0.01) | 0.89 (0.02) | 0.94 (0.01) | 0.92 (0.02) | 1 |
| CSS | 0.71 (0.03) | 0.72 (0.04) | 0.67 (0.03) | 0.68 (0.03) | 0.72 (0.03) | 0.67 (0.04) | 0.74 (0.03) | 0.68 (0.04) | 3 |
| HMP | 0.97 (0.01) | 0.97 (0.01) | 0.97 (0.01) | 0.97 (0.01) | 0.97 (0.01) | 0.96 (0.01) | 0.97 (0.01) | 0.97 (0.01) | 0 |
| CS | 0.88 (0.06) | 0.89 (0.05) | 0.88 (0.04) | 0.88 (0.05) | 0.92 (0.04) | 0.87 (0.06) | 0.94 (0.04) | 0.89 (0.05) | 6 |
| FS | 0.94 (0.03) | 0.95 (0.02) | 1.00 (0.01) | 1.00 (0.01) | 0.97 (0.03) | 0.91 (0.12) | 0.98 (0.02) | 0.97 (0.02) | -2 |
| FSH | 0.70 (0.08) | 0.71 (0.07) | 0.69 (0.05) | 0.72 (0.06) | 0.75 (0.05) | 0.68 (0.06) | 0.76 (0.05) | 0.75 (0.07) | 6 |
| IBD | 0.79 (0.02) | 0.79 (0.02) | 0.78 (0.02) | 0.79 (0.02) | 0.82 (0.01) | 0.77 (0.02) | 0.84 (0.01) | 0.78 (0.02) | 6 |
| PDX | 0.44 (0.07) | 0.48 (0.03) | 0.43 (0.07) | 0.44 (0.06) | 0.53 (0.01) | 0.49 (0.05) | 0.56 (0.03) | 0.50 (0.06) | 17 |
+D and +A means dropout and data augmentation, respectively. For each experiment, we consider 10-fold cross-validation and use F1-macro and F1-micro scores to quantify performance as defined in Classification performance metrics. For each fold, we perform five simulation runs with standard deviations shown between round brackets. Performance gains are shown for the best NN and the best ML models. Bold values show the best results
Fig. 5ROC curves and AUCs for (a) multilayer perceptron (MLP) and (b) convolutional neural network (CNN). True positive rates are averaged over 10-fold cross-validation each with 5 independent random runs. We show the ROC curves and AUCs for the real datasets considered in this paper
Fig. 6(a-b and e-f) Q-Q plots and (c-d and g-h) scatter plots for FS and PDX datasets, respectively. The red line is the linear fitted line with adjusted R square reported at the top-left corner. S1, S2 represent samples from subject 1 and subject 2, respectively. BE, EA represent samples from Barrett’s esophagus (BE) and esophageal adenocarcinoma (EA) patients, respectively
Fig. 7Visualization of (a) HMP, (b) IBD, and (c) PDX datasets using t-SNE projection [33]. We project the activation function of the last hidden layer of the test data onto a 2D space, where different colors represent different classes. For instance, the red and green colors represent samples collected from anterior nares and stools, respectively. As it can be seen, HMP and IBD samples show a clear separation between classes, while PDX samples are hard to be distinguished