| Literature DB >> 31797586 |
Abstract
The microbiome, the community of microorganisms living within an individual, is a promising avenue for developing non-invasive methods for disease screening and diagnosis. Here, we utilize 5643 aggregated, annotated whole-community metagenomes to implement the first multiclass microbiome disease classifier of this scale, able to discriminate between 18 different diseases and healthy. We compared three different machine learning models: random forests, deep neural nets, and a novel graph convolutional architecture which exploits the graph structure of phylogenetic trees as its input. We show that the graph convolutional model outperforms deep neural nets in terms of accuracy (achieving 75% average test-set accuracy), receiver-operator-characteristics (92.1% average area-under-ROC (AUC)), and precision-recall (50% average area-under-precision-recall (AUPR)). Additionally, the convolutional net's performance complements that of the random forest, showing a lower propensity for Type-I errors (false-positives) while the random forest makes less Type-II errors (false-negatives). Lastly, we are able to achieve over 90% average top-3 accuracy across all of our models. Together, these results indicate that there are predictive, disease-specific signatures across microbiomes that can be used for diagnostic purposes.Entities:
Mesh:
Year: 2020 PMID: 31797586 PMCID: PMC7120658
Source DB: PubMed Journal: Pac Symp Biocomput ISSN: 2335-6928
Fig. 1.Architecture of Graph Convolutional Classifier. Purple lines indicate the flow of information from the previous to the highlighted neuron in each layer. In the convolutional layers, each neuron receives information only from its immediate neighbors in the preceding layer.
Percent accuracy by disease for each model.Boldface indicates the model(s) with the highest score in the category.
| Accuracy (%) | |||
|---|---|---|---|
| Disease | GCN (n=30) | DNN (n=30) | RF (n=20) |
| Healthy | 91 ± 1 | 87 ± 1 | |
| Colorectal Cancer (CRC) | 19 ± 2 | ||
| Type 2 Diabetes (T2D) | 08 ± 1 | ||
| Rheumatoid Arthritis (RA) | 73 ± 4 | ||
| Hypertension | |||
| Inflammatory Bowel Disease (IBD) | 07 ± 1 | ||
| Adenoma | 13 ± 2 | 04 ± 1 | |
| Otitis | 00 ± 0 | ||
| Hepatitis B Virus (HBV) | 62 ± 3 | 59 ± 3 | |
| Fatty Liver | 22 ± 2 | 02 ± 1 | |
| Psoriasis | 51 ± 5 | ||
| Type 1 Diabetes (T1D) | 25 ± 6 | ||
| Metabolic Syndrome | 08 ± 3 | ||
| Impaired Glucose Tolerance (IGT) | 12 ± 3 | 00 ± 0 | |
| Periodontitis | |||
| Atopic Dermatitis (AD) | |||
| 53 ± 5 | 54 ± 9 | ||
| Infectious Gastroenteritis | 01 ± 2 | ||
| Bronchitis | |||
| 46 ± 27 | 45 ± 12 | 35 ± 14 | |
Percent area-under-precision-recall (AUPR) and area-under-ROC (AUC) by disease for each model. Boldface indicates the model(s) with the highest score in the category.
| AUC (%) | AUPR (%) | |||||
|---|---|---|---|---|---|---|
| Disease | GCN | DNN | RF | GCN | DNN | RF |
| Healthy | 82 ± 0 | 83 ± 0 | 73 ± 8 | 88 ± 1 | ||
| CRC | 81 ± 2 | 50 ± 0 | 41 ± 4 | 35 ± 3 | ||
| T2D | 86 ± 1 | 50 ± 0 | 40 ± 5 | 31 ± 2 | ||
| RA | 73 ± 4 | 74 ± 3 | ||||
| Hypertension | 92 ± 1 | 66 ± 1 | 51 ± 5 | 44 ± 3 | ||
| IBD | 90 ± 1 | 50 ± 1 | 41 ± 4 | 36 ± 2 | ||
| Adenoma | 75 ± 2 | 50 ± 0 | 15 ± 3 | 12 ± 1 | ||
| Otitis | 74 ± 2 | 50 ± 0 | 18 ± 3 | 10 ± 1 | ||
| HBV | 69 ± 2 | 58 ± 3 | ||||
| Fatty Liver | 80 ± 2 | 5 ± 2 | 35 ± 7 | 20 ± 3 | ||
| Psoriasis | 93 ± 2 | 73 ± 3 | 57 ± 4 | 62 ± 3 | ||
| T1D | 53 ± 3 | |||||
| Metabolic Syndrome | 51 ± 1 | 51 ± 1 | ||||
| IGT | 88 ± 2 | 50 ± 0 | 22 ± 3 | 16 ± 1 | ||
| Periodontitis | 96 ± 2 | 93 ± 2 | ||||
| AD | 88 ± 3 | 78 ± 5 | ||||
| CDI | 91 ± 4 | 64 ± 3 | 64 ± 8 | 62 ± 3 | ||
| Infectious Gastroenteritis | 70 ± 1 | 50 ± 0 | 21 ± 4 | 14 ± 1 | ||
| Bronchitis | 50 ± 0 | 13 ± 3 | 13 ± 2 | |||
| Average | 62 ± 7 | 50 ± 11 | 45 ± 12 | |||
Fig. 2.(left) Accuracy at top-1,3, and 5 levels for (top to bottom) Random Forest, GCN, DNN. (right) Chord diagram showing (for a subset of labels) the most common classification made when an incorrect classification was made for a given class.
Fig. 3.tSNE visualization of the final layer activations of the GCN model (excluding healthy). Inset shows the same tSNE labeled by healthy (blue) and disease (red).
Overview of dataset samples.
| Disease | Count | Site | Studies |
|---|---|---|---|
| Atopic Dermatitis (AD) | 38 | Skin | Chng[ |
| Adenoma | 143 | Stool | Thomas,[ |
| Bronchitis | 18 | Stool | Yassour[ |
| 33 | Stool | Vincent[ | |
| Colorectal Cancer (CRC) | 273 | Stool | Vogtmann,[ |
| Fatty Liver | 94 | Stool | Loomba,[ |
| Hepatitis B Virus (HBV) | 99 | Stool | Qin[ |
| Healthy | 3808 | All | 25 Studies |
| Hypertension | 169 | Stool | Thomas,[ |
| Inflammatory Bowel Disease (IBD) | 148 | Stool | Nielsen[ |
| Impaired Glucose Tolerance (IGT) | 49 | Stool | Karlsson[ |
| Infectious Gastroenteritis | 20 | Stool | David,[ |
| Metabolic Syndrome | 50 | Stool | Vrieze[ |
| Otitis | 107 | Stool | Yassour[ |
| Periodontitis | 48 | Oral | Shi[ |
| Psoriasis | 74 | Skin | TettAJ[ |
| Rheumatoid Arthritis (RA) | 194 | Stool | Chengping[ |
| Type 1 Diabetes (T1D) | 55 | Stool | HeintZ-Buschart,[ |
| Type 2 Diabetes (T2D) | 223 | Stool | Qin[ |