| Literature DB >> 29242838 |
Xiang Gao1, Huaiying Lin1,2, Qunfeng Dong1,2,3,4.
Abstract
Dysbiosis of microbial communities is associated with various human diseases, raising the possibility of using microbial compositions as biomarkers for disease diagnosis. We have developed a Bayes classifier by modeling microbial compositions with Dirichlet-multinomial distributions, which are widely used to model multicategorical count data with extra variation. The parameters of the Dirichlet-multinomial distributions are estimated from training microbiome data sets based on maximum likelihood. The posterior probability of a microbiome sample belonging to a disease or healthy category is calculated based on Bayes' theorem, using the likelihood values computed from the estimated Dirichlet-multinomial distribution, as well as a prior probability estimated from the training microbiome data set or previously published information on disease prevalence. When tested on real-world microbiome data sets, our method, called DMBC (for Dirichlet-multinomial Bayes classifier), shows better classification accuracy than the only existing Bayesian microbiome classifier based on a Dirichlet-multinomial mixture model and the popular random forest method. The advantage of DMBC is its built-in automatic feature selection, capable of identifying a subset of microbial taxa with the best classification accuracy between different classes of samples based on cross-validation. This unique ability enables DMBC to maintain and even improve its accuracy at modeling species-level taxa. The R package for DMBC is freely available at https://github.com/qunfengdong/DMBC. IMPORTANCE By incorporating prior information on disease prevalence, Bayes classifiers have the potential to estimate disease probability better than other common machine-learning methods. Thus, it is important to develop Bayes classifiers specifically tailored for microbiome data. Our method shows higher classification accuracy than the only existing Bayesian classifier and the popular random forest method, and thus provides an alternative option for using microbial compositions for disease diagnosis.Entities:
Keywords: Bayes classifier; Dirichlet-multinomial distribution; disease diagnosis; microbiome
Year: 2017 PMID: 29242838 PMCID: PMC5729222 DOI: 10.1128/mSphereDirect.00536-17
Source DB: PubMed Journal: mSphere ISSN: 2379-5042 Impact factor: 4.389
Comparison of the classification accuracies between DMBC, DMM, and random forest methods
| Test data set | Classification accuracy (AUC) | ||
|---|---|---|---|
| DMBC | DMM | Random forest | |
| IBS at the genus level (157 genera) | 0.809 | 0.718 | 0.741 (0.005) |
| IBS at the OTU level (6,011 OTUs) | 0.78 | 0.672 | 0.643 (0.008) |
| NAFLD at the genus level (120 genera) | 0.684 | 0.686 | 0.621 (0.006) |
| NAFLD at the OTU level (4,287 OTUs) | 0.709 | 0.626 | 0.680 (0.004) |
For each test data set, the taxonomic level (genus- or species-level OTU) andthe number of features (i.e., the number of genera or OTUs) are indicated.
The classification accuracies, computed with leave-one-out cross validation, are represented by the AUC values for each classifier. Since the results of the random forest method are affected in its intrinsic random generation of the decision trees, we repeated each random forest classification three times and reported averages with the corresponding standard deviations in parentheses.
FIG 1 ROC curves for the three classifiers being compared using the IBS and NAFLD data sets at both the genus and OTU levels.
FIG 2 Overview of the DMBC method. A major characteristic of our method is to automatically select a subset of microbial taxa that may achieve the highest classification accuracy (i.e., feature selection). See Materials and Methods for details.