| Literature DB >> 24498380 |
Daniel Beck1, James A Foster1.
Abstract
Microbial communities are important to human health. Bacterial vaginosis (BV) is a disease associated with the vagina microbiome. While the causes of BV are unknown, the microbial community in the vagina appears to play a role. We use three different machine-learning techniques to classify microbial communities into BV categories. These three techniques include genetic programming (GP), random forests (RF), and logistic regression (LR). We evaluate the classification accuracy of each of these techniques on two different datasets. We then deconstruct the classification models to identify important features of the microbial community. We found that the classification models produced by the machine learning techniques obtained accuracies above 90% for Nugent score BV and above 80% for Amsel criteria BV. While the classification models identify largely different sets of important features, the shared features often agree with past research.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24498380 PMCID: PMC3912131 DOI: 10.1371/journal.pone.0087830
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The correlated microbe groups.
This figure shows the correlated microbe groups. We converted the sparCC correlations between microbial taxa to distances by subtracting the absolute value of the correlation from one. We then clustered the taxa and defined correlated groups using a dynamic tree-pruning algorithm (from the R library dynamicTreeCut). Microbial taxa not falling into these groups are not shown.
Figure 2A comparison of the classification accuracies for each machine learning technique.
This figures shows the accuracy of different classifiers at classifying microbial communities into BV categories. The red and blue lines show the accuracy of random forest and logistic regression classifiers respectively. The black dots are different genetic programming models. Panel A shows the results using the Srinivasan et al. dataset and Amsel BV. Panel B uses the Srinivasan et al. dataset and Nugent score BV. Panel C uses the Ravel et al. dataset and Nugent score BV.
This table shows the fifteen most important features identified by the different classifiers.
| Srinivasan | Srinivasan | Ravel | ||||||
| Genetic Programming | Random Forests | Logistic Regression | Genetic Programming | Random Forests | Logistic Regression | Genetic Programming | Random Forests | Logistic Regression |
| nugenta | nugenta | Prevotella genogroup 7 | pHb | CG2a | clueb | CG4a | CG4a | CG1b |
| Moryella indoligenes | CG1 | Streptococcus agalactiaeb | CG2a | CG1b | CG1b | pHb | pHb | Lactobacillus 2 |
| Mycoplasmab | Uncorrelated microbes | Peptoniphilus lacrimalis | whiffa | CG5 | race | GpI | CG3 | Sutterella |
| Fusobacterium | CG3 | Bifidobacteriaceae | Sutterella wadsworthensis | vag_fluid | whiffa | Coriobacteriaceae 3 | Ureaplasma | CG6 |
| Sutterella wadsworthensis | CG4 | Raoultella planticola | Neisseria gonorrhoeae | pHb | Finegoldia magnab | Peptostreptococcaceae Incertae Sedis | CG5 | Total number readsb |
| Bacteroides Porphyromonasb | Anaerococcus prevotii tetradius | Mycoplasmab | Bacteroides | whiffa | CG2a | Flexibacteraceae 5 | Total number readsb | Bulleidia |
| CG5b | CG2 | nugenta | Haemophilus | clueb | Streptococcus anginosus | Moryella | CG2b | Proteobacteria 3 |
| Streptococcus agalactiaeb | CG5b | Peptoniphilus harei | Bacteroides xylanisolvens | CG4 | Peptostreptococcus | Megamonas | CG1b | Bilophila |
| Veillonella montpellierensis | Peptostrepto-coccus | Dialister propionicifaciens | Eubacteriaceae Lachnospiraceae | Streptococcus agalactiaeb | Streptococcus agalactiaeb | Enterobacteriaceae 2 | Ethnic Group | CG2b |
| Candidatus Peptoniphilus massiliensis | Arcanobacterium phocae | Fusobacteriaceae | Arcanobacterium phocae | CG3 | Raoultella planticola | Chryseobacterium | Uncorrelated microbes | Lactobacillales 1b |
| Clostridiales | race | Megasphaera micronuciformis | Eubacteriaceae Ruminococcaceae | Finegoldia magnab | Streptococcus parasanguinis | Patulibacter | Lactobacillus gasseri | CG4a |
| Bifidobacterium breve | Coriobacteriaceae | Porphyromonas sp. type 1 | Mobiluncus curtisii | Uncorrelated microbes | Megasphaera | Haemophilus | Clostridiales 15 | Salmonella |
| Haemophilus | Actinomyces | Haemophilus pittmaniae | Campylobacter ureolyticus | Lactobacillus coleohominis | Prevotella genogroup 4 | Clostridia 2 | Community group | Dermabacter |
| Streptococcus salivarius thermophilus | Bacteroides Porphyromonasb | Neisseria gonorrhoeae | Delftia tsuruhatensis | Anaerococcus vaginalis | Streptococcus salivarius thermophilus | Rothia | Lactobacillales 1b | Flexibacteraceae 2 |
| Fusobacterium periodonticum | Finegoldia magna | Asticcacaulis excentricus | Ruminococcaceae | Streptococcus mitis oralis | Pseudomonadaceae | Bacillus c | Staphylococcus | Exiguobacterium |
Features common to all three techniques are labeled ‘a’. Features common to two techniques are labeled ‘b’. The features are listed in order of importance.
This table lists the parameter values used by the GP classifier.
| Parameter | Value |
| Population size | 15000 |
| Tournament group size | 4 |
| Cross-over probability | 0.2 |
| Total generations | 300 |
| Mutation probability | 1 |
| Available node functions | addition, subtraction, protected division, multiplication,if/then/else, sine, cosine, logical AND, logical OR, maximum,minimum, log |