| Literature DB >> 31042284 |
Pajau Vangay1, Benjamin M Hillmann2, Dan Knights1,2.
Abstract
The use of machine learning in high-dimensional biological applications, such as the human microbiome, has grown exponentially in recent years, but algorithm developers often lack the domain expertise required for interpretation and curation of the heterogeneous microbiome datasets. We present Microbiome Learning Repo (ML Repo, available at https://knights-lab.github.io/MLRepo/), a public, web-based repository of 33 curated classification and regression tasks from 15 published human microbiome datasets. We highlight the use of ML Repo in several use cases to demonstrate its wide application, and we expect it to be an important resource for algorithm developers.Entities:
Keywords: database; machine learning; microbiome; repository
Mesh:
Year: 2019 PMID: 31042284 PMCID: PMC6493971 DOI: 10.1093/gigascience/giz042
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Microbiome datasets with available classification tasks in ML Repo
| Project name | V Region | Target size | No. samples | No. subjects | Area | Description | Sequencing technology | Study design |
|---|---|---|---|---|---|---|---|---|
| Cho 2012 | V3 | 177 | 95 | 47 | Antibiotics | Mouse fecal and cecal samples, control vs 4 kinds of antibiotics | 454 | Cross-sectional |
| Claesson 2012 | V4 | 221 | 168 | 168 | Age | Elderly and young adults | 454 | Cross-sectional |
| David 2014 | V4 | 282 | 235 | 11 | Diet | Plant-based vs animal-based diet, cross-over study | Illumina MiSeq | Longitudinal |
| Gevers 2014 | V4 | 173 | 1,321 | 668 | IBD | Biopsies from patients with IBD prior to treatment | Illumina MiSeq | Cross-sectional |
| HMP 2012 | V35 | 527 | 6,407 | 242 | Body habitat, sex | Up to 18 body sites across 242 healthy subjects at 1–2 time points | 454 | Cross-sectional |
| Kostic 2012 | V35 | 569 | 190 | 95 | Colorectal cancer | Adjacent healthy vs tumor colon biopsy tissues | 454 | Paired |
| Montassier 2016 | V56 | 280 | 28 | 28 | Bacteremia | Patients prior to chemotherapy who did or did not develop bacteremia | 454 | Cross-sectional |
| Morgan 2012 | V35 | 569 | 231 | 231 | IBD | Healthy controls, patients with Crohn's disease or ulcerative colitis | 454 | Cross-sectional |
| Turnbaugh 2009 | V2 | 230 | 281 | 154 | Obesity | Monozygotic or dizygotic twin pairs concordant for body mass index class, and their mothers | 454 | Cross-sectional |
| Wu 2011 | V12 | 244 | 95 | 10 | Diet | Controlled high-fat or low-fat feeding on 10 subjects over 10 days | 454 | Longitudinal |
| Yatsunenko 2012 | V4 | 282 | 531 | 531 | Geography, age, sex | Humans of varying ages from the USA, Malawi, and Venezuela | Illumina MiSeq | Cross-sectional |
| Ravel 2011 | V12 | 240 | 396 | 396 | Bacterial vaginosis | Vaginal samples from 4 ethnic groups; Nugent scores for bacterial vaginosis | 454 | Cross-sectional |
| Karlsson 2013 | NA | NA | 144 | 144 | Diabetes | Patients with normal, impaired, or type 2 diabetes glucose tolerance categories | Illumina HiSeq | Cross-sectional |
| Qin 2012 | NA | NA | 134 | 134 | Diabetes | Chinese healthy controls vs patients with type 2 diabetes | Illumina HiSeq | Cross-sectional |
| Qin 2014 | NA | NA | 130 | 130 | Cirrhosis | Healthy controls vs patients with cirrhosis | Illumina HiSeq | Cross-sectional |
ML Repo contains 33 classification and regression tasks from 15 publicly available human microbiome datasets shown here. IBD: inflammatory bowel disease; NA: not applicable.
Figure 1:Data processing workflow and website generation. (A) Quality-filtered sequences were obtained from either the QIITA or from another public repository and trimmed and filtered using SHI7. Reference-based OTUs were picked using BURST with the NCBI RefSeq and Greengenes 97 (GG 97) databases. (B) Individual GitHub Markdown pages were generated from dataset and task lists with a custom Python script and Jinja2 template, then uploaded to GitHub to be hosted.
Figure 2:Screenshots of ML Repo web interface. (A) Available classification and regression tasks are listed by high-level phenotype categories for browsing. (B) Individual task webpages contain links to files for classifying a specific task, as well as relevant task-specific metadata. (C) Individual dataset webpages contain relevant metadata pertaining to the entire dataset, as well as links to raw metadata files and sequencing data.
Figure 3:ROCs comparing random forest and SVM with different kernels. Sweeping across all binary classification tasks available in ML Repo (28), we compare ROCs of random forest, SVM with a radial kernel, and SVM with a linear kernel. AUCs are listed within plots and are colored respective to each model. cd: Crohn's disease; dz: dizygotic; mz: monozygotic; uc: ulcerative colitis.
Figure 4:Summary statistics of framework and database comparisons. (A) AUCs of random forest (rf) to SVM-Linear (left) and random forest to SVM-Radial (right). Paired t-tests reveal that random forest results in significantly higher AUC than both SVM-Linear (P = 0.0014) and SVM-Radial (P = 0.00032). (B) Accuracies of random forest to SVM-Linear (left) and random forest to SVM-Radial (right). Paired t-tests reveal that random forest results in significantly better accuracy than SVM-Radial (P = 0.03), but not SVM-Linear (P = 0.083). (C) AUCs (left) and accuracies (right) of random forest classifications of 24 tasks using OTUs picked with NCBI RefSeq database or Greengenes (gg) database as predictors. Student t-test reveals that reference database choice has limited impact on classification AUC or accuracy. Lines are colored by the top model for each classification task.
Figure 5:ROCs comparing NCBI RefSeq and Greengenes 97 (gg97) databases. Sweeping across 16s-based binary classification tasks available in ML Repo (24), we compare ROCs of random forest with genus-level taxonomic summaries as predictors from OTU-picking strategies with the NCBI RefSeq prokaryote reference database and the Greengenes 97 reference database. AUCs are listed within plots and are colored respective to each database. cd: Crohn's disease; dz: dizygotic; mz: monozygotic; uc: ulcerative colitis.
Description of available prediction tasks
| Dataset | Attributes | Description | Area | Regression? | Sample size | No. of Features | Control variable | |||
|---|---|---|---|---|---|---|---|---|---|---|
| OTU, refseq | OTU, gg | Taxa, refseq | Taxa, gg | |||||||
| Cho 2012 | Abx: Control, Chlortetracycline | 5 Groups of mice treated with 4 different antibiotics or no antibiotics | Antibiotics | 47 | 293 | 1,144 | 299 | 141 | N | |
| Abx: Control, Chlortetracycline | 5 Groups of mice treated with 4 different antibiotics or no antibiotics | Antibiotics | 45 | 293 | 1,144 | 299 | 141 | N | ||
| Abx: Penicillin, Vancomycin | 5 Groups of mice treated with 4 different antibiotics or no antibiotics | Antibiotics | 47 | 293 | 1,144 | 299 | 141 | N | ||
| Abx: Penicillin, Vancomycin | 5 Groups of mice treated with 4 different antibiotics or no antibiotics | Antibiotics | 45 | 293 | 1,144 | 299 | 141 | N | ||
| Claesson 2012 | AGE: Elderly, Young | Elderly or young adults | Age | 167 | 569 | 3,763 | 662 | 279 | N | |
| David 2014 | Diet: Plant, Animal | Individuals on the last day of an animal or plant diet intervention | Diet | 18 | 1,747 | 6,293 | 1,535 | 695 | Y | |
| Gevers 2014 | DIAGNOSIS: no, CD | Healthy controls and patients with CD | IBD | 140 | 943 | 3,547 | 992 | 446 | N | |
| DIAGNOSIS: no, CD | Healthy controls and patients with CD | IBD | 160 | 943 | 3,547 | 992 | 446 | N | ||
| PCDAI | PCDAI scores of patients with CD at 6 months after sampling | IBD | X | 68 | 943 | 3,547 | 992 | 446 | N | |
| PCDAI | PCDAI scores of patients with CD at 6 months after sampling | IBD | X | 51 | 943 | 3,547 | 992 | 446 | N | |
| HMP 2012 | HMPBODYSUPERSITE: Oral, Gastrointestinal_tract, HOST_SUBJECT_ID | Gastrointestinal tract and oral cavity of healthy adults | Body habitat | 2,070 | 3,121 | 9,383 | 3,090 | 1,218 | Y | |
| SEX: male, female | Healthy male and female adults | Sex | 180 | 3,121 | 9,383 | 3,090 | 1,218 | N | ||
| HMPBODYSUBSITE: Stool, Tongue_dorsum; HOST_SUBJECT_ID | Stool and tongue of healthy adults | Body habitat | 404 | 3,121 | 9,383 | 3,090 | 1,218 | Y | ||
| HMPBODYSUBSITE: Subgingival_plaque, Supragingival_plaque; HOST_SUBJECT_ID | Subgingival and supragingival plaque of healthy adults | Body habitat | 408 | 3,121 | 9,383 | 3,090 | 1,218 | Y | ||
| Karlsson 2013 | Classification: IGT, T2D | Impaired or type 2 diabetes glucose tolerance categories | Diabetes | 101 | 12,845 | NA | 3,758 | NA | N | |
| Classification: NGT, T2D | Normal or type 2 diabetes glucose tolerance categories | Diabetes | 96 | 12,845 | NA | 3,758 | NA | N | ||
| Kostic 2012 | DIAGNOSIS: Healthy, Tumor; HOST_SUBJECT_ID | Colorectal carcinoma tumors and adjacent nonaffected tissues | Cancer | 172 | 908 | 3,228 | 980 | 409 | Y | |
| Montassier 2016 | Treatment: bact, NObact | Patients prior to chemotherapy who did or did not develop bacteremia | Bacteremia | 28 | 541 | 1,852 | 640 | 228 | N | |
| Morgan 2012 | ULCERATIVE_COLIT_OR_CROHNS_DIS: Crohn's disease, Healthy | Healthy controls or patients with CD or ulcerative colitis | IBD | 128 | 829 | 3,677 | 877 | 367 | N | |
| ULCERATIVE_COLIT_OR_CROHNS_DIS: Ulcerative Colitis, Healthy | Healthy controls or patients with CD or ulcerative colitis | IBD | 128 | 829 | 3,677 | 877 | 367 | N | ||
| Qin 2012 | Diabetic: Y, N | Healthy controls or patients with type 2 diabetes | Diabetes | 124 | 11,880 | NA | 2,526 | NA | N | |
| Qin 2014 | Cirrhotic: Cirrhosis, Healthy | Healthy controls or patients with cirrhosis | Cirrhosis | 130 | 8,483 | NA | 2,579 | NA | N | |
| Ravel 2011 | Ethnic_Group: Black, Hispanic | Vaginal microbiomes of black and Hispanic women | Vaginal | 199 | 586 | 1,093 | 660 | 305 | N | |
| Nugent_score_category: low, high | Predict Nugent score category (low, high) from vaginal microbiome | Vaginal | 342 | 586 | 1,093 | 660 | 305 | N | ||
| Nugent_score | Predict Nugent score from vaginal microbiome | Vaginal | X | 388 | 586 | 1,093 | 660 | 305 | N | |
| pH | Predict pH from vaginal microbiome | Vaginal | X | 388 | 586 | 1,093 | 660 | 305 | N | |
| Ethnic_Group: White, Black | Vaginal microbiomes of white and black women | Vaginal | 200 | 586 | 1,093 | 660 | 305 | N | ||
| Turnbaugh 2009 | OBESITYCAT: Lean, Obese; ZYGOSITY: MZ, DZ, Mom | Lean or obese individuals (monozygotic or dyzygotic twins or their mothers) | Obesity | 142 | 557 | 4,051 | 680 | 232 | Y | |
| Wu 2011 | DIET: HighFat, LowFat | Individuals after completing a high-fat or low-fat diet intervention | Diet | 10 | 292 | 1,769 | 361 | 136 | N | |
| Yatsunenko 2012 | AGE | Infants (up to age 3 years) from the USA | Age | X | 49 | 4,660 | 15,783 | 4,021 | 1,544 | N |
| COUNTRY: GAZ: Venezuela, GAZ: Malawi | Individuals living in Malawi or Venezuela | Geography | 54 | 4,660 | 15,783 | 4,021 | 1,544 | N | ||
| SEX: male, female | Males and females from the USA | Sex | 129 | 4,660 | 15,783 | 4,021 | 1,544 | N | ||
| COUNTRY: GAZ: United States of America, GAZ: Malawi | Individuals living in the USA or Malawi | Geography | 150 | 4,660 | 15,783 | 4,021 | 1,544 | N | ||
Abx: antibiotics; CD: Crohn's disease; DZ: dizygotic; GG: Greengenes; IBD: inflammatory bowel disease; IGT: impaired glucose tolerance; MZ: monozygotic; NGT: normal glucose tolerance; PCDAI: Pediatric Crohn's Disease Activity Index; T2D: type 2 diabetes; GAZ: Gazeteer, an ontology of place names.
Glossary
| Term | Definition |
| OTU | Operational taxonomic unit, group of closely related organisms based on DNA sequence similarity |
| 16S | 16S ribosomal RNA gene, component of the prokaryotic ribosome, used to reconstruct phylogenies |
| FASTA | Text-based format for representing nucleotide sequences with single-letter codes |
| FASTQ | Text-based format for representing nucleotide sequences and corresponding quality scores, with single-letter codes for nucleotides and quality |
| Taxa | Groups of ≥1 populations of organisms. Usually summarized at phylum, class, order, family, genus, or species levels |
| Metadata | Descriptive data pertaining to samples within a study |
| Shotgun | Shotgun metagenomics sequencing breaks up all available DNA into random small segments and uses chain termination to sequence reads. Reads can be aligned directly to a reference database, or overlapping reads can be assembled into contiguous sequences |