Literature DB >> 31042284

Microbiome Learning Repo (ML Repo): A public repository of microbiome regression and classification tasks.

Pajau Vangay1, Benjamin M Hillmann2, Dan Knights1,2.   

Abstract

The use of machine learning in high-dimensional biological applications, such as the human microbiome, has grown exponentially in recent years, but algorithm developers often lack the domain expertise required for interpretation and curation of the heterogeneous microbiome datasets. We present Microbiome Learning Repo (ML Repo, available at https://knights-lab.github.io/MLRepo/), a public, web-based repository of 33 curated classification and regression tasks from 15 published human microbiome datasets. We highlight the use of ML Repo in several use cases to demonstrate its wide application, and we expect it to be an important resource for algorithm developers.
© The Author(s) 2019. Published by Oxford University Press.

Entities:  

Keywords:  database; machine learning; microbiome; repository

Mesh:

Year:  2019        PMID: 31042284      PMCID: PMC6493971          DOI: 10.1093/gigascience/giz042

Source DB:  PubMed          Journal:  Gigascience        ISSN: 2047-217X            Impact factor:   6.524


Findings

Background

Machine learning is widely used as a method for classification and prediction, with a growing number of applications in human health [1]. The use of machine learning in biological fields [2, 3], and more specifically the microbiome research field [4-7], has grown exponentially owing to the robustness of these algorithms to high-dimensional data. However, challenges exist for large-scale meta-analysis because they often require manual curation of metadata and standardized processing of raw sequence data, resulting in variation in the results derived from chosen datasets across studies [8, 9]. In addition, microbiome research data can be challenging to access and analyze for expert machine learning algorithm developers, who often do not have the domain expertise required to interpret the data and metadata in complex microbiome studies. There exist general resources with curated classification tasks from a variety of domains. The University of California Irvine Machine Learning Repository [10] revolutionized machine learning methods development by giving developers access to many curated datasets; its widespread usage and impact can be seen from its thousands of resulting citations. Currently, we are unaware of any machine learning repository dedicated to microbiome classification tasks. We constructed a complementary database to address this deficiency, in order to promote the development of and use of improved machine learning methods for the microbiome community.

Workflow

We present the Microbiome Learning Repo (ML Repo), a repository of 33 curated classification and regression tasks involving human microbiome data. Our 33 tasks are derived from 15 publicly available human microbiome datasets, which include 12 amplicon-based and 3 shotgun sequencing datasets (Table 1). These datasets vary across sequencing technology platforms, 16s hypervariable regions, and study design, in order to help developers ensure robustness of algorithms across data types. We streamlined the microbiome data using a single post-processing workflow (Fig. 1A). We downloaded trimmed and quality-filtered sequencing reads for 8 datasets from QIITA [11], and raw sequences for 7 datasets from public repositories. Raw sequences were trimmed and quality filtered using SHI7 [12] or QIIME [13]. We picked operational taxonomic units (OTUs) from all quality-filtered sequences using a closed-reference method with the BURST [14] aligner against both the National Center for Biotechnology Information (NCBI) RefSeq 16S ribosomal RNA project [15] and the Greengenes 97 database [16]. Samples with <1,000 sequencing reads were dropped for 10 datasets, while we applied a lower threshold of 100 sequencing reads per sample for 5 datasets that had lower expected bacterial load. Full details regarding the data preprocessing are provided for each data set in the mlrepo-source branch of the GitHub repository, under preprocessing/make.mappings.r. As a result, for each dataset we generated RefSeq-based OTU and taxa abundance counts, and Greengenes-based OTU and taxa abundance counts. These counts are presented in tables that are organized as follows: OTUs or taxa as rows, and samples as columns. OTUs are represented as either NCBI genome identifiers or Greengenes identifiers. Taxa are represented as “kingdom; phylum; class; order; family; genus; species; strain," with highest taxonomic specificity where possible. We excluded additional post-processing filtering and normalization steps so that these parameters can be included in future benchmarking use cases as needed. We also limit our data to OTU and taxa tables because other metrics such as α and β diversity can be subsequently generated as needed.
Table 1:

Microbiome datasets with available classification tasks in ML Repo

Project nameV RegionTarget sizeNo. samplesNo. subjectsAreaDescriptionSequencing technologyStudy design
Cho 2012V31779547AntibioticsMouse fecal and cecal samples, control vs 4 kinds of antibiotics454Cross-sectional
Claesson 2012V4221168168AgeElderly and young adults454Cross-sectional
David 2014V428223511DietPlant-based vs animal-based diet, cross-over studyIllumina MiSeqLongitudinal
Gevers 2014V41731,321668IBDBiopsies from patients with IBD prior to treatmentIllumina MiSeqCross-sectional
HMP 2012V355276,407242Body habitat, sexUp to 18 body sites across 242 healthy subjects at 1–2 time points454Cross-sectional
Kostic 2012V3556919095Colorectal cancerAdjacent healthy vs tumor colon biopsy tissues454Paired
Montassier 2016V562802828BacteremiaPatients prior to chemotherapy who did or did not develop bacteremia454Cross-sectional
Morgan 2012V35569231231IBDHealthy controls, patients with Crohn's disease or ulcerative colitis454Cross-sectional
Turnbaugh 2009V2230281154ObesityMonozygotic or dizygotic twin pairs concordant for body mass index class, and their mothers454Cross-sectional
Wu 2011V122449510DietControlled high-fat or low-fat feeding on 10 subjects over 10 days454Longitudinal
Yatsunenko 2012V4282531531Geography, age, sexHumans of varying ages from the USA, Malawi, and VenezuelaIllumina MiSeqCross-sectional
Ravel 2011V12240396396Bacterial vaginosisVaginal samples from 4 ethnic groups; Nugent scores for bacterial vaginosis454Cross-sectional
Karlsson 2013NANA144144DiabetesPatients with normal, impaired, or type 2 diabetes glucose tolerance categoriesIllumina HiSeqCross-sectional
Qin 2012NANA134134DiabetesChinese healthy controls vs patients with type 2 diabetesIllumina HiSeqCross-sectional
Qin 2014NANA130130CirrhosisHealthy controls vs patients with cirrhosisIllumina HiSeqCross-sectional

ML Repo contains 33 classification and regression tasks from 15 publicly available human microbiome datasets shown here. IBD: inflammatory bowel disease; NA: not applicable.

Figure 1:

Data processing workflow and website generation. (A) Quality-filtered sequences were obtained from either the QIITA or from another public repository and trimmed and filtered using SHI7. Reference-based OTUs were picked using BURST with the NCBI RefSeq and Greengenes 97 (GG 97) databases. (B) Individual GitHub Markdown pages were generated from dataset and task lists with a custom Python script and Jinja2 template, then uploaded to GitHub to be hosted.

Data processing workflow and website generation. (A) Quality-filtered sequences were obtained from either the QIITA or from another public repository and trimmed and filtered using SHI7. Reference-based OTUs were picked using BURST with the NCBI RefSeq and Greengenes 97 (GG 97) databases. (B) Individual GitHub Markdown pages were generated from dataset and task lists with a custom Python script and Jinja2 template, then uploaded to GitHub to be hosted. Microbiome datasets with available classification tasks in ML Repo ML Repo contains 33 classification and regression tasks from 15 publicly available human microbiome datasets shown here. IBD: inflammatory bowel disease; NA: not applicable. Sample metadata from individual studies were manually curated to generate viable prediction tasks. When available, published study exclusion criteria, such as reported use of antibiotics, were applied accordingly and confounders were removed by dropping samples or stratification. Well-known confounders were accounted for when constructing prediction tasks for other human-associated conditions; e.g., predicting age using the Yatsunenko 2012 dataset is restricted to samples from the USA owing to the known variation in gut microbiomes across different geographical locations. Details of how samples were subset for each prediction task can be found in the mlrepo-source branch of the GitHub repository, under preprocessing/make.mappings.r. Studies that were cross-sectional by design but contained several samples per subject were filtered to contain 1 sample per subject. In study designs with paired diseased-healthy or pre- and post-intervention samples, samples were reduced to 2 samples per subject with subject identifiers provided as confounder variables. Hence, each prediction task is made available as an individual, compartmentalized metadata file that contains sample identifiers, responses to predict, and optionally, confounder variables that are inherent to the research study design such as paired healthy and diseased samples from the same subject (see Methods for more details). As a result, we generated 33 distinct tasks for predicting human-associated responses.

Publicly available web-based interface

We expect 2 types of users: (i) machine learning algorithm developers with limited knowledge of microbiome study designs and (ii) microbiome researchers interested in obtaining additional datasets for meta-analysis. Generally, we expect that method developers will be most interested in sweeping through the full set of prediction tasks for benchmarking, and hence would prefer to download a single compressed file containing all tasks and data. On the other hand, we expect microbiome researchers to be more selective in downloading specific datasets and tasks depending on their research domain. Hence, researchers may prefer to browse specific details about tasks and datasets prior to downloading. On the basis of these expected use cases, we created a publicly available web-based interface for ML Repo hosted by GitHub Pages [17]. Tasks are organized by relevant response categories (Fig. 2A). Task pages contain descriptive details such as sample size and response type that are specific to the selected prediction task, as well as links for downloading OTU tables, taxa tables, and sample metadata (Fig. 2B). Dataset pages contain important details about the entire dataset, including links to the original research study, as well as original metadata files and quality-filtered sequences (Fig. 2C). We also provide a single compressed file containing the entire set of available tasks (OTU tables, taxa tables, and relevant metadata) for download from the main home page.
Figure 2:

Screenshots of ML Repo web interface. (A) Available classification and regression tasks are listed by high-level phenotype categories for browsing. (B) Individual task webpages contain links to files for classifying a specific task, as well as relevant task-specific metadata. (C) Individual dataset webpages contain relevant metadata pertaining to the entire dataset, as well as links to raw metadata files and sequencing data.

Screenshots of ML Repo web interface. (A) Available classification and regression tasks are listed by high-level phenotype categories for browsing. (B) Individual task webpages contain links to files for classifying a specific task, as well as relevant task-specific metadata. (C) Individual dataset webpages contain relevant metadata pertaining to the entire dataset, as well as links to raw metadata files and sequencing data.

Benefits of curated microbiome-based prediction tasks

We expect ML Repo to be beneficial for both the machine learning community as well as the microbiome research community. ML Repo will be a powerful complement to the University of California Irvine's Machine Learning Repository because it will allow for benchmarking curated classification tasks with high-dimensional data and hence enable the subsequent development of novel algorithms for these complex datasets. Our streamlined approach in generating OTU and taxa tables offers a rich set of 15 datasets that microbiome researchers can use directly for further comparison with their own studies, for teaching and learning purposes, or for large meta-analyses. We expect that our provided OTU and taxa tables will also be beneficial for researchers with limited access to high-performance computing resources or bioinformatics skills necessary for processing raw sequencing data. In addition, we also expect microbiome-specific methods development to benefit from our repository. The subset of samples found in each prediction task metadata file replace the work of rigorously deciphering metadata and understanding the subtle differences of individual research studies. New methods, such as OTU-picking algorithms, can be evaluated not only on metrics such as speed and accuracy but also based on overall impact to study findings.

Comparison to similar databases

Although a number of microbiome repositories exist, many are intended as data archival repositories [18, 19] or function as resources for aggregating across studies [20]. Resources such as QIITA [11] offer an extensive collection of datasets, and mock-community–based Mockrobiota [21] is well-suited for benchmarking upstream methods, but neither offers support for the metadata interpretation necessary for predicting high-level phenotypes. Microbiome-based repositories that do provide manually curated metadata include curatedMetagenomicData [22] and MicrobiomeHD [23]. Although curatedMetagenomicData offers a collection of shotgun-metagenomics datasets with varying human sample types with gene, pathway, and taxonomic abundance tables, its data are accessible only via Bioconductor [24] and are stored as ExpressionSet objects, which integrates metadata and abundance data. Although curatedMetagenomicData is an impressive repository with many features, it is most suitable for advanced bioinformaticians because its interface may hinder use by beginner data analysts and in teaching environments. MicrobiomeHD offers easily accessible taxonomic abundance tables with curated metadata but is limited only to amplicon-based sequencing data, human stool samples, and case-control responses. And although both curatedMetagenomicData and MicrobiomeHD provide manually curated metadata, biological interpretation is still required because other sample metadata, e.g., antibiotic use, may have biological relevance in predicting responses. This poses a potential problem for machine learning developers with limited biological and microbiome domain expertise. ML Repo resolves this issue by explicitly defining classification and regressions tasks for predicting responses that have either been manually curated to remove confounders or been specifically annotated with biological confounders that must be controlled for. Metadata files in ML Repo are task-specific and, hence, are simplified to contain only (i) sample identifiers indicating samples that should be used for the prediction task, (ii) corresponding high-level phenotypes or responses, and optionally, (iii) a confounder that should be accounted for owing to its biological relevance. In addition, datasets in ML Repo include both amplicon-based and shotgun-metagenomics datasets covering a variety of human sample types, and are easily accessible via a web-based interface.

Case studies

We compare the performance of 3 machine learning models: a random forest [25], and a support vector machine(SVM) [26] with either a radial or linear kernel. Sweeping through available tasks with binary responses, we compare our models by examining receiver operating curves (ROCs) and areas under the curve (AUC), considered the standard method for machine learning model evaluation [27, 28] (Fig. 3). Through comparison of ROCs, we can see that random forest outperforms or ties the other 2 models in 21 of the 28 tasks. The choice of kernels for SVM seems to have limited impact on overall mean accuracy, yet a linear kernel was able to perfectly classify penicillin-treated and vancomycin-treated mouse cecal contents when the other models could not; further examination of the microbial features in these samples may be warranted to better elucidate the strengths of this kernel. We also performed pairwise comparisons of random forest against the other models across all tasks. When evaluated by AUC, random forest performed significantly better than both SVM with a linear kernel (P = 0.0014) and with a radial kernel (P = 0.00032) (Fig. 4A). We found that random forest accuracy improvements were moderate when compared with SVM-Linear (P = 0.083) and SVM-Radial (P = 0.03) (Fig. 4B), which may be explained by the fact that, unlike AUC, accuracy ignores class prediction probability estimates. Our results support the broad usage [4, 5, 8, 29] and acceptance of random forest as a robust classifier [6] with high-dimensional microbiome data.
Figure 3:

ROCs comparing random forest and SVM with different kernels. Sweeping across all binary classification tasks available in ML Repo (28), we compare ROCs of random forest, SVM with a radial kernel, and SVM with a linear kernel. AUCs are listed within plots and are colored respective to each model. cd: Crohn's disease; dz: dizygotic; mz: monozygotic; uc: ulcerative colitis.

Figure 4:

Summary statistics of framework and database comparisons. (A) AUCs of random forest (rf) to SVM-Linear (left) and random forest to SVM-Radial (right). Paired t-tests reveal that random forest results in significantly higher AUC than both SVM-Linear (P = 0.0014) and SVM-Radial (P = 0.00032). (B) Accuracies of random forest to SVM-Linear (left) and random forest to SVM-Radial (right). Paired t-tests reveal that random forest results in significantly better accuracy than SVM-Radial (P = 0.03), but not SVM-Linear (P = 0.083). (C) AUCs (left) and accuracies (right) of random forest classifications of 24 tasks using OTUs picked with NCBI RefSeq database or Greengenes (gg) database as predictors. Student t-test reveals that reference database choice has limited impact on classification AUC or accuracy. Lines are colored by the top model for each classification task.

ROCs comparing random forest and SVM with different kernels. Sweeping across all binary classification tasks available in ML Repo (28), we compare ROCs of random forest, SVM with a radial kernel, and SVM with a linear kernel. AUCs are listed within plots and are colored respective to each model. cd: Crohn's disease; dz: dizygotic; mz: monozygotic; uc: ulcerative colitis. Summary statistics of framework and database comparisons. (A) AUCs of random forest (rf) to SVM-Linear (left) and random forest to SVM-Radial (right). Paired t-tests reveal that random forest results in significantly higher AUC than both SVM-Linear (P = 0.0014) and SVM-Radial (P = 0.00032). (B) Accuracies of random forest to SVM-Linear (left) and random forest to SVM-Radial (right). Paired t-tests reveal that random forest results in significantly better accuracy than SVM-Radial (P = 0.03), but not SVM-Linear (P = 0.083). (C) AUCs (left) and accuracies (right) of random forest classifications of 24 tasks using OTUs picked with NCBI RefSeq database or Greengenes (gg) database as predictors. Student t-test reveals that reference database choice has limited impact on classification AUC or accuracy. Lines are colored by the top model for each classification task. To assess the impact of reference database choice on classification accuracies, we also used the classification tasks to compare random forest using OTUs picked with the Greengenes 97 database or the NCBI RefSeq Targeted Loci Project 16s project. We found that there was limited impact of database choice on overall classification accuracies (Figs 4C and 5). This may be due to (i) large effect sizes that are driven mainly by several well-characterized bacterial taxa present in both databases (e.g., stool vs tongue samples), or (ii) small effect sizes such that classification is difficult regardless of the database (e.g., male vs female stool). Note that OTU-picking with the Greengenes database resulted in more OTU features in every dataset (Table 2); hence, these findings further highlight how the smaller, higher-quality NCBI RefSeq database can recover the same signal from the larger Greengenes database.
Figure 5:

ROCs comparing NCBI RefSeq and Greengenes 97 (gg97) databases. Sweeping across 16s-based binary classification tasks available in ML Repo (24), we compare ROCs of random forest with genus-level taxonomic summaries as predictors from OTU-picking strategies with the NCBI RefSeq prokaryote reference database and the Greengenes 97 reference database. AUCs are listed within plots and are colored respective to each database. cd: Crohn's disease; dz: dizygotic; mz: monozygotic; uc: ulcerative colitis.

Table 2:

Description of available prediction tasks

 Dataset Attributes Description Area Regression? Sample sizeNo. of Features Control variable
OTU, refseqOTU, ggTaxa, refseqTaxa, gg
Cho 2012Abx: Control, Chlortetracycline5 Groups of mice treated with 4 different antibiotics or no antibioticsAntibiotics472931,144299141N
Abx: Control, Chlortetracycline5 Groups of mice treated with 4 different antibiotics or no antibioticsAntibiotics452931,144299141N
Abx: Penicillin, Vancomycin5 Groups of mice treated with 4 different antibiotics or no antibioticsAntibiotics472931,144299141N
Abx: Penicillin, Vancomycin5 Groups of mice treated with 4 different antibiotics or no antibioticsAntibiotics452931,144299141N
Claesson 2012AGE: Elderly, YoungElderly or young adultsAge1675693,763662279N
David 2014Diet: Plant, AnimalIndividuals on the last day of an animal or plant diet interventionDiet181,7476,2931,535695Y
Gevers 2014DIAGNOSIS: no, CDHealthy controls and patients with CDIBD1409433,547992446N
DIAGNOSIS: no, CDHealthy controls and patients with CDIBD1609433,547992446N
PCDAIPCDAI scores of patients with CD at 6 months after samplingIBDX689433,547992446N
PCDAIPCDAI scores of patients with CD at 6 months after samplingIBDX519433,547992446N
HMP 2012HMPBODYSUPERSITE: Oral, Gastrointestinal_tract, HOST_SUBJECT_IDGastrointestinal tract and oral cavity of healthy adultsBody habitat2,0703,1219,3833,0901,218Y
SEX: male, femaleHealthy male and female adultsSex1803,1219,3833,0901,218N
HMPBODYSUBSITE: Stool, Tongue_dorsum; HOST_SUBJECT_IDStool and tongue of healthy adultsBody habitat4043,1219,3833,0901,218Y
HMPBODYSUBSITE: Subgingival_plaque, Supragingival_plaque; HOST_SUBJECT_IDSubgingival and supragingival plaque of healthy adultsBody habitat4083,1219,3833,0901,218Y
Karlsson 2013Classification: IGT, T2DImpaired or type 2 diabetes glucose tolerance categoriesDiabetes10112,845NA3,758NAN
Classification: NGT, T2DNormal or type 2 diabetes glucose tolerance categoriesDiabetes9612,845NA3,758NAN
Kostic 2012DIAGNOSIS: Healthy, Tumor; HOST_SUBJECT_IDColorectal carcinoma tumors and adjacent nonaffected tissuesCancer1729083,228980409Y
Montassier 2016Treatment: bact, NObactPatients prior to chemotherapy who did or did not develop bacteremiaBacteremia285411,852640228N
Morgan 2012ULCERATIVE_COLIT_OR_CROHNS_DIS: Crohn's disease, HealthyHealthy controls or patients with CD or ulcerative colitisIBD1288293,677877367N
ULCERATIVE_COLIT_OR_CROHNS_DIS: Ulcerative Colitis, HealthyHealthy controls or patients with CD or ulcerative colitisIBD1288293,677877367N
Qin 2012Diabetic: Y, NHealthy controls or patients with type 2 diabetesDiabetes12411,880NA2,526NAN
Qin 2014Cirrhotic: Cirrhosis, HealthyHealthy controls or patients with cirrhosisCirrhosis1308,483NA2,579NAN
Ravel 2011Ethnic_Group: Black, HispanicVaginal microbiomes of black and Hispanic womenVaginal1995861,093660305N
Nugent_score_category: low, highPredict Nugent score category (low, high) from vaginal microbiomeVaginal3425861,093660305N
Nugent_scorePredict Nugent score from vaginal microbiomeVaginalX3885861,093660305N
pHPredict pH from vaginal microbiomeVaginalX3885861,093660305N
Ethnic_Group: White, BlackVaginal microbiomes of white and black womenVaginal2005861,093660305N
Turnbaugh 2009OBESITYCAT: Lean, Obese; ZYGOSITY: MZ, DZ, MomLean or obese individuals (monozygotic or dyzygotic twins or their mothers)Obesity1425574,051680232Y
Wu 2011DIET: HighFat, LowFatIndividuals after completing a high-fat or low-fat diet interventionDiet102921,769361136N
Yatsunenko 2012AGEInfants (up to age 3 years) from the USAAgeX494,66015,7834,0211,544N
COUNTRY: GAZ: Venezuela, GAZ: MalawiIndividuals living in Malawi or VenezuelaGeography544,66015,7834,0211,544N
SEX: male, femaleMales and females from the USASex1294,66015,7834,0211,544N
COUNTRY: GAZ: United States of America, GAZ: MalawiIndividuals living in the USA or MalawiGeography1504,66015,7834,0211,544N

Abx: antibiotics; CD: Crohn's disease; DZ: dizygotic; GG: Greengenes; IBD: inflammatory bowel disease; IGT: impaired glucose tolerance; MZ: monozygotic; NGT: normal glucose tolerance; PCDAI: Pediatric Crohn's Disease Activity Index; T2D: type 2 diabetes; GAZ: Gazeteer, an ontology of place names.

ROCs comparing NCBI RefSeq and Greengenes 97 (gg97) databases. Sweeping across 16s-based binary classification tasks available in ML Repo (24), we compare ROCs of random forest with genus-level taxonomic summaries as predictors from OTU-picking strategies with the NCBI RefSeq prokaryote reference database and the Greengenes 97 reference database. AUCs are listed within plots and are colored respective to each database. cd: Crohn's disease; dz: dizygotic; mz: monozygotic; uc: ulcerative colitis. Description of available prediction tasks Abx: antibiotics; CD: Crohn's disease; DZ: dizygotic; GG: Greengenes; IBD: inflammatory bowel disease; IGT: impaired glucose tolerance; MZ: monozygotic; NGT: normal glucose tolerance; PCDAI: Pediatric Crohn's Disease Activity Index; T2D: type 2 diabetes; GAZ: Gazeteer, an ontology of place names.

Future work

We expect and hope that the broader microbiome research community will add new datasets and prediction tasks to ML Repo. We provide instructions [30] on our GitHub repository to guide users to create a fork from our repository, add the appropriate data and files, and update the master task and dataset lists. Researchers can then submit a pull request for our review, and requests that are properly formatted will be accepted and merged into the repository. We expect that data submissions will come from either the original researchers or those well acquainted with the datasets, and hence will expect that sample selection and subsetting will have undergone rigorous review for prediction tasks.

Methods

Pre-processing of sequencing reads

When available, preprocessed FASTA files were downloaded from QIITA (or previously, the QIIME database). For all other datasets, raw FASTQ files were downloaded from sources listed in Supplemental Table 1. Adaptors and barcodes were removed and sequences were quality filtered (at Phred score ≥ Q20) using SHI7 [12] or QIIME [13]. OTUs were picked from processed FASTA files using BURST [14] with Greengenes [16] 97 or the NCBI RefSeq Targeted Loci Project 16s project [15] (accessed on 4 July 2017). Samples with sequencing depth <1,000 sequences per sample were dropped for all studies, except for 5 datasets [31-35], where the minimum threshold was 100 sequences per sample.

Selection of classification tasks

Classification tasks were selected on the basis of reported study results, biologically relevant high-level phenotypes, and sufficient sample sizes. Original metadata files and research methods were rigorously and manually curated in order to subset samples with minimal confounders. For confounders that were inherent to the study, we include an additional variable to control for in the task metadata files. The presence of control variables can be found by examining “control_vars” in the Tasks table.

Website generation

Website templating was developed using Jinja2 [36] and custom Python scripts. Individual webpages were generated by iterating through items in the Tasks and Datasets tables, and dynamically populating templates to generate individual Markdown [37] pages. The resulting Markdown pages are hosted as GitHub Pages.

Case study benchmarking

Case study results were generated with custom R [38] scripts, which can be found in the /example folder in the ML Repo Github repository. To compare machine learning models, we iterated through tasks with binary responses. OTU counts were converted to relative abundances, filtered at a minimum of 10% prevalence across samples, and collapsed at a complete-linkage correlation of 95% (which is done by calculating the Pearson correlation between each pair of OTUs using all complete pairs of observations, hierarchically clustering the results, and cutting the resulting dendrogram at a height of 0.05). We then constructed a 5-fold cross-validation for tasks containing >100 samples, or a leave-one-out cross-validation for tasks with fewer samples. For n-fold cross validation, samples were assigned to folds such that classes were equally balanced within each fold (e.g., if our task contained 40% healthy and 60% diseased samples, our folds would also be selected to represent this distribution). For tasks that contained control variables, we selected folds such that samples with the same control variable value were contained within the same fold. For example, for a task dataset containing matching stool and oral samples from subjects, the Subject Identifier would be listed as the control variable and we should assign samples to folds such that all samples from a specific subject were contained within a fold. This step is crucial to avoid biasing or overfitting the training model; test folds should contain not only new samples but also samples that are independent from those in the training set. Models were constructed using the “caret” package [39]. Control parameters were set using the function trainControl with parameter method = “none” and default parameters. Default settings for all models are as follows: SVM radial basis σ is set to 0.1, all SVMs [40]C is set to 1, and randomForest number of trees is set to 500 and number of variables to split is sqrt(p), where p is the number of features. This entire process was bootstrapped 100 times, and the mean class probabilities were used to calculate the resulting AUCs and ROCs. To compare classification accuracies using different reference databases, we used a similar procedure but held the model constant and predicted using different base OTU tables. This framework enables comparison of a myriad of machine learning models available in the “caret” package and can be easily expanded to compare different OTU-picking algorithms, or normalization and filtering techniques.

Availability of supporting data and materials

All test datasets are available in the Microbiome Learning Repo site [17]. Snapshots of our code and other supporting data are available in the GigaScience database, GigaDB [41].

Availability of supporting source code and requirements

Project name: Microbiome Learning Repo Project home page: https://knights-lab.github.io/MLRepo/ Operating system: platform independent Programming language: Python, R License: MIT License Restrictions: None RRID:SCR_017079 Glossary Click here for additional data file. Click here for additional data file. Click here for additional data file. 9/20/2018 Reviewed Click here for additional data file. 9/21/2018 Reviewed Click here for additional data file. 3/5/2019 Reviewed Click here for additional data file. 9/21/2018 Reviewed Click here for additional data file. 2/28/2019 Reviewed Click here for additional data file.

Abbreviations

AUC: area under the curve; IBD: inflammatory bowel disease; ML Repo: Microbiome Learning Repo; NCBI: National Center for Biotechnology Information; OTU: operational taxonomic unit; ROC: receiver operating curve; SVM: support vector machine.

Competing interests

D.K. serves as CEO and holds equity in CoreBiome, a company involved in the commercialization of microbiome analysis. The University of Minnesota also has financial interests in CoreBiome under the terms of a license agreement with CoreBiome. These interests have been reviewed and managed by the University of Minnesota in accordance with its Conflict-of-Interest policies.

Funding

This work is supported by funds from National Institutes of Health grant R01AI121383.

Authors’ contributions

Conceptualization: P.V. and D.K.; data curation: P.V.; formal analyses: P.V.; methodology: P.V., B.M.H., D.K.; software: P.V.; writing—original draft: P.V.; writing—review and editing: B.M.H. and D.K.

Glossary

TermDefinition
OTUOperational taxonomic unit, group of closely related organisms based on DNA sequence similarity
16S16S ribosomal RNA gene, component of the prokaryotic ribosome, used to reconstruct phylogenies
FASTAText-based format for representing nucleotide sequences with single-letter codes
FASTQText-based format for representing nucleotide sequences and corresponding quality scores, with single-letter codes for nucleotides and quality
TaxaGroups of ≥1 populations of organisms. Usually summarized at phylum, class, order, family, genus, or species levels
MetadataDescriptive data pertaining to samples within a study
ShotgunShotgun metagenomics sequencing breaks up all available DNA into random small segments and uses chain termination to sequence reads. Reads can be aligned directly to a reference database, or overlapping reads can be assembled into contiguous sequences
  27 in total

1.  Support vector machine classification and validation of cancer tissue samples using microarray expression data.

Authors:  T S Furey; N Cristianini; N Duffy; D W Bednarski; M Schummer; D Haussler
Journal:  Bioinformatics       Date:  2000-10       Impact factor: 6.937

Review 2.  Supervised classification of human microbiota.

Authors:  Dan Knights; Elizabeth K Costello; Rob Knight
Journal:  FEMS Microbiol Rev       Date:  2010-10-07       Impact factor: 16.408

3.  A framework for human microbiome research.

Authors: 
Journal:  Nature       Date:  2012-06-13       Impact factor: 49.962

4.  Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights.

Authors:  Edoardo Pasolli; Duy Tin Truong; Faizan Malik; Levi Waldron; Nicola Segata
Journal:  PLoS Comput Biol       Date:  2016-07-11       Impact factor: 4.475

5.  mockrobiota: a Public Resource for Microbiome Bioinformatics Benchmarking.

Authors:  Nicholas A Bokulich; Jai Ram Rideout; William G Mercurio; Arron Shiffer; Benjamin Wolfe; Corinne F Maurice; Rachel J Dutton; Peter J Turnbaugh; Rob Knight; J Gregory Caporaso
Journal:  mSystems       Date:  2016-10-18       Impact factor: 6.496

6.  Looking for a Signal in the Noise: Revisiting Obesity and the Microbiome.

Authors:  Marc A Sze; Patrick D Schloss
Journal:  mBio       Date:  2016-08-23       Impact factor: 7.867

7.  Gut microbiomes of Malawian twin pairs discordant for kwashiorkor.

Authors:  Michelle I Smith; Tanya Yatsunenko; Mark J Manary; Indi Trehan; Rajhab Mkakosya; Jiye Cheng; Andrew L Kau; Stephen S Rich; Patrick Concannon; Josyf C Mychaleckyj; Jie Liu; Eric Houpt; Jia V Li; Elaine Holmes; Jeremy Nicholson; Dan Knights; Luke K Ursell; Rob Knight; Jeffrey I Gordon
Journal:  Science       Date:  2013-01-30       Impact factor: 47.728

8.  A core gut microbiome in obese and lean twins.

Authors:  Peter J Turnbaugh; Micah Hamady; Tanya Yatsunenko; Brandi L Cantarel; Alexis Duncan; Ruth E Ley; Mitchell L Sogin; William J Jones; Bruce A Roe; Jason P Affourtit; Michael Egholm; Bernard Henrissat; Andrew C Heath; Rob Knight; Jeffrey I Gordon
Journal:  Nature       Date:  2008-11-30       Impact factor: 49.962

9.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

Authors:  Nuala A O'Leary; Mathew W Wright; J Rodney Brister; Stacy Ciufo; Diana Haddad; Rich McVeigh; Bhanu Rajput; Barbara Robbertse; Brian Smith-White; Danso Ako-Adjei; Alexander Astashyn; Azat Badretdin; Yiming Bao; Olga Blinkova; Vyacheslav Brover; Vyacheslav Chetvernin; Jinna Choi; Eric Cox; Olga Ermolaeva; Catherine M Farrell; Tamara Goldfarb; Tripti Gupta; Daniel Haft; Eneida Hatcher; Wratko Hlavina; Vinita S Joardar; Vamsi K Kodali; Wenjun Li; Donna Maglott; Patrick Masterson; Kelly M McGarvey; Michael R Murphy; Kathleen O'Neill; Shashikant Pujar; Sanjida H Rangwala; Daniel Rausch; Lillian D Riddick; Conrad Schoch; Andrei Shkeda; Susan S Storz; Hanzhen Sun; Francoise Thibaud-Nissen; Igor Tolstoy; Raymond E Tully; Anjana R Vatsan; Craig Wallin; David Webb; Wendy Wu; Melissa J Landrum; Avi Kimchi; Tatiana Tatusova; Michael DiCuccio; Paul Kitts; Terence D Murphy; Kim D Pruitt
Journal:  Nucleic Acids Res       Date:  2015-11-08       Impact factor: 16.971

10.  SHI7 Is a Self-Learning Pipeline for Multipurpose Short-Read DNA Quality Control.

Authors:  Gabriel A Al-Ghalith; Benjamin Hillmann; Kaiwei Ang; Robin Shields-Cutler; Dan Knights
Journal:  mSystems       Date:  2018-04-24       Impact factor: 6.496

View more
  15 in total

1.  Predicting microbiomes through a deep latent space.

Authors:  Beatriz García-Jiménez; Jorge Muñoz; Sara Cabello; Joaquín Medina; Mark D Wilkinson
Journal:  Bioinformatics       Date:  2021-06-16       Impact factor: 6.937

Review 2.  The food-gut axis: lactic acid bacteria and their link to food, the gut microbiome and human health.

Authors:  Francesca De Filippis; Edoardo Pasolli; Danilo Ercolini
Journal:  FEMS Microbiol Rev       Date:  2020-07-01       Impact factor: 16.408

3.  mAML: an automated machine learning pipeline with a microbiome repository for human disease classification.

Authors:  Fenglong Yang; Quan Zou
Journal:  Database (Oxford)       Date:  2020-01-01       Impact factor: 3.451

4.  DCMD: Distance-based classification using mixture distributions on microbiome data.

Authors:  Konstantin Shestopaloff; Mei Dong; Fan Gao; Wei Xu
Journal:  PLoS Comput Biol       Date:  2021-03-12       Impact factor: 4.475

5.  Colorectal Cancer-Associated Microbiome Patterns and Signatures.

Authors:  Lan Zhao; William C Cho; Mark R Nicolls
Journal:  Front Genet       Date:  2021-12-22       Impact factor: 4.772

Review 6.  A practical guide to amplicon and metagenomic analysis of microbiome data.

Authors:  Yong-Xin Liu; Yuan Qin; Tong Chen; Meiping Lu; Xubo Qian; Xiaoxuan Guo; Yang Bai
Journal:  Protein Cell       Date:  2020-05-11       Impact factor: 14.870

Review 7.  Towards multi-label classification: Next step of machine learning for microbiome research.

Authors:  Shunyao Wu; Yuzhu Chen; Zhiruo Li; Jian Li; Fengyang Zhao; Xiaoquan Su
Journal:  Comput Struct Biotechnol J       Date:  2021-04-28       Impact factor: 7.271

8.  BowSaw: Inferring Higher-Order Trait Interactions Associated With Complex Biological Phenotypes.

Authors:  Demetrius DiMucci; Mark Kon; Daniel Segrè
Journal:  Front Mol Biosci       Date:  2021-06-17

9.  A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems.

Authors:  Begüm D Topçuoğlu; Nicholas A Lesniak; Mack T Ruffin; Jenna Wiens; Patrick D Schloss
Journal:  mBio       Date:  2020-06-09       Impact factor: 7.867

10.  EasyMicroPlot: An Efficient and Convenient R Package in Microbiome Downstream Analysis and Visualization for Clinical Study.

Authors:  Bingdong Liu; Liujing Huang; Zhihong Liu; Xiaohan Pan; Zongbing Cui; Jiyang Pan; Liwei Xie
Journal:  Front Genet       Date:  2022-01-04       Impact factor: 4.599

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.