Literature DB >> 31552137

q2-sample-classifier: machine-learning tools for microbiome classification and regression.

Nicholas A Bokulich1, Matthew R Dillon1, Evan Bolyen1, Benjamin D Kaehler2, Gavin A Huttley2, J Gregory Caporaso1,3.   

Abstract

q2-sample-classifier is a plugin for the QIIME 2 microbiome bioinformatics platform that facilitates access, reproducibility, and interpretation of supervised learning (SL) methods for a broad audience of non-bioinformatics specialists.

Entities:  

Year:  2018        PMID: 31552137      PMCID: PMC6759219          DOI: 10.21105/joss.00934

Source DB:  PubMed          Journal:  J Open Res Softw        ISSN: 2049-9647


Microbiome studies often aim to predict outcomes or differentiate samples based on their microbial compositions, tasks that can be efficiently performed by SL methods (Knights et al., 2011). The goal of SL is to train a machine learning model on a set of samples with known target values/class labels, and then use that model to predict the target values/class membership of additional, unlabeled samples. The ability to categorize new samples, as opposed to describing the structure of existing data, extends itself to many useful applications, e.g., the prediction of disease/susceptibility (Pasolli, Truong, Malik, Waldron, & Segata, 2016; Schubert, Sinani, & Schloss, 2015; Yazdani et al., 2016), crop productivity (Chang, Haudenshield, Bowen, & Hartman, 2017), wine chemical composition (Bokulich et al., 2016b), or sample collection site (Bokulich, Thorngate, Richardson, & Mills, 2013); the identification of mislabeled samples in microbiome data sets (Knights et al., 2011); or tracking microbiota-for-age development in children (Bokulich et al., 2016a; Subramanian et al., 2014). We describe q2-sample-classifier, a QIIME 2 plugin to support SL tools for pattern recognition in microbiome data. This plugin provides several SL methods, automatic parameter tuning, feature selection, and various learning algorithms. The visualizations generated provide portable, shareable reports, publication-ready figures, and integrated decentralized data provenance. Additionally, integration as a QIIME 2 plugin streamlines data handling and supports the use of multiple user interfaces, including a prototype graphical user interface (q2studio), facilitating its use for non-expert users. The plugin is freely available under the BSD-3-Clause license at https://github.com/qiime2/q2-sample-classifier. The q2-sample-classifier plugin is written in Python 3.5 and employs pandas (McKinney, 2010) and numpy (Walt, Colbert, & Varoquaux, 2011) for data manipulation, scikit-learn (Pedregosa et al., 2011) for SL and feature selection algorithms, scipy (Jones, Oliphant, Peterson, & others, 2001) for statistical testing, and matplotlib (Hunter, 2007) and seaborn (Waskom et al., 2017) for data visualization. The plugin is compatible with macOS and Linux operating systems. The standard workflow for classification and regression in q2-feature-classifier is shown in Figure 1. All q2-sample-classifier actions accept a feature table (i.e., matrix of feature counts per sample) and sample metadata (prediction targets) as input. Feature observations for q2-sample-classifier would commonly consist of microbial counts (e.g., ampliconsequence variants, operational taxonomic units, or taxa detected by marker-gene or shotgun metagenome sequencing methods), but any observation data, such as gene, transcript, protein, or metabolite abundance could be provided as input. Input samples are shuffled and split into training and test sets at a user-defined ratio (default: 4:1) with or without stratification (equal sampling per class label; stratified by default); test samples are left out of all model training steps and are only used for final model validation.
Figure 1:

Workflow schematic (A) and output data and visualizations (B-E) for q2-feature-classifier. Data splitting, model training, and testing (A) can be accompanied by automatic hyperparameter optimization (OPT) and recursive feature elimination for feature selection (RFE). Outputs include trained estimators for re-use on additional samples, lists of feature importance (B), RFE results if RFE is enabled (C), and predictions and accuracy results, including either confusion matrix heatmaps for classification results (D) or scatter plots of true vs. predicted values for regression results (E).

The user can enable automatic feature selection and hyperparameter tuning, and can select the number of cross-validations to perform for each (default = 5). Feature selection is performed using cross-validated recursive feature elimination via scikit-learn’s RFECV to select the features that maximize predictive accuracy. Hyperparameter tuning is automatically performed using a cross-validated randomized parameter grid search via scikit-learn’s RandomizedSearchCV to find hyperparameter permutations (within a sensible range) that maximize accuracy. The following scikit-learn (Pedregosa et al., 2011) SL estimators are currently implemented in q2-sample-classifier: AdaBoost (Freund & Schapire, 1997), Extra Trees (Geurts, Ernst, & Wehenkel, 2006), Gradient boosting (Friedman, 2002), and Random Forest (Breiman, 2001) ensemble classifiers and regressors; linear SVC, linear SVR, and nonlinear SVR support vector machine classifiers/regressors (Cortes & Vapnik, 1995); k-Neighbors classifiers/regressors (Altman, 1992); and Elastic Net (Zou & Hastie, 2005), Ridge (Hoerl & Kennard, 1970), and Lasso (Tibshirani, 1996) regression models.
  35 in total

1.  Loss of Interleukin-10 (IL-10) Signaling Promotes IL-22-Dependent Host Defenses against Acute Clostridioides difficile Infection.

Authors:  Emily S Cribas; Joshua E Denny; Jeffrey R Maslanka; Michael C Abt
Journal:  Infect Immun       Date:  2021-04-16       Impact factor: 3.441

2.  Predicting microbiomes through a deep latent space.

Authors:  Beatriz García-Jiménez; Jorge Muñoz; Sara Cabello; Joaquín Medina; Mark D Wilkinson
Journal:  Bioinformatics       Date:  2021-06-16       Impact factor: 6.937

3.  QIIME 2 Enables Comprehensive End-to-End Analysis of Diverse Microbiome Data and Comparative Studies with Publicly Available Data.

Authors:  Mehrbod Estaki; Lingjing Jiang; Nicholas A Bokulich; Daniel McDonald; Antonio González; Tomasz Kosciolek; Cameron Martino; Qiyun Zhu; Amanda Birmingham; Yoshiki Vázquez-Baeza; Matthew R Dillon; Evan Bolyen; J Gregory Caporaso; Rob Knight
Journal:  Curr Protoc Bioinformatics       Date:  2020-06

4.  Rapidly Processed Stool Swabs Approximate Stool Microbiota Profiles.

Authors:  Nicholas A Bokulich; Juan Maldonado; Dae-Wook Kang; Rosa Krajmalnik-Brown; J Gregory Caporaso
Journal:  mSphere       Date:  2019-04-10       Impact factor: 4.389

Review 5.  Quantifying the contribution of microbial immigration in engineered water systems.

Authors:  Ran Mei; Wen-Tso Liu
Journal:  Microbiome       Date:  2019-11-06       Impact factor: 14.650

6.  Gut microbiome in Schizophrenia: Altered functional pathways related to immune modulation and atherosclerotic risk.

Authors:  Tanya T Nguyen; Tomasz Kosciolek; Rebecca E Daly; Yoshiki Vázquez-Baeza; Austin Swafford; Rob Knight; Dilip V Jeste
Journal:  Brain Behav Immun       Date:  2020-10-22       Impact factor: 7.217

7.  Composition and Associations of the Infant Gut Fungal Microbiota with Environmental Factors and Childhood Allergic Outcomes.

Authors:  Stuart E Turvey; B Brett Finlay; Rozlyn C T Boutin; Hind Sbihi; Ryan J McLaughlin; Aria S Hahn; Kishori M Konwar; Rachelle S Loo; Darlene Dai; Charisse Petersen; Fiona S L Brinkman; Geoffrey L Winsor; Malcolm R Sears; Theo J Moraes; Allan B Becker; Meghan B Azad; Piush J Mandhane; Padmaja Subbarao
Journal:  mBio       Date:  2021-06-01       Impact factor: 7.867

8.  Beating Naive Bayes at Taxonomic Classification of 16S rRNA Gene Sequences.

Authors:  Michal Ziemski; Treepop Wisanwanichthan; Nicholas A Bokulich; Benjamin D Kaehler
Journal:  Front Microbiol       Date:  2021-06-18       Impact factor: 5.640

9.  Dumpster diving for diatom plastid 16S rRNA genes.

Authors:  Krista L Bonfantine; Stacey M Trevathan-Tackett; Ty G Matthews; Ana Neckovic; Han Ming Gan
Journal:  PeerJ       Date:  2021-07-01       Impact factor: 2.984

10.  Oral microbial community composition is associated with pancreatic cancer: A case-control study in Iran.

Authors:  Emily Vogtmann; Yongli Han; J Gregory Caporaso; Nicholas Bokulich; Ashraf Mohamadkhani; Alireza Moayyedkazemi; Xing Hua; Farin Kamangar; Yunhu Wan; Shalabh Suman; Bin Zhu; Amy Hutchinson; Casey Dagnall; Kristine Jones; Belynda Hicks; Jianxin Shi; Reza Malekzadeh; Christian C Abnet; Akram Pourshams
Journal:  Cancer Med       Date:  2019-11-21       Impact factor: 4.452

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.