| Literature DB >> 35547145 |
James M W R McElhinney1, Mary Krystelle Catacutan2, Aurelie Mawart1, Ayesha Hasan1,2, Jorge Dias3.
Abstract
Microbial communities are ubiquitous and carry an exceptionally broad metabolic capability. Upon environmental perturbation, microbes are also amongst the first natural responsive elements with perturbation-specific cues and markers. These communities are thereby uniquely positioned to inform on the status of environmental conditions. The advent of microbial omics has led to an unprecedented volume of complex microbiological data sets. Importantly, these data sets are rich in biological information with potential for predictive environmental classification and forecasting. However, the patterns in this information are often hidden amongst the inherent complexity of the data. There has been a continued rise in the development and adoption of machine learning (ML) and deep learning architectures for solving research challenges of this sort. Indeed, the interface between molecular microbial ecology and artificial intelligence (AI) appears to show considerable potential for significantly advancing environmental monitoring and management practices through their application. Here, we provide a primer for ML, highlight the notion of retaining biological sample information for supervised ML, discuss workflow considerations, and review the state of the art of the exciting, yet nascent, interdisciplinary field of ML-driven microbial ecology. Current limitations in this sphere of research are also addressed to frame a forward-looking perspective toward the realization of what we anticipate will become a pivotal toolkit for addressing environmental monitoring and management challenges in the years ahead.Entities:
Keywords: artificial intelligence; environmental monitoring; machine learning; metagenomics; microbial ecology; microbial omics; microbiology; predictive modeling
Year: 2022 PMID: 35547145 PMCID: PMC9083327 DOI: 10.3389/fmicb.2022.851450
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 6.064
FIGURE 1The interface of microbial omics and machine learning (ML). A generalized and simplified overview of the workflows is presented highlighting the major steps in the microbial omics and ML workflows as they relate to one another along with key outcomes obtainable from the application of ML to omics data. Microbial community responses (biological information on which learning is aimed) are summarized below the cartoon snapshot of a contaminated environment of interest. Here, HC cont., hydrocarbon contamination; PAH, polyaromatic hydrocarbons (as examples of targets in petroleum hydrocarbon scenarios); QC, quality control; ASV, amplicon sequence variant (ASVs are given here as an example of an omics classification, other examples include the often used OTU, genes, mRNA transcripts, protein categories or metabolite IDs); DL, deep learning; ANN, artificial neural networks (shallow); RF, random forest; SVM, support vector machine; GB, gradient boost; LR, logistic regression; SMOTE, synthetic minority oversampling technique; SML, supervised machine learning; and MP, model performance.
Example applications of the SML of microbial Omics data for addressing environmental challenges.
| Environment | Niche | Application | Omics | Input data | Feature | Target(s) | SML architectures | Software | References |
| Aquatic | Marine (Coral Reef) | Prediction of environmental status | metataxonomics | 16S rRNA OTUs | OTU abundance | Eutrophication indicators and temperature | RF | Caret and RF R packages |
|
| Industrial | WWTP | Prediction of environmental variable to identify key subpopulations | metataxonomics | 16S rRNA OTUs | OTU abundance, PCA coordinates | WWTP water temperature | LR, RF, SVML, DT, KNN, SVMRBF | Scikit-Learn |
|
| Terrestrial | Soil | Prediction of carbon cycling | metataxonomics | 16S rRNA OTUs | OTU abundance | [DOC] | RF, ANN | THEANO, Scikit-Learn |
|
| Terrestrial | Compost | Classification of microbial biomarkers | metataxonomics | 16S rRNA OTUs | OTU abundance | Compost cycle | RF | RF R package |
|
| Terrestrial | Ground water + Soil | Prediction of environmental contaminants | metataxonomics | 16S rRNA OTUs | OTU abundance | [dioxane] and [CVOCs] | RF |
| |
| Terrestrial | Soil | Prediction of environmental quality | metataxonomics | 16S rRNA OTUs | OTU abundance | Soil physicochemical features | RF | RF R package |
|
| Aquatic | Marine (coastal waters) | Prediction of environmental contaminants | metataxonomics | 16S rRNA OTUs | OTU abundance, 16S rRNA gene sequences | Glyphosate | RF, ANN | RF R package and DL4J |
|
| Aquatic | Freshwater (river) | Classification of anthropogenic pathogen loads | metataxonomics | 16S rRNA OTUs | OTU abundance | Fecal source | RF, MCMC | RF R package and SourceTracker |
|
| Aquatic | Marine and Freshwater | Classification of microbial biomarkers | metataxonomics | 16S rRNA and ITS OTUs | OTU abundance | Plastisphere communities | RF | RF R package |
|
| Aquatic | Marine sediment (munitions dumpsite) | Prediction of environmental contaminants | metataxonomics | 16S rRNA OTUs | OTU abundance | TNT | RF, ANN | Ranger R package ANN R keras framework + TensorFlow back end |
|
| Aquatic | Freshwater (river) | Classification of sample origin | metataxonomics | 16S rRNA OTUs | OTU abundance (top taxa) | Sample origin | RF | RF R package |
|
| Aquatic | Marine (oceanic waters) | Classification of trophic modes | Metatranscriptomics | Gene expression levels | expression levels of selected Pfam entries | Trophic mode (photo/hetero/mixo) | RF, DT, ANN | NR and XGBoost |
|
| Terrestrial | Soil | Prediction of crop productivity | metagenomics | Shotgun sequencing | OTU abundance | Crop productivity | RF | Ranger R package |
|
| Terrestrial | Soil | Prediction of soil phylogroups from environmental metadata | metagenomics | NR | NR | RF | RF R package |
|
Here, ANN, Artificial Neural Network; CVOCs, Chlorinated Volatile Organic Compounds; DOC, Dissolved Organic Carbon; DT, Decision Tree; KNN, K-Nearest Neighbors; LR, Logistic Regression; MCMC, Markov Chain Monte Carlo; NR, Not reported; RF, Random Forest; SVML, Support Vector Machine (SVM) with a linear kernel; SVMRBF, SVM with a radial basis function kernel; TNT, trinitrotoluene; WWTP, Wastewater Treatment Plant.