| Literature DB >> 35359727 |
Alban Mathieu1, Mickael Leclercq1, Melissa Sanabria2, Olivier Perin3, Arnaud Droit1.
Abstract
Shotgun sequencing of environmental DNA (i.e., metagenomics) has revolutionized the field of environmental microbiology, allowing the characterization of all microorganisms in a sequencing experiment. To identify the microbes in terms of taxonomy and biological activity, the sequenced reads must necessarily be aligned on known microbial genomes/genes. However, current alignment methods are limited in terms of speed and can produce a significant number of false positives when detecting bacterial species or false negatives in specific cases (virus, plasmids, and gene detection). Moreover, recent advances in metagenomics have enabled the reconstruction of new genomes using de novo binning strategies, but these genomes, not yet fully characterized, are not used in classic approaches, whereas machine and deep learning methods can use them as models. In this article, we attempted to review the different methods and their efficiency to improve the annotation of metagenomic sequences. Deep learning models have reached the performance of the widely used k-mer alignment-based tools, with better accuracy in certain cases; however, they still must demonstrate their robustness across the variety of environmental samples and across the rapid expansion of accessible genomes in databases.Entities:
Keywords: classification; deep learning; functional annotation; machine learning; metagenomic; taxonomic annotation; whole genome shotgun
Year: 2022 PMID: 35359727 PMCID: PMC8964132 DOI: 10.3389/fmicb.2022.811495
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
FIGURE 1Factors that influence the capacity of sequence annotation. Parameters, defined in the sequencing and bioinformatic processes, are tunable by the users. Intrinsic factors are some characteristics of the environment studied that influence the rate of annotation, by definition they are not tunable. The cursors indicate where the annotation rate will be the highest. A low sequence identity cutoff for assignment increases the annotation rate, but the trade-off will be a higher detection rate of false positives. Precision of the annotation refers to the degree of annotation examined (for taxonomic assignment, it corresponds to the taxonomic range used for the analysis, for the functional annotation to the metabolic/anabolic level: genes, short biosynthetic pathways, and global pathways).
Summary of the articles and models reviewed.
| Publication | Machine/deep learning category | Models tested | Training input | Tested input | Real applications input | Output | Encoding scheme | Parameters | Hyper-parameters | Best model selected |
| NBC: the naive Bayes classification tool web server for taxonomic classification of metagenomic reads ( | Machine learning | Naive Bayes | Genome sequence from DB (25, 100, and 500 bp) | Genome sequence from DB (25, 100, and 500 bp) | Metagenomic reads | Strain–species–genus classification | Compositional vectors (“Target encoding” like) | NA | k-mer size (3, 6, and 9–15) | Naive Bayes |
| Accurate phylogenetic classification of variable-length DNA fragments ( | Support vector machine | Linear or Gaussian SVM | Genome sequence from DB (1, 5, 10, and 15 kb) | Genome sequence from DB (25, 100, and 500 bp) | Contigs (assembled metagenomic data) | Genus to domain classification | Compositional vectors (“Target encoding” like) | Misclassification cost | k-mer size (2–6) | 5–6-mer-size Gaussian SVM |
| Large-scale machine learning for metagenomics sequence classification. ( | Support vector machine | Linear SVM | Genome sequence | Genome sequence affiliated to the same species as trained. Simulated reads with sequencing error model introduction | Metagenomic reads | Rank flexible classification of metagenomic reads | Compositional vectors (“Target encoding” like) | Squared loss function | k-mer size (4, 5, and 6) | Linear SVM classifier with rank-flexible classification |
| Deep learning models for bacteria | Deep neural network (DNN) | Convolutional neural network (CNN) | Simulated reads of 16S RNA sequences | Simulated reads of 16S RNA sequences | 16S amplicon reads or metagenomic reads | Domain to genus classification | One hot encoding | # hidden unit | k-mer size (3–7) | CNN |
| DeepMicrobes: taxonomic classification for metagenomics with deep learning ( | Deep neural network (DNN) | ResNet-like CNN, CNN + LSTM, Pool, CNN, LSTM, LSTM + ATTENTION | Simulated reads from MAGs sequence | Simulated reads from MAGs sequence (training excluded) | Metagenomic reads | Genus/species reads classification | One hot encoding | # size of CNN filters | k-mer length and redundancy | k-mer embedding + LSTM + ATTENTION |
| A fast and accurate functional annotator and classifier of genomic and metagenomic sequences ( | Machine learning supervised classification coupled to alignment method | Naive Bayesian classifier, Random Forest (RF), AdaBoost, Multiclass classifier and Lib-SVM | Peptides from eggNOG databases | Genomes | Genomic/metagenomic reads | Functional annotation of predicted genes | Compositional vectors of amino acid composition | # features | NA | Random forest + RAPsearch2 |
| DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data ( | Deep neural network (DNN) | Deep neural network (DNN) | UniProt genes with similarity against ARDB genes | Short gene fragments | Genes | Antibiotic resistance genes prediction | Matrix of dissimilarity against AR genes | NA | NA | Deep neural network (DNN) |
Functional databases and their characteristics.
| Functional databases | CAZy | Pfam | KEGG | eggNOG | GO Terms | MetaCyc | UniProt |
| Base unit | Carbohydrate-Active Enzymes | Protein domain | Ortholog gene | Ortholog gene | Vocabulary | Small-molecule metabolism | Protein |
| Grouping family | Protein family and sub-families | Family | Module pathway disease | Pathway | Ontology | Metabolic pathway | NA |
FIGURE 2Schematization of deep learning models. The encoded input represents a metagenomic DNA sequence or k-mer that will be transformed using the activation function in the hidden layers. Each gray circle in the hidden layers represents a cell that will communicate its output with the other cells. As mentioned in the text, LSTM models possess a “forget” gate that selects relevant information. The final output of the hidden layers is the classification with a predicted probability for an input to be in one of the categories. During the training, the probability is encoded by the SoftMax function, whereas, for the final testing, the argMAX function is used, a most understandable function that gives probabilities between 0 and 1.