Literature DB >> 35070166

Using metagenomic data to boost protein structure prediction and discovery.

Qingzhen Hou^1,2, Fabrizio Pucci^3,4, Fengming Pan^1,2, Fuzhong Xue^1,2, Marianne Rooman^3,4, Qiang Feng^5,6.

Abstract

Over the past decade, metagenomic sequencing approaches have been providing an ever-increasing amount of protein sequence data at an astonishing rate. These constitute an invaluable source of information which has been exploited in various research fields such as the study of the role of the gut microbiota in human diseases and aging. However, only a small fraction of all metagenomic sequences collected have been functionally or structurally characterized, leaving much of them completely unexplored. Here, we review how this information has been used in protein structure prediction and protein discovery. We begin by presenting some widely used metagenomic databases and analyze in detail how metagenomic data has contributed to the impressive improvement in the accuracy of structure prediction methods in recent years. We then examine how metagenomic information can be exploited to annotate protein sequences. More specifically, we focus on the role of metagenomes in the discovery of enzymes and new CRISPR-Cas systems, and in the identification of antibiotic resistance genes. With this review, we provide an overview of how metagenomic data is currently revolutionizing our understanding of protein science.

Entities: Chemical

Keywords: Antibiotic resistance; CRISPR-Cas system; Enzyme design; Metagenomics; Microbiome; Multiple sequence alignment

Year: 2022 PMID： 35070166 PMCID： PMC8760478 DOI： 10.1016/j.csbj.2021.12.030

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

The use of metagenomic sequencing is dramatically improving our understanding of the evolution and ecology of microbial systems in various environments, from water and soil to the human body [1], [2], [3]. For example, metagenomics has been essential in revealing the mechanism of certain human diseases by detecting changes in the gut microbiome [4], [5], [6] and in identifying and controlling pathogens [7]. These advances have been made possible by metagenomics high-throughput sequencing techniques, through which billions of protein sequences have been characterized. Meanwhile, their number continues to grow at an impressive rate. These huge amounts of data are an invaluable source of information that has a big impact in different areas of protein science. Protein three-dimensional (3D) structure prediction is one of these areas. Since a seminal article [8], metagenomic sequence data has been widely used to construct multiple sequence alignments (MSA) of target proteins, which are used as inputs to deep learning models for structure prediction. Metagenomic information has significantly contributed to improving the accuracy of the predictors [8], which have achieved astonishing scores in recent years [9], [10]. Many studies have also been devoted to understanding the functions of proteins from metagenomic assembly, even though only a small part is functionally annotated. Metagenomic data constitute a huge reservoir of information that can be exploited to discover new proteins with specific functions. Indeed, they have proven to be a fundamental resource for discovering new enzymes with given stability properties [11], [12], [13], [14], [15], [16], [17], [18], exploring antibiotic resistance genes in different microbial communities [19], [20], [21], [22], [23] and identifying new CRISPR-Cas systems [24], [25], [26], [27], [28]. In the next sections, we review widely known metagenomic databases and their characteristics, and show how this huge amount of information is used to improve the abovementioned research fields, as schematically depicted in Fig. 1.

Fig. 1

Schematic representation of the pipeline from biomes, metagenome samples to protein structure prediction and discovery.

Metagenomic resources and databases

We start by reviewing the widely known and curated metagenome resources and databases: IMG/M [29], MGnify [30], MetaClust [31] and BFD [32]. These databases are extensively used by the research community in a wide range of studies, e.g., protein structure prediction [8], [9], metabolic gene cluster discovery [33], enzyme discovery [11], and gene function prediction [34]. Their characteristics and content, dated December 2021, are described below; further details can be found in Table S1 of Supplementary Material:Here we only described metagenomics databases that were commonly used in the last rounds of the Critical Assessment of Structure Prediction (CASP) [48], a community-wide experiment in which the competitors blindly predict the 3D structure of target proteins and the accuracy of the predictions is evaluated by a group of assessors. There are also other metagenomic databases available in the literature, which are listed in Table S2 of Supplementary Material. Among them, one of the most complete repositories is the MetaGenomics Rapid Annotation using Subsystems Technology (MG-RAST) [49], which allows storage, annotation, phylogenetic study and functional analysis of metagenomes. Other resources are mainly databases that collect eukaryotic metagenomic data such as TOPAZ [50], SMAGs [51] and MetaEuk [52] or viral metagenomes such as MetaVir [53], VIROME [54], Metagenomic Gut Virus (MGV) [55] and Gut Phage Database (GPD) [56]. IMG/M. The Integrated Microbial Genomes and Microbiomes [29] is a comprehensive data management resource for the analysis of annotated genomic and metagenomic sequence data. It is increasing rapidly, reaching about 360 million genes from isolated genomes and 66 billion genes from metagenomes. The latter mainly come from human gut microbiome and from marine and freshwater microbial systems (see Fig. 2.a). The genomes and metagenomes with their metadata attributes were collected from the manually curated GOLD database [35] and then annotated with the IMG annotation pipeline [36]. Protein-coding genes were identified from (meta) genomic data by the prediction program Prodigal [37], and functionally annotated by using hidden Markov model (HMM)-based homologous sequence searches [38].

Fig. 2

Sources of metagenomic data in (a) IMG/M and (b) MGnify databases. For more details, see Table S1 of Supplementary Material..

Sources of metagenomic data in (a) IMG/M and (b) MGnify databases. For more details, see Table S1 of Supplementary Material.. IMG/M includes a set of genomic tools for data analysis, such as IMG/ABC for the study of biosynthetic gene clusters and secondary metabolites, and IMG/VR for the analysis of viral genome fragments derived from metagenomic samples. It also provides multi-search capabilities to search the database for, e.g., homologous proteins of a target sequence via BLAST [39], KEGG enzyme classes and pathways [40], [41], CATH families [42] and Pfam domains [43]. MGnify. It is a comprehensive hub for the analysis, exploration and archiving of microbiome information [30]. It is one of the world’s largest resources of microbiome data, and a user-friendly platform integrating multiple genomic tools, which makes MGnify widely used. A total of about 4,000 publicly available studies corresponding to about 325,000 samples and 437,000 analyses were deposited in the database. These numbers are constantly growing and have doubled in the last two years. MGnify provides a non-redundant protein set generated from the analysis of all the assembled datasets, which contains more than 1 billion sequences [30]. It also uses Linclust [31] to cluster the protein sequences with a sequence identity and coverage of 90%; the cluster representative is chosen to be the longest sequence. Moreover, it provides very useful tools to, for example, query the non-redundant protein dataset for sequence homologs using the HMM profile-based tool HMMER [38]. Note that the user can choose to query only a subset of the full set of proteins, corresponding to a type of microbial niche (also called biome). As shown in Fig. 2.b, most of the entries come from human microbiomes, but marine, animal, plant, and soil biomes also contribute significantly to the dataset. MetaClust. The MetaClust database contains about 1.6 billion protein sequence fragments, predicted by the gene prediction program Prodigal [37] from about 1,800 metagenomic and 400 metatranscriptomic datasets obtained from multiple resources [29], [44], [45]. These sequences were clustered into 424 million classes using Linclust [31], a fast protein sequence clustering algorithm able to cluster huge sets of sequences. The thresholds used for the clustering is 50% sequence identity and 90% sequence coverage. MetaClust is a ready-to-use tool, providing 424 million representative sequences. BFD. Unlike other databases, the Big Fantastic Database (BFD) is a sequence profile database. It contains about 65 million families represented as MSAs and hidden Markov models (HMMs). It is one of the largest and most used metagenomic databases, as MSA and HMM representations are sometimes more convenient to work with than non-redundant representative protein sequence sets. It has been constructed by collecting about 2.5 billion protein sequences from UniProt/Trembl [46], SwissProt[47], MetaClust[31], as well as the Soil Reference Catalog and the Marine Eukaryotic Reference Catalog, assembled using the de novo protein-level assembler PLASS [32], which is able to recover more protein sequences from metagenomes than classical assembly methods. The sequences were clustered using a strict sequence identity cut-off of 30% and a coverage threshold of 90% using MMseqs2/Linclust [31]. Clusters with less than three entries were removed. Despite the huge amounts of sequences in the different metagenomic databases described above, their overlap with standard protein sequence databases such as UniProtKB [57] is very limited. Therefore, the combination of metagenome and genome sequence databases has the enormous potential to provide improved biological information.

Integrating metagenomics data to structure prediction pipelines

Boosting protein structure prediction accuracy

In the last two decades, huge improvements have been made in the field of protein structure prediction. Many predictors have been developed with a steady increase in performance, achieving amazing prediction accuracy in CASP14, the last round of the CASP competition [48]. AlphaFold2 [58] was the best performing method, reaching an accuracy close to that of the experimental methods, even for difficult targets for which no structural templates were available [48]. The improved performance of structure predictors is due to several technological advances, among which novel machine learning algorithms such as deep learning and end-to-end prediction models; for more detailed analyses of these approaches, we refer to excellent reviews on the subject [9], [10]. Another breakthrough is certainly the incorporation of data from large metagenomic databases, which allows building deeper and better quality MSAs than when limiting the search to genomic sequences; this is especially true when the number of genomic sequences is low. In turn, these enhanced MSAs are used to gain more precise protein structure information through the application of coevolutionary approaches. The idea that coevolutionary models extracted from MSAs can be used to gain information on protein structure dates back almost thirty years [59], even though it only started to be a focus of the research community when a series of seminal papers introduced the coevolution formalism of direct coupling analysis (DCA) [60], [61], [62] (see also [63] for a recent review). Basically, residues that are close in the 3D protein structure tend to coevolve along the evolutionary history. Indeed, if an interaction between two residues is essential for the stability of the protein structure, the mutation of one of the residues causes an evolutionary pressure on the second residue, which favors compensatory mutations to restore the original interaction, thus maintaining the molecular function of the protein [64]. Since these early approaches, many studies have been devoted to extracting coevolutionary signals from MSAs and using them to predict protein contact maps [65], [66], [67], [68], [69], [70]. These contacts are then used as constraints in modeling tools to guide the protein structure prediction pipeline. A key step of these pipelines is to build and curate MSAs. Widely known algorithms for MSA construction are PSI-BLAST, which uses position-specific sequence profiles [71], and HHblits [72] and HMMsearch [73], which use HMM profiles. Note, however, that low-quality or shallow MSAs can lead to inaccurate predictions, when the substitution statistics are not well estimated. The first time metagenomic data was used to improve MSA quality was in 2017 [8]. It was shown that substantially deeper MSAs can be obtained by combining the Integrated Microbial Genomes and Microbiomes (IMG/M) database [29] to the genomic sequence cluster database UniRef30 [74]. Indeed, the addition of metagenomic data leads to the increase of the effective number of sequences (as defined in [61]) by a factor of about 3.5 on average for the approximately 5,000 protein families from the Pfam database [43]. In particular, about 500 families show an increase in by a factor of 10, and a few families, by a factor of 100 [8] (Fig. 3.a). This improvement led to more accurate predictions of the protein contact maps for about 20% of the Pfam families considered using the contact map predictor GREMLIN [75], which in turn led to more accurate 3D structures generated via the de novo structural modeling tool Rosetta [76].

Fig. 3

Metagenomics in protein structure prediction. (a) Quantitative MSA enrichment when adding metagenomic sequences: probability distribution of , which is defined as the ratio between the number of effective sequences in MSAs constructed from both metagenomic and genomic sequence databases, and from genomic sequences only; the values come from the study of 5,721 Pfam families [8]; (b) Schematic representation of the two types of protein structure prediction pipelines based on MSAs: the optimization of multiple intermediate steps such as the identification of coevolutionary signals and the prediction of contact maps, and an end-to-end differentiable model which enables a single optimization from the input MSA to the output 3D structure; (c) Number of times metagenomic databases have been used in structure prediction methods in the last three CASP experiments [48], [77], [78]. After this first study [8], multiple structure prediction tools integrating metagenomic data have been developed. Already in the CASP13 experiment [77], several methods used this source of information to predict residue-residue contacts [79], [80], [81], 3D structure [82], [83], [84], [85] and structure refinement [86]. DeepMSA [87], an open-source automated pipeline for the construction of deep alignments using metagenomic information, has also been introduced in CASP13. It is based on different types of sequence sources, two genomic sources (UniClust30 [88] and UniRef90 [74]) and a metagenomic source (MetaClust [31]). These databases are queried via a hybrid homology-detection approach including HHblits [72] and HMMER [89]. The high quality MSAs generated from DeepMSA has been shown to significantly improve long-range contact prediction accuracy [87]. In a later investigation [90], metagenomic sequence data collected by the Tara Oceans expeditions [45] and from MetaClust [31], another metagenomic database, were used in addition to UniRef [91] to increase the number of effective sequences of about 400 Pfam families. For 27 of them, an enriched MSA was obtained with an increase by a factor of two, which, again, led to an improvement of their predicted 3D protein structures. The metagenomics contribution to the field has become even more important in the last CASP rounds (CASP14), where the majority of the methods used metagenomics sequences either for predicting inter-residue contacts or distances as a preliminary step in protein structure modeling [92], [81], [93], or directly using them in the end-to-end structure prediction model without intermediate steps [58], [94], [95], [83], [96] (see Fig. 3). Some of the prediction methods in CASP14 used the DeepMSA pipeline to query the target sequence against metagenomics databases. However, the best performing methods such as AlphaFold2 [58], D-I-Tasser [97] and RoseTTaFold [95] developed new, improved pipelines for homologous sequence search which combine multiple methods to mine metagenomics databases. For example, AlphaFold2 [58], which dominated CASP14 and achieved astonishing prediction accuracy, employs homologous searches in UniRef90 and MGnify using JackHMMER, and in BFD and Uniclust30 using HHBlits. The output MSAs of these searches are then deduplicated and stacked together to further improve the amount of homologous sequences collected. This pipeline led to an average improvement of the structure prediction performance of approximately 6% in terms of the global distance test score [58]. The DeepMSA approach has been generalized to DeepMSA2 [97] in which, in addition to the Uniclust30 and UniRef90 genomic sequence databases, the four widely known metagenomic sequence databases described in the previous sections are mined (MetaClust, BFD, MGnify and IMG/M). The full pipeline consists of a complex series of steps including multiple rounds of database mining with JackHammer, HHBlits and HMMsearch (see [97] for technical details). It provides MSAs that are 40% to 150% deeper than the original DeepMSA pipeline, which in turn leads to statistical significant improvements in both residue-residue distance and protein structure predictions [97]. Recently, a more computation-efficient pipeline for MSA generation has been introduced [98]. It employs MMseqs2 [31] to mine UniRef30 and, using the sequence profile generated, performs an iterative search against two new databases: BFD/MGnify and ColabFoldDB. The former was created by merging BFD and MGnify databases through a MMseqs2 search of MGnify sequences among the BFD clusters of representative sequences. The aligned matches were assigned to the corresponding BFD clusters; the non-matching MGnify sequences were used to generate new clusters. The latter database (ColabFoldDB) was essentially constructed in the same way but includes, in addition to BFD/MGnify, sequences retrieved from other metagenomic databases such as MetaClust, GPD, MGV, TOPAZ, MetaEuk and SMAGs. The improved accuracy of these different methods compared to standard approaches that do not rely on metagenomic information demonstrates the central role played by metagenomics in the field of protein structure prediction. This is due to the fact that the current sequence databases are far from complete, despite their rapid growth, and that they contain too few homologous sequences for too many target proteins. Metagenome sequence databases have the advantage of filling this gap. Note that the combined use of multiple metagenomic databases with different mining algorithms and parameters further improves homologous sequence search and thus helps construct deeper MSAs and identify more accurate evolutionary information needed for protein structure prediction.

Is more metagenomics always better?

We underlined in the previous subsection the rapid accumulation of metagenomic sequences and the impressive size of metagenomics databases with e.g., the IMG database containing more than 60 billion microbial genes [29]. Although these databases represent an invaluable source of information, deep MSA construction by querying them is becoming computationally expensive and memory-demanding. The precise identification of MSA characteristics that may improve the accuracy of contact and structure prediction is still an open question in the community. Indeed, having more sequence homologs in the alignment is not always better [87] considering that there is a trade-off between the effective number of sequences, the sequence coverage, and the alignment accuracy. In an interesting recent study [99], the link between microbial niches and homologous protein families was investigated for a set of about 2,000 Pfam families with no structural templates. Four different microbial biomes, from gut, lake, soil and fermentor, were used in turn for MSA enrichment to test their ability to improve 3D structure prediction. It turned out that the structural modeling of the Pfam families is more precise when only one or a few specific biomes linked to the target protein family are used. This has led to propose a prediction model called MetaSource which is able to identify one biome or a set of biomes which allows better MSA construction and modeling of a given Pfam family [99]. Note that this approach yields not only an improved accuracy but also a significant increase in computational efficiency: it is around 3.3-fold faster than considering all sets of metagenomic information [99].

Integrating metagenomics data for functional annotation and validation

Boosting enzyme discovery using metagenomics

Metagenomic sequencing data started to be used to identify new proteins with specific enzymatic activity and stability properties. The use of huge amounts of sequence data extracted from a wide variety of different environments, from animal rumen to marine, water and soil, has revolutionized the discovery process of novel enzymes in the last decade [11]. We can estimate from previous reviews on the topic that at least 500 new enzymes have been identified using metagenomics-based approaches; this underlines the deep impact of metagenomics in this important biotechnological research field [11]. Here we provide a non-exhaustive list of enzyme types whose development has been boosted by using of such approcahes. The HotZyme project [100], for example, has been devoted to the extensive screening and analysis of metagenomes from thermal springs around the world, with the aim to first discover and then characterize novel thermostable hydrolases of industrial interest. Metagenomics screening resulted in 100 potentially new hydrolases, of which 12 have been biochemically and structurally characterized, including carboxylesterases, lactonases and cellulases. Metagenomics data has also been widely used for the identification of lignin-degrading enzymes such as laccase, xylanase, -glucosidase, acetyl xylan esterases, arabinofuranosidases, and lyases [12], [13], [14], [15]. These enzymes, which catalyze the depolymerization of lignin, have been discovered by different consortia such as RAS [101] and LigMet [12] through the mining of metagenomes from different environments such as rice straw compost, sugarcane soil samples, bovine rumen and insect intestinal tracts. Marine-related metagenomics data also provides huge amounts of information. Large expeditions collecting marine samples such as Tara Ocean [45], [102] and GEOTRACES [103] have led to the metagenome assembly of more than 25,000 genomes. These efforts have contributed to the discovery and functional characterization of a wide series of novel enzymes [16], [17], [18], such as cold-adapted lipases and esterases which are of key importance in food and biotechnology industry. Other discovered enzymes of interest are novel thermostable biocatalysts including lipolytic enzymes, hydrolases, fumarase and -glucosidase, as well as a series of enzymes that are tolerant to salt, acid or basic pH, or heavy metals. There are basically two main approaches for the discovery of new enzymes from metagenomic data, which are function-based and sequence-based screenings [104]. In the former, DNA fragments from environmental metagenomes are first cloned and expressed using expression vectors to produce proteins which are then screened in vivo or in vitro for enzymatic activities [105]. This is the most frequently used approach and allows the identification of enzymes that do not share sequence similarity with known counterparts. However, it is laborious and requires reliable and high-throughput screening methods that are difficult to generalize [104]. The second approach is sequence-based and applicable when the enzymes sought are closely related in terms of sequence similarity to enzymes collected in known databases. This in silico method queries large metagenomic datasets with HMM profiles [38] constructed from sequences of known enzymes and their close homologs. While this method is more efficient than function-based methods, it is more limited in terms of sequence space explored. It can also lead to false hits, due to misannotations in poorly curated datasets. Examples of automatic computational pipelines for metagenomic enzyme discovery through sequence-based screening are MetaHMM [106] and ANASTASIA [107].

CRISPR-Cas system identification in microbiomes

CRISPR-Cas, where Cas is an enzyme and CRISP the acronym of Clustered Regularly Interspaced Short Palindromic Repeats, is a system employed by most archaea and bacteria as immunological defense against invading DNA [108]. In brief, short fragments of foreign DNA are integrated into CRISPR loci, which causes the memorization of the infection. These are transcribed into CRISPR RNA, which are then used as guides for Cas proteins to specifically interfere with invading nucleic acids upon reoccurring infection. Due to its huge potential for genome editing, the CRISPR-Cas system is used as a precise technology in biological and clinical research and applications [109], and metagenomic datasets are therefore mined to discover new such systems [24]. For example, three sources of metagenomic data were used for this purpose [24], from two soil environments and one water environment. As many as 155 million protein-coding genes were extracted from these, using Prodigal, a protein-coding gene predictor for prokaryotic genomes [37]. This set of sequences were searched for Cas protein homologs using HMMER [38], while CRISPR arrays were identified using the CrisprFinder detection tool [110]. This analysis led to the identification of novel CRISPR-Cas systems: CRISPR-Cas9 in archaea, and CRISPR-CasX and CasY in bacteria, which are among the most compact CRISPR-Cas systems known to date [24]. The International Metagenomics and Metadesign of Subways and Urban Biomes (MetaSUB) consortium provided 4,728 metagenomic samples from mass-transit systems of 60 cities around the world [111]. These data led to the discovery of 838,532 CRISPR arrays predicted by an improved version of CrisprFinder [25], of which 3,245 had unambiguous annotations. More recently, 2.9 million CRISPR loci have been functionally and taxonomically profiled from 2,355 body-wide human microbiomes from 17 different body sites [26], thus increasing of one order of magnitude the number of known CRISPR in the human microbiome. The Crass tool has been used for that purpose, which identifies and reconstructs CRISPR from unassembled metagenomic data [112]. Also, the abundance of different Cas proteins was profiled and associated with CRISPR subtypes to obtain information about the functional and evolutionary role of CRISPR-Cas systems in human microbiomes. The studies that identified CRISPR-Cas systems by mining metagenome resources are not limited to these few examples but include other studies that use completely different metagenomic environments such as the irrigation of water sources [27], various human microbiomes from skin to oral microbiomes [113], [114] and extreme environments ranging from antarctic snow to hot springs [28], [115]. To discover CRISPR repeats in these metagenomes, several bioinformatics tools have been developed, among which MinCED (github.com/ctSkennerton/minced), MetaCRAST [116], Crass [112], and metaCRT [113]. Finally note that metagenomic databases can also by mined to explore anti-CRISPRs, i.e. natural inhibitors of the CRISPR-Cas system [117]. For example, a high-throughput approach has been developed to discover anti-CRISPR genes from metagenomics data based on their functional activity [118]. The action of eleven DNA fragments from soil, animal, and human metagenomes were identified and tested in vitro to decrease Cas9 activity in Streptococcus pyogenes.

Functional annotation and analysis of the resistome using metagenomics data

Antimicrobial resistance is another central problem in microbiology where metagenomic data plays a fundamental role. The identification of antibiotic resistance genes (ARG) in soil-dwelling bacteria, human gut microbiota and other microbial communities, which can potentially act as ARG reservoirs [119], [120], [121], is important to fully understand the origin, evolution and maintenance of antibiotic resistance. Indeed, these genes can be exchanged through lateral gene transfer and confer antibiotic resistance to pathogens. For example, infant gut microbiome was investigated and revealed a cohort of resistance genes in fecal microbiota of pediatric patients, even without their prior exposure to the selective pressure of antibiotics [122]. These findings explain how a healthy human gut can act as a reservoir for ARGs. An interesting method, taking advantage of 3D protein structures, was developed to predict ARGs in gut microbiota [20]. This method, based on a combination of homology modeling and machine learning techniques, is able to correctly identify ARGs: from the 71 predicted ARGs, antibiotic resistance activity was detected in vitro in 51 of them. The method was also tested on an experimentally validated functional metagenomic dataset from soils, highlighting very good performance, especially in terms of sensitivity. Furthermore, metagenomic sequencing from respiratory specimens of patients with and without chronic respiratory diseases such as severe asthma, chronic obstructive pulmonary disease and bronchiectasis showed that respiratory tract microbiota also harbors a core of ARGs dominated by genes resistant to macrolide antibiotics [21]. This finding was independent of the health status of the patients and of their previous exposure to antibiotics. Soil is certainly another reservoir of ARGs, since it is in direct contact with antibiotics used in livestock farming and agriculture. Evidence of ARG exchange between soil-dewling bacteria and clinical pathogens was shown from functional metagenomics analyses of soil-derived bacteria cultures [22]. Multidrug-resistant soil bacteria were shown to harbor ARGs against five important classes of antibiotics: -lactams, aminoglycosides, amphenicols, sulfonamides, and tetracyclines. Furthermore, the analysis of metagenomic data from gut microbiota of migratory birds revealed about 1,000 ARGs that can be classified into about 200 different types associated to specific antibiotic resistance [23]. Compared to environmental metagenomes, microbiota of migratory birds have a lower phylogenetic diversity but more antibiotic resistance proteins, thus suggesting the possible role of birds as ARG reservoir. Finally, possible differences in the ARG distribution according to the ecological niches were investigated [19]. The analysis of human, animal, water, soil, plant and insect metagenomes from the MG-RAST database [123] led to conclude that the human microbiome is characterized by the highest relative ARG abundance.

Conclusion

The use of metagenomics data has become essential in different domains of research during the last decade. Indeed, considerable efforts to improve the standardization of data analyses and metagenomic databases have resulted in impressive developments in enzyme discovery, 3D protein structure prediction and function annotation. The study of the role of human microbiota in disease, aging and antibiotic resistance has also greatly benefited from these developments. The explosion of the amount of metagenomic data is currently creating a challenge for bioinformatics tools, especially in the data storage and analysis and in the integration of different metagenomic techniques, including metatranscriptomics, metaproteomics and metabolomics. The improvements of these tools will lead in the near future to further advances in these fields, but will also boost or continue to fuel a series of other applications that have not been analyzed in this mini-review such as protein function prediction [124], predictions of protein–protein interactions and protein complex structures [95], [125], [126] and the detection and tracing of novel viral pathogens [127], [128].

CRediT authorship contribution statement

Qingzhen Hou: Data curation, Formal analysis, Writing - original draft, Writing - review & editing. Fabrizio Pucci: Data curation, Formal analysis, Writing - original draft, Writing - review & editing. Fengming Pan: Data curation, Formal analysis. Fuzhong Xue: Funding acquisition, Project administration, Supervision. Marianne Rooman: Formal analysis, Funding acquisition, Supervision, Validation, Writing - original draft, Writing - review & editing. Qiang Feng: Funding acquisition, Project administration, Supervision, Investigation, Software, Writing - original draft.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

118 in total

1. Gut microbiome development along the colorectal adenoma-carcinoma sequence.

Authors: Qiang Feng; Suisha Liang; Huijue Jia; Andreas Stadlmayr; Longqing Tang; Zhou Lan; Dongya Zhang; Huihua Xia; Xiaoying Xu; Zhuye Jie; Lili Su; Xiaoping Li; Xin Li; Junhua Li; Liang Xiao; Ursula Huber-Schönauer; David Niederseer; Xun Xu; Jumana Yousuf Al-Aama; Huanming Yang; Jian Wang; Karsten Kristiansen; Manimozhiyan Arumugam; Herbert Tilg; Christian Datz; Jun Wang
Journal: Nat Commun Date: 2015-03-11 Impact factor: 14.919

2. The IMG/M data management and analysis system v.6.0: new tools and advanced capabilities.

Authors: I-Min A Chen; Ken Chu; Krishnaveni Palaniappan; Anna Ratner; Jinghua Huang; Marcel Huntemann; Patrick Hajek; Stephan Ritter; Neha Varghese; Rekha Seshadri; Simon Roux; Tanja Woyke; Emiley A Eloe-Fadrosh; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

3. Decoding the link of microbiome niches with homologous sequences enables accurately targeted protein structure prediction.

Authors: Pengshuo Yang; Wei Zheng; Kang Ning; Yang Zhang
Journal: Proc Natl Acad Sci U S A Date: 2021-12-07 Impact factor: 12.779

4. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes.

Authors: F Meyer; D Paarmann; M D'Souza; R Olson; E M Glass; M Kubal; T Paczian; A Rodriguez; R Stevens; A Wilke; J Wilkening; R A Edwards
Journal: BMC Bioinformatics Date: 2008-09-19 Impact factor: 3.169

5. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.

Authors: Baris E Suzek; Yuqi Wang; Hongzhan Huang; Peter B McGarvey; Cathy H Wu
Journal: Bioinformatics Date: 2014-11-13 Impact factor: 6.937

6. HMMER web server: 2018 update.

Authors: Simon C Potter; Aurélien Luciani; Sean R Eddy; Youngmi Park; Rodrigo Lopez; Robert D Finn
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971

7. Prediction of interresidue contacts with DeepMetaPSICOV in CASP13.

Authors: Shaun M Kandathil; Joe G Greener; David T Jones
Journal: Proteins Date: 2019-07-27

8. Metagenomics Reveals a Core Macrolide Resistome Related to Microbiota in Chronic Respiratory Disease.

Authors: Micheál Mac Aogáin; Kenny J X Lau; Zhao Cai; Jayanth Kumar Narayana; Rikky W Purbojati; Daniela I Drautz-Moses; Nicolas E Gaultier; Tavleen K Jaggi; Pei Yee Tiew; Thun How Ong; Mariko Siyue Koh; Albert Lim Yick Hou; John A Abisheganaden; Krasimira Tsaneva-Atanasova; Stephan C Schuster; Sanjay H Chotirmall
Journal: Am J Respir Crit Care Med Date: 2020-08-01 Impact factor: 21.405

9. Improved protein structure prediction by deep learning irrespective of co-evolution information.

Authors: Jinbo Xu; Matthew Mcpartlon; Jin Li
Journal: Nat Mach Intell Date: 2021-05-20

10. Lignolytic-consortium omics analyses reveal novel genomes and pathways involved in lignin modification and valorization.

Authors: Eduardo C Moraes; Thabata M Alvarez; Gabriela F Persinoti; Geizecler Tomazetto; Livia B Brenelli; Douglas A A Paixão; Gabriela C Ematsu; Juliana A Aricetti; Camila Caldana; Neil Dixon; Timothy D H Bugg; Fabio M Squina
Journal: Biotechnol Biofuels Date: 2018-03-22 Impact factor: 6.040