| Literature DB >> 35070166 |
Qingzhen Hou1,2, Fabrizio Pucci3,4, Fengming Pan1,2, Fuzhong Xue1,2, Marianne Rooman3,4, Qiang Feng5,6.
Abstract
Over the past decade, metagenomic sequencing approaches have been providing an ever-increasing amount of protein sequence data at an astonishing rate. These constitute an invaluable source of information which has been exploited in various research fields such as the study of the role of the gut microbiota in human diseases and aging. However, only a small fraction of all metagenomic sequences collected have been functionally or structurally characterized, leaving much of them completely unexplored. Here, we review how this information has been used in protein structure prediction and protein discovery. We begin by presenting some widely used metagenomic databases and analyze in detail how metagenomic data has contributed to the impressive improvement in the accuracy of structure prediction methods in recent years. We then examine how metagenomic information can be exploited to annotate protein sequences. More specifically, we focus on the role of metagenomes in the discovery of enzymes and new CRISPR-Cas systems, and in the identification of antibiotic resistance genes. With this review, we provide an overview of how metagenomic data is currently revolutionizing our understanding of protein science.Entities:
Keywords: Antibiotic resistance; CRISPR-Cas system; Enzyme design; Metagenomics; Microbiome; Multiple sequence alignment
Year: 2022 PMID: 35070166 PMCID: PMC8760478 DOI: 10.1016/j.csbj.2021.12.030
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Schematic representation of the pipeline from biomes, metagenome samples to protein structure prediction and discovery.
Fig. 2Sources of metagenomic data in (a) IMG/M and (b) MGnify databases. For more details, see Table S1 of Supplementary Material..
Fig. 3Metagenomics in protein structure prediction. (a) Quantitative MSA enrichment when adding metagenomic sequences: probability distribution of , which is defined as the ratio between the number of effective sequences in MSAs constructed from both metagenomic and genomic sequence databases, and from genomic sequences only; the values come from the study of 5,721 Pfam families [8]; (b) Schematic representation of the two types of protein structure prediction pipelines based on MSAs: the optimization of multiple intermediate steps such as the identification of coevolutionary signals and the prediction of contact maps, and an end-to-end differentiable model which enables a single optimization from the input MSA to the output 3D structure; (c) Number of times metagenomic databases have been used in structure prediction methods in the last three CASP experiments [48], [77], [78].