Literature DB >> 36175448

Deciphering microbial gene function using natural language processing.

Danielle Miller1, Adi Stern1, David Burstein2.   

Abstract

Revealing the function of uncharacterized genes is a fundamental challenge in an era of ever-increasing volumes of sequencing data. Here, we present a concept for tackling this challenge using deep learning methodologies adopted from natural language processing (NLP). We repurpose NLP algorithms to model "gene semantics" based on a biological corpus of more than 360 million microbial genes within their genomic context. We use the language models to predict functional categories for 56,617 genes and find that out of 1369 genes associated with recently discovered defense systems, 98% are inferred correctly. We then systematically evaluate the "discovery potential" of different functional categories, pinpointing those with the most genes yet to be characterized. Finally, we demonstrate our method's ability to discover systems associated with microbial interaction and defense. Our results highlight that combining microbial genomics and language models is a promising avenue for revealing gene functions in microbes.
© 2022. The Author(s).

Entities:  

Mesh:

Year:  2022        PMID: 36175448      PMCID: PMC9523054          DOI: 10.1038/s41467-022-33397-4

Source DB:  PubMed          Journal:  Nat Commun        ISSN: 2041-1723            Impact factor:   17.694


  60 in total

Review 1.  The uncultured microbial majority.

Authors:  Michael S Rappé; Stephen J Giovannoni
Journal:  Annu Rev Microbiol       Date:  2003       Impact factor: 15.500

2.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

Authors:  Kazutaka Katoh; Kazuharu Misawa; Kei-ichi Kuma; Takashi Miyata
Journal:  Nucleic Acids Res       Date:  2002-07-15       Impact factor: 16.971

3.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.

Authors:  Michael Remmert; Andreas Biegert; Andreas Hauser; Johannes Söding
Journal:  Nat Methods       Date:  2011-12-25       Impact factor: 28.547

4.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors:  Weizhong Li; Adam Godzik
Journal:  Bioinformatics       Date:  2006-05-26       Impact factor: 6.937

Review 5.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

6.  Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life.

Authors:  Donovan H Parks; Christian Rinke; Maria Chuvochina; Pierre-Alain Chaumeil; Ben J Woodcroft; Paul N Evans; Philip Hugenholtz; Gene W Tyson
Journal:  Nat Microbiol       Date:  2017-09-11       Impact factor: 17.745

7.  Identification of antimicrobial peptides from the human gut microbiome using deep learning.

Authors:  Yue Ma; Zhengyan Guo; Binbin Xia; Yuwei Zhang; Xiaolin Liu; Ying Yu; Na Tang; Xiaomei Tong; Min Wang; Xin Ye; Jie Feng; Yihua Chen; Jun Wang
Journal:  Nat Biotechnol       Date:  2022-03-03       Impact factor: 68.164

Review 8.  Bacterial Secretion Systems: An Overview.

Authors:  Erin R Green; Joan Mecsas
Journal:  Microbiol Spectr       Date:  2016-02

9.  A functional selection reveals previously undetected anti-phage defence systems in the E. coli pangenome.

Authors:  Christopher N Vassallo; Christopher R Doering; Megan L Littlehale; Gabriella I C Teodoro; Michael T Laub
Journal:  Nat Microbiol       Date:  2022-09-19       Impact factor: 30.964

10.  MutL homologs in restriction-modification systems and the origin of eukaryotic MORC ATPases.

Authors:  Lakshminarayan M Iyer; Saraswathi Abhiman; L Aravind
Journal:  Biol Direct       Date:  2008-03-17       Impact factor: 4.540

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.