Literature DB >> 33876751

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.

Alexander Rives1,2, Joshua Meier3, Tom Sercu3, Siddharth Goyal3, Zeming Lin2, Jason Liu3, Demi Guo4, Myle Ott3, C Lawrence Zitnick3, Jerry Ma5,6, Rob Fergus2.   

Abstract

In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
Copyright © 2021 the Author(s). Published by PNAS.

Entities:  

Keywords:  deep learning; generative biology; protein language model; representation learning; synthetic biology

Year:  2021        PMID: 33876751     DOI: 10.1073/pnas.2016239118

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


  102 in total

Review 1.  Machine learning: its challenges and opportunities in plant system biology.

Authors:  Mohsen Hesami; Milad Alizadeh; Andrew Maxwell Phineas Jones; Davoud Torkamaneh
Journal:  Appl Microbiol Biotechnol       Date:  2022-05-16       Impact factor: 4.813

2.  Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning.

Authors:  Manato Akiyama; Yasubumi Sakakibara
Journal:  NAR Genom Bioinform       Date:  2022-02-22

3.  Constructing benchmark test sets for biological sequence analysis using independent set algorithms.

Authors:  Samantha Petti; Sean R Eddy
Journal:  PLoS Comput Biol       Date:  2022-03-07       Impact factor: 4.475

4.  Accurate protein function prediction via graph attention networks with predicted structure information.

Authors:  Boqiao Lai; Jinbo Xu
Journal:  Brief Bioinform       Date:  2022-01-17       Impact factor: 11.622

5.  Large-scale design and refinement of stable proteins using sequence-only models.

Authors:  Jedediah M Singer; Scott Novotney; Devin Strickland; Hugh K Haddox; Nicholas Leiby; Gabriel J Rocklin; Cameron M Chow; Anindya Roy; Asim K Bera; Francis C Motta; Longxing Cao; Eva-Maria Strauch; Tamuka M Chidyausiku; Alex Ford; Ethan Ho; Alexander Zaitzeff; Craig O Mackenzie; Hamed Eramian; Frank DiMaio; Gevorg Grigoryan; Matthew Vaughn; Lance J Stewart; David Baker; Eric Klavins
Journal:  PLoS One       Date:  2022-03-14       Impact factor: 3.240

6.  TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding.

Authors:  Yue Cao; Yang Shen
Journal:  Bioinformatics       Date:  2021-03-23       Impact factor: 6.937

7.  Utilizing graph machine learning within drug discovery and development.

Authors:  Thomas Gaudelet; Ben Day; Arian R Jamasb; Jyothish Soman; Cristian Regep; Gertrude Liu; Jeremy B R Hayter; Richard Vickers; Charles Roberts; Jian Tang; David Roblin; Tom L Blundell; Michael M Bronstein; Jake P Taylor-King
Journal:  Brief Bioinform       Date:  2021-11-05       Impact factor: 11.622

8.  PANDA2: protein function prediction using graph neural networks.

Authors:  Chenguang Zhao; Tong Liu; Zheng Wang
Journal:  NAR Genom Bioinform       Date:  2022-02-02

9.  Fast and effective protein model refinement using deep graph neural networks.

Authors:  Xiaoyang Jing; Jinbo Xu
Journal:  Nat Comput Sci       Date:  2021-07-15

Review 10.  Representation learning applications in biological sequence analysis.

Authors:  Hitoshi Iuchi; Taro Matsutani; Keisuke Yamada; Natsuki Iwano; Shunsuke Sumi; Shion Hosoda; Shitao Zhao; Tsukasa Fukunaga; Michiaki Hamada
Journal:  Comput Struct Biotechnol J       Date:  2021-05-23       Impact factor: 7.271

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.