Literature DB >> 34232869

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.

Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, Burkhard Rost.   

Abstract

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.

Entities:  

Mesh:

Substances:

Year:  2022        PMID: 34232869     DOI: 10.1109/TPAMI.2021.3095381

Source DB:  PubMed          Journal:  IEEE Trans Pattern Anal Mach Intell        ISSN: 0098-5589            Impact factor:   9.322


  62 in total

1.  Large-scale design and refinement of stable proteins using sequence-only models.

Authors:  Jedediah M Singer; Scott Novotney; Devin Strickland; Hugh K Haddox; Nicholas Leiby; Gabriel J Rocklin; Cameron M Chow; Anindya Roy; Asim K Bera; Francis C Motta; Longxing Cao; Eva-Maria Strauch; Tamuka M Chidyausiku; Alex Ford; Ethan Ho; Alexander Zaitzeff; Craig O Mackenzie; Hamed Eramian; Frank DiMaio; Gevorg Grigoryan; Matthew Vaughn; Lance J Stewart; David Baker; Eric Klavins
Journal:  PLoS One       Date:  2022-03-14       Impact factor: 3.240

2.  Comparative analysis of molecular fingerprints in prediction of drug combination effects.

Authors:  B Zagidullin; Z Wang; Y Guan; E Pitkänen; J Tang
Journal:  Brief Bioinform       Date:  2021-11-05       Impact factor: 11.622

3.  Deciphering the language of antibodies using self-supervised learning.

Authors:  Jinwoo Leem; Laura S Mitchell; James H R Farmery; Justin Barton; Jacob D Galson
Journal:  Patterns (N Y)       Date:  2022-05-18

4.  Mitigating cold-start problems in drug-target affinity prediction with interaction knowledge transferring.

Authors:  Tri Minh Nguyen; Thin Nguyen; Truyen Tran
Journal:  Brief Bioinform       Date:  2022-07-18       Impact factor: 13.994

5.  Deciphering microbial gene function using natural language processing.

Authors:  Danielle Miller; Adi Stern; David Burstein
Journal:  Nat Commun       Date:  2022-09-29       Impact factor: 17.694

6.  LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction.

Authors:  Zichen Wang; Steven A Combs; Ryan Brand; Miguel Romero Calvo; Panpan Xu; George Price; Nataliya Golovach; Emmanuel O Salawu; Colby J Wise; Sri Priya Ponnapalli; Peter M Clark
Journal:  Sci Rep       Date:  2022-04-27       Impact factor: 4.996

7.  Protein inter-residue contact and distance prediction by coupling complementary coevolution features with deep residual networks in CASP14.

Authors:  Yang Li; Chengxin Zhang; Wei Zheng; Xiaogen Zhou; Eric W Bell; Dong-Jun Yu; Yang Zhang
Journal:  Proteins       Date:  2021-08-19

8.  TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding.

Authors:  Yue Cao; Yang Shen
Journal:  Bioinformatics       Date:  2021-03-23       Impact factor: 6.937

Review 9.  Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms.

Authors:  Mohammed AlQuraishi; Peter K Sorger
Journal:  Nat Methods       Date:  2021-10-04       Impact factor: 28.547

10.  PredictProtein - Predicting Protein Structure and Function for 29 Years.

Authors:  Michael Bernhofer; Christian Dallago; Tim Karl; Venkata Satagopam; Michael Heinzinger; Maria Littmann; Tobias Olenyi; Jiajun Qiu; Konstantin Schütze; Guy Yachdav; Haim Ashkenazy; Nir Ben-Tal; Yana Bromberg; Tatyana Goldberg; Laszlo Kajan; Sean O'Donoghue; Chris Sander; Andrea Schafferhans; Avner Schlessinger; Gerrit Vriend; Milot Mirdita; Piotr Gawron; Wei Gu; Yohan Jarosz; Christophe Trefois; Martin Steinegger; Reinhard Schneider; Burkhard Rost
Journal:  Nucleic Acids Res       Date:  2021-07-02       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.