Literature DB >> 35702380

Contrastive learning on protein embeddings enlightens midnight zone.

Michael Heinzinger1, Maria Littmann1, Ian Sillitoe2, Nicola Bordin2, Christine Orengo2, Burkhard Rost1.   

Abstract

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.
© The Author(s) 2022. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.

Entities:  

Year:  2022        PMID: 35702380      PMCID: PMC9188115          DOI: 10.1093/nargab/lqac043

Source DB:  PubMed          Journal:  NAR Genom Bioinform        ISSN: 2631-9268


  70 in total

1.  UniqueProt: Creating representative protein sequence sets.

Authors:  Sven Mika; Burkhard Rost
Journal:  Nucleic Acids Res       Date:  2003-07-01       Impact factor: 16.971

2.  Ligand recognition by A-class Eph receptors: crystal structures of the EphA2 ligand-binding domain and the EphA2/ephrin-A1 complex.

Authors:  Juha P Himanen; Yehuda Goldgur; Hui Miao; Eugene Myshkin; Hong Guo; Matthias Buck; My Nguyen; Kanagalaghatta R Rajashankar; Bingcheng Wang; Dimitar B Nikolov
Journal:  EMBO Rep       Date:  2009-06-12       Impact factor: 8.807

3.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology.

Authors:  K Sjölander; K Karplus; M Brown; R Hughey; A Krogh; I S Mian; D Haussler
Journal:  Comput Appl Biosci       Date:  1996-08

Review 4.  Origins and evolutionary relationships of retroviruses.

Authors:  R F Doolittle; D F Feng; M S Johnson; M A McClure
Journal:  Q Rev Biol       Date:  1989-03       Impact factor: 4.875

5.  Prediction of protein secondary structure at better than 70% accuracy.

Authors:  B Rost; C Sander
Journal:  J Mol Biol       Date:  1993-07-20       Impact factor: 5.469

6.  Unified rational protein engineering with sequence-based deep representation learning.

Authors:  Ethan C Alley; Grigory Khimulya; Surojit Biswas; Mohammed AlQuraishi; George M Church
Journal:  Nat Methods       Date:  2019-10-21       Impact factor: 28.547

7.  Testing the ortholog conjecture with comparative functional genomic data from mammals.

Authors:  Nathan L Nehrt; Wyatt T Clark; Predrag Radivojac; Matthew W Hahn
Journal:  PLoS Comput Biol       Date:  2011-06-09       Impact factor: 4.475

8.  Modeling aspects of the language of life through transfer-learning protein sequences.

Authors:  Michael Heinzinger; Ahmed Elnaggar; Yu Wang; Christian Dallago; Dmitrii Nechaev; Florian Matthes; Burkhard Rost
Journal:  BMC Bioinformatics       Date:  2019-12-17       Impact factor: 3.169

9.  Embeddings from deep learning transfer GO annotations beyond homology.

Authors:  Maria Littmann; Michael Heinzinger; Christian Dallago; Tobias Olenyi; Burkhard Rost
Journal:  Sci Rep       Date:  2021-01-13       Impact factor: 4.379

10.  Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function.

Authors:  Amelia Villegas-Morcillo; Stavros Makrodimitris; Roeland C H J van Ham; Angel M Gomez; Victoria Sanchez; Marcel J T Reinders
Journal:  Bioinformatics       Date:  2021-04-19       Impact factor: 6.937

View more
  3 in total

1.  TMbed: transmembrane proteins predicted through language model embeddings.

Authors:  Michael Bernhofer; Burkhard Rost
Journal:  BMC Bioinformatics       Date:  2022-08-08       Impact factor: 3.307

2.  ProteinGLUE multi-task benchmark suite for self-supervised protein modeling.

Authors:  Henriette Capel; Robin Weiler; Maurits Dijkstra; Reinier Vleugels; Peter Bloem; K Anton Feenstra
Journal:  Sci Rep       Date:  2022-09-26       Impact factor: 4.996

3.  Improving protein succinylation sites prediction using embeddings from protein language model.

Authors:  Suresh Pokharel; Pawel Pratyush; Michael Heinzinger; Robert H Newman; Dukka B Kc
Journal:  Sci Rep       Date:  2022-10-08       Impact factor: 4.996

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.