Christian Dallago1,2, Konstantin Schütze1, Michael Heinzinger1,2, Tobias Olenyi1, Maria Littmann1,2, Amy X Lu3, Kevin K Yang4, Seonwoo Min5, Sungroh Yoon5,6, James T Morton7, Burkhard Rost1,8,9,10,11. 1. TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany. 2. TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching/Munich, Germany. 3. Department of Computer Science, University of Toronto, Toronto, Canada & Vector Institute. 4. Microsoft Research New England, Cambridge, Massachusetts. 5. Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea. 6. Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea. 7. Center for Computational Biology, Flatiron Institute, New York, New York. 8. Institute for Advanced Study (TUM-IAS), Garching/Munich, Germany. 9. TUM School of Life Sciences Weihenstephan (WZW), Freising, Germany. 10. Columbia University, Department of Biochemistry and Molecular Biophysics, New York, New York. 11. New York Consortium on Membrane Protein Structure (NYCOMPS), New York, New York.
Abstract
Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols.
Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols.
Authors: Michael Heinzinger; Maria Littmann; Ian Sillitoe; Nicola Bordin; Christine Orengo; Burkhard Rost Journal: NAR Genom Bioinform Date: 2022-06-11
Authors: Maria Littmann; Michael Heinzinger; Christian Dallago; Konstantin Weissenow; Burkhard Rost Journal: Sci Rep Date: 2021-12-13 Impact factor: 4.379
Authors: Valérie de Crécy-Lagard; Rocio Amorin de Hegedus; Cecilia Arighi; Jill Babor; Alex Bateman; Ian Blaby; Crysten Blaby-Haas; Alan J Bridge; Stephen K Burley; Stacey Cleveland; Lucy J Colwell; Ana Conesa; Christian Dallago; Antoine Danchin; Anita de Waard; Adam Deutschbauer; Raquel Dias; Yousong Ding; Gang Fang; Iddo Friedberg; John Gerlt; Joshua Goldford; Mark Gorelik; Benjamin M Gyori; Christopher Henry; Geoffrey Hutinet; Marshall Jaroch; Peter D Karp; Liudmyla Kondratova; Zhiyong Lu; Aron Marchler-Bauer; Maria-Jesus Martin; Claire McWhite; Gaurav D Moghe; Paul Monaghan; Anne Morgat; Christopher J Mungall; Darren A Natale; William C Nelson; Seán O'Donoghue; Christine Orengo; Katherine H O'Toole; Predrag Radivojac; Colbie Reed; Richard J Roberts; Dmitri Rodionov; Irina A Rodionova; Jeffrey D Rudolf; Lana Saleh; Gloria Sheynkman; Francoise Thibaud-Nissen; Paul D Thomas; Peter Uetz; David Vallenet; Erica Watson Carter; Peter R Weigele; Valerie Wood; Elisha M Wood-Charlson; Jin Xu Journal: Database (Oxford) Date: 2022-08-12 Impact factor: 4.462
Authors: Michael Bernhofer; Christian Dallago; Tim Karl; Venkata Satagopam; Michael Heinzinger; Maria Littmann; Tobias Olenyi; Jiajun Qiu; Konstantin Schütze; Guy Yachdav; Haim Ashkenazy; Nir Ben-Tal; Yana Bromberg; Tatyana Goldberg; Laszlo Kajan; Sean O'Donoghue; Chris Sander; Andrea Schafferhans; Avner Schlessinger; Gerrit Vriend; Milot Mirdita; Piotr Gawron; Wei Gu; Yohan Jarosz; Christophe Trefois; Martin Steinegger; Reinhard Schneider; Burkhard Rost Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971