Literature DB >> 33390682

Evaluating Protein Transfer Learning with TAPE.

Roshan Rao1, Nicholas Bhattacharya1, Neil Thomas1, Yan Duan2, Xi Chen2, John Canny1, Pieter Abbeel1, Yun S Song1.   

Abstract

Machine learning applied to protein sequences is an increasingly popular area of research. Semi-supervised learning for proteins has emerged as an important paradigm due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.

Entities:  

Year:  2019        PMID: 33390682      PMCID: PMC7774645     

Source DB:  PubMed          Journal:  Adv Neural Inf Process Syst        ISSN: 1049-5258


  40 in total

1.  The Protein Data Bank.

Authors:  H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  Twilight zone of protein sequence alignments.

Authors:  B Rost
Journal:  Protein Eng       Date:  1999-02

3.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction.

Authors:  J A Cuff; G J Barton
Journal:  Proteins       Date:  1999-03-01

4.  NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning.

Authors:  Michael Schantz Klausen; Martin Closter Jespersen; Henrik Nielsen; Kamilla Kjaergaard Jensen; Vanessa Isabell Jurtz; Casper Kaae Sønderby; Morten Otto Alexander Sommer; Ole Winther; Morten Nielsen; Bent Petersen; Paolo Marcatili
Journal:  Proteins       Date:  2019-03-09

Review 5.  Major New Microbial Groups Expand Diversity and Alter our Understanding of the Tree of Life.

Authors:  Cindy J Castelle; Jillian F Banfield
Journal:  Cell       Date:  2018-03-08       Impact factor: 41.582

6.  The HHpred interactive server for protein homology detection and structure prediction.

Authors:  Johannes Söding; Andreas Biegert; Andrei N Lupas
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

7.  ProteinNet: a standardized data set for machine learning of protein structure.

Authors:  Mohammed AlQuraishi
Journal:  BMC Bioinformatics       Date:  2019-06-11       Impact factor: 3.169

8.  Modeling aspects of the language of life through transfer-learning protein sequences.

Authors:  Michael Heinzinger; Ahmed Elnaggar; Yu Wang; Christian Dallago; Dmitrii Nechaev; Florian Matthes; Burkhard Rost
Journal:  BMC Bioinformatics       Date:  2019-12-17       Impact factor: 3.169

Review 9.  Strategies and molecular tools to fight antimicrobial resistance: resistome, transcriptome, and antimicrobial peptides.

Authors:  Letícia S Tavares; Carolina S F Silva; Vinicius C de Souza; Vânia L da Silva; Cláudio G Diniz; Marcelo O Santos
Journal:  Front Microbiol       Date:  2013-12-31       Impact factor: 5.640

10.  The Pfam protein families database in 2019.

Authors:  Sara El-Gebali; Jaina Mistry; Alex Bateman; Sean R Eddy; Aurélien Luciani; Simon C Potter; Matloob Qureshi; Lorna J Richardson; Gustavo A Salazar; Alfredo Smart; Erik L L Sonnhammer; Layla Hirsh; Lisanna Paladin; Damiano Piovesan; Silvio C E Tosatto; Robert D Finn
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

View more
  56 in total

1.  Deep Learning applications for COVID-19.

Authors:  Connor Shorten; Taghi M Khoshgoftaar; Borko Furht
Journal:  J Big Data       Date:  2021-01-11

Review 2.  A guide to machine learning for biologists.

Authors:  Joe G Greener; Shaun M Kandathil; Lewis Moffat; David T Jones
Journal:  Nat Rev Mol Cell Biol       Date:  2021-09-13       Impact factor: 94.444

3.  Deciphering the language of antibodies using self-supervised learning.

Authors:  Jinwoo Leem; Laura S Mitchell; James H R Farmery; Justin Barton; Jacob D Galson
Journal:  Patterns (N Y)       Date:  2022-05-18

4.  Utilizing graph machine learning within drug discovery and development.

Authors:  Thomas Gaudelet; Ben Day; Arian R Jamasb; Jyothish Soman; Cristian Regep; Gertrude Liu; Jeremy B R Hayter; Richard Vickers; Charles Roberts; Jian Tang; David Roblin; Tom L Blundell; Michael M Bronstein; Jake P Taylor-King
Journal:  Brief Bioinform       Date:  2021-11-05       Impact factor: 11.622

5.  Identification of Sub-Golgi protein localization by use of deep representation learning features.

Authors:  Zhibin Lv; Pingping Wang; Quan Zou; Qinghua Jiang
Journal:  Bioinformatics       Date:  2020-12-26       Impact factor: 6.937

6.  Overcoming Immunological Challenges Limiting Capsid-Mediated Gene Therapy With Machine Learning.

Authors:  Anna Z Wec; Kathy S Lin; Jamie C Kwasnieski; Sam Sinai; Jeff Gerold; Eric D Kelsic
Journal:  Front Immunol       Date:  2021-04-27       Impact factor: 7.561

Review 7.  Representation learning applications in biological sequence analysis.

Authors:  Hitoshi Iuchi; Taro Matsutani; Keisuke Yamada; Natsuki Iwano; Shunsuke Sumi; Shion Hosoda; Shitao Zhao; Tsukasa Fukunaga; Michiaki Hamada
Journal:  Comput Struct Biotechnol J       Date:  2021-05-23       Impact factor: 7.271

8.  flDPnn: Accurate intrinsic disorder prediction with putative propensities of disorder functions.

Authors:  Gang Hu; Akila Katuwawala; Kui Wang; Zhonghua Wu; Sina Ghadermarzi; Jianzhao Gao; Lukasz Kurgan
Journal:  Nat Commun       Date:  2021-07-21       Impact factor: 14.919

9.  PredictProtein - Predicting Protein Structure and Function for 29 Years.

Authors:  Michael Bernhofer; Christian Dallago; Tim Karl; Venkata Satagopam; Michael Heinzinger; Maria Littmann; Tobias Olenyi; Jiajun Qiu; Konstantin Schütze; Guy Yachdav; Haim Ashkenazy; Nir Ben-Tal; Yana Bromberg; Tatyana Goldberg; Laszlo Kajan; Sean O'Donoghue; Chris Sander; Andrea Schafferhans; Avner Schlessinger; Gerrit Vriend; Milot Mirdita; Piotr Gawron; Wei Gu; Yohan Jarosz; Christophe Trefois; Martin Steinegger; Reinhard Schneider; Burkhard Rost
Journal:  Nucleic Acids Res       Date:  2021-07-02       Impact factor: 16.971

Review 10.  Learning the Regulatory Code of Gene Expression.

Authors:  Jan Zrimec; Filip Buric; Mariia Kokina; Victor Garcia; Aleksej Zelezniak
Journal:  Front Mol Biosci       Date:  2021-06-10
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.