Literature DB >> 29584811

Learned protein embeddings for machine learning.

Kevin K Yang1, Zachary Wu1, Claire N Bedbrook2, Frances H Arnold1,2.   

Abstract

Motivation: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling.
Results: The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured. Availability and implementation: The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/. Supplementary information: Supplementary data are available at Bioinformatics online.

Mesh:

Substances:

Year:  2018        PMID: 29584811      PMCID: PMC6061698          DOI: 10.1093/bioinformatics/bty178

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  15 in total

1.  Mismatch string kernels for discriminative protein classification.

Authors:  Christina S Leslie; Eleazar Eskin; Adiel Cohen; Jason Weston; William Stafford Noble
Journal:  Bioinformatics       Date:  2004-01-22       Impact factor: 6.937

2.  A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments.

Authors:  Yougen Li; D Allan Drummond; Andrew M Sawayama; Christopher D Snow; Jesse D Bloom; Frances H Arnold
Journal:  Nat Biotechnol       Date:  2007-08-26       Impact factor: 54.908

3.  ProFET: Feature engineering captures high-level protein functions.

Authors:  Dan Ofer; Michal Linial
Journal:  Bioinformatics       Date:  2015-06-30       Impact factor: 6.937

4.  Navigating the protein fitness landscape with Gaussian processes.

Authors:  Philip A Romero; Andreas Krause; Frances H Arnold
Journal:  Proc Natl Acad Sci U S A       Date:  2012-12-31       Impact factor: 11.205

5.  Directed evolution of Gloeobacter violaceus rhodopsin spectral properties.

Authors:  Martin K M Engqvist; R Scott McIsaac; Peter Dollinger; Nicholas C Flytzanis; Michael Abrams; Stanford Schor; Frances H Arnold
Journal:  J Mol Biol       Date:  2014-06-28       Impact factor: 5.469

6.  Issues in performance evaluation for host-pathogen protein interaction prediction.

Authors:  Wajid Arshad Abbasi; Fayyaz Ul Amir Afsar Minhas
Journal:  J Bioinform Comput Biol       Date:  2016-01-14       Impact factor: 1.122

7.  Learning epistatic interactions from sequence-activity data to predict enantioselectivity.

Authors:  Julian Zaugg; Yosephine Gumulya; Alpeshkumar K Malde; Mikael Bodén
Journal:  J Comput Aided Mol Des       Date:  2017-12-12       Impact factor: 3.686

8.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.

Authors:  Ehsaneddin Asgari; Mohammad R K Mofrad
Journal:  PLoS One       Date:  2015-11-10       Impact factor: 3.240

9.  Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli.

Authors:  Catherine Ching Han Chang; Chen Li; Geoffrey I Webb; BengTi Tey; Jiangning Song; Ramakrishnan Nagasundara Ramanan
Journal:  Sci Rep       Date:  2016-03-02       Impact factor: 4.379

10.  Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization.

Authors:  Claire N Bedbrook; Kevin K Yang; Austin J Rice; Viviana Gradinaru; Frances H Arnold
Journal:  PLoS Comput Biol       Date:  2017-10-23       Impact factor: 4.475

View more
  33 in total

1.  AlphaFold at CASP13.

Authors:  Mohammed AlQuraishi
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

2.  Cluster learning-assisted directed evolution.

Authors:  Yuchi Qiu; Jian Hu; Guo-Wei Wei
Journal:  Nat Comput Sci       Date:  2021-12-09

3.  PRECOGx: exploring GPCR signaling mechanisms with deep protein representations.

Authors:  Marin Matic; Gurdeep Singh; Francesco Carli; Natalia De Oliveira Rosa; Pasquale Miglionico; Lorenzo Magni; J Silvio Gutkind; Robert B Russell; Asuka Inoue; Francesco Raimondi
Journal:  Nucleic Acids Res       Date:  2022-05-26       Impact factor: 19.160

4.  Machine Learning-driven Protein Library Design: A Path Toward Smarter Libraries.

Authors:  Mehrsa Mardikoraem; Daniel Woldring
Journal:  Methods Mol Biol       Date:  2022

5.  DTI-BERT: Identifying Drug-Target Interactions in Cellular Networking Based on BERT and Deep Learning Method.

Authors:  Jie Zheng; Xuan Xiao; Wang-Ren Qiu
Journal:  Front Genet       Date:  2022-06-08       Impact factor: 4.772

6.  Unified rational protein engineering with sequence-based deep representation learning.

Authors:  Ethan C Alley; Grigory Khimulya; Surojit Biswas; Mohammed AlQuraishi; George M Church
Journal:  Nat Methods       Date:  2019-10-21       Impact factor: 28.547

7.  A convolutional neural network for the prediction and forward design of ribozyme-based gene-control elements.

Authors:  Calvin M Schmidt; Christina D Smolke
Journal:  Elife       Date:  2021-04-16       Impact factor: 8.140

8.  Embeddings of genomic region sets capture rich biological associations in lower dimensions.

Authors:  Erfaneh Gharavi; Aaron Gu; Guangtao Zheng; Jason P Smith; Hyun Jae Cho; Aidong Zhang; Donald E Brown; Nathan C Sheffield
Journal:  Bioinformatics       Date:  2021-06-22       Impact factor: 6.931

9.  Evaluating Protein Transfer Learning with TAPE.

Authors:  Roshan Rao; Nicholas Bhattacharya; Neil Thomas; Yan Duan; Xi Chen; John Canny; Pieter Abbeel; Yun S Song
Journal:  Adv Neural Inf Process Syst       Date:  2019-12

Review 10.  Representation learning applications in biological sequence analysis.

Authors:  Hitoshi Iuchi; Taro Matsutani; Keisuke Yamada; Natsuki Iwano; Shunsuke Sumi; Shion Hosoda; Shitao Zhao; Tsukasa Fukunaga; Michiaki Hamada
Journal:  Comput Struct Biotechnol J       Date:  2021-05-23       Impact factor: 7.271

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.