Kevin K Yang1, Zachary Wu1, Claire N Bedbrook2, Frances H Arnold1,2. 1. Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA. 2. Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
Abstract
Motivation: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling. Results: The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured. Availability and implementation: The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/. Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling. Results: The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured. Availability and implementation: The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/. Supplementary information: Supplementary data are available at Bioinformatics online.
Authors: Christina S Leslie; Eleazar Eskin; Adiel Cohen; Jason Weston; William Stafford Noble Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937
Authors: Yougen Li; D Allan Drummond; Andrew M Sawayama; Christopher D Snow; Jesse D Bloom; Frances H Arnold Journal: Nat Biotechnol Date: 2007-08-26 Impact factor: 54.908
Authors: Martin K M Engqvist; R Scott McIsaac; Peter Dollinger; Nicholas C Flytzanis; Michael Abrams; Stanford Schor; Frances H Arnold Journal: J Mol Biol Date: 2014-06-28 Impact factor: 5.469
Authors: Claire N Bedbrook; Kevin K Yang; Austin J Rice; Viviana Gradinaru; Frances H Arnold Journal: PLoS Comput Biol Date: 2017-10-23 Impact factor: 4.475
Authors: Marin Matic; Gurdeep Singh; Francesco Carli; Natalia De Oliveira Rosa; Pasquale Miglionico; Lorenzo Magni; J Silvio Gutkind; Robert B Russell; Asuka Inoue; Francesco Raimondi Journal: Nucleic Acids Res Date: 2022-05-26 Impact factor: 19.160
Authors: Ethan C Alley; Grigory Khimulya; Surojit Biswas; Mohammed AlQuraishi; George M Church Journal: Nat Methods Date: 2019-10-21 Impact factor: 28.547
Authors: Roshan Rao; Nicholas Bhattacharya; Neil Thomas; Yan Duan; Xi Chen; John Canny; Pieter Abbeel; Yun S Song Journal: Adv Neural Inf Process Syst Date: 2019-12