Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Learned protein embeddings for machine learning.

Literature DB >> 29584811

Learned protein embeddings for machine learning.

Kevin K Yang¹, Zachary Wu¹, Claire N Bedbrook², Frances H Arnold^1,2.

Abstract

Motivation: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling.
Results: The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured. Availability and implementation: The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/. Supplementary information: Supplementary data are available at Bioinformatics online.

Mesh：

Substances：
Proteins

Year: 2018 PMID： 29584811 PMCID： PMC6061698 DOI： 10.1093/bioinformatics/bty178

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

15 in total

1. Mismatch string kernels for discriminative protein classification.

Authors: Christina S Leslie; Eleazar Eskin; Adiel Cohen; Jason Weston; William Stafford Noble
Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937

2. A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments.

Authors: Yougen Li; D Allan Drummond; Andrew M Sawayama; Christopher D Snow; Jesse D Bloom; Frances H Arnold
Journal: Nat Biotechnol Date: 2007-08-26 Impact factor: 54.908

3. ProFET: Feature engineering captures high-level protein functions.

Authors: Dan Ofer; Michal Linial
Journal: Bioinformatics Date: 2015-06-30 Impact factor: 6.937

4. Navigating the protein fitness landscape with Gaussian processes.

Authors: Philip A Romero; Andreas Krause; Frances H Arnold
Journal: Proc Natl Acad Sci U S A Date: 2012-12-31 Impact factor: 11.205

5. Directed evolution of Gloeobacter violaceus rhodopsin spectral properties.

Authors: Martin K M Engqvist; R Scott McIsaac; Peter Dollinger; Nicholas C Flytzanis; Michael Abrams; Stanford Schor; Frances H Arnold
Journal: J Mol Biol Date: 2014-06-28 Impact factor: 5.469

6. Issues in performance evaluation for host-pathogen protein interaction prediction.

Authors: Wajid Arshad Abbasi; Fayyaz Ul Amir Afsar Minhas
Journal: J Bioinform Comput Biol Date: 2016-01-14 Impact factor: 1.122

7. Learning epistatic interactions from sequence-activity data to predict enantioselectivity.

Authors: Julian Zaugg; Yosephine Gumulya; Alpeshkumar K Malde; Mikael Bodén
Journal: J Comput Aided Mol Des Date: 2017-12-12 Impact factor: 3.686

8. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.

Authors: Ehsaneddin Asgari; Mohammad R K Mofrad
Journal: PLoS One Date: 2015-11-10 Impact factor: 3.240

9. Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli.

Authors: Catherine Ching Han Chang; Chen Li; Geoffrey I Webb; BengTi Tey; Jiangning Song; Ramakrishnan Nagasundara Ramanan
Journal: Sci Rep Date: 2016-03-02 Impact factor: 4.379

10. Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization.

Authors: Claire N Bedbrook; Kevin K Yang; Austin J Rice; Viviana Gradinaru; Frances H Arnold
Journal: PLoS Comput Biol Date: 2017-10-23 Impact factor: 4.475

33 in total

Learned protein embeddings for machine learning.

1. Mismatch string kernels for discriminative protein classification.

2. A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments.

3. ProFET: Feature engineering captures high-level protein functions.

4. Navigating the protein fitness landscape with Gaussian processes.

5. Directed evolution of Gloeobacter violaceus rhodopsin spectral properties.

6. Issues in performance evaluation for host-pathogen protein interaction prediction.

7. Learning epistatic interactions from sequence-activity data to predict enantioselectivity.

8. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.

9. Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli.

10. Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization.

1. AlphaFold at CASP13.

2. Cluster learning-assisted directed evolution.

3. PRECOGx: exploring GPCR signaling mechanisms with deep protein representations.

4. Machine Learning-driven Protein Library Design: A Path Toward Smarter Libraries.

5. DTI-BERT: Identifying Drug-Target Interactions in Cellular Networking Based on BERT and Deep Learning Method.

6. Unified rational protein engineering with sequence-based deep representation learning.

7. A convolutional neural network for the prediction and forward design of ribozyme-based gene-control elements.

8. Embeddings of genomic region sets capture rich biological associations in lower dimensions.

9. Evaluating Protein Transfer Learning with TAPE.

Review 10. Representation learning applications in biological sequence analysis.