Christophe N Magnan1, Arlo Randall, Pierre Baldi. 1. Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, CA, USA.
Abstract
MOTIVATION: Protein insolubility is a major obstacle for many experimental studies. A sequence-based prediction method able to accurately predict the propensity of a protein to be soluble on overexpression could be used, for instance, to prioritize targets in large-scale proteomics projects and to identify mutations likely to increase the solubility of insoluble proteins. RESULTS: Here, we first curate a large, non-redundant and balanced training set of more than 17 000 proteins. Next, we extract and study 23 groups of features computed directly or predicted (e.g. secondary structure) from the primary sequence. The data and the features are used to train a two-stage support vector machine (SVM) architecture. The resulting predictor, SOLpro, is compared directly with existing methods and shows significant improvement according to standard evaluation metrics, with an overall accuracy of over 74% estimated using multiple runs of 10-fold cross-validation.
MOTIVATION: Protein insolubility is a major obstacle for many experimental studies. A sequence-based prediction method able to accurately predict the propensity of a protein to be soluble on overexpression could be used, for instance, to prioritize targets in large-scale proteomics projects and to identify mutations likely to increase the solubility of insoluble proteins. RESULTS: Here, we first curate a large, non-redundant and balanced training set of more than 17 000 proteins. Next, we extract and study 23 groups of features computed directly or predicted (e.g. secondary structure) from the primary sequence. The data and the features are used to train a two-stage support vector machine (SVM) architecture. The resulting predictor, SOLpro, is compared directly with existing methods and shows significant improvement according to standard evaluation metrics, with an overall accuracy of over 74% estimated using multiple runs of 10-fold cross-validation.
Authors: Christophe N Magnan; Michael Zeller; Matthew A Kayala; Adam Vigil; Arlo Randall; Philip L Felgner; Pierre Baldi Journal: Bioinformatics Date: 2010-10-07 Impact factor: 6.937
Authors: Adriana-Michelle Wolf Pérez; Pietro Sormanni; Jonathan Sonne Andersen; Laila Ismail Sakhnini; Ileana Rodriguez-Leon; Jais Rose Bjelke; Annette Juhl Gajhede; Leonardo De Maria; Daniel E Otzen; Michele Vendruscolo; Nikolai Lorenzen Journal: MAbs Date: 2019-01-18 Impact factor: 5.857
Authors: Alexander W Golinski; Katelynn M Mischler; Sidharth Laxminarayan; Nicole L Neurock; Matthew Fossing; Hannah Pichman; Stefano Martiniani; Benjamin J Hackel Journal: Proc Natl Acad Sci U S A Date: 2021-06-08 Impact factor: 11.205