Literature DB >> 28575391

Protein-Sol: a web tool for predicting protein solubility from sequence.

Max Hebditch¹, M Alejandro Carballo-Amador², Spyros Charonis¹, Robin Curtis¹, Jim Warwicker².

Abstract

MOTIVATION: Protein solubility is an important property in industrial and therapeutic applications. Prediction is a challenge, despite a growing understanding of the relevant physicochemical properties.
RESULTS: Protein-Sol is a web server for predicting protein solubility. Using available data for Escherichia coli protein solubility in a cell-free expression system, 35 sequence-based properties are calculated. Feature weights are determined from separation of low and high solubility subsets. The model returns a predicted solubility and an indication of the features which deviate most from average values. Two other properties are profiled in windowed calculation along the sequence: fold propensity, and net segment charge. The utility of these additional features is demonstrated with the example of thioredoxin.
AVAILABILITY AND IMPLEMENTATION: The Protein-Sol webserver is available at http://protein-sol.manchester.ac.uk. CONTACT: jim.warwicker@manchester.ac.uk.

Entities: Chemical

Mesh：

Substances：

Year: 2017 PMID： 28575391 PMCID： PMC5870856 DOI： 10.1093/bioinformatics/btx345

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Protein solubility is an important property, from recombinant protein production to the development of biotherapeutics. A number of methods have been used to predict aggregation (Agrawal ) and solubility, based on factors such as propensity to form inclusion bodies (Wilkinson and Harrison, 1991) and β-strands (Tartaglia and Vendruscolo, 2008), structural genomics studies (Magnan ), and physicochemical properties (Agostini ). A web server is presented, Protein–Sol, for predicting protein solubility, based on the observation of a bimodal distribution of protein solubilities for E.coli proteins in cell-free expression (Niwa ). These measurements report the amount of a protein that is soluble (in the supernatant subsequent to centrifugation) compared with the total amount of that protein, rather than a thermodynamic property. A wider significance is apparent from two factors. First, that proteins tend to evolve to a point at which their solubility matches that required for their natural abundance (Tartaglia ). Second, the properties seen in the current work that associate with more soluble proteins are those seen previously, such as fewer amino acids with aromatic sidechains, favouring negative charge, and a preference for lysine over arginine (Warwicker ).

2 The Protein–Sol server

Protein–Sol is available at http://protein-sol.manchester.ac.uk without account registration or licence. It processes amino acid sequence and calculates predicted solubility and other properties, which returned in a graphical format and as a text file. Thirty-five features are considered in the algorithm, 20 amino acid compositions; 7 composites: K-R, D-E, K+R, D+E, K+R-D-E, K+R+D+E, F+W+Y; and 8 further predicted features: length, pI, hydropathy (Kyte and Doolittle, 1982), absolute charge at pH 7, fold propensity (Uversky ), disorder (Linding ), sequence entropy, and β-strand propensity (Costantini ). A linear model combining the 35 features gave an initial fit to the solubility data (Niwa ). Weights were then derived from differences between the lower and higher 5% tails of the solubility distribution, recorded as z-scores. Proteins predicted to have a transmembrane (TM) segment (hydropathy > 1.6 in any 21 amino acid segment), were excluded. For a query sequence, the contribution of each feature to predicted solubility is a linear scaling between its corresponding values averaged within each of the lower and higher subsets, multiplied by feature weight, with feature weights normalized to sum to 1. As there are many correlations between features, and because some features do not contribute to the prediction, overall correlation of prediction to the population of experimental solubilities for 2395 proteins, (without predicted TM regions), was used to assess combinations of features, eliminating first those with least weighting, continuing elimination until the model performance falls. The final prediction scheme consists of 10 features (H, L, V, K-R, D+E, F+W+Y, length, absolute charge, fold propensity, sequence entropy), with a correlation coefficient of 0.621 between calculated and experimental values, and 58% predicted solubility giving the best separation threshold of lowest and highest 5% subsets in a receiver operating characteristic (ROC) analysis. In addition to charge-based features, non-polar features are also present in the model. For example, aromatic (F+W+Y) composition weights predicted solubility down, whilst valine weights solubility up. In addition, predicted fold propensity and sequence entropy have a negative influence on predicted solubility. Our interpretation is that, in addition to a charged protein surface being favourable for solubility, there may also be a subset of more soluble proteins that have reduced sequence complexity, perhaps similar to intrinsically disordered proteins. Display of the extent to which each feature deviates from population average allows the user to select features that could be targeted to improve solubility. Net charge and fold propensity over a sliding window are displayed as profiles, providing additional information with which to interpret protein behaviour. Prediction of solubility from sequence is a single step process for the user. Each sequence for calculation is assigned a unique id number, formatted, and stored temporarily on the server. No calculation occurs if the input is invalid and the user is informed of the mismatch. The algorithm generates a text file that is processed using shell scripts and R to produce a graphical interpretation of the results. The predicted protein solubility is not valid for membrane proteins, but the results will be presented, with a warning, if a predicted transmembrane region is identified. Several tests have been made of the server. Protein expression data from structural genomics projects is often aggregated and heterogeneous. The first test set consists of 679 strongly expressed and well-behaved proteins from a single pipeline, which were used to derive a model for crystallization propensity (Price ). We predict an average solubility of 70.6% for these 679 proteins, with 70.3% of the set above the 58% threshold. A further set of 200 proteins used to test the crystallization model (Price ) gives an average of 76.1% predicted solubility with 82.5% of the set above the 58% threshold. Thermophile proteins have evolved to counter particularly stringent tests on solubility (Greaves and Warwicker, 2007). Methanopyrus kandleri is a sequenced archaeon with one of the highest known growth temperatures (, Slesarev ). Excluding those containing a predicted TM segment, solubility predictions for 1294 proteins from UniProt (UniProt Consortium, 2017) averaged 78.6%, with 93.6% of these above 58%. A link between protein aggregation rates and gene expression levels (Tartaglia ) has been reinforced with comparison of the abundant proteins serum albumin and myoglobin with their less abundant paralogues (Warwicker ). Quantitative proteomics allows comparison of (log scale) protein abundance and predicted protein solubility, with ROC plot analysis using low and high abundance subsets from the 5% tails. Calculations have been made with whole proteome integrated sets for Escherichia coli, Saccharomyces cerevisiae and Homo sapiens retrieved from PaxDb (Wang ). Results are reported in Table 1 (excluding proteins containing predicted TM segments), with the original development set of E.coli protein solubility added for reference. With membrane proteins included (not shown), the measures of agreement increase, an outcome of the importance of charge for protein solubility. Accuracy for the ROC analysis is listed at 58% solubility prediction, since this gives the highest accuracy for the development set. ROC plots are shown in Figure 1.

Table 1.

ROC plot and correlation analysis of predictions versus protein solubility or abundance

Set	Proteins	5% Tails	AUC	Acc at 58%	Corr
E.coli solubility Train	2395	120	0.974	0.900	0.621
E.coli abundance Test	2364	119	0.922	0.828	0.382
Yeast abundance Test	4275	214	0.707	0.626	0.188
Human abundance Test	10662	534	0.708	0.659	0.190

Note: AUC is area under the curve, Acc is accuracy at 58% solubility prediction threshold.

Fig. 1

Performance of the predictions across bacterial and eukaryotic proteomes. ROC plots are shown for prediction performance in the training set of measured solubilities, and 3 test sets of protein abundance

ROC plot and correlation analysis of predictions versus protein solubility or abundance Note: AUC is area under the curve, Acc is accuracy at 58% solubility prediction threshold. Performance of the predictions across bacterial and eukaryotic proteomes. ROC plots are shown for prediction performance in the training set of measured solubilities, and 3 test sets of protein abundance Through these varied tests, a structural genomics pipeline, the proteome of a hyperthermophile, and protein abundance in organisms across the tree of life, the model consistently demonstrates correlations.

3 Discussion

Protein–Sol is demonstrated with E.coli thioredoxin, known to enhance solubility of co-produced proteins in E.coli (Yasukawa ). Predicted solubility (scaled from 0 to 1) is plotted (Fig. 2A) alongside the population average for the experimental dataset (Niwa ). Thioredoxin at 0.76 is well above the average of 0.45, consistent with its wider use in co-expression or as a fusion partner. Solubility prediction on the server is given in the 0–1 range for ease of user interpretation. Percentage values, which were used in training and testing, can exceed 100% in the experimental dataset. For reference, thioredoxin predicts at 88% against a population average of 53%. The predicted pI is also displayed. Next, a plot shows deviations from population averages for the 35 features. Although only 10 of these contribute to the prediction, the signed deviations show the characteristics of the input sequence. For example KmR, meaning K-R, is prominent for thioredoxin and contributes to a prediction of highly soluble. To improve solubility, K-R is perhaps more useful than the other 9 features in the final model, since lysine and arginine can generally be swapped with little consequence for protein function or fold. The plot of windowed fold propensity (Fig. 2A) shows two subdomains, consistent with experimental characterization of thioredoxin folding (Katti ). The subdomain structure is also apparent in a novel representation of windowed net charge with negatively charged N-terminal and positively charged C-terminal subdomains (Fig. 2B). Whilst the windowed net charge does not indicate a complete separation of charge between subdomains, it shows the possibility for interactions dependent on the opposite sign of net charges, exemplified by the two salt-bridges shown in Figure 2B.

Fig. 2

(A) The Protein–Sol calculation. Results are shown for the E.coli thioredoxin example. (B) E.coli thioredoxin (2trx chain A, Katti ) is shown colour-coded by subdomain (1–67 and 68–108), with salt-bridges E44-K96 and E48-K100 displayed between the subdomains. Drawn with PyMOL (http://pymol.org) Protein–Sol provides a fast sequence-based method for predicting protein solubility and lysine and arginine content are highlighted in regard to modifying protein solubility, as K/R swaps are likely to be structurally and functionally neutral. A case study with thioredoxin shows that additional features of the server can be used to interpret subdomain structures and introduces the novel feature of windowed net charge, which may inform on charge-charge interactions between subdomains.

19 in total

1. Why are "natively unfolded" proteins unstructured under physiologic conditions?

Authors: V N Uversky; J R Gillespie; A L Fink
Journal: Proteins Date: 2000-11-15

2. Predicting the solubility of recombinant proteins in Escherichia coli.

Authors: D L Wilkinson; R G Harrison
Journal: Biotechnology (N Y) Date: 1991-05

3. Amino acid propensities for secondary structures are influenced by the protein structural class.

Authors: Susan Costantini; Giovanni Colonna; Angelo M Facchiano
Journal: Biochem Biophys Res Commun Date: 2006-02-08 Impact factor: 3.575

4. Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins.

Authors: Tatsuya Niwa; Bei-Wen Ying; Katsuyo Saito; WenZhen Jin; Shoji Takada; Takuya Ueda; Hideki Taguchi
Journal: Proc Natl Acad Sci U S A Date: 2009-02-27 Impact factor: 11.205

5. SOLpro: accurate sequence-based prediction of protein solubility.

Authors: Christophe N Magnan; Arlo Randall; Pierre Baldi
Journal: Bioinformatics Date: 2009-06-23 Impact factor: 6.937

Review 6. The Zyggregator method for predicting protein aggregation propensities.

Authors: Gian Gaetano Tartaglia; Michele Vendruscolo
Journal: Chem Soc Rev Date: 2008-05-27 Impact factor: 54.564

7. A simple method for displaying the hydropathic character of a protein.

Authors: J Kyte; R F Doolittle
Journal: J Mol Biol Date: 1982-05-05 Impact factor: 5.469

8. Crystal structure of thioredoxin from Escherichia coli at 1.68 A resolution.

Authors: S K Katti; D M LeMaster; H Eklund
Journal: J Mol Biol Date: 1990-03-05 Impact factor: 5.469

9. Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data.

Authors: W Nicholson Price; Yang Chen; Samuel K Handelman; Helen Neely; Philip Manor; Richard Karlin; Rajesh Nair; Jinfeng Liu; Michael Baran; John Everett; Saichiu N Tong; Farhad Forouhar; Swarup S Swaminathan; Thomas Acton; Rong Xiao; Joseph R Luft; Angela Lauricella; George T DeTitta; Burkhard Rost; Gaetano T Montelione; John F Hunt
Journal: Nat Biotechnol Date: 2009-01 Impact factor: 54.908

10. UniProt: the universal protein knowledgebase.

Authors:
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

96 in total

Review 1. Stepwise optimization of recombinant protein production in Escherichia coli utilizing computational and experimental approaches.

Authors: Kulandai Arockia Rajesh Packiam; Ramakrishnan Nagasundara Ramanan; Chien Wei Ooi; Lakshminarasimhan Krishnaswamy; Beng Ti Tey
Journal: Appl Microbiol Biotechnol Date: 2020-02-19 Impact factor: 4.813

2. In vitro and in silico assessment of the developability of a designed monoclonal antibody library.

Authors: Adriana-Michelle Wolf Pérez; Pietro Sormanni; Jonathan Sonne Andersen; Laila Ismail Sakhnini; Ileana Rodriguez-Leon; Jais Rose Bjelke; Annette Juhl Gajhede; Leonardo De Maria; Daniel E Otzen; Michele Vendruscolo; Nikolai Lorenzen
Journal: MAbs Date: 2019-01-18 Impact factor: 5.857

3. Immunization of mice with chimeric antigens displaying selected epitopes confers protection against intestinal colonization and renal damage caused by Shiga toxin-producing Escherichia coli.

Authors: David A Montero; Felipe Del Canto; Juan C Salazar; Sandra Céspedes; Leandro Cádiz; Mauricio Arenas-Salinas; José Reyes; Ángel Oñate; Roberto M Vidal
Journal: NPJ Vaccines Date: 2020-03-12 Impact factor: 7.344

Review 4. Peptide-Based Vaccines for Tuberculosis.

Authors: Wenping Gong; Chao Pan; Peng Cheng; Jie Wang; Guangyu Zhao; Xueqiong Wu
Journal: Front Immunol Date: 2022-01-31 Impact factor: 7.561

Review 5. Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies.

Authors: Rahmad Akbar; Habib Bashour; Puneet Rawat; Philippe A Robert; Eva Smorodina; Tudor-Stefan Cotet; Karine Flem-Karlsen; Robert Frank; Brij Bhushan Mehta; Mai Ha Vu; Talip Zengin; Jose Gutierrez-Marcos; Fridtjof Lund-Johansen; Jan Terje Andersen; Victor Greiff
Journal: MAbs Date: 2022 Jan-Dec Impact factor: 5.857

Review 6. Current advances in biopharmaceutical informatics: guidelines, impact and challenges in the computational developability assessment of antibody therapeutics.

Authors: Rahul Khetan; Robin Curtis; Charlotte M Deane; Johannes Thorling Hadsund; Uddipan Kar; Konrad Krawczyk; Daisuke Kuroda; Sarah A Robinson; Pietro Sormanni; Kouhei Tsumoto; Jim Warwicker; Andrew C R Martin
Journal: MAbs Date: 2022 Jan-Dec Impact factor: 5.857

7. Assessment of Therapeutic Antibody Developability by Combinations of In Vitro and In Silico Methods.

Authors: Adriana-Michelle Wolf Pérez; Nikolai Lorenzen; Michele Vendruscolo; Pietro Sormanni
Journal: Methods Mol Biol Date: 2022

8. Combination of highly antigenic nucleoproteins to inaugurate a cross-reactive next generation vaccine candidate against Arenaviridae family.

Authors: Kazi Faizul Azim; Tahera Lasker; Rahima Akter; Mantasha Mahmud Hia; Omar Faruk Bhuiyan; Mahmudul Hasan; Md Nazmul Hossain
Journal: Heliyon Date: 2021-05-19

9. Computational Resources for Bioscience Education.

Authors: Rajiv K Kar
Journal: Appl Biochem Biotechnol Date: 2021-06-08 Impact factor: 2.926

10. Solubility and Aggregation of Selected Proteins Interpreted on the Basis of Hydrophobicity Distribution.

Authors: Magdalena Ptak-Kaczor; Mateusz Banach; Katarzyna Stapor; Piotr Fabian; Leszek Konieczny; Irena Roterman
Journal: Int J Mol Sci Date: 2021-05-08 Impact factor: 5.923