Literature DB >> 24990610

ccSOL omics: a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli.

Federico Agostini¹, Davide Cirillo¹, Carmen Maria Livi¹, Riccardo Delli Ponti¹, Gian Gaetano Tartaglia¹.

Abstract

SUMMARY: Here we introduce ccSOL omics, a webserver for large-scale calculations of protein solubility. Our method allows (i) proteome-wide predictions; (ii) identification of soluble fragments within each sequences; (iii) exhaustive single-point mutation analysis.
RESULTS: Using coil/disorder, hydrophobicity, hydrophilicity, β-sheet and α-helix propensities, we built a predictor of protein solubility. Our approach shows an accuracy of 79% on the training set (36 990 Target Track entries). Validation on three independent sets indicates that ccSOL omics discriminates soluble and insoluble proteins with an accuracy of 74% on 31 760 proteins sharing <30% sequence similarity.
AVAILABILITY AND IMPLEMENTATION: ccSOL omics can be freely accessed on the web at http://s.tartaglialab.com/page/ccsol_group. Documentation and tutorial are available at http://s.tartaglialab.com/static_files/shared/tutorial_ccsol_omics.html. CONTACT: gian.tartaglia@crg.es SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2014 PMID： 24990610 PMCID： PMC4184263 DOI： 10.1093/bioinformatics/btu420

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Algorithms for prediction of protein solubility (Wilkinson and Harrison, 1991) and aggregation (Fernandez–Escamilla ) provide a solid basis to investigate physico-chemical determinants of amyloid fibril formation and associated diseases (Conchillo–Solé ; Tartaglia ). In the past years, an in vitro reconstituted translation system allowed the large-scale investigation of Escherichia coli proteins solubility (Niwa ), thus providing the opportunity for the development of predictive methods such as ccSOL (Agostini ). In ccSOL, coil/disorder, hydrophobicity, hydrophilicity, β-sheet and α-helical propensities are combined together into a solubility propensity score that is useful to investigate protein expression (Baig ) as well as bacterial evolution (Warnecke, 2012). Other methods have been developed to predict protein solubility based on amino acid characteristics. For instance, PROSO II (Smialowski ) exploits occurrence of monopeptides and dipeptides to estimate heterologous expression in E.coli. PROSO II was trained on the pepcDB database [now Target Track (Berman )] that stores target and protocol information provided by Protein Structure Initiative centers. Both ccSOL and PROSO II perform accurate predictions when used to respectively predict endogenous or heterologous soluble expressions [ccSOL: 76% accuracy; PROSO II: 75% accuracy (Smialowski )]. We found that the experimental status of several Target Track entries (http://sbkb.org/tt/) has been recently updated and new data are available to train predictive methods (see Supplementary Material). Here, we introduce a novel implementation of the ccSOL method, called ccSOL omics, to perform large-scale predictions of endogenous and heterologous expression in E.coli. Our algorithm has been trained on non-redundant Target Track entries to identify soluble and insoluble regions within protein sequences. We envisage that ccSOL omics will be useful for protein engineering studies, as it allows the investigation of sequence variants in large datasets.

2 WORKFLOW AND IMPLEMENTATION

The ccSOL omics server allows the investigation of large protein datasets (see Supplementary Material). Once the user provides sequences in FASTA format, the algorithm calculates: All the aforementioned analyses are performed for each submitted protein set if the number of entries is <500. Because of the intense CPU usage, sequence susceptibility scores are not computed for datasets >500 entries. Solubility profiles. To identify soluble fragments within each polypeptide chain, protein sequences are divided into elements and individual solubility propensities are calculated. Starting from the N-terminus of a protein, we use a sliding window of 21 amino acids that is moved one residue at a time until the C-terminus is reached. The solubility propensity profile of each fragment is calculated as defined in our previous publication (Agostini ). Sequence susceptibility. For each sequence analyzed, the algorithm computes the effect of single amino acid mutations at different positions. This approach is particularly useful to identify regions susceptible to solubility change upon mutation. All variants are reported along with their scores, which provides a basis to engineer protein sequences and test hypotheses such as the occurrence of specific mutations in pathology. Solubility score. The solubility profile represents a unique signature containing information on all fragments arranged in sequential order. In our approach, the profile is used to estimate solubility upon expression in the E.coli system. As sequences have different lengths, we exploit a method based on Fourier’s transform (Bellucci ; Tartaglia ) that allows comparison of polypeptide chains with different sizes. Using 100 Fourier’s coefficients, we trained an algorithm that has the same architecture developed for the analysis of protein expression levels in E.coli [i.e. neural network approach (Tartaglia )]. Reliability score. The webserver provides a confidence score based on statistical analysis of both training and testing sets (i.e. sequence range used to validate the method; see Supplementary Material).

3 PERFORMANCES

Expression of human prion (PrP) in E.coli is particularly difficult, as the protein accumulates in inactive aggregates (Baneyx and Mujacic, 2004). ccSOL omics correctly predicts that PrP is insoluble and identifies the fragment 130–170 as the least soluble (Fig. 1A–C) together with region 231–253 (not present in the mature form). This finding is very well in agreement with what has been previously reported in literature (Tartaglia , 2008). Moreover, the analysis of susceptible fragments identifies a number of experimentally validated mutations (e.g. G131V, S132I, R148H, V176I and D178N) associated with lower solubility and located in the region promoting PrP aggregation (Corsaro ) [see Supplementary Material]. As for the large-scale performances of ccSOL omics, we used a 10-fold cross-validation on Target Track [total of 36 990 entries with 30% redundancy (Fu )] and observed 79% accuracy in discriminating between soluble and insoluble proteins. Furthermore, we tested the algorithm on three independent datasets containing protein expression data [total of 31 760 entries taken from E.coli (Niwa ), SOLpro (Magnan ) and PROSO II (Smialowski )] and found 74% accuracy (Fig. 1D; see also Supplementary Material).

Fig. 1.

Human Prion Solubility and ccSOL Performances. (A) Starting from the N-terminus, ccSOL computes the solubility profile using a sliding window moved toward the C-terminus. ccSOL identifies the fragment 130–170 as the most insoluble within the C-terminus of human PrP (region 231–253 is not present in the mature form of the protein). (B, C) Maximal and average susceptibility upon single-point mutation. (D) We trained on the Target Track set (AUROC = 85.5%) and tested on E.coli [AUROC = 93.3%; (Niwa )], SOLpro [AUROC = 85.7%; (Magnan )] and PROSO II [AUROC = 82.9%; (Smialowski )] proteins. Inset: overall score distribution for soluble (red) and insoluble (blue) proteins

4 CONCLUSIONS

The ccSOL omics algorithm shows excellent performances in predicting solubility of endogenous and heterologous genes in E.coli. We hope that the webserver will be useful for biotechnological purposes, as it could be for instance used to design fusion tags for soluble expression. Although accurate, our calculations are based on sequence features, and integration with structural characteristics will dramatically increase the predictive power. We plan to combine ccSOL omics with information on chaperone (Tartaglia ) and RNA (Bellucci ; Choi ) interactions, as these molecules greatly contribute to the solubility of protein products (Cirillo ; Zanzoni ).

23 in total

1. Sequence-based prediction of protein solubility.

Authors: Federico Agostini; Michele Vendruscolo; Gian Gaetano Tartaglia
Journal: J Mol Biol Date: 2011-12-09 Impact factor: 5.469

2. PROSO II--a new method for protein solubility prediction.

Authors: Pawel Smialowski; Gero Doose; Phillipp Torkler; Stefanie Kaufmann; Dmitrij Frishman
Journal: FEBS J Date: 2012-05-21 Impact factor: 5.542

3. Physicochemical determinants of chaperone requirements.

Authors: Gian Gaetano Tartaglia; Christopher M Dobson; F Ulrich Hartl; Michele Vendruscolo
Journal: J Mol Biol Date: 2010-04-21 Impact factor: 5.469

Review 4. RNA-mediated chaperone type for de novo protein folding.

Authors: Seong Il Choi; Kisun Ryu; Baik L Seong
Journal: RNA Biol Date: 2009-01-18 Impact factor: 4.652

5. Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins.

Authors: Tatsuya Niwa; Bei-Wen Ying; Katsuyo Saito; WenZhen Jin; Shoji Takada; Takuya Ueda; Hideki Taguchi
Journal: Proc Natl Acad Sci U S A Date: 2009-02-27 Impact factor: 11.205

6. SOLpro: accurate sequence-based prediction of protein solubility.

Authors: Christophe N Magnan; Arlo Randall; Pierre Baldi
Journal: Bioinformatics Date: 2009-06-23 Impact factor: 6.937

7. Predicting protein associations with long noncoding RNAs.

Authors: Matteo Bellucci; Federico Agostini; Marianela Masin; Gian Gaetano Tartaglia
Journal: Nat Methods Date: 2011-06 Impact factor: 28.547

8. Loss of the DnaK-DnaJ-GrpE chaperone system among the Aquificales.

Authors: Tobias Warnecke
Journal: Mol Biol Evol Date: 2012-06-07 Impact factor: 16.240

9. CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors: Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

Review 10. Role of prion protein aggregation in neurotoxicity.

Authors: Alessandro Corsaro; Stefano Thellung; Valentina Villa; Mario Nizzari; Tullio Florio
Journal: Int J Mol Sci Date: 2012-07-11 Impact factor: 6.208

22 in total

1. Establishing synthesis pathway-host compatibility via enzyme solubility.

Authors: Sara A Amin; Venkatesh Endalur Gopinarayanan; Nikhil U Nair; Soha Hassoun
Journal: Biotechnol Bioeng Date: 2019-03-29 Impact factor: 4.530

Review 2. Stepwise optimization of recombinant protein production in Escherichia coli utilizing computational and experimental approaches.

Authors: Kulandai Arockia Rajesh Packiam; Ramakrishnan Nagasundara Ramanan; Chien Wei Ooi; Lakshminarasimhan Krishnaswamy; Beng Ti Tey
Journal: Appl Microbiol Biotechnol Date: 2020-02-19 Impact factor: 4.813

ccSOL omics: a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli.

1 INTRODUCTION

2 WORKFLOW AND IMPLEMENTATION

3 PERFORMANCES

4 CONCLUSIONS

1. Sequence-based prediction of protein solubility.

2. PROSO II--a new method for protein solubility prediction.

3. Physicochemical determinants of chaperone requirements.

Review 4. RNA-mediated chaperone type for de novo protein folding.

5. Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins.

6. SOLpro: accurate sequence-based prediction of protein solubility.

7. Predicting protein associations with long noncoding RNAs.

8. Loss of the DnaK-DnaJ-GrpE chaperone system among the Aquificales.

9. CD-HIT: accelerated for clustering the next-generation sequencing data.

Review 10. Role of prion protein aggregation in neurotoxicity.

1. Establishing synthesis pathway-host compatibility via enzyme solubility.

Review 2. Stepwise optimization of recombinant protein production in Escherichia coli utilizing computational and experimental approaches.

Review 3. Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity.

4. Correlation Between Protein Primary Structure and Soluble Expression Level of HSA dAb in Escherichia coli.

5. Effect of C-Terminus Modification in Salmonella typhimurium FliC on Protein Purification Efficacy and Bioactivity.

6. Structural analysis of SARS-CoV-2 genome and predictions of the human interactome.

7. Prediction of Protein Solubility Based on Sequence Feature Fusion and DDcCNN.

8. TISIGNER.com: web services for improving recombinant protein production.

9. Solubility and Aggregation of Selected Proteins Interpreted on the Basis of Hydrophobicity Distribution.

10. Aggregation is a Context-Dependent Constraint on Protein Evolution.