Literature DB >> 24990610

ccSOL omics: a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli.

Federico Agostini1, Davide Cirillo1, Carmen Maria Livi1, Riccardo Delli Ponti1, Gian Gaetano Tartaglia1.   

Abstract

SUMMARY: Here we introduce ccSOL omics, a webserver for large-scale calculations of protein solubility. Our method allows (i) proteome-wide predictions; (ii) identification of soluble fragments within each sequences; (iii) exhaustive single-point mutation analysis.
RESULTS: Using coil/disorder, hydrophobicity, hydrophilicity, β-sheet and α-helix propensities, we built a predictor of protein solubility. Our approach shows an accuracy of 79% on the training set (36 990 Target Track entries). Validation on three independent sets indicates that ccSOL omics discriminates soluble and insoluble proteins with an accuracy of 74% on 31 760 proteins sharing <30% sequence similarity.
AVAILABILITY AND IMPLEMENTATION: ccSOL omics can be freely accessed on the web at http://s.tartaglialab.com/page/ccsol_group. Documentation and tutorial are available at http://s.tartaglialab.com/static_files/shared/tutorial_ccsol_omics.html. CONTACT: gian.tartaglia@crg.es SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2014. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 24990610      PMCID: PMC4184263          DOI: 10.1093/bioinformatics/btu420

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Algorithms for prediction of protein solubility (Wilkinson and Harrison, 1991) and aggregation (Fernandez–Escamilla ) provide a solid basis to investigate physico-chemical determinants of amyloid fibril formation and associated diseases (Conchillo–Solé ; Tartaglia ). In the past years, an in vitro reconstituted translation system allowed the large-scale investigation of Escherichia coli proteins solubility (Niwa ), thus providing the opportunity for the development of predictive methods such as ccSOL (Agostini ). In ccSOL, coil/disorder, hydrophobicity, hydrophilicity, β-sheet and α-helical propensities are combined together into a solubility propensity score that is useful to investigate protein expression (Baig ) as well as bacterial evolution (Warnecke, 2012). Other methods have been developed to predict protein solubility based on amino acid characteristics. For instance, PROSO II (Smialowski ) exploits occurrence of monopeptides and dipeptides to estimate heterologous expression in E.coli. PROSO II was trained on the pepcDB database [now Target Track (Berman )] that stores target and protocol information provided by Protein Structure Initiative centers. Both ccSOL and PROSO II perform accurate predictions when used to respectively predict endogenous or heterologous soluble expressions [ccSOL: 76% accuracy; PROSO II: 75% accuracy (Smialowski )]. We found that the experimental status of several Target Track entries (http://sbkb.org/tt/) has been recently updated and new data are available to train predictive methods (see Supplementary Material). Here, we introduce a novel implementation of the ccSOL method, called ccSOL omics, to perform large-scale predictions of endogenous and heterologous expression in E.coli. Our algorithm has been trained on non-redundant Target Track entries to identify soluble and insoluble regions within protein sequences. We envisage that ccSOL omics will be useful for protein engineering studies, as it allows the investigation of sequence variants in large datasets.

2 WORKFLOW AND IMPLEMENTATION

The ccSOL omics server allows the investigation of large protein datasets (see Supplementary Material). Once the user provides sequences in FASTA format, the algorithm calculates: All the aforementioned analyses are performed for each submitted protein set if the number of entries is <500. Because of the intense CPU usage, sequence susceptibility scores are not computed for datasets >500 entries. Solubility profiles. To identify soluble fragments within each polypeptide chain, protein sequences are divided into elements and individual solubility propensities are calculated. Starting from the N-terminus of a protein, we use a sliding window of 21 amino acids that is moved one residue at a time until the C-terminus is reached. The solubility propensity profile of each fragment is calculated as defined in our previous publication (Agostini ). Sequence susceptibility. For each sequence analyzed, the algorithm computes the effect of single amino acid mutations at different positions. This approach is particularly useful to identify regions susceptible to solubility change upon mutation. All variants are reported along with their scores, which provides a basis to engineer protein sequences and test hypotheses such as the occurrence of specific mutations in pathology. Solubility score. The solubility profile represents a unique signature containing information on all fragments arranged in sequential order. In our approach, the profile is used to estimate solubility upon expression in the E.coli system. As sequences have different lengths, we exploit a method based on Fourier’s transform (Bellucci ; Tartaglia ) that allows comparison of polypeptide chains with different sizes. Using 100 Fourier’s coefficients, we trained an algorithm that has the same architecture developed for the analysis of protein expression levels in E.coli [i.e. neural network approach (Tartaglia )]. Reliability score. The webserver provides a confidence score based on statistical analysis of both training and testing sets (i.e. sequence range used to validate the method; see Supplementary Material).

3 PERFORMANCES

Expression of human prion (PrP) in E.coli is particularly difficult, as the protein accumulates in inactive aggregates (Baneyx and Mujacic, 2004). ccSOL omics correctly predicts that PrP is insoluble and identifies the fragment 130–170 as the least soluble (Fig. 1A–C) together with region 231–253 (not present in the mature form). This finding is very well in agreement with what has been previously reported in literature (Tartaglia , 2008). Moreover, the analysis of susceptible fragments identifies a number of experimentally validated mutations (e.g. G131V, S132I, R148H, V176I and D178N) associated with lower solubility and located in the region promoting PrP aggregation (Corsaro ) [see Supplementary Material]. As for the large-scale performances of ccSOL omics, we used a 10-fold cross-validation on Target Track [total of 36 990 entries with 30% redundancy (Fu )] and observed 79% accuracy in discriminating between soluble and insoluble proteins. Furthermore, we tested the algorithm on three independent datasets containing protein expression data [total of 31 760 entries taken from E.coli (Niwa ), SOLpro (Magnan ) and PROSO II (Smialowski )] and found 74% accuracy (Fig. 1D; see also Supplementary Material).
Fig. 1.

Human Prion Solubility and ccSOL Performances. (A) Starting from the N-terminus, ccSOL computes the solubility profile using a sliding window moved toward the C-terminus. ccSOL identifies the fragment 130–170 as the most insoluble within the C-terminus of human PrP (region 231–253 is not present in the mature form of the protein). (B, C) Maximal and average susceptibility upon single-point mutation. (D) We trained on the Target Track set (AUROC = 85.5%) and tested on E.coli [AUROC = 93.3%; (Niwa )], SOLpro [AUROC = 85.7%; (Magnan )] and PROSO II [AUROC = 82.9%; (Smialowski )] proteins. Inset: overall score distribution for soluble (red) and insoluble (blue) proteins

Human Prion Solubility and ccSOL Performances. (A) Starting from the N-terminus, ccSOL computes the solubility profile using a sliding window moved toward the C-terminus. ccSOL identifies the fragment 130–170 as the most insoluble within the C-terminus of human PrP (region 231–253 is not present in the mature form of the protein). (B, C) Maximal and average susceptibility upon single-point mutation. (D) We trained on the Target Track set (AUROC = 85.5%) and tested on E.coli [AUROC = 93.3%; (Niwa )], SOLpro [AUROC = 85.7%; (Magnan )] and PROSO II [AUROC = 82.9%; (Smialowski )] proteins. Inset: overall score distribution for soluble (red) and insoluble (blue) proteins

4 CONCLUSIONS

The ccSOL omics algorithm shows excellent performances in predicting solubility of endogenous and heterologous genes in E.coli. We hope that the webserver will be useful for biotechnological purposes, as it could be for instance used to design fusion tags for soluble expression. Although accurate, our calculations are based on sequence features, and integration with structural characteristics will dramatically increase the predictive power. We plan to combine ccSOL omics with information on chaperone (Tartaglia ) and RNA (Bellucci ; Choi ) interactions, as these molecules greatly contribute to the solubility of protein products (Cirillo ; Zanzoni ).
  23 in total

1.  Sequence-based prediction of protein solubility.

Authors:  Federico Agostini; Michele Vendruscolo; Gian Gaetano Tartaglia
Journal:  J Mol Biol       Date:  2011-12-09       Impact factor: 5.469

2.  PROSO II--a new method for protein solubility prediction.

Authors:  Pawel Smialowski; Gero Doose; Phillipp Torkler; Stefanie Kaufmann; Dmitrij Frishman
Journal:  FEBS J       Date:  2012-05-21       Impact factor: 5.542

3.  Physicochemical determinants of chaperone requirements.

Authors:  Gian Gaetano Tartaglia; Christopher M Dobson; F Ulrich Hartl; Michele Vendruscolo
Journal:  J Mol Biol       Date:  2010-04-21       Impact factor: 5.469

Review 4.  RNA-mediated chaperone type for de novo protein folding.

Authors:  Seong Il Choi; Kisun Ryu; Baik L Seong
Journal:  RNA Biol       Date:  2009-01-18       Impact factor: 4.652

5.  Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins.

Authors:  Tatsuya Niwa; Bei-Wen Ying; Katsuyo Saito; WenZhen Jin; Shoji Takada; Takuya Ueda; Hideki Taguchi
Journal:  Proc Natl Acad Sci U S A       Date:  2009-02-27       Impact factor: 11.205

6.  SOLpro: accurate sequence-based prediction of protein solubility.

Authors:  Christophe N Magnan; Arlo Randall; Pierre Baldi
Journal:  Bioinformatics       Date:  2009-06-23       Impact factor: 6.937

7.  Predicting protein associations with long noncoding RNAs.

Authors:  Matteo Bellucci; Federico Agostini; Marianela Masin; Gian Gaetano Tartaglia
Journal:  Nat Methods       Date:  2011-06       Impact factor: 28.547

8.  Loss of the DnaK-DnaJ-GrpE chaperone system among the Aquificales.

Authors:  Tobias Warnecke
Journal:  Mol Biol Evol       Date:  2012-06-07       Impact factor: 16.240

9.  CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors:  Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal:  Bioinformatics       Date:  2012-10-11       Impact factor: 6.937

Review 10.  Role of prion protein aggregation in neurotoxicity.

Authors:  Alessandro Corsaro; Stefano Thellung; Valentina Villa; Mario Nizzari; Tullio Florio
Journal:  Int J Mol Sci       Date:  2012-07-11       Impact factor: 6.208

View more
  22 in total

1.  Establishing synthesis pathway-host compatibility via enzyme solubility.

Authors:  Sara A Amin; Venkatesh Endalur Gopinarayanan; Nikhil U Nair; Soha Hassoun
Journal:  Biotechnol Bioeng       Date:  2019-03-29       Impact factor: 4.530

Review 2.  Stepwise optimization of recombinant protein production in Escherichia coli utilizing computational and experimental approaches.

Authors:  Kulandai Arockia Rajesh Packiam; Ramakrishnan Nagasundara Ramanan; Chien Wei Ooi; Lakshminarasimhan Krishnaswamy; Beng Ti Tey
Journal:  Appl Microbiol Biotechnol       Date:  2020-02-19       Impact factor: 4.813

Review 3.  Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity.

Authors:  Huilin Wang; Liubin Feng; Geoffrey I Webb; Lukasz Kurgan; Jiangning Song; Donghai Lin
Journal:  Brief Bioinform       Date:  2018-09-28       Impact factor: 11.622

4.  Correlation Between Protein Primary Structure and Soluble Expression Level of HSA dAb in Escherichia coli.

Authors:  Yankun Yang; Guoqiang Liu; Meng Liu; Zhonghu Bai; Xiuxia Liu; Xiaofeng Dai; Wenwen Guo
Journal:  Food Technol Biotechnol       Date:  2018-03       Impact factor: 3.918

5.  Effect of C-Terminus Modification in Salmonella typhimurium FliC on Protein Purification Efficacy and Bioactivity.

Authors:  Mohammad-Hosein Khani; Masoumeh Bagheri; Ali Dehghanian; Azadeh Zahmatkesh; Soheila Moradi Bidhendi; Zahra Salehi Najafabadi; Reza Banihashemi
Journal:  Mol Biotechnol       Date:  2019-01       Impact factor: 2.695

6.  Structural analysis of SARS-CoV-2 genome and predictions of the human interactome.

Authors:  Andrea Vandelli; Michele Monti; Edoardo Milanetti; Alexandros Armaos; Jakob Rupert; Elsa Zacco; Elias Bechara; Riccardo Delli Ponti; Gian Gaetano Tartaglia
Journal:  Nucleic Acids Res       Date:  2020-11-18       Impact factor: 16.971

7.  Prediction of Protein Solubility Based on Sequence Feature Fusion and DDcCNN.

Authors:  Xianfang Wang; Yifeng Liu; Zhiyong Du; Mingdong Zhu; Aman Chandra Kaushik; Xue Jiang; Dongqing Wei
Journal:  Interdiscip Sci       Date:  2021-07-08       Impact factor: 2.233

8.  TISIGNER.com: web services for improving recombinant protein production.

Authors:  Bikash K Bhandari; Chun Shen Lim; Paul P Gardner
Journal:  Nucleic Acids Res       Date:  2021-07-02       Impact factor: 16.971

9.  Solubility and Aggregation of Selected Proteins Interpreted on the Basis of Hydrophobicity Distribution.

Authors:  Magdalena Ptak-Kaczor; Mateusz Banach; Katarzyna Stapor; Piotr Fabian; Leszek Konieczny; Irena Roterman
Journal:  Int J Mol Sci       Date:  2021-05-08       Impact factor: 5.923

10.  Aggregation is a Context-Dependent Constraint on Protein Evolution.

Authors:  Michele Monti; Alexandros Armaos; Marco Fantini; Annalisa Pastore; Gian Gaetano Tartaglia
Journal:  Front Mol Biosci       Date:  2021-06-18
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.