Literature DB >> 23524162

Using the concept of Chou's pseudo amino acid composition to predict protein solubility: an approach with entropies in information theory.

Niu Xiaohui1, Li Nana, Xia Jingbo, Chen Dingyan, Peng Yuehua, Xiao Yang, Wei Weiquan, Wang Dongming, Wang Zengzhen.   

Abstract

Protein solubility plays a major role and has strong implication in the proteomics. Predicting the propensity of a protein to be soluble or to form inclusion body is a fundamental and not fairly resolved problem. In order to predict the protein solubility, almost 10,000 protein sequences were downloaded from NCBI. Then the sequences were eliminated for the high homologous similarity by CD-HIT. Thus, there were 5692 sequences remained. Based on protein sequences, amino acid and dipeptide compositions were generally extracted to predict protein solubility. In this study, the entropy in information theory was introduced as another predictive factor in the model. Experiments involving nine different feature vector combinations, including the above-mentioned three kinds of factors, were conducted with support vector machines (SVMs) as prediction engine. Each combination was evaluated by re-substitution test and 10-fold cross-validation test. According to the evaluation results, the accuracies and Matthew's Correlation Coefficient (MCC) values were boosted by the introduction of the entropy. The best combination was the one with amino acid, dipeptide compositions and their entropies. Its accuracy reached 90.34% and Matthew's Correlation Coefficient (MCC) value was 0.7494 in re-substitution test, while 88.12% and 0.7945 respectively for 10-fold cross-validation. In conclusion, the introduction of the entropy significantly improved the performance of the predictive method.
Copyright © 2013. Published by Elsevier Ltd.

Entities:  

Keywords:  Entropy in information theory; Protein solubility; Pseudo amino acid composition; Support vector machine

Mesh:

Substances:

Year:  2013        PMID: 23524162     DOI: 10.1016/j.jtbi.2013.03.010

Source DB:  PubMed          Journal:  J Theor Biol        ISSN: 0022-5193            Impact factor:   2.691


  5 in total

Review 1.  Some illuminating remarks on molecular genetics and genomics as well as drug development.

Authors:  Kuo-Chen Chou
Journal:  Mol Genet Genomics       Date:  2020-01-01       Impact factor: 3.291

2.  FEPS: A Tool for Feature Extraction from Protein Sequence.

Authors:  Hamid Ismail; Clarence White; Hussam Al-Barakati; Robert H Newman; Dukka B Kc
Journal:  Methods Mol Biol       Date:  2022

3.  A Two-Step Feature Selection Method to Predict Cancerlectins by Multiview Features and Synthetic Minority Oversampling Technique.

Authors:  Runtao Yang; Chengjin Zhang; Lina Zhang; Rui Gao
Journal:  Biomed Res Int       Date:  2018-02-07       Impact factor: 3.411

4.  PseAAC-General: fast building various modes of general form of Chou's pseudo-amino acid composition for large-scale protein datasets.

Authors:  Pufeng Du; Shuwang Gu; Yasen Jiao
Journal:  Int J Mol Sci       Date:  2014-02-26       Impact factor: 5.923

5.  iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition.

Authors:  Wei Chen; Peng-Mian Feng; Hao Lin; Kuo-Chen Chou
Journal:  Biomed Res Int       Date:  2014-05-21       Impact factor: 3.411

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.