Xianfang Wang1, Yifeng Liu2, Zhiyong Du3, Mingdong Zhu4, Aman Chandra Kaushik5, Xue Jiang6, Dongqing Wei7. 1. School of Computer Science and Technology, Henan Institute of Technology, Xinxiang, 453003, China. 2wangfang@163.com. 2. School of Management, Henan Institute of Technology, Xinxiang, 453003, China. 3. School of Electrical Engineering and Automation, Henan Institute of Technology, Xinxiang, 453003, China. 4. School of Computer Science and Technology, Henan Institute of Technology, Xinxiang, 453003, China. 5. Wuxi School of Medicine, Jiangnan University, Wuxi, 214000, China. 6. School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China. 7. School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China. dqwei@sjtu.edu.cn.
Abstract
BACKGROUND: Prediction of protein solubility is an indispensable prerequisite for pharmaceutical research and production. The general and specific objective of this work is to design a new model for predicting protein solubility by using protein sequence feature fusion and deep dual-channel convolutional neural networks (DDcCNN) to improve the performance of existing prediction models. METHODS: The redundancy of raw protein is reduced by CD-HIT. The four subsequences are built from protein sequence: one global and three locals. The global subsequence is the entire protein sequence, and these local subsequences are obtained by moving a sliding window with some rules. Using G-gap to extract the features of the above four subsequences, a mixed matrix is constructed as the input of one channel which is composed of three-layer convolutional operating. Additional features are extracted by SCRATCH tool as input of another channel, which is consist of a single convolution in order to find hidden relationships and improve the accuracy of predictor. The outputs of two parallel channels are concatenated as the input of the hidden layer. And the prediction of protein solubility is obtained in the output layer. The best protein solubility prediction model is obtained by doing some comparative experiments of different frameworks. RESULTS: The performance indicators of DDcCNN model (our designed) are as follows: accuracy of 77.82%, Matthew's correlation coefficient of 0.57, sensitivity of 76.13% and specificity of 79.32%. The results of some comparative experiments show that the overall performance of DDcCNN model is better than existing models (GCNN, LCNN and PCNN). The related models and data are publicly deposited at http://www.ddccnn.wang . CONCLUSION: The satisfactory performance of DDcCNN model reveals that these features and flexible computational methodologies can reinforce the existing prediction models for better prediction of protein solubility could be applied in several applications, such as to preselect initial targets that are soluble or to alter solubility of target proteins, thus can help to reduce the production cost.
BACKGROUND: Prediction of protein solubility is an indispensable prerequisite for pharmaceutical research and production. The general and specific objective of this work is to design a new model for predicting protein solubility by using protein sequence feature fusion and deep dual-channel convolutional neural networks (DDcCNN) to improve the performance of existing prediction models. METHODS: The redundancy of raw protein is reduced by CD-HIT. The four subsequences are built from protein sequence: one global and three locals. The global subsequence is the entire protein sequence, and these local subsequences are obtained by moving a sliding window with some rules. Using G-gap to extract the features of the above four subsequences, a mixed matrix is constructed as the input of one channel which is composed of three-layer convolutional operating. Additional features are extracted by SCRATCH tool as input of another channel, which is consist of a single convolution in order to find hidden relationships and improve the accuracy of predictor. The outputs of two parallel channels are concatenated as the input of the hidden layer. And the prediction of protein solubility is obtained in the output layer. The best protein solubility prediction model is obtained by doing some comparative experiments of different frameworks. RESULTS: The performance indicators of DDcCNN model (our designed) are as follows: accuracy of 77.82%, Matthew's correlation coefficient of 0.57, sensitivity of 76.13% and specificity of 79.32%. The results of some comparative experiments show that the overall performance of DDcCNN model is better than existing models (GCNN, LCNN and PCNN). The related models and data are publicly deposited at http://www.ddccnn.wang . CONCLUSION: The satisfactory performance of DDcCNN model reveals that these features and flexible computational methodologies can reinforce the existing prediction models for better prediction of protein solubility could be applied in several applications, such as to preselect initial targets that are soluble or to alter solubility of target proteins, thus can help to reduce the production cost.
Entities:
Keywords:
Deep dual-channel convolutional neural network; Deep learning; Feature fusion; Protein solubility
Authors: Ning Zhang; R S P Rao; Fernanda Salvato; Jesper F Havelund; Ian M Møller; Jay J Thelen; Dong Xu Journal: Front Plant Sci Date: 2018-05-23 Impact factor: 5.753