| Literature DB >> 36245797 |
Varanavasi Nallasamy1, Malarvizhi Seshiah2.
Abstract
In living organisms, proteins are considered as the executants of biological functions. Owing to its pivotal role played in protein folding patterns, comprehension of protein structure is a challenging issue. Moreover, owing to numerous protein sequence exploration in protein data banks and complication of protein structures, experimental methods are found to be inadequate for protein structural class prediction. Hence, it is very much advantageous to design a reliable computational method to predict protein structural classes from protein sequences. In the recent few years there has been an elevated interest in using deep learning to assist protein structure prediction as protein structure prediction models can be utilized to screen a large number of novel sequences. In this regard, we propose a model employing Energy Profile for atom pairs in conjunction with the Legion-Class Bayes function called Energy Profile Legion-Class Bayes Protein Structure Identification model. Followed by this, we use a Thompson Optimized convolutional neural network to extract features between amino acids and then the Thompson Optimized SoftMax function is employed to extract associations between protein sequences for predicting secondary protein structure. The proposed Energy Profile Bayes and Thompson Optimized Convolutional Neural Network (EPB-OCNN) method tested distinct unique protein data and was compared to the state-of-the-art methods, the Template-Based Modeling, Protein Design using Deep Graph Neural Networks, a deep learning-based S-glutathionylation sites prediction tool called a Computational Framework, the Deep Learning and a distance-based protein structure prediction using deep learning. The results obtained when applied with the Biopython tool with respect to protein structure prediction time, protein structure prediction accuracy, specificity, recall, F-measure, and precision, respectively, are measured. The proposed EPB-OCNN method outperformed the state-of-the-art methods, thereby corroborating the objective.Entities:
Keywords: Convolutional neural network; Energy profile; Legion-Class Bayes; Protein structure identification; Secondary structure prediction; Thompson optimization
Year: 2022 PMID: 36245797 PMCID: PMC9542649 DOI: 10.1007/s00521-022-07868-0
Source DB: PubMed Journal: Neural Comput Appl ISSN: 0941-0643 Impact factor: 5.102
Fig. 1Block diagram of Energy Profile Bayes and Thompson Optimized Convolutional Neural Network (EPB-OCNN) method
Fig. 2Block diagram of Energy Profile Legion-Class Bayes Protein Structure identification
Fig. 3Structure of Thompson Optimized Convolutional Neural Network protein secondary structure prediction
Typical PDB dataset with four categories utilized for benchmarking
| Dataset | |||||
|---|---|---|---|---|---|
| Protein Data Bank (PDB) |
The number of proteins in each category and the total number of proteins in PDB dataset
Hyperparameters and description
| S. no | Hyperparameters | Description |
|---|---|---|
| 1 | Number of hidden layers used | Two hidden layers are used (the first hidden layer from the convolution and the second hidden layer from the pooling) |
| 2 | Activation function used in hidden layers | Nonlinear down-sampling function (i.e., linear activation function) is used in hidden layer |
| 3 | Activation function used in output layer | Sigmoid activation function |
| 4 | Learning rate | The value of the learning rate used in our work is 0.01 |
| 5 | The momentum set | The momentum is set of 0.9 |
| 6 | Batch size | Batch size in our work refers to the samples from the training dataset. In our work, the batch size is 5000 as samples are considered for simulation |
| 7 | Number of epochs | The number of epochs in our work is 10 |
Tabulation for protein structure prediction time
| Number of protein data | Protein structure prediction time (ms) | |||||
|---|---|---|---|---|---|---|
| EPB-OCNN | TBM | PD-DGNN | CF | DL | DPSP | |
| 500 | 55 | 60 | 82.5 | 92.5 | 97.5 | 100 |
| 1000 | 68.20 | 90.05 | 110.05 | 140.15 | 170.20 | 200.25 |
| 1500 | 80.09 | 120.25 | 135.55 | 160.25 | 185.15 | 220.05 |
| 2000 | 90.09 | 135.05 | 180.05 | 200.05 | 220.35 | 280.35 |
| 2500 | 120.25 | 170.05 | 200.05 | 240.45 | 280.25 | 300.05 |
| 3000 | 135.15 | 200.15 | 230.25 | 270.25 | 320.05 | 350.15 |
| 3500 | 160.55 | 215.05 | 275.05 | 320.05 | 380.25 | 400.15 |
| 4000 | 195.05 | 235.25 | 300.05 | 380.25 | 419.45 | 480.05 |
| 4500 | 220.25 | 275.05 | 330.15 | 420.05 | 480.05 | 520.25 |
| 5000 | 265.05 | 290.25 | 375.15 | 480.43 | 530.05 | 590.05 |
Fig. 4Protein structure prediction time analyses
Tabulation for protein structure prediction accuracy
| Number of protein data | Protein structure prediction accuracy (%) | |||||
|---|---|---|---|---|---|---|
| EPB-OCNN | TBM | PD-DGNN | CF | DL | DPSP | |
| 500 | 98.8 | 98 | 97 | 96 | 95 | 94 |
| 1000 | 98.80 | 96.25 | 92.45 | 92.10 | 91.10 | 90.45 |
| 1500 | 98.60 | 94.25 | 90.55 | 89.45 | 89.10 | 88.25 |
| 2000 | 98.30 | 93.45 | 87.65 | 86.25 | 85.35 | 85.10 |
| 2500 | 98.25 | 90.55 | 85.75 | 84.35 | 83.25 | 82.10 |
| 3000 | 97.15 | 88.25 | 82.45 | 81.10 | 80.65 | 79.10 |
| 3500 | 97.10 | 85.35 | 81.55 | 80.25 | 79.35 | 79.10 |
| 4000 | 97.10 | 84.25 | 81.25 | 80.10 | 79.10 | 78.65 |
| 4500 | 96.95 | 81.35 | 80.35 | 78.45 | 78.25 | 78.10 |
| 5000 | 96.65 | 81.10 | 76.45 | 75.31 | 74.25 | 74.10 |
Fig. 5Protein structure prediction accuracy analyses
Tabulation for ROC curve
| False positive rate | True positive rate | |||||
|---|---|---|---|---|---|---|
| EPB-OCNN | TBM | PD-DGNN | CF | DL | DPSP | |
| 0.1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0.2 | 0.33 | 0.23 | 0.18 | 0.17 | 0.16 | 0.15 |
| 0.3 | 0.48 | 0.33 | 0.23 | 0.21 | 0.18 | 0.17 |
| 0.4 | 0.63 | 0.48 | 0.36 | 0.28 | 0.23 | 0.21 |
| 0.5 | 0.71 | 0.54 | 0.48 | 0.40 | 0.34 | 0.31 |
| 0.6 | 0.76 | 0.63 | 0.51 | 0.44 | 0.40 | 0.34 |
| 0.7 | 0.83 | 0.74 | 0.63 | 0.55 | 0.43 | 0.41 |
| 0.8 | 0.95 | 0.85 | 0.75 | 0.68 | 0.58 | 0.53 |
| 0.9 | 0.98 | 0.88 | 0.83 | 0.76 | 0.68 | 0.63 |
| 1.0 | 1 | 0.93 | 0.87 | 0.83 | 0.75 | 0.68 |
Fig. 6ROC curve analyses
Tabulation for precision
| Number of protein data | Precision | |||||
|---|---|---|---|---|---|---|
| EPB-OCNN | TBM | PD-DGNN | CF | DL | DPSP | |
| 500 | 0.99 | 0.93 | 0.77 | 0.75 | 0.72 | 0.70 |
| 1000 | 0.94 | 0.88 | 0.78 | 0.77 | 0.75 | 0.80 |
| 1500 | 0.94 | 0.87 | 0.79 | 0.80 | 0.77 | 0.76 |
| 2000 | 0.99 | 0.87 | 0.80 | 0.73 | 0.73 | 0.72 |
| 2500 | 0.99 | 0.90 | 0.90 | 0.79 | 0.76 | 0.75 |
| 3000 | 0.96 | 0.89 | 0.91 | 0.77 | 0.75 | 0.73 |
| 3500 | 0.94 | 0.90 | 0.88 | 0.78 | 0.77 | 0.75 |
| 4000 | 0.96 | 0.89 | 0.89 | 0.78 | 0.76 | 0.75 |
| 4500 | 0.98 | 0.89 | 0.89 | 0.68 | 0.66 | 0.65 |
| 5000 | 0.94 | 0.89 | 0.89 | 0.79 | 0.77 | 0.75 |
Fig. 7Precision analyses
Tabulation for specificity, recall and F-measure
| Metrics | Methods | |||||
|---|---|---|---|---|---|---|
| EPB-OCNN | TBM | PD-DGNN | CF | DL | DPSP | |
| Specificity | 88.25 | 86.45 | 84.25 | 76.75 | 71.25 | 67.32 |
| Recall | 87.65 | 85.25 | 83.55 | 77.15 | 72.65 | 68.65 |
| F-measure | 85.55 | 83.45 | 81.55 | 75.45 | 70.25 | 65.45 |
Fig. 8Specificity, recall, and F-measure analyses
Tabulation for precision–recall curve
| Recall | Precision | |||||
|---|---|---|---|---|---|---|
| EPB-OCNN | TBM | PD-DGNN | CF | DL | DPSP | |
| 0.1 | 0.26 | 0.23 | 0.21 | 0.18 | 0.17 | 0.15 |
| 0.2 | 0.29 | 0.27 | 0.26 | 0.22 | 0.21 | 0.19 |
| 0.3 | 0.38 | 0.37 | 0.35 | 0.32 | 0.30 | 0.26 |
| 0.4 | 0.43 | 0.40 | 0.39 | 0.36 | 0.32 | 0.29 |
| 0.5 | 0.44 | 0.42 | 0.39 | 0.36 | 0.34 | 0.33 |
| 0.6 | 0.45 | 0.43 | 0.41 | 0.35 | 0.36 | 0.33 |
| 0.7 | 0.46 | 0.44 | 0.43 | 0.39 | 0.37 | 0.35 |
| 0.8 | 0.48 | 0.46 | 0.45 | 0.40 | 0.38 | 0.36 |
| 0.9 | 0.48 | 0.48 | 0.46 | 0.40 | 0.38 | 0.38 |
| 1 | 0.52 | 0.49 | 0.47 | 0.42 | 0.39 | 0.39 |
Fig. 9Precision–recall analyses
Tabulation for MCC
| Number of protein data | MCC | |||||
|---|---|---|---|---|---|---|
| EPB-OCNN | TBM | PD-DGNN | CF | DL | DPSP | |
| 500 | 0.93 | 0.88 | 0.82 | 0.81 | 0.78 | 0.77 |
| 1000 | 0.92 | 0.87 | 0.81 | 0.80 | 0.77 | 0.74 |
| 1500 | 0.91 | 0.85 | 0.80 | 0.79 | 0.75 | 0.73 |
| 2000 | 0.90 | 0.84 | 0.78 | 0.76 | 0.73 | 0.71 |
| 2500 | 0.90 | 0.84 | 0.77 | 0.76 | 0.72 | 0.69 |
| 3000 | 0.88 | 0.83 | 0.77 | 0.73 | 0.70 | 0.66 |
| 3500 | 0.88 | 0.83 | 0.76 | 0.71 | 0.68 | 0.64 |
| 4000 | 0.85 | 0.81 | 0.75 | 0.70 | 0.66 | 0.63 |
| 4500 | 0.84 | 0.80 | 0.73 | 0.68 | 0.63 | 0.62 |
| 5000 | 0.83 | 0.78 | 0.72 | 0.66 | 0.62 | 0.57 |
Fig. 10MCC analyses
Comparison of algorithms
| S. no | Methods | ||||||
|---|---|---|---|---|---|---|---|
| EPB-OCNN | TBM | PD-DGNN | CF | DL | DPSP | ||
| 1 | Feature selection | Energy profiles for pair of atoms | Local structural feature selection | Structural features | No feature selection algorithm is applied separately | No feature selection algorithm is applied separately | No feature selection algorithm is applied separately |
| 2 | Linear/nonlinear/collinear data | Any protein data type | Linear type | Linear sequence of amino acids | Not applicable | Can be applied with linear data only | Can be applied with linear data only |
| 3 | Optimization algorithm | Thompson Optimization function (with momentum set to 0.9) | Torsion angle optimization | Not applied | Gradient descent | Gradient-based weight optimization | ResNet |
| 4 | Activation function | Sigmoid activation function | Per residue network activation | Not applicable | Linear function | Linear function | Linear function |
| 5 | Hyper parameters (regularization parameter, learning rate) | Hyper parameter is optimized via Thompson function and changes according to the amino acid sequence used for simulation | Not applicable | Not applicable | 0.6 | 0.6 | 0.5 |
| 6 | Neural network construction method | Convolutional model | Deep residual neural network | Deep graph neural network | Deep Neural Networks | Deep learning | deep convolutional residual neural network |
| 7 | Weight calculation of nodes | Optimization model | Markov random field model | Hidden Markov models | 62 | Not available | Not available |
| 8 | Error handling | Gaussian prior modeling | Not handled | Spearman’s correlation coefficient | Not used | Static – zero training error | Absolute error calculation |
Tabulation for McNemar test
| Number of protein data | McNemar test (M-test) | |||||
|---|---|---|---|---|---|---|
| EPB-OCNN | TBM | PD-DGNN | CF | DL | DPSP | |
| 500 | 78.23 | 76.35 | 75.55 | 74.55 | 73.35 | 72.10 |
| 1000 | 77.55 | 75.25 | 74.45 | 73.25 | 72.25 | 71.35 |
| 1500 | 77.10 | 74.45 | 74.35 | 73.10 | 72.10 | 71.10 |
| 2000 | 77.95 | 74.10 | 73.65 | 72.45 | 71.65 | 70.45 |
| 2500 | 78.65 | 73.55 | 73.10 | 72.10 | 71.45 | 70.10 |
| 3000 | 78.45 | 73.45 | 72.35 | 71.65 | 71.25 | 69.25 |
| 3500 | 78.10 | 73.10 | 71.10 | 70.35 | 71.10 | 68.10 |
| 4000 | 77.55 | 72.55 | 70.45 | 70.10 | 69.65 | 67.25 |
| 4500 | 77.45 | 72.45 | 71.35 | 69.45 | 69.10 | 67.10 |
| 5000 | 77.10 | 72.10 | 71.10 | 69.10 | 68.25 | 66.10 |
Fig. 11M-test analysis
Tabulation for L2 loss function
| Number of protein data | L2 loss function | |||||
|---|---|---|---|---|---|---|
| EPB-OCNN | TBM | PD-DGNN | CF | DL | DPSP | |
| 500 | 36 | 100 | 225 | 400 | 625 | 900 |
| 1000 | 144 | 1406.25 | 5700.25 | 6241 | 7921 | 9120.25 |
| 1500 | 441 | 7439.06 | 20,093.1 | 25,043.1 | 26,732.3 | 31,064.1 |
| 2000 | 1156 | 17,161 | 61,009 | 75,625 | 85,849 | 88,804 |
| 2500 | 1914.06 | 55,814.1 | 126,914 | 153,077 | 175,352 | 200,256 |
| 3000 | 7310.25 | 124,256 | 277,202 | 321,489 | 336,980 | 393,129 |
| 3500 | 10,302.3 | 262,913 | 416,993 | 477,827 | 522,368 | 535,092 |
| 4000 | 13,456 | 396,900 | 562,500 | 633,616 | 698,896 | 729,316 |
| 4500 | 18,837.6 | 704,341 | 781,898 | 940,415 | 957,952 | 971,210 |
| 5000 | 28,056.3 | 893,025 | 1,386,506 | 1,523,990 | 1,657,656 | 1,677,025 |
Fig. 12L2 loss function
Tabulation for RMSE function
| Number of protein data | Root mean square error | |||||
|---|---|---|---|---|---|---|
| EPB-OCNN | TBM | PD-DGNN | CF | DL | DPSP | |
| 500 | 6 | 10 | 15 | 20 | 25 | 30 |
| 1000 | 12 | 37.5 | 75.5 | 79 | 89 | 95.5 |
| 1500 | 21 | 86.25 | 141.75 | 158.25 | 163.5 | 176.25 |
| 2000 | 34 | 131 | 247 | 275 | 293 | 298 |
| 2500 | 43.75 | 236.25 | 356.25 | 391.25 | 418.75 | 447.5 |
| 3000 | 85.5 | 352.5 | 526.5 | 567 | 580.5 | 627 |
| 3500 | 101.5 | 512.75 | 645.75 | 691.25 | 722.75 | 731.5 |
| 4000 | 116 | 630 | 750 | 796 | 836 | 854 |
| 4500 | 137.25 | 839.25 | 884.25 | 969.75 | 978.75 | 985.5 |
| 5000 | 167.5 | 945 | 1177.5 | 1234.5 | 1287.5 | 1295 |
Fig. 13Root mean square error loss function