| Literature DB >> 34917128 |
Rahu Sikander1, Yuping Wang1, Ali Ghulam2, Xianjuan Wu1.
Abstract
Predicting the protein sequence information of enzymes and non-enzymes is an important but a very challenging task. Existing methods use protein geometric structures only or protein sequences alone to predict enzymatic functions. Thus, their prediction results are unsatisfactory. In this paper, we propose a novel approach for predicting the amino acid sequences of enzymes and non-enzymes via Convolutional Neural Network (CNN). In CNN, the roles of enzymes are predicted from multiple sides of biological information, including information on sequences and structures. We propose the use of two-dimensional data via 2DCNN to predict the proteins of enzymes and non-enzymes by using the same fivefold cross-validation function. We also use an independent dataset to test the performance of our model, and the results demonstrate that we are able to solve the overfitting problem. We used the CNN model proposed herein to demonstrate the superiority of our model for classifying an entire set of filters, such as 32, 64, and 128 parameters, with the fivefold validation test set as the independent classification. Via the Dipeptide Deviation from Expected Mean (DDE) matrix, mutation information is extracted from amino acid sequences and structural information with the distance and angle of amino acids is conveyed. The derived feature maps are then encoded in DDE exploitation. The independent datasets are then compared with other two methods, namely, GRU and XGBOOST. All analyses were conducted using 32, 64 and 128 filters on our proposed CNN method. The cross-validation datasets achieved an accuracy score of 0.8762%, whereas the accuracy of independent datasets was 0.7621%. Additional variables were derived on the basis of ROC AUC with fivefold cross-validation was achieved score is 0.95%. The performance of our model and that of other models in terms of sensitivity (0.9028%) and specificity (0.8497%) was compared. The overall accuracy of our model was 0.9133% compared with 0.8310% for the other model.Entities:
Keywords: enzyme; function; machine learing; protein; sequence
Year: 2021 PMID: 34917128 PMCID: PMC8670239 DOI: 10.3389/fgene.2021.759384
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 4Comparison of CNN with GRU and XGBoost.
FIGURE 1Proposed framework model.
Parameters of the CNN model settings.
| Parameter name | Recommendation |
|---|---|
| Learning rate | 0.001 |
| Activation function | ReLU |
| Cost function | Cross-entropy |
| Optimizer | Adam |
| Dropout rate | 0.4 |
| Width | 3 |
| Depth | 128 |
| Epoch | 80 |
Used all model layers and trainable parameters in 2DCNN.
| Layer (type) | Output shape | Param # |
|---|---|---|
| conv2d_4 (Conv2D) | (None, 1, 20, 32) | 5,792 |
| leaky_re_lu_5 (LeakyReLU) | (None, 1, 20, 32) | 0 |
| max_pooling2d_4 (MaxPooling2 | (None, 1, 10, 32) | 0 |
| dropout_5 (Dropout) | (None, 1, 10, 32) | 0 |
| conv2d_5 (Conv2D) | (None, 1, 10, 64) | 18,496 |
| leaky_re_lu_6 (LeakyReLU) | (None, 1, 10, 64) | 0 |
| max_pooling2d_5 (MaxPooling2 | (None, 1, 5, 64) | 0 |
| dropout_6 (Dropout) | (None, 1, 5, 64) | 0 |
| conv2d_6 (Conv2D) | (None, 1, 5, 128) | 73,856 |
| leaky_re_lu_7 (LeakyReLU) | (None, 1, 5, 128) | 0 |
| max_pooling2d_6 (MaxPooling2 | (None, 1, 3, 128) | 0 |
| dropout_7 (Dropout) | (None, 1, 3, 128) | 0 |
| flatten_2 (Flatten) | (None, 384) | 0 |
| dense_3 (Dense) | (None, 128) | 49,280 |
| leaky_re_lu_8 (LeakyReLU) | (None, 128) | 0 |
| dropout_8 (Dropout) | (None, 128) | 0 |
| dense_4(Dense) | (None, 2) | 258 |
Statistics of all retrieved Enzyme Protein and Non-enzyme proteins.
| Used data points | Training | Independent | |
|---|---|---|---|
| Enzyme protein | 652 | 501 | 129 |
| Non-enzyme proteins | 1,108 | 859 | 249 |
FIGURE 2Training and validation accuracy and loss predicted score.
CNN identification of the optimal parameter for various models.
| Cross-validation | Independent | |||||||
|---|---|---|---|---|---|---|---|---|
| Model | ACC | Sensi | Speci | MCC | ACC | Sensi | Spec | MCC |
| DDE-CNN | 0.8762 | 0.9028 | 0.8497 | 0.7545 | 0.7621 | 0.7621 | 0.7621 | 0.5276 |
FIGURE 3CNN model ROC (AUC), precision and recall.
Comparison with different filters.
| Filter cross-validation | Independent | |||||||
|---|---|---|---|---|---|---|---|---|
| Sens | Spec | Acc | MCC | Sens | Spec | Acc | Mcc | |
| 32 | 0.899 | 0.876 | 0.888 | 0.778 | 0.799 | 0.810 | 0.805 | 0.614 |
| 32–64 | 0.895 | 0.858 | 0.877 | 0.756 | 0.763 | 0.828 | 0.795 | 0.596 |
| 32–64–128 | 0.894 | 0.862 | 0.878 | 0.759 | 0.793 | 0.855 | 0.824 | 0.657 |
Efficiency of different methods in terms of various feature extraction schemes.
| Cross-validation | Independent | |||||||
|---|---|---|---|---|---|---|---|---|
| Classifiers | ACC | Sensi | Speci | MCC | ACC | Sensi | Spec | MCC |
| CNN | 0.8762 | 0.9028 | 0.8497 | 0.7545 | 0.7621 | 0.7621 | 0.7621 | 0.5276 |
| GRU | 0.8584 | 0.8757 | 0.8411 | 0.7481 | 0.9001 | 0.9459 | 0.8540 | 0.8034 |
| XGBOOST | 0.8088 | 0.7530 | 0.8646 | 0.6445 | 0.9055 | 0.8111 | 1 | 0.8242 |
FIGURE 5validation accuracy of different optimizers in this study (The epoch ranges from 0 to 150).
FIGURE 6Confusion matrices of: (A) independent test (B) cross-validation test.
Existing method comparison accuracy and ROC(AUC).
| S.no | Previous method | Accuracy | ROC(AUC) | References |
|---|---|---|---|---|
| 1 | SVM classifier. jackknife cross-validated | 76.46% | 0.8019 |
|
| 2 | Jackknife test in identifying | 62.86% | 0.7898 |
|
| 3 | Support Vector Machines (SVM) and NN | 73.5% | — |
|
| 4 | Proposed 2DCNN with DDE | Cross-validation 0.8762%, Independent 0.7621%.% | 0.95% | — |