| Literature DB >> 36201724 |
Wafa Alameen Alsanousi1,2, Nosiba Yousif Ahmed1,3, Eman Mohammed Hamid1, Murtada K Elbashir4, Mohamed Elhafiz M Musa5, Jianxin Wang6, Noman Khan7.
Abstract
Plasmodium falciparum is a parasitic protozoan that can cause malaria, which is a deadly disease. Therefore, the accurate identification of malaria parasite mitochondrial proteins is essential for understanding their functions and identifying novel drug targets. For classifying protein sequences, several adaptive statistical techniques have been devised. Despite significant gains, prediction performance is still constrained by the lack of appropriate feature descriptors and learning strategies in current systems. Moreover, good ground truth data is important for Artificial Intelligence (AI)-based models but there is a lack of that data in the literature. Therefore, in this work, we propose a novel hybrid network that combines 1D Convolutional Neural Network (CNN) and Bidirectional Gated Recurrent Unit (BGRU) to classify the malaria parasite mitochondrial proteins. Furthermore, we curate a sequential data that are collected from National Center for Biotechnology Information (NCBI) and UniProtKB/Swiss-Prot proteins databanks to prepare a dataset that can be used by the research community for AI-based algorithms evaluation. We obtain 4204 cases after preprocessing of the collected data and denote this set of proteins as PF4204. Finally, we conduct an ablation study on several conventional and deep models using PF4204 and the benchmark PF2095 datasets. The proposed model 'CNN-BGRU' obtains the accuracy values of 0.9096 and 0.9857 on PF4204 and PF2095 datasets, respectively. In addition, the CNN-BGRU is compared with state-of-the-arts, where the results illustrate that it can extract robust features and identify proteins accurately.Entities:
Mesh:
Substances:
Year: 2022 PMID: 36201724 PMCID: PMC9536844 DOI: 10.1371/journal.pone.0275195
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Abbreviations used in the paper.
| DL | Deep Learning | NCBI | National Center for Biotechnology Information |
|---|---|---|---|
| RNN | Recurrent Neural Network | ML | Machine Learning |
| DL | Deep Learning | MCC | Matthews Correlation Coefficient |
| NPV | Negative Predictive Value | WHO | World Health Organization |
| ATP | Adenosine Triphosphate | DNA | Deoxyribonucleic Acid |
| AI | Artificial Intelligence | PDB | Protein Data Bank |
| NB | Naive Bayes | FDR | False Discovery Rate |
| SVM | Support Vector Machine | PCA | Principal Component Analysis |
| FPR | False Positive Rate | BGRU | Bidirectional Gated Recurrent Unit |
| NLP | Natural Language Processing | CNN | Convolutional Neural Network |
| RNN | Recurrent Neural Network | PSSM | Position Specific Scoring Matrix |
| LR | Logistic Regression | SAAC | Split Amino Acid Composition |
| FNR | False Negative Rate | GRU | Gated Recurrent Unit |
| KNN | K-Nearest Neighbors | PAAC | Pseudo Amino Acid Composition |
| IoT | Internet of Things | DT | Decision Tree |
Fig 1Overview of the proposed hybrid architecture for plasmodium mitochondrial proteins classification.
Fig 5(a) statistics of the proteins datasets, (b) comparative analysis with state-of-the-art model.
Fig 2(a) represents unit structure of GRU while (b) shows the working flow of BGRU.
Performance of different models on PF4204 dataset using hold-out validation method.
| Metrics/Models | LR | NB | KNN | DT | SVM | GRU | BGRU | CNN-GRU | Proposed |
|---|---|---|---|---|---|---|---|---|---|
| Sensitivity | 0.7816 | 0.7936 | 0.7986 | 0.8050 | 0.8165 | 0.8506 | 0.8736 | 0.8956 | 0.9070 |
| Specificity | 0.7882 | 0.8025 | 0.8094 | 0.8148 | 0.8272 | 0.8621 | 0.8867 | 0.9024 | 0.9124 |
| Precision | 0.7981 | 0.8122 | 0.8192 | 0.8239 | 0.8357 | 0.8685 | 0.8920 | 0.9061 | 0.9155 |
| NPV | 0.7711 | 0.7831 | 0.7880 | 0.7952 | 0.8072 | 0.8434 | 0.8675 | 0.8916 | 0.9036 |
| FPR | 0.2118 | 0.1975 | 0.1906 | 0.1852 | 0.1728 | 0.1379 | 0.1133 | 0.0976 | 0.0876 |
| FDR | 0.2019 | 0.1878 | 0.1808 | 0.1761 | 0.1643 | 0.1315 | 0.1080 | 0.0939 | 0.0845 |
| FNR | 0.2184 | 0.2064 | 0.2014 | 0.1950 | 0.1835 | 0.1494 | 0.1264 | 0.1044 | 0.0930 |
| F1 Score | 0.7898 | 0.8028 | 0.8088 | 0.8144 | 0.8260 | 0.8595 | 0.8827 | 0.9008 | 0.9112 |
| MCC | 0.5695 | 0.5957 | 0.6076 | 0.6195 | 0.6433 | 0.7123 | 0.7599 | 0.7979 | 0.8192 |
| Accuracy | 0.7848 | 0.7979 | 0.8038 | 0.8098 | 0.8216 | 0.8561 | 0.8799 | 0.8989 | 0.9096 |
Performance of different models on PF4204 dataset using 10-fold cross-validation method.
| Metrics/Models | LR | NB | KNN | DT | SVM | GRU | BGRU | CNN-GRU | Proposed |
|---|---|---|---|---|---|---|---|---|---|
| Sensitivity | 0.7469 | 0.7582 | 0.7642 | 0.7755 | 0.7967 | 0.8347 | 0.8506 | 0.8607 | 0.8921 |
| Specificity | 0.7222 | 0.7458 | 0.7600 | 0.7727 | 0.8057 | 0.8436 | 0.8611 | 0.8870 | 0.9167 |
| Precision | 0.7826 | 0.8043 | 0.8174 | 0.8261 | 0.8522 | 0.8783 | 0.8913 | 0.9130 | 0.9348 |
| NPV | 0.6806 | 0.6911 | 0.6963 | 0.7120 | 0.7382 | 0.7906 | 0.8115 | 0.8220 | 0.8639 |
| FPR | 0.2778 | 0.2542 | 0.2400 | 0.2273 | 0.1943 | 0.1564 | 0.1389 | 0.1130 | 0.0833 |
| FDR | 0.2174 | 0.1957 | 0.1826 | 0.1739 | 0.1478 | 0.1217 | 0.1087 | 0.0870 | 0.0652 |
| FNR | 0.2531 | 0.2418 | 0.2358 | 0.2245 | 0.2033 | 0.1653 | 0.1494 | 0.1393 | 0.1079 |
| F1 Score | 0.7643 | 0.7806 | 0.7899 | 0.8000 | 0.8235 | 0.8559 | 0.8705 | 0.8861 | 0.9130 |
| MCC | 0.4662 | 0.4997 | 0.5190 | 0.5432 | 0.5964 | 0.6735 | 0.7073 | 0.7413 | 0.8037 |
| Accuracy | 0.7363 | 0.7530 | 0.7625 | 0.7743 | 0.8005 | 0.8385 | 0.8551 | 0.8717 | 0.9026 |
Fig 3Comparative confusion matrices of different models using PF4204 dataset and hold-out validation method.
Performance of different models on PF2095 dataset using hold-out validation method.
| Metrics/Models | LR | NB | KNN | DT | SVM | GRU | BGRU | CNN-GRU | Proposed |
|---|---|---|---|---|---|---|---|---|---|
| Sensitivity | 0.7736 | 0.8130 | 0.8276 | 0.8488 | 0.8654 | 0.9360 | 0.9442 | 0.9680 | 0.9879 |
| Specificity | 0.7143 | 0.7707 | 0.7911 | 0.8137 | 0.8491 | 0.9112 | 0.9286 | 0.9586 | 0.9766 |
| Precision | 0.8233 | 0.8554 | 0.8675 | 0.8795 | 0.9036 | 0.9398 | 0.9518 | 0.9719 | 0.9839 |
| NPV | 0.6471 | 0.7118 | 0.7353 | 0.7706 | 0.7941 | 0.9059 | 0.9176 | 0.9529 | 0.9824 |
| FPR | 0.2857 | 0.2293 | 0.2089 | 0.1863 | 0.1509 | 0.0888 | 0.0714 | 0.0414 | 0.0234 |
| FDR | 0.1767 | 0.1446 | 0.1325 | 0.1205 | 0.0964 | 0.0602 | 0.0482 | 0.0281 | 0.0161 |
| FNR | 0.2264 | 0.1870 | 0.1724 | 0.1512 | 0.1346 | 0.0640 | 0.0558 | 0.0320 | 0.0121 |
| F1 Score | 0.7977 | 0.8337 | 0.8471 | 0.8639 | 0.8841 | 0.9379 | 0.9480 | 0.9699 | 0.9859 |
| MCC | 0.4790 | 0.5754 | 0.6107 | 0.6563 | 0.7060 | 0.8464 | 0.8711 | 0.9257 | 0.9654 |
| Accuracy | 0.7518 | 0.7971 | 0.8138 | 0.8353 | 0.8592 | 0.9260 | 0.9379 | 0.9642 | 0.9833 |
Performance of different models on PF2095 dataset using 10-fold cross-validation method.
| Metrics/Models | LR | NB | KNN | DT | SVM | GRU | BGRU | CNN-GRU | Proposed |
|---|---|---|---|---|---|---|---|---|---|
| Sensitivity | 0.8818 | 0.9091 | 0.9189 | 0.9279 | 0.9292 | 0.9744 | 0.9829 | 0.9758 | 0.9922 |
| Specificity | 0.6800 | 0.7100 | 0.7273 | 0.7374 | 0.7526 | 0.8387 | 0.8495 | 0.9070 | 0.9756 |
| Precision | 0.7519 | 0.7752 | 0.7907 | 0.7984 | 0.8140 | 0.8837 | 0.8915 | 0.9380 | 0.9845 |
| NPV | 0.8395 | 0.8765 | 0.8889 | 0.9012 | 0.9012 | 0.9630 | 0.9753 | 0.9630 | 0.9877 |
| FPR | 0.3200 | 0.2900 | 0.2727 | 0.2626 | 0.2474 | 0.1613 | 0.1505 | 0.0930 | 0.0244 |
| FDR | 0.2481 | 0.2248 | 0.2093 | 0.2016 | 0.1860 | 0.1163 | 0.1085 | 0.0620 | 0.0155 |
| FNR | 0.1182 | 0.0909 | 0.0811 | 0.0721 | 0.0708 | 0.0256 | 0.0171 | 0.0242 | 0.0078 |
| F1 Score | 0.8117 | 0.8368 | 0.8500 | 0.8583 | 0.8678 | 0.9268 | 0.9350 | 0.9565 | 0.9883 |
| MCC | 0.5764 | 0.6352 | 0.6627 | 0.6823 | 0.6983 | 0.8297 | 0.8494 | 0.8918 | 0.9700 |
| Accuracy | 0.7857 | 0.8143 | 0.8286 | 0.8381 | 0.8476 | 0.9143 | 0.9238 | 0.9476 | 0.9857 |
Fig 4Comparative confusion matrices of different models using PF2095 dataset and hold-out validation method.