| Literature DB >> 34336861 |
Jian Zhou1, Suling Bo2, Hao Wang1, Lei Zheng1, Pengfei Liang1, Yongchun Zuo1.
Abstract
The 2-oxoglutarate/Fe (II)-dependent (2OG) oxygenase superfamily is mainly responsible for protein modification, nucleic acid repair and/or modification, and fatty acid metabolism and plays important roles in cancer, cardiovascular disease, and other diseases. They are likely to become new targets for the treatment of cancer and other diseases, so the accurate identification of 2OG oxygenases is of great significance. Many computational methods have been proposed to predict functional proteins to compensate for the time-consuming and expensive experimental identification. However, machine learning has not been applied to the study of 2OG oxygenases. In this study, we developed OGFE_RAAC, a prediction model to identify whether a protein is a 2OG oxygenase. To improve the performance of OGFE_RAAC, 673 amino acid reduction alphabets were used to determine the optimal feature representation scheme by recoding the protein sequence. The 10-fold cross-validation test showed that the accuracy of the model in identifying 2OG oxygenases is 91.04%. Besides, the independent dataset results also proved that the model has excellent generalization and robustness. It is expected to become an effective tool for the identification of 2OG oxygenases. With further research, we have also found that the function of 2OG oxygenases may be related to their polarity and hydrophobicity, which will help the follow-up study on the catalytic mechanism of 2OG oxygenases and the way they interact with the substrate. Based on the model we built, a user-friendly web server was established and can be friendly accessed at http://bioinfor.imu.edu.cn/ogferaac.Entities:
Keywords: 10-fold cross-validation test; 2-oxoglutarate/Fe (II)-dependent oxygenase; anova; incremental feature selection; machine learning; reduced amino acid cluster
Year: 2021 PMID: 34336861 PMCID: PMC8323781 DOI: 10.3389/fcell.2021.707938
Source DB: PubMed Journal: Front Cell Dev Biol ISSN: 2296-634X
FIGURE 1Schematic diagram of the structure of 2-oxoglutarate/Fe (II)-dependent (2OG) oxygenase.
FIGURE 2The workflow of OGFE_RAAC predictor.
Data composition of each dataset.
| Dataset | Group | Training set | Test set |
| 2OG-SwissProt | Positive | 240 | 75 |
| Negative | 240 | 75 | |
| 2OG-Fe | Positive | 240 | 75 |
| Negative | 231 | 84 | |
| 2OG-domain | Positive | 113 | 621 |
| Negative | 170 | 415 |
FIGURE 3Density distribution diagram of different K value accuracy rates. (A–C) are the density distribution diagrams of the 2OG-SwissProt set, 2OG-Fe set, and 2OG-domain set at different K values in 673 reduction schemes.
FIGURE 4Performance evaluation of different reduced amino acid clusters. (A) Heat map of accuracy distribution of different reduced amino acid clusters. (B) The accuracy rate of the reduced amino acid cluster (t = 33, s = 15) with the highest accuracy rate reaches 83.75%. (C) The incremental feature selection (IFS) curve shows that prediction accuracy is 91.46% when using 812 optimal features based on the tripeptide combination (t = 33, s = 15).
Cluster size of reduced amino acid alphabet of type 33.
| Size | Reduced amino acid cluster |
| 2 | STANDGRQEKHPIVLMWYF-C |
| 3 | STANDGRQEKHP-IVLMWYF-C |
| 4 | STANDG-RQEKHP-IVLMWYF-C |
| 5 | STAND-G-RQEKHP-IVLMWYF-C |
| 6 | STAND-G-RQEK-HP-IVLMWYF-C |
| 7 | STA-ND-G-RQEK-HP-IVLMWYF-C |
| 8 | STA-ND-G-RQ-EK-HP-IVLMWYF-C |
| 9 | STA-ND-G-RQ-EK-HP-IVLM-WYF-C |
| 10 | ST-A-ND-G-RQ-EK-HP-IVLM-WYF-C |
| 11 | ST-A-ND-G-RQ-EK-H-P-IVLM-WYF-C |
| 12 | ST-A-N-D-G-RQ-EK-H-P-IVLM-WYF-C |
| 13 | ST-A-N-D-G-RQ-EK-H-P-IV-LM-WYF-C |
| 14 | S-T-A-N-D-G-RQ-EK-H-P-IV-LM-WYF-C |
| 15 | S-T-A-N-D-G-RQ-EK-H-P-IV-L-M-WYF-C |
| 16 | S-T-A-N-D-G-RQ-E-K-H-P-IV-L-M-WYF-C |
| 17 | S-T-A-N-D-G-RQ-E-K-H-P-IV-L-M-WY-F-C |
| 18 | S-T-A-N-D-G-R-Q-E-K-H-P-IV-L-M-WY-F-C |
| 19 | S-T-A-N-D-G-R-Q-E-K-H-P-I-V-L-M-WY-F-C |
FIGURE 5Feature set t-SNE clustering scatter diagram and receiver operating characteristic (ROC) curve diagram. (A–C) are the t-SNE clustering analysis diagrams of the feature set after unreduced, reduced, and feature screening, respectively. 0 and 1 represent positive samples and negative samples, respectively. (D–F) are the ROC curves of the three models 2OG-SwissProt, 2OG-Fe, and 2OG-domain, respectively.
The results of each evaluation index of the three models.
| Model | Acc (%) | Sn (%) | SP (%) | MCC (%) | AUC (%) | |
| 2OG-SwissProt | 91.04 | 93.33 | 88.75 | 82.34 | 91.26 | 97.15 |
| 2OG-Fe | 97.23 | 97.92 | 96.53 | 94.48 | 97.31 | 99.57 |
| 2OG-domain | 97.87 | 98.23 | 97.65 | 95.60 | 97.37 | 99.89 |
FIGURE 6Home page and results page of OGFE-RAAC web server.