| Literature DB >> 35651937 |
Hao Li1, ShiQi Zhang2, Lei Chen3, Xiaoyong Pan4, ZhanDong Li1, Tao Huang5,6, Yu-Dong Cai7.
Abstract
In current biology, exploring the biological functions of proteins is important. Given the large number of proteins in some organisms, exploring their functions one by one through traditional experiments is impossible. Therefore, developing quick and reliable methods for identifying protein functions is necessary. Considerable accumulation of protein knowledge and recent developments on computer science provide an alternative way to complete this task, that is, designing computational methods. Several efforts have been made in this field. Most previous methods have adopted the protein sequence features or directly used the linkage from a protein-protein interaction (PPI) network. In this study, we proposed some novel multi-label classifiers, which adopted new embedding features to represent proteins. These features were derived from functional domains and a PPI network via word embedding and network embedding, respectively. The minimum redundancy maximum relevance method was used to assess the features, generating a feature list. Incremental feature selection, incorporating RAndom k-labELsets to construct multi-label classifiers, used such list to construct two optimum classifiers, corresponding to two key measurements: accuracy and exact match. These two classifiers had good performance, and they were superior to classifiers that used features extracted by traditional methods.Entities:
Keywords: embedding features; feature selection; mouse protein; multi-label classification; rakel
Year: 2022 PMID: 35651937 PMCID: PMC9149260 DOI: 10.3389/fgene.2022.909040
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
Number of proteins in each functional category.
| Index | Category | Number of Proteins | ||
|---|---|---|---|---|
| Training dataset | Test dataset | Overall | ||
| 1 | METABOLISM | 1152 | 280 | 1432 |
| 2 | ENERGY | 247 | 64 | 311 |
| 3 | CELL CYCLE AND DNA PROCESSING | 473 | 124 | 597 |
| 4 | TRANSCRIPTION | 906 | 229 | 1135 |
| 5 | PROTEIN SYNTHESIS | 213 | 45 | 258 |
| 6 | PROTEIN FATE (folding, modification, destination) | 983 | 234 | 1217 |
| 7 | PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic) | 3316 | 868 | 4184 |
| 8 | REGULATION OF METABOLISM AND PROTEIN FUNCTION | 414 | 102 | 516 |
| 9 | CELLULAR TRANSPORT, TRANSPORT FACILITIES AND TRANSPORT ROUTES | 915 | 227 | 1142 |
| 10 | CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM | 1228 | 328 | 1556 |
| 11 | CELL RESCUE, DEFENSE AND VIRULENCE | 318 | 76 | 394 |
| 12 | INTERACTION WITH THE ENVIRONMENT | 501 | 138 | 639 |
| 13 | SYSTEMIC INTERACTION WITH THE ENVIRONMENT | 488 | 149 | 637 |
| 14 | TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS | 3 | 1 | 4 |
| 15 | CELL FATE | 550 | 171 | 721 |
| 16 | DEVELOPMENT (Systemic) | 421 | 127 | 548 |
| 17 | BIOGENESIS OF CELLULAR COMPONENTS | 287 | 68 | 355 |
| 18 | CELL TYPE DIFFERENTIATION | 146 | 39 | 185 |
| 19 | TISSUE DIFFERENTIATION | 144 | 37 | 181 |
| 20 | ORGAN DIFFERENTIATION | 237 | 53 | 290 |
| 21 | SUBCELLULAR LOCALIZATION | 3920 | 947 | 4867 |
| 22 | CELL TYPE LOCALIZATION | 80 | 15 | 95 |
| 23 | TISSUE LOCALIZATION | 82 | 26 | 108 |
| 24 | ORGAN LOCALIZATION | 168 | 44 | 212 |
| Sum number of proteins in all categories | 17,192 | 4392 | 21,584 | |
| Number of different proteins | 5560 | 1390 | 6950 | |
FIGURE 1Distribution of training, test and overall samples based on the number of categories that they belong to. Several samples belong to two or more categories.
FIGURE 2Entire procedures to construct the multi-label classifiers for predicting functions of mouse proteins. Mouse proteins and their function annotations are retrieved from MfunGD. These proteins are randomly divided into one training dataset and one test dataset. Embedding features were derived from protein functional domains and protein–protein interaction network through Word2vec and Node2vec, respectively. A feature selection procedure is used to analyze embedding features, and essential features are fed into RAKEL to construct the multi-label classifiers. Proteins in the test dataset are fed into these classifiers to further evaluate their performance.
FIGURE 3IFS curves on embedding features using different classification methods. (A) Accuracy is set to the Y-axis. (B) Exact match is set to the Y-axis. RAKEL_RF/RAKEL_SVM indicates that RAKEL with RF/SVM as the base classifier is used to construct the multi-label classifiers.
Accuracy of the important multi-label classifiers with different features on training and test datasets.
| Method | Feature | Number of Features | Accuracy | |
|---|---|---|---|---|
| Training dataset | Test dataset | |||
| RAKEL_RF | Embedding features | 702 | 0.542 | 0.536 |
| RAKEL_SVM | Embedding features | 746 | 0.542 | 0.537 |
| RAKEL_RF | Embedding features | 48 | 0.530 | 0.530 |
| RAKEL_RF | Domain features | 26 | 0.429 | 0.426 |
| RAKEL_SVM | Domain features | 27 | 0.429 | 0.428 |
| RAKEL_RF | Linkage features | 233 | 0.462 | 0.460 |
| RAKEL_SVM | Linkage features | 234 | 0.432 | 0.424 |
| RAKEL_RF | Domain and linkage features | 221 | 0.470 | 0.462 |
| RAKEL_SVM | Domain and linkage features | 227 | 0.449 | 0.433 |
Exact match of the important multi-label classifiers with different features on training and test datasets.
| Method | Feature | Number of Features | Exact match | |
|---|---|---|---|---|
| Training dataset | Test dataset | |||
| RAKEL_RF | Embedding features | 690 | 0.186 | 0.171 |
| RAKEL_SVM | Embedding features | 445 | 0.179 | 0.157 |
| RAKEL_RF | Embedding features | 53 | 0.170 | 0.159 |
| RAKEL_RF | Domain features | 25 | 0.077 | 0.078 |
| RAKEL_SVM | Domain features | 29 | 0.075 | 0.077 |
| RAKEL_RF | Linkage features | 158 | 0.130 | 0.123 |
| RAKEL_SVM | Linkage features | 225 | 0.113 | 0.104 |
| RAKEL_RF | Domain and linkage features | 201 | 0.135 | 0.130 |
| RAKEL_SVM | Domain and linkage features | 215 | 0.132 | 0.111 |
FIGURE 4Distribution of embedding features used in two efficient classifiers. (A) Distribution of embedding features used in the classifier selected by accuracy. (B) Distribution of embedding features used in the classifier selected by exact match.
FIGURE 5IFS curves on domain features using different classification methods. (A) Accuracy is set to the Y-axis. (B) Exact match is set to the Y-axis. RAKEL_RF/RAKEL_SVM indicates that RAKEL with RF/SVM as the base classifier is used to construct the multi-label classifiers.
FIGURE 7IFS curves on domain and linkage features using different classification methods. (A) Accuracy is set to the Y-axis. (B) Exact match is set to the Y-axis. RAKEL_RF/RAKEL_SVM indicates that RAKEL with RF/SVM as the base classifier is used to construct the multi-label classifiers.