| Literature DB >> 36132086 |
Yu-Hang Zhang1,2, ShiJian Ding1, Lei Chen3, Tao Huang4,5, Yu-Dong Cai1.
Abstract
Subcellular localization attempts to assign proteins to one of the cell compartments that performs specific biological functions. Finding the link between proteins, biological functions, and subcellular localization is an effective way to investigate the general organization of living cells in a systematic manner. However, determining the subcellular localization of proteins by traditional experimental approaches is difficult. Here, protein-protein interaction networks, functional enrichment on gene ontology and pathway, and a set of proteins having confirmed subcellular localization were applied to build prediction models for human protein subcellular localizations. To build an effective predictive model, we employed a variety of robust machine learning algorithms, including Boruta feature selection, minimum redundancy maximum relevance, Monte Carlo feature selection, and LightGBM. Then, the incremental feature selection method with random forest and support vector machine was used to discover the essential features. Furthermore, 38 key features were determined by integrating results of different feature selection methods, which may provide critical insights into the subcellular location of proteins. Their biological functions of subcellular localizations were discussed according to recent publications. In summary, our computational framework can help advance the understanding of subcellular localization prediction techniques and provide a new perspective to investigate the patterns of protein subcellular localization and their biological importance.Entities:
Mesh:
Substances:
Year: 2022 PMID: 36132086 PMCID: PMC9484878 DOI: 10.1155/2022/3288527
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.246
Number of proteins in each category.
| Index | Category | Number of proteins |
|---|---|---|
| Class 1 | Biological membrane | 1487 |
| Class 2 | Cell periphery | 35 |
| Class 3 | Cytoplasm | 506 |
| Class 4 | Cytoplasmic vesicle | 70 |
| Class 5 | Endoplasmic reticulum | 190 |
| Class 6 | Endosome | 25 |
| Class 7 | Extracellular space or cell surface | 649 |
| Class 8 | Flagellum or cilium | 3 |
| Class 9 | Golgi apparatus | 98 |
| Class 10 | Microtubule cytoskeleton | 48 |
| Class 11 | Mitochondrion | 345 |
| Class 12 | Nuclear periphery | 33 |
| Class 13 | Nucleolus | 112 |
| Class 14 | Nucleus | 1285 |
| Class 15 | Peroxisome | 46 |
| Class 16 | Vacuole | 54 |
| Total | 4986 | |
Figure 1Entire procedures for constructing and evaluating protein subcellular location prediction models. Human proteins and their subcellular location information are retrieved from Swiss-Prot. Each protein is represented by three feature groups: network features, functional KEGG features, and functional GO features. All features are analyzed by Boruta and mRMR, MCFS, and LightGBM methods, resulting in three ranked feature lists. These lists are fed into the IFS method one by one, incorporating two classification algorithms, to build efficient models and extract essential features. Thirty-eight essential features are selected on the basis of the feature integration rules.
Figure 2Results of the IFS method with RF and SVM in the LightGBM feature list. The highest MCC values for RF and SVM are 0.838 and 0.851, respectively. RF and SVM can provide quite high performance when much less features are used (76 for RF and 1027 for SVM).
Figure 3Results of the IFS method with RF and SVM in the MCFS feature list. The highest MCC values for RF and SVM are 0.836 and 0.852, respectively. RF and SVM can provide quite high performance when much less features are used (484 for RF and 1448 for SVM).
Figure 4Results of the IFS method with RF and SVM in the mRMR feature list. The highest MCC values for RF and SVM are 0.835 and 0.852, respectively. RF and SVM can provide quite high performance when much less features are used (46 for RF and 1431 for SVM).
Figure 5Performance of different classifiers on each category. (a) Performance of the optimal classifiers constructed from three feature lists on 16 categories. (b) Performance of the classifiers using much less features from three feature lists on 16 categories.
Thirty-eight key features obtained by feature integration rules.
| Rank | Feature name | Description |
|---|---|---|
| 1 | ENSP00000407401 | PEX5 gene: peroxisomal biogenesis factor 5 |
| 2 | hsa04142 | Lysosome |
| 3 | GO:0005654 | Nucleoplasm |
| 4 | GO:0031090 | Organelle membrane |
| 5 | GO:0016021 | Integral component of membrane |
| 6 | ENSP00000405965 | SUMO2 gene: small ubiquitin-like modifier 2 |
| 7 | ENSP00000357748 | BCCIP gene: BRCA2 and CDKN1A interacting protein |
| 8 | GO:0005615 | Extracellular space |
| 9 | hsa00520 | Amino sugar and nucleotide sugar metabolism |
| 10 | ENSP00000263239 | DDX18 gene: DEAD-box helicase 18 |
| 11 | GO:0005789 | Endoplasmic reticulum membrane |
| 12 | ENSP00000317578 | GRK3 gene: G protein-coupled receptor kinase 3 |
| 13 | ENSP00000317159 | CYC1 gene: cytochrome C1 |
| 14 | hsa05110 | Vibrio cholerae infection |
| 15 | hsa00010 | Glycolysis/gluconeogenesis |
| 16 | GO:0031224 | Intrinsic component of membrane |
| 17 | GO:0070013 | Intracellular organelle lumen |
| 18 | ENSP00000346725 | PES1 gene: pescadillo ribosomal biogenesis factor 1 |
| 19 | GO:0031975 | Envelope |
| 20 | ENSP00000390722 | SLC25A17 gene: solute carrier family 25 member 17 |
| 21 | ENSP00000328854 | NOC4L gene: nucleolar complex associated 4 homolog |
| 22 | GO:0001578 | Microtubule bundle formation |
| 23 | GO:0005887 | Integral component of plasma membrane |
| 24 | ENSP00000264279 | NOP58 gene: NOP58 ribonucleoprotein |
| 25 | GO:0016491 | Oxidoreductase activity |
| 26 | ENSP00000380982 | PUM3 gene: Pumilio RNA binding family member 3 |
| 27 | GO:0005634 | Nucleus |
| 28 | ENSP00000371101 | NOL10 gene: nucleolar protein 10 |
| 29 | GO:0005886 | Plasma membrane |
| 30 | GO:0044450 | Obsolete microtubule organizing center part |
| 31 | ENSP00000244230 | MPHOSPH10 gene: phase phosphoprotein 10 |
| 32 | GO:0042147 | Retrograde transport, endosome to Golgi |
| 33 | ENSP00000408017 | / |
| 34 | GO:0009060 | Aerobic respiration |
| 35 | GO:0044424 | Obsolete intracellular part |
| 36 | ENSP00000402733 | / |
| 37 | GO:0005815 | Microtubule organizing center |
| 38 | GO:0044451 | Obsolete nucleoplasm part |