| Literature DB >> 30713552 |
Long Pang1, Junjie Wang2, Lingling Zhao3, Chunyu Wang2, Hui Zhan2.
Abstract
The disorder distribution of protein in the compartment or organelle leads to many human diseases, including neurodegenerative diseases such as Alzheimer's disease. The prediction of protein subcellular localization play important roles in the understanding of the mechanism of protein function, pathogenes and disease therapy. This paper proposes a novel subcellular localization method by integrating the Convolutional Neural Network (CNN) and eXtreme Gradient Boosting (XGBoost), where CNN acts as a feature extractor to automatically obtain features from the original sequence information and a XGBoost classifier as a recognizer to identify the protein subcellular localization based on the output of the CNN. Experiments are implemented on three protein datasets. The results prove that the CNN-XGBoost method performs better than the general protein subcellular localization methods.Entities:
Keywords: Conventional Neural Network (CNN); XGBoost; deep learning (DL); machine learning; protein subcellular localization
Year: 2019 PMID: 30713552 PMCID: PMC6345701 DOI: 10.3389/fgene.2018.00751
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1The framework of the CNN-XGBoost based protein subcellular location predictor.
Dataset Summary.
| No. Proteins | 3,126 | 379 | 2,597 | 576 | 5,959 | 158 |
| No. Labels | 4,229 | 541 | 2,597 | 576 | 5,959 | 158 |
| No.Locations | 12 | 4 | 6 | |||
Comparision of CNN-XGBoost on Hum-mPloc 3.0 dataset with other methods.
| Centrosome | 0 | 0 | 0 | 0.75 | 0.14 | 0.23 | 0.59 | 0.59 | 0.59 | 0.75 | 0.55 | 0.94 | 0.75 | 0.79 | 0.50 | 0.61 | ||
| Cytoplasm | 0.5 | 0.54 | 0.52 | 0.69 | 0.53 | 0.60 | 0.93 | 0.51 | 0.66 | 0.76 | 0.73 | 0.74 | 0.79 | 0.81 | 0.85 | 0.89 | ||
| Cytoskeleton | 0 | 0 | 0 | 0.32 | 0.34 | 0.33 | 0.9 | 0.22 | 0.35 | 0.8 | 0.68 | 0.74 | 0.93 | 0.77 | 0.89 | 0.80 | ||
| ER | 0 | 0 | 0 | 0.73 | 0.2 | 0.31 | 0.74 | 0.49 | 0.59 | 0.83 | 0.37 | 0.51 | 0.9 | 0.7 | 0.97 | 0.71 | ||
| Endosome | 0 | 0 | 0 | 0.25 | 0.07 | 0.11 | 0.38 | 0.2 | 0.26 | 0.58 | 0.47 | 0.57 | 0.37 | 0.80 | 0.27 | 0.40 | ||
| Extracellular | 0.62 | 0.62 | 0.62 | 0.67 | 0.77 | 0.16 | 0.69 | 0.26 | 0.5 | 0.46 | 0.48 | 0.66 | 0.71 | 0.68 | 0.80 | 0.62 | ||
| Golgi apparatus | 0.6 | 0.3 | 0.4 | 0.6 | 0.15 | 0.24 | 0.72 | 0.65 | 0.68 | 0.69 | 0.45 | 0.55 | 0.88 | 0.61 | 0.80 | 0.60 | ||
| Lysosome | 0.5 | 0.13 | 0.2 | 0.2 | 0.13 | 0.15 | 0.55 | 0.75 | 0.63 | 0.71 | 0.63 | 0.67 | 1 | 0.55 | 1.00 | 0.75 | ||
| Mitochondrion | 0.95 | 0.33 | 0.49 | 0.79 | 0.73 | 0.76 | 0.83 | 0.88 | 0.85 | 0.78 | 0.75 | 0.76 | 0.92 | 0.88 | 0.96 | 0.80 | ||
| Nucleus | 0.54 | 0.7 | 0.61 | 0.65 | 0.64 | 0.64 | 0.85 | 0.7 | 0.76 | 0.75 | 0.71 | 0.73 | 0.81 | 0.92 | 0.83 | 0.91 | ||
| Peroxisome | 1 | 0.5 | 0.67 | 0.5 | 1 | 0.67 | 0.29 | 1 | 0.44 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| Plasma membrane | 0.42 | 0.33 | 0.37 | 0.44 | 0.53 | 0.48 | 0.58 | 0.56 | 0.57 | 0.65 | 0.44 | 0.52 | 0.78 | 0.74 | 0.89 | 0.75 | ||
| ACC-mean | 0.41 | 0.50 | 0.65 | 0.63 | ||||||||||||||
| F1-mean | 0.32 | 0.44 | 0.56 | 0.65 | ||||||||||||||
The bold marks the first best result and the underline marks the second best result.
Figure 2The accuracy comparison on the Hum-mPloc 3.0 data set.
Comparison of CNN-XGBoost ACC/F1-mean on other proteins datasets with other methods.
| MultiLoc2-LowRes | 0.73/0.76 | – |
| MultiLoc2-HighRes | 0.68/0.71 | 0.57/0.41 |
| BaCelLo | 0.64/0.66 | – |
| Hum-mPloc 3.0 | 0.86/0.84 | 0.64/0.59 |
| PSL-Recommender | ||
| CNN-XGBoost |
The bold marks the first best result and the underline marks the second best result.