| Literature DB >> 30845684 |
Bo Li1, Lijun Cai2, Bo Liao3,4, Xiangzheng Fu5, Pingping Bing6, Jialiang Yang7.
Abstract
The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou's pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.Entities:
Keywords: generalized chaos game representation; protein primary sequence; protein subcellular localization; statistical method; support vector machine; unitary distance
Mesh:
Substances:
Year: 2019 PMID: 30845684 PMCID: PMC6429470 DOI: 10.3390/molecules24050919
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
The prediction results of dataset CL317 using unitary distance based on GCGR + NSI features in the jackknife test.
| Cy | Me | Nu | En | Mi | Se | |
|---|---|---|---|---|---|---|
| Sn (%) | 91.8 | 85.3 | 86.5 | 93.7 | 83.3 | 74.5 |
| Sp (%) | 86.2 | 99.6 | 91.9 | 86.2 | 91.5 | 90.9 |
| MCC | 0.83 | 0.83 | 0.86 | 0.88 | 0.86 | 0.81 |
| Acc | 0.8825 | |||||
The prediction results of dataset ZW225 using unitary distance based on GCGR + NSI features in the jackknife test.
| Me | Cy | Nu | Mi | |
|---|---|---|---|---|
| Sn (%) | 0.6617 | 0.8286 | 0.88 | 0.8439 |
| Sp (%) | 0.7792 | 0.7432 | 0.9383 | 0.8841 |
| MCC | 0.6863 | 0.6789 | 0.9115 | 0.7745 |
| Acc | 0.7736 | |||
Figure 1The prediction results based on CL317 using the support vector machine algorithm with different combination of features.
Comparison of prediction performance for CL317 in the jackknife test.
| Predictor | MCC | Acc | |||||
|---|---|---|---|---|---|---|---|
| Cy | Me | Nu | En | Mi | Se | ||
| [ | 0.80 | 0.77 | 0.73 | 0.90 | 0.74 | 0.68 | 0.827 |
| [ | 0.87 | 0.90 | 0.86 | 0.95 | 0.86 | 0.80 | 0.909 |
| [ | 0.84 | 0.85 | 0.84 | 0.91 | 0.77 | 0.80 | 0.88 |
| [ | 0.89 | 0.88 | 0.87 | 0.95 | 0.88 | 0.78 | 0.911 |
| [ | 0.946 | 0.909 | 0.885 | 0.957 | 0.882 | 0.706 | 0.912 |
| This paper | 0.896 | 0.913 | 0.929 | 0.892 | 0.853 | 0.905 | 0.921 |
Comparison of prediction performance for ZW225 in the jackknife test.
| Predictor | MCC | Acc | |||
|---|---|---|---|---|---|
| Me | Cy | Nu | Mi | ||
| [ | 0.933 | 0.90 | 0.634 | 0.60 | 0.831 |
| [ | 0.91 | 0.929 | 0.732 | 0.68 | 0.858 |
| [ | 0.921 | 0.871 | 0.732 | 0.64 | 0.84 |
| [ | 0.91 | 0.871 | 0.756 | 0.72 | 0.849 |
| This paper | 0.909 | 0.892 | 0.867 | 0.778 | 0.889 |
The six classes of the 20 amino acids.
| Classification | Abbreviation | Amino Acids |
|---|---|---|
| Strongly hydrophilic or polar | H | H, R, D, E, N, Q, K |
| Strongly hydrophobic | L | L, I, V, A, M, F |
| Weakly hydrophilic or weakly hydrophobic (ambiguous) | S | S, T, Y, W |
| Proline | P | P |
| Glycine | G | G |
| Cysteine | C | C |
Figure 2The GCGRs of primary sequence for proteins from six subcellular locations GCGR: Generalized Chaos Game Representation.
Figure 3Six time series that represent the first three GCGRs in Figure 2. Each panel in Figure 2 gives rise to two time series.
Figure 4Six time series that represent the last three GCGRs in Figure 2. Each panel in Figure 2 gives rise to two time series.
Figure 5The boxplots for the and of all the proteins in dataset CL317 grouped into the six subcellular locations.