| Literature DB >> 29244758 |
Shunfang Wang1, Bing Nie2, Kun Yue3, Yu Fei4, Wenjia Li5, Dongshu Xu6.
Abstract
Kernel discriminant analysis (KDA) is a dimension reduction and classification algorithm based on nonlinear kernel trick, which can be novelly used to treat high-dimensional and complex biological data before undergoing classification processes such as protein subcellular localization. Kernel parameters make a great impact on the performance of the KDA model. Specifically, for KDA with the popular Gaussian kernel, to select the scale parameter is still a challenging problem. Thus, this paper introduces the KDA method and proposes a new method for Gaussian kernel parameter selection depending on the fact that the differences between reconstruction errors of edge normal samples and those of interior normal samples should be maximized for certain suitable kernel parameters. Experiments with various standard data sets of protein subcellular localization show that the overall accuracy of protein classification prediction with KDA is much higher than that without KDA. Meanwhile, the kernel parameter of KDA has a great impact on the efficiency, and the proposed method can produce an optimum parameter, which makes the new algorithm not only perform as effectively as the traditional ones, but also reduce the computational time and thus improve efficiency.Entities:
Keywords: Gaussian kernel function; dimension reduction; kernel discriminant analysis (KDA); kernel parameter selection; protein subcellular localization
Mesh:
Substances:
Year: 2017 PMID: 29244758 PMCID: PMC5751319 DOI: 10.3390/ijms18122718
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1The overall accuracy versus for four sample sets.
The overall accuracy and the ratio of runtime for two methods.
| Sample Sets | Overall Accuracy | Ratio ( | |
|---|---|---|---|
| GP-220 (PSSM-S) | The proposed method | 0.9924 | 0.7087 |
| Grid searching method | 0.9924 | ||
| GP-1000 (PsePSSM) | The proposed method | 0.9924 | 0.7362 |
| Grid searching method | 0.9924 | ||
| GN-220 (PSSM-S) | The proposed method | 0.9801 | 0.7416 |
| Grid searching method | 0.9801 | ||
| GN-1000 (PsePSSM) | The proposed method | 0.9574 | 0.7687 |
| Grid searching method | 0.9574 | ||
Figure 2The overall accuracy and the ratio of runtime for two methods.
Figure 3The overall accuracy versus value with or without KDA algorithm.
Figure 4The overall accuracy for four sample sets with different values.
The values of evaluation criterion with the proposed method for the Gram-positive.
| Sample Set | Protein Subcellular Locations | |||
|---|---|---|---|---|
| Cell Membrane | Cell Wall | Cytoplasm | Extracell | |
| GP-220 | 1 | 0.9444 | 0.9904 | 0.9919 |
| GP-1000 | 0.9943 | 0.9444 | 1 | 0.9837 |
| GP-220 | 0.9943 | 1 | 1 | 09950 |
| GP-1000 | 0.9971 | 1 | 0.9937 | 0.9925 |
| GP-220 | 0.9914 | 0.9709 | 0.9920 | 0.9841 |
| GP-1000 | 0.9914 | 0.9709 | 0.9921 | 0.9840 |
| GP-220 | 0.9924 | |||
| GP-1000 | 0.9924 | |||
The values of evaluation criterion with the proposed method for the Gram-negative.
| Sample Set | Protein Subcellular Locations | |||||||
|---|---|---|---|---|---|---|---|---|
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
| GN-220 | 1 | 0.9699 | 1 | 0 | 0.9982 | 0 | 0.9677 | 1 |
| GN-1000 | 1 | 0.9323 | 1 | 0 | 0.9659 | 0 | 0.9516 | 0.9556 |
| GN-220 | 0.9924 | 0.9902 | 1 | 1 | 0.9978 | 1 | 1 | 0.9953 |
| GN-1000 | 0.9608 | 0.9872 | 1 | 1 | 0.9967 | 1 | 1 | 0.9992 |
| GN-220 | 0.9866 | 0.9324 | 1 | - | 0.9956 | - | 0.9823 | 0.9814 |
| GN-1000 | 0.9346 | 0.8957 | 1 | - | 0.9681 | - | 0.9733 | 0.9712 |
| GN-220 | 0.9801 | |||||||
| GN-1000 | 0.9574 | |||||||
(1) Cytoplasm, (2) Extracell, (3) Fimbrium, (4) Flagellum, (5) Inner membrane, (6) Nucleoid, (7) Outer membrane, (8) Periplasm.
Figure 5The flow of protein subcellular localization.
The Selection of Internal and Edge Samples.
| Input: |
|---|
| 1. Calculate the radius of neighborhood |
| 2. Calculate the centroid |
| 3. Calculate the distances |
| 4. For each training sample Calculate the If If |
|
|
The Method for Selecting the Gaussian KDA Parameter.
| Input: A reasonable candidate set |
| 1. Get the internal sample set |
| 2. For each parameter Calculate the kernel matrix Reduce dimension of the Calculate Calculate the value of objective function |
| 3. Select the optimum parameter |
| Output: the optimum Gaussian kernel parameter |
The name and the size of each location for the Gram-positive data set.
| No. | Subcellular Localization | Number of Proteins |
|---|---|---|
| 1 | cell membrane | 174 |
| 2 | cell wall | 18 |
| 3 | cytoplasm | 208 |
| 4 | extracell | 123 |
The name and the size of each location for the Gram-negative data set.
| No. | Subcellular Localization | Number of Proteins |
|---|---|---|
| 1 | cytoplasm | 410 |
| 2 | extracell | 133 |
| 3 | fimbrium | 32 |
| 4 | flagellum | 12 |
| 5 | inner membrane | 557 |
| 6 | nucleoid | 8 |
| 7 | outer membrane | 124 |
| 8 | periplasm | 180 |
Sample sets.
| Sample Sets | Benchmarks for Subcellular Locations | Extraction Feature Method | The Number of Classes | The Dimension of Feature Vector | The Number of Samples |
|---|---|---|---|---|---|
| GN-1000 | Gram-negative | PsePSSM | 8 | 1000 | 1456 |
| GN-220 | Gram-negative | PSSM-S | 8 | 220 | 1456 |
| GP-1000 | Gram-positive | PsePSSM | 4 | 1000 | 523 |
| GP-220 | Gram-positive | PSSM-S | 4 | 220 | 523 |