| Literature DB >> 25184139 |
Guohua Huang1, Lin Lu2, Kaiyan Feng3, Jun Zhao2, Yuchao Zhang4, Yaochen Xu5, Ning Zhang6, Bi-Qing Li7, Weiping Huang8, Yu-Dong Cai9.
Abstract
Protein S-nitrosylation plays a very important role in a wide variety of cellular biological activities. Hitherto, accurate prediction of S-nitrosylation sites is still of great challenge. In this paper, we presented a framework to computationally predict S-nitrosylation sites based on kernel sparse representation classification and minimum Redundancy Maximum Relevance algorithm. As much as 666 features derived from five categories of amino acid properties and one protein structure feature are used for numerical representation of proteins. A total of 529 protein sequences collected from the open-access databases and published literatures are used to train and test our predictor. Computational results show that our predictor achieves Matthews' correlation coefficients of 0.1634 and 0.2919 for the training set and the testing set, respectively, which are better than those of k-nearest neighbor algorithm, random forest algorithm, and sparse representation classification algorithm. The experimental results also indicate that 134 optimal features can better represent the peptides of protein S-nitrosylation than the original 666 redundant features. Furthermore, we constructed an independent testing set of 113 protein sequences to evaluate the robustness of our predictor. Experimental result showed that our predictor also yielded good performance on the independent testing set with Matthews' correlation coefficients of 0.2239.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25184139 PMCID: PMC4145740 DOI: 10.1155/2014/438341
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Distribution of feature type for a sample.
| Feature category | Number of features from each category |
|---|---|
| Evolutionary conservation | 21 × 20 |
| Amino acid factor | 20 × 5 |
| Secondary structure | 21 × 3 |
| Solvent accessibility | 21 × 2 |
| Amino acid frequency | 20 × 1 |
| Disorder | 21 × 1 |
| Number of features of a sample | 666 |
Algorithm 1OMP algorithm.
Algorithm 2SRC algorithm.
Algorithm 3KSRC algorithm.
Figure 1MCC value of 10-fold cross validation of the KSRC on the training set in the incremental feature selection procedure.
Figure 2MCC curves of 10-fold cross validation on the training set of (a) SRC, (b) KNN, (c) RF, (d) SMO, and (e) Dagging in the incremental feature selection procedure.
Performances of six algorithms on the training set with the respective optimal features using 10-fold cross validation.
| SN | SP | ACC | MCC | |
|---|---|---|---|---|
| KSRC | 0.4048 | 0.7543 | 0.6393 | 0.1634 |
| SRC | 0.3489 | 0.7876 | 0.6433 | 0.1467 |
| KNN | 0.3852 | 0.7469 | 0.6279 | 0.1358 |
| RF | 0.3399 | 0.7957 | 0.6458 | 0.1473 |
| SMO | 0.2840 | 0.8705 | 0.6776 | 0.1887 |
| Dagging | 0.3610 | 0.8320 | 0.6771 | 0.2150 |
KSRC: kernel sparse representation classification; SRC: sparse representation classification; KNN: k-nearest neighbor algorithm; RF: random forest method; SMO: sequential minimal optimization; Dagging refers to the use of majority vote to combine multiple models derived from a single learning algorithm using disjoint samples.
Performances of six algorithms on the testing set with the respective optimal features.
| SN | SP | ACC | MCC | |
|---|---|---|---|---|
| KSRC | 0.4727 | 0.8077 | 0.6978 | 0.2919 |
| SRC | 0.2909 | 0.7988 | 0.6322 | 0.1000 |
| KNN | 0.4061 | 0.7899 | 0.6649 | 0.2062 |
| RF | 0.3636 | 0.8343 | 0.6799 | 0.2206 |
| SMO | 0.2364 | 0.8669 | 0.6600 | 0.1299 |
| Dagging | 0.2848 | 0.8343 | 0.6541 | 0.1386 |
Performances of KSRC on the training and testing sets with the original 666 features.
| SN | SP | ACC | MCC | |
|---|---|---|---|---|
| The training set | 0.2749 | 0.8120 | 0.6354 | 0.0991 |
| The testing set | 0.2909 | 0.8462 | 0.6640 | 0.1612 |
Performances of eight algorithms on the independent testing set with the respective optimal features.
| SN | SP | ACC | MCC | |
|---|---|---|---|---|
| KSRC | 0.5196 | 0.7368 | 0.6915 | 0.2239 |
| SRC | 0.5588 | 0.7419 | 0.7038 | 0.2617 |
| KNN | 0.4069 | 0.7419 | 0.6721 | 0.1333 |
| RF | 0.4657 | 0.7535 | 0.6936 | 0.1958 |
| SMO | 0.1765 | 0.8645 | 0.7211 | 0.0474 |
| Dagging | 0.2745 | 0.7884 | 0.6813 | 0.0612 |
| iSNO-AAPair | 0.4020 | 0.7252 | 0.6578 | 0.1125 |
| iSNO-PseAAC | 0.5343 | 0.6103 | 0.5945 | 0.1190 |