| Literature DB >> 24897370 |
Ravindra Kumar1, Sohni Jain1, Bandana Kumari1, Manish Kumar1.
Abstract
The nucleus is the largest and the highly organized organelle of eukaryotic cells. Within nucleus exist a number of pseudo-compartments, which are not separated by any membrane, yet each of them contains only a specific set of proteins. Understanding protein sub-nuclear localization can hence be an important step towards understanding biological functions of the nucleus. Here we have described a method, SubNucPred developed by us for predicting the sub-nuclear localization of proteins. This method predicts protein localization for 10 different sub-nuclear locations sequentially by combining presence or absence of unique Pfam domain and amino acid composition based SVM model. The prediction accuracy during leave-one-out cross-validation for centromeric proteins was 85.05%, for chromosomal proteins 76.85%, for nuclear speckle proteins 81.27%, for nucleolar proteins 81.79%, for nuclear envelope proteins 79.37%, for nuclear matrix proteins 77.78%, for nucleoplasm proteins 76.98%, for nuclear pore complex proteins 88.89%, for PML body proteins 75.40% and for telomeric proteins it was 83.33%. Comparison with other reported methods showed that SubNucPred performs better than existing methods. A web-server for predicting protein sub-nuclear localization named SubNucPred has been established at http://14.139.227.92/mkumar/subnucpred/. Standalone version of SubNucPred can also be downloaded from the web-server.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24897370 PMCID: PMC4045734 DOI: 10.1371/journal.pone.0098345
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Flow diagram of SubNucPred.
The overall schema is divided into three steps. Step-1 does prediction on the basis of presence or absence of unique Pfam domain (Method-I). Step-2 and 3 (referred as Layer-I & II respectively in manuscript) does prediction on the basis of amino acid composition based SVM model and threshold (Method-II). In step 2 or Layer-I, prediction is made for five sub-nuclear locations (centromere, chromosome, nucleolus, nuclear speckle and others). In case the SVM score of location ‘others’ is greater than the threshold, query protein is predicted to belong to locations contained in ‘Others’. 3rd step or Layer-II SVM prediction is used then and prediction is also done for the six locations belonging to ‘Others’.
Figure 2Average amino acid composition analysis of proteins belonging to different sub-nuclear locations.
Prediction efficiency at various sub-nuclear locations on the basis of presence of SSLD.
| Location | Number of Unique Domains | Proteins Predicted | Prediction Efficiency (%) |
| Centromere (86) | 21 | 39 | 45.35 |
| Chromosome (113) | 19 | 22 | 19.47 |
| Nuclear speckle (50) | 16 | 12 | 24.00 |
| Nucleolus (294) | 81 | 102 | 34.69 |
| Nuclear envelope (17) | 6 | 7 | 41.18 |
| Nuclear matrix (18) | 3 | 5 | 27.78 |
| Nucleoplasm (30) | 7 | 6 | 20.00 |
| Nuclear pore complex (12) | 4 | 3 | 25.00 |
| PML body (12) | 4 | 3 | 25.00 |
| Telomere (37) | 10 | 10 | 27.03 |
SSLD represents single sub-nuclear domain.
Values in parenthesis are the number of proteins in that location.
Performance of SVM model based on amino acid composition using layer approach. (For detail please see Table S9).
| Location | TP | TN | FP | FN | Sensitivity | Specificity | Accuracy | MCC | AUC |
|
| |||||||||
| Centromere (86) | 67(44) | 524(628) | 159(55) | 19(42) | 77.91(51.16) | 76.72(91.95) | 76.85(87.39) | 0.38(0.41) | 0.83 |
| Chromosome (113) | 76(38) | 423(606) | 233(50) | 37(75) | 67.26(33.63) | 64.48(92.38) | 64.89(83.75) | 0.23(0.29) | 0.71 |
| Nuclear speckle (50) | 35(15) | 527(701) | 192(18) | 15(35) | 70.00(30.00) | 73.30(97.50) | 73.08(93.11) | 0.23(0.33) | 0.80 |
| Nucleolus (294) | 211(162) | 342(411) | 133(64) | 83(132) | 71.77(55.10) | 72.00(86.53) | 71.91(74.51) | 0.43(0.44) | 0.78 |
| Others (126) | 86(86) | 438(438) | 205(205) | 40(40) | 68.25(68.25) | 68.12(68.12) | 68.14(68.14) | 0.28(0.28) | 0.72 |
|
| |||||||||
| Nuclear envelope (17) | 12(8) | 83(100) | 26(9) | 5(9) | 70.59(47.06) | 76.15(91.74) | 75.40(85.71) | 0.35(0.39) | 0.76 |
| Nuclear matrix (18) | 13(5) | 75(104) | 33(4) | 5(13) | 72.22(27.78) | 69.44(96.30) | 69.84(86.51) | 0.30(0.33) | 0.72 |
| Nucleoplasm (30) | 20(23) | 65(59) | 31(37) | 10(7) | 66.67(76.67) | 67.71(61.46) | 67.46(65.08) | 0.30(0.33) | 0.67 |
| Nuclear pore complex (12) | 9(9) | 90(90) | 24(24) | 3(3) | 75.00(75.00) | 78.95(78.95) | 78.57(78.57) | 0.36(0.36) | 0.80 |
| PML body (12) | 8(8) | 76(76) | 38(38) | 4(4) | 66.67(66.67) | 66.67(66.67) | 66.67(66.67) | 0.20(0.20) | 0.66 |
| Telomere (37) | 27(20) | 64(82) | 25(7) | 10(17) | 72.97(54.05) | 71.91(92.13) | 72.22(80.95) | 0.42(0.51) | 0.76 |
Where TP, TN, FP, FN, MCC and AUC are True positive, True negative, False positive, False negative, Matthews correlation coefficient and Area under ROC curve respectively.
Values in parenthesis are the number of proteins in that location at column ‘location’ and in column ‘TP’, ‘TN’, ‘FP’, ‘FN’, ‘Sensitivity’, ‘Specificity’, ‘Accuracy’ and ‘MCC’ are the values at which maximum MCC was found.
Figure 3ROC curve of amino acid composition based SVM modules.
Performance of SSLD and amino acid composition based SVM.
| Location | TP | TN | FP | FN | Sensitivity | Specificity | Accuracy | MCC |
|
| ||||||||
| Centromere | 73(60) | 581(653) | 102(30) | 13(26) | 84.88(69.77) | 85.07(95.61) | 85.05(92.72) | 0.53(0.64) |
| Chromosome | 88(56) | 503(623) | 153(33) | 25(57) | 77.88(49.56) | 76.68(94.97) | 76.85(88.30) | 0.42(0.49) |
| Nuclear speckle | 39(23) | 586(705) | 133(14) | 11(27) | 78.00(46.00) | 81.50(98.05) | 81.27(94.67) | 0.35(0.51) |
| Nucleolus | 245(213) | 384(435) | 91(40) | 49(81) | 83.33(72.45) | 80.84(91.58) | 81.79(84.27) | 0.63(0.66) |
| Others | 95(95) | 536(536) | 107(107) | 31(31) | 75.40(75.40) | 83.36(83.36) | 82.05(82.05) | 0.49(0.49) |
|
| ||||||||
| Nuclear envelope | 13(12) | 87(101) | 22(8) | 4(5) | 76.47(70.59) | 79.82(92.66) | 79.37(89.68) | 0.43(0.59) |
| Nuclear matrix | 13(7) | 85(104) | 23(4) | 5(11) | 72.22(38.89) | 78.70(96.30) | 77.78(88.10) | 0.39(0.44) |
| Nucleoplasm | 22(24) | 75(70) | 21(26) | 8(6) | 73.33(80.00) | 78.12(72.92) | 76.98(74.60) | 0.46(0.46) |
| Nuclear pore complex | 11(11) | 101(101) | 13(13) | 1(1) | 91.67(91.67) | 88.60(88.60) | 88.89(88.89) | 0.60(0.60) |
| PML body | 9(9) | 86(86) | 28(28) | 3(3) | 75.00(75.00) | 75.44(75.44) | 75.40(75.40) | 0.33(0.33) |
| Telomere | 32(27) | 73(83) | 16(6) | 5(10) | 86.49(72.97) | 82.02(93.26) | 83.33(87.30) | 0.64(0.69) |
Where TP, TN, FP, FN and MCC are True positive, True negative, False positive, False negative and Matthews correlation coefficient respectively.
Values in parenthesis are the values at which maximum MCC was found at respective column.
Performance of SubNucPred method on DataIND (One-vs-Rest approach).
| Location | TP | TN | FP | FN | Sensitivity | Specificity | Accuracy | MCC |
|
| ||||||||
| Centromere (31) | 17 | 118 | 58 | 14 | 54.84 | 67.05 | 65.22 | 0.16 |
| Chromosome (38) | 25 | 111 | 58 | 13 | 65.79 | 65.68 | 65.70 | 0.25 |
| Nuclear speckle (14) | 12 | 127 | 66 | 2 | 85.71 | 65.80 | 67.15 | 0.27 |
| Nucleolus (46) | 31 | 127 | 34 | 15 | 67.39 | 78.88 | 76.33 | 0.41 |
| Others (78) | 60 | 73 | 56 | 18 | 76.92 | 56.59 | 64.25 | 0.33 |
|
| ||||||||
| Nuclear envelope (51) | 27 | 20 | 7 | 24 | 52.94 | 74.07 | 60.26 | 0.26 |
| Nuclear matrix (6) | 3 | 54 | 18 | 3 | 50.00 | 75.00 | 73.08 | 0.15 |
| Nucleoplasm (7) | 4 | 46 | 25 | 3 | 57.14 | 64.79 | 64.10 | 0.13 |
| Nuclear pore complex (2) | 1 | 54 | 22 | 1 | 50.00 | 71.05 | 70.51 | 0.07 |
| PML body (7) | 4 | 51 | 20 | 3 | 57.14 | 71.83 | 70.51 | 0.18 |
| Telomere (5) | 3 | 38 | 35 | 2 | 60.00 | 52.05 | 52.56 | 0.06 |
Where TP, TN, FP, FN and MCC are True positive, True negative, False positive, False negative and Matthews correlation coefficient respectively.
Values in parenthesis are the number of proteins in that location.
Comparison of performance of SubNucPred with Nuc-PLoc, Snlpred and Scp web-servers using DataIND.
| Location | SubNucPred | Scp | Nuc-PLoc | Snlpred |
| Centromere (31) | 15 | 2 Chromatin | 7 Chromatin + 2 Hetrochromatin | 11 Chromatin |
| Chromosome (38) | 16 | 1 Chromatin | 2 Chromatin | 7 Chromatin |
| Nuclear speckle (14) | 12 | 4 | 3 | 5 |
| Nucleolus (46) | 35 | 10 | 43 | 37 |
| Nuclear envelope (51) | 31 | 47 Nuclear Lamina | 18 | 7 Nuclear Lamina |
| Nuclear matrix (6) | 2 | 0 | 1 | 0 |
| Nucleoplasm (7) | 1 | 0 | 0 | 1 |
| Nuclear pore complex (2) | 1 | 1 Nuclear Lamina | 0 | 2 Nuclear Lamina |
| PML body (7) | 2 | 0 | 1 | 0 |
| Telomere (5) | 2 | 0 | 1 Chromatin | 1 Chromatin |
As all Scp, Nuc-Ploc and Snlpred don't have same sub-nuclear locations as in SubNucPred, we adjusted prediction of centromeric, chromosomal and telomeric protein to chromatin and hetrochromatin as correct prediction. Similarly for nuclear envelope and nuclear pore complex a prediction saying nuclear lamina was classified as correct.
Values in parenthesis are the number of proteins in that location.
Lei Z, Dai Y (2005) An SVM-based system for predicting protein subnuclear localizations. BMC Bioinformatics 6: 291.
Shen HB, Chou KC (2007) Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Eng Des Sel 20: 561–567.
Han GS, Yu ZG, Anh V, Krishnajith AP, Tian YC (2013) An ensemble method for predicting subnuclear localizations from primary protein structures. PLoS One 8: e57225.