| Literature DB >> 17090318 |
Abstract
BACKGROUND: The accomplishment of the various genome sequencing projects resulted in accumulation of massive amount of gene sequence information. This calls for a large-scale computational method for predicting protein localization from sequence. The protein localization can provide valuable information about its molecular function, as well as the biological pathway in which it participates. The prediction of localization of a protein at subnuclear level is a challenging task. In our previous work we proposed an SVM-based system using protein sequence information for this prediction task. In this work, we assess protein similarity with Gene Ontology (GO) and then improve the performance of the system by adding a module of nearest neighbor classifier using a similarity measure derived from the GO annotation terms for protein sequences.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17090318 PMCID: PMC1660555 DOI: 10.1186/1471-2105-7-491
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The summary of the nuclear proteins
| Class label | Compartment | Number of sequences |
| 1 | PML BODY | 38 |
| 2 | Nuclear Lamina | 55 |
| 3 | Nuclear Splicing Speckles | 56 |
| 4 | Chromatin | 61 |
| 5 | Nucleoplasm | 75 |
| 6 | Nucleolus | 219 |
| - | Mutiple localizations | 92 |
Predictive results obtained by using different similarity measures for GO term pairs
| Semantic similarity method | Lord | SimLP | Exact Match |
| Compartment | MCC (Accuracy %) | ||
| PML BODY | 0.223 (31.6) | 0.253 (34.2) | 0.250 (31.6) |
| Nuclear Lamina | 0.579 (60.0) | 0.578 (63.6) | 0.578 (63.6) |
| Nuclear Splicing Speckles | 0.598 (66.1) | 0.607 (62.5) | 0.63 (62.5) |
| Chromatin | 0.511 (59.0) | 0.518 (60.7) | 0.509 (57.4) |
| Nucleoplasm | 0.411 (50.7) | 0.504 (56.0) | 0.483 (54.7) |
| Nucleolus | 0.615 (75.3) | 0.656 (79.0) | 0.642 (80.8) |
| Overall for Single-localization | |||
(Based on SUM_Match: The similarity of two proteins is defined as the sum of similarity scores over all matched GO term pairs.)
Predictive results obtained by using various similarity definitions for proteins
| Similarity Definition | MAX | AVG | SUM | AVG_BestPairs |
| Compartment | MCC (Accuracy %) | |||
| PML BODY | 0.189 (28.9) | 0.153 (34.2) | 0.129 (76.3) | -0.031 (0.0) |
| Nuclear Lamina | 0.344 (45.5) | 0.535 (63.6) | 0.455 (45.5) | 0.315 (61.8) |
| Nuclear Splicing Speckles | 0.377 (35.7) | 0.251 (71.4) | 0.289 (33.9) | 0.013 (12.5) |
| Chromatin | 0.236 (19.7) | 0.218 (16.4) | 0.112 (4.9) | 0.142 (8.2) |
| Nucleoplasm | 0.272 (29.3) | 0.039 (9.3) | -0.079 (4.0) | 0.118 (6.7) |
| Nucleolus | 0.367 (75.8) | 0.431 (44.7) | 0.214 (26.0) | 0.289 (75.3) |
| Overall for Single-localization | ||||
| Similarity Definition | SUM_BestPairs | AVG_Match | SUM_Match | MAX_Match |
| Compartment | MCC (Accuracy %) | |||
| PML BODY | 0.242 (44.7) | 0.187 (34.2) | 0.253 (34.2) | 0.211 (31.6) |
| Nuclear Lamina | 0.53 (67.3) | 0.586 (60.0) | 0.578 (63.6) | 0.344 (45.5) |
| Nuclear Splicing Speckles | 0.438 (46.4) | 0.397 (66.1) | 0.607 (62.5) | 0.487 (46.4) |
| Chromatin | 0.325 (36.1) | 0.467 (45.9) | 0.518 (60.7) | 0.263 (21.3) |
| Nucleoplasm | 0.284 (36.0) | 0.332 (32.0) | 0.504 (56.0) | 0.298 (32.0) |
| Nucleolus | 0.512 (66.7) | 0.615 (72.6) | 0.656 (79.0) | 0.407 (76.7) |
| Overall for Single-localization | ||||
(Based on SimLP: The GO term similarity is defined on the longest path shared by two GO terms [22].)
Results obtained by using different numbers of homolog(s)
| Number of homlogs (up to n) | n = 1 | n = 5 |
| Compartment | MCC (Accuracy %) | |
| PML BODY | 0.262 (39.5) | 0.253 (34.2) |
| Nuclear Lamina | 0.395 (43.6) | 0.578 (63.6) |
| Nuclear Splicing Speckles | 0.566 (57.1) | 0.607 (62.5) |
| Chromatin | 0.474 (47.5) | 0.518 (60.7) |
| Nucleoplasm | 0.457 (53.3) | 0.504 (56.0) |
| Nucleolus | 0.606 (795.) | 0.656 (79.0) |
| Overall for Single-localization | ||
Results obtained from the previous and current systems
| Method | AA (ver1) | GO-AA (ver2) |
| Compartment | MCC (Accuracy %) | |
| PML BODY | 0.172 (29.0) | 0.253 (34.2) |
| Nuclear Lamina | 0.338 (43.6) | 0.578 (63.6) |
| Nuclear Splicing Speckles | 0.363(35.7) | 0.607 (62.5) |
| Chromatin | 0.260 (19.7) | 0.518 (60.7) |
| Nucleoplasm | 0.206 (22.7) | 0.504 (56.0) |
| Nucleolus | 0.367 (76.7) | 0.656 (79.0) |
| Overall for Single-localization | ||
| Multi-localization | ||
Results obtained with and without the use of the six GO terms related to subnuclear compartments.
| GO Module with BLAST homologs | With the subnuclear compartment GO terms | without the subnuclear compartment GO terms |
| Compartment | MCC (Accuracy %) | |
| PML BODY | 0.291 (40.0) | 0.290 (40.0) |
| Nuclear Lamina | 0.626 (67.4) | 0.609 (65.1) |
| Nuclear Splicing Speckles | 0.657 (70.0) | 0.640 (73.7) |
| Chromatin | 0.544 (63.5) | 0.543 (61.5) |
| Nucleoplasm | 0.543 (58.5) | 0.548 (60.0) |
| Nucleolus | 0.744 (82.5) | 0.723 (80.1) |
| Overall for Single-localization | ||
| Number of proteins predicted by the GO module | ||
Figure 1Depth of GO terms.
The simLP scores for GO term pairs
| GO: 0005737 | GO: 0006412 | GO: 0006415 | GO:0016149 | |
| GO: 0005488 | 0 | 0 | 0 | 2 |
| GO: 0005515 | 0 | 0 | 0 | 2 |
| GO: 0006412 | 0 | 7 | 7 | 0 |