| Literature DB >> 22962467 |
Tatyana Goldberg1, Tobias Hamp, Burkhard Rost.
Abstract
MOTIVATION: Subcellular localization is one aspect of protein function. Despite advances in high-throughput imaging, localization maps remain incomplete. Several methods accurately predict localization, but many challenges remain to be tackled.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22962467 PMCID: PMC3436817 DOI: 10.1093/bioinformatics/bts390
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Hierarchical architecture of LocTree2. The localization prediction follows a different tree for each of the three domains of life: (a) archaea, (b) bacteria and (c) eukaryota. Each hierarchy mimics the biological sorting mechanism in that domain (in eukaryotes membrane and non-membrane proteins are treated separately). The branches represent paths of the protein sorting, the leaves the final prediction of one localization class and the internal nodes are the decision points along the path. These decisions are implemented as binary support vector machines (SVMs). CHL, chloroplast; CHLM, chloroplast membrane; CYT, cytosol; ER, endoplasmic reticulum; ERM, endoplasmic reticulum membrane; EXT, extra-cellular; FIM, fimbrium; GOL, Golgi apparatus; GOLM, Golgi apparatus membrane; MIT, mitochondria; MITM, mitochondria membrane; NUC, nucleus; NUCM, nucleus membrane; OM, outer membrane; PERI, periplasmic space; PER, peroxisome; PERM, peroxisome membrane; PM, plasma membrane; PLAS, plastid; VAC, vacuole; VACM, vacuole membrane
Fig. 2.High performance in cross-validation. For the cross-validation sets (a: averages over 479 bacterial proteins and b: averages over 1682 eukaryotic proteins), LocTree2 reached high levels of sustained performance. Overall, performance tended to correlate with the number of representatives (pie charts: inner ring: composition in the corresponding data set and outer ring: composition in correct predictions). Exceptions were membrane bound classes in eukaryotes for which the performance tended to be better than that for the corresponding non-membrane bound class (e.g. MIT = mitochondrial proteins versus MITM = membrane-linked mitochondrial proteins). Localization classes as in Figure 1; performance measures: Acc, accuracy; Cov, coverage; gAv, geometric coverage of Acc and Cov; Q, overall prediction accuracy (Q6 for six and Q18 for 18 classes). Standard errors were estimated by bootstrapping (see Section 2). Classes with less than 20 members were excluded
Fig. 3.More reliable predictions better. The curves show the percentage accuracy/coverage for LocTree2 predictions above a given threshold in the reliability index (from 0 = unreliable to 100 = most reliable). True positives are the number of correct predictions with reliability indices above the given threshold, false negatives are the number of correct predictions with reliability indices below the threshold and false positives are the number of wrong predictions with reliability indices above the threshold. The curves were obtained on cross-validated test sets of bacterial (gray line) and eukaryotic (black line) proteins. Half of all eukaryotic proteins are predicted at RI>80; for these, Q18 is above 92% (black arrow). As the number of localization classes is lower for bacteria, the corresponding number in accuracy is higher (Q6 is above 95% at 50% coverage, gray arrow)
Performance comparison on independent data sets
| Method | New SWISS-PROT | LocDB | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Q(5) | Q(3) | Q(9) | Q(6) | Q(5) | Q(9) | Q(8) | Q(6) | Q(8) | Q(7) | Q(6) | |
| LocTree2 | 42 ± 8 | ||||||||||
| CELLO v. 2.5 | 57 ± 22 | — | 46 ± 16 | — | — | 26 ± 18 | — | — | 40 ± 8 | — | — |
| WoLF PSORT | — | — | 62 ± 14 | — | — | 19 ± 15 | — | — | — | — | |
| PSORTb 3.0 | 71 ± 21 | — | — | — | — | — | — | — | — | — | — |
| MultiLoc2 | — | — | — | 60 ± 16 | — | — | 24 ± 18 | — | — | 42 ± 9 | — |
| LocTree | 77 ± 21 | — | — | 62 ± 17 | — | — | 24 ± 18 | — | — | 48 ± 9 | |
Data ‘New SWISS-PROT’: 28 sequence-unique bacterial and 52 eukaryotic proteins added to SWISS-PROT between releases 2011_04 and 2012_02 (sequence uniqueness was ascertained both within this set and from any protein in this set to any other protein previously in SWISS-PROT). Data ‘A. thaliana’ and ‘H. sapiens’: 43 Arabidopsis thaliana and 201 Homo sapiens proteins from the LocDB database (as for ‘New SWISS-PROT’: sequence unique with respect to itself and to SWISS-PROT 2011_04). Qn, the overall prediction accuracy in n classes; highest value in each column in bold; values ± standard error (see Section 2).