| Literature DB >> 17108361 |
Andea Pierleoni1, Pier Luigi Martelli, Piero Fariselli, Rita Casadio.
Abstract
Eukaryotic Subcellular Localization DataBase collects the annotations of subcellular localization of eukaryotic proteomes. So far five proteomes have been processed and stored: Homo sapiens, Mus musculus, Caenorhabditis elegans, Saccharomyces cerevisiae and Arabidopsis thaliana. For each sequence, the database lists localization obtained adopting three different approaches: (i) experimentally determined (when available); (ii) homology-based (when possible); and (iii) predicted. The latter is computed with a suite of machine learning based methods, developed in house. All the data are available at our website and can be searched by sequence, by protein code and/or by protein description. Furthermore, a more complex search can be performed combining different search fields and keys. All the data contained in the database can be freely downloaded in flat file format. The database is available at http://gpcr.biocomp.unibo.it/esldb/.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17108361 PMCID: PMC1669738 DOI: 10.1093/nar/gkl775
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Number of proteins with an experimental or a similarity-based annotation of the subcellular localization
| No. of sequences in the genome | No. of sequences with a SwissProt entry | No. of sequences experimentally annotated | No. of sequences annotated by similarity | |
|---|---|---|---|---|
| 48 926 | 12 927 (26%) | 9341 (19%) | 33 225 (68%) | |
| 31 302 | 6228 (20%) | 4669 (15%) | 20 764 (66%) | |
| 25 714 | 3327 (13%) | 1612 (6%) | 11 222 (44%) | |
| 6680 | 5296 (79%) | 3106 (46%) | 3503 (52%) | |
| 30 600 | 4030 (13%) | 2645 (9%) | 10 121 (33%) |
The sequences experimentally annotated are included among those annotated by similarity.
Number of sequences in the 17 different subcellular localizations as derived with experimental and similarity-based annotations
| Subcellular localization | Experimental annotation | Similarity-based annotation |
|---|---|---|
| Cell wall | 73 | 404 |
| Cytoplasm | 4468 | 23 492 |
| Cytoskeleton | 519 | 4099 |
| Endoplasmic reticulum | 1058 | 4115 |
| Endosome | 202 | 1091 |
| Extracellular | 340 | 1719 |
| Golgi | 806 | 3231 |
| Lysosome | 208 | 959 |
| Membrane | 1913 | 9956 |
| Mitochondrion | 1829 | 4490 |
| Nucleus | 3413 | 25 812 |
| Peroxisome | 112 | 718 |
| Plastid | 429 | 1525 |
| Secretory pathway | 2157 | 7262 |
| Transmembrane | 6773 | 19 586 |
| Vacuole | 179 | 506 |
| Vesicles | 390 | 2289 |
The sequences experimentally annotated are included among those annotated by similarity.
Figure 1Flow chart of the predicting pipeline adopted in eSLDB. SVM, support vector machine. BaCelLo, Spep and ENSEMBLE are predictive methods described previously (8,17,18).
Number of sequences in the six predicted subcellular localizations
| Subcellular localization | |||||
|---|---|---|---|---|---|
| Transmembrane | 10 229 | 7750 | 6593 | 1657 | 8079 |
| Secretory | 7816 | 4971 | 5172 | 348 | 3001 |
| Nucleus | 12 358 | 6820 | 4733 | 1717 | 7649 |
| Cytoplasm | 14 720 | 9356 | 6280 | 1710 | 6033 |
| Mitochondrion | 3630 | 2326 | 1454 | 1112 | 963 |
| Chloroplast | — | — | — | — | 4875 |
Performance of the prediction pipeline as compared with the experimental and the similarity-based annotations
| Subcellular localization | With respect to the experimental annotations | With respect to the similarity-based annotations | ||||
|---|---|---|---|---|---|---|
| No. of proteins | Coverage (%) | Accuracy (%) | No. of proteins | Coverage (%) | Accuracy (%) | |
| Transmembrane | 2244 | 87.7 | 93.0 | 6660 | 76.5 | 82.3 |
| Soluble | 4200 | 92.4 | 86.6 | 18 474 | 92.3 | 83.0 |
| Secretory pathway | 865 | 82.3 | 60.6 | 2844 | 68.8 | 48.3 |
| Intracellular | 3364 | 89.0 | 90.5 | 15 776 | 87.2 | 83.5 |
| Nucleus or cytoplasm | 3013 | 88.2 | 90.7 | 14 788 | 83.0 | 82.6 |
| Nucleus | 2107 | 66.5 | 85.1 | 8973 | 54.5 | 69.6 |
| Cytoplasm | 1410 | 56.7 | 62.4 | 8779 | 51.0 | 57.2 |
| Organelle (mitochondrion) | 398 | 58.0 | 60.9 | 1230 | 42.3 | 31.9 |
The indentation of the subcellular localization names reflects the hierarchy of the prediction (see Figure 1). Coverage = (no. of proteins of class i predicted as class i)/(total no. of proteins in class i). Accuracy = (no. of proteins of class i predicted as class i)/(total no. of proteins predicted as class i).
Figure 2Output page of eSLDB for a single protein chain.