| Literature DB >> 21078192 |
Yao-Qing Shen1, Gertraud Burger.
Abstract
BACKGROUND: The eukaryotic cell has an intricate architecture with compartments and substructures dedicated to particular biological processes. Knowing the subcellular location of proteins not only indicates how bio-processes are organized in different cellular compartments, but also contributes to unravelling the function of individual proteins. Computational localization prediction is possible based on sequence information alone, and has been successfully applied to proteins from virtually all subcellular compartments and all domains of life. However, we realized that current prediction tools do not perform well on partial protein sequences such as those inferred from Expressed Sequence Tag (EST) data, limiting the exploitation of the large and taxonomically most comprehensive body of sequence information from eukaryotes.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21078192 PMCID: PMC3000424 DOI: 10.1186/1471-2105-11-563
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance of available tools and TESTLoc on plant EST-peptides1
| Predictors | chl2 | cyt | end | ext | mit | nuc | per | pla | vac | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SN | PPV | SN | PPV | SN | PPV | SN | PPV | SN | PPV | SN | PPV | SN | PPV | SN | PPV | SN | PPV | |
| TargetP | 19 | 62 | 29 | 25 | 18 | 44 | ||||||||||||
| Protein Prowler | 14 | 66 | 34 | 32 | 26 | 60 | 25 | 100 | ||||||||||
| BaCelLo | 61 | 69 | 50 | 28 | 60 | 32 | 9 | 90 | 88 | 39 | ||||||||
| Wolf-PSORT | 27 | 41 | 60 | 27 | 0 | 0 | 18 | 36 | 8 | 58 | 65 | 71 | 16 | 50 | 0 | 0 | 0 | 0 |
| YLoc | 25 | 77 | 84 | 13 | 22 | 13 | 52 | 98 | 35 | 84 | 82 | 80 | 50 | 26 | 0 | 0 | 15 | 31 |
| KnowPred | NA | NA | 71 | 23 | 0 | 0 | 23 | 36 | 61 | 49 | 86 | 40 | 58 | 54 | 0 | 0 | NA | NA |
| MultiLoc2 | 14 | 81 | 84 | 12 | 9 | 8 | 44 | 48 | 18 | 37 | 48 | 67 | 67 | 10 | 0 | 0 | 45 | 54 |
| BLAST3 | 76 | 97 | 77 | 62 | 64 | 100 | 81 | 95 | 57 | 87 | 77 | 96 | 25 | 60 | 0 | 0 | 0 | 0 |
1 Numbers in %. Bold numbers are the result of the here described TESTLoc, which is tailored for ESTs. The results of full-length proteins are compiled in Additional file 8
2 Abbreviations: chl, chloroplast; cyt, cytosol; end, endoplasmatic reticulum; ext, extracellular space; mit, mitochondrion; nuc, nucleus; per, peroxisome; pla, plasma membrane; vac, vacuole; SN, sensitivity; PPV, positive predictive value
3 Note that all test data have homologs in databases, which in practice is rarely the case; see text
Number of EST-peptides used in this study
| dataset | chl | cyt | end | ext | mit | nuc | per | pla | vac | total |
|---|---|---|---|---|---|---|---|---|---|---|
| 97 | 53 | 5 | 9 | 167 | 41 | 5 | 4 | 5 | 386 | |
| Expanded plant data | 679 | 122 | 11 | 48 | 309 | 260 | 12 | 7 | 29 | 1477 |
Figure 1Fragmentation procedure of plant protein sequences in order to expand the EST-peptide dataset. Open bars, full-length proteins; filled bars, fragmented protein sequences. Proteins shorter than 200 residues remained unchanged. Proteins ranging from 200 to 400 residues were fragmented into two pieces. Proteins longer than 400 residues were fragmented into three pieces. See text for details.
Amino acids grouped according to their chemical properties
| Group C, chemical properties | Group D, Devlin structural properties | |||
|---|---|---|---|---|
| Acidic | D, E | Monoamino Moncarboxylic | G, A | |
| Basic | H, K, R | Unsubstituted | V, L, I | |
| Aromatic | F,W, Y | Heterocyclic | P, F | |
| Small hydroxyl | S, T | Aromatic | W, Y | |
| Sulphur containing | C, M | Thioether | M | |
| Aliphatic1 | A, G, P | Hydroxy | S, T | |
| Aliphatic2 | I, L, V | Mercapto | C | |
| Amide | N, Q | Carboxamide | N, Q | |
| Monamino, Dicarboxylic | D, E | |||
| Diamino, Monocarboxylic | H, K, R | |||
Figure 2Training and evaluation of SVM predictors. The circle and pies indicate the dataset and portions thereof. The procedure in each dashed box was repeated ten times. The whole dataset was randomly divided into ten parts, with nine parts combined to construct the SVM model, and the remaining one to evaluate the model. The combined data for model construction were further divided randomly into ten subsets, in which nine subsets were combined to serve as training data, and the 10th subset served as test data. See text for details.
Figure 3Independent evaluation of SVM predictors based on different representations of amino acid composition. The performance was assessed by the Matthews Correlation Coefficient (MCC). For most classes, the best MCC was obtained with the 4th order amino acid composition (the frequency of tetra-peptides). Amino acid group-C and group-D composition yielded similar results (see Additional file 5).
Evaluation of results from top-ranking TESTLoc prediction schemes1
| Prediction scheme | chl | cyt | end | ext | mit | nuc | per | pla | vac | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Expanded plant dataset | 1. Top-performing individual feature (4th order amino acid composition) | SN | 99.9 | 20 | 20 | 20 | |||||
| PPV | 99.9 | 76.1 | 20 | 96.3 | 67.9 | 88 | 20 | 20 | |||
| MCC | 0.99 | 0.61 | 0.2 | 0.7 | 0.82 | 0.2 | 0.2 | ||||
| 2. Integration of predictions from all sequence features | SN | 100 | 45.5 | 20 | 69 | 86.1 | 78.1 | 40 | 63.3 | ||
| PPV | 99.3 | 20 | 98 | 70.3 | 11.7 | 8.8 | 100 | ||||
| MCC | 0.99 | 0.63 | 0.2 | 0.81 | 0.7 | 0.16 | 0.15 | 0.78 | |||
| 3. Integration attributions of all sequence features | SN | 9.3 | 10 | 48.5 | 82.2 | 80 | 0 | 0 | 0 | ||
| PPV | 28.8 | 10 | 77.8 | 53.1 | 76.2 | 0 | 0 | 0 | |||
| MCC | 0.1 | 0.1 | 0.6 | 0.5 | 0.7 | 0 | 0 | 0 | |||
| 4. Integration of predictions from three top-performing features2 | SN | 99.9 | 50.5 | 71 | 86.7 | 75.8 | 30 | 63.3 | |||
| PPV | 99.7 | 88.4 | 96.1 | 5 | 100 | ||||||
| MCC | 0.99 | 0.65 | 0.83 | 0.82 | 0.12 | 0.78 | |||||
| 5. Integration of attributes from three top-performing features | SN | 94.4 | 52.2 | 20 | 75 | 84.2 | 77.7 | 20 | 20 | 76.7 | |
| PPV | 86.6 | 90.5 | 20 | 96 | 67.8 | 92 | 20 | 100 | |||
| MCC | 0.8 | 0.2 | 0.84 | 0.68 | 0.81 | 0.2 | 0.86 | ||||
| Integration of predictions from three top-performing features | SN | 47.4 | 58.5 | 80 | 100 | 89.8 | 90.2 | 0 | 100 | 100 | |
| PPV | 90.6 | 86.1 | 100 | 100 | 68.2 | 100 | 100 | 100 | 100 | ||
| MCC | 0.42 | 0.67 | 0.89 | 1 | 0.57 | 0.94 | 0 | 1 | 1 | ||
1 Numbers are the average of the 10-fold test. Numbers in parenthesis are the standard deviation. Bold numbers indicate the best values for each metric (SN, PPV, MCC) in each class of the expanded plant dataset. The values for SN and PPV are given in %. MCC, Matthews Correlation Coefficient. For other abbreviations, see footnote to Table 1.
2 The three features are 4th order amino acid composition, 6th order group-C amino acid composition and 7th order group-D amino acid composition
Figure 4Integration of predictions from SVM models based on individual features. Each of the 41 SVM models built with single sequence features forms the first layer SVM and emits the probabilities for the query sequence to belong to the various classes. The probabilities are used as input for the second layer SVM.