| Literature DB >> 18811934 |
Andrea Pierleoni1, Pier Luigi Martelli, Rita Casadio.
Abstract
BACKGROUND: Several eukaryotic proteins associated to the extracellular leaflet of the plasma membrane carry a Glycosylphosphatidylinositol (GPI) anchor, which is linked to the C-terminal residue after a proteolytic cleavage occurring at the so called omega-site. Computational methods were developed to discriminate proteins that undergo this post-translational modification starting from their aminoacidic sequences. However more accurate methods are needed for a reliable annotation of whole proteomes.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18811934 PMCID: PMC2571997 DOI: 10.1186/1471-2105-9-392
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The HMM model of the ω-site. Different colors represent different emission probability sets. ω-site is represented in red. Surrounding residues are colored in green, orange and yellow. The preceding region is represented in dark green. The spacer and the C terminal hydrophobic regions are depicted in violet and blue, respectively. The total number of independent trainable parameters is 147.
Figure 2The ROC curve of PredGPI. The ROC curve of PredGPI is shown as a continuous line. The dashed line is referred to a random guess. Two points are shown over the ROC curve: the circle indicates a false positive rate of 0.15%, while the triangle indicates a false positive rate of 0.5%. The curve was computed using the 145 positive examples and the 10,630 negative examples in GPI-Set and Non-GPI-Set, respectively. See text for details in the Methods section.
Comparison between PredGPI and other available predictors
| Predictor | TP | FP | Cov (%) | Acc (%) | FP rate (%) | MCC |
| PredGPI | 112 | 15 | 77.2 | 88.2 | 0.14 | 0.823 |
| FragAnchor | 102 | 37 | 70.3 | 73.4 | 0.35 | 0.725 |
| BIG-PI | 79 | 33 | 53.4 | 70.5 | 0.31 | 0.609 |
| DGPI | 117 | 250 | 79.1 | 31.9 | 2.35 | 0.492 |
| GPI-SOM | 126 | 182 | 85.1 | 40.9 | 1.7 | 0.583 |
| MemType-2L (Shangai server*) | 74 | 189 | 51.0 | 28.1 | 1.8 | 0.368 |
| MemType-2L (Harvard server*) ** | ≤ 107 | ≥ 60 | ≤ 73.8 | ≤ 64.1 | ≥ 0.56 | ≤ 0.683 |
Performances are evaluated on 145 positive and 10,630 negative examples contained in GPI-Set and Non-GPI-Set, respectively. PredGPI performances were evaluated using the jack-knife procedure. It's worth noticing that many of the tested proteins may have been used for the training of other predictors.
Abbreviations: TP = True Positives, FP = False Positives; the number of sequences is listed; Cov = Coverage, Acc = Accuracy, FP rate = False Positives over the total number of negative examples, MCC = Matthews Correlation Coefficient.
*MemType-2L is available in two versions: the Shangai server and the Harvard server . For sake of completeness we used both.
** The Harvard server of MemType-2L gave an answer only for 143 out of the 145 positive examples comprised in GPI-Set. 105 sequences are correctly predicted in this set. Moreover the server gave an answer only for 4,265 out of the 10,630 negative examples comprised in Non-GPI-Set. The number of mispredictions in this set is equal to 60. The limits to the indexes scoring the performance of this server are computed in the best case, which is by considering all the non predicted proteins as correctly predicted.
Coverage on all the experimentally annotated proteins in SwissProt
| Predictor | TP | Cov |
| PredGPI | 301 | 88.5 |
| FragAnchor | 286 | 84.1 |
| BIG-PI | 189 | 55.6 |
| DGPI | 267 | 78.5 |
| GPI-SOM | 278 | 83.8 |
| MemType-2L (Shangai server*) | 147 | 43.2 |
| MemType-2L (Harvard server*) ** | ≤ 293 | ≤ 86.1 |
The testing dataset comprises all the 340 GPI-anchored proteins experimentally annotated in SwissProt and contained in All-GPI-Set Abbreviations: TP = True Positives; Cov = Coverage.
* See legend to Table 1.
** The Harvard server of MemType-2L gave an answer only for 309 out of the 340 positive examples comprised in All-GPI-Set. 262 sequences are correctly predicted in this set. The limits to the indexes scoring the performance of this server are computed in the best case, namely by considering all the non predicted proteins as correctly predicted.
Performance for the prediction of the ω-site
| Predictor | BigPI | DGPI | GPI-SOM | PredGPI (cv*) | PredGPI (non cv**) |
| Correctly predicted proteins | 23 | 16 | 15 | 21 | 24 |
| Proteins wrongly predicted by one position | 1 | 4 | 2 | 3 | 2 |
| Proteins wrongly predicted by more than one position | 1 | 5 | 7 | 2 | 0 |
| Proteins predicted as non GPI-anchored | 1 | 1 | 2 | 0 | 0 |
The test set comprises 26 sequences with experimentally annotated ω-site (GPIω-Set). The number of sequences is listed.
*cv = results obtained with a 20-fold cross validation prediction;
**non cv = results obtained with a predictor trained on all the 26 sequences.
Most relevant features as evaluated by MCC decrease upon feature elimination
| Feature | ΔMCC | Higher in |
| Average KD hydrophobicity of 20 C-ter residues | -0.021 | GPI |
| Frequency of Ser in 40 C-ter residues | -0.020 | GPI |
| Frequency of Leu in 40 C-ter residues | -0.018 | GPI |
| Frequency of Gly in 20 C-ter residues | -0.016 | GPI |
| Frequency of Asn in 20 N-ter residues | -0.016 | Non GPI |
| Frequency of Asn in whole sequence | -0.015 | GPI |
| Frequency of Gln in 20 N-ter residues | -0.015 | Non GPI |
| Frequency of Leu in 20 N-ter residues | -0.015 | GPI |
| Frequency of Thr in whole sequence | -0.015 | GPI |
| Frequency of Ala in 20 N-ter residues | -0.015 | GPI |
Only features leading to a Δ MCC lower than -0.015 were listed. The third column indicates whether the considered feature has higher average value in GPI- or in non GPI-anchored proteins, as computed considering the sequences included in GPI-Set and Non-GPI-Set.