| Literature DB >> 20361868 |
Chi-Yuan Yu1, Lih-Ching Chou, Darby Tien-Hao Chang.
Abstract
BACKGROUND: Elucidating protein-protein interactions (PPIs) is essential to constructing protein interaction networks and facilitating our understanding of the general principles of biological systems. Previous studies have revealed that interacting protein pairs can be predicted by their primary structure. Most of these approaches have achieved satisfactory performance on datasets comprising equal number of interacting and non-interacting protein pairs. However, this ratio is highly unbalanced in nature, and these techniques have not been comprehensively evaluated with respect to the effect of the large number of non-interacting pairs in realistic datasets. Moreover, since highly unbalanced distributions usually lead to large datasets, more efficient predictors are desired when handling such challenging tasks.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20361868 PMCID: PMC2868006 DOI: 10.1186/1471-2105-11-167
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Demonstration of evaluation bias owing to subset-sampled dataset, where the dashed line represents the decision boundary of .
Evaluation measurements employed in this study
| Measurement | Abbreviation | Equation1 |
|---|---|---|
| TP/(TP+FP) | ||
| TP/(TP+FN) | ||
| TN/(TN+FP) | ||
| (TP+TN)/(TP+TN+FP+FN) | ||
| 2TP/(2TP+FP+FN) |
1The definition of the abbreviations used: TP is the number of interacting protein pairs correctly classified; FN is the number of interacting protein pairs incorrectly classified as non-interacting; TN is the number of non-interacting protein pairs correctly classified; and FP is the number of non-interacting protein pairs incorrectly classified as interacting.
Performance of the compared feature sets on datasets with different positive-to-negative ratios
| Feature | |||||
|---|---|---|---|---|---|
| Datasets with 1:1 positive-to-negative ratio | |||||
| Shen | 77.1 ± 0.8 | 77.9 ± 0.8 | 75.2 ± 0.9 | 80.9 ± 1.4 | 73.3 ± 1.4 |
| Guo | 77.2 ± 0.9 | 77.6 ± 0.9 | 76.2 ± 1.0 | 79.1 ± 1.3 | 75.4 ± 1.4 |
| This work4 | |||||
| Datasets with 1:3 positive-to-negative ratio | |||||
| Shen | 82.2 ± 0.3 | 58.6 ± 1.1 | 50.4 ± 1.6 | 92.7 ± 0.3 | |
| Guo | 82.1 ± 0.6 | 58.3 ± 1.7 | 69.8 ± 1.6 | 50.1 ± 1.8 | |
| This work | 67.9 ± 0.9 | 89.7 ± 0.4 | |||
| Datasets with 1:7 positive-to-negative ratio | |||||
| Shen | 88.0 ± 0.3 | 45.4 ± 1.7 | 52.8 ± 1.8 | 39.9 ± 1.9 | 94.9 ± 0.3 |
| Guo | 87.2 ± 0.3 | 45.5 ± 1.3 | 48.8 ± 1.5 | 93.6 ± 0.3 | |
| This work | 41.8 ± 1.8 | ||||
| Datasets with 1:15 positive-to-negative ratio | |||||
| Shen | 92.5 ± 0.1 | 33.1 ± 1.4 | 37.5 ± 1.3 | 29.7 ± 1.5 | 96.7 ± 0.1 |
| Guo | 91.7 ± 0.2 | 36.6 ± 1.5 | 35.1 ± 1.5 | 38.3 ± 1.9 | 95.3 ± 0.2 |
| This work | |||||
The best performance among each positive-to-negative ratio is highlighted with bold font. 1The parameter selection is based on a five-fold cross validation of the training dataset to maximize the F-measure. 2Using triad frequency as the feature set. 3Using auto cross covariance as the feature set. 4Using triad significance as the feature set.
Figure 2Comparison of . The Random predictor predicts any query protein pair as positive with a probability of 0.5, and as negative with a probability of 0.5, too. The Opportunistic predictor predicts any query protein pair as negative for accuracy and it predicts any query protein pair as positive for F-measure. Shen et al. use triad frequency as the feature set. Guo et al. use auto cross covariance as the feature set. This work uses triad significance as the feature set.
Figure 3.
Amino acid groups adopted in this study
| Group no. | Amino acids | Occurrence (%)1 |
|---|---|---|
| 1 | Ala, Gly, Val | 22.0 |
| 2 | Ile, Leu, Phe, Pro | 24.2 |
| 3 | Tyr, Met, Thr, Ser | 17.3 |
| 4 | His, Asn, Gln, Tpr | 11.4 |
| 5 | Arg, Lys | 11.4 |
| 6 | Asp, Glu | 12.2 |
| 7 | Cys | 1.4 |
This table follows the Shen et al.'s work [33]. 1Occurrences of seven amino acid groups in the Swiss-Prot database release 57.0 [49].
Figure 4Schematic diagram of encoding a protein sequence into a feature vector.