| Literature DB >> 19208160 |
Jisu Kim1, De-Shuang Huang, Kyungsook Han.
Abstract
BACKGROUND: Supervised learning and many stochastic methods for predicting protein-protein interactions require both negative and positive interactions in the training data set. Unlike positive interactions, negative interactions cannot be readily obtained from interaction data, so these must be generated. In protein-protein interactions and other molecular interactions as well, taking all non-positive interactions as negative interactions produces too many negative interactions for the positive interactions. Random selection from non-positive interactions is unsuitable, since the selected data may not reflect the original distribution of data.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19208160 PMCID: PMC2648735 DOI: 10.1186/1471-2105-10-S1-S57
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Motif pairs found during five-fold cross validation
| A = 1/10 | A = 1/8 | A = 1/6 | A = 1/4 | |
| M1 | 12563 | 21821 | 50634 | 142395 |
| M2 | 3479 | 4866 | 12472 | 38008 |
| M3 | 1047 | 1181 | 3498 | 15220 |
| M4 | 189 | 344 | 874 | 6970 |
| M5 | 28 | 105 | 141 | 2134 |
Mi denotes a set of motif pairs found in at least i folds during five-fold cross validation.
Prediction performance with respect to acceptance ratios of bootstrapping
| A = 1/10 | A = 1/8 | A = 1/6 | A = 1/4 | |
| Sensitivity | 58.35% | 75.88% | 82.42% | 90.42% |
| Specificity | 78.83% | 84.40% | 92.29% | 96.02% |
| Accuracy | 66.09% | 80.14% | 87.35% | 93.22% |
As the acceptance ratio A increases, the prediction performance of motif pairs is improved.
Prediction performance with respect to proportions of positive and negative data
| Data ratio | 1712: 2283 | 1712: 1712 | 1712: 1141 |
| Sensitivity | 68.98% | 75.88% | 77.80% |
| Specificity | 87.03% | 84.40% | 77.56% |
| Accuracy | 79.30% | 80.14% | 77.70% |
P: positive data, N: negative data.
Figure 1Sensitivity and specificity of predictions with respect to proportions of positive and the negative data. As the proportion of positive data increases, the sensitivity increases but the specificity decreases.
Prediction performance of two boosting algorithms
| Boosting algorithm | AdaBoost algorithm | Our Boosting algorithm |
| Sensitivity | 70.55% | 75.88% |
| Specificity | 84.21% | 84.40% |
| Accuracy | 77.37% | 80.14% |
Parameter values: T = 4, S = 5, R = 100,000.
Motif pairs found in each fold
| Set | # of motif pairs | p-value |
| M1 | 334 | 1 |
| M2 | 87 | 3.13e-3 |
| M3 | 22 | 3.02e-3 |
| M4 | 7 | 2.25e-2 |
| M5 | 2 | 1.79e-1 |
The number of motif pairs predicted by our boosting algorithm for complexes of human and virus proteins.
Figure 2Motif pairs predicted for 1AGF. Red balls: contact residue pairs correctly predicted, Cyan balls: contact residue pairs missed in the prediction, Gray wireframe: non-contact residues
Encoding scheme for the interacting motif pairs
| Biochemical property | 4-tuple pairs (M bits) | |||
| Candidate motif pair | ||||
| Classification | Category number | Bit number | Human 4-tuple | Virus 4-tuple |
| {I, V, L, M} | 0 | 1 | 0000 | 0000 |
| {F, Y, W} | 1 | 2 | 0000 | 0001 |
| {H, K, R} | 2 | ⋮ | ⋮ | ⋮ |
| {D, E} | 3 | |||
| {Q, N, T, P} | 4 | M-1 | 5555 | 5554 |
| {A, C, G, S} | 5 | M | 5555 | 5555 |
The total number of possible motif pairs is 1,679,616, 1-bit for each motif pair. 1 represents the corresponding motif pair exists in the pair of proteins, and 0 represents the motif pair is absent.
Figure 3Framework for Yu's AdaBoost algorithm. The AdaBoost algorithm requires 20 weak hypotheses for T = 4 and S = 5.
Figure 4The framework of our boosting algorithm. Our algorithm requires only 5 weak hypotheses for S = 5.