| Literature DB >> 19445721 |
Nazar Zaki1, Sanja Lazarova-Molnar, Wassim El-Hajj, Piers Campbell.
Abstract
BACKGROUND: Protein-protein interaction (PPI) is essential to most biological processes. Abnormal interactions may have implications in a number of neurological syndromes. Given that the association and dissociation of protein molecules is crucial, computational tools capable of effectively identifying PPI are desirable. In this paper, we propose a simple yet effective method to detect PPI based on pairwise similarity and using only the primary structure of the protein. The PPI based on Pairwise Similarity (PPI-PS) method consists of a representation of each protein sequence by a vector of pairwise similarities against large subsequences of amino acids created by a shifting window which passes over concatenated protein training sequences. Each coordinate of this vector is typically the E-value of the Smith-Waterman score. These vectors are then used to compute the kernel matrix which will be exploited in conjunction with support vector machines.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19445721 PMCID: PMC2701420 DOI: 10.1186/1471-2105-10-150
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Randomly selected training and testing protein datasets
| YAR003W-YBR175W | interact | YCR077C-YDL160C | interact |
| YBR126C-YML100W | interact | YPR072W-YIL038C | interact |
| YNR006W-YOR025W | non-interact | YNL137C-YOR025W | non-interact |
| YMR203W-YNL029C | non-interact | YMR261C-YOR321W | non-interact |
Mean, standard deviation and confidence level of the length of the selected 15 proteins
| Training Dataset | 539 | 243.81 | 203.83 |
| Testing Dataset | 679.75 | 213.67 | 178.64 |
Similarity and identity averages and standard deviations calculated based on the selected 15 proteins
| 48.35 | 3.79 | 29.59 | 2.72 | 51.97 | 15.09 |
Figure 1Similarity score of each protein sequence in the testing dataset against the three generated subsequences.
Mean, standard deviation and confidence level of the length of the 157 positive and 77 negative protein sequences
| Positive Examples | 567.7 | 374.7 | 59.08 |
| Negative Examples | 510.27 | 314.15 | 71.30 |
Similarity and identity averages and standard deviations calculated based on the 157 positive and 77 negative protein sequences
| 44.97 | 2.26 | 29.21 | 2.41 | 65.57 | 10.93 |
ROC, SN, SP and overall accuracy recorded from testing PPI-PS on 100 interacting protein pairs and 100 non-interacting protein pairs based on several window size values.
| 20000 | 0.9591 | 0.9 | 0.9 | 0.9 |
| 19000 | 0.9751 | 0.94 | 0.86 | 0.9 |
| 18000 | 0.996 | 1 | 0.96 | 0.98 |
| 17000 | 0.976 | 0.95 | 0.92 | 0.935 |
| 16000 | 0.974 | 0.88 | 0.91 | 0.895 |
| 15000 | 0.996 | 1 | 0.96 | 0.98 |
| 14000 | 0.979 | 0.91 | 0.97 | 0.94 |
| 13000 | 0.9918 | 1 | 0.94 | 0.97 |
| 12000 | 0.98804 | 0.93 | 0.97 | 0.95 |
| 11000 | 0.9885 | 0.98 | 0.96 | 0.97 |
| 10000 | 0.9985 | 1 | 0.95 | 0.975 |
| 9000 | 0.9979 | 1 | 0.95 | 0.975 |
| 8000 | 0.989 | 0.98 | 0.98 | 0.98 |
| 7000 | 0.9964 | 1 | 0.93 | 0.965 |
| 6000 | 0.9984 | 1 | 0.95 | 0.975 |
| 4000 | 0.9941 | 0.98 | 0.96 | 0.97 |
| 3000 | 0.9962 | 1 | 0.95 | 0.975 |
| 2000 | 0.9927 | 0.97 | 0.94 | 0.955 |
| 1000 | 0.9864 | 0.96 | 0.87 | 0.915 |
| 500 | 0.973 | 0.96 | 0.78 | 0.87 |
ROC, SN, SP and overall accuracy recorded from testing PPI-PS on a dataset of 50 interacting protein pairs and 50 non-interacting protein pairs based on several window size values.
| 20000 | 0.8648 | 0.48 | 0.86 | 0.67 |
| 19000 | 0.87772 | 1 | 0.78 | 0.89 |
| 18000 | 0.8432 | 0.96 | 0.78 | 0.87 |
| 17000 | 0.8336 | 0.88 | 0.76 | 0.82 |
| 16000 | 0.8176 | 0.78 | 0.76 | 0.77 |
| 15000 | 0.8612 | 0.82 | 0.76 | 0.79 |
| 14000 | 0.854 | 1 | 0.74 | 0.87 |
| 11000 | 0.842 | 1 | 0.72 | 0.86 |
| 10000 | 0.8556 | 0.8 | 0.8 | 0.8 |
| 9000 | 0.8628 | 0.94 | 0.78 | 0.86 |
| 8000 | 0.8724 | 0.96 | 0.72 | 0.84 |
| 7000 | 0.8732 | 0.98 | 0.76 | 0.87 |
| 6000 | 0.8812 | 1 | 0.74 | 0.87 |
| 5000 | 0.8792 | 0.96 | 0.74 | 0.85 |
| 4000 | 0.8532 | 1 | 0.72 | 0.86 |
| 3000 | 0.8876 | 1 | 0.74 | 0.87 |
| 2000 | 0.8488 | 1 | 0.62 | 0.81 |
| 1000 | 0.8608 | 1 | 0.58 | 0.79 |
| 500 | 0.8544 | 1 | 0.46 | 0.73 |
Comparing the classification accuracy of the 200 protein pairs based on reversed sequence order.
| Original Order | 0.86 | 0.93 | 0.735 | 0.833 |
| Reverse Order | 0.853 | 0.93 | 0.73 | 0.83 |
Mean, standard deviation and confidence level of the length of the 8917 training examples and the 8917 testing examples
| Training Examples | 548.31 | 398.29 | 12.10 |
| Testing Examples | 547.48 | 398.27 | 12.14 |
Similarity and identity averages and standard deviations calculated based on the 8917 training examples and the 8917 testing examples
| 19.16 | 9.91 | 29.97 | 1.96 | 81.59 | 6.71 |
ROC, SN, SP and overall accuracy recorded from testing PPI-PS on a testing dataset of 4917 interacting protein pairs and 4000 non-interacting protein pairs based on several window size values.
| 20000 | 0.8407 | 0.7914 | 0.7357 | 0.7664 |
| 15000 | 0.84 | 0.793 | 0.736 | 0.767 |
| 10000 | 0.845 | 0.795 | 0.745 | 0.77 |
| 500 | 0.7858 | 0.7 | 0.721 | 0.7098 |
Comparing the classification accuracy of the 8917 protein pairs based on reversed sequence order.
| Original Order | 0.833 | 0.77 | 0.73 | 0.75 |
| Reverse Order | 0.80 | 0.71 | 0.726 | 0.72 |
Figure 2Comparing PPI-PS performance with MLE and Domain-based random forest of decision trees methods.
Figure 3Illustration of the feature extraction algorithm.
Figure 4Overview of the feature extraction step.