| Literature DB >> 20003442 |
Abstract
BACKGROUND: Protein-protein interactions underlie many important biological processes. Computational prediction methods can nicely complement experimental approaches for identifying protein-protein interactions. Recently, a unique category of sequence-based prediction methods has been put forward--unique in the sense that it does not require homologous protein sequences. This enables it to be universally applicable to all protein sequences unlike many of previous sequence-based prediction methods. If effective as claimed, these new sequence-based, universally applicable prediction methods would have far-reaching utilities in many areas of biology research.Entities:
Mesh:
Substances:
Year: 2009 PMID: 20003442 PMCID: PMC2803199 DOI: 10.1186/1471-2105-10-419
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Cross-validation on the yeast and the human data. ROC and recall-precision plots for the four tested methods and the consensus method. The title of each plot is of the 'D1_D2_(ROC|RP)100' format, where D1 is the training data set, D2 is the testing data set, and '100' indicates that the size of the negative test data set is 100 times that of the positive test data set. 'ROC' indicates an ROC plot whereas 'RP' indicates a recall-precision plot. When D1 and D2 are identical as is here, 4-fold cross-validation was performed. When not identical, D1 and D2 are used for training and testing, respectively.
Cross-validation results on the yeast and the human data
| Yeast 10 | Yeast 100 | |||
|---|---|---|---|---|
| AUC | P20R2 | AUC | P20R | |
| M1 | 0.83 | 0.55 | 0.83 | 0.11 |
| M2 | 0.79 | 0.82 | 0.79 | 0.33 |
| M3 | 0.60 | 0.28 | 0.60 | 0.04 |
| M4 | 0.75 | 0.35 | 0.75 | 0.05 |
| C | 0.85 | 0.84 | 0.85 | 0.34 |
| M1 | 0.86 | 0.70 | 0.86 | 0.19 |
| M2 | 0.81 | 0.91 | 0.81 | 0.51 |
| M3 | 0.67 | 0.36 | 0.67 | 0.05 |
| M4 | 0.83 | 0.59 | 0.83 | 0.12 |
| C | 0.91 | 0.95 | 0.90 | 0.67 |
1Evaluation using a negative subset of size 10N randomly chosen from the 100N negative set, where N is the size of the positive set.
2Precision values at 20% recall
Prediction performance of the modified M1 on the yeast and the human data
| Yeast 100 | Human 100 | |||
|---|---|---|---|---|
| AUC | P20R | AUC | P20R | |
| M1 | 0.82 | 0.10 | 0.84 | 0.17 |
Cross-species testing results
| Human - Yeast1 100 | Yeast - Human 100 | |||
|---|---|---|---|---|
| AUC | P20R | AUC | P20R | |
| M1 | 0.67 | 0.03 | 0.65 | 0.03 |
| M2 | 0.72 | 0.06 | 0.67 | 0.04 |
| M3 | 0.52 | 0.02 | 0.51 | 0.01 |
| M4 | 0.62 | 0.02 | 0.62 | 0.02 |
| C | 0.73 | 0.07 | 0.68 | 0.04 |
1"A - B" signifies training with the A data and testing on the B data.
In this Table, prediction methods were trained with all the data from one species and tested on all the data from another species (no 4-fold cross-validation).
Figure 2Cross-species benchmarking results. The title of each plot reads in the same way as in Fig. 1.
Figure 3Cross-validation on the combined data. The title of each plot reads in the same way as in Fig. 1.
Testing results on the combined data set
| Combined - Yeast 100 | Combined - Human 100 | Combined - Combined 100 | ||||
|---|---|---|---|---|---|---|
| AUC | P20R | AUC | P20R | AUC | P20R | |
| M1 | 0.79 | 0.07 | 0.86 | 0.18 | 0.85 | 0.15 |
| M2 | 0.79 | 0.24 | 0.82 | 0.52 | 0.81 | 0.48 |
| M3 | 0.62 | 0.04 | 0.68 | 0.05 | 0.67 | 0.05 |
| M4 | 0.74 | 0.05 | 0.83 | 0.11 | 0.81 | 0.10 |
| C | 0.84 | 0.31 | 0.89 | 0.63 | 0.88 | 0.59 |
Mean coefficients of the four methods in the linear SVC consensus methods
| Yeast | Human | Combined | |
|---|---|---|---|
| M1 | 0.35 | 0.30 | 0.13 |
| M2 | 0.42 | 0.26 | 0.52 |
| M3 | 0.08 | 0.12 | 0.11 |
| M4 | 0.15 | 0.32 | 0.24 |
Prediction performance of consensus methods that combine two or three methods
| Results on the yeast data | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| AUC | 0.85 | 0.81 | 0.85 | 0.85 | 0.85 | 0.85 | 0.85 | 0.85 | 0.79 | 0.81 | 0.81 |
| P20R | 0.34 | 0.37 | 0.34 | 0.35 | 0.29 | 0.29 | 0.29 | 0.35 | 0.32 | 0.38 | 0.37 |
| AUC | 0.90 | 0.88 | 0.90 | 0.90 | 0.89 | 0.89 | 0.89 | 0.90 | 0.82 | 0.88 | 0.88 |
| P20R | 0.67 | 0.63 | 0.67 | 0.67 | 0.64 | 0.64 | 0.64 | 0.67 | 0.52 | 0.62 | 0.63 |
| AUC | 0.88 | 0.86 | 0.88 | 0.88 | 0.86 | 0.86 | 0.86 | 0.88 | 0.81 | 0.86 | 0.86 |
| P20R | 0.59 | 0.54 | 0.59 | 0.59 | 0.54 | 0.54 | 0.54 | 0.59 | 0.48 | 0.54 | 0.54 |
| AUC | 0.73 | 0.71 | 0.73 | 0.73 | 0.73 | 0.73 | 0.73 | 0.73 | 0.72 | 0.71 | 0.71 |
| P20R | 0.07 | 0.07 | 0.07 | 0.07 | 0.06 | 0.06 | 0.06 | 0.07 | 0.06 | 0.07 | 0.07 |
| AUC | 0.68 | 0.65 | 0.68 | 0.68 | 0.67 | 0.67 | 0.67 | 0.68 | 0.67 | 0.65 | 0.65 |
| P20R | 0.04 | 0.03 | 0.04 | 0.04 | 0.03 | 0.03 | 0.03 | 0.04 | 0.04 | 0.03 | 0.03 |
1"A - B" signifies training with the A data and testing on the B data.
Analysis of prediction results by the gene ontology slims
| Results on the yeast data sorted according to AUC | |||||
|---|---|---|---|---|---|
| 1 | 0005198 | 39513 | C1 | 0.90 | Structural molecular activity |
| 2 | 0007124 | 9192 | C | 0.89 | Pseudohyphal growth |
| 3 | 0006997 | 10093 | C | 0.89 | Nucleus organization |
| 4 | 0007047 | 18668 | C | 0.89 | Cell wall organization |
| 5 | 0005215 | 44019 | C | 0.89 | Transporter activity |
| 1 | 0005618 | 8689 | M2 | 1.00 | Cell wall |
| 2 | 0006997 | 10093 | C | 0.97 | Nucleus organization |
| 3 | 0042254 | 44304 | C | 0.95 | Ribosome biogenesis |
| 4 | 0005198 | 39513 | C | 0.92 | Structural molecule activity |
| 5 | 0008289 | 10690 | M2 | 0.92 | Lipid binding |
| 1 | 0008907 | 245 | C | 1.00 | Integrase activity |
| 2 | 0004871 | 71939 | C | 0.92 | Signal transducer activity |
| 3 | 0051704 | 88280 | C | 0.92 | Multi-organism process |
| 4 | 0008219 | 98990 | C | 0.92 | Cell death |
| 5 | 0016740 | 244001 | C | 0.91 | Transferase activity |
| 1 | 0009405 | 1017 | M2 | 1.00 | Pathogenesis |
| 2 | 0008907 | 245 | M2 | 1.00 | Integrase activity |
| 3 | 0004871 | 71939 | C | 0.91 | Signal transducer activity |
| 4 | 0004872 | 208752 | C | 0.88 | Receptor activity |
| 5 | 0016301 | 110554 | C | 0.88 | Kinase activity |
| 1 | 0008907 | 245 | C | 0.99 | Integrase activity |
| 2 | 0004871 | 77553 | C | 0.92 | Signal transducer activity |
| 3 | 0015267 | 7183 | C | 0.91 | Channel activity |
| 4 | 0004872 | 208752 | C | 0.91 | Receptor activity |
| 5 | 0051704 | 88280 | C | 0.91 | Multi-organism process |
| 1 | 0005618 | 8689 | M2 | 1.00 | Cell wall |
| 2 | 0009405 | 1017 | M2 | 1.00 | Pathogenesis |
| 3 | 0008907 | 245 | M2 | 1.00 | Integrase activity |
| 4 | 0006997 | 10093 | M2 | 0.97 | Nucleus organization |
| 5 | 0008289 | 10690 | M2 | 0.92 | Lipid binding |
1C: the consensus method that integrates the four methods M1 through M4.
For each combination of a data set (the yeast, the human or the combined data) and an evaluation scheme (AUC or P20R), five GO terms are listed for which best performance was achieved. For each GO term, the number of protein-protein pairs in the data set is shown in the third column for which either protein in the pair is annotated with that GO term. Also shown are the best-performing method (column 4) and its performance (column 5).