| Literature DB >> 18070366 |
Zhi-Ping Liu1, Ling-Yun Wu, Yong Wang, Luonan Chen, Xiang-Sun Zhang.
Abstract
BACKGROUND: Annotation of protein functions is an important task in the post-genomic era. Most early approaches for this task exploit only the sequence or global structure information. However, protein surfaces are believed to be crucial to protein functions because they are the main interfaces to facilitate biological interactions. Recently, several databases related to structural surfaces, such as pockets and cavities, have been constructed with a comprehensive library of identified surface structures. For example, CASTp provides identification and measurements of surface accessible pockets as well as interior inaccessible cavities.Entities:
Mesh:
Substances:
Year: 2007 PMID: 18070366 PMCID: PMC2233648 DOI: 10.1186/1471-2105-8-475
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The percentage of the edges (pocket pairs) related to similar GO terms in pocket similarity network which is constructed by using the E-value of sequence similarity between two pockets as the threshold.
| Threshold | 1.0 × 10-1 | 1.0 × 10-2 | 1.0 × 10-3 | 1.0 × 10-4 | 1.0 × 10-5 |
| Pocket pairs | 3178 | 1460 | 652 | 320 | 219 |
| GO annotated pairs | 2359 | 1086 | 492 | 241 | 160 |
| Similar pairs | 581 | 375 | 252 | 178 | 126 |
| Percentage | 24.63% | 34.53% | 51.21% | 73.85% | 78.75% |
The percentage of the edges (pocket pairs) related to similar GO terms in pocket similarity network which is constructed by using the p-value of structure cRMSD similarity between two pockets as the threshold.
| Threshold | 1.0 × 10-1 | 1.0 × 10-2 | 1.0 × 10-3 | 1.0 × 10-4 | 1.0 × 10-5 |
| Pocket pairs | 1002 | 602 | 521 | 481 | 468 |
| GO annotated pairs | 778 | 501 | 437 | 405 | 397 |
| Similar pairs | 508 | 464 | 430 | 403 | 396 |
| Percentage | 65.29% | 92.61% | 98.40% | 99.50% | 99.75% |
The percentage of the edges (pocket pairs) related to similar GO terms in pocket similarity network which is constructed by using the p-value of structure oRMSD similarity between two pockets as the threshold.
| Threshold | 1.0 × 10-1 | 1.0 × 10-2 | 1.0 × 10-3 | 1.0 × 10-4 | 1.0 × 10-5 |
| Pocket pairs | 2354 | 1182 | 757 | 618 | 567 |
| GO annotated pairs | 1786 | 922 | 617 | 516 | 483 |
| Similar pairs | 626 | 564 | 515 | 491 | 475 |
| Percentage | 35.05% | 61.17% | 83.47% | 95.16% | 98.34% |
The percentage of the edges (pocket pairs) related to similar GO terms in pocket similarity network which is constructed by using the combination of E-value of sequence similarity and p-value of structure cRMSD similarity between two pockets as the threshold.
| Threshold | 1.0 × 10-1 | 1.0 × 10-2 | 1.0 × 10-3 | 1.0 × 10-4 | 1.0 × 10-5 |
| Pocket pairs | 711 | 360 | 257 | 189 | 145 |
| GO annotated pairs | 551 | 293 | 203 | 148 | 109 |
| Similar pairs | 402 | 287 | 203 | 148 | 109 |
| Percentage | 72.95% | 97.95% | 100% | 100% | 100% |
Figure 1Similar pockets imply similar functions. The relationship between pocket similarity and functional similarity. The graph summarizes the results in Tables 1-4. We use four statistical significance measurements as the thresholds to construct the similarity network, that is E-value of the sequence similarity, p-value of structure similarity cRMSD, p-value of structure similarity oRMSD, and the combination of E-value and p-value of cRMSD. The horizontal axis represents the number of GO annotated edges (i.e. pocket pairs) found by given thresholds. The vertical axis denotes the percentage of GO annotated edges with at least one identical GO term between two end nodes. The graph shows that the structure-based threshold is better than the sequence-based threshold.
Figure 2Recall-precision graph. Recall-precision relationship graph for the prediction result.
Figure 3Statistics of individual predictions. The distributions of proteins with respect to different (a) recall and (b) precision values. The concrete numbers of proteins are shown on each bar individually. Most of the 316 proteins have high recall and precision values, which mean that their functions are almost covered by prediction results and the false positive is low. (c) is the percentages of proteins with both recall and precisions higher than the given thresholds. 156 proteins can be predicted with recall value 1 and precision value 1.
Figure 4Recall-precision graphs for the three ontologies. Recall-precision relationship graph of prediction for the three GO ontologies independently. (a) cellular component ontology, (b) molecular function ontology, (c) biological process ontology.
The prediction results of proteins by the three GO ontologies individually. These results are similar to the integrated version, and show that the performance of the proposed prediction method does not heavily depend on the considered GO terms. For Gene Ontologies, C means cellular component, F means molecular function, and P means biological process. The proteins with R & P 100% mean that these proteins can be predicted with recall value 1 and precision value 1.
| Gene Ontology | C | F | P |
| Maximum F-measure | 0.823 | 0.778 | 0.772 |
| Recall-precision | (0.839, 0.805) | (0.797, 0.729) | (0.792, 0.715) |
| Number of proteins | 92 | 290 | 275 |
| Predicted proteins | 81 | 249 | 233 |
| Not predicted | 11 | 41 | 42 |
| Proteins with recall 100% | 74 | 207 | 202 |
| Proteins with precision 100% | 69 | 174 | 166 |
| Proteins with R & P 100% | 66 | 155 | 150 |
Prediction results in the pocket similarity network by using different cRMSD p-values as thresholds. The detailed prediction results of thresholds from 10-3 to 10-5 are listed in the Additional Files.
| Threshold | 10-2 | 10-3 | 10-4 | 10-5 |
| Maximum F-measure | 0.756 | 0.875 | 0.898 | 0.904 |
| Recall-precision | (0.774,0.706) | (0.888,0.839) | (0.907,0.869) | (0.913,0.877) |
| Coverage | 84.50% | 96.98% | 98.35% | 99.16% |
| Number of proteins | 316 | 265 | 242 | 237 |
Some predicted GO terms of the un-annotated proteins. The predicted results are almost consistent with the existing functional knowledge in databases and literature. The full predicted results of all un-annotated proteins in the considered pocket similarity network can be found at our website.
| Protein | Predicted GO terms | GO description | Information |
| 1orn | GO:0003677 F | DNA binding | PDB Classification: Hydrolase/dna |
| GO:0019104 F | DNA N-glycosylase activity | ||
| GO:0003906 F | DNA-(apurinic or apyrimidinic site) lyase activity | ||
| 1lia | GO:0009503 C | light-harvesting complex (sensu Viridiplantae) | PDB Classification: Light Harvesting Protein |
| GO:0015979 P | photosynthesis | PMID: 11134927 | |
| 1dnl | GO:0010181 F | FMN binding | PDB Classification: Oxidoreductase |
| GO:0004733 F | pyridoxamine-phosphate oxidase activity [source: EC:1.4.3.5] | EC no.: 1.4.3.5 | |
| 1jbe | GO:0004871 F | signal transducer activity | PDB Classification: Signaling Protein |
| GO:0000160 P | two-component signal transduction system (phosphorelay) | PMID: 12270703 | |
| 1uap | GO:0005509 F | calcium ion binding | PDB Classification: Protein Binding |
| GO:0004252 F | serine-type endopeptidase activity | PMID: 12670942 | |
| 1fb3 | GO:0016491 F | oxidoreductase activity | PDB Classification: Oxidoreductase |
| GO:0004324 F | ferredoxin-NADP+ reductase activity [source: EC:1.18.1.2] | EC no.: 1.18.1.2 | |
| GO:0050660 F | FAD binding | ||
| GO:0050661 F | NADP binding | ||
| GO:0006118 P | electron transport | ||
| 1got | GO:0005834 C | heterotrimeric G-protein complex | PDB Classification: Complex (gtp Binding/transducer) |
| GO:0004871 F | signal transducer activity | PMID: 17036056 | |
| GO:0007186 P | G-protein coupled receptor protein signaling pathway | ||
| 1k8g | GO:0003677 F | DNA binding | PDB Classification: DNA Binding Protein/dna |
| GO:0042162 F | telomeric DNA binding | PMID: 11743727 | |
| GO:0006260 P | DNA replication | ||
| 1iis | GO:0042612 C | MHC class I protein complex | PDB Classification: Immune System |
| GO:0030106 F | MHC class I receptor activity | ||
| GO:0019882 P | antigen processing and presentation | ||
| 1lug | GO:0008270 F | zinc ion binding | PDB Classification: Lyase |
| GO:0004089 F | carbonate dehydratase activity [source: EC:4.2.1.1] | EC no.: 4.2.1.1 | |
| GO:0006730 P | one-carbon compound metabolic process | Chemical Component: ZINC ION |
Figure 5Comparison of prediction results between methods by global and local similarity. The recall-precision graphs of the prediction results by the pocket similarity network versus those by the protein similarity network in the intersection of proteins. (a) The RP curve of the prediction results by pvSOAR cRMSD p-value 10-2 versus CE Z-Score 4.8. (b) The intersection illustration of the proteins that can be predicted by the two methods. The red cycle represents the pocket similarity network and the blue cycle the protein similarity network. The RP graph is drawn in the results of the intersection of two cycles. (c) and (d) are the similar comparison between pvSOAR cRMSD p-value 10-3 and CE Z-Score 5.8.
Figure 6The prediction results with selected GO terms. The recall-precision graphs of the prediction results with selected specific GO terms by the pocket similarity network (pvSOAR cRMSD p-value 10-2). (a) The RP curves of the prediction results when the informative GO terms are selected based on GO term probability. (b) The RP curves of the prediction results when the informative GO terms are selected based on the depth level. The red dash line gives the original prediction results without filter.
Figure 7Example of pocket scoring. An example of pocket scoring scheme. The closest neighbors of the pocket C3 and the corresponding GO terms are listed in the left box. The occurrence numbers of GO terms in the closest neighbors of pocket C3 are recorded in the corresponding row of the pocket-term table. For example, the GO term T1 is annotated to 8 of the closest neighbors, and the GO term T2 is annotated to 6 of the closest neighbors. The score between pocket C3 and GO terms is obtained by normalizing the row.
Figure 8Procedure of protein function prediction. An example of protein scoring scheme. The target protein is queried in CASTp sever to obtain its pockets. Then the correspondence scores between pockets and GO terms are inferred as shown in Figure 5. The weighted sum of scores of all pockets with regard to a GO term is the score of the protein with regard to the GO term.