| Literature DB >> 35758802 |
Maxat Kulmanov1, Robert Hoehndorf1.
Abstract
MOTIVATION: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50 000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require a significant amount of training data and cannot make predictions for GO classes that have only few or no experimental annotations.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35758802 PMCID: PMC9235501 DOI: 10.1093/bioinformatics/btac256
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Distribution of proteins by sequence similarity. We compute pairwise similarity between all proteins in the SwissProt dataset and the figure shows the frequency of pairs by sequence similarity
Summary of the UniProtKB-SwissProt dataset
| Ontology | Terms | Proteins | Groups | Training | Validation | Testing |
|---|---|---|---|---|---|---|
| MFO | 6868 | 43 279 | 22 445 | 34 716 | 3851 | 4712 |
| BPO | 21 381 | 58 729 | 29 311 | 47 733 | 5552 | 5444 |
| CCO | 2832 | 59 257 | 30 006 | 48 318 | 4970 | 5969 |
Note: The table shows the number of GO terms, total number of proteins, number of groups of similar proteins, number of proteins in training, validation and testing sets for the UniProtKB-SwissProt dataset.
Summary of the NetGO dataset
| Ontology | Terms | Training | Validation | Testing |
|---|---|---|---|---|
| MFO | 6854 | 62 646 | 1128 | 505 |
| BPO | 21 814 | 89 828 | 1124 | 491 |
| CCO | 2880 | 81 377 | 1359 | 268 |
Note: The table shows the number of GO terms and the number of proteins used as part of training, validation, and testing in the NetGO dataset.
Fig. 2.The figure provides a high-level overview and example of the DeepGOZero model. On the left, a protein P is embedded in a vector space using an MLP whereas the right side shows how GO axioms are embedded using the EL Embedding method; the MLP embeds the protein in the same space as the GO axioms. The example above shows a protein P which is annotated to positive regulation of protein kinase B signaling (GO: 0051897). This class is defined as biological regulation (GO: 0065007) and positively regulates (RO: 0002213) some protein kinase B signaling (GO: 0043491). This knowledge allows us to annotate proteins with GO: 0051897 even if we do not have any training proteins (zero-shot). Both the protein and the GO class embeddings are optimized jointly during training of DeepGOZero
Prediction results for Molecular Function on the UniProtKB-SwissProt dataset
| Method |
|
| AUPR | AUC |
|---|---|---|---|---|
| DiamondScore | 0.623 | 10.145 | 0.380 | 0.747 |
| MLP | 0.657 | 9.857 | 0.655 | 0.882 |
| MLP + DiamondScore |
|
| 0.649 | 0.886 |
| DeepGOCNN | 0.430 | 13.601 | 0.393 | 0.765 |
| DeepGOPlus | 0.634 | 10.072 | 0.636 | 0.844 |
| DeepGOZero | 0.657 | 9.808 | 0.657 | 0.903 |
| DeepGOZero + DiamondScore | 0.668 | 9.595 |
|
|
Note: This table shows protein-centric and AUPR, and the class-centric average AUC. The bold values indicate the best results.
Prediction results for Biological Process on the UniProtKB-SwissProt dataset
| Method |
|
| AUPR | AUC |
|---|---|---|---|---|
| DiamondScore | 0.444 | 45.040 | 0.313 | 0.610 |
| MLP | 0.460 | 43.987 | 0.435 | 0.793 |
| MLP + DiamondScore |
|
|
| 0.797 |
| DeepGOCNN | 0.344 | 48.543 | 0.289 | 0.672 |
| DeepGOPlus | 0.462 | 44.485 | 0.421 | 0.726 |
| DeepGOZero | 0.451 | 44.621 | 0.422 | 0.798 |
| DeepGOZero + DiamondScore | 0.482 | 44.058 | 0.446 |
|
Note: This table shows protein-centric , and AUPR, and the class-centric average AUC. The bold values indicate the best results.
Prediction results for Cellular Component on the UniProtKB-SwissProt dataset.
| Method |
|
| AUPR | AUC |
|---|---|---|---|---|
| DiamondScore | 0.581 | 11.092 | 0.352 | 0.648 |
| MLP | 0.667 |
| 0.670 | 0.846 |
| MLP + DiamondScore | 0.666 | 10.526 | 0.654 | 0.851 |
| DeepGOCNN | 0.641 | 11.396 | 0.645 | 0.775 |
| DeepGOPlus |
| 10.591 | 0.667 | 0.821 |
| DeepGOZero | 0.661 | 10.681 | 0.665 | 0.854 |
| DeepGOZero + DiamondScore | 0.667 | 10.615 |
|
|
Note: This table shows protein-centric and AUPR, and the class-centric average AUC. The bold values indicate the best results.
Fig. 3.Average prediction performance of classes grouped by annotation size on UniprotKB-Swissprot dataset
The comparison of performance on the NetGO dataset. The bold values indicate the best results.
| Method |
|
| AUPR | AUC | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MFO | BPO | CCO | MFO | BPO | CCO | MFO | BPO | CCO | MFO | BPO | CCO | |
| DiamondScore | 0.627 | 0.407 | 0.625 | 5.503 | 25.918 | 9.351 | 0.427 | 0.272 | 0.412 | 0.836 | 0.643 | 0.682 |
| DeepGOCNN | 0.589 | 0.337 | 0.624 | 6.417 | 27.235 | 10.617 | 0.565 | 0.271 | 0.623 | 0.867 | 0.694 | 0.834 |
| DeepGOPlus | 0.661 | 0.419 | 0.655 | 5.407 | 25.603 | 9.374 | 0.667 | 0.342 | 0.663 | 0.913 | 0.737 |
|
| MLP | 0.667 | 0.419 | 0.656 | 5.326 |
| 9.688 | 0.672 | 0.359 | 0.650 | 0.921 | 0.738 | 0.839 |
| MLP + DiamondScore | 0.659 |
| 0.662 | 5.316 | 24.904 | 9.545 | 0.664 |
| 0.651 | 0.924 | 0.740 | 0.846 |
| DeepGOZero | 0.662 | 0.396 | 0.662 | 5.322 | 25.838 | 9.834 | 0.668 | 0.337 | 0.645 | 0.930 | 0.717 | 0.809 |
| DeepGOZero + DiamondScore | 0.655 | 0.432 | 0.675 | 5.337 | 25.439 | 9.391 | 0.665 | 0.356 | 0.654 |
| 0.725 | 0.827 |
| NetGO2 (Webserver) |
| 0.431 | 0.662 |
| 25.076 | 9.473 |
| 0.343 | 0.627 | 0.856 | 0.635 | 0.772 |
| DeepGraphGO | 0.671 | 0.418 |
| 5.374 | 25.866 |
| 0.647 |
| 0.669 | 0.930 |
| 0.857 |
| TALE+ | 0.466 | 0.382 | 0.661 | 8.136 | 26.308 | 9.599 | 0.441 | 0.310 |
| 0.753 | 0.608 | 0.778 |
Average information content of the predictions on NetGO dataset
| Ontology | DeepGOZero + DiamondScore | NetGO2 | DeepGraphGO | TALE+ |
|---|---|---|---|---|
| MFO | 9.018 | 8.034 | 9.099 |
|
| BPO |
| 26.897 | 29.854 | 31.489 |
| CCO |
| 7.141 | 10.145 | 8.870 |
Note: This table shows average information content of the predictions that perform best in measure. The bold values indicate the best results.
Zero-shot and trained prediction performance on specific classes with more than 100 annotations
| Ontology | Term | Name | AUC (test) | AUC (all) | AUC (trained) | AUC (trained mlp) |
|---|---|---|---|---|---|---|
| mf | GO: 0001227 | DNA-binding transcription repressor activity, RNA polymerase II-specific | 0.257 | 0.405 | 0.932 | 0.926 |
| mf | GO: 0001228 | DNA-binding transcription activator activity, RNA polymerase II-specific | 0.574 | 0.699 | 0.948 | 0.944 |
| mf | GO: 0003735 | Structural constituent of ribosome | 0.400 | 0.194 | 0.940 | 0.942 |
| mf | GO: 0004867 | Serine-type endopeptidase inhibitor activity | 0.972 | 0.967 | 0.985 | 0.984 |
| mf | GO: 0005096 | GTPase activator activity | 0.847 | 0.870 | 0.938 | 0.960 |
| bp | GO: 0000381 | Regulation of alternative mRNA splicing, via spliceosome | 0.855 | 0.865 | 0.906 | 0.886 |
| bp | GO: 0032729 | Positive regulation of interferon-gamma production | 0.870 | 0.919 | 0.932 | 0.906 |
| bp | GO: 0032755 | Positive regulation of interleukin-6 production | 0.719 | 0.819 | 0.884 | 0.873 |
| bp | GO: 0032760 | Positive regulation of tumor necrosis factor production | 0.861 | 0.906 | 0.925 | 0.867 |
| bp | GO: 0046330 | Positive regulation of JNK cascade | 0.855 | 0.894 | 0.904 | 0.916 |
| bp | GO: 0051897 | Positive regulation of protein kinase B signaling | 0.772 | 0.864 | 0.888 | 0.915 |
| bp | GO: 0120162 | Positive regulation of cold-induced thermogenesis | 0.637 | 0.789 | 0.738 | 0.835 |
| cc | GO: 0005762 | Mitochondrial large ribosomal subunit | 0.889 | 0.975 | 0.874 | 0.916 |
| cc | GO: 0022625 | Cytosolic large ribosomal subunit | 0.898 | 0.969 | 0.893 | 0.849 |
| cc | GO: 0042788 | Polysomal ribosome | 0.858 | 0.950 | 0.889 | 0.780 |
| cc | GO: 1904813 | Ficolin-1-rich granule lumen | 0.653 | 0.782 | 0.792 | 0.900 |
| Average | 0.745 | 0.804 | 0.898 | 0.900 |
Note: Evaluation measures are class-centric. AUC(test) is the zero-shot performance on the test set, i.e., neither the class nor the protein were included during model training; AUC(all) is the zero-shot performance on all proteins, i.e., the class was never seen during training but the model has seen the proteins (annotated with other classes) during training; AUC(trained) and AUC(trained mlp) is the performance of the DeepGOZero and MLP models on the testing set when trained with the class (i.e. the protein is not seen but other proteins with the class were used during training).
Zero-shot prediction performance on classes with less than 10 annotations
| Ontology | All classes | Defined classes | ||
|---|---|---|---|---|
| Num. classes | AUC | Num. classes | AUC | |
| MFO | 4791 | 0.804 | 95 | 0.862 |
| BPO | 11 092 | 0.737 | 4598 | 0.786 |
| CCO | 1492 | 0.819 | 151 | 0.915 |
Note: The table shows average performance for all evaluated classes and classes that have definition axioms. It provides number of classes and average class-centric AUC.
Fig. 4.Distribution of GO classes by their number of annotated proteins (<10) for zero-shot prediction evaluation