| Literature DB >> 29390084 |
Irina M Armean1, Kathryn S Lilley1, Matthew W B Trotter2, Nicholas C V Pilkington3, Sean B Holden3.
Abstract
Motivation: Protein-protein interactions (PPI) play a crucial role in our understanding of protein function and biological processes. The standardization and recording of experimental findings is increasingly stored in ontologies, with the Gene Ontology (GO) being one of the most successful projects. Several PPI evaluation algorithms have been based on the application of probabilistic frameworks or machine learning algorithms to GO properties. Here, we introduce a new training set design and machine learning based approach that combines dependent heterogeneous protein annotations from the entire ontology to evaluate putative co-complex protein interactions determined by empirical studies.Entities:
Mesh:
Year: 2018 PMID: 29390084 PMCID: PMC5972588 DOI: 10.1093/bioinformatics/btx803
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Barplot distribution of the number of members per complex in the CYC2008 dataset of 408 complexes. The four largest complexes are: the cytoplasmic ribosomal large subunit with 81 members, the cytoplasmic ribosomal small subunit with 57 members, the mitochondrial ribosomal large subunit with 44 members and mitochondrial ribosomal small subunit with 32 members. Most complexes 171/408 (42%) have 2 members
Fig. 2.Annotation coverage of the protein interactions of the two initial training sets each containing 9593 interactions
Fig. 3.Performance of the different systems trained on the different datasets evaluated using accuracy (ACC), Matthews Correlation Coefficient (MCC), F1, recall and precision as defined in the formulas (Supplementary Tables S5, S6). GIS-MaxEnt trained on the six different training sets: GO-BP, GO-CC, GO-MF, GO, GO-IP, SVM trained on the same six training sets: SVM-GO-BP, SVM-GO-CC, SVM-GO-MF; SVM-GO, SVM-IP, SVM-GO-IP; GIS-MaxEnt Ensemble (GME) and Multiple Kernel Learning (MKL) which were trained on all the data
The individual weights on each dataset used by the MKL algorithm
| GO-CC | GO-BP | GO-MF | IP |
|---|---|---|---|
| 0.428±5E-05 | 0.422±3E-05 | 0.538±7E-05 | 0.590±11E-05 |
AUC for go2ppi and GIS-MaxEnt in different configurations
| Model | GO-CC | GO-BP | GO-MF | GO |
|---|---|---|---|---|
| go2ppi—NB | 0.765/0.730 | 0.731/0.700 | 0.729/0.697 | 0.761/0.723 |
| go2ppi—RF | 0.991/0.719 | 0.985/0.697 | 0.957/0.695 | 0.997/0.708 |
| GIS-MaxEnt —term-only | 0.963 | 0.959 | 0.950 | 0.972 |
| GIS-MaxEnt —all-parents | 0.965 | 0.787 | 0.956 | 0.978 |
Note: The go2ppi algorithm reports two results, displayed as Train/Test. ‘Train’ is the self-test AUC in the training phase (for example 0.731 for go2ppi-NB and GO-BP). ‘Test’ is the 10-fold cross-validation AUC in the testing phase over 50 runs (for example 0.70 for go2ppi-NB and GO-BP).
Fig. 4.Plot of occurrences of GO terms defining protein complexes (full dots) in the positive and negative set compared to the rest of the GO terms (empty circles). The difference in counts of all GO terms between the positive and negative set is significant at P < 0.05 (P-value = 0.028), while the frequencies for the 180 protein complex GO terms do not differ significantly (P-value = 0.28). The right plot (b) is a closer view of the points in the 0 to 1000 range (a)