| Literature DB >> 29949999 |
Fatima Zohra Smaili1, Xin Gao1, Robert Hoehndorf1.
Abstract
Motivation: Biological knowledge is widely represented in the form of ontology-based annotations: ontologies describe the phenomena assumed to exist within a domain, and the annotations associate a (kind of) biological entity with a set of phenomena within the domain. The structure and information contained in ontologies and their annotations make them valuable for developing machine learning, data analysis and knowledge extraction algorithms; notably, semantic similarity is widely used to identify relations between biological entities, and ontology-based annotations are frequently used as features in machine learning applications.Entities:
Mesh:
Year: 2018 PMID: 29949999 PMCID: PMC6022543 DOI: 10.1093/bioinformatics/bty259
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Onto2Vec Workflow. The blue-shaded part illustrates the steps to obtain vector representation for classes from the ontology. The purple-shaded part shows the steps to obtain vector representations of ontology classes and the entities annotated to these classes
Fig. 2.ROC curves for PPI prediction for the unsupervised learning methods
AUC values of ROC curves for PPI prediction
| Yeast | Human | |
|---|---|---|
Resnik | 0.7942 | 0.7891 |
Lin | 0.7354 | 0.7222 |
| Jiang and Conrath | 0.7108 | 0.7027 |
| 0.7634 | 0.7594 | |
| 0.7701 | 0.7614 | |
| 0.7439 | 0.7385 | |
| 0.6912 | 0.6712 | |
| 0.6741 | 0.6470 | |
| 0.7139 | 0.7093 | |
| 0.7959 | 0.7785 | |
| 0.8586 | 0.8621 | |
| 0.7009 | 0.7785 | |
| 0.8253 | 0.8068 | |
| 0.7662 | 0.7064 |
Note: The best AUC value among all methods is shown in bold. Resnik, Lin, Jiang and Conrath and sim_GIC are semantic similarity measures; Onto2Vec is our method in which protein and ontology class representations are learned jointly from a single knowledgebase which is deductively closed; Onto2Vec_NoReasoner is identical to Onto2Vec but does not use the deductive closure of the knowledge base; Binary_GO represents a protein’s GO annotations as a binary vector (closed against the GO structure); Onto_BMA only generates vector representations for GO classes and compares proteins by comparing their GO annotations individually using cosine similarity and averaging individual values using the BMA approach; Onto_AddVec sums GO class vectors to represent a protein. The methods with suffix LR, SVM, and NN use logistic regression, a support vector machine, and an artificial neural network, respectively, either on the Onto2Vec or the Binary_GO protein representations.
Fig. 3.ROC curves for PPI prediction for the supervised learning methods, in addition to Resnik’s semantic similarity measure for comparison
Spearman correlation coefficients between STRING confidence scores and PPI prediction scores of different prediction methods
| Yeast | Human | |
|---|---|---|
| 0.1107 | 0.1151 | |
| 0.1067 | 0.1099 | |
| 0.1021 | 0.1031 | |
| 0.1424 | 0.1453 | |
| 0.2245 | 0.2621 | |
| 0.1121 | 0.1208 | |
| 0.1363 | 0.1592 | |
| 0.1243 | 0.1616 |
Note: The highest absolute correlation across all methods is highlighted in bold.
AUC values of the ROC curves for PPI interaction type prediction
| Yeast | Human | |||||||
|---|---|---|---|---|---|---|---|---|
| Reaction | Activation | Binding | Catalysis | Reaction | Activation | Binding | Catalysis | |
| Resnik | 0.5811 | 0.6023 | 0.5738 | 0.5792 | 0.5341 | 0.5331 | 0.5233 | 0.5810 |
| 0.5738 | 0.5988 | 0.5611 | 0.5814 | 0.5153 | 0.5104 | 0.5073 | 0.6012 | |
| 0.7103 | 0.7011 | 0.6819 | 0.6912 | 0.7091 | 0.6951 | 0.6722 | 0.6853 | |
| 0.7311 | 0.7117 | |||||||
| 0.7419 | 0.7737 | 0.7811 | 0.7265 | 0.7568 | 0.7713 | |||
| 0.6874 | 0.6611 | 0.6214 | 0.6433 | 0.6151 | 0.6533 | 0.6018 | 0.6189 | |
| 0.7455 | 0.7346 | 0.7173 | 0.7738 | 0.7246 | 0.7132 | 0.6821 | 0.7422 | |
| 0.7131 | 0.6934 | 0.6741 | 0.6838 | 0.6895 | 0.6803 | 0.6431 | 0.6752 | |
Note: The best AUC value for each action is shown in bold.
Fig. 4.t-SNE visualization of 10, 000 enzyme vectors color-coded by their first level EC category (1, 2, 3, 4, 5 or 6)
Parameter we use for training the Word2Vec model
| Parameter | Definition | Default value |
|---|---|---|
| Choice of training algorithm (sg=1: skip-gram; sg=0: CBOW) | 1 | |
| Dimension of the obtained vectors | 200 | |
| Words with frequency lower than this value will be ignored | 1 | |
| Maximum distance between the current and the predicted word | 10 | |
| Number of iterations | 5 | |
| Whether negative sampling will be used and how many ‘noise words’ would be drawn | 4 |