| Literature DB >> 20585388 |
Robert Hoehndorf1, Axel-Cyrille Ngonga Ngomo, Michael Dannemann, Janet Kelso.
Abstract
Biological data, and particularly annotation data, are increasingly being represented in directed acyclic graphs (DAGs). However, while relevant biological information is implicit in the links between multiple domains, annotations from these different domains are usually represented in distinct, unconnected DAGs, making links between the domains represented difficult to determine. We develop a novel family of general statistical tests for the discovery of strong associations between two directed acyclic graphs. Our method takes the topology of the input graphs and the specificity and relevance of associations between nodes into consideration. We apply our method to the extraction of associations between biomedical ontologies in an extensive use-case. Through a manual and an automatic evaluation, we show that our tests discover biologically relevant relations. The suite of statistical tests we develop for this purpose is implemented and freely available for download.Entities:
Mesh:
Year: 2010 PMID: 20585388 PMCID: PMC2886832 DOI: 10.1371/journal.pone.0010996
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Schematic representation of our method.
Elements of the test score of .
| combining p-values in the CDF's of score differences from parents to children | variance distribution of scores | variance distributions to children and parents | |
|
| geometric mean | ||
|
| minimum | ||
|
| geometric mean | X | |
|
| minimum | X | |
|
| geometric mean | X | X |
|
| minimum | X | X |
Example synsets taken from the GO and the CL.
| ID | Label | Synonyms |
| GO:0001574 | globoside biosynthetic process | ganglioside biosynthesis; ganglioside formation; ganglioside synthesis |
| CL:0000114 | surface ectodermal cell | cell of surface ectoderm; surface ectoderm cell |
-quantiles for different -values for all tests.
|
|
|
|
|
|
|
|
| 0.5 | 0.075 | 0.017 | 0.024 | 0.003 | 0.007 | 0.001 |
| 0.8 | 0.288 | 0.145 | 0.141 | 0.047 | 0.061 | 0.016 |
| 0.9 | 0.522 | 0.433 | 0.298 | 0.168 | 0.220 | 0.120 |
| 0.95 | 0.806 | 0.790 | 0.472 | 0.412 | 0.456 | 0.400 |
| 0.99 | 0.952 | 0.950 | 0.863 | 0.826 | 0.859 | 0.824 |
Given a -value (first column), the quantiles show the result of each test for which -values are below the quantile.
Figure 2Distribution of test results.
The plot on the left shows the distribution of the test results for . On the right, the same is shown for . It can be seen that a test using the minimum function () is more restrictive than a test using the geometric mean (). Furthermore, weighting the tests with the CDFs of the variances () produces stronger results than the basic test (). The test results of the GO-CL dataset for each test are displayed below the distributions.
Association examples.
| CL | GO |
| Myoepithelial cell | Milk ejection |
| Oocyte | Meiotic anaphase I |
| Osteoclast | Protein geranylgeranylation |
| Neuroblast | Neuron recognition |
| Keratinocyte | Keratinization |
| Sensory neuron | Optic nerve formation |
| Motor neuron | Spinal cord development |
| Protoplast | Photosynthesis |
| Lymphocyte | Chloroplast fission |
The results in this table were above the quantile in all six tests. While the kind of relation between the categories is apparent for most results, some, like the relation between lymphocytes and chloroplast fission, remain dubious.
Manually identified ontological relations in the top-scoring association results with respect to .
| Relation | Number of occurrences |
|
| 62 |
|
| 13 |
|
| 2 |
| unclassified | 38 |
Evaluation of our approach with respect to the GO-CL dataset [23].
| Recall |
|
|
|
|
|
|
| 99% | 0.004 | 0 | 0 | 0 | 0 | 0 |
| 95% | 0.007 | 0.006 | 0.003 | 0 | 0.002 | 0 |
| 80% | 0.102 | 0.054 | 0.028 | 0.003 | 0.016 | 0.002 |
| 70% | 0.173 | 0.109 | 0.049 | 0.008 | 0.029 | 0.004 |
| 50% | 0.502 | 0.350 | 0.173 | 0.063 | 0.154 | 0.060 |
The dataset we used for comparison consists of the relations from the GO-CL crossproduct [23] found in our text corpus. Columns two to seven show the cutoff values required to identify the percentage given in column one of associations as significant using tests one to six. For example, at a cutoff of , of the relations found in the dataset were significant according to test .