| Literature DB >> 24404165 |
Fernando Benites1, Svenja Simon1, Elena Sapozhnikova1.
Abstract
The constantly increasing volume and complexity of available biological data requires new methods for their management and analysis. An important challenge is the integration of information from different sources in order to discover possible hidden relations between already known data. In this paper we introduce a data mining approach which relates biological ontologies by mining cross and intra-ontology pairwise generalized association rules. Its advantage is sensitivity to rare associations, for these are important for biologists. We propose a new class of interestingness measures designed for hierarchically organized rules. These measures allow one to select the most important rules and to take into account rare cases. They favor rules with an actual interestingness value that exceeds the expected value. The latter is calculated taking into account the parent rule. We demonstrate this approach by applying it to the analysis of data from Gene Ontology and GPCR databases. Our objective is to discover interesting relations between two different ontologies or parts of a single ontology. The association rules that are thus discovered can provide the user with new knowledge about underlying biological processes or help improve annotation consistency. The obtained results show that produced rules represent meaningful and quite reliable associations.Entities:
Mesh:
Year: 2014 PMID: 24404165 PMCID: PMC3880308 DOI: 10.1371/journal.pone.0084475
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Measure List.
| Number | Measure name | Abbreviation | Formula | Range | Ref. |
| 1. | Support |
|
| [0,1] |
|
| 2. | Confidence |
|
| [0,1] |
|
| 3. | Cosine |
|
| [0,1] |
|
| 4. | All-Confidence |
|
| [0,1] |
|
| 5. | Kulczynski |
|
| [0,1] |
|
| 6. | Lift |
|
| [0, |
|
| 7. | Bayes Factor |
|
| [0, |
|
| 8. | CenteredConfidence |
|
| [−1,1) |
|
| 9. |
|
|
| [0,1] |
|
| 10. | Jaccard |
|
| [0,1] |
|
| 11. | JacDif |
|
| (−1,1] | our |
| 12. | CosDif |
|
| (−1,1] | our |
The list of the used interestingness measures and abbreviations. Term refers to the fraction of transactions from the whole transaction set where and co-occurred, analogically for and .
Hierarchical Measures Example.
| Nr. | Rule | Support | Item | Support |
| 1 | Clothes | 30 | Clothes | 150 |
| 2 | Outerwear | 20 | Outerwear | 100 |
| 3 | Jackets | 15 | Jackets | 90 |
| 4 | Travel Pants | 10 | Travel Pants | 15 |
| Hiking Boots | 30 |
Hierarchical Measures and Example.
| Nr. | Rule |
|
|
|
|
|
| 3 | Jackets |
|
|
|
|
|
| 4 | Travel Pants |
|
|
|
|
|
Figure 1Example hierarchy.
DBpedia-Yago performance results.
| Method |
|
|
|
|
|
|
|
|
| GRP | GRL |
|
|
| Best 151 | |||||||||||||
| T-Rules | 7 | 74 | 73 | 74 | 72 | 5 | 7 | 29 | 73 | 7 | 54 | 73 | 68 |
|
| 4.64 |
|
|
| 47.68 | 3.31 | 4.63 | 19.21 |
| 4.64 | 35.76 |
| 45.03 |
| Best possible | |||||||||||||
| Rules | 609 | 225 | 201 | 208 | 187 | 265 | 447 | 384 | 199 | 591 | 107 | 195 | 221 |
| T-Rules | 105 | 96 | 89 | 92 | 85 | 10 | 33 | 91 | 89 | 105 | 54 | 90 | 94 |
|
| 27.63 |
| 50.57 |
| 50.30 | 4.81 | 11.04 | 34.02 | 50.85 | 28.30 | 41.86 |
| 50.54 |
DBpedia-Yago: The number of found rules and the number of true rules among them (T-rules), for the first 151 and for the best possible rule set. -1 is in %. Three best values are shown in bold.
Figure 2Sorted true rules by absolute support.
The color indicates if it was in the first 151 rules of each metric.
Median absolute support of “best 151 rules” the DBpedia-Yago dataset.
| Method |
|
|
|
|
|
|
|
|
|
|
| GRP | GRL |
| found true rules | 53 | 1202 | 1204 | 1193 | 1202 | 3 | 53 | 1150 | 1200 | 942 | 696.5 | 53 | 1030 |
| not found true rules | 635.5 | 378 | 359 | 435 | 378 | 635.5 | 635.5 | 579 | 406.5 | 496 | 551 | 635.5 | 480 |
| all found rules | 4 | 1222 | 1222 | 1082 | 1204 | 1 | 4 | 3 | 1204 | 779 | 547 | 4 | 591 |
Median absolute support of the rules in the setting “best 151 rules”.
The number of intersections for DBpedia-Yago.
|
|
|
|
|
|
|
|
|
|
| GRP | GRL | |
|
| 211 | 211 | 209 | 207 | 43 | 218 | 451 | 210 | 203 | 188 | 490 | 107 |
|
| 479 | 472 | 441 | 82 | 144 | 244 | 470 | 459 | 427 | 215 | 71 | |
|
| 451 | 461 | 83 | 145 | 244 | 488 | 453 | 430 | 215 | 71 | ||
|
| 416 | 83 | 145 | 242 | 445 | 444 | 406 | 213 | 71 | |||
|
| 97 | 156 | 236 | 465 | 431 | 428 | 211 | 69 | ||||
|
| 317 | 48 | 94 | 97 | 103 | 42 | 2 | |||||
|
| 224 | 156 | 158 | 161 | 217 | 30 | ||||||
|
| 244 | 230 | 212 | 456 | 00 | |||||||
|
| 458 | 437 | 214 | 71 | ||||||||
|
| 462 | 206 | 70 | |||||||||
|
| 191 | 67 | ||||||||||
| GRP | 103 |
The number of intersections among the best 500 rules extracted by different methods from the DBpedia-Yago dataset.
Difference in the true rules sets of Jac and JacDif for the DBpedia-Yago dataset.
| Nr. | Sup(a,b) | Sup(a) | Sup(b) | DBpedia | Yago | rank Jac | rank JacDif |
| The true rules found by | |||||||
| 1 | 1847 | 2521 | 1847 | Planet | planet109394007 | 134 |
|
| 2 | 1974 | 1997 | 1984 | RadioStation | radiostation104044119 | 27 | 569 |
| 3 | 1864 | 1884 | 1879 | River | river109411430 | 30 | 359 |
| 4 | 1396 | 1552 | 1657 | Saint | saint110546850 | 126 | 152 |
| 5 | 2553 | 3065 | 3045 | School | school108276720 | 145 | 188 |
| The true rules found by | |||||||
| 1 | 93 | 149 | 93 | Archaea | Archaeagenera | 176 | 151 |
| 2 | 5 | 8 | 5 | Continent | Continents | 173 | 149 |
| 3 | 15 | 23 | 15 | SpaceStation | Spacestations | 163 | 142 |
| 4 | 51 | 64 | 60 | Valley | valley109468604 | 153 | 117 |
The difference in the found true rules of Jac and JacDif among the best 151. Support refers in this table to the absolute support (number of instances: sup*N).
Median absolute support and the number of intersections for GPCR-GO-MF dataset.
| Nr. of Rules | Median Sup. |
|
|
|
|
| GRP | GRL | |
|
| 8781 | 5 | 459 | 444 | 450 | 414 | 371 | 11 | 31 |
|
| 8781 | 5 | 420 | 487 | 412 | 375 | 13 | 31 | |
|
| 8781 | 6 | 409 | 393 | 342 | 8 | 31 | ||
|
| 8781 | 4 | 417 | 379 | 13 | 33 | |||
|
| 7879 | 5 | 443 | 12 | 21 | ||||
|
| 7077 | 4 | 12 | 17 | |||||
| GRP | 5419 | 3 | 49 | ||||||
| GRL | 511 | 15 |
Median absolute support and the number of intersections among the best 500 rules extracted by different methods from the GPCR-GO-MF dataset (for GRL only 32 rules).
Rule Rankings for GPCR-GO-MF dataset.
| Rule |
|
|
|
|
|
| GRP | GRL |
| GPCR:Serotonin | 42 | 43 | 43 | 43 | 36 | 45 | 2037 | |
| GPCR:“Chemokine receptor-like” | 129 | 130 | 142 | 114 | 76 | 60 | 2421 | |
| GPCR:“Anaphylatoxin” | 43 | 42 | 42 | 42 | 23 | 19 | 2035 | |
| GPCR:P2RY1 | 989 | 1159 | 970 | 1128 | 1329 | 1780 | 2553 | |
| GPCR:“Trace amine” | 3055 | 2276 | 3175 | 2714 | 7466 | 6671 | 485 |
Figure 3Number of inconsistent rules.
The number of inconsistent rules found in the best x rules extracted by the given metric. The best x rules were gathered from 20,000 and then ancestors rules were removed.
Median absolute support and the number of intersections for GO-MF-GO-MF.
| Metric | Nr. of rules | Median Sup. |
|
|
|
|
| GRP | GRL |
|
| 5445 | 5 | 433 | 436 | 433 | 227 | 206 | 47 | 11 |
|
| 5485 | 6 | 379 | 498 | 203 | 230 | 53 | 16 | |
|
| 5518 | 5 | 379 | 197 | 176 | 39 | 10 | ||
|
| 5495 | 5.5 | 203 | 231 | 53 | 16 | |||
|
| 5678 | 5 | 431 | 33 | 14 | ||||
|
| 5717 | 5 | 38 | 14 | |||||
| GRP | 5688 | 2 | 67 | ||||||
| GRL | 513 | 146 |
Median absolute support and the number of intersections between the best 500 rules extracted by different methods from the GO-MF-GO-MF dataset. Nr. of rules refers to the total number of rules after preprocessing.
20 best rules extracted by JacDif the GO-MF dataset.
| Nr. |
|
|
| GO name | GO name |
| 10-13 | 13-10 |
| 1. | 5 | 1 | 1.00 | GO:0008954 peptidoglycan synthetase activity | GO:0016807 cysteine-type carboxypeptidase activity | 0 | 5 | 0 |
| 2. | 1 | 1 | 1.00 | GO:0034437 glycoprotein transporter activity | GO:0034041 sterol-transporting ATPase activity | 19 | 0 | 18 |
| 3. | 2 | 1 | 1.00 | GO:0010490 UDP-4-keto-rhamnose-4-keto-reductase activity | GO:0010489 UDP-4-keto-6-deoxy-glucose-3,5-epimerase activity | 4 | 0 | 2 |
| 4. | 1 | 1 | 1.00 | GO:0015518 arabinose:hydrogen symporter activity | GO:0015150 fucose transmembrane transporter activity | 1 | 0 | 0 |
| 5. | 1 | 1 | 1.00 | GO:0070905 serine binding | GO:0010855 adenylate cyclase inhibitor activity | 20 | 0 | 19 |
| 6. | 1 | 1 | 1.00 | GO:0050241 pyrroline-2-carboxylate reductase activity | GO:0050132 N-methylalanine dehydrogenase activity | 2 | 0 | 1 |
| 7. | 1 | 1 | 1.00 | GO:0017045 adrenocorticotropin-releasing hormone activity | GO:0051431 corticotropin-releasing hormone receptor 2 binding | 3 | 0 | 2 |
| 8. | 1 | 1 | 1.00 | GO:0017045 adrenocorticotropin-releasing hormone activity | GO:0051430 corticotropin-releasing hormone receptor 1 binding | 3 | 0 | 2 |
| 9. | 9 | 1 | 1.00 | GO:0047376 all-trans-retinyl-palmitate hydrolase activity | GO:0050251 retinol isomerase activity | 0 | 9 | 0 |
| 10. | 2 | 1 | 1.00 | GO:0035473 lipase binding | GO:0035478 chylomicron binding | 13 | 0 | 11 |
| 11. | 2 | 1 | 1.00 | GO:0080048 GDP-D-glucose phosphorylase activity | GO:0010475 galactose-1-phosphate guanylyltransferase (GDP) activity | 8 | 0 | 6 |
| 12. | 1130 | 1 | 0.99 | GO:0043752 adenosylcobinamide kinase activity | GO:0008820 cobinamide phosphate guanylyltransferase activity | 1518 | 812 | 1,202 |
| 13. | 3590 | 1 | 0.97 | GO:0004743 pyruvate kinase activity | GO:0030955 potassium ion binding | 11,356 | 188 | 7,975 |
| 14. | 2407 | 1 | 0.87 | GO:0004643 phosphoribosylaminoimidazolecarboxamide formyltransferase activity | GO:0003937 IMP cyclohydrolase activity | 7759 | 83 | 5,438 |
| 15. | 1756 | 0.99 | 0.86 | GO:0019134 glucosamine-1-phosphate N-acetyltransferase activity | GO:0003977 UDP-N-acetylglucosamine diphosphorylase activity | 6889 | 77 | 5,211 |
| 16. | 2424 | 0.97 | 0.85 | GO:0004486 methenyltetrahydrofolate dehydrogenase activity | GO:0004477 methylenetetrahydrofolate cyclohydrolase activity | 4 | 2,418 | 0 |
| 17. | 329 | 0.93 | 0.85 | GO:0051861 glycolipid binding | GO:0017089 glycolipid transporter activity | 935 | 14 | 624 |
| 18. | 1862 | 0.95 | 0.84 | GO:0004633 phosphopantothenoylcysteine decarboxylase activity | GO:0004632 phosphopantothenate–cysteine ligase activity | 6009 | 86 | 4,234 |
| 19. | 1619 | 0.84 | 0.67 | GO:0008066 glutamate receptor activity | GO:0005234 extracellular-glutamate-gated ion channel activity | 3967 | 69 | 2,435 |
| 20. | 2074 | 0.86 | 0.65 | GO:0008531 riboflavin kinase activity | GO:0003919 FMN adenylyltransferase activity | 7022 | 78 | 5,028 |
20 best rules extracted by JacDif (JD) with Faria's filtering method without filtering by min Sup, min Cnf and min Agr.