| Literature DB >> 19426458 |
Christopher Brewster1, Simon Jupp, Joanne Luciano, David Shotton, Robert D Stevens, Ziqi Zhang.
Abstract
BACKGROUND: Ontology construction for any domain is a labour intensive and complex process. Any methodology that can reduce the cost and increase efficiency has the potential to make a major impact in the life sciences. This paper describes an experiment in ontology construction from text for the animal behaviour domain. Our objective was to see how much could be done in a simple and relatively rapid manner using a corpus of journal papers. We used a sequence of pre-existing text processing steps, and here describe the different choices made to clean the input, to derive a set of terms and to structure those terms in a number of hierarchies. We describe some of the challenges, especially that of focusing the ontology appropriately given a starting point of a heterogeneous corpus.Entities:
Mesh:
Year: 2009 PMID: 19426458 PMCID: PMC2679401 DOI: 10.1186/1471-2105-10-S5-S1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Animal Behaviour Ontology (top level).
Figure 2Aspects of a corpus (brown circles represent as yet unspecified ontologies or sets of terms).
The regular expression used in Step 5a.
| defence$ | attack$ | behaviour$ | | preference$ | discrimination$ | |
| choice$ | selection$ | ing$ | | attraction$ | grunt$ | reduction$ | |
| care$ | conflict$ | aggression$ | | chase$ | aggregation$ | alarm$ | |
| ism$ | competition$ | recognition$ | | movement$ | investment$ | urination$ | |
| skill$ | ship$ | copulation$ | | submission | invitation$ | expulsion$ | |
| play$ | flight$ | flip$ | response$ | | motion$ | mimicry$ | release$ | |
| avoidance$ | fidelity$ | courtship$ | | icide$ | fight$ | rut$ | |
| inspection$ | intrusion$ | activity$ | | coercion$ | construction$ | flight$ | |
| reactivity$ | communication$ | attendance$ | | solicitation$ | search$ | appeasement$ | |
| igration$ | harassment$ | contest$ | | mimicry$ | protection$ | submission$ | |
| interference$ | foraging$ | polyandry$ | | preparation$ | vocalization$ | vocalisation$ | |
| predation$ | call$ | bob$ | | incubation$ | insemination$ | concealment$ | |
| intrusion$ | tactic$ | strategy$ | | evasion$ | nod$ | call$ | |
| attempt$ | trill$ | whistle$ | | trill$ | song |
Lexico-syntactic phrases used in Step 7a and 7b.
| < Child singular > or other < Parent plural > |
| < Child plural > and other < Parent plural > |
| < Child plural > or other < Parent plural > |
| < Child singular > is a type of < Parent singular > |
| < Child singular >, a type of < Parent singular > |
| < Child singular > and other < Parent plural > |
| < Parent plural > such as < Child plural > |
| < Child singular > is a kind of < Parent singular > |
| < Child singular >, a kind of < Parent singular > |
Figure 3The processing steps undertaken.
Figure 4Screen-shot of part of the sub-tree concerning behaviour from the output of Step 8a.
Figure 5Screen-shot of part of the sub-tree concerning behaviour from the output of Step 7b.
Figure 6Screen-shot of part of the sub-tree concerning calling from output of Step 8a.
Figure 7Screen-shot of part of the sub-tree concerning call from the output of Step 7b.
Random sample of terms extracted by Step 5a.
| colony fissioning | active territory defence |
| high investment | matching neighbour song |
| greater reduction | soldier bug predation |
| major force favouring | stranger song |
| eavesdropping and mate choice | severe matriline based aggression |
| mormoniella and pachycrepoideuspolyandry | dashing |
| threat grunt | acid metabolism |
| total attraction | taeniopygia guttatasong |
| location and spacing | acoustic response |
| orienting response | prior winning |
| memorize heterospecific song | daytime rest and sleep behaviour |
| diet selection behaviour | sexual and courtship behaviour |
| mass rearing | xiphophorusfemale preference |
| diurnal activity | utetheisa ornatrixsite dependent aggression |
| locomotor behaviour | |
| noise interference |
The number of terms and animal behaviour terms retrieved from each section of the corpus.
| Title and Abstract | 131132 | 13463 | 2031 | 0.15 |
| Introduction | 525841 | 36932 | 4915 | 0.13 |
| Materials and Methods | 646675 | 13611 | 1535 | 0.11 |
| Results | 323196 | 19504 | 2197 | 0.11 |
| Discussion | 260168 | 20170 | 2683 | 0.13 |
| Bibliography (titles only) | 318678 | 39115 | 6336 | 0.16 |
| Total (non unique) | 2205690 | 142795 | 19697 | |
| Total unique terms | 98435 | |||
| Total unique animal behaviour terms | 13755 | |||
| Proportion of animal behaviour terms | 0.14 | |||
Data sets at each step 1 – 8
| Step | 1 | 2,3 | 4 | 5a | 5b | 6a | 6b | 7a | 7b | 8a | 8b |
| Number of articles | 623 | ||||||||||
| Number of noun phrases | 135026 | ||||||||||
| Number of terms | 98435 | ||||||||||
| Number of terms selected | 13755 | 13755 | |||||||||
| Number of classes | 18171 | 14251 | 18055 | 12383 | 18055 | 12383 | |||||
| Subclass axioms | 16877 | 11716 | 16876 | 10497 | 17393 | 12326 | |||||
| Top level clases | 1294 | 2535 | 1179 | 1886 | 662 | 58 | |||||
| Maximum Depth | 5 | 5 | 5 | 5 | 8 | 13 | |||||
| Average Depth | 2.4 | 2.1 | 2.4 | 2.2 | 3.1 | 1.4 | |||||
| Maximum span | 1294 | 2535 | 1179 | 1886 | 778 | 557 | |||||
| Average Span | 1.7 | 1.8 | 1.7 | 1.8 | 1.7 | 2.2 |
Results from Steps 1-5a and 5b, and the Evaluation Steps 1 and 2.
| Regular expression method | Term voting method | |||
| TOTAL terms | 98435 | 100% | 98435 | 100% |
| Selected set | 13755 | 14% | 13755 | 14% |
| Excluded set | 84680 | 86% | 84680 | 86% |
| Sample of excluded | 3140 | 100% | ||
| Wrong (false negative) | 49 | 1.6% | ||
| Correct (true negative) | 3091 | 98.4% | ||
| Proportionate number of | 1321 | |||
| Sample of included | 2070 | 100% | 2287 | 100% |
| Wrong (false positive) | 1538 | 74.3% | 1974 | 86.3% |
| Correct (true positive) | 532 | 25.7% | 313 | 13.7% |
| Probable number of | 3535 | 1883 | ||
| 0.728 | ||||
| 0.257 | 0.137 |
Results of the evaluation the subsumption pairs in the output of Step 8a.
| Number of terms sampled | 408 |
| Number of valid terms | 198 |
| Number of incorrect terms | 210 |
| Term precision on this sample | 0.49 |
| Number of subsumption pairs in this sample | 204 |
| Number of valid subsumption pairs in this sample | 57 |
| Number of incorrect subsumption pairs in this sample | 147 |
| Precision of the subsumption pairs in this sample | 0.28 |
Data from human evaluation of Ontologies 8a and 7b.
| Ontology 8a – Regex | Ontology 7b – Voting | |||
| Total number of top-level terms | 662 | 1886 | ||
| Number of sampled top-level terms | 200 | 30.2% of total | 200 | 10.6% of total |
| Number of sampled top-level terms relevant to animal behaviour | 84 | 42% of sample | 82 | 41% of sample |
| Proportionate number of top- level animal behaviour terms in the whole ontology | 278 | 773 | ||
| Number of sampled top-level an imal behaviour terms not found in the other ontology | 44 | 22% of sample | 77 | 38.5% of sample |
| Proportionate number of top- level animal behaviour terms in the whole ontology not found in the other ontology | 145 | 727 | ||
Figure 8A sample subsumption hierarchy generated from the phrase "high reproductive potential egg laying" using the string inclusion method.
No. of classes and effort involved in the production of a number of manually curated ontologies.
| Ontology | No. of classes | Subsumption axioms | Duration | Effort in hours |
| ABO | 305 | 303 | 3 years | 800 |
| Normalisation of CTO | 1110 | 2823 | 8 months | 444 |
| EFO | 1420 | 1873 | 1 year | 1440 |
| InfluenzO | 269 | 223 | 2.5 years | 2900 |
| BioPAX L1 | 28 | 67 | 2.5 years | 9900 |
| OBI | 1366 | 2016 | 3 years | 14,296 |
| GO | 26894 | 10 years | > 160 000 |