| Literature DB >> 20529942 |
Thomas Wächter1, Michael Schroeder.
Abstract
MOTIVATION: Ontologies and taxonomies have proven highly beneficial for biocuration. The Open Biomedical Ontology (OBO) Foundry alone lists over 90 ontologies mainly built with OBO-Edit. Creating and maintaining such ontologies is a labour-intensive, difficult, manual process. Automating parts of it is of great importance for the further development of ontologies and for biocuration.Entities:
Mesh:
Year: 2010 PMID: 20529942 PMCID: PMC2881373 DOI: 10.1093/bioinformatics/btq188
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Overview on term generation systems and their characteristics
| System | Characteristics | Description | |||
|---|---|---|---|---|---|
| Linguistic filtering | Statistical filtering | Machine learning | Context | ||
| Morphosyntactic patterns, list of suffixes, frequency, mutual information (Medicine), 70% recall | |||||
| NP parsers, statistical disambig., sub-compound generation, 240 Mb News corpus, 82% recall | |||||
| POS tagger; context defining words in the corpus, 75% precision within top 25% of terms | |||||
| Comprehensive system including term, definition extraction and disambiguation; Tourism domain 0.80 | |||||
| Framework for ontology learning, algorithms for term and relation extraction | |||||
| Lee | Dependency parsing for relationship extraction for sub-units of GO concepts low | ||||
| Wermter and Hahn ( | Comparison of statistics with filtering by frequency or linguistic information | ||||
All methods use linguistic filtering, most methods statistical filtering, some methods use context information. The quality is given in terms of precision and recall (see ‘Methods’ section).
Overview on the quality of taxonomy induction
| Hearst ( | Precision > 90%, Recall << 10% |
| Caraballo ( | 33% Precision (strict), 60% Precision |
| Sanderson and Croft ( | 48% Precision (baseline 28%) |
| Cimiano | |
| Snow | Maximal |
| Snow | 58% Precision, 20% Recall |
| Ryu and Choi ( | All Recall and Precision below 50% |
The F-measure is usually <50%. In information retrieval, quality is often measured as F-measure (F), the harmonic mean of precision and recall (see ‘Methods’ section).
Overview on the quality of definition extraction and definitional question answering
| Xu | |
| Yang | |
| Echihabi | |
| Han | |
| Degórski |
In the TREC2003 task on definitional question answering, the best system achieved a F-measure of F=0.31. In information retrieval, quality is often measured as F-measure (F), the harmonic mean of precision and recall (see ‘Methods’ section)
Fig. 1.Screenshot of the OBO-Edit ontology generation tool showing the three steps ‘Term Generation’, ‘Definition Generation’ and ‘Add To Ontology’ for the example of adding and defining the term endosome.
Original and the best generated definition for four GO and four MeSH terms
| Original | Generated | |
|---|---|---|
| Gene ontology | ||
| Integrin biosynthetic process (GO:0045112) The chemical reactions and pathways resulting in the formation of integrins, a large family of transmembrane proteins that act as receptors for cell-adhesion molecules. | 4th: integrin is a heterodimer transmembrane protein that plays a critical role in cellular adhesion and migration during the inflammation and immune response. ( | Correct |
| Anion channel activity (GO:0005253) Catalysis of the energy-independent passage of anions across a lipid bilayer down a concentration gradient. | 1st: Anion channel is an integral membrane protein or more typically an assembly of several proteins. ( | Valid |
| Benzoate metabolic process (GO:0018874) The chemical reactions and pathways involving benzoate, the anion of benzoic acid, a fungistatic compound widely used as a food preservative; […] | 1st: Benzoate is a common carbon source in nature that is funnelled directly to the widely distributed benzoyl-coenzyme A (benzoyl-CoA) central pathway. ( | Valid |
| 1st: cerebral cortex is a layer of nerve cells forming a convoluted outer shell over the brain, […] in which much of the thinking or higher intellectual activity of the brain takes place. ( | Valid | |
| Medical subject headings | ||
| Flucytosine (D005437) A fluorinated cytosine analog that is used as an antifungal agent. | 1st: Flucytosine is a fluorine analog of cytosine […], leading to inhibition of thymidylate synthetase and disruption of DNA synthesis. ( | Correct |
| Cystoscopy (D003558) Endoscopic examination, therapy or surgery of the urinary bladder. | 3rd: cystoscopy is an examination of the bladder […] using a flexible, miniature telescope […] ( | Correct |
| Correct | ||
| Correct | ||
Definition are manually labelled as either correct if they match the GO/MeSH definition or valid if they contain useful information. For each generated definition the rank of retrieval (1st, 2nd, 3rd or 4th) is shown.
Fig. 2.The mean percentage of generated terms from UMLS, MeSH, OBO and GO in the top-k ranked generated terms and their distance to the randomly selected query MeSH term used to retrieve 250 PubMed abstracts. The generated terms show both, a high proportion of terms similar to existing ontology terms, justifying the notion of noun phrases as term candidates, and a certain variance of distances of generated terms to the query MeSH term, thus mapping out the neighbourhood of the query MeSH term as well as addressing other aspects of the document set.
Proportion of terms from in MeSH and GO containing parent terms, ancestor terms or other existing terms in their definitions
| All GO | All MeSH | |
|---|---|---|
| Total | 28 814 | 29 348 |
| Terms with definition | 99.1 | 96.0 |
| Words in definition | 24.3 (±15.3) | 30.2 (±19.3) |
| Terms in definition | 2.4 (±2.3) | 5.7 (±4.1) |
| ≥ 1 term in definition | 88.0 | 97.2 |
| ≥ 1 ancestor in definition | 54.1 | 56.2 |
| ≥ 1 parent in definition | 15.8 | 36.6 |
Nearly all of GO an MeSH terms are defined. 54.1–56.2% of terms are defined via an ancestor, 15.8–36.6% via a parent term.
Evaluation of generated definitions for 500 GO and 500 MeSH terms
| 500 GO (%) | 500 MeSH (%) | |||
|---|---|---|---|---|
| Correct | Valid | Correct | Valid | |
| Top 1 | 21.9 | 41.2 | 32.0 | 47.0 |
| Within top 5 | 27.8 | 54.6 | 49.8 | 72.6 |
| Within top 10 | 27.8 | 54.6 | 53.6 | 78.2 |
For 22−38% of terms the top ranked definition captured aspects of the true definition, in 41−47% it was a valid definition, but not similar to the original one. Within the top 10 ranked definitions a valid definition was found for 55−78% of terms.
Evaluation of taxonomic information contained in generated definitions for 500 GO and 500 MeSH terms
| 500 GO (%) | 500 MeSH (%) | |||
|---|---|---|---|---|
| Parent | Ancestor | Parent | Ancestor | |
| Contained in top 1 | 12.2 | 32.4 | 20.2 | 37.0 |
| Contained in top 10 | 13.4 | 38.0 | 26.0 | 54.4 |
For 26% of the 500 randomly selected MeSH terms the parent and for 54% some ancestor could be found in the top 10 generated definitions.
Fig. 3.Generated taxonomy for MeSH sub tree ‘Blood’. Result for co-occurrence-based taxonomy induction as described in Section 4.3 using a maximum of 10 000 000 documents per node and a threshhold of 0.01.
Precision, recall and F-measure for automatic induction of MeSH using document-wise co-occurrences in PubMed abstracts
| Threshhold Heymann alg. | Precision | Recall | Precision | Recall | |||
|---|---|---|---|---|---|---|---|
| MeSH | 0.5 | 0.28 | 0.02 | 0.04 | 0.34 | 0.02 | 0.04 |
| MeSH | 0.1 | 0.15 | 0.11 | 0.13 | 0.21 | 0.17 | 0.19 |
| MeSH | 0.01 | 0.12 | 0.13 | 0.13 | 0.19 | 0.20 | 0.19 |
Results AB only regard relations as correct which exist in MeSH, while AB|A..B|BA also regards prediction of ancestors or the inverse direct relations as correct.