| Literature DB >> 23046816 |
Lana Yeganova1, Won Kim, Donald C Comeau, W John Wilbur.
Abstract
BACKGROUND: There are several humanly defined ontologies relevant to Medline. However, Medline is a fast growing collection of biomedical documents which creates difficulties in updating and expanding these humanly defined ontologies. Automatically identifying meaningful categories of entities in a large text corpus is useful for information extraction, construction of machine learning features, and development of semantic representations. In this paper we describe and compare two methods for automatically learning meaningful biomedical categories in Medline. The first approach is a simple statistical method that uses part-of-speech and frequency information to extract a list of frequent nouns from Medline. The second method implements an alignment-based technique to learn frequent generic patterns that indicate a hyponymy/hypernymy relationship between a pair of noun phrases. We then apply these patterns to Medline to collect frequent hypernyms as potential biomedical categories.Entities:
Year: 2012 PMID: 23046816 PMCID: PMC3465206 DOI: 10.1186/2041-1480-3-S3-S3
Source DB: PubMed Journal: J Biomed Semantics
List of 40 patterns generated by alignment-based method.
| X is a Y | X is a potent Y |
| X are Y | X is the most common Y |
| X and other Y | X are rare Y |
| X as a Y | X is a widely used Y |
| X such as Y | X is an uncommon Y |
| X is an Y | X is an autosomal dominant Y |
| X as an Y | X is a form of Y |
| X is an important Y | X is one of the major Y |
| X a new Y | X is a chronic Y |
| X are the most common Y | X and other forms of Y |
| X is a rare Y | X is a broad spectrum Y |
| X is a novel Y | X is the primary Y |
| X is a major Y | X is a rare autosomal recessive Y |
| X is an essential Y | X is the most common type of Y |
| X was the only Y | X is the second most common Y |
| X was the most common Y | X are the most frequent Y |
| X is a common Y | X is the most widely used Y |
| X is a new Y | X is the most frequent Y |
| X is a complex Y | X is the most common primary Y |
| X is an effective Y | X is one of the major Y |
Sentences containing pairs of terms with narrower/broader relationship, with narrower term replaced by X and wider term replaced by Y.
| Pattern | ||||
|---|---|---|---|---|
| X | and other | Y | ||
| Effects of | anticoagulants | and other | drugs | |
| Epidemiology of | rabies virus | and other | lyssaviruses | |
| Biosynthesis of | Cholesterol | and other | Sterols | |
| Alcohol | and other | drug | dependencies | |
Counts, Percentage, and Significance of overlap between FHN100 ⊂ FHN50 ⊂ FHN10 and PN50 lists. Percentage is computed relative to FHN sets.
| Set (size) |
| ||
|---|---|---|---|
| Counts | Percentage | - | |
| 1,564 | 50% | 2215 | |
| 915 | 76% | 1483 | |
| 522 | 87% | 885 | |