| Literature DB >> 27454860 |
Karin M Verspoor1, Go Eun Heo2, Keun Young Kang2, Min Song3.
Abstract
BACKGROUND: The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems.Entities:
Keywords: Biocuration; Genetic variant information; Information extraction; Text mining
Mesh:
Year: 2016 PMID: 27454860 PMCID: PMC4959367 DOI: 10.1186/s12911-016-0294-3
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Overall architecture of the proposed approach
Fig. 2Architecture of PKDE4J. The structure of the overall PKDE4J system
Rules for syntactic feature generation
| Rule name | Description |
|---|---|
| Verb in dependency path | In the level of root (verb), shows subordinate dependency relation types and directions |
| No verb in dependency path | Determine whether the sentence has a verb or not between two entities. If not, detect nominalization or weak nominalization rule is processed |
| Contains clause | Check if the sentence has any clauses |
| Clause distance | Distance between clause and entities in left and right, the closest ones. The entities might be all to the right or to the left or divided |
| Same head | Check if two entities have same parent |
| Full tree path | Use it in dependency parsing process |
| Path length | Path length from the parent node to child node |
Rules for lexical feature generation
| Rule name | Description |
|---|---|
| Negation | Determine negation by checking existence of negative word e.g. a negative adjective modifying the verb or relation trigger, or a semantically negative word |
| Tense (active/passive) | Tense determination |
| Words in between entities | The sequence of words between the two entities |
| Surface distance | Distance between the two recognized entities (including existing tokens and entity itself) |
| Window left entity | A window of |
| Window right entity | A window of |
| Detect nominalization | Existence of nominalized verb located in left/right position of the entity and distance from specific entity |
| Weak nominalization | Detect when an entity occurs after a preposition and whether a nominalized biomedical verb is located ahead of that preposition |
NER-based Rules for supplementary feature generation
| Rule name | Description |
|---|---|
| Number entities between | Number of recognizable entities located |
| entities | between the two recognized entities |
| Entities in between | Show which entities are located in between |
| the two main entities | |
| Entity counts | Number of entities |
| Entity order | Order of entities |
Fig. 3Stages in iterative optimization algorithm
Fig. 4Example Variome annotation. An example sentence annotation from the Variome corpus [6], including annotations of a number of relations such as cohort-has-mutation (“individuals with germline mutations”) and cohort-has-size (“5 % of all colorectal cancers”, where ‘cancers’ is treated as a metonym for a patient cohort)
Relation extraction results over Variome corpus relations with at least 100 examples, based on 10-fold cross-validation and assuming gold-standard entity annotations
| Relation | N | Baseline | PKDE4J | ||||
|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | ||
| Disease-has-ConceptIdeas | 431 | 0.704 | 0.922 | 0.764 | 0.799 | 0.746 |
|
| Disease-has-Physiology | 188 | 0.752 | 0.873 | 0.788 | 0.885 | 0.684 |
|
| Disease-has-Disorders | 349 | 0.754 | 0.884 |
| 0.773 | 0.713 | 0.752 |
| Disease-relatedTo-BodyPart | 445 | 0.763 | 0.853 |
| 0.794 | 0.666 | 0.746 |
| Mutation-relatedTo-Disease | 126 | 0.702 | 0.986 |
| 0.683 | 0.973 | 0.758 |
| Gene-has-Physiology | 180 | 0.844 | 0.897 |
| 0.857 | 0.723 | 0.807 |
| Gene-has-Mutation | 538 | 0.835 | 0.866 |
| 0.910 | 0.569 | 0.758 |
| Cohort-has-Mutation | 307 | 0.873 | 0.921 |
| 0.909 | 0.726 | 0.839 |
| Cohort-has-Disease | 717 | 0.715 | 0.813 | 0.745 | 0.865 | 0.654 |
|
| Cohort-has-Size | 669 | 0.857 | 0.793 | 0.835 | 0.910 | 0.736 |
|
| Cohort-has-Disorders | 119 | 0.859 | 0.918 |
| 0.903 | 0.748 | 0.845 |
P = Precision, R = Recall, F = F0.5 score (weighting P more than R due to the importance of Precision). The {P,R}-Base results refer to results from a simple co-occurrence baseline. The best F score (F-Base or F-PKDE4J) is bolded
Relation extraction results over Variome corpus relations, with the Characteristics entity type aggregating its subtypes (Concept & Ideas, Disorders, Physiology, and Phenomena) based on 10-fold cross-validation and assuming gold-standard entity annotations
| Relation | N | P | R | F |
|---|---|---|---|---|
| Disease-has-Characteristics | 989 | 0.749 | 0.602 | 0.667 |
| Mutation-has-Characteristics | 17 | 0.701 | 0.583 | 0.642 |
| Gene-has-Characteristics | 230 | 0.876 | 0.699 | 0.778 |
| Cohort-has-Characteristics | 239 | 0.886 | 0.734 | 0.803 |
P = Precision, R = Recall, F = F1 score