| Literature DB >> 27650402 |
Wonjun Choi1, Baeksoo Kim1, Hyejin Cho1, Doheon Lee2, Hyunju Lee3.
Abstract
BACKGROUND: Plants are natural products that humans consume in various ways including food and medicine. They have a long empirical history of treating diseases with relatively few side effects. Based on these strengths, many studies have been performed to verify the effectiveness of plants in treating diseases. It is crucial to understand the chemicals contained in plants because these chemicals can regulate activities of proteins that are key factors in causing diseases. With the accumulation of a large volume of biomedical literature in various databases such as PubMed, it is possible to automatically extract relationships between plants and chemicals in a large-scale way if we apply a text mining approach. A cornerstone of achieving this task is a corpus of relationships between plants and chemicals.Entities:
Keywords: Chemical; Corpus; Data mining; Medicine; Natural language processing; Natural product; Plant; Text mining
Mesh:
Substances:
Year: 2016 PMID: 27650402 PMCID: PMC5029005 DOI: 10.1186/s12859-016-1249-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The workflow of corpus construction. The corpus was constructed as follows: (i) we collected PubMed abstracts from the PubTator database; (ii) we applied NER tools, including LingPipe and ChemSpot, to pre-annotate plant and chemical names; (iii) we extracted co-occurrence sentences that contain at least one plant and chemical name; (iv) we randomly selected candidate sentences, where the numbers of positive and negative sentences were set to be approximately the same, and also split them into corpus units; (v) we manually annotated candidate corpus units with our guidelines and also conducted later annotation to harmonize disagreements after annotators finished their annotation tasks; and (vi) we converted annotated corpus units to the BioC XML format
Fig. 2Annotation example in Excel file. The candidate corpus units were exported in Excel format so that annotators could easily annotate the corpus. The first and second columns represent annotated plant names and IDs contained in each sentence shown in the eleventh column. The sixth and seventh columns show annotated chemical names and IDs. Annotators should check whether plant and chemical names and their IDs are correctly annotated. If they are incorrectly annotated, then annotators should write the letter “X” in the fifth, tenth, or both columns. Also, they should leave comments in the fourth, ninth, or both columns. For the thirteenth column, annotators should determine whether the relationship in the sentence is positive or negative. If the sentence contains a positive relationship, the weak and strong triggers should be written in the last two columns
Fig. 3An example of the result of the Stanford Dependency Parser. A tree represents dependency types of each token when applying the Stanford Dependency Parser to the following sentence: “About 450 mg of FB1 were obtained from 800 g cultured corn.” In the dependency parse tree, words in square brackets express dependency types of the linked tokens below. For instance, [nsubjpass] means a passive nominal subject, and it is a noun phrase that is the syntactic subject of a passive clause, “about 450 mg of FB1”. Detailed descriptions of the various dependency types are available in the Stanford Typed Dependencies manual provided by the Stanford NLP
The rule specifications
| Rule specification type | # | Rule structure | Trigger word form | Constraint | Example (PMID) |
|---|---|---|---|---|---|
| Verbal trigger rule | 1 |
| Transitive verb (active from) | NA |
|
| 2 |
| Transitive verb (passive from) | Any preposition between |
| |
| 3 |
| Intransitive verb | Any preposition between |
| |
| Preposition trigger rule | 1 |
| Preposition | NA |
|
| 2 |
| Preposition | NA |
| |
| Relative trigger rule | 1 |
| Past participle form | Any preposition between |
|
| 2 |
| Gerund form | When the trigger word ( |
| |
| Apposition trigger rule | 1 |
| Apposition form (e.g. comma) | The token distance between |
|
| 2 |
| Apposition form (e.g. comma) | The token distance between |
| |
| Copula trigger rule | 1 |
| Be verb form | The token distance between |
|
| 2 |
| Be verb form | The token distance between |
| |
| Compound noun trigger rule | 1 |
| White space | NA |
|
The rule-based model consists of six types of rules. The first column shows the specification name. Each specification has more than one rule structure shown in the third column. In the rule structure, N P 0 means the noun phrase containing a plant name, and N P 1 represents the noun phrase in which a chemical name appears. The component marked with “tr” represents a trigger word described in the fourth column. We also defined several constraints if necessary
Trigger words used in the rule-based model. The table shows trigger words selected for the six predefined rules in the model
| Trigger type | Trigger word form | Trigger words |
|---|---|---|
| Verbal trigger ( | Active form | contain, contains, contained |
| have, has, had | ||
| involve, involves, involved | ||
| incorporate, incorporates, incorporated | ||
| possess, possesses, possessed | ||
| encompass, encompasses, encompassed | ||
| subsume, subsumes, subsumed | ||
| comprise, comprises, comprised | ||
| embody, embodies, embodied | ||
| embrace, embraces, embraced | ||
| include, includes, included | ||
| cover, covers, covered | ||
| compose, composes, composed | ||
| originate, originates, originated | ||
| produce, produces, produced | ||
| derive, derives, derived | ||
| accumulate, accumulates, accumulated | ||
| release, releases, released | ||
| Passive form | contained, involved, incorporated, possessed, | |
| encompassed, subsumed, comprised, embodied, | ||
| embraced, included, covered, composed, produced, | ||
| originated, derived, accumulated, released, isolated, | ||
| extracted, separated, detached, split, segregated, | ||
| obtained, found, gained, discovered, uncovered, identified | ||
| Intransitive form | consist, consists | |
| Prepositional trigger ( | Preposition | any token whose dependency type is “prep” |
| Relative trigger ( | Past participle form |
|
| Gerund form | containing, involving, incorporating, possessing, | |
| encompassing, subsuming, comprising, | ||
| embracing, including, covering, composing, | ||
| embodying, producing, originating, deriving, | ||
| accumulating, releasing, having, consisting | ||
| Apposition trigger ( | Apposition form | any token whose dependency type is “appos” |
| Copula trigger ( | Copula form | any token whose dependency type is “cop” |
| Compound noun trigger ( | Compound noun form | strings that are made up of plant and chemical names together ( |
Fig. 4The overall method for extracting plant–chemical relationships from abstracts. To extract plant–chemical relationships from biomedical articles, our model annotates plant and chemical names in the articles using LingPipe and ChemSpot, respectively. Then, the system splits annotated abstracts into sentences using the LingPipe sentence splitter to apply the Stanford Dependency Parser, which provides dependency parse trees for each sentence. Finally, the rule-based model checks the grammar structure of the dependency parse trees and the trigger words in sentences to extract relationships
Fig. 5Example of our corpus units in BioC XML format. The corpus contains information about gold-standard sentences with annotated plant and chemical names along with their IDs, locations of entity names, trigger terms only for positive relationships, and class labels (positives or negatives). The corpus data we provide in our web site are currently divided into three types: (i) 939 primary corpus units from annotation phases 1 and 2; (ii) 68 later-annotation corpus units after harmonizing disagreements; and (iii) 102 rule development corpus units
Statistics and IAA scores for entities, relation labels, and triggers. The IAA scores for entities and relation labels were calculated using a simple percent agreement. The IAA score for relation labels was calculated using Cohen’s kappa statistic
| Phases | Entity | # of entities | # of agreements for entities (IAA, Simple) | Relation | # of class labels | # of agreements for relation labels (IAA, Kappa) | Trigger | # of triggers | # of agreements for triggers (IAA, Simple) |
|---|---|---|---|---|---|---|---|---|---|
| Phase 1 | Plants | 642 | 640 (99.7 %) | Plant–chemical | 642 | 570 (77.6 %) | Weak | 284 | 275 (96.8 %) |
| Chemicals | 642 | 636 (99.1 %) | Strong | 284 | 271 (95.4 %) | ||||
| Total | 1,284 | 1276 (99.4 %) | Total | 568 | 546 (96.1 %) | ||||
| Phase 2 | Plants | 401 | 401 (100.0 %) | Plant–chemical | 401 | 369 (82.8 %) | Weak | 245 | 234 (95.5 %) |
| Chemicals | 401 | 400 (99.8 %) | Strong | 245 | 223 (91.0 %) | ||||
| Total | 802 | 801 (99.9 %) | Total | 490 | 457 (93.3 %) | ||||
| Overall | 2,086 | 2,077 (99.6 %) | 1,043 | 939 (79.8 %) | 1,058 | 1,003 (94.8 %) |
Statistics and IAA scores of annotated corpus units. The number of agreements was counted when annotators agree on both entities and relation labels. The IAA score in this case was calculated using the Cohen’s kappa statistic
| Phases | # of corpus units | # of agreements for both entities and relations | IAA score, Kappa |
|---|---|---|---|
| Phase 1 | 642 | 566 | 76.5 % |
| Phase 2 | 401 | 368 | 82.0 % |
| Overall | 1043 | 934 | 78.9 % |
Evaluation of a rule-based text mining model to extract plant–chemical relationships using the corpus data
| Phase | Positives | Negatives | P (%) | R (%) | F (%) |
|---|---|---|---|---|---|
| Phase 1 | 273 | 297 | 66.5 | 56.0 | 60.8 |
| Phase 2 | 239 | 130 | 81.0 | 71.6 | 76.0 |
| Overall | 512 | 427 | 73.5 | 63.3 | 68.0 |