| Literature DB >> 27060160 |
Daniel M Lowe1, Noel M O'Boyle2, Roger A Sayle2.
Abstract
Awareness of the adverse effects of chemicals is important in biomedical research and healthcare. Text mining can allow timely and low-cost extraction of this knowledge from the biomedical literature. We extended our text mining solution, LeadMine, to identify diseases and chemical-induced disease relationships (CIDs). LeadMine is a dictionary/grammar-based entity recognizer and was used to recognize and normalize both chemicals and diseases to Medical Subject Headings (MeSH) IDs. The disease lexicon was obtained from three sources: MeSH, the Disease Ontology and Wikipedia. The Wikipedia dictionary was derived from pages with a disease/symptom box, or those where the page title appeared in the lexicon. Composite entities (e.g. heart and lung disease) were detected and mapped to their composite MeSH IDs. For CIDs, we developed a simple pattern-based system to find relationships within the same sentence. Our system was evaluated in the BioCreative V Chemical-Disease Relation task and achieved very good results for both disease concept ID recognition (F1-score: 86.12%) and CIDs (F1-score: 52.20%) on the test set. As our system was over an order of magnitude faster than other solutions evaluated on the task, we were able to apply the same system to the entirety of MEDLINE allowing us to extract a collection of over 250 000 distinct CIDs.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27060160 PMCID: PMC4825350 DOI: 10.1093/database/baw039
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Example of a Wikipedia disease page, demonstrating the term relationships that were extracted in bulk from a dump of Wikipedia.
Effect of the choice of lexicon on performance of the system on the development set.
| Wikipedia | 79.3% | 61.3% | 69.1% |
| MeSH/Disease Ontology | 91.6% | 67.1% | 77.4% |
| MeSH/Disease Ontology/Wikipedia | 85.1% | 73.1% | 78.6% |
Figure 2.Workflow for chemical-disease relationship extraction. Dashed boxes are optional steps.
Precision of patterns where the chemical term precedes the disease term.
| Chemical <caused> | 528 | 219 | 70.7% |
| Chemical Disease | 41 | 25 | 62.1% |
| Chemical <related to > | 8 | 2 | 80.0% |
| <negative effects caused by> chemical | 4 | 2 | 66.7% |
| <relationship between> chemical <and> | 2 | 1 | 66.7% |
Precision of patterns where the chemical term follows the disease term.
| Disease <caused by > | 208 | 79 | 72.47% |
| Disease <after or during> | 108 | 76 | 58.70% |
| Disease <after or while taking> | 73 | 36 | 67.00% |
| Disease <in person taking> | 18 | 4 | 81.80% |
| Disease <effect of > | 14 | 14 | 50.00% |
| Disease <related to > | 14 | 6 | 70.00% |
| Disease <complications of > | 12 | 5 | 70.60% |
| <induction of> Disease <by or with> | 2 | 1 | 66.70% |
Performance of the system on the test set for the DNER and CID tasks
| DNER | 86.08% | 86.17% | 86.12% | 45.0 ms |
| CID (pattern-based) | 57.65% | 36.77% | 44.90% | 96.9 ms |
| CID (pattern-based with filters) | 60.99% | 35.93% | 45.22% | 121.8 ms |
| CID (pattern-based with filters and recall increasing heuristic) | 52.62% | 51.78% | 52.20% | 119.3 ms |
Performance of the system on the test set for CID identification when using gold-standard entities.
| CID (pattern-based, gold-standard entities) | 62.75% (+5.10%) | 44.56% (+7.79%) | 52.11% (+7.21%) |
| CID (pattern-based with filters, gold-standard entities) | 66.52% (+5.53%) | 43.62% (+7.69%) | 52.69% (+7.47%) |
| CID (pattern-based with filters and recall increasing heuristic, gold-standard entities) | 59.29% (+6.67%) | 62.29% (+10.51%) | 60.75% (+8.55%) |
Change in performance from using gold-standard entities in parenthesis.
| 23427516 | C052342 | D007007 | topiramate | Hypohidrosis | Hypohidrosis and hyperthermia during topiramate treatment in children. |
| 23427516 | C052342 | D005334 | topiramate | hyperthermia |