| Literature DB >> 27570671 |
Kalpana Raja1, Naman Dasot1, Pawan Goyal1, Siddhartha R Jonnalagadda1.
Abstract
Precision Medicine is an emerging approach for prevention and treatment of disease that considers individual variability in genes, environment, and lifestyle for each person. The dissemination of individualized evidence by automatically identifying population information in literature is a key for evidence-based precision medicine at the point-of-care. We propose a hybrid approach using natural language processing techniques to automatically extract the population information from biomedical literature. Our approach first implements a binary classifier to classify sentences with or without population information. A rule-based system based on syntactic-tree regular expressions is then applied to sentences containing population information to extract the population named entities. The proposed two-stage approach achieved an F-score of 0.81 using a MaxEnt classifier and the rule- based system, and an F-score of 0.87 using a Nai've-Bayes classifier and the rule-based system, and performed relatively well compared to many existing systems. The system and evaluation dataset is being released as open source.Entities:
Year: 2016 PMID: 27570671 PMCID: PMC5001749
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Figure 1.Extraction of population named entity
List of Tregex symbols for pattern generation
| Tregex Symbol | Description |
|---|---|
| Node A « Node B | Node A dominates Node B |
| Node A >> Node B | Node A is dominated by Node B (Node B << Node A) |
| Node A < Node B | Node A immediately dominates Node B |
| Node A > Node B | Node A is immediately dominated by Node B (Node B < Node A) |
| Node A g Node B | A and B are sisters i.e. at same level in the parse tree (but are not equal) |
| @Node | Selects the entire phrase (noun or verb) mentioned i.e. @NP |
Tregex patterns for population named entity extraction (to be used in conjunction with "population-related concepts")
| Pattern | Output Phrase | Example Sentence with Output Underlined |
|---|---|---|
| NP > PP | Noun phrase succeeding prepositional phrase | Aldosterone blockade has been shown to be effective in reducing total mortality as well as |
| PP $ NP | Prepositional phrase and noun phrase are sisters | Rosuvastatin did not reduce mortality |
| NP $ NP | Two noun phrases as sisters | Implantation of CRT-D rather than an implantable cardioverter defibrillator in |
| NP $ NNS | Noun phrase and noun are sisters | Diuretics are indicated for |
| @NP | Noun phrase | So far, nebivolol is the only beta-blocker to have been shown effective in |
| PP, NP | Prepositional phrase immediately follows noun phrase | It is suggested that beta-receptor blockade should be added to conventional treatment with digitalis and diuretics in |
| PP $ PP | Two prepositional phrases as sisters | This article reviews the physiological changes that occur in the elderly and the treatment approach that can be taken |
| NP, PP | Noun phrase immediately follows prepositional phrase | It is suggested that potassium depletion is not a major problem |
| NP $ PP | Noun phrase and prepositional phrase are sisters | Piretanide, a diuretic that acts on the loop of Henle, was used to treat |
| @VP | Verb phrase | Isosorbide dinitrate and hydralazine hydrochloride should be |
Evaluation Dataset
| Dataset | Citations with population | Citations without population |
|---|---|---|
| Diagnosis for CHF | 80 | 120 |
| Treatment for CHF | 98 | 102 |
| Diagnosis for AFib | 140 | 60 |
| Treatment for AFib | 56 | 58 |
System performance
| System | Precision | Recall | F-score |
|---|---|---|---|
| Rule-based system | 0.67 | 0.62 | 0.64 |
| MaxEnt classifier + Rule-based system | 0.87 | 0.76 | 0.81 |
| Naïve-Bayes classifier + Rule-based system | 0.90 | 0.83 | 0.87 |
Performance of binary classifiers
| System | Precision | Recall | F-score |
|---|---|---|---|
| MaxEnt classifier | 0.87 | 0.82 | 0.84 |
| Naïve-Bayes classifier | 0.89 | 0.91 | 0.90 |
Approach and dataset used by various systems
| System | Dataset | Sentence Classification | Population Extraction | |||
|---|---|---|---|---|---|---|
| Model | F-score | Model | F-score | Remarks | ||
| Xu | Abstracts Only from PubMed | HMM + NLP Techniques | 92% | Classification + parse Tree (Stanford) | 0.51 | - |
| Zhu | - | - | Partially Matched Using Metamap | 0.84 | - | |
| Partially Matched Using NLP- based method | 0.83 | |||||
| Exact Matched Using Metamap | 0.42 | |||||
| Exact Matched Using NLP- based method | 0.64 | |||||
| Demner Fushman and Lin | MedLine Abstracts | - | Baseline | 0.53 | Returns a set of results | |
| Extractor | 0.80 | |||||
| Zhao | Reduced Dataset | Mallet CRF | Independent | 0.78 | 4 different Methods Used | |
| Sentence-First | 0.78 | |||||
| Word-First | 0.78 | |||||
| Joint | 0.75 | |||||
| Full Dataset | Independent | 0.64 | ||||
| Sentence-First | 0.64 | |||||
| Word-First | 0.63 | |||||
| Joint | 0.60 | |||||
| Kelly[ | Abstracts from PubMed | - | Partial match | 0.877 | Dependency Parse | |
| Exact match | 0.601 | Regular Expressions | ||||
| RIDeM Tool[ | Tested on Our Dataset | - | Original | 0.63 | Upper bound on precision | |
| With Add-ons | 0.766 | |||||
| Our Current System | Tested on Our Dataset | - | Rule-based | 0.64 | - | |
| MaxEnt | 0.844 | MaxEnt + Rule-based | 0.81 | |||
| Naive- Bayes | 0.9 | Naive-Bayes + Rule based | 0.87 | |||