| Literature DB >> 26347797 |
Abdul Wahab Muzaffar1, Farooque Azam1, Usman Qamar1.
Abstract
The information extraction from unstructured text segments is a complex task. Although manual information extraction often produces the best results, it is harder to manage biomedical data extraction manually because of the exponential increase in data size. Thus, there is a need for automatic tools and techniques for information extraction in biomedical text mining. Relation extraction is a significant area under biomedical information extraction that has gained much importance in the last two decades. A lot of work has been done on biomedical relation extraction focusing on rule-based and machine learning techniques. In the last decade, the focus has changed to hybrid approaches showing better results. This research presents a hybrid feature set for classification of relations between biomedical entities. The main contribution of this research is done in the semantic feature set where verb phrases are ranked using Unified Medical Language System (UMLS) and a ranking algorithm. Support Vector Machine and Naïve Bayes, the two effective machine learning techniques, are used to classify these relations. Our approach has been validated on the standard biomedical text corpus obtained from MEDLINE 2001. Conclusively, it can be articulated that our framework outperforms all state-of-the-art approaches used for relation extraction on the same corpus.Entities:
Mesh:
Year: 2015 PMID: 26347797 PMCID: PMC4546954 DOI: 10.1155/2015/910423
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Analysis of the existing literature on biomedical relation extraction.
| Authors | Technique | Domain area | Type of relations | Year of publication | Reported results |
|---|---|---|---|---|---|
|
Huang et al. [ | Hybrid approach (shallow parsing and pattern matching) | Biomedical | Protein-protein (P2P) interaction | 2006 | 80% |
|
| |||||
| Frunza and Inkpen [ | Hybrid approach | Biomedical | Disease and treatment relation (cure, prevent, and side effect relations) | 2010 | Accuracy Cure 95% |
|
| |||||
| Sharma et al. [ | Verb-centric algorithm | Biomedical | Not mentioned | 2010 | 90% |
|
| |||||
| Ben Abacha and Zweigenbaum [ | Hybrid approach (pattern based and machine learning) | Biomedical | Disease and treatment relation | 2011 | 94.07% |
|
| |||||
| Ben Abacha and Zweigenbaum [ | Linguistic patterns and domain knowledge | Biomedical | Relation between 16 entities | 2011 | Precision of 75.72% and recall of 60.64% |
|
| |||||
| Yang et al. [ | Verb-centric approach | Biomedical | Relation between (foods, chemicals, diseases, proteins, and genes) | 2011 | 90.5% |
|
| |||||
| Kadir and Bokharaeian [ | Hybrid approach (rule-based, kernel based, and cooccurrence based methods) | Biomedical | Not mentioned | 2013 | Not reported |
|
| |||||
| Rosario and Hearst [ | Graphical models and neural network | Biomedical | Disease and treatment relation (cure, prevent, side effect relations) | 2004 | Accuracy Cure 92.6% |
Original dataset description from [21].
| Sr. number | Relationship | Number of sentences |
|---|---|---|
| 1 | Cure/treat for dis. | 810 |
| 2 | Prevent relation | 63 |
| 3 | Side effect | 29 |
| 4 | DisOnly | 616 |
| 5 | TreatOnly | 166 |
| 6 | Vague | 36 |
| 7 | No cure/treat number for dis. | 4 |
| 8 | Nonrelevant | 1771 |
| Total | 3495 | |
Figure 1Hybrid feature set based relation extraction framework.
GENIA tagger output on the example sentence.
| Word | Base form | POS | Chunk | Named entity | Open NLP |
|---|---|---|---|---|---|
| Only | Only | RB | B-NP | O | B-NP |
| Two | Two | CD | I-NP | O | I-NP |
| Protein | Protein | NN | I-NP | B-protein | I-NP |
| Subunits | Subunit | NNS | I-NP | I-protein | I-NP |
| Pop1p | Pop1p | NN | B-NP | O | B-NP |
| And | And | CC | I-NP | O | I-NP |
| Pop4p | Pop4p | NN | I-NP | O | I-NP |
| Specifically | Specifically | RB | B-ADVP | O | B-ADVP |
| Bind | Bind | VBP | B-VP | O | B-VP |
| The | The | DT | B-NP | O | B-NP |
| RNA | RNA | NN | I-NP | B-protein | I-NP |
| Subunit | Subunit | NN | I-NP | I-protein | I-NP |
Output of MetaMap API to rank the noun phrases and verb phrases.
| S/number | VP | UMLS concept | NP | UMLS concepts |
|---|---|---|---|---|
| 1 | Experience cover place too | Meta mapping (775): | The abdomen diagnostic peritoneal lavage 4 g/p and amp | Meta mapping (599): |
|
| ||||
| 2 | be to compare | Meta mapping (1000): | Efficacy and safety | Meta mapping (1000): |
Detail of experimental settings.
| Setups | Class label: +1 | Class label: −1 |
|---|---|---|
| Setting # 1 | Cure | Disonly + Treatonly |
| Setting # 1 | Prevent | Disonly + Treatonly |
| Setting # 1 | Side effect | Disonly + Treatonly |
| Setting # 2 | Cure | Vague |
| Setting # 2 | Prevent | Vague |
| Setting # 2 | Side effect | Vague |
| Setting # 3 | Cure | Prevent + side effect |
| Setting # 3 | Prevent | Side effect |
| Setting # 3 | Side effect | Prevent |
Results of all feature sets using classification algorithms.
| Relation | Feature set | Classification performance of all feature sets | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Setting 1 | Setting 2 | Setting 3 | |||||||||||
| Algo | FS | P | R | Algo | FS | P | R | Algo | FS | P | R | ||
| Cure | BOW | SVM-RBF | 84.76 | 85.07 | 84.46 | SVM-RBF | 97.87 | 96.38 | 99.4 | SVM-RBF | 97.58 | 95.82 | 99.4 |
| SVM |
| 86.64 | 83.61 | SVM | 97.99 | 96.28 | 99.76 | SVM | 97.45 | 96.13 | 98.8 | ||
| NB | 82.88 | 84.48 | 81.33 | NB | 97.26 | 96.34 | 98.19 | NB | 96.45 | 94.77 | 98.19 | ||
| BOW + NLP | SVM-RBF | 84.61 | 86.14 | 83.13 | SVM-RBF | 97.92 | 96.38 | 99.52 | SVM-RBF | 97.29 | 95.37 | 99.28 | |
| SVM | 84.53 | 85.84 | 83.25 | SVM | 97.74 | 96.59 | 98.92 | SVM | 97.03 | 95.67 | 98.43 | ||
| NB | 83.64 | 83.8 | 83.49 | NB | 97.19 | 96.44 | 97.95 | NB | 96.51 | 94.88 | 98.19 | ||
| BOW + NLP + UMLS (NP) | SVM-RBF | 84.9 | 85.47 | 84.34 | SVM-RBF | 97.99 | 96.39 | 99.64 | SVM-RBF |
| 95.82 | 99.4 | |
| SVM | 84.44 | 86.58 | 82.41 | SVM | 97.98 | 96.6 | 99.4 | SVM | 97.08 | 95.89 | 98.31 | ||
| NB | 82.71 | 84.81 | 80.72 | NB | 97.06 | 96.42 | 97.71 | NB | 96.45 | 94.66 | 98.31 | ||
| BOW + NLP + UMLS (NP + VP) | SVM-RBF | 84.83 | 85.82 | 83.86 | SVM-RBF |
| 96.39 | 99.76 | SVM-RBF | 97.41 | 95.49 | 99.4 | |
| SVM | 84.83 | 86.87 | 82.89 | SVM | 97.92 | 96.6 | 99.28 | SVM | 97.08 | 95.89 | 98.31 | ||
| NB | 82.71 | 84.53 | 80.96 | NB | 96.94 | 96.42 | 97.47 | NB | 96.58 | 94.68 | 98.55 | ||
|
| |||||||||||||
| Prevent relation | BOW | SVM-RBF | 78.95 | 88.24 | 71.43 | SVM-RBF | 90.91 | 94.83 | 87.3 | SVM-RBF | 89.06 | 87.69 | 90.48 |
| SVM | 81.02 | 93.75 | 71.34 | SVM | 91.81 | 94.92 | 88.89 | SVM | 86.82 | 84.85 | 88.89 | ||
| NB | 60.2 | 77.5 | 49.21 | NB | 91.94 | 93.44 | 90.48 | NB |
| 93.33 | 88.89 | ||
| BOW + NLP | SVM-RBF | 71.15 | 90.24 | 58.73 | SVM-RBF | 90.75 | 96.43 | 85.71 | SVM-RBF | 86.61 | 85.94 | 87.3 | |
| SVM | 80 | 93.62 | 69.84 | SVM | 90.91 | 94.83 | 87.3 | SVM | 86.15 | 83.58 | 88.89 | ||
| NB | 63.07 | 72.92 | 55.56 | NB | 93.55 | 95.08 | 92.06 | NB | 88.71 | 90.16 | 87.3 | ||
| BOW + NLP + UMLS (NP) | SVM-RBF | 72.38 | 90.48 | 60.32 | SVM-RBF | 90.75 | 96.43 | 85.71 | SVM-RBF | 88.19 | 87.5 | 88.89 | |
| SVM |
| 92 | 73.02 | SVM | 90.91 | 94.83 | 87.3 | SVM | 86.15 | 83.58 | 88.89 | ||
| NB | 62.13 | 80 | 50.79 | NB | 93.55 | 95.08 | 92.06 | NB | 88.72 | 90.19 | 87.3 | ||
| BOW + NLP + UMLS (NP + VP) | SVM-RBF | 72.38 | 90.48 | 60.32 | SVM-RBF | 90.75 | 96.43 | 85.71 | SVM-RBF | 85.94 | 84.62 | 87.3 | |
| SVM | 80.36 | 91.84 | 71.43 | SVM | 90.91 | 94.83 | 87.3 | SVM | 85.5 | 82.35 | 88.89 | ||
| NB | 62.26 | 76.74 | 52.38 | NB |
| 95.08 | 92.06 | NB | 88.89 | 88.89 | 88.89 | ||
|
| |||||||||||||
| Side effect | BOW | SVM-RBF | 21.05 | 50 | 13.33 | SVM-RBF | 70 | 70 | 70 | SVM-RBF | 75.86 | 78.57 | 73.33 |
| SVM | 21.74 | 31.25 | 16.67 | SVM | 63.16 | 66.67 | 60 | SVM | 70.18 | 74.07 | 66.67 | ||
| NB | 22.22 | 33.33 | 16.67 | NB |
| 71.88 | 76.67 | NB | 83.54 | 78.79 | 88.89 | ||
| BOW + NLP | SVM-RBF | 25.65 | 55.56 | 16.67 | SVM-RBF | 68.97 | 71.43 | 66.67 | SVM-RBF | 71.18 | 72.41 | 70 | |
| SVM | 22.73 | 35.71 | 16.67 | SVM | 65.52 | 67.86 | 63.33 | SVM | 67.86 | 73.08 | 63.33 | ||
| NB |
| 43.75 | 23.33 | NB | 67.69 | 62.86 | 73.33 | NB | 77.42 | 75 | 80 | ||
| BOW + NLP + UMLS (NP) | SVM-RBF | 16.67 | 50 | 10 | SVM-RBF | 66.67 | 66.67 | 66.67 | SVM-RBF | 74.57 | 75.86 | 73.33 | |
| SVM | 18.18 | 28.57 | 13.33 | SVM | 62.07 | 64.29 | 60 | SVM | 67.86 | 73.08 | 63.33 | ||
| NB | 13.64 | 21.43 | 10 | NB | 68.75 | 64.71 | 73.33 | NB | 88.71 | 90.16 | 87.3 | ||
| BOW + NLP + UMLS (NP + VP) | SVM-RBF | 21.62 | 57.14 | 13.33 | SVM-RBF | 70.18 | 74.07 | 66.67 | SVM-RBF | 68.97 | 71.43 | 66.67 | |
| SVM | 22.22 | 33.33 | 16.67 | SVM | 65.52 | 67.86 | 63.33 | SVM | 65.45 | 72 | 60 | ||
| NB | 26.67 | 40 | 20 | NB | 67.69 | 62.86 | 73.33 | NB |
| 88.89 | 88.89 | ||
Bag of word (BOW) is unigrams only.
Natural language processing (NLP) is noun and verb phrases.
UMLS (NP) is the use of UMLS with noun phrase ranking only.
UMLS (NP + VP) is the use of UMLS with noun phrase and verb phrase ranking.
SVM is Support Vector Machine algorithms; SVM-RBF is Support Vector Machine-Radial Based Function algorithm; NB is Naïve Bayes algorithm; FS is F-score; P is precision; R is recall.
Classification of accuracy comparison with state-of-the-art approaches.
| Relations | Comparison of accuracy results | ||
|---|---|---|---|
| Rosario and Hearst [ | Frunza and Inkpen [ | Our approach | |
| Cure | 92.6% | 95% | 96.19% |
| Prevent relation | 38.5% | 75% | 97.45% |
| Side effect | 20% | 46% | 96.49% |
F-measure comparison with state-of-the-art approaches.
| Relations | Comparison of | ||
|---|---|---|---|
| Frunza and Inkpen [ | Ben Abacha and Zweigenbaum [ | Our approach | |
| Cure | 87.10 | 96.84 | 98.05 |
| Prevent relation | 77.78 | 67.92 | 93.55 |
| Side effect | 55.56 | 64.15 | 88.89 |