| Literature DB >> 22693501 |
Syed Toufeeq Ahmed1, Hasan Davulcu, Sukru Tikves, Radhika Nair, Zhongming Zhao.
Abstract
Background. Recent advances in computational and biological methods in last two decades have remarkably changed the scale of biomedical research and with it began the unprecedented growth in both the production of biomedical data and amount of published literature discussing it. An automated extraction system coupled with a cognitive search and navigation service over these document collections would not only save time and effort, but also pave the way to discover hitherto unknown information implicitly conveyed in the texts. Results. We developed a novel framework (named "BioEve") that seamlessly integrates Faceted Search (Information Retrieval) with Information Extraction module to provide an interactive search experience for the researchers in life sciences. It enables guided step-by-step search query refinement, by suggesting concepts and entities (like genes, drugs, and diseases) to quickly filter and modify search direction, and thereby facilitating an enriched paradigm where user can discover related concepts and keywords to search while information seeking. Conclusions. The BioEve Search framework makes it easier to enable scalable interactive search over large collection of textual articles and to discover knowledge hidden in thousands of biomedical literature articles with ease.Entities:
Year: 2012 PMID: 22693501 PMCID: PMC3368157 DOI: 10.1155/2012/509126
Source DB: PubMed Journal: Adv Bioinformatics ISSN: 1687-8027
Figure 1BioEve search framework architecture.
Scheme 1
Figure 2A sample screen shot of the main search screen. Left panel shows clickable top relevant entities, which if selected refines the query and results dynamically. User can deselect any of the previously selected entities to refine query more, and the results are updated dynamically to reflect the current selected list of entities.
Figure 3A sample result set with the query “cholesterol.”
Figure 4“Hepatic-lipase” selected.
Figure 5“Hyperthyroidism” highlighted.
Figure 6Final refined search results.
Classification approaches used: Naïve Bayes classifier (NBC), NBC + Expectation Maximization (EM), Maximum Entropy (MaxEnt), Conditional Random Fields (CRFs).
| Granularity | Features | Classifier |
|---|---|---|
| Single label, | Bag-of-words (BOW) | NBC |
| Sentence level | BOW + gene names boosted | |
| BOW + trigger words boosted | ||
| BOW + gene names and trigger | ||
| words boosted | ||
|
| ||
| Multiple labels | BOW | NBC + |
| EM | ||
| Sentence level | MaxEnt | |
|
| ||
| Event trigger | BOW + | CRFs |
| phrase labeling | 3-gram and 4-gram | |
| prefixes and suffixes + | ||
| orthographic features + | ||
| trigger phrase dictionary | ||
Single label, sentence level results.
| Classifier | Feature set | Precision |
|---|---|---|
| Bag-of-words | 62.39% | |
| Bag-of-words + gene name boosting | 50.00% | |
| NBC | Bag-of-words + trigger word boosting | 49.92% |
| Bag-of-words + trigger word boosting + | 49.77% | |
| Gene name boosting | ||
| Bag-of-POS tagged words | 43.30% |
Summary of classification approaches: test instances (marked events) for each class type in test dataset. Precision, recall, and F1-score in percentage. Compared to NB + EM and CRF, Maximum Entropy based classifier had better average precision, but CRF has best recall and good precision, giving it best F-Measure of the three well-known classifiers.
| Event type | Test instances | NB + EM | MaxEnt | CRF | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Total: 942 | P | R | F1 | P | R | F1 | P | R | F1 | |
| Phosphorylation | 38 | 62 | 42 | 50 | 97 | 73 | 83 | 80 | 83 | 81 |
| Protein catabolism | 17 | 60 | 47 | 53 | 97 | 73 | 83 | 85 | 86 | 85 |
| Gene expression | 200 | 60 | 41 | 49 | 88 | 58 | 70 | 75 | 81 | 78 |
| Localization | 39 | 39 | 47 | 43 | 61 | 69 | 65 | 67 | 79 | 72 |
| Transcription | 60 | 24 | 52 | 33 | 49 | 80 | 61 | 57 | 78 | 66 |
| Binding | 153 | 56 | 63 | 59 | 65 | 62 | 63 | 65 | 81 | 72 |
| Regulation | 90 | 47 | 69 | 55 | 52 | 67 | 58 | 62 | 73 | 67 |
| Positive regulation | 220 | 70 | 27 | 39 | 75 | 25 | 38 | 55 | 74 | 63 |
| Negative regulation | 125 | 42 | 46 | 44 | 54 | 38 | 45 | 68 | 82 | 74 |
|
| ||||||||||
| Average | 51 | 48 | 47 |
| 61 | 63 | 68 |
|
| |
CRF sequence labeling results.
| Type of evaluation | Coverage % |
|---|---|
| Exact boundary matching | 79% |
| Soft boundary matching | 82% |