| Literature DB >> 23707966 |
Daniel G Jamieson1, Phoebe M Roberts, David L Robertson, Ben Sidders, Goran Nenadic.
Abstract
The vast collection of biomedical literature and its continued expansion has presented a number of challenges to researchers who require structured findings to stay abreast of and analyze molecular mechanisms relevant to their domain of interest. By structuring literature content into topic-specific machine-readable databases, the aggregate data from multiple articles can be used to infer trends that can be compared and contrasted with similar findings from topic-independent resources. Our study presents a generalized procedure for semi-automatically creating a custom topic-specific molecular interaction database through the use of text mining to assist manual curation. We apply the procedure to capture molecular events that underlie 'pain', a complex phenomenon with a large societal burden and unmet medical need. We describe how existing text mining solutions are used to build a pain-specific corpus, extract molecular events from it, add context to the extracted events and assess their relevance. The pain-specific corpus contains 765 692 documents from Medline and PubMed Central, from which we extracted 356 499 unique normalized molecular events, with 261 438 single protein events and 93 271 molecular interactions supplied by BioContext. Event chains are annotated with negation, speculation, anatomy, Gene Ontology terms, mutations, pain and disease relevance, which collectively provide detailed insight into how that event chain is associated with pain. The extracted relations are visualized in a wiki platform (wiki-pain.org) that enables efficient manual curation and exploration of the molecular mechanisms that underlie pain. Curation of 1500 grouped event chains ranked by pain relevance revealed 613 accurately extracted unique molecular interactions that in the future can be used to study the underlying mechanisms involved in pain. Our approach demonstrates that combining existing text mining tools with domain-specific terms and wiki-based visualization can facilitate rapid curation of molecular interactions to create a custom database. Database URL: •••Entities:
Mesh:
Year: 2013 PMID: 23707966 PMCID: PMC3662864 DOI: 10.1093/database/bat033
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 2.Pain dictionary summary statistics. (A) Represents the numbers of pain-specific and pain-relevant terms in the pain dictionary for each category of pain term. (B) Represents the numbers of pain-specific and pain-relevant synonyms in the dictionary for each category of pain term.
Figure 3.Pain term matches. Pain term matches from Medline (A) and open access PMC documents (B) in each type of document section across the 12 pain term categories are displayed. The overall percentage of pain-specific and pain-relevant terms from Medline and open access PMC documents are shown for each type of document section. ‘Body’ represents full text excluding abstracts and titles. MeSH refers to textual document tags used by PubMed articles in indexing.
Top reported pain terms in P1
| Pain Term | Category | Pain Specific | Frequency | Documents |
|---|---|---|---|---|
| Pain | Disorder | Yes | 627 644 | 247 312 |
| Anaesthesia | Pain type | Yes | 190 376 | 115 614 |
| Analgesic | Drug class | Yes | 112 703 | 61 223 |
| Headache | Disorder | Yes | 118 956 | 50 249 |
| Brain haemorrhage | Disorder | No | 85 702 | 45 214 |
| Opioid | Drug class | Yes | 77 921 | 33 486 |
| Morphine | Drug | Yes | 119 985 | 33 337 |
| Analgesia | Pain type | Yes | 64 777 | 31 982 |
| Palliative | Treatment | Yes | 51 401 | 27 536 |
| Abdominal pains | Pain type | Yes | 33 916 | 25 062 |
‘Pain term’ refers to the individual pain term and all its synonyms. Pain terms are pain specific (yes) or pain relevant (no). Pain term ‘categories’ are defined in supplementary file 1. ‘Frequency’ refers to the total number of times that that term was mentioned. ‘Documents’ refers to the number of documents that that term was mentioned in.
Figure 4.Document pain relevancy scores. Pie charts represent the overall pain scores for Medline (abstracts and titles).
Evaluations of TM software used
| Tool | Data | True positives | False positives | True negatives | False negatives | True negative Rate | Precision | Recall | Accuracy | F score |
|---|---|---|---|---|---|---|---|---|---|---|
| Pain terms (LINNAEUS) | 50 Documents | 3803 | 0 | N/A | 443 | N/A | 100 | 89.6 | N/A | 94.5 |
| Mutation to protein linker | 100 Event chains | 36 | 1 | 109 | 14 | 99.1 | 97.3 | 72 | 90.60 | 82.7 |
| Pain relevancy (>50 confidence) | 100 Event chains | 78 | 22 | N/A | N/A | N/A | 78 (92 expected) | N/A | N/A | N/A |
| Pain relevancy (≤50 confidence) | 100 event chains | 39 | 61 | N/A | N/A | N/A | 39 (20 expected) | N/A | N/A | N/A |
| Disease terms (LINNAEUS) | 25 Documents | 345 | 16 | N/A | 15 | N/A | 95.6 | 95.8 | N/A | 95.7 |
| Disease relevancy (>50 confidence) | 100 Event chains | 84 | 16 | N/A | N/A | N/A | 84 (88 expected) | N/A | N/A | N/A |
| Disease relevancy (≤50 confidence) | 100 Event chains | 30 | 70 | N/A | N/A | N/A | 30 (13 expected) | N/A | N/A | N/A |
For each tool evaluated we display a summary of the data used in the evaluation (either documents or event chains), and the frequencies of true positives, false positives, false negatives and true negatives for each tool wherever possible. From the true positives, false positives, false negatives and true negatives we calculated the true-negative rate, precision, recall, accuracy and F score of each tool where applicable. In pain and disease relevancy we also note the expected precision calculated from the average relevancy score of each term in the respective evaluation.
Event chains from P1
| Involving only | Single events | Molecular interactions | More than two participants | Total |
|---|---|---|---|---|
| Human proteins | 45 731 | 14 568 | 262 | 60 561 |
| Mice proteins | 41 671 | 12 956 | 230 | 54 857 |
| Rat proteins | 26 736 | 7369 | 132 | 34 237 |
| Other proteins | 147 300 | 58 378 | 1166 | 206 844 |
| Total | 261 438 | 93 271 | 1790 | 356 499 |
Event chains are shown for those involving only human, mice, rat and other proteins as their cause and/or theme. Event chains are divided into single events, molecular interactions (i.e. those containing two participants) and event chains with more than two participants. Total numbers of events chains by number of participants and by proteins involved are displayed.
Event types involved in event chains
| Event type | Single events | Molecular interactions | More than two participants |
|---|---|---|---|
| Binding | 33 358 | 37 291 (37 315) | 897 (919) |
| Gene expression | 78 255 | 12 223 (12482) | 95 |
| Transcription | 12 158 | 1238 | 10 |
| Localization | 27 329 | 5355 (5368) | 50 |
| Phosphorylation | 7360 | 1782 (1784) | 37 |
| Protein catabolism | 5296 | 467 | 6 |
| Positive regulation | 69 846 (75 064) | 32 222 (35 740) | 1174 (1650) |
| Negative regulation | 52 754 (54 729) | 13 698 (14 870) | 541 (624) |
| Regulation | 41 137 (42 422) | 19 271 (19 783) | 468 (551) |
Non-redundant frequencies of single events, molecular interactions (i.e. those containing two participants) and event chains containing more than two participants are displayed for each of the nine categories of events used by the event extractors. The numbers in brackets represent the total number of occurrences of that event type where some events have duplicate (redundant) event types, e.g. ‘positive regulation of positive regulation of protein A’.
Figure 5.Number of negated event chains. ‘Mixed’ refers to event chains that have been mentioned both negatively and positively. ‘All negated’ refers to the number of event chains that are only mentioned negatively. Proportions of mixed and negated data are shown for all molecular interactions and single events that have been mentioned more than once or more than five times.
Top 10 anatomical regions associated with event chains
| Name | Frequency |
|---|---|
| Neurons | 37 666 |
| Plasma | 36 969 |
| Brain | 31 775 |
| Blood | 19 291 |
| T cells | 16 092 |
| Liver | 15 650 |
| Spinal Cord | 14 453 |
| Macrophage | 13 409 |
| Neuronal | 12 368 |
| Nerve | 11 355 |
| Total | 761 990 |
Anatomy terms are extracted using GETM.
Overview of overall pain relevancy scores for unique event chains involving human, mouse or rat proteins and excluding self-interactions
| Pain relevancy score | single events | Molecular interactions | More than two participants | Total |
|---|---|---|---|---|
| Low (0) | 22 623 | 9240 | 191 | 32 054 |
| Medium (>0,≤1) | 62 640 | 25 593 | 520 | 88 753 |
| High (>1) | 28 875 | 2646 | 42 | 31 563 |
We show the frequency of unique single events, molecular interactions (i.e. two participants) and event chains with more than two participants with a low (0), medium (>0, ≤1) or high (>1) overall pain relevancy score.
Top diseases associated with documents containing event data
| Disease name | Disease term mentions |
|---|---|
| Disease | 135 367 |
| Pain | 122 233 |
| Cancer | 117 041 |
| Inflammation | 101 059 |
| Injury | 59 237 |
| Infection | 57 481 |
| Diabetes mellitus | 50 705 |
| Stress | 41 056 |
| Depression | 39 762 |
| AIDS or HIV infection | 30 872 |
| Total | 3 041 109 |
Here we report the total number of disease term mentions in documents that contain at least one event chain.
Pain genes enrichment analysis
| Corpus | Event chains mentioning a pain gene | Event chains not mentioning a pain gene | Total event chains | % of event chains with a pain gene |
|---|---|---|---|---|
| P1 | 71 685 | 1 506 969 | 1 578 654 | 4.54 |
| R1 | 47 998 | 2 196 618 | 2 244 616 | 2.14 |
P1 represents the pain corpus and R1 represents the randomly generated generic corpus. We show frequencies of event chains mentioning a gene from the Pain Gene DB for each corpus and event chains not mentioning a gene from the Pain Gene DB. We also display total event chains for each corpus and the percentage of event chains that contain genes from the Pain Gene DB. Fisher’s exact test showed significant enrichment of pain genes within P1, having an odds ratio of 2.177008 with a P-value <2.2e-16.
Figure 6.Example of a typical molecular interaction in wiki-pain.org. We have removed the page borders that are typical of a Mediawiki interface and annotated each region of the page that we have designed and is novel. All ‘INT’ pages on wiki-pain.org follow the same framework including single events and event chains containing more than two participants. The specific page shown can be viewed by searching ‘INT106559’ on wiki-pain.org.
Top 10 homologues appearing in our manually curated data
| Homologue ID | Symbol | Frequency |
|---|---|---|
| 1876 | NGF | 53 |
| 37368 | OPRM1 | 50 |
| 723 | POMC | 45 |
| 12920 | TRPV1 | 44 |
| 88337 | CALCB | 40 |
| 4528 | PENK | 39 |
| 502 | IL6 | 27 |
| 599 | CRH | 22 |
| 496 | TNF | 19 |
| 4537 | PNOC | 16 |
These have been ranked by frequency of unique molecular interactions that each homologue is involved in, in our manually curated data. Homologue ID refers to the ID used by NCBI homologene database (http://www.ncbi.nlm.nih.gov/homologene).
Manual curation evaluation
| Analysis | TPs before | TPs after | FPs before | FPs after | Agreed | Disagreed | P(A) | P(E) | K |
|---|---|---|---|---|---|---|---|---|---|
| Intra | 18 | 12 | 32 | 38 | 42 | 8 | 0.84 | 57.3 | 0.427 |
| Inter | 27 | 22 | 23 | 28 | 45 | 5 | 0.9 | 49.5 | 0.802 |
| Overall | 45 | 34 | 55 | 66 | 87 | 13 | 0.87 | 51.6 | 0.731 |
We evaluate the quality of our manual curation using an intra analysis (data quality is evaluated by the same curator), an inter analysis (data quality originally curated by a different curator is evaluated) and these two are combined to show an overall evaluation of our manual curation. We present the number of true positives (TPs) and false positives (FPs) in the original curation (before) and the new curation results (after). Results that were the same were marked as ‘Agreed’ and those that were different, ‘Disagreed’. The absolute agreement, P(A), was calculated from the proportion of agreement (agreed/disagreed). Cohen’s Kappa coefficient (K) was calculated from the proportion of agreement, corrected for expected agreement by chance [P(E)], such that K = {[P(A) – P(E)]/[1-P(E)]}.