| Literature DB >> 23325628 |
Kalpana Raja1, Suresh Subramani, Jeyakumar Natarajan.
Abstract
One of the most common and challenging problem in biomedical text mining is to mine protein-protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder--a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information on PPIs from MEDLINE abstracts and consists of three phases. First, it identifies the relation keyword using a parser with Tregex and a relation keyword dictionary. Next, it automatically identifies the candidate PPI pairs with a set of rules related to PPI recognition. Finally, it extracts the relations by matching the sentence with a set of 11 specific patterns based on the syntactic nature of PPI pair. We find that PPInterFinder is capable of predicting PPIs with the accuracy of 66.05% on AIMED corpus and outperforms most of the existing systems. DATABASE URL: http://www.biomining-bu.in/ppinterfinder/Entities:
Mesh:
Substances:
Year: 2013 PMID: 23325628 PMCID: PMC3548331 DOI: 10.1093/database/bas052
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Work flow of PPInterFinder.
Figure 2Tregex-based algorithm for extracting the relation keyword.
Rules set for identifying candidate PPI pairs in the three abstract forms
| Rules | Description | Abstract Form 1 (PIP) | Abstract Form 2 (IPP) | Abstract Form 3 (PPI) |
|---|---|---|---|---|
| Rule 1 | Order of two proteins and relation keyword | A | A | A |
| Rule 2 | Distance between the protein pair | NA | A | A |
| Rule 3 | Simple sentence with two proteins | A | A | A |
| Rule 4 | Simple sentence with two proteins and negation keyword | A | A | NA |
| Rule 5 | Complex sentence having more than two proteins | A | A | A |
| Rule 6 | Complex sentence having more than two proteins and negation keyword | A | A | NA |
| Rule 7 | Complex sentence having more than two proteins and two negation keyword | Special rule independent of Forms | ||
PIP, protein–relation–protein; IPP, relation–protein–protein; PPI, protein–protein–relation; A, applicable; NA, not applicable
Figure 3Algorithm to extract PPI triplets from complex sentences with more than two proteins.
List of Tregex syntax tags and description
| Syntax tag | Tag description |
|---|---|
| S | Sentence |
| NP | Noun phrase |
| VP | Verb phrase |
| NNPS | Proper noun, plural |
| CC | Coordinating conjunction |
| IN | Preposition, subordinating conjunction |
| JJ | Adjective |
| DT | Determiner |
| $++ | Sister node on left |
| $+ | Immediate sisters |
| << | Points to root node |
| < | Points to next immediate node |
| PROTEIN1, PROTEIN2, PROTEIN3 | Special tag for protein |
| RELATION | Special tag for relation keyword |
| NEGATION | Special tag for negation keyword |
| And | Exact word match |
| With | Exact word match |
Figure 4PPI extraction—methodology.
Performance of PPInterFinder on AIMED, HPRD50 and IntAct corpora
| Corpus | AIMED | HPRD50 | IntAct | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TP | FP | FN | R | P | F | TP | FP | FN | R | P | F | TP | FP | FN | R | P | F | |
| PPI algorithm | ||||||||||||||||||
| PIP | 432 | 72 | 233 | 64.96 | 85.71 | 73.91 | 73 | 7 | 49 | 59.84 | 91.25 | 72.28 | 270 | 34 | 73 | 78.72 | 88.82 | 83.47 |
| IPP | 103 | 39 | 137 | 42.92 | 69.13 | 52.96 | 9 | 5 | 15 | 37.50 | 64.29 | 47.37 | 64 | 12 | 33 | 65.98 | 84.21 | 73.99 |
| PPI | 42 | 31 | 81 | 34.15 | 57.53 | 42.85 | 5 | 1 | 4 | 55.56 | 83.33 | 66.67 | 70 | 20 | 24 | 74.47 | 77.78 | 76.09 |
| Total | 577 | 142 | 451 | 56.12 | 80.25 | 66.05 | 87 | 13 | 68 | 56.13 | 87.00 | 68.24 | 404 | 55 | 130 | 75.66 | 88.01 | 81.37 |
| PPI algorithm with preprocessing steps (NER and GN) | ||||||||||||||||||
| PIP | 334 | 68 | 331 | 50.23 | 83.08 | 62.61 | 49 | 5 | 75 | 39.52 | 90.74 | 55.04 | 258 | 34 | 85 | 75.22 | 88.36 | 81.26 |
| IPP | 95 | 41 | 144 | 39.75 | 69.85 | 50.65 | 8 | 6 | 17 | 32.00 | 57.14 | 41.02 | 53 | 12 | 44 | 54.64 | 81.54 | 65.43 |
| PPI | 40 | 16 | 96 | 29.41 | 71.43 | 41.67 | 3 | 1 | 6 | 50.00 | 75.00 | 60.00 | 65 | 20 | 29 | 69.15 | 76.47 | 72.63 |
| Total | 469 | 125 | 571 | 45.10 | 78.96 | 57.41 | 60 | 12 | 98 | 37.97 | 83.33 | 52.17 | 376 | 55 | 158 | 70.58 | 87.33 | 78.07 |
Performance evaluation (%): recall (R), precision (P) and F-score (F); NER, named entity recognition; GN, gene/protein normalization.
Performance comparison with the existing systems on AIMED corpus
| System | Description | |
|---|---|---|
| Saetre | Feature-based, two parsers | 64.2 |
| Miwa | Multiple kernels, two parsers | 60.8 |
| Kim | Walk-weighted subsequence kernels, one parser | 56.6 |
| Airola | All-paths graph kernel, one parser | 56.4 |
| Niu | All-paths graph kernel, one parser | 53.5 |
| Bui | RBF kernel, one parser | 61.2 |
| PPInterFinder | Pattern matching, two parsers | 66.05 |
Evaluation of PPInterFinder prior to BioCreative Workshop 2012
| Evaluation | Curator1 | Curator2 | ||||
|---|---|---|---|---|---|---|
| R | P | F | R | P | F | |
| Preprocessing steps (NER & GN) + PPI extraction algorithm | 46.88 | 85.71 | 60.61 | 46.88 | 85.71 | 60.61 |
| PPI extraction algorithm | 69.76 | 85.71 | 76.91 | 63.83 | 85.71 | 73.17 |
Performance evaluation (%): recall (R), precision (P) and F-score (F); NER, named entity recognition; GN, gene/protein normalization.
Performance of the system with improvements from BioCreative Workshop 2012
| Dataset | PPInterFinder (improved version) | PPInterFinder (BioCreative Workshop 2012) | ||||
|---|---|---|---|---|---|---|
| R | P | F | R | P | F | |
| 693 sentences from IntAct Database | 70.58 | 87.33 | 78.07 | 71.27 | 81.28 | 75.94 |
Performance evaluation (%): recall (R), precision (P) and F-score (F).
Figure 5Screenshot of PPInterFinder showing input and extracted PPI pairs.
| Form 1: | PROTEIN1 - token* - RELATION - token* - PROTEIN2 |
| Examples: | PROTEIN1 interacts with PROTEIN2 |
| PROTEIN1 has weak association with PROTEIN2 | |
| Form 2: | RELATION - token* - PROTEIN1 - token* - PROTEIN2 |
| Example: | interaction between PROTEIN1 and PROTEIN2 |
| Form 3: | PROTEIN1 - token* - PROTEIN2 - token* - RELATION |
| Example: | PROTEIN1 and PROTEIN2 complex |