| Literature DB >> 19850753 |
Yun Niu1, David Otasek, Igor Jurisica.
Abstract
MOTIVATION: Identification and characterization of protein-protein interactions (PPIs) is one of the key aims in biological research. While previous research in text mining has made substantial progress in automatic PPI detection from literature, the need to improve the precision and recall of the process remains. More accurate PPI detection will also improve the ability to extract experimental data related to PPIs and provide multiple evidence for each interaction.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19850753 PMCID: PMC2796811 DOI: 10.1093/bioinformatics/btp602
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.A sentence parsed by Collins' parser.
The output of the phrase extraction script
| 0 | C-PP | IN | In | PP | B-S/B-PP |
| 1 | B-NP | DT | the | NOFUNC | I-S/I-PP/B-NP |
| 2 | I-NP | NN | complex | NP | I-S/I-PP/I-NP |
| 3 | O | COMMA | COMMA | NOFUNC | I-S/E-PP/E-NP |
| 4 | B-NP | CD | one | NOFUNC | I-S/B-NP |
| 5 | I-NP | NN | interferon | NOFUNC | I-S/I-NP |
| 6 | I-NP | NN | gamma | NOFUNC | I-S/I-NP |
| 7 | E-NP | NN | homo-dimer | NP | I-S/E-NP |
| 8 | C-VP | VBZ | binds | VP/S | I-S/B-VP |
| 9 | B-NP | CD | two | NOFUNC | I-S/I-VP/B-NP |
| 10 | I-NP | NN | receptor | NOFUNC | I-S/I-VP/I-NP |
| 11 | I-NP | NNS | molecules | NP | I-S/I-VP/I-NP |
| 12 | O | . | . | NOFUNC | E-S/E-VP/E-NP |
An output example from Minipar (Lin, 1994)
| E0 | (() | fin | C | * | ) |
| 1 | (In | Prep | E0 | mod | (gov fin)) |
| 2 | (the | Det | 3 | det | (gov complex)) |
| 3 | (complex | N | 1 | pcomp-n | (gov in)) |
| 4 | (, | U | E0 | punc | (gov fin)) |
| 5 | (one | N | 8 | nn | (gov homo-dimer)) |
| 6 | (interferon | N | 8 | nn | (gov homo-dimer)) |
| 7 | (gamma | N | 8 | nn | (gov homo-dimer)) |
| 8 | (homo-dimer | N | 9 | s | (gov bind)) |
| 9 | (binds | V | E0 | i | (gov fin)) |
| 10 | (two | N | 12 | nn | (gov molecule)) |
| 11 | (receptor | N | 12 | nn | (gov molecule)) |
| 12 | (molecules | N | 9 | obj | (gov bind)) |
| 13 | (. | U | * | punc | ) |
Pair-level evaluation of individual features and their combination on data from AImed (Bunescu and Mooney, 2005), using 10-fold cross-validation and measuring precision, recall and F-score
| Features | Precision | Recall | |
|---|---|---|---|
| (%) | (%) | (%) | |
| position+lexical forms+context (1) | 57.4 | 25.3 | 35.1 |
| (1)+keywords | 56.2 | 17.6 | 26.8 |
| (1)+patterns | 66.7 | 11.0 | 18.9 |
| (1)+phrase | 58.6 | 29.8 | 39.5 |
| (1)+dependency | 64.0 | 29.3 | 40.2 |
| (1)+phrase+dependency (2) | 61.6 | 31.6 | 41.7 |
| (2)+overlap (3) | 65.4 | 33.0 | 43.9 |
| (3)+interaction sentence | 65.0 | 35.9 | 46.2 |
Abstract-level evaluation of different systems on AImed dataset.
| System | Recall | Precision | |
|---|---|---|---|
| (%) | (%) | (%) | |
| Bunescu and Mooney ( | 43 | 60 | 50.1 |
| (10-fold cross-validation) | |||
| Bunescu | 43 | 48 | 45.4 |
| (10-fold cross-validation) | |||
| Yakushiji | 45.3 | 37.3 | 40.9 |
| (10-fold cross-validation) | |||
| Mitsumori | 36.7 | 64.2 | 46.7 |
| (10-fold cross-validation) | |||
| Romano | 29 | 42 | 34.3 |
| (development: 60%, test: 40%) | |||
| Our approach | 43.2 | 70.2 | 53.5 |
| (10-fold cross-validation) |
The 10-fold splits follow (Bunescu et al., 2005).
Results obtained for the SwissProt-only test set
| Precision | Recall | ||
|---|---|---|---|
| Average (11 articles) | 0.5454 | 0.3918 | 0.4560 |
| Macro-average | 0.0246 | 0.0176 | 0.0205 |
| Micro-average | 0.5714 | 0.0094 | 0.0186 |
Fig. 2.System architecture for PPI verification. To speed up analysis, we use a copy of NLM Medline/PubMed stored in an IBM DB2 ver. 9.2 database on IBM p595 server. Individual abstracts are indexed using the relevant MeSH terms and protein name synonyms found in each abstract.
Number of distinct interactions in individual human-curated databases
| HPRD | BIND | DIP | MINT | IntAct | |
|---|---|---|---|---|---|
| No. of PPIs in a database | 34 177 | 44 391 | 4443 | 58 733 | 55 182 |
| No. of PPI partners in abstracts | 11 182 | 3410 | 473 | 1620 | 1892 |
Please note that human curation uses full papers, not just abstracts.
Number of distinct interactions, considering simple co-occurrence and SVM classification
| HPRD | BIND | DIP | MINT | IntAct | |
|---|---|---|---|---|---|
| Co-occur in abstracts | 300 | 300 | 300 | 300 | 300 |
| Co-occur in sentences | 247 | 258 | 234 | 266 | 241 |
| Detected by the classifier | 149 | 158 | 128 | 168 | 126 |
Number of target abstracts in HPRD, BIND, DIP, MINT, IntAct: 357, 315, 315, 331, 326, respectively.
Results of identifying new evidence for 300 pairs in DIP (showing numbers of distinct pairs)
| ABS | SEN | SVM > 0 | SVM > 0.5 | |||
|---|---|---|---|---|---|---|
| SIN | MUL | SIN | MUL | |||
| Search provided abstracts | 300 | 234 | 123 | 5 | 84 | 1 |
| Search an entire PubMed | 300 | 295 | 26 | 243 | 69 | 128 |
ABS, co-occur in abstracts; SEN, co-occur in sentences; SIN, PPIs with single abstract as evidence; MUL, PPIs with multiple abstracts as evidence.
Fig. 3.PPIs with new evidence from automatic text mining. Expanding the original set of 1433 proteins and 845 interactions by including PPIs from I2D results in a network with 8250 proteins and 21 652 interactions. Subgraph from Supplementary Figure 3, showing only the largest connected component and highlighting new evidence for human-curated PPIs, predicted PPIs and PPIs from high-throughput experiments. Combined network comprises 392 interactions on 322 proteins. Node color is according to Gene Ontology, as shown in the legend. Node size is based on node degree in the network. For simplicity, only protein names of high degree nodes are shown. Edge color is according to the PPI source, as described in the legend. The figure was generated in NAViGaTOR 2.1.12 (http://ophid.utoronto.ca/navigator), (Brown et al., 2009).