| Literature DB >> 23640984 |
Chen Li1, Antonio Jimeno-Yepes, Miguel Arregui, Harald Kirsch, Dietrich Rebholz-Schuhmann.
Abstract
The extraction of information from the scientific literature is a complex task-for researchers doing manual curation and for automatic text processing solutions. The identification of protein-protein interactions (PPIs) requires the extraction of protein named entities and their relations. Semi-automatic interactive support is one approach to combine both solutions for efficient working processes to generate reliable database content. In principle, the extraction of PPIs can be achieved with different methods that can be combined to deliver high precision and/or high recall results in different combinations at the same time. Interactive use can be achieved, if the analytical methods are fast enough to process the retrieved documents. PCorral provides interactive mining of PPIs from the scientific literature allowing curators to skim MEDLINE for PPIs at low overheads. The keyword query to PCorral steers the selection of documents, and the subsequent text analysis generates high recall and high precision results for the curator. The underlying components of PCorral process the documents on-the-fly and are available, as well, as web service from the Whatizit infrastructure. The human interface summarizes the identified PPI results, and the involved entities are linked to relevant resources and databases. Altogether, PCorral serves curator at both the beginning and the end of the curation workflow for information retrieval and information extraction. Database URL: http://www.ebi.ac.uk/Rebholz-srv/pcorral.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23640984 PMCID: PMC3641755 DOI: 10.1093/database/bat030
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.PCorral back end workflow. The processing is split into three main parts: collection of relevant citations querying an index on MEDLINE, identification of gene mentions and normalization to UniProt identifiers and extraction of relations among the identified genes.
List of verbs used in PCorral split into groups defining the interaction type
| Verbs denoting protein chemical modification | acetylate, acylate, amidate, brominate, biotinylate, carboxylate, cysteinylate, farnesylate, formylate, ‘hydrox[iy]late’, methylate, demethylate, ‘myristo?ylate’, ‘palmito?ylate’, phosphorylate, dephosphorylate, pyruvate, nitrosylate, sumoylate, ‘ubiquitin(yl)?ate’ |
| Verbs denoting interaction and regulation events | associate, dissociate, assemble, attach, bind, complex, contact, couple, ‘(multi|di)meri[zs]e’, link, interact, precipitate, regulate, inhibit, activate, ‘down[-]regulate’, express, suppress, ‘up[-]regulate’, block, contain, inactivate, induce, modify, overexpress, promote, stimulate, substitute, catalyze, cleave, conjugate, disassemble, discharge, mediate, modulate, repress, transactivate |
The verb forms are given in a regular expression form also including morphological variants of verb forms.
Figure 2.(Syntactical patterns) The diagram explains the composition of the SynPs. The verb phrase (VP) is composed of several subcomponents that enable the identification of modal verbs (Vmodal), forms of to be (Vbe) and common forms of hedging (Vshown). NP_P is an NP containing a protein mention.
Figure 3.PCorral query interface.
Figure 4.PPI summary table. The screenshot displays in the top ranks those proteins that interact frequently with BRCA2 (using the query ‘Breast cancer’): amongst all proteins, RAD51 is most frequently linked to BRCA2 across the selection of documents. The frequency of findings per abstract and per sentence listed for each method is present as well [language pattern (ppi), tri-occurrence (co3) and co-occurrence (co)], including the interaction verbs.
Figure 5.Example annotation sentences with PPIs. Highlighting of the evidences that allow better identification and curation of the PPIs. Each highlighted protein/gene is linked back to UniProt. Interaction verbs are denoted in square brackets.
Evaluation of COs, CO3, SynP for PPIs on MEDLINE abstracts
| Method | Predictions | Correct predictions | Precision (%) | Recall (%) | F-measure (%) |
|---|---|---|---|---|---|
| CO | 5934 | 1705 | 28.73 | 17.54 | 21.78 |
| CO3 | 1461 | 454 | 31.07 | 4.67 | 8.12 |
| SynP | 370 | 142 | 38.38 | 1.46 | 2.81 |
Evaluation of CO, CO3, SynP for PPIs on the BioCreative II sentences
| Method | Predictions | Correct predictions | Precision (%) | Recall (%) | F-measure (%) |
|---|---|---|---|---|---|
| CO | 52 136 | 785 | 1.5 | 33.2 | 2.9 |
| CO3 | 15 823 | 609 | 3.8 | 28.8 | 6.8 |
| SynP | 2078 | 358 | 17.2 | 17.0 | 17.1 |
List of verbs that contributed to a correct prediction of related proteins
| Regulate | 179 | 12 | 6.7 | 0.6 | 1.0 |
| Contain | 286 | 12 | 4.2 | 0.6 | 1.0 |
| Inhibit | 130 | 9 | 6.9 | 0.4 | 0.8 |
| Mediate | 136 | 7 | 5.1 | 0.3 | 0.6 |
| Activate | 165 | 7 | 4.2 | 0.3 | 0.6 |
| Modulate | 31 | 5 | 16.1 | 0.2 | 0.5 |
| Precipitate | 31 | 4 | 12.9 | 0.2 | 0.4 |
| Express | 218 | 4 | 1.8 | 0.2 | 0.3 |
| Promote | 42 | 3 | 7.1 | 0.1 | 0.3 |
| Induce | 110 | 3 | 2.7 | 0.1 | 0.3 |
| Modify | 6 | 2 | 33.3 | 0.1 | 0.2 |
| Dephosphorylate | 8 | 2 | 25.0 | 0.1 | 0.2 |
| Complex | 15 | 2 | 13.3 | 0.1 | 0.2 |
| Stimulate | 41 | 2 | 4.9 | 0.1 | 0.2 |
| Downregulate | 6 | 2 | 33.3 | 0.1 | 0.2 |
| Methylate | 6 | 1 | 16.7 | 0.0 | 0.1 |
| Substitute | 7 | 1 | 14.3 | 0.0 | 0.1 |
| Assemble | 11 | 1 | 9.1 | 0.0 | 0.1 |
| Block | 30 | 1 | 3.3 | 0.0 | 0.1 |
| Suppress | 40 | 1 | 2.5 | 0.0 | 0.1 |
They are sorted according to their F-measure. The list can be used to tune an IE system for performance (e.g. for precision, recall, speed).