| Literature DB >> 23221174 |
Catalina O Tudor1, Cecilia N Arighi, Qinghua Wang, Cathy H Wu, K Vijay-Shanker.
Abstract
Protein phosphorylation is a central regulatory mechanism in signal transduction involved in most biological processes. Phosphorylation of a protein may lead to activation or repression of its activity, alternative subcellular location and interaction with different binding partners. Extracting this type of information from scientific literature is critical for connecting phosphorylated proteins with kinases and interaction partners, along with their functional outcomes, for knowledge discovery from phosphorylation protein networks. We have developed the Extracting Functional Impact of Phosphorylation (eFIP) text mining system, which combines several natural language processing techniques to find relevant abstracts mentioning phosphorylation of a given protein together with indications of protein-protein interactions (PPIs) and potential evidences for impact of phosphorylation on the PPIs. eFIP integrates our previously developed tools, Extracting Gene Related ABstracts (eGRAB) for document retrieval and name disambiguation, Rule-based LIterature Mining System (RLIMS-P) for Protein Phosphorylation for extraction of phosphorylation information, a PPI module to detect PPIs involving phosphorylated proteins and an impact module for relation extraction. The text mining system has been integrated into the curation workflow of the Protein Ontology (PRO) to capture knowledge about phosphorylated proteins. The eFIP web interface accepts gene/protein names or identifiers, or PubMed identifiers as input, and displays results as a ranked list of abstracts with sentence evidence and summary table, which can be exported in a spreadsheet upon result validation. As a participant in the BioCreative-2012 Interactive Text Mining track, the performance of eFIP was evaluated on document retrieval (F-measures of 78-100%), sentence-level information extraction (F-measures of 70-80%) and document ranking (normalized discounted cumulative gain measures of 93-100% and mean average precision of 0.86). The utility and usability of the eFIP web interface were also evaluated during the BioCreative Workshop. The use of the eFIP interface provided a significant speed-up (∼2.5-fold) for time to completion of the curation task. Additionally, eFIP significantly simplifies the task of finding relevant articles on PPI involving phosphorylated forms of a given protein.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23221174 PMCID: PMC3514748 DOI: 10.1093/database/bas044
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1The eFIP text mining system overview. The pipeline consists of four components to process: (1) retrieval of all documents relevant to a given protein (eGRAB), (2) extraction of phosphorylation mentions (kinase, substrate and site) in these documents (RLIMS-P), (3) extraction of PPI mentions (protein interactants and type of interaction) (PPI module) and (4) detection of phosphorylation-interaction relations (impact module).
Example patterns that capture PPI mentions
| Pattern | Example phrase capturing the pattern |
|---|---|
| NP_P1 NP_int Prep_from NP_P2 | … |
| NP_its NP_int Prep_with NP_P2 | … |
| NP_it VG_int Prep_with NP_P2 | … |
| NP_int Prep_of NP_P1 Prep_to NP_P2 | … the |
| NP_P1 VG_int Prep_with NP_P2 |
‘NP’ stands for noun phrase, ‘NP_P’ stands for a noun phrase that holds a protein name and ‘NP_int’ stands for a noun phrase holding a trigger word for interaction (e.g. ‘binds’, ‘binding’, ‘interacts’, ‘interaction’, etc.). ‘VG_int’ stands for a verb group containing a trigger word for interaction. ‘Prep’ stands for preposition, and the actual preposition is given after the underscore line. Pronouns are also allowed as interactant, and we mark them with ‘NP_its’, ‘NP_it’, ‘NP_they’, etc.
Features used in the detection of phosphorylation-interaction relations
| Type | Feature | Description |
|---|---|---|
| T, C | SSI | Substrate is the same as interactant |
| T | IMP | One of the interactants is mentioned as being ‘phosphorylated’ (phosphorylated A binds to B) |
| T | CONJ | P and I are mentioned in a conjunction (there are five types of conjunctions captured in five different features) |
| C | ACTION | P and I are mentioned in a Subject–Verb–Object relationship (A phosphorylation leads to interaction with B) |
| T, C | DEPEND | I mentioned to be dependent on P (phosphorylation-dependent interaction of A to B) |
| T | PFIRST | P mentioned before I in the sentence |
| T | IFIRST | I mentioned before P in the sentence |
| T | WLR | There is a word/phrase between P and I hinting to a directionality of events from left to right (leads to) |
| T | WRL | There is a word/phrase between P and I hinting to a directionality of events from right to left (requires) |
| C | NEG | One of the events or the action is being negated (phosphorylated A does not bind to B) |
| C | HEDGE | One of the events or the action is mentioned with hedging (phosphorylated A might bind to B) |
| T | RELAPPB | P or I is mentioned in a relative clause or appositive referring to a protein (A, which interacts with B, is phosphorylated by C) |
| T | RELAPPG | I is mentioned in a relative clause or appositive referring to the phosphorylation (phosphorylation of A, which increased the interaction with B) |
The type column specifies if the feature is used in the detection of the temporal relation (T), causal relation (C) or both (T, C). The feature column lists the features by name and the description column gives a description of each feature. ‘P’ is short for phosphorylation and ‘I’ is short for interaction.
Figure 2eFIP ranking and result summary of abstracts for protein BAD. A total of 1331 abstracts are linked to protein BAD as determined by eGRAB, among which 369 mention phosphorylation information (ranked and partially shown). The ‘Impact’, ‘PPI’ and ‘Site’ images on the left point to the type of information are found in the abstract. The title, authors and a summary of the interactions involving the phosphorylated forms of BAD are displayed. A spreadsheet summary file can be downloaded by clicking on the ‘Download info in CSV format’ button.
Figure 3eFIP annotation interface with sentence evidence attribution of phosphorylated protein and interaction events in PMID 10837486.
eFIP performance evaluation on document retrieval as measured by precision, recall and F-measure based on TP, TN, FP and FN
| Evaluation set | # Abstracts | Precision | Recall | TP | TN | FP | FN | |
|---|---|---|---|---|---|---|---|---|
| In-house Set | 96 | 71.1 | 86.5 | 78.0 | 32 | 46 | 13 | 5 |
| BioCreative Set 1 | 25 | 100.0 | 100.0 | 100.0 | 11 | 14 | 0 | 0 |
| BioCreative Set 2 | 25 | 83.3 | 83.3 | 83.3 | 10 | 11 | 2 | 2 |
| BioCreative Set 3 | 40 | 82.4 | 93.3 | 87.5 | 28 | 4 | 6 | 2 |
eFIP performance evaluation on information extraction at the sentence level as measured by precision, recall and F-measure based on TP, TN, FP and FN
| Evaluation type | # Abstracts/sentences | Time to completion | Precision | Recall | TP | TN | FP | FN | |
|---|---|---|---|---|---|---|---|---|---|
| System evaluation (In-house) | 96/148 | 72.4 | 67.9 | 70.1 | 55 | 46 | 21 | 26 | |
| User evaluation (BioCreative) | |||||||||
| Set 1: Manual curation | 25/37 | 104 min | 84.2 | 80.0 | 82.0 | 16 | 14 | 3 | 4 |
| Set 2: eFIP interface | 25/37 | 42.5 min | 94.7 | 69.2 | 80.0 | 18 | 10 | 1 | 8 |
eFIP performance evaluation on document ranking as measured by nDCG and AveP based on the ranked lists of abstracts
| Evaluation set | # Abstracts | Relevant | Irrelevant | nDCG | AveP |
|---|---|---|---|---|---|
| In-house Set | 96 | 37 | 59 | 94.50 | 0.75 |
| BioCreative Set 1 | 25 | 11 | 14 | 100.00 | 1.00 |
| BioCreative Set 2 | 25 | 12 | 13 | 98.08 | 0.81 |
| Protein | |||||
| LAT | 10 | 10 | 0 | 100.00 | 1.00 |
| LCP2 | 10 | 8 | 2 | 98.76 | 0.83 |
| PLCG1 | 10 | 4 | 6 | 93.45 | 0.73 |
| ZAP70 | 10 | 8 | 2 | 96.20 | 0.88 |