| Literature DB >> 30624652 |
Pei-Yau Lung1, Zhe He2, Tingting Zhao3, Disa Yu1, Jinfeng Zhang1.
Abstract
Information about the interactions between chemical compounds and proteins is indispensable for understanding the regulation of biological processes and the development of therapeutic drugs. Manually extracting such information from biomedical literature is very time and resource consuming. In this study, we propose a computational method to automatically extract chemical-protein interactions (CPIs) from a given text. Our method extracts CPI pairs and CPI triplets from sentences, where a CPI pair consists of a chemical compound and a protein name, and a CPI triplet consists of a CPI pair along with an interaction word describing their relationship. We extracted a diverse set of features from sentences that were used to build multiple machine learning models. Our models contain both simple features, which can be directly computed from sentences, and more sophisticated features derived using sentence structure analysis techniques. For example, one set of features was extracted based on the shortest paths between the CPI pairs or among the CPI triplets in the dependency graphs obtained from sentence parsing. We designed a three-stage approach to predict the multiple categories of CPIs. Our method performed the best among systems that use non-deep learning methods and outperformed several deep-learning-based systems in the track 5 of the BioCreative VI challenge. The features we designed in this study are informative and can be applied to other machine learning methods including deep learning.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30624652 PMCID: PMC6323317 DOI: 10.1093/database/bay138
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Number of cases in each true CPR group
|
|
|
|
|
|
|
| # pairs | 761 | 2251 | 173 | 235 | 727 |
| # triplets | 2492 | 6452 | 486 | 732 | 1536 |
|
|
|
|
|
|
|
| # pairs | 548 | 1093 | 115 | 199 | 457 |
| # triplets | 1650 | 2936 | 400 | 673 | 881 |
|
|
|
|
|
|
|
| # pairs | 664 | 1658 | 185 | 292 | 644 |
| # triplets | 2120 | 4780 | 570 | 870 | 1371 |
Features of semantic pattern from a sentence
| Features | Feature values | Comment |
|---|---|---|
| sp_type | Interger 0 to 7 | Type 0: e1–iw–e2 (entity1–interaction word–entity2) Type 1: iw–e1–e2 Type 2: e1–e2–iw Type 3: the star shape with no other paths Type 4: the triangle shape Type 5: the star shape with a path between e1 and e2 Type 6: the star shape with a path between iw and e2 Type 7: the star shape with a path between iw and p1 |
| SenLen |
| Number of words in a sentence. |
| steps_sp1 |
| Number of edges in the shortest path between e1 and iw. |
| steps_sp2 |
| Number of edges in the shortest path between iw and e2. |
| steps_sp3 |
| Number of edges in the shortest path between e1 and e2. |
| pos_e1 |
| Number of words which lie before e1. |
| pos_e2 |
| Number of words which lie before e2. |
| pos_iw |
| Number of words which lie before interaction word. |
| NEntities |
| Number of entities other than e1 and e2. |
| Significant |
| Presence of |
| isBracket |
| Whether e1 or e2 is in (any kind of) brackets. |
| isSubstrate |
| Presence of |
| isAdjacent |
| Whether e1 and e2 are adjacent. |
Figure 1Grammatical dependencies graph.
Entities names replacement
|
|
|
|
|
|---|---|---|---|
|
|
|
| |
| Before |
| ||
| After |
| ||
Figure 2Flowchart of three-stage CPI extraction model.
Example for choosing CPI triplet
|
| PROT agonist CHEM enhances functional recovery after detachment caused by subCPT10 injection in normal and rds mice. | ||
|---|---|---|---|
|
|
|
|
|
| CHEM-PROT-agonist | CPR:5 | 0.7234 | Yes |
| CHEM-PROT-enhances | CPR:3 | 0.5118 | No |
| CHEM-PROT-caused | CPR:9 | 0.3841 | No |
F1 score of models at stage II
|
|
|
|
|
|---|---|---|---|
| Triplets | Yes | 0.5615 | 0.5558 |
| No | 0.5284 | 0.5225 | |
| Pairs | Yes | 0.5533 | 0.5275 |
| No | 0.5192 | 0.4981 |
Best F1 score of each team
|
|
|
|
|
|
|---|---|---|---|---|
| 1 [ | Y | 0.6410 | 0.7266 | 0.5735 |
| 2 [ | Y | 0.6141 | 0.5610 | 0.6784 |
| 3 [ | Y | 0.6099 | 0.6608 | 0.5662 |
| 4 [ | Y | 0.5853 | 0.6704 | 0.5194 |
| Our system | N | 0.5671 | 0.6352 | 0.5121 |
| 6 [ | Y | 0.5181 | 0.5738 | 0.4722 |
| 7 [ | Y | 0.4948 | 0.5301 | 0.4639 |
| 8 [ | Y | 0.4582 | 0.4718 | 0.4453 |
| 9 [ | Y | 0.3839 | 0.2696 | 0.6663 |
| 10 [ | N | 0.3700 | 0.3387 | 0.4078 |
| 11 [ | N | 0.3092 | 0.2932 | 0.3271 |
| 12 | 0.2195 | 0.1618 | 0.3409 | |
| 13 [ | Y | 0.1864 | 0.6057 | 0.1102 |
Confusion matrix and performance by CPR
|
| |||||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| ||
|
|
| 305 | 59 | 1 | 0 | 4 | 296 |
|
| 41 | 1085 | 2 | 1 | 6 | 526 | |
|
| 2 | 1 | 87 | 3 | 1 | 101 | |
|
| 0 | 4 | 2 | 172 | 0 | 115 | |
|
| 15 | 31 | 0 | 0 | 122 | 476 | |
|
| 198 | 426 | 22 | 27 | 72 | - | |
|
| 0.498 | 0.665 | 0.5649 | 0.6964 | 0.2874 | - | |
|
| 0.5446 | 0.6773 | 0.7699 | 0.8557 | 0.5951 | - | |
|
| 0.4586 | 0.6532 | 0.4462 | 0.587 | 0.1894 | - | |