| Literature DB >> 30066644 |
Kaixian Yu1,2, Pei-Yau Lung3, Tingting Zhao4, Peixiang Zhao5, Yan-Yuan Tseng6, Jinfeng Zhang7.
Abstract
BACKGROUND: Relationships between bio-entities (genes, proteins, diseases, etc.) constitute a significant part of our knowledge. Most of this information is documented as unstructured text in different forms, such as books, articles and on-line pages. Automatic extraction of such information and storing it in structured form could help researchers more easily access such information and also make it possible to incorporate it in advanced integrative analysis. In this study, we developed a novel approach to extract bio-entity relationships information using Nature Language Processing (NLP) and a graph-theoretic algorithm.Entities:
Keywords: Graph-theoretic algorithm; Information extraction; Nature language processing; Protein-protein-interactions; Relationship extraction
Mesh:
Substances:
Year: 2018 PMID: 30066644 PMCID: PMC6069288 DOI: 10.1186/s12911-018-0628-4
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Example of PPIs. The sentence has four protein names and two interactions words, “interact” and “target”. The five triplets with “interact” are shown below the sentence
Fig. 2Grammatical dependencies graph
Fig. 3Grammatical dependencies sub-graph: a) the strict pattern directly extracted from the annotated sample; b) the relaxed pattern where the word "domain" was allowed to vary; c) the more general pattern where the interaction word can be replaced by the ones from the same pre-defined interaction class
Fig. 4Example decision tree
Dataset information
| Corpus | No. of sentences | No. of Triplets | No. of true PPI |
|---|---|---|---|
| HPRD50 | 145 | 954 | 126 |
| IEPA | 374 | 1341 | 164 |
| LLL | 79 | 977 | 106 |
| PICAD | 1033 | 19,755 | 1831 |
Performance comparison
| Corpus | HPRD50 | IEPA | LLL | PICADb | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F | P | R | F | P | R | F | P | R | F | P | R | |
| Bui et al. [ | 71.7 | 62.2 | 84.7 | 73.4 | 62.9 | 88.1 | 83.6 | 81.9 | 85.4 | – | – | – |
| Miwa et al. [ | 70.9 | 68.5 | 76.1 | 71.7 | 67.5 | 78.6 | 80.1 | 77.6 | 86.0 | – | – | – |
| Chang et al. [ | 71.5 | 63.8 | 81.2 | 71.4 | 62.5 | 83.3 | 80.6 | 73.2 | 89.6 | – | – | – |
| Murugesan et al. [ | 80.0 | 76.3 | 84.2 | 80.2 | 75.9 | 85.2 | 89.2 | 87.3 | 91.2 | – | – | – |
| aZhao et al. [ | 71.3 | 58.7 | 92.4 | 74.2 | 67.0 | 84.0 | 82.0 | 75.8 | 91.8 | – | – | – |
| GRGT | 64.0 | 86.5 | 50.8 | 74.9 | 91.0 | 63.6 | 83.6 | 91.2 | 77.1 | 70.0 | 78.2 | 63.4 |
Performance comparison of our method (GRGT) with top-performing methods on four benchmark datasets. F F1-score, P precision, R recall. The measurement is out of 100. adeep learning method. bValues are not available because of the unavailability of executable program or source code
Fig. 5Precision-recall curve: a) HPRD50, b) IEPA, c) LLL, d) PICAD
Summary of the extracted subgraphs and their generalizations
| Corpus | # of patterns | # of valid patternsa | Triplet per valid pattern |
|---|---|---|---|
| HPRD50 | 3895 | 522 | 1.83 |
| IEPA | 6117 | 575 | 2.33 |
| LLL | 4859 | 891 | 1.10 |
| PICAD | 18,794 | 4363 | 4.53 |
aPatterns appeared at least twice