| Literature DB >> 27454611 |
Thi Thanh Thuy Phan1, Takenao Ohkawa2.
Abstract
BACKGROUND: Protein-protein interaction (PPI) extraction from published scientific articles is one key issue in biological research due to its importance in grasping biological processes. Despite considerable advances of recent research in automatic PPI extraction from articles, demand remains to enhance the performance of the existing methods.Entities:
Keywords: Biomedical text mining; Information extraction; Protein protein interaction; k-nearest neighbors
Mesh:
Year: 2016 PMID: 27454611 PMCID: PMC4965725 DOI: 10.1186/s12859-016-1100-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Word context features obtained directly from sentences
| Features | Definitions/Remarks | Values | Examples |
|---|---|---|---|
| Distance_KP1 | The distance defined by number of words appearing between | Integer value | In sentence LLL.d33.s1 of LLL corpus, |
| Distance_KP2 | The distance between | Integer value | In sentence LLL.d33.s1 above, Distance_KP2 is 8. |
| Distance_P1P2 | The distance between two protein names in the sentence. | Integer value | In sentence LLL.d33.s1 above, Distance_P1P2 is 9. |
| Position_P1 | The value adding word distance between protein name | Integer value | In sentence LLL.d33.s1 above, Position_P1 is 1. |
| Position_P2 | The value adding word distance between protein name | Integer value | In sentence LLL.d33.s1 above, Position_P2 is 11. |
| Position of keyword | The word order of | ‘Infix’, ‘prefix’, or ‘postfix’ | In sentence LLL.d33.s1 above, feature value is ‘infix’. |
| Comma between keyword and protein pair | Because topic of the sentence frequently changes before and after commas, we utilize the information if there is a comma between protein pair and | ‘tt’, ‘ff’, ‘tf’, or ‘ft’ | In sentence LLL.d33.s1 above, feature value is ‘ft’. |
| Multiple occurrences of keywords | Check whether there is more than one | ‘true’ or ‘false’ | In sentence LLL.d33.s1 above, feature value is ‘false’. |
| Parallel expression of a protein pair | Check whether the two protein names of the protein pair are contiguous in the word order of the sentence containing them (they are also considered contiguous even if | ‘true’ or ‘false’ | In sentence LLL.d30.s0, |
Word context features extracted from sentences. P1, P2, and K denote the protein name appearing first, the protein name appearing later, and the keyword in a sentence, respectively. ‘t’ and ‘f’ are abbreviations of ‘true’ and ‘false’
Syntactic features obtained from parse trees
| Features | Definitions/Remarks | Values | Examples |
|---|---|---|---|
| Height_P1 | The height of first protein name | Integer value | In Fig. |
| Height_P2 | The height of second protein name | Integer value | In Fig. |
| Height_K | The height of | Integer value | In Fig. |
| POS_P1 | We take into account the part-of-speech information of path from root at constituent parse tree of the two protein names constituting instance and | The list of part-of-speech information of path from root at constituent parse tree. | In Fig. |
| POS_P2 | POS_P2 denotes part-of-speech information of path from root of leaf representing second protein | The list of part-of-speech information of path from root at constituent parse tree. | In Fig. |
| POS_K | POS_K denotes part-of-speech information of path from root of leaf representing | The list of part-of-speech information of path from root at constituent parse tree. | In Fig. |
All sentences were transformed into representations called constituent parse trees, output from the Stanford parser [7]. Syntactic features were extracted from constituent parse trees. P1, P2, and K denote the protein name appearing first, the protein name appearing later, and the keyword in a sentence, respectively
Fig. 1Example of a constituent parse tree. Constituent parse tree for sentence, “Oxytocin stimulates IP3 production in dose-dependent fashion as well,” from sentence IEPA.d0.s0 of IEPA corpus (first protein P1 is Oxytocin and second protein P2 is IP3)
Set of PPI syntax patterns
| No. | PPI-Pattern |
|---|---|
| Pattern 1 |
|
| Pattern 2 |
|
| Pattern 3 |
|
| Pattern 4 |
|
| Pattern 5 |
|
| Pattern 6 |
|
| Pattern 7 |
|
| Pattern 8 | complex between ∗
|
| Pattern 9 | complex of ∗
|
| Pattern 10 |
|
| Pattern 11 |
|
| Pattern 12 |
|
| Pattern 13 | between |
We prepared syntax patterns related to PPI based on the syntax patterns proposed by Plake et al. [8]. P1 and P2 denote the protein names appearing first and later in a sentence, respectively. iNoun and iVerb denote sets of nouns and verbs related to interaction. The number of words substituted by a wildcard ‘ ∗’ in a pattern is limited to five. After the training set was divided into subsets based on the existence of significant keywords and the structure of the sentence, these syntax patterns were applied to each subset
Fig. 2Framework of our PPI extraction system. Our system consists of two phases. First, training set is divided into subsets based on presence of significant keywords and the feature position of keyword. Second, after cross-validation is performed on the training data to assess the contribution levels of four groups, which consist of related features, feature selection is performed automatically through our three approaches (BEST1G, U3G, O2G). Finally, the k-NN classifier is used to classify candidate PPI pairs of test data
Division of training set
| Subset | Significant keyword | Position of keyword |
|---|---|---|
| A | Included | Infix |
| B | Included | Prefix/postfix |
| C | Not included | Infix/prefix/postfix |
The training set was divided into subsets, A, B, and C, based on presence of the significant keyword and the feature position of keyword
Fig. 3Outline of PPI prediction based on division of training set. Training set was divided into subsets, A, B, and C, based on existence of significant keyword and feature position of keyword. Three classifiers were generated from every subset. Similarly, unlabeled instances were divided into one of three subsets, A’, B’, and C’, and corresponding classifier was used to identify whether PPIs exist in these instances
Removed patterns for each training subset
| Subset | Removed patterns |
|---|---|
| A | Patterns 7,8,9,13 |
| B | Patterns 1,2,10,12 |
| C | No deletions |
After the training set was divided into subsets A, B, and C, the syntax patterns (Table 4) we prepared were checked to determine whether they matched each subset. Unsuitable patterns were removed beforehand for subsets A and B. No pattern was excluded for subset C
Fig. 4S-fold cross-validation (SFCV) performed on original training data. Original training data T r a i n all was divided into S equal-sized partitions P (i=0,⋯,S−1) to perform SFCV on it to estimate contribution levels of four groups, G 1, G 2, G 3, and G 4, and perform feature selection
Number of positive and negative instances in four corpora: LLL, HPRD50, IEPA, and AIMed
| Corpus | LLL | HPRD50 | IEPA | AIMed |
|---|---|---|---|---|
| Positive instances | 164 | 163 | 335 | 1000 |
| Negative instances | 166 | 270 | 482 | 4834 |
Four corpora, LLL, HPRD50, IEPA, and AIMed, were converted into a unified XML format with a very simple structure by Pyysalo et al. [17] to make the corpora easily accessible to users. Number of positive instances (interacting protein pairs) and negative instances (non-interacting protein pairs) in each corpus is shown
Experiment results of our three approaches: BEST1G, U3G, O2G
| Corpus | LLL | HPRD50 | IEPA | AIMed | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (%) | P | R | F | P | R | F | P | R | F | P | R | F |
|
| 71.6 | 79.8 | 74.4 | 71.9 | 62.4 | 65.9 |
| 66.6 | 67.2 |
| 35.9 | 42.3 |
| BEST1G |
|
|
| 72.5 | 72.6 | 71.6 | 67.7 | 71.3 | 69.2 | 50.8 |
|
|
| U3G | 74.6 | 80.7 | 75.9 | 72.5 | 72.6 | 71.6 | 68.3 |
|
| 49.5 | 40.5 | 44.4 |
| O2G |
|
|
|
|
|
| 68.1 | 71.3 | 69.5 | 50.2 | 39.8 | 44.3 |
Precision (P), recall (R), F-score (F) results of our three approaches (BEST1G, U3G, O2G) evaluated by 10-fold document-level cross-validation on four corpora, LLL, HPRD50, IEPA, and AIMed, shown in the second, third, and fourth row. As a baseline, in the first row, we add results when only k-NN is applied, and feature selection using contribution levels of groups consisting of related features was not performed. Precision (P), recall (R), and F-score (F) values are shown by percentage (%). Bold typeface shows best results per corpus in terms of precision, recall, and F-score
Comparison of our three approaches (BEST1G, U3G, O2G) with other systems
| Corpus | LLL | HPRD50 | IEPA | AIMed | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (%) | P | R | F | P | R | F | P | R | F | P | R | F | |
| BEST1G |
|
| 76.5 | 72.5 | 72.6 | 71.6 | 67.7 | 71.3 | 69.2 | 50.8 | 40.9 | 45.1 | |
| U3G | 74.6 | 80.7 | 75.9 | 72.5 | 72.6 | 71.6 |
|
|
| 49.5 | 40.5 | 44.4 | |
| Feature- | O2G |
|
| 76.5 |
|
|
| 68.1 | 71.3 | 69.5 | 50.2 | 39.8 | 44.3 |
| based | Landeghem et al. [ | 72.0 | 73.0 | 73.0 | 60.0 | 51.0 | 55.0 | 64.0 | 70.0 | 67.0 | 49.0 | 44.0 | 46.0 |
| methods | Liu et al. [ |
| 64.9 | 62.1 |
|
|
| ||||||
| Yakushiji et al. [ | 33.7 | 33.1 | 33.4 | ||||||||||
| Kernel- | Airola et al. [ | 72.5 | 87.2 | 76.8 | 64.3 | 65.8 | 63.4 |
| 82.7 |
| 52.9 | 61.8 | 56.4 |
| based | Miwa et al. [ |
| 86.0 | 80.1 |
| 76.1 | 70.9 | 67.5 | 78.6 | 71.7 | 55.0 |
|
|
| methods | Tikk et al. [ | 69.3 |
| 78.1 | 62.2 |
|
| 58.8 |
| 70.5 | 50.1 | 41.4 | 44.6 |
| Qian et al. [ |
| 68.8 | 69.8 |
| 57.6 | 58.1 | |||||||
| Co-occurrence | Airola et al. [ | 55.9 | 100.0 | 70.3 | 38.9 | 100.0 | 55.4 | 40.8 | 100.0 | 57.6 | 17.8 | 100.0 | 30.1 |
| Rule-based | RelEx [ | 82.0 | 72.0 | 77.0 | 76.0 | 64.0 | 69.0 | 74.0 | 61.0 | 67.0 | 40.0 | 50.0 | 44.0 |
| methods | Kabiljo et al. [ | 76.7 | 40.2 | 52.8 | 52.0 | 55.8 | 53.8 | 66.2 | 51.3 | 57.8 | 29.1 | 52.9 | 37.5 |
Performance comparison of our three approaches (BEST1G, U3G, O2G) with other related research on four corpora: LLL, HPRD50, IEPA, and AIMed. Co-occurrence and rule-based methods results are also listed as a baseline. Precision (P), recall (R), and F-score (F) values are shown by percentage (%). Bold typeface shows best results of feature-based and kernel-based methods per corpus in terms of precision, recall, and F-score
Lexical features obtained directly from sentences
| Features | Definitions/Remarks | Values | Examples |
|---|---|---|---|
| Keyword | Words indicating relationship between two proteins. | One of the 180 kinds of words obtained by stemming 642 kinds of words such as | In sentence IEPA.d0.s0 (Fig. |
| Negative word | Check if one such negative word as | ‘true’ or ‘false’ | In sentence HPRD50.d21.s1 of HPRD50 corpus, |
| Conjunctive word | Check if one of the following words indicating a conjunctive relation appears: | ‘true’ or ‘false’ | In sentence HPRD50.d21.s1 above, feature value is ‘false’. |
| ‘Which’ | Check if ‘which’ appears. Although ‘which’ also shows conjunctive relations, because ‘which’ appears more often than the conjunctive words listed above, we differentiate it from the above features. | ‘true’ or ‘false’ | In sentence LLL.d13.s0 of LLL corpus, |
| ‘But’ | Check if ‘but’ appears. Although ‘but’ also appears as frequently as ‘which’ to represent conjunctive relations, ‘but’ implies negation of context. | ‘true’ or ‘false’ | In sentence AIMed.d55.s485 of AIMed corpus, |
| Words indicating condition or presumption | Check if ‘if’ or ‘whether’ appears between | ‘true’ or ‘false’ | In sentence IEPA.d0.s0 (Fig. |
| Preposition of keyword | Preposition following | One of the prepositions | In sentence AIMed.d55.s487 of AIMed corpus, |
| Second keyword | Only one of seven words, | ‘true’ or ‘false’ for each of these seven words (If one of these seven words appears in the sentence and is not chosen as a | In sentence IEPA.d0.s0 (Fig. |
Lexical features extracted from sentences