| Literature DB >> 26783534 |
Shun Koyabu1, Thi Thanh Thuy Phan1, Takenao Ohkawa1.
Abstract
For the automatic extraction of protein-protein interaction information from scientific articles, a machine learning approach is useful. The classifier is generated from training data represented using several features to decide whether a protein pair in each sentence has an interaction. Such a specific keyword that is directly related to interaction as "bind" or "interact" plays an important role for training classifiers. We call it a dominant keyword that affects the capability of the classifier. Although it is important to identify the dominant keywords, whether a keyword is dominant depends on the context in which it occurs. Therefore, we propose a method for predicting whether a keyword is dominant for each instance. In this method, a keyword that derives imbalanced classification results is tentatively assumed to be a dominant keyword initially. Then the classifiers are separately trained from the instance with and without the assumed dominant keywords. The validity of the assumed dominant keyword is evaluated based on the classification results of the generated classifiers. The assumption is updated by the evaluation result. Repeating this process increases the prediction accuracy of the dominant keyword. Our experimental results using five corpora show the effectiveness of our proposed method with dominant keyword prediction.Entities:
Mesh:
Year: 2015 PMID: 26783534 PMCID: PMC4689882 DOI: 10.1155/2015/928531
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Features obtained directly from sentences.
| Features | Definitions/remarks | Values | Examples |
|---|---|---|---|
| Keywords | Words representing relationship between two proteins | One of the 180 kinds of words obtained by stemming 642 kinds of words such as | |
|
| |||
| Distance between protein pair and keyword: three types | The word distance defined by the number of words appearing between keyword mentioned above and protein names constituting the protein pair; Type 1 is the distance between | Integer value | In sentence |
|
| |||
| Position of keyword: three types | The word order of protein pair and keyword | “Infix” (the order of the sentence is [ | In sentence |
|
| |||
| Position of protein names | The value adding word distance between the word at the beginning of the sentence and the protein name to one; positions 1 and 2 are defined for | Integer value | In sentence |
|
| |||
| Comma between keyword and protein pair: four types | Since the topic often changes before and after a comma, we use such information if there is any comma between the keyword and the protein pair | “yy”, “nn”, “yn”, or “ny” (e.g., “yy” means commas are observed between | In sentence |
|
| |||
| Negative words | Whether any negative word such as | “True” or “false” | In sentence |
|
| |||
| Conjunctive words | Whether one of the following 16 kinds of words representing conjunctive relations appears: | “True” or “false” | In sentence |
|
| |||
| “Which” | Whether “which” appears; since “which” also represents the conjunctive relation but occurs more frequently than the 16 words mentioned above, we distinguish “which” from the above features | “True” or “false” | |
|
| |||
| “But” | Whether “but” appears; in addition to “which”, “but” also frequently represents the conjunctive relation; however, “but” introduces negation to the context | “True” or “false” | |
|
| |||
| Words representing assumptions or conditions | Whether “if” or “whether” appears between the protein names or the keyword and the protein name | “True” or “false” | |
|
| |||
| Preposition of keyword | The preposition following the keyword providing that the word distance between the keyword and the preposition is within 3; if there are many prepositions, the preposition is used whose word distance from the keyword is nearer | One of the prepositions | In sentence “ |
|
| |||
| Multiple occurrences of keywords | Whether there is more than one keyword in a sentence | “true” or “false” | In sentence “ |
|
| |||
| Second keywords: seven kinds | Only one of seven particular words: “ | “True” or “false” for each of the seven words (if some of these seven words appear in the sentence and are not selected as a keyword, we use “true” as a feature value for them) | In sentence “ |
|
| |||
| Parallel expression of protein pair | Whether the protein names constituting the protein pair are adjacent (they are also considered adjacent even if “—”, “/”, “and”, “or”, “(” appears between them); if protein names are expressed in parallel in a sentence, interaction between them is difficult; we can easily determine the parallel expression of a protein pair in a sentence by determining whether these protein names are adjacent in the word order of that sentence | “True” or “false” | In sentence “ |
Features obtained from parsing information.
| Features | Definitions/remarks | Values | Examples |
|---|---|---|---|
| Height of protein pair and keyword: three types | The heights of the protein names constituting the protein pair and the keyword at the parse tree structure: these heights differ from word distances; features height_ | Integer value | In |
|
| |||
| Part-of-speech information of protein pair and keyword: three types | The part-of-speech information of PATH (the path from the root) at the parse tree structure of the protein names constituting the protein pair and the keyword; it is possible to represent the syntax structure and train classifiers to learn pseudo grammar structure; features POS_ | List of part-of-speech information of PATH | In |
Figure 1Example of parse tree.
Set of 13 PPI patterns.
| Number | PPI pattern |
|---|---|
| Pattern 1 |
|
| Pattern 2 |
|
| Pattern 3 |
|
| Pattern 4 |
|
| Pattern 5 |
|
| Pattern 6 |
|
| Pattern 7 |
|
| Pattern 8 | complex between |
| Pattern 9 | complex of |
| Pattern 10 |
|
| Pattern 11 |
|
| Pattern 12 |
|
| Pattern 13 | between |
Division of training set.
| Subset | Dominant keyword | Position of keyword |
|---|---|---|
| II | Included | Infix |
| IP | Included | Prefix/postfix |
| NI | Not included | Infix |
| NP | Not included | Prefix/postfix |
Figure 2Overview of PPI prediction based on division of training set.
Figure 3General flow of updating DK values.
Algorithm 1Procedure for updating DK values.
Removed features for each training subset.
| Subset | Removed features |
|---|---|
| II | Patterns 7, 8, 9, and 13 |
| IP | Patterns 1, 2, 10, and 12 |
| NI | Patterns 7, 8, 9, and 13 |
| NP | Patterns 1, 2, 10, and 12 |
Experimental results.
| Corpus | LLL | HPRD50 | IEPA | AImed | BioInfer | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (%) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| SC | 85.4 | 79.1 | 82.1 | 70.1 | 75.2 | 72.3 | 63.6 | 68.9 | 66.1 | 49.5 | 67.9 | 57.3 | 67.7 | 74.8 | 71.1 |
| MC | 85.4 | 81.9 | 83.6 | 72.4 | 71.5 | 72.0 | 64.2 | 70.3 | 67.1 | 54.4 | 67.7 |
| 68.1 | 74.3 | 71.1 |
| DK-MC | 84.8 | 84.8 | 84.8 | 77.3 | 72.8 | 75.0 | 66.9 | 71.3 | 69.0 | 54.4 | 66.8 | 60.0 | 69.5 | 74.3 | 71.7 |
| FS-MC | 87.8 | 81.8 | 84.7 | 77.3 | 73.7 | 75.4 | 65.6 | 72.6 | 69.0 | 51.8 | 66.7 | 58.3 | 69.1 | 75.0 | 71.9 |
| DK-FS-MC | 86.6 | 83.5 |
| 77.9 | 76.0 |
| 67.2 | 71.4 |
| 55.0 | 66.0 | 60.0 | 70.8 | 74.8 |
|
Influence of value of T on F-values.
|
| LLL | HPRD50 | IEPA | AImed | BioInfer |
|---|---|---|---|---|---|
| 0.15 |
| 77.0 |
|
|
|
| 0.20 | 83.7 |
| 68.4 | 59.4 | 71.6 |
| 0.25 | 81.6 | 74.6 | 67.7 | 58.9 | 71.6 |
| 0.30 | 83.3 | 72.0 | 67.5 |
| 71.9 |
| 0.35 | 84.3 | 73.0 | 66.1 | 59.3 | 71.9 |
Performance comparison of PPI extraction.
| Corpus | Method |
|
|
|
|---|---|---|---|---|
| LLL | Fundel et al. [ | 79.0 |
| 82.0 |
| Fayruzov et al. [ | 86.0 | 72.0 | 78.0 | |
| Van Landeghem et al. [ | 84.0 | 79.0 | 82.0 | |
| DK-FS-MC |
| 83.5 |
| |
|
| ||||
| HPRD50 | Van Landeghem et al. [ | 71.0 | 71.0 | 71.0 |
| DK-FS-MC |
|
|
| |
|
| ||||
| IEPA | Van Landeghem et al. [ |
|
|
|
| DK-FS-MC | 67.2 | 71.4 | 69.2 | |
|
| ||||
| AImed | Giuliano et al. [ |
| 64.5 |
|
| Mitsumori et al. [ | 53.6 | 55.7 | 54.3 | |
| Fayruzov et al. [ | 50.0 | 41.0 | 45.0 | |
| Van Landeghem et al. [ | 58.0 | 66.0 | 62.0 | |
| Edit of Erkan et al. [ | 43.5 |
| 55.6 | |
| Cosine of Erkan et al. [ | 55.0 | 62.0 | 58.1 | |
| DK-FS-MC | 55.0 | 66.0 | 60.0 | |
Number of positive/negative pairs in AImed corpus applied in our work and existing works.
| Our work | Mitsumori et al. [ | Giuliano et al. [ | Van Landeghem et al. [ | Erkan et al. [ | Fayruzov et al. [ | |
|---|---|---|---|---|---|---|
| Positive pairs | 1,000 | 1,107 | 1,008 | 1,000 | 951 | 816 |
| Negative pairs | 4,834 | 4,369 | 4,634 | 4,670 | 3,075 | 3,204 |