| Literature DB >> 20406505 |
Yanpeng Li1, Xiaohua Hu, Hongfei Lin, Zhihao Yang.
Abstract
BACKGROUND: Extracting protein-protein interactions from biomedical literature is an important task in biomedical text mining. Supervised machine learning metho<span class="Chemical">ds have been used with great success in this task but they tend to suffer from data sparseness because of their restriction to obtain knowledge from limited amount of labelled data. In this work, we study the use of unlabeled biomedical texts to enhance the performance of supervised learning for this task. We use feature coupling generalization (FCG) - a recently proposed semi-supervised learning strategy - to learn an enriched representation of local contexts in sentences from 47 million unlabeled examples and investigate the performance of the new features on AIMED corpus.Entities:
Mesh:
Year: 2010 PMID: 20406505 PMCID: PMC3166043 DOI: 10.1186/1471-2105-11-S2-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Examples of EDFs
| Current example: | The results show that prot1 heterodimerizes with prot0 and prot2 in vivo , but it does not homodimerize to a measurable extent . |
|---|---|
| SP-EDF: | |
| CP-EDF: | |
| DS-EDF: | |
In SP-EDF, the distance of a protein and an n-gram is the word count between the protein and the last token of the n-gram. In the CD-EDF, the distance between the two proteins is 5, so all the features are joined with the distance.
Examples CDFs
| CDFs | Chi-square | CDFs | Chi-square |
|---|---|---|---|
| 219.34 | 93.10 |
Since the CDFs extracted from the training set of each round of ten-fold cross validation are highly overlapped, here we just list the top results of the first round. “Chi-square” is the score given by Chi-square feature selection method.
Figure 1An example that shows how FCG generates new feature for the PPIE task. Here only CP-EDFs are considered. They are divided into four groups according to different EDF roots. A CDF is denoted by c. Since only one FCD type is used here, the FCD features are indexed by the conjunction of EDF roots and CDFs.
Performance of local lexical features
| Feature | Precision | Recall | F-score | AUC |
|---|---|---|---|---|
| F1 | 42.2 | 50.17 | 78.22 | |
| F1+F2 | 45.83 | 61.65 | 50.89 | 78.92 |
| F1+F2+F3 | 54.06 | 60.25 | 56.39 | 83.1 |
| F1+F2+F3+F4 | 60.61 | 63.43 | 61.11 | 85.97 |
| F1+F2+F3+F4+F5 | 62.4 |
F1: GA-BOW; F2: GA-Lex; F3: SA-Lex; F4: SP-Lex; F5: CP-Lex
Performance of FCD features
| ID | Local | SP-EDF | CP-EDF | DS-EDF | Linear | RBF | Precision | Recall | F-score | AUC |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | * | * | 62.40 | 61.4 | 86.11 | |||||
| 2 | * | * | 53 | 64.48 | 57.6 | 81.75 | ||||
| 3 | * | * | 58.08 | 62.67 | 59.16 | 82.97 | ||||
| 4 | * | * | 56.83 | 53.55 | 54.33 | 79.26 | ||||
| 5 | * | * | 54.64 | 57.71 | 54.47 | 78.34 | ||||
| 6 | * | * | 53.5 | 56.4 | 53.87 | 78.02 | ||||
| 7 | * | * | 54.45 | 58.71 | 55.56 | 79.79 | ||||
| 8 | * | * | * | * | 54.08 | 63.66 | 58.06 | 82.61 | ||
| 9 | * | * | * | * | * | 59.7 | 61.52 | 60.06 | 83.78 | |
| 10 | * | * | * | * | * | * | 60.47 |
In RBF SVM, SVD is used to reduce the feature dimension and a SVM with RBF kernel is used to classify examples. Run 10 combines the outputs of Run 1 and Run 9.
Figure 2Relation between the performance of FCD features and the number of CDFs. The patterns are selected in a descendent order of Chi-square scores. For all the FCD features, linear classifiers are used.
Figure 3Relation between the performance of FCD features and the number of unlabeled examples. Here linear classifiers are used.
Comparison with other systems on AIMED corpus
| Method or author | F-score | AUC |
|---|---|---|
| (Miwa et al., 2009) [ | 65.2 | 89.3 |
| Our method (Combined) | 63.5 | 87.2 |
| (Miwa et al., 2009) [ | 62.7 (64.3) | 83.2 (87.9) |
| Our method (Lex) | 61.4 | 86.11 |
| Our method (FCD) | 60.1 | 83.8 |
| (Miyao et al., 2008) [ | 59.5 | - |
| (Airola et al., 2008) [ | 56.4 | 84.8 |
| (Sætre et al., 2007) [ | 52.0 | - |
| (Mitsumori et al., 2006) [ | 47.7 | - |
For the work [12], the scores in brackets are their reported results obtained by removing all the examples with self-interaction protein pairs in AIMED corpus.
Figure 4Comparison of local lexical features and FCD features. For FCD features, non-linear classifiers are used.
Figure 5Sparse features in FCG and supervised learning The x axis is the count of features (EDFs for FCD features) in the AIMED corpus. The y axis is the F-score obtained by training with all the features equal or less than the certain feature count indicated by x axis. For FCD features, linear classifiers are used.