| Literature DB >> 18315856 |
Richard Tzong-Han Tsai1, Hsi-Chuan Hung, Hong-Jie Dai, Yi-Wen Lin, Wen-Lian Hsu.
Abstract
BACKGROUND: Experimentally verified protein-protein interactions (PPI) cannot be easily retrieved by researchers unless they are stored in PPI databases. The curation of such databases can be made faster by ranking newly-published articles' relevance to PPI, a task which we approach here by designing a machine-learning-based PPI classifier. All classifiers require labeled data, and the more labeled data available, the more reliable they become. Although many PPI databases with large numbers of labeled articles are available, incorporating these databases into the base training data may actually reduce classification performance since the supplementary databases may not annotate exactly the same PPI types as the base training data. Our first goal in this paper is to find a method of selecting likely positive data from such supplementary databases. Only extracting likely positive data, however, will bias the classification model unless sufficient negative data is also added. Unfortunately, negative data is very hard to obtain because there are no resources that compile such information. Therefore, our second aim is to select such negative data from unlabeled PubMed data. Thirdly, we explore how to exploit these likely positive and negative data. And lastly, we look at the somewhat unrelated question of which term-weighting scheme is most effective for identifying PPI-related articles.Entities:
Mesh:
Year: 2008 PMID: 18315856 PMCID: PMC2259404 DOI: 10.1186/1471-2105-9-S1-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A PPI record in the MINT database.
Figure 2An overview of our protein-protein interaction text classification system.
Datasets used in our experiment
| Dataset | Size (# of abstracts) | |
| Training | True positive (TP) | 3,536 |
| True negative (TN) | 1,959 | |
| Likely-positive (LP) | 18,930 | |
| Unlabeled (U) | 105,000 | |
| Test | Positive | 338 |
| Negative | 339 |
The selected likely datasets
| Dataset | Size (# of abstracts) |
| Selected Likely-positive (LP*) | 8862 |
| Selected Likely-negative (LN*) | 10000 |
Figure 3Impact of adding likely data on different term weighting schemes (F-measure).
Figure 4Impact of adding likely data on different term weighting schemes (AUC).
Figure 5Impact of applying different term weighting schemes (F-measure). The rank 1 setting denotes the highest F-measure among all participants in BioCreAtIvE-II IAS.
Figure 6Impact of adding likely data on different term weighting schemes (AUC). The rank 1 setting denotes the highest AUC among all participants in BioCreAtIvE-II IAS.
The contingency table for document frequency of term tin different classes. ¬tstands for all words other than t
| Class | ¬ | |
| Positive | ||
| Negative |
Figure 7The flowchart of constructing the mixed model.
Figure 8The flowchart of constructing the hierarchical model.