| Literature DB >> 19208158 |
Hongning Wang1, Minlie Huang, Xiaoyan Zhu.
Abstract
BACKGROUND: Considerable efforts have been made to extract protein-protein interactions from the biological literature, but little work has been done on the extraction of interaction detection methods. It is crucial to annotate the detection methods in the literature, since different detection methods shed different degrees of reliability on the reported interactions. However, the diversity of method mentions in the literature makes the automatic extraction quite challenging.Entities:
Mesh:
Year: 2009 PMID: 19208158 PMCID: PMC2648772 DOI: 10.1186/1471-2105-10-S1-S55
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
String matching performance.
| 740 Full Texts | 0.090 | 0.107 | 0.098 |
We apply a string matching algorithm with all the names/synonyms from the MI ontology on a set of 740 documents, annotated with 96 methods and provided by the BioCreative II challenge evaluation.
Figure 1Graphical model representation of the CMW model. Following the standard graphical model formalism [19]: nodes represent the random variables and edges indicate the possible dependence. The joint probability can be obtained from the graph by taking the product of the conditional distribution of nodes given their parents.
Figure 2Comparison between the traditional approach and the CMW model. In the this representation, e denotes the detection methods associating with the document, w denotes the observed words and t in the right panel denotes the latent topic factors in the CMW model.
Figure 3Statistics of the corpus. In the whole corpus, 5 dominate detection methods take up nearly 59.3% occurrences and 86.1% (99 out of 115) methods occur in less than 10% documents.
Figure 4Methods perplexity. Lower perplexity on the testing data indicates a better generalization capability. Here we held out 20% of collection for the testing purpose and used the remaining 80% to train the model, in accordance with 5-fold cross-validation.
Figure 5Performance on the number of topics. We use the same data set partition as in Figure 4 and evaluate the precision and recall performance of the CMW model.
Figure 6Comparison with the baseline models. We compare the F-score performance of the four models on different proportions of the data used for training. In this comparison, we set the size of topics in the CMW model to be 250 and k in the KNN model to be 37.
Figure 7Coverage comparison with the baseline models. We compare the coverage performance of the four models on the same data set partition as in Figure 4 and we use the same model parameter settings.
Comparison with BioCreative II best result.
| 0.506 | 0.522 | 0.483 | |
We operate the CMW model on the BioCreative II testing corpus (300 full text documents) to compare with the best result reported in the evaluation. We set the topic size to be 300 according to the result in Figure 5. The CMW model achieved competitive results, F-Score improved 12.4%.
Top 20 relevant terms for methods.
| structure, crystal, residue, molecule, model, site, form, interface, chain, contact, bond, hydrogen, helix, pp, record, helical, window, surface, linker, segment | |
| yeast, two-hybrid, interact, assay, fusion, system, plasmid, clone, cdna, screen, bait, sequence, acid, amino, encode, site, pp, record, domain, plant | |
| gst, fusion, glutathione, pull-down, assay, interact, bead, buffer, wash, yeast, scopus, min, incubate, two-hybrid, antibody, pp, record, system, plasmid, sequence | |
| record, pp, cite, yeast, antibody, strain, panel, anti-flag, saccharomyces, flag, cerevisia, growth, blot, western, flag-tagg, gene, grow, medline, ha, anti-ha | |
| control, buffer, pp, record, isi, bait, cancer, antibody, extract, c-terminus, bead, sirna, tumor, stain, gene, yeast, sds, luciferase, embo, cdna | |
| antibody, pp, record, extract, yeast, domain, sequence, expression, blot, cdna, clone, activity, luciferase, growth, transfect, acid, fusion, sirna, mmedta, link | |
We collect top 20 terms for 6 different methods according to Eq(15) from the corpus.
Figure 8Methods clustering tree. We utilize an accumulative clustering algorithm to perform the hierarchical clustering and build up the "pedigree" tree of the detection methods. Red circles in the figure mean the correct clusters according to the MI ontology definition.
Figure 9Relevance distribution in relevant and irrelevant documents. In the diagram, red line indicates the relevance scores in the relevant document set and the blue dots indicate the relevance scores in the irrelevant document set. If we select the classification threshold as the green line indicates, we would achieve a promising classification performance: in terms of precision 0.745, recall 0.676 and AUC 0.819.