| Literature DB >> 18834490 |
Minlie Huang1, Shilin Ding, Hongning Wang, Xiaoyan Zhu.
Abstract
BACKGROUND: Deciphering physical protein-protein interactions is fundamental to elucidating both the functions of proteins and biological processes. The development of high-throughput experimental technologies such as the yeast two-hybrid screening has produced an explosion in data relating to interactions. Since manual curation is intensive in terms of time and cost, there is an urgent need for text-mining tools to facilitate the extraction of such information. The BioCreative (Critical Assessment of Information Extraction systems in Biology) challenge evaluation provided common standards and shared evaluation criteria to enable comparisons among different approaches.Entities:
Mesh:
Year: 2008 PMID: 18834490 PMCID: PMC2559983 DOI: 10.1186/gb-2008-9-s2-s12
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
The average Kullback Leibler divergence between the distributions of different datasets
| Compared distributions | Term feature | String feature | ||
| Pr(x|c+) | Pr(x|c-) | Pr(x|c+) | Pr(x|c-) | |
| Dist on the remaining training dataset versus Dist on the leave-out dataset | 0.0216 | 0.0703 | 0.0029 | 0.0163 |
| Dist on the remaining training dataset versus Dist on the official test dataset | 0.0369 | 0.9926 | 0.0357 | 0.1887 |
The table shows the average Kullback Leibler divergence of three distributions estimated on the leave-out dataset, remaining training dataset, and the official test data. The Average Kullback Leibler divergence between distributions on different datasets. Dist, distribution.
Figure 1The probability of a feature x occurring in irrelevant articles. The figure shows the three distributions of the leave-out dataset, remaining training dataset, and official test dataset. The probability of a feature x occurring in irrelevant articles (Pr(x|c-)) in different datasets are shown (only 40 features are listed here).
Article filtering performance with different features and classifiers
| Model | Precision | Recall | F1 score | AUC |
| Mean | 0.6642 | 0.7636 | 0.6868 | 0.7351 |
| Standard deviation | 0.0810 | 0.1926 | 0.1035 | 0.0741 |
| Best reported in terms of AUC [ | 0.7080 | 0.8609 | 0.7770 | 0.8554 |
| Our results in BioCreative 2006 | 0.7507 | 0.8107 | 0.7795 | 0.8471 |
| Term (baseline) | 0.7016 | 0.8213 | 0.7568 | 0.8037 |
| String | 0.7044 | 0.8960 | 0.7887 | 0.8416 |
| Named entity (NE) | 0.5815 | 0.9600 | 0.7243 | 0.7570 |
| Template | 0.7841 | 0.7653 | 0.7746 | 0.8239 |
| String + NE | 0.7360 | 0.8773 | 0.8005 | 0.8479 |
| String + template | 0.7416 | 0.8880 | 0.8082 | 0.8372 |
| String + NE + template | 0.7585 | 0.8373 | 0.7959 | 0.8507 |
| String + term + NE + template | 0.7432 | 0.8720 | 0.8025 | 0.8608 |
| Naïve Bayes classifier | 0.6321 | 0.8613 | 0.7291 | 0.7884 |
| Multinomial classifier | 0.6264 | 0.8720 | 0.7290 | 0.7770 |
| Linear kernel SVM | 0.7016 | 0.8213 | 0.7568 | 0.8037 |
| 0.7352 | 0.8293 | 0.7794 | 0.8376 | |
| Integration of the above four classifiers (AdaBoost) | 0.7995 | 0.8933 | 0.8438 | 0.8746 |
This table shows the experimental results from article filtering. AUC, area under the receiving operator characteristic curve; SVM, support vector machine.
Comparative results for protein name normalization
| Precision | Recall | F1 score | ||
| Average | Mean | 0.1495 | 0.2828 | 0.1707 |
| Standard deviation | 0.0963 | 0.1294 | 0.0764 | |
| Median | 0.1337 | 0.2723 | 0.1683 | |
| Our results | Baseline | 0.2223 | 0.1024 | 0.1402 |
| + entry curation | 0.2345 | 0.2648 | 0.2487 | |
| + organism context | 0.3483 | 0.2410 | 0.2849 | |
The table shows the comparative results when identifying and normalizing protein names.
Comparative results for interaction pair extraction
| Compared models | Whole collection | SwissProt only article collection | ||||
| Precision | Recall | F1 score | Precision | Recall | F1 score | |
| Mean | 0.1062 | 0.1858 | 0.1035 | 0.1160 | 0.2000 | 0.1127 |
| Standard deviation | 0.0945 | 0.1001 | 0.0761 | 0.1035 | 0.1062 | 0.0836 |
| Median | 0.0755 | 0.1961 | 0.0788 | 0.0808 | 0.2156 | 0.0842 |
| Best reported in terms of F1 score [ | 0.3908 | 0.2970 | 0.2849 | 0.3893 | 0.3073 | 0.2885 |
| Template-based method (threshold = 0.0) | 0.1373 | 0.2905 | 0.1578 | 0.1566 | 0.3189 | 0.1784 |
| Template-based method (threshold = 80.0) | 0.2177 | 0.2651 | 0.2038 | 0.2434 | 0.2828 | 0.2247 |
| Profile-based method | 0.3096 | 0.2935 | 0.2623 | 0.3695 | 0.3268 | 0.3042 |
'Whole collection' means that all of the articles have been considered. 'SwissProt only article collection' include articles containing interaction pairs that can be normalized to SwissProt entries. The table shows the comparative results for the extraction of interaction pairs.
Figure 2Errors of interaction pair extraction. The figure shows the distribution of errors in the interaction pair extraction. The blue ellipse contains 798 annotated pairs, the yellow ellipse 8,172 coincident pairs, and the green circle 339 extracted pairs. I, 100 true-positive samples; II, 166 coincident but false-negative samples; III, 239 false-positive samples; IV, 7,135 true-negative samples; V, 532 false-negative samples but never coincident.
Figure 3The system architecture of our method. Blue rectangles are the three main modules in our system. The figure shows the architecture of our system, and there are three main modules in the system that have been colored in blue. MR, molecule recognition; PPI, protein-protein interaction.
An example for constructing string features
| Input and processed documents and condidate string features | Details |
| Input document | The Three Human Syntrophin Genes Are Expressed in Diverse Tissues, Have Distinct Chromosomal Locations, and Each Bind to Dystrophin and Its Relatives |
| Processed document | the three human syntrophin genes are expressed in diverse tissues have distinct chromosomal locations and each bind to dystrophin and its relatives |
| Candidate string features | the thr |
| he thre | |
| e three | |
| three h | |
| hree hu | |
| ree hum | |
| ee huma | |
| e human | |
| ... |
The length of substring is fixed to 7. The example document only has one sentence (the title of the document of PMID:8576247). A seven-character window moves along the sequential text. All characters are converted to lower case. Only alphabetical letters and the space character are processed. Punctuation is converted to the space character.
Figure 4The flowchart of the molecule recognition module. Gray boxes are the input of our molecule recognition module and the figure illustrates the flowchart of the molecule recognition module.
Figure 5The profile vector in the extraction of interaction protein pairs. The construction of the profile vector for each candidate protein pair is shown in this figure. The term feature (unigram/bigram), template feature, and position feature are used in this process.
Examples for template features used in the profile-based method
| activation of * Protein1 * to * Protein2 |
| interaction of * Protein1 * and * Protein2 |
| association of * Protein1 * with * Protein2 |
| interaction between * Protein1 * and * Protein2 |
| binding of * Protein1 * to * Protein2 |
| Protein1 * bind * to * Protein2 |
| Protein1 * activate * Protein2 |
| activation of * Protein1 * with * Protein2 |
| Protein1 * coimmunoprecipitate * with * Protein2 |
| Protein1 * copurify * with * Protein2 |
| Protein1 * mediate * interaction * Protein2 |
| Protein1 * form * complex * with * Protein2 |
| Protein1 * phosphorylate * Protein2 |
| Protein1 * interact * with * Protein2 |
| Protein1 * associate * with * Protein2 |
| Protein1 * in * complex * with * Protein2 |
| Protein1 * regulate * Protein2 |
| regulation of * Protein1 * by * Protein2 |
| yeast * Protein1 * screen * Protein2 |
| yeast * strain * Protein1 * and * Protein2 |
| yeast * Protein1 * to * function * Protein2 |
| Protein1 * lead * to * activation * of * Protein2 |
Template examples are listed in this table. These templates are reproduced from [26]. The complete list of template features is available upon request from the authors. The asterisk in the template features indicates that any word can be skipped.