| Literature DB >> 18426546 |
Hui Yang1, Goran Nenadic, John A Keane.
Abstract
BACKGROUND: Availability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the development of semi-automated text mining support is hindered by unavailability of training data. There have been no studies on how existing data sources (e.g. TF-related data from the MeSH thesaurus and GO ontology) or potentially noisy example data (e.g. protein-protein interaction, PPI) could be used to provide training data for identification of TF-contexts in literature.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18426546 PMCID: PMC2352869 DOI: 10.1186/1471-2105-9-S3-S11
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Examples of typical actors and events in transcription factor contexts
Figure 1Overall architecture of the approach.
Statistics for the datasets used in the experiments
| 491 | 712 | 477 | 77 | 283 | 127 | 1200 | 1700 | |
| 1680 | 1687 | 1700 | ||||||
Feature statistics for different datasets (GM = generic model; BM = biological model). Note that the feature list used in the BM model is longer than that of the GM model due to the additional binary biological features (has-protein, has-two-proteins, etc.).
| 1327 | 1188 | 1780 | ||
| 803 | 760 | 1306 | ||
| 9.70 | 14.44 | 11.43 | ||
| 12.87 | 17.73 | 9.78 |
Performance of the three machine-learning classifiers on the FlyTF test data using only MeSH and GO TF data as positive training data (GM = generic model; BM = biological model)
| .9328 | .7352 | .8223 | .9477 | .8859 | .9158 | .9595 | .7230 | .8246 | ||
| .9542 | .7210 | .8213 | .9595 | .8676 | .9112 | .9802 | .7047 | .8199 | ||
| 1.000 | .6986 | .8225 | 1.000 | .6354 | .7771 | .9972 | .7210 | .8369 | ||
| .9810 | .6314 | .7683 | 1.000 | .5234 | .6872 | .9816 | .6517 | .7834 | ||
Performance of the three machine-learning classifiers on the FlyTF test data using both MeSH and GO TF data and part of the FlyTF data as positive training data (GM = generic model; BM = biological model)
| .9271 | .8910 | .9087 | .9308 | .9592 | .9447 | .9588 | .8533 | .9029 | ||
| .9455 | .8925 | .9182 | .9527 | .9450 | .9488 | .9770 | .8655 | .9109 | ||
| 1.000 | .9183 | .9574 | 1.000 | .8879 | .9406 | 1.000 | .9124 | .9541 | ||
| .9936 | .8926 | .9403 | 1.000 | .8370 | .9112 | .9885 | .8818 | .9321 | ||
Figure 2The average KL divergence of feature distributions between (1) TF and PPI, and (2) TF and NonPF datasets for the GM and BM models, when the top ranked features are considered (TF& PPI_GM = feature distribution in TF vs. feature distribution in PPI in GM model, etc.)
Examples of confused contexts in the TF & PPI dataset
| PPI | TF | |
| PPI | TF | |
| PPI | TF | |
| TF | PPI | |
| TF | PPI |
Performance of the three machine-learning classifiers on the TF & NonPF and TF & PPI datasets using 5-fold cross-validation (GM = generic model; BM = biological model)
| .9342 | .9104 | .9222 | .9413 | .9744 | .9576 | .9638 | .9042 | .9330 | ||
| .9421 | .9343 | .9380 | .9434 | .9726 | .9578 | .9591 | .9351 | .9470 | ||
| .8938 | .9463 | .9193 | .8767 | .9268 | .9010 | .8685 | .9554 | .9099 | ||
| .9092 | .9367 | .9227 | .8892 | .9268 | .9076 | .8974 | .9524 | .9241 | ||
Figure 3The F-measures of the three machining learning approaches on the TF&NonPF dataset (GM = generic model; BM = biological model)
Figure 4The F-measure of the three machining learning approaches on the TF&PPI dataset (GM = generic model; BM = biological model)
Performance of the three machine-learning classifiers on the TF & NonPF and TF & PPI datasets with additional negative examples for training using 5-fold cross-validation (GM = generic model; BM = biological model)
| .9592 | .8967 | .9269 | .9472 | .9708 | .9588 | .9700 | .8863 | .9263 | ||
| .9602 | .9242 | .9418 | .9371 | .9661 | .9513 | .9609 | .9208 | .9404 | ||
| .8959 | .9469 | .9207 | .8743 | .9149 | .8941 | .8760 | .9542 | .9134 | ||
| .9103 | .9379 | .9239 | .8891 | .9119 | .9004 | .9058 | .9506 | .9277 | ||
Stage I performance, after the result merging from the three different classifiers learned on the same dataset (using the biological model), along with the best performance in each column before and after Stage I highlighted
| .9381 | .9420 | .9488 | |||||
| .9315 | .8787 | .9143 | .8961 | ||||
| .9327 | .9474 | .8895 | .9204 | ||||
| .9576 | .9427 | .8966 | .9515 | ||||
| .9242 | .8836 | .9052 | |||||
| .9141 | .9452 | .8598 | .9166 | ||||
Stage II performance, after combining the results from the two datasets (TF & NonPF and TF & PPI); the best combination results are highlighted
| .9036 | .9493 | .8339 | .9094 | .9595 | ||||||
| .9598 | .9748 | .9115 | .9534 | |||||||