| Literature DB >> 26005564 |
Christopher S Funk1, Indika Kahanda2, Asa Ben-Hur2, Karin M Verspoor3.
Abstract
Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in the context of a structured output support vector machine model, GOstruct. We find that even simple literature based features are useful for predicting human protein function (F-max: Molecular Function =0.408, Biological Process =0.461, Cellular Component =0.608). One advantage of using literature features is their ability to offer easy verification of automated predictions. We find through manual inspection of misclassifications that some false positive predictions could be biologically valid predictions based upon support extracted from the literature. Additionally, we present a "medium-throughput" pipeline that was used to annotate a large subset of co-mentions; we suggest that this strategy could help to speed up the rate at which proteins are curated.Entities:
Keywords: Biomedical concept recognition; Protein function prediction; Text mining
Year: 2015 PMID: 26005564 PMCID: PMC4441003 DOI: 10.1186/s13326-015-0006-4
Source DB: PubMed Journal: J Biomed Semantics
Figure 1Overview of the experimental setup used for function prediction.
Statistics of co-mentions extracted from both Medline and PMCOA using the different dictionaries for identifying GO terms
|
| |||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| Original | sentence | 12,826 | 14,102 | 1,473,579 | 25,765,168 |
| non-sentence | 13,459 | 17,231 | 3,070,466 | 147,524,964 | |
| combined | 13,492 | 17,424 | 3,222,619 | 173,289,862 | |
| Enhanced | sentence | 12,998 | 15,415 | 1,839,360 | 33,199,284 |
| non-sentence | 13,513 | 18,713 | 3,725,450 | 196,761,554 | |
| combined | 13,536 | 18,920 | 3,897,951 | 229,960,838 | |
|
| |||||
|
|
|
|
|
|
|
| Original | sentence | 5,016 | 9,471 | 317,715 | 2,945,833 |
| non-sentence | 5,148 | 12,582 | 715,363 | 18,142,448 | |
| combined | 5,160 | 12,819 | 748,427 | 21,088,281 | |
| Enhanced | sentence | 5,063 | 12,877 | 414,322 | 3,853,994 |
| non-sentence | 5,160 | 13,769 | 901,123 | 23,986,761 | |
| combined | 5,167 | 14,018 | 939,743 | 27,840,755 | |
Figure 2Precision, recall, and F-max performance of four different co-mention feature sets on function prediction. Better performance is to the upper-right and the grey iso bars represent balance between precision and recall. Diamonds – Cellular Component, Circle – Biological Process, Square – Molecular Function.
Overall performance of literature features on human proteins
|
| ||||
|---|---|---|---|---|
|
|
|
|
|
|
| Baseline (Original) | 0.094 | 0.055 | 0.327 | 0.680 |
| Baseline (Enhanced) | 0.064 | 0.036 | 0.322 | 0.701 |
| Co-mentions (Original) | 0.386 | 0.302 |
| 0.769 |
| Co-mentions (Enhanced) | 0.377 | 0.336 | 0.447 | 0.764 |
| BoW | 0.394 |
| 0.414 | 0.768 |
| Co-mentions + BoW |
| 0.354 | 0.491 |
|
|
| ||||
|
|
|
|
|
|
| Baseline (Original) | 0.134 | 0.091 | 0.249 | 0.610 |
| Baseline (Enhanced) | 0.155 | 0.103 | 0.311 | 0.611 |
| Co-mentions (Original) | 0.424 | 0.426 | 0.422 | 0.750 |
| Co-mentions (Enhanced) | 0.429 | 0.427 | 0.430 | 0.752 |
| BoW |
|
| 0.455 | 0.768 |
| Co-mentions + BoW | 0.459 | 0.426 |
|
|
|
| ||||
|
|
|
|
|
|
| Baseline (Original) | 0.086 | 0.050 | 0.305 | 0.640 |
| Baseline (Enhanced) | 0.073 | 0.041 | 0.317 | 0.642 |
| Co-mentions (Original) | 0.587 | 0.590 | 0.585 | 0.744 |
| Co-mentions (Enhanced) | 0.589 | 0.583 | 0.596 | 0.753 |
| BoW |
|
|
| 0.755 |
| Co-mentions + BoW | 0.607 | 0.592 | 0.622 |
|
Precision, Recall and F-max are micro-averaged across all proteins. Baseline corresponds to using only the co-mentions mined from the literature as a classifier. Macro-AUC is the average AUC per GO category. “Co-mentions + BoW” utilizes original co-mentions and BoW features within a single classifier.
Description of the gold standard human annotations and predictions made by GOstruct from each type of feature
|
|
|
| |
|---|---|---|---|
|
|
|
| |
|
|
|
|
|
| Gold standard | 36,349 | 264,631 | 79,631 |
| Original | 102,486 | 268,068 | 76,513 |
| Enhanced | 64,919 | 276,734 | 81,094 |
| BoW | 40,499 | 268,114 | 77,753 |
| Combined | 62,039 | 386,267 | 78,475 |
All numbers are counts based on the predictions broken down by sub-ontology; these counts have the ‘true path rule’ applied.
Figure 3Functional class analysis of all GO term annotations and predictions. a) Distribution of the depth and information content of GO term annotations. As IC values are real numbers, they are binned, and each bar represents a range, e.g. ‘[1,2)’ includes all depth 1 terms and IC between 1 and 2 (not including 2). b) Macro-averaged F-measure performance broken down by GO term depth. c) Macro-averaged F-measure performance binned by GO term information content.
Top biological process and molecular function classes predicted by each type of feature
|
| |||||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| GO:0009987 | cellular process | 6,164 | 0.812 | 0.875 | 0.842 | 1 | 0.66 |
| GO:0044699 | single-organism process | 4,849 | 0.743 | 0.765 | 0.754 | 1 | 0.96 |
| GO:0044763 | single-organism cellular process | 4,295 | 0.681 | 0.714 | 0.697 | 2 | 1.20 |
| GO:0008152 | metabolic process | 3,893 | 0.644 | 0.726 | 0.682 | 1 | 1.22 |
| GO:0065007 | biological regulation | 3,615 | 0.691 | 0.629 | 0.658 | 1 | 0.90 |
| GO:0071704 | organic substance metabolic process | 3,489 | 0.611 | 0.677 | 0.643 | 2 | 1.42 |
| GO:0050789 | regulation of biological process | 3,350 | 0.668 | 0.601 | 0.633 | 2 | 0.97 |
| GO:0044238 | primary metabolic process | 3,337 | 0.593 | 0.655 | 0.623 | 2 | 1.56 |
| GO:0044237 | cellular metabolic process | 3,268 | 0.590 | 0.644 | 0.616 | 2 | 1.49 |
| GO:0050794 | regulation of cellular process | 3,156 | 0.648 | 0.583 | 0.614 | 3 | 1.11 |
| GO:0050896 | response to stimulus | 2,968 | 0.606 | 0.590 | 0.597 | 1 | 1.62 |
| GO:0043170 | macromolecule metabolic process | 2,640 | 0.548 | 0.618 | 0.581 | 3 | 1.77 |
|
| |||||||
|
|
|
|
|
|
|
|
|
| GO:0009987 | cellular process | 6,223 | 0.816 | 0.887 | 0.850 | 1 | 0.66 |
| GO:0007076 | mitotic chromosome condensation | 6 | 0.833 | 0.714 | 0.769 | 4 | 8.58 |
| GO:0006323 | DNA packaging | 6 | 0.833 | 0.714 | 0.769 | 3 | 7.81 |
| GO:0044699 | single-organism process | 4,957 | 0.744 | 0.783 | 0.763 | 1 | 0.96 |
| GO:0044763 | single-organism cellular process | 4,423 | 0.682 | 0.736 | 0.708 | 2 | 1.20 |
| GO:0008152 | metabolic process | 3,887 | 0.643 | 0.723 | 0.681 | 1 | 1.22 |
| GO:0065007 | biological regulation | 3,701 | 0.683 | 0.636 | 0.659 | 1 | 0.90 |
| GO:0050789 | regulation of biological process | 3,453 | 0.662 | 0.613 | 0.637 | 2 | 0.97 |
| GO:0071704 | organic substance metabolic process | 3,491 | 0.605 | 0.670 | 0.636 | 2 | 1.42 |
| GO:0043252 | sodium-independent organic anion transport | 11 | 0.636 | 0.583 | 0.608 | 7 | 8.50 |
| GO:0000398 | mRNA splicing, via spliceosome | 140 | 0.492 | 0.697 | 0.577 | 10 | 5.88 |
| GO:0006607 | NLS-bearing protein import into nucleus | 15 | 0.533 | 0.571 | 0.551 | 6 | 8.50 |
|
| |||||||
|
|
|
|
|
|
|
|
|
| GO:0009987 | cellular process | 6,005 | 0.820 | 0.869 | 0.844 | 1 | 0.66 |
| GO:0044699 | single-organism process | 4,940 | 0.754 | 0.799 | 0.776 | 1 | 0.96 |
| GO:0044763 | single-organism cellular process | 4,449 | 0.696 | 0.764 | 0.728 | 2 | 1.20 |
| GO:0043252 | sodium-independent organic anion transport | 8 | 0.875 | 0.583 | 0.700 | 7 | 8.50 |
| GO:0065007 | biological regulation | 3,865 | 0.698 | 0.686 | 0.692 | 1 | 0.90 |
| GO:0008152 | metabolic process | 3,870 | 0.647 | 0.733 | 0.688 | 1 | 1.22 |
| GO:0050789 | regulation of biological process | 3,597 | 0.680 | 0.663 | 0.671 | 2 | 0.97 |
| GO:0006479 | protein methylation | 13 | 0.615 | 0.727 | 0.666 | 8 | 6.52 |
| GO:0051568 | histone H3-K4 methylation | 13 | 0.615 | 0.727 | 0.666 | 11 | 7.94 |
| GO:0007076 | mitotic chromosome condensation | 5 | 0.800 | 0.571 | 0.666 | 4 | 8.58 |
| GO:0050794 | regulation of cellular process | 3,440 | 0.657 | 0.651 | 0.654 | 3 | 1.11 |
| GO:0006497 | protein lipidation | 9 | 0.889 | 0.500 | 0.640 | 7 | 6.79 |
|
| |||||||
|
|
|
|
|
|
|
|
|
| GO:0009987 | cellular process | 6,420 | 0.813 | 0.913 | 0.860 | 1 | 0.66 |
| GO:0044699 | single-organism process | 5,338 | 0.736 | 0.834 | 0.782 | 1 | 0.96 |
| GO:0044763 | single-organism cellular process | 4,862 | 0.674 | 0.800 | 0.731 | 2 | 1.20 |
| GO:0065007 | biological regulation | 4,445 | 0.669 | 0.749 | 0.707 | 1 | 0.90 |
| GO:0008152 | metabolic process | 4,252 | 0.638 | 0.785 | 0.704 | 1 | 1.22 |
| GO:0050789 | regulation of biological process | 4,199 | 0.650 | 0.733 | 0.689 | 2 | 0.97 |
| GO:0050794 | regulation of cellular process | 4,046 | 0.626 | 0.723 | 0.671 | 3 | 1.11 |
| GO:0043252 | sodium-independent organic anion transport | 15 | 0.600 | 0.750 | 0.667 | 7 | 8.50 |
| GO:0071704 | organic substance metabolic process | 3,883 | 0.602 | 0.743 | 0.665 | 2 | 1.42 |
| GO:0043170 | macromolecule metabolic process | 3,007 | 0.540 | 0.694 | 0.607 | 3 | 1.77 |
| GO:0051716 | cellular response to stimulus | 3,176 | 0.520 | 0.674 | 0.587 | 3 | 1.89 |
| GO:0006386 | termination of RNA polymerase III transcription | 12 | 0.583 | 0.583 | 0.583 | 7 | 8.18 |
Most difficult biological process and molecular function classes
|
| ||||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| GO:0051179 | localization | 28 | 0.107 | 0.054 | 0.072 | 5.70 |
| GO:0016247 | channel regulator activity | 115 | 0.043 | 0.208 | 0.071 | 6.53 |
| GO:0009055 | electron carrier activity | 108 | 0.03 | 0.111 | 0.055 | 6.94 |
| GO:0007067 | mitosis | 23 | 0.043 | 0.031 | 0.036 | 7.54 |
| GO:0042056 | chemoattractant activity | 53 | 0.018 | 0.067 | 0.029 | 7.56 |
|
| ||||||
|
|
|
|
|
|
|
|
| GO:0009055 | electron carrier activity | 102 | 0.090 | 0.138 | 0.109 | 6.94 |
| GO:0051179 | localization | 42 | 0.071 | 0.055 | 0.061 | 5.70 |
| GO:0019838 | growth factor binding | 44 | 0.021 | 0.035 | 0.027 | 5.99 |
| GO:0070888 | E-box binding | 99 | 0.010 | 0.066 | 0.019 | 7.49 |
| GO:0030545 | receptor regulator activity | 152 | 0.007 | 0.020 | 0.010 | 7.63 |
|
| ||||||
|
|
|
|
|
|
|
|
| GO:0051179 | localization | 18 | 0.277 | 0.090 | 0.137 | 5.70 |
| GO:0009055 | electron carrier activity | 29 | 0.103 | 0.083 | 0.092 | 6.94 |
| GO:0016042 | lipid catabolic process | 26 | 0.076 | 0.054 | 0.063 | 5.80 |
| GO:0015992 | proton transport | 15 | 0.066 | 0.047 | 0.055 | 7.29 |
| GO:0005516 | calmodulin binding | 14 | 0.071 | 0.033 | 0.045 | 7.25 |
|
| ||||||
|
|
|
|
|
|
|
|
| GO:0051179 | localization | 61 | 0.100 | 0.109 | 0.104 | 5.70 |
| GO:0009055 | electron carrier activity | 62 | 0.079 | 0.138 | 0.101 | 6.94 |
| GO:0030545 | receptor regulator activity | 63 | 0.064 | 0.080 | 0.071 | 7.63 |
| GO:0042056 | chemoattractant activity | 24 | 0.041 | 0.066 | 0.051 | 7.56 |
| GO:0040007 | growth | 27 | 0.030 | 0.066 | 0.047 | 7.33 |
IC represents information content of term.