| Literature DB >> 23514326 |
Abstract
BACKGROUND: Advances in sequencing technology over the past decade have resulted in an abundance of sequenced proteins whose function is yet unknown. As such, computational systems that can automatically predict and annotate protein function are in demand. Most computational systems use features derived from protein sequence or protein structure to predict function. In an earlier work, we demonstrated the utility of biomedical literature as a source of text features for predicting protein subcellular location. We have also shown that the combination of text-based and sequence-based prediction improves the performance of location predictors. Following up on this work, for the Critical Assessment of Function Annotations (CAFA) Challenge, we developed a text-based system that aims to predict molecular function and biological process (using Gene Ontology terms) for unannotated proteins. In this paper, we present the preliminary work and evaluation that we performed for our system, as part of the CAFA challenge.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23514326 PMCID: PMC3584852 DOI: 10.1186/1471-2105-14-S3-S14
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
List of evidence codes
| Included evidence codes | Excluded evidence codes | ||
|---|---|---|---|
| EXP | Inferred from Experiment | ISS | Inferred from Sequence/Structural Similarity |
| IDA | Inferred from Direct Assay | ISO | Inferred from Sequence Orthology |
| IPI | Inferred from Physical Interaction | ISA | Inferred from Sequence Alignment |
| IMP | Inferred from Mutant Phenotype | ISM | Inferred from Sequence Model |
| IGI | Inferred from Genetic Interaction | IGC | Inferred from Genomic Context |
| IEP | Inferred from Expression Pattern | RCA | Reviewed Computational Analysis |
| IC | Inferred by Curator | IEA | Inferred from Electronic Annotation |
| TAS | Traceable Author Statement | NAS | Non-traceable Author Statement |
The table shows which GO evidence codes were included in our dataset and which evidence codes were excluded.
The GO categories that are used as function classes in this work
| Molecular Function | Biological Process | ||
|---|---|---|---|
| 0005488 | Binding | 0065007 | biological regulation |
| 0003824 | catalytic activity | 0032502 | developmental process |
| 0030528 | transcription regulator activity | 0009987 | cellular process |
| 0005215 | transporter activity | 0050896 | response to stimulus |
| 0060089 | molecular transducer activity | 0008152 | metabolic process |
| 0030234 | enzyme regulator activity | 0051234 | establishment of localization |
| 0005198 | structural molecular activity | 0016043 | cellular component organization |
| 0016247 | channel regulator activity | 0023052 | Signalling |
| 0009055 | electron carrier activity | 0032501 | Multi-cellular organismal process |
| 0045182 | translation regulator activity | 0022414 | reproductive process |
| 0051704 | multi-organism process | ||
| 0040011 | Locomotion | ||
| 0040007 | Growth | ||
| 0051179 | Localization | ||
| 0022610 | biological adhesion | ||
| 0008283 | cell proliferation | ||
| 0000003 | Reproduction | ||
| 0002376 | immune system process | ||
| 0016265 | Death | ||
| 0071554 | cell wall organization or biogenesis | ||
| 0048511 | rhythmic process | ||
| 0023046 | signalling process | ||
| 0044085 | cellular component biogenesis | ||
| 0043473 | Pigmentation | ||
Prediction performance on molecular function classes, over the cross-validation dataset.
| Function | # | # | Text-KNN | Base-Prior | Base-Seq | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | P | R | F | |||
| GO:0005488 | 10720 | 2680 | 0.65 | 0.63 | 0.64 | 0.63 | 0.75 | 0.71 | |||
| GO:0003824 | 2943 | 736 | 0.23 | 0.32 | 0.16 | 0.15 | 0.15 | 0.38 | |||
| GO:0030528 | 1276 | 319 | 0.44 | 0.24 | 0.31 | 0.07 | 0.07 | 0.07 | |||
| GO:0005215 | 782 | 196 | 0.38 | 0.04 | 0.04 | 0.04 | 0.50 | ||||
| GO:0060089 | 738 | 184 | 0.16 | 0.22 | 0.04 | 0.04 | 0.04 | 0.26 | |||
| GO:0030234 | 485 | 121 | 0.05 | 0.08 | 0.03 | 0.03 | 0.03 | 0.16 | |||
| GO:0005198 | 334 | 84 | 0.04 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | |||
| GO:0016247 | 58 | 14 | 0.01 | 0.01 | 0.01 | 0.00 | 0.00 | 0.00 | |||
| GO:0009055 | 54 | 14 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| GO:0045182 | 21 | 5 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
The text-based classifier, Text-KNN, is compared with two baselines: Base-Prior, and Base-Seq. The columns P, R, and F refer, respectively, to the Precision, Recall, and F-measure of the classifier over individual GO categories. A precision and recall values of 0 on a class indicates that all the proteins belonging to that class are misclassified into another class.
Prediction performance on biological process classes, over the cross-validation dataset.
| Function | # Training Protein | # Test Protein | Text-KNN | Base-Prior | Base-Seq | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | P | R | F | |||
| GO:0065007 | 3626 | 906 | 0.23 | 0.31 | 0.20 | 0.24 | 0.22 | 0.48 | |||
| GO:0032502 | 3338 | 835 | 0.19 | 0.20 | 0.12 | 0.17 | 0.14 | ||||
| GO:0009987 | 1790 | 447 | 0.24 | 0.26 | 0.17 | 0.14 | 0.15 | 0.27 | |||
| GO:0050896 | 1780 | 445 | 0.10 | 0.10 | 0.10 | 0.16 | 0.09 | 0.11 | |||
| GO:0008152 | 1658 | 415 | 0.23 | 0.14 | 0.17 | 0.08 | 0.06 | 0.07 | |||
| GO:0051234 | 1204 | 301 | 0.32 | 0.20 | 0.25 | 0.05 | 0.05 | 0.05 | |||
| GO:0016043 | 1145 | 286 | 0.13 | 0.05 | 0.07 | 0.06 | 0.05 | 0.06 | |||
| GO:0023052 | 965 | 241 | 0.18 | 0.11 | 0.14 | 0.05 | 0.04 | 0.04 | |||
| GO:0032501 | 606 | 151 | 0.12 | 0.02 | 0.04 | 0.04 | 0.03 | 0.04 | |||
| GO:0022414 | 346 | 86 | 0.02 | 0.02 | 0.02 | 0.14 | 0.03 | 0.05 | |||
| GO:0051704 | 272 | 68 | 0.01 | 0.01 | 0.01 | 0.09 | 0.04 | 0.05 | |||
| GO:0040011 | 170 | 42 | 0.01 | ||||||||
| GO:0040007 | 165 | 41 | 0.00 | 0.00 | 0.00 | ||||||
| GO:0051179 | 151 | 38 | 0.00 | 0.00 | 0.00 | ||||||
| GO:0022610 | 128 | 32 | 0.01 | 0.01 | 0.01 | 0.00 | 0.00 | 0.00 | |||
| GO:0008283 | 118 | 29 | 0.00 | 0.00 | 0.00 | ||||||
| GO:0000003 | 96 | 24 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |||
| GO:0002376 | 74 | 19 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |||
| GO:0016265 | 64 | 16 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |||
| GO:0071554 | 46 | 11 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |||
| GO:0048511 | 43 | 11 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |||
| GO:0023046 | 35 | 9 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |||
| GO:0044085 | 16 | 4 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |||
| GO:0043473 | 13 | 3 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
The text-based classifier, Text-KNN, is compared with two baselines: Base-Prior, and Base-Seq. The columns P, R, and F refer, respectively, to the Precision, Recall, and F-measure of the classifier over individual GO categories. A precision and recall values of 0 on a class indicates that all the proteins belonging to that class are misclassified into another class.
Prediction performance on molecular function classes, over the dataset of textless proteins.
| GO:0005488 | 58 | 0.47 | 0.59 | |||||
| GO:0003824 | 9 | 0.29 | ||||||
| GO:0030528 | 1 | 0.04 | 0.08 | |||||
| GO:0005215 | 5 | 0.50 | 0.20 | 0.29 | ||||
| GO:0060089 | 7 | 0.57 | ||||||
| GO:0005198 | 2 | 0.00 | 0.00 | 0.00 | ||||
Prediction performance of Text-KNN on proteins that have no associated text is shown in the Text-KNN (Textless) column. As a point of reference, the average cross-validation results, denoted as Text-KNN (Cross-Validation) as obtained over the whole cross-validation dataset, are shown for comparison only. The columns P, R, and F refer, respectively, to the Precision, Recall, and F-measure of the classifier over individual GO categories. A precision and recall values of 0 on a class indicates that all the proteins belonging to that class are misclassified into another class.
Prediction performance on biological process classes, over the dataset of textless proteins.
| GO:0065007 | 19 | |||||||
| GO:0032502 | 18 | 0.19 | ||||||
| GO:0009987 | 8 | 0.04 | 0.13 | 0.06 | ||||
| GO:0050896 | 20 | |||||||
| GO:0008152 | 7 | |||||||
| GO:0051234 | 9 | |||||||
| GO:0016043 | 6 | 0.00 | 0.00 | 0.00 | ||||
| GO:0023052 | 3 | 0.00 | 0.00 | 0.00 | ||||
| GO:0032501 | 9 | 0.00 | 0.00 | 0.00 | ||||
| GO:0022414 | 7 | 0.00 | 0.00 | 0.00 | ||||
| GO:0051704 | 1 | 0.00 | 0.00 | 0.00 | ||||
| GO:0040011 | 3 | 0.00 | 0.00 | 0.00 | ||||
| GO:0002376 | 1 | 0.00 | 0.00 | 0.00 | ||||
Prediction performance of Text-KNN on proteins that have no associated text is shown in the Text-KNN (Textless) column. As a point of reference, the average cross-validation results, denoted as Text-KNN (Cross-Validation) as obtained over the whole cross-validation dataset, are shown for comparison only. The columns P, R, and F refer, respectively, to the Precision, Recall, and F-measure of the classifier over individual GO categories. A precision and recall values of 0 on a class indicates that all the proteins belonging to that class are misclassified into another class.
Prediction performance for molecular function classes, over the CAFA evaluation dataset. (The number of proteins in each class is shown below each function header)
| Function | Text-KNN | CAFA-Prior | CAFA-Seq | GOtcha | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | S | P | R | S | P | R | S | P | R | S | |
| 0.643 | 0.17 | 0.87 | 0.579 | 0.00 | 0.085 | 0.723 | 0.16 | 0.916 | ||||
| 0.00 | 0.00 | 0.97 | 0.077 | 0.00 | 0.5 | 0.036 | 0.179 | 0.994 | ||||
| 0.312 | 0.03 | 0.95 | 0.451 | 0.00 | 0.714 | 0.03 | 0.067 | |||||
The text-based classifier, Text-KNN, is compared with baseline results provided by the CAFA challenge: CAFA-Prior, CAFA-Seq, and GOtcha. The confidence threshold used for each classifier is shown under its name in the respective column. A confidence threshold of 0.01 is used for CAFA-Prior because the classifier does not make any predictions for the 'transporter activity' class at higher confidence thresholds.
The columns P, R, and S refer, respectively, to the Precision, Recall, and Specificity of the classifiers over individual classes. Precision and recall values of 0 for a class indicate that all the proteins belonging to that class are misclassified (when the confidence score is 0.95). CAFA-Prior always has a specificity value of 0, because it assigns all the proteins to each class, and as such the number of true negatives is always 0.
A specificity value that is close to 1, for a class whose precision and recall are both 0, indicates that most proteins in the dataset are not in the class (true negatives) and are indeed not assigned to the class. A few proteins from other classes are misclassified into the class (false positives), hence the specificity is slightly less than 1.
Prediction performance for biological process classes, over the CAFA evaluation dataset. (The number of proteins in each class is shown below each function header)
| Function | Text-KNN | CAFA-Prior | CAFA-Seq | GOtcha | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | S | P | R | S | P | R | S | P | R | S | |
| 0.5 | 0.009 | 0.261 | 0 | 0 | 0.105 | 0.978 | 0.404 | 0.351 | 0.817 | |||
| 0.00 | 0.00 | 0.939 | 0.067 | 0 | 0.00 | 0.00 | 0.069 | 0.988 | ||||
| 0.2 | 0.017 | 0.138 | 0 | 0.067 | 0.976 | 0.297 | 0.317 | 0.88 | ||||
| 0.25 | 0.026 | 0.087 | 0 | 0.105 | 0.99 | 0.263 | 0.395 | 0.894 | ||||
| 0.125 | 0.009 | 0.243 | 0 | 0.047 | 0.39 | 0.302 | 0.848 | |||||
| 0.00 | 0.00 | 0 | 0.06 | 0.989 | 0.263 | 0.181 | 0.881 | |||||
| 0.069 | 0.023 | 0.923 | 0.2 | 0 | 0.115 | 0.343 | 0.264 | 0.874 | ||||
| 0.03 | 0.076 | 0 | 0.25 | 0.061 | 0.985 | 0.077 | 0.061 | 0.94 | ||||
| 0.00 | 0.00 | 0.971 | 0.06 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.993 | |||
| 0.00 | 0.00 | 0.147 | 0 | 0.031 | 0.987 | 0.192 | 0.156 | 0.887 | ||||
| 0.857 | 0.016 | 0.844 | 0 | 0.071 | 0.941 | 0.866 | 0.829 | 0.309 | ||||
| 0.00 | 0.00 | 0.489 | 0 | 0.588 | 0.047 | 0.969 | 0.559 | 0.691 | ||||
| 0.083 | 0.08 | 0.946 | 0.057 | 0 | 0.00 | 0.00 | 0.214 | 0.12 | 0.973 | |||
| 0.083 | 0.08 | 0.946 | 0.057 | 0 | 0.00 | 0.00 | 0.12 | 0.981 | ||||
The text-based classifier, Text-KNN, compared with baseline results provided by the CAFA challenge: CAFA-Prior, CAFA-Seq, and GOtcha. The confidence threshold used for each classifier is shown under its name in the respective column. The confidence threshold for Text-kNN, GOtcha, and CAFA-Prior are, respectively, set at 0.75, 0.14, and 0.01 since these classifiers make no predictions for over 75% of the classes at higher confidence thresholds.
The columns P, R, and S refer, respectively, to the Precision, Recall, and Specificity of the classifier over individual classes. Precision and recall values of 0 for a class indicate that all the proteins belonging to that class are misclassified (at the respective confidence level). CAFA-Prior always has a specificity value of 0, because it assigns all the proteins to each class, and as such the number of true negatives is always 0.
A specificity value that is close to 1, for a class whose precision and recall are both 0, indicates that most proteins in the dataset are not in the class (true negatives) and are indeed not assigned to the class. A few proteins from other classes are misclassified into the class (false positives), hence the specificity is slightly less than 1.