| Literature DB >> 26380075 |
Indika Kahanda1, Christopher S Funk2, Fahad Ullah1, Karin M Verspoor3, Asa Ben-Hur1.
Abstract
BACKGROUND: The recently held Critical Assessment of Function Annotation challenge (CAFA2) required its participants to submit predictions for a large number of target proteins regardless of whether they have previous annotations or not. This is in contrast to the original CAFA challenge in which participants were asked to submit predictions for proteins with no existing annotations. The CAFA2 task is more realistic, in that it more closely mimics the accumulation of annotations over time. In this study we compare these tasks in terms of their difficulty, and determine whether cross-validation provides a good estimate of performance.Entities:
Keywords: Automated function prediction; Gene Ontology; Machine learning; Support vector machines
Mesh:
Substances:
Year: 2015 PMID: 26380075 PMCID: PMC4570743 DOI: 10.1186/s13742-015-0082-5
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Fig. 1Overview of the NA and NP setups. We distinguish between three sets of annotations that are used to define the training and test set in the two setups. Annotations accumulated between an initial time t 0 until t 1 (end of 2009 in our experiments) and form a set A, which is the training set in both NA and NP. The set of annotations acquired for those proteins after t 1 form a set B, which is the test set in the NA setup. The set of annotations acquired after t 1 for proteins that were unannotated before t 1 is denoted by the set C, and is used as the test set in the NP setup
The number of proteins and the number of annotations in the train and test sets with respect to the three setups for yeast
| Training set | Test set | ||||
|---|---|---|---|---|---|
| Setup | Proteins | Annots. | Proteins | Annots. | |
| F | CV | 1532 | 2185 | 383 | 546 |
| NA | 1367 | 1706 | 208 | 285 | |
| NP | 1367 | 1706 | 521 | 677 | |
| P | CV | 2752 | 5789 | 688 | 1447 |
| NA | 2834 | 5161 | 633 | 990 | |
| NP | 2834 | 5161 | 604 | 1046 | |
| C | CV | 3731 | 7053 | 932 | 1763 |
| NA | 4189 | 6968 | 813 | 1162 | |
| NP | 4189 | 6968 | 476 | 681 | |
F, P and C represent molecular function, biological process and cellular component, respectively. For the CV setup, numbers represent average values computed across the training and test folds (5-fold CV)
The number of proteins and the number of annotations in the train and test sets with respect to the three setups for human
| Training set | Test set | ||||
|---|---|---|---|---|---|
| Set. | Proteins | Annots. | Proteins | Annots. | |
| F | CV | 4532 | 8467 | 1133 | 2116 |
| NA | 4305 | 6898 | 799 | 1343 | |
| NP | 4305 | 6898 | 1344 | 2174 | |
| P | CV | 7533 | 31794 | 1883 | 7948 |
| NA | 5824 | 12196 | 3301 | 13192 | |
| NP | 5824 | 12196 | 3574 | 12973 | |
| C | CV | 8440 | 19196 | 2110 | 4799 |
| NA | 5082 | 8185 | 2966 | 5511 | |
| NP | 5082 | 8185 | 5468 | 10200 | |
F, P and C represent molecular function, biological process and cellular component, respectively. For the CV setup, numbers represent average values computed across the training and test folds (5-fold CV)
Fig. 2Performance comparison between CV, NA and NP. GOstruct, binary SVMs and GBA are evaluated in cross-validation (CV), novel annotation (NA) and novel proteins (NP) in yeast and human. The top, middle and bottom panels depict the molecular function, biological process and cellular component subontologies, respectively. Performance is evaluated using the protein-centric F-max
Fig. 3Label distribution comparison between CV, NA and NP. First we computed the probability (number of annotated proteins/number of all proteins) of GO category i in the training and test sets for all three setups, denoted by and , respectively; in the CV setup the calculation was performed five times for each fold and averaged across the five folds. The discrepancy for category i is then defined as: . The average discrepancy is shown in top left panel. p-values based on paired t-tests for CV vs NA and CV vs NP in all three subontologies for both species are less than 1E−4 or 10−4. The individual signed discrepancy values (without the absolute value) are shown in the other three panels in sorted order by their magnitude for each setup