| Literature DB >> 26771463 |
Flavio E Spetale1,2, Elizabeth Tapia1,2, Flavia Krsticevic1,3, Fernando Roda1, Pilar Bulacio1,2,3.
Abstract
As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.Entities:
Mesh:
Year: 2016 PMID: 26771463 PMCID: PMC4714749 DOI: 10.1371/journal.pone.0146986
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Matching a GO-DAG to a core FG.
(a) GO-DAG where GO:i nodes are GO-terms and edges are is_a relationships (b) Core GO-FG where x are variable nodes representing positive/negative GO:i annotations and f are logical factor nodes modeling TPG constraint.
The truth table of the logical factor node f3.
Positive/negative annotations of variable nodes x2, x3 and x4 are depicted as 1/0. Parent variable nodes x2 and x3 are shown in bold.
| 0 | 0 | 0 | 1 |
| 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 |
| 0 | 1 | 1 | 0 |
| 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 |
Fig 2(a) Core GO-FG. (b) Enriched core GO-FG where x are latent variable nodes modeling actual positive/negative GO:i annotations and f are logical factor nodes modeling the TPG constraint over them, y are observable variable leaf nodes modeling real-valued GO:i predictions and g are probabilistic factor nodes modeling their statistical dependence on latent variable nodes x.
S. cerevisiae, A. thaliana and D. melanogaster datasets in the GO-Molecular Function domain.
| Training | Organism | # GO-terms | Characterization | # Features | # Samples |
|---|---|---|---|---|---|
| 103 | Pfam | 3070 | 3223 | ||
| Physicochemical+ | 457 | 3223 | |||
| 54 | Pfam | 3323 | 2863 | ||
| Physicochemical+ | 457 | 3856 | |||
| 226 | Pfam | 4823 | 8636 | ||
| Physicochemical+ | 457 | 8636 | |||
| 435 | Pfam | 3070 | 3223 | ||
| Physicochemical+ | 457 | 3223 | |||
| 659 | Pfam | 3789 | 19601 | ||
| Physicochemical+ | 457 | 24150 | |||
| 656 | Pfam | 4825 | 8655 | ||
| Physicochemical+ | 457 | 9320 |
Fig 3Scatter-plot of the average AUC after versus before TPR-DAG and FGGA classification.
Annotation of D. melanogaster protein sequences to the GO-Molecular Function domain with Pfam characterization and loose annotation data is considered. (Left) The average AUC for TPR-DAG versus baseline SVM classifiers. (Right) The average AUC for FGGA versus baseline SVM classifiers.
Fig 4Scatter-plot of the average AUC for FGGA and TPR-DAG classifiers on the annotation of D. melanogaster protein sequences to the GO-Molecular Function domain with a Pfam characterization.
Points above the diagonal show AUC improvements by FGGA. Points above the dashed line show 10% margin improvements. (Left) GO with 226 terms, 10 levels and robust annotation data. (Right) GO with 656 terms, 14 levels and loose annotation data.
Average hierarchical precision(HP), recall (HR) and F-score (HF) of the FGGA and TPR-DAG methods in the GO Molecular Function.
Organisms are S. cerevisiae, A. thaliana and D. melanogaster. Characterizations are Pfam and physicochemical/secondary structure (PhyChe+) properties. Training policies are robust and loose. For each model organism, characterization and training policy, the best performing method according to the Wilcoxon rank sum test (p = 0.01) is shown in bold.
| Organism | Characterization | Training | Method | HP | HR | HF |
|---|---|---|---|---|---|---|
| Pfam | FGGA | 0.62 | ||||
| TPR-DAG | 0.62 | 0.66 | 0.61 | |||
| FGGA | 0.53 | |||||
| TPR-DAG | 0.53 | 0.70 | 0.56 | |||
| PhyChe+ | FGGA | 0.46 | ||||
| TPR-DAG | 0.45 | 0.79 | 0.55 | |||
| FGGA | 0.84 | |||||
| TPR-DAG | 0.40 | 0.83 | 0.52 | |||
| Pfam | FGGA | |||||
| TPR-DAG | 0.71 | 0.73 | 0.69 | |||
| FGGA | 0.90 | |||||
| TPR-DAG | 0.76 | 0.90 | 0.77 | |||
| PhyChe+ | FGGA | |||||
| TPR-DAG | 0.47 | 0.84 | 0.59 | |||
| FGGA | ||||||
| TPR-DAG | 0.33 | 0.84 | 0.46 | |||
| Pfam | FGGA | 0.71 | ||||
| TPR-DAG | 0.70 | 0.81 | 0.72 | |||
| FGGA | ||||||
| TPR-DAG | 0.51 | 0.80 | 0.59 | |||
| PhyChe+ | FGGA | 0.84 | ||||
| TPR-DAG | 0.40 | 0.84 | 0.52 | |||
| FGGA | ||||||
| TPR-DAG | 0.33 | 0.85 | 0.47 |