| Literature DB >> 25916950 |
Pietro Pinoli, Davide Chicco, Marco Masseroli.
Abstract
BACKGROUND: Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25916950 PMCID: PMC4416163 DOI: 10.1186/1471-2105-16-S6-S4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Prediction workflow. The input is a gene annotation repository. Firstly, the contained annotations of interest are represented in a computable structure (i.e. a matrix, binary or weighted). Then, this representation is used as training dataset for a machine learning method that fits a predictive model of gene annotations. Finally, the estimated model is treated as a generative process and new putative annotations are produced, along with a confidence value.
Weighting schema components.
| Code | Name | Description |
|---|---|---|
| N | No transformation | ∀ |
| A | Augmented | ∀ |
| T | Term weight | ∀ |
| N | No normalization | Normalization factor is not used |
| M | Maximum | |
Each of the proposed weighting schemes is made of a local weight, a global weight and a normalization function. The implemented and tested options for the three components are listed below.
Figure 2Truncated Singular Value Decomposition. Given a truncation level k, an approximation of the W matrix is built keeping into account only the first k columns of the left singular vector matrix U and of the right singular vector matrix V and the k × k portion of the diagonal matrix S of the singular values of W. Considered sub matrices are highlighted.
Figure 3pLSAnorm aspect model. Each gene is associated with each function term through hidden variables, the topics. Connections between nodes represent probability values.
Quantitative characteristics of the considered GO (BP+CC+MF) gene annotation datasets in their July 2009 version and March 2013 updated version from the GPDW.
| Dataset | July 2009 | March 2013 | #a comparison | |||||
|---|---|---|---|---|---|---|---|---|
| #g | #f | #a | #g | #f | #a | Δ | Δ% | |
| 734 | 3,714 | 32,232 | 2,243 | 8,421 | 1,44,358 | 1,12,126 | 347.87 | |
| 1,807 | 2,967 | 49,834 | 3,825 | 6,848 | 1,79,142 | 1,29,308 | 259.47 | |
| 8,722 | 6,516 | 3,08,962 | 10,304 | 8,850 | 5,17,457 | 2,08,495 | 67.48 | |
| 11,646 | 6,927 | 3,35,063 | 5,428 | 9,107 | 2,32,945 | −102,118 | −30.47 | |
| 14,114 | 3,270 | 2,62,940 | 15,439 | 4,191 | 3,45,712 | 82,772 | 31.47 | |
| 7,950 | 2,136 | 86,207 | 8,433 | 2,560 | 96,354 | 10,147 | 11.77 | |
Figures do not include GO annotations with IEA or ND evidence, nor obsolete GO terms and genes. #g: number of genes; #f: number of function features (GO terms); #a: number of GO gene annotations; Δ: difference of annotation number between the two dataset versions; Δ%: percentage difference of annotation number between the two dataset versions. Drosophila m.: Drosophila melanogaster organism.
Figure 4ROC curves for the Bos taurus datasets. ROC curves and their AUC percentages of Annotation Confirmed rate (AC rate) versus Annotation Predicted rate (AP rate), obtained by varying the threshold τ in predicting the GO annotations of Bos taurus genes with the LSI (a), SIM (b) or pLSA (c) methods, each with or without weighting schemes.
Figure 5Predictions for the PGRP-LB gene. Branch of the Directed Acyclic Graph of the GO Biological Process terms associated with the PGRP-LB Peptidoglycan recognition protein LB gene (Entrez Gene ID: 41379) of the Drosophila melanogaster organism. It includes GO terms present in the analyzed dataset (black circles), as well as GO terms predicted by the SIM method with the NTM weighting schema as associated with the same gene (blue hexagons) and the ones of them that were found validated in the dataset updated version (green rectangles). Other GO DAG parts are connected to the shown branch as indicated by the dotted lines.
Percentages of the top 500 predicted annotations found confirmed for each method (LSI, SIM, pLSA), weighting schema (none, NTN, NTM, ATN), and dataset (Bos taurus, Danio rerio, Drosophila melanogaster).
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
| LSI-none | 26 | 3.6 | 21.2 | 11.6 | 6.8 | ||||
| LSI-NTN | 22.2 | 24.8 | 7.2 | 9.4 | 27.4 | 16.8 | 6.6 | 13.8 | 7.2 |
| LSI-NTM | 14.8 | 19.2 | 6.4 | 6.6 | 17.6 | 11.6 | 9.8 | 26 | |
| LSI-ATN | 21.6 | 6.2 | 23.6 | 24 | 5.8 | ||||
| SIM-none | 19.2 | 19 | 4.4 | 21.8 | 10.8 | 35 | 10.6 | ||
| SIM-NTN | 17.4 | 20.6 | 7 | 11.4 | 28.8 | 17.8 | 7.4 | 24.6 | 17 |
| SIM-NTM | 22 | 24 | 6.2 | 7.4 | 30.2 | 21.4 | 16 | ||
| SIM-ATN | 6 | 22.6 | 23.8 | 6 | |||||
| pLSA-none | 20.6 | 5.2 | 14 | 13 | 4.8 | 6.6 | 4.2 | ||
| pLSA-NTN | 19.6 | 11.8 | 21.6 | ||||||
| pLSA-NTM | 14.6 | 20.2 | 5 | 23.8 | 11.6 | 5.4 | 7 | 3.2 | |
| pLSA-ATN | 15.6 | 16.2 | 6.4 | 4.8 | 9.6 | 5 | 3.8 | 6.4 | 4 |
We report: the portion of predictions that were confirmed in the outdated dataset with only computational evidence, therefore not included in the input corpus (cmp); the portion of predictions that were confirmed in the updated dataset with any evidence (Uany) and the portion of confirmed predictions in the updated dataset with not-computational evidence (Ucur). The best values obtained for each of these three cases and dataset across the considered methods are highlighted in bold.