| Literature DB >> 24070402 |
Daniela Stojanova1, Michelangelo Ceci, Donato Malerba, Saso Dzeroski.
Abstract
BACKGROUND: Ontologies and catalogs of gene functions, such as the Gene Ontology (GO) and MIPS-FUN, assume that functional classes are organized hierarchically, that is, general functions include more specific ones. This has recently motivated the development of several machine learning algorithms for gene function prediction that leverages on this hierarchical organization where instances may belong to multiple classes. In addition, it is possible to exploit relationships among examples, since it is plausible that related genes tend to share functional annotations. Although these relationships have been identified and extensively studied in the area of protein-protein interaction (PPI) networks, they have not received much attention in hierarchical and multi-class gene function prediction. Relations between genes introduce autocorrelation in functional annotations and violate the assumption that instances are independently and identically distributed (i.i.d.), which underlines most machine learning algorithms. Although the explicit consideration of these relations brings additional complexity to the learning process, we expect substantial benefits in predictive accuracy of learned classifiers.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24070402 PMCID: PMC3850549 DOI: 10.1186/1471-2105-14-285
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Example of a hierarchy. (a) A part of the FUN hierarchy [7]. (b) An example of input data: The FUN class hierarchy of an example and corresponding class vector and attribute set. (c) An example of a predictive clustering tree for HMC. The internal nodes contain tests on attribute values and the leaves vectors of probabilities associated with the class values.
Basic properties of the datasets
| seq | 3932 | 476 | 499 | 3900 | 476 | 4133 |
| pheno | 1592 | 67 | 455 | 1587 | 67 | 3127 |
| struc | 3838 | 19629 | 499 | 3822 | 19629 | 4132 |
| hom | 3848 | 47035 | 499 | 3567 | 47035 | 4126 |
| cellcycle | 3757 | 77 | 499 | 3751 | 77 | 4125 |
| church | 3779 | 550 | 499 | 3774 | 550 | 4131 |
| derisi | 2424 | 63 | 499 | 2418 | 63 | 3573 |
| eisen | 3725 | 79 | 461 | 3719 | 79 | 4119 |
| gasch1 | 3764 | 172 | 499 | 3758 | 172 | 4125 |
| gasch2 | 3779 | 51 | 499 | 3758 | 51 | 4131 |
| spo | 3703 | 79 | 499 | 3698 | 79 | 4119 |
| exp | 3782 | 550 | 499 | 3773 | 550 | 4131 |
We use 12 yeast (Saccharomyces cerevisiae) datasets (as considered by [1]) and two functional annotation (FUN and GO) schemes.
The performance of NHMC and competitive methods in predicting gene function for different datasets and PPI networks
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| | |||||||||
| seq | 0.023 | 0.032 | 0.030 | 0.004 | 0.003 | 0.011 | 0.011 | 0.006 | 0.006 |
| pheno | 0.019 | 0.016 | 0.016 | 0.001 | 0.001 | 0.016 | 0.016 | 0.003 | 0.002 |
| struc | 0.018 | 0.012 | 0.012 | 0.001 | 0.001 | 0.012 | 0.012 | 0.003 | 0.002 |
| homo | 0.040 | 0.013 | 0.013 | 0.000 | 0.000 | 0.012 | 0.012 | 0.001 | 0.002 |
| cellcycle | 0.019 | 0.287 | 0.288 | 0.004 | 0.003 | 0.012 | 0.012 | 0.006 | 0.006 |
| church | 0.014 | 0.015 | 0.012 | 0.003 | 0.002 | 0.012 | 0.012 | 0.006 | 0.006 |
| derisi | 0.017 | 0.015 | 0.017 | 0.004 | 0.003 | 0.044 | 0.317 | 0.006 | 0.006 |
| eisen | 0.030 | 0.024 | 0.024 | 0.005 | 0.003 | 0.015 | 0.334 | 0.006 | 0.008 |
| gasch1 | 0.024 | 0.018 | 0.019 | 0.003 | 0.002 | 0.050 | 0.354 | 0.006 | 0.006 |
| gasch2 | 0.020 | 0.021 | 0.021 | 0.004 | 0.003 | 0.012 | 0.012 | 0.006 | 0.006 |
| spo | 0.019 | 0.018 | 0.015 | 0.004 | 0.003 | 0.012 | 0.012 | 0.006 | 0.006 |
| exp | 0.023 | 0.017 | 0.016 | 0.003 | 0.002 | 0.012 | 0.012 | 0.006 | 0.006 |
| Average: | 0.022 | 0.041 | 0.040 | 0.003 | 0.002 | 0.018 | 0.093 | 0.005 | 0.005 |
| | |||||||||
| | |||||||||
| | | | | | |||||
| seq | 0.037 | 0.072 | 0.1 | 0.003 | 0.001 | 0.025 | 0.035 | 0.007 | 0.007 |
| pheno | 0.051 | 0.016 | 0.051 | 0.002 | 0.002 | 0.051 | 0.051 | 0.006 | 0.005 |
| struc | 0.078 | 0.078 | 0.078 | 0.001 | 0.002 | 0.078 | 0.078 | 0.003 | 0.003 |
| homo | 0.047 | 0.068 | 0.068 | 0.001 | 0.001 | 0.023 | 0.023 | 0.002 | 0.003 |
| cellcycle | 0.027 | 0.036 | 0.018 | 0.004 | 0.005 | 0.026 | 0.041 | 0.007 | 0.007 |
| church | 0.017 | 0.025 | 0.025 | 0.004 | 0.004 | 0.025 | 0.025 | 0.007 | 0.007 |
| derisi | 0.078 | 0.078 | 0.106 | 0.004 | 0.004 | 0.044 | 0.042 | 0.007 | 0.007 |
| eisen | 0.043 | 0.061 | 0.146 | 0.005 | 0.005 | 0.030 | 0.045 | 0.007 | 0.007 |
| gasch1 | 0.051 | 0.094 | 0.095 | 0.004 | 0.005 | 0.050 | 0.046 | 0.007 | 0.007 |
| gasch2 | 0.04 | 0.088 | 0.107 | 0.004 | 0.005 | 0.025 | 0.043 | 0.007 | 0.007 |
| spo | 0.04 | 0.078 | 0.09 | 0.004 | 0.005 | 0.026 | 0.035 | 0.007 | 0.007 |
| exp | 0.045 | 0.036 | 0.092 | 0.004 | 0.004 | 0.025 | 0.025 | 0.007 | 0.007 |
| Average: | 0.046 | 0.061 | 0.081 | 0.003 | 0.003 | 0.036 | 0.041 | 0.006 | 0.006 |
We use the 2/3-1/3 training-testing evaluation schema. We report the of the CLUS-HMC (α = 1), NHMC (α = 0.5 and α = 0), FunctionalFlow (FF), and Hopfield (H) methods, when predicting gene function in yeast using GO annotations. We use 12 yeast (Saccharomyces cerevisiae) datasets (as considered by [1]). We consider all genes. Results for two PPI networks (DIP and BioGRID) are presented.
The performance of NHMC and competitive methods in predicting gene function for different datasets and PPI networks
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| | |||||||||
| seq | 0.030 | 0.025 | 0.025 | 0.003 | 0.002 | 0.022 | 0.022 | 0.004 | 0.006 |
| pheno | 0.021 | 0.018 | 0.019 | 0.002 | 0.001 | 0.018 | 0.018 | 0.004 | 0.002 |
| struc | 0.018 | 0.012 | 0.016 | 0.002 | 0.000 | 0.012 | 0.012 | 0.004 | 0.002 |
| homo | 0.040 | 0.013 | 0.031 | 0.001 | 0.001 | 0.013 | 0.013 | 0.002 | 0.002 |
| cellcycle | 0.017 | 0.297 | 0.273 | 0.004 | 0.002 | 0.013 | 0.013 | 0.006 | 0.006 |
| church | 0.017 | 0.013 | 0.012 | 0.003 | 0.002 | 0.012 | 0.012 | 0.006 | 0.006 |
| derisi | 0.018 | 0.022 | 0.021 | 0.004 | 0.002 | 0.039 | 0.315 | 0.006 | 0.006 |
| eisen | 0.025 | 0.020 | 0.020 | 0.004 | 0.002 | 0.021 | 0.335 | 0.006 | 0.008 |
| gasch1 | 0.020 | 0.017 | 0.017 | 0.003 | 0.002 | 0.029 | 0.339 | 0.006 | 0.006 |
| gasch2 | 0.019 | 0.020 | 0.018 | 0.004 | 0.002 | 0.015 | 0.016 | 0.006 | 0.006 |
| spo | 0.018 | 0.019 | 0.018 | 0.004 | 0.002 | 0.017 | 0.017 | 0.006 | 0.006 |
| exp | 0.020 | 0.017 | 0.017 | 0.002 | 0.002 | 0.018 | 0.018 | 0.006 | 0.006 |
| Average: | 0.022 | 0.041 | 0.041 | 0.003 | 0.002 | 0.019 | 0.094 | 0.005 | 0.005 |
We use the 3-fold cross-validation evaluation schema. The average (estimated by 3-fold CV) of the CLUS-HMC (α = 1), NHMC (α = 0.5 and α = 0), FunctionalFlow (FF), and Hopfield (H) methods, when predicting gene function in yeast using GO annotations. We use 12 yeast (Saccharomyces cerevisiae) datasets. Results for two PPI networks (DIP and BioGRID) are presented.
Figure 2distribution. Comparison of the predictive models in terms of learned by CLUS-HMC and NHMC (α = 0.5 and α = 0) from the most connected subsets of genes from the (a) gasch2 and (b) cellcycle datasets annotated with labels from the GO hierarchy. The horizontal axis gives the minimum relative number (in %) of interactions a gene must have in the DIP PPI network to be included in the testing data, whereas the vertical axis gives the model performance on the testing data in terms of the values. At the far right (100 on the horizontal axis), we have the performance on the most-highly connected genes from the test set. At the far left (0 on the horizontal axis), we have the performance on all genes for the testing set.
The performance of NHMC and competitive methods in predicting gene function on weakly connected genes
| | ||||||||
|---|---|---|---|---|---|---|---|---|
| seq | 0.014 | 0.014 | 0.001 | 0.001 | 0.033 | 0.042 | 0.007 | 0.007 |
| pheno | 0.018 | 0.051 | 0.001 | 0.001 | 0.033 | 0.027 | 0.005 | 0.007 |
| struc | 0.012 | 0.078 | 0.001 | 0.001 | 0.093 | 0.093 | 0.000 | 0.007 |
| homo | 0.012 | 0.023 | 0.001 | 0.001 | 0.149 | 0.149 | 0.003 | 0.007 |
| cellcycle | 0.015 | 0.015 | 0.001 | 0.001 | 0.041 | 0.023 | 0.007 | 0.007 |
| church | 0.013 | 0.025 | 0.001 | 0.001 | 0.031 | 0.022 | 0.007 | 0.007 |
| derisi | 0.015 | 0.015 | 0.000 | 0.001 | 0.024 | 0.026 | 0.007 | 0.007 |
| eisen | 0.020 | 0.020 | 0.000 | 0.001 | 0.039 | 0.040 | 0.007 | 0.002 |
| gasch1 | 0.015 | 0.015 | 0.001 | 0.001 | 0.023 | 0.025 | 0.007 | 0.006 |
| gasch2 | 0.018 | 0.023 | 0.001 | 0.001 | 0.028 | 0.028 | 0.007 | 0.007 |
| spo | 0.015 | 0.015 | 0.000 | 0.001 | 0.022 | 0.022 | 0.007 | 0.007 |
| exp | 0.015 | 0.015 | 0.001 | 0.001 | 0.026 | 0.044 | 0.007 | 0.003 |
| Average: | 0.015 | 0.026 | 0.001 | 0.001 | 0.045 | 0.045 | 0.006 | 0.006 |
We report the of the CLUS-HMC (α = 1), NHMC (α = 0.5), FunctionalFlow (FF), and Hopfield (H) methods, when predicting gene function in yeast, using GO annotations and the BioGRID PPI network. The models are trained on the subset of highly connected genes and tested on the subset of weakly connected genes.
Weighted features
| | ||||||
|---|---|---|---|---|---|---|
| seq | 0.011 | 0.006 | 0.006 | 0.021 | 0.006 | 0.006 |
| pheno | 0.016 | 0.003 | 0.002 | 0.016 | 0.004 | 0.004 |
| struc | 0.012 | 0.003 | 0.002 | 0.093 | 0.002 | 0.003 |
| homo | 0.012 | 0.001 | 0.002 | 0.149 | 0.006 | 0.006 |
| cellcycle | 0.012 | 0.006 | 0.006 | 0.013 | 0.007 | 0.006 |
| church | 0.012 | 0.006 | 0.006 | 0.012 | 0.007 | 0.006 |
| derisi | 0.317 | 0.006 | 0.006 | 0.013 | 0.007 | 0.006 |
| eisen | 0.334 | 0.006 | 0.006 | 0.041 | 0.006 | 0.006 |
| gasch1 | 0.354 | 0.006 | 0.006 | 0.016 | 0.007 | 0.006 |
| gasch2 | 0.012 | 0.006 | 0.006 | 0.016 | 0.007 | 0.006 |
| spo | 0.012 | 0.006 | 0.006 | 0.016 | 0.007 | 0.006 |
| exp | 0.012 | 0.006 | 0.006 | 0.015 | 0.007 | 0.006 |
| Average: | 0.093 | 0.005 | 0.005 | 0.035 | 0.006 | 0.006 |
Binary vs. weighted connections: of NHMC (α = 0.5) with weighted distances in the BioGRID PPI network.
The performance of NHMC in predicting gene function in combination with feature selection
| | ||||
|---|---|---|---|---|
| | ||||
| All the features | 19624 | 0.012 | 47034 | 0.012 |
| Top 15% | 2944 | 0.0115 | 7055 | 0.012 |
| Top 10% | 1962 | 0.0115 | 4703 | 0.012 |
| Top 5% | 981 | 0.0115 | 2351 | 0.012 |
| Top 1% | 196 | 0.0115 | 470 | 0.0115 |
of NHMC (α = 0.5) with BioGRID PPI network for struc and homo datasets, when working on all the features, top 15% of the features, top 10% of the features, top 5% of the features and top 1% of the features. The datasets struc and homo are chosen because of their very high number of features as compared to the other datasets.
Basic properties of the yeast PPI networks
| | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| seq | 46 | 96 | 46 | 97 | 8 | 8 | 15 | 8 | 7.09 | 7.09 | 7.15 | 54.97 |
| pheno | 46 | 98 | 46 | 99 | 6 | 11 | 16 | 11 | 3.53 | 27.67 | 17.57 | 27.75 |
| struc | 13 | 98 | 59 | 98 | 7 | 14 | 14 | 14 | 7.27 | 54.74 | 7.07 | 54.97 |
| hom | 45 | 97 | 48 | 14 | 7 | 16 | 14 | 16 | 7.22 | 54.301 | 7.79 | 58.57 |
| cellcycle | 72 | 99 | 47 | 99 | 2 | 17 | 17 | 16 | 7.36 | 55.63 | 7.38 | 55.72 |
| church | 46 | 99 | 46 | 99 | 15 | 16 | 13 | 15 | 7.35 | 56.21 | 7.39 | 56.28 |
| derisi | 72 | 100 | 73 | 100 | 7 | 17 | 11 | 16 | 11.17 | 84.43 | 11.19 | 84.64 |
| eisen | 35 | 65 | 35 | 65 | 9 | 19 | 19 | 17 | 4.68 | 32.47 | 4.69 | 32.52 |
| gasch1 | 47 | 99 | 47 | 99 | 9 | 17 | 19 | 16 | 7.41 | 55.83 | 7.42 | 55.92 |
| gasch2 | 47 | 98 | 47 | 99 | 7 | 17 | 17 | 16 | 7.35 | 55.62 | 7.39 | 55.93 |
| spo | 48 | 99 | 48 | 99 | 3 | 13 | 17 | 16 | 7.31 | 55.27 | 7.32 | 55.35 |
| exp | 46 | 99 | 46 | 99 | 15 | 16 | 39 | 15 | 7.35 | 56.16 | 7.36 | 56.3 |
The percentage of connected genes, the percentage of function-relevant interactions and the average degree of nodes for 2 different PPI networks (DIP [28] and BioGRID [27].