| Literature DB >> 16630343 |
Rafal Kustra1, Romy Shioda, Mu Zhu.
Abstract
BACKGROUND: Expression array data are used to predict biological functions of uncharacterized genes by comparing their expression profiles to those of characterized genes. While biologically plausible, this is both statistically and computationally challenging. Typical approaches are computationally expensive and ignore correlations among expression profiles and functional categories.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16630343 PMCID: PMC1468435 DOI: 10.1186/1471-2105-7-216
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Illustration of performance metrics. A hypothetical table illustrating 4 possibilities in predicting one GO category.
| In top-ranked list | Not in top-ranked list | Total | |
| Associated | a | b | s |
| Unassociated | c | d | p-s |
| Total | m | p-m | p |
cross-validated performance of FAM and SVM, averaged over all 369 GO functions. For each of the four C-V folds, the other three folds are used to train FAM and SVM. The performance metrics listed are averaged over all 369 GO functions.
| 1 | 0.503 | 0.501 | 0.905 | 0.905 | 0.784 | 0.783 |
| 2 | 0.482 | 0.480 | 0.905 | 0.905 | 0.769 | 0.773 |
| 3 | 0.518 | 0.509 | 0.905 | 0.905 | 0.797 | 0.791 |
| 4 | 0.485 | 0.478 | 0.905 | 0.905 | 0.769 | 0.764 |
pairwise comparison of predictions made by FAM and SVM for 369 GO functions in 4 cross-validation experiments. For each of the 369 GO functions, the performance of FAM and SVM are compared. "Win," "Tie" and "Lose" refer to the number of GO functions for which FAM's predictions are better than, tied with, or worse than that of SVM, respectively. The last row, denoted by "ACV," compares FAM and SVM using performance metrics which are averaged over the 4 cross-validation folds. As expected, this gives rise to fewer ties.
| 1 | 106 | 168 | 95 | 107 | 167 | 95 | 193 | 2 | 174 |
| 2 | 87 | 181 | 101 | 88 | 183 | 98 | 177 | 1 | 191 |
| 3 | 104 | 171 | 94 | 109 | 166 | 94 | 186 | 3 | 180 |
| 4 | 90 | 193 | 86 | 92 | 192 | 85 | 199 | 6 | 164 |
| ACV | 184 | 36 | 146 | 194 | 34 | 141 | 201 | 0 | 168 |
averaged cross-validation performance of FAM and SVM for 369 GO functions classified into 4 groups based on how informative they are. A GO function is said to be more informative if fewer numbers of genes are associated with it. N = number of GO functions in each group. H, MH, ML, L denote the 4 groups with high, medium-high, medium-low, and low levels of informativeness. For each GO function, the averaged cross-validation (CV) performance metrics of FAM and SVM are used. These (averaged) CV performance metrics are then averaged over all GO functions within each category. Notice that the last row is identical to that in Table 1.
| N | |||||||
| H | 95 | 0.550 | 0.532 | 0.903 | 0.903 | 0.788 | 0.784 |
| MH | 144 | 0.521 | 0.521 | 0.904 | 0.904 | 0.792 | 0.790 |
| ML | 71 | 0.449 | 0.445 | 0.906 | 0.906 | 0.771 | 0.768 |
| L | 59 | 0.412 | 0.412 | 0.908 | 0.908 | 0.748 | 0.749 |
| 369 | |||||||
pairwise comparison of predictions made by FAM and SVM for 369 GO functions classified into 4 groups based on how informative they are. A GO function is said to be more informative if fewer numbers of genes are associated with it. N = number of GO functions in each group. H, MH, ML, L denote the 4 groups with high, medium high, medium low, and low levels of informativeness. For each GO function, the performance of FAM and SVM are compared using averaged performance metrics over the 4 cross-validation folds. "Win," "Tie" and "Lose" refer to the number of GO functions for which FAM's predictions are better than, tied with, or worse than that of SVM, respectively. Notice that the last row is identical to that in Table 2.
| N | ||||||||||
| H | 95 | 46 | 22 | 27 | 46 | 20 | 29 | 55 | 0 | 40 |
| MH | 144 | 64 | 12 | 68 | 76 | 12 | 56 | 74 | 0 | 70 |
| ML | 71 | 43 | 2 | 26 | 39 | 2 | 30 | 39 | 0 | 32 |
| L | 59 | 31 | 0 | 28 | 33 | 0 | 26 | 33 | 0 | 26 |
| ACV | 369 | 184 | 36 | 149 | 194 | 34 | 141 | 201 | 0 | 168 |
Figure 1four sample ROC curves. We group 369 GO functions into four groups based on how informative they are. We say a GO function is more informative if fewer genes are associated with it. H, MH, ML, L denote the 4 groups with high, medium-high, medium-low, and low levels of informativeness. Here, we show sample ROC curves for four randomly selected GO functions, one from each of the four groups. G0:0000002 = "mitochondrial genome maintenance" (from group H); G0:0007568 = "aging" (from group MH); G0:0006626 = "protein targeting to mitochondrion" (from group ML); and G0:0006399 = "tRNA metabolism" (from group L). Solid line = FAM; dashed line = SVM.
AUC of SVM on the testing set. The testing set is comprised of 50 functional categories classified into 10 groups according to how informative they are. A GO function is said to be more informative if fewer numbers of genes are associated with it. An SVM was trained for each of these categories using two sets of parameters. In this table, Group 1 consists of the most informative functions whereas Group 10 consists of the least informative functions. The "default" column reports to the prediction performance on the AUC scale using default control parameters in our SVM software Gist. The "tuned" column reports the prediction performance on the AUC scale using control parameters optimized on the corresponding the training set. The AUC values listed are averaged across the four C-V folds. See the substraction "SVM" Parameter Selection" in the "Method" section for further detail.
| Group | Default | Tuned |
| 1 | 0.804 | 0.793 |
| 2 | 0.683 | 0.638 |
| 3 | 0.803 | 0.806 |
| 4 | 0.853 | 0.862 |
| 5 | 0.864 | 0.863 |
| 6 | 0.751 | 0.740 |
| 7 | 0.788 | 0.805 |
| 8 | 0.807 | 0.796 |
| 9 | 0.748 | 0.773 |
| 10 | 0.739 | 0.766 |
| Mean | 0.784 | 0.784 |
| (Stdev) | (0.055) | (0.064) |