| Literature DB >> 18315868 |
Philip Groth1, Bertram Weiss, Hans-Dieter Pohlenz, Ulf Leser.
Abstract
BACKGROUND: Health and disease of organisms are reflected in their phenotypes. Often, a genetic component to a disease is discovered only after clearly defining its phenotype. In the past years, many technologies to systematically generate phenotypes in a high-throughput manner, such as RNA interference or gene knock-out, have been developed and used to decipher functions for genes. However, there have been relatively few efforts to make use of phenotype data beyond the single genotype-phenotype relationships.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18315868 PMCID: PMC2311305 DOI: 10.1186/1471-2105-9-136
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Distribution of cluster sizes. The diagram shows the distribution of the number of clusters in different sizes.
Figure 2Cross-species phenotype data distribution. The left pie chart depicts the distribution of genes by species, i.e. the relative number of genes in our gene set according to species affiliation. The right pie chart shows the distribution of clusters according to single species or 'mixed', if the cluster is made up of genes from more than one species.
Figure 3Protein-Protein interactions derived from one 'phenocluster' and genes lacking phenotype data. The figure shows an example for interactions between proteins from genes in a 'phenocluster'. Depicted is a network with many genes from the same 'phenocluster' (blue nodes with Entrez Gene IDs) for which associated proteins are connected, while the genes of all proteins that are responsible for these connections are not in our initial set of genes due to lack of substantial phenotype data (red nodes).
Figure 4Protein-Protein interactions of proteins derived from several 'phenoclusters'. The figure shows an example for interactions between proteins from genes of several 'phenoclusters'. Depicted is a network with many genes from the same 'phenocluster' (blue nodes with Entrez Gene IDs) for which associated proteins are connected. Also, other genes with known phenotypes for which proteins are responsible for some connections are not in the same 'phenocluster' but in the same network (shown as green nodes).
'Phenocluster' with 17 associated genes with a GO-score of 0.9 in the Biological Process subtree.
| Entrez ID | Gene Symbol | Gene name | # annotated GO-process terms | # terms common to at least 50% genes in group | # terms common to at least 75% genes in group |
| 172805 | rps-19 | Ribosomal Protein, Small subunit 19 | 5 | 5 | 5 |
| 174346 | eif-3.G | Eukaryotic Initiation Factor | 7 | 4 | 4 |
| 175501 | rpl-3 | Ribosomal Protein, Large subunit 3 | 6 | 6 | 5 |
| 175538 | lrs-1 | Leucyl tRNA Synthetase | 14 | 6 | 5 |
| 175584 | rps-19 | Ribosomal Protein, Small subunit 1 | 7 | 6 | 5 |
| 175659 | rrt1 | aRginyl aa-tRNA syntheTase | 8 | 4 | 4 |
| 175796 | rpl-23 | Ribosomal Protein, Large subunit 23 | 8 | 6 | 5 |
| 175901 | rps-13 | Ribosomal Protein, Small subunit 13 | 5 | 5 | 5 |
| 176007 | rpl-36 | Ribosomal Protein, Large subunit 36 | 6 | 6 | 5 |
| 176011 | rps-21 | Ribosomal Protein, Small subunit 21 | 6 | 6 | 5 |
| 176024 | prs-1 | Prolyl tRNA Synthetase | 9 | 6 | 5 |
| 176071 | rpl-9 | Ribosomal Protein, Large subunit 9 | 7 | 6 | 5 |
| 176097 | rpl-35 | Ribosomal Protein, Large subunit 35 | 5 | 5 | 5 |
| 176146 | rpl-21 | Ribosomal Protein, Large subunit 21 | 5 | 5 | 5 |
| 177583 | rps-21 | Ribosomal Protein, Small subunit 2 | 5 | 5 | 5 |
| 179063 | W02F12.5 | W02F12.5 | 8 | 5 | 5 |
| 189611 | Y37B11A.3 | Y37B11A.3 | 2 | 1 | 1 |
Of all terms associated with this group, there are 5 terms annotated to 14 out of 17 genes. Due to the homogeneous nature of the annotations, one can hypothesize that the remaining 3 genes should receive the same common annotation as the other 14 genes.
Different criteria for filtering clusters for function prediction
| (Filter 1) | (Filter 1 & Filter 2) | (Filter 1 & Filter 3) | (Filter 1 & Filter 4) | (Filter 1 & Filter 5) | |
| # of groups | 196 | 74 | 53 | 185 | 11 |
| # of terms | 345 | 159 | 102 | 338 | 16 |
| # of genes | 3213 | 711 | 409 | 2895 | 320 |
| Precision | 67.91% | 62.52% | 60.52% | 67.73% | 64.70% |
| Recall | 22.98% | 26.16% | 19.78% | 23.80% | 11.21% |
In order to push the values for precision and recall towards the precision ceiling, we strived for filter criteria for selecting appropriate gene groups a-priori. To achieve this goal, we defined the following filter criteria for our 1,000 'phenoclusters':
Filter 1: Removes groups with less than 3 genes, no GO-terms associated to at least 50% of genes
Filter 2: Removes groups with a GO-similarity score < 0.4
Filter 3: Removes groups with a PPi-connectedness < 33%.
Filter 4: removes all non-single species clusters.
Filter 5: removes all single-species clusters
Results for different filters applied to gene groups (k = 1,000).
| K | 500 | 1,000 | 2,000 | 3,000 |
| Single Species cluster | 422 (84.4%) | 904 (90.4%) | 1897 (94.9%) | 2894 (96.5%) |
| # of Phenocopy-Pairs (of 25) | 25 (100%) | 13 (52%) | 12 (48%) | 8 (32%) |
| Cluster w/PT-Sim = 0.4 | 92 (18.4%) | 293 (29.3%) | 526 (26.3%) | 810 (40.5%) |
| # Genes | 3221 | 5886 | 6379 | 6878 |
| Cluster w/GO-Sim = 0.4 | 51 (10.2%) | 206 (20.6%) | 522 (26.1%) | 921 (46.1%) |
Precision and recall values of function prediction in all clusters and with varying k selected by different combinations of the filters defined in Table 2.
The distribution of clusters with their characteristics given different values for k (the number of clusters) from 500 to 3,000.
| K | 500 | 1,000 | 2,000 | 3,000 |
| Single Species cluster | 422 (84.4%) | 904 (90.4%) | 1897 (94.9%) | 2894 (96.5%) |
| # of Phenocopy-Pairs (of 25) | 25 (100%) | 13 (52%) | 12 (48%) | 8 (32%) |
| Cluster w/PT-Sim ≥ 0.4 | 92 (18.4%) | 293 (29.3%) | 526 (26.3%) | 810 (40.5%) |
| # Genes | 3221 | 5886 | 6379 | 6878 |
| Cluster w/GO-Sim ≥ 0.4 | 51 (10.2%) | 206 (20.6%) | 522 (26.1%) | 921 (46.1%) |
| Correlation GO-Sim vs PT-SIM | 0.53 | 0.41 | 0.37 | 0.28 |
| # Genes | 863 | 1800 | 2392 | 3065 |
| Cluster w/PPi ≥ 75% | 21 (4.2%) | 60 (6.0%) | 174 (8.7%) | 305 (10.2%) |
| # Genes | 1497 | 1858 | 2335 | 2702 |
| Cluster w/PPi ≥ 33% | 63 (12.6%) | 138 (13.8%) | 286 (14.3%) | 413 (13.8%) |
| # Genes | 3890 | 4322 | 4965 | 4996 |
| Cluster for GO-Predictions | 90 (18%) | 196 (19.6%) | 393 (19.7%) | 611 (20.4%) |
| # Genes | 2820 | 3213 | 4145 | 4546 |
| # Terms | 142 | 345 | 730 | 1226 |
| Precision | 72.55% | 67.91% | 63.40% | 60.31% |
| Recall | 16.73% | 22.98% | 25.63% | 28.32% |
| Avg. Genes/Cluster | 54 | 29 | 16 | 11 |
As internal measure for cluster quality we sought to gain insight how the data structure changes by choosing different values for k, ranging from 500 to 3,000. Here, Filter 1 has been applied for GO-predictions. For details, see text.