| Literature DB >> 18959798 |
Matthew T Weirauch1, Christopher K Wong, Alexandra B Byrne, Joshua M Stuart.
Abstract
BACKGROUND: The rapid annotation of genes on a genome-wide scale is now possible for several organisms using high-throughput RNA interference assays to knock down the expression of a specific gene. To date, dozens of RNA interference phenotypes have been recorded for the nematode Caenorhabditis elegans. Although previous studies have demonstrated the merit of using knock-down phenotypes to predict gene function, it is unclear how the data can be used most effectively. An open question is how to optimally make use of phenotypic observations, possibly in combination with other functional genomics datasets, to identify genes that share a common role.Entities:
Mesh:
Year: 2008 PMID: 18959798 PMCID: PMC2596148 DOI: 10.1186/1471-2105-9-463
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Four possible associations between phenotypic congruency and shared gene function. Boxes indicate that a phenotype (column) was observed to be either present (dark box) or absent (light box) upon knock-down of a gene (row) using RNA interference. Genes are grouped into four different pairs: A, B, C, and D. Pairs A and C contain genes with identical signatures; pairs A and D contain genes with related function. Phenotypes are ordered from left to right by decreasing frequency (indicated as inverted bars in the bar graph), calculated across genes with at least one phenotype.
The 34 phenotypes collected from high-throughput RNA interference screens.
| Embryonic lethal | EMB | 0.52 | Paralyzed | PRL | 0.037 |
| Slow post-embryonic growth | GRO | 0.43 | Patchy appearance | PCH | 0.035 |
| Larval arrest | LVA | 0.37 | Sluggish appearance | SLU | 0.030 |
| Uncoordinated | UNC | 0.33 | Long | LON | 0.026 |
| Sterile | STE | 0.26 | Adult lethal | ADL | 0.025 |
| Protruding vulva | PVL | 0.18 | Molting defect | MLT | 0.016 |
| Lethal | LET | 0.17 | Blistering of cuticle | BLI | 0.016 |
| Sterile progeny | STP | 0.14 | Pale | PALE | 0.0093 |
| Reduced brood size | RBS | 0.13 | High incidence of males | HIM | 0.0072 |
| Body morphological defects | BMD | 0.10 | Oocytes | OOC | 0.0072 |
| Sick | SCK | 0.093 | Roller | ROL | 0.0063 |
| Ruptured | RUP | 0.080 | Multivulva | MUV | 0.0059 |
| Dumpy | DPY | 0.077 | Kinker | KNK | 0.0013 |
| Clear | CLR | 0.074 | Unique phenotype | UNIQ | 0.0013 |
| Egg-laying defect | EGL | 0.056 | Vulvaless | VUL | 0.0013 |
| Thin | THIN | 0.040 | Hyperactive | HYA | 0.0008 |
| Small | SMA | 0.037 | Social | SOC | 0.0004 |
19 metrics evaluated for their ability to identify functionally-related genes from knock-down phenotypes. See methods for mathematical definitions of each metric.
| P | Counts the number of matching present phenotypes. | |
| A | Counts the number of matching absent phenotypes. | |
| P, A | Counts the number of matching present and absent phenotypes. | |
| P, A | Vector correlation coefficient. | |
| P, A | Same as | |
| P, A | Same as | |
| P, A | Measures the degree to which knowledge about one gene's phenotypes reduces the entropy of another's. | |
| P, A | The "straight line" distance between two vectors. | |
| P, A | The number of matching present phenotypes divided by the number of phenotypes present in either gene. | |
| P, F | Scales the number of matching present phenotypes by the frequency of each phenotype. | |
| P, F | Same as | |
| P, F | Same as | |
| P, A, F | Same as | |
| P, A, F | Ranking system used by PhenoBlast (Gunsalus | |
| P, A, F | Scales the number of matching present and absent phenotypes by their frequencies across all genes [ | |
| P, C | Same as | |
| P, F, C | Weights by present phenotype background pairwise co-occurences. | |
| P, F, C | Same as | |
| P, A, F, C | Same as |
a: The metric type indicates whether the metric rewards for shared present phenotypes (P), rewards for shared absent phenotypes (A), factors in frequencies of phenotypes across all genes (F), and/or factors in pairwise co-occurrence of phenotypes across all genes (C).
Figure 2Evaluation of metrics A. Gene network precision. The precision of the top-scoring gene pairs is shown for each evaluated metric (see Methods). A. Gene pair precision (y-axis) is plotted against the number of top-scoring gene pairs for the given metric (x-axis). Metrics discussed in the text are displayed as bold lines. The dashed line indicates the background precision of all gene pairs in the dataset. B. Gene neighborhood functional coherence. The 25 most similar genes to each query gene were identified using each evaluated metric. The precision of these 25 gene pairs was calculated using the evaluation set. Shown is the number of query genes with high precisions (> 0.25) for information-based (black bars) and non information-based metrics (gray bars).
Figure 3Network and subnetwork comparisons. Key is shown in the upper left-hand panel. A. AGREE links join genes with more shared present phenotypes. The x-axis indicates the link complexity, or number of shared present phenotypes. The y-axis indicates the frequency of occurrence of each link complexity bin. B. AGREE links have higher precision for every level of link complexity. The links of each network were binned by their complexity (x-axis). The y-axis indicates the precision of each bin for each network. Error bars indicate one standard deviation, assuming a binomial distribution. C. AGREE subnetworks are enriched for more phenotypes. The number of over-represented phenotypes present in the genes of each subnetwork (subnetwork complexity) was determined using the hypergeometric distribution (see Methods). The x-axis indicates subnetwork complexity. The y-axis indicates the frequency of the given subnetwork complexity in the ANET and UNET. D. A greater number of AGREE links are supported by other data types. The x-axis indicates the data type. The y-axis indicates the number of links in the ANET and UNET which are supported by that data type. Error bars indicate one standard deviation, assuming a binomial distribution.
Figure 4Comparison of . Each point represents one functional category, and indicates the negative log significance of the pairwise scores of all genes within that functional category using the AGREE (x-axis) and UPC metrics (y-axis). Dashed lines indicate significances of P < 0.01 or better. Categories were taken from the evaluation set, and were filtered to ensure that no two categories overlap by greater than half of their gene members (see Methods). * 'Embryonic development' is short for 'Embryonic development ending in birth or egg hatching.'
Figure 5Integration with other data sources. Link colors indicate interaction type: Green, AGREE phenotype congruency; Blue, UPC phenotype congruency; Purple, protein-protein; Red, co-expression; Orange, genetic. A. Superimposed network. Superimposed network created from multiple data sources. Shown below is a multiply-supported subnetwork identified in the superimposed network. B. Multiply-supported network. Network created from restricting to links supported by multiple data sources. Shown below the network is one 'molecular machine', identified as a subnetwork in the multiply-supported network.
Figure 6Illustration of frequency-weighted phenotype congruency. The distance between points reflects the relative number of genes that share (or lack) the corresponding phenotype. Phenotypic signatures for a single gene are represented as a line; phenotypes correspond to individual vertical bars. The length of the gray area in a phenotype's bar is proportional to its frequency. The presence of a phenotype for a gene is indicated by drawing a point at the extremes of the shaded area for the phenotype. The total distance between the lines reflects the relative dissimilarity of the gene pair. A. Two genes which share frequently occurring present and absent phenotypes, and so would receive a poor information-based score. B. Two genes which share rare present and absent phenotypes, and so would receive a good information-based score.