| Literature DB >> 19208173 |
Erdahl T Teber1, Jason Y Liu, Sara Ballouz, Diane Fatkin, Merridee A Wouters.
Abstract
BACKGROUND: Automated candidate gene prediction systems allow geneticists to hone in on disease genes more rapidly by identifying the most probable candidate genes linked to the disease phenotypes under investigation. Here we assessed the ability of eight different candidate gene prediction systems to predict disease genes in intervals previously associated with type 2 diabetes by benchmarking their performance against genes implicated by recent genome-wide association studies.Entities:
Mesh:
Year: 2009 PMID: 19208173 PMCID: PMC2648789 DOI: 10.1186/1471-2105-10-S1-S69
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Data sources and approaches used in automated candidate gene prediction methods. (A): Most systems draw on at least two types of data. SUSPECTS [21] (not shown) uses keywords from InterPro [22] and GO [23], co-expression data, and also incorporates the PROSPECTR module [12] (shown on right). (B): Upper left Gene clustering approaches associate a gene cluster with a phenotype via a group member. For example, Systems Biology approaches [4,5,24] group genes whose protein products interact; and link them to a phenotype using a group-member gene associated with the phenotype. Systems Biology methods assume oligogenic diseases are associated with disruption in proteins that participate in a common complex or pathway [25]. Other gene clustering systems look for enrichment of keywords or domains associated with particular phenotypes and suggest candidate genes with similar properties. These systems are based on the principle that candidate genes have similar functions to disease genes already determined [26-28]. Upper right Phenotype clustering approaches such as that of Freudenberg & Propping [29] group related phenotypes into superphenotypes. Lower left Most of the Machine Learning approaches do not use phenotype information and are based on the concept that the genome consists of a bipartite distribution of genes: those which cause diseases, and those that do not. By analysing these two gene sets with respect to discriminating variables, a profile for "non-disease genes" and "disease genes" is produced which enables training of a classifier. A novel gene submitted to the classifier is flagged as either "disease-causing" or "non-disease causing". Systems include eVOC [30], PROSPECTR [12], SUSPECTS [21] and DGP [31]. Finally G2D, lower right, is a transitive method that maps phenotypes to genes [32] by interfacing literature- and keyword-based ontologies.
Automated Candidate Gene Prediction Systems
| In |
| The |
♠ Assessed here, ◇ Webserver.
Figure 2Comparison of methods against the HS (left) and MHWD (right) T2D gene data sets. Top: Relative Enrichment Ratios. Bottom: Comparisons based on Sensitivity and Specificity.
Gentrepid ab initio results
| Predictions | Reference list | ER | L95% | U95% | S | L95% | U95% |
| CPS rank 8+ pathways | HS | 3.3 | 1.1 | 9.4 | 0.45 | 0.21 | 0.72 |
| CPS rank 8+ pathways | HS – annotated | 7.2 | 2.1 | 25 | 1.00 | 0.57 | 1.00 |
| CPS rank 8+ pathways | MHWD | 2.1 | 1.3 | 3.6 | 0.30 | 0.20 | 0.42 |
| CPS rank 8+ pathways | MHWD – annotated | 6.8 | 3.6 | 13 | 0.95 | 0.75 | 0.99 |
| CPS interactions top 50% | HS | 4.1 | 1.2 | 15 | 0.27 | 0.10 | 0.57 |
| CPS interactions top 50% | HS – annotated | 9.0 | 2.2 | 37 | 0.60 | 0.23 | 0.88 |
| CPS interactions top 50% | MHWD | 1.7 | 0.79 | 3.8 | 0.11 | 0.06 | 0.22 |
| CPS interactions top 50% | MHWD – annotated | 8.1 | 3.2 | 20 | 0.54 | 0.29 | 0.77 |
| CMP top 10% | HS | 2.2 | 0.3 | 17 | 0.1 | 0.02 | 0.38 |
| CMP top 10% | MHWD | 2.0 | 0.8 | 4.8 | 0.1 | 0.08 | 0.18 |
Abbreviations in Table: ER – Enrichment Ratio, L95% – Lower 95% confidence limit, U95% – Upper 95% confidence limit, S – Sensitivity
Figure 3MHWD dataset filtered against prioritized automatic candidate gene predictions. Genes in bold are robustly supported genes from the GWA studies (HS set).