| Literature DB >> 17020920 |
Richard A George1, Jason Y Liu, Lina L Feng, Robert J Bryson-Richardson, Diane Fatkin, Merridee A Wouters.
Abstract
Linkage analysis is a successful procedure to associate diseases with specific genomic regions. These regions are often large, containing hundreds of genes, which make experimental methods employed to identify the disease gene arduous and expensive. We present two methods to prioritize candidates for further experimental study: Common Pathway Scanning (CPS) and Common Module Profiling (CMP). CPS is based on the assumption that common phenotypes are associated with dysfunction in proteins that participate in the same complex or pathway. CPS applies network data derived from protein-protein interaction (PPI) and pathway databases to identify relationships between genes. CMP identifies likely candidates using a domain-dependent sequence similarity approach, based on the hypothesis that disruption of genes of similar function will lead to the same phenotype. Both algorithms use two forms of input data: known disease genes or multiple disease loci. When using known disease genes as input, our combined methods have a sensitivity of 0.52 and a specificity of 0.97 and reduce the candidate list by 13-fold. Using multiple loci, our methods successfully identify disease genes for all benchmark diseases with a sensitivity of 0.84 and a specificity of 0.63. Our combined approach prioritizes good candidates and will accelerate the disease gene discovery process.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17020920 PMCID: PMC1636487 DOI: 10.1093/nar/gkl707
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Number of correctly predicted disease genes by each method using known disease genes
| Disease | Known Disease Genes | Successful Automated Predictions | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CMP | CPS | CPS | CPS | CPS | CPS | CPS | Random | |||||
| BioCarta | KEGG | OPHID | OPHIDh | OPHIDlit+ | OPHIDlit− | Total | 50 | 100 | 150 | |||
| aan | 4 | 0 | 0 | 0 | 3 | 3 | 3 | 2 | 3 | 0.1 | 0.1 | 0.1 |
| alz | 8 | 2 | 3 | 6 | 5 | 5 | 5 | 3 | 6 | 0.3 | 0.2 | 0.2 |
| aml | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.2 | 0.2 | 0.2 |
| bb | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 |
| bc | 9 | 0 | 4 | 0 | 6 | 6 | 6 | 0 | 6 | 0.5 | 0.5 | 0.5 |
| bcc | 4 | 1 | 1 | 2 | 3 | 3 | 3 | 0 | 3 | 0.1 | 0.0 | 0.1 |
| cchn | 6 | 5 | 0 | 0 | 5 | 4 | 4 | 4 | 5 | 0.4 | 0.3 | 0.3 |
| cf | 5 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 2 | 0.2 | 0.2 | 0.2 |
| cfh | 12 | 5 | 0 | 4 | 4 | 4 | 4 | 0 | 9 | 1.0 | 0.7 | 0.8 |
| cmt | 5 | 0 | 0 | 0 | 2 | 2 | 2 | 0 | 2 | 0.2 | 0.2 | 0.2 |
| ebl | 5 | 3 | 0 | 5 | 5 | 5 | 5 | 0 | 5 | 0.2 | 0.1 | 0.1 |
| ed | 7 | 5 | 0 | 2 | 0 | 0 | 0 | 0 | 5 | 0.4 | 0.3 | 0.2 |
| fap | 4 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 3 | 0.2 | 0.2 | 0.1 |
| gc | 5 | 0 | 2 | 3 | 0 | 0 | 0 | 0 | 4 | 0.3 | 0.2 | 0.2 |
| h | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1 | 0.2 | 0.2 |
| ibd | 5 | 0 | 2 | 3 | 4 | 4 | 4 | 2 | 4 | 0.4 | 0.3 | 0.3 |
| joag | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1 | 0.1 | 0.1 |
| lca | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1 | 0.1 | 0.1 |
| lhscr | 5 | 0 | 0 | 2 | 2 | 2 | 2 | 0 | 4 | 0.2 | 0.3 | 0.3 |
| md | 6 | 2 | 0 | 0 | 2 | 2 | 2 | 0 | 3 | 0.1 | 0.1 | 0.1 |
| mf | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.2 | 0.2 | 0.2 |
| mody | 6 | 2 | 0 | 0 | 4 | 4 | 4 | 2 | 5 | 0.3 | 0.3 | 0.3 |
| niddm | 8 | 4 | 2 | 0 | 2 | 2 | 2 | 2 | 5 | 0.6 | 0.4 | 0.3 |
| oc | 4 | 0 | 0 | 4 | 2 | 2 | 2 | 2 | 4 | 0.3 | 0.3 | 0.3 |
| pc | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1 | 0.1 | 0.2 |
| pd | 3 | 0 | 0 | 3 | 2 | 2 | 2 | 0 | 3 | 0.1 | 0.0 | 0.0 |
| rp | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.2 | 0.2 | 0.2 |
| sle | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.2 | 0.1 | 0.2 |
| tcp | 13 | 3 | 0 | 2 | 4 | 4 | 4 | 0 | 7 | 0.9 | 0.8 | 0.8 |
| Total | 170 | 32 | 16 | 41 | 55 | 54 | 54 | 17 | 88 | 8.0 | 6.6 | 6.7 |
CMP results are based on a cut-off threshold of 0.1. CPS-interactions go to the 1st level of interaction only. CPS-OHPID contains all PPI data from OPHID. CPS-OPHIDh contains human data only. CPS-OPHIDlit+ contains data from literature databases only. CPS-OPHIDlit− does not contain PPI data from literature databases. Random is calculated on total predictions for the 50, 100 and 150 interval size. Disease abbreviations: aan, adrenoleukodystrophy, autosomal neonatal; alz, Alzheimer disease; aml, acute myeloid leukemia; bb, Bardet-Biedl syndrome; bc, breast cancer; bcc, basal cell carcinoma; cchn, colorectal cancer, hereditary nonpolyposis; cf, cystic fibrosis; cfh, cardiomyopathy, familial hypertrophic; cmt, Charcot-Marie-Tooth disease; ebl, epidermolysis bullosa letalis; ed, epiphyseal dysplasia, multiple types 1–5; fap, familial adenomatous polyposis; gc, gastric cancer; h, hypertension; ibd, inflammatory bowel disease; joag, juvenile-onset primary open angle glaucoma; lca, Leber congenital amaurosis; lhscr, long-segment Hirschsprung disease; md, muscular dystrophy, limb-girdle; mf, familial meningioma; mody, maturity-onset diabetes of the young; niddm, type 2 diabetes mellitus; oc, ovarian carcinom; pc, prostate cancer; pd, Parkinson disease; rp, retinitis pigmentosa; sle, systemic lupus erythematosus; tcp, thyroid carcinoma, papillary.
Figure 1Sensitivity (continuous line) and proportion of predicted genes that are actually disease genes (dashed line) for OPHID (diamond), OPHIDh (circle), OPHIDlit+ (triangle) and OPHIDlit− (square) at three levels of interactions (Distance). Results are shown for the 100 interval size only.
Figure 2Performance of PPI data from (a) OPHID, (b) OPHIDh, (c) OPHIDlit+ and (d) OPHIDlit−. Results are shown for three levels of interaction using the shortest path length to a disease gene (Distance). Black diamonds represent the number of disease genes found. The number of non-disease genes returned are presented for the 50 gene interval (square), 100 gene interval (triangle) and 150 gene interval (x). The number of disease genes returned by random selection are presented for the 50 gene interval (*), 100 gene interval (circle) and 150 gene interval (+).
Figure 3Combined prediction success. (a) Correct predictions based on known disease genes. (b) Correct predictions based on multiple intervals. (c) Combined CPS and CMP predictions for familial hypertrophic cardiomyopathy using known disease genes. Disease genes are represented by their HUGO-name. Gene-linking lines are predictions by CPS and CMP. For example, TNNT2 is found by the known disease gene TNNI3 using CPS-PPI and CMP predictions, and TNNI3 is found by the known disease gene TNNT2 using CPS-PPI predictions. PRKAG2 and TPM1 were found using PPI data at a distance of three, all other PPI predictions are at a distance of one.
Multiple interval benchmark results
| Method | 50 | 100 | 150 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Sensitivity | Specificity | ER | Sensitivity | Specificity | ER | Sensitivity | Specificity | ER | |
| CPS-pathway | 0.35 | 0.90 | 3.4 | 0.39 | 0.89 | 3.4 | 0.41 | 0.88 | 3.2 |
| CPS-PPI | 0.39 | 0.95 | 7.3 | 0.42 | 0.93 | 6.1 | 0.47 | 0.92 | 5.6 |
| CPS | 0.54 | 0.87 | 4.0 | 0.59 | 0.84 | 3.7 | 0.62 | 0.82 | 3.5 |
| CMP ( | 0.17 | 0.95 | 3.3 | 0.19 | 0.94 | 3.1 | 0.23 | 0.93 | 3.2 |
| CMP ( | 0.46 | 0.77 | 1.9 | 0.55 | 0.72 | 1.9 | 0.59 | 0.69 | 1.9 |
| CMP ( | 0.16 | 0.95 | 3.2 | 0.18 | 0.94 | 3.1 | 0.22 | 0.94 | 3.3 |
| CMP ( | 0.46 | 0.77 | 2.0 | 0.55 | 0.72 | 1.9 | 0.58 | 0.69 | 1.9 |
| CPS-CMP ( | 0.74 | 0.69 | 2.3 | 0.84 | 0.63 | 2.2 | 0.87 | 0.59 | 2.1 |
, significance based on the assumption that domains in a gene are uncorrelated; , significance based on the assumption that domains in a gene are correlated; multi, genes that contain multiple Pfam domains only; all, genes that contain at least one Pfam domain. All χ2 tests are at a significance level of 0.995.
Figure 4Candidate gene enrichment for the 50 (a), 100 (b) and 150 (c) gene interval sizes using the combined methods. Enrichment values are on the y-axis and diseases are listed alphabetically from left to right on the x-axis, as in Table 1. Black diamonds represent enrichment of data using known disease genes. Grey squares represent enrichment of data using multiple intervals. The dashed line represents data enrichment by random selection.