| Literature DB >> 21912686 |
Jing Zhao1, Ting-Hong Yang, Yongxu Huang, Petter Holme.
Abstract
Many diseases have complex genetic causes, where a set of alleles can affect the propensity of getting the disease. The identification of such disease genes is important to understand the mechanistic and evolutionary aspects of pathogenesis, improve diagnosis and treatment of the disease, and aid in drug discovery. Current genetic studies typically identify chromosomal regions associated specific diseases. But picking out an unknown disease gene from hundreds of candidates located on the same genomic interval is still challenging. In this study, we propose an approach to prioritize candidate genes by integrating data of gene expression level, protein-protein interaction strength and known disease genes. Our method is based only on two, simple, biologically motivated assumptions--that a gene is a good disease-gene candidate if it is differentially expressed in cases and controls, or that it is close to other disease-gene candidates in its protein interaction network. We tested our method on 40 diseases in 58 gene expression datasets of the NCBI Gene Expression Omnibus database. On these datasets our method is able to predict unknown disease genes as well as identifying pleiotropic genes involved in the physiological cellular processes of many diseases. Our study not only provides an effective algorithm for prioritizing candidate disease genes but is also a way to discover phenotypic interdependency, cooccurrence and shared pathophysiology between different disorders.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21912686 PMCID: PMC3166320 DOI: 10.1371/journal.pone.0024306
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Illustration of the method with synthetic data.
The area of the nodes is proportional to x—the difference in expression level. The width of the edges represents the coupling strength w in the protein interaction network. The color of the nodes represents our score and the numbers shows their order in this ranking. Panels A and B shows the result of two values of φ—a low value of φ (2% of the maximal possible) in A, and a high value of φ (98% of max). Low φ-values put an emphasis on the difference in expression level; high φ-values stress the proximity to other vertices with high score. We also assume η≫1.
Figure 2Parameter dependence of prediction performance.
(A) The distributions of r-ratios for the known OMIM disease genes of the 40 diseases under study, at (φ,η) = (0,1), i.e., only gene expression levels were used to predict disease genes. (B) The distributions of r-ratios for the known OMIM disease genes at (φ,η) = (0.001,0), i.e., only the PPI network was used in the ranking. (C) The distributions of r-ratios for the known OMIM disease genes at (φ,η) = (0.005,39), (C) ROC curves for (φ,η) = (0.005,39), (0.001,0), and (0,1), respectively.
Prediction results of our algorithm (φ,η) = (0.005,39) for the known OMIM disease genes of the 40 diseases under study.
| h | TP | TPR | FPR | TPR/FPR |
| 1 | 28 | 0.081 | 0.009 | 9 |
| 10 | 120 | 0.345 | 0.098 | 3.520 |
| 15 | 163 | 0.468 | 0.147 | 3.184 |
| 24 | 208 | 0.600 | 0.236 | 2.542 |
| 30 | 233 | 0.670 | 0.296 | 2.264 |
h: number of genes on the top of the candidate ranking that was predicted as disease-associated; TP: true positive numbers, i.e., number of known disease genes that was predicted as disease-associated; TPR: true positive rates; FPR: false positive rates.
Figure 3Finding a trade-off between sensitivity and specificity.
The variation trend of in response to changes of h—the number of disease genes predicted. TPR: true positive rates; FPR: false positive rates.
Selected prediction results for disease genes in three monogenic diseases and complex diseases, respectively.
| Disease MeSH | Gene name | Gene loci |
|
| Progeria | LMNA | 1q21.2 | 4 |
| Muscular Dystrophy, Duchenne | DMD | Xp21.2 | 2 |
| Cystic Fibrosis | CFTR | 7q31.2 | 8 |
| Alzheimer Disease | APOE | 19q13.2 | 3 |
| APP | 21q21 | 4 | |
| PSEN1 | 14q24.3 | 4 | |
| PSEN2 | 1q31-q42 | 15 | |
| Crohn Disease | IL6 | 7p21 | 1 |
| IL23R | 1p31.3 | 3 | |
| NOD2 | 16q12 | 4 | |
| Diabetes Mellitus, Type 2 | IL6 | 7p21 | 1 |
| PPARG | 3p25 | 1 | |
| IRS1 | 2q36 | 2 | |
| IRS2 | 13q34 | 3 |
s-rank: ranks of candidate genes according to their s-values when (φ,η) = (0.005,39).
Figure 4ROC curves for the predictions of disease genes.
Here we restrict the analysis to diseases with at least two known associated genes.
Alzheimer's disease (AD) associated genes predicted by our algorithm that have found literature supports.
| Unknown AD genes in OMIM morbid | Predicted AD-associated genes by our algorithm | ||||||
| No | Gene Symbol in OMIM morbid | OMIM ID | Gene loci | Gene ID | Gene Symbol | Gene loci |
|
| 1 | AD5 | 602096 | 12p11.23–q13.12 | 7421 | VDR | 12q13.11c | 3 |
| 2 | AD6 | 605526 | 10q24 | 8945 | BTRC | 10q24.32a | 4 |
| 3 | AD7 | 606187 | 10p13 | 1787 | TRDMT1 | 10p13a | 9 |
| 4 | AD8 | 607116 | 20p | 5111 | PCNA | 20p12.3c | 2 |
| 5 | AD9 | 608907 | 19p13.2 | 3383 | ICAM1 | 19p13.2c | 1 |
| 6 | AD10 | 609636 | 7q36 | 4846 | NOS3 | 7q36.1c–q36.1d | 1 |
| 7 | AD11 | 609790 | 9p22.1 | 1029 | CDKN2A | 9p21.3c | 2 |
| 8 | AD12 | 611073 | 8p12–q22 | 2260 | FGFR1 | 8p12a | 1 |
| 9 | AD13 | 611152 | 1q21 | 6275 | S100A4 | 1q21.3c | 3 |
| 10 | AD14 | 611154 | 1q25 | 9588 | PRDX6 | 1q25.1a | 1 |
| 11 | AD15 | 611155 | 3q22–q24 | 7018 | TF | 3q22.1e | 1 |
| 12 | AD16 | 300756 | Xq21.3 | 1349 | COX7B | Xq21.1a | 3 |
Selected significantly enriched GO terms for the top s1-ranked genes.
| GO ID | GO Term | Mapped genes | Total genes |
| GO:0050896 | response to stimulus | 68 | 6192 |
| GO:0006950 | response to stress | 53 | 2538 |
| GO:0002376 | immune system process | 44 | 1436 |
| GO:0030154 | cell differentiation | 43 | 2008 |
| GO:0042127 | regulation of cell proliferation | 40 | 946 |
| GO:0010941 | regulation of cell death | 44 | 1042 |
All reported genes are significant with a P-value less than 0.001.
Figure 5Correlation between the importance and pleiotropy.
We measure the s 1-score averaged over bins of the number of shared diseases for that particular gene (as a measure of the strength of pleiotropy).