| Literature DB >> 29422030 |
Yaogong Zhang1, Jiahui Liu1, Xiaohu Liu1, Xin Fan1, Yuxiang Hong1, Yuan Wang2, YaLou Huang1, MaoQiang Xie3.
Abstract
BACKGROUND: Prioritizing disease genes is trying to identify potential disease causing genes for a given phenotype, which can be applied to reveal the inherited basis of human diseases and facilitate drug development. Our motivation is inspired by label propagation algorithm and the false positive protein-protein interactions that exist in the dataset. To the best of our knowledge, the false positive protein-protein interactions have not been considered before in disease gene prioritization. Label propagation has been successfully applied to prioritize disease causing genes in previous network-based methods. These network-based methods use basic label propagation, i.e. random walk, on networks to prioritize disease genes in different ways. However, all these methods can not deal with the situation in which plenty false positive protein-protein interactions exist in the dataset, because the PPI network is used as a fixed input in previous methods. This important characteristic of data source may cause a large deviation in results.Entities:
Keywords: Gene prioritization; Heterogeneous network; Label propagation
Mesh:
Year: 2018 PMID: 29422030 PMCID: PMC5806269 DOI: 10.1186/s12859-018-2040-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Statistics of Data in Experiments
| Statistics | Value |
|---|---|
| Number of genes | 3292 |
| Number of phenotypes | 4120 |
| Number of gene phenotype associations | 4,678/4,801 |
| (Aug-2015/Dec-2016) | |
| Average number of genes per phenotype | 1.1354/1.1653 |
| (Aug-2015/Dec-2016) | |
| Average number of phenotypes per gene | 1.4210/1.4584 |
| (Aug-2015/Dec-2016) | |
| Percentage of phenotypes that have only | 91.87%/94.10% |
| one disease gene (Aug-2015/Dec-2016) | |
| Percentage of genes that have only one | 66.22%/66.74% |
| interaction phenotype (Aug-2015/Dec-2016) | |
| Sparsity of the PPI matrix (Aug-2015) | 99.74% |
Notations
| NOTATION | DESCRIPTION |
|---|---|
|
| Number of genes |
|
| Number of phenotypes |
|
| |
|
| |
|
| Binary PPI network |
|
| Phenotype similarity network |
|
| Normalized PPI network |
|
| |
|
| Normalized phenotype similarity network |
|
| |
|
| Known binary gene-phenotype associations for training |
|
| Gene-phenotype associations matrix to be learnt |
|
| Weighted PPI network to be learnt |
|
| Weighted phenotype similarity network to be learnt |
Fig. 1Illustration of the IDLP framework. Square nodes represent phenotypes, all pairwise phenotype similarity relationships make up the phenotype similarity network. Circular nodes represent genes, all pairwise gene interactions make up the PPI network. Nodes surrounded by oval are query phenotypes (or genes), Nodes surrounded by triangle are seed genes (or phenotypes). a For a query phenotype p, the corresponding related genes are selected as seed nodes. b By modeling the noises in the PPI network, the interactions between gene nodes have been changed. In order to better explain the situation, we consider two extreme cases here, i.e., edge deletion and edge addition. During the optimization of IDLP, the interaction between gene g and f has been added, the interaction between gene d and e has been removed. The changes of the PPI network result in a high score on gene g, because gene g directly receive score from seed gene f. What’s more, gene d no longer receives scores from gene e, which indirectly results in gene d receives more support from gene e. c For a query gene g, the corresponding related phenotypes are selected as seed nodes. d By modeling the noises in the phenotype network, the similarity scores between phenotypes have been changed. The edge addition between phenotype r and p and edge deletion between phenotype r and t result in a high score on phenotype p
Average AUCs scores of gene prioritization on test set and validation set
| Performance on test set | Performance on validation set | |||||
|---|---|---|---|---|---|---|
| AUC20 | AUC50 | AUC100 | AUC20 | AUC50 | AUC100 | |
| CIPHER_SP | 0.0029* | 0.0046* | 0.0066* | 0 | 0 | 0 |
| CIPHER_DN | 0.0015* | 0.0027* | 0.0042* | 0 | 0 | 0 |
| RWR | 0.0075* | 0.0178* | 0.0283* | 0.0233 | 0.0358 | 0.0475 |
| DK | 0.0192* | 0.0255* | 0.0294* | 0.0211 | 0.0306 | 0.0399 |
| RWRH | 0.0916* | 0.1250* | 0.1664* | 0.2009 | 0.2724 | 0.3288 |
| MINProp | 0.0771* | 0.1266* | 0.1799* | 0.1963 | 0.2625 | 0.3104 |
| BiRW | 0.0421* | 0.0780* | 0.1142* | 0.1544 | 0.2180 | 0.26672 |
| PRINCE | 0.1117 | 0.1468 | 0.2088 | 0.1433 | 0.2137 | 0.2715 |
| IDLP-G | 0.0040* | 0.0076* | 0.0166* | 0.0189 | 0.0348 | 0.0519 |
| IDLP-P | 0.1051* | 0.1457 | 0.1897 | 0.2003 | 0.2592 | 0.3010 |
| IDLP | 0.1123 | 0.1492 | 0.1909 | 0.2004 | 0.2572 | 0.2990 |
We compared AUCs when the number of false positive genes are up to 20, 50, 100
*indicates IDLP significantly outperforms the baseline with p<0.05 using Student t-test
Fig. 2Data Analysis. a The phenotype distribution based on the genes it associates with. b The distribution of newly added phenotypes based on whether they have known disease causing gene(s). c The AUC20 scores of different methods in two situations: 1. phenotypes with known disease genes are used as queries (left); 2. phenotypes with unknown disease genes are used as queries (right)
Fig. 3Average precision on all query diseases of test set at each top-k position
Fig. 4Average recall on all query diseases of test set at each top-k position
Fig. 5Effects of parameters on the performance of IDLP (a) Performance of AUC20 w.r.t α. b Performance of AUC20 w.r.t γ
Fig. 6Robustness of IDLP. Four disturbed PPI networks are applied into each algorithm: 1. randomly delete 10% PPI data; 2. randomly delete 10% PPI data and add 10% PPI data; 3. randomly delete 20% PPI data; 4. randomly delete 20% PPI data and add 20% PPI data. The best and the worse performance of these four situations are drawn as error bar on the histogram. a It shows the results when all diseases are chosen as test set. b It shows the results when totally new diseases are chosen as test set
Predicted top 10 new genes for Parkinson’s disease by IDLP
| Gene | Score | Evidence of Support |
|---|---|---|
| DNAJC13 | 0.7016 | DNAJC13 mutations in Parkinson disease [ |
| CYP2D6 | 0.5796 | CYP2D6 phenotypes and Parkinson’s disease risk: a meta-analysis [ |
| DRD4 | 0.5667 | Lack of allelic association of dopamine D4 receptor gene polymorphisms with Parkinson’s disease in a Chinese Population [ |
| RAB39B | 0.5421 | Loss-of-function mutations in RAB39B are associated with typical early-onset Parkinson disease [ |
| TRPM7 | 0.3101 | TRPM7 and its role in neurodegenerative diseases [ |
| SNCB | 0.2342 | Beta-synuclein gene variants and Parkinson’s disease: a preliminary case-control study [ |
| DCTN1 | 0.1791 | A Novel DCTN1 mutation with late-onset parkinsonism and frontotemporal atrophy [ |
| ATP6AP2 | 0.1562 | Altered splicing of ATP6AP2 causes X-linked parkinsonism with spasticity (XPDS) [ |
| WDR45 | 0.1415 | - |
| PSEN2 | 0.1401 | - |