| Literature DB >> 29320986 |
Xiaoyan Huang1,2,3, Hankui Liu2,3, Xinming Li4, Liping Guan2,3, Jiankang Li2,3, Laurent Christian Asker M Tellier2,3,5, Huanming Yang2,6, Jian Wang2,6, Jianguo Zhang7,8,9.
Abstract
BACKGROUND: Alzheimer's disease (AD) is an important, progressive neurodegenerative disease, with a complex genetic architecture. A key goal of biomedical research is to seek out disease risk genes, and to elucidate the function of these risk genes in the development of disease. For this purpose, expanding the AD-associated gene set is necessary. In past research, the prediction methods for AD related genes has been limited in their exploration of the target genome regions. We here present a genome-wide method for AD candidate genes predictions.Entities:
Keywords: Alzheimer’s disease; Gene; Machine learning
Mesh:
Year: 2018 PMID: 29320986 PMCID: PMC5763548 DOI: 10.1186/s12883-017-1010-3
Source DB: PubMed Journal: BMC Neurol ISSN: 1471-2377 Impact factor: 2.474
Fig. 1Genome-wide prediction of Alzheimer-associated genes. Our prediction was based on machine learning methods that are trained upon the already associated AD genes, which are already linked to AD at various levels of evidence (C1AD-C4AD), as a positive training set, and other genes excluded from the OMIM database (C5) as a negative training set. Combining these with the human brain-specific gene network, we were able to build an evidence-weighted, network-based classifier, and predict the probability of the association between each gene and AD across the genome
Fig. 2Receiver operating characteristic (ROC) curve for SVM model classification effect. The threshold for the ROC was 0.561. At this threshold, sensitivity was 0.859, specificity was 0.892, area under the curve (AUC) was 0.94
Top ten GO items of significantly enriched AD-associated genes
| Cluster | Information | |
|---|---|---|
| GO:0005515 | 5.70e-19 | Protein binding |
| GO:0005615 | 5.34e-15 | Extracellular space |
| GO:0042493 | 6.59e-15 | Response to drug |
| GO:0042157 | 9.57e-15 | Lipoprotein metabolic process |
| GO:0008203 | 1.68e-14 | Cholesterol metabolic process |
| GO:0009986 | 5.34e-13 | Cell surface |
| GO:0042802 | 1.61e-12 | Identical protein binding |
| GO:0019899 | 3.24e-12 | Enzyme binding |
| GO:0044281 | 3.55e-11 | Small molecule metabolic process |
| GO:0001540 | 6.87e-11 | Beta-amyloid binding |
The list of selected features of gene sets for comparison
| Feature | Source | Description |
|---|---|---|
| Gene length | Ensembl [ | Length of gene in bp |
| Protein length | UniProt [ | Length of protein in aa |
| CDS length | Ensembl | Length of coding sequence in bp |
| Length of 3’ UTR | Ensembl | Length of the 3′ untranslated region in bp |
| Length of 5’ UTR | Ensembl | Length of the 5′ untranslated region in bp |
| Transcript count | Ensembl | Transcript count in the gene |
| Number of exons | Ensembl | Number of exons in the gene |
| GC content | Ensembl | GC content (%) of gene |
| Transmembrane domain | Ensembl | If the gene has a transmembrane domain |
| Signal domain | Ensembl | If the gene has a signal domain |
| Paralog | Ensembl | If the gene has a paralog in the human genome |
Significant differences among the AD-associated set, control set and predicted AD candidate set
| Feature | AD-related dataset (median) | Control dataset (median) | AD-predicted dataset (median) |
|---|---|---|---|
| Gene length (bp) | 43,474.5 | 8906 | 36,937 |
| Length of 3’ UTR (bp) | 309 | 103 | 362 |
| Length of 5’ UTR (bp) | 345 | 134 | 332 |
| Transcript count | 8 | 3 | 8 |
| Number of exons | 10 | 5 | 10 |
| Transmembrane domain | 31.31% | 23.18% | 32.04% |
| Signal domain | 33.43% | 14.09% | 33.86% |
| Paralog | 81.79% | 45.97% | 86.21% |
The differences between any two of the four datasets calculated by the P value of Mann-Whitney U test or Chi-squared test
| Features | AD-associated dataset | Control dataset | AD-predicted dataset | Non-mental-health dataset | |
|---|---|---|---|---|---|
| Gene length | AD- associated set | – | < 2.2E-16 | 0.01607 | 0.002573 |
| Control dataset | < 2.2E-16 | – | < 2.2E-16 | < 2.2E-16 | |
| AD-predicted dataset | 0.01607 | < 2.2E-16 | – | 0.2018 | |
| Non-mental-health dataset | 0.002573 | < 2.2E-16 | 0.2018 | – | |
| Length of 3’ UTR | AD-associated set | – | < 2.2E-16 | 0.109 | 0.1131 |
| Control dataset | < 2.2E-16 | – | < 2.2E-16 | < 2.2E-16 | |
| AD-predicted dataset | 0.109 | < 2.2E-16 | – | 0.0003546 | |
| Non-mental-health dataset | 0.1131 | < 2.2E-16 | 0.0003546 | – | |
| Length of 5’ UTR | AD-associated dataset | – | 1.17E-13 | 0.4351 | 0.008426 |
| Control dataset | 1.17E-13 | – | < 2.2E-16 | 3.05E-13 | |
| AD-predicted dataset | 0.4351 | < 2.2E-16 | – | 0.000159 | |
| Non-mental-health dataset | 0.008426 | 3.05E-13 | 0.000159 | – | |
| Transcript count | AD-associated dataset | – | < 2.2E-16 | 0.2962 | 0.0006213 |
| Control dataset | < 2.2E-16 | – | < 2.2E-16 | < 2.2E-16 | |
| AD-predicted dataset | 0.2962 | < 2.2E-16 | – | 0.0006213 | |
| Non-mental-health dataset | 0.0006213 | < 2.2E-16 | 0.0006213 | – | |
| Number of exon | AD-associated dataset | – | < 2.2E-16 | 0.1314 | 0.3506 |
| Control dataset | < 2.2E-16 | – | < 2.2E-16 | < 2.2E-16 | |
| AD-predicted dataset | 0.1314 | < 2.2E-16 | – | 0.1537 | |
| Non-mental-health dataset | 0.3506 | < 2.2E-16 | 0.1537 | – | |
| Transmembrane domain | AD- associated set | – | 0.03783 | 0.8109 | 0.6143 |
| Control dataset | 0.03783 | – | 0.01107 | 0.0448 | |
| AD-predicted dataset | 0.8109 | 0.01107 | – | 0.302 | |
| Non-mental-health dataset | 0.6143 | 0.0448 | 0.302 | – | |
| Signal domain | AD- associated set | – | 3.70E-07 | 0.89 | 3.54E-05 |
| Control dataset | 3.70E-07 | – | 1.20E-08 | 0.006176 | |
| AD-predicted dataset | 0.89 | 1.20E-08 | – | 1.12E-08 | |
| Non-mental-health dataset | 3.54E-05 | 0.006176 | 1.12E-08 | – | |
| Paralog | AD- associated set | – | < 2.2E-16 | 0.05604 | 0.0007944 |
| Control dataset | < 2.2E-16 | – | < 2.2E-16 | < 2.2E-16 | |
| AD-predicted dataset | 0.05604 | < 2.2E-16 | – | 5.19E-13 | |
| Non-mental-health dataset | 0.0007944 | < 2.2E-16 | 5.19E-13 | – |
Fig. 3Distributions of selected features of different dataset. Distributions of predicted AD candidate set are basically consistent with those of AD-associated set; rather, distributions of control set are quite different from those of AD-associated set and predicted AD candidate set
Information about discovering AD-associated genes from published papers since 2015
| Articles | Total genes | Trained genes | Predicted genes |
|---|---|---|---|
| Chen J A, et al. [ | DYSF, PAXIP1 | – | PAXIP1 |
| Xiao Q, et al.[ | CD2AP,SORL1, FERMT2,PVRL2, TOMM40 | SORL1, FERMT2, TOMM40 | PVRL2 |
| Gao H, et al. [ | DAB1 | – | DAB1 |
| Malishkavich A, et al. [ | ADNP | – | ADNP |
| Lee Y H, et al. [ | ANXA1, CDC25C | – | ANXA1, CDC25C |
| Zheng X, et al. [ | APC2 | – | APC2 |
| Lin Q, et al. [ | APOA1,APOC3, APOA4 | APOA1, | APOC3 |
| Marchesi V T, et al. [ | NLRP3,APP, TREX1,NOTCH3, COL4A1 | APP | NLRP3, COL4A1 |
| Total | 20 | 6 | 10 |