| Literature DB >> 35415203 |
Magdalena Arnal Segura1,2,3, Giorgio Bini2, Dietmar Fernandez Orth3, Eleftherios Samaras4, Maya Kassis4, Fotis Aisopos5, Jordi Rambla De Argila3, George Paliouras5, Peter Garrard4, Claudia Giambartolomei2, Gian Gaetano Tartaglia1,2,3,6.
Abstract
Introduction: Genome-wide association studies (GWAS) in late onset Alzheimer's disease (LOAD) provide lists of individual genetic determinants. However, GWAS do not capture the synergistic effects among multiple genetic variants and lack good specificity.Entities:
Keywords: Apolipoprotein E; genetic determinants; genomic interactions; genomic profiles; late onset Alzheimer's disease; machine learning; single nucleotide variants; variant prioritization
Year: 2022 PMID: 35415203 PMCID: PMC8984091 DOI: 10.1002/dad2.12300
Source DB: PubMed Journal: Alzheimers Dement (Amst) ISSN: 2352-8729
Summary of the evaluation metrics obtained with GB, ET, and RF models and Alzheimer's Disease predictors
| Accuracy | AUC‐ROC | F score | Sensitivity | Specificity | PPV | NPV | |
|---|---|---|---|---|---|---|---|
| GB |
|
|
|
|
|
|
|
| ET | 0.707 | 0.820 | 0.706 | 0.703 | 0.710 | 0.708 | 0.705 |
| RF | 0.739 | 0.804 | 0.735 | 0.725 | 0.754 | 0.746 | 0.732 |
Abbreviations: GB, gradient boosted decision trees; ET, extremely randomized trees; RF, random forest; AUC‐ROC, area under the receiver operating characteristic curve; PPV, positive predictive value; NPV, negative predictive value.
Note: Machine learning models with best scores in each evaluation metric are highlighted in red.
FIGURE 1A, The genomic location of single nucleotide variants (SNVs) selected using a feature importance (FI) >0.01 in the chromosome 19 hot‐spot region. SNVs prioritized by different machine learning (ML) methods are illustrated in different tracks. B, Venn diagram showing the intersection of the prioritized SNVs by gradient boosted decision trees (GB), extremely randomized trees (ET), and random forest (RF). The name of the SNVs in the intersection with the three methods is provided. C, For the 145 SNVs in Alzheimer's disease (AD) predictors, distribution of the Fisher‐test P‐values obtained measuring differences in allele frequency (AF) between late onset Alzheimer's disease (LOAD) and controls over the chromosomes. The name of the SNVs prioritized by any of the three ML methods is provided and a color is assigned depending on the number of times a SNV was selected by any one of the methods. The six SNVs prioritized by GB, ET, and RF are colored in red
Characteristics of the six SNVs prioritized by the three machine learning methods
| SNV | Gene | Region | Chr | hg19 position | AF AD | AF Cntrl | LOG2 FC AF AD/Cntrl | FI RF | FI ET | FI GB | Fisher |
|---|---|---|---|---|---|---|---|---|---|---|---|
| rs429358 |
| Exonic | 19 | 45411941 | 0.403 | 0.156 | 1.373 | 0.047 | 0.041 | 0.066 | 3.72E‐44 |
| rs769449 |
| Intronic | 19 | 45410002 | 0.320 | 0.127 | 1.329 | 0.040 | 0.043 | 0.119 | 9.95E‐37 |
| rs4420638 |
| Downstream | 19 | 45422946 | 0.417 | 0.189 | 1.144 | 0.051 | 0.050 | 0.045 | 4.56E‐42 |
| rs405509 |
| Upstream | 19 | 45408836 | 0.411 | 0.470 | –0.194 | 0.130 | 0.152 | 0.139 | 1.40E‐03 |
| rs1160985 |
| Intronic | 19 | 45403412 | 0.322 | 0.458 | –0.507 | 0.013 | 0.026 | 0.167 | 5.04E‐14 |
| rs7412 |
| Exonic | 19 | 45412079 | 0.033 | 0.087 | –1.385 | 0.014 | 0.015 | 0.024 | 5.14E‐09 |
Notes: dbSNP ID together with gene annotations are provided in the columns “SNV,” “Gene”, “Region”, “Chr” and “hg19 position”. AF in AD and in controls are used to calculate the log2FC in AD vs. Cntrl (column “LOG FC AF AD/Cntrl”). SNVs are ordered from the highest logFC (top) to the lowest (bottom) and colored in blue and red accordingly. FI obtained in RF, ET, and GB are in columns “FI RF”, “FI ET” and “FI GB” respectively. Fisher test P‐values measuring the significance of AF differences between AD and controls are provided in the “Fisher P‐value” column.
Abbreviations: AD, Alzheimer's disease; SNV, single nucleotide variants; AF, allele frequencies; APOE, apolipoprotein E; FI, feature importance; RF, random forest; ET, extremely randomized trees; GB, gradient boosted decision trees.
FIGURE 2Genomic profiles of correctly classified samples in gradient boosted decision trees (GB) defined with the nine prioritized single nucleotide variants (SNVs). Genomic profiles with only one sample or having missing values were excluded. In (A) genomic profiles of true positives (TP) represent all samples that were correctly classified as late onset Alzheimer's disease (LOAD). In (B) genomic profiles of true negatives (TN) represent all samples that were correctly classified as controls. Dendrograms on the top and the left were made with Ward‐D2 method and Euclidean distances. Clusters of genomic profiles are indicated with numbers in the x‐axis. Fisher‐test P‐values are provided measuring the statistical significance of different representation of Alzheimer's disease (AD) and controls in clusters of genomic profiles. The % of samples having each genomic profile in LOAD and controls is indicated in the bar‐plots below the heatmaps. SNVs are colored with their corresponding gene loci and information of the higher allele frequency (AF) in LOAD or controls is provided in the right‐side bar. An asterisk points to the six SNVs commonly prioritized by GB, extremely randomized trees (ET), and random forest (RF).
FIGURE 3Representation of the pairwise test of interactions between the 14 single nucleotide variants (SNVs) prioritized by any of the three machine learning (ML) methods and commonly present in UK Biobank (UKB) and Alzheimer's Disease Neuroimaging Initiative 3 (ADNI3) arrays. Further details on the approach used to test the statistical significance of the pairwise interactions are provided in the Methods section “Statistical test for interactions between pairs of single nucleotide variants (SNVs)”. Details on the asymptotically exact harmonic mean P‐values (HMP) for each pairwise interaction are provided in Table S6. A cut‐off HMP <0.01 was used to consider an interaction statistically significant. An asterisk points to the six SNVs commonly prioritized by gradient boosted decision trees (GB), extremely randomized trees (ET), and random forest (RF). Statistically significant interactions are enriched with rs1160985 and rs405509 in both datasets. These two SNVs: (1) had high feature importance (FI) scores in the ML models, (2) had low allele frequency (AF) differences between Alzheimer's disease (AD) and controls, (3) were involved in interaction patterns of the genomic profiles obtained with ML approaches. In UKB, 16 of the 19 statistically significant pairwise interactions involved rs1160985 or rs405509 (Fisher test P‐value 5.28E‐09). In ADNI all the statistically significant pairwise interactions involved one of the two SNVs (Fisher test P‐value 9.42E‐08). SNVs are ordered from the top to the bottom and from the left to the right by number of statistically significant interactions (decreasing). The gray gradient corresponding to the AF shows weak correlation between number of statistically significant pairwise interactions of SNVs and AF (spearman correlation 0.41 and 0.32 in UKB and ADNI, respectively)