| Literature DB >> 34067584 |
Jorge I Vélez1, Luiggi A Samper2, Mauricio Arcos-Holzinger3, Lady G Espinosa4, Mario A Isaza-Ruget4, Francisco Lopera5, Mauricio Arcos-Burgos3.
Abstract
Machine learning (ML) algorithms are widely used to develop predictive frameworks. Accurate prediction of Alzheimer's disease (AD) age of onset (ADAOO) is crucial to investigate potential treatments, follow-up, and therapeutic interventions. Although genetic and non-genetic factors affecting ADAOO were elucidated by other research groups and ours, the comprehensive and sequential application of ML to provide an exact estimation of the actual ADAOO, instead of a high-confidence-interval ADAOO that may fall, remains to be explored. Here, we assessed the performance of ML algorithms for predicting ADAOO using two AD cohorts with early-onset familial AD and with late-onset sporadic AD, combining genetic and demographic variables. Performance of ML algorithms was assessed using the root mean squared error (RMSE), the R-squared (R2), and the mean absolute error (MAE) with a 10-fold cross-validation procedure. For predicting ADAOO in familial AD, boosting-based ML algorithms performed the best. In the sporadic cohort, boosting-based ML algorithms performed best in the training data set, while regularization methods best performed for unseen data. ML algorithms represent a feasible alternative to accurately predict ADAOO with little human intervention. Future studies may include predicting the speed of cognitive decline in our cohorts using ML.Entities:
Keywords: Alzheimer’s disease; PSEN1; age of onset; genetic isolates; machine learning; natural history; predictive genomics
Year: 2021 PMID: 34067584 PMCID: PMC8156402 DOI: 10.3390/diagnostics11050887
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
Common exonic functional variants modifying ADAOO in 125 individuals from the Paisa genetic isolate.
| Cohort | Chr | Marker | Position | Gene | Change |
|
|
|---|---|---|---|---|---|---|---|
| E280A | 19 | rs7412 | 45,412,079 |
| p.Arg176Cys | 17.45 (0.48) | 2.13 × 10−30 |
| ( | 8 | rs36092215 | 142,367,246 |
| p.Arg260Cys | 12.12 (0.54) | 6.58 × 10−22 |
| 11 | rs12364019 | 5,730,343 |
| p.Arg321Lys | −11.64 (0.79) | 1.15 × 10−14 | |
| 1 | rs16838748 | 157,508,997 |
| p.Asn427Lys | 7.14 (0.68) | 8.61 × 10−10 | |
| 7 | rs12701506 | 36,566,020 |
|
| −2.75 (0.30) | 5.69 × 10−8 | |
| 19 | rs2682585 | 44,081,288 |
| p.His6Arg | −1.68 (0.21) | 1.67 × 10−6 | |
| 1 | rs62621173 | 159,021,506 |
| p.Ser512Phe | −2.80 (0.37) | 8.63 × 10−6 | |
| 1 | rs10798302 | 173,987,798 |
|
| 1.76 (0.27) | 1.86 × 10−4 | |
| 7 | rs754554 | 24,758,818 |
| p.Pro142Thr | −1.39 (0.28) | 3.62 × 10−2 | |
| Sporadic | 2 | rs35946826 | 105,859,249 |
| p.Leu312fs | −12.67 (0.148) | 3.08 × 10−36 |
| ( | 1 | rs61742849 | 114,226,143 |
| p.Gly1318fs | −14.32 (0.199) | 4.38 × 10−34 |
| 6 | rs675026 | 154,414,563 |
| p.Ala442fs | 5.42 (0.079) | 1.15 × 10−33 | |
| 10 | rs838759 | 22,498,468 |
| p.Gly149fs | −4.26 (0.092) | 3.90 × 10−28 | |
| 17 | rs61749930 | 48,594,691 |
| p.Arg124fs | −12.08 (0.286) | 6.06 × 10−27 | |
| 19 | rs7250872 | 1,811,603 |
| p.Gly45fs | −2.54 (0.088) | 9.57 × 10−22 | |
| 16 | rs749670 | 31,088,625 |
| p.Lys328fs | −1.52 (0.067) | 1.35 × 10−18 | |
| 4 | rs7677237 | 89,306,659 |
| p.Met123fs | 2.14 (0.122) | 3.58 × 10−15 | |
| 4 | rs6835769 | 79,284,694 |
| p.Ala817fs | −1.11 (0.074) | 2.74 × 10−13 | |
| 11 | rs4757987 | 5,906,205 |
| p.Arg228fs | 1.02 (0.07) | 6.86 × 10−13 | |
| 20 | rs236150 | 5,903,141 |
| p.Lys117fs | −2.14 (0.181) | 2.12 × 10−10 | |
| 6 | rs3130257 | 33,256,471 |
| p.Thr40fs | −2.35 (0.209) | 7.92 × 10−10 | |
| 18 | rs754093 | 77,246,406 |
| p.Cys751fs | −0.94 (0.094) | 1.34 × 10−8 | |
| 3 | rs34230332 | 14,725,878 |
| p.Leu84fs | 1.59 (0.185) | 4.81 × 10−7 | |
| 19 | rs867228 | 52,249,211 |
| p.Glu346fs | −0.94 (0.115) | 1.34 × 10−6 | |
| 4 | rs3733251 | 77,192,838 |
| p.Arg166fs | −0.71 (0.127) | 2.07 × 10−3 | |
| 16 | rs2303772 | 87,795,580 |
| p.Leu56fs | 0.75 (0.135) | 2.75 × 10−3 | |
| 16 | rs739999 | 319,511 |
| p.Met416fs | 0.35 (0.075) | 3.48 × 10−2 | |
| 16 | rs34779002 | 87,782,396 |
| p.Gly74fs | 0.78 (0.172) | 4.00 × 10−2 | |
| 15 | rs6493068 | 43,170,793 |
| p.Asp9fs | −0.48 (0.107) | 4.27 × 10−2 | |
| 16 | rs17137138 | 4,606,743 |
| p.Val85fs | 1.00 (0.223) | 4.40 × 10−2 | |
| 7 | rs3823646 | 99,757,612 |
| p.Lys468fs | −0.31 (0.069) | 4.47 × 10−2 | |
| 13 | rs17081389 | 25,487,001 |
| p.Pro55fs | 1.00 (0.223) | 4.61 × 10−2 | |
| 10 | rs78334417 | 75,071,618 |
| p.Pro450fs | 1.00 (0.223) | 4.84 × 10−2 | |
| 7 | rs186048202 | 134,678,273 |
| p.Arg52fs | 0.61 (0.139) | 4.91 × 10−2 |
UCSC GRCh37/hg19 coordinates; Markers can accelerate ( < 0) or delay (> 0) ADAOO according to their effect; Chromatin state segmentation strong enhancer state-5 from ChiP-seq data; CpG islands, DNaseI hypersensitivity uniform peak from ENCODE/analysis. ADAOO = Alzheimer’s disease age of onset; Chr: chromosome; = Regression coefficient; = Standard error of ; PFDR = Corrected P-value using the False Discovery Rate (FDR) [78,79].
Performance of ML algorithms for predicting ADAOO in the E280A pedigree. RMSE = root mean squared error, lower is better; MAE = mean absolute error, lower is better; R2 = coefficient of determination, higher is better. Best results are shown in bold.
| ML Algorithm | Performance Measure | |||||
|---|---|---|---|---|---|---|
| RMSE |
| MAE | ||||
| Training | Testing | Training | Testing | Training | Testing | |
| glmboost | 3.51 |
| 0.62 |
| 2.41 |
|
| bstTree | 3.67 | 6.75 | 0.59 | 0.08 | 3.00 | 4.52 |
| gbm | 4.90 | 6.68 | 0.27 | 0.09 | 3.86 | 4.52 |
| glmnet | 3.59 | 3.85 | 0.62 | 0.64 | 2.51 | 2.89 |
| knn | 4.53 | 6.35 | 0.39 | 0.05 | 3.56 | 4.13 |
| mlp | 6.30 | 6.62 | 0.07 | 0.43 | 5.64 | 5.78 |
| qrf | 1.35 | 7.24 | 0.95 | 0.03 | 0.69 | 4.65 |
| rf | 2.14 | 6.17 | 0.91 | 0.12 | 1.70 | 3.93 |
| rpart | 4.73 | 6.36 | 0.31 | 0.07 | 3.95 | 4.51 |
| rpart1SE | 4.18 | 5.89 | 0.46 | 0.18 | 3.35 | 4.11 |
| rpart2 | 4.28 | 6.02 | 0.43 | 0.15 | 3.43 | 4.11 |
| svmLinear | 4.74 | 6.80 | 0.43 | 0.07 | 2.97 | 4.21 |
| svmLinear2 | 4.74 | 6.80 | 0.43 | 0.07 | 2.97 | 4.21 |
| svmPoly | 3.46 | 7.30 | 0.66 | 0.14 | 1.86 | 5.13 |
| svmRadial | 5.21 | 6.50 | 0.35 | 0.02 | 3.43 | 3.96 |
| treebag | 4.26 | 6.02 | 0.45 | 0.16 | 3.47 | 4.20 |
| xgbLinear |
| 7.14 |
| 0.06 |
| 4.28 |
| xgbTree | 1.79 | 7.12 | 0.90 | 0.08 | 1.28 | 4.65 |
Figure 1PCA and K-means clustering representation of the performance measures for ML algorithms predicting ADAOO in individuals carrying the PSEN1 E280A mutation when the (a) training (n = 51) and (b) testing (n = 20) data sets are used. (c) Variable importance for the glmnet (left) and glmboost (right) ML algorithms. Here, higher values are better.
Performance of ML algorithms for predicting ADAOO in the individuals with sporadic AD from the Paisa genetic isolate. Conventions as in Table 2. Best results are shown in bold.
| ML Algorithm | Performance Measure | |||||
|---|---|---|---|---|---|---|
| RMSE |
| MAE | ||||
| Training | Testing | Training | Testing | Training | Testing | |
| bstTree | 3.33 | 5.22 | 0.83 | 0.44 | 2.56 | 3.75 |
| glmboost | 2.32 | 3.08 | 0.92 | 0.84 | 1.96 | 2.47 |
| glmnet | 0.25 | 0.52 | 1.00 | 0.99 | 0.17 |
|
| knn | 5.37 | 6.75 | 0.48 | 0.16 | 3.90 | 4.98 |
| lasso | 0.40 |
| 1.00 |
| 0.31 | 0.42 |
| qrf | 0.87 | 5.86 | 0.99 | 0.30 | 0.40 | 4.57 |
| rf | 2.47 | 5.09 | 0.94 | 0.49 | 1.86 | 4.15 |
| rpart | 5.53 | 7.69 | 0.38 | 0.00 | 4.46 | 6.37 |
| rpart1SE | 5.53 | 7.69 | 0.38 | 0.00 | 4.46 | 6.37 |
| rpart2 | 5.92 | 6.98 | 0.29 | 0.03 | 4.63 | 5.75 |
| svmLinear | 0.61 | 1.11 | 0.99 | 0.97 | 0.57 | 0.83 |
| svmLinear2 | 0.61 | 1.11 | 0.99 | 0.97 | 0.57 | 0.83 |
| svmPoly | 0.75 | 1.33 | 0.99 | 0.96 | 0.70 | 1.07 |
| svmRadial | 2.57 | 4.70 | 0.93 | 0.51 | 1.57 | 3.64 |
| treebag | 5.22 | 7.02 | 0.48 | 0.02 | 4.13 | 5.54 |
| xgbLinear |
| 4.61 |
| 0.67 |
| 3.32 |
| xgbTree | 1.13 | 3.98 | 0.98 | 0.70 | 0.93 | 3.19 |
Figure 2PCA and K-means clustering representation of the performance measures for ML algorithms predicting ADAOO in individuals with sporadic AD from the Paisa genetic isolate when the (a) training (n = 40) and (b) testing (n = 14) data sets are used. (c) Variable importance for the svmLinear (left), lasso (center) and glmnet (right) ML algorithms. Conventions as in Figure 1.
Figure 3Variable importance for the best ADAOO-predicting ML algorithm in individuals (a) carrying the E280A mutation and (b) individuals with sAD. Blue dots represent the average importance; segments represent 95% bootstrap-based confidence intervals based on B = 1000 replicates. Conventions as in Figure 1.
Figure 4Variable importance vs. effect on ADAOO for genetic variants in individuals with (a) E280A PSEN1 and (b) sporadic AD. Protective ( > 0) variants are shown in green, while harmful ( < 0) variants are shown in red. See Table 1 for more details.