| Literature DB >> 34327330 |
Marlena Osipowicz1, Bartek Wilczynski1, Magdalena A Machnicka1.
Abstract
Despite great increase of the amount of data from genome-wide association studies (GWAS) and whole-genome sequencing (WGS), the genetic background of a partially heritable Alzheimer's disease (AD) is not fully understood yet. Machine learning methods are expected to help researchers in the analysis of the large number of SNPs possibly associated with the disease onset. To date, a number of such approaches were applied to genotype-based classification of AD patients and healthy controls using GWAS data and reported accuracy of 0.65-0.975. However, since the estimated influence of genotype on sporadic AD occurrence is lower than that, these very high classification accuracies may potentially be a result of overfitting. We have explored the possibilities of applying feature selection and classification using random forests to WGS and GWAS data from two datasets. Our results suggest that this approach is prone to overfitting if feature selection is performed before division of data into the training and testing set. Therefore, we recommend avoiding selection of features used to build the model based on data included in the testing set. We suggest that for currently available dataset sizes the expected classifier performance is between 0.55 and 0.7 (AUC) and higher accuracies reported in literature are likely a result of overfitting.Entities:
Year: 2021 PMID: 34327330 PMCID: PMC8315124 DOI: 10.1093/nargab/lqab069
Source DB: PubMed Journal: NAR Genom Bioinform ISSN: 2631-9268
Figure 1.Analysis summary. (A) Feature (SNPs) selection before division of the data into training and test sets. (B) Feature selection after division of the data into training and test sets. (C) Datasets included in the study.
Numbers of selected features and classifiers performance for analysis with feature selection before test/train division. Mean values and standard deviations from a 10-fold CV are given
| ADNI WGS | ROSMAP | ADNI GWAS | |||
|---|---|---|---|---|---|
| AUC | #selected SNPs | AUC | #selected SNPs | AUC | #selected SNPs |
| 0.99 ± 0.02 | 341 690 | 0.99 ± 0.01 | 369 012 | 0.98 ± 0.03 | 24 998 |
Numbers of selected features and classifiers performance for analysis with feature selection before test/train division on SNPs shared between ADNI-WGS and ROSMAP-WGS datasets. Mean values and standard deviations from a 10-fold CV are given
| Test set | ADNI WGS | ROSMAP | ||
|---|---|---|---|---|
| Training set | AUC | #selected SNPs | AUC | #selected SNPs |
| ADNI WGS | 0.99 ± 0.01 | 257 634 | 0.50 ± 0.02 | 257 634 |
| ROSMAP | 0.50 ± 0.04 | 185 252 | 0.99 ± 0.01 | 185 252 |
Numbers of selected features and classifiers performance for analysis with feature selection after test/train division. Mean values and standard deviations from three repetitions are given
| ADNI WGS | ROSMAP | ADNI GWAS | ||||
|---|---|---|---|---|---|---|
| SNPs set | AUC | #selected SNPs | AUC | #selected SNPs | AUC | #selected SNPs |
| All from each dataset | 0.56 ± 0.02 | 334 886 ± 15 762 | 0.58 ± 0.02 | 358 181 ± 4640 | 0.67 ± 0.06 | 24 610 ± 667 |
|
| 0.51 ± 0.09 | 258 758 ± 13 288 | 0.54 ± 0.04 | 182 348 ± 5622 | - | - |
Numbers of selected features and classifiers performance for analysis with feature selection after test/train division performed on ROSMAP and ADNI-WGS subsets of patients chosen based on genetic similarity. Mean values and standard deviations were calculated from two and four repetitions for ADNI-WGS and ROSMAP, respectively
| ADNI WGS | ROSMAP | ||
|---|---|---|---|
| AUC | #selected SNPs | AUC | #selected SNPs |
| 0.63 ± 0.03 | 201 484 | 0.55 ± 0.09 | 160 992 ± 211 |
Cross classification based on features chosen and/or random forest built on different dataset, after train/test division (mean values and standard deviations from two repetitions)
| Analysis type | Training set | Test set | AUC | #selected SNPs |
|---|---|---|---|---|
| SNPs selected and random forest built on training set | ADNI-WGS | ROSMAP | 0.55 ± 0.01 | 243 799 |
| ROSMAP | ADNI-WGS | 0.53 ± 0.04 | 183 528 | |
| SNPs selected on training set, random forest built on test set | ADNI-WGS | ROSMAP | 0.55 ± 0.03 | 243 799 |
| ROSMAP | ADNI-WGS | 0.57 ± 0.01 | 183 528 |
Gene Ontology terms overrepresentation for SNPs selected from ADNI-WGS and ROSMAP-WGS
| PANTHER GO-Slim Biological Process | # in | # | expected | Fold Enrichment | raw | FDR |
|---|---|---|---|---|---|---|
| Cell–cell adhesion via plasma-membrane adhesion molecules (GO:0098742) | 52 | 29 | 7.24 | 4.01 | 3.77E-08 | 7.53E-06 |
| Modulation of chemical synaptic transmission (GO:0050804) | 75 | 32 | 10.44 | 3.07 | 8.79E-07 | 1.13E-04 |
| Adherens junction organization (GO:0034332) | 23 | 16 | 3.2 | 5 | 5.55E-06 | 4.99E-04 |
| Cell–cell junction assembly (GO:0007043) | 28 | 17 | 3.9 | 4.36 | 1.07E-05 | 8.35E-04 |
| Synaptic transmission, glutamatergic (GO:0035249) | 44 | 19 | 6.12 | 3.1 | 1.24E-04 | 5.85E-03 |
| Actin cytoskeleton organization (GO:0030036) | 179 | 47 | 24.91 | 1.89 | 2.16E-04 | 9.45E-03 |
| Cellular calcium ion homeostasis (GO:0006874) | 151 | 41 | 21.01 | 1.95 | 3.39E-04 | 1.45E-02 |
| Cell morphogenesis involved in neuron differentiation (GO:0048667) | 110 | 32 | 15.31 | 2.09 | 6.83E-04 | 2.51E-02 |
| Second-messenger-mediated signaling (GO:0019932) | 191 | 47 | 26.58 | 1.77 | 9.26E-04 | 3.08E-02 |
| Synapse assembly (GO:0007416) | 15 | 9 | 2.09 | 4.31 | 1.40E-03 | 4.11E-02 |