| Literature DB >> 34828267 |
Erik D Huckvale1, Matthew W Hodgman1, Brianna B Greenwood2, Devorah O Stucki2, Katrisa M Ward2, Mark T W Ebbert1, John S K Kauwe2, Justin B Miller1.
Abstract
The Alzheimer's Disease Neuroimaging Initiative (ADNI) contains extensive patient measurements (e.g., magnetic resonance imaging [MRI], biometrics, RNA expression, etc.) from Alzheimer's disease (AD) cases and controls that have recently been used by machine learning algorithms to evaluate AD onset and progression. While using a variety of biomarkers is essential to AD research, highly correlated input features can significantly decrease machine learning model generalizability and performance. Additionally, redundant features unnecessarily increase computational time and resources necessary to train predictive models. Therefore, we used 49,288 biomarkers and 793,600 extracted MRI features to assess feature correlation within the ADNI dataset to determine the extent to which this issue might impact large scale analyses using these data. We found that 93.457% of biomarkers, 92.549% of the gene expression values, and 100% of MRI features were strongly correlated with at least one other feature in ADNI based on our Bonferroni corrected α (p-value ≤ 1.40754 × 10-13). We provide a comprehensive mapping of all ADNI biomarkers to highly correlated features within the dataset. Additionally, we show that significant correlation within the ADNI dataset should be resolved before performing bulk data analyses, and we provide recommendations to address these issues. We anticipate that these recommendations and resources will help guide researchers utilizing the ADNI dataset to increase model performance and reduce the cost and complexity of their analyses.Entities:
Keywords: ADNI; Alzheimer’s disease; feature reduction; machine learning; pairwise feature correlation
Mesh:
Substances:
Year: 2021 PMID: 34828267 PMCID: PMC8619902 DOI: 10.3390/genes12111661
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Creation of the MRI domain from the MRI slice sequences using the trained convolutional autoencoders. A separate autoencoder was trained for each MRI slice, and the latent space was concatenated for each person to create a row specific to that individual.
Statistical tests chosen for comparisons and their conditions.
| Comparison Data Types | Condition | Statistical Test |
|---|---|---|
| Numeric and Numeric | Both features follow a normal distribution | Pearson correlation |
| Numeric and Numeric | At least one of the features does not follow a normal distribution | Spearman correlation |
| Categorical and Categorical | The contingency table contains at least one frequency less than five | N/A |
| Categorical and Categorical | All frequencies in the contingency table are greater than or equal to five | Chi-squared |
| Numeric and Categorical | All categories have a normal distribution | ANOVA |
| Numeric and Categorical | Not all categories have a normal distribution | Kruskal-Wallis |
For numeric features, depending on the normality of their distribution, we chose between a parametric (normal distribution) or a non-parametric (non-normal distribution) statistical test. If both features were nominal, we used the Chi-squared test unless the contingency table resulting from the two features did not each contain at least five instances. In that case, the test was not performed.
ADNIMERGE Features that are Highly Correlated with other Features.
| Feature Name | ADNIMERGE Frequency | Gene Expression Frequency | MRI Frequency | Total Frequency |
|---|---|---|---|---|
| PTGENDER | 207 | 281 | 145,780 | 146268 |
| ICV | 265 | 84 | 143,377 | 143,724 |
| CLOCKNUM | 199 | 0 | 97,725 | 97,924 |
| COPYTIME | 243 | 0 | 97,030 | 97,228 |
| CLOCKSYM | 216 | 0 | 96,307 | 96,523 |
| ST65SV | 191 | 46 | 81,250 | 81,487 |
| GLUCOSE | 155 | 0 | 81,245 | 81,400 |
Numbers of correlated features for seven example features in the ADNIMERGE domain. The ‘Feature Name’ is the column header as it appeared in our constructed tabular data set. The ‘ADNIMERGE Frequency’ is the number of ADNIMERGE features that are highly correlated with the feature. For example, intra-cranial volume (ICV) is correlated with 265 other ADNIMERGE features. It is likewise correlated with 84 gene expression levels and 143,377 extracted MRI features. The ‘Total Frequency’ is the sum of the ‘ADNIMERGE Frequency’, ‘Gene Expression Frequency’, and ‘MRI Frequency’. In other words, it is the total number of features that are highly correlated with each row across the entire ADNI data set.
Summarized correlated feature frequencies based on the Bonferroni corrected α.
|
|
| ||||||||
|
|
|
|
|
|
|
|
|
|
|
| ADNIMERGE | 129.49 | 88.06 | 1 | 346 | ADNIMERGE | 11.91 | 30.9 | 0 | 616 |
| Gene Expression | 0.28 | 5.52 | 0 | 189 | Gene Expression | 6139.72 | 6195.45 | 1 | 24,588 |
| MRI | 9.31 | 20.09 | 0 | 188 | MRI | 7.87 | 19.66 | 0 | 149 |
|
|
| ||||||||
|
|
|
|
|
|
|
|
|
|
|
| ADNIMERGE | 6988.04 | 19,170.23 | 0 | 145,780 | ADNIMERGE | 7129.43 | 19,203.93 | 1 | 146,268 |
| Gene Expression | 140.05 | 3642.09 | 0 | 119,556 | Gene Expression | 6280.05 | 7096.48 | 1 | 120,141 |
| MRI | 141,348.57 | 69,866.96 | 81 | 347,944 | MRI | 141,365.75 | 69,873.31 | 81 | 347,955 |
Summary of the numbers of correlated features based on the Bonferroni corrected α. Sections A through D provide summary statistics for the domain frequencies for ADNIMERGE, Gene Expression, MRI, and Total. For example, the meaning of the ‘Average’ column and ‘MRI’ row in table A is the average number of ADNIMERGE features with which the MRI features are strongly correlated. That row states that the MRI features are strongly correlated with an average of 9.31 ADNIMERGE features with a standard deviation of 20.9 features. The 0 in the ‘Minimum’ column indicates that at least one MRI feature is not correlated with any ADNIMERGE features. The 188 under ‘Maximum’ indicates that at least one MRI feature is correlated with 188 ADNIMERGE features when p-value ≤ 1.40754 × 10−13.
Summarized correlated feature frequencies based on the maximally significant α.
|
|
| ||||||||
|
|
|
|
|
|
|
|
|
|
|
| ADNIMERGE | 5.48 | 5.83 | 1 | 23 | ADNIMERGE | 0.0 | 0.0 | 0 | 0 |
| Gene Expression | 0.0 | 0.0 | 0 | 0 | Gene Expression | 1.55 | 0.94 | 1 | 7 |
| MRI | 0.0 | 0.0 | 0 | 0 | MRI | 0.0 | 0.0 | 0 | 0 |
|
|
| ||||||||
|
|
|
|
|
|
|
|
|
|
|
| ADNIMERGE | 0.0 | 0.0 | 0 | 0 | ADNIMERGE | 5.48 | 5.83 | 1 | 23 |
| Gene Expression | 0.0 | 0.0 | 0 | 0 | Gene Expression | 1.55 | 0.94 | 1 | 7 |
| MRI | 2457.08 | 4397.99 | 1 | 11957 | MRI | 2457.08 | 4397.99 | 1 | 11,957 |
Summary of the numbers of correlated features based on the maximally significant comparisons (p-value ≤ 5 × 10−324). Interestingly, when applying a maximally significant α, features were only strongly correlated with other features in their same domain.
Example of Features with Identical Results but Slightly Different Names.
| Feature | ADNIMERGE Frequency | Gene Expression Frequency | MRI Frequency | Total Frequency | Domain |
|---|---|---|---|---|---|
| ICV | 265 | 84 | 141,676 | 142,025 | ADNIMERGE |
| ICV.BL | 265 | 84 | 141,676 | 142,025 | ADNIMERGE |