| Literature DB >> 32065794 |
Hamel Patel1,2, Raquel Iniesta1, Daniel Stahl1, Richard J B Dobson1,2,3,4,5, Stephen J Newhouse1,2,3,4,5.
Abstract
BACKGROUND: The typical approach to identify blood-derived gene expression signatures as a biomarker for Alzheimer's disease (AD) have relied on training classification models using AD and healthy controls only. This may inadvertently result in the identification of markers for general illness rather than being disease-specific.Entities:
Keywords: Age-related memory disorders; Alzheimer’s disease; biomarkers; dementia; gene expression; human; machine learning; microarray analysis; neurodegenerative disorders
Mesh:
Substances:
Year: 2020 PMID: 32065794 PMCID: PMC7175937 DOI: 10.3233/JAD-191163
Source DB: PubMed Journal: J Alzheimers Dis ISSN: 1387-2877 Impact factor: 4.472
Fig.1Overview of study design. Two types of XGBoost classification models were developed, optimized, and evaluated. The first (“AD vs Healthy Control”) used the typical approach, training in Alzheimer’s disease (AD) and cognitively healthy controls (HC), while the second (“AD vs Mixed Control”) was trained in AD and a mixed controls group. The mixed control group consisted of Parkinson’s disease (PD), multiple sclerosis (MS), amyotrophic lateral sclerosis (ALS), bipolar disorder (BD), schizophrenia (SCZ), coronary artery disease (CD), rheumatoid arthritis (RA), chronic obstructive pulmonary disease (not represented in the figure), and cognitively healthy subjects. The individual groups within the mixed controls were upsampled with replacement to avoid sampling biases during model development. To account for the randomness, a thousand “AD vs Healthy Control” and a thousand “AD vs Mixed Control” classification models were developed and evaluated. cv, cross-validation; RFE, recursive feature elimination.
Dataset demographics
| Disorder | Study ID (associated publication) | Platform | BeadArray | Tissue source | Demographics before QC | Samples removed during QC | Demographics after QC | Training and testing set assignment | |||||||
| No. probes | Case sex (M/F) | Control sex (M/F) | No. samples | No. gender mismatches | No. outlying sample | No. probes | Case sex (M/F) | Control sex (M/F) | No. samples | ||||||
| Alzheimer’s Disease | GSE63060 ([ | I | HT-12 v3.0 | WB | 38323 | 46/99 | 42/62 | 249 | 2 | 10 | 5364 | 45/93 | 40/59 | 237 | Training |
| GSE63061 ([ | I | HT-12 v4.0 | WB | 32049 | 51/81 | 55/87 | 274 | 5 | 4 | 5241 | 48/79 | 54/84 | 265 | Testing | |
| E-GEOD-6613 ([ | A | HG U133A | WB | 22283 | 8/15 | 11/11 | 45 | 0 | 1 | 4184 | 8/14 | 11/11 | 44 | Training | |
| Parkinson’s Disease | E-GEOD-6613 ([ | A | HG U133A | WB | 22283 | 38/12 | 0/0 | 50 | 0 | 0 | 3674 | 38/12 | 0/0 | 50 | Training |
| E-GEOD-72267 ([ | A | HG U133A 2.0 | PBMC | 22277 | 23/17 | 8/11 | 59 | 0 | 0 | 8742 | 23/17 | 8/11 | 59 | Testing | |
| Multiple Sclerosis | GSE24427 ([ | A | HG U133A | WB | 22283 | 9/16 | 0/0 | 25 | 0 | 0 | 6633 | 9/16 | 0/0 | 25 | Testing |
| E-GEOD-16214 ([ | A | HG U133 plus 2.0 | PBMC | 54675 | 11/71 | 0/0 | 82 | 0 | 3 | 8098 | 11/68 | 0/0 | 79 | Training | |
| E-GEOD-41890 ([ | A | Exon 1.0 ST | PBMC | 33297 | 20/24 | 12/12 | 68 | 0 | 1 | 8157 | 19/24 | 12/12 | 67 | Training | |
| Schizophrenia | GSE38484 ([ | I | HT-12 v3.0 | WB | 48743 | 76/30 | 42/54 | 202 | 9 | 5 | 6700 | 69/28 | 39/52 | 188 | Training |
| E-GEOD-27383 ([ | A | HG U133 plus 2.0 | WB | 54675 | 43/0 | 29/0 | 72 | 0 | 1 | 11297 | 42/0 | 29/0 | 71 | Testing | |
| GSE38481 ([ | I | Human-6 v3 | WB | 24526 | 4/11 | 16/6 | 37 | 2 | 1 | 8106 | 11/3 | 15/5 | 34 | Testing | |
| Bipolar Disorder | E-GEOD-46449 ([ | A | HG U133 plus 2.0 | L | 54675 | 28/0 | 25/0 | 53 | 0 | 0 | 9882 | 28/0 | 25/0 | 53 | Training |
| GSE23848 ([ | I | Human-6 v2 | WB | 48701 | 6/14 | 5/10 | 35 | 0 | 0 | 7211 | 6/14 | 5/10 | 35 | Testing | |
| Cardiovascular Disease | E-GEOD-46097 ([ | A | HG U133A 2.0 | PBMC | 22277 | 102/36 | 60/180 | 378 | 0 | 24 | 7676 | 94/36 | 57/167 | 354 | Training |
| GSE59867 ([ | A | Exon 1.0 ST | WB | 33297 | 85/26 | 0/0 | 111 | 0 | 3 | 7936 | 82/26 | 0/0 | 108 | Testing | |
| E-GEOD-12288 ([ | A | HG U113A | WB | 22283 | 88/22 | 84/28 | 222 | 0 | 8 | 4815 | 83/22 | 82/27 | 214 | Training | |
| Rheumatoid Arthritis | E-GEOD-74143 ([ | A | HT HG U113 plus | WB | 54715 | 81/296 | 0/0 | 377 | 1 | 23 | 8112 | 80/273 | 0/0 | 353 | Training |
| E-GEOD-54629 ([ | A | Exon 1.0 ST | WB | 33297 | 11/58 | 0/0 | 69 | 0 | 0 | 11931 | 11/58 | 0/0 | 69 | Testing | |
| E-GEOD-42296 ([ | A | Exon 1.0 ST | PBMC | 33297 | 4/15 | 0/0 | 19 | 0 | 0 | 10417 | 4/15 | 0/0 | 19 | Testing | |
| Chronic Obstructive Pulmonary Disease | E-GEOD-54837 ([ | A | HG U133 plus 2.0 | WB | 54675 | 91/45 | 57/33 | 226 | 0 | 16 | 5531 | 83/44 | 52/31 | 210 | Training |
| E-GEOD-42057 ([ | A | HG U133 plus 2.0 | WB | 54675 | 52/42 | 22/20 | 136 | 3 | 4 | 6445 | 49/39 | 21/20 | 129 | Testing | |
| ALS | E-TABM-940 | A | HG U133 plus 2.0 | WB | 54675 | 27/26 | 18/19 | 90 | 3 | 10 | 10442 | 27/25 | 15/10 | 77 | Training |
| Total | 904/956 | 486/533 | 2879 | 25 | 114 | 870/906 | 465/49 | 2740 | |||||||
Each study is accompanied by its corresponding publication (if available), where individual study design can be obtained. When possible, datasets were obtained in their raw format, except for GSE63060, GSE63061, E-GEOD-41890, GSE23848, E-GEOD74143, E-GEOD-54629, and E-GEOD-42296 which were only available in a processed form where the dataset had already been background corrected, log2 transformed, and normalized by techniques stated in corresponding publications. Multiple datasets from the same disease existed in this study. The dataset with the largest number of diseased subjects was prioritized into the training set for better discovery. Study IDs initiating with “GSE” and “E-GEOD” were obtained from GEO and ArrayExpress, respectively. I, Illumina; A, Affymetrix; WB, whole blood; PBMC, peripheral blood mononuclear cell; L, lymphocytes.
Fig.2Distribution of gene expression across all 2,740 subjects in this study. Plots a) and c) are boxplots, where each vertical line represents an individual, while plots b) and d) represents the expression density of the same 2,740 subjects where each line represents a different individual. Plots a) and b) shows the variation of the gene expression across subjects prior to YuGene transformation, providing evidence of batch effects between samples and datasets. In contrast, plots c) and d) reveals a more evenly distributed gene expression profile across all 2,740 subjects when extracting the 1,681 common “reliably detected” genes, and independently YuGene transforming each sample.
Overview Training and Testing set subjects
| Dataset | Training set | Testing set | Class assignment for XGBoost | |
| AD vs Healthy Control | AD vs Mixed Control | |||
| Alzheimer’s Disease | 160* | 160* | 127 | 0 |
| Parkinson’s Disease | 0 | 702 (50) | 40 | 1 |
| Multiple Sclerosis | 0 | 702 (122*) | 25 | 1 |
| Schizophrenia | 0 | 702 (97*) | 56* | 1 |
| Bipolar Disorder | 0 | 702 (28) | 20 | 1 |
| Cardiovascular Disease | 0 | 702 (235*) | 108 | 1 |
| Rheumatoid Arthritis | 0 | 702 (353) | 88* | 1 |
| Chronic Obstructive Pulmonary Disease | 0 | 702 (127) | 88 | 1 |
| ALS | 0 | 702 (52) | 0 | 1 |
| Pooled Controls | 127* | 702* | 262 | 1 |
Entire datasets from each disease were assigned to either the “Training Set” for classification model development or the “Testing Set” for validation purposes. Datasets with the larger number of diseased subjects were prioritized into the training set to increase discovery. Two types of classification models were developed, the first (“AD vs Healthy Control”) was developed using only the 160 AD and associated 127 healthy control samples, and the second (“AD vs Mixed Controls”) was developed using the same 160 AD samples, and 6,318 upsampled mixed controls. The pooled controls in the “AD vs Healthy Control” training set originates only from AD datasets. Sample numbers provided in brackets are before upsampling. Sample numbers with an asterisk (*) indicates multiple datasets were available, and subject numbers shown are a sum across these datasets.
Classification model performance
| AD vs Healthy Control | AD vs Mixed Control | |
| Sensitivity | 48.7% (34.7–64.6) | 40.8% (27.5–52.0) |
| Specificity | 41.9% (26.8–54.3) | 95.22% (93.3–97.1) |
| PPV | 13.6% (9.9–18.5) | 61.35% (53.8–69.6) |
| NPV | 81.1% (73.3–87.7) | 89.7% (87.8–91.4) |
| Balanced Accuracy | 45.3% (36.0–56.0) | 67.99% (61.9–72.9) |
| AUC | 0.45 (0.34–0.60) | 0.86 (0.82–0.90) |
| AUC Rating | Test not useful | Very Good |
| CUI+ve | 0.07 (0.04–0.12) | 0.25 (0.16–0.32) |
| CUI+ve Rating | Poor | Poor |
| CUI –ve | 0.34 (0.2–0.46) | 0.85 (0.84–0.87) |
| CUI –ve Rating | Poor | Excellent |
The table provides the average performance measurements form validating a thousand “AD vs Healthy Control” and a thousand “AD vs Mixed Control” classification models on the same testing set. A students T-test between the “AD vs Healthy Control” and “AD vs Mixed Control” classification performances reveals a significant difference for all metrics (p < 2.20e–16). The values provided in brackets () are the 95% confidence interval.
Fig.3Testing set raw prediction comparison by (a) the thousand “AD vs Healthy Control” classification models and (b) the thousand “AD vs Mixed Control” Classification models. Samples with a probability of≤0.5 are predicted to be AD. Controls represent pooled non-diseased subjects from all datasets. AD, Alzheimer’s disease; BD, bipolar disease; CD, coronary artery disease; COPD, chronic obstructive pulmonary disease; MS, multiple sclerosis; PD, Parkinson’s disease; RA, rheumatoid arthritis; SCZ, schizophrenia.