| Literature DB >> 34492068 |
Xinxing Wu1, Chong Peng2, Peter T Nelson1, Qiang Cheng1.
Abstract
Alzheimer's disease (AD) is a complex neurodegenerative disorder that affects thinking, memory, and behavior. Limbic-predominant age-related TDP-43 encephalopathy (LATE) is a recently identified common neurodegenerative disease that mimics the clinical symptoms of AD. The development of drugs to prevent or treat these neurodegenerative diseases has been slow, partly because the genes associated with these diseases are incompletely understood. A notable hindrance from data analysis perspective is that, usually, the clinical samples for patients and controls are highly imbalanced, thus rendering it challenging to apply most existing machine learning algorithms to directly analyze such datasets. Meeting this data analysis challenge is critical, as more specific disease-associated gene identification may enable new insights into underlying disease-driving mechanisms and help find biomarkers and, in turn, improve prospects for effective treatment strategies. In order to detect disease-associated genes based on imbalanced transcriptome-wide data, we proposed an integrated multiple random forests (IMRF) algorithm. IMRF is effective in differentiating putative genes associated with subjects having LATE and/or AD from controls based on transcriptome-wide data, thereby enabling effective discrimination between these samples. Various forms of validations, such as cross-domain verification of our method over other datasets, improved and competitive classification performance by using identified genes, effectiveness of testing data with a classifier that is completely independent from decision trees and random forests, and relationships with prior AD and LATE studies on the genes linked to neurodegeneration, all testify to the effectiveness of IMRF in identifying genes with altered expression in LATE and/or AD. We conclude that IMRF, as an effective feature selection algorithm for imbalanced data, is promising to facilitate the development of new gene biomarkers as well as targets for effective strategies of disease prevention and treatment.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34492068 PMCID: PMC8423259 DOI: 10.1371/journal.pone.0256648
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
LATE vs. AD.
| LATE | AD | |
|---|---|---|
|
| Nelson et al., 2019 | Alzheimer, 1906 |
|
| Usually 80+ | Usually 65+ |
|
| LATE is slower than AD, but AD plus LATE will cause a more rapid decline | |
|
| About a quarter of AD patients actually have LATE, which mimics AD in syndrome | |
|
| TDP-43 | A |
|
| TDP-43 | Braak and CERAD |
Fig 1Overall scheme of IMRF.
As an illustration, we show the use of IMRF on synthetic dataset with or without tiny black points for visualization.
Fig 2The procedure for calculation of feature importances from multiple RFs.
Fig 3Demographics for the stratified study population of RNA array expression.
(a) Distribution with respect to four classes, LATE+AD, pure LATE, pure AD, and control, in sex. The vertical axis represents the number of samples. (b) Age distribution with respect to the four classes. The vertical axis represents the age of samples. The horizontal axes for (a) and (b) denote different classes.
Fig 4Supervised feature selection on MNIST and synthetic data.
(a) MNIST with the digits 1 and 9; (b) MNIST with the digits 3 and 8; (c) Four classes of noise background images with or without black points; (d) Four classes of noise background images with or without cross black points. The black point in the middle of the right side is a common black point for classes 1 and 2; (e) Using classes 1 and 2 in Table 3 in Section 3 of S1 File for classification and feature selection; (f) Use classes 1 and 2 in Table 3 in Section 3 of S1 File for classification and feature selection. In (a)-(f), the selected features are marked in red for visualization. Best viewed with color when zoomed in.
Top 5 genes identified and ranked from 35,339 genes for differentiating controls and ADs, and the related prior studies in the literature on these genes.
| Rank | Gene name | Related study |
|---|---|---|
| 1 |
| [ |
| 2 |
| [ |
| 3 |
| [ |
| 4 |
| [ |
| 5 |
| [ |
Fig 5The 31 genes selected from 48,803 genes by IMRF.
Red vertical lines with gene names represent the IMRF-identified genes.
Fig 6Comparison of F1 scores and accuracies by SVM on the total and IMRF-selected genes.
(a) Class-wise F1 scores and overall accuracy for four-class classification; (b) Accuracy for three scenarios of binary classification.
Fig 7Comparison of F1 scores and accuracy for three scenarios of binary classification using the total genes and using the IMRF-selected genes.
(a) LATE+AD vs. pure LATE; (b) LATE+AD vs. pure AD; (c) pure LATE vs. pure AD.
Genes identified by IMRF from 48803 genes for six scenarios of pair-wise classes.
The p-values calculated by ANOVA are shown in the parentheses. The genes in bold are also selected for differentiating four classes, which are shown in Table 5. There are respectively 4, 6, 7, 1, 3, and 3 genes with p-values greater than 0.05 for LATE+AD vs. pure LATE, LATE+AD vs. pure AD, pure LATE vs. pure AD, LATE+AD vs. control, pure LATE vs. control, and pure AD vs. control.
| Class | Gene name (p-value) |
|---|---|
| LATE+AD vs. pure LATE (20) | |
| LATE+AD vs. pure AD (24) | |
| pure LATE vs. pure AD (21) | |
| LATE+AD vs. control (12) | |
| pure LATE vs. control (18) | |
| pure AD vs. control (14) |
Top 31 genes identified and ranked from 48803 genes for differentiating the four classes, their p-values by using ANOVA, and the related studies on these genes.
| Rank | Gene name | p-value | Related study |
|---|---|---|---|
| 1 |
| 1.09E-6 | |
| 2 |
| 9.79E-4 | AD [ |
| 3 |
| 6.84E-5 | AD [ |
| 4 |
| 2.47E-7 | |
| 5 |
| 7.12E-1 | Neurodegenerative diseases [ |
| 6 |
| 2.83E-2 | AD [ |
| 7 |
| 5.37E-6 | |
| 8 |
| 6.74E-1 | AD [ |
| 9 |
| 2.34E-4 | Neurodegenerative diseases [ |
| 10 |
| 3.72E-5 | AD [ |
| 11 |
| 4.86E-6 | |
| 12 |
| 3.69E-4 | |
| 13 |
| 7.00E-4 | |
| 14 |
| 1.78E-1 | |
| 15 |
| 7.76E-5 | |
| 16 |
| 8.67E-6 | |
| 17 |
| 8.32E-7 | AD [ |
| 18 |
| 2.82E-3 | |
| 19 |
| 1.58E-3 | |
| 20 |
| 1.42E-3 | AD [ |
| 21 |
| 2.65E-4 | |
| 22 |
| 7.21E-6 | Its mutations will cause exercise intolerance, neuropathy, and muscle weakness or developmental delay and spastic paraparesis [ |
| 23 |
| 1.38E-1 | Associated to mental retardation syndromes but with unknown molecular basis [ |
| 24 |
| 9.90E-1 | Cognition and memory [ |
| 25 |
| 7.50E-4 | |
| 26 |
| 6.00E-2 | Brain calcifications [ |
| 27 |
| 2.56E-4 | Involved in maintaining genome integrity, DNA damage response, and DNA repair. Defective DNA repair may lead to neurological disorders like AD [ |
| 28 |
| 5.53E-2 | |
| 29 |
| 9.12E-5 | AD [ |
| 30 |
| 2.57E-3 | AD [ |
| 31 |
| 1.29E-2 | The |
Subject categorization rules for RNA expression data.
Here, 〚⋅〛 denotes the grade corresponding to the specific metric.
| Rule | Class |
|---|---|
| 〚Braak〛 ⩾ 5, 〚CERAD〛 ⩽ 2, and 〚TDP-43〛 = 1 | LATE+AD |
| 〚Braak〛 < 5 or 〚CERAD〛 > 2, and 〚TDP-43〛 = 1 | pure LATE |
| 〚Braak〛 ⩾ 5, 〚CERAD〛 ⩽ 2, and 〚TDP-43〛 = 0 | pure AD |
| 〚Braak〛 < 5 or 〚CERAD〛 > 2, and 〚TDP-43〛 = 0 | control |
Fig 8SVM classification performance in F1 score using the original number of genes and using the selected genes by different RF-based algorithms.
Fig 9The ratios of genes with p-value ⩾ 0.05 vs. p-value < 0.05 for 31 selected genes by different algorithms.
Fig 10SVM classification performance in F1 score on the original number of genes and the selected genes by different feature selection algorithms.
Without (a) or with (b) using SMOTE as a preprocessing procedure to counteract the class imbalance.
Fig 11Schematic representation of the p-values of the IMRF-selected genes for four classes and six pair-wise classes.