| Literature DB >> 32964109 |
Chang Su1, Jie Tong2, Fei Wang1.
Abstract
High-throughput techniques have generated abundant genetic and transcriptomic data of Parkinson's disease (PD) patients but data analysis approaches such as traditional statistical methods have not provided much in the way of insightful integrated analysis or interpretation of the data. As an advanced computational approach, machine learning, which enables people to identify complex patterns and insight from data, has consequently been harnessed to analyze and interpret large, highly complex genetic and transcriptomic data toward a better understanding of PD. In particular, machine learning models have been developed to integrate patient genotype data alone or combined with demographic, clinical, neuroimaging, and other information, for PD outcome study. They have also been used to identify biomarkers of PD based on transcriptomic data, e.g., gene expression profiles from microarrays. This study overviews the relevant literature on using machine learning models for genetic and transcriptomic data analysis in PD, points out remaining challenges, and suggests future directions accordingly. Undoubtedly, the use of machine learning is amplifying PD genetic and transcriptomic achievements for accelerating the study of PD. Existing studies have demonstrated the great potential of machine learning in discovering hidden patterns within genetic or transcriptomic information and thus revealing clues underpinning pathology and pathogenesis. Moving forward, by addressing the remaining challenges, machine learning may advance our ability to precisely diagnose, prognose, and treat PD.Entities:
Keywords: Parkinson's disease; Translational research
Year: 2020 PMID: 32964109 PMCID: PMC7481248 DOI: 10.1038/s41531-020-00127-w
Source DB: PubMed Journal: NPJ Parkinsons Dis ISSN: 2373-8057
Parkinson’s disease repositories with genetic data.
| Repository | Participant types | Genetic data screened | Other type of data |
|---|---|---|---|
| HBS (USA, Canada)[ | PD and HC | Targeted sequencing or Asn370Ser, Glu326Lys, Thr369Met genotyping | Motor and nonmotor assessments, biospecimen data, neuroimaging data |
| DIGPD (France)[ | PD and HC | Sanger sequencing | Motor and nonmotor assessments |
| CamPaIGN (UK)[ | PD and subjects diagnosed with other causes of parkinsonism/tremor | Sanger sequencing; SNP genotype | Motor and nonmotor assessments |
| PROPARK (Netherlands)[ | PD | Targeted sequencing or whole exome sequencing | Motor and nonmotor assessments |
| LABS-PD (USA, Canada)[ | PD and HC | Targeted sequencing | Motor and nonmotor assessments, biomarkers, imaging data |
| PICNICS (UK)[ | PD | Sanger sequencing | Motor and nonmotor assessments |
| DATATOP (USA, Canada)[ | PD | Targeted sequencing | Motor and nonmotor assessments |
| PDBP[ | PD and HC | NeuroX genotyping | Motor and nonmotor assessments |
| Penn-Udall (USA) | PD | Targeted sequencing | Motor and nonmotor assessments |
| PPMI (USA, Europe)[ | PD, SWEED, and HC | Whole exome sequence | Motor and nonmotor assessments, CSF biomarkers, neuroimaging data |
| BioFIND (USA)[ | PD and HC | Whole genomic sequence of the GBA1 gene | Motor and nonmotor assessments, biospecimen data |
| IPDGC (worldwide)[ | PD and HC | NeuroX genotyping | Not specified |
CamPaIGN Cambridgeshire Parkinson’s incidence from General Practitioner to Neurologist, DIGPD Drug Interaction with Genes in Parkinson’s Disease, EMBL-EMI The European Bioinformatics Institute, HBS Harvard Biomarkers Study, IPDGC International Parkinson’s Disease Genomics Consortium, LABS-PD Longitudinal and Biomarker Study in Parkinson’s disease, NCBI The National Center for Biotechnology Information, PD Parkinson’s disease, PDPB Parkinson’s Disease Biomarkers Program, Penn-Udall Morris K Udall Parkinson’s Disease Research Center of Excellence cohort, PICNICS Parkinsonism: Incidence, Cognition and Non–motor heterogeneity in Cambridgeshire, PPMI Parkinson’s Progression Marker Initiative, PreCEPT Parkinson Research Examination of CEP–1347 Trial, PROPARK PROFIling PARKinson’s disease, SWEDD scans without evidence of dopaminergic deficit.
Parkinson’s disease repositories with transcriptomic data.
| Repository | Description | URL |
|---|---|---|
| GEO[ | A public functional genomics data repository, provided by NCBI. | |
| ArrayExpress[ | A public database that stores data from high-throughput functional genomics experiments, provided by EMBL-EBI. | |
| ParkDB[ | A complete set of reanalyzed, curated and annotated microarray datasets of Parkinson’s disease. |
GEO gene expression omnibus.
Fig. 1Illustrations of machine learning.
a An example of supervised learning. A supervised learning model takes input as feature vectors of the subjects and “true” labels of them, a.k.a. supervision information, and contains the following components: feature selection (optional), modeling training on training set, model evaluation on testing set, and model deployment for predicting labels of new data. b An example of unsupervised learning. An unsupervised learning model takes input as feature vectors of the subjects only, without any supervision information, and then categorizes the subjects into homogenous groups (a.k.a. clusters). c Illustration of the K-fold cross-validation. One by one, each fold is used as testing set, meanwhile one by one, each remaining K-1 folds are used as training set to train model. d Illustration of underfitting and overfitting issues. Underfitting occurs when the model doesn’t capture patterns of the data well, while overfitting occurs when the model captures details and noise of training data too well to predict new data correctly.
Studies using machine learning for PD outcome analysis.
| Study | Task | Discover cohorts | Validation cohorts | Genetic clues/features | Other features | Model |
|---|---|---|---|---|---|---|
| Nalls et al. 2015[ | PD diagnosis | PPMI: 367 PDs, 165 HCs and 55 SWEDD subjects | PDPB: 453 PDs and 156 HCs; PARS: 15 PDs, 85 HCs and 146 at risk; 23andMe: 20 PDs and 20 HCs; LABS-PD: 239 PDs and 13 SWEED subjects; Penn-Udall: 98 PDs. | GRS from 30 genetic risk factors (28 common risk loci[ | Olfactory function, self-reported family history of PD, age, sex | Logistic regression |
| Dinov et al. 2016[ | PD diagnosis | PPMI: 263 PDs, 40 SWEDD subjects and 127 HCs | None | Not specified. | Clinical data, demographics and derived neuroimaging biomarker data. | A series of typical machine learning methods, such as AdaBoost[ |
| Kraemmer et al. 2016[ | ICD prediction | PPMI: 276 PDs (86% started DRT, 40% DA, 19% reported incident ICD behavior during follow-up in the study) | None | Genotype of 13 genes: DRD2, DRD3, DAT1, COMT, DDC, GRIN2B, ADRA2C, SERT, TPH2, HTR2A, OPRK1, and OPRM1. | Age, sex, PD treatment (no treatment, DA treatment, other DRT), and duration of follow-up. | Logistic regression |
| Latourelle et al. 2017[ | Motor progression prediction | PPMI: 312 PDs and 117 HCs | LABS-PD: 317 PDs | 53 a priori selected PD-related SNPs, 17403 SNPs by LD pruning and 10 genetic principal components. | 7 CSF protein biomarkers, 8 DaTscan imaging variables and 18 clinical and demographic variables. | Ensemble model based on Bayesian platform |
| Liu et al. 2017[ | GCI prediction in PD | HBS: 556 PDs; PDBP: 499 PDs; CamPaIGN: 114 PDs; PICNICS: 129 PDs; PROPAR: 327 PDs; DIGPD: 409 PDs. | DATATOP: 437 PDs; PreCEPT: 332 PDs; PPMI: 396 PDs | GBA mutation status | Age at onset, sex, years of education at baseline, baseline MMSE, MDS-UPDRS II and III scores, Hoehn and Yahr stage, and baseline depression status | Multivariable Cox regression model |
| Tropea et al. 2018[ | Cognitive decline prediction | 100 PDs | None | APOE, COMT, MAPT variants and GBA mutations | Biomarkers from clinical, biochemical (CSF), and MRI-based imaging modalities | Multivariate linear mixed-effects model |
| Nalls et al. 2019[ | PD prediction | IPDGC: 5,851 PDs and 5,866 HCs. | HBS: 527 PDs and 472 HCs | GRS form 1805 variants. | None | Not specified |
| Fereshtehnejad et al. 2017[ | PD subtyping | PPMI: 421 PDs | None | GRS from 30 genetic risk factors (28 common risk loci[ | Demographics, motor manifestations, neuropsychological testing, and other nonmotor manifestations | Unsupervised learning model |
APOE Apolipoprotein E, CamPaIGN Cambridgeshire Parkinson’s incidence from General Practitioner to Neurologist, COMT Catechol-O-methyltransferase, CSF cerebrospinal fluid, DA dopamine agonists, DATATOP deprenyl and tocopherol antioxidative therapy of parkinsonism, DIGPD Drug Interaction with Genes in Parkinson’s Disease, DRT Dopamine replacement therapy, GBA β-glucocerebrosidase, GCI Global cognitive impairment, GRS Genetic risk score, HBS Harvard Biomarkers Study, HC Healthy control, ICD Impulse control disorder, IPDGC International Parkinson’s Disease Genomics Consortium, KNN K-nearest neighbor classification, LABS-PD longitudinal and Biomarker Study in PD, LD linkage disequilibrium, MAPT microtubule-associated protein-tau, MDS-UPDRS Movement Disorders Society-Unified Parkinson’s Disease Rating Scale, MMSE Mini Mental State Examination, PARS Parkinson’s Associated Risk Study, PD Parkinson’s disease, PDPB Parkinson’s Disease Biomarkers Program, PICNICS Parkinsonism: Incidence, Cognition and Non–motor heterogeneity in Cambridgeshire, PPMI Parkinson’s Progression Marker Initiative, PROPARK PROFIling PARKinson’s disease, SWEDD scans without evidence of dopaminergic deficit.
Fig. 2Machine learning in PD genetic and transcriptomic data analysis.
a Applying machine learning to genetic data (usually combined with other features like demographics, clinical assessments, and neuroimaging features, etc.) for PD outcome study. b Applying machine learning to transcriptomic data (e.g., microarray data) for PD biomarker identification.
Studies based on machine learning for PD biomarker identification.
| Study | Genetic data | Participants | Model | Validation | Biomarkers |
|---|---|---|---|---|---|
| Scherzer et al. 2007[ | Microarrays from whole blood samples | 50 PDs and 55 HCs | Supervised classification model | Validation sample set | 8 genes: VDR, HIP2, CLTB, FPRL2, CA12, CEACAM4, ACRV1, and UTX |
| Molochnikov et al. 2012[ | Microarrays from blood samples | 38 de novo PDs, 24 early stage PDs (within 1st-year medication), 30 advanced PDs, 29 ADs and 64 HCs | Stepwise multivariate logistic regression model | Validation sample set | 5 genes: HIP2, ALDH1A1, PSMC4, HSPA8 and EGLN1 |
| Potashkin et al. 2012[ | Splice variant-specific microarrays | 51 PDs, 34 APD (17 MSA and 17 PSP), 39 HCs | KNN | qPCR | 13 genes: C5ORF4, WLS, MACF1, PRG3, EFTUD2, PKM2, SLC14A1-S, SLC14A1-L, MPP1, COPZ1, ZNF160, MAP4K1 and ZNF134 |
| Santiago et al. 2013[ | Gene expression data from blood samples | 50 PDs and 46 HCs | Stepwise multivariate LDA | None | 7 out of 13 genes in previous study[ |
| Karlsson et al. 2013[ | Microarrays from blood samples | 79 PDs and 75 HCs | CPLS | None | 6 genes: LRPPRC, BCL2, SRSF8, HSPA8, UBE2K, EGLN1 |
| Calligaris et al. 2015[ | Microarrays from blood samples | 52 PDs and 32 HCs | PLS-DA and LDA (a Bayesian classification method) | RT-qPCR | 54 genes |
| Shamir et al. 2017[ | Microarrays from blood samples | 205 PDs, 233 HCs and 48 other neurodegenerative diseases | SVM | None | A gene signature of 87 gene (64 upregulated and 23 downregulated genes between PD and HC) |
AD Alzheimer’s disease, APD Atypical parkinsonian disorders, CPLS canonical partial least squares, HC Healthy control, KNN K-nearest neighbor classification, LDA linear discriminant regression (a Bayesian classification method), MSA multiple system atrophy, PD Parkinson’s disease, PSP progressive supra-nuclear palsy, qPCR quantitative polymerase chain reaction, SN substantia nigra, SVM support vector machine.
Summary points of challenges and potential future directions to address them.
| Challenges | Potential future directions |
|---|---|
| Bias of sample size | Integrated multiple cohort modeling. |
| Handling whole spectrum genetic information | Engaging appropriate feature engineering tools such as genetic principal component analysis[ Incorporating appropriate deep learning model such as autoencoder[ |
| Multifactorial modeling | Multivariate modeling; Incorporating kernel approaches and probability models. |
| Cohort diversity | Validation on an external cohort; Training model on data from multiple populations if possible; Engaging transfer learning. |
| Model interpretation | Using interpretable models such as Bayesian, rule-based (e.g., decision tree and random forest), logistic regression models, etc.; Incorporating or developing model interpretation methods for “black box” models, e.g., deep learning models. |
| Model evaluation | Evaluation using isolate validation data set; Applying experimental test evaluation; Developing visualization tools for model evaluation. |
| Interdisciplinary issue | Deep interdisciplinary collaboration; Incorporating domain knowledge in model training. |