| Literature DB >> 35365675 |
Mary B Makarious1,2,3, Hampton L Leonard1,4,5,6, Dan Vitale4,5, Hirotaka Iwaki1,4,5, Lana Sargent1,4,7,8, Anant Dadu9, Ivo Violich10, Elizabeth Hutchins11, David Saffo12, Sara Bandres-Ciga1, Jonggeol Jeff Kim1,13, Yeajin Song1,5, Melina Maleknia14, Matt Bookman15, Willy Nojopranoto15, Roy H Campbell9, Sayed Hadi Hashemi9, Juan A Botia16,17, John F Carter18, David W Craig10, Kendall Van Keuren-Jensen11, Huw R Morris2,3, John A Hardy2,3,19,20, Cornelis Blauwendraat1, Andrew B Singleton1,4, Faraz Faghri21,22,23, Mike A Nalls24,25,26.
Abstract
Personalized medicine promises individualized disease prediction and treatment. The convergence of machine learning (ML) and available multimodal data is key moving forward. We build upon previous work to deliver multimodal predictions of Parkinson's disease (PD) risk and systematically develop a model using GenoML, an automated ML package, to make improved multi-omic predictions of PD, validated in an external cohort. We investigated top features, constructed hypothesis-free disease-relevant networks, and investigated drug-gene interactions. We performed automated ML on multimodal data from the Parkinson's progression marker initiative (PPMI). After selecting the best performing algorithm, all PPMI data was used to tune the selected model. The model was validated in the Parkinson's Disease Biomarker Program (PDBP) dataset. Our initial model showed an area under the curve (AUC) of 89.72% for the diagnosis of PD. The tuned model was then tested for validation on external data (PDBP, AUC 85.03%). Optimizing thresholds for classification increased the diagnosis prediction accuracy and other metrics. Finally, networks were built to identify gene communities specific to PD. Combining data modalities outperforms the single biomarker paradigm. UPSIT and PRS contributed most to the predictive power of the model, but the accuracy of these are supplemented by many smaller effect transcripts and risk SNPs. Our model is best suited to identifying large groups of individuals to monitor within a health registry or biobank to prioritize for further testing. This approach allows complex predictive models to be reproducible and accessible to the community, with the package, code, and results publicly available.Entities:
Year: 2022 PMID: 35365675 PMCID: PMC8975993 DOI: 10.1038/s41531-022-00288-w
Source DB: PubMed Journal: NPJ Parkinsons Dis ISSN: 2373-8057
Descriptive statistics of studies included from AMP PD.
| Study | Status | Age at baseline mean (SD) | UPSIT score (mean, SD) | Male (%) | Positive family history of PD (%) | Inferred Ashkenazi ancestry (%) |
|---|---|---|---|---|---|---|
| PPMI | Case | 61.75 (9.69) | 23.48 (8.35) | 65.57 | 25.53 | 6.09 |
| Control | 60.61 (10.43) | 34.18 (4.71) | 63.74 | 5.85 | 11.11 | |
| PDBP | Case | 64.59 (8.99) | 19.65 (8.01) | 64.18 | 24.88 | 3.61 |
| Control | 62.87 (10.96) | 32.52 (5.98) | 45.25 | 8.14 | 4.07 |
AMP-PD accelerating medicines partnership in Parkinson’s disease, PPMI Parkinson’s progression marker initiative, PDBP Parkinson’s disease biomarker program, PD Parkinson’s disease, SD standard deviation, UPSIT University of Pennsylvania smell identification test.
Fig. 1Workflow and Data Summary.
Scientific notation in the workflow diagram denotes minimum p values from reference GWAS or differential expression studies as a pre-screen for feature inclusion. Blue indicates subsets of genetics data (also denoted as “G”), green indicates subsets of transcriptomics data (also denoted as *omics or “O”), yellow indicates clinico-demographic data (also denoted as C + D), and purple indicates combined data modalities. PD Parkinson’s disease, AMP-PD accelerating medicines partnership in Parkinson’s disease, PPMI Parkinson’s progression marker initiative, PDBP Parkinson’s disease biomarker program, WGS whole-genome sequencing, GWAS genome-wide association study, QC quality control, MAF minor allele frequency, PRS polygenic risk score.
Performance metric summaries comparing training in withheld samples in PPMI.
| Data Modality | Genetics ( | Clinico-demographic | Transcriptomics ( | Combined |
|---|---|---|---|---|
| Stage | Training in PPMI (70:30) | Training in PPMI (70:30) | Training in PPMI (70:30) | Training in PPMI (70:30) |
| Algorithm | MLPClassifier | LogisticRegression | SVC | AdaBoostClassifier |
| AUC (%) | 70.66 | 87.52 | 79.73 | 89.72 |
| Accuracy (%) | 70.00 | 79.44 | 73.89 | 85.56 |
| Balanced accuracy (%) | 60.64 | 75.27 | 54.60 | 82.41 |
| Log Loss | 0.83 | 0.39 | 0.48 | 0.63 |
| Sensitivity | 0.83 | 0.85 | 0.97 | 0.89 |
| Specificity | 0.38 | 0.65 | 0.12 | 0.76 |
| PPV | 0.77 | 0.86 | 0.75 | 0.91 |
| NPV | 0.48 | 0.64 | 0.60 | 0.73 |
Fig. 2Receiver operating characteristic curves and case probability density plots in withheld training samples at default thresholds comparing performance metrics in different data modalities from the PPMI dataset.
P values mentioned indicate the threshold of significance used per datatype, except for the inclusion of all clinico-demographic features. a PPMI combined *omics dataset (genetics p value threshold = 1E-5, transcriptomics p value threshold = 1E-2, and clinico-demographic information); b PPMI genetics-only dataset (p value threshold = 1E-5); c PPMI clinico-demographics only dataset; d PPMI transcriptomics-only dataset (p value threshold = 1E-2). Note that x-axis limits may vary as some models produce less extreme probability distributions than others inherently based on fit to the input data and the algorithm used, further detailed images are included in Supplementary Fig. 5. PPMI Parkinson’s progression marker initiative, ROC receiver operating characteristic curve.
Performance metric summaries comparing at tuned cross-validation in withheld samples in PPMI.
| Data Modality | Genetics ( | Clinico-demographic | Transcriptomics ( | Combined |
|---|---|---|---|---|
| Stage | Tuning in PPMI | Tuning in PPMI | Tuning in PPMI | Tuning in PPMI |
| Algorithm | MLPClassifier | LogisticRegression | SVC | AdaBoostClassifier |
| AUC at training (%) | 70.66 | 87.52 | 79.73 | 89.72 |
| Mean, AUC during CV for baseline model (%) | 69.44 | 88.51 | 78.05 | 86.99 |
| Standard deviation, AUC during CV for baseline model (%) | 4.46 | 2.17 | 4.27 | 2.30 |
| Min, AUC during CV for baseline model (%) | 62.45 | 86.19 | 71.49 | 84.27 |
| Max, AUC during CV for baseline model (%) | 75.73 | 91.98 | 82.62 | 90.70 |
| Mean, AUC during CV for tuned model (%) | 70.93 | 88.55 | 79.01 | 90.17 |
| Standard deviation, AUC during CV for tuned model (%) | 5.39 | 2.20 | 4.71 | 1.64 |
| Min, AUC during CV for tuned model (%) | 61.29 | 86.33 | 70.88 | 88.06 |
| Max, AUC during CV for tuned model (%) | 76.71 | 92.15 | 84.01 | 92.73 |
| Variance, AUC during CV for baseline model (%) | 19.89 | 4.73 | 18.20 | 5.29 |
| Variance, AUC during CV for tuned model (%) | 29.03 | 4.82 | 22.18 | 2.70 |
Performance metric summaries comparing combined tuned and untuned model performance on PDBP validation dataset.
| Data Modality | Combined | Combined; Untuned | Combined; Tuned |
|---|---|---|---|
| Stage | Untuned in PPMI as reference | Validation in PDBP | Validation in PDBP |
| Algorithm | AdaBoostClassifier | AdaBoostClassifier | AdaBoostClassifier |
| AUC (%) | 89.72 | 83.84 | 85.03 |
| Accuracy (%) | 85.56 | 75.81 | 75.00 |
| Balanced accuracy (%) | 82.41 | 69.31 | 68.09 |
| Log Loss | 0.63 | 0.64 | 0.67 |
| Sensitivity | 0.89 | 0.93 | 0.93 |
| Specificity | 0.76 | 0.46 | 0.43 |
| PPV | 0.91 | 0.75 | 0.74 |
| NPV | 0.73 | 0.78 | 0.78 |
Fig. 3Receiver operating characteristic and case probability density plots in the external dataset (PDBP) at validation for the trained and then tuned models at default thresholds.
Probabilities are predicted case status (r1), so controls (status of 0) skews towards more samples on the left, and positive PD cases (status of 1) skews more samples on the right. a Testing in PDBP the combined *omics model (genetics p value threshold = 1E-5, transcriptomics p value threshold = 1E-2, and clinico-demographic information) developed in PPMI prior to tuning the hyperparameters of the model; b Testing in PDBP the combined *omics model (genetics p value threshold = 1E-5, transcriptomics p value threshold = 1E-2, and clinico-demographic information) developed in PPMI after tuning the hyperparameters of the model. PPMI Parkinson’s progression marker initiative, PDBP Parkinson’s disease biomarker program, ROC receiver operating characteristic curve.
Optimizing the AUC threshold in withheld training samples and in the validation data.
| Dataset | PPMI, withheld samples | PPMI, withheld samples | PDBP, external test samples | PDBP, external test samples |
|---|---|---|---|---|
| Model | Training phase | Training phase | Tuned model | Tuned model |
| Optimization | optimized | default | optimized | default |
| Case Probability Threshold (%) | 51 | 50 | 51 | 50 |
| Accuracy (%) | 85 | 85.56 | 78.58 | 75 |
| Balanced accuracy (%) | 83.95 | 82.41 | 77.97 | 68.09 |
| Log loss | 0.05 | 0.05 | 0.07 | 0.09 |
| Sensitivity | 0.86 | 0.89 | 0.80 | 0.93 |
| Specificity | 0.82 | 0.76 | 0.76 | 0.43 |
| PPV | 0.93 | 0.91 | 0.85 | 0.74 |
| NPV | 0.69 | 0.73 | 0.68 | 0.78 |
Fig. 4Feature importance plots for top 5% of features in data.
The plot on the left has lower values indicated by the color blue, while higher values are indicated in red compared to the baseline risk estimate. Plot on the right indicates directionality, with features predicting for cases indicated in red, while features better-predicting controls are indicated in blue. SHAP Shapley values, UPSIT University of Pennsylvania smell identification test, PRS polygenic risk score.