| Literature DB >> 30161148 |
Simeone Marino1,2, Jiachen Xu1, Yi Zhao1, Nina Zhou1, Yiwang Zhou1, Ivo D Dinov1,3,4.
Abstract
The theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. Repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable algorithm for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. Albeit not fully developed yet, a CBDA mathematical framework will enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches via CBDA. We implemented the high-throughput CBDA method using pure R as well as via the graphical pipeline environment. To validate the technique, we used several simulated datasets as well as a real neuroimaging-genetics of Alzheimer's disease case-study. The CBDA approach may be customized to provide generic representation of complex multimodal datasets and to provide stable scientific inference for large, incomplete, and multisource datasets.Entities:
Mesh:
Year: 2018 PMID: 30161148 PMCID: PMC6116997 DOI: 10.1371/journal.pone.0202674
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1CBDA framework.
CBDA involves the following steps: Step1: Data Cleaning, Step 2: Data Harmonization, Step 3: Data Aggregation and Selection of Prediction Dataset. The first three steps represent Data Wrangling. Step 4: Random Sampling from the aggregated dataset, Step 5: Data Imputation, Scaling and Balancing (if needed), Step 6: Controlled variable selection and SuperLearner algorithms, Step 7: Ranking of Mean Square Errors (MSE) and Accuracy metrics, and finally, Step 8: Feature Mining and Inference.
Alzheimer Disease Neuroimaging Initiative dataset.
| Source | Types of Data | Sample Size | Clinical Relevance |
|---|---|---|---|
| Each data modality comes with a different number of cohorts. Generally, 500–2,500 subjects (for instance see [ | ADNI provides interesting data modalities, multiple cohorts (e.g., early-onset, mild, and severe dementia, controls) that allow effective model training and validation |
Input specifications for all CBDA experiments used to validate the convergence of the CBDA method.
| M [number of CBDA iterations] | Fraction of missing values [misValperc] | Feature Sampling Range [FSR] | Cases Sampling Range [CSR] | Subsets of M | Top-Ranked Predictions |
|---|---|---|---|---|---|
| 9,000 | 0% | [1%,5%] | [30%,60%] | 1,000 | 100 |
Fig 2LONI pipeline workflow for the CBDA protocol.
In the graphical pipeline workflow implementation, the CBDA technique is divided into following steps. Step 1–5 is data wrangling and sampling; Step 6 represents the SuperLearner loop; Step 7 is consolidation, performance metrics generation, and ranking; and Step 8 includes consolidation of performance metrics and inference on the top features.
CBDA computational complexity.
| CBDA Computational Complexity | CPU time per job | Total CPU time (M = 9000) |
|---|---|---|
| Desktop/Laptop | ~3–10 mins | [x M] |
| Small Multicore Server | ~3–10 mins | [(x M)/n] |
| Large Cloud Server | ~3–10 mins | [(x M)/n] |
Fig 3Heatmaps of CDBA protocol for the binomial datasets.
The x axis represents the 16 combinations between the choice of the subsets of M (i.e., 1,000, 3,000, 6,000 and 9,000) and the choice for top-ranked predictions (i.e., 100, 200, 500 and 1,000, as described in the last 2 columns of Table 2 in the Methods section). Namely, the combinations are ordered as follows: Combination 1 = (1,000,100), Combination 2 = (1,000,200), Combination 3 = (1,000,500), Combination 4 = (1,000,1,000), Combination 5 = (3,000,100), Combination 6 = (3,000,200), Combination 7 = (3,000,500), Combination 8 = (3,000,1,000), Combination 9 = (6,000,100), Combination 10 = (6,000,200),Combination 11 = (6,000,500), Combination 12 = (6,000,1,000), Combination 13 = (9,000,100), Combination 14 = (9,000,200), Combination 15 = (9,000,500), Combination 16 = (9,000,1,000). The y axis represents the CBDA experiment specs, where Experiments 1–6 have no missing values (i.e., missValperc = 0%), and Experiments 7–12 have 20% missing values (i.e., missValperc = 20%). Both sets of experiments have the FSR and CSR ranges combined in ascending order, namely Exp1and Exp 7 = [FSR,CSR] = [1–5%,30–60%], Exp2 and Exp 8 = [FSR,CSR] = [5–15%,30–60%], Exp3 and Exp 9 = [FSR,CSR] = [15–30%,30–60%], Exp4 and Exp 10 = [FSR,CSR] = [1–5%,60–80%], Exp5 and Exp 11 = [FSR,CSR] = [5–15%,60–80%], Exp6 and Exp 12 = [FSR,CSR] = [15–30%,60–80%]. See Table 2 for details. Panels A, C and E show the CBDA results using the Accuracy performance metric. Panels B, D and F show the CBDA results using the Mean Square Error-MSE performance metric (see Methods for details on the performance metrics). Panels A and B, C and D, E and F show the results for the 3 Binomial datasets tested, respectively.
Fig 4CBDA results on the null and binomial datasets.
Panels A, C and E show the correspondent histograms generated from the CBDA analysis on the three Null datasets. Panels B, D and F show the correspondent histograms generated from the CBDA analysis on the three Binomial datasets. Panels A and B, C and D, E and F show the combined results of all 12 experiments using the MSE metric.
Fig 5Knockoff filtering of null vs binomial data.
Panels A, C and E show the correspondent histograms generated from the Knockoff Filter algorithm on the three Null datasets. Panels B, D and F show the correspondent histograms generated from the Knockoff Filter algorithm on the three Binomial datasets. Panels A and B, C and D, E and F show the combined results of all 12 experiments using the MSE metric.
CBDA multinomial classification results on the ADNI dataset.
Confusion Matrix and Statistics.
| Reference | |||
| Prediction | |||
| 69 | 17 | 1 | |
| 12 | 243 | 8 | |
| 0 | 9 | 140 | |
| Overall Statistics | |||
| Accuracy | 0.9058 [95% CI = (0.8767,0.93) | ||
| No Information Rate | 0.5391 | ||
| p-value [Acc>NIR] | <2e-16 | ||
| Kappa | 0.8426 | ||
| McNemar’s Test p-value | 0.589 | ||
| Sensitivity | 0.8519 | 0.9033 | 0.9396 |
| Specificity | 0.9569 | 0.913 | 0.9743 |
| Positive Pred Value | 0.7931 | 0.924 | 0.9396 |
| Negative Pred Value | 0.9709 | 0.8898 | 0.9743 |
| Prevalence | 0.1623 | 0.5391 | 0.2986 |
| Detection Rate | 0.1383 | 0.487 | 0.2806 |
| Detection Prevalence | 0.1743 | 0.5271 | 0.2986 |
| Balanced Accuracy | 0.9044 | 0.9082 | 0.9569 |