| Literature DB >> 33070196 |
Mingming Gong1,2,3, Peng Liu1, Frank C Sciurba1, Petar Stojanov2, Dacheng Tao4, George C Tseng1, Kun Zhang2, Kayhan Batmanghelich1.
Abstract
MOTIVATION: There is growing interest in the biomedical research community to incorporate retrospective data, available in healthcare systems, to shed light on associations between different biomarkers. Understanding the association between various types of biomedical data, such as genetic, blood biomarkers, imaging, etc. can provide a holistic understanding of human diseases. To formally test a hypothesized association between two types of data in Electronic Health Records (EHRs), one requires a substantial sample size with both data modalities to achieve a reasonable power. Current association test methods only allow using data from individuals who have both data modalities. Hence, researchers cannot take advantage of much larger EHR samples that includes individuals with at least one of the data types, which limits the power of the association test.Entities:
Year: 2021 PMID: 33070196 PMCID: PMC8098021 DOI: 10.1093/bioinformatics/btaa886
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.X and Y represent two modalities. Current approaches only use paired data . Assuming that the total number of samples of X (M) and Y (N) is more than the paired data, we aim to find out how the control of the false discovery and the power of association tests can be improved by the unpaired data and
Comparison of VCST and KIT
| Test statistic (unbiased) | Null distribution (unbiased) | Test statistic (biased) | Null distribution (biased) | Unpaired | Unpaired | |
|---|---|---|---|---|---|---|
| VCST |
|
|
|
| ✗ | ✓ |
| KIT |
|
|
|
| ✓ | ✓ |
Fig. 2.Evaluation of SAT-rx type I error rate control on the simulated data generated by procedure (1) in the random X setting. The blue line (KIT) is the result of using only paired data; hence it does not change with addition of unpaired data. KIT only uses the n = 100 paired data points. Our methods (green and orange) start with n pairs and gradually adds unpaired data to improve type I error control. False-positive rates for both variants of our method SAT-rx are well controlled around the nominal value (DR: Dimension Reduction)
Fig. 3.Evaluation of SAT-rx test power on the simulated data generated by procedure (1) in the random X setting (DR: Dimension Reduction). The results for heritability values and dimensionality are shown. KIT only uses the n = 100 paired data points. Our methods start with n pairs and gradually add unpaired data to improve test power
Fig. 4.Evaluation of SAT-fx type I error rate control on the data generated in simulation (2). VCST only uses the n = 3000 paired data points. Our method SAT-fx starts with n pairs and gradually adds unpaired data to improve type I error control
Fig. 5.Evaluation of SAT-fx test power on the data generated in simulation (2). VCST only uses the n = 3000 paired data points. Our method SAT-fx starts with n pairs and gradually adds unpaired data to improve test power
Fig. 6.Experiments on three real imaging and genetics datasets. (a) Test an association between multidimensional imaging features and plasma biomarkers. (b) Test an association between imaging features and peripheral blood mononuclear cell gene expression data. (c) Test an association between imaging features and gene expression of genes in immune system pathway of the disease. In all the experiments, we start with n paired data points and show the behavior of our methods when adding unpaired data, with and without dimensionality reduction (DR)
P-values on Uganda General Population Cohort
|
| KIT | SAT-rx (w/o DR) | SAT-rx | Oracle | |||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| ||
| SBP | 0.22 | 0.293 | 1.000 | 0.224 | 1.000 | 0.128 | 0.897 | 0.010 | 0.195 |
| DBP | 0.29 | 0.091 | 0.928 |
| 0.537 |
| 0.138 | <1.00e-05 | <1.90e-04 |
| BMI | 0.37 | 0.101 | 0.907 |
| 0.528 |
| 0.214 | <1.00e-05 | <1.90e-04 |
| WHR | 0.14 | 0.249 | 1.000 | 0.171 | 0.901 | 0.119 | 0.810 | 0.033 | 0.630 |
| Weight | 0.43 | 0.057 | 0.819 |
| 0.235 |
|
| <1.00e-05 | <1.90e-04 |
| Height | 0.50 | 0.031 | 0.532 | 3.81e-03 | 0.072 | 1.74e-04 |
| <1.00e-05 | <1.90e-04 |
| HC | 0.37 | 0.095 | 0.930 |
| 0.503 |
| 0.196 | <1.00e-05 | <1.90e-04 |
| WC | 0.31 | 0.127 | 0.928 | 0.057 | 0.662 |
| 0.345 | 1.20e-05 | 2.28e-04 |
| ALT | 0.37 | 0.204 | 0.920 | 0.172 | 0.646 | 0.106 | 0.617 | 1.76e-03 | 0.033 |
| Albumin | 0.44 | 0.117 | 0.983 |
| 0.593 |
| 0.395 | <1.00e-05 | <1.90e-04 |
| ALP | 0.12 | 0.442 | 1.000 | 0.419 | 1.000 | 0.318 | 1.000 | 0.261 | 1.000 |
| AST | 0.25 | 0.293 | 1.000 | 0.322 | 1.000 | 0.276 | 0.875 | 0.187 | 1.000 |
| Bilirubin | 0.45 | 0.046 | 0.629 | 0.027 | 0.390 | 8.43e-03 | 0.160 | <1.00e-05 | <1.90e-04 |
| Cholesterol | 0.60 | 0.024 | 0.448 | 2.25e-03 |
| 1.96e-04 |
| <1.00e-05 | <1.90e-04 |
| GGT | 0.11 | 0.307 | 1.000 | 0.290 | 0.801 | 0.265 | 0.800 | 0.039 | 0.734 |
| HDL | 0.51 | 0.063 | 0.717 |
| 0.326 |
| 0.090 | <1.00e-05 | <1.90e-04 |
| LDL | 0.60 | 0.012 | 0.222 | 6.10e-04 |
| 2.20e-05 |
| <1.00e-05 | <1.90e-04 |
| Triglycerides | 0.27 | 0.242 | 1.000 | 0.164 | 1.000 | 0.126 | 0.880 | 6.76e-04 | 0.013 |
| HbA1c2 | 0.56 | 6.23e-03 | 0.118 | 3.66e-04 |
| 1.80e-05 |
| <1.00e-05 | <1.90e-04 |
| WBC | 0.44 | 6.95e-03 | 0.139 | <1.00e-05 | < | <1.00e-05 | < | ||
| RBC | 0.39 | 0.011 | 0.219 | 4.40e-05 |
| ≪1.00e-05 | < | ||
| Hemoglobin | 0.20 | 0.041 | 0.815 | 1.18e-03 |
| 1.40e-05 |
| ||
| HCT | 0.22 | 0.025 | 0.508 | 3.36e-04 |
| <1.00e-05 | < | ||
| MCV | 0.57 | 1.47e-03 | 0.029 | <1.00e-05 | <2.00e-04 | <1.00e-05 | <2.00e-04 | ||
| MCH | 0.53 | 2.50e-03 | 0.050 | <1.00e-05 | <2.00e-04 | <1.00e-05 | <2.00e-04 | ||
| MCHC | 0.72 | <1.00e-05 | <2.00e-04 | <1.00e-05 | <2.00e-04 | <1.00e-05 | <2.00e-04 | ||
| RDW | 0.33 | 6.70e-03 | 0.134 | <1.00e-05 | < | <1.00e-05 | < | ||
| PLT | 0.48 | 3.00e-03 | 0.060 | <1.00e-05 | < | <1.00e-05 | < | ||
| MPV | 0.57 | 1.00e-05 | 2.00e-04 | <1.00e-05 | <2.00e-04 | <1.00e-05 | <2.00e-04 | ||
| NEUPr | 0.39 | 0.015 | 0.304 | 7.80e-05 |
| <1.00e-05 | < | ||
| LYMPHPr | 0.47 | 3.30e-03 | 0.066 | <1.00e-05 | < | <1.00e-05 | < | ||
| MONOPr | 0.48 | 7.48e-03 | 0.150 | <1.00e-05 | < | <1.00e-05 | < | ||
| EOSPr | 0.41 | 1.13e-01 | 1.000 |
| 0.331 |
|
| ||
| BASOPr | 0.47 | 9.60e-04 | 0.019 | <1.00e-05 | <2.00e-04 | <1.00e-05 | <2.00e-04 | ||
| LYMPH | 0.52 | 4.10e-04 | 0.008 | <1.00e-05 | <2.00e-04 | <1.00e-05 | <2.00e-04 | ||
| NEU | 0.35 | 0.062 | 1.000 |
| 0.066 |
|
| ||
| MONO | 0.43 | 0.012 | 0.236 | 2.40e-05 |
| <1.00e-05 | < | ||
| EOS | 0.39 | 0.212 | 1.000 | 0.080 | 1.000 |
| 0.136 | ||
| BASO | 0.50 | 1.20e-05 | 2.40e-04 | <1.00e-05 | <2.00e-04 | <1.00e-05 | <2.00e-04 | ||
Note: The newly found associations by our method at the significance level 0.05 are marked as bold. Since we mimic the missingness for phenotypes in the top part of the table, we are able to compare our performance with the oracle. In the bottom part of the table, a subset of the subjects has a missing phenotype; hence, the oracle columns are empty.