| Literature DB >> 26565662 |
Xiaoyan Yin1,2,3, Daniel Levy1,4, Christine Willinger1,4, Aram Adourian5, Martin G Larson1,2,6.
Abstract
Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case-control study of 135 incident cases of myocardial infarction and 135 pair-matched controls from the Framingham Heart Study Offspring cohort. Plasma protein markers (K = 861) were measured on the case-control pairs (N = 135), and the majority of proteins had missing expression values for a subset of samples. In the setting of many more variables than observations (K ≫ N), we explored and documented the feasibility of multiple imputation approaches along with subsequent analysis of the imputed data sets. Initially, we selected proteins with complete expression data (K = 261) and randomly masked some values as the basis of simulation to tune the imputation and analysis process. We randomly shuffled proteins into several bins, performed multiple imputation within each bin, and followed up with stepwise selection using conditional logistic regression within each bin. This process was repeated hundreds of times. We determined the optimal method of multiple imputation, number of proteins per bin, and number of random shuffles using several performance statistics. We then applied this method to 544 proteins with incomplete expression data (≤ 40% missing values), from which we identified a panel of seven proteins that were jointly associated with myocardial infarction.Entities:
Keywords: high dimension; imputation quality; multiple imputation; stepwise selection
Mesh:
Substances:
Year: 2015 PMID: 26565662 PMCID: PMC4777663 DOI: 10.1002/sim.6800
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.373
Figure 1Flow chart of simulation process. Using 261 biomarkers with complete data, we introduced missing values and studied parameters for successful imputation. The imputation parameters included the number of variables per bin, FCS versus MCMC, use of prior information in MCMC, and the number of shuffles. From evaluating these parameters, we determined that 25 was the ‘best’ bin size and that MCMC with prior information worked better than MCMC without prior information or FCS. Using these settings, we imputed data and studied the parameters for stepwise selection.
Figure 2Imputation quality metric: mean distance ratio (DR) by bin size and imputation method. Results are based on simulated data. For each combination of bin size and prior information, ten shuffles were performed and ten datasets were imputed within each bin. Bin‐specific datasets were merged within each shuffle. Mean DR was calculated across the ten datasets within each shuffle. (Smaller DR indicates less distance between the imputed data and original data.) Mean DR decreased with increasing bin size. MCMC with prior information generated data with smaller DR than either FCS or MCMC without priors. Failure to finish imputation was common with bin sizes of 25 and 30 when no prior was used in MCMC.
Figure 3Imputation quality metric: mean absolute difference in correlations (MADC) by bin size and imputation method. MADC decreased with increasing bin size. MCMC with prior information generated data with smaller differences in correlation matrices than either FCS or MCMC without priors. The mean and standard deviation of the absolute correlation coefficients in complete data are 0.13 and 0.10.
Figure 4Kappa statistics for model selection results based on complete data of 261 biomarkers by number of shuffles into bins of size 25. Kappa quantifies the agreement between selection results based on specific numbers of shuffles versus a gold standard obtained from 7200 shuffles. We ran stepwise selection within bins holding ~25 biomarkers. Given n shuffles, we called a biomarker ‘important’ if it was selected at least 40, 50, or 60% among n times. From these results, we concluded that 140 shuffles were sufficient for a stable panel.
Figure 5Kappa statistics based on imputed data for the 261 biomarkers after introducing missing values. Kappa values compare the presence/absence of biomarkers in the gold standard model (complete data, filled symbols) versus the imputed data (20 imputations, open symbols). For each combination of number of shuffles (n) and selection threshold (t), kappa from the complete data is the average from ten replications. Kappa values were calculated by comparing selection results with a gold standard (as in Figure 4). Kappa values are generally smaller than those based on the complete data. It appears that kappa does not improve with more than 140 shuffles.
Number of biomarkers selected in conditional logistic models per bin and per shuffle.
| Statistic | Observed complete data | Imputed data | ||
|---|---|---|---|---|
| Per bin | Per shuffle | Per bin | Per shuffle | |
| Min | 0 | 13 | 0 | 11 |
| Q1 | 1 | 18 | 1 | 16 |
| Median | 2 | 19 | 2 | 18 |
| Q3 | 3 | 21 | 2 | 20 |
| Max | 10 | 28 | 9 | 26 |
The imputed data are based on 135 case–control pairs, 261 biomarkers (≤40% missingness), a bin size of 25, and imputation using MCMC with priors.
Final panel of biomarkers jointly predicting myocardial infarction status in a multiple‐marker conditional logistic model.
| Marker (gene name) | Missing values | Inclusion | Single marker | Final model using RR approach | ||
|---|---|---|---|---|---|---|
| Odds ratio | 95% CI |
| ||||
|
| 11% | 100% | 0.0010 | 0.40 | (0.18, 0.88) | 0.023 |
|
| 0% | 99% | 0.0012 | 0.50 | (0.29, 0.86) | 0.013 |
|
| 40% | 98% | 0.019 | 0.40 | (0.20, 0.81) | 0.012 |
|
| 9% | 97% | 0.0053 | 0.38 | (0.17, 0.83) | 0.017 |
|
| 0% | 90% | 0.036 | 2.43 | (1.25, 4.70) | 0.009 |
|
| 8% | 85% | 0.0056 | 0.34 | (0.16, 0.75) | 0.008 |
|
| 0% | 79% | 0.038 | 2.10 | (1.19, 3.70) | 0.011 |
Inclusion frequency was calculated from 260 shuffles of 544 markers (stage 1) and rounded to the nearest integer percentage.
The final model (stage 2) was based on 135 case–control pairs and 50 imputed datasets with the 26 most frequently included markers. The model was adjusted for BMI, diabetes status, HDL cholesterol, hypertension treatment, systolic blood pressure, and total cholesterol.
| Chosen by gold standard |
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|
| No | Yes | Total | No | Yes | Total | No | Yes | Total | |
| No | 239 | 2 | 242 | 244 | 1 | 245 | 249 | 1 | 250 |
| Yes | 1 | 19 | 20 | 0 | 16 | 16 | 0 | 11 | 11 |
| Total | 240 | 21 | 261 | 244 | 17 | 261 | 249 | 12 | 261 |