| Literature DB >> 34468045 |
Célia Escribe1,2, Tianyuan Lu1,3, Julyan Keller-Baruch1,4, Vincenzo Forgetta1, Bowei Xiao1,3, J Brent Richards1,4,5,6, Sahir Bhatnagar5,7, Karim Oualkacha8, Celia M T Greenwood1,4,5,9.
Abstract
Medical research increasingly includes high-dimensional regression modeling with a need for error-in-variables methods. The Convex Conditioned Lasso (CoCoLasso) utilizes a reformulated Lasso objective function and an error-corrected cross-validation to enable error-in-variables regression, but requires heavy computations. Here, we develop a Block coordinate Descent Convex Conditioned Lasso (BDCoCoLasso) algorithm for modeling high-dimensional data that are only partially corrupted by measurement error. This algorithm separately optimizes the estimation of the uncorrupted and corrupted features in an iterative manner to reduce computational cost, with a specially calibrated formulation of cross-validation error. Through simulations, we show that the BDCoCoLasso algorithm successfully copes with much larger feature sets than CoCoLasso, and as expected, outperforms the naïve Lasso with enhanced estimation accuracy and consistency, as the intensity and complexity of measurement errors increase. Also, a new smoothly clipped absolute deviation penalization option is added that may be appropriate for some data sets. We apply the BDCoCoLasso algorithm to data selected from the UK Biobank. We develop and showcase the utility of covariate-adjusted genetic risk scores for body mass index, bone mineral density, and lifespan. We demonstrate that by leveraging more information than the naïve Lasso in partially corrupted data, the BDCoCoLasso may achieve higher prediction accuracy. These innovations, together with an R package, BDCoCoLasso, make error-in-variables adjustments more accessible for high-dimensional data sets. We posit the BDCoCoLasso algorithm has the potential to be widely applied in various fields, including genomics-facilitated personalized medicine research.Entities:
Keywords: Lasso; estimation accuracy; high dimension; measurement error; variable selection
Mesh:
Year: 2021 PMID: 34468045 PMCID: PMC9292988 DOI: 10.1002/gepi.22430
Source DB: PubMed Journal: Genet Epidemiol ISSN: 0741-0395 Impact factor: 2.344
Summary of simulation design
| Err. | No. Obs. | No. Fts. | No. Causal Fts. | % Fts. with Additive Err. | % Fts. Missing |
|
|
|
|---|---|---|---|---|---|---|---|---|
| Additive | 10,000 | 200 | 6 | 10 |
|
| ||
| Missing | 10,000 | 200 | 6 | 10 |
|
| ||
| Additive | 1000 | 2000 | 6 | 10 |
|
| ||
| Missing | 1000 | 2000 | 6 | 10 |
|
| ||
| Additive | 10,000 | 200 | 5%, 20% | 10, 20, 50 |
| 0.2 | ||
| Missing | 10,000 | 200 | 5%, 20% | 10, 20, 50 |
| 0.2 | ||
| Additive | 10,000 | 500, 1000 | 5% | 10, 20, 50 |
| 0.2 | ||
| Missing | 10,000 | 500, 1000 | 5% | 10, 20, 50 |
| 0.2 | ||
| Mixed | 10,000 | 200 | 5% | 10, 20, 50 | 10, 20, 50 |
| 0.2, 0.5, 0.8 | 0.2, 0.5 |
Note: All simulations were replicated 100 times in each of the autoregressive covariance setting and the symmetric covariance setting, respectively. β 0 = (3,1.5,0,0,2,0,…,0,2,0,0,1.5,3), τ 0 ∈ {0,0.05,0.10,…,0.70,0.75,0.80}, and r 0,high‐dim. ∈ {0,0.05,0.10,…,0.30,0.35,0.40} as high missing rates lead to completely missing data in some features with a small number of observations.
Abbreviations: Err., errors; Fts., features; Obs., observations.
Figure 1Performance of BDCoCoLasso, BDCoCoLasso‐SCAD, and Lasso with increasing additive error () and missing rates () for the simulation scenarios in the first four rows of Table 1, where six features were assigned to be causal with large effect sizes. Panels (a) and (b) show squared bias. Dots denote median total‐mean‐square error and error bars show the interquartile range based on 100 replicates in each simulation setting. When or , no measurement error exists. All simulations were based on (a) 10,000 observations of 200 features or (b) 1000 observations of 2000 features where 10% of the features were measured with error. In (b), as increased, frequently all observations of a feature were missing. Therefore, scenarios with were not explored. Comparison of running time in (c) the lower‐dimensional settings and (d) the higher‐dimensional settings indicates the substantially improved computational efficiency of the BDCoCoLasso over CoCoLasso. Running time was summarized over all replicates in each simulation setting. All methods were implemented using a 2.6‐GHz quad‐core processor. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso; CoCoLasso, Convex Conditioned Lasso; SCAD, smoothly clipped absolute deviation
Figure 2Squared bias of BDCoCoLasso and Lasso with higher error rates and weaker signals (rows 5 and 6 in Table 1). Dots and triangles denote median total‐mean‐square error and error bars denote interquartile range based on 100 replicates in each simulation setting. Error rates denote the fractions of features measured with either additive error or missing data. Causal features denote the fractions of features assigned to be causal. Effect sizes of causal features were sampled from a standardized normal distribution. All simulations were based on 10,000 observations of 200 features. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso
Figure 3Squared bias of BDCoCoLasso and Lasso with high‐dimensional feature sets of 500 or 1000 features (rows 7 and 8 in Table 1). Dots denote median total‐mean‐square error and error bars denote interquartile ranges based on 100 replicates in each simulation setting. Error rates denote the fractions of features measured with either additive error or missing data. In all simulation settings, 5% of the features were assigned to be causal with effect sizes sampled from a standardized normal distribution. Features were simulated to have an autoregressive covariance matrix. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso
Figure 4Squared bias of BDCoCoLasso and Lasso in the mixed error setting using a three‐block coordinate descent algorithm. Dots and triangles denote median total‐mean‐square error and error bars denote interquartile range based on 100 replicates in each simulation setting. Additive error rates and missing error rates were set to be equivalent taking values in , or 0.5. When both the additive error rate and the missing error rate are 0.5, all features are measured with error and a two‐block coordinate descent algorithm supplants the three‐block coordinate descent algorithm. All simulations were based on 10,000 observations of 200 features. A 5% of the features was assigned to be causal with effect sizes sampled from a standardized normal distribution. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso
Figure 5Comparison of Lasso and BDCoCoLasso in developing a covariate‐adjusted genetic risk score for the body mass index z score. (a) Summary of missing rates for the covariates in the training data set (left). The test data set does not have missing data. The coefficients of these covariates estimated by Lasso based on complete observations (second panel; N = 895) or mean imputation (third panel; N = 3000), and by BDCoCoLasso (rightmost panel; N = 3000) on the training data set are aligned. (b) Comparison of model metrics for the Lasso and BDCoCoLasso models. Standard errors of the proportion of variance explained and root‐mean‐square error were generated using 100 bootstrap replicates of the test data set (N = 1500). The five models were evaluated on the same bootstrap replicates. (c) Comparison of running time in logarithmic scale. All methods were implemented using a 2.6‐GHz quad‐core processor. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso; FEV1, forced expiratory volume in 1 s; FVC, forced vital capacity
Figure 6Comparison of predictive performance of covariate‐adjusted genetic risk scores for bone mineral density in identifying individuals who had fractures. (a) Receiver operating characteristic curves and (b) precision‐recall curves. Scores were evaluated based on the test data set (N = 1500). Other model metrics are provided in Figure S6. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso
Figure 7Comparison of predictive performance of covariate‐adjusted genetic risk scores for lifespan. (a) Kaplan–Meier curves for time to maternal death and (b) Kaplan–Meier curves for time to paternal death. Parents of individuals with the top 20% highest scores (predicted to be the most likely to live longer) and the top 20% lowest (predicted to be the least likely to live longer) were compared. Hazard ratios (HRs) were estimated based on standardized covariate‐adjusted genetic risk score using Cox regression models. Scores were evaluated based on the test data set (N = 1500). Other model metrics are provided in Figure S7
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|