Literature DB >> 34468045

Block coordinate descent algorithm improves variable selection and estimation in error-in-variables regression.

Célia Escribe^1,2, Tianyuan Lu^1,3, Julyan Keller-Baruch^1,4, Vincenzo Forgetta¹, Bowei Xiao^1,3, J Brent Richards^1,4,5,6, Sahir Bhatnagar^5,7, Karim Oualkacha⁸, Celia M T Greenwood^1,4,5,9.

Abstract

Medical research increasingly includes high-dimensional regression modeling with a need for error-in-variables methods. The Convex Conditioned Lasso (CoCoLasso) utilizes a reformulated Lasso objective function and an error-corrected cross-validation to enable error-in-variables regression, but requires heavy computations. Here, we develop a Block coordinate Descent Convex Conditioned Lasso (BDCoCoLasso) algorithm for modeling high-dimensional data that are only partially corrupted by measurement error. This algorithm separately optimizes the estimation of the uncorrupted and corrupted features in an iterative manner to reduce computational cost, with a specially calibrated formulation of cross-validation error. Through simulations, we show that the BDCoCoLasso algorithm successfully copes with much larger feature sets than CoCoLasso, and as expected, outperforms the naïve Lasso with enhanced estimation accuracy and consistency, as the intensity and complexity of measurement errors increase. Also, a new smoothly clipped absolute deviation penalization option is added that may be appropriate for some data sets. We apply the BDCoCoLasso algorithm to data selected from the UK Biobank. We develop and showcase the utility of covariate-adjusted genetic risk scores for body mass index, bone mineral density, and lifespan. We demonstrate that by leveraging more information than the naïve Lasso in partially corrupted data, the BDCoCoLasso may achieve higher prediction accuracy. These innovations, together with an R package, BDCoCoLasso, make error-in-variables adjustments more accessible for high-dimensional data sets. We posit the BDCoCoLasso algorithm has the potential to be widely applied in various fields, including genomics-facilitated personalized medicine research.

Entities: Chemical

Keywords: Lasso; estimation accuracy; high dimension; measurement error; variable selection

Mesh：

Year: 2021 PMID： 34468045 PMCID： PMC9292988 DOI： 10.1002/gepi.22430

Source DB: PubMed Journal: Genet Epidemiol ISSN： 0741-0395 Impact factor: 2.344

INTRODUCTION

Modern medical research is increasingly built on modeling of high‐dimensional data. Sparse regression methods, such as the Lasso (Tibshirani, 1996), Generalized Lasso (Tibshirani et al., 2011), Grouped Lasso (Yuan & Lin, 2006), adaptive Lasso (Zou, 2006), and Elastic Net (Zou & Hastie, 2005), have been widely applied to perform estimation and variable selection at the same time. However, high‐dimensional data sets often contain less precise measurements of phenotypes than those that might be available in smaller studies. For example, large biobanks often use billing codes from electronic health care records as proxy measures for a physician‐made diagnosis. It is well known that applying naïve regression methods to predictor variables that are measured with error can lead to attenuation of effect estimates (Chesher, 1991; Rosenbaum et al., 2010). Analogously, questionnaire data from large cohorts often contain many missing values (Obermeyer & Emanuel, 2016). Removing subjects who are missing at least one measurement can easily lead to removal of most subjects when data are high dimensional. Many error‐in‐variables solutions have been proposed. In addition to simple complete case analysis and pairwise deletion, more rigorous methods, such as expectation‐maximization algorithms (Dempster, 1977; Schafer, 1997), multiple imputation methods (Buuren, 2011), and full information maximum likelihood estimation (Enders, 2001; Friedman et al., 2010), have been developed, but these computationally expensive methods cannot be easily extended to high‐dimensional settings. In contrast, Loh and Wainwright (2011) developed a penalized method for error‐in‐variables regression. Within a properly chosen constraint radius, a projected gradient descent algorithm will converge to a small neighborhood of the set of all global minimizers, and is promising for variable selection in a high‐dimensional setting (Loh & Wainwright, 2011). Nevertheless, proper choice of this constraint radius depends on knowledge of the parameters yet to be estimated (Datta et al., 2017). Hence, Datta and Zou (2017) developed the Convex Conditioned Lasso (CoCoLasso) that does not require prior knowledge of the unknown parameters. The CoCoLasso algorithm is able to correct for both additive measurement error and missing data, and showed a substantial increase in estimation accuracy and stability compared with the naïve Lasso. However, when the data are only partially corrupted (i.e., some features are free of measurement error), the CoCoLasso still performs estimation for all features in an undifferentiated manner, limiting the implementation of the approach for large feature sets due to the intensive matrix computations required. Such circumstances of partial corruption are common for genetic epidemiology studies based on large genotyped cohorts, where the genotypes are accurately measured by highly reliable high‐throughput sequencing or microarrays, but lifestyle or clinical risk factors (except for age and sex) are measured with various types of error. For instance, in the UK Biobank, one of the largest health registries to date, participants had accurately measured hundreds of thousands of single nucleotide polymorphisms (SNPs) with little missing data, but most covariates based on questionnaires or health care records contained missing data (Bycroft et al., 2018). Samples with such corrupted covariates are usually discarded, potentially leading to underuse of information. Therefore, inspired by the CoCoLasso, we propose here a Block coordinate Descent Convex Conditioned Lasso (BDCoCoLasso) algorithm that makes it possible to perform higher‐dimensional error‐in‐variables regressions by separately optimizing estimation of the parameter estimates for uncorrupted and corrupted features in an iterative manner. Our proposal requires the implementation of a carefully calibrated cross‐validation strategy. Furthermore, we build in the smoothly clipped absolute deviation (SCAD) penalty (Fan & Li, 2001) in the new algorithm. In simulations, we confirm that our algorithm provides equivalent results to the CoCoLasso, and demonstrates better performance than the naïve Lasso, with increasing benefit as the dimension increases. Although this approach will still encounter computational limitations for many corrupted features, we substantially enlarge the magnitude of problems that can be analyzed with an error‐in‐variables approach. We demonstrate the potential practical utility of the BDCoCoLasso by deriving covariate‐adjusted genetic risk scores to predict body mass index, bone mineral density, and lifespan in a subset of the UK Biobank (Bycroft et al., 2018). The rest of the manuscript is organized as follows. In Section 2, we briefly review the CoCoLasso method, and then we describe our new version that allows blocks of features with different corruption states—BDCoCoLasso. We describe simulation settings and results in Section 3. Section 4 illustrates the performance of our algorithm on the UK Biobank data.

METHODS

In this section, we first review the principles of the CoCoLasso. We then seek to improve its computational efficiency and stability when the covariate matrix is partially corrupted or when different types of measurement error exist simultaneously, by implementing a block coordinate descent algorithm (Rosenbaum et al., 2013). We also implement a SCAD penalty (Fan & Li, 2001) to avoid overshrinkage when some features have strong effects.

The CoCoLasso

Suppose a true covariate matrix , with observations and features, is measured as a corrupted covariate matrix , where measurement error can be: Additive error: , where represents additive error; Missing data: , where or . It has been shown that using a classical Lasso with an objective function taking the form can lead to biased estimation of (Datta et al., 2017; Loh & Wainwright, 2011), where is the continuous response. Alternatively, this objective function can be reformulated as where and . Loh and Wainwright (2011) proposed that and could be replaced by their unbiased estimators and such that and . However, since the new covariance matrix can have negative eigenvalues, particularly when the covariate matrix is high dimensional (), the new optimization problem with the objective function is not necessarily convex. Loh and Wainwright (2011) showed that by setting certain constraints on , the problem could become convex, yet it is necessary to have prior knowledge of to find a suitable constraint. Datta and Zou (2017) therefore proposed the CoCoLasso that adopts the adapted objective function but finds a nearest positive semidefinite matrix for : where . Here, the elementwise maximum norm for matrix is defined as . This nearest positive semidefinite matrix can then be solved by an alternating direction method of multipliers (ADMM) algorithm (Boyd et al., 2011).

Two‐block coordinate descent for partially corrupted covariate matrix

The CoCoLasso enables error‐in‐variables regression in general, but when the feature set is large, the required matrix calculations are demanding. Implementing a block coordinate descent could substantially improve the computational efficiency when the covariate matrix is only partially corrupted. Specifically, projection of the covariance matrix onto a positive semidefinite subspace, that is, , within the CoCoLasso, requires multiple operations on matrices of dimension , which are order . In contrast, our BDCoCoLasso requires these operations only on the corrupted subblocks of the covariance matrix, which are anticipated to be much smaller. Suppose the true covariate matrix is now measured as , where is measured without error, and is measured with error. We then need to estimate where is a coefficient vector for the noncorrupted covariates, and is a coefficient vector for the corrupted covariates. We derive the objective function as We conceive a two‐step block coordinate descent algorithm based on (2)–(4): We first consider fixed, and we solve where and . is defined as in the additive error setting, ; in the missing‐error setting, specifically, we define a ratio matrix indicating the presence or absence of data as where is the number of samples for which both the th and the th features are measured and is the number of samples for which the th feature is measured. Note that is used to correct for measurement error in the corrupted covariates. We then have , that is, for and . We next consider fixed, with a value optimized in the previous step, and we solve where is an unbiased surrogate of and is the nearest positive semidefinite matrix of . For and , in the additive error setting, and , where is a known variance–covariance matrix for features measured with additive error; in the missing error setting, and . Here, represents elementwise division. We then alternate between the two steps until convergence. Following similar arguments as in Datta et al. (2017), we can ensure that both problems are equivalent to a Lasso problem. The complete optimization procedure is described in Algorithm 1. Of note, the estimation problem can be defined as finding the global solution for , and our two‐step procedure can be seen as equivalent to replacing by its nearest positive definite matrix, , in (5). Use of this substitution might not lead to a jointly convex problem. However, since both marginal problems (6) and (7) are convex, and both have suitable properties (i.e., both are strongly convex and smooth), our generalized alternating minimization algorithm can guarantee global minimization (Jain & Kar, 2017; Kelley, 1999). Cross‐validation to choose the penalization parameter, , must be appropriately implemented for the block implementation. Therefore, extending the concept in CoCoLasso (Datta et al., 2017), a ‐fold cross‐validated can be obtained by minimizing the total cross‐validation error while correcting for the two blocks separately, Here, and are estimated as described above for and based on data not in the th‐fold; and are derived as described above for and based on data in the th‐fold. is an unbiased surrogate of , where and are in the th‐fold. More specifically, in the additive error setting, where the additive error is centered to have zero mean, ; in the missing error setting, where . Although either an additive error setting or a missing error setting can be approached in the aforementioned two‐step manner, data often contain variables subject to both types of errors. Therefore, we further propose a generalized algorithm that copes with a mixed error setting, described in Supporting Information.

Implementation of a SCAD penalty

For potential application in scenarios where the causal variables are few but have large effect sizes, using the Lasso penalty may lead to overshrinkage (Fan & Li, 2001). To resolve this potential issue, we have also implemented a nonconcave SCAD penalty (Fan & Li, 2001). The SCAD penalty is given by and its first derivative with respect to is given by Substituting the regular penalty used in the Lasso by the SCAD penalty can retain large coefficients while shrinking smaller coefficients to zero. Thus, the SCAD penalty is able to produce a sparse solution and more accurate estimation for large coefficients. Following Zou and Li (2008), we implement a local linear approximation of the penalization function: where and are given by Equations (9) and (10), respectively, and is the estimate obtained from the previous iteration. Equivalently, where a weight specific to the th feature is introduced to the regular penalty and is updated after each iteration. This implementation enables an adaptive BDCoCoLasso. In principle, the hyperparameter in the SCAD penalty should be estimated through cross‐validation. However, the resulting two‐dimensional cross‐validation would be computationally expensive. Fan and Li (2001) proposed that should be suitable for many problems, and that the algorithm performance does not improve significantly with selected by data‐driven approaches. We therefore set in all simulations described below. In addition to the SCAD penalty, other weighting schemes, such as the minimax concave penalty (Zhang, 2010), could be implemented in the future for improved generalizability.

SIMULATION STUDY

Simulations were designed to explore the performance of BDCoCoLasso as a function of the number and proportion of corrupted features. Furthermore, we wanted to ensure that our results matched CoCoLasso when both methods could be implemented, that is, for fairly modest , and a single type of error.

Simulation design

We first simulated an uncorrupted covariate matrix from a multivariate normal distribution with observations, zero mean, and a predefined correlation structure across features. We explored a lower‐dimensional setting ( and ) and a higher‐dimensional setting ( and ) in combination with two common covariance matrix designs to introduce correlation between features (): An autoregressive setting: . A symmetric setting: . We then generated the response as To ensure a realistic signal‐to‐noise ratio, we set . When assessing the performance of the CoCoLasso algorithm, Datta and Zou used to generate strong signals from only a few features. Likewise, to start with a simulation that was similar to theirs, we set where three of the features measured without error and three of the features measured with error were assigned to be causal with relatively large effect sizes. Since we anticipate that this algorithm will be useful in large cohorts where , and anticipating multiple associated features with small effect sizes, we simulated more scenarios with and . We assigned different fractions of features to be causal ( or ), and created higher dimensionality (, or ) while sampling from a standardized normal distribution . Next, we introduced different types of error to the covariate matrix: For the additive error setting, the corrupted design matrix was generated as where . We explored different parameters in combination with different fractions (at least ) of features measured with additive error. For the missing error setting, the corrupted design matrix was generated as where each element of follows a Bernoulli distribution: where is the missing rate. We explored different values for the missing rate in combination with different fractions (at least ) of features measured with missing data. For the mixed error setting, we generated where and were generated as the additive error setting and the missing error setting, respectively. We explored different combinations of for and for . All parameters used in the simulations are summarized in Table 1. In all simulations, simulation of and was repeated for the same twice to create a training data set for model fitting and a test data set of equal size ( = 10,000 or 1000) for assessing prediction accuracy. We used fivefold cross‐validation in the training data to optimize the parameter. We repeated each simulation scenario 100 times. Data were then analyzed with BDCoCoLasso, naïve Lasso, and for the simulation scenarios with strong signals in Table 1, also with BDCoCoLasso‐SCAD, using the variant of SCAD penalty, as well as the adaptive Lasso. All methods were implemented using a 2.6‐GHz quad‐core processor with 32 GB of random access memory. The data sets were also analyzed with CoCoLasso for comparison of computational cost. The four following criteria were used to compare the performance of different methods:

Table 1

Summary of simulation design

Err.	No. Obs.	No. Fts.	No. Causal Fts.	% Fts. with Additive Err.	% Fts. Missing	β	τ	r
Additive	10,000	200	6	10		β0	τ0
Missing	10,000	200	6		10	β0		r0
Additive	1000	2000	6	10		β0	τ0
Missing	1000	2000	6		10	β0		r0,high‐dim.
Additive	10,000	200	5%, 20%	10, 20, 50		~N(0,1)	0.2
Missing	10,000	200	5%, 20%		10, 20, 50	~N(0,1)		0.2
Additive	10,000	500, 1000	5%	10, 20, 50		~N(0,1)	0.2
Missing	10,000	500, 1000	5%		10, 20, 50	~N(0,1)		0.2
Mixed	10,000	200	5%	10, 20, 50	10, 20, 50	~N(0,1)	0.2, 0.5, 0.8	0.2, 0.5

Note: All simulations were replicated 100 times in each of the autoregressive covariance setting and the symmetric covariance setting, respectively. β 0 = (3,1.5,0,0,2,0,…,0,2,0,0,1.5,3), τ 0 ∈ {0,0.05,0.10,…,0.70,0.75,0.80}, and r 0,high‐dim. ∈ {0,0.05,0.10,…,0.30,0.35,0.40} as high missing rates lead to completely missing data in some features with a small number of observations.

Abbreviations: Err., errors; Fts., features; Obs., observations.

Computational time (for some scenarios). Total‐mean‐square error in the training data set: . False‐positive rate (FPR), that is, the number of truly zero coefficients estimated to be nonzero. Sparsity: The fraction of features correctly estimated to be zero or nonzero. Variance explained () in the test data set: . Summary of simulation design Note: All simulations were replicated 100 times in each of the autoregressive covariance setting and the symmetric covariance setting, respectively. β 0 = (3,1.5,0,0,2,0,…,0,2,0,0,1.5,3), τ 0 ∈ {0,0.05,0.10,…,0.70,0.75,0.80}, and r 0,high‐dim. ∈ {0,0.05,0.10,…,0.30,0.35,0.40} as high missing rates lead to completely missing data in some features with a small number of observations. Abbreviations: Err., errors; Fts., features; Obs., observations. When the naïve Lasso and the adaptive Lasso were applied to corrupted data in the additive error setting, estimates could be directly obtained, without taking the measurement error into account. However, in the missing error setting, since removing all observations with missing data would occasionally lead to insufficient numbers of samples, we used the classical mean imputation method to impute missing data. The adaptive weight for the th feature in the adaptive Lasso was obtained by Ridge regression with fivefold cross‐validation: . We did not apply more sophisticated imputation methods, such as the Multivariate Imputation by Chained Equations (Buuren, 2011), since they would have prohibitive computational costs in a high‐dimensional setting.

Simulation results

BDCoCoLasso outperforms Lasso when covariate matrix is partially corrupted, and can cope with much larger data sets than the CoCoLasso

To ensure validity of our implementation, we analyzed the same data with BDCoCoLasso as well as the CoCoLasso algorithm without the block coordinate descent procedures. As expected, we found that all the estimates obtained by the BDCoCoLasso were numerically the same as those obtained by the CoCoLasso (numerical discrepancies were below the convergence tolerance), while the latter had a higher computational cost (Figure 1c,d). The computational efficiency of the BDCoCoLasso was more prominent in higher‐dimensional data with stronger correlations between features. For instance, on 1000 observations of 2000 features simulated with a symmetric covariance structure, it took the BDCoCoLasso approximately 10 min, on average, to construct the model, whereas the ordinary CoCoLasso had an average running time above 10 h.

Figure 1

Performance of BDCoCoLasso, BDCoCoLasso‐SCAD, and Lasso with increasing additive error () and missing rates () for the simulation scenarios in the first four rows of Table 1, where six features were assigned to be causal with large effect sizes. Panels (a) and (b) show squared bias. Dots denote median total‐mean‐square error and error bars show the interquartile range based on 100 replicates in each simulation setting. When or , no measurement error exists. All simulations were based on (a) 10,000 observations of 200 features or (b) 1000 observations of 2000 features where 10% of the features were measured with error. In (b), as increased, frequently all observations of a feature were missing. Therefore, scenarios with were not explored. Comparison of running time in (c) the lower‐dimensional settings and (d) the higher‐dimensional settings indicates the substantially improved computational efficiency of the BDCoCoLasso over CoCoLasso. Running time was summarized over all replicates in each simulation setting. All methods were implemented using a 2.6‐GHz quad‐core processor. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso; CoCoLasso, Convex Conditioned Lasso; SCAD, smoothly clipped absolute deviation Also as expected based on Datta et al. (2017), the BDCoCoLasso achieved better performance than the naïve Lasso in most scenarios. In both the lower‐dimensional setting ( and ) and the higher‐dimensional setting ( and ), with 10% features measured with error and strong signals, the BDCoCoLasso yielded smaller total‐mean‐square error, lower FPR, and higher sparsity compared with the naïve Lasso (Figures 1, S1, and S2). The BDCoCoLasso was relatively insensitive to the increase in additive error rates or missing rates, while the naïve Lasso had considerably worse performance as corruption rates increased. Although the naïve Lasso achieved a slightly better prediction accuracy in the test data set with small values of or , its predictive performance deteriorated more rapidly than the BDCoCoLasso (Figures S1 and S2). Moreover, implementing a SCAD penalty in the lower‐dimensional setting ( and ) with strong signals further improved the estimation accuracy of the BDCoCoLasso. As indicated in Figures 1a and S1, the BDCoCoLasso with SCAD penalty yielded smaller total‐mean‐square error with a 100% sparsity when no measurement error occurred ( or ). Further, it consistently outperformed the BDCoCoLasso implementing an penalty, the naïve Lasso as well as the adaptive Lasso with increasing and . Notably, while the adaptive Lasso had comparable performance to the BDCoCoLasso with SCAD penalty, and was slightly better than BDCoCoLasso with an penalty when the intensity of measurement error was considerably weak, its accuracy could attenuate substantially with a higher or . However, the SCAD penalty implementation had a low prediction accuracy despite consistently achieving an FPR close to 0 and almost 100% sparsity (Figures S1 and S2). This situation can arise when there are many highly correlated predictor variables. Since SCAD has good performance in variable selection, it does not retain many noncausal variables. In contrast, prediction models created by some other methods may retain several noncausal variables that are highly correlated with the true causal predictors; this obviously leads to worse metrics for sensitivity and sparsity, but can in fact lead to better even in test data.

The BDCoCoLasso also outperforms naïve Lasso with weakened signals, increased error rate, and increased dimensionality

In the lower‐dimensional setting ( and ), when the magnitude of causal feature effect sizes was reduced, and more causal features were introduced, the estimation accuracy and stability for the naïve Lasso decreased substantially (Figures 2 and S3). In contrast, although an increase in the number of causal features and the correlation between features rendered the signals more elusive and resulted in an increase in FPR and a decrease in sparsity, the BDCoCoLasso always maintained better estimation accuracy than the naïve Lasso with better consistency across replicates (Figures 2 and S3). Also as expected, the BDCoCoLasso was clearly less sensitive to changes in the proportion of features measured with error (Figure 2). Such an improved estimation accuracy persisted when the covariate matrix contained more features ( or ; Figures 3 and S4).

Figure 2

Figure 3

Squared bias of BDCoCoLasso and Lasso with high‐dimensional feature sets of 500 or 1000 features (rows 7 and 8 in Table 1). Dots denote median total‐mean‐square error and error bars denote interquartile ranges based on 100 replicates in each simulation setting. Error rates denote the fractions of features measured with either additive error or missing data. In all simulation settings, 5% of the features were assigned to be causal with effect sizes sampled from a standardized normal distribution. Features were simulated to have an autoregressive covariance matrix. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso

Squared bias of BDCoCoLasso and Lasso with higher error rates and weaker signals (rows 5 and 6 in Table 1). Dots and triangles denote median total‐mean‐square error and error bars denote interquartile range based on 100 replicates in each simulation setting. Error rates denote the fractions of features measured with either additive error or missing data. Causal features denote the fractions of features assigned to be causal. Effect sizes of causal features were sampled from a standardized normal distribution. All simulations were based on 10,000 observations of 200 features. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso Squared bias of BDCoCoLasso and Lasso with high‐dimensional feature sets of 500 or 1000 features (rows 7 and 8 in Table 1). Dots denote median total‐mean‐square error and error bars denote interquartile ranges based on 100 replicates in each simulation setting. Error rates denote the fractions of features measured with either additive error or missing data. In all simulation settings, 5% of the features were assigned to be causal with effect sizes sampled from a standardized normal distribution. Features were simulated to have an autoregressive covariance matrix. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso

The BDCoCoLasso handles measurement error with mixed types

The new three‐block coordinate descent algorithm (Supporting Information) copes seamlessly with coexistence of both types of error (Figures 4 and S5). As demonstrated in previous figures, the BDCoCoLasso achieved higher estimation accuracy than Lasso in all combinations of , and error rates. Its advantage became more prominent when the covariate matrix was more corrupted with a higher , and error rate. In particular, when all features were measured with error, a two‐block coordinate descent iterating between the additive‐error block and the missing‐error block retained its superiority over the naïve Lasso.

Figure 4

Squared bias of BDCoCoLasso and Lasso in the mixed error setting using a three‐block coordinate descent algorithm. Dots and triangles denote median total‐mean‐square error and error bars denote interquartile range based on 100 replicates in each simulation setting. Additive error rates and missing error rates were set to be equivalent taking values in , or 0.5. When both the additive error rate and the missing error rate are 0.5, all features are measured with error and a two‐block coordinate descent algorithm supplants the three‐block coordinate descent algorithm. All simulations were based on 10,000 observations of 200 features. A 5% of the features was assigned to be causal with effect sizes sampled from a standardized normal distribution. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso

REAL DATA APPLICATION EXAMPLES IN THE UK BIOBANK

The UK Biobank provides deep genetic and phenotypic data collected from nearly 500,000 participants between 2006 and 2010, and has enabled many important advances in human genetics and health care (Bycroft et al., 2018). One important advance is the development of genetic risk scores, which have demonstrated the potential in improving risk screening and possibly guiding prevention and intervention (Khera et al., 2018; Lu et al., 2020; Lu, Forgetta, Keller‐Baruch, et al., 2021; Lu, Forgetta, Wu, et al., 2021; Lu, Zhou, et al., 2021). Notably, several genetic risk scores have been developed using (Lu, Forgetta, Keller‐Baruch, et al., 2021; Lu, Forgetta, Wu, et al., 2021; Lu, Zhou, et al., 2021). Similar to most large‐scale cohort studies, measurement errors, especially missingness, affected a substantial proportion of clinical and lifestyle variables. We thus tested whether the BDCoCoLasso could help improve the predictive performance and clinical utility of covariate‐adjusted genetic risk scores compared with the naïve Lasso or the adaptive Lasso. For the purpose of testing BDCoCoLasso in a reasonably large high‐dimensional setting, we randomly selected 4500 unrelated individuals from the UK Biobank of white British ancestry with self‐reported age, sex, measured body mass index, bone mineral density, maternal and paternal living status or age of death, and 30 clinical and lifestyle variables (Figure 5a). We randomly split this data set into a training data set, including 3000 individuals possibly with missing data, and a test data set, including 1500 individuals without missing data. For all three examples, genotypes had been imputed to the Haplotype Reference Consortium panel (McCarthy et al., 2016).

Figure 5

Comparison of Lasso and BDCoCoLasso in developing a covariate‐adjusted genetic risk score for the body mass index z score. (a) Summary of missing rates for the covariates in the training data set (left). The test data set does not have missing data. The coefficients of these covariates estimated by Lasso based on complete observations (second panel; N = 895) or mean imputation (third panel; N = 3000), and by BDCoCoLasso (rightmost panel; N = 3000) on the training data set are aligned. (b) Comparison of model metrics for the Lasso and BDCoCoLasso models. Standard errors of the proportion of variance explained and root‐mean‐square error were generated using 100 bootstrap replicates of the test data set (N = 1500). The five models were evaluated on the same bootstrap replicates. (c) Comparison of running time in logarithmic scale. All methods were implemented using a 2.6‐GHz quad‐core processor. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso; FEV1, forced expiratory volume in 1 s; FVC, forced vital capacity

Predicting body mass index with accurate genotype variables and corrupted clinical and lifestyle measurements

Obesity is a highly polygenic trait involving multiple genes of small or moderate effects (Speliotes et al., 2010; Willer et al., 2009). Previously, the genetic basis of obesity was explored by a genome‐wide association study of body mass index, a widely used measure to define obesity, in 322,154 individuals of European ancestry from the Genetic Investigation of ANthropometric Traits Consortium (Locke et al., 2015). Despite tens of independent genetic risk loci identified, many clinical and lifestyle risk factors are also strongly associated with body mass index (Marti et al., 2004; Speakman, 2004), yet measurements of these risk factors may be missing in large‐scale cohort studies. Such missingness limits the investigation of the joint effects of the genetic and nongenetic risk factors for obesity. We therefore applied the BDCoCoLasso to a subset of data from the UK Biobank (Bycroft et al., 2018) to examine whether incorporating variables that previously had to be discarded due to missingness could improve the prediction of body mass index. An existing large meta‐analysis of genome‐wide association studies identified 1882 SNPs strongly associated with body mass index (Locke et al., 2015), hence we retrieved genotypes of these SNPs as candidate genetic predictors. Among these SNPs that were representative of causal signals, many were correlated due to linkage disequilibrium. Thus, penalty was adopted for variable selection. Genetic variants together with age and sex were considered features measured without error or missing data, while the other clinical or lifestyle features containing missing values were further processed by BDCoCoLasso. In addition, we compared the performance of BDCoCoLasso with two types of implementation of naïve Lasso and adaptive Lasso (with adaptive weights obtained from fivefold cross‐validation Ridge regression as in the simulation study), respectively: We built Lasso models either on only 895 individuals with no missing data in the training data set, or on the entire mean‐imputed training data set. These five models were constructed based on the same fivefold cross‐validation, such that in each fold the optimization was high dimensional, and were evaluated on the independent test data set. Proportion of variance explained and prediction root‐mean‐square error were examined based on 100 bootstrap replicates. As anticipated, because of a largely compromised sample size, the Lasso models relying on only the complete observations explained the least proportion of variance in body mass index in the test data set with the least number of predictors activated (Figure 5b). On the other hand, the naïve Lasso model and the adaptive Lasso model with mean imputation derived similar estimates for clinical and lifestyle covariates as the BDCoCoLasso model, that were substantially different from those estimated by the Lasso models on complete observations (Figure 5a). However, the BDCoCoLasso achieved significantly higher proportion of variance explained (0.622 vs. 0.584 of the naïve Lasso, paired t test p value of bootstrap replicates = ; 0.622 vs. 0.583 of the adaptive Lasso, paired t test p = ) and significantly lower prediction root‐mean‐square error (0.725 vs. 0.783 of the naïve Lasso, paired t test p value of bootstrap replicates = ; 0.725 vs. 0.784 of the adaptive Lasso, paired t test p = ) than this Lasso model on the test data set. Notably, the BDCoCoLasso only required twice as much the running time as the mean‐imputed naïve Lasso model, whereas the CoCoLasso without the block coordinate descent procedures had more than 100 times higher time cost to yield the same parameter estimates (Figure 5c).

Predicting bone mineral density and fracture risk

Osteoporotic fractures affect up to 1 in 3 women and 1 in 5 men aged above 50 years, and incur a heavy socioeconomic burden among elderly populations (Kanis et al., 2000). Therefore, good predictions of the risk of osteoporotic fracture are essential to public health management. Bone mineral density is a key indicator of bone mass and bone quality, and has been included in successful risk factor‐based fracture risk prediction tools, such as FRAX (Kanis, 2002; Kanis et al., 2008). Recently, it has been shown that, when combined with clinical risk factors, genetically predicted bone mineral density could significantly improve the predictive performance in identifying individuals at an elevated risk of fracture (Lu, Forgetta, Keller‐Baruch, et al., 2021). Therefore, we attempted to leverage BDCoCoLasso to further improve the prediction of bone mineral density and fracture risk. We retrieved 7307 SNPs strongly associated with bone mineral density (estimated by quantitative ultrasound speed of sound and broadband ultrasound attenuation); SNPs had demonstrated p in a previous genome‐wide association study (Morris et al., 2019). We implemented BDCoCoLasso, the naïve Lasso (with complete data or mean‐imputed data) and the adaptive Lasso (with complete data or mean‐imputed data) as in Section 4.1 with the same training data set and the same clinical and lifestyle features. We found that the covariate‐adjusted genetic risk score constructed using the BDCoCoLasso again had the highest proportion of variance explained and the lowest prediction root‐mean‐square error for bone mineral density on the independent test data set, and its computational cost was tremendously reduced compared with the CoCoLasso (Figure S6). Moreover, among the 1500 individuals in the test data set, 170 self‐reported or had a medical record of major osteoporotic fractures affecting hip, radius/ulna, humerus, or vertebrae upon recruitment. The score constructed by BDCoCoLasso also exhibited the strongest discriminative power in identifying individuals who experienced fractures, with an area under the receiver operating characteristic curve (AUROC) of 0.571 and an area under the precision‐recall curve (AUPRC) of 0.132 (Figure 6). In contrast, the naïve Lasso with mean imputation achieved the best performance among the four naïve Lasso or adaptive Lasso implementations, but only obtained an AUROC of 0.554 and an AUPRC of 0.123 (Figure 6), respectively.

Figure 6

Comparison of predictive performance of covariate‐adjusted genetic risk scores for bone mineral density in identifying individuals who had fractures. (a) Receiver operating characteristic curves and (b) precision‐recall curves. Scores were evaluated based on the test data set (N = 1500). Other model metrics are provided in Figure S6. BDCoCoLasso, Block coordinate Descent Convex Conditioned Lasso

Predicting human lifespan

Longevity is a highly complex trait in which genetics plays a debatable role (van den Berg et al., 2017). It was only recently that genes and genetic variants influencing extreme longevity (Deelen et al., 2019) or human lifespan (Timmers et al., 2019) have been systematically identified in large‐scale genome‐wide association studies. We tested whether a covariate‐adjusted genetic risk score could predict lifespan and inform lifetime risk of death. We retrieved 462 SNPs strongly associated with human lifespan (p ) identified in a recent genome‐wide association study (Timmers et al., 2019) as candidate genetic predictors. Because the majority of the UK Biobank participants were alive at the time of the latest follow‐up, we sought to predict parental lifespan instead. Since BDCoCoLasso has not been adapted to time‐to‐event outcomes, we created two subtraining data sets containing 1814 individuals whose mother had died and 2254 individuals whose father had died, from the original training data set of 3000 individuals. We trained models to predict maternal or paternal age of death separately, to account for potential sex‐specific effects (Timmers et al., 2019). We again implemented BDCoCoLasso, the naïve Lasso (with complete data or mean‐imputed data), and the adaptive Lasso (with complete data or mean‐imputed data) as above with the same clinical and lifestyle features. Notably, the naïve Lasso models with complete data did not select any of the predictors, probably due to the reduced sample size (Figure S7). Next, we tested the predictive performance of these covariate‐adjusted genetic risk scores based on the test data set using Cox regression models. Of the 1500 individuals in the test data set, 902 mothers, and 1109 fathers had died upon recruitment. Although a genetic risk score based on offspring genotypes is not an ideal way to estimate parental genetic predispositions, our BDCoCoLasso‐based scores achieved modest discriminative power in identifying individuals whose parents lived longer in the test data set, and outperformed the other naïve Lasso or adaptive Lasso models (Figure 7). Specifically, a one standard deviation decrease in the maternal score (corresponding to a shorter predicted lifespan) was associated with a lifetime hazard ratio for time to death of 1.104 (95% CI, 1.032–1.181) while a one standard deviation decrease in the paternal score was associated with a lifetime hazard ratio of 1.071 (95% CI, 1.009–1.137). In contrast, the runner‐up score for maternal lifespan using an adaptive Lasso with mean imputation had a hazard ratio of 1.084 (95% CI, 1.013–1.161) per standard deviation increase and the runner‐up score for paternal lifespan using naïve Lasso with mean imputation had a hazard ratio of 1.068 (95% CI, 1.006–1.134) per standard deviation increase.

Figure 7

Comparison of predictive performance of covariate‐adjusted genetic risk scores for lifespan. (a) Kaplan–Meier curves for time to maternal death and (b) Kaplan–Meier curves for time to paternal death. Parents of individuals with the top 20% highest scores (predicted to be the most likely to live longer) and the top 20% lowest (predicted to be the least likely to live longer) were compared. Hazard ratios (HRs) were estimated based on standardized covariate‐adjusted genetic risk score using Cox regression models. Scores were evaluated based on the test data set (N = 1500). Other model metrics are provided in Figure S7

DISCUSSION

With the increasing availability of large population‐based cohorts, developing rigorous methods for model estimation and variable selection is a pressing need in contemporary medical research. The CoCoLasso algorithm proposed by Datta and Zou (2017) utilizes a reformulated form of the Lasso objective function with a modified covariance estimator to allow for high‐dimensional error‐in‐variables regression. More recent studies have combined the principles of the CoCoLasso with other techniques that render more complicated scenarios tractable. For example, Brown et al. (2019) developed a Measurement Error Boosting algorithm with a measurement error‐corrected score function to enable Poisson, Gamma, and Wald. However, no algorithm to our knowledge specifically targets data that are only partially corrupted by measurement or have mixed error types, but such characteristics are common in most large‐scale genomics and medical studies. In this study, we developed a block coordinate descent algorithm as an extension to the CoCoLasso algorithm to improve both computational efficiency and estimation accuracy. We also implemented an optional SCAD penalty for further improved model estimation and variable selection when the signals are strong. These adaptations make it possible to use error‐in‐variables penalized models for data sets with large feature dimension, as long as the number of corrupted features remains modest. Computational time depends linearly on sample size, but is cubic as a function of the number of corrupted features. Therefore, although these developments achieve an important step towards being able to analyze large‐scale data, to work with data of the size of the UK Biobank, while allowing for corrupted data, additional developments would be required. Perhaps by combining these approaches with new methods for working with biobank data at scale (Bi et al., 2020; Jiang et al., 2019; Qian et al., 2020), it may be possible to achieve the orders‐of‐magnitude expansions required. In multifaceted simulations, the BDCoCoLasso algorithm substantially outperformed the naïve Lasso (as expected), achieving smaller total‐mean‐square error, lower FPR, and higher sparsity. The BDCoCoLasso was also less sensitive to increases in the intensity of additive error and/or missing rate, fraction of features measured with error, dimensionality as well as reduction in the magnitude of signals. We further derived covariate‐adjusted genetic risk scores for body mass index, bone mineral density, and parental lifespan in the UK Biobank and showed that the BDCoCoLasso leveraged more information than the naïve Lasso without the need to discard missing data or perform imputation, and achieved better prediction accuracy. It should be noted that, while we worked on well‐genotyped and well‐imputed genotypes (INFO > 0.3), poorly imputed SNPs that were filtered out before our analysis could potentially be considered as measured with error, and hence used more effectively by our algorithm. We do not pursue this here since most genetic studies analyze only well‐imputed genotypes. Considering that genomics‐facilitated personalized medicine is booming, and large data sets are being rapidly released containing both accurate genotyping information and other partially corrupted features, we posit the BDCoCoLasso algorithm has the potential to be applied in various medical research settings and we have provided a freely available R package for public use. Since our algorithm utilizes corrupted covariates, BDCoCoLasso on an extremely small sample size may have less stable performance than the naïve Lasso. Particularly with small ‐large situations, results should be carefully examined and data perturbed to assess stability. If cross‐validation were to be employed, the number of folds should be chosen such that each fold contains sufficient observations. Our simulations with (fivefold cross‐validation) experienced no trouble, but with or 200 (and double these values, using fivefold cross‐validation), convergence was not always achieved. Leave‐one‐out cross‐validation may be an appropriate alternative under such circumstances. Extra caution should also be taken when implementing the SCAD penalty in a high‐dimensional setting if the features are correlated, as it may introduce instability in parameter estimation or prediction. Given that our algorithm exhibited better model sparsity in multiple simulation settings, it may be combined with various approaches for post‐selection inference, including but limited to those proposed by Lee et al. (2016, with closed‐form p values and confidence intervals), Taylor et al. (2016, forward stepwise regression and least angle regression), and possibly in the future, Taylor and Tibshirani (2018, generalized regression models). The improved control of false discovery rate may benefit various fields, including genetic epidemiology studies. Our algorithm has some important limitations. First, it assumes that each feature can harbor at most one type of error (either additive or missing error) and does not cope with coexistence of both types of error in one feature. Therefore, BDCoCoLasso could be combined with a complete case analysis removing features with both types of error but a low missing rate, or an imputation of only the features with a low missing rate to control potential bias. Second, a useful extension of our algorithm could be to allow for varying penalty factors for different coefficient blocks, for example, for and for in Equation (5). However, without strong prior knowledge of the features, selecting optimal penalty factors with cross‐validation becomes nontrivial and requires future investigations. Third, the ADMM algorithm becomes unstable when the missing rate is high. Replacing the max norm by a Frobenius norm when defining the nearest positive semidefinite matrix, or down‐weighting features with a high missing rate in the ADMM algorithm may boost its stability; in fact, the recently developed high missing Lasso (HMLasso) algorithm has successfully adopted similar concepts to handle scenarios where features are subject to very high missing rates (Takada et al., 2019). Our package includes an option with HMLasso features, although we did not observe a clear benefit to this adaptation in our simulations. Furthermore, in the additive error setting, similar to the CoCoLasso (Datta et al., 2017), our algorithm requires knowledge about the variance of the error, and therefore it is essential to be able to find relevant literature, such as measures of precision of an instrument used for measurement. Lastly, we noted that with a very large feature dimension and strong correlations between features (e.g., a symmetric covariance matrix for ), the algorithm became time intensive. Enhanced memory handling and parallelization may assist in enabling and accelerating computation in higher‐dimensional data sets with more complex correlation structures. Nevertheless, the algorithm copes extremely efficiently with large sample sizes—our UK Biobank example analyzed over thousands of samples and could easily have analyzed more.

CONFLICT OF INTERESTS

The authors declare that there are no conflict of interests.

Algorithm 1 Two‐block coordinate descent

Input Σ1,Σ~2,R,y,λ,X1,Z2, error

Initialize β01←0; β02←0

while until convergence do

if error = missing then

Z~2=Z2diag(1∕R)

end if

if error = additive then

Z~2=Z2

end if

ρ~1←1nX1′(y−Z~2β02)

β1←argminβ112β1′Σ1β1−ρ~1′β1+λ∥β1∥1

if error = missing then

ρ~2←1nZ2′(y−X1β1)diag(1∕R)

end if

if error = additive then

ρ~2←1nZ2′(y−X1β1)

end if

β2←argminβ212β2′Σ~2β2−ρ~2′β2+λ∥β2∥1

Update β01←β1; β02←β2

end while

Output β1,β2

29 in total

1. A resource-efficient tool for mixed model association analysis of large-scale data.

Authors: Longda Jiang; Zhili Zheng; Ting Qi; Kathryn E Kemper; Naomi R Wray; Peter M Visscher; Jian Yang
Journal: Nat Genet Date: 2019-11-25 Impact factor: 38.330

2. Post-Selection Inference for ℓ₁-Penalized Likelihood Models.

Authors: Jonathan Taylor; Robert Tibshirani
Journal: Can J Stat Date: 2017-03-06 Impact factor: 0.875

Review 3. Historical demography and longevity genetics: Back to the future.

Authors: Niels van den Berg; Marian Beekman; Ken Robert Smith; Angelique Janssens; Pieternella Eline Slagboom
Journal: Ageing Res Rev Date: 2017-07-05 Impact factor: 10.895

4. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models.

Authors: Hui Zou; Runze Li
Journal: Ann Stat Date: 2008-08-01 Impact factor: 4.028

5. Individuals with common diseases but with a low polygenic risk score could be prioritized for rare variant screening.

Authors: Tianyuan Lu; Sirui Zhou; Haoyu Wu; Vincenzo Forgetta; Celia M T Greenwood; J Brent Richards
Journal: Genet Med Date: 2020-10-28 Impact factor: 8.822

6. A Polygenic Risk Score to Predict Future Adult Short Stature Among Children.

Authors: Tianyuan Lu; Vincenzo Forgetta; Haoyu Wu; John R B Perry; Ken K Ong; Celia M T Greenwood; Nicholas J Timpson; Despoina Manousaki; J Brent Richards
Journal: J Clin Endocrinol Metab Date: 2021-06-16 Impact factor: 6.134

7. A meta-analysis of genome-wide association studies identifies multiple longevity genes.

Authors: Joris Deelen; Daniel S Evans; Dan E Arking; Niccolò Tesi; Marianne Nygaard; Xiaomin Liu; Mary K Wojczynski; Mary L Biggs; Ashley van der Spek; Gil Atzmon; Erin B Ware; Chloé Sarnowski; Albert V Smith; Ilkka Seppälä; Heather J Cordell; Janina Dose; Najaf Amin; Alice M Arnold; Kristin L Ayers; Nir Barzilai; Elizabeth J Becker; Marian Beekman; Hélène Blanché; Kaare Christensen; Lene Christiansen; Joanna C Collerton; Sarah Cubaynes; Steven R Cummings; Karen Davies; Birgit Debrabant; Jean-François Deleuze; Rachel Duncan; Jessica D Faul; Claudio Franceschi; Pilar Galan; Vilmundur Gudnason; Tamara B Harris; Martijn Huisman; Mikko A Hurme; Carol Jagger; Iris Jansen; Marja Jylhä; Mika Kähönen; David Karasik; Sharon L R Kardia; Andrew Kingston; Thomas B L Kirkwood; Lenore J Launer; Terho Lehtimäki; Wolfgang Lieb; Leo-Pekka Lyytikäinen; Carmen Martin-Ruiz; Junxia Min; Almut Nebel; Anne B Newman; Chao Nie; Ellen A Nohr; Eric S Orwoll; Thomas T Perls; Michael A Province; Bruce M Psaty; Olli T Raitakari; Marcel J T Reinders; Jean-Marie Robine; Jerome I Rotter; Paola Sebastiani; Jennifer Smith; Thorkild I A Sørensen; Kent D Taylor; André G Uitterlinden; Wiesje van der Flier; Sven J van der Lee; Cornelia M van Duijn; Diana van Heemst; James W Vaupel; David Weir; Kenny Ye; Yi Zeng; Wanlin Zheng; Henne Holstege; Douglas P Kiel; Kathryn L Lunetta; P Eline Slagboom; Joanne M Murabito
Journal: Nat Commun Date: 2019-08-14 Impact factor: 14.919

8. Polygenic risk for coronary heart disease acts through atherosclerosis in type 2 diabetes.

Authors: Tianyuan Lu; Vincenzo Forgetta; Oriana H Y Yu; Lauren Mokry; Madeline Gregory; George Thanassoulis; Celia M T Greenwood; J Brent Richards
Journal: Cardiovasc Diabetol Date: 2020-01-30 Impact factor: 9.951

9. A reference panel of 64,976 haplotypes for genotype imputation.

Authors: Shane McCarthy; Sayantan Das; Warren Kretzschmar; Olivier Delaneau; Andrew R Wood; Alexander Teumer; Hyun Min Kang; Christian Fuchsberger; Petr Danecek; Kevin Sharp; Yang Luo; Carlo Sidore; Alan Kwong; Nicholas Timpson; Seppo Koskinen; Scott Vrieze; Laura J Scott; He Zhang; Anubha Mahajan; Jan Veldink; Ulrike Peters; Carlos Pato; Cornelia M van Duijn; Christopher E Gillies; Ilaria Gandin; Massimo Mezzavilla; Arthur Gilly; Massimiliano Cocca; Michela Traglia; Andrea Angius; Jeffrey C Barrett; Dorrett Boomsma; Kari Branham; Gerome Breen; Chad M Brummett; Fabio Busonero; Harry Campbell; Andrew Chan; Sai Chen; Emily Chew; Francis S Collins; Laura J Corbin; George Davey Smith; George Dedoussis; Marcus Dorr; Aliki-Eleni Farmaki; Luigi Ferrucci; Lukas Forer; Ross M Fraser; Stacey Gabriel; Shawn Levy; Leif Groop; Tabitha Harrison; Andrew Hattersley; Oddgeir L Holmen; Kristian Hveem; Matthias Kretzler; James C Lee; Matt McGue; Thomas Meitinger; David Melzer; Josine L Min; Karen L Mohlke; John B Vincent; Matthias Nauck; Deborah Nickerson; Aarno Palotie; Michele Pato; Nicola Pirastu; Melvin McInnis; J Brent Richards; Cinzia Sala; Veikko Salomaa; David Schlessinger; Sebastian Schoenherr; P Eline Slagboom; Kerrin Small; Timothy Spector; Dwight Stambolian; Marcus Tuke; Jaakko Tuomilehto; Leonard H Van den Berg; Wouter Van Rheenen; Uwe Volker; Cisca Wijmenga; Daniela Toniolo; Eleftheria Zeggini; Paolo Gasparini; Matthew G Sampson; James F Wilson; Timothy Frayling; Paul I W de Bakker; Morris A Swertz; Steven McCarroll; Charles Kooperberg; Annelot Dekker; David Altshuler; Cristen Willer; William Iacono; Samuli Ripatti; Nicole Soranzo; Klaudia Walter; Anand Swaroop; Francesco Cucca; Carl A Anderson; Richard M Myers; Michael Boehnke; Mark I McCarthy; Richard Durbin
Journal: Nat Genet Date: 2016-08-22 Impact factor: 38.330

10. The UK Biobank resource with deep phenotyping and genomic data.

Authors: Clare Bycroft; Colin Freeman; Desislava Petkova; Gavin Band; Lloyd T Elliott; Kevin Sharp; Allan Motyer; Damjan Vukcevic; Olivier Delaneau; Jared O'Connell; Adrian Cortes; Samantha Welsh; Alan Young; Mark Effingham; Gil McVean; Stephen Leslie; Naomi Allen; Peter Donnelly; Jonathan Marchini
Journal: Nature Date: 2018-10-10 Impact factor: 49.962

1 in total

1. Block coordinate descent algorithm improves variable selection and estimation in error-in-variables regression.

Authors: Célia Escribe; Tianyuan Lu; Julyan Keller-Baruch; Vincenzo Forgetta; Bowei Xiao; J Brent Richards; Sahir Bhatnagar; Karim Oualkacha; Celia M T Greenwood
Journal: Genet Epidemiol Date: 2021-09-01 Impact factor: 2.344

1 in total