Literature DB >> 31632436

On the Use of the Pearson Correlation Coefficient for Model Evaluation in Genome-Wide Prediction.

Abstract

The large number of markers in genome-wide prediction demands the use of methods with regularization and model comparison based on some hold-out test prediction error measure. In quantitative genetics, it is common practice to calculate the Pearson correlation coefficient (r2 ) as a standardized measure of the predictive accuracy of a model. Based on arguments from the bias-variance trade-off theory in statistical learning, we show that shrinkage of the regression coefficients (i.e., QTL effects) reduces the prediction mean squared error (MSE) by introducing model bias compared with the ordinary least squares method. We also show that the LASSO and the adaptive LASSO (ALASSO) can reduce the model bias and prediction MSE by adding model variance. In an application of ridge regression, the LASSO and ALASSO to a simulated example based on results for 9,723 SNPs and 3,226 individuals, the best model selected was with the LASSO when r2 was used as a measure. However, when model selection was based on test MSE and coefficient of determination R2 the ALASSO proved to be the best method. Hence, use of r2 may lead to selection of the wrong model and therefore also nonoptimal ranking of phenotype predictions and genomic breeding values. Instead, we propose use of the test MSE for model selection and R2 as a standardized measure of the accuracy.

Entities: Chemical Disease Species

Keywords: accuracy; bias–variance trade-off; coefficient of determination; genomic selection; model comparison

Year: 2019 PMID： 31632436 PMCID： PMC6781837 DOI： 10.3389/fgene.2019.00899

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

Introduction

At the heart of classical quantitative genetics is linear model theory (Lynch and Walsh, 1998). Statistical inference in linear models mostly falls within the ordinary least squares (OLS) and maximum likelihood (ML) frameworks (Casella and Berger, 2002). The recent transition from pedigree-based classical quantitative genetics to prediction based on genome-wide markers involves some steps where the characteristics of the data complicate statistical inference and may have profound effects on model selection. One of the most important factors is the number of markers p in relation to the number of individuals n. If p < < n, we can set up the linear model y = X β + e where each individual genotype score (0,1, or 2) is collected in a matrix X (standardized over columns to have mean equal to zero and variance equal to one) and the corresponding phenotypes in a vector y (centered to have a mean of zero), and then use standard OLS to obtain unbiased solutions to the regression coefficients of the genetic markers, i.e., β = (X X) y. Note that this is also the solution to the ML function max p(y | X, β). It is straightforward to incorporate dominance and epistasis into X using indicator variables. The predicted phenotypes are calculated as and the residuals as e = y - ŷ. Based on the residuals, it is possible to calculate the residual sum of squares RSS = e e, the OLS error variance , and the mean squared error: We can also obtain the variances (diagonal terms) and covariances (off-diagonal terms) of the regression coefficients as (Ravishanker and Dey, 2002). However, for estimation of the genomic variance and the genomic heritability it is necessary to use some random effects model where the covariance structure is based on the outer product X X instead of the inner product X X (Morota and Gianola, 2014; de los Campos et al., 2015). When p < < n, OLS will give unbiased estimates of the genomic parameters with low variance. However, if n is not much larger than p, there can be considerable variability in the OLS fit, resulting in overfitting with very small, or even zero error variance, and consequently incorrect predictions of future observations. Hence, it is advisable to cast OLS into a supervized statistical learning framework where the data are split into training and test sets, and MSE is evaluated on the test set (Hastie et al., 2009).

Regularization

Although the number of genotyped individuals is generally increasing, the experimental setting in genomic prediction is often that p > n or even p > > n. This is an example of a high-dimensional statistical problem which leads to certain challenges (Johnstone and Titterington, 2009; Fan et al., 2014). Standard OLS is not applicable in this situation, because X X is singular (i.e., does not have an inverse) and the parameters in the regression model cannot be uniquely estimated. One approach to overcome the singularity problem is to use regularization (also known as penalization). An early example of this is ridge regression (RR) (Hoerl and Kennard, 1970), in which the regression coefficient is estimated using , where I is an identity matrix and λ is a positive penalty parameter that needs to be tuned using training and test data. Note that genomic best unbiased linear prediction (GBLUP) is a form of random effects RR, where and the genomic relationship matrix G is calculated based on XX (Goddard, 2009; Morota and Gianola, 2014). There is also a Bayesian rationale for RR where the regression coefficients follows a normal prior, . The RR estimator has some interesting properties. Firstly, both the expectation and the variance tend towards zero when λ goes to infinity. Secondly, compared with OLS estimates, is biased, and the variance of the OLS estimator is always larger than when λ > 0 (van Wieringen, 2018). Another interesting feature of RR appears when considering the MSE. In general, for any estimator of a parameter θ, the mean squared test error can be decomposed following (Hastie et al., 2009). The bias–variance decomposition is a way of analyzing the expected test error of a learning algorithm with respect to a particular problem. In order to minimize the test error, a model that simultaneously achieves low variance and low bias needs to be selected. The variance refers to the amount by which θ would change if it were estimated using other training datasets. Ideally, the estimate of θ should vary as little as possible. Bias represents the error that is the result of approximating a complex problem with a simpler model. Generally, more flexible methods result in less bias, but also lead to higher variance. Hence, there is a bias–variance trade-off that needs to be optimized using the test data. For data with an orthonormal design matrix, i.e., X X = I = (X X) and n = p, it can be mathematically shown that there is a value of λ > 0 where (Theobald, 1974; Farebrother, 1976). RR can be written as an optimization problem , where denotes the Euclidean ℓ2-norm. The first term is the loss function and the second term the penalty. By changing the penalty into an ℓ1-norm, we end up with which is also known as the LASSO (Tibshirani, 1996). In contrast to RR, the LASSO sets regression coefficients to zero and therefore performs variable selection. In general, the LASSO will perform better than RR when a relatively small number of predictors (markers) have relatively large effects on the response (phenotype). Compared with OLS, the LASSO and also RR can yield a reduction in variance at the expense of some increase in bias, and consequently generate lower MSE and better prediction accuracy (Hastie et al., 2009). Unfortunately, minimization of the LASSO problem does not provide an estimate of the error variance, because it depends on a complex relationship between the signal-to-noise ratio (i.e., the heritability) and the sparsity pattern (i.e., number of QTLs in relation to number of markers). In general, it is notoriously difficult to obtain proper error variance estimates with regularization methods in the p > n situation, because of the biased estimates and the difficulty in calculating correct degrees of freedom (Reid et al., 2016). The LASSO has been extended in many directions (Vidaurre et al., 2013; Hastie et al., 2015). Among the most interesting variants is the adaptive LASSO (ALASSO), where a pre-calculated vector w is used to weight the coefficients differently in the penalty, i.e., (Zou, 2006). The weights can be calculated as the absolute values of marginal covariances between the markers and the phenotype. The bias introduced by the shrinkage of in RR and LASSO is reduced in ALASSO at the expense of an increase in variance (Giraud, 2015). The LASSO and ALASSO have shown competitive prediction performance compared with a range of other methods in comparative genomic prediction studies (Li and Sillanpää, 2012; Momen et al., 2018).

Model Selection

In order to determine the best model, it is important to find a good measure of the lowest test error, because the training error will decrease when more variables or parameters are added to the model. There are a number of approaches (e.g., Mallows’ C, AIC and BIC) that attempt to correct the training RSS for model size. However, their use as model selection criteria in regularized models with p > n data is questionable, since they rely on asymptotic theory, for example that it is possible to obtain correct degrees of freedom and unbiased error variance estimates. In an application of RR to genomic marker data, Whittaker et al. (2000) suggest optimizing λ by minimizing , which is a variant of Mallows’ C. An alternative approach is to use cross-validation (CV). There are several variants of CV, but the general idea is to average MSE over some sets of hold-out test data (Hastie et al., 2009). In quantitative genetics, it is common to use the Pearson correlation coefficient, r, as a model selection criterion, both with and without CV (González-Recio et al., 2014). Daetwyler et al. (2008) suggest to use the expected predictive correlation accuracy: for model evaluation in genome-enabled prediction. The use of r2 for model comparison has been questioned, see for example Gianola and Schön (2016). Based on the regularization theory above, it is evident that there are potential problems with r because VAR[y] will be unaffected, whereas VAR[ŷ] will be heavily influenced by the type of model and level of regularization. It is also possible to assess the goodness of fit of the models using the coefficient of determination R. Kvålseth (1985) identifies eight different variants of this statistic and compares them for different types of models. For linear OLS regression models with an intercept term, the problem seems to be of a minor nature, since the majority of the R statistics are equivalent. However, for other types of models, such as linear models without intercepts or nonlinear models, the various R statistics generally yield different values. Although not examined by Kvålseth (1985), the same problem applies to regularized models. Kvålseth (1985) concludes that the best coefficient to use is:

Illustration of the Problem With r2

In a recent publication (Waldmann et al., 2019), we presented a novel automatic adaptive LASSO (AUTALASSO) based on the alternating direction method of multipliers (ADMM) optimization algorithm. We also compared the ALASSO, LASSO, and RR on a simulated dataset using the glmnet software (Friedman et al., 2010). The original simulated data stem from the QTLMAS2010 workshop (Szydłowski and Paczynska, 2011). The total number of individuals is 3,226, structured in a pedigree with five generations. The continuous quantitative trait was created from 37 QTLs, including nine controlled major genes and 28 random minor genes. The controlled QTLs included two pairs of epistatic genes with no individual effects, three maternally imprinted genes, and two additive major genes. The random genes were chosen among the simulated SNPs and their effects were sampled from a truncated normal distribution. In addition to these original data, one dominance locus, one over-dominance and one under-dominance loci were created and added to the phenotype (Waldmann et al., 2019). The narrow sense heritability was equal to 0.45. MAF cleaning was performed at the 0.01 level, resulting in a final sample of 9,723 SNPs. Data from individual 1 to 2,326 were used as training data and data from individual 2,327 to 3,226 as test (or validation) data. The regularization path in glmnet was run over 100 different λ-values to estimate the smallest test MSE and largest test r and R. In our previous paper (Waldmann et al., 2019), we estimated only MSE and r and therefore add R here. Application of the ALASSO, LASSO, and RR resulted in a test MSE of 64.52, 65.73, and 83.07, respectively. Hence, based on the MSE, it is clear that the ALASSO is the best model. The ALASSO is also favored in terms of R, which yields the results 0.449, 0.439, and 0.291, respectively. However, based on r, the LASSO is the best model, with an estimate of 0.460, compared with ALASSO and RR estimates of 0.455 and 0.300, respectively. Decomposing r into its parts reveals that the test VAR[y] is the same (117.2) for all three methods. However, VAR[ŷ] differs between the models, increasing from 29.54 for RR to 36.41 for the LASSO and 48.17 for the ALASSO. The COV[y,ŷ] also follows this pattern, but the proportions to VAR[ŷ] differ. These results are summarized in . Introduction of the weight factor in the ALASSO increases model complexity, which results in decreased model bias at the expense of increased variance. Most importantly, however, the test MSE is reduced. This is an example of the bias-variance trade-off that is fundamental in statistical learning, where r can provide estimates that may result in erroneous model decisions.

Table 1

Method	MSE	r²	R²	COV[ y,ŷ]	VAR[ŷ]
RR	83.07	0.300	0.291	32.22	29.54
LASSO	65.73	0.460	0.439	44.30	36.41
ALASSO	64.52	0.455	0.449	50.68	48.17

Mean squared error (MSE), predictive correlation accuracy (r), coefficient of determination (R), covariance between test phenotypes and predicted test phenotypes (COV[y, ŷ]), and variance of predicted test phenotypes (VAR[ŷ]) for ridge regression (RR), LASSO and adaptive LASSO (ALASSO), evaluated on the simulated QTLMAS2010 data. Ranking of individuals in terms of breeding values and predicted phenotypes is important in breeding. The order of the 10 best individuals differs not only between the RR, LASSO and ALASSO, but also within each model when min MSE and max r are used for determination of the best model (). How regularization and the variable selection properties of the LASSO and ALASSO affects the statistical properties of rank correlation measures (e.g. Spearman’s and Kendall’s rank correlation coefficients) is unclear because of the bias-variance trade-off and needs to be further investigated. For example, a rank correlation measure can be high even if the model is highly biased and therefore the rank statistic may work in the opposite direction of the MSE loss function which will lead to optimization conflicts. Hence, it would be necessary to use a model with a rank-based loss function.

Table 2

Ranking of the 10 best individuals from the simulated QTLMAS2010 data based on ŷ for RR, LASSO and ALASSO using min MSE and max predictive correlation accuarcy (r) as model selection measures.

					Rank
Metdod/selection statistic	1	2	3	4	5	6	7	8	9	10
RR/min[MSE]	2,586	2,772	2,977	3,050	3,195	3,056	2,756	2,738	2,821	3,184
RR/max[r²]	2,586	2,772	3,195	2,977	3,050	3,184	2,589	2,821	2,756	2,738
LASSO/min[MSE]	2,967	2,820	2,586	2,809	3,050	2,977	3,195	2,582	2,688	2,765
LASSO/max[r²]	2,967	2,820	2,809	2,688	2,582	2,586	3,195	3,050	2,977	2,972
ALASSO/min[MSE]	2,820	2,582	2,586	2,809	3,050	2,832	3,195	3,006	2,589	2,817
ALASSO/max[r²]	2,820	2,582	2,809	2,586	3,050	3,195	2,832	3,006	2,817	2,972

Ranking of the 10 best individuals from the simulated QTLMAS2010 data based on ŷ for RR, LASSO and ALASSO using min MSE and max predictive correlation accuarcy (r) as model selection measures.

Data Availability Statement

The simulated dataset QTLMAS2010ny012.zip can be found in https://github.com/patwa67/AUTALASSO.

Author Contributions

The author wrote, read and approved the final version of the manuscript.

Funding

Financial support was provided by the Beijer Laboratory for Animal Science, SLU, Uppsala.

Conflict of Interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

13 in total

1. Marker-assisted selection using ridge regression.

Authors: J C Whittaker; R Thompson; M C Denham
Journal: Genet Res Date: 2000-04 Impact factor: 1.588

Review 2. Overview of LASSO-related penalized regression methods for quantitative trait mapping and genomic selection.

Authors: Zitong Li; Mikko J Sillanpää
Journal: Theor Appl Genet Date: 2012-05-24 Impact factor: 5.699

3. Statistical challenges of high-dimensional data.

Authors: Iain M Johnstone; D Michael Titterington
Journal: Philos Trans A Math Phys Eng Sci Date: 2009-11-13 Impact factor: 4.226

4. Challenges of Big Data Analysis.

Authors: Jianqing Fan; Fang Han; Han Liu
Journal: Natl Sci Rev Date: 2014-06 Impact factor: 17.275

5. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

6. QTLMAS 2010: simulated dataset.

Authors: Maciej Szydlowski; Paulina Paczyńska
Journal: BMC Proc Date: 2011-05-27

7. Cross-Validation Without Doing Cross-Validation in Genome-Enabled Prediction.

Authors: Daniel Gianola; Chris-Carolin Schön
Journal: G3 (Bethesda) Date: 2016-10-13 Impact factor: 3.154

8. Predictive ability of genome-assisted statistical models under various forms of gene action.

Authors: Mehdi Momen; Ahmad Ayatollahi Mehrgardi; Ayyub Sheikhi; Andreas Kranis; Llibertat Tusell; Gota Morota; Guilherme J M Rosa; Daniel Gianola
Journal: Sci Rep Date: 2018-08-17 Impact factor: 4.379

9. AUTALASSO: an automatic adaptive LASSO for genome-wide prediction.

Authors: Patrik Waldmann; Maja Ferenčaković; Gábor Mészáros; Negar Khayatzadeh; Ino Curik; Johann Sölkner
Journal: BMC Bioinformatics Date: 2019-04-02 Impact factor: 3.169

10. Accuracy of predicting the genetic risk of disease using a genome-wide approach.

Authors: Hans D Daetwyler; Beatriz Villanueva; John A Woolliams
Journal: PLoS One Date: 2008-10-14 Impact factor: 3.240

5 in total

1. Genome-Wide Association Study and Genomic Prediction for Bacterial Wilt Resistance in Common Bean (Phaseolus vulgaris) Core Collection.

Authors: Bazgha Zia; Ainong Shi; Dotun Olaoye; Haizheng Xiong; Waltram Ravelombola; Paul Gepts; Howard F Schwartz; Mark A Brick; Kristen Otto; Barry Ogg; Senyu Chen
Journal: Front Genet Date: 2022-05-31 Impact factor: 4.772

2. Prognostic Nomogram Based on the Metastatic Lymph Node Ratio for T_1-4N_0-1M₀ Pancreatic Neuroendocrine Tumors After Surgery.

Authors: Jingxiang Shi; Sifan Liu; Jisen Cao; Shigang Shan; Chaoyi Ren; Jinjuan Zhang; Yijun Wang
Journal: Front Oncol Date: 2022-04-27 Impact factor: 5.738

3. Evaluation of Bayesian alphabet and GBLUP based on different marker density for genomic prediction in Alpine Merino sheep.

Authors: Shaohua Zhu; Tingting Guo; Chao Yuan; Jianbin Liu; Jianye Li; Mei Han; Hongchang Zhao; Yi Wu; Weibo Sun; Xijun Wang; Tianxiang Wang; Jigang Liu; Christian Keambou Tiambo; Yaojing Yue; Bohui Yang
Journal: G3 (Bethesda) Date: 2021-10-19 Impact factor: 3.154

4. Number of Positive Lymph Nodes Combined with the Logarithmic Ratio of Positive Lymph Nodes predicts Survival in Patients with Non-Metastatic Larynx Squamous Cell Carcinoma.

Authors: Qiyue Wang; Zhuo Tan; Chuanming Zheng; Jiafeng Wang; Guowan Zheng; Ping Huang; Yiwen Zhang; Minghua Ge
Journal: J Cancer Date: 2022-03-14 Impact factor: 4.207

5. Genome-Wide Association and Prediction of Traits Related to Salt Tolerance in Autotetraploid Alfalfa (Medicago sativa L.).

Authors: Cesar Augusto Medina; Charles Hawkins; Xiang-Ping Liu; Michael Peel; Long-Xi Yu
Journal: Int J Mol Sci Date: 2020-05-09 Impact factor: 5.923

5 in total