Literature DB >> 29588381

A Bayesian Genomic Regression Model with Skew Normal Random Errors.

Paulino Pérez-Rodríguez¹, Rocío Acosta-Pech¹, Sergio Pérez-Elizalde¹, Ciro Velasco Cruz¹, Javier Suárez Espinosa¹, José Crossa^2,3.

Abstract

Genomic selection (GS) has become a tool for selecting candidates in plant and animal breeding programs. In the case of quantitative traits, it is common to assume that the distribution of the response variable can be approximated by a normal distribution. However, it is known that the selection process leads to skewed distributions. There is vast statistical literature on skewed distributions, but the skew normal distribution is of particular interest in this research. This distribution includes a third parameter that drives the skewness, so that it generalizes the normal distribution. We propose an extension of the Bayesian whole-genome regression to skew normal distribution data in the context of GS applications, where usually the number of predictors vastly exceeds the sample size. However, it can also be applied when the number of predictors is smaller than the sample size. We used a stochastic representation of a skew normal random variable, which allows the implementation of standard Markov Chain Monte Carlo (MCMC) techniques to efficiently fit the proposed model. The predictive ability and goodness of fit of the proposed model were evaluated using simulated and real data, and the results were compared to those obtained by the Bayesian Ridge Regression model. Results indicate that the proposed model has a better fit and is as good as the conventional Bayesian Ridge Regression model for prediction, based on the DIC criterion and cross-validation, respectively. A computing program coded in the R statistical package and C programming language to fit the proposed model is available as supplementary material.

Entities: CellLine Chemical Disease Species

Keywords: GBLUP; GenPred; Genomic Selection; Ridge regression; Shared Data Resources; asymmetric distributions; data augmentation

Mesh：

Year: 2018 PMID： 29588381 PMCID： PMC5940167 DOI： 10.1534/g3.117.300406

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

In genetic studies of plants or animals, it is common to find quantitative traits whose distribution is not normal; this is because the data are obtained from multiple sources or contain isolated observations (Li ). Landfors noted that it is often necessary to normalize the data to remove variation introduced during the experiment’s development. However, such standard normalization techniques are not always able to remove bias because a large number of observations are positively or negatively affected by some treatment. In addition, suitable transformation for the data may be difficult to find, and may bring further complications when estimating and interpreting the results obtained (Fernandes ). To avoid transformations, different methods have been developed that are flexible enough to represent the data and reduce unrealistic assumptions (Arellano-Valle ). In the genomic selection framework (GS; Meuwissen ), dense molecular marker genotypes and phenotypes are used to predict the genetic values of candidates for selection. The availability of high density molecular markers of many agricultural species, together with promising results from simulations (e.g., Meuwissen ) and empirical studies in plants (de los Campos ; de los Campos ; Crossa , 2011) and animals (e.g., VanRaden 2008; Weigel ), are prompting the adoption of GS in several breeding programs. The parametric model for GS explains phenotypes (; i = 1,…,n) as functions of marker genotypes () using a linear model of the form: , where is the number of phenotypic records, is the number of markers, represents the number of copies of a bi-allelic marker (e.g., an SNP), and is the additive effect of the reference allele at the j marker, . In matrix notation, the model is expressed as , where , and are vectors of phenotypes, marker effects and Gaussian zero mean errors, respectively, and is a matrix of marker genotypes of dimensions n×p. However, when the data are not normal, the normal regression methods generate inconsistent estimates with the natural distribution of the data and, therefore, the estimation of the conditional mean given the covariates is also inconsistent (Bianco ). With dense molecular markers, the number of markers exceeds the number of records in the reference population () and, therefore, penalized regression estimation methods and their Bayesian counterparts are commonly used. Penalized estimation methods produce regression parameter estimates that are often equivalent to posterior modes. The literature on GS offers a long list of Bayesian models whose main difference is the prior distributions assigned to marker effects, which leads to what is known as the Bayesian Alphabet (Gianola 2013; de los Campos ). The above-mentioned models assume that the response follows a normal distribution. Several phenotypic traits have distributions that are skewed, for example, female flowering, male flowering, the interval between male and female flowering, categorical measurements of diseases such as ordinal scale, counting data, etc. In these cases, either a normalizing transformation for the response variable (e.g., using Box-Cox transformation) or a model that deals with skew responses may be used. Varona proposed using linear mixed models with asymmetric distributions in the residuals to tackle the problem in the context of animal breeding when pedigree information is available. Nascimento proposed the Regularized Quantile Regression as a way to overcome the issue of non-symmetric distributions when marker information is available. If a population is selected based on one trait and another trait of interest results that exceeds (does not exceed) some threshold, then the conditional distribution of , for a fixed leads to a distribution that is skewed (Arnold and Beaver 2000), such as the skew-normal (SN) distribution, which is of particular interest in this research. This distribution is a generalization of the normal distribution (Azzalini 1985) with a shape parameter added to adopt skewed forms. It has the advantage of being mathematically tractable and shares properties with the normal distribution; for example, the density of the SN is unimodal (Genton 2004). Varona argues that the asymmetric distributions observed for the phenotypes are the result of environmental factors and that data can be modeled using non-symmetric residual distributions. Based on the previous considerations and motivated by the fact that a great deal of traits in plant breeding have skew normal distributed, such as flowering time in most crop species, as well categorical traits such as diseases (ordinal, binomial, or counting data), in this study we propose a general Bayesian genomic regression model for skew-normal phenotypic traits with skew-normal random errors. The model uses a stochastic representation of the response variable (Arnold and Beaver 2000) in order to ease computations and it also works in the case that when n>p. It should point out, however, that the aim of the paper is not to describe and study the causes of the skew distribution but rather we assume that the skew data are given and thus the objective is to propose a robust statistical model that deals with the skew-normal distribution of the phenotypic and residuals. The structure of this paper is as follows. In section 2, we present the statistical models and describe the latent variable model used in the regression. In section 3, we describe a simulation experiment that is performed to evaluate the predictive power of the proposed model. In section 4, we present an application with real data; section 5 includes the discussion and concluding remarks.

MATERIALS AND METHODS

Statistical models

In this section we introduce the statistical models to be used in the manuscript. We begin by giving a brief review of the skew normal model. Then we introduce the concept of data augmentation and we use this concept in order to generate a skew normal random variable. After that we introduce the “centered parameterization” in the skew normal model, regression with skew random errors. Finally, we present the pior, posterior and full conditional distributions for the regression model with skew normal residuals.

Skew-normal model:

A continuous random variable is said to follow the skew-normal law with shape parameter , denoted by if its density function is:where and denote the density and cumulative distribution functions of a standard normal random variable, respectively. The subscript indicates the use of “direct parametrization” (Azzalini 1985). Note that the skew normal distribution reduces to the normal case when . The mean and variance of are given by:The coefficient of skewness of is:If is a random variable defined by , then is said to have a skew-normal distribution with location (), scale (), and shape () parameters, and is denoted as . The density function of is given by:It can be shown that the coefficient of skewness of corresponds to the skewness coefficient of .

Hidden truncation:

Let V and W be two random variables whose joint distribution is given as follows:where denotes a bivariate random variable with mean and variance-covariance matrix and the random variable U is defined as follows:then , with (Arnold and Beaver 2000; Hea-Jung 2005; Liseo and Parisi 2013). The above representation allows writing an augmented likelihood function (Hea-Jung 2005; Liseo and Parisi 2013), “as if” we had observed the latent variable . The conditional distribution of is and , which is a truncated normal random variable with location parameter 0, scale parameter 1, lower truncation bound 0 and upper truncation bound . Therefore, the joint distribution of and is , that is:Note that the density function of can be obtained by integrating with respect to z; that is, . Estimating the parameters in the direct parametrization is troublesome, so “centered parametrization” is more appropriate for parameter estimation and interpretation (Azzalini 1985; Pewsey 2000; Azevedo , among others). If and , then and so that . Figure 1 shows the density function of U for several values of the shape parameter and the corresponding values of . The random variable:is said to be a skew normal random variable with parameters , and , where is Pearson’s skewness coefficient given by , and the range of is (-0.99527, 0.99527). In this case, , . The usual notation is .

Figure 1

Densities of the standard skew normal distribution for different values of λ and the corresponding values for , .

Densities of the standard skew normal distribution for different values of λ and the corresponding values for , . If we consider the following transformations:it can be shown, using Jacobians (Casella and Berger, 2002, Chapter 2), that the joint density of and is given by:where . Note that the density function of can be obtained by integrating with respect to ; that is, , with . This representation is very convenient, because it allows us to write an augmented likelihood function (Hea-Jung 2005; Liseo and Parisi 2013), “as if” we had observed the latent value . The density function of the skew normal random variable under “centered parametrization” is a complicated function that was given by Azevedo :where , , with , .

Regression with skew normal random errors:

Azzalini and Capitanio (1999) and Rusell and González (2002) proposed a simple linear regression model where the error terms are independent and identically distributed as . The proposed model is:from the properties of the skew normal distribution, it follows that . The model can be easily extended to include more covariates; that is:Azzalini and Capitanio (1999) and Rusell and González (2002) used the maximum likelihood method to estimate the parameters in the model. These ideas can be extended to the case of errors that are independent and identically distributed as .

Bayesian regression with skew normal random errors (BSN):

Let Then, the likelihood function is given by:Let and the prior distribution of and a set of hyper-parameters that index the prior distributions. Then, by Bayes’ theorem, the joint posterior distribution of is as follows:Neither the joint posterior distribution nor the full conditional distributions of the parameters of interest have a closed form; therefore, the implementation of this model within the Bayesian framework is computing intensive. We propose using hidden truncation together with two standard MCMC techniques in Bayesian analysis: (i) Gibbs Sampling (Geman and Geman 1984) and (ii) Random Walk Metropolis Sampling to alleviate some of the computing burden.

Prior, posterior and full conditional distributions:

Consider the joint distribution of and Z given in (5). In the regression context, we set ; then the augmented likelihood function is:In order to fully specify the Bayesian model, prior distributions for the unknown parameters must be defined. Let ; for based on the following transformation , where the prior for is denoted as , and depending on the hyper-parameters and , it can lead to a rich variety of shapes, just as in the case of the Beta distribution. For the intercept, the prior distribution is ; for the scale parameter, a scaled inverted chi-squared prior distribution was used, that is, and finally (see Sorensen and Gianola, 2002, p. 85, for details about the parametrization used in this paper). The joint prior distribution isBy combining equations 6 and 7 throught the Bayes’ theorem, the posterior distribution of is given by:The full conditional distributions of the parameters are given in Appendix A, which are the inputs for the Gibbs and the Random Walk Metropolis Samplers. In Appendix B, some pragmatical rules to elicitate values for the hyper-parameters , and , are given. The rules for setting , are based on those given by de los Campos and Pérez and de los Campos (2014). In this paper, we set the hyper-parameters as follows: , ,, , where is the sample phenotypic variance and . We set in order to reduceshrinkage and because in practice it mimics a non informative but proper distribution. To sample from the full conditionals of and , we implemented a Random Walk Metropolis Sampler whose parameters are tuned so that the acceptation rate is about 0.23 (see Appendix A for details). The BSN can be re-parametrized by replacing with ; if the prior distribution of marker effects is normal with mean 0 and variance , then the prior of is , which leads to a G-BLUP model (see de los Campos , for details about G-BLUP) but with skew normal residuals, that is, or, in matrix notation, , which is a skew linear mixed model, a particular case of the model proposed by Arellano-Valle , 2007), who relaxed all normality assumptions in a standard mixed model.

Bayesian ridge regression With random normal errors (BRR):

Regression with random normal errors is a special case of the proposed model when . The model is widely used in the GS selection literature (e.g., de los Campos ). In the GS context, the model is given by:where are independent and identically distributed as . The prior distributions for the unknown are: , for the scale parameter, a scaled inverted chi-squared distribution, that is, and finally . The joint prior distribution isThus, the posterior distribution of isThe required full conditional distributions of the parameters for implementing a Gibbs sampler can be found elsewhere (e.g., de los Campos ). We set the hyper-parameters using the same rules as in the BSN model. The BRR model can be fitted easily using the BGLR statistical package (Pérez and de los Campos 2014).

Monte Carlo Simulation

In this section, we use simulated data using marker genotypes from a wheat dataset made publicly available by Crossa . The dataset includes genotypic information for 599 wheat lines which were genotyped for 1279 DArT markers coded as 0 and 1. We simulated the phenotypes using the following additive genetic model:where , with , , which leads to different degrees of skewness. The intercept parameter, was set equal to 3; 10 marker effects were sampled from a normal distribution with mean 0 and variance 0.5/10 (Pérez and de los Campos 2014), and the rest were set equal to 0, that is:The idea here is to verify, through simulation, whether the proposed model works satisfactorily. We therefore obtained point estimates for , , and . We also fitted the Bayesian Ridge Regression model and compared the estimates of regression coefficients, predictions and estimates of genetic values of both models. Let be the vector of posterior means for regression coefficients. Pearson’s correlation between the observed () and predicted values () is a goodness-of-fit measure; Pearson’s correlation between the “true” genetic values () and the predicted values () is a measure of how well the genetic values are estimated; finally, Pearson’s correlation between the “true” marker effects () and the estimated effects () is a measure that indicates how good a model is at uncovering marker effects (de los Campos ). We also computed the effective number of parameters () and deviance information criterion () for the two fitted models (see Spiegelhalter , for more details). The algorithm used in this simulation experiment is described briefly below. Set , , and . Simulate the phenotypes using equation (9). Fit the regression model with skew normal random errors and obtain point estimates for , , and , that is, , , and . The point estimates correspond to the posterior means of the posterior distribution of the parameters of interest. Fit the Bayesian Ridge Regression model and obtain point estimates for , , , that is, , , . Compute the correlation between observed and predicted phenotypes, “true” and predicted genetic values, and “true” and estimated regression coefficients with both regression models. Compute the effective number of parameters () and deviance information criterion () for the two fitted models. Repeat steps 1 to 5 one hundred times and obtain the averages of correlations, intercept (), and .

Application to real data

This dataset is from the Drought Tolerance Maize (DTMA) project of CIMMYT’s Global Maize Program (http://www.cimmyt.org). The dataset comes from a large study aimed at detecting chromosomal regions affecting drought tolerance. The genotypic data consist of information from 300 tropical inbred lines that were genotyped using 1,152 SNPs (Single Nucleotide Polymorphisms). The analyzed trait is Gray Leaf Spot (GLS) caused by the fungus Cercospora zeae-maydis, which was evaluated at three different sites, Kakamega (Kenya), San Pedro Lagunillas (Mexico) and Santa Catalina (Colombia) (see Supporting information). Crossa analyzed a subset of these data; the response variable was transformed using Box-Cox transformation (Box and Cox 1964). Figure 2 shows density plots for GLS rating at the three sites. Kernel density was estimated using a Gaussian kernel, and the bandwidth for the kernel was estimated according to Venables and Ripley (2002). Figure 2 also shows the sample skewness index, , where , is the sample mean and is the sample standard deviation (see Joanes and Gill 1998); in the three cases, the distribution is skewed to the right, so most of the distribution is concentrated around small values of the response variable.

Figure 2

Density plot for Gray Leaf Spot (GLS) rating (disease resistance), at each site: Kakamega (Kenya), San Pedro Lagunillas (Mexico) and Santa Catalina (Colombia).

Density plot for Gray Leaf Spot (GLS) rating (disease resistance), at each site: Kakamega (Kenya), San Pedro Lagunillas (Mexico) and Santa Catalina (Colombia). We propose using the regression model with skew normal random errors to predict disease resistance. We fitted two models: (1) the standard Bayesian Ridge Regression, where the errors , where “NIID” stands for “normally, independent and identically distributed”; and (2) the proposed model with skew normal random errors. The Bayesian Ridge Regression was fitted using the BGLR package (Pérez and de los Campos 2014), whereas the proposed model was fitted using the algorithm described in Appendix A. The models were first fitted using the full data, and subsequently 100 random partitions with 80% of observations in the training set and 20% of observations in the testing set were generated. The two models were fitted for each of these random partitions; then the phenotypes of the individuals in the testing set were predicted and the ability of each model to make predictions was evaluated using Pearson’s correlation between observed and predicted values. Inferences for each fit were based on 100,000 samples obtained after discarding 50,000 samples that were taken as burn-in. Convergence was checked by inspecting trace plots of the parameters.

Data and program availability

The data and programs are available as File S1 which corresponds to a compressed zip folder. The zip folder also contains a description of the data and commands to read it into the R statistical software.

RESULTS

Table 1 shows point estimates for , , and for the BSN and BRR models for different values of . It also shows an estimate of , a regularization parameter that is widely used in Bayesian Ridge Regression. Higher values of the parameter are associated with more shrinkage; note that the estimates of are very similar in both models, so small values of could be associated with more precise estimates of . It is also clear from this table that the point estimates for and are very close to the real values used in the simulation. The correlation between observed and predicted values and the mean squared error is quite similar for both models and there is no clear winner. Finally, the algorithm is not able to estimate precisely the parameter for distributions that are slightly asymmetric.

Table 1

Point estimates, standard deviations for , , , correlations between observed and predicted values and MSE of predictions. Phenotypes were simulated under model (9) with and then regression models with skew normal (BSN) and normal errors (BRR) were fitted

Model	β^0	σ^e2	σ^β2	θ^	Cor(y,y^)	MSE
ρ=0, ρ^=0.016 (0.207)
BSN	3.075 (0.854)	2.257 (0.052)	0.003 (0.001)	833.72	0.479	2.441
BRR	3.113 (0.975)	2.207 (0.155)	0.003 (0.001)	627.33	0.531	3.036
ρ=0.5,ρ^=0.075 (0.270)
BSN	3.009 (0.771)	2.218 (0.047)	0.003 (0.001)	803.35	0.648	3.274
BRR	2.991 (0.905)	2.167 (0.133)	0.003 (0.001)	602.89	0.667	2.714
ρ=0.75, ρ^=0.329 (0.261)
BSN	2.972 (0.714)	2.210 (0.048)	0.003 (0.001)	816.71	0.442	2.219
BRR	2.945 (0.828)	2.168 (0.139)	0.003 (0.001)	614.19	0.506	2.154
ρ=0.90, ρ^=0.841 (0.115)
BSN	3.094 (0.821)	2.219 (0.054)	0.003 (0.001)	833.48	0.648	3.112
BRR	3.120 (0.858)	2.175 (0.155)	0.003 (0.001)	621.59	0.676	2.639
ρ=0.95 ρ^=0.943 (0.023)
BSN	3.055 (0.942)	2.270 (0.061)	0.003 (0.001)	872.98	0.662	1.676
BRR	3.067 (0.900)	2.196 (0.169)	0.003 (0.001)	631.79	0.696	1.642
ρ=0.99 ρ^=0.987 (0.008)
BSN	2.830 (1.280)	2.167 (0.056)	0.003 (0.001)	668.16	0.578	2.593
BRR	2.890 (0.878)	2.165 (0.153)	0.004 (0.001)	606.84	0.631	2.893

. Table 2 shows the effective number of parameters , the deviance information criterion (DIC), the correlation between “true” and estimated marker effects and the correlation between “true” and estimated signals. The table also shows that in general the and the DIC (small is better) favored the BSN model. The correlation between “true” and estimated marker effects is slightly better for BSN and the difference between the two models becomes clearer as increases. The same pattern is observed for the correlations between true and estimated genetic signals.

Table 2

True and estimated posterior mean of , effective number of parameters (), deviance information criterion (DIC), correlations between “true” and estimated marker effects and correlations between “true” and estimated genetic signals; standard deviations in parentheses. Phenotypes were simulated under model (9) with and then regression models with skew normal (BSN) and normal errors (BRR) were fitted

Model	pD	DIC	Cor(β, β^)	Cor(Xβ, Xβ^)
ρ=0, ρ^=0.016 (0.207)
BSN	40.794	2206.112	0.192 (0.046)	0.697 (0.116)
BRR	59.573	2212.001	0.193 (0.049)	0.689 (0.115)
ρ=0.5, ρ^=0.075 (0.270)
BSN	80.469	2279.974	0.207 (0.049)	0.718 (0.119)
BRR	91.548	2279.706	0.207 (0.050)	0.714 (0.117)
ρ=0.75, ρ^=0.329 (0.261)
BSN	41.996	2262.930	0.194 (0.051)	0.717 (0.104)
BRR	57.826	2267.114	0195 (0.052)	0.708 (0.104)
ρ=0.90, ρ^=0.841 (0.115)
BSN	76.978	2218.017	0.203 (0.049)	0.718 (0.114)
BRR	96.31	2238.787	0.198 (0.052)	0.706 (0.115)
ρ=0.95 ρ^=0.943 (0.023)
BSN	93.687	2144.77	0.203 (0.046)	0.734 (0.109)
BRR	102.345	2174.32	0.191 (0.047)	0.707 (0.116)
ρ=0.99, ρ^=0.987 (0.008)
BSN	85.465	2151.505	0.216 (0.055)	0.747 (0.098)
BRR	83.422	2276.08	0.196 (0.052)	0.703 (0.109)

Full data:

Table 3 shows estimates of the posterior means of parameters , and , as well as the effective number of parameters and the deviance information criterion (DIC). From Table 3 it is clear that the estimation of marker effects is more precise for the BSN model than for the BRR model; the and the DIC also favored the BSN model. The estimated parameter also supports the assumption that the skew normal random error is correct, and that the point estimate is not around 0, except in the case of San Pedro Lagunillas.

Table 3

Estimates of posterior means of parameters , and from the full-data analysis of Kakamega, San Pedro Lagunillas and Santa Catalina for Gray Leaf Spot in 300 tropical inbred maize lines and 1,152 SNPs; standard deviations in parentheses

Site	Parameter
Site	Model	σ^e2	σ^β2	θ^	pD	DIC	ρ^
Kakamega	BSN	0.498 (0.0725)	0.00032 (9e-05)	1726.551 (723.794)	61.257	586.361	0.981 (0.021)
	BRR	0.425 (0.073)	0.00053 (0.00014)	901.913 (391.380)	97.367	629.15	—
San Pedro Lagunillas	BSN	0.369 (0.079)	0.00093 (0.00019)	425.973 (173.541)	126.833	602.752	0.376 (0.550)
	BRR	0.331 (0.069)	0.00104 (0.00019)	339.114 (125.014)	143.06	597.852	—
Santa Catalina	BSN	0.518 (0.092)	0.00046 (0.00015)	1331.033 (785.158)	50.072	555.512	0.9226 (0.227)
	BRR	0.404 (0.070)	0.00075 (0.00016)	574.862 (199.984)	112.447	595.027	—

Figure 3 shows scatterplots of the predicted GLS using the BSN and BRR models. As expected, Pearson’s correlation between both predictions is very high (higher than 0.95). That implies that even when the data are skewed, if a BRR model is fitted in order to obtain candidates for selection, we can expect to obtain about the same individuals. Two models were fitted for each site by BRR and BSN.

Figure 3

Scatterplot of predicted Gray Leaf Spot (GLS) obtained when fitting the BSN model and the BRR model. In the three cases considered, the Pearson’s correlation between predictions was higher than 0.95.

Cross-validation:

Figure 4 shows scatterplots for Pearson’s correlation between observed and predicted values for individuals in the testing set obtained after fitting the BSN and BRR models for the three locations. When the correlations are higher for BSN than for BRR, this is represented by a filled circle, and by an open circle otherwise. The figure also shows the number of times Pearson’s correlation is higher for the BSN than for the BRR model. From this figure, it is clear that the BSN model predicts slightly better than the BRR model. Figure 5 shows a scatterplot for the mean squared errors in the testing set for the three locations. When the MSE in BSN is smaller than the MSE in BRR, this is represented by an open circle and by a filled circle otherwise. The number of times that the MSE in BRR is greater than the MSE in BSN is also shown in the plots. From this figure, it is clear that in general, the MSE for BRR is greater than the MSE for BSN. Table 4 shows the average Pearson’s correlation and mean squared error (MSE) between observed and predicted values in the testing set. The averages and the standard deviations are very similar for both models and the differences between the models are non-significant, but the figures suggest that the BSN model predicts slightly better than the BRR model.

Figure 4

Figure 5

Plots of the mean squared error (MSE) in the testing set for each of the 100 cross-validations and 3 locations. When the MSE in BSN is smaller than the MSE in BRR, this is represented by an open circle and when the MSE in BRR is bigger than the MSE in BSN, this is represented by a filled circle. The number of times that the MSE in BRR is bigger than the MSE in BSN is also shown in the plots.

Table 4

Average of Pearson’s correlation and mean squared error (MSE) between observed and predicted values in the testing set. The predictions were obtained after fitting the BSN and BRR models. The average is across the 100 random partitions with 80% of observations in the training set and 20% in the testing set. Standard deviations are given in parentheses

Site	Parameter
Site	Model	Pearson’s correlation	MSE
Kakamega	BSN	0.2836 (0.1157)	0.7017 (0.1130)
	BRR	0.2609 (0.1163)	0.7187 (0.1212)
San Pedro Lagunillas	BSN	0.5489 (0.0895)	0.7752 (0.1031)
	BRR	0.5450 (0.0887)	0.7804 (0.1064)
Santa Catalina	BSN	0.4871 (0.1238)	0.7790 (0.1302)
	BRR	0.4804 (0.1220)	0.7685 (0.1338)

Plots of the predictive correlation for each of the 100 cross-validations and 3 locations. When the best model is BSN, this is represented by a filled circle, and when the best model is BRR, this is represented by an open circle. The number of times that Pearson’s correlation in BSN is better than Pearson’s correlation in BRR is also shown in the plots. Plots of the mean squared error (MSE) in the testing set for each of the 100 cross-validations and 3 locations. When the MSE in BSN is smaller than the MSE in BRR, this is represented by an open circle and when the MSE in BRR is bigger than the MSE in BSN, this is represented by a filled circle. The number of times that the MSE in BRR is bigger than the MSE in BSN is also shown in the plots.

DISCUSSION AND CONCLUSIONS

We have proposed a Bayesian regression model for skewed responses with applications when p > >n in the GS context, but it can also be employed in other cases and, of course, when p < n. In addition, to generalize linear whole genome regression models for various discrete distributions (ordinal, binomial, etc.), this study further completes the Bayesian toolbox for whole genome regression. The proposed model uses a stochastic representation of a skew normal random variable in order to facilitate the computations; it also allows using standard MCMC techniques to fit the proposed model. Results of the simulation and of applications with real data suggest that the proposed model fits the data better and also predicts slightly better than the standard Ridge Regression model. The Ridge Regression model is a particular case of our model when . On the other hand, our results also suggest that BRR is a very robust model, although in the simulations data we already knew that it was the wrong model to fit; still, the predictive power of the model was very good. Although the conventional Bayesian whole-genome regression is robust, it does not correctly deal with skew phenotypic data, and this can decrease its genomic-enabled prediction accuracy and its goodness of fit to the data. Thus, the advantages of the proposed Bayesian whole-genome regression compensate its complexity and possible increases in computational time as compared to the conventional Bayesian ridge regression. The model proposed in this study is conceptually and operationally different, and presumably simpler than the skew-normal linear mixed model of Arellano-Valle that uses a multivariate skew-normal distribution in order to relax normality. Despite the fact that skewness is a major concern for breeding data analyses and may often be a result of uneven sampling of “high” and “low” performing individuals, selection, environmental effects, etc., the theoretical developments presented in this study are also applicable to many other areas of research in agronomy and in agriculture in general. For example, most crop flowering time data are indeed skewed, as well as categorical data representing different types of diseases as those presented in this research. So, skewness in phenotypical response can be the result of an artificial phenomena, the aim of this study was to propose a statistical model that will be more appropriate to deal with that problem. Results of this study can be compared to results of two other studies, Crossa and González-Camacho . Crossa included one site in Mexico (San Pedro Lagunillas) that was also analyzed by transforming the original GLS ordinal scale using Box-Cox transformation; the prediction accuracy of different models (e.g., Bayesian Lasso and Reproducing Kernel Hilbert Spaces) ranged from 0.416 to 0.462. Although strict comparison with the results obtained in this study is not possible because other random cross-validations were generated, the prediction accuracies of BSN (0.5489) and BRR (0.5450) models were higher than those previously reported by Crossa for the same site. Stochastic representation can be used to extend Reproducing Kernel Hilbert Space (de los Campos ) models that in many empirical studies have led to more accurate predictions than Bayesian Ridge Regression models and Bayesian LASSO, among others (e.g., Pérez-Rodríguez ), so this is a topic for future research. Further studies to extend the model proposed in this study to include genotype × environment interaction should not be complicated. The proposed model can also be extended by assigning different prior distributions to the marker effects, for example, to induce variable selection. This could potentially lead to a new Bayesian alphabet with skew normal random errors.

Supplementary Material

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.300406/-/DC1. Click here for additional data file.

17 in total

1. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images.

Authors: S Geman; D Geman
Journal: IEEE Trans Pattern Anal Mach Intell Date: 1984-06 Impact factor: 6.226

2. Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods.

Authors: Gustavo De los Campos; Daniel Gianola; Guilherme J M Rosa; Kent A Weigel; José Crossa
Journal: Genet Res (Camb) Date: 2010-08 Impact factor: 1.588

3. Mapping of quantitative trait loci using the skew-normal distribution.

Authors: Elisabete Fernandes; António Pacheco; Carlos Penha-Gonçalves
Journal: J Zhejiang Univ Sci B Date: 2007-11 Impact factor: 3.066

4. Predicting quantitative traits with regression models for dense molecular markers and pedigree.

Authors: Gustavo de los Campos; Hugo Naya; Daniel Gianola; José Crossa; Andrés Legarra; Eduardo Manfredi; Kent Weigel; José Miguel Cotes
Journal: Genetics Date: 2009-03-16 Impact factor: 4.562

5. Efficient methods to compute genomic predictions.

Authors: P M VanRaden
Journal: J Dairy Sci Date: 2008-11 Impact factor: 4.034

6. Bayesian analysis of quantitative traits using skewed distributions.

Authors: L Varona; N Ibañez-Escriche; R Quintanilla; J L Noguera; J Casellas
Journal: Genet Res (Camb) Date: 2008-04 Impact factor: 1.588

7. Predictive ability of direct genomic values for lifetime net merit of Holstein sires using selected subsets of single nucleotide polymorphism markers.

Authors: K A Weigel; G de los Campos; O González-Recio; H Naya; X L Wu; N Long; G J M Rosa; D Gianola
Journal: J Dairy Sci Date: 2009-10 Impact factor: 4.034

8. A robust multiple-locus method for quantitative trait locus analysis of non-normally distributed multiple traits.

Authors: Z Li; J Möttönen; M J Sillanpää
Journal: Heredity (Edinb) Date: 2015-07-15 Impact factor: 3.821

9. Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers.

Authors: José Crossa; Gustavo de Los Campos; Paulino Pérez; Daniel Gianola; Juan Burgueño; José Luis Araus; Dan Makumbi; Ravi P Singh; Susanne Dreisigacker; Jianbing Yan; Vivi Arief; Marianne Banziger; Hans-Joachim Braun
Journal: Genetics Date: 2010-09-02 Impact factor: 4.562

10. Normalization of high dimensional genomics data where the distribution of the altered variables is skewed.

Authors: Mattias Landfors; Philge Philip; Patrik Rydén; Per Stenberg
Journal: PLoS One Date: 2011-11-22 Impact factor: 3.240

1 in total

1. Maximum a posteriori Threshold Genomic Prediction Model for Ordinal Traits.

Authors: Abelardo Montesinos-López; Humberto Gutierrez-Pulido; Osval Antonio Montesinos-López; José Crossa
Journal: G3 (Bethesda) Date: 2020-11-05 Impact factor: 3.154

1 in total