Literature DB >> 26290569

Genomic-Enabled Prediction of Ordinal Data with Bayesian Logistic Ordinal Regression.

Osval A Montesinos-López¹, Abelardo Montesinos-López², José Crossa³, Juan Burgueño⁴, Kent Eskridge⁵.

Abstract

Most genomic-enabled prediction models developed so far assume that the response variable is continuous and normally distributed. The exception is the probit model, developed for ordered categorical phenotypes. In statistical applications, because of the easy implementation of the Bayesian probit ordinal regression (BPOR) model, Bayesian logistic ordinal regression (BLOR) is implemented rarely in the context of genomic-enabled prediction [sample size (n) is much smaller than the number of parameters (p)]. For this reason, in this paper we propose a BLOR model using the Pólya-Gamma data augmentation approach that produces a Gibbs sampler with similar full conditional distributions of the BPOR model and with the advantage that the BPOR model is a particular case of the BLOR model. We evaluated the proposed model by using simulation and two real data sets. Results indicate that our BLOR model is a good alternative for analyzing ordinal data in the context of genomic-enabled prediction with the probit or logit link.

Entities: Chemical Disease Gene Species

Keywords: Bayesian ordinal regression; GenPred; Gibbs sampler; genomic selection; logit; probit; shared data resource

Mesh：

Year: 2015 PMID： 26290569 PMCID： PMC4592994 DOI： 10.1534/g3.115.021154

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Genomic-enabled prediction models are revolutionizing animal and plant breeding. There is some evidence that they are powerful for predicting the genomic merit of animals and plants based on high-density single-nucleotide polymorphism (SNP) marker panels and are being recommended increasingly for genomic prediction in human health (Yang and Tempelman 2012). However, most genomic-enabled prediction models assume a continuous and normally distributed phenotype. Because often this assumption is not fulfilled, researchers normally approach phenotypes in three ways: (a) they ignore the lack of normality in the phenotypes; (b) they transform the non-normal phenotype to approximate it to normality; or (c) they use generalized linear mixed models (GLMMs) to model the appropriate distribution of the phenotype (Stroup 2015). The use of the first approach is justified for large sample sizes with the central limit theorem, which states that treatment means have an approximate normal distribution if the sample size is large enough. However, there is a lot of evidence indicating that the first approach produces highly biased results for small- and moderate-sample sizes (Stroup 2012, 2015). Transformations introduced by Bartlett (1947) for non-normal data were proposed for variance-stabilization to fulfill the assumption of homogeneous variance; they are still considered standard procedures in many agricultural disciplines. Implementation with transformations is equal to that for phenotypes normally distributed (based on the linear model). However, there is mounting evidence that transformations do more harm than good for the models required by most agricultural research, because the use of the linear model with or without transformed data produces a great loss of accuracy and power (Stroup 2015), mostly in small sample sizes. Nelder and Wedderburn (1972) introduced generalized linear models, a major departure from the usual approach to non-normal data. GLMMs extend the linear model theory to accommodate non-normal data with heterogeneous variance and even correlated observations. Viewed through the GLMM lens, the pre-1990s understanding of non-normal data––still pervasive in the agricultural research community––is antiquated at best, obsolete at worst. Today, small sample investigations are providing an increasing body of evidence that GLMMs work as well in practice as they do in theory because they produce more accuracy and power than approaches (a) and (b). Also, now there are textbooks and software available for the implementation of GLMMs, although the implementation of approaches (a) and (b) is still dominant in agricultural research (Stroup 2015). In genomic-enabled prediction, the use of GLMMs is still in its early stages because their implementation is not straightforward given that the number of observations (n) is usually smaller than the number of covariates (p) included in the model. In addition, a complex dependence structure is observed among covariates (markers) and observations (lines) due to the joint involvement of biological processes and pathways. To overcome this situation in the pregenomic era, Gianola (1980, 1982) and Gianola and Foulley (1983) proposed a probit (threshold) model for dealing with ordinal categorical traits in animal breeding. This probit model was extended to deal with p >> n in the genomic era by González-Recio and Forni (2011) and Villanueva for binary trials, and by Wang and Montesinos-López for more than two ordinal categories. Also, BGLR (Bayesian generalized linear regression), software developed for genomic-enabled prediction that is able to deal with normal, binary, ordinal, and censored data (de los Campos and Perez-Rodriguez 2013; Perez-Rodriguez and de los Campos 2014), is now available. However, no GLMMs are available for genomic-enabled prediction for counts and percentage phenotypes. For modeling ordinal categorical phenotypes, the ordinal logistic regression model is often preferred over the ordinal probit model in statistical applications, because it provides regression coefficients that are more interpretable due to their connection to odds ratios (Zucknick and Richardson 2014). However, in genomic-enabled prediction (when p >> n), only the Bayesian probit model is frequently implemented, given that Bayesian methods that introduce sparseness through additional priors on the model size are very well-suited to this problem. Therefore, because of the lack of a Bayesian logistic ordinal model analogous to the Bayesian probit ordinal model that uses a data augmentation approach, the logistic model is not practical for genomic selection. Both logistic and normal distributions are symmetric with a basic, unimodal “bell curve” shape. The only difference is that the logistic distribution has a somewhat heavier tails, which means that it is less sensitive to outlying data (and hence somewhat more robust for modeling mis-specifications or erroneous data). This is another advantage of logistic regression over probit regression. Because of its easy implementation, the use of BPOR is extremely common, even though it is less robust for modeling mis-specifications and its coefficients are less interpretable. Because of the aforementioned properties of the logistic model, some researchers have proposed approximations to logit regression. For example, Bartholomew and Knott (1999) proposed , where Φ is the cumulative density function for the standard normal distribution and , whereas Camilli (1994) proposed using k = 1.702, obtained by minimizing the maximum distance between two cumulative distribution functions. Although Amemiya (1981) proposed a value of k = 1.6 and computed tables for representative values of the density function for different values of k, he did not explain why he used . More recently, Savalei (2006) obtained a value of k = 1.75 based on minimizing the Kullback-Leibler information. However, although some of these approximations do a reasonable job of approximating the logistic distribution, they are only approximations, and it goes without saying that an exact solution is preferred. In this paper, we propose a Bayesian logistic ordinal regression (BLOR) model for genomic-enabled prediction by using a data augmentation approach. We illustrate our proposed method with simulation and real data. We compare the BLOR with the Bayesian probit ordinal regression (BPOR) model with and without approximation.

Materials and Methods

Gray leaf spot (GLS) and Septoria data sets

GLS is one of the most important foliar diseases of maize worldwide. The GLS data set is composed of 278 maize lines; the ordinal trait measured in each line was GLS [1 (no disease), 2 (low infection), 3 (moderate infection), 4 (high infection), 5 (complete infection)] caused by the fungus Cercospora zeae-maydis, evaluated in three environments (México, Zimbabwe, and Colombia). These data are part of a data set previously analyzed by Crossa , González-Camacho , and Montesinos-López . Genotypes of all 278 lines were obtained using the 55k SNP Illumina platform. SNPs with >10% missing values or a minor allele frequency of ≤ 0.05 were excluded from the data. After line-specific quality control (applying the same quality control to each line separately), the maize data still contained 46,347 SNPs. On the other hand, the Septoria data set contains 268 wheat lines planted in Toluca, México, in 2010, and the trait (Septoria scores) was measured using an ordinal four-point scale. Genotypes of these lines were obtained with 45,000 genotype by sequencing (GBS), following the protocol of Poland . We kept only 13,913 GBS that had <50% missing data; after filtering for minor allele frequency, we ended up with 6787 GBS that were used in the analysis. For the implementation of the proposed model, we formed five data sets from these two real data sets (GLS and Septoria), four from the GLS data set and one from the Septoria data set. The first three data sets formed from GLS correspond to each environment in which they were evaluated for GLS; the last one was formed by pooling the data from the three environments (information from the three environments without taking into account the environments as covariates).

Bayesian logistic ordinal regression

Let (where i represents the genotype and j denotes the number of replicates or experimental units of each genotype. The total number of observations is In other words, the observed vector contains elements, and the n-dimensional vector y of all responses can be written as . The response variable represents an assignment into one of C mutually exclusive and exhaustive categories that follow an order. Therefore, the ordinal logistic regression model can be written in terms of a latent response variable as follows: where are called “liabilities”, , where denotes the logistic distribution, and the vectors (p × 1) are explanatory variables associated with the fixed effects . The random effect . In genomic-enabled prediction, Since are unobservable and can be measured indirectly by an observable ordinal variable , then can be defined by:This means that is divided by thresholds into C intervals, corresponding to C ordered categories. The first threshold, , defines the upper bound of the interval corresponding to observed outcome 1. Similarly, threshold defines the lower bound of the interval corresponding to observed outcome C. Threshold defines the boundary between the interval corresponding to observed outcomes c − 1 and c for (c = 1,2,…, C − 1). Threshold parameters are with , and Assuming that the error term of the latent response is distributed as , the cumulative response probability for the c category of the ordinal outcome is:for .Similarly, model (2) can be written as a cumulative logit model:This GLMM model is described by: (1) two distributions, one for observations in the response variable Multinomial ), where is the p × 1 vector of fixed effects, and another for the random effects, or , where b is the effect of line i; (2) linear predictor , where denotes the cth link (c = 1,2,…, C − 1) for the fixed and random effects combination, is the intercept (threshold) for the cth link, and are known row incidence vectors corresponding to fixed effects in . Because there are C categories, a total of C − 1 link functions are required to fully specify the model; and (3) link function: cumulative logit {, (c = 1,2,…, C − 1)}. Using the inverse link for this model, we can calculate as follows:Since we have latent variables distributed as and we observe if, and only if, , then the joint posterior density of the parameter vector and latent variable becomesLet’s assume a scaled independent inverse chi-square prior for , a normal prior distribution for , a normal prior distribution for , and also a prior for (Gianola 2013). Following Sorensen , the prior for the C − 1 unknown thresholds has been given as order statistics from U(, distribution,where . The fully conditional posterior distributions are provided below and details of all derivations are given in Appendix A.

Liabilities and Pólya-Gamma values

The fully conditional posterior distribution of liability is a truncated normal distribution and its density is For simplicity, ELSE is the data and the parameters, except the one in question. is a normal density with parameters as indicated in the argument, Φ is the cumulative distribution function of a normal density with mean and variance , and the fully conditional posterior distribution of ω is

Regression coefficients (β)

The fully conditional posterior of is as follows:where , . With , , , , , , . It is important to point out that if we use a prior for (improper uniform distribution), then in and we need to make 0 the term .

Polygenic effects (b)

Now the fully conditional posterior of is given as

Variance of polygenic effects

Next, the fully conditional posterior of is

Threshold effects ()

The density of the fully conditional posterior distribution of the cth threshold, , is

Variance of regression coefficients

The fully conditional posterior of is

The Gibbs sampler

The Gibbs sampler is implemented by sampling repeatedly from the following loop: Sample liabilities from the truncated normal distribution in (3). Sample values from the Pólya-Gamma distribution in (4). Sample the regression coefficients from the normal distribution in (5). Sample the polygenic effects from the normal distribution in (6). Sample the variance effect ( from the scaled inverted χ2 distribution in (7). Sample the thresholds from the uniform distribution in (8). Sample the variance of regression coefficients ( from the scaled inverted χ2 distribution in (9). Return to step 1 or terminate if chain length is adequate to meet convergence diagnostics. In the absence of polygenic effects (b), the aforementioned Gibbs sampler can be used only by ignoring steps 4 and 5. If all marker effects are taken into account in the design matrix, , with a prior for the beta regression coefficients, we end up with a threshold Bayesian ridge regression. This is a version for ordinal categorical data of the ridge estimator of Hoerl and Kennard (1970), since the posterior expectation of is equal to with pseudo-response . Another important point is that by setting each , the aforementioned Gibbs sampler for the BLOR with the logistic link is reduced to the Gibbs sampler for the BPOR with the probit link proposed by Albert and Chib (1993). This implies that the proposed BLOR model is more general and includes the Gibbs sampler for the BPOR model as a particular case.

Simulation study

The purpose of this simulation study is twofold: (1) to compare the performance of the proposed BLOR with: (a) the approximation resulting from using the estimates of the BPOR with , denoted as BLOR*, (b) with the results of using maximum likelihood estimators (MLEs) with logit link for ordinal data (MLLOR), and (c) the approximation resulting from multiplying the MLEs with probit link for ordinal data by 1.75, denoted as MLLOR*; (2) to evaluate the performance of the BLOR in the presence of outliers. We used the value 1.75 because, according to the literature review, it is the most reasonable value (Savalei 2006). To reach these two goals, two data sets were simulated. Both simulation studies were carried out with the following liability:where and , and the vectors have been drawn independently, with components following a uniform distribution within the interval [−0.1, 0.1]. The threshold parameters used were γ1= −0.8416, γ2= −0.2533, γ3= −0.2533, and γ4= −0.8416. Then the response variable was generated as follows: For simulated data set 1, we used four values of sample size n = 5, 10, 20, and 40 and all the were drawn independently from a L(0,1), whereas for simulated data set 2, we used only one sample size (n = 40) and the error terms () were obtained from two distributions [L(0,1) and a student's t distribution with four degrees of freedom, denoted as t4]. We studied four scenarios: Scenario 1, the percentage of outliers (PO) from the t4 was 5% and the remaining percentage was obtained from the L(0,1) distribution; Scenario 2: the PO from the t4 was 10%, and 90% from the L(0,1); Scenario 3: the PO from the t4 was 20%, and 80% from the L(0,1); and Scenario 4: the PO from the t4 was 30%, and 70% from the L(0,1). The MLE estimates for the ordinal regression were obtained using the polr function of the MASS package in R (R Core Team 2015). It is important to point out that the priors used for the Bayesian methods were not informative for , and for the hyperparameters for thresholds, we used and . We computed 20,000 Markov chain Monte Carlo (MCMC) samples. Bayes estimates were computed using 10,000 samples because the first 10,000 were discarded as burn-in.

Model implementation

The Gibbs sampler described previously for the BLOR model was implemented with the R-software (R Core Team 2015). Implementation was done with a Bayesian approach and MCMC through the Gibbs sampler algorithm, which samples sequentially from the full conditional distributions until it reaches a stationary process, converging with the joint posterior distribution (Gelfand and Smith 1990). In the real data, to reduce the potential impact of MCMC errors on prediction accuracy, we performed a total of 60,000 iterations with a burn-in of 20,000, so that 40,000 samples were used for inference. We did not apply thinning of the chains, following the suggestions of Geyer (1992), Maceachern and Berliner (1994), and Link and Eaton (2012), who provide justification for the ban on subsampling MCMC output for approximating simple features of the target distribution (e.g., means, variances and percentiles), since thinning is neither necessary nor desirable, and unthinned chains are more precise. It is important to point out that implementation of the BLOR model for the real data sets was done using the hyperparameters , , γmin = −1000 and γmax= −1000 for thresholds parameters, all of which were chosen to lead weakly informative priors.

Assessing prediction accuracy

We used cross-validation to estimate the prediction accuracy of the proposed models. The data set was divided into training and validation sets 10 times, with 90% of the data set used for training and 10% for testing; this was done only for the pooled data. The training set was used to fit the model and the validation set was used to evaluate the prediction accuracy of the proposed models. Since the phenotype response is ordinal categorical, we used the Brier score (Brier 1950) to measure prediction ability, which is equal towhere BS denotes the Brier Score, and d takes a value of 1 if the ordinal categorical response observed for individual i falls into category c; otherwise, d = 0. This scoring rule uses all the information contained in the predictive distribution, not just a small portion like the hit rate or the log-likelihood score. Therefore, it is a reasonable choice for comparing categorical regression models, although there are other scoring rules that also have good properties. The range of BS in equation (10) is between 0 and 2. For this reason, we divided Brier scores (BS)/2 to get the BS bounded between 0 and 1; lower scores imply better predictions. It is important to point out that we also used the BS when analyzing the full data sets. We also used the deviance information criterion (DIC) to compare Bayesian models, as suggested by Gelman ; here, the lower the DIC, the better the model.

Data availability

The two real data sets and two simulated data sets together with R codes are deposited in the link http://hdl.handle.net/11529/10254. The phenotypic data for GLS in three environments (México, Zimbabwe, and Colombia) for the 278 maize lines, the 46,347 SNPs, and the R scripts developed to fit the predictive models used in this study are given in the files PhenoGLS.RData, and MarkersGLSFinal.RDat. The repository also contains the Septoria genotypic and phenotypic data sets, in files SeptoriaGenotypic.RDat, and SeptoriaPhenotypic. RDat, respectively. The R codes to generate the simulated data sets 1 and 2, and for analyzing the real data set from Colombia are directly given in Appendices B, C, and D, respectively.

Results

In the following sections, we investigate the performance of the proposed BLOR estimator through a simulation study and with real data.

Simulated data set

In Table 1, we report average estimates obtained by all methods, along with SDs; all the results are based on 50 replications. From Table 1, it is clear that as the sample size increases, the average biases and SD decrease in all cases. This confirmed the consistency properties of all the estimates. Table 1 also shows that, in general, the point estimates of the Bayesian estimates (BLOR and BLOR*) are similar to the MLEs (MLLOR, and MLLOR*, which was approximated with the MLE with probit ordinal regression); however, the approximations (BLOR* and MLLOR*) have greater bias and SD. BLOR has less bias and SD in most of the studied parameters, producing better parameter estimates than the MLLOR (which is the correct method that use maximum likelihood with the logit link function). For this reason, the proposed BLOR is an excellent alternative. However, a more in-depth simulation study is required to ensure that these findings are valid for all possible scenarios.

Table 1

Simulated data set 1: Average values (Mean) and SD of MLEs and the Bayesian estimators, with four sample sizes (n)

n_i	Parameter	True Value	BLOR		BLOR*		MLLOR		MLLOR*
n_i	Parameter	True Value	Mean	SD	Mean	SD	Mean	SD	Mean	SD
	β1	−6	-6.141	1.935	−6.711	2.117	−6.380	2.489	−6.826	2.668
	β2	−5	-4.957	2.262	−5.546	2.726	−5.596	2.731	−5.927	2.872
	β3	7	7.550	2.746	7.815	3.112	6.659	2.698	7.110	2.885
5	γ1	−0.842	-0.851	0.190	−0.937	0.152	−0.883	0.177	−0.942	0.185
	γ2	−0.253	-0.254	0.154	−0.271	0.142	−0.277	0.167	−0.298	0.179
	γ3	0.253	0.274	0.170	0.328	0.171	0.262	0.203	0.281	0.218
	γ4	0.842	0.878	0.171	0.967	0.151	0.863	0.211	0.920	0.220
	β1	−6	−6.224	1.562	−6.534	1.650	-6.118	1.673	−6.480	1.767
	β2	−5	-4.987	1.825	−5.433	1.901	−4.717	1.500	−5.022	1.619
	β3	7	7.306	1.971	7.762	1.825	6.606	1.836	7.038	1.939
10	γ1	−0.842	-0.843	0.100	−0.926	0.147	−0.847	0.127	−0.907	0.135
	γ2	−0.253	-0.239	0.097	−0.284	0.131	−0.273	0.110	−0.296	0.119
	γ3	0.253	0.276	0.113	0.272	0.123	0.233	0.110	0.249	0.120
	γ4	0.842	0.861	0.116	0.920	0.124	0.841	0.115	0.897	0.123
	β1	−6	−6.122	1.063	−6.278	1.390	-6.030	0.936	−6.422	1.017
	β2	−5	-5.099	1.262	−5.538	1.103	−5.181	0.962	−5.488	1.019
	β3	7	7.271	1.114	7.479	1.394	7.243	1.118	7.669	1.183
20	γ1	−0.842	−0.849	0.108	−0.917	0.100	-0.847	0.073	−0.905	0.077
	γ2	−0.253	-0.252	0.106	−0.291	0.094	−0.250	0.069	−0.270	0.074
	γ3	0.253	0.259	0.091	0.261	0.099	0.254	0.068	0.272	0.073
	γ4	0.842	0.860	0.097	0.883	0.106	0.850	0.079	0.907	0.084
	β1	−6	-6.058	0.804	−6.467	0.924	−6.199	0.719	−6.563	0.783
	β2	−5	−5.175	0.817	−5.231	0.879	−4.804	0.791	-5.101	0.856
	β3	7	7.163	0.815	7.674	0.958	7.093	0.867	7.528	0.904
40	γ1	−0.842	−0.844	0.065	−0.911	0.069	-0.841	0.051	−0.899	0.055
	γ2	−0.253	-0.258	0.053	−0.278	0.064	−0.248	0.05	−0.267	0.056
	γ3	0.253	0.255	0.052	0.267	0.060	0.252	0.044	0.271	0.048
	γ4	0.842	0.856	0.054	0.900	0.071	0.835	0.047	0.893	0.050

BLOR* and MLLOR* use the parameter estimates of BPOR and MLLOR and approximate BLOR and MLLOR with . The best model has the value for the parameter closer to the true value; these are presented in bold. MLEs, maximum likelihood estimators; BLOR, Bayesian logistic ordinal regression; MLLOR, maximum likelihood logistic ordinal regression. Table 2 shows that the smaller the PO, the less bias there is in the parameter estimates under the four methods (two Bayesians and two under Maximum Likelihood). However, under the two Bayesian methods, the proposed BLOR showed less bias than BLOR*, and this behavior is observed for all parameters and PO under study, except when the PO is 30% and for the parameter γ2. Under the two maximum likelihood methods, the approximate method (MLLOR*) showed greater bias in all scenarios and parameters. Finally, although BLOR and MLLOR produced better results in term of bias, in most cases, MLLOR was better than BLOR.

Table 2

Simulated data set 2: average values (Mean) and SD of MLEs and Bayesian estimators, with four POs

PO	Parameter	True Value	BLOR		BLOR*		MLLOR		MLLOR*
PO	Parameter	True Value	Mean	SD	Mean	SD	Mean	SD	Mean	SD
	β1	−6	-5.994	0.824	−6.525	0.915	−6.230	0.723	−6.608	0.766
	β2	−5	−5.088	0.880	−5.678	0.864	-5.015	0.763	−5.324	0.815
	β3	7	7.183	0.802	7.607	0.893	7.111	0.651	7.535	0.689
5	γ1	−0.842	−0.862	0.066	−0.923	0.067	-0.853	0.050	−0.910	0.053
	γ2	−0.253	−0.263	0.071	−0.281	0.059	-0.256	0.048	−0.276	0.051
	γ3	0.253	0.258	0.068	0.285	0.051	0.261	0.040	0.281	0.043
	γ4	0.842	0.861	0.067	0.928	0.058	0.856	0.041	0.914	0.043
	β1	−6	-6.156	0.854	−6.670	0.979	-6.136	0.607	−6.504	0.647
	β2	−5	−5.359	0.899	−5.544	1.094	-5.149	0.610	−5.459	0.639
	β3	7	7.349	0.830	7.739	0.856	7.109	0.645	7.536	0.682
10	γ1	−0.842	−0.883	0.063	−0.925	0.076	-0.866	0.048	−0.924	0.050
	γ2	−0.253	−0.266	0.070	−0.267	0.072	-0.265	0.052	−0.285	0.056
	γ3	0.253	0.260	0.071	0.313	0.065	0.264	0.051	0.284	0.055
	γ4	0.842	0.868	0.068	0.964	0.082	0.861	0.052	0.918	0.055
	β1	−6	−6.529	0.730	−6.828	0.903	-6.345	0.645	−6.709	0.689
	β2	−5	-5.244	0.759	−5.722	0.877	−5.275	0.629	−5.589	0.671
	β3	7	7.525	0.784	7.860	0.758	7.344	0.735	7.780	0.772
20	γ1	−0.842	−0.915	0.065	−0.972	0.060	-0.883	0.049	−0.942	0.053
	γ2	−0.253	−0.275	0.059	−0.295	0.057	-0.260	0.044	−0.280	0.048
	γ3	0.253	0.279	0.063	0.297	0.059	0.271	0.047	0.291	0.051
	γ4	0.842	0.922	0.060	0.977	0.060	0.895	0.053	0.954	0.057
	β1	−6	−6.794	0.803	−7.011	1.013	-6.555	0.654	−6.916	0.691
	β2	−5	−5.652	0.754	−5.827	0.754	-5.298	0.634	−5.590	0.666
	β3	7	8.065	0.894	8.351	0.881	7.491	0.752	7.925	0.798
30	γ1	−0.842	−0.972	0.071	−1.004	0.075	-0.898	0.041	−0.956	0.044
	γ2	−0.253	−0.301	0.060	−0.296	0.072	-0.269	0.044	−0.289	0.047
	γ3	0.253	0.279	0.067	0.318	0.069	0.286	0.044	0.306	0.047
	γ4	0.842	0.944	0.068	1.023	0.060	0.922	0.044	0.981	0.046

The outliers were generated with a student's t distribution with four degrees of freedom. BLOR* and MLLOR* use the parameter estimates of BPOR and MLLOR and approximate BLOR and MLLOR with . The best model has the value for the parameter closer to the true value; these are presented in bold. MLEs, maximum likelihood estimators; POs, percentages of outliers; BLOR, Bayesian logistic ordinal regression; MLLOR, maximum likelihood logistic ordinal regression.

Real data sets

Next we compared BLOR and BPOR with the real data set (GLS and Septoria data) described in the Materials and Methods. However, because with this data set the number of parameters (betas and thresholds) to be estimated is high, we compared both models with point and interval estimates of the probabilities estimated for each category on the four- and five-point ordinal scale for each data set studied. In Table 3, we compare BLOR, BPOR, and BLOR* (this is the approximate method since the estimates resulting from BPOR were used to approximate BLOR with with five data sets (three locations, the resulting pooling of the three locations and the Septoria data set) made up from the data sets described in the section Materials and Methods. First, when comparing BLOR with BLOR*, we see that there are differences in the point probability estimates, and the lower the probabilities, the greater the differences. However, in general, the widths of the credible sets are similar. Second, when comparing BLOR with BPOR, we see that the point and credible sets for each location and Septoria produced similar results; however, for the pooled data, BLOR produced estimates with narrower confidence sets than BPOR (Table 3). We observed that the estimates produced using BPOR are less accurate because wider confidence sets are produced when the data are pooled; this could be because the assumption of errors normally distributed with mean zero and variance one when the data are pooled is not fulfilled. Also, there were no differences in the BS of the three models (BLOR, BLOR*, and BPOR).

Table 3

Real data sets: GLS and Septoria data sets

Model	Set (DIC)	Statistic	Probability of Each Category					BS
Model	Set (DIC)	Statistic	1	2	3	4	5	BS
		Mean	0.025	0.256	0.392	0.196	0.132	0.363
	Zimbabwe	L	0.018	0.236	0.368	0.173	0.112	0.363
	(4150.29)	U	0.034	0.278	0.415	0.216	0.149	0.363
		Mean	0.054	0.442	0.267	0.169	0.069	0.350
BLOR	México	L	0.035	0.399	0.228	0.139	0.050	0.349
	(1313.82)	U	0.075	0.480	0.308	0.200	0.092	0.352
		Mean	0.204	0.248	0.259	0.212	0.077	0.390
	Colombia	L	0.179	0.219	0.232	0.187	0.058	0.389
	(2577.92)	U	0.233	0.279	0.287	0.239	0.095	0.391
		Mean	0.083	0.284	0.332	0.198	0.104	0.377
	Pooled	L	0.069	0.269	0.315	0.184	0.093	0.377
	(8327.36)	U	0.093	0.299	0.350	0.211	0.119	0.377
		Mean	0.077	0.150	0.432	0.341	−	0.335
	Septoria	L	0.049	0.112	0.376	0.294	−	0.333
	(654.33)	U	0.110	0.195	0.493	0.388	−	0.338
		Mean	0.032	0.237	0.416	0.190	0.124	0.363
	Zimbabwe	L	0.025	0.219	0.389	0.171	0.110	0.363
	(4156.86)	U	0.039	0.258	0.443	0.211	0.140	0.365
		Mean	0.057	0.436	0.280	0.156	0.070	0.350
BLOR*	México	L	0.041	0.392	0.242	0.128	0.054	0.350
	(1315.21)	U	0.075	0.479	0.323	0.187	0.089	0.353
		Mean	0.193	0.260	0.279	0.194	0.074	0.390
	Colombia	L	0.168	0.226	0.248	0.168	0.060	0.389
	(2581.66)	U	0.220	0.292	0.310	0.223	0.089	0.392
		Mean	0.082	0.277	0.358	0.184	0.100	0.377
	Pooled	L	0.074	0.259	0.341	0.168	0.090	0.377
	(8339.54)	U	0.091	0.294	0.375	0.200	0.109	0.378
		Mean	0.075	0.137	0.457	0.332	−	0.334
	Septoria	L	0.051	0.098	0.392	0.268	−	0.330
	(652.50)	U	0.104	0.176	0.527	0.393	−	0.341
		Mean	0.025	0.253	0.392	0.199	0.132	0.363
	Zimbabwe	L	0.017	0.233	0.366	0.179	0.113	0.363
	(4150.18)	U	0.033	0.274	0.416	0.219	0.148	0.363
		Mean	0.054	0.440	0.265	0.171	0.070	0.350
BPOR	México	L	0.036	0.399	0.229	0.142	0.051	0.350
	(1314.70)	U	0.076	0.479	0.304	0.203	0.092	0.352
		Mean	0.206	0.249	0.261	0.209	0.075	0.390
	Colombia	L	0.176	0.221	0.230	0.183	0.058	0.389
	(2578.42)	U	0.233	0.277	0.293	0.233	0.095	0.391
		Mean	0.071	0.256	0.331	0.218	0.123	0.377
	Pooled	L	0.042	0.183	0.313	0.191	0.104	0.374
	(8329.84)	U	0.086	0.287	0.350	0.286	0.171	0.389
		Mean	0.075	0.150	0.430	0.345	−	0.334
	Septoria	L	0.047	0.110	0.369	0.282	−	0.330
	(651.79)	U	0.109	0.191	0.500	0.402	−	0.339

Comparing the models with DIC, we see that in Zimbabwe, BPOR produced the lowest DIC (DIC = 4150.18), whereas BLOR* produced the highest (DIC = 4156.86). However, the DIC of BLOR (DIC = 4150.29) was very close to that of BPOR. In México, the lowest DIC was for BLOR (DIC = 1313.82), whereas the highest was for BLOR* (DIC = 1315.21). In Colombia, BLOR had the lowest DIC (2577.92) and BLOR* had the highest (2581.66). In the Septoria data set, BPOR had the lowest DIC (651.79) and BLOR had the highest (654.33). Finally, in the pooled data, BLOR (8327.36) also had the lowest DIC, whereas BLOR* had the highest (8339.539). All these results shows that even approximations that did a reasonable job (BLOR*) were sometimes very far from the exact methods (BLOR and BPOR). For this reason, whenever possible, an exact model (BLOR or BPOR) should be chosen. The fact that BPOR is sometimes the best (lower DIC) implies that for these data sets, the assumption in Equation (1) is enough. Finally, in Table 4 we present the BS for the testing sets of the pooled real GLS data with 10 cross-validations and 90% of data used for the training set and 10% for the testing set. For models BLOR and BPOR, the BS are almost identical, which means that, with regard to prediction, both models had a similar performance. However, although the approximation method (BLOR*) produced a higher BS, its prediction accuracy was very close to that of BLOR and BPOR. Although we found that the three models had a similar performance regarding prediction accuracy with this data set, more in-depth research is required to validate this observation.

Table 4

GLS data set

Model	Brier Scores
Model	Mean	Min	Max
BLOR	0.373	0.365	0.381
BLOR*	0.374	0.364	0.382
BPOR	0.373	0.365	0.381

BLOR* uses the parameter estimates of BPOR and approximates BLOR with . Brier scores (mean, minimum and maximum; lower scores indicate better prediction) evaluated for validation samples from the pooled data. GLS, Gray leaf spot; BLOR, Bayesian logistic ordinal regression; BPOR, Bayesian probit ordinal regression.

BLOR* use the parameter estimates of the BPOR and approximate the BLOR with . Point probability estimates, credible sets for each category, DIC, and BS for the threshold Bayesian ridge regression. L and U denote lower and upper confidence sets, respectively. GLS, Gray leaf spot; DIC, deviance information criterion; BS, Brier scores; BLOR, Bayesian logistic ordinal regression; BPOR, Bayesian probit ordinal regression. BLOR* uses the parameter estimates of BPOR and approximates BLOR with . Brier scores (mean, minimum and maximum; lower scores indicate better prediction) evaluated for validation samples from the pooled data. GLS, Gray leaf spot; BLOR, Bayesian logistic ordinal regression; BPOR, Bayesian probit ordinal regression. This paper proposes a method for BLOR using the Pólya-Gamma data augmentation approach of Scott and Pillow (2013), which produces a Gibbs sampler with full conditional distributions similar to that of the BPOR model of Albert and Chib (1993). The proposed method is reduced to the BPOR model when the sampled values, , from the Pólya-Gamma distribution in Equation (4) are set to 1. This is an advantage because with the proposed model, researchers can perform an exact logistic or probit ordinal regression without having to do approximations to perform a logistic ordinal regression. The performance of the proposed method was compared with the approximation using the probit ordinal regression model in a small simulation study and real data sets (GLS and Septoria data sets) using a four- and five-point ordinal scale. On the basis of the simulation study, it is clear that the estimation of parameters using the approximation produces a considerable amount of bias and can give rise to wrong conclusions in association studies. However, we observed with the real data that, in terms of prediction ability, both models (BLOR and BPOR) have a similar performance even though we observed BLOR had lower DIC values in México, Colombia, and the pooled data. This means that when violation of the assumption in Equation (1) is not strong, any model can be used. For this reason, we observed greater accuracy (narrow confidence sets) for the BLOR model compared with the exact BPOR model (BPOR) only with the pooled data set without a covariate for location. Although with the real data we did not observe an advantage in prediction accuracy with the proposed BLOR model, it is very well documented in statistical literature that logistic ordinal regression is more robust for dealing with outlying data, because logistic distribution has heavier tails which was corroborated in terms of parameter estimates with the simulation done using simulated data set 2. For this reason, the proposed BLOR should be preferred because it is usually not practical to test if the error term in Equation (1) is or . In addition to being more robust, the proposed method also provides regression coefficients that are more interpretable because of their connection to odds ratios (Zucknick and Richardson 2014). However, this advantage does not make sense when p >> n, because the main driving force in Bayesian models in the case of p >> n is the prior and not the data (Gianola 2013). Even with this restriction, this paper unifies logistic and probit ordinal regression under a Bayesian framework and is a useful alternative for genomic-enabled prediction of ordinal categorical trials where available data sets have a larger number of parameters to estimate than observations. Also, the proposed method should be preferred over BPOR when outliers are present and not easily detected. This is especially true for multidimensional data, since many times a few outliers in a data set can have strong influence on parameter estimation and inference. Finally, it is important to point out that, to devise the method proposed in this paper, we generalized the work of Scott and Pillow (2013) for ordered categorical responses. Our method is elegant, easy to implement, and produces a unified Gibbs sampler framework useful for both the logit and the probit link. For this reason, we believe it is an appealing alternative for plant and animal researchers. Also, the proposed BLOR model can be easily extended to take into account genotype × environment interactions, which play an extremely important role in plant breeding, especially when selecting candidate genotypes.

11 in total

1. The use of transformations.

Authors: M S BARTLETT
Journal: Biometrics Date: 1947-03 Impact factor: 2.571

2. Accuracy of genome-wide evaluation for disease resistance in aquaculture breeding programs.

Authors: B Villanueva; J Fernández; L A García-Cortés; L Varona; H D Daetwyler; M A Toro
Journal: J Anim Sci Date: 2011-07-08 Impact factor: 3.159

3. Bayesian methods for estimating GEBVs of threshold traits.

Authors: C-L Wang; X-D Ding; J-Y Wang; J-F Liu; W-X Fu; Z Zhang; Z-J Yin; Q Zhang
Journal: Heredity (Edinb) Date: 2012-10-31 Impact factor: 3.821

4. Sire evaluation for ordered categorical data with a threshold model.

Authors: D Gianola; J Foulley
Journal: Genet Sel Evol Date: 1983 Impact factor: 4.297

5. Priors in whole-genome regression: the bayesian alphabet returns.

Authors: Daniel Gianola
Journal: Genetics Date: 2013-05-01 Impact factor: 4.562

6. A method of sire evaluation for dichotomies.

Authors: D Gianola
Journal: J Anim Sci Date: 1980-12 Impact factor: 3.159

7. A Bayesian antedependence model for whole genome prediction.

Authors: Wenzhao Yang; Robert J Tempelman
Journal: Genetics Date: 2011-11-30 Impact factor: 4.562

8. Genome-wide prediction of discrete traits using Bayesian regressions and machine learning.

Authors: Oscar González-Recio; Selma Forni
Journal: Genet Sel Evol Date: 2011-02-17 Impact factor: 4.297

9. Genome-wide regression and prediction with the BGLR statistical package.

Authors: Paulino Pérez; Gustavo de los Campos
Journal: Genetics Date: 2014-07-09 Impact factor: 4.562

10. Threshold models for genome-enabled prediction of ordinal categorical traits in plant breeding.

Authors: Osval A Montesinos-López; Abelardo Montesinos-López; Paulino Pérez-Rodríguez; Gustavo de Los Campos; Kent Eskridge; José Crossa
Journal: G3 (Bethesda) Date: 2014-12-23 Impact factor: 3.154

6 in total

1. Multi-dimensional machine learning approaches for fruit shape phenotyping in strawberry.

Authors: Mitchell J Feldmann; Michael A Hardigan; Randi A Famula; Cindy M López; Amy Tabb; Glenn S Cole; Steven J Knapp
Journal: Gigascience Date: 2020-05-01 Impact factor: 6.524

Review 2. Genome and Environment Based Prediction Models and Methods of Complex Traits Incorporating Genotype × Environment Interaction.

Authors: José Crossa; Osval Antonio Montesinos-López; Paulino Pérez-Rodríguez; Germano Costa-Neto; Roberto Fritsche-Neto; Rodomiro Ortiz; Johannes W R Martini; Morten Lillemo; Abelardo Montesinos-López; Diego Jarquin; Flavio Breseghello; Jaime Cuevas; Renaud Rincent
Journal: Methods Mol Biol Date: 2022

3. A semi-parametric Bayesian model for semi-continuous longitudinal data.

Authors: Junting Ren; Susan Tapert; Chun Chieh Fan; Wesley K Thompson
Journal: Stat Med Date: 2022-03-10 Impact factor: 2.497

4. Genomic Bayesian Prediction Model for Count Data with Genotype × Environment Interaction.

Authors: Abelardo Montesinos-López; Osval A Montesinos-López; José Crossa; Juan Burgueño; Kent M Eskridge; Esteban Falconi-Castillo; Xinyao He; Pawan Singh; Karen Cichy
Journal: G3 (Bethesda) Date: 2016-05-03 Impact factor: 3.154

5. An R Package for Bayesian Analysis of Multi-environment and Multi-trait Multi-environment Data for Genome-Based Prediction.

Authors: Osval A Montesinos-López; Abelardo Montesinos-López; Francisco Javier Luna-Vázquez; Fernando H Toledo; Paulino Pérez-Rodríguez; Morten Lillemo; José Crossa
Journal: G3 (Bethesda) Date: 2019-05-07 Impact factor: 3.154

6. Classification and Regression Models for Genomic Selection of Skewed Phenotypes: A Case for Disease Resistance in Winter Wheat (Triticum aestivum L.).

Authors: Lance F Merrick; Dennis N Lozada; Xianming Chen; Arron H Carter
Journal: Front Genet Date: 2022-02-23 Impact factor: 4.599

6 in total