Literature DB >> 28364037

A Bayesian Poisson-lognormal Model for Count Data for Multiple-Trait Multiple-Environment Genomic-Enabled Prediction.

Osval A Montesinos-López¹, Abelardo Montesinos-López², José Crossa³, Fernando H Toledo⁴, José C Montesinos-López⁵, Pawan Singh⁴, Philomin Juliana⁴, Josafhat Salinas-Ruiz⁶.

Abstract

When a plant scientist wishes to make genomic-enabled predictions of multiple traits measured in multiple individuals in multiple environments, the most common strategy for performing the analysis is to use a single trait at a time taking into account genotype × environment interaction (G × E), because there is a lack of comprehensive models that simultaneously take into account the correlated counting traits and G × E. For this reason, in this study we propose a multiple-trait and multiple-environment model for count data. The proposed model was developed under the Bayesian paradigm for which we developed a Markov Chain Monte Carlo (MCMC) with noninformative priors. This allows obtaining all required full conditional distributions of the parameters leading to an exact Gibbs sampler for the posterior distribution. Our model was tested with simulated data and a real data set. Results show that the proposed multi-trait, multi-environment model is an attractive alternative for modeling multiple count traits measured in multiple environments.

Entities: Chemical Disease Species

Keywords: Bayesian; GenPred; count phenotype; genomic selection; genomic-enabled prediction; multi-environment; multi-trait; shared data resource

Mesh：

Year: 2017 PMID： 28364037 PMCID： PMC5427491 DOI： 10.1534/g3.117.039974

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Plant breeders need more efficient models for performing genomic selection for multiple-traits and multiple-environments for count data. Count data are those dependent variables that take values 0, 1, 2,… without an explicit upper limit. These types of dependent variables are common in genomic selection, for example: panicle number per plant, seed number per plant, number of infected spikelets per plant, etc. Due to its simplicity and its ability to generate samples from high-dimensional probability distributions, the Gibbs sampler is one of the most popular computationally intensive methods for fitting complex multilevel models (Park and van Dyk 2009). This method is also very popular for modeling normal and binary responses when efficient closed-form Gibbs samplers have been developed. However, obtaining a closed-form Gibbs sampler for count data is not straightforward. For this reason, Montesinos-López , 2016a) in the context of genomic-enabled prediction and genomic selection proposed closed-form Gibbs samplers for multilevel models for univariate count responses with and without the genotype × environment interaction (G × E) term that helps fill the lack of closed-form Gibbs samplers for count data. Although these models are helpful for modeling univariate count responses, many times breeders record phenotypic data for multiple counts. Scientists must take advantage of correlated traits to improve the prediction of unobserved genotypes and to increase the prediction accuracy of other count traits that are difficult to measure but that are associated with traits that are easy to measure. The available univariate count models are not appropriate for dealing with these situations. Since prediction problems are ubiquitous and of great interest and importance in statistical science, more attention has been given to parametric inference than to predictive inference (Harville 2014). However, thanks to the efforts of scientists like C.R. Henderson, currently there is a lot of evidence that the model-based approach to prediction is a useful tool for predicting future observations, and many linear mixed effects models have been developed for predicting future observations (Harville 2014). Most of these models are valid for normal responses, but few have been developed for discrete outcomes and very few for predicting more than one trait at a time. Thus there is a great need for developing more univariate and multivariate models for non-normal traits. Developing these models is very challenging since in predictive inference, inferences are sought about tangible quantities that are regarded as unobservable at the time of the inferences; the focus is on quantities that will, or could, become observable in the future, or that can be observed at the time of the inferences but only with an unacceptable delay or with undue effort or expense (Harville 2014). These quantities are regarded as the realizations of the elements of an unobservable random vector, say an M-dimensional unobservable random column vector w, and/or as realizations of linear combinations (or other functions) of the elements of w (Harville 2014). Poisson and negative binomial (NB) distributions are the most common random variables used for modeling count data; NB distribution is preferred when there is evidence of considerable overdispersion. In the Bayesian context, estimating the overdispersion parameter in the NB distribution is challenging because Metropolis-Hastings algorithms are used most of the time; this is computationally expensive and not practical for use in genomic-enabled prediction where data sets are large. Montesinos-López , 2016a) proposed a Gibbs sampler for NB distribution but it is not computationally efficient. Other authors have proposed using the Poisson-lognormal distribution to model count data to account for overdispersion (Williams and Ebel 2012). The Poisson component of the Poisson-lognormal distribution accommodates integer inputs (or outputs) to describe the actual number of counts observed within a single unit or sample, while the lognormal component of the distribution describes the overdispersion in the Poisson rate parameter due to clustering of some factors and describes how the average of these factors varies across the population (Williams and Ebel 2012). Adding this lognormal component to the predictor of a Poisson model is very helpful for accommodating a general correlation structure between traits when more than one trait is under study (Ma ). Based on the previous considerations, the main goal of this research is to extend the genomic-enabled Bayesian prediction model for count data with genotype × environment (G × E) interaction to the context of multiple traits under a Poisson-lognormal model. Since nowadays scientists measure multiple count traits in multiple environments, the joint modeling of multiple count traits can help to increase prediction accuracy and parameter estimation accuracy, and reduce trait selection bias. This argument is very well documented for continuous phenotypes; see, for example, Henderson and Quaas (1976), Pollak , Schaeffer (1984), Jiang , and Montesinos-López . It is reasonable that it could also substantially help in the context of multivariate count phenotypes.

Materials and Methods

Experimental data

First we describe real phenotypic and genotypic experimental data used to illustrate the results of the new Poisson-lognormal model. Then we explain the theory (joint posterior density and prior specification) of the proposed model, the Gibb sampler, and its implementation. We also describe how to assess prediction accuracy and simulated data sets, genomic-enabled prediction models, and give the link to the data and software availability.

Phenotypic data

The phenotypic data set used included 182 spring wheat lines developed by the International Maize and Wheat Improvement Center (CIMMYT) that were assembled and evaluated for resistance to Fusarium graminearum at El Batan experiment station in Mexico in 2011 in three experiments. For the application, we call these three experiments Env1, Env2, and Env3. In all the experiments (environments), the genotypes were arranged in a randomized complete block design, in which each plot comprised two 1-m double rows separated by a 0.25-m space. Fusarium head blight (FHB) severity data were collected 20 and 30 d before maturity by counting symptomatic spikelets on five randomly selected spikes in each plot. We used the counts collected at 20 d as trait 1 and the counts collected at 30 d as trait 2. These data sets were taken from the data used by Montesinos-López in their paper for count data with genotype × environment interaction.

Genotypic data

DNA samples were extracted from young leaves 2–3 wk old taken from each line, using Wizard Genomic DNA purification (Promega) following the manufacturer’s protocol. DNA samples were genotyped using an Illumina 9K SNP chip with 8632 single nucleotide polymorphisms (SNPs) (Cavanagh ). For a given marker, the genotype for the th line was coded as the number of copies of a designated marker-specific allele carried by the th line (absence = zero and presence = one). SNP markers with unexpected AB (heterozygous) genotype were recoded as either AA or BB based on the graphical interface visualization tool of GenomeStudio (Illumina) software. SNP markers that did not show clear clustering patterns were excluded. In addition, 66 simple sequence repeat markers were screened. After filtering the markers for 0.05 minor allele frequency and deleting markers with >10% of no calls, the final set of SNPs included 1635 SNPs.

Statistical model

Let be the phenotypic response of trait in replication in environment for genotype Conditionally on and we assume that the distribution of is a Poisson distribution with mean equal to with the following linear predictor:where is the intercept of environment in trait is the random effect of line in trait is the interaction random effect of environment line and trait and is an individual random effect for the response in the replication environment line , and trait the are jointly distributed as in order to take into account part of the correlation between traits not explained by the other random effects. It is interesting to point out that marginally the proposed model is equivalent to a Poisson-lognormal model with random effects. Conditioning on and the distribution of can be approximated using the NB distribution as:with a large enough value of (for example, ). This probability can be expressed as:where and Then, using the identity of Polson , this expression (Equation 2) can be expressed as:where denotes the probability density function of a Pólya-Gamma random variable with parameters and Then conditioning on and the distribution of is expressed as:ELSE means that the conditioning was done on and Therefore, the joint conditional distribution of all phenotypic count responses in the same individual is given by:where and Here ELSE has a similar meaning as before, but adds the corresponding random effects of each response. In this case, we are conditioning on and Therefore, according to Appendix A, the joint conditional distribution for all phenotypic count responses is expressed as:where , and the others terms are defined in the Appendix A. Hereafter we assume that the joint distribution of the random line-trait effects is with the joint distribution of the interaction random effect of environment and line in trait is with and with assuming that all the extra random effects () are independent and identically distributed. It is important to point out that the correlation between traits is taken into account in both parts of the model: (a) in random effect with general covariance matrix and (b) in random effect with general covariance matrix while the correlation between environments is taken into account in random effect with general covariance matrix Normal random effect resulted because we paired a lognormal distribution, with the Poisson distribution to create an overdispersed distribution, which is referred to as the Poisson-lognormal distribution. The above statement is clear if we remember that if has a normal distribution, then has a lognormal distribution.

Prior specification and joint posterior

In this section, we provide the joint posterior density and prior specification for the Bayesian Poisson Multiple-Trait and Multiple-Environment (BPMTME) model. The joint posterior density of the parameter vector becomes:where we are implicitly assuming the following structure of the prior density of the hyper-parameters:More specifically, we assume that and . Here indicates an inverse Wishart random matrix distribution with density function both are positive definite matrices, and denotes an inverse gamma distribution with shape parameter and rate parameter Using the specified priors we ended up with all full conditional distributions given in Appendix B.

Gibbs sampler

To produce posterior means for all relevant model parameters, below we outline the exact Gibbs sampler procedure that we propose for estimating the parameters of interest. As is the case with Markov Chain Monte Carlo (MCMC) techniques, the ordering of draws is somewhat arbitrary; however, we suggest the following order: Step 1. Simulate according to the normal distribution given in Appendix B (B1). Step 2. Simulate according to the normal distribution given in Appendix B (B2). Step 3. Simulate according to the normal distribution given in Appendix B (B3). Step 4. Simulate according to the normal distribution given in Appendix B (B4). Step 5. Simulate according to the IW distribution given in Appendix B (B5). Step 6. Simulate according to the IG distribution given in Appendix B (B6). Step 7. Simulate according to the IW distribution given in Appendix B (B7). Step 8. Simulate according to the IG distribution given in Appendix B (B8). Step 9. Simulate according to the Pólya-gamma distribution given in Appendix B (B9). Step 10. Simulate according to the IW distribution given in Appendix B (B10). Step 11. Simulate according to the IG distribution given in Appendix B (B11). Step 12. Return to step 1 or terminate when chain length is adequate for meeting convergence diagnostics.

Model implementation

The Gibbs sampler described above for the BPMTME model was implemented as an R-software package (R Core Team 2016). We performed a total of 40,000 iterations; 20,000 samples were used for inference because the first 20,000 were used as burn-in to decrease MCMC errors in prediction accuracy. We used a thinning of five, so that 4000 samples were used for inference. The chain convergence diagnostic was done by visual checks using the trace plots and autocorrelation functions of each component of β coefficients and variance components, and in general we observed that they stabilize very quickly for the real and simulated data. We implemented the prior specification given in the previous section where the BPMTME model was defined. The hyper-parameters used were: and where is the identity matrix of dimension All these hyper-parameters were chosen to lead weakly informative priors. Also, it is important to point out that for the implementation with real and simulated data sets, we worked with the sum for each line resulting from its corresponding number of replicates. This was done to save time in the implementation of the proposed model.

Assessing the prediction accuracy of the models

We used cross-validation 1 (CV1), which mimics a situation where lines were evaluated in some environments for all traits but some lines are missing in other environments, as proposed by Montesinos-López . For real and simulated data, prediction ability was assessed using 10 trn-tst (trn = training and tst = testing) random partitions; we used this approach because it provides higher precision in the predictive estimates than the framework that uses different numbers of folds. However, for the simulated data we used five different partitions (percentages) for the training and testing sets, and the corresponding testing sets were assumed as missed values. The percentages for the training set were 90, 80, 70, 60, and 50% and their corresponding complements were used as testing (tst) sets (tst = 10, tst = 20, tst = 30, tst = 40, and tst = 50%). Of the variety of methods for comparing the predictive posterior distribution to the observed data (generally termed “posterior predictive checks”), we used Spearman’s correlation because the phenotype is not normally distributed. Models with values closer to one indicate better predictions. The predicted observations were calculated with collected Gibbs samplers as: where are estimates of and in the sth collected sample.

Simulated data sets

The proposed model was also tested with two simulated data sets (S1 and S2). Both data sets were simulated under the proposed model (described in the statistical model section) with three environments, two traits, 200 genotypes, and five replications. The β coefficients used for both data sets were = [0.20, 0.25, 0.15, 0.20, 0.30, and 0.32]. The first two β coefficients belong to traits 1 and 2 in environment 1, the third and fourth values belong to traits 1 and 2 in environment 2, and the last two traits for environment 3. The variance-covariance matrices used for the first data set (scenario S1) gave rise to a matrix of correlation between traits and environments of 0.8; these matrices were: and . The variance-covariance matrices used for the second data set (scenario 2) gave rise to correlation matrices with correlations equal to 0.3; the corresponding covariances were: , and . It is also important to point out that for both simulated data sets, we assumed independence between genotypes, i.e., The implementation of the BPMTME model for these two simulated data sets was the same as that used with the real data sets described in the section on model implementation.

Data availability

The Gibbs algorithm described in this paper was implemented in C++ using the Armadillo linear algebra Library (Eddelbuettel and Sanderson 2014). It was included in the R package BMTME as an extension of its current features. With the updated version (0.0.4), the user can also fit count data by means of the same pipeline presented by Montesino-López et al. (2016b). However, due to the need for sampling from a Pólya-gamma distribution, it was necessary to include a dependency regarding the R package BayesLogit (Polson ) and for the moment, we are only distributing the package through CIMMYT’s repository http://hdl.handle.net/11529/10866. Users must have BayesLogit properly installed in their computers before installing BMTME v0.0.4, as well as the R version 3.2.4. or higher. The phenotypic (FHB) and genotypic (marker) data used in this study can also be downloaded from that link.

Results

The results are given in two sections: the first section gives the results of the simulated data and the second section gives the results of the real data set.

Prediction accuracy in the simulated data

Table 1 shows the results for both scenarios (S1 and S2) described in the simulation study for the BPMTME model and for the Bayesian Poisson multi-environment (BPME) model (univariate version of the BPMTME model). Since for each scenario we studied the prediction accuracy of the models using 10 random partitions for each of the five training-testing percentages, we provided the average of the Spearman correlation of the 10 random partitions for each testing percentage (10, 20, 30, 40, and 50%) for each scenario and each model (BPMTME and BPME models). First, we compare, for each scenario (S1), the prediction accuracies for each testing percentage between the BPMTME model and the BPME model. Then we compare the prediction accuracy for each testing percentage between the two scenarios.

Table 1

Prediction accuracy measured with the average of the 10 random partitions using the Spearman correlation (ASC) for each testing (tst) percentage for the BPMTME and BPME models

		BPMTME					BPME
		tst = 10%	tst = 20%	tst = 30%	tst = 40%	tst = 50%	tst = 10%	tst = 20%	tst = 30%	tst = 40%	tst = 50%
	Trait–Env	ASC	ASC	ASC	ASC	ASC	ASC	ASC	ASC	ASC	ASC
	11	0.625	0.575	0.710	0.635	0.597	0.480	0.574	0.574	0.551	0.553
	21	0.693	0.568	0.705	0.566	0.542	0.533	0.527	0.560	0.572	0.546
	12	0.815	0.571	0.694	0.615	0.551	0.492	0.395	0.452	0.423	0.397
	22	0.548	0.553	0.685	0.612	0.541	0.597	0.627	0.608	0.609	0.590
	13	0.754	0.575	0.715	0.623	0.518	0.613	0.635	0.616	0.581	0.569
	23	0.610	0.590	0.706	0.628	0.559	0.539	0.601	0.556	0.600	0.597
	Average	0.674	0.572	0.702	0.613	0.551	0.542	0.560	0.561	0.556	0.542
S1		SD	SD	SD	SD	SD	SD	SD	SD	SD	SD
	11	0.118	0.104	0.060	0.045	0.050	0.053	0.018	0.018	0.013	0.016
	21	0.104	0.118	0.046	0.059	0.045	0.033	0.026	0.025	0.013	0.007
	12	0.048	0.113	0.032	0.061	0.051	0.026	0.025	0.016	0.014	0.009
	22	0.136	0.116	0.072	0.065	0.045	0.032	0.032	0.015	0.010	0.011
	13	0.077	0.135	0.045	0.047	0.048	0.033	0.030	0.018	0.012	0.011
	23	0.102	0.118	0.062	0.050	0.043	0.033	0.019	0.014	0.013	0.008
	Trait–Env	Mean	Mean	Mean	Mean	Mean	Mean	Mean	Mean	Mean	Mean
	11	0.728	0.514	0.740	0.643	0.581	0.499	0.581	0.584	0.548	0.561
	21	0.561	0.630	0.711	0.587	0.544	0.511	0.550	0.561	0.584	0.561
	12	0.537	0.623	0.668	0.647	0.567	0.529	0.402	0.437	0.439	0.416
	22	0.531	0.575	0.721	0.625	0.550	0.613	0.641	0.591	0.616	0.600
	13	0.571	0.633	0.674	0.613	0.522	0.607	0.618	0.593	0.564	0.570
	23	0.640	0.538	0.711	0.636	0.568	0.570	0.574	0.544	0.588	0.585
	Average	0.595	0.585	0.704	0.625	0.555	0.555	0.561	0.552	0.557	0.549
S2		SD	SD	SD	SD	SD	SD	SD	SD	SD	SD
	11	0.074	0.122	0.044	0.050	0.054	0.046	0.023	0.016	0.015	0.012
	21	0.154	0.098	0.057	0.055	0.061	0.039	0.028	0.026	0.014	0.007
	12	0.144	0.088	0.041	0.056	0.039	0.023	0.025	0.014	0.013	0.010
	22	0.142	0.108	0.055	0.054	0.035	0.025	0.029	0.020	0.012	0.012
	13	0.137	0.112	0.065	0.058	0.040	0.030	0.027	0.017	0.013	0.010
	23	0.105	0.135	0.064	0.051	0.043	0.032	0.020	0.014	0.014	0.006

SD denotes standard deviations.

SD denotes standard deviations. The first comparison is important since we hypothesized that the BPMTME model produces better predictions than the BPME model. Under S1, on average, the BPMTME model produced better predictions than the BPME model. For the 10% testing percentage, the BPMTME model was on average 17.89% better than the BPME model. When the testing percentages were 20 and 30%, the BPMTME model was on average 2.1% better than the BPME model. When the testing percentage was 40%, the BPMTME model was on average 9.16% better than the BPME model, and when the testing percentage was 50%, the BPMTME model was on average 1.4% better than the BPME model. It is also important to point out that for trait–environment combination 12, the best model was the BPMTME model for the five testing percentages; the superiority of this model over the BPME model in the five cases was >27%. For the second scenario (S2) and the 10% testing percentage, the BPMTME model was on average 5.16% better than the BPME model, whereas for the 20, 30, 40, and 50% testing percentages, the BPMTME model was on average 3.17, 21.75, 10.75, and 0.9% better than the BPME model, respectively. When we compare the prediction accuracies of scenarios S1 and S2, we see that there are no significant differences between the two scenarios, although, on average, for all the testing percentages the BPME model was a little better than BPMTME. To make sense of this result, we need to remember that under scenario S1, the correlation of the three variance-covariance matrices (, and ) was 0.8, while under scenario S2 the correlation of these three covariance matrices was 0.30.

Estimation and prediction accuracy in the real data set

Table 2 gives the parameter estimates of the real data set. The β coefficients between traits and between environments are different, which implies that the counts between environments and between traits are different. The larger counts are observed in environment 3 and the lower counts are observed in environment 1; also, the counts in trait 1 are a little larger in comparison to trait 2. The variances for each trait (that belong to ) are very similar (0.393 for trait 1 and 0.381 for trait 2) and the correlation between these two traits is 0.948 (with a covariance of 0.367 in ); that is, the correlation between traits is high since the resulting correlation between traits is close to one. The variances of traits estimated under the variance-covariance matrix () were 0.702 for trait 1 and 0.726 for trait 2, with a correlation of 0.1418 (and a covariance of 0.102); in this case, the correlation between traits that is not explained by is very low (0.1418). However, the variances of environments were very low: 0.0198, 0.003, and 4.94E−05 for environments 1, 2, and 3, respectively (obtained from ). The correlations between environments were also very low since the correlation between environments 1 and 2 was 0.182 (with covariance 0.0014 from ), the correlation between environments 1 and 3 was 0.039 (with covariance 3.92E−05 from ), and the correlation between environments 2 and 3 was 0.037 (with covariance 1.45E−05, from ). is the general variance-covariance matrix of traits that were not explained by variance-covariance matrix

Table 2

The posterior means (Mean) and SD of parameter estimates of of the BPMTME model

Coefficient Estimates of β*			Estimates of Σt and Σc			Estimates of ΣE
Parameter	Mean	SD	Parameter	Mean	SD	Parameter	Mean	SD
β^11*	0.018	0.119	Σ^t11	0.393	0.086	Σ^E11	0.0198	0.0416
β^12*	−0.151	0.133	Σ^t12	0.367	0.063	Σ^E12	0.0014	0.0079
β^21*	1.806	0.099	Σ^t22	0.381	0.069	Σ^E13	3.92E−05	0.0003
β^22*	1.680	0.102	Σ^c11	0.702	0.058	Σ^E22	0.003	0.0041
β^31*	2.806	0.097	Σ^c12	0.102	0.039	Σ^E23	1.45E−05	0.0002
β^32*	2.751	0.091	Σ^c22	0.726	0.060	Σ^E33	4.94E−05	0.0001

SD denotes standard deviations.

SD denotes standard deviations. Table 3 shows that, in general, there are no significant differences in terms of prediction accuracy between the two models (BPMTME and BPME); however, it is very interesting that predictions are pretty high under both models, since in all cases the predictions are >0.5. For example, under both models the lowest predictions were for trait–environment combination 11, while the best predictions were for trait–environment combinations 13 and 23.

Table 3

Prediction accuracy of the real data set measured based on the average of the 10 random partitions using the Spearman correlation (ASC) for each testing (tst) percentage for the BPMTME and BPME models in bold are the best predictions.

	BPMTME		BPME
Trait–Env	Mean	SE	Mean	SE
11	0.5008	0.0666	0.5145	0.0595
21	0.5309	0.0375	0.5814	0.0290
12	0.6805	0.0287	0.7064	0.0383
22	0.7025	0.0462	0.6715	0.0394
13	0.7447	0.0158	0.7672	0.0195
23	0.7576	0.0262	0.7187	0.0214
Average	0.6528	0.0368	0.6600	0.0345

SE denotes the standard error.

SE denotes the standard error. These results can be explained in part because the correlation observed between traits in correlation matrix was very low (0.1418), and because the correlation observed in the correlation matrix between environments was also low (see Table 2). However, although the correlation between traits observed in correlation matrix was very high (0.948), this alone did not contribute to superior predictions under the BPMTME model in comparison to the BPME model. It is important to point out that these results with the real data do not indicate that the BPMTME model is not useful, but rather that there is no advantage in using this model over the BPME model when the correlation structures of traits ( ) and environments () are not high. Another possible explanation of why the BPMTME model did not outperform the BPME model is that the number of wheat lines in this data set is limited (only 182) and only 18 (10%) of them are predicted in some environments. When our proposed BPMTME model is used with limited data sets, it is likely to capture idiosyncrasies of the data rather than the true underlying processes; this problem may be entirely due to limitations the data impose on our ability to detect the underlying processes, rather than to any inherent value of simple models. Although the results in Table 3 do not favor the BPMTME model, we can argue that the proposed multivariate model has a low risk of doing worse in terms of prediction accuracy. For this reason, we are convinced that when dealing with prediction of count phenotypes for multiple traits and multiple environments, the savviest solution is to run both models and choose the one with better prediction performance. Another possible reason for the difference in performance of both models may be the presence of more zeros than our model can support in the phenotypes under study, since Figure 1 shows that for trait–environment combinations, Trait1_Environment1, Trait1_Environment2, Trait2_Environment1, and Trait2_Environment2, there is overdispersion for the excess zeros. In the statistical literature, zero-inflated and hurdle models have been proposed for coping with zero-inflated outcome data with or without overdispersion. For this reason, as one reviewer suggested, the proposed models for multiple-traits and multiple-environments can be expanded to deal with the high occurrence of zeros in the observed dependent variables to end up with zero-inflated and hurdle versions for multiple-traits and multiple-environments.

Figure 1

Histogram of count frequencies of the real data set for the two traits under study for each environment.

Discussion

In this paper, we propose a Poisson-lognormal multivariate model that takes into account the and interaction terms, and helps to improve prediction capacity in comparison to the univariate Poisson-lognormal model, which only takes into account the interaction term. The proposed model is original in the sense that an exact Gibbs sampler was obtained, which makes this model more efficient than existing Poisson-lognormal multivariate models that do not take into account interaction terms and use the Metropolis-Hastings algorithm to estimate parameters. Also, the lognormal random effect that was introduced in the linear predictor allows having an approximate Poisson-lognormal model. The advantage of introducing these lognormal random effects was twofold: (1) accounting for overdispersion (more than the Poisson variance) and (2) exploiting the correlation between traits that was not captured by the variance-covariance matrix of traits (). We introduced a general covariance structure between the traits for the extra random effect which plays a really important role because it also allows borrowing information between traits. Additionally, the proposed model also has a general covariance structure between environments which also allows borrowing information between environments. It is important to point out that if a gamma distribution was assumed for the random effect this would produce a multivariate negative binomial model which, in addition to being very inefficient for estimating the scale parameter, only allows positive correlation structures because the NB model (univariate and multivariate) was motivated solely for mathematical convenience. Another advantage of the proposed BPMTME model is that it can be used for performing univariate analysis employing the predictor given in Equation (1), using the appropriate design matrices of environments, lines, and genotype by environment. This alternative is really helpful since in addition to implementing the multiple-trait and multi-environment model for count data, it also allows implementing a univariate multiple-environment model with the same proposed code under a Poisson-lognormal model. Here it is important to point out that the resulting univariate Poisson-lognormal model is different from the Poisson and NB models with the interaction term proposed by Montesinos-López , given that the models with proposed by these authors do not have the normal random effect For this reason, a comparison with this model was not included. However, according to the existing statistical literature, there is evidence that when there is overdispersion, the Poisson-lognormal model does a better job than the Poisson only model. According to our results, the proposed model may be useful for breeders interested in modeling more than one count trait in many environments simultaneously. Although in the simulation study the BPMTME model was better in terms of prediction accuracy than the BPME model, in the real data set both models had similar prediction performance. Although the proposed BPMTME model showed no superiority in the real data set, we believe that when the correlations between traits and between environments are high, the proposed model should be superior. However, more empirical evidence is necessary to be sure that the proposed model helps to increase prediction accuracy under these circumstances. As previously mentioned, our results with the real data sets indicated that the univariate model was slightly better than our multivariate model in terms of prediction accuracy. This result, which was not expected, has been documented in areas like econometrics where multivariate models have been better at modeling interdependencies and achieve better fit. Within a given sample, it has been found that univariate methods outperform multivariate methods in sample predictions. Some of the reasons that may explain these results are given here. (a) Multivariate models have more parameters than univariate models (parsimonious). Every additional parameter is an unknown quantity and has to be estimated. This estimation brings an additional source of error due to sampling variation. (b) The number of potential candidates for multivariate models exceeds its univariate counterpart. Model selection is therefore more complex, lengthier, and more prone to errors, which affect prediction. (c) It is difficult to generalize nonlinear procedures in the multivariate case. Generally, multivariate models must have a simpler structure than univariate ones to overcome the additional complexity of being multivariate. For example, while a researcher may use a nonlinear model for univariate data, she/he may refrain from using the multivariate counterpart or such a generalization may not have been developed. Then, multivariate models will miss the nonlinearities that are handled properly by the univariate models. (d) Outliers can have a more serious effect on multivariate predictions than on univariate predictions. Moreover, it is easier to spot and control outliers in the univariate context, etc. For these reasons, we believe it is necessary to explore other approaches to prediction when dealing with multivariate responses (count and normal phenotypes) and multiple environments, for when the number of traits, lines, and environments increased significantly even with normal phenotypes, the implementation of this model became very difficult. Finally, although the proposed model has some advantages, it also has some disadvantages. The main disadvantage of the proposed BPMTME model is that it is computationally demanding. Although we implemented this model in C++ code, when the number of lines, environments, and traits grows substantially, the time required for its implementation increases greatly and can be untenable for large numbers. The second disadvantage is that our proposed BPMTME model is in reality an approximation to the true BPMTME model: instead of working directly with the Poisson model, we approximated it with the negative NB distribution using as a scale parameter, which works very well for small counts, as was shown by Montesinos-López . Thanks to this approximation, it was possible to obtain an exact Gibbs sampler that is more powerful for complex and large data sets.

Conclusions

In this paper, we propose a multiple-trait, multiple-environment model for count data under an approximate Bayesian Poisson-lognormal model. The Gibbs sampler obtained is an approximation since the Poisson-lognormal model is approximated using the NB distribution in conjunction with a lognormal random effect. The proposed BPMTME model was compared with its univariate counterpart (BPME), and in the simulation examples, the proposed multivariate model (BPMTME) was considerably better in terms of prediction accuracy. This can be expected when there is a moderate or large correlation between traits, which allows borrowing information between traits. However, the prediction accuracy of the proposed multivariate model was not superior to that of the univariate model with the real data set. The proposed BPMTME model allows borrowing information between environments because it takes into account a general covariance matrix between environments. For these reasons, we believe that more empirical work is needed to determine whether the proposed BPMTME model is really a better tool than the BPME model tool for breeders interested in simultaneously improving more than one count trait in many environments. However, although an exact Gibbs sampler was proposed and the program was implemented in C++, it is computationally demanding and more work is required for increasing its speed. Nonetheless, we are convinced that this limitation will disappear or diminish considerably in the coming years as computer science is improving very quickly.

6 in total

1. A multivariate Poisson-lognormal regression model for prediction of crash counts by severity, using Bayesian methods.

Authors: Jianming Ma; Kara M Kockelman; Paul Damien
Journal: Accid Anal Prev Date: 2007-12-18

2. Genome-wide comparative diversity uncovers multiple targets of selection for improvement in hexaploid wheat landraces and cultivars.

Authors: Colin R Cavanagh; Shiaoman Chao; Shichen Wang; Bevan Emma Huang; Stuart Stephen; Seifollah Kiani; Kerrie Forrest; Cyrille Saintenac; Gina L Brown-Guedira; Alina Akhunova; Deven See; Guihua Bai; Michael Pumphrey; Luxmi Tomar; Debbie Wong; Stephan Kong; Matthew Reynolds; Marta Lopez da Silva; Harold Bockelman; Luther Talbert; James A Anderson; Susanne Dreisigacker; Stephen Baenziger; Arron Carter; Viktor Korzun; Peter Laurent Morrell; Jorge Dubcovsky; Matthew K Morell; Mark E Sorrells; Matthew J Hayden; Eduard Akhunov
Journal: Proc Natl Acad Sci U S A Date: 2013-04-29 Impact factor: 11.205

3. Discussion of "The Need for More Emphasis on Prediction: A 'Nondenominational' Model-Based Approach"

Authors: Hal Stern
Journal: Am Stat Date: 2014 Impact factor: 8.710

4. Genomic Bayesian Prediction Model for Count Data with Genotype × Environment Interaction.

Authors: Abelardo Montesinos-López; Osval A Montesinos-López; José Crossa; Juan Burgueño; Kent M Eskridge; Esteban Falconi-Castillo; Xinyao He; Pawan Singh; Karen Cichy
Journal: G3 (Bethesda) Date: 2016-05-03 Impact factor: 3.154

5. Joint prediction of multiple quantitative traits using a Bayesian multivariate antedependence model.

Authors: J Jiang; Q Zhang; L Ma; J Li; Z Wang; J-F Liu
Journal: Heredity (Edinb) Date: 2015-04-15 Impact factor: 3.821

6. A Genomic Bayesian Multi-trait and Multi-environment Model.

Authors: Osval A Montesinos-López; Abelardo Montesinos-López; José Crossa; Fernando H Toledo; Oscar Pérez-Hernández; Kent M Eskridge; Jessica Rutkoski
Journal: G3 (Bethesda) Date: 2016-09-08 Impact factor: 3.154

6 in total

5 in total

Review 1. A Review of Integrative Omic Approaches for Understanding Rice Salt Response Mechanisms.

Authors: Mohammad Asad Ullah; Muhammad-Redha Abdullah-Zawawi; Rabiatul-Adawiah Zainal-Abidin; Noor Liyana Sukiran; Md Imtiaz Uddin; Zamri Zainal
Journal: Plants (Basel) Date: 2022-05-27

2. Prediction of Multiple-Trait and Multiple-Environment Genomic Data Using Recommender Systems.

Authors: Osval A Montesinos-López; Abelardo Montesinos-López; José Crossa; José C Montesinos-López; David Mota-Sanchez; Fermín Estrada-González; Jussi Gillberg; Ravi Singh; Suchismita Mondal; Philomin Juliana
Journal: G3 (Bethesda) Date: 2018-01-04 Impact factor: 3.154

3. A singular value decomposition Bayesian multiple-trait and multiple-environment genomic model.

Authors: Osval A Montesinos-López; Abelardo Montesinos-López; José Crossa; Juan Manuel Ramírez-Alcaraz; Ravi Singh; S Mondal; P Juliana
Journal: Heredity (Edinb) Date: 2018-08-17 Impact factor: 3.821

4. A Multivariate Poisson Deep Learning Model for Genomic Prediction of Count Data.

Authors: Osval Antonio Montesinos-López; José Cricelio Montesinos-López; Pawan Singh; Nerida Lozano-Ramirez; Alberto Barrón-López; Abelardo Montesinos-López; José Crossa
Journal: G3 (Bethesda) Date: 2020-11-05 Impact factor: 3.154

5. A zero altered Poisson random forest model for genomic-enabled prediction.

Authors: Osval Antonio Montesinos-López; Abelardo Montesinos-López; Brandon A Mosqueda-Gonzalez; José Cricelio Montesinos-López; José Crossa; Nerida Lozano Ramirez; Pawan Singh; Felícitas Alejandra Valladares-Anguiano
Journal: G3 (Bethesda) Date: 2021-02-09 Impact factor: 3.154

5 in total