Literature DB >> 36249559

Genomic prediction through machine learning and neural networks for traits with epistasis.

Weverton Gomes da Costa¹, Maurício de Oliveira Celeri², Ivan de Paiva Barbosa³, Gabi Nunes Silva⁴, Camila Ferreira Azevedo³, Aluizio Borem³, Moysés Nascimento², Cosme Damião Cruz¹.

Abstract

Genomic wide selection (GWS) is one contributions of molecular genetics to breeding. Machine learning (ML) and artificial neural networks (ANN) methods are non-parameterized and can develop more accurate and parsimonious models for GWS analysis. Multivariate Adaptive Regression Splines (MARS) is considered one of the most flexible ML methods, automatically modeling nonlinearities and interactions of the predictor variables. This study aimed to evaluate and compare methods based on ANN, ML, including MARS, and G-BLUP through GWS. An F2 population formed by 1000 individuals and genotyped for 4010 SNP markers and twelve traits from a model considering epistatic effect, with QTL numbers ranging from eight to 480 and heritability ( h 2 ) of 0.3, 0.5 or 0.8 were simulated. Variation in heritability and number of QTL impacts the performance of methods. About quantitative traits (40, 80, 120, 240, and 480 QTLs) was observed highest R2 to Radial Base Network (RBF) and G-BLUP, followed by Random Forest (RF), Bagging (BA), and Boosting (BO). RF and BA also showed better results for traits to h 2 of 0.3 with R 2 values 16.51% and 16.30%, respectively, while MARS methods showed better results for oligogenic traits with R 2 values ranging from 39,12 % to 43,20 % in h 2 of 0.5 and from 59.92% to 78,56% in h 2 of 0.8. Non-additive MARS methods also showed high R2 for traits with high heritability and 240 QTLs or more. ANN and ML methods are powerful tools to predict genetic values in traits with epistatic effect, for different degrees of heritability and QTL numbers.

Entities: Chemical

Keywords: Genome wide selection; Genome-enabled prediction; Multivariate adaptive regression splines; Non-additive effects; Quantitative trait locus

Year: 2022 PMID： 36249559 PMCID： PMC9547190 DOI： 10.1016/j.csbj.2022.09.029

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

Genomic wide selection (GWS), proposed by [1], has become one of the main contributions of molecular genetics to breeding. The GWS approach increased the accuracy in the prediction of breeding values, making the selection of elite genotypes more efficient and accurate [2]. Furthermore, GWS made it possible to accelerate the improvement process by half the time in relevant crops, helping to sustain current food demands [3], [4], [5].This time reduction allowed breeders to maximize genetic gains per unit of time, in addition to early selection [6], [7]. All these benefits are due to the direct use of DNA information in the selection of individuals, associating marker information with phenotypic information, reducing the time and resources allocated to the development of a new cultivar [2], [8], [9]. Genome-based prediction is influenced by several factors, such as the predictive ability of the methods, the complexity of the trait's genetic architecture due to non-additive effects (dominance and epistasis), number of phenotypic observations and markers used [2]. Increasingly, researchers are turning to machine learning and neural network techniques, which have built-in predictor selection capabilities and are unparameterized to develop more accurate and parsimonious models [10]. Furthermore, some of these methods allow identifying interactions between markers. This property allows great flexibility to deal with different types of traits with gene control with additive, dominant and epistatic effects [5], [9], [11], [12]. Among the various methods based on machine learning, Multivariate Adaptive Regression Splines (MARS) is considered one of the most flexible [13], it proves to be more parsimonious and performs better than artificial neural networks for genomic prediction in some studies [10], [14]. MARS produces continuous models that can have multiple partitions, automatically models nonlinearities, and contemplates interactions of predictor variables using adaptively selected spline functions [15], [16], [17]. In the genetic context, MARS can be able to adjust the genetic architecture of the trait and can also detect interactions, such as epistasis, and can be used to define the type of trait control. Thus, the inheritance mode of the markers and their interactions can also be determined automatically, therefore, the number of parameters in the modeling can be drastically reduced [14]. Although several studies have proven the high power of MARS in the evaluation of genomic data in the medical field [14], [18], [19], [20], it is not known whether more advanced machine learning, such as MARS, offer superior performance over traditional statistical methods for genetic improvement. In this sense, the objectives of this study were: (i) to evaluate the general accuracy and the variability of the prediction performance of methods based on machine learning, including MARS, and neural networks in genomic prediction analyzes for simulated traits for different numbers of genes in the presence of dominance and epistasis and with different degrees of heritability and (ii) to compare the results obtained with G-BLUP in different scenarios.

Material and methods

Simulation of population genome

To simulate the data, an F2 population of a diploid species () with an effective size of 1000 individuals was taken as reference. The genotypic constitution of each individual was established considering the information in the genome and the random union of gametes from the parents, assuming a gametic pool of 5000 reproductive units, per parent, in each fertilization. The population was generated using divergent parental lines, i.e., contrasting homozygous parents (P1 dominant and P2 recessive), with a genome established considering 10 linkage groups with a size of 200 cM each. To provide linkage disequilibrium between markers, the percentage of recombination was equivalent to a distance between loci of 0.5 cM. The genome was generated with a saturation level of 401 equidistantly spaced molecular markers in each linkage group, resulting in a total of 4010 molecular markers in the genome. Markers were codominant (SNPs - Single Nucleotide Polymorphism), allowing the identification of heterozygous individuals.

Simulation and constitution of phenotypic values

From the simulated genotypic data of the F2 population, 18 traits with numbers of controlling genes ranging from 8 to 480 and heritability of 0.3, 0.5 or 0.8 were simulated (Table 1). The controlling genes (QTL - Quantitative Trait Locus) were distributed equally among the first 8 linkage groups (Supplementary Fig. 1).

Table 1

Number of controlling loci and heritability () of the 12 simulated traits (C1 to C18).

h2	Number of controlling loci
h2	8	40	80	120	240	480
0,3	C1	C2	C3	C4	C5	C6
0,5	C7	C8	C9	C10	C11	C12
0,8	C13	C14	C15	C16	C17	C18

Number of controlling loci and heritability () of the 12 simulated traits (C1 to C18). Eight QTL controlled for C1, C7 and C13 traits, defined by the central markers of the first eight linkage groups. For traits C2 to C6, C8 to C12, and C14 to C18 the QTL were distributed keeping an approximate distance between them, within the first 8 linkage groups (Supplementary Fig. 1). The total phenotypic values expressed by a given individual for traits C1 to C18 were simulated according with [21], [22], [23] considering the mean equal to 100 and coefficient of variation equal to 12 %, with a dominance level (d) equal to 0.5 and by a model with epistatic effect according to the following equation:where is the phenotypic value for observation ; is the general mean; is the effect of the favorable allele at locus j of individual i, that is, it assumes the values , and for the genotypic values associated with classes AA, Aa and aa, respectively, with being the mean between the dominant homozygote (AA) and the recessive homozygote (aa). Classes were identified by coding 1, 0 or − 1, respectively; represents the interaction between favorable alleles at different loci. The variance structure of the residues was given by [24] , where , where is the residual variance, is the genotypic variance, and the heritability.

Prediction of breeding values

The genomic breeding values (GEBVs) were predicted using methods based on statistical approaches, represented by G-BLUP, on neural network approaches, represented by the Multilayer Perceptron Network (MLP) and Radial Basis Function Network (RBF) and on learning approaches from Multivariate machine Adaptive Regression Splines (MARS), Decision Tree (DT), Boosting (BO), Bagging (BA) and Random Forest (RF). Neural network approaches are often treated as machine learning [12], [22], [25]. However, each approach has its specificity and here they will be considered as different approaches. As neural networks work like the human brain, they are composed of neurons organized in layers that capture all available information to generate a decision-making criterion, they differ from machine learning methods, which model the limitations of data separation with based on the learning decision rules on the input characteristics of the model [26].

Data analysis

For all methods, the input data was a matrix of molecular markers, represented by the genotypic values encoded in − 1, 0 and 1, simulated for 4010 markers and 1000 individuals. The methods returned in the output a vector with the GEBV for each individual. For comparison, the methods were grouped according to their respective learning approach: G-BLUP – G-BLUP; MLP and RBF – NETWORK; DT, BA, BO and RF – TREES; and MARS 1, MARS 2 and MARS 3 – MARS.

Multivariate Adaptive regression Splines (MARS)

The algorithm proposed by [27] Multivariate Adaptive Regression Splines (MARS), considers an expansion in piecewise linear functions, called basis functions (BFs), as follows: Each function is a piecewise linear spline, with a node at the value t. These two BFs are called a reflexive pair. MARS forms reflexive pairs for each input (marker) , with nodes at each observed value of that input. The model building strategy is like a progressive linear regression, but instead of using the original inputs, we used functions from the set and/or its products. The MARS model, which is a linear combination of the BFs and/or their interactions, is given by [28]: where is the regression constant, with m = 1, 2, …, M, are the regression coefficients, and is a function in , or a product of two or more functions. The estimation process of the parameters and is based on the minimization of the residual sum of squares. First, the forward phase is performed on the training data, initially starting to build the model only with the constant function , and all functions in the set are candidate functions. At each subsequent step, the base pair that produces the maximum reduction in training error is added. Considering a model with basic M functions, the next pair to be added to the model is [28]: where and are coefficients estimated by the least square method, together with all other coefficients in the model. This process of adding BFs continues until the model reaches a predetermined maximum number, often leading to a purposefully oversized model [29]. The backward phase improves the model by removing the least significant terms until finding the best submodel. The model subsets are compared using the generalized cross-validation (GCV) method. The GCV is the root-mean-square residual error divided by a penalty that depends on the complexity of the model [29]. The GCV is calculated as [28]: where is the effective number of model parameters, is a cost function for each basis function included in the developed submodel, which by default is adopted by default value of 3 [27], N is the number of datasets used in cross-validation and denotes the predicted MARS values. To identify the possible interaction between the QTLs, MARS models with degrees equal to 1, 2, and 3 were used, with the model with degree 1 considered an additive model and the others non-additive, which allow interactions between markers. For the stopping criterion of the forward phase, the maximum number of terms in the adopted model was equal to 200, as the default of the “earth” package of R. A preliminary analysis was carried out for the second stopping criterion [30], in which incrementing a term in the model would change the coefficient of determination from less than 0.001 (default) to 0.05, choosing the best model that presented the highest selective accuracy () for the validation set.

Genomic BLUP (G-BLUP)

An epistatic model, including dominance and additive effects, for the REML/G-BLUP method was used according to the following expression: where is the vector of phenotypic observations; is the vector of fixed effects (in this study, the general mean) with incidence matrix ; , e are vectors of genetic values of additive, dominant and epistatic effects, respectively; is the incidence matrix for these vectors; and ε is the random error vector. The variance structure was given by ); ); ) and , where , and are the genomic relationship matrices for the additive, dominant and epistatic effects, respectively , and , and are the additive, dominance and epistatic variances, respectively. For the construction of the genomic relationship matrices ( and ) used in the model, M_ij was considered to be the incidence of the number of alleles of brand of individual and the frequency of the dominant allele A in brand j. In this way, the W and S matrices were given by [31]: In this way, we obtain: Where is the Hadamard product operator. The mixed model equations for the full model were given by [24]: were , and and the variances were estimated by the Restricted Maximum Likelihood Method (REML).

Multilayer Perceptron neural Network (MLP)

The Levenberg-Marquardt backpropagation training algorithm was used for the Multilayer Perceptron Neural Network (MLP). Preliminary tests were performed with different architectures, being represented by 1 layer and the number of neurons varying from 5 to 15, to choose the best topology to be used. The linear activation function (purelin) was used. The linear function for the nth neuron of the output layer of an MLP was represented by: where: p is a linear activation function, is the bias term of the nth neuron, is the i-th input, is the synaptic weights to be adjusted and is the value coming from the layer hidden for each input i, assigned to an activation function. The activation function used in this work was the linear one ().

Neural Network Radial Base function (RBF)

The Radial Base Function Neural Network (RBF) uses a feedforward architecture. This model also consists of an input layer, a hidden layer, and an output layer. RBF training is hybrid (supervised and unsupervised) and the input layer information goes through a linear k-means cluster [11]. The hidden layer applies a non-linear transformation of the input space to a high-dimensional hidden space with a Gaussian function. The output layer applies a transformation to the hidden space, providing an output vector for the network. The RBF optimization training included: the weights between the hidden layer and the output layer, the activation function, the center of activation functions, the distribution of the center of activation functions, and the number of hidden neurons [11]. During the training process, only the weights between the hidden layer and the output layer are modified [9]. To select the best RBF architecture, according to the MLP, preliminary tests were carried out. The number of neurons ranged from 5 to 50 and radius size from 30 to 50. The mean square error was set to 0.05. The linear function for the nth neuron of the output layer of an RBF was represented by: where: g is a linear function, is the bias term of the nth neuron, is the i-th input, is the synaptic weights to be adjusted, and is the value coming from the hidden layer for each input i, assigned to the Gaussian activation function, which is given by the equation:, where c is the center of the Gaussian function, σ2 is the variance of the Gaussian function and i is the value of the individual's input, which represents the activation potential of the clustering phase.

Decision tree

The decision tree structure was based on a regression tree, created from the search for the tree that would lead to the data partition until the formation of homogeneous groups was obtained. To perform recursive binary division, first is the marker and the cutoff point s so that the division of the predictor space into the regions e leads to the greatest possible reduction in RSS. That is, we consider all markers and all possible values of the cutoff for each of the markers, and then choose the marker and cutpoint so that the resulting tree has the smallest RSS. The equation that reflects the binary division is [32]: and then we look for the value of and that minimize the equation: where: is the average of the response variable of the training observations belonging to the region , is the average of the response variable of the training observations belonging to the region and is the true value of the traits of each individual.

Bagging

The Bagging (BA) method creates several similar datasets by resampling (bootstrapping) to obtain an average of several regression trees that are performed without pruning for each dataset [33], [34]. Thus, a number B of models are obtained: . These generated models are used to obtain an average model, given by: . The number of trees sampled for BA was set at 500 trees.

Random Forest

Random Forest (RF) [35] is similar to BA in that bootstrap samples are used to build multiple trees, the difference being that each tree is established with a random subset of predictors. The number of predictors used to find the best split at each node is a subset that was chosen by , with v being the total number of predictors. The number of trees for the RF was set at 500. For the RF, the trees grow to their maximum size without pruning, and the aggregation is done by averaging the trees [25].

Boosting

Boosting (BO) creates trees sequentially using information from previous trees [32]. In this sense, BO is an approach repeatedly trained on the same sample so that at each iteration, a measure of prediction error is calculated for each marker, and in the next iteration, markers with higher errors receive greater weight in the model training. The prediction is performed by weighting the results of the set of all regression trees [36]. The number of trees sampled was 500, with a learning rate of 0.01 and a depth of 2. The following model was used to adjust the BO [28]: Where , m = 1, 2, …, M are the coefficients of base expansion and are simple functions of the multivariate argument , with a set of parameters .

Efficiency parameters

To evaluate the efficiency of the techniques, the selective accuracy was used, which is measured by the square of the correlation (R2) between the estimated values - GEBVs () and the real values (), and the root means square error (), which expresses the predictive accuracy. The selective and predictive accuracies were given respectively by the following equations: and.

Training and validation

For the training and validation of the techniques used, cross-validation (k-fold) was performed with k = 5 partitions [37]. In each of the five rounds, four of these subsets constituted the training population (80 % − 800 individuals), and the remaining subset constituted the validation population (20–200 individuals). The techniques were compared based on the arithmetic mean of the five performance estimates of the validation sets.

Computational aspects

Population simulations were performed using the GENES software [38]. The G-BLUP, DT, BA, RF, BO, and MARS methods were performed with the GENES software integrated with the R software [39], [40]. The MLP and RBF methods were performed by the GENES software integrated with the MATLAB software [39], [41].

Results

The phenotypic and genotypes values of the 18 simulated traits are shown in Fig. 1. The low variations can be explained by the 12% coefficient assigned in the simulation and the mean was very close to 100, as defined in the simulation.

Fig. 1

Boxplot of the genetic and phenotypic values of the 18 simulated traits, considering a coefficient of variation equal to 12% and mean to 100. The specification of each characteristic is represented in Table 1. The selective accuracy (R2) of the prediction of breeding values for all methods was higher in scenarios with higher heritability (Fig. 1). On the other hand, the variation in the number of QTL showed that the methods have diversity among the results obtained, indicating that the number of QTL of the traits directly influences the prediction of GEBVs according to the method used and that the increase in the number of QTL is harmful to the MARS approach and the DT method, while for the other methods the increase in the number of QTL reflects an improvement in . For both heritability scenarios, the methods based on machine learning, MARS and Trees, presented higher values of for the traits with the lowest number of QTL, when compared to the other methods (Fig. 1). The effect of the interaction between markers was even more evident for higher heritability (80 %), resulting in higher values for the non-additive MARS models (MARS 2 and 3). From the scenarios with 40 QTL, an increase was observed for the values of as the number of QTL increased, except for MARS and DT, reaching values close to those of the real genetic variation when the trait presents 480 QTL. In these scenarios, the G-BLUP and RBF methods, followed by RF and BA, presented the highest values for R2 and always above the general average (red line) for the traits for both heritability scenarios (Fig. 1). For scenarios with 80 % heritability and 40 or more QTL, the MLP and BO methods also deserve to be highlighted. Despite presenting lower values for compared to other methods, the predictive power of MARS methods for traits with many QTLs cannot be neglected, especially when there is a very high number of QTLs, such as 240 and 480 genes, the non-additive MARS methods (MARS 2 and 3) showed high R2 values (above 60 %), and considering the standard error, values lower only than G-BLUP (Fig. 1). It is worth mentioning that for these scenarios, MARS had high predictive potential, explaining almost all the genetic variations of these traits. Methods based on MARS and regression trees did not obtain a linear response as a function of increasing the number of QTL. On the other hand, both methods based on neural networks and G-BLUP showed a substantial improvement the higher the QTL number (Fig. 1). With the exception of DT, which presented lower values in almost all scenarios, the tree-based methods presented values close to the simulated heritability, mainly for scenarios with 240 QTL (Fig. 1). In addition, these methods presented values of greater than the overall mean of in all scenarios (Fig. 1). The BO method presented the best result when the scenarios were of greater heritability and 40 or more QTL. BO was also the method that showed the greatest sensitivity to heritability and showed a substantial improvement in results in higher heritability scenarios. The predictive accuracy results (), referring to the error in the prediction of the GEBVs of the individuals, were always smaller according to the increase in the number of QTL, that is, the greater the number of QTL, the lower the error in the prediction of the GEBVs of the individuals, regardless of the method used (Fig. 2). In this case, the impact caused by the increase in the number of QTLs on the prediction error of GEBVs is greater than the change in heritability and is inversely proportional. This result was possible due to the fixed number of markers, providing a greater proportion of direct effects of markers on traits in relation to those poorly correlated with the phenotype and without direct effect.

Fig. 2

Average results of selective accuracy () as a function of the number of genes and heritability for the families of the methods: Trees [Bagging (BA), Boosting (BO), Decision Tree (DT); and Random Forest (RF)]; Network (Multilayer Perceptron Network (MLP) and Radial Base Function Network (RBF) MARS (MARS 1, 2 and 3); and G-BLUP. The red dashed line refers to the overall mean value of the selective accuracy () between all methods for comparison purposes. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 3

Average results of predictive accuracy (RMSE) as a function of the number of genes and heritability for the families of the methods: Trees [Bagging (BA), Boosting (BO), Regression Tree (DT); and Random Forest (RF)]; Network (Multilayer Perceptron Network (MLP) and Radial Base Function Network (RBF) MARS (MARS 1, 2 and 3) and G-BLUP. The red dashed line refers to the overall mean value of predictive accuracy (RMSE)) between all methods for comparison purposes. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) The RBF method presented very similar RMSE values when compared to those obtained through G-BLUP for all scenarios (Fig. 2). From 40 QTL, these methods presented the lowest values for RMSE. Similar values of these methods were obtained by the non-additive MARS (MARS 2 and MARS 3) for the scenarios with 240 and 480 QTL and heritability of 80 %.

Discussion

The inclusion of a greater number of marker variables in a predictive model can be useful to obtain a better performance, but it can lead to the addition of redundant information and make it difficult to apply in practice [20]. Furthermore, in hybrid populations, non-additive effects, i.e., dominance, and epistasis, are highly relevant and should also be considered [42]. In this sense, methods that deal with high dimensionality and take into account possible interactions between predictor variables have important traits. Many recent studies have been applied to GWS and have shown that machine learning and neural network methods can perform better or similarly in predicting genotypic data phenotypes compared to statistical methods [9], [12], [22], [43], [44], [45], [46]. However, there is still a gap to be filled on which methods may be preferable to perform the prediction when it comes to different degrees of heritability and QTL number, including when considering epistatic effects. The variability in the results for the different methods suggests that any method is prone to produce a differentiated result under some type of data perturbations. The results obtained were able to demonstrate the strong effect of heritability and increase in QTL number on R2 and RMSE values. Several studies have shown that there is a favorable effect of heritability on selective accuracy [9], [22], [36], [47], [48], as also obtained in this study. This is justified by the greater genetic variation in higher heritability’s and, consequently, less environmental effect, contributing to more accurate predictions of marker effects [22]. The results showed that there was a reduction in the RMSE in scenarios with a greater number of QTL, and that, in the same way, there was also a reduction in the RMSE in the scenarios with greater heritability, however in a smaller proportion. Results similar to those obtained in this study were observed by [36], in scenarios with heritability ranging from 0.1 to 0.5 and 100 QTL, and [22] evaluating scenarios with QTL numbers ranging from 2 to 88 and heritability from 0.3 to 0.8. The reduction in RMSE due to the increase in the number of QTLs may have occurred due to the lower influence of the multiplicative effect between the additive and dominant effects that characterize epistatic effects in more complex traits [22], [36], [49]. As MARS can simultaneously include multiple terms (additive and epistatic effects) in a model [50] genetic interactions can be better evaluated. Apparently, this fact could lead to a better prediction for GWS, since, in this way, it would be possible to reduce the residual variance of the model, by capturing information that before was isolated only residual component, such as the effects of the interaction between markers. However, as [14] explain, some patterns of interaction tend to be less pronounced to be detected in features with a high number of QTL. Thus, methods based on recursive partitioning, such as MARS and Trees, benefit from situations in which the predictor variables can be partitioned into well-defined regions [12], as is the case with features with lower QTL numbers (oligogenic). This is because traits controlled by few genes have well-defined phenotypic classes and suffer little or no environmental influence [51]. These results show that MARS is an alternative to be used, especially when it is easier to identify groups of individuals based on the population genome. Due to the identification of markers and/or interactions between markers of greater effect, MARS proved to be more efficient when the multiplicative effects of the controlling genes (epistasis) may be more important, since, for traits with lower QTL numbers, the multiplicative effects control genes (epistasis) may be of greater magnitude in proportional terms, as the individual effect of each gene is greater than in traits controlled by a greater number of QTL [22]. This is a direct result of its modeling philosophy, which tries to approximate a (possibly higher-order) function with a set of basic functions that are locally lower-order, so it has more power and flexibility to model relationships that are almost additive or involve interactions in at most a few variables [27]. The frame is different when there is a large number of predictor variables with high correlation and response trait has high genetic (dominance and epistasis effect) and environmental noise. When the trait involves a greater number of QTL, there is a greater chance that a marker explains in genetic terms the same variation of another marker, in addition to providing a greater action of the environmental effect, thus impairing the prediction efficiency. The excess of markers associated with a reduced number of genotypic observations can also lead to multicollinearity problems [12]. As [27] points out, MARS is not particularly robust against correlated inputs and relies heavily on data to infer the process model, in these cases, MARS loses explanatory power. Thus, analyses should use an optimal set of informative SNPs, according to expectations regarding the number of QTLs, to adopt the best analytical strategy, maximizing predictive accuracy estimates [12], [22]. In addition to MARS, another method susceptible to multicollinearity vulnerability is DT. If two predictors are highly collinear, MARS or DT has to make an arbitrary knot or split selection that minimizes the residual sum of squares, this can profoundly affect all subsequent selections and final predictions [52]. Also, more generally, recursive partitioning methods have difficulties when the dominant interactions involve a small fraction of the total number of variables, so one cannot discern whether the approximation function approximates a simple one, such as linear or additive, or if it involves complex interactions between variables [27]. This explains why MARS did not perform well in genomic regions where strong genetic interactions are present, such as for traits from 40 to 120 QTL. However, for scenarios where these effects are diluted, high QTL number (250 to 480), MARS has high predictive potential and lower model variance. Thus, it is notable that the excellent results obtained for the non-additive MARS show that this approach should be considered for GWS. The greater number of genotypic classes in scenarios with a greater number of QTL reduces the representativeness of each genotypic combination in the training set and overparameterization of the model [22]. In this context, it was these restrictions that led these algorithms to not present such satisfactory results when the trait is polygenic, mainly DT. Low DT efficiency was also observed by [12] to predict the genetic values for rust incidence in Coffea arabica and by [22] for simulated features with epistatic effects with 16 or more QTL. The approaches based on decision trees (BA, BO, and RF) showed excellent results regarding the accuracy of the GEBVs prediction for traits with many QTL. A differential of the BA and RF approaches is the resampling of the original data in sub-samples (bootstrap) to perform the prediction according to a number of determined trees. This resampling of data brings concrete benefits for prediction in these cases, allowing for the easy evaluation of poorly predicted samples and possible discrepancies [53]. BA analyzes its main effects on variance and can make forecasting more robust by decreasing the variance lead time and RF not only combines a large number of decision trees to reduce forecast variance like BA but also decreases dependency between decision trees by projecting random features to obtain a much smaller prediction error [54]. As a result, these methods perform better to maximize prediction in a target population, suggesting that bootstrapping can be performed by other methods to achieve better prediction results. As it is a gradient-enhancing algorithm that has a learning rate, the BO method combines individually weak predictors to produce a strong classifier [32], thus allowing a better prediction of the genetic effects of individuals, as observed in this study. Neural network approaches, as used in this study, MLP and RBF, apparently, are not affected by correlated inputs. The MLP and RBF methods are defended for being efficient in capturing nonlinear effects, in this case, provided by interallelic interactions [22]. Both RBF and MLP were harmed by the excess of ineffective markers, showing lower performance compared to other methods, as also found by [22], [55]. However, neural networks were efficient in predicting traits with many QTL, especially when the phenotypic value of the trait was mostly due to genetic value. G-BLUP considers the interaction between marker pairs and relies on the DL between SNPs and QTL, moreover, when QTL are in strong LD and the use of an unweighted genomic relationship matrix in G-BLUP can cause upward bias in the heritability estimate [56], [57], [58], [59]. However, if only a few markers are important, the technique is hampered by this bias, as confirmed in this study. On the other hand, the G-BLUP was highlighted in the performance of the prediction of the GEBVs, presenting very similar results to the Family Network methods for the traits with more QTL. Results similar to those found in this study were found in [22], [60]. However, some markers are more informative for some traits than others, this increase in the amount of information using the genomic matrix G (genomic relationship matrix) can sometimes lead to better and more accurate estimates and predictions [12]. These results corroborate other studies where the GBLUP precision increased for characters with a high proportion of non-additive variation and when with increasing heritability [61], [62], [63] and justified due to G-BLUP principle that genomic predictions are based on the relatedness derived from all markers [60], so when more markers have a genetic effect the prediction accuracy increases. Although MARS performs the selection of SNPs, eliminating a large number of markers, the performance of this method showed a greater difference for the RBF, MLP, and G-BLUP methods, which consider all markers, and BA, RF, and BO in the scenarios between 40 and 280 QTL. This can be explained by the fact that the F2 population has a high rate of linkage disequilibrium (LD), due to the combination process. This LD can then cause false-positive signals for some loci, which have no connection with the studied trait in question [59]. So, the SNPs closest to a QTL are not sampled often enough and the QTL signal may be captured by more distant SNPs, consequently, the signal from a QTL to MARS may be blurred compared to other methods. Alternatively, use of hybrid modeling schemes (combination of two or more methodologies) including MARS had been previously very effective with initial data clustering using c-means or principal components [64], [65], [66], it could be more important for diverse population for genomic predictions. For example, studies such as the one by [43] proposed hybrid smart modeling schemes for heart disease classification using combined MARS-ANN. This would be a viable alternative to improve predictability and decrease the effect of multicollinearity using MARS on genomic data, which is worthy of further investigation and deserves further research. The main limitation of the additive MARS is that the model is constrained to be additive. With many variables, important interactions can be missed. On the other hand, as the model is additive, we can examine the effect of each marker on the prediction of GEBVs individually. Furthermore, the model can be represented in a way that separately identifies additive contributions and those associated with different multivariate interactions, being useful for future studies applied to Genomic Wide Association Studies (GWAS). MARS also has several ways of improvement that can improve the predictability of the traits and that must be tested, mainly the change in gamma, where the model becomes more flexible to detect close variables for inclusion in the model for the forward phase. In general, as the number of QTLs increases, the total genetic variation is expected to be divided among the QTLs, which can reduce the efficiency of methods to estimate small QTL effects and lead to a loss of precision [36], [67]. This is confirmed only for traits that present stronger effects between interactions in the same linkage group, such as for traits with 40 QTL, since, as they have a smaller number of QTL in a single linkage group, the expression of interactions between these QTL is stronger. On the other hand, the increase in efficiency for a greater number of QTL can be attributed to the excess of markers with null effects, which can impair the accuracy of the methods [12], [22]. Each technique has its specificity and must be evaluated in a wide set of data so that the decision on which method to base is correctly made [2]. It is rare that more than one technique is used when performing GWS analyses, but these results align with the view that evaluating multiple methods is a useful strategy to ensure that uncertainty in data is considered from multiple angles.

Conclusions

MARS ability to simplify complex relationships is quite pertinent to GWS, as most traits of interest in plant breeding are affected by complex interactions of biological, environmental, and management conditions. Non-additive MARS is better for predicting breeding values than additive MARS in the scenarios evaluated. The additive and non-additive MARS methods showed superior results in the prediction of genetic values in characters with dominant and epistatic effects for scenarios with eight QTL in relation to G-BLUP methods, neural networks, and other machine learning methods. The use of different statistical methods, neural networks, and machine learning, such as MARS, to estimate genetic values resulted in different consequences influenced by the complexity and particularity of the analyzed traits. Therefore, it is recommended that when evaluating the prediction of genetic values, the use of multiple approaches is used, in order to choose the best method to be used.

CRediT authorship contribution statement

Weverton Gomes da Costa: Conceptualization, Methodology, Formal analysis, Data curation, Investigation, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing. Mauricio de Oliveira Celeri: Methodology, Investigation, Visualization, Writing – original draft, Writing – review & editing. Ivan de Paiva Barbosa: Conceptualization, Methodology, Visualization, Writing – original draft. Gabi Nunes Silva: Visualization, Writing – original draft. Camila Ferreira Azevedo: Visualization, Writing – original draf. Aluizio Borem: Visualization, Writing – original draft. Moysés Nascimento: Investigation , Visualization, Writing – original draft, Project administration, Supervision, Writing - Review & Editin. Cosme Damião Cruz: Investigation, Visualization, Writing – original draft, Project administration, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

30 in total

1. Tree and spline based association analysis of gene-gene interaction models for ischemic stroke.

Authors: Nancy R Cook; Robert Y L Zee; Paul M Ridker
Journal: Stat Med Date: 2004-05-15 Impact factor: 2.373

2. Multivariate adaptive regression splines: a powerful method for detecting disease-risk relationship differences among subgroups.

Authors: Timothy P York; Lindon J Eaves; Edwin J C G van den Oord
Journal: Stat Med Date: 2006-04-30 Impact factor: 2.373

3. A novel linkage-disequilibrium corrected genomic relationship matrix for SNP-heritability estimation and genomic prediction.

Authors: Boby Mathew; Jens Léon; Mikko J Sillanpää
Journal: Heredity (Edinb) Date: 2017-12-14 Impact factor: 3.821

4. Application of Machine-Learning Models to Predict Tacrolimus Stable Dose in Renal Transplant Recipients.

Authors: Jie Tang; Rong Liu; Yue-Li Zhang; Mou-Ze Liu; Yong-Fang Hu; Ming-Jie Shao; Li-Jun Zhu; Hua-Wen Xin; Gui-Wen Feng; Wen-Jun Shang; Xiang-Guang Meng; Li-Rong Zhang; Ying-Zi Ming; Wei Zhang
Journal: Sci Rep Date: 2017-02-08 Impact factor: 4.379

5. Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods.

Authors: Bo Li; Nanxi Zhang; You-Gan Wang; Andrew W George; Antonio Reverter; Yutao Li
Journal: Front Genet Date: 2018-07-04 Impact factor: 4.599

6. Factors Affecting the Accuracy of Genomic Selection for Agricultural Economic Traits in Maize, Cattle, and Pig Populations.

Authors: Haohao Zhang; Lilin Yin; Meiyue Wang; Xiaohui Yuan; Xiaolei Liu
Journal: Front Genet Date: 2019-03-14 Impact factor: 4.599

7. Clinical predictive modelling of post-surgical recovery in individuals with cervical radiculopathy: a machine learning approach.

Authors: Bernard X W Liew; Anneli Peolsson; David Rugamer; Johanna Wibault; Hakan Löfgren; Asa Dedering; Peter Zsigmond; Deborah Falla
Journal: Sci Rep Date: 2020-10-08 Impact factor: 4.379