Literature DB >> 25519518

Genome wide association analysis of the 16th QTL- MAS Workshop dataset using the Random Forest machine learning approach.

Giulietta Minozzi1, Andrea Pedretti2, Stefano Biffani3, Ezequiel Luis Nicolazzi2, Alessandra Stella3.   

Abstract

BACKGROUND: Genome wide association studies are now widely used in the livestock sector to estimate the association among single nucleotide polymorphisms (SNPs) distributed across the whole genome and one or more trait. As computational power increases, the use of machine learning techniques to analyze large genome wide datasets becomes possible.
METHODS: The objective of this study was to identify SNPs associated with the three traits simulated in the 16th MAS-QTL workshop dataset using the Random Forest (RF) approach. The approach was applied to single and multiple trait estimated breeding values, and on yield deviations and to compare them with the results of the GRAMMAR-CG method.
RESULTS: The two QTL mapping methods used, GRAMMAR-CG and RF, were successful in identifying the main QTLs for trait 1 on chromosomes 1 and 4, for trait 2 on chromosomes 1, 4 and 5 and for trait 3 on chromosomes 1, 2 and 3.
CONCLUSIONS: The results of the RF approach were confirmed by the GRAMMAR-CG method and validated by the effective QTL position, even if their approach to unravel cryptic genetic structure is different. Furthermore, both methods showed complementary findings. However, when the variance explained by the QTL is low, they both failed to detect significant associations.

Entities:  

Year:  2014        PMID: 25519518      PMCID: PMC4195406          DOI: 10.1186/1753-6561-8-S5-S4

Source DB:  PubMed          Journal:  BMC Proc        ISSN: 1753-6561


Background

Genome wide association studies (GWAs) are now widely used in the livestock sector to estimate the association among multiple single nucleotide polymorphisms (SNPs) distributed across the whole genome and one or more trait. GWAs are typically carried out on a single-point by performing a marginal chi-square test or regression. However, these methods do not take into account linkage disequilibrium between markers and the genetic structure of the population that may have a large impact on structured populations (e.g. cattle populations). Approaches for genome wide pedigree-based quantitative trait loci (QTL) analysis have been developed (e.g. GRAMMAR-CG), which are based on mixed model and regression, where the genomic kinship matrix estimated through genomic marker data can be used to correct for familiar correlation and cryptic relatedness [1]. As computational power increases, the use of more advanced machine learning techniques to analyze large genome wide datasets becomes possible [2], these techniques include Support Vector Machines [3], Bayesian Networks [4] and Random Forest [5]. The Random Forests (RF) algorithm [6] is a machine-learning method that has been widely applied to classification and regression problems, and is particularly well suited to circumstances in which the number of potential explanatory variables exceeds the number of observations, as is the case for GWAs. The RF algorithm produces a collection of trees (forest), each grown on a different bootstrap sample of observations, and at each split (node) of a tree, a different random subset of predictors (SNP) is evaluated to identify the best split. The final scores are then calculated by aggregating predictions resulting from all the trees grown in the forest. RF embraces a combination of characteristics that makes it appropriate for genetic applications: it is well suited for very large datasets; it is non-parametric, thus does not require a causal model to be specified, it is highly parallelizable and considers interactions between predictors. The objective of this study was to identify SNPs associated to the three traits simulated in the 16th MAS-QTL workshop dataset using the Random Forest approach and to compare them with the results obtained by the Grammar-CG method. SNPs identified by both methods were verified with the actual QTL positions.

Methods

Dataset

The dataset used was provided by the organisers of the 16th QTLMAS workshop and consisted of 4080 individuals (G0 to G4). The simulated genome was 499.750 Mb consisting of 5 chromosomes carrying 2,000 equally distributed SNPs. The GWA analysis was conducted on 3000 samples, all females belonging to generations G1 to G3, for which phenotypic information for three traits (yield deviations) was provided. The analysis was performed on: yield deviations (YD1, YD2 and YD3), the estimated breeding values (EBV) obtained from a single trait model (tr1_ST, tr2_ST, tr3_ST) and the EBVs obtained from a multiple trait model (tr1_MT, tr2_MT, tr3_MT).

Analysis

Variance components and EBV estimation

Variance components and EBVs were obtained separately, using REMLF90 and BLUPF90 programs, respectively [7]. The model used to estimate variance components and EBVs was: where μ is a general mean for the kth trait, GEN is a fixed effect for i generations (i = 1 to 3), Animal is a random animal effect with distribution ~ N(0,σ2a), where σ2a is the additive genetic variance, and e is the random residual with distribution ~ N(0, σ2e), where σ2e is the residual variance. Covariance between traits was considered only in multiple-trait analysis.

Random Forest

Feature selection (SNPs) analysis was performed with the randomForest package in R [8] using 3000 individuals and the 9042 SNPs that passed quality control checks out of the total 10000 SNP. The minimum size of the terminal nodes was set to 5. The number of trees grown was set to 1000. The subset of samples evaluated at each tree was 70% of the total number of samples (n = 2100). The number of variables evaluated at each node was set to the square root of the number of predictors (p = 94). All SNPs were ordered by Mean decrease Gini index [6] and the most strongly associated SNPs are at the top of the lists shown in Table 2, 3 and 4.
Table 2

Top SNPs identified by the Random Forest and GRAMMAR-CG Approach for Trait 1

Trait 1
Random Forest method Grammar_CG

Multiple Trait EBV Multiple Trait EBV

SNPnameCHRpos Kb nameCHRpos Kb p-value
SNP6499424.900SNP6499424.9003,6E-17
SNP4688334.350SNP1682184.0501,2E-07
SNP4674333.650SNP1683184.1004,4E-06
SNP419739.800SNP6498424.8507,2E-06
SNP7145457.200SNP3585279.2001,1E-05
SNP1012150.550SNP6501425.0001,1E-05
SNP1614180.650SNP6469423.4008,4E-05
SNP6534426.650SNP9362568.0509,5E-05
Single Trait EBV Single trait EBV
SNP6499424.900SNP6499424.9009,8E-16
SNP4688334.350SNP1682184.0509,8E-06
SNP4674333.650SNP6501425.0001,4E-05
SNP419739.800SNP6498424.8503,0E-05
SNP1012150.550SNP1683184.1005,0E-05
SNP1614180.650SNP293114.6005,6E-05
Yield Deviation Yield Deviation
SNP6499424.900SNP6499424.9001,9E-19
SNP1683184.100SNP1682184.0504,2E-09
SNP6507425.300SNP1683184.1002,8E-08
SNP1614180.650SNP6498424.8502,7E-07
SNP6506425.250SNP6501425.0006,9E-07
SNP4674333.650SNP6506425.2503,3E-06
SNP1682184.050SNP293114.6009,2E-06
SNP9374568.650SNP6507425.3002,7E-05
SNP1012150.550SNP1699184.9005,1E-05
SNP1685184.200SNP1161158.0007,6E-05
Table 3

Top SNPs identified by the Random Forest and GRAMMAR-CG Approach for trait 2

Trait 2
Random Forest method Grammar_CG

Multiple Trait EBV Multiple Trait EBV

SNP nameCHRpos Kb SNP nameCHRPos Kb p-value

SNP6499424.900SNP6499424.9001,93E-18
SNP7151457.500SNP293114.6003,51E-10
SNP298114.850SNP404432.1506,38E-10
SNP217128.500SNP298114.8504,07E-07
SNP293114.600SNP6501425.0001,79E-06
SNP9528576.350SNP6498424.8501,74E-05
SNP296114.750SNP296114.7508,60E-05
Single Trait EBV Single Trait EBV
SNP6499424.900SNP6499424.9001,93E-19
SNP7151457.500SNP293114.6001,74E-09
SNP217128.500SNP404432.1501,24E-08
SNP298114.850SNP298114.8501,12E-06
SNP293114.600SNP6501425.0001,17E-06
SNP9528576.350SNP6498424.8508,61E-06
Yield Deviation Yield Deviation
SNP6499424.900SNP6499424.9002,90E-24
SNP293114.600SNP293114.6001,21E-11
SNP298114.850SNP6501425.0002,34E-08
SNP296114.750SNP298114.8504,93E-08
SNP6507425.300SNP6498424.8507,76E-08
SNP6506425.250SNP404432.1503,19E-07
SNP6425421.200SNP296114.7501,58E-06
SNP9374568.650SNP6506425.2502,80E-06
SNP295114.700SNP295114.7003,81E-06
SNP7151457.500SNP6503425.1001,66E-05
SNP6507425.3002,37E-05
SNP6504425.1505,45E-05
SNP6502425.0508,14E-05
SNP9362568.0508,14E-05
Table 4

Top SNPs identified by the Random Forest and GRAMMAR-CG Approach for trait 3

Trait 3
Multiple Trait EBV Multiple Trait EBV

SNP nameCHRPos Kb SNP nameCHRPos Kb

SNP4738336.850SNP3585279.2001,54E-22
SNP3585279.200SNP4738336.8501,71E-14
SNP1683184.100SNP404432.1501,52E-13
SNP3584279.150SNP1682184.0505,52E-13
SNP1291164.500SNP1683184.1007,06E-11
SNP1478173.850SNP3584279.1509,54E-11
SNP1682184.050SNP1699184.9008,86E-08
SNP1169158.400SNP1166158.2503,33E-06
Single Trait EBV Single Trait EBV
RanFoGTrait 3 ST GRAMMARTrait 3 ST
SNP1683184.100SNP1682184.0507,61E-13
SNP4738336.850SNP1683184.1001,12E-12
SNP7012450.550SNP3585279.2001,36E-11
SNP1291164.500SNP404432.1509,75E-09
SNP1169158.400SNP1699184.9002,41E-08
SNP1478173.850SNP1161158.0001,61E-07
SNP296114.750SNP4738336.8502,28E-06
SNP3585279.200SNP1178158.8501,43E-05
SNP4317315.800SNP404732.3002,39E-05
SNP295114.700SNP3584279.1502,80E-05
Yield Deviation Yield Deviation
SNP1683184.100SNP1682184.0502,49E-19
SNP1682184.050SNP1683184.1002,09E-18
SNP4738336.850SNP3585279.2005,84E-15
SNP3585279.200SNP1699184.9003,29E-12
SNP295114.700SNP404432.1503,88E-08
SNP1161158.000SNP1161158.0006,46E-08
SNP1169158.400SNP3584279.1508,54E-08
SNP296114.750SNP4738336.8502,30E-07
SNP278113.850SNP1166158.2501,86E-06
SNP1166158.250SNP1697184.8001,92E-06
SNP1178158.8503,32E-06
SNP1168158.3507,79E-06
SNP1685184.2001,42E-05
SNP3595279.7002,08E-05
SNP404732.3003,11E-05

Grammar-CG

Genome-wide association analysis was performed with the GenABEL package in R using a three step GRAMMAR-CG (Genome wide Association using Mixed Model and Regression - Genomic Control) approach [1,9].

Results and discussion

Variance components and EBV estimation

Mean and standard deviations of the nine phenotypes used are shown in Table 1. The heritability estimates resulting from the single trait model were 0.38, 0.38 and 0.50 for trait 1, 2 and 3, respectively. Large genetic correlations between traits 1 and 2 were observed (0.83), whereas lower genetic correlation was observed for trait 2 and 3 (0.12). Negative correlation was observed between traits 1 and 3 (-0.44).
Table 1

Statistics of the nine phenotypes used in the GWAs.

TraitnMeanSd
YD130000176,519
YD2300009,512
YD3300000,024
tr1_MT3000-0,23881,495
tr2_MT30000,0314,264
tr3_MT300000,015
tr1_ST3000-0,55579,057
tr2_ST30000,00414,254
tr3_ST300000,014

Yield deviations for the three traits (YD1, YD2 and YD3), estimated breeding values (EBV) obtained from a single trait model (tr1_ST, tr2_ST, tr3_ST) and from a multiple trait model (tr1_MT, tr2_MT, tr3_MT).

Statistics of the nine phenotypes used in the GWAs. Yield deviations for the three traits (YD1, YD2 and YD3), estimated breeding values (EBV) obtained from a single trait model (tr1_ST, tr2_ST, tr3_ST) and from a multiple trait model (tr1_MT, tr2_MT, tr3_MT). Top SNPs identified by the Random Forest and GRAMMAR-CG Approach for Trait 1 Top SNPs identified by the Random Forest and GRAMMAR-CG Approach for trait 2 Top SNPs identified by the Random Forest and GRAMMAR-CG Approach for trait 3

Association mapping

The two QTL mapping methods used, GRAMMAR-CG and RF, were successful in identified the largest QTLs for trait 1 on chromosomes 4 and 1 in position 24 Mb and 84 Mb, for trait 2 on chromosomes 4, 5 and 1 in position24 Mb, 68 Mb and 14 Mb and for trait 3 on chromosomes 1, 2 and 3 at 84 Mb, 79 Mb and 36 Mb respectively. Positions and names of the significant SNPs are shown in Tables 2, 3 and 4. Both methods showed good precision in the identification of the QTL in comparison with the "real" QTL position provided by [10]. Interestingly the exact markers flankingthe QTL were identified for all traits. Differences however were observed depending on i) the phenotype analysed, YD, single trait EBV and Multiple trait EBV and on ii) the method used RF or GRAMMAR-CG. With regard to Trait 1, the GRAMMAR-CG method identified 8 significant associations for multiple trait EBV, 6 for single trait EBVs and 10 for YD, only 4 of which are common between the three phenotypes. The RF approach identified the same number of markers per phenotype, but only 2 markers were in common between the methods of analysis and phenotype. The two markers identified by both approaches were the QTL which explained the largest variance, however, the other markers are all true associations and indicate that using different types of phenotypes for the same trait and different analysis methods may overlap, but may also show some differences in QTLs and positions. Traits 2 and 3 share the same pattern as observed for trait 1. Several QTL were identified in common between phenotypes and methods but just a few were in common between analysis methods: 2 markers for trait 2 and 3 markers for trait 3. When the YD phenotype was used, a larger number of significant SNPs were detected. This may be due to the larger variability of the YD compared to the more regressed EBV phenotypes (Table 1). Interestingly both methods failed to identify the QTLs on chromosomes 4 and 5 for Trait 3. The variance explained by the markers is low, suggesting that both methods are not able to detect QTLs which explain a small amount of variance. The RF approach, however, detect the QTL on chromosomes 5 and 3 for Trait 1. Overall the results of the RF were confirmed by the results of the GRAMMAR-CG method and were validated by the effective positions given the QTL. Interestingly, even though the RF approach does not directly use family structure information through a relationship matrix (genomic or additive), as is the case in the GRAMMAR-CG approach, correct identification of QTL positions is achieved.

Conclusions

In this study we proposed the use of recursive partitioning approaches such as Random Forest, as an alternative to traditional regression methods to detect the genetic loci. The results of the RF approach were consistent with those of the GRAMMAR-CG method and validated by the effective positions given for the QTL. However, when the variance explained by the QTL was low, both failed to detect a significant association.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

AP, GM, ELN, and SB participated in the design and performed the statistical analysis. GM and AS conceived the study, participated in its design and coordination. GM drafted the manuscript. All authors read and approved the final manuscript.
  6 in total

1.  GenABEL: an R library for genome-wide association analysis.

Authors:  Yurii S Aulchenko; Stephan Ripke; Aaron Isaacs; Cornelia M van Duijn
Journal:  Bioinformatics       Date:  2007-03-23       Impact factor: 6.937

Review 2.  Random forests for genetic association studies.

Authors:  Benjamin A Goldstein; Eric C Polley; Farren B S Briggs
Journal:  Stat Appl Genet Mol Biol       Date:  2011-07-12

3.  Analysis of multiple single nucleotide polymorphisms of candidate genes related to coronary heart disease susceptibility by using support vector machines.

Authors:  Yeomin Yoon; Junghan Song; Seung Ho Hong; Jin Q Kim
Journal:  Clin Chem Lab Med       Date:  2003-04       Impact factor: 3.694

4.  Genome-wide association study for backfat thickness in Canchim beef cattle using Random Forest approach.

Authors:  Fabiana Barichello Mokry; Roberto Hiroshi Higa; Maurício de Alvarenga Mudadu; Andressa Oliveira de Lima; Sarah Laguna Conceição Meirelles; Marcos Vinicius Gualberto Barbosa da Silva; Fernando Flores Cardoso; Maurício Morgado de Oliveira; Ismael Urbinati; Simone Cristina Méo Niciura; Rymer Ramiz Tullio; Maurício Mello de Alencar; Luciana Correia de Almeida Regitano
Journal:  BMC Genet       Date:  2013-06-05       Impact factor: 2.797

5.  Genetic studies of complex human diseases: characterizing SNP-disease associations using Bayesian networks.

Authors:  Bing Han; Xue-wen Chen; Zohreh Talebizadeh; Hua Xu
Journal:  BMC Syst Biol       Date:  2012-12-17

6.  A genomic background based method for association analysis in related individuals.

Authors:  Najaf Amin; Cornelia M van Duijn; Yurii S Aulchenko
Journal:  PLoS One       Date:  2007-12-05       Impact factor: 3.240

  6 in total
  2 in total

1.  Single Marker and Haplotype-Based Association Analysis of Semolina and Pasta Colour in Elite Durum Wheat Breeding Lines Using a High-Density Consensus Map.

Authors:  Amidou N'Diaye; Jemanesh K Haile; Aron T Cory; Fran R Clarke; John M Clarke; Ron E Knox; Curtis J Pozniak
Journal:  PLoS One       Date:  2017-01-30       Impact factor: 3.240

2.  An efficient unified model for genome-wide association studies and genomic selection.

Authors:  Hengde Li; Guosheng Su; Li Jiang; Zhenmin Bao
Journal:  Genet Sel Evol       Date:  2017-08-24       Impact factor: 4.297

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.