Literature DB >> 29398937

Genome-wide association mapping for flowering and maturity in tropical soybean: implications for breeding strategies.

Rodrigo Iván Contreras-Soto^1,2,3, Freddy Mora⁴, Fabiane Lazzari⁵, Marco Antônio Rott de Oliveira⁶, Carlos Alberto Scapim¹, Ivan Schuster⁵.

Abstract

Knowledge of the genetic architecture of flowering and maturity is needed to develop effective breeding strategies in tropical soybean. The aim of this study was to identify haplotypes across multiple environments that contribute to flowering time and maturity, with the purpose of selecting desired alleles, but maintaining a minimal impact on yield-related traits. For this purpose, a genome-wide association study (GWAS) was undertaken to identify genomic regions that control days to flowering (DTF) and maturity (DTM) using a soybean association mapping panel genotyped for single nucleotide polymorphism (SNP) markers. Complementarily, yield-related traits were also assessed to discuss the implications for breeding strategies. To detect either stable or specific associations, the soybean cultivars (N = 141) were field-evaluated across eight tropical environments of Brazil. Seventy-two and forty associations were significant at the genome-wide level relating respectively to DTM and DTF, in two or more environments. Haplotype-based GWAS identified three haplotypes (Gm12_Hap12; Gm19_Hap42 and Gm20_Hap32) significantly co-associated with DTF, DTM and yield-related traits in single and multiple environments. These results indicate that these genomic regions may contain genes that have pleiotropic effects on time to flowering, maturity and yield-related traits, which are tightly linked with multiple other genes with high rates of linkage disequilibrium.

Entities: Chemical Disease Gene Mutation Species

Keywords: linkage disequilibrium; pleiotropy; quantitative trait loci; tropical soybean

Year: 2017 PMID： 29398937 PMCID： PMC5790042 DOI： 10.1270/jsbbs.17024

Source DB: PubMed Journal: Breed Sci ISSN： 1344-7610 Impact factor: 2.086

Introduction

Flowering, maturity and plant height are key complex traits determining soybean productivity and adaptability (Cober and Morrison 2010, Zhang ). Most of these traits have been studied through correlation with yield to improve the understanding of their relationship to yield components (Fox , Li , Mansur ). Moreover, to improve relevant agronomic traits in breeding programs, where large populations are evaluated every year, genotyping with a small number of markers would be more feasible (Schuster 2011). Consequently, it is desirable to identify molecular markers in genetically superior progenies or exotic plant introduction with favorable alleles, which should be successfully introgressed using marker assisted selection (MAS) (Fox ). Yield-quantitative trait loci (QTL) are often detected within the context of specific soybean breeding populations and environments, since some conditions in any given environment, geographic region or year can change the grain yield (Guzman , Orf ). According to Palomeque , studies have identified QTLs associated with traits of interest that appear to be independent of the environment but dependent on the genetic background in which they found. The difficulty of identifying yield-QTL effective for MAS across a wide range of genetic and/or environmental contexts might be addressable by using preliminary yield trials to model target haplotypes within each context and then immediately selecting inbred lines that target genotypes in real time (Sebastian ). Sebastian demonstrated that using MAS with haplotypes to improve grain yield is possible if focused within a specific genetic and environmental context. In addition, the context-specific approach has already been adopted as a major component of MAS strategies known commercially as Accelerated Yield Technology (AYT) at Pioneer Hi-Bred International. Genome-wide association studies (GWAS) using individual Single Nucleotide Polymorphism (SNPs) and haplotype information have been used to improve agronomical traits in soybean (Contreras-Soto , Hao , Zhang ). A haplotype block is a genomic region in which two or more polymorphic loci (i.e., SNP) in close proximity tend to be inherited together with high probability (Abdel-Shafy ). These blocks are believed to be caused by recombination hotspots with extremely rare recombination within stretches of DNA, where the enclosed SNPs consequently segregate together from one generation to the next, acting as combined multi-site alleles (Greenspan and Geiger 2004). The combination of SNP alleles in a haplotype block on one chromosome covers the observed variation and can have higher linkage disequilibrium (LD) with the allele of a QTL than individual SNP alleles that are used to construct the haplotype (Abdel-Shafy ). Furthermore, haplotype association is likely to be more powerful in the presence of LD (Garner and Slatkin 2003). Lorenz used simulated phenotype data to show that the use of SNP-based haplotypes can increase power over the use of single-SNP markers in GWAS. Using haplotypes for QTL mapping could compensate for the bi-allelic limitation of SNPs, and substantially improve the efficiency of QTL mapping (Yang ). According to Song , highly selfing species, such as soybean, are in many ways uniquely suitable for haplotype block mapping. Therefore, the aim of this study was to identify haplotypes across multiple environments that contribute to time to flowering and maturity in tropical soybean, with the view to improve the selection of desired alleles for these traits, but with minimal impact on yield.

Material and Methods

Plant material and field evaluation

The association panel of this study consisted of 141 cultivars of tropical soybean (Supplemental Table 1), which were field evaluated in five locations that represent eight environments of Brazil: Cascavel (24°52′54.9″S 53°32′30.4″W) in the growing seasons 2012/2013, 2013/2014 and 2014/2015 (Cas12/13, Cas13/14 and Cas14/15, respectively); Palotina (24°21′06.5″S 53°45′24.9″W) in the growing season 2014/2015 (Pal14/15); Primavera do Leste (15°34′37.6″S 54°20′41.8″W) in the growing season 2012/2013 (Pri12/13), Rio Verde (17°45′49.0″S 51°01′49.3″W) in the growing season 2013/2014 and 2014/2015 (Rio13/14 and Rio14/15), and Sorriso (12°32′43.6″S 55°42′41.8″W) in the growing season 2014/2015 (Sorr14/15). These locations were chosen on the basis of their diversity of latitude and altitude. Field trials were arranged in a complete block design with two replications. Fertilizer and field management practices recommended for optimum soybean production were used according to Embrapa (2011).

Phenotypic data analysis

Seed yield (SY), 100-seed weight (SW), plant height (PH), number of days to flowering (DTF) and maturity (DTM) were measured in the 141 soybean cultivars across the eight environments. Flowering dates were recorded when 50% of plants in a plot had open flowers. DTF was measured by counting days from emergence to flowering, when approximately 50% of plants per plot had at least one open flower (R1), and DTM was measured by counting the days from planting to the date when plants had 95% of their pods dry (R8 on the scale of Fehr and Caviness 1977). Field data were analyzed on the basis of the following mixed linear model: where μ is the total mean, gi is the genetic effect of the ith genotype, lj is the effect of the jth environment, (gl)ij is the interaction effect between the ith genotype and the jth environment (G × E), bk(j) is the random block effect within the jth environment, and eijk is a random error following . Adjusted entry means (AEM) were calculated for each of the 141 entries (ith genotype: gi) with the LSMEANS option of MIXED procedure, and these were used as a dependent variable in the posterior association analysis. AEM (denoted as M) was where μ̂ and ĝ are the generalized least-squares estimates of μ and g, respectively. To estimate AEM for all cultivars at each of the eight environments, g was regarded as fixed and b as random, as proposed by Stich . The Restricted Likelihood Ratio Test (RLRT) was calculated to confirm the heterogeneity of residual variance (across environments) using the GLIMMIX procedure in SAS, according to the following: where MHV and MCV are the models with heterogeneous and common (homogenous) variances, respectively. The asymptotic distribution of the RLRT statistic is Chi-square with p degrees of freedom (), where p is the difference in the number of parameters included in the MHV and MCV models (in this case P = 7). Consequently, error variances were assumed to be heterogeneous among locations, and these were computed using the COVTEST homogeneity option, with RANDOM _residual_ statement and GROUP option in the GLIMMIX procedure (Mora ). Analysis of Deviance (ANODEV) was conducted to evaluate the significance of the effects of the five traits across environments by using the MIXED procedure in SAS (Nelder and Wedderburn 1972). The PROC CORR procedure was used to analyze Pearson correlations among variables by environment since G × E interactions were significant. Broad-sense heritability (h2) for the five traits at each environment was estimated as the proportion of genetic variance () over the total variance (), according to the formula:

Association panel, SNP genotyping and population structure

Cultivars were genotyped for 6,000 single nucleotide polymorphisms (SNPs) using the Illumina BARCSoySNP6K BeadChip, corresponding to a subset of SNPs from the SoySNP50K BeadChip (Song ). Genotyping was conducted by Deoxi Biotechnology Ltda. ® in Aracatuba, Sao Paulo, Brazil. A total of 3,780 SNP markers, including polymorphic and non-redundant SNPs, SNP markers with greater than 10% minor allele frequency (MAF) and missing data values lower than 25% were used for subsequent analysis, with heterozygous markers treated as missing data. Haplotype blocks were constructed using the Solid Spine method implemented in the software Haploview (Barrett ), and have been previously reported by Contreras-Soto (Supplemental Table 2). This method considers that the first and last markers in a block are in strong LD with all intermediate markers, thereby providing more robust block boundaries. A cutoff of 1% was used, meaning that if addition of a SNP to a block resulted in a recombinant allele at a frequency exceeding 1%, then that SNP was not included in the block. Then, these LD blocks were used to conduct the haplotype-based GWAS. A Bayesian model-based method was used to infer population structure using 3,780 SNPs, implemented in the program InStruct (Gao ). Posterior probabilities were estimated using five independent runs of the Markov Chain Monte Carlo (MCMC) sampling algorithm for the numbers of genetically differentiated groups (k) varying from 2 to 10, without prior population information. The MCMC chains were run for a burn-in of 5,000, followed by 50,000 iterations. The convergence of the log likelihood was determined by the value of the Gelman-Rubin statistic. The best estimate of k was determined according to the lowest value of the average log(Likelihood) and Deviance Information Criterion (DIC) values among the simulated groups (Gao ), as defined by Spiegelhalter . where D̄ is a Bayesian measure of model fit that is defined as the posterior expectation of the deviance (D̄ = E/y [−2· ln f (y/θ)]) ; pD is the effective number of parameters, which measures the complexity of the model.

SNP-based GWAS

AEM of each cultivar were used to perform SNP-based and haplotype-based GWAS for SY, PH, SW, DTF and DTM. To consider the effects of population structure and genetic relatedness among the cultivars, the following unified mixed-model (Cappa , Yu ) of association was employed (in matrix form): where y is a vector of adjusted phenotypic observations; α is a vector of SNP effects (fixed); v is a vector of population structure effects (fixed); u is a vector of polygene background effects (random); and ɛ is a vector of residual effects. S, Q and Z are incidence matrices for a, v, and u, respectively. According to Yu , the variances of u and ɛ are and , respectively. K and R are the kinship and residual variance matrices, respectively. This is a structured association model (Q model), which considers the genetic structure of the association panel included in the association mixed model. The kinship coefficient matrix (K) that explains the most likely identity by state of each allele between cultivars was estimated using the program TASSEL (Bradbury , Endelman and Jannink 2012). Mixed linear models with Q and K by themselves and MLM considering Q + K models were also run in TASSEL (Bradbury , Yu ). The Bayesian information criterion (BIC) (Schwarz 1978) was used for model selection, which is defined as: where L is the restricted maximum likelihood for a determined model, p the number of parameters to be estimated in the model, and n the sample size. BIC values were computed using the TASSEL program following Yu .

Haplotype-based GWAS

Haplotype-based GWAS was performed on the basis of LD information. Haplotype-based association mapping was performed by using adjusted phenotypes (y) as the dependent traits and the information of haplotype blocks in the model, as follows: Where 1 is a vector of n ones, with n representing the number of soybean cultivars, H is the incidence matrix of haplotype genotypes for the individuals at the i-th haplotype locus; The element of H(Hij) is equal to the number of the i-th copies of haplotypes-blocks carried by the j-th cultivar. For this analysis u represents the polygenic gene effect or kinship matrix (K) with variance and the residual effects e with variance . A limit of detection (LOD) value higher than 3 was used as the threshold P-value for haplotype-trait associations according to Hwang . Then, only the significant haplotypes were used to estimate the phenotypic variance explained by haplotypes. The percentage of variation explained by the haplotype-based method was calculated using a simple regression performed in TASSEL as follows: Where LR is the Likelihood Ratio; n represents the number of observations (i.e., number of soybean cultivars); logLM and logL0 are the likelihood functions of the reduced and the intercept-only models, respectively (Sun ). The Chi-square test was performed to check phenotypic differences among haplotype blocks using the CONTRAST option of the GENMOD procedure in SAS (SAS Institute, Inc., Cary, NC). Additionally, the genomic regions or SNPs in haplotypes blocks identified in this study were compared to the genomic locations of QTLs previously reported for the traits under study. Genes, QTLs and markers annotated in Glyma1.01 and NCBI RefSeq gene models in SoyBase (www.soybase.org) were used as references.

Results

Phenotypic analysis, heritability and correlation between traits

Analysis of deviance indicated that the effects of genotype (G), environment (E) and their interaction (G × E) were statistically significant (χ2 > 0.01) for all traits under study (Supplemental Table 3). Highly significant differences were observed among traits and environments (Supplemental Figs. 1–5). On average, PH ranged from 38.27 cm (Rio13/14) to 103.45 cm (Cas12/13). SY and SW data ranged from 670.23 kg ha−1 (Rio13/14) to 3319.00 kg ha−1 (Cas14/15) and 11.96 g (Rio13/14) to 15.50 g (Rio14/15), respectively. As expected, DTF and DTM varied widely, ranging from 30 (Pri12/13) to 47 (Cas13/14) days, and 88 (Pri12/13) to 133 (Cas14/15) days, respectively (Table 1). The high phenotypic variability was confirmed by analysis of deviance, which revealed that all traits were severely influenced by environmental factors, showing significant G × E interaction (Supplemental Table 3). Over the eight environments, SY was moderately heritable with a value of 56%, whereas SW, DTM, PH and DTF showed high heritabilities: 81.7%, 91.7%, 93.4% and 94.6%, respectively.

Table 1

Descriptive statistics of phenotypic variation, heritability (h2) across environments and variance components (G and G × E) of seed yield (SY), seed weight (SW), plant height (PH), days to maturity (DTM) and flowering (DTF) of 141 cultivars of soybean evaluated in eight environments

Trait	Environment	Mean	SD	Min	Max	G	G × E	h² (%)
SY (kg ha⁻¹)	Cas12/13	2457.59	820.92	806.00	6563.00	75068	351055	56.7
	Pri12/13	1910.82	767.17	233.00	4372.00
	Cas13/14	1863.71	623.59	125.00	5127.00
	Rio13/14	670.23	305.47	128.00	1780.00
	Cas14/15	3319.00	1297.25	176.00	7149.00
	Pal14/15	1442.93	667.79	299.00	3669.00
	Rio14/15	1559.18	814.61	136.00	4284.00
	Sorr14/15	1775.69	800.04	152.00	4916.00
	Mean

SW (100seed gr)	Cas12/13	12.08	2.31	7.90	25.50	1.50	2.05	81.7
	Pri12/13	12.59	1.99	9.00	25.80
	Cas13/14	13.49	2.26	7.90	25.00
	Rio13/14	11.96	1.69	8.20	23.90
	Cas14/15	12.33	3.15	6.30	19.40
	Pal14/15	12.30	1.92	7.60	18.40
	Rio14/15	15.50	1.93	10.30	21.00
	Sorr14/15	14.78	1.96	10.10	21.80
	Mean

PH (cm)	Cas12/13	103.45	19.89	55.00	220.00	209.30	101.83	93.4
	Pri12/13	48.36	11.98	20.00	90.00
	Cas13/14	97.59	19.54	45.00	205.00
	Rio13/14	38.27	11.59	20.00	75.00
	Cas14/15	90.34	24.45	30.00	180.00
	Pal14/15	74.25	23.04	30.00	130.00
	Rio14/15	46.17	13.72	20.00	95.00
	Sorr14/15	55.52	18.50	23.00	100.00
	Mean

DTF (days)	Cas12/13	46.16	10.38	28.00	80.00	44.63	18.42	94.6
	Pri12/13	30.29	5.90	24.00	52.00
	Cas13/14	47.75	9.41	29.00	82.00
	Rio13/14	40.49	7.31	28.00	77.00
	Cas14/15	46.58	10.85	26.00	76.00
	Pal14/15	46.76	6.89	32.00	70.00
	Rio14/15	37.39	7.26	24.00	54.00
	Sorr14/15	31.42	6.09	25.00	46.00
	Mean
	Cas12/13	126.33	15.39	104.00	256.00	82.00	53.57	91.7
	Pri12/13	88.83	9.84	40.00	172.00
	Cas13/14	124.89	15.66	97.00	248.00

DTM (days)	Rio13/14	99.02	13.71	82.00	182.00
	Cas14/15	133.89	10.59	106.00	164.00
	Pal14/15	119.59	6.48	106.00	138.00
	Rio14/15	104.98	9.38	80.00	123.00
	Sorr14/15	98.27	5.56	75.00	123.00
	Mean

G × E = Genotype × Environment interaction.

G = Genotype.

Analysis of phenotypic correlation was conducted by environment since residual heterogeneity was observed among the environments and the G × E interaction was significant for all traits. In most of the environments, significant and positive phenotypic correlations were observed between SY and SW, with correlation coefficients ranging from 0.15 (Pri12/13; P-value < 10−2) to 0.58 (Cas14/15; P-value < 10−4), and with no correlation between SY and SW in Pal14/15 and Sorr14/15. SY and SW showed different patterns of phenotypic correlation with DTF and DTM across environments. The same was observed among SY and SW with PH. In most of the environments, PH and SW showed negative correlations, although non-significant at the 0.05 level. However, PH, DTF and DTM were low to highly positively correlated traits, and statistically different of zero (P-value < 10−4), with correlation coefficients ranging from 0.13 for PH and DTF at Rio14/15 (P-value < 10−2) to 0.84 for DTM and DTF at Rio13/14 (P-value < 10−4) (Supplemental Table 4).

Genome-wide association across environments and traits

According to the deviance information criterion (from the posterior Bayesian clustering analysis), the most probable number of subpopulations was nine (Supplemental Fig. 6). The results based on Bayesian information criterion (BIC) consistently showed a better fit for the Q + K model over either Q or K alone (Supplemental Table 5). In total, 33, 29, 57, 72 and 40 linkage disequilibrium blocks were significantly associated with SY, SW, PH, DTM and DTF, respectively (Tables 2–6, Supplemental Tables 6–10). The haplotypes blocks explained considerable phenotypic variation: 17.6% to 96.8%, 13.6% to 33.2%, 45.2% to 99.4%, 12.7% to 59.9% and 12.9% to 42.7% for SY, SW, PH, DTM and DTF, respectively (Tables 2–6).

Table 2

Haplotype block associated with seed yield in 141 cultivars of tropical soybean

Env	Position (bp)			SN	Hap_ID	HapA	HF	SYa	R² (%)	Nearby genes or QTLs

	Chr	Start	End
Cas13/14	9	38523430	38906660	3	Gm9_Hap22a	CCC	29	2126.2a	21.3	DNA-binding proteinRHL1-like
					Gm9_Hap22b	CTC	3	1861.8ab
					Gm9_Hap22c	TTC	61	1761.7b
					Gm9_Hap22d	TTT	20	1717.9b

Cas14/15	12	5622210	6052289	4	Gm12_Hap12a	TAAC	55	3509.0a	41.4	uncharacterizedLOC102667945
					Gm12_Hap12b	TAAT	37	3354.1a
					Gm12_Hap12c	CGGT	28	2323.0b

Pal14/15	19	44965128	45370594	6	Gm19_Hap42a	AATxAA	34	1815.1a	96.8	Beta-fructofuranosidase insoluble isoenzyme 1-like
					Gm19_Hap42b	GCCGGG	88	1219.3a
					Gm19_Hap42c	ACCGGG	2	374.2b
					Gm19_Hap42d	AATGAA	–	–

Pal14/15	10	3962673	4360182	6	Gm10_Hap8a	TATxTA	16	1999.6a	17.6	uncharacterizedLOC100499780
					Gm10_Hap8b	CCGCTA	8	1522.3b
					Gm10_Hap8c	CCGCCG	30	1516.2bc
					Gm10_Hap8d	TCTxTA	34	1227.9bcd
					Gm10_Hap8e	CCGCCA	22	1162.2bcd
					Gm10_Hap8f	TCTCTA	–	–
					Gm10_Hap8g	TATCTA	–	–
					Gm10_Hap8h	TCGCTA	3	–

Pal14/15	19	45478438	45643073	3	Gm19_Hap43a	ATA	31	1848.2a	50.6	Intergenic
					Gm19_Hap43b	ACG	2	1113.0b
					Gm19_Hap43c	GCG	89	1112.9b
					Gm19_Hap43d	GTA	2	–

Pal14/15	11	4462645	4806173	5	Gm11_Hap11a	CCxAA	31	1699.9a	45.6	Probable 125 kDa kinesin-related protein-like
					Gm11_Hap11b	TATCA	6	1548.8ab
					Gm11_Hap11c	CCTAC	21	1144.4bc
					Gm11_Hap11d	TATAA	10	1060.8bc

Sorr14/15	5	5621714	5794460	3	Gm5_Hap7a	CAC	18	2027.8a	18.7	uncharacterizedLOC100818074
					Gm5_Hap7b	CGT	21	1956.5a
					Gm5_Hap7c	TAT	78	1743.5a

Sorr14/15	15	5621714	5794460	3	Gm15_Hap11a	TCC	7	2024.1a	30.1	uncharacterizedLOC100785341
					Gm15_Hap11b	CCC	92	1898.9ab
					Gm15_Hap11c	TTA	27	1510.8b

Env: Environment; Chr: Chromosome; SN: Number of SNPs by haplotype; Hap_ID: Haplotype ID; HapA: Allelic haplotypes; HF: Haplotype frequency; SY: mean for seed yield (kg*ha−1) of haplotypes at each environment.