| Literature DB >> 25246241 |
Justin N Vaughn1, Randall L Nelson2, Qijian Song3, Perry B Cregan3, Zenglu Li4.
Abstract
Soybean oil and meal are major contributors to world-wide food production. Consequently, the genetic basis for soybean seed composition has been intensely studied using family-based mapping. Population-based mapping approaches, in the form of genome-wide association (GWA) scans, have been able to resolve loci controlling moderately complex quantitative traits (QTL) in numerous crop species. Yet, it is still unclear how soybean's unique population history will affect GWA scans. Using one of the populations in this study, we simulated phenotypes resulting from a range of genetic architectures. We found that with a heritability of 0.5, ∼100% and ∼33% of the 4 and 20 simulated QTL can be recovered, respectively, with a false-positive rate of less than ∼6×10(-5) per marker tested. Additionally, we demonstrated that combining information from multi-locus mixed models and compressed linear-mixed models improves QTL identification and interpretation. We applied these insights to exploring seed composition in soybean, refining the linkage group I (chromosome 20) protein QTL and identifying additional oil QTL that may allow some decoupling of highly correlated oil and protein phenotypes. Because the value of protein meal is closely related to its essential amino acid profile, we attempted to identify QTL underlying methionine, threonine, cysteine, and lysine content. Multiple QTL were found that have not been observed in family-based mapping studies, and each trait exhibited associations across multiple populations. Chromosomes 1 and 8 contain strong candidate alleles for essential amino acid increases. Overall, we present these and additional data that will be useful in determining breeding strategies for the continued improvement of soybean's nutrient portfolio.Entities:
Keywords: QTL; amino acid; genome-wide association; oil; protein; soybean population structure
Mesh:
Substances:
Year: 2014 PMID: 25246241 PMCID: PMC4232554 DOI: 10.1534/g3.114.013433
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 2Attributes of populations used in GWA scans for protein and oil. Genotypes were clustered based on genetic distance. Each genotype used in the study represents a leaf in the dendrogram at the top of each panel. Country of origin and maturity group (“MG”) are color-coded. Protein and oil are represented as a heat map, with red being the highest value within that population and green being the lowest. Values in parentheses indicate the number of lines within a given category. (A) MS-2000 population. (B) IL-1966 population. Note that color-coding can be different for the same category in different populations.
Simulation results using MS-2000 population
| #QTL | 4 | 20 | 200 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Effect Distr. | Linear | Uniform | Linear | Uniform | Linear | |||||
| H2 | 0.95 | 0.5 | 0.95 | 0.5 | 0.95 | 0.5 | 0.95 | 0.5 | 0.95 | 0.5 |
| False (−) | 0.1 | 0.35 (0.25) | 0 (0) | 0.1 (0) | 0.5 (0.41) | 0.69 (0.63) | 0.3 (0.19) | 0.68 (0.53) | 0.96 (0.93) | 0.99 (0.99) |
| (total) | ||||||||||
| False (−) | 0 (0) | 0 (0) | NA | NA | 0.04 (0.04) | 0.24 (0.12) | NA | NA | 0.87 (0.78) | 0.98 (0.97) |
| (top 1/4) | ||||||||||
| False (−) | 0 (0) | 0.13 (0.13) | NA | NA | 0.33 (0.23) | 0.59 (0.51) | NA | NA | 0.94 (0.91) | 0.99 (0.99) |
| (top 3/4) | ||||||||||
| False (+) | 2.0E−4 (3.5E−4) | 4.2E−5 (4.2E−5) | 3.5E−5 (4.9E−5) | 4.9E−5 (6.3E−5) | 3.6E−4 (8.5E−4) | 8.4E−5 (1.0E−4) | 2.9E−3 (4.8E−3) | 8.4E−5 (2.0E−4) | 2.8E−5 (1.9E−4) | 1.4E−5 (4.2E−5) |
Each value is the mean of five separate replicates under the given combination of variables.
“Top 1/4” indicates that only the top quartile of loci with the strongest effects were evaluated in terms of type II errors. Similarly, “Top 3/4” refers to the top 3 quartiles. Because these categories cannot apply to uniform effect distributions, applicable cells are given “NA” values.
p-value threshold < 10−5.
Parenthetical values for p-value threshold <10−4.
False (−) indicates the fraction of true positives that were missed; false (+) indicates the fraction of tests that identified untrue associations.
GWA scan results for protein and oil
| MS-1996 (728) | MS-2000 (934) | IL-1964 (619) | IL-1966 (977) | Huang | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SNP | -log ( | Allelic Effect Estimate | SNP | -log ( | Allelic Effect Estimate | SNP | -log ( | Allelic Effect Estimate | SNP | -log ( | Allelic Effect Estimate | SNP | -log ( | Allelic Effect Estimate |
| % Protein (43.45 [43.50]) | ||||||||||||||
| 4.85 (4.69) | 0.6 | 10.63 (10.43) | 1.38 | 8_44632488 | 4.4 (4.37) | 0.44 | 13_24858209 | 4.75 (4.12) | 0.39 | 5.99 (5.98) | 1.49 | |||
| 17_2678979 | 4.03 (4.61) | 0.63 | 4.94 (4.11) | 0.66 | 11_37932701 | 4.68 (5.08) | 0.41 | |||||||
| 15_3919945 | 4.35 (4.24) | 0.28 | ||||||||||||
Population used; value in parentheses is the number of genotypes used.
Value from MLMM; parenthetical value from CMLM.
Mean values for combined populations are in parentheses with median values in brackets.
Bold font indicates that the marker (or a marker within 4 Mbp) was associated with the trait in two or more environment–population datasets.
Italic font indicates that the marker (or a marker within 4 Mbp) was also associated with another trait in the study.
GWA scan results for selected essential amino-acid profiles
| IL-1996 (900) | MS-1997 (978) | ||||
|---|---|---|---|---|---|
| SNP | -log ( | Allelic Effect Estimate | SNP | -log ( | Allelic Effect Estimate |
| Cysteine (1.47 [1.50]) | |||||
| 5.39 (5.01) | 0.06 | 12.33 (12.06) | 0.06 | ||
| 6_18690983 | 9.87 (8.6) | 0.04 | |||
| 6_17674401 | 4.33 (5.01) | 0.03 | |||
Associated Manhattan plots are given in Figure S1 and Figure S2.
Population used; value in parentheses is the number of genotypes used.
Value from MLMM; parenthetical value from CMLM.
Mean values for combined populations are in parentheses, with median values in brackets (% protein by dry weight).
Bold font indicates that the marker (or a marker within 200 Kbp) was associated with the trait in two or more environment–population datasets.
Italic font indicates that the marker (or a marker within 200 Kbp) was also associated with another trait in the study.
For simplicity, marker names are reduced to their chromosome position form, e.g., BARC1.01Gm08_8462762 appears as 8_8462762.
Figure 1Protein and oil phenotypic variation and covariance within populations. (A) All lines across all populations for which protein and oil were measured. (B and C) Lines for a specific population assayed in a particular environment. IL, Illinois; MS, Mississippi. Year of growth is given adjacent to location and maturity group is given below the location and date. In all graphs, percent dry-weight protein and oil are plotted on the x-axis and y-axis, respectively.
Figure 3Significance scores of simulated genetic architectures in the MS-2000 soybean population using CMLM and MLMM methods. Each marker is plotted with its -log(p-val) on the y-axis and physical position is plotted on the x-axis. Chromosomes are indicated by alternating black and gray coloration and are plotted in order, 1 through 20. Magenta markers indicate the polymorphisms associated with a simulated effect. A significance threshold of p-value < 10−5 is indicated by a dotted line. (A) Four QTL with uniform effect sizes and a heritability of 0.5. (B) Twenty QTL sampled from linear effect sizes with a heritability of 0.5. Rank of the allelic effect is given above the marker, with “1” being the largest effect. (C) Two hundred QTL, a linear effect distribution, and a heritability of 0.95. Only MLMM method is shown.
Figure 4Percent protein and oil GWA scan. For population MS-2000 (A) and IL-1966 (B), each marker is plotted with its -log(p-val), as assessed using the CMLM method, on the y-axis and its physical position is plotted on the x-axis. Orange color indicates markers also identified by the MLMM method; their discovery order and -log(p-val) are also indicated in adjacent orange font. Chromosomes are indicated by alternating black and gray and are plotted in order, 1 through 20. A significance threshold of p-value < 10−5 is indicated by a dotted line. (C) Using MS-2000 population, LD plot for the region around the protein/oil QTL identified in this study and others. The total physical distance shown is ∼8 MB. R2 and D′’ measures of LD are given above and below the diagonal, respectively; both values range from 0 to 1. Brackets indicate the physical range previously found to associate with protein and oil content. The bar graph to the right of the LD plot is scaled approximately to the physical position along the LD plot, as indicated, and plots the Tajima’s D metric for sliding windows 20 markers wide with an overlap of 10 markers. Both monomorphic and polymorphic markers were included in the Tajima’s D calculation, whereas only polymorphic (MAF > 0.05) sites are shown in the LD plot.
Figure 5Allele distribution as it relates to population structure and protein levels. Genotypes were clustered based on genetic distance across all markers (not just those depicted here). Each genotype used in the study represents a leaf in the dendrogram at the top of each panel. Percent protein is represented as a heat map, with red being the highest value within a population and green being the lowest. For a given marker, a genotype is color-coded according to the nucleotide for which it is homozygous; heterozygotes are shown as brown. Marker names are abbreviated to exclude "BARC1.01." (A) MS-2000 population. (B) IL-1966 population. (C) Population created by enriching BARC1.01Gm20_31610452-C to a frequency of 0.5, regardless of the environment in which a line was phenotype.
Figure 6Attributes of populations used in GWA scans for seed quality traits. Genotypes were clustered based on genetic distance. Each genotype used in the study represents a leaf in the dendrogram at the top of each panel. Country of origin and maturity group (“MG”) are color-coded. Traits are represented as a heat map, with red being the highest value within a population and green being the lowest. Values in parentheses indicate the number of lines within a given category. (A) MS-1997 population. (B) IL-1996 population. Note that color-coding can be different for the same category in different populations.