Shiyu Chen1, Shawn M Kaeppler1,2, Kenneth P Vogel3,4, Michael D Casler2,5. 1. Department of Agronomy, University of Wisconsin-Madison, Madison, Wisconsin, United States of America. 2. Department of Energy, Great Lakes Bioenergy Research Center, Madison, Wisconsin, United States of America. 3. USDA-ARS, Grain, Forage, and Bioenergy Research Unit, Lincoln, Nebraska, United States of America. 4. Department of Agronomy & Horticulture, University of Nebraska, Lincoln, Nebraska, United States of America. 5. USDA-ARS, U.S. Dairy Forage Research Center, Madison, Wisconsin, United States of America.
Abstract
Switchgrass is undergoing development as a dedicated cellulosic bioenergy crop. Fermentation of lignocellulosic biomass to ethanol in a bioenergy system or to volatile fatty acids in a livestock production system is strongly and negatively influenced by lignification of cell walls. This study detects specific loci that exhibit selection signatures across switchgrass breeding populations that differ in in vitro dry matter digestibility (IVDMD), ethanol yield, and lignin concentration. Allele frequency changes in candidate genes were used to detect loci under selection. Out of the 183 polymorphisms identified in the four candidate genes, twenty-five loci in the intron regions and four loci in coding regions were found to display a selection signature. All loci in the coding regions are synonymous substitutions. Selection in both directions were observed on polymorphisms that appeared to be under selection. Genetic diversity and linkage disequilibrium within the candidate genes were low. The recurrent divergent selection caused excessive moderate allele frequencies in the cycle 3 reduced lignin population as compared to the base population. This study provides valuable insight on genetic changes occurring in short-term selection in the polyploid populations, and discovered potential markers for breeding switchgrass with improved biomass quality.
Switchgrass is undergoing development as a dedicated cellulosic bioenergy crop. Fermentation of lignocellulosic biomass to ethanol in a bioenergy system or to volatile fatty acids in a livestock production system is strongly and negatively influenced by lignification of cell walls. This study detects specific loci that exhibit selection signatures across switchgrass breeding populations that differ in in vitro dry matter digestibility (IVDMD), ethanol yield, and lignin concentration. Allele frequency changes in candidate genes were used to detect loci under selection. Out of the 183 polymorphisms identified in the four candidate genes, twenty-five loci in the intron regions and four loci in coding regions were found to display a selection signature. All loci in the coding regions are synonymous substitutions. Selection in both directions were observed on polymorphisms that appeared to be under selection. Genetic diversity and linkage disequilibrium within the candidate genes were low. The recurrent divergent selection caused excessive moderate allele frequencies in the cycle 3 reduced lignin population as compared to the base population. This study provides valuable insight on genetic changes occurring in short-term selection in the polyploid populations, and discovered potential markers for breeding switchgrass with improved biomass quality.
Over the last decade, biomass energy consumption has increased more than 60%, driven by biofuel production, mainly in the form of bioethanol [1]. Switchgrass-based ethanol production contributes to energy diversification and environmental sustainability [2]. Ethanol production from switchgrass biomass produces 540% more renewable energy than nonrenewable energy consumed during the production process, while reducing greenhouse-gas emissions by 94% compared to gasoline [3]. However, due to the hydrophobicity of lignin and the cross-linking between lignin and hemicellulose in the cell walls, pretreatments are required to facilitate the enzymatic hydrolysis of cellulose and hemicellulose, increasing cost and complexity of bioethanol production from cellulosic biomass [4].Recent approaches to improving switchgrass biomass quality have focused on engineering genes involved in the lignin biosynthesis pathway. Switchgrass plants with down-regulated caffeic acid o-methyltransferase (COMT) evaluated in the field had biomass with 10 to 14% reduced lignin concentration, 34% greater sugar release and 28% higher ethanol yield compared to control plants [5]. Despite these results, there are administrative challenges to commercializing transgenic switchgrass due to the deregulation process [6]. Switchgrass pollen retains its viability for up to 60 min, 100 min in rare cases, and may travel up to 3.5 km under mild wind conditions [7]. As a native grass species with less than 1% self-compatibility, the existence of viable pollen over large distances will result in migration of transgenes into native grasslands [8]. Autoexcision was investigated as a solution for preventing transgene flow, resulting in reduction of transgene flow by about 22–24% [9]. Traditional plant breeding for improved biomass quality represents an alternative approach to reduce recalcitrance of switchgrass biomass [10, 11]. Switchgrass populations divergently selected for in vitro dry matter digestibility (IVDMD) in a livestock production system showed a strong genetic correlation between IVDMD and ethanol yield of r = 0.84 [12]. This strong and positive genetic correlation indicates that the genetic basis underlying improvements in IVDMD could point to opportunities to improve ethanol yield from switchgrass biomass.Forward genetic screening for causal alleles underlying the phenotypic variations in the natural populations can be carried out in light of high resolution of single nucleotide polymorphisms (SNPs) [13]. Different methodologies were applied depending on the populations under investigation. Allele segregation patterns were used to indicate causal markers in crossing populations, while the association between the genetic variance and the phenotypic variance was used in linkage disequilibrium mapping. Detection of allele frequency (AF) changes has been implemented in studying adaptively or artificially divergent populations [14-17]. Considering the large sample size needed to account for high density genetic variances in the natural populations, bulking the extremely divergent samples could drastically reduce the genotyping cost, and have been exploited successfully to detect SNPs associated with phenotype divergence [18, 19].Numerous gene members in the monolignol biosynthesis pathway can affect lignin concentrations and forage digestibility in various species. COMT catalyzing methylation in the monolignol pathway [20, 21] is critical for the formation of two of the three monolignol units, guaiacyl (G) and syringyl (S). Down-regulation of the COMT2 gene in switchgrass reduced lignin concentration, S/G ratio and the recalcitrance to the fermentation process [22] indicating its methylation function in switchgrasslignin biosynthesis. The cinamyl alcohol dehydrogenase (CAD) gene catalyzes the reduction of hydoxycinnamyl aldehydes into their corresponding alcohols in the last steps of the monolignol pathway [23]. The class I CAD, also known as bona fide CAD, is hypothesized to be correlated with the origin of lignin, based on evidence from phylogenetic distance of CAD genes and the lack of lignin found in the earliest plants without bona fide CAD genes [24]. The CAD2 gene found in switchgrass is close to ZmCAD2 in maize and OsCAD2 in rice which were also classified as bona fide CAD [24]. Genetic engineering studies in alfalfa, rice, and forage grasses also demonstrated the influence of various monolignol genes on lignin concentration and degradation [25-27]. Genetic engineering of hydroxycinnamoyl-CoA shikimate/quinate hydroxycinnamoyl transferase (HCT) in alfalfa and COMT and CAD genes in tall fescue resulted in disrupted lignin biosynthesis. In switchgrass, transgenenic plants were generated by downregulating monolignol genes encoding COMT [22], CAD [28, 29], and 4-coumarate: CoA ligase 1 (4CL1) [30], each showing a significant decrease in lignin concentration and increases in ethanol production efficiency. The extensive forward and reverse genetic studies on these monolignol genes made them good candidate genes to investigate the genetic mechanisms underlying switchgrass breeding populations divergent in lignin concentrations.The switchgrass populations used in this study were generated by divergent breeding for decreased and increased in vitro dry matter digestibility (IVDMD) at the USDA-ARS grass breeding project at the University of Nebraska-Lincoln, Nebraska [12]. The IVDMD test simulates the digestion of forages or biomass in ruminants. Five divergent populations in this study was generated from the selection process, including one base population (C0), the population of one selection cycle for low IVDMD (C-1), and the populations of five selection cycles for high IVDMD (C+1 to C+3) [12]. Lignin concentration, IVDMD and ethanol yield across the divergent populations changed substantially due to selection (S1 Fig). While IVDMD increased by 9.6% from population C-1 to C+3, acid detergent lignin (ADL) decreased by 17%, and ethanol yield increased by 12.7% [12]. The recurrent selection cycles resulted in significant differences and consistent rankings of IVDMD, ethanol yield and ADL values across the selection cycles [12, 31].To identify polymorphisms responsible for the distinct differences of IVDMD, lignin concentration and ethanol yield in the breeding populations, candidate genes in the monolignol biosynthesis pathway were investigated in this study. We described the approach of detecting selection signatures in divergent selected switchgrass populations by AF changes. The identifications of polymorphisms under selection provided insight into the genetic basis of recurrent selection in switchgrass, and potential SNP markers to facilitate marker-assisted selection.
Materials and Methods
Plant materials
A random sample of five generations of divergent selection for IVDMD (populations C-1 through C+3) was space-transplanted in the field in May 2006. The populations are described as NE Trailblazer C-1, NE Trailblazer C0, Trailblazer, NE Trailblazer C2, and NE Trailblazer C3 in the official release notification by USDA-ARS. For our purposes, Trailblazer is noted as C+1 and the other four populations are noted according to their cycle number from the original population (C-1, C0, C+2, and C+3). The sign refers to the direction of selection for IVDMD and the number refers to the number of selection cycles or recombination events. The breeding generation evaluation nursery was established in 2006 with a randomized complete block design in Lincoln, Nebraska [12]. Within each of the six blocks, ten individual genotypes from each breeding generation were planted in a plot. Leaf samples were collected from each individual plant, freeze-dried, and sent to Madison, Wisconsin in 2010.
Gene sequencing and genetic diversity
The dried leaf samples of five populations were pooled by population with 0.002g per individual. DNA extractions were made for each pool using the protocol described by [32]. Candidate genes COMT1, COMT2, CAD2, and 4CL1 were amplified from the genomic DNA using the NCBI cDNA sequences and primers shown in S1 Table. A high fidelity polymerase and minimum number of PCA cycles were used in genomic amplification to reduce PCA errors. Due to the high heterozygosity levels in switchgrass, the amplicons were cloned and sequenced individually by Sanger Sequencer. Middle primers were designed to sequence each amplicon as a haplotype read. A sample of sequences was obtained from each population pool for each of the four candidate genes (S2 Table). About 190 reads were sequenced from the C0 population pool to increase the precision of initial AF estimation in the AF tests.To control the sequence quality, preliminary AF was calculated for each polymorphic site as the proportion of the minor polymorphism to the total reads across all five populations. The polymorphic sites with preliminary AFs lower than 0.05 were discarded. If a site has more than one minor allele and one of the preliminary AFs was greater than 0.05, the site cannot be discarded, but the haplotypes corresponding to the rare polymorphisms at this site were discarded. The AF per site per population were then calculated to be used in the statistical tests for selection signature.Nucleotide diversity and haplotype diversity were calculated using DNaSP [33, 34]. Nucleotide diversity (π) was calculated as the average number of nucleotide differences per site between two sequences [35]. Pairwise linkage disequilibrium (LD) were estimated as r2 for both SNPs and InDels in R 3.2.0 [36]. LD values were fitted in the nonlinear models against the distances between pairs of SNPs. The distance of half decay LD is the distance when the predicted LD is half of its maximum value. The genetic diversity was also estimated within each population.
Statistical tests for selection signature
The sequence data set were separated into the five divergent populations (C0, C-1, C+1-C+3), and the AFs were calculated within each population. The allele frequencies in the C0 population are the initial frequencies. Only the loci common among all 5 populations were kept for the following statistic tests.The demographic scheme was simulated 10000 times with genetic drift only to build the null distributions for the statistic tests [37]. For each time of simulation, a base C0 population was generated using the initial AF observed in the real data. Each individual had a single diallelic polymorphic locus. Each locus was assigned eight alleles, which were randomly separated into two subgenomes during intercrossing, and followed a tetrasomic inheritance pattern within the subgenomes [38]. Individuals were randomly selected from the C0 population and producing progenies by polycrossing. The experimental error of estimating population AFs came from multiple sources at different levels, for example, the sampling error during population pooling, PCR amplification and clone picking. To account for these variations as much as possible, the sampling process was included in the simulation to calculate the simulated AF in each population. After simulation of each population, sixty individuals were randomly chosen to form an allele pool, and a number of alleles were randomly drawn according to the real number of reads obtained for each population pool. The simulated AF was calculated using this allele sample.In the AF change test, thresholds were decided using the null distribution unique to each initial AF. The p-values were calculated as the proportion of simulated allele AF excessing the observed AF in a total of 10,000 simulations. The polymorphisms significant for the AF change test were then processed through a second, independent, statistical test: linear regression of AF on the cycle numbers. The slopes of the linear regression for each locus were compared to slopes generated from the simulated distribution. The p-values of both tests were adjusted by the controlled false discovery rate [39]. Selection signatures were considered to be significant only for loci that passed both tests with p-value <0.05.
Results
Gene sequencing and SNP discovery
Two family members of COMT (COMT1 and COMT2) and CAD2 and 4CL1 were sequenced using pooled genomic DNA from each population. The cDNA sequences of COMT2, CAD2, and 4CL1 in switchgrass were obtained from NCBI (accessions: HQ645965.1, GU045612.1, EU491511.1) [22, 30, 40]. Multiple members of COMT gene family were found by expression study in maize [41]. The coding sequence of COMT1 gene in switchgrass was identified by querying the most commonly expressed COMT member in maize by BLAST in NCBI EST database (accessions: FL749574.1, FL749575.1). These coding sequences were used for primer designs to amplify genomic sequences of the four genes. The measures including designing primer for specificity, using high fidelity polymerase and gel excision of the PCR amplicons were taken to make sure the amplification of interested members of the gene families in switchgrass.Gene structures of the resulted genomic sequences were inferred by comparing them to homologs from maize and sorghum (Fig 1). Thirteen exon regions and nine intron regions were sequenced. A small region of 3’ UTR was sequenced in 4CL1. The SNP markers were discovered by aligning the sequences from all five populations. To control the errors from PCR amplification and sequencing, polymorphic sites with preliminary AF less than 0.05 were discarded. As a result of the quality control, all the polymorphic sites were biallelic. The NCBI accession numbers of the aligned sequences are KY004561-KY004928 for COMT1, KY004196-KY004560 for COMT2, KY005440-KY005851 for CAD2 and KY004929-KY005439 for 4CL1.
Fig 1
Hypothesized gene structures of the sequenced COMT1, COMT2, CAD2, and 4CL1 in switchgrass.
The number of SNPs and InDels among all five populations were summarized in Table 1. A total of 183 SNPs and InDels were identified. The number of SNPs per 100 bp ranged from 0.83 for 4CL1 to 1.73 for COMT1. SNPs in the coding regions were found for all four genes, of which 17 are synonymous and 7 are nonsynonymous. No InDel was found in the coding regions. The intron regions have a total of 159 polymorphisms, of which 101 sites are SNPs, and 58 sites are InDels. No polymorphic sites in the 3’ UTR of 4CL1 gene was found.
Table 1
Total number of polymorphisms for the four candidate genes in switchgrass divergent populations.
Whole gene
non-coding regions
coding regions
synonymous
non-synonymous
COMT1
Sequence length(bp)
1851
978
873
NA
NA
# of SNP sites
32
29
3
3
0
# of Insert/Delete sites
34
34
0
0
0
SNP sites per 100 bp
1.73
2.97
0.34
NA
NA
COMT2
Sequence length (bp)
1580
683
897
NA
NA
# of SNP sites
27
19
8
6
2
# of Insert/Delete sites
3
3
0
0
0
SNP sites per 100 bp
1.71
2.78
0.89
NA
NA
CAD2
Sequence length (bp)
2789
2165
624
NA
NA
# of SNP sites
39
31
8
6
2
# of Insert/Delete sites
11
11
0
0
0
SNP sites per 100 bp
1.40
1.43
1.28
NA
NA
4CL1
Sequence length (bp)
3255
2271
984
NA
NA
# of SNP sites
27
22
5
2
3
# of Insert/Delete sites
10
10
0
0
0
SNP sites per 100 bp
0.83
0.97
0.51
NA
NA
Genetic diversity and linkage disequilibrium of polymorphisms in four candidate genes
Nucleotide diversity was estimated for each gene ranging from 0.0027 to 0.0060 (Table 2). 4CL1 has the lowest overall diversity amongst the four genes. The diversity of synonymous sites is only slightly lower than the overall gene diversity. The ratio of diversity between synonymous sites and nonsynonymous sites were 7.8, 6.0 and 1.4 for CAD2, COMT2, and 4CL1 respectively. Haplotype diversity and LD were analyzed for each gene across all populations. A considerable amount of haplotypes was found for each gene, from 47 haplotypes in COMT2 to 100 in 4CL1, increasing as the lengths of the gene sequences increased. The rank of haplotype diversity for the genes differed from the number of haplotypes. 4CL1 has the highest number of haplotypes and medium level of Haplotype diversity 0.80. In the contrary, COMT2 has the lowest number of haplotypes and the highest haplotype diversity 0.93. Many of the haplotypes were represented by only one read each. The common haplotypes for each gene are the ones that have more than 5% reads. The number of the common haplotypes are drastically reduced and differed from 4 to 8 for the four genes. As expected, LD decayed rapidly along the genes. The overall means of pairwise LD (r2) were lower than 0.4. LD reduced to half within only several hundred base pairs for all of the genes (Fig 2). The mosaic patterns in the LD heatmaps in (Fig 3) showed very short LD blocks at the candidate genes in the octoploid switchgrass populations.
Table 2
Genetic diversity and LD in each of the four candidate genes.
Different nucleotide diversity was estimated using SNPs within the whole gene, π, nonsynonymous SNP sites, π(nonsyn), synonymous SNP sites, π(syn), and the silent SNP sites including both synonymous and non-coding sites, π(s). The results of haplotype and LD analysis include number of haplotypes (H), haplotype diversities (Hd), the number of haplotypes with proportions higher than 0.05 (H>0.05), mean of pairwise LD (LD mean) and the half LD decay distance (LD decay).
Gene
π
π (nonsyn)
π(syn)
π(s)
H
Hd
H (>0.05)
LD mean
LD decay (bp)
COMT1
0.0043
0.0000
0.0039
0.0067
53
0.741
5
0.334
375
COMT2
0.0060
0.0011
0.0068
0.0096
47
0.927
8
0.224
188
CAD2
0.0043
0.0014
0.0109
0.00486
87
0.920
5
0.196
204
4CL1
0.0027
0.0015
0.0021
0.0031
100
0.803
4
0.258
609
Fig 2
The scatter plots of pairwise LD on the distances between the polymorphic sites.
The red line noted the predicted LD by fitted a nonlinear model of LD.
Fig 3
The heatmaps of pairwise LD of the polymorphisms in the four candidate genes.
The blue stars indicate the significant loci under selection.
Genetic diversity and LD in each of the four candidate genes.
Different nucleotide diversity was estimated using SNPs within the whole gene, π, nonsynonymous SNP sites, π(nonsyn), synonymous SNP sites, π(syn), and the silent SNP sites including both synonymous and non-coding sites, π(s). The results of haplotype and LD analysis include number of haplotypes (H), haplotype diversities (Hd), the number of haplotypes with proportions higher than 0.05 (H>0.05), mean of pairwise LD (LD mean) and the half LD decay distance (LD decay).
The scatter plots of pairwise LD on the distances between the polymorphic sites.
The red line noted the predicted LD by fitted a nonlinear model of LD.
The heatmaps of pairwise LD of the polymorphisms in the four candidate genes.
The blue stars indicate the significant loci under selection.The phylogenetic tree of the haplotypes indicated that despite the number of haplotypes discovered, the haplotypes within each gene have no significant branching. The substituted amino acids were analyzed for the impact of substitution on protein function in SIFT [42]. There was no significant predicted impact on protein function for all of the non-synonymous loci.
Allele frequencies in the extreme cycles for four candidate genes
To investigate the association of polymorphisms in the four genes with selection for IVDMD, allele frequencies and AF changes between the most extreme populations were analyzed. Minor allele frequencies of the SNPs/InDels were calculated within each population pool. The AF changes were calculated as the AF in C+3 population (high IVIDMD) minus the AF in C-1 population (low IVDMD).During short-term selections, two reasons could result in AF changes across the genome, genetic drift and selection, one causing random AF fluctuation, while the other producing directional frequency changes. The loci under constant selection would likely have bigger AF changes than the loci undergone only genetic drift, if the selection intensity is high or the trait under selection is highly inheritable. A demographic scheme was simulated (S2 Fig) to reflect the population size changes from generation to generation in the IVDMD breeding project, except that random individuals got to pass their alleles down to the next generation. This is the genetic drift effect that would occur on the neutral loci during the breeding process. A distribution of the AF changes at a certain locus was obtained by repeatedly simulating the demographic process for 10,000 times. This distribution provided the null distribution for the hypothesis that there is no selection effect at a locus, only genetic drift. Therefore, the loci with AF changes exceeding the thresholds defined by the simulated distribution were determined to be under selection (Fig 4). The significant levels were calculated in the one-tailed statistic tests as the ratio between the number of simulations with bigger/smaller AF changes than the observed AF change and the total 10,000 simulations.
Fig 4
Distribution of simulated allele frequency change between C-1 and C+3 for an initial allele frequency in C0 of 0.15 at locus 246 of COMT1 gene.
The red arrow indicates the observed change in allele frequency between cycles C-1 and C+3. The lines indicate the Benjamini-Hochberg-adjusted confidence intervals of allele frequency change with one-tailed test with α = 0.05.
Distribution of simulated allele frequency change between C-1 and C+3 for an initial allele frequency in C0 of 0.15 at locus 246 of COMT1 gene.
The red arrow indicates the observed change in allele frequency between cycles C-1 and C+3. The lines indicate the Benjamini-Hochberg-adjusted confidence intervals of allele frequency change with one-tailed test with α = 0.05.In total, 36 SNPs and InDels were found significant for AF changes after adjusting p-values to control FDR (Fig 5). None of the nonsynonymous sites were found significant. Out of the 37 polymorphisms, 25 of them located in the intron regions, and 3 are synonymous polymorphisms in the exon regions. Ranges of AF changes for all the observed SNPs/InDels were: -0.11 to 0.48 for COMT1, -0.11 to 0.15 for COMT2, -0.15 to 0.12 for CAD2, and -0.26 to 0.18 for 4CL1. The COMT1 gene had the widest range of AF change among the four genes, followed by 4CL1. Due to the AF changes at the significant loci, rare alleles were turned into frequent or common alleles at the end of the selection cycle, and vice versa. Even though CAD2 and COMT2 had medium levels of genetic diversity comparing, they have fewer loci with significant AF changes.
Fig 5
Changes in allele frequency between divergent breeding populations C-1 and C+3 for COMT1, COMT2, CAD2 and 4CL1 in switchgrass.
The data points are observed allele frequency changes plotted on the initial allele frequencies from C0. The dotted lines indicated the Benjamini-Hochberg-adjusted confidence intervals (CI) (α = 0.05) of allele frequency changes using the 10,000-simulation data. Data points inside the CI are deemed due to drift, while those outside the CI (shown in red color) are deemed candidates for selection.
Changes in allele frequency between divergent breeding populations C-1 and C+3 for COMT1, COMT2, CAD2 and 4CL1 in switchgrass.
The data points are observed allele frequency changes plotted on the initial allele frequencies from C0. The dotted lines indicated the Benjamini-Hochberg-adjusted confidence intervals (CI) (α = 0.05) of allele frequency changes using the 10,000-simulation data. Data points inside the CI are deemed due to drift, while those outside the CI (shown in red color) are deemed candidates for selection.
Linear regression of allele frequencies against selection cycles
The SNPs/InDels significant in the AF change test were analyzed by regression of allele frequencies on selection cycles. Slopes of the linear regression in the observed data were compared with that calculated in the simulated data to determine p-values. Eighty percent of the significant polymorphisms from the AF change test were also significant in the regression test. As a result, 29 SNPs and InDels passed both tests as final significant polymorphisms associated with recurrent selection. All 7 polymorphisms that didn’t pass the regression coefficient test were from CAD2 gene. Significant loci were detected only in COMT1 and 4CL1.The fit of the linear regression (r2) and the slopes (b) were plotted against their physical positions in each gene (Fig 6). The b values of the significant loci clustered together as the LD blocks in COMT1 and 4CL1. For the significant polymorphisms, the average of absolute b values is 0.058 in COMT1, and 0.040 in 4CL1. The sign of b is the direction of AF change of the minor alleles at significant loci, with a positive sign indicating that the minor allele. All of the significant loci of COMT1 have positive b values, indicating that the minor alleles at these loci have positive effects on IVDMD and negative impact on lignin concentration. In 4CL1 the significant b values had mixed signs for the significant loci within the range of -0.063 to 0.046. The loci with positive b in 4CL1 were intervened by loci with negative b, which corresponded to the pairwise LD patterns in 4CL1 (Fig 3). The synonymous SNPs in the coding regions have b values of 0.085 and 0.048 in COMT1 and 0.035 and 0.042 in 4CL1. The linear regression of synonymous SNPs has goodness-of-fit values ranging from 0.21 in 4CL1 to 0.96 in COMT1.
Fig 6
Slope (change in allele frequency per cycle of selection) and fit of the linear regressions (r2) for the polymorphisms with significant allele frequency change across the selection cycles.
Plus signs represent polymorphisms with P≤0.05, and open circles represent polymorphisms with P>0.05.
Slope (change in allele frequency per cycle of selection) and fit of the linear regressions (r2) for the polymorphisms with significant allele frequency change across the selection cycles.
Plus signs represent polymorphisms with P≤0.05, and open circles represent polymorphisms with P>0.05.
Allele frequency, genetic diversity and haplotypes change across the selection cycles
Enriched intermediate-frequency alleles and a slight increase of genetic diversity are observed as expected for the positive selection on standing variation during a short term [43]. The AF spectrum of all 180 polymorphic loci in population C0 and C+3 were plotted in histograms (Fig 7). The C0 population has enriched number of loci with low frequency alleles and very low counts of loci having allele frequencies higher than 0.3. After three cycles of selection, the allele frequencies distribution shifted distinctively, resulting in an increased number of alleles with intermediate frequencies ranging from 0.2 to 0.5 and decreased number of alleles in the range of 0 to 0.20. Genetic diversity and haplotype diversity of each gene was calculated for each selection cycle. Within COMT1 and 4CL1 where significant polymorphisms were found, the nucleotide diversity (π) increased in both directions after one cycle of divergent selection, and continued to increase as the selection cycles increased (S3 Table). While in COMT2 and CAD2 genes, no clear trend was seen. The increase of the moderate AF also coincided with the increase of π in COMT1 and CAD2.
Fig 7
Histograms of allele frequency on all 183 polymorphisms undergone statistic tests in C0 and C+3 populations.
Discussion
Switchgrass produces a high yield of lignocellulosic biomass, especially with recent advances in breeding for increased biomass production [44]. Reducing recalcitrance of switchgrass biomass to fermentation has been a long-term research objective toward improving the economics and sustainability of livestock production [45]. Parallels between ruminant livestock fermentation and biomass fermentation for bioethanol suggest similar mechanisms for biomass recalcitrance [12, 46]. The divergent populations generated from recurrent selected for IVDMD [31] provided powerful tools to identify the polymorphisms under selection and the candidate polymorphisms associated with lignin concentration and ethanol yield. Existence of discreet selection cycles and the availability of genotypes from all the intermediate cycles facilitated detection of selection signatures using both an allele divergence test and a linear regression test.Multiple polymorphisms in the candidate genes were found under selection for IVDMD. Artificial selection has been known to affect a number of genomic regions for traits such as ear number [16], seed size [17], and disease resistance [47] in maize long-term breeding populations. Multiple genes involved in the monolignol pathway were also found associated with digestibility traits in the maize breeding lines [48]. Similar results were observed in other association studies in maize [48-53], sorghum [54], alfalfa [55] and perennial ryegrass [56-58]. The bigger b values in COMT1 suggested that larger phenotypic changes associated with selection on these polymorphisms than the polymorphisms in 4CL1 [59].Complex traits like IVDMD are controlled by multiple loci with small effects [49]. The anatomic study in the divergent genotypes of these breeding populations showed reduced lignification, fewer cortical sclerenchyma in the stem tissues and more parenchyma cells in some vascular bundles, which indicated that besides lignin biosynthesis, other pathways affecting cell development could also be selected while breeding for divergent IVDMD [60, 61]. In this study, we chose to investigate four candidate genes in three functionally characterized gene families in switchgrass [22, 28, 29, 30]. None of the non-synonymous polymorphisms within the sequenced candidate genes was significant, suggesting that the significant polymorphisms could be involved in trans-regulation, or the causal genes could be in LD with COMT1 and 4CL1. Genome-wide molecular markers are needed to gain a complete picture of genetic controls of IVDMD in these short-term breeding populations.Different monolignol genes in these divergent switchgrass populations showed low to medium nucleotide diversities. Nucleotide diversity of COMT genes in this study fell within the similar range as the estimations in maize (Zea mays) and alfalfa (Medicago sativa) [55, 62, 63]. The 4CL1 gene had lower diversity level than that in the maize inbred lines. The nucleotide diversity of resistance genes in diverse switchgrass populations ranged from 0.0051 to 0.072, slightly higher than the estimations in this study [64]. Different regions of the genome and the population origins could contribute to the relatively low nucleotide diversity [65].The patterns of genetic variation in these candidate genes depicted the complexity of the octoploid switchgrass genomes. Majority of the significant loci have initial allele frequencies in the low range (<0.2) except for two loci in the COMT1 gene. This could be explained by that genetic diversity for low lignin and high IVDMD traits are not necessary for surviving in the wild habitat, sometimes might even be defective [66]. Before selection, the alleles beneficial for bioethanol production could arise by mutation and preserved in the genome by many different haplotypes each at a relatively low frequency resulting in a relatively low nucleotide diversity. The selection force accumulated the beneficial alleles and the haplotypes that harbor these alleles, which resulted in low LD within a gene. It is interesting to note that 4CL1 gene have mixed signs of b, and lower LD while the COMT1 gene has much more defined and longer LD blocks. Giving the unstable nature of chromosome pairing and segregating in polyploid switchgrass, recombination could be indirectly selected during the breeding process, even within a short term [67, 68], which could explain that both directions of selection were observed in 4CL1 gene. The genetic patterns revealed in the breeding populations suggested the need of developing a comprehensive selection criteria or germplasm pool to maintain the overall performance of the breeding populations especially for long term selection.The significant polymorphisms discovered in this study are potential candidates for QTL underlying biomass quality in switchgrass, and provided possible markers for marker-assisted selection [69, 70]. Depending on the number of QTL, heritability of the traits and genomic models, genomic selection could also increase the favorable allele frequencies of QTL at various rates [71]. The number of significant polymorphisms suggested that the individual loci underlying recalcitrance to biomass conversion had small effects, which was also observed in the maize cell wall component traits [72]. However, the low LD in the switchgrass promises the potential of genetic gain under the appropriate selection scheme. To effectively improve biomass quality in switchgrass, breeding projects could benefit greatly from the marker-assisted selection by increasing the favorable alleles of the QTL recurrently.
Changes in in vitro dry matter digestibility (IVDMD), ethanol production and lignin concentration across the five populations evaluated in Lincoln, Nebraska.
The figure is adapted from data of Vogel and others [12] for illustrative purpose only, not a replicate of published images.(TIF)Click here for additional data file.
Population sizes through divergent recurrent selection for in vitro dry matter digestibility in switchgrass.
From the base population C0, one cycle of selection for low IVDMD and three cycles of selection for high IVDMD were conducted, resulting in four selected populations, C-1, C+1, C+2 and C+3. Population sizes are represented by n and the number of selected individuals by m for each group of selected individuals (S-1, S+1, S+2, and S+3). The figure is adapted from data of Vogel and others [12] for illustrative purpose only, not a replicate of published images.(TIFF)Click here for additional data file.
Summary information on allele sequences for four candidate genes obtained from the five divergent populations.
Switchgrass v3.1 genomic identifier were obtained from phytozome genome database by using our sequences as queries in BLAST.(DOCX)Click here for additional data file.
The number of gene sequences sampled from each population allele pool.
(DOCX)Click here for additional data file.
Genetic diversity and haplotype diversity within the divergent populations for the four candidate genes.
Authors: Holly L Baxter; Mitra Mazarei; Nicole Labbe; Lindsey M Kline; Qunkang Cheng; Mark T Windham; David G J Mann; Chunxiang Fu; Angela Ziebell; Robert W Sykes; Miguel Rodriguez; Mark F Davis; Jonathan R Mielenz; Richard A Dixon; Zeng-Yu Wang; C Neal Stewart Journal: Plant Biotechnol J Date: 2014-04-21 Impact factor: 9.803
Authors: Candice N Hirsch; Sherry A Flint-Garcia; Timothy M Beissinger; Steven R Eichten; Shweta Deshpande; Kerrie Barry; Michael D McMullen; James B Holland; Edward S Buckler; Nathan Springer; C Robin Buell; Natalia de Leon; Shawn M Kaeppler Journal: Genetics Date: 2014-07-17 Impact factor: 4.562
Authors: Noel O I Cogan; Rebecca C Ponting; Anita C Vecchies; Michelle C Drayton; Julie George; Peter M Dracatos; Mark P Dobrowolski; Timothy I Sawbridge; Kevin F Smith; Germán C Spangenberg; John W Forster Journal: Mol Genet Genomics Date: 2006-05-17 Impact factor: 3.291
Authors: Chunxiang Fu; Jonathan R Mielenz; Xirong Xiao; Yaxin Ge; Choo Y Hamilton; Miguel Rodriguez; Fang Chen; Marcus Foston; Arthur Ragauskas; Joseph Bouton; Richard A Dixon; Zeng-Yu Wang Journal: Proc Natl Acad Sci U S A Date: 2011-02-14 Impact factor: 11.205
Authors: Yunwei Zhang; Juan E Zalapa; Andrew R Jakubowski; David L Price; Ananta Acharya; Yanling Wei; E Charles Brummer; Shawn M Kaeppler; Michael D Casler Journal: Genetica Date: 2011-07-23 Impact factor: 1.082
Authors: Yi-Hong Wang; Aniruddha Acharya; A Millie Burrell; Robert R Klein; Patricia E Klein; Karl H Hasenstein Journal: Genome Date: 2013-10-24 Impact factor: 2.166
Authors: Dau Dayal Aggarwal; Eugenia Rashkovetsky; Pawel Michalak; Irit Cohen; Yefim Ronin; Dan Zhou; Gabriel G Haddad; Abraham B Korol Journal: BMC Biol Date: 2015-11-27 Impact factor: 7.431