Literature DB >> 30982471

Analysis of the recombination landscape of hexaploid bread wheat reveals genes controlling recombination and gene conversion frequency.

Laura-Jayne Gardiner^1,2, Luzie U Wingen³, Paul Bailey⁴, Ryan Joynson⁴, Thomas Brabbs⁴, Jonathan Wright⁴, James D Higgins⁵, Neil Hall^4,6, Simon Griffiths³, Bernardo J Clavijo⁴, Anthony Hall^7,8.

Abstract

BACKGROUND: Sequence exchange between homologous chromosomes through crossing over and gene conversion is highly conserved among eukaryotes, contributing to genome stability and genetic diversity. A lack of recombination limits breeding efforts in crops; therefore, increasing recombination rates can reduce linkage drag and generate new genetic combinations.
RESULTS: We use computational analysis of 13 recombinant inbred mapping populations to assess crossover and gene conversion frequency in the hexaploid genome of wheat (Triticum aestivum). We observe that high-frequency crossover sites are shared between populations and that closely related parents lead to populations with more similar crossover patterns. We demonstrate that gene conversion is more prevalent and covers more of the genome in wheat than in other plants, making it a critical process in the generation of new haplotypes, particularly in centromeric regions where crossovers are rare. We identify quantitative trait loci for altered gene conversion and crossover frequency and confirm functionality for a novel RecQ helicase gene that belongs to an ancient clade that is missing in some plant lineages including Arabidopsis.
CONCLUSIONS: This is the first gene to be demonstrated to be involved in gene conversion in wheat. Harnessing the RecQ helicase has the potential to break linkage drag utilizing widespread gene conversions.

Entities: Chemical Disease Gene Species

Keywords: Crossover; Gene conversion; QTL; Recombination; Wheat

Mesh：

Year: 2019 PMID： 30982471 PMCID： PMC6463664 DOI： 10.1186/s13059-019-1675-6

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

Main text

There is an evolutionary requirement for genetic diversity across a species. Shuffling of material between homologous chromosomes, or genetic recombination, breaks linkage between genes resulting in offspring that have combinations of alleles that differ from those found in either of the parents. During meiosis, double-strand breaks (DSBs) can generate sequence variation in gametes via the DSB repair model [44]. DSBs are resolved by homologous recombination either as crossovers (COs), i.e., the reciprocal exchange of large regions between chromosomes, or otherwise as non-crossovers (NCOs). A minimum of one CO per chromosome during meiosis is a requirement for proper chromosome segregation [36]. When both COs and NCOs are resolved, they can also give rise to gene conversions (GCs) as a mechanism of DSB repair involving the non-reciprocal transfer of short DNA segments between homologous non-sister chromatids [17, 43]. GCs can be either allelic, meaning that one allele of the same gene replaces another allele, or ectopic, meaning that one paralogous DNA sequence converts another, they are also involved in the repair of DSBs that occur during mitosis (reviewed in: [7]). In plant and animal breeding, researchers strive to identify and introduce loci linked to favorable traits ranging from abiotic or biotic stress resistance to agronomic traits. This often also introduces linked undesirable genes and their resulting traits. The goal is to introduce a favorable allele while minimizing linkage drag from surrounding undesirable alleles. Increasing COs at meiosis breaks up linkage groups reducing linkage drag. However, COs are largely absent in centromeric chromosomal regions [45]. GCs contribute to breaking up linkage groups, and it has been observed that GCs are prevalent in centromeric regions suggesting that centromeres do experience genetic change but that DSBs in these regions are converted preferentially to GCs [41, 43]. It is therefore important for us to understand recombination and GC if we are to alter their rates to accelerate the induction of novel allelic combinations or to generate stable cultivars. This is particularly important in bread wheat where recombination frequency is low and skewed towards the ends of chromosomes [9]. Wheat has a large (16 Gb) complex allohexaploid genome, and in the light of recent advances in the wheat’s genomic and genetic resources, it presents an excellent model crop ([4, 8, 26, 46]). In plants, over 80 genes have been identified and characterized that are involved in recombination including crossover and gene conversion formation [12, 14, 35, 39, 54]. For example, in Arabidopsis, Ziolkowski et al. [54] found that the HEI10 meiotic E3 ligase could control crossover recombination. Recombination rates can also be variable: between populations, within a population, and even within a chromosome of a single organism [11, 38]. However, in crops such as wheat, little detail is known about the mechanisms that control genome-wide recombination. In wheat, quantitative trait loci (QTL) have been identified that affect recombination frequency; however, there are few QTLs that have been fully characterized to the point of validating candidate genes [11, 23, 50]. There is very little known in wheat about the genes or mechanisms controlling GC or GC frequency across its complex genome. To investigate the recombination landscape in bread wheat, we used 13 genotyped recombinant inbred line (RIL) populations generated using single-seed descent from the UK elite variety Paragon crossed with a diverse collection of landraces and elite material (for population details see Additional file 1: Table S1 and S2). Using the genotypes of the parents, we can accurately map CO positions across the genome (Additional file 1: Figure S1). In addition to large CO blocks, shorter shifts in genotype were frequently encountered across the genome. These have previously been disregarded as potential genotyping errors or issues with local ordering of markers. These short shifts are potential markers for GC events. Recent advances in wheat’s genome sequence assembly and local ordering of contigs into pseudo-chromosomes allow us to more confidently classify shorter shifts as GCs in wheat [4, 8]. Using this methodology, we have characterized CO and GC frequencies and locations, how different parental crosses affect the recombination landscapes, and genomic regions that control CO and GC frequency and identified a RecQ-like gene controlling GC frequency in wheat. We have used whole genome sequencing of selected lines to validate our CO and GC calls and to generate a more comprehensive profile of these events across the genome.

Results

Analysis of the CO landscape in wheat

For each of the 13 populations, the number of COs per RIL was recorded across the 21 chromosomes (Fig. 1a; Additional file 1: Figure S2; CO-Phenotype, “Materials and methods”). CO frequencies show a relatively normal distribution independent of the analyzed population with outlier RILs observed with high or low CO frequencies in each population. The average CO frequency per RIL remained relatively stable across the 13 populations varying from 40.8 to 51.9 (Additional file 1: Table S2) consistent with similar studies [11, 22]. For each population, we calculated the number of RILs sharing an individual CO site. Figure 1b and Additional file 1: Figure S3 highlight the distribution of shared CO sites; there is a peak of, on average, 10.2% of CO sites that are seen in only one RIL with the remainder shared in two or more lines (Additional file 1: Table S3). The rate of CO conservation steadily declines as the number of RILs increases. The maximum number of RILs with a conserved CO site never exceeds 55.3% of the population size (range 44.6–55.3%). Although we do not expect highly conserved CO locations in populations of this size, the overlap of a proportion of our COs is likely explained by the binning of our SNPs into 20-Mbp windows to define COs.

Fig. 1

Recombination landscape of wheat. a The number of COs recorded for each RIL in the Paragon × Chinese Spring population (CO frequency per sample) as a frequency histogram. b The number of RILs sharing each recorded CO (number of samples with each CO) as a frequency histogram for the Paragon × Chinese Spring population. c For all analyzed COs, the location of the CO (the start of the window that shows a different predominant parental allele compared to the previous window) is plotted on the x-axis with the number of samples in the population that share the CO on the y axis. d The intersection of two 20-Mbp windows defines a CO. Therefore, for all windows of 40 Mbp encompassing a central defined CO, the number of high-confidence genes that are found within each interval is plotted alongside the number of RILs within each population showing the CO. e Parents for the 13 populations clustered according to their representative alleles from the 35K SNP array. f The 13 populations clustered according to their individual CO profiles, i.e., number of RILs with each recorded CO in the population. The dendrograms in e and f were produced using the R package pvclust average linkage method with correlation-based dissimilarity matrix and the value of this distance metric between clusters is represented as height on the y-axis. AU (approximately unbiased) p values were computed by multiscale bootstrap resampling (bootstrap number of 1000). Landraces are highlighted with blue boxes and pure breeding lines are highlighted with pink boxes We observed that CO sites cluster towards the end of chromosomes and that this effect is more pronounced if the CO site is more frequently encountered, i.e. appears in multiple RILs (Fig. 1c). CO sites show a bias to genic regions that is statistically significant (two-tailed t test p < 0.0001, t = 6.2534, df = 10,174). In the regions defined containing CO sites, i.e. the windows that were representative of cM bins (CO-Phenotype, “Materials and methods”), the average number of genes is 363.83 and the average proportion of the region represented by genes is 2.65% whereas across all array SNP windows the average number of genes per region decreases to 342.98 and the average proportion of each region represented by genes decreases to 2.54% (“Materials and methods”). Further to this bias of COs to genic regions, there was a significant positive correlation between the number of genes at the CO site and the number of RILs with the CO (Fig. 1d, Pearson correlation coefficient (r) = 0.271706, n = 4780, p < 0.00001). Therefore, if a CO is observed in more RILs, then it is more likely to be gene associated. We assessed the number of RILs that each identified CO was observed in and found that COs that are more frequently seen within a population are more likely to be seen in all 13 populations. The number of COs that are seen in only one population decreases as more RILs in the population have the CO; conversely, COs that are seen in all 13 populations increase in number as more RILs within the populations have the CO (Additional file 1: Figure S4, Table S4). We clustered parental accessions for all of the populations according to their SNP profiles (Fig. 1e) and clustered the resultant populations based on their CO profiles (Fig. 1f). The CIMMYT lines (pink in Fig. 1e, f) cluster together closely and are distinct from the landraces (blue). This is true for the genotype profiles of the parents as well as for the resultant CO profiles of the RIL populations. Additionally, from Fig. 1e, we were able to define pairs of the most closely related parental lines: Baj/Waxw, Bkachu/CIM49 and Super152/Pfau. Looking at the resultant populations after these parents were crossed with Paragon, these are also the most closely related populations according to the CO profile. Therefore, more similar parents for a population lead to populations with more similar CO profiles.

GCs are more prevalent than COs in wheat

Previously, we defined COs by comparing windows that were representative of cM bins (CO-Phenotype, “Materials and methods”). This methodology is unaffected by subtle differences in genome organization between wheat accessions, such as GC events. Sun et al. [43] report 120–222 DSBs per meiosis in Arabidopsis, and although associated GCs should be possible for all of the DSBs, they predict a GC rate of 60–111 assuming that mismatch repair restores 50% of DSBs to their original allelic state. With these rates of GC frequency and previous approximations of GCs being between 2 bp and 10 kbp in length [43, 52], it is likely that such events will be missed by our analysis with on average 4335 SNPs available for analysis per population. However, after adjustment for our parallel definition of COs, we were able to identify on average 104 potential GCs per RIL across the 13 populations (Additional file 1: Table S2, Figure S5; Fig. 2a; GC-Phenotype, “Materials and methods”).

Fig. 2

Fine-scale analysis of sequence exchange events. a The number of COs and/or GCs recorded for each RIL in the Paragon × Chinese Spring population (GC/CO frequency per sample) as a frequency histogram. b Line plots separately for the number of COs (COs), GCs (GCs), and array SNPs per 20-Mbp window across the genome. All chromosomes are normalized to 500 Mbp in length to be displayed in a single plot. The moving average of each dataset is displayed (period = 15). c Schematic of methodology for calling gene conversions (GCs) and crossovers (COs) in the skim sequencing data using pre-defined Paragon and Chinese Spring-specific homozygous SNPs. d Immunolocalization of the chromosome axis protein ASY1 (blue) and yH2A.X (red) a marker for DNA DSB on hexaploid wheat leptotene male meiotic nuclei. Scale bar = 10 μM. e Original nuclei as per d; however, yH2A.X foci are marked that co-localize with ASY1. Mean number of yH2A.X foci across five replicates are shown from the displayed image n = 1673. f Line plots separately for the number of GCs 20 bp–2 kbp, 2–10 kbp, 10–500 kbp, and > 500 kbp in length per 20-Mbp window across the genome. Chromosomes are normalized as per b and the average frequency per window is displayed In Fig. 2b, we compare the distribution of COs to the profile of GCs across the genome. Taking into account the normalization of chromosome lengths used in Fig. 2b, f (see “Materials and methods”) and the representation of all 21 chromosomes in a single plot, the translated centromeric regions across the chromosomes span the interval from ~ 140 to 268 Mbp. Both distributions show the characteristic increase in frequency towards the distal regions of the chromosomes indicative of our likely detection of GC events associated with COs and therefore observed at similar locations. However, high GC frequency in the centromeric regions, where COs are sparse, was also seen and is likely to represent GCs associated with NCOs.

Using whole genome sequencing to define GC locations

There are three hypotheses as to how we are able to identify potential GCs in RILs using low-resolution genotyping; either gene conversion events are far more prevalent in wheat than Arabidopsis and other eukaryotes, lengths of GCs can be longer in wheat or the GCs could represent genotyping error or structural variation in the RILs. To test these hypotheses and more precisely map COs and GCs, we performed whole genome sequencing at low coverage (skim sequencing) for 12 lines from the Paragon × Chinese Spring population. These lines represent the upper and lower ends of the CO and GC frequency range (Additional file 1: Table S5). We defined 31,327,143 homozygous SNPs between Paragon and Chinese Spring along the Chinese Spring IWGSC RefSeq V1.0 reference sequence (IWGSC et al., 2018) that represent parent-specific allelic differences for the RIL population (“Materials and methods”). This translates to 1 SNP approximately every 540 bp to discriminate Chinese Spring and Paragon that is the resolution of our event detection. Skim sequencing data for each of the 12 lines was aligned to RefSeq V1.0 gaining on average 6.42× coverage across 91.3% of the genome and the defined parental SNP set was used to identify homozygous Paragon and Chinese Spring-specific alleles within the sequencing data to allow us to identify COs and GCs (Fig. 2c; “Materials and methods”). The high-resolution whole genome skim sequencing allowed us to identify on average 30,110 potential GC events per RIL. We adapted categories used by Yang et al. [52] to assign confidence to our skim sequencing GCs by classifying them according to length (“Materials and methods”; Table 1). We noted that GC numbers could be inflated as an artifact of incomplete or locally inaccurate reference genome assemblies or due to structural variation between the analyzed lines and the reference genome. Structural variants are likely to present as events that are conserved across the analyzed lines. We used the abundance levels of our events across the analyzed lines to calculate a false positive rate for our calls with regard to structural variation. 49.3% of our identified events were unique to a single RIL with 82.4% observed in < 50% of the analyzed lines. We can assign higher confidence to our unique GCs. Interestingly, the average length of unique GCs was 452 kbp. However, for those GCs conserved across the majority of lines (> 50%), the average length was only 3.7 kbp. Therefore, it appears that the shorter GCs that we identified are more likely to represent structural variation between Chinese Spring and Paragon. Considering only our high-confidence GC calls or unique calls, we defined a range of 7909–22,847 and on average 14,795 GCs events per RIL, of which, on average, 10,064 GCs were of 2 bp–2 kbp, 2255 GCs of 2–10 kbp, 2247 of 10–500 kbp and 228 of > 500 kbp in length (Table 1, “Materials and methods”). As a final validation, we identified sequencing read pairs that spanned our GC shifts from Chinese Spring to Paragon encompassing both a Chinese Spring and a Paragon SNP (“Materials and methods”). This analysis provided an accuracy rate of 85.2% for our definition of GC shifts. The false positives (14.8%) are likely to be a result of read misalignment or structural reference-related errors.

Table 2

Population/phenotype	Cross	BC.gen	F.gen	Chr	Pos cM	LOD	%var	Add eff	Flanking markers	Top marker	Candidate genes plus annotation	Parent with increasing effect
CSxPCO-Phenotype	RIself	NA	NA	6A	11.84	3.637	6.037	− 4.502	AX-95155001AX-945071466,740,223-21,438,976 bpTraesCS6A01G013800-TraesCS6A01G040900 (including 267 genes)	AX-94901219 11.84 cM6D: 13,554,815 bp; TraesCS6D01G032100; Pre-mRNA-processing protein 40AMarker at 11.84 cM on 6A: 14,609,435 bp	TraesCS6A01G026300.2 Holliday junction ATP-dependent DNA helicase RuvB (13,063,092 bp)TraesCS6A01G037500.1 Holliday junction ATP-dependent DNA helicase RuvA (18,534,777 bp)	Chinese Spring
CSxPGC-Phenotype	RIself	NA	NA	2A	161	2.513	4.212	− 14.73	AX-94616756AX-9463916851,940,072-70,183,206 bpTraesCS2A01G098600-TraesCS2A01G119800(including 207 genes)	AX-9441908457,999,746 bp	TraesCS2A01G103300 (56,942,731 bp)ATP-dependent RNA helicase DeaD; DNA helicase, ATP-dependent, RecQ type	Chinese Spring
PxCIM47GC-Phenotype	RIself	0	5	2B	0	6.041	6.623	0.519	AX-94768203AX-9454604547,425,660-64,988,255 bpTraesCS2B01G084900-TraesCS2B01G104200(including 189 genes)	AX-9476820364,988,255 bp	TraesCS2B01G103300 (64,091,671 bp)ATP-dependent DNA helicase PIF2	CIM47
PxWat94GC-Phenotype	RIself	0	4	4B	191	2.591	10.461	− 0.838	AX-94840592AX-951860173,513,381-4,996,179 bpTraesCS4B01G005000-TraesCS4B01G007500(including 26 genes)	AX-948405924,996,179 bp	TraesCS4B01G006700 Protein HIRA (4,526,708 bp)	Paragon
PxBajGC-Phenotype	RIself	0	4	5A	304.6	3.802	9.898	1.24	AX-95240942AX-94480416464,478,676-487,684,559 bpTraesCS5A01G249800-TraesCS5A01G277700(including 273 genes)	AX-94758045472,344,060 bp	TraesCS5A01G257000 (472,343,789-472,347,557 bp)WPP domain-interacting protein 1TraesCS5A01G260500 (474,537,170 bp) Regulator of chromosome condensation (RCC1)TraesCS5A01G381700LC (472,221,190-472,221,614 bp)BURP domain protein RD22-similar to C-PH1 for homologous chromosome pairing	Baj

In Arabidopsis, Sun et al. [43] predict the GC rate based on the DSB frequency per meiosis. Here, to allow us to perform a similar validation for our GC calls, we calculated the DSB rate for male hexaploid wheat leptotene nuclei (n = 5) using immunolocalization recording on average 2133+/− 157 DSBs per meiosis (Fig. 2d, e; Additional file 1: Figure S6). This translates to 3952–4580 considering female DSBs per meiosis, and when considering that we are observing an F8 generation, we would therefore expect 31,616–36,640 DSBs to have occurred in our analyzed RILs. However, since our F8 RILs have passed through single seed descent becoming increasingly homozygous in each resultant generation, detectable allelic GCs, where an allele of the same gene replaces another variant allele, will decrease as generations increase. Accounting for increasing homozygosity at a rate of approximately 50% per generation, we estimate that 7874–9125 DSBs would have the potential to be detectable allelic GCs. It is unclear how DSBs directly translate to detectable GCs in wheat and the impact of mismatch repair on this number; however, when we focus on our high-confidence GC calls from the skim sequencing, after correction for our classification accuracy rate of 85.2%, we define a range of 6588–19,192 and on average 12,411 GCs of 20 bp–500 kbp that overlaps the frequency of DSBs. We showed with array SNPs that GCs increase in frequency towards the distal regions of the chromosomes but also show higher frequency in centromeric regions. Here, we profiled the different length high-confidence GCs defined from the whole genome sequencing to see if their profiles differ (Fig. 2f). It is evident from Fig. 2f that all GCs tend to increase at the distal regions of the chromosomes; however, compared to GCs of > 500 kbp in length, which are likely to include crossovers, both 10–500 kbp and 2–10 kbp GCs maintain an elevated frequency that is conserved across proximal and centromeric regions. Shorter GCs of 20 bp–2 kbp display the highest frequency that is conserved highly across proximal regions and to a relatively high level at the centromere. It is therefore the GCs from 20 bp to 500 kbp in length which appear to largely break proximal and centromeric regions.

Using whole genome sequencing to validate our array-based CO and GCs

Across the 12 sequenced RILs, we observed 52 COs per RIL considering events of at least 40 Mbp which closely reflects the average of 54.6 COs for the same 12 RILs defined using the array SNPs. This is likely because the 40 Mbp represents the total analysis window that we used for array SNPs to define a central CO at the intersection of two 20-Mbp windows. Furthermore, Table 1 highlights that we could validate up to 100% of the COs from the array analysis with the skim sequencing datasets and that they overlapped events that averaged 42,859,250 bp in length with 84.7% of events > 500 kbp. On average, we validate 90.7% of COs with the skim sequencing data. Similarly, we validated GC calls from the array analysis using the skim sequencing. We could directly associate on average 95.7% of our previously defined GCs with GC events in the sequencing data that were on average 29,992,400 bp in length with 61.2% of events < 20 Mbp (Table 1). Furthermore, on average, we could validate 83.0% of the specific array SNP alleles that were used to define GCs and given the reported agreement rates of 85.7% for SNP arrays when compared to sequencing data by Burridge et al. [5]; this is in line with expectations. Using the array, the reason for our ability to identify GCs in low-resolution genotyping data is that we typically detect the larger GC events whereas whole genome sequencing gives the ability to robustly detect shorter GC events across the genome. It appears the frequency of GCs is far higher than in Arabidopsis and in contrast to other eukaryotes studied to date; longer GCs are also prevalent in wheat [43].

QTLs identified for CO frequency

In our analyzed RIL populations, we observed a normal distribution of CO and GC frequency with a positive skew (Figs. 1a and 2a). As such, there were subsets of RILs with increased levels of GCs or COs compared to the population average. Therefore, we used GC and CO frequency as traits for QTL analysis (“Materials and methods”). The CO analysis (CO-Phenotype, “Materials and methods”) identified a robust QTL for the Paragon × Chinese Spring population that explained > 6% of the variation (LOD score 3.64, p < 0.05; Table 2; Fig. 3a, c, and e). The Holliday junction ATP-dependent DNA helicase RuvB-like was located within ~ 1.5 Mbp of the QTL peak on chromosome 6A. This gene is known to act in a complex with RuvA to promote strand exchange reactions in homologous recombination [40] and is therefore our main candidate for the QTL. The RuvA-like gene was also located within the QTL interval (~ 4 Mbp from the main peak).

Table 1

Sequence exchange events defined from whole genome skim sequencing. Detailing for the 12 Paragon × Chinese Spring samples skim sequenced; the sample number, CO and frequency defined using the 35K array, % of these events validated by skim sequencing, GCs identified with track lengths > 40,000,000 bp (representative of COs), > 20,000,000 bp, > 500,000 bp, 10,000–500,000 bp, 2000–10,000 bp, and 20–2000 bp. High-confidence GCs are defined as those where the specific array SNP allele that was used to make the call could be validated in the skim sequencing

Sample	CO frequency (35K array)	% of array COs validated by skim seq	GC frequency (35K array)	% array GC positions validated with skim sequence	% GC-specific SNP alleles validated with skim sequence	Track length> 40Mbp	Track length> 20Mbp	Track length > 500 kbp	High-confidence track length> 500 kbp	Track length 10–500 kbp	High-confidence track length10–500 kbp	Track length 2–10 kbp	High-confidence track length2–10 kbp	Track length 20 bp–2 kbp	High-confidence track length20 bp–2 kbp
0055	73	86.3	429	96.0	83.9	56	94	186	177	2884	898	4230	1158	15,262	5676
0334	70	91.4	507	96.8	86.9	55	88	337	321	10,017	8155	7687	4828	18,913	9543
0200	54	87.0	461	92.8	82.6	53	81	200	184	3433	1322	5251	1691	19,602	8609
0270	54	81.5	483	96.7	50.9	79	122	278	257	2907	1694	3721	1853	17,651	11,383
0008	65	98.5	413	94.2	85.2	43	72	215	197	3588	1630	5954	2508	26,063	13,451
0119	69	91.3	409	94.4	82.8	43	75	330	317	5926	3995	6097	2898	20,600	9961
0005	55	94.5	363	96.7	84.6	51	79	238	213	4303	2228	6002	2402	22,553	10,738
0018	52	92.3	386	94.3	87.0	58	101	241	218	2531	1148	4323	1773	18,583	10,225
0097	53	98.1	322	96.3	86.9	30	56	177	163	4296	1717	7203	2694	28,299	13,193
0004	34	88.2	378	97.9	86.6	47	73	217	202	3377	1701	5242	2206	20,889	10,829
0010	37	100	319	94.7	89.1	44	85	309	291	3292	1383	4926	1860	22,063	11,234
0281	39	79.5	306	97.7	89.4	66	102	212	196	2642	1097	3426	1195	13,612	5930

Fig. 3

Output from QTL analysis from the Paragon × Chinese Spring population. QTL analysis output for the Paragon × Chinese Spring population that yielded significant associations for either a CO-Phenotype or b GC-Phenotype (p < 0.05). Detailing LOD scores plotted over the respective linkage groups, i.e., chromosomes. Increased resolution of QTL peaks for c CO-Phenotype and d GC-Phenotype. e Finally, the locations of the array SNPs showing the peak associations are marked in red surrounded by a red box while also showing all other array SNP locations per chromosome

QTLs identified for CO frequency. Detailing for the QTL analysis; the population under analysis, phenotype, cross type and generation, chromosome, position (cM), logarithm of odds (LOD) score, % variation explained, additive effect, flanking markers including QTL region length and gene number, top marker from QTL scan and annotation of identified candidate genes in the region Output from QTL analysis from the Paragon × Chinese Spring population. QTL analysis output for the Paragon × Chinese Spring population that yielded significant associations for either a CO-Phenotype or b GC-Phenotype (p < 0.05). Detailing LOD scores plotted over the respective linkage groups, i.e., chromosomes. Increased resolution of QTL peaks for c CO-Phenotype and d GC-Phenotype. e Finally, the locations of the array SNPs showing the peak associations are marked in red surrounded by a red box while also showing all other array SNP locations per chromosome For the GC frequency trait (GC-Phenotype, “Materials and methods”), we identified multiple robust QTL that explained 4.2–10.5% of the observed variation (Table 2; Fig. 3b, d and e; Additional file 1: Figure S7). We only identified QTL for four of the 13 populations likely due to the low power of our analysis in some of the populations since they were made up of < 100 RILs. We studied genes within the QTL intervals and identified four gene candidates for GC frequency that were on average 600 kbp from the QTL peak and showed functional significance (p < 0.05); firstly, ATP-dependent RNA helicase RecQ-like (for the purpose of this paper, we have named this RecQ-7) on chromosome 2A from the Paragon × Chinese Spring population analysis, overexpression of RecQ in rice embryogenic cells has been linked to stimulation of homologous recombination [30] (Fig. 3b); secondly, ATP-dependent DNA helicase PIF2 on chromosome 2B from the Paragon × CIMMYT 47 analysis, PIF2 is a known DNA repair and recombination helicase (Additional file 1: Figure S7a); thirdly, a gene encoding the protein HIRA on chromosome 4B from the Paragon × Watkins 94 analysis, chromatin reassembly during DSB repair has been shown to be dependent on the HIRA histone chaperone [1, 3, 21, 31] (Additional file 1: Figure S7b); finally, WPP domain-interacting protein 1 on chromosome 5A from the Paragon × Baj analysis, this gene is key for nuclear assembly and transport and is involved in the same pathway as the gene RCC1 that is seen in the same interval (Additional file 1: Figure S7c) [6, 24, 53]. We noted a low-confidence gene ~120 kbp from the WPP domain-interacting protein 1 in the Paragon × Baj analysis; this BURP domain protein RD22 shows similarity to the gene C-Ph1 that is involved in homologous chromosome pairing and could also be contributing to this QTL peak [15].

Validation of candidate genes

We used the Cadenza TILLING population, a mutagenized bread wheat population that has been widely characterized using exome sequencing, to identify lines with likely knockouts of our candidate genes from the QTL analysis; we prioritized stop codon inducing mutations that were as close to the start of the gene as possible to ensure a null phenotype [26] (Additional file 1: Table S6 and S7). We then use the high frequency background EMS mutations in the TILLING lines to call COs and GCs as previously, however, here recording homozygous/heterozygous shifts across the genome (“Materials and methods”) [51]. Firstly, for the Holliday junction ATP-dependent DNA helicase RuvB-like that was associated with CO frequency in the Paragon × Chinese Spring population, the CO frequencies of the eight knockout RuvB lines and the control group of ten lines showed no discernible difference with average frequencies of 57.4 and 57.6 respectively (Fig. 4a; Additional file 1: Table S6, Note S1; “Materials and methods”).

Fig. 4

Examination of candidate genes from QTL analysis RecQ-7 and RuvB. a Box plot comparison of the knockout RuvB lines with the control lines, defining CO frequency using collapsed linkage windows as per CO-Phenotype. b Box plot comparison of the knockout RecQ-7 lines with the control lines, defining CO/GC frequency using GC-Phenotype. c Box plot comparison of the knockout RecQ-7 lines with the control lines, defining CO frequency using CO-Phenotype. d Phylogenetic tree of identified genes across multiple species (Arabidopsis, rice and wheat) with sequence similarity to the RecQ helicase family, including our wheat candidate RecQ-7 gene for comparison. Bootstrap values ≥ 90% are shown as green dots on the branches Secondly, for the ATP-dependent RNA helicase RecQ-7 that was associated with GC frequency in the Paragon × Chinese Spring population, we observed a significant decrease in GC frequency in the RecQ-7 mutant knockout group of four lines compared to the control group (Welch t test, P = 0.0338, t = 2.424, df = 11) (Additional file 1: Note S2). We observed average GC frequencies of 426.96 and 498.54 for mutant and control lines respectively (Fig. 4b; Additional file 1: Table S6, Note S3; “Materials and methods”). We then went on to define CO frequency for the RecQ-7 knockouts to determine if our GC-Phenotype QTL, which mainly reflects increased GC, could translate to an increase in COs. We observed a significant decrease in CO frequency in the RecQ-7 mutant group compared to the control group (Welch t test, P = 0.0411, t = 2.3439, df = 10) with average CO frequencies of 51.5 and 59 for mutant and control respectively (Fig. 4c, “Materials and methods”). In Arabidopsis, there are seven RecQ genes, AtRecQl 1, 2, 3, 4A, 4B, 5 and AtRecQsim. RecQ4A and RecQ4B have been identified as playing a role in recombination [19, 39]. To understand how the wheat RecQ7 gene is related to the Arabidopsis RecQ genes, we built a phylogenetic tree including rice, Arabidopsis and wheat RecQ homologs (Fig. 4d). We found that the RecQ7 gene failed to cluster with any of the Arabidopsis genes; instead, it clusters in a very well-defined separate branch with the rice RecQ-Like gene Os07g48360 defining a novel RecQ clade that has been lost in some plant lineages (Additional file 1: Figure S8; [18]). We used the Cadenza TILLING population to ascertain if knockouts of the three homoeologous wheat homologs of the Arabidopsis candidates RecQ4A and RecQ4B, identified from Fig. 4d, showed similar CO phenotypes (Additional file 1: Table S6). Knockouts of our homoeologous gene candidates resulted in general decreases in the average CO frequency per line compared to the control group; however, none were statistically significant (Additional file 1: Note S4). We next used the Cadenza TILLING population to ascertain if knockouts of the homoeologous gene copies of our candidate gene RecQ-7 on chromosome 2A showed similar GC phenotypes to RecQ-7 (Additional file 1: Table S6, Note S5, Figure S9). Knockouts of our homoeologous gene candidates resulted in a decrease in the average GC frequency per line from 498.5 in the control group to 414.4 and 449.4 for sub-genomes B and D respectively (GC-Phenotype). When we consider all sub-genome A, B and D knockouts as a group together, there is a significant decrease in GC frequency between this group and the control group (two-tailed t test, P = 0.0371, t = 2.2617, df = 17). The increasing effect of RecQ-7 on CO frequency was associated with the landrace parental line Chinese Spring. We compared our candidate gene sequence between Paragon and Chinese Spring using genomic and transcriptome sequence and saw a high level of sequence conservation; however, there was an insertion of 12 bp in Paragon compared to Chinese Spring that is potentially a microsatellite expansion in the intronic sequence. Furthermore, Cadenza, the founder accession for the TILLING population that we used to knockout the gene and observe a phenotypic effect, has a homologous RecQ-7 sequence that is more comparable to the genic sequence seen in Chinese Spring. In Chinese Spring, RecQ-7 is reportedly expressed in a large range of tissues, but notably it is expressed at the highest levels in the vegetative and reproductive spike at anthesis, flag leaf stage, ear emergence and meiosis [2]. In the vegetative spike, our candidate homoeolog RecQ-7 on sub-genome A is expressed at 1.4× and 1.7× the level of the homoeologs on sub-genomes B and D respectively. Similarly in the reproductive spike, RecQ-7 on sub-genome A is expressed at 1.3× and 1.9× the level of the homoeologs on sub-genomes B and D respectively. Therefore, although a combined triple gene knockout may be beneficial, this could be why a significant observable effect is only gained from the knockout of our single candidate RecQ-7 on sub-genome A with uneven gene expression across the RecQ-7 homoeologs that is biased to sub-genome A. However, it may be that a triple mutant would be lethal. We focused on genes that were identified from the Chinese Spring × Paragon population since this was the largest population with the highest number of SNPs available for analysis and was therefore likely to give the most robust candidates. Knockouts of QTL candidates from the other populations showed observable decreases in GC/CO frequency; however, they showed decreases that were typically not statistically significant (Additional file 1: Note S6, Table S7). The notable exception was the WPP domain-interacting protein 1 from the Paragon × Baj population analysis where, although mutants for this gene showed a decrease in GC frequency that was not significant, this translated to a significant decrease in CO frequency in the mutant group compared to the control group. This gene could be an important candidate for further analysis. It is possible that combined triple gene knockouts for our additional QTL candidates may yield significant results.

Discussion

Using the 35K array, we see the characteristic normal distribution of CO frequency, with an average of 41–52 COs per RIL, and a non-random distribution of COs between RILs and along the chromosomes [9, 11, 22]. There is a positive correlation between the number of genes in the CO interval and the number of RILs with the CO suggesting that perhaps recombination in gene-rich regions leads to favorable phenotypes that are under selective pressure. In addition, we show that introducing a more divergent parent, e.g., a landrace such as Chinese Spring or a Watkins accession to be crossed with an elite line, could increase the potential for introducing new CO profiles in the resultant population in addition to new allelic diversity. Therefore, wide crosses not only bring in novel alleles but also new COs, potentially breaking linkage groups. Using whole genome skim sequencing, we profile GC events across the wheat genome revealing a large unexplored source of sequence exchange between chromosomes that is particularly desirable in the drive to break linkage drag in the crossover sparse centromeric regions. We show that larger GC events can be profiled using low-density genotyping arrays where previously the crop breeding and research community may have ignored such changes regarding them as potential genotyping errors. These large GCs have not been described before in eukaryotes and may be the result of large complex repetitive polyploid genomes. There is also the possibility that the larger GCs could actually be double COs where RecQ-7 could promote more intermediates going down the non-interfering CO pathway. However, we do not observe a global increase in non-interfering COs, although we cannot rule out a possible role in recombination intermediate branch migration by RecQ-7 as observed in the human ortholog BLM [25], that may separate components of a double Holliday junction substantially enough for both Holliday junctions to resolve via the class II pathway as COs. In any case, this would be an equally useful tool to reduce linkage drag and an interesting feature of such a complex polyploid. Despite the critical role of GC in genome diversity and evolution, little is known about the mechanism of control. In yeast, the protein complex MutLβ and meiosis-specific helicase Mer3 have been implicated controlling the length of gene conversions [10]. Here, we were able to identify multiple QTL affecting GC frequency. We have identified a candidate helicase gene, ATP-dependent RNA helicase RecQ, on chromosome 2A that we named RecQ-7. Using EMS-induced gene knockouts, we went on to validate our gene candidate RecQ-7 observing a drop in GC frequency in those lines where our candidate gene was knocked out. This reduction in GC frequency also correlated with a significant decrease in CO frequency. Furthermore, the decrease in GC was observed when either of the other homoeologous gene copies was knocked out. By performing phylogenetic analysis, we determined that RecQ-7 was present in rice but not in Arabidopsis. Therefore, genetic screens to identify suppressors of recombination in Arabidopsis would not identify RecQ-7, although RecQ-4 might perform a similar role [39]. However, RecQ-7 (OsRecQ 886 in rice) is structurally divergent from RecQ-4 (reviewed in [18]). RecQ-7 is structurally similar to moss RecQ-6 that is essential for gene targeting, but has no apparent role in DNA repair [48]. Therefore, RecQ-7 is an important target for wheat breeders to increase recombination in genomic “cold” regions by increasing CO and GC frequency and could be transferred to dicotyledonous crops to be used as an enhancer of recombination.

Materials and methods

Crossover calls (CO-Phenotype)

The SNP sequences from the 35K array were aligned (anchored) to the IWGSC RefSeq V1.0 wheat genome reference sequence (IWGSC et al., 2018), and therefore, for each SNP, we have its exact base pair position the Chinese Spring reference. To identify if specific sequences or regions are targeted for recombination consistently between populations, the SNP Chinese Spring reference anchoring was implemented for all populations throughout this study to aid comparative analyses between populations. For each population, for each RIL, array SNPs were annotated according to the parental allele that they represented. SNPs that were deemed to be so close together as to render recombination between them more unlikely were collapsed into bins representative of cM bins. With a genome size of 16 Gb and a previously reported cM genome size of 3894 cM, we collapsed SNPs within intervals across the genome of 20 Mbp (16 Gbp/3894 cM × 5 the estimated shared haplotype length around a SNP in wheat) [22]. Collapsing SNPs involved classifying the most frequently encountered parent specific allele per window, if the window appeared to be more Parent 1 or 2, i.e., homozygous for Parent 1 or 2, this was recorded, and a mixture of the two parents was recorded as a heterozygous region. A change in the most frequently encountered parent-specific allele from one window to the next across the genome was recorded as a CO. This methodology is unaffected by subtle differences in genome organization between wheat accessions and is therefore a conservative method for calling COs enabling cross-population comparisons; our refinement of a CO will be limited to the intersection of two 20-Mbp windows. To call COs from the Cadenza TILLING population, for each of the lines we assessed, we utilized SNPs that were called by aligning exome capture data to the IWGSC RefSeq V1.0 Chinese Spring reference sequence. The SNPs are publicly available at https://plants.ensembl.org/Triticum_aestivum/Info/Annotation (see Acknowledgements). We, firstly, filtered the SNPs removing those not showing allelic changes characteristic of EMS treatment (leaving 89% of the SNPs). Secondly, we selected only those SNPs with a mapping quality of ≥ 50. Thirdly, we annotated homozygous mutant SNPs as those having the mutant or alternate allele in > 85% of the total sequencing reads, i.e., wild-type (WT) allele in ≤ 15% of the reads; the remaining SNPs were annotated as heterozygous. Finally, we removed SNPs where the mutant alternate allele was observed at less than 4× coverage and removed SNPs that were not assigned to one of the 21 wheat chromosomes. This resulted in a list of high-confidence SNPs for CO calling. COs were called as previously; however, instead of looking for Parent 1/Parent 2 shifts, we recorded homozygous/heterozygous shifts between windows.

Gene conversion calls (GC-Phenotype)

The RefSeq V1.0 anchored SNPs from the 35K array that were used for CO calling were used here to identify short shifts between parents that we use as a potential indicator for GC. Here, no windows or collapsing was used and each change in the encountered parent-specific allele from one SNP to the next across the genome was recorded as an event. Here, we assume a correct local reference order and consider each genotype shift even if they are short in length or multiple changes occur in close proximity. Therefore, here we consider both GC and CO events together due to our limited ability to detect the difference via the specific length of the event with only on average 4335 SNPs available for analysis per population with a genome size of 16 Gb. This methodology has the potential to include false positives led by incorrect genome order and incorrect SNP calls. As such, here we rely on the recent advances in the wheat reference sequence assembly and genome ordering alongside the high-confidence SNP calls generated by the 35K array. For direct estimates of the number of GC events per RIL, we subtract the number of COs from the GC/CO estimate. To call GCs from exome capture data from the Cadenza TILLING population, SNPs were identified in the data and identified as homozygous/heterozygous as previously detailed. GCs (again alongside COs) were called as previously; however, instead of looking for Parent 1/Parent 2 shifts, we recorded homozygous/heterozygous shifts across the genome. Since we observed high variation in the number of SNPs that were available for each EMS-treated line (1700-9018) alongside a linear relationship between the number of SNPs available and the number of GCs identified (R2 = 0.91595), we normalized GC estimates to reflect a SNP count of 5000 per line.

Determining if COs show bias to genic regions

To determine if COs are more or less likely to target genic regions, we compared the number of genes at the CO sites to the number of genes in regions of the same size across the genome, which contain an array SNP, independently of whether they contained a CO or not. Focusing on only those regions containing array SNPs eliminates bias from the array SNP positions already being focused in genic regions.

Cluster analysis

The dendrograms in Fig. 1e and f were produced using the R package pvclust average linkage method with correlation-based dissimilarity matrix, and AU (approximately unbiased) p values were computed by multiscale bootstrap resampling (bootstrap number of 1000).

Phylogenetic analysis of the RecQ gene family

RecQ-like genes were identified by detecting the presence of the helicase conserved C-terminal domain in the proteomes of ten species using HMMER3.1b2 HMMSEARCH (hmmer.org). The inputs to HMMSEARCH were the Pfam hidden Markov model (HMM), Helicase_C (PF00271), and the protein data sets from the following genome annotations: Arabidopsis (Araport11), Medicago (Phytozome, V10), Brachypodium (Phytozome, V12), rice (MSU_RGAP_v7.0), maize (Phytozome, V10), barley high- and low-confidence genes (IBSC consortium, [33]), wheat TGACv1 [8], wheat RefSeq v1.0 high- and low-confidence genes (IWGSC consortium), Marchantia (Marchantia.info), moss (Phytozome V10), and yeast (Ensembl Fungi Release 40). The Helicase_C protein sequences detected were aligned back to the HMM using HMMER3.1b2 HMMALIGN. Gap columns in the alignment were removed, and sequences with less than 70% coverage across the alignment were removed to reduce false placement in the tree of sequences with insufficient coverage across the domain. The longest sequence for each gene out of the available set of splice versions was used for phylogenetic analysis. Phylogenetic analysis was carried out using the MPI version of RAxML v8.2.9 [42] with the following method parameters set: -f a, -x 12345, -p 12345, -# 100, -m PROTCATJTT. The tree was mid-point rooted and visualized using the Interactive Tree of Life (iToL) tool [27] to identify the clade containing the RECQ-like homologs. Full-length sequences for these homologs were aligned to build a tree containing Arabidopsis, rice and wheat proteins (Fig. 4d), and a tree containing proteins from all 10 species (Additional file 1: Figure S8a). Each alignment was made with PRANK [32], then all the columns between the conserved DEAD and RECQ domains were extracted using JalView [47] and the alignment was used to build the final tree for the RecQ-like proteins with the above RAxML command.

Selecting samples for skim sequencing

RILs were selected from the Paragon × Chinese Spring population due to us having whole genome reference sequences available for both accessions. RILs were selected to cover a profile of low, medium, and high CO and GC frequency compared to the averages for the population (~ 52 for COs and ~ 388 for COs and GCs). Low was defined as ≥ 10% less than the average (CO range 0–46; CO/GC range 0–350), medium was defined as within ± 10% of the average (CO range 47–57; CO/GC range 351–428), and high was defined as ≥ 10% more than the average (CO range 58–93; CO/GC range 428–540).

Genomic DNA isolation for skim sequencing

Leaf tissue from each line was ground in liquid nitrogen with a mortar and pestle. Genomic DNA was extracted from the ground tissue using a DNeasy plant mini kit (Qiagen) according to the manufacturer’s protocol (version March 2018) with two alterations. The incubation time with buffer AP1 and RNase was extended to 1 h and a new 1.5-mL microcentrifuge tube was used for the second elution to prevent dilution of the DNA. For all library preparations, the first elution of each sample was used. The genomic DNA was assessed by spectrophotometry using a NanoDrop 2000 (Thermo Fisher) for contamination. DNA concentration was determined using a Quant-iT High Sensitivity double-stranded DNA assay kit (Invitrogen) and an Infinite F200 Pro microplate reader (Tecan). DNA integrity was assessed using genomic DNA ScreenTape (Agilent) and a TapeStation 2200 (Agilent).

Library preparation for skim sequencing

The 12 genomic DNA samples were sheared to 300 bp using a S2 Covaris ultrasonicator (2 cycles 60 s, 10% duty factor, intensity of 5, and 200 cycles/burst). A whole genome library was produced for each sample using the KAPA High Throughput library preparation kit (Roche). The standard protocol was followed (Version 5.16), with the modifications listed below. The safe stopping point in the standard protocol at A-tailing was included. For adapter ligation, 3 μl of 10 μM SeqCap adapters (Roche) were used. The volume of water per reaction was adjusted so the total volume remained 50 μl. The dual size selection ratios were adjusted to 0.5× and 0.7× to account for the larger fragment size. Two cycles of PCR were used to amplify the libraries in order to keep PCR duplicates down. The libraries were quantified by Qubit High Sensitivity double-stranded DNA assays (Invitrogen) and a Qubit 2 (Invitrogen). Final library yield was between 80 and 200 ng. Fragment size was determined by running the libraries on High Sensitivity DNA Bioanalyzer (Agilent) chips. Exact molar concentrations were determined prior to sequencing using the universal KAPA Illumina Library Quantification kit (Roche) on an Applied Biosystems StepOne Plus system.

Skim sequencing of whole genome libraries

Sequencing was performed on the Illumina MiSeq and NovaSeq6000 platforms. The libraries were pooled equi-molarly for the MiSeq according to the concentrations determined by qPCR. A MiSeq Nano v2 150bp paired-end run was used to quality check the libraries and optimize library pooling for the NovaSeq run. The NovaSeq pooled library was balanced according to the MiSeq read data. The library was run on two S2 150-bp paired-end NovaSeq6000 lanes.

Identification of Paragon-Chinese Spring-specific SNPs for validation of skim sequencing

In this analysis, we used the Earlham Institute’s Paragon whole genome assembly (at http://opendata.earlham.ac.uk/opendata/data/Triticum_aestivum/EI/v1.1/). From this sequence, we simulated 2 × 150 bp paired end sequencing reads with no errors at 20× coverage across the genome using the short-read simulator dwgsim v0.1.11. To avoid the pitfalls in GC detection noted by Qi et al. [37] and Wijnker et al. [51], we aligned these reads back to the Chinese Spring RefSeq V1.0 using BWA-MEM v0.7.10 [28], took only reads aligned in a proper pair (correct orientation and mapped distance), and removed any non-uniquely mapped reads (mapping quality ≤ 10) using SAMtools [29]. We aligned 1,923,658,991 simulated reads initially and after filtering 1,817,505,942 reads remained resulting in coverage of 11,390,273,387 bp of the 14-Gb reference genome size at a minimum of 10×. We then called SNPs between Paragon and the Chinese Spring reference using GATK [34]. We focused only on homozygous SNPs (alternate allele frequency > 80%), removed SNPs with a quality score less than 30 and a depth less than 3, and removed SNPs if 3 or more were defined within a 10-bp window. These SNPs are sites where we can accurately discriminate Chinese Spring and Paragon, and we refer to this SNP list as “Paragon-Chinese Spring-specific SNPs”.

Identification of Paragon-Chinese Spring COs and GCs from skim sequencing

Here, we aligned sequencing reads for each of the 12 lines that were sequenced to the Chinese Spring RefSeq V1.0 using the same methodology as for the simulated Paragon reads detailed above, however, with the following additions; we included a duplicate read removal step using Picard tools v2.1.1, and this time for SNP calling using GATK, we removed SNPs with a quality score less than 20. Since we are aligning skim sequencing from Paragon × Chinese Spring crosses to a Chinese Spring genome, we expect that the alternate SNP alleles we define in the sequencing data will be Paragon-specific, and therefore, these can be used to define COs and GCs. However, our defined SNPs are typically seen at low coverage and as such require validation. We validated our SNPs in the skim sequencing data by comparing the defined alleles to the “Paragon-Chinese Spring-specific SNPs” that we previously defined. For each skim sequenced line, we cycle through the list of “Paragon-Chinese Spring-specific SNPs,” i.e., all possible differences between Chinese Spring and Paragon, and for each SNP position; if the skim sequenced line also has a SNP with an alternate allele defined in > 80% of the sequencing reads at this position that matches the Paragon allele, we define this as Paragon-specific; if the line has a SNP with an alternate allele in < 80% of the reads or that does not match Paragon, the position is undetermined; if the line has no SNP allele defined at this position, we check for mapping coverage in the region (if none, the position is undetermined); and if sufficient coverage is observed but no SNP called, we check that the Chinese Spring reference allele is found at this position and define this as Chinese Spring-specific. We then have a validated list of positions across each of the skim-sequenced lines from which we can define COs and GCs where we either define Paragon-specific sequence and Chinese Spring-specific sequence or have an undetermined call which is removed from the analysis. Due to working with the F8 generation and not F2 as per previous studies, we focus on homozygous SNPs only rather than homozygous-heterozygous shifts that are more likely to be mistakenly identified due to incorrectly aligned reads as highlighted by Qi et al. [37]. In this analysis, we defined COs and GCs in sequencing data as the intersection of unbroken runs of markers from a single parent that are surrounded or followed by runs of markers from the other parent. We defined the lengths of the unbroken runs and used this to classify events as GCs in the category of 2 bp–10 kbp (while also subsetting these GCs into regions of 20 bp–2 kbp and 2–19 bp) and classified COs as regions of sequence exchange between parents of > 10 kbp (while also subsetting these COs into regions of 10–500 kbp and > 500 kbp). Due to the large genome size of wheat, it is unknown if GCs/COs may present as longer shifts than were seen in Arabidopsis; therefore, we categorized blocks of 20 bp–2 kbp, 2–10 kbp, 10–500 kbp, and > 500 kbp and also used blocks of > 20 Mbp/> 40 Mbp to allow comparison with our previous array-based analysis. To ensure confident calls for CO/GC determination, we only considered a break in the run of markers from one parent to another if three or more SNPs on the run showed such a shift between Paragon and Chinese Spring. Since our resolution of SNPs that discriminate Chinese Spring and Paragon is 1 SNP per ~ 540 bp, we should still be able to effectively detect COs and the majority of GCs using this methodology while eliminating false positive calls.

Validation of COs and GCs from array analysis with skim sequencing

Using the array, our refinement of a CO is limited to the intersection of two 20-Mbp windows. In order to validate COs that were defined using the array with COs that were defined using skim sequencing, we looked at the interval 20 Mbp at each side of each array-defined CO. We then determined if this region overlapped a sequence exchange site or GC that we defined from the skim sequencing with high confidence, i.e., the region between the SNPs that show the Chinese Spring-Paragon shift. We expected our COs to overlap these shifts that were associated with GCs of > 20 Mbp. Table 1 details the percentage of COs that overlapped and the average length of the GC that was overlapped was in line with our expectations. Sequence exchange events defined from whole genome skim sequencing. Detailing for the 12 Paragon × Chinese Spring samples skim sequenced; the sample number, CO and frequency defined using the 35K array, % of these events validated by skim sequencing, GCs identified with track lengths > 40,000,000 bp (representative of COs), > 20,000,000 bp, > 500,000 bp, 10,000–500,000 bp, 2000–10,000 bp, and 20–2000 bp. High-confidence GCs are defined as those where the specific array SNP allele that was used to make the call could be validated in the skim sequencing To validate GCs that were defined using the array, we looked at the interval 3 Mbp around our array-defined GC (representative of the typical genome space between array SNPs if 5000 are analyzed across a 16-Gb genome) and determined if this fell within a GC that we defined from the skim sequencing. Table 1 details the percentage of array GCs that overlapped a skim sequencing GC and those overlapping skim sequencing GCs with the same SNP allele were classified as high-confidence GCs. As an additional validation for the skim sequenced sample 0004, we identified paired end sequencing reads that spanned Chinese Spring-Paragon GC shift sites and where the SNPs used to define the shifts were close enough in proximity to allow a single read pair to provide coverage for both the Chinese Spring and Paragon SNPs. At 169 shift sites, we gained a minimum of 3× coverage for both the Chinese Spring and Paragon SNPs that was made up from single read pairs and this allowed to us calculate a likely false positive rate from the perspective of read misalignment or structural reference-related problems. For correlation of array-defined GCs with skim sequencing-defined GCs, we normalized our skim sequencing-defined GC frequencies according to the number of SNP sites that show sequencing coverage in the individual datasets where Paragon and Chinese Spring can be discriminated since this has a direct impact on GC calls.

QTL analysis

Our QTL analysis implemented genetic maps for each individual population that have been previously generated and are publicly available (see “Availability of data and materials”). QTL calculation and plotting of logarithm of odds (LOD) scores were conducted using R package “qtl,” in the first step as a single QTL model employing the extended Haley-Knott method on genotypes. Significant thresholds for the QTLs were calculated from the data distribution. Final QTL LOD scores and effects were received from a multiple QTL model, using the QTL detected in the initial scan.

Immunolocalization

Immunolocalization was performed using the protocol from Higgins et al. [20] using primary antibodies anti-ASY1 (guinea pig) and anti-γH2A.X (rabbit) (Millipore). Secondary antibodies: anti-guinea pig 488 (false colored to blue for contrast with red) and anti-rabbit Alexa 594 (Invitrogen). Images were captured using Nikon NIS-Elements software and processed with the Mexican Hat function. Individual γH2A.X foci co-localizing with ASY1 were marked using the counting tool. Includes supplemental Information in PDF format including Supplemental Notes S1-S9, Supplemental Figures S1-S9 and Supplemental Tables S1-S7. (PDF 6178 kb)

52 in total

1. Great majority of recombination events in Arabidopsis are gene conversion events.

Authors: Sihai Yang; Yang Yuan; Long Wang; Jing Li; Wen Wang; Haoxuan Liu; Jian-Qun Chen; Laurence D Hurst; Dacheng Tian
Journal: Proc Natl Acad Sci U S A Date: 2012-12-03 Impact factor: 11.205

2. Detection of genomic variations and DNA polymorphisms and impact on analysis of meiotic recombination and genetic mapping.

Authors: Ji Qi; Yamao Chen; Gregory P Copenhaver; Hong Ma
Journal: Proc Natl Acad Sci U S A Date: 2014-06-23 Impact factor: 11.205

3. High-Resolution Mapping of Crossover Events in the Hexaploid Wheat Genome Suggests a Universal Recombination Mechanism.

Authors: Benoit Darrier; Hélène Rimbert; François Balfourier; Lise Pingault; Ambre-Aurore Josselin; Bertrand Servin; Julien Navarro; Frédéric Choulet; Etienne Paux; Pierre Sourdille
Journal: Genetics Date: 2017-05-22 Impact factor: 4.562

4. The role of RPA2 phosphorylation in homologous recombination in response to replication arrest.

Authors: Wei Shi; Zhihui Feng; Jiuqin Zhang; Ignacio Gonzalez-Suarez; Robert P Vanderwaal; Xiaohua Wu; Simon N Powell; Joseph L Roti Roti; Susana Gonzalo; Junran Zhang
Journal: Carcinogenesis Date: 2010-02-03 Impact factor: 4.944

5. The RanGTP gradient - a GPS for the mitotic spindle.

Authors: Petr Kalab; Rebecca Heald
Journal: J Cell Sci Date: 2008-05-15 Impact factor: 5.285

6. The CAF-1 and Hir Histone Chaperones Associate with Sites of Meiotic Double-Strand Breaks in Budding Yeast.

Authors: Elsa Brachet; Claire Béneut; Maria-Elisabetta Serrentino; Valérie Borde
Journal: PLoS One Date: 2015-05-04 Impact factor: 3.240

7. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.

Authors: Alexandros Stamatakis
Journal: Bioinformatics Date: 2014-01-21 Impact factor: 6.937

8. Conversion of array-based single nucleotide polymorphic markers for use in targeted genotyping by sequencing in hexaploid wheat (Triticum aestivum).

Authors: Amanda J Burridge; Paul A Wilkinson; Mark O Winfield; Gary L A Barker; Alexandra M Allen; Jane A Coghill; Christy Waterfall; Keith J Edwards
Journal: Plant Biotechnol J Date: 2017-10-23 Impact factor: 9.803

9. The genetic architecture of genome-wide recombination rate variation in allopolyploid wheat revealed by nested association mapping.

Authors: Katherine W Jordan; Shichen Wang; Fei He; Shiaoman Chao; Yanni Lun; Etienne Paux; Pierre Sourdille; Jamie Sherman; Alina Akhunova; Nancy K Blake; Michael O Pumphrey; Karl Glover; Jorge Dubcovsky; Luther Talbert; Eduard D Akhunov
Journal: Plant J Date: 2018-07-19 Impact factor: 6.417

10. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

20 in total

1. MutS homologue 4 and MutS homologue 5 Maintain the Obligate Crossover in Wheat Despite Stepwise Gene Loss following Polyploidization.

Authors: Stuart D Desjardins; Daisy E Ogle; Mohammad A Ayoub; Stefan Heckmann; Ian R Henderson; Keith J Edwards; James D Higgins
Journal: Plant Physiol Date: 2020-06-11 Impact factor: 8.340

2. Recombination landscape divergence between populations is marked by larger low-recombining regions in domesticated rye.

Authors: Mona Schreiber; Yixuan Gao; Natalie Koch; Joerg Fuchs; Stefan Heckmann; Axel Himmelbach; Andreas Börner; Hakan Özkan; Andreas Maurer; Nils Stein; Martin Mascher; Steven Dreissig
Journal: Mol Biol Evol Date: 2022-06-11 Impact factor: 8.800

3. Trend, population structure, and trait mapping from 15 years of national varietal trials of UK winter wheat.

Authors: Oluwaseyi Shorinola; James Simmonds; Luzie U Wingen; Cristobal Uauy
Journal: G3 (Bethesda) Date: 2022-02-04 Impact factor: 3.542

4. High-resolution mapping of the Mov-1 locus in wheat by combining radiation hybrid (RH) and recombination-based mapping approaches.

Authors: Alexander Mahlandt; Nidhi Rawat; Jeff Leonard; Prakash Venglat; Raju Datla; Nathan Meier; Bikram S Gill; Oscar Riera-Lizarazu; Gary Coleman; Angus S Murphy; Vijay K Tiwari
Journal: Theor Appl Genet Date: 2021-04-08 Impact factor: 5.574

5. A Co-Expression Network in Hexaploid Wheat Reveals Mostly Balanced Expression and Lack of Significant Gene Loss of Homeologous Meiotic Genes Upon Polyploidization.

Authors: Abdul Kader Alabdullah; Philippa Borrill; Azahara C Martin; Ricardo H Ramirez-Gonzalez; Keywan Hassani-Pak; Cristobal Uauy; Peter Shaw; Graham Moore
Journal: Front Plant Sci Date: 2019-10-18 Impact factor: 5.753

Review 6. Advanced domestication: harnessing the precision of gene editing in crop breeding.

Authors: Wendy J Lyzenga; Curtis J Pozniak; Sateesh Kagale
Journal: Plant Biotechnol J Date: 2021-03-25 Impact factor: 9.803

7. FISH landmarks reflecting meiotic recombination and structural alterations of chromosomes in wheat (Triticum aestivum L.).

Authors: Yang Zou; Linrong Wan; Jie Luo; Zongxiang Tang; Shulan Fu
Journal: BMC Plant Biol Date: 2021-04-06 Impact factor: 4.215

8. Conversion between 100-million-year-old duplicated genes contributes to rice subspecies divergence.

Authors: Chendan Wei; Zhenyi Wang; Jianyu Wang; Jia Teng; Shaoqi Shen; Qimeng Xiao; Shoutong Bao; Yishan Feng; Yan Zhang; Yuxian Li; Sangrong Sun; Yuanshuai Yue; Chunyang Wu; Yanli Wang; Tianning Zhou; Wenbo Xu; Jigao Yu; Li Wang; Jinpeng Wang
Journal: BMC Genomics Date: 2021-06-19 Impact factor: 3.969

Review 9. DNA Helicases as Safekeepers of Genome Stability in Plants.

Authors: Annika Dorn; Holger Puchta
Journal: Genes (Basel) Date: 2019-12-10 Impact factor: 4.096

10. Genomic Patterns of Introgression in Interspecific Populations Created by Crossing Wheat with Its Wild Relative.

Authors: Moses Nyine; Elina Adhikari; Marshall Clinesmith; Katherine W Jordan; Allan K Fritz; Eduard Akhunov
Journal: G3 (Bethesda) Date: 2020-10-05 Impact factor: 3.154