Literature DB >> 35718938

Expressed genes and their new alleles identification during fibre elongation reveal the genetic factors underlying improvements of fibre length in cotton.

Jianjiang Ma1,2, Yafei Jiang3, Wenfeng Pei1, Man Wu1, Qifeng Ma1, Ji Liu1, Jikun Song1, Bing Jia1, Shang Liu1, Jianyong Wu1,2, Jinfa Zhang4, Jiwen Yu1,2.   

Abstract

Interspecific breeding in cotton takes advantage of genetic recombination among desirable genes from different parental lines. However, the expression new alleles (ENAs) from crossovers within genic regions and their significance in fibre length (FL) improvement are currently not understood. Here, we generated resequencing genomes of 191 interspecific backcross inbred lines derived from CRI36 (Gossypium hirsutum) × Hai7124 (Gossypium barbadense) and 277 dynamic fibre transcriptomes to identify the ENAs and extremely expressed genes (eGenes) potentially influencing FL, and uncovered the dynamic regulatory network of fibre elongation. Of 35 420 eGenes in developing fibres, 10 366 ENAs were identified and preferentially distributed in chromosomes subtelomeric regions. In total, 1056-1255 ENAs showed transgressive expression in fibres at 5-15 dpa (days post-anthesis) of some BILs, 520 of which were located in FL-quantitative trait locus (QTLs) and GhFLA9 (recombination allele) was identified with a larger effect for FL than GhFLA9 of CRI36 allele. Using ENAs as a type of markers, we identified three novel FL-QTLs. Additionally, 456 extremely eGenes were identified that were preferentially distributed in recombination hotspots. Importantly, 34 of them were significantly associated with FL. Gene expression quantitative trait locus analysis identified 1286, 1089 and 1059 eGenes that were colocalized with the FL trait at 5, 10 and 15 dpa, respectively. Finally, we verified the Ghir_D10G011050 gene linked to fibre elongation by the CRISPR-cas9 system. This study provides the first glimpse into the occurrence, distribution and expression of the developing fibres genes (especially ENAs) in an introgression population, and their possible biological significance in FL.
© 2022 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.

Entities:  

Keywords:  Expression new alleles; Extremely expressed genes; Fibre length; Gossypium; Intragenic recombination; eQTLs

Mesh:

Year:  2022        PMID: 35718938      PMCID: PMC9491459          DOI: 10.1111/pbi.13874

Source DB:  PubMed          Journal:  Plant Biotechnol J        ISSN: 1467-7644            Impact factor:   13.263


Introduction

Hybridization is a classic breeding strategy that provides opportunities to combine the abundant variations of parents in progeny. Interactions between transcriptional networks of parents are assumed to contribute to these ideal phenotypes by inducing novel patterns of gene expression and sequence variants (Landry et al., 2007; Liu et al., 2018; Mackay et al., 2021; Pan et al., 2016). However, the systematic study of gene expression and sequence variants potentially influencing agronomic trait improvement in breeding populations remains limited. With the rapid development of sequencing technologies, several additional factors derived from hybridization have been identified as related to favourable traits. For example, misexpressed genes, which have higher or lower expression than either parent, are caused by genetic recombination and are potential factors for associated traits in some organisms (Shivaprasad et al., 2012; Liu et al., 2018; McGirr and Martin, 2021). Furthermore, intragenic recombination was ignored in higher eukaryotes until Nelson showed that a recombination event between two wx mutant alleles created a novel sequence variant of the wx allele (i.e., a new allele) leading to a revertant pollen grain (Nelson, 1959). High‐throughput genotyping approaches have since replaced single‐gene intragenic studies to characterize the outcomes of recombination (Brown et al., 2020; Brunner et al., 2008; Kelly et al., 2010; Liu et al., 2018; Zhang et al., 2020). Numerous new alleles generated by intragenic recombination have been identified in several plant species including maize, and some have shown transgressive expression (Liu et al., 2018). Additionally, expression quantitative trait locus (eQTL) mapping has emerged as a potentially powerful new approach to investigate the link between gene expression variation and phenotype, and the genetic network of genes in specific tissues has been clarified in many organisms (Li et al., 2020; Liu et al., 2017; Pang et al., 2019; Vosa et al., 2021; Wang et al., 2014; Wang et al., 2020a). Regarding structural genomics, QTL mapping is a common genetic approach to expound on the genetic variations of phenotypic differentiation in various organisms (Beche et al., 2020; Gangurde et al., 2020; Gu et al., 2020; Li et al., 2016). Therefore, transcriptomics, genomics and phenomics can be efficiently integrated to clarify the genetic factors of expression genes and their new alleles underlying improvements in agronomic traits in hybrid populations. Cotton fibres, the longest, fastest‐growing cells in plants, are an indispensable natural source in the textile industry. Upland cotton (Gossypium hirsutum L.) has a high lint yield and broad adaptation, accounting for more than 95% of the world's cotton production (Wang et al., 2019a). Compared with upland cotton, extralong staple, Pima, Egyptian or Sea Island cotton (Gossypium barbadense) has high fibre quality, but low yields and narrow production areas due to its requirements for high temperature and low humidity (Zhang et al., 2014). The major goal of cotton breeding is to develop upland cotton cultivars that produce fibres of superior quality, particularly with respect to the fibre length (FL) trait. To date, genetic recombination between G. hirsutum and G. barbadense is commonly used to select progeny with a long FL trait, and several hundred candidate QTLs have been identified for FL (www.cottonqtldb.org) (Chandnani et al., 2018; Chen et al., 2018; He et al., 2006; Lacape et al., 2009; Li et al., 2019a; Said et al., 2015; Shi et al., 2020; Wang et al., 2020; Yu et al., 2013). The elongation of fibres is a subtle and complex regulatory process that occurs after initiation (−3 to 3 days post‐anthesis, dpa) and lasts until 25–30 dpa (Hu et al., 2019). At this stage, the fibres from 3 to 15 dpa exhibit a greater elongation rate, which determines the ultimate length (Ma et al., 2018a). To date, numerous studies of transcriptome analysis of fibre development have been reported. One of the recent studies investigated fibre transcriptomes at 15 dpa for 251 accessions of natural upland cotton, contributing to uncovering the genetic regulatory network orchestrating the initiation of secondary cell wall development in cotton (Li et al., 2020). However, transcriptome analysis of introgression lines from interspecific hybrids between G. hirsutum and G. barbadense is lacking. More importantly, focusing on the fibre transcriptomes of only a single time point does not allow monitoring of the dynamic patterns of differential expression (DE) or regulatory variation throughout fibre elongation. Additionally, a knowledge gap persists regarding extremely expressed genes (eGenes) (gene activation and silencing) and new alleles that contribute to agronomic trait improvement based on G. hirsutum × G. barbadense interspecific breeding in cotton. In the present study, using the case of the FL trait tested from multiple environments, we combined genome resequencing data from an interspecific backcross inbred line (BIL) population of 191 lines derived from a cross between upland cotton CRI36 and G. barbadense Hai7124, with fibre transcriptome data at 5, 10 and 15 dpa from 47 BILs with contrasting FL. We aimed to (1) identify expression new alleles (ENAs) from intragenic recombination in fibre transcriptomes and explore their potentially influence on FL; (2) elaborate on the “extremely eGenes” generated near recombination hotspots and their impact on FL; and (3) provide insights into the regulatory “temporal–spatial” landscape of dynamic fibre elongation and search for candidate genes related to FL.

Results

Resequencing‐based SNP mapping and FL‐QTL analysis

Based on the genome resequencing data for the interspecific BIL population of 191 lines, a total of 11 892 179 single nucleotide polymorphisms (SNPs) were identified to construct an extra high‐density genetic map with 7558 bin markers, which spanned 5474.49 cM in the genetic distance across 26 chromosomes, with a mean marker density of 0.72 cM per bin (Figure S1; Table S1). However, recombination events did not occur evenly in that 269 recombination hotspots (RHs) were identified with 4 (on D04, D06 and D13) to 26 (on A09) RHs in this BILs population (Figure 1b; Table S2). Overall, the collinearity coefficient between the genetic and physical maps was 0.87 among the 26 chromosomes (Figure S2), and the genetic distance was also significantly positively correlated with the physical distance on a chromosome basis (r = 0.63; P = 5.77E‐4), indicating the reliability of the constructed genetic map.
Figure 1

Chromosomal features of a G. hirsutum × G. barbadense interspecific population based on integrated genetic and transcriptome data. (a) A total of 11 892 179 high‐quality SNP markers (aa × bb) in 1 Mb sliding windows are shown. (b) Distribution of the 7558 bin markers. The red bar indicates the recombination hotspots. (c) Distribution of the expressed genes at 5 dpa. Three types of genes are shown in yellow (common genes), red (new alleles), and blue (AGEs and SGEs). For each color, dark to light represents high to low expression, respectively. Similar to (c), (d) and (e) indicate the distribution of the expressed genes at 10 and 15 dpa. (f) Distribution of DEGs at the three time points. The DEGs at 5, 10 and 15 dpa are shown in red, yellow and green, respectively. (g) The FL‐QTLs (orange bar) identified in the present study. [Colour figure can be viewed at wileyonlinelibrary.com]

Chromosomal features of a G. hirsutum × G. barbadense interspecific population based on integrated genetic and transcriptome data. (a) A total of 11 892 179 high‐quality SNP markers (aa × bb) in 1 Mb sliding windows are shown. (b) Distribution of the 7558 bin markers. The red bar indicates the recombination hotspots. (c) Distribution of the expressed genes at 5 dpa. Three types of genes are shown in yellow (common genes), red (new alleles), and blue (AGEs and SGEs). For each color, dark to light represents high to low expression, respectively. Similar to (c), (d) and (e) indicate the distribution of the expressed genes at 10 and 15 dpa. (f) Distribution of DEGs at the three time points. The DEGs at 5, 10 and 15 dpa are shown in red, yellow and green, respectively. (g) The FL‐QTLs (orange bar) identified in the present study. [Colour figure can be viewed at wileyonlinelibrary.com] Phenotypically, the G. barbadense parent Hai7124 (31.08 mm) had significantly longer fibres than the upland parent CRI36 (26.27 mm) across the nine testing environments (Table S3). The FL in the BIL population followed a normal distribution in each environment and ranged from 20.42 to 34.46 mm (Figure S3; Table S3), exhibiting transgressive segregation in both directions. The variation in FL was dissected using the bin map, resulting in the detection of 25 FL‐QTLs on 14 chromosomes, each with 4.32–13.34% of the phenotypic variance explained (PVE) (Figure 1g; Table S4). Importantly, 11 (44%) of the identified QTLs were among the 166 FL‐QTLs previously published in interspecific populations of G. hirsutum × G. barbadense between 2003 and 2020 (Table S5). Among the 166 QTLs, 35 were reported in multiple studies or multiple environments in one (but not the same) study and were therefore named meta‐QTLs (i.e., mQTLs). These 35 mQTLs were integrated with the 25 FL‐QTLs identified in this study for the subsequent analysis (Table S6).

Identification of new alleles within eGenes resulted from intra‐genic recombination events

Based on a total of 277 dynamic transcriptomes from 47 BILs in developing fibres at 5, 10 and 15 dpa, new alleles within express genes (called expression new alleles, ENAs) resulted from intra‐genic recombination were identified, as compared to the parental sequences. Because the number of eGenes differed, the count of ENAs ranged from 531 to 2824, 362 to 1515 and 348 to 2063 at 5, 10 and 15 dpa fibres, respectively. The reliability of the ENAs was validated by Sanger sequencing of three randomly selected genes (Figure S4). For example, the Ghir_A10G021280 gene sequence at both the DNA level and its transcript sequence level in BIL line L12 was consistent, showing recombination of the gene sequence from the upland cotton parent CRI36 and the TM‐1 reference genome from 1 to 2084 bp and that from the G. barbadense parent Hai7124 and its reference genome from 2085 to 2348 bp (Figure S4a). Because intra‐genic recombination occurred between homologous chromosomes from the two parents during the development of the BILs regardless of its gene expression, all the ENAs from the three time points of the same BIL were merged, resulting in the identification of 10 366 ENAs in 35 420 eGenes (Table S7). However, 31.17% (3231) ENAs were identified in only one (but not the same) individual BIL or another, and common ENAs among BILs gradually decreased as the number of BILs increased, as expected (Figure S5). Across the 47 BILs, ENAs differed among chromosomes ranging from 211 on A04 to 662 on A05, with no subgenome bias (Figure 2a and b). The ENAs were distributed predominantly in telomeric regions (Figure 1c, d and e). The number of ENAs within the 1 Mb window along each chromosome was highly correlated with the number of bin markers (see the previous section) within the same 1 Mb region (Figure 2c).
Figure 2

Distribution of ENAs in the genome. (a) Counts of ENAs in 26 chromosomes. Yellow and blue bars represent the 13 chromosomes in subgenomes A and D, respectively. ns indicates no significant difference by Student's t‐test. (b) Frequencies (count of ENAs/total genes in individual chromosomes) of the ENAs in 26 chromosomes. Yellow and blue bars represent the 13 chromosomes in subgenomes A and D, respectively. (c) Pearson correlation analysis between recombination bin markers and ENAs with 1 Mb sliding windows across the whole genome. *** Represents a significant correlation at P = 0.001. [Colour figure can be viewed at wileyonlinelibrary.com]

Distribution of ENAs in the genome. (a) Counts of ENAs in 26 chromosomes. Yellow and blue bars represent the 13 chromosomes in subgenomes A and D, respectively. ns indicates no significant difference by Student's t‐test. (b) Frequencies (count of ENAs/total genes in individual chromosomes) of the ENAs in 26 chromosomes. Yellow and blue bars represent the 13 chromosomes in subgenomes A and D, respectively. (c) Pearson correlation analysis between recombination bin markers and ENAs with 1 Mb sliding windows across the whole genome. *** Represents a significant correlation at P = 0.001. [Colour figure can be viewed at wileyonlinelibrary.com] The 47 BILs and two parents were grouped into 2 clusters using the whole genome‐wide SNPs, with 15 SF and 6 LF BILs and Hai7124 in cluster A and 11 SF and 15 LF BIL and CRI36 in cluster B (Figure S6). The results indicated that the overall variation in SNPs was not associated with FL, consistent with the previously QTL mapping results that only a small number of SNPs were linked with FL. However, the broad extent of ENAs in expressed fibre genes suggests that some of them may be a new driving force to generate significant genetic variation in FL. To test this hypothesis, the SNPs within the 2‐kb region upstream and downstream of each ENA were extracted and used in cluster analysis. The resulted genetic distance between BILs was correlated with that calculated based on the genome‐wide SNPs (r = 0.65, P = 8.05E‐7), suggesting an overall congruence between existing SNPs and ENAs. Although the mean genetic distance between each BIL and the parents had an overall significant positive correlation with the number of ENAs in the BILs (Figure 3c), the 47 BILs and the two parents were grouped into 3 clusters (Figure 3a). Cluster І contained the LF Hai7124 parent and more SF (11) and less LF (4) BILs, and cluster П contained the SF CRI36 parent and similar numbers of SF (13) and LF (9) BILs (Figure 3a). However, cluster Ш was unique with more LF (8) and less SF (2) and different from the two parents and the other two clusters (Figure 3a). Furthermore, cluster Ш also had a significantly higher number of ENAs (2465) than cluster І (1,004; P = 2.62E‐6) and П (1,228; P = 2.38E‐6) which has a similar number of ENAs (P = 0.14; Figure 3b). The above results implied that some of the ENAs may be one of the driving forces for trait differentiation (here, FL) in artificial genetic populations, which were identified and further analysed (see the next section).
Figure 3

The accumulation of ENAs drives the differentiation of organisms. (a) A neighbour‐joining tree constructed based on the adjacent SNPs of the ENAs in 47 BILs and parents. Parent Hai7124 and parent CRI36 are present in clusters І and П, respectively. BILs in cluster Ш had a large genetic distance from the parents. Pie charts represent the ratio of the long‐(red) and short‐(green) fibre individuals in each cluster. (b) Number of the ENAs in the three clusters. ns indicates no significant difference by Student's t‐test. *** Represents a significant difference by Student's t‐test at P = 0.001. (c) Correlation analysis between the mean genetic distance and ENAs in BILs. *** Represents a significant correlation at P = 0.001. [Colour figure can be viewed at wileyonlinelibrary.com]

The accumulation of ENAs drives the differentiation of organisms. (a) A neighbour‐joining tree constructed based on the adjacent SNPs of the ENAs in 47 BILs and parents. Parent Hai7124 and parent CRI36 are present in clusters І and П, respectively. BILs in cluster Ш had a large genetic distance from the parents. Pie charts represent the ratio of the long‐(red) and short‐(green) fibre individuals in each cluster. (b) Number of the ENAs in the three clusters. ns indicates no significant difference by Student's t‐test. *** Represents a significant difference by Student's t‐test at P = 0.001. (c) Correlation analysis between the mean genetic distance and ENAs in BILs. *** Represents a significant correlation at P = 0.001. [Colour figure can be viewed at wileyonlinelibrary.com]

Influence of ENAs on FL

Overall, the number of ENAs (mean of 1896) in the LF group with 17 BILs was significantly greater than that in the SF group with 24 BILs (1097) (P = 7.41E‐5; Figure S7). Correlation analysis confirmed that the number of ENAs was significantly positively correlated with FL (r = 0.40; P = 0.009; Table S8). Importantly, ENAs were located within 25 FL‐QTL regions identified in this study. For 88 eGenes, on average, BILs with an ENA (named “R”), had significantly different FL than those with the CIR36 allele and/or Hai7124 allele, including 42 ENAs producing greater FL than the Hai7124 allele (Table S9). According to the genotype of 10 366 genes, the 47 BILs and parents were distinguished from the CRI36, Hai7124, heterozygosity and ENA alleles, and performed a GWAS for FL trait with MLM (PCA + K) model. We identified seven QTLs for FL, four of which were consistent with those detected previously. However, the other three QTLs were new (Table S10). A further t‐test showed that three ENAs (named “R alleles”) with each from Ghir_D03G012070, Ghir_D03G012240 and Ghir_D03G012780 in the new QTL region on D03 produced average longer fibres than their CRI36 alleles in the BILs; and the BILs with the CIR36 and Hai7124 alleles did not differ in FL (Figure S8a). The results demonstrate that new alleles through intra‐genic recombination affected FL, and the inclusion of ENAs as a new set of markers facilitated the identification of QTLs that were undetected based on traditional biallelic markers. Additionally, combined with previous studies, 18 genes involved in fibre development were identified to possess ENAs (Table S11; Greer et al., 2007; Huang et al., 2008; Huang et al., 2013; Gong et al., 2014; Xiao et al., 2016; Zhang et al., 2017b; Zhang et al., 2017c; Zhang et al., 2017a; Zhao et al., 2018; Gao et al., 2019; Salih et al., 2019; Wang et al., 2019c; Cao et al., 2020; Liu et al., 2020; Wang et al., 2021; Wang et al., 2022). For example, the transcription factor Ghir_D05G035270 (GhTCP4D) has been identified to play a critical role in balancing cotton fibre cell elongation and wall synthesis in cotton (Cao et al., 2020), and Ghir_D02G012370 encodes a 3‐ketoacyl‐CoA synthase (GhKCS10), which regulates fibre elongation through the very‐long‐chain fatty acid pathway (Xiao et al., 2016). The FL of BILs with five ENA alleles was significantly greater than that of the CRI36 allele (Figure S8b). However, no significant difference was detected between the CRI36 and Hai7124 alleles.

Expression levels of ENAs and association with FL

ENAs affected gene expressions negatively or positively beyond the boundary of parents, defined as transgressive transcripts here, which in turn could be involved in fibre elongation. Here, transgressive transcripts in ENAs were identified to be expressed at levels 5% higher or lower than the most extreme value observed across the BIL set (and their parents) for nonrecombinant parental alleles (Table S12). As a result, of the 10 366 ENAs, 1255, 1131 and 1056 exhibited transgressive expression at 5, 10 and 15 dpa fibres, respectively; and 323 of which simultaneously exhibited transgressive transcript accumulation at all the three time points (Figure S9). Furthermore, 2300 ENAs were identified in FL‐QTL regions, 520 of which showed transgressive expressions at one or more time points (Table S13). In parallel, 26 of the 2300 ENAs exhibited DE between the LF and SF groups (Table 1). Among the 26 genes, a fasciclin‐like arabinogalactan gene (GhFLA9, Ghir_D12G015320), which has a transgressive transcript at 5 and 10 dpa (Figure 4b) and shows DE at 10 dpa (Table S14), and one of its homologous gene, GhFLA1 (Ghir_D09G011010), is involved in fibre initiation and elongation (Huang et al., 2013). Sanger sequencing revealed the ORF of GhFLA9 with two SNPs between Hai7124 (A594, T694) and CRI36 (G594, G694) (Figure S10). Combined with the genome resequencing and transcriptome data, we identified that GhFLA9 exhibited an R allele (A594, G694) in several BILs (Figure 4a). The SNP at 594 bp position was synonymous while the SNP at the 694 bp position was nonsynonymous. Phyre2 was used to predict the 3D structure of the GhFLA9 amino acids in Hai7124 (S198, F232), CRI36 (S198, V232) and R (S198, V232), and the mutant amino acids at 232 showed a 3D structural difference between Hai7124 and CRI36, which probably led to a functional difference in GhFLA9 (Figure 4a). Although there was no significant expression difference in GhFLA9 between the BILs with the Hai7124 allele and the R allele in BILs at 5, 10 and 15 dpa, the BILs showed a significantly higher expression than those with the CRI36 allele at 5 and 10 dpa (Figure 4c). This result suggests that the expression level of GhFLA9 (R allele) was derived from the parent Hai7124 drive by its promoter. To confirm, SNPs of the GhFLA9 promoter (−2500–0 bp) in the 47 BILs and parents were extracted using the genome resequencing data. BILs with the R allele (GhFLA9) had the same promoter sequences as Hai7124 and therefore similar SNPs as compared to CIR36 (Figure 4a). Therefore, the R and CRI36 alleles have the same amino acid sequence of GhFLA9, but its expression in the R allele was controlled by the promoter of Hai7124 in BILs with the R allele (Figure 4a). The FL in BILs with the R allele of GhFLA9 was significantly greater than that of the BILs with the CRI36 allele, suggesting that the increased expression of GhFLA9 (CRI36 allele) could improve the FL in cotton (Figure 4c).
Table 1

Summary of the differentially expressed new alleles in FL‐QTLs

QTLGeneAnnotationTETime point for TE
mqFL‐lg1‐1 Ghir_A01G004450 Kinase superfamily proteinNo
mqFL‐lg2‐1 Ghir_A02G003210 Primary amine oxidaseNo
qFL‐lg2‐1; mqFL‐lg2‐2 Ghir_A02G016340 Cytochrome c oxidase subunit 6aNo
mqFL‐lg2‐2 Ghir_A02G017640 Transducin/WD40 repeat‐like superfamily proteinNo
mqFL‐lg4‐1 Ghir_A04G002840 NAC domain‐containing 18‐like proteinNo
mqFL‐lg5‐1 Ghir_A05G002650 Hypothetical protein F383_23056No
mqFL‐lg7‐1 Ghir_A07G010680 Expansin‐A4‐like proteinYes10 dpa
qFL‐lg11‐1 Ghir_A11G002730 Pentatricopeptide repeat superfamily proteinNo
qFL‐lg11‐1 Ghir_A11G003570 TCP domain‐like protein IYes5 dpa
mqFL‐lg11‐1 Ghir_A11G005760 Ras‐related Rab11AYes5 dpa
mqFL‐lg14‐1 Ghir_D01G000700 Transmembrane 53‐BYes10 dpa
mqFL‐lg15‐1 Ghir_D02G002260 Putative receptor protein kinase TMK1Yes5, 10 and 15 dpa
qFL‐lg15‐1 Ghir_D02G023110 PAP10Yes10 dpa
mqFL‐lg16‐1 Ghir_D03G015460 ZIP metal ion transporter familyYes10 dpa
qFL‐lg18‐2 Ghir_D05G028710 Hypothetical protein F383_16006No
qFL‐lg18‐1 Ghir_D05G035850 Indole‐3‐acetic acid‐amido synthetaseNo
mqFL‐lg19‐1 Ghir_D06G004810 l‐ascorbate peroxidase TYes15 dpa
mqFL‐lg22‐1 Ghir_D09G021070 Phosphoglucan proteinYes5, 10 and 15 dpa
mqFL‐lg22‐1 Ghir_D09G022220 Putative serine/threonine‐protein kinaseNo
qFL‐lg23‐2; mqFL‐lg23‐1 Ghir_D10G004070 Putative leucine‐rich repeat family proteinYes10 dpa
qFL‐lg23‐2 Ghir_D10G008420 Leucine‐rich repeat family protein isoform 1Yes5, 10 and 15 dpa
qFL‐lg23‐2 Ghir_D10G008990 Uncharacterized protein TCM_017094No
mqFL‐lg25‐1 Ghir_D12G015250 Cytochrome P450Yes10 dpa
mqFL‐lg25‐1 Ghir_D12G015320 Fasciclin‐like arabinogalactan protein 1Yes5 and 10 dpa
mqFL‐lg25‐1 Ghir_D12G015670 Aldehyde dehydrogenase family 2 member B4No
mqFL‐lg25‐1 Ghir_D12G021980 Phosphatidylinositol‐glycan biosynthesis class XYes5, 10 and 15 dpa

TE, transgressive expression.

Figure 4

Smart megamerger of GhFLA9 in the G. hirsutum × G. barbadense interspecific population. (a) Promoter genotype and protein structure analysis of the three types of GhFLA9 alleles in the parents and 47 BILs. The yellow arrow in the 3D protein structure indicates the mutated amino acid at 232, and the 3D protein structure in the white dot box suggests the difference (two loops for the Hai7124 allele and three loops for the CRI36 and R alleles) caused by the mutated amino acid at 232. (b) Transgressive transcript accumulation of GhFLA9 (R allele) at the fibre elongation stage. The expression level was calculated by log2(FPKM+1). The grey dot represents GhFLA9 (R allele) in some specific BILs; the blue dot represents GhFLA9 in the parents; the red dot represents GhFLA9 in the remaining BILs. (c) Expression and FL analysis among the three alleles. The expression level was calculated by log2(FPKM+1). ** Represents a significant difference by Student's t‐test at P = 0.01. ns indicates no significant difference by Student's t‐test. [Colour figure can be viewed at wileyonlinelibrary.com]

Summary of the differentially expressed new alleles in FL‐QTLs TE, transgressive expression. Smart megamerger of GhFLA9 in the G. hirsutum × G. barbadense interspecific population. (a) Promoter genotype and protein structure analysis of the three types of GhFLA9 alleles in the parents and 47 BILs. The yellow arrow in the 3D protein structure indicates the mutated amino acid at 232, and the 3D protein structure in the white dot box suggests the difference (two loops for the Hai7124 allele and three loops for the CRI36 and R alleles) caused by the mutated amino acid at 232. (b) Transgressive transcript accumulation of GhFLA9 (R allele) at the fibre elongation stage. The expression level was calculated by log2(FPKM+1). The grey dot represents GhFLA9 (R allele) in some specific BILs; the blue dot represents GhFLA9 in the parents; the red dot represents GhFLA9 in the remaining BILs. (c) Expression and FL analysis among the three alleles. The expression level was calculated by log2(FPKM+1). ** Represents a significant difference by Student's t‐test at P = 0.01. ns indicates no significant difference by Student's t‐test. [Colour figure can be viewed at wileyonlinelibrary.com]

Substantial roles of gene activation and silencing in fibre elongation

Genetic recombination can also cause gene activation (from no expression in parents) and silencing (from expression in parents). To test if these two types of extremely eGenes in dynamic fibre samples are involved in elongation and development, we first identified 169, 118 and 105 expression‐activating genes (AGEs) and 49, 30 and 33 expression‐silencing genes (SGEs) in developing fibres at 5, 10 and 15 dpa, respectively (Figure 1c, d and e; Table S15), with a total of 358 AGEs and 98 SGEs (Figure S11a and d). Semiquantitative RT‐PCR of six randomly selected genes confirmed the reliability of the AGEs/SGEs identified in this study (Figure S12a and b). The number of AGEs/SGEs differed among BILs (Figure 5a and d) and chromosomes (Figure S11b and e). The AGEs ranged from 7 (A02, A04 and D04) to 27 (D05) and SGEs ranged from 1 (A07) to 8 (A01 and D02); however, no significant subgenome (A vs. D) bias was observed (Figure S11b, c, e and f). Overall, more AGEs/SGEs were located in near telomeric regions than centromeric regions, and many AGEs/SGEs were located in the areas of recombination hotspots (RHs) (Figure 1b, c, d and e). For further analysis of the relationship between AGEs/SGEs and RHs, based on a physical window of 4, 2, 1 and 0.5 Mb for each RH, a total of 269 RHs encompassed 926, 607, 382 and 226 Mb, with 217, 180, 113 and 68 AGEs, respectively. However, the mean AGEs per Mb was similar (0.295–0.301) except for the range of 4 Mb window for an RH (0.234), higher than the genome‐wide average (0.156, i.e., 358 AGEs/2300 Mb). Similarly, RH areas also contained more SGEs, that is, 60, 44, 26 and 16 SGEs based on the physical window of 4, 2, 1 and 0.5 Mb for each RH, respectively. The average SGEs per Mb were similar (0.065–0.072) among the four physical windows, but higher than the genome‐wide average (0.042, 98 SGEs/2300 Mb; Figure 5b). This result suggests that AGEs/SGEs are more likely to occur from genes in recombination hotspot areas of chromosomes.
Figure 5

Identification of the extremely expressed genes and their function in fibre elongation. (a) Quantity distribution of the AGEs in BILs. (b) Density distribution of AGEs (blue)/SGEs (red) around recombination hotspots. (c) Twenty‐two AGEs significantly influence FL. The five‐pointed star symbol with the same color indicates that the AGE significantly influenced FL at different time points. Ten AGEs below the red line were positive for FL. (d) Quantity distribution of the SGEs in BILs. (e) Twelve SGEs significantly influence FL. The square symbol with the same color indicates that the SGE significantly influenced fibre length at different time points. Six SGEs below the red line were positive for FL. [Colour figure can be viewed at wileyonlinelibrary.com]

Identification of the extremely expressed genes and their function in fibre elongation. (a) Quantity distribution of the AGEs in BILs. (b) Density distribution of AGEs (blue)/SGEs (red) around recombination hotspots. (c) Twenty‐two AGEs significantly influence FL. The five‐pointed star symbol with the same color indicates that the AGE significantly influenced FL at different time points. Ten AGEs below the red line were positive for FL. (d) Quantity distribution of the SGEs in BILs. (e) Twelve SGEs significantly influence FL. The square symbol with the same color indicates that the SGE significantly influenced fibre length at different time points. Six SGEs below the red line were positive for FL. [Colour figure can be viewed at wileyonlinelibrary.com] The expression levels of 31 AGEs were significantly correlated with FL (P < 0.05). A further t‐test detected significant differences in 22 AGEs (including 5 AGEs at two time points) between the expression‐activating group (FPKM ≥ 1) and no‐expression group (0 ≤ FPKM < 1) of the 47 BILs, 10 of which showed positive effects on FL (Figure 5c). Gene ontology (GO) analysis of the 22 AGEs showed that Ghir_A02G000150 (auxin efflux carrier protein) and Ghir_A10G002710 (E3 ubiquitin‐protein ligase) were likely involved in some important metabolic processes related to fibre development (Table S16). Additionally, by combining the 22 AGEs with the FL‐QTLs, we identified Ghir_A02G016670 (mitochondrial mRNA‐processing protein) located in qFL‐lg2‐1. In parallel, the expression levels of 12 SGEs (including 2 SGEs at two time points) were significantly associated with FL, 6 of which showed that silencing of the expression (0 ≤ FPKM < 1) had a positive effect on FL (Figure 5e). We identified a FAD gene, involved in fatty acid metabolic processes (Ghir_D02G012480) in qFL‐lg15‐3, is likely related to fibre elongation (Table S16). The above results showed that AGEs/SGEs might be genetic factors associated with FL variation in cotton.

eGWAS of the rapid elongation stage of fibres

Genetic recombination supports an ideal model to reveal the regulatory architecture of genes at the whole‐genome level by constructing a critical regulator network to understand the biological processes and developmental programmes of organisms (Wang et al., 2014). Here, we identified 11 211 eQTLs associated with 4404 eGenes at 5 dpa (Table S17), 15 557 eQTLs associated with 8335 eGenes at 10 dpa (Table S17) and 10 456 eQTLs associated with 4145 eGenes at 15 dpa (Table S17). As described by Li et al. (2020), we adopted a 1‐Mb distance between eQTLs and eGenes to define cis‐(≤1 Mb) or trans‐eQTLs (>1 Mb on the same chromosomes or on different chromosomes). We then identified 3132, 2589 and 2370 cis‐eQTLs and 8079, 12 968 and 8086 trans‐eQTLs at 5, 10 and 15 dpa, respectively (Table S17), which revealed a higher frequency of cis‐eQTLs (27.94%) at 5 dpa than at 10 (16.64%) and 15 dpa (22.67%; Figure S13a). Among the three time points, the frequency of the trans‐eQTLs was significantly higher than that of the cis‐eQTLs (P = 2.72E‐4; Figure S13b). However, the mean interpretation of the expression variation of cis‐eQTLs for the three time points was 0.43, which was significantly higher than that of 0.39 obtained for trans‐eQTLs (P = 2.05E−129; Figure S13c); this result agrees with similar findings obtained with upland cotton and other organisms (Li et al., 2020; Liu et al., 2015; Liu et al., 2017; Wang et al., 2014). Furthermore, we estimated the effects of eQTLs on eGenes by calculating the proportion of the explained variance at the three time points. We found that 55.42% of cis‐eQTLs explained more than 40% of the expression variation, and this number declined to 33.26% for trans‐eQTLs. These results indicate that cis‐eQTLs have a greater impact on eGenes than trans‐eQTLs, and a similar trend was found in rice (Wang et al., 2014).

Combination of the FL‐QTL and eQTL results for candidate gene isolation

We focused on the eGenes that colocalized with the FL trait. Therefore, we identified 411, 393 and 387 eQTLs overlapping with FL‐QTL regions, corresponding to 1286, 1089 and 1059 eGenes at 5, 10 and 15 dpa, respectively (Table S18). To characterize the complex dynamic regulatory relationship between eQTLs and eGenes, we constructed a dynamic eQTL network for fibre elongation and development. In this network, some eQTLs exhibited spatiotemporally specific regulation (Figure 6a). Here, we defined that the eQTL controlled a cluster of eGenes (>20 eGenes), more than 90% of which were derived from only one time point and were regarded as a spatiotemporal‐specific regulation eGene cluster (SSREC). We identified 2, 2 and 3 SSRECs at 5, 10 and 15 dpa; for example, the Sbin2788 marker (overlapping with mqFL‐lg9‐1) controls 132 eGenes at 5 dpa, three eGenes at 10 dpa and one eGene at 15 dpa; the Sbin4252 marker (overlapping with mqFL‐lg25‐1) controls 75 eGenes at 10 dpa and only two eGenes at 5 dpa (Figure 6a). By contrast, the other eQTL showed common regulation (CREC) at the three time points; for example, the Sbin2424 marker (overlapping with qFL‐lg23‐2) controls 11, 12 and 12 eGenes at 5, 10 and 15 dpa, respectively, and contributes to connecting the dynamic regulatory network of fibre elongation and development (Figure 6a).
Figure 6

Identification of eGenes for fibre elongation and development. (a) Dynamic regulatory network between eQTLs and eGenes. The green, yellow and pink dots represent the eGenes at 5, 10 and 15 dpa, respectively. The blue dot represents the eQTL. The grey font represents the spatiotemporal‐specific regulatory eGene cluster; the purple font shows a common regulatory eGene cluster at the three time points. (b) Venn diagram of the number of eGenes at the three time points. The black number represents the core eGenes, and the blue number represents the specific eGenes. (c) Expression pattern of the 105 DEGs at the three time points. LF and SF indicate the long‐fibre and short‐fibre groups, respectively; GhMAH1 (Ghir_D10G011050) is highlighted in red font; 15 genes with blue font generated their ENAs in BILs. (d) Gene Ontology (GO) enrichment of the 105 DEGs. Cellular component (CC); Molecular function (MF); Biological process (BP). [Colour figure can be viewed at wileyonlinelibrary.com]

Identification of eGenes for fibre elongation and development. (a) Dynamic regulatory network between eQTLs and eGenes. The green, yellow and pink dots represent the eGenes at 5, 10 and 15 dpa, respectively. The blue dot represents the eQTL. The grey font represents the spatiotemporal‐specific regulatory eGene cluster; the purple font shows a common regulatory eGene cluster at the three time points. (b) Venn diagram of the number of eGenes at the three time points. The black number represents the core eGenes, and the blue number represents the specific eGenes. (c) Expression pattern of the 105 DEGs at the three time points. LF and SF indicate the long‐fibre and short‐fibre groups, respectively; GhMAH1 (Ghir_D10G011050) is highlighted in red font; 15 genes with blue font generated their ENAs in BILs. (d) Gene Ontology (GO) enrichment of the 105 DEGs. Cellular component (CC); Molecular function (MF); Biological process (BP). [Colour figure can be viewed at wileyonlinelibrary.com] Furthermore, 334 eGenes were present in at least two time points and were defined as core eGenes; the remaining 2437 were defined as specific eGenes (Figure 6b). To characterize the dynamic biological role of fibre elongation, we performed GO enrichment analysis of the core eGenes and specific eGenes in the network at 5, 10 and 15 dpa. The enrichment of core eGenes was observed in the cytoplasmic part and mitochondria of cellular components (Table S19). However, the specific eGenes at 5 dpa were enriched in the metabolic process and catalytic activity among the biological process and molecular function categories (Table S19). Those at 10 dpa were enriched in cellular component and cell wall organization within the biological process category and catalytic activity within the molecular function category (Table S19). The eGenes at 15 dpa were enriched in catalytic activity within the molecular function category and single‐organism process within the biological process category (Table S19). We assumed that the significant GO terms for specific eGenes represent processes that may clarify the difference in the fibre‐cell elongation rate from 5 to 15 dpa. In total, 105 FL‐colocalized eGenes with DE in at least one time point between the LF and SF groups were identified in the dynamic eQTL network (Figure 6c). GO analysis of the 105 eGenes showed enrichment in cell wall macromolecule metabolic process in biological process, anchored component of the membrane in a cellular component, and alkane 1‐monooxygenase activity in molecular function (Figure 6d). Further analysis revealed DE of an alkane hydroxylase gene (Ghir_D10G011050, GhMAH1) at 10 dpa (Figure 7c), which is located in qFL‐lg23‐2 in numerous studies (Table S4) (Guo et al., 2018; Ma et al., 2018a; Ma et al., 2018b; Shi et al., 2019; Shi et al., 2020; Sun et al., 2017) and was speculated to be involved in the FL trait in our previous study (Ma et al., 2018a). Hence, we designed a mutagenesis experiment with GhMAH1 using the CRISPR–Cas9 system and obtained lines with a guanine deoxyribonucleotide insertion in 15 103 937 bp of D10 and a 4 bp (D10: 15 103 937–15 103 940)/9 bp (D10: 15 103 936–15 103 944) deletion by Sanger sequencing (Figure 7a and b). Compared with GhMAH1 (wild type, WT), Ghmah1 lines (+1 bp) and (−4 bp) gained stop codons at 55 and 37, respectively; the Ghmah1 line (−9 bp) lost three amino acids at 35–37. After measuring the FL of the three Ghmah1 lines (T1) and WT, the mean FL of the three Ghmah1 lines (25.8 ± 0.4 mm) was significantly shorter than that of the WT plants (27.5 ± 0.4 mm; P = 1.33E−4; Figure 7d and e). This result suggested that GhMAH1 is involved in fibre elongation. Additionally, we identified 15 of the 105 eGenes that generated their new alleles in BILs (Figure 6c). Interestingly, three of the 15 eGenes are involved in the key processes of the initiation and growth of fibre cells, such as the glycerol‐3‐phosphate acyltransferase gene Ghir_D04G013940 (GhGPAT6) and fasciclin‐like arabinogalactan protein genes Ghir_D08G005610 (GhFLA7) and Ghir_D13G000310 (GhFLA17) (Huang et al., 2013; Li et al., 2020; Qin et al., 2007; Wang et al., 2019c). The above‐described data highlighted the candidate gene pool and characterized the complex dynamic regulatory network for fibre elongation in the interspecific breeding of G. hirsutum × G. barbadense.
Figure 7

The candidate gene GhMAH1 (Ghir_D10G011050) was validated by the CRISPR–cas9 system. (a) Gene model of GhMAH1. Only one exon for GhMAH1. (b) GhMAH1 was knocked out by CRISPR–cas9 genome editing. The 1 bp insertion and 4 bp/9 bp deletion are shown in GhMAH1. The lower panel illustrates the Sanger sequencing spectrogram. (c) Expression of GhMAH1 in fibre tissue at the three time points. * Represents a significant difference by Student's t‐test at P = 0.05. ns indicates no significant difference by Student's t‐test. (d) Phenotype of mature fibres of Ghmah1 (CRI24) and WT (CRI24). (e) Mature fibre length between Ghmah1 (CRI24) and WT (CRI24). ** Represents a significant difference by Student's t‐test at P = 0.01. [Colour figure can be viewed at wileyonlinelibrary.com]

The candidate gene GhMAH1 (Ghir_D10G011050) was validated by the CRISPR–cas9 system. (a) Gene model of GhMAH1. Only one exon for GhMAH1. (b) GhMAH1 was knocked out by CRISPR–cas9 genome editing. The 1 bp insertion and 4 bp/9 bp deletion are shown in GhMAH1. The lower panel illustrates the Sanger sequencing spectrogram. (c) Expression of GhMAH1 in fibre tissue at the three time points. * Represents a significant difference by Student's t‐test at P = 0.05. ns indicates no significant difference by Student's t‐test. (d) Phenotype of mature fibres of Ghmah1 (CRI24) and WT (CRI24). (e) Mature fibre length between Ghmah1 (CRI24) and WT (CRI24). ** Represents a significant difference by Student's t‐test at P = 0.01. [Colour figure can be viewed at wileyonlinelibrary.com]

Discussion

ENAs from intragenic recombination are common

Each crop has its own unique origin and diversification and domestication histories that have altered its genetic composition, resulting in various physiological and phenotypic differences compared with wild relatives (Doebley et al., 2006; Hu et al., 2019). The genus Gossypium contains two cultivated tetraploids, G. hirsutum (AD1) and G. barbadense (AD2), which originated from interspecies hybridization between putative diploid A genome (G. herbaceum or G. arboretum) and D genome (G. raimondii) ancestors 1–1.5 million years ago (Endrizzi et al., 1985; Wendel, 1989; Zhang et al., 2015). Genome assemblies and pangenome analyses have identified extensive genetic variation including sequence differences within genes between G. hirsutum and G. barbadense (Li et al., 2021), and evolution studies have been accomplished using high‐resolution markers in several large natural populations (Du et al., 2018; He et al., 2021; Ma et al., 2018b; Ma et al., 2021; Yang et al., 2019). Genetic recombination is an important driving force for evolution in organisms (Gaut et al., 2007; Liu et al., 2018; Mcdowell et al., 1998; Wolf et al., 2014). However, few studies have focused on the relationship between intragenic recombination and trait variation in crops, including cotton. In this study, through the analysis of 277 fibre transcriptomes from 47 BILs and their two parents, 10 366 (ca. 30%) of 35 420 eGenes were found to contain new alleles from intragenic recombination in one or more BILs. The results indicated that genetic recombination occurred highly frequently within genes during the development of the introgressed population. Considering the fact that most intergenic (or inter‐marker) recombination hotspots occur in gene‐rich subtelomeric regions, it should not be surprising that intragenic recombination also occurs in similar regions. Because intergenic regions account for a majority of the complex genome such as cotton, it is understandable that recombination events due to crossovers between homologous chromosomes occur more frequently in intergenic regions than intragenic regions. Intragenic recombination was largely ignored in the past because high‐density genome‐wide SNPs especially gene‐based SNPs have only become a reality until recently. In this study, a cluster analysis of the BILs using the 10 366 ENAs and their neigbouring SNPs grouped many BILs into their parental groups. However, other BILs possessed more ENAs and exhibited a greater genetic distance from the parents and formed a separate group. These results implied that a greater accumulation of new alleles in BILs could be associated with a greater genetic differentiation relative to parents.

Function and expression differences among new alleles for fibre elongation

Since 1955, when Benzer identified intragenic recombination in bacteriophages (Benzer, 1955), an increasing number of investigations have identified intragenic recombination events in higher plants (i.e., Arabidopsis, wheat and maize), and some studies have speculated that new alleles contribute to trait improvement (Kelly et al., 2010; Liu et al., 2018; Mourad et al., 1994; Sun et al., 2020; Zhang et al., 2020). However, studies on the link between new alleles and agronomic traits remain lacking in crops. In this study, we identified 10 366 eGenes that generated new alleles in BILs, which accounted for approximately 30% of the eGenes in fibre tissue. This result indicated that interspecific hybridization is an efficient way to overcome the challenge of the limited genetic resources of domesticated cotton and other crops. Fortunately, we used GhFLA9 to reveal a detailed case in which intragenic recombination produced a smart megamerger, for example, combining the powerful promoter of Hai7124 with the amino acid sequence of CRI36, in some specific BILs, was more likely to contribute to fibre elongation than the parent alleles. Transcript accumulation can be affected by sequences at both ends of genes, for example, promoters at the 5′ end of the gene (Biłas et al., 2016) and 3′ regions that affect RNA stability (Wang et al., 2009). Therefore, an intragenic recombinant allele that combines a 5′ promoter of one parent with a stabilizing 3′ polymorphism of another parent could exhibit transgressive levels of transcript accumulation, and this hypothesis has been verified in maize (Liu et al., 2018). In the present study, we identified 520 new alleles in the FL‐QTL regions that exhibited transgressive transcript accumulation in dynamic fibre samples and were potential genetic factors to create the transgressive FL phenotype in cotton.

Genetic regulation of extreme‐expression genes

Gene birth and death are considered continuous and dynamic processes in evolution that are culled by natural selection or genetic drift (Kaessmann, 2010; Palmieri et al., 2014; Carvalho et al., 2015; Tobler et al., 2017). Similar to gene birth and death, gene activation and silencing of expression also exist in the evolutionary process. Thus, evolutionarily young genes have been identified to preferentially activate expression in the testis across species (Witt et al., 2021). Using transcript data from the 47 BILs and parents, we identified 358 genes that activated expression in at least four BILs relative to the parents. Genetic mechanistic studies on AGEs have suggested that AGEs are usually preferentially distributed in the 2‐Mb region of the RHs. Furthermore, eGWAS results showed that 98 of the 358 AGEs were significantly regulated by 15 cis‐eQTLs and 204 trans‐eQTLs (Table S20). Because cis‐eQTLs have a larger effect on eGenes than trans‐eQTLs, we focused on the cis‐eQTLs that regulate the expression of AGEs (Stern and Orgogozo, 2009; Wang et al., 2014). Interestingly, the AGE Ghir_D11G011050, which was significantly negative for FL (Figure 5c), was regulated by 9 cis‐Sbin markers (8.63–10.83 Mb) at the three time points (Figure S14a). In parallel, 98 genes were identified to have silenced expression in some specific BILs relative to the parents. The eGWAS results showed that 27 of the 98 SGEs were significantly regulated by 17 cis‐eQTLs and 53 trans‐eQTLs (Table S20). The SGE Ghir_D11G002660, which was significantly positive for FL (Figure 5e), was regulated by 6 cis‐Sbin markers (1.71–2.95 Mb) at 5 dpa (Figure S14b). These results implied that the AGEs and SGEs were likely caused by the recombination loci nearby and influenced fibre elongation in progeny.

Construction of a complex dynamic regulatory network of fibre elongation

The construction of a regulatory network is the basis for depicting the regulation of gene expression at the whole‐genome level and is a fundamental process to understand the biological processes and functions of both individual genes and the genome. To date, a complex regulatory network of eGenes in specific tissues has been constructed in maize and rice (Liu et al., 2015; Wang et al., 2014). Moreover, a complex genetic regulation network of fibre samples at 15 dpa was previously constructed (Li et al., 2020), and this network depicted the developmental transition from rapid cell elongation to secondary cell wall synthesis. In part, these regulatory networks contribute to clarifying the phenotypic variations of critical agronomic traits. Because the final phenotypic traits of plants are always controlled by the dynamic expression of different genes during growth, some of the associated regulatory networks are missed when using the transcriptome of a single period. In this study, we identified 1286, 1089 and 1059 eGenes that were colocalized with the FL trait at 5, 10 and 15 dpa, and constructed the dynamic regulatory network of these eGenes. Interestingly, we found 2, 2 and 3 SSRECs at 5, 10 and 15 dpa, implying that some of the eGenes experienced preferential regulation of the genetic loci at different time points (Figure 6a). A total of 105 DEGs in the dynamic regulation network were selected (Figure 6c), among which 15 FL‐related genes generated their new alleles in some BILs. Taken together, this study provides resources to expand our understanding of and improve the mechanism underlying fibre elongation and development in modern cotton interspecific breeding.

Conclusion

Our analyses reveal gene expression and sequence variant diversity in G. hirsutum × G. barbadense interspecific hybridization populations and identify extremely eGenes and new alleles that may contribute to fibre elongation and development. Additionally, the complex dynamic regulatory network of FL‐related eGenes provides a valuable resource for molecular mechanism studies of functional genes in cotton.

Methods

Plant materials and fibre length data

An interspecific BIL population of 191 lines was used in this study. Detailed information of the parents (CRI36 and Hai7124) and 191 BILs was described previously (Ma et al., 2019). We collected 9 environments (year‐by‐location) of FL data from 2015 to 2017 using the High‐Volume Instrument (HVI) 900 system.

Library construction and sequencing

The modified CTAB nucleic DNA extraction method for parents and BILs was described previously (Yang et al., 2019). We constructed paired‐end sequencing libraries with insert sizes ranging from 300 to 700 bp using the Illumina HiSeq 2500 platform according to the manufacturer's instructions. In total, approximately 3.05 terabases of genomic sequence data, with an average of 5× and 30× genomic coverage for each BIL and parent, were generated.

Genetic map construction and FL‐QTL analysis

The paired‐end reads of the clean data were mapped to the G. hirsutum TM‐1 reference genome (Wang et al., 2019a) using BWA software with the mem process (−t 4 ‐k 32 ‐M ‐R). The SNP‐calling procedure as described previously (Wang et al., 2017) identified aa × bb SNP markers (Figure 1a). A sliding window approach was used to call genotypes for the recombination breakpoint determination of 15 consecutive SNPs, with a step length of 1 (Gu et al., 2020; Huang et al., 2009). Next, high‐quality bin markers were selected to construct the linkage map using Joinmap4.0 with a Chain length per Monte Carlo EM cycle of 1000; the other parameters were default (Van Ooijen, 2006). The Kosambi mapping function was applied to estimate the map distance (Kosambi, 1944). The relationship between the genetic and physical positions was visualized using the ALLMAPS program (Hu et al., 2018). In cotton, the genetic distance between adjacent markers was higher than 20 cM/Mb, indicating an RH (Wang et al., 2015). The FL data collected from nine environments were subjected to QTL mapping with the composite interval mapping (CIM) method using WinQTLCart (Wang et al., 2012). The logarithm of the odds (LOD) score for declaring significant FL‐QTLs was implemented using a 1000‐permutation test (walk speed 1.0 cM, P < 0.05). LOD score values between 2.5 and permutation test LOD thresholds were used to declare suggestive QTLs (Jia et al., 2016; Zhang et al., 2016).

RNA‐seq of dynamic fibres in polarization subgroups of BILs

Based on the FL of the 191 BILs in 4 environments (15Ay, 16Hn, 16Ayxt and 16Alaer), we selected the top 10% (range from 29.5 ± 0.4 to 31.2 ± 0.6 mm) of lines with long fibre and the bottom 15% of lines (range from 23.6 ± 0.3 to 26.8 ± 0.3 mm) with short fibre. In Xinjiang (a major cotton‐producing area of China), 29.0 mm of FL was an important reference for the quality grading of the main cultivated varieties. Hence, the top 10% of BILs (21) were regarded as the “LF group,” and the bottom 15% of BILs (26) comprised the “SF group.” T‐test analysis of the FL trait for the LF versus SF groups showed an extremely significant difference (P = 7.41E−5; Figure S15). Hai7124, CRI36 and 47 BILs were cultivated with two replications in a randomized complete block design in the field at Anyang in 2017. In July, white flowers (or yellow flowers in the case of G. barbadense) present in the middle fruiting branches were tagged, and this time point was defined as 0 dpa. Six normally growing bolls were randomly sampled from 47 lines and parents in two replications at the three key time points (5, 10 and 15 dpa) for fibre elongation. For each individual, fibre samples at the corresponding time points were manually separated from ovules in liquid nitrogen and stored at −70 °C for total RNA extraction. High‐throughput RNA‐seq of 5, 10 and 15 dpa fibre tissues was performed using an Illumina HiSeq 2000 platform, and two independent biological replicates were included in the analysis (17 samples without transcriptomes data due to their poor quality of RNA). We collected 10–15 million (paired‐end 150‐bp) reads from each sample. The analysis criterion of the sequencing reads was performed as described previously (Li et al., 2020). The remaining reads were used to calculate the expression levels of genes (fragments per kilobase of transcript per million mapped reads, FPKM) using StringTie software (version 1.3.4) with default settings (Pertea et al., 2015). Using the DESeq R package, 34, 436 and 367 DEGs in SF vs. LF were identified at 5, 10 and 15 dpa, respectively, and 79 DEGs were found in at least two time points (Figure 1f; Table S14).

Identification of ENAs based on transcriptomes

Intragenic recombination provides a crucial source of data for studies of evolution and transgressive segregation, and numerous novel protein functions are caused by this type of recombination (Maguire et al., 2014; Salim et al., 2011). To date, studies are lacking on intragenic recombination in cotton. Two cultivated tetraploids, G. hirsutum (AD1) and G. barbadense (AD2), are extensive SNP variants that supply an ideal reference for studying novel alleles in interspecific populations (Hu et al., 2019; Ma et al., 2021; Wang et al., 2019a). Herein, 277 fibre transcriptomes at the three time points (5, 10 and 15 dpa) in the 47 BILs and genotyped parents were used to detect the new alleles. The detailed processes were as follows: (i) the transcriptomes of two biological repetitions for parents and 47 BILs were merged, and their SNP variations were called with GATK (QD < 2.0, FS > 60.0, MQ < 40.0, SOR >4.0, MQRankSum <−12.5, ReadPosRankSum <−8.0, DP < 10); (ii) genes with at least two homozygous SNPs between parents CRI36 and Hai7124 were retained and (iii) the retained genes in each BIL were scanned, and if one gene contained the SNPs from both CRI36 and Hai7124 in at least one BIL, it was considered as a new allele. Similar methods for new allele identification were used in previous studies (Liu et al., 2018; Pan et al., 2016).

Analysis of expression‐activated and expression‐silenced genes

The types of extremely eGenes (e.g., activated or silenced) were similar to the overexpression or RNAi biotechnology and showed high potential for phenotypic improvement. Here, we defined transcripts with two independent biological replicates that showed FPKM = 0 in the parents and FPKM ≥1 in at least 4 BILs (8%) simultaneously as AGEs. By contrast, the cluster of genes that were expressed in the parents was silent in some BILs. We defined the transcripts that exhibited both FPKM = 0 in at least 4 BILs and FPKM ≥1 in the parents as SGEs. The correlation analysis was designed for FPKM values of AGEs/SGEs and FL in 47 BILs and parents.

Semiquantitative RT‐PCR

cDNA of the 10 dpa fibre samples of parents and BILs was synthesized using TransScript One‐Step gDNA Removal and cDNA Synthesis SuperMix (TRANSGEN BIOTECH, Beijing, China), and the TransStart Top Green qPCR SuperMix (TRANSGEN BIOTECH, Beijing, China) was used to perform semiquantitative RT–PCR according to the manufacturer's instructions. Sequence‐specific primers were designed using Primer‐BLAST from NCBI (https://www.ncbi.nlm.nih.gov/tools/primer‐bl) (Table S21).

eQTL analysis of fibres at different developmental time points

To identify the genes involved in dynamic fibre development, an analysis of the gene expression level was performed to discard those with expression levels less than 1.0 (FPKM < 1.0) in all of the samples (Wang et al., 2019b). This analysis allowed us to obtain 32 707, 30 891 and 30 326 genes expressed at 5, 10 and 15 dpa of fibre development, respectively (Figure 1c, d and e; Table S22). We performed an association analysis of the 47 lines using the MLM (PCA + K) model (Bradbury et al., 2007). The cut‐off for significant associations was set to 1/n (n represents the 2216 Sbin markers; P < 4.51E‐4; Table S23). eQTL analysis can detect genetic elements that regulate the expression variation of e‐traits acting in cis if the eQTL is located in the vicinity of an e‐trait or in trans if it is located in a distant position. Such analysis could suggest the existence of a potential regulator in a genomic region for a gene (Li et al., 2020; Pang et al., 2019).

Network construction

To systematically understand the dynamic regulation associations, we leveraged a genetic network involving genes (traits) and eQTL data at the fibre fast elongation stage to visualize the relationship between eGenes and eQTLs. eQTLs that were colocalized with the FL‐QTLs identified in the present study and previous studies of G. hirsutum × G. barbadense populations were highlighted in the genetic network. This network included the relationship between eQTLs and their regulated genes at three time points of fibre development. In the present analysis, we focused on these genes, which showed DE at least one‐time point between the LF and SF groups and were regarded as causal regulators of fibre elongation. The construction of the network was performed using CYTOSCAPE (v.3.8.2) with the edge‐weighted force‐directed layout method (Shannon et al., 2003).

CRISPR–cas9 mutagenesis of

The specific target site (5′‐CTAGTGGGGATGATGCCGCA‐3′) was designed to knock out GhMAH1 (Ghir_D10G011050) in cotton using the CRISPR–Cas9 system (Li et al., 2019b). G. hirsutum cultivar accession CRI24 was used for Agrobacterium‐mediated transformation. The T0 transgenic cotton plants were confirmed by genotyping PCR. The WT and T1 generations were cultivated in pots (21 cm × 21 cm), and positioned in a well‐managed phytotron (16 h light with 500 μmol/m2/s photosynthetically active radiation; 8 h dark and 28 °C) of the Institute of Cotton Research of CAAS, and the mutant individuals were identified by Sanger sequencing. The FL data were collected using three or more healthy and mature bolls from each mutant line and wild plants in T1 as described previously study (Qu et al., 2012).

Conflicts of interest

The authors have no conflicts of interest to declare.

Author contributions

J.Z. and J.Y. conceived and designed the experiments. J.W. discussed the potential factors of heterosis. Y.J. screened the ENAs. W.P. performed the field cultivation of cotton plants and FL measurements. M.W., Q.M., and J.L. performed fibre sample collection and RNA extraction. J.S. and B.J. contributed to the evolution analysis. S.L. supported the screening of the misexpressed genes. J.M. performed the experiments and wrote the manuscript. J.Z. and J.Y. edited the manuscript. All the authors read and approved the final manuscript. Figure S1 Construction of the 26 linkage groups of the G. hirsutum × G. barbadense interspecific population. Click here for additional data file. Figure S2 Marker collinearity between the genetic and physical maps. Click here for additional data file. Figure S3 Frequency map of the FL trait in 191 BILs in different environments. Click here for additional data file. Figure S4 Sanger sequencing for intragenic recombination genes. Click here for additional data file. Figure S5 Frequency statistics of the ENAs in BILs. Click here for additional data file. Figure S6 A neighbour‐joining tree was constructed based on the whole genome‐wide SNPs in 47 BILs and parents. Click here for additional data file. Figure S7 Count statistics of the ENAs for the BILs with transcriptomes at 5, 10 and 15 dpa, simultaneously. Click here for additional data file. Figure S8 Comparisons of FL among the three types of alleles. Click here for additional data file. Figure S9 Venn diagram of the number of transgressive transcripts of the ENAs at the three time points. Click here for additional data file. Figure S10 Sanger sequencing of GhFLA9. Click here for additional data file. Figure S11 Statistics of the extremely expressed genes in the genome. Click here for additional data file. Figure S12 Semiquantitative RT‐PCR for extremely expressed genes. Click here for additional data file. Figure S13 Dynamic eQTL identification in cotton fibre elongation. Click here for additional data file. Figure S14 Identification of cis‐eQTLs of FL‐associated extremely expressed genes at specific time points. Click here for additional data file. Figure S15 Comparison of FL between the LF and SF groups. Click here for additional data file. Table S1 The 7558 bin markers and their genetic distance and physical position on 26 chromosomes. Table S2 Distribution of the recombination hotspots on the linkage map. Table S3 FL of backcross inbred lines (BILs) of Hai7124 × CRI36 hybrids and their parents. Table S4 Summary of FL‐QTLs identified in different environments. Table S5 Summary of the total FL‐QTLs identified in previous studies. Table S6 Summary of the stable FL‐QTLs identified in previous studies. Table S7 Summary of the frequency of ENAs in each BIL. Table S8 Correlation between the number of ENAs and FL in BILs. Table S9 Summary of the 88 expressed genes in 25 FL‐QTLs. Table S10 Summary of the stable FL‐QTLs identified by ENA markers. Table S11 Annotations of the 18 ENAs involved in fibre development. Table S12 Lists of transgressive transcripts of ENAs at the three time points. Table S13 Annotations of the 520 transgressive ENAs located in FL‐QTLs. Table S14 DEGs between the LF and SF groups at 5, 10 and 15 dpa. Table S15 Lists of AGEs/SGEs at three time points. Table S16 Annotations of the 358 AGEs and 98 SGEs. Table S17 Significant eQTLs of the eGenes at the three time points. Table S18 eQTLs at the three time points overlapped with the FL‐QTL in Gossypium hirsutum × Gossypium barbadense populations. Table S19 GO enrichment analysis of the core‐ and specific eGenes in the network at 5, 10 and 15 dpa. Table S20 eQTLs of the 98 AGEs and 27 SGEs at the three time points. Table S21 Specific primers of the genes in the present study. Table S22 Expressed genes at the three time points. Table S23 The physical position of the Sbin markers on 26 chromosomes. Click here for additional data file.
  89 in total

1.  Homoeologous exchanges occur through intragenic recombination generating novel transcripts and proteins in wheat and other polyploids.

Authors:  Zhibin Zhang; Xiaowan Gou; Hongwei Xun; Yao Bian; Xintong Ma; Juzuo Li; Ning Li; Lei Gong; Moshe Feldman; Bao Liu; Avraham A Levy
Journal:  Proc Natl Acad Sci U S A       Date:  2020-06-09       Impact factor: 11.205

2.  Genome-wide recombination dynamics are associated with phenotypic variation in maize.

Authors:  Qingchun Pan; Lin Li; Xiaohong Yang; Hao Tong; Shutu Xu; Zhigang Li; Weiya Li; Gary J Muehlbauer; Jiansheng Li; Jianbing Yan
Journal:  New Phytol       Date:  2015-12-31       Impact factor: 10.151

3.  Structure of a viral cap-independent translation element that functions via high affinity binding to the eIF4E subunit of eIF4F.

Authors:  Zhaohui Wang; Krzysztof Treder; W Allen Miller
Journal:  J Biol Chem       Date:  2009-03-10       Impact factor: 5.157

4.  A new interspecific, Gossypium hirsutum x G. barbadense, RIL population: towards a unified consensus linkage map of tetraploid cotton.

Authors:  Jean-Marc Lacape; J Jacobs; T Arioli; R Derijcker; N Forestier-Chiron; D Llewellyn; J Jean; E Thomas; C Viot
Journal:  Theor Appl Genet       Date:  2009-04-23       Impact factor: 5.699

5.  Dissecting functions of KATANIN and WRINKLED1 in cotton fiber development by virus-induced gene silencing.

Authors:  Jing Qu; Jian Ye; Yun-Feng Geng; Yan-Wei Sun; Shi-Qiang Gao; Bi-Pei Zhang; Wen Chen; Nam-Hai Chua
Journal:  Plant Physiol       Date:  2012-07-26       Impact factor: 8.340

6.  Resequencing of 243 diploid cotton accessions based on an updated A genome identifies the genetic basis of key agronomic traits.

Authors:  Xiongming Du; Gai Huang; Shoupu He; Zhaoen Yang; Gaofei Sun; Xiongfeng Ma; Nan Li; Xueyan Zhang; Junling Sun; Min Liu; Yinhua Jia; Zhaoe Pan; Wenfang Gong; Zhaohui Liu; Heqin Zhu; Lei Ma; Fuyan Liu; Daigang Yang; Fan Wang; Wei Fan; Qian Gong; Zhen Peng; Liru Wang; Xiaoyang Wang; Shuangjiao Xu; Haihong Shang; Cairui Lu; Hongkun Zheng; Sanwen Huang; Tao Lin; Yuxian Zhu; Fuguang Li
Journal:  Nat Genet       Date:  2018-05-07       Impact factor: 38.330

7.  Sequence-based ultra-dense genetic and physical maps reveal structural variations of allopolyploid cotton genomes.

Authors:  Sen Wang; Jiedan Chen; Wenpan Zhang; Yan Hu; Lijing Chang; Lei Fang; Qiong Wang; Fenni Lv; Huaitong Wu; Zhanfeng Si; Shuqi Chen; Caiping Cai; Xiefei Zhu; Baoliang Zhou; Wangzhen Guo; Tianzhen Zhang
Journal:  Genome Biol       Date:  2015-05-24       Impact factor: 13.583

8.  Nested-association mapping (NAM)-based genetic dissection uncovers candidate genes for seed and pod weights in peanut (Arachis hypogaea).

Authors:  Sunil S Gangurde; Hui Wang; Shasidhar Yaduru; Manish K Pandey; Jake C Fountain; Ye Chu; Thomas Isleib; C Corley Holbrook; Alencar Xavier; Albert K Culbreath; Peggy Ozias-Akins; Rajeev K Varshney; Baozhu Guo
Journal:  Plant Biotechnol J       Date:  2019-12-25       Impact factor: 9.803

9.  High rate of translocation-based gene birth on the Drosophila Y chromosome.

Authors:  Ray Tobler; Viola Nolte; Christian Schlötterer
Journal:  Proc Natl Acad Sci U S A       Date:  2017-10-19       Impact factor: 11.205

10.  Few Fixed Variants between Trophic Specialist Pupfish Species Reveal Candidate Cis-Regulatory Alleles Underlying Rapid Craniofacial Divergence.

Authors:  Joseph A McGirr; Christopher H Martin
Journal:  Mol Biol Evol       Date:  2021-01-23       Impact factor: 16.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.