Literature DB >> 36122204

Evolutionary divergence of duplicated genomes in newly described allotetraploid cottons.

Renhai Peng¹, Yanchao Xu², Shilin Tian³, Turgay Unver⁴, Zhen Liu¹, Zhongli Zhou², Xiaoyan Cai², Kunbo Wang², Yangyang Wei¹, Yuling Liu¹, Heng Wang², Guanjing Hu^2,5, Zhongren Zhang³, Corrinne E Grover⁶, Yuqing Hou², Yuhong Wang², Pengtao Li¹, Tao Wang¹, Quanwei Lu¹, Yuanyuan Wang⁷, Justin L Conover⁶, Hassan Ghazal⁸, Qinglian Wang⁷, Baohong Zhang⁹, Marc Van Montagu^10,11, Yves Van de Peer^10,11,12,13, Jonathan F Wendel⁶, Fang Liu².

Abstract

Allotetraploid cotton (Gossypium) species represents a model system for the study of plant polyploidy, molecular evolution, and domestication. Here, chromosome-scale genome sequences were obtained and assembled for two recently described wild species of tetraploid cotton, Gossypium ekmanianum [(AD)6, Ge] and Gossypium stephensii [(AD)7, Gs], and one early form of domesticated Gossypium hirsutum, race punctatum [(AD)1, Ghp]. Based on phylogenomic analysis, we provide a dated whole-genome level perspective for the evolution of the tetraploid Gossypium clade and resolved the evolutionary relationships of Gs, Ge, and domesticated G. hirsutum. We describe genomic structural variation that arose during Gossypium evolution and describe its correlates-including phenotypic differentiation, genetic isolation, and genetic convergence-that contributed to cotton biodiversity and cotton domestication. Presence/absence variation is prominent in causing cotton genomic structural variations. A presence/absence variation-derived gene encoding a phosphopeptide-binding protein is implicated in increasing fiber length during cotton domestication. The relatively unimproved Ghp offers the potential for gene discovery related to adaptation to environmental challenges. Expanded gene families enoyl-CoA δ isomerase 3 and RAP2-7 may have contributed to abiotic stress tolerance, possibly by targeting plant hormone-associated biochemical pathways. Our results generate a genomic context for a better understanding of cotton evolution and for agriculture.

Entities: Chemical

Keywords: adaptive evolution; polyploid dynamics; structure variations; tetraploid cotton

Mesh：

Substances：
Isomerases

Year: 2022 PMID： 36122204 PMCID： PMC9522333 DOI： 10.1073/pnas.2208496119

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 12.779

Polyploidization is an important evolutionary process in many higher plants, contributing to speciation and adaptation (1–5). Allopolyploidy in particular has been considered a major evolutionary force due to the novel genomic possibilities resulting from hybridization and increased genetic variability. Approximately 1 to 2 million y ago (Mya), hybridization between geographically disjunct diploid A and D genome ancestors (2n = 26, AA and DD genome) and concomitant polyploidization generated allotetraploid cotton (2n = 52, AADD genome) (6, 7). This new allopolyploid clade subsequently diversified into the seven species recognized today [(AD)1 to (AD)7] (8, 9). Among them, Gossypium hirsutum [Gh, genome designation (AD)1] and Gossypium barbadense [Gb, (AD)2)], provide the majority of natural fiber for commercial production (10, 11). Five tetraploid cottons [(AD)1 to (AD)5], including the two domesticated species, have recently been sequenced using long-read technology, providing high-quality genome assemblies and genomic resources for uncovering the genetic basis of spinnable fiber (12–16). However, genome assemblies for the two most recently described wild tetraploid species, both of which are closely related to Gh, have not been reported. Genome sequences for these species may facilitate an improved understanding of evolution and fiber improvement in the dominant crop species, G. hirsutum. In fact, these species—that is, Gossypium ekmanianum [Ge, (AD)6] from the Dominican Republic and Gossypium stephensii [Gs, (AD)7] from the Wake Atoll near French Polynesia—are so similar to Gh that they have historically been confounded with wild accessions of Gh (8, 9). Additionally, there are no genome sequences for early domesticated or wild forms of either of the two domesticated species, precluding comparative genomic approaches to understanding cotton evolution and domestication. Among the great diversity of morphological forms spanning the wild-to-domesticated continuum in Gh, many of the least-improved forms occur in the Yucatan Peninsula of Mexico, including the truly wild race yucatanense, and the relatively unimproved race punctatum (Ghp) (17, 18). Assembling the genome of these tetraploid cottons will provide an important model system for understanding both the evolutionary consequences of polyploidy and of parallel domestication (19–21). Here, we report high-quality genome assemblies of the three tetraploid genomes, Ge, Gs, and Ghp. Using comparative genomics and phylogenomics, we reveal extensive genomic structural variations (SVs) in tetraploid cottons, and reevaluate phylogenetic relationships and divergence times within the polyploid clade. Extensive SVs are associated with phenotypic diversity, including the economically important trait, fiber length. We characterize wild gene resources that have the potential to facilitate adaptation to various abiotic and biotic stresses in domesticated cotton. These results deepen our understanding of genome evolution in polyploids and provide insight into the genetic and morphological diversity of tetraploid cottons.

Results

Assemblies and Characterization of Three Allotetraploid Cotton Reference Genomes.

Three allotetraploid cotton genomes (Ge, Gs, and Ghp) were sequenced and assembled using a combination of sequencing technologies, including single-molecule real-time (PacBio), paired-end Illumina sequencing, and chromatin conformation capture (Hi-C). An initial assembly was generated via FALCON (22) using at least 20.83 million PacBio long reads for each cotton species (), and subsequently corrected using Illumina paired-end data (average 120-fold coverage). These megabase assemblies (N50 of 1.57, 1.23, and 11.49 Mb for Ge, Gs, and Ghp, respectively) () were combined with Hi-C interaction information to produce chromosome-scale scaffolds (), yielding final assemblies of 2.34, 2.29, and 2.29 Gb for Ge, Gs, and Ghp, respectively. These high-quality assemblies had scaffold N50 values of more than 107 Mb (Table 1), with 99% of bases anchored onto chromosomes and with 99% of mapped Illumina reads covering about 97% of the genomes (). Nearly all of the 1,614 Embryophyta benchmarking universal single-copy orthologs (BUSCOs, embryophyta_odb10) (23) were complete in the Ge (99.2%), Gs (97.4%), and Ghp (99.3%) assemblies (), and the long terminal repeat (LTR) assembly index (LAI score 13.7 in Ge, 12.8 in Gs, and 12.7 in Ghp) further indicated that these three assemblies could be considered “reference quality” (24) ().

Table 1.

Features of three tetraploid cotton assemblies

Genomic features	Ge	Gs	Ghp
Assembly
Genome size (Mb)	2,341.87	2,291.84	2,292.48
Scaffold number	160	243	277
Scaffold N50 (Mb)	108.06	108.2	106.96
Contig size (Mb)	2,341.51	2,291.47	2,292.40
Contig number	3,781	3,927	1,111
Contig N50 (Mb)	1.57	1.23	11.49
Gap number	3,621	3,684	834
Gap length (Mb)	0.36	0.37	0.08
Pseudochromosomes size (Mb)	2,337.03	2,272.89	2,283.07
Annotation
TE percentage	64.86	63.01	64.89
Gene number	74,178	74,970	74,520
Genes in pseudochromosomes	74,038	73,324	74,283
Complete BUSCOs (%)	95.50	97.10	95.40

Features of three tetraploid cotton assemblies A total of 1,575 Mb (65%), 1,489 Mb (63%), and 1,488 Mb (65%) of sequence corresponding to transposable elements (TEs) were predicted in Ge, Gs, and Ghp, respectively (Table 1 and ). We identified 74,178, 74,970, and 74,520 protein-coding gene models (PCGs), respectively, of which an average of 97% had matched functional identifiers (). The majority (95 to 97%) of PCGs predicted in Ge, Gs, and Ghp had an identifiable homolog (>80% protein identity) in the published tetraploid cotton genomes () (15): that is, Gh, Gb, Gossypium tomentosum [Gt, (AD)3], Gossypium mustelinum [Gm, (AD)4], and Gossypium darwinii [Gd, (AD)5]. An assessment of TE and PCG density in 1,000 equal windows per chromosome suggest a strong bias for Copia and PCG accumulation within 20% of the windows nearest the chromosome telomeres, having an average of 0.85-fold (P < 10−16, Wilcox test) and 2.34-fold (P < 10−16, Wilcox test) increase, respectively, compared to other chromosomal regions (Fig. 1 and ). In contrast, Gypsy element density exhibited an average decrease of 0.74-fold (P < 10−16, Wilcox test) in telomeric versus other regions.

Fig. 1.

Genomic feature distribution in the three cotton genomes. The scale unit of each chromosome is 10 M. Track a: Gene density indicated by gene length per megabase. Track b: Gypsy density showed the distribution of Gypsy TE length per megabase across each chromosome. Track c: Copia density showed the distribution of Copia length per megabase across each chromosome. Track d: TE density showed the distribution of DNA transposon length per megabase across each chromosome.

Phylogenetic Analysis of Tetraploid Gossypium.

A maximum-likelihood phylogenetic tree was generated using 3,281 single-copy coding genes for the eight tetraploid cottons (seven species), eight representative diploid cottons [G. herbaceum A1 (14), Gossypium arboreum A2 (25), Gossypium longicalyx F1 (26), Gossypium australe G2 (27), Gossypium thurberi D1 (28), Gossypium raimondii D5 (10), and Gossypium turneri D10 (29)], and the phylogenetic outgroup species Gossypioides kirkii (Gki) (30). Divergence times were estimated using Ks values for orthologous genes (Figs. 2 ). The phylogeny of allopolyploid cotton is reiterated in both the A and D genome clades, as expected given their formation from A and D genome diploid antecedents. The initial divergence time within the allopolyploids was estimated at 1.80 Mya (95% CI: 1.10 to 2.72 Mya) (Fig. 2), consistent with previous reports (15, 31). Except for Gm and Gt, which comprise one of the two deepest branches, tetraploid species fall into two clearly distinguished clades, each of which includes one of the two economically important domesticated cottons (upland cotton Gh and sea island cotton Gb). These two groups are hereafter referred to as the Gh-like and Gb-like clades, respectively (Fig. 2), and are inferred to have diverged from each other ∼0.79 Mya (95% CI, 0.49 to 1.49 Mya). Ghp is among the most primitive among the many forms of semidomesticated and feral derivatives found within the diversity generated by the 4,000+-y history of Gh domestication (32, 33). We observed similar divergence times between Gb-Gd (0.63 Mya; 95% CI: 0.37 to 1.26 Mya) and Ghp-Gh (0.68 Mya; 95% CI: 0.41 to 1.14 Mya), confirming earlier data indicating that the Galapagos Island endemic Gd, previously considered to be conspecific with Gb, diverged relatively recently from its mainland relatives (34). The genome sequences for the two species Ge and Gs completes the sampling of wild tetraploid Gossypium and, as expected from prior analyses (19–21), they fall within the Gh-like clade, having all diverged from their most recent common ancestor around 0.75 Mya (95% CI, 0.42 to 1.33 Mya) (). With respect to diploid divergence, two branches are distinguished (Fig. 2): the New World clade (D genome) and the African–Australian–Asian clade (A, G, and F genomes) (6, 7). We observed phylogenetic inconsistencies in the order of divergence within the Gh-like clade for the two subgenomes [Dt clade: (Ge, Gs), (Gh, Ghp)] vs. At clade: {Ge, [Gs, (Gh, Ghp)]} (Fig. 2), which is reasonable given the rapid divergence exhibited by these species (8, 9).

Fig. 2.

Phylogenetic analysis of the Gossypium genomes. (A) Maximum-likelihood tree inferred using G. kirkii (Gki) as the outgroup. (B) Distribution of Ks values for orthologous genes among Gossypium genomes. (C) Evolution of the allopolyploid cotton clade, formed following hybridization between an extinct A0 and ancestor of D5. Our results support previous inferences regarding the monophyletic origin of allopolyploid cottons (35). Although the AT genome of tetraploid cotton is more divergent from A2 (∼1.31 Mya) than A1 (∼1.23 Mya), the range in estimates for these divergence times overlap (). As expected, based on prior studies (7), Gm is the sole survivor of the earliest split in the allopolyploid species, and thus it can be used as an outgroup to evaluate subsequent evolutionary differences of the remaining allopolyploids in the Gh- and Gb-like clades. In general, the synonymous substitution rate (Ks) was higher for Dt homeologs than for At homeologs (), consistent with a previous report (15) and possibly reflecting subgenome-specific evolutionary processes, including differences in recombination rates and selective sweeps.

Genomic SVs Occurred During the Evolution of Tetraploid Gossypium.

SVs occur frequently during plant evolution and domestication, providing a major genetic source of phenotypic diversity (36, 37). We focused on identifying all genomic SVs ≥50 bp in length because these are the least well-characterized genetic variations and are likely to affect gene function (38, 39). By mapping the seven tetraploid Gossypium assembled genomes and their sequencing reads to the reference genome of Gm [(AD)4_JGI] (15), four methods [smartie-SV (39), SVMU (40), SyRI (41), and Breakdancer (42)] were combined to identify SVs (Fig. 3) and to polarize their directionality (i.e., insertion vs. deletion relative to Gm). SVs were only considered when they were consistently identified by at least two methods, resulting in an average of 72,965 insertions (range 67,885 to 77,756), 63,126 deletions (range 59,663 to 65,670), and 339 inversions (range 297 to 410) (). Presence/absence of variation (PAVs) occurred relatively frequently during cotton evolution. The lowest number of PAVs was observed in Gt (). Notably, the domesticated polyploids (Gb and Gh) had the longest average length of PAVs among the seven tetraploid cotton genomes surveyed (), yet the other five cotton accessions exhibit more PAVs with size ≥1 kb (). Relative to the large number of species/accession-specific PAVs (range 36,476 to 75,125), fewer shared PAVs (8,277, average ratio of 6.05%) among species/accessions were observed, suggesting potential for impact by PAVs on species/accession-specific traits (relative to Gm) ().

Fig. 3.

Characterization of genomic variations in Gossypium At genome and Dt genome. Genic synteny blocks are connected by gray lines. Translocation blocks and inversion blocks are connected by red and green lines, respectively. The number of PAVs in the Dt genome (range 61,132 to 67,223) is slightly smaller than in the At genome (range 64,875 to 78,695) for all polyploid cotton accessions except Gh, suggesting a higher density of PAVs in the twofold smaller Dt subgenome (). The majority of PAVs were located in intergenic regions (70.53 to 76.81%), and fewer were in coding regions than in introns (). PAVs overlapping exons resulted in a predicted 20,343 frameshift and 9,771 stop-codon gain or loss mutations within 11,557 predicted protein sequences (15.68% of the total) (Dataset S1), including 1,168 proteins affected in at least 4 of the genomes. The three chromosomes affected most by PAVs within genes were At05 (535 genes), At11 (428), and Dt11 (312) (). Within the 8,277 PAVs shared by all accessions (relative to Gm), 646 protein sequences were affected by these PAVs. This branch (i.e., the internode between Gm and the radiation of the remaining genomes) exhibited the largest number of genic PAV-induced changes across the polyploid phylogeny (0.078) by a factor of approximately three; on the terminal branches, species/accession-specific rates of PAV-induced genic changes ranged from 0.006 to 0.027 (). Of the SVs affecting genes, we found a 450-bp SV in a synteny block on At10 among the eight tetraploid genomes (coordinates in domesticated Gh; At10: 84,877,673 to 84,878,123) that resulted in a shorter version of Ghi_A10G09231 in the Gb-like clade and Gt (Fig. 4). This shorter version is phylogenetically homoplasious, possibly resulting from either a repeated truncation or a single event in a polymorphic ancestor followed by lineage sorting. This gene encodes a phosphopeptide-binding protein that is involved in fiber length, and which exhibits significantly reduced expression in fiber from cotton species missing this gene () (43). This deletion (relative to Gm) was present in Gb and Gd, suggesting that the deletion occurred subsequent to divergence from Gm. We found a large-scale inversion event (∼4.48 Mb) in a synteny block on Dt04 that distinguishes the Gh-like clade from the Gt and the Gb-like clade (Fig. 4), and thus is phylogenetically diagnosed as having occurred in the ancestor of the Ge through Gh clade. This inversion was further confirmed by mapping Hi-C data of four accessions (Gb, Ge, Gs, and Ghp) to TM-1_WHU (). Notably, two genes at this inversion boundary are involved in various abiotic stress responses, in which the Ghi_D04G05266 gene encodes calcium-dependent protein kinase and the Ghi_D04G0499 gene encodes alcohol dehydrogenase class-P (Dataset S2) (44–48). Another large inversion (>986 kb) in a synteny block on Dt01 was present in both domesticated Gb and Gh, but not their close wild relatives. At its boundary, the gene Ghi_D01G09866 was the homologous BXL in Arabidopsis thaliana, which encodes a β-xylosidase associated with secondary cell wall metabolism (49), suggesting a possible relationship to fiber quality; the other gene, Ghi_D01G10141, encodes an ethylene-responsive transcription factor, which interacts with TPL to regulate seed germination (Dataset S2) (50). Collectively, these SVs occurring in the evolution of tetraploid Gossypium led to the important phenotypic differentiation, although should be experimentally validated in the future.

Fig. 4.

SVs in tetraploid cotton genomes. (A) A 450-bp fragment SV occurred between Gh-like and Gb-like clades in a synteny block among At10 of all eight tetraploid Gossypium genomes. Coverage of Gh genome by the Illumina reads of eight tetraploid Gossypium genomes (Upper) and gene structure of Ghi_A10G09231 are shown (Lower). The deletion region of Ghi_A10G09231 is outlined in red and marked in gray bar. The yellow bars indicate the coding sequence regions. Evolutionary relationships are shown in the tree to the Left. (B) A 4.5-Mb inversion occurred between Gh-like and Gb-like clades. (C) A 980-kb inversion occurs at both Gb and Gh relative to their wild progenitors. Of the tetraploid cotton pan-genome made by combining our newly sequenced genomes with sequences from the five previously published tetraploid cotton species (Gh, Gb, Gt, Gm, and Gd) (15), a total of 96,537 gene families was detected (Fig. 5). Core genes were enriched for gene ontology (GO) terms related to “regulation of biosynthetic process” and “metabolic process” (), similar to previous findings from other plants (51–53). However, the number of pan-gene sets increased as additional genomes were added and did not approach a plateau (Fig. 5), as for some within and between species comparisons, such as soybean and its wild relatives (51). This observation likely reflects the levels of divergence and perhaps genome size variation that exists among Gossypium species.

Fig. 5.

Pan-genome analysis for eight tetraploid cotton genomes. (A) Increase in pan-gene families and decrease in core gene families with the addition of tetraploid cotton genomes. (B) Clustering of core and dispensable gene families of tetraploid cotton genomes. Among the Gossypium gene families, most (average of 59.78% or 27,484 families) were considered core families that account for an average of 68.11% of the genes, followed by approximately one-quarter of the genes that were considered “dispensable” (an average of 18,809 genes in each genome) (Fig. 5 and ). We also observed 6.81% of genes, on average, were species-specific, but the number and the proportion of Gh-like are almost twice as big as four other Gossypium (). Additionally, upland and sea island cottons have slightly fewer specific genes than their closest wild relatives (i.e., Gh vs. Ghp and Gb vs. Gd) (). Nucleotide-binding site leucine-rich repeat (NLR) gene families mediate plant resistance to biotic stressors (54, 55). We identified a total of 3,462 to 4,312 NLR genes in each of the eight tetraploid cotton genomes (). NLR genes were scattered across almost all chromosomes, with significantly higher numbers in Dt subgenomes than in At subgenomes (P = 0.012), congruent with earlier reports for five allopolyploid cotton species (15) (). We observed some dense clusters appearing in A04, A11, and D11 of Ghp and Ge (). Oligonucleotide probes designed for Ge A04 and A11 R gene clusters () confirmed the presence of these clusters in other tetraploid genomes (). Notably, the domesticated Gossypium genome contained a narrower range of NLR gene domain architectures than were observed in the other genomes (); this was especially evident in the Gh-like clade, where there was a trend of gradual decreasing from wild to domesticated Gossypium. This suggest that perhaps selection during domestication was for gene family reduction to eliminate nonessential genes, or perhaps neutral forces were involved.

Gene Evolution in the Relatively Unimproved Ghp.

Wild tetraploid cotton species naturally occur in habitats that periodically are subjected to drought or salt-stress (7, 17). The relatively unimproved Ghp may harbor genes/gene families for adaptation to these harsh environments. Notably, we identified 446 expanded gene families containing 948 genes in Ghp compared to the other 7 allotetraploid cotton species. These expanded gene families were significantly associated with the functional terms “sodium ion transport (GO: 0006814, P = 3.95e-08),” “glycolytic process (GO: 0006096, P = 1.7e-04),” “biotin metabolism (ath00780, P = 1.20e-05),” “fatty acid metabolism (ath01212, P = 1.49e-05),” and “fatty acid biosynthesis (ath00061, P = 6.12e-05)” (). To explore possible relationships to stress, we performed transcriptomic sequencing under two independent stress treatments (salt and drought) at three time periods (0, 12, and 24 h) for Ghp (). We detected a total of 9,700 and 1,197 differentially expressed genes (DEGs) in salt and drought treatments, respectively, using 0-h seedling leaves as the control group (Fig. 6). Interestingly, we found that 402 of 948 (42.41%) expanded genes in Ghp presented evident transcriptional expression changes, suggesting that expansion of these gene families may have contributed to abiotic and biotic stress resistance. Among these expanded DEGs, we identified one DEG, GhirPD0101G028900, homologous to ECI3 in A. thaliana, encoding a homolog of enoyl-CoA δ isomerase 3. This gene is involved in salt and drought stress response in A. thaliana (56), and the expression levels of cotton homologs of this gene were also significantly changed under both salt and drought stress treatment in Ghp (). We confirmed that expression of ECI3 was significantly decreased in the early stages of either cold or salt stress (Fig. 6 ). We also observed a DEG, GhirPA0801G001500, a homolog for ethylene-responsive transcription factor RAP2-7, that belongs to the AP2/ERF transcript factor family. This family is not only involved in the regulation of gene expression by stress factors and by components of stress signal transduction pathways, but also negatively regulates the transition to flowering (57). These results exemplify the potential of the relatively unimproved Ghp presented here for gene discovery, including those involved in adaptation to environmental challenges (27, 58).

Fig. 6.

Abiotic stress adaption of the Ghp. (A) Venn map of DEGs in salt (300 mM NaCl) and drought (17% PEG) treatments at two time periods, respectively. (B) The mature plants of GhP. (C) Significant comparative analysis of the expression level for GhECI3 among different treatments. *P < 0.05. (D) Phenotypes of two treatment groups and a control (water) after salt and drought treatments in VIGS plants. (E) Expression detection of gene GhECI3 in its silenced plants (TRV2::GhECI). (F) Significant comparison analysis of ion leakage and relative leaf water content. ANOVA analysis was performed with the standard t test, with least significant difference used for multiple comparisons. The different letters above the error bars indicate significant differences (P < 0.05) in all combinations.

Discussion

Assembling closely related wild and semiwild Gossypium genomes has the potential to inform our understanding of the evolution of domesticated Gossypium and of adaptation during species diversification. We present three new tetraploid Gossypium genome assemblies, including one from an early form of domesticated G. hirsutum (race punctatum) [(AD)1, Ghp], and two recently described wild species of tetraploid cotton, G. ekmanianum [(AD)6, Ge] and G. stephensii [(AD)7, Gs]. All three of these have close evolutionary proximity to domesticated Gh and share a similar genome size with previously sequenced upland cotton (59–61). Following its initial domestication, Gh spread throughout the drier regions of meso-America, in the process developing a diversity of morphological forms spanning the wild-to-domesticated continuum (32). Ghp is thought to be among the earliest domesticated forms of Gh (32). Our results confirm that they were domesticated from a common wild ancestor. The cryptic Gh-like wild species Ge and Gs are geographically isolated as island endemics in the Caribbean and South Pacific, respectively; however, they are so similar to Gh that they have been included as Gh in germplasm banks (8, 9). We found that they shared the greatest genetic similarity with Gh and Ghp and diverged from their most recent common ancestor about 0.75 Mya. We confirm the monophyletic origin of the polyploid clade (35). Notably, Gh shares with Ghp, Ge, and Gs a lower level of SNPs and InDel density (59) than found in the other allotetraploid genomes. The three assembled tetraploid Gossypium genomes represent a valuable resource for understanding cotton evolution and domestication. SVs are important in gene expression and functional evolution, and thus are thought to be important in crop domestication and genomic diversity (62–65). Several large chromosomal SVs in domesticated Gh have been reported that are related to geographic and genetic differentiation (13, 66, 67). Through identifying SVs among the eight tetraploid Gossypium genomes, we affirm the influence of SVs and discovered new SV-related regulatory mutations for specific genes during Gossypium evolution. We found a 450-bp SV on At10 that affects fiber length (43), which occurred after divergence from Gm and before Gh- and Gb-like divergence (Fig. 4). Another ∼4.48-Mb inversion on Dt04 accompanies the formation of Gh-like species (Fig. 4). These SVs may have contributed to genetic isolation between Gh- and Gb-like species. One notable SV is an inversion covering 986.42 kb on Dt01, which was detected in both domesticated Gh and Gb (Fig. 4). This parallelism suggests either a remarkable genomic convergence under human selection, or perhaps more likely, a region of introgression acquired during human manipulation of the two species (21). Collectively, it seems likely that structural variation is a significant factor in Gossypium divergence, and that it may have adaptive and agronomic relevance to crop phenotypes. Crop wild relatives are an important source of agricultural genetic diversity. The observation that the Gossypium pan-gene size does not rapidly asymptote suggests that there has been abundant genic diversification that accompanied diversification and global spread of the ∼50 species in the genus (15). The Gossypium pan-genome shows the expansion of new genes in Gh-like species that can provide the basis for crop improvement. The domestication and crop improvement history of cotton has entailed sequential genetic bottlenecks and an accompanying loss of genetic diversity (68–70); some of this lost diversity might have potential to mitigate problems associated with adaptation to various abiotic and biotic stresses. Compared with domesticated cotton, wild relatives have greater resistance to different abiotic and biotic stresses, including disease, drought, and salinity (7, 17). The relatively unimproved Ghp may be useful in this respect, for introgression of favorable genes into domesticated cotton and perhaps to contribute increased resilience to climate challenges. The completion of reference grade genome sequences for all seven allotetraploid cotton species provides a foundation for future investigation of genomic and phenotypic evolution and adaptation, both in natural settings and under domestication. Future comparative analysis across the full spectrum of “omics” will facilitate a better understanding of cotton evolution and domestication, and further elucidate the genome-level basis of elite traits, such as tolerance to environmental stresses and enhanced cotton fiber properties.

Materials and Methods

Plant Materials and Growth Conditions.

Leaves for DNA sequencing were collected from three cotton species, G. ekmanianum accession no. AD602, G. stephensii accession no. AD701, and G. hirsutum race punctatum accession no. Punctatum 25 (TX-1000). These perennials are all maintained at the National Wild Cotton Nursery in Sanya, China, which is supervised by the Institute of Cotton Research Institute, Chinese Academy of Agricultural Sciences (ICR-CAAS).

Genome Sequencing.

High-molecular weight genome DNA (gDNA) of three tetraploid cotton species/accessions (AD602, AD607, and Punctatum 25) was extracted according to the standard CTAB protocol, and subsequently fragmented for PacBio SMRTbell long-read sequencing libraries using Covaris g-TUBE Shearing Device. DNA fragments were purified using 0.45X AMPure beads, and DNA quality was assessed by both Qubit fluorometer and Agilent 2100 Bioanalyzer. The PacBio library was prepared by using the purified DNA fragments and sequenced on the PacBio Sequel I platform. Illumina paired-end sequencing libraries with an insert size of 350 bp were generated from the same gDNA extraction following the manufacturer’s protocol, and all libraries of three species/accessions were sequenced on the Illumina HiSeq X Ten platform as PE150. Illumina Hi-C was generated following a published protocol (71). Briefly, the leaves of 15-d-old seedlings were fixed in 1% formaldehyde solution. The nuclei/chromatin was extracted from the fixed tissue and digested with DpnII. The overhangs resulting from DpnII digestion were filled in using biotin-14-dCTP (Invitrogen) and Klenow (New England Biolabs). After dilution and relegation chromatin with T4 DNA ligase (New England Biolabs), gDNA was extracted and sheared to a size of 300 to 500 bp with a Bioruptor (Diagenode). The biotin-labeled DNA fragments were enriched using streptavidin beads (Invitrogen) and subject to library preparation according to a previous report (72). Illumina sequencers (Illumina HiSeq X Ten platform) carried out the sequencing of the Hi-C libraries. HiC-Pro (v2.10.0) was used to evaluate Hi-C data quality (72). Samples from leaves, stems, and stem apices of mature Ghp, Gs, and Ge plants were collected for extracting RNA. RNA-sequencing (RNA-seq) libraries were constructed using the protocol of NEBNext Ultra RNA Library Prep Kit for Illumina (New England Biolabs) and sequenced on the Illumina X Ten platform.

Contig Assembly.

De novo genome assembly was performed mainly using the PacBio SMART long reads with FALCON (https://github.com/PacificBiosciences/FALCON/, falcon-kit==1.8.1) (22). Briefly, we first selected the longest 50× of subreads as seeds to do error correction. These filtered data were used in FALCON for assembly with the parameters: length_cutoff_pr = 5000, max_diff = 100, max_cov = 100. The resulting primary contigs (p-contigs) were then polished using Quiver (https://www.pacb.com/support/software-downloads) (73) by aligning total SMRT reads. Finally, Pilon (v1.18) (74) were used to perform a second round of error correction with Illumina PE reads (insertion size = 350 bp).

Chromosome Assembly Using Hi-C.

To avoid artificial bias, the following type of reads were removed: 1) reads with ≥10% unidentified nucleotides (N); 2) reads with >10 nt aligned to the adapter, allowing ≤10% mismatches; 3) reads with >50% bases having phred quality < 5. The filtered Hi-C reads were aligned against the contig assemblies with BWA (v0.7.8) (75). Reads were excluded from subsequent analysis if they did not align within 500 bp of a restriction site or did not uniquely map, and the number of Hi-C read pairs linking each pair of scaffolds was tabulated. LACHESIS (https://github.com/shendurelab/LACHESIS) (76) used hierarchical agglomerative clustering to 26 groups. Juicebox v1.22 (https://github.com/aidenlab/Juicebox) was finally used to order the scaffolds in each group.

Assembly Assessment.

The genome assembly was evaluated by mapping the high-quality reads from 350-bp insert size PE libraries to the Hi-C assembly using BWA-mem. The distribution of the sequencing depth at each position was calculated to measure the completeness of the genome assembly. BUSCO (v3.0.2) (23) was used to evaluate the assembly completeness of three cotton genomes with 1,440 embryophyte genes from the “Embryophyta_odb9” database. LAI was used to evaluate assembly continuity and completeness by full-length LTR retrotransposons (LTR-RTs) (24). LTRharvest (v1.5.3) (77) (parameters: “-similar 85.00 -vic 10 -seed 30 -seqids yes -motif TGCA -motifmis 1 -minlenltr 100 -maxlenltr 3500 - mindistltr 1000 -maxdistltr 20000 -mintsd 4 -maxtsd 20”) and LTR_FINDER (V 64–1.0.5) (78) (parameters: “-w 2 -l 100 -L 3500 -d 1000 -D 20000 -M 0.3”) were used to de novo predict the candidate LTR-RTs in the three genome assemblies. LTR_retriever (v2.9.0) (79) was then used to combine and refine all the candidates to get the final, complete LTR-RTs. Each LAI score was calculated based on the formula: LAI = (intact LTR-RTs length/total LTR-RTs length) × 100%.

Repeat Annotation.

Repeat annotation was carried out based on de novo predictions and homolog-based predictions for the three new cotton genomes. For de novo-based predictions, RepeatModeler1 (v1.0.8), RepeatScout (v1.0.5), and LTR_FINDER (v1.07) were used to predict TEs and to build a TE library. We integrated this TE library with a known repeat library (Repbase v15.02, homolog-based) and used these with RepeatMasker (v3.3.0) to predict TEs. RepeatProteinMask (v3.3.0, www.repeatmasker.org/RepeatMasker) which makes homology-based predictions, was performed to detect TEs in these three cotton genomes by comparing them to the TE protein database. Tandem repeats were detected in the genome using Tandem Repeats Finder (TRF, v4.07b).

Gene Annotation.

A combination of de novo, homology-based, and RNA-seq–based predictions were employed to annotate the PCG in the three cotton genomes. Five ab initio gene-prediction programs were used to predict genes, including Augustus (v3.0.2) (80), Genescan (v1.0) (81), Geneid (v1.4) (82), GlimmerHMM (v3.0.2) (83), and SNAP (v2013-02-16) (84). Protein sequences from six dicot species [i.e., A. thaliana (85), Theobroma cacao (86), Populus trichocarpa (87), G. hirsutum (14), G. arboreum (14), and G. raimondii (29)] were downloaded from CottonGen (88), Ensembl (89), and the National Center for Biotechnology Information (NCBI) (90) and aligned against to the genome using WUblast (v2.0) (91). Genewise (v2.2.0) (92) was employed to predict gene models based on the sequence alignment results. For RNA-seq–based predictions, reads from more than four tissues (leaves, stems, and stem apices) RNA-seq data, which were detected in our research were, aligned to the three cotton genomes using TopHat (v2.0.13) (93) to identify exons region and splice positions. The alignment results were then used as input for cufflinks (v2.1.1) (94) to assemble transcripts to the gene models. In addition, the RNA-seq data were assembled by Trinity (v2.1.1), creating several pseudo-ESTs, which were mapped to each assembly by BLAT (v3.2.3) (95) and used to predict gene models via PASA (r20140417) (96). A weighted and nonredundant gene set was generated by EVidenceModeler (EVM, v1.1.1) (97), which merged all genes models predicted by the above three approaches. These were combined with the transcript assembly, and PASA was used to adjust the gene models generated by EVM.

Resistance Gene Analog Identification and Evolution.

To predict resistance gene analogs (RGAs) in cotton tetraploids, RGAdb from RGAugury (https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3197-x) software was downloaded (98). Protein sequences of all annotated genes of cottons were aligned to the RGAdb using BLASTP with an E-value cutoff of 1e-05. Seven RGA-related domains and motifs—including NB-ARC, NBS, LRR, TM, STTK, LysM, CC, and TIR—were searched by InterProScan, hmmscan, and phobius from RGAugury pipeline in annotated genes.

Functional Annotation.

Predicted protein sequences were assigned functions by searching six protein/function databases: NR, InterPro, GO, Kyoto Encyclopedia of Genes and Genomes (KEGG), Swiss-Prot, and TrEMBL. We used InterproScan46 (v20180213) (99) to search the InterPro database with parameters: -f TSV -dp -gotermes -iprlookup -pa. For the other five databases, BLAST was run with an E-value cutoff of 1e-5. Results from these databases were concatenated together. The R package Clusterprofiler47 (100) was used for GO term and KEGG enrichment analyses.

Orthology and Pan-Genome Analysis.

Protein sequences of annotated genes from eight genomes [G. kirkii (30), G. australe (27), G. longicalyx (26), G. herbaceum (14), G. arboreum (25), G. thurberi (28), G. turneri (29) and G. raimondii (29)] were analyzed in conjunction with the proteins from the eight genomes from allotetraploids (Gh (15), Gb (15), Gt (15), Gm (15), Gd (15), Ge, Gs, and Ghp) to determine orthology. The longest proteins for each gene were in an all-versus-all BLASTP with an E-value cutoff of 1e-5. OrthoFinder (v2.2.7) (101) was used to detect orthogroups of homologous genes from all genomes using default parameters. Single-copy gene orthogroups were aligned with MUSCLE (v3.8.31) (102) and concatenated into a superalignment. RAxML (v8.0.19) was used to build a phylogenetic tree with the parameters: “-n cds -m GTRGAMMA -p 12345 -x 12345 -# 1000 -f ad”. K and Ks values were calculated for single-copy orthologous genes between each diploid cotton genome and each tetraploid cotton subgenome by KaKs_Calculator (v2.0) software (103). Divergence times (T) were estimated using the formula T = Ks/2r (substitution rate r = 2.6 × 10−9) (104). Putative positively selected genes were detected using the branch-site model in PAML (v4.7) (105). Genome synteny blocks containing at least four genes was detected using mcscan (https://github.com/tanghaibao/jcvi/wiki/MCscan-(Python-version)) with parameter: –cscore = 0.90, –iter = 1. Gene families for the eight tetraploid cottons (Gh, Gb, Gt, Gm, Gd, Ge, Gs, and Ghp) were generated by OrthoFinder. Gene families that were shared among the eight genomes were defined as core gene families, and those that only existed in one genome were defined as species-special gene families. The gene families that were presenting in one to seven samples were defined as dispensable gene families.

Virus-Induced Gene Silencing.

Virus-induced gene silencing (VIGS) of the GhECI3 was performed to verify their potential functions. Here, we used G. hirsutum race Marie-Galante 85 since it has been demonstrated to have better salt stress tolerance in our previous study (106). First, the VIGS vector TRV::GhECI3 was constructed by recombining ∼300-bp fragments of GhECI3 into pTRV-RNA2 vector and introducing into Agrobacterium tumefaciens strain GV4104. TRV::00, without recombined fragments, was used as a control vector. Then, this Agrobacterium culture was used to infect seedlings of G. hirsutum race Marie Galante 85 (MAR85) according to a previous protocol (27). The transformed cotton seedlings were grown under greenhouse conditions with 25 °C and 8-h dark/16-h day cycle. After 20 d post-Agrobacterium inoculation, the VIGS-plants and non-VIGS plants were exposed to salt (300 mM NaCl) and drought (17% PEG6000) treatment for 3 d. Finally, we collected the leaves of TRV::GhECI3, TRV::00, and non-VIGS seedlings for morphological and physiological analysis.

Oligo Probes.

Six probes for RGAs were designed based on the Ge genome sequence. These oligo probes were synthesized by Ningbo Kangbei Biochem, which attached a 6-carboxyfluorescein (6-FAM) or 6-carboxytetramethylrho-damine (TAMRA) to the 5′ end. Primer sequence information is shown in . The oligo probes were designed according to a previous method (107). Briefly, the RGA sequences enriched in the chromosomes A04, A11, and D11 of Ge were analyzed using the TRF algorithm, using alignment parameters of 2, 7, and 7 for match, mismatch, and indels, respectively. The tandem repeats in each chromosome were identified based on a minimum alignment score of 50 and were divided into three classes with different size of period distances (<20, 20 to 60, and >60). At the same time, the tandem repeats were physically mapped onto the genome sequence using a web server B2DSC (mcgb.uestc.edu.cn/b2dsc) to predict the distribution on chromosomes. The RGA repeat sequences specific to these chromosomes in the genome were determined using the SPSS software (v22.0, SPSS).

FISH Analysis of RGA-Derived Oligo Probes.

Root tips of five cotton species—G. hirsutum (cultivar: TM-1), G. barbadense (cultivar: 3-79), G. tomentosum (accession no. in ICR-CAAS: AD3-LZ), G. mustelinum (accession no. in ICR-CAAS: AD4-LZ), and G. darwinii accession (accession no. in ICR-CAAS: AD5-07)—were harvested from circa 6-d-old incubator-grown seedlings. Root tips were pretreated using 0.089 mM cycloheximide at 20 °C for 80 min, fixed in methanol-acetic acid (3:1), and then stored at 4 °C for 24 h. Chromosome preparations of metaphase chromosomes were created according to a previously reported method (108). The protocol of ND-FISH using synthesized probes was described by Tang et al. (109). Briefly, 10 μL of hybrid solution with 1.0 μL working solution of each probe and residual volume of 2× SSC 1× TE (pH7.0) were added to the metaphase chromosome slides of different cotton species and covered with a plastic film cover. Hybridization took place at 42 °C for 1 to 3 h. After hybridization, the slides were replaced in 2× SSC solution until the plastic film cover fell off naturally. Slides were dried in the dark. Chromosomes were counter-stained with DAPI in Vectashield antifading solution (Vector Laboratories) under a coverslip. Slides were examined using Zeiss Imager M2 microscope. FISH images were captured using CCD camera (MetaSystems CoolCube 1). The photos and signals were merged using MetaSystems Isis software.

RNA-Seq.

All samples for RNA-seq were collected from the National Wild Cotton Nursery in Sanya, China. Four-week-old seedlings of Ghp and Gh (TM-1) were exposed to both salt (300 mM NaCl) and polyethylene glycol (200 g/L PEG). Leaf samples were collected post treatment at 0, 12, and 24 h. Experiments were reproducible in three independent repetitions. In the follow-up analysis, we use the samples at 0 h as the control group (CK), and other samples at 12 h and 24 h after salt or PEG treatment as a different treatment group (S12P, S24P, S12R, and S24R, represent 12-h PEG, 24-h PEG, 12-h NaCl, and 24-h NacCl treatments, respectively). All fresh tissues were frozen in liquid nitrogen immediately and stored at −80 °C before processing. Total RNAs for each sample were extracted using TRIzol Reagent (Invitrogen) according to the manufacturer’s instructions. RNA-seq libraries were prepared using the Illumina standard mRNA-seq library preparation kit (Illumina) and sequenced on an Illumina NovaSeq platform as pair-end short reads (150 bp). RNA-seq data were mapped to the corresponding genomes using Tophat2 (v2.0.8) (93). HTSeq v0.6.1 (110) was employed to count the number of reads mapped to each gene. FPKM (fragments per kilobase of transcript per million mapped reads) was calculated for each gene based on the length of the gene and number of reads mapped to that gene. Differential expression analysis of two groups was (treatment group vs. control group) performed using the DESeq R package (1.18.0) (111). Genes with an adjusted P < 0.05 were considered differentially expressed.

Genomic SV Detection.

We aligned seven allotetraploid cotton genomes to the Gm (AD4_JGI) reference genome and then applied three methods to identify SVs, including smartie-SV (https://github.com/zeeev/smartie-sv) (39), SyRI (https://github.com/schneebergerlab/syri) (41), and SVMU (https://github.com/mahulchak/svmu) (40). Specifically, the pipeline of Smartie-SV was performed based on the BLASR (v5.3.2) alignment with default parameters; we extracted alignment pairs from any pair of genomes based on nucmer (v3.23) (–mum –maxgap = 500 –mincluster = 1000) to serve as input for the packages SyRI and SVMU with default parameters (112). Then, we aligned Illumina reads of seven tetraploid cotton genomes to the Gm reference genome to identify SVs using Breakdancer (v1.3.6). On the basis of the above pipeline, we obtained four raw sets of SVs. For insertions and deletions, we merged four raw set using package Jasmine (v1.0.11, https://github.com/mkirsche/Jasmine) with the parameters “min_support = 1 max_dist = 100 k_jaccard = 9 min_seq_id = 0.2 spec_len = 30”, and identified candidate insertions and deletions supported by at least two methods (62). For inversions, we also only considered candidates supported by at least two methods by using bedtools (113). Annotation of genomic SVs was performed using the package ANNOVAR (v2019Oct24) (114). Genomic SVs were categorized as being in exonic regions (overlapping with a coding exon), intronic regions (overlapping with an intron), splice sites (within 2 bp of a splicing junction), upstream and downstream regions (within a 1-kb region upstream or downstream from the transcription start site), and intergenic regions.

102 in total

1. Polyploidy: Pitfalls and paths to a paradigm.

Authors: Douglas E Soltis; Clayton J Visger; D Blaine Marchant; Pamela S Soltis
Journal: Am J Bot Date: 2016-05-27 Impact factor: 3.844

2. AtBXL1, a novel higher plant (Arabidopsis thaliana) putative beta-xylosidase gene, is involved in secondary cell wall metabolism and plant development.

Authors: Thomas Goujon; Zoran Minic; Abdelhak El Amrani; Olivier Lerouxel; Estelle Aletti; Catherine Lapierre; Jean-Paul Joseleau; Lise Jouanin
Journal: Plant J Date: 2003-02 Impact factor: 6.417

10. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy.

Authors: David M Emms; Steven Kelly
Journal: Genome Biol Date: 2015-08-06 Impact factor: 13.583