Literature DB >> 29194487

Comparative Genomics of an Unusual Biogeographic Disjunction in the Cotton Tribe (Gossypieae) Yields Insights into Genome Downsizing.

Corrinne E Grover1, Mark A Arick2, Justin L Conover1, Adam Thrash2, Guanjing Hu1, William S Sanders2,3,4, Chuan-Yu Hsu2, Rubab Zahra Naqvi5, Muhammad Farooq5, Xiaochong Li6, Lei Gong6, Joann Mudge7, Thiruvarangan Ramaraj7, Joshua A Udall8, Daniel G Peterson2, Jonathan F Wendel1.   

Abstract

Long-distance insular dispersal is associated with divergence and speciation because of founder effects and strong genetic drift. The cotton tribe (Gossypieae) has experienced multiple transoceanic dispersals, generating an aggregate geographic range that encompasses much of the tropics and subtropics worldwide. Two genera in the Gossypieae, Kokia and Gossypioides, exhibit a remarkable geographic disjunction, being restricted to the Hawaiian Islands and Madagascar/East Africa, respectively. We assembled and use de novo genome sequences to address questions regarding the divergence of these two genera from each other and from their sister-group, Gossypium. In addition, we explore processes underlying the genome downsizing that characterizes Kokia and Gossypioides relative to other genera in the tribe. Using 13,000 gene orthologs and synonymous substitution rates, we show that the two disjuncts last shared a common ancestor ∼5 Ma, or half as long ago as their divergence from Gossypium. We report relative stasis in the transposable element fraction. In comparison to Gossypium, there is loss of ∼30% of the gene content in the two disjunct genera and a history of genome-wide accumulation of deletions. In both genera, there is a genome-wide bias toward deletions over insertions, and the number of gene losses exceeds the number of gains by ∼2- to 4-fold. The genomic analyses presented here elucidate genomic consequences of the demographic and biogeographic history of these closest relatives of Gossypium, and enhance their value as phylogenetic outgroups.
© The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

Entities:  

Keywords:  Gossypium; genetic divergence; genome evolution; long-distance dispersal; molecular evolution

Mesh:

Year:  2017        PMID: 29194487      PMCID: PMC5737505          DOI: 10.1093/gbe/evx248

Source DB:  PubMed          Journal:  Genome Biol Evol        ISSN: 1759-6653            Impact factor:   3.416


Introduction

One of the intriguing evolutionary phenomena that characterizes the cotton tribe, Gossypieae, is the prevalence of long-distance, transoceanic dispersal (Wendel and Grover 2015). The most famous of these occurred in the cotton genus (Gossypium), which includes the intercontinental dispersal of an African species to the Americas in the mid-Pleistocene (Wendel 1989) that gave rise to the New World allopolyploid cottons (including the primary cottons of commerce, i.e., Go. hirsutum and Go. barbadense). Outside of Gossypium, multiple long-distance dispersals have occurred during the evolution of the tribe (Stephens 1958, 1966; Fryxell 1979; Wendel 1989; Wendel and Percival 1990; Wendel and Percy 1990; Dejoode and Wendel 1992; Wendel and Albert 1992; Seelanan et al. 1997; Wendel and Cronn 2003). One example includes the sister genera Kokia and Gossypioides, from Hawaii and southeast Africa, respectively. Based on preliminary molecular divergence estimates derived from chloroplast and nuclear genes, these two genera are estimated to have diverged from each other in the Pliocene, ∼3 Ma, and from Gossypium during the Miocene, perhaps 10–15 Ma (Seelanan et al. 1997; Cronn et al. 2002). Kokia (Malvaceae) is a small genus of Hawaiian endemics comprising four species that were once widespread components of Hawaiian forests yet now are either endangered (three species) or recently extinct (K. lanceolata Lewton) (Bates 1990; Sherwood and Morden 2014; Morden and Yorkston 2017). Few individuals remain of the two extant species, K. kauaiensis (Rock) Degener and Duvel and K. drynarioides (Seem.) Lewton, the latter being nearly extinct in the wild, whereas the third endangered species, K. cookei Degener, exists only as a maintained graft derived from a single individual (US Fish and Wildlife Service 2012; Sherwood and Morden 2014). Due to the significance of Kokia to Hawaiian forests, diversity in the genus has been evaluated for the purposes of conservation (Sherwood and Morden 2014; Morden and Yorkston 2017). A surprising amount of diversity within and among species has been detected, particularly given the demographic history of Kokia, which includes the original genetic bottleneck associated with dispersal to the Hawaiian Islands, subsequent interisland dispersals, and the subsequent bottlenecks due to habitat loss and the introduction of competitive and/or damaging alien species (Sherwood and Morden 2014; Morden and Yorkston 2017). The native region of Gossypioides, the sister genus to Kokia, is located over 17,500 km distant in East Africa and Madagascar (fig. 1). The two species that comprise the genus, G. kirkii M. Mast. and G. brevilanatum Hoch. (East Africa and Madagascar, respectively), are themselves reproductively isolated and, with Kokia, are cytologically distinct from the remainder of the cotton tribe in that they appear to have experienced an aneuploid reduction in chromosome number. Specifically, while most genera in the Gossypieae have a haploid chromosome base of n = 13, species in both Kokia and Gossypioides are n = 12, likely representing a chromosome loss or fusion event. The two species of Gossypioides also are cytogenetically distinct, with an unusually long chromosome pair in G. brevilanatum (Hutchinson and Ghose 1937; Hutchinson 1943). Hutchinson (1943) notes that successful grafts can be made between K. drynarioides and G. kirkii, and that their shared chromosomal reduction (n = 12) is unique in the tribe.
. 1.

—Modern geographic ranges of the genus Kokia in Hawaii and Gossypioides in East Africa/Madagascar, and their estimated divergence time (Ma, million years ago).

—Modern geographic ranges of the genus Kokia in Hawaii and Gossypioides in East Africa/Madagascar, and their estimated divergence time (Ma, million years ago). Despite extensive research on the evolution of Gossypium, these sister genera have been understudied, except for their utility as outgroups for cotton phylogenetic and genomic research (Wendel and Grover 2015) and, in the case of Kokia, for assessments of current status and diversity (Morden and Yorkston 2017). Direct comparisons of the two genera are limited. Estimates of synonymous substitutions for nuclear gene orthologs indicate that the distance between K. drynarioides and G. kirkii is less than that between basally diverged species in Gossypium (Wendel and Grover 2015), that is, ∼2% versus 3.6% (Cronn et al. 2002; Flagel et al. 2012), although these estimates for Kokia and Gossypioides are based on few genes. Genomic resources for both genera are minimal and access to plant material is limited. With the recent exception of studies on divergence diversity within and among Kokia species noted above, much of our knowledge regarding these genera is decades old (Hutchinson 1947; Fryxell 1968; Seelanan et al. 1997). The history of these genera, however, is biogeographically intriguing. The current geographic ranges of Kokia in the Hawaiian Islands and Gossypioides in East Africa-Madagascar, combined with their sister-genus status and divergence time estimates, implies that there has been at least one significant transoceanic traversal to the relatively young Hawaiian archipelago. The present islands began to emerge only ∼4–6 Ma (Flinders et al. 2010; Lim and Marshall 2017), an age on the same order of magnitude as that estimated for the divergence between Kokia and Gossypioides (Seelanan et al. 1997). Here, we apply a whole-genome sequencing strategy to understand the evolution and divergence of these two genera from a genomic perspective. We present a draft assembly of K. drynarioides, and compare it with the prerelease reference-quality sequence of G. kirkii. Through genome sequence comparisons, we derive a more precise estimate of the divergence time between the two genera, and their similarities and differences with respect to their suite of genes and repetitive sequences. As these species represent the two closest genera to the cotton genus, this information may prove informative with respect to understanding the evolution and composition of the cotton genome.

Materials and Methods

Sequencing and Genome Assembly of K. drynarioides

DNA was extracted from mature leaves using the Qiagen Plant DNeasy kit (Qiagen). Total genomic DNA was independently sheared via Covaris into two average sizes, that is, 350 and 550 bp, for Illumina library construction. A single, independent library was constructed from each fragment pool using the Illumina PCR-free library construction kit (Illumina). The libraries were sequenced on a single lane of Illumina HiSeq2000 and two MiSeq flowcells (both at the Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University). The reads were trimmed and filtered with Trimmomatic v0.32 (Bolger et al. 2014) with the following options: 1) sequence adapter removal, 2) removal of leading and/or trailing bases when the quality score (Q) <28, 3) removal of bases after average Q <28 (8-nt window) or single base quality <10, and 4) removal of reads <85 nt. The trimmed DNA data and RNA assembly were jointly assembled via ABySS v2.0.1 (Simpson et al. 2009), using every fifth kmer value from 65 through 200. Each assembly was further scaffolded with ABySS using the MEGAHIT-derived transcripts. The assembly with the highest E-size (Salzberg et al. 2012) was retained for improvement and analysis. ABySS Sealer v2.0.1 (Paulino et al. 2015) was used to fill gaps in the retained assembly using every tenth kmer starting at 100 and decreasing to 30. Pilon v1.22 (Walker et al. 2014) polished the resulting gap-filled assembly using all trimmed DNA data. QUAST v4.5 (Gurevich et al. 2013) was used to generate the final assembly statistics. The K. drynarioides genome is available under NCBI BioProject PRJNA400144.

Genome Annotation of K. drynarioides and G. kirkii

RNA was extracted from three biological replicates of both K. drynarioides and G. kirkii for the purpose of annotation. Three-centimeter (length) seedling leaves were collected and RNA as extracted using the Concert Plant RNA Reagent (Invitrogen) according to the manufacturer’s instructions. Illumina libraries were generated using the TruSeq RNA Sample Preparation Kit (Illumina) in preparation for paired-end, 150-nt sequencing. Sequencing was completed on the Illumina HiSeqX Ten at BerryGenomics (Beijing). MEGAHIT commit: 02102e1 (Li et al. 2015) was used to assemble the RNA data into transcripts. RNA-seq reads are available under BioProject PRJNA400144. MAKER (v2.31.6) (Holt and Yandell 2011) annotation of the genome was then completed in two rounds, using only contigs >1 kb in size and training MAKER with Kokia-specific sequences. First pass de novo annotations were derived from Genemark (v4.3.3) (Lomsadze et al. 2005) and retained for MAKER training. At the same time, BUSCO (v2) (Simão et al. 2015) was used both to train Augustus and create a Snap model (Korf 2004). Finally, Trinity (v2.2.0) (Grabherr et al. 2011) was used to create an RNASeq-assembly to pass to MAKER as EST evidence. The first pass of MAKER was run using the combination of: 1) the output from Genemark, 2) the BUSCO-generated Snap model, 3) the BUSCO-trained Augustus (Stanke et al. 2008) model, 4) the Trinity RNASeq-assembly as ESTs, and 5) the UniProt protein database (The UniProt Consortium 2017). After the first pass of MAKER was complete, the annotations generated by MAKER were used to train Augustus and generate a second Snap model. MAKER was run again with the same input except using the newly generated Snap model (#2 above) and Augustus model (#3 above) to replace those in the first pass. All annotations were output to gff format and can be found at https://github.com/Wendellab/KokiaKirkii. The prerelease genome sequence of G. kirkii (PRJNA400144) was similarly annotated (for assembly notes, please see https://github.com/Wendellab/KokiaKirkii). As with K. drynarioides, RNA was 1) extracted from three biological replicates of 3 cm (length) G. kirkii seedling leaves using the Concert Plant RNA Reagent (Invitrogen) according to the manufacturer’s instructions, 2) prepared for sequencing via TruSeq RNA Sample Preparation Kit (Illumina); and 3) sequenced as paired-end, 150 nt on the Illumina HiSeqX Ten at BerryGenomics (Beijing). MEGAHIT commit: 02102e1 (Li et al. 2015) was used to assemble the RNA data into transcripts, which were used in the same MAKER (Holt and Yandell 2011) iterations as used for K. drynarioides (see above). RNA-seq reads are available from NCBI PRJNA400144.

dN/dS Estimation

Amino acid sequences from G. kirkii, Gossypium raimondii (Paterson et al. 2012), and K. drynarioides were clustered using OrthoFinder v1.1.4 (Emms and Kelly 2015), which utilizes a Markov clustering algorithm of normalized BLASTp scores to infer homology between proteins sequences from different species; default values were used for the inflation parameter (1.5) in the Markov clustering. Orthologous groups containing only a single representative from all species were retained (12,281 orthogroups out of 21,415 total, discarding ∼17,000 genes from both G. kirkii and K. drynarioides). These retained groups were subsequently discarded if one or more representatives in that group contained ambiguous nucleotide bases (indicating poor sequence coverage; 50 groups total). Amino acid sequences from each possible pairwise group (Gossypium raimondii + G. kirkii, Go. raimondii + K. drynarioides, K. drynarioides + G. kirkii) were aligned using the pairwise2 python package (https://github.com/biopython/biopython/blob/master/Bio/pairwise2.py) and the BLOSUM62 substitution matrix (Eddy 2004); the highest scoring alignment then served as a guide for codon-aligning the CDS sequences using a custom python script (https://github.com/Wendellab/KokiaKirkii). Pairwise dN and dS values were calculated via CODEML (PAML v.4.9; Yang 2007) and groups with any pairwise dS > 0.6 were removed due to possible inclusion of nonorthologous proteins; this threshold represents the upper-limit average of dS values between Go. raimondii and Theobroma cacao, a more distant relative (Wang et al. 2012). Distributions of all pairwise dN, dS, and dN/dS values were evaluated, and basic statistics (mean, median, and SD) were calculated in R (R Core Team 2017).

Estimating Divergence Times

Absolute rates of synonymous substitutions have been estimated for eight angiosperm families (De La Torre et al. 2017), fortuitously including the Malvaceae where the rate of substitution between Theobroma and Gossypium is estimated to be 4.56E–09/year (based on 42 genes). Here, we extended this analysis to include two orders of magnitude more genes (n = 13,643 single copy orthologs) using published genome sequences for these taxa. In estimating dS values between T. cacao and Go. raimondii, we removed values >3 to eliminate saturated synonymous sites (43 genes). We then used the equation r = dS/(2 T), where T is the fossil (perhaps 60 Ma; Carvalho et al. 2011 and www.timetree.org) and sequence-calibrated estimate for divergence of Gossypium and Theobroma, r is the number of synonymous substitutions × synonymous site−1 × year−1 in the Malvaceae, and dS the median of the dS distribution using the 13,643 single copy orthologs. Divergence time between each pairwise group within Gossypieae was estimated using the equation T = dS/(2r), where r is the synonymous substitution rate calculated above and dS is the median dS value in the dS distribution for each pairwise comparison (after applying the filtering criteria).

Copy Number Variation Estimation

A custom Python script (https://github.com/Wendellab/KokiaKirkii) was used to calculate lineage-specific gene losses and duplications between G. kirkii and K. drynarioides, as inferred by OrthoFinder. First, orthologous groups were filtered for clusters with both copy number variation (CNV) among species and where either G. kirkii or K. drynarioides had the same copy number as in the sister genus, represented here by Go. raimondii. Gene gain or loss was inferred when the nonequal species contained more or fewer genes, respectively, than the species equivalent in copy number to Go. raimondii. Although an absolute limit on CNV size was not set, most orthologous groups did not have a CNV > 3 genes. Verification of inferred gains and losses was independently completed by 1) delly2 (Rausch et al. 2012) and 2) searching for the “missing” genes via gmap (Wu and Watanabe 2005) of the coding sequence to a masked genome, where all annotated genes are masked. Parameters for delly2 can be found at https://github.com/Wendellab/KokiaKirkii/tree/master/analysis/CNV_verification_delly2. Missing gene detection was completed by using genes annotated in Go. raimondii that were not present in either K. drynarioides or G. kirkii, that is, inferred losses, as gmap queries against the unmasked K. drynarioides or G. kirkii. Results were visualized with Circos (Krzywinski et al. 2009).

Repeat Clustering and Annotation

All forward reads from the DNA libraries were filtered for quality and trimmed to a standard 95 nt using Trimmomatic version 0.33 (Bolger et al. 2014) as per (https://github.com/Wendellab/KokiaKirkii). Surviving reads were randomly subsampled to represent a 1% genome size equivalent for each genome (Wendel et al. 2002; Hendrix and Stewart 2005) and combined as input into the RepeatExplorer pipeline (Novák et al. 2010, 2013), which is designed to cluster reads based on similarity and identify putative repetitive sequences using low-coverage, small read sequencing. Clusters containing a minimum of 0.01% of the total input sequences (i.e., 201 reads from a total input of 2,013,469 reads) were annotated by the RepeatExplorer implementation of RepeatMasker (Smit et al. 2013–2015) using a custom library derived from a combination of Repbase version 21.08 (Bao et al. 2015) and previously annotated cotton repeats (Grover et al. 2004, 2007, 2008b; Hawkins et al. 2006; Paterson et al. 2012). A cutoff of 0.01% read representation is common; however, we evaluated the suitability of this cut using a log of diminishing returns (supplementary fig. S1, Supplementary Material online; https://github.com/Wendellab/KokiaKirkii). Within the annotated clusters, the number of megabases (Mb) attributable to that cluster (i.e., element type) for each genome/accession was calculated based on the 1% genome representation of the sample and the standardized read length of 95 nt; total repetitive amounts for each broad repetitive classification were summed from these results. The genome occupation of each cluster (i.e., the calculated number of Mb) was normalized by genome size for each accession, resulting in the percent of each genome occupied by that element type, for use in multivariate visualization (i.e., Principle Coordinate Analysis and Principal Component Analysis). Raw counts were also log-transformed and visualized via PCoA. All analyses were conducted in R (R Core Team 2017); R versions and scripts are available at (https://github.com/Wendellab/KokiaKirkii).

Repeat Heterogeneity and Relative Age

Relative cluster age was approximated using the among-read divergence profile of each cluster, as previously used for Fritillaria (Kelly et al. 2015) and dandelion (Ferreira de Carvalho et al. 2016). Briefly, an all-versus-all BLASTn (Altschul et al. 1990; Boratyn et al. 2013) was conducted on a cluster-by-cluster basis using the same BLAST parameters implemented in RepeatExplorer. A histogram of pairwise percent identity was generated for each cluster and the trend (i.e., biased toward high-identity, “young” or lower-identity, “older” element reads) was described for each via regression models using R. Specifically, two regression models were used to describe the data as either linear (Y = a + bX) or quadratic (Y = a + bX + cX2), and the model with the highest confidence was determined using the Bayesian Information Criterion (Schwarz 1978). The read similarity profile for each cluster was automatically evaluated for each histogram to determine if the reads trend toward highly similar “young” or more divergent “older” reads, as previously characterized (Ferreira de Carvalho et al. 2016) but with an additional category. These categories include the following: 1) positive linear regression; 2) absence of linear regression; 3) negative linear regression; 4) positive quadratic vertical parabola, trend described by right side of vertex; 4b) positive quadratic vertical parabola, trend described by left side of vertex; 5) negative quadratic vertical parabola, trend described by right side of vertex; and 6) negative quadratic vertical parabola, trend described by left side of vertex and vertex at > 99% pairwise-identity (supplementary fig. S2, Supplementary Material online). Categories that trend toward highly similar reads (i.e., 1, 4, and 6) were interpreted as representing more recent divergences, whereas categories with lower identities (i.e., 2, 3, 4b, and 5) were interpreted as being composed of older elements. As with Ferreira de Carvalho (2016), this regression simply provides a relative characterization of cluster/element age and is not designed to detect statistically significant differences.

Repetitive Profiles between K. drynarioides and G. kirkii

Comparison of abundance for the annotated clusters in K. drynarioides and G. kirkii was computed in R (R Core Team 2017), including the assumption of a 1:1 ratio between K. drynarioides and G. kirkii cluster sizes if their repetitive profiles had remained static postdivergence. Differential abundance (in read counts) between K. drynarioides and G. kirkii for each cluster was evaluated via two-sample χ2 tests; all P values were subject to Benjamini–Hochberg correction for multiple testing (Benjamini and Yekutieli 2001).

Indel Characterization in K. drynarioides and G. kirkii

Indels in K. drynarioides and G. kirkii were evaluated by mapping each set of DNA sequencing reads to the Go. raimondii genome and using GATK (v 3.6) (Van der Auwera et al. 2002; McKenna et al. 2010; DePristo et al. 2011) to align and characterize indels. GATK indel calls were pruned to remove 1) positions with missing data in either G. kirkii or K. drynarioides or 2) heterozygous sites. The resulting table was imported into R (R Core Team 2017) for characterization of indels and length determination using the Go. raimondii reference state as an outgroup. Indels were characterized as insertions or deletions for each species under the following criteria: 1) the state must be different in K. drynarioides and G. kirkii; 2) either K. drynarioides or G. kirkii must share the state with the outgroup; 3) insertions are represented by longer sequence in either K. drynarioides or G. kirkii compared with the other two; and 4) deletions are represented by shorter sequence in K. drynarioides or G. kirkii as compared with the other two. Software versions and scripts are available at (https://github.com/Wendellab/KokiaKirkii).

Results

Kokia Genome Assembly and Annotation

ABySS assembly of the 80× coverage Illumina (trimmed; raw = 111×) led to 19,146 scaffolds (25,827 contigs) ranging in size from 500 to 2.29 Mb and comprising a total length of 520.9 Mb (supplementary table S1, Supplementary Material online; estimated genome size for K. drynarioides = 590 Mb; Wendel et al. 2002). Nearly 80% of the K. drynarioides assembly is represented in scaffolds of > 50 kb, which, in conjunction with an N50 of 176.7 kb, indicates a relatively contiguous genome. As an additional measure of genic completeness, we searched for 1,440 Benchmarking Universal Single-Copy Ortholog (BUSCO) groups (Simão et al. 2015) in the K. drynarioides assembly. This search recovered 1,377 BUSCOs (95.6%), with 1,213 (84.2%) recovered as single-copy (supplementary table S2, Supplementary Material online). Annotation of the K. drynarioides genome (supplementary table S3, Supplementary Material online) resulted in 29,231 gene models, ∼22% fewer than in the “gold-standard” Gossypium raimondii genome sequence (Paterson et al. 2012), which has 37,505 predicted protein-coding genes. For comparative purposes, we annotated the prerelease G. kirkii genome (34× coverage PacBio; PRJNA400144) in the same manner as the K. drynarioides genome using two iterations of MAKER and the G. kirkii leaf RNA-seq generated here. The preliminary version of the G. kirkii genome used here has greater contiguity than K. drynarioides, that is, an N50 of 616 kb and a total contig length of ∼530 Mb; however, BUSCO analysis recovered approximately the same number of complete and single-copy complete BUSCOs (1,349 and 1,213, respectively). The same annotation method also yielded approximately the same number of gene models in G. kirkii as in K. drynarioides (29,179 vs. 29,231).

Molecular Evolution between K. drynarioides and G. kirkii

OrthoFinder-based clustering resulted in 21,414 orthologous groups, of which 12,281 contained only one gene from each species (i.e., singleton groups). A disproportionate number of Go. raimondii genes were not included in any group, as compared with the other two genera (10,408 in Go. raimondii vs. 5,188 and 4,400 in G. kirkii and K. drynarioides, respectively), an observation consistent with the observation of nearly 8,000 additional gene models in the Go. raimondii genome (5,982 verified as “missing”; see Materials and Methods, supplementary table S4, Supplementary Material online). Rates of molecular evolution among these three lineages were estimated for each singleton group (supplementary table S5, Supplementary Material online), with the exception of those (n = 106) where any pairwise comparison resulted in dS > 0.6 (i.e., the upper estimate of the dS between Go. raimondii and T. cacao, see Materials and Methods). The median dS value for G. kirkii versus K. drynarioides was approximately half that of either Go. raimondii versus G. kirkii or Go. raimondii versus K. drynarioides (0.0383 vs. 0.0743 and 0.0810 substitutions per synonymous site per year, respectively; supplementary table S6, Supplementary Material online), whose median dS values were approximately equivalent (fig. 2). The median dN values for each comparison showed a similar pattern, that is, 0.0050 between the sister genera versus 0.0086 and 0.0095 substitutions per nonsynonymous site per year for Go. raimondii versus G. kirkii and Go. raimondii versus K. drynarioides, respectively (fig. 2 and supplementary table S6, Supplementary Material online).
. 2.

—Distribution of substitution rates between pairwise comparisons of Kokia drynarioides, Gossypioides kirkii, and Gossypium raimondii. The line graph depicts the frequency distribution between G. kirkii and K. drynarioides (red); G. kirkii and Go. raimondii (green); or K. drynarioides and Go. raimondii (blue) calculated for 12,281 genes. Inset into the frequency graph are box plots of both the values (including the median) of both synonymous substitutions (red) and nonsynonymous substitutions (black).

—Distribution of substitution rates between pairwise comparisons of Kokia drynarioides, Gossypioides kirkii, and Gossypium raimondii. The line graph depicts the frequency distribution between G. kirkii and K. drynarioides (red); G. kirkii and Go. raimondii (green); or K. drynarioides and Go. raimondii (blue) calculated for 12,281 genes. Inset into the frequency graph are box plots of both the values (including the median) of both synonymous substitutions (red) and nonsynonymous substitutions (black).

Divergence Time within Malvaceae

Earlier estimates of divergence times within the Gossypieae (Cronn et al. 2002) relied on dating calibrations derived from a single nuclear gene (AdhA) in the Brassicaceae (Koch et al. 2000) or palms (Morton et al. 1996), and on rates of chloroplast DNA evolution (Seelanan et al. 1997). More recently, divergence times for the Malvaceae, which includes cotton and chocolate, have been reported based on a single gene each from the chloroplast and the nuclear genomes (Richardson et al. 2015), which suggests that chocolate (Theobroma) and Gossypium diverged circa 60–70 Ma. Using a more extensive data set, absolute rates of synonymous substitutions have been estimated for eight angiosperm families (De La Torre et al. 2017), fortuitously including the Malvaceae; this analysis estimates the rate of substitution between Theobroma and Gossypium as 4.56E–09/year. Using this rate of substitution between Theobroma and Gossypium, whose fossil-based divergence time is ∼60 Ma, we revisit divergence times among Kokia, Gossypioides, and Gossypium using two orders of magnitude more genes (n = 13,643 single copy orthologs) extracted from the genome sequences for these taxa. The median of the resultant dS distribution (supplementary fig. S3, Supplementary Material online) was 0.4332, which predicts a synonymous substitution rate (r) of 3.61×10−9 synonymous substitutions per synonymous site per year, similar to that reported recently for 42 genes (De La Torre et al. 2017). Using this evolutionary rate, we estimate that Gossypium diverged from the Kokia and Gossypioides lineage between 10.29 and 11.22 Ma, and that the sister genera Kokia and Gossypioides diverged from each other ∼5.30 Ma.

Gene Copy Number Variation between K. drynarioides and Gossypioides kirkii

The 9,133 orthologous groups not classified as singleton groups were evaluated for evidence of CNV (see Materials and Methods), resulting in 2,991 candidate groups with possible copy number alterations in G. kirkii and 2,424 candidates in K. drynarioides. The remaining 3,718 groups were excluded either due to complexity (i.e., different copy numbers in each species) or because they were indicative of CNV between Go. raimondii and G. kirkii/K. drynarioides, but not between the sister genera themselves. Candidate CNV groups were evaluated for direction (gain vs. loss) and magnitude. We inferred 731 genes gained and 2,957 lost in G. kirkii (distributed among 259 and 2,730 orthologous groups, respectively; table 1). The CNV magnitude (i.e., the number of genes gained or loss per group) varied between one and seven, although two groups encompassed a remarkably large number of genes (i.e., 14 and 225; table 1); these were excluded from subsequent calculations as putative falsely annotated transposable elements or errors in the clustering algorithm. In K. drynarioides, we infer a somewhat similar number of gains and losses, with 790 genes gained in 499 orthologous groups and 2,008 genes lost from 1,925 orthologous groups. Thus, in both genera, the number of losses is ∼4-fold higher than the number of gains. The magnitude of gains varied from one to eight copies, whereas the magnitude of losses was slightly lower at one to six copies per group (table 1). Interestingly, the number of groups where genes were gained in duplicate for K. drynarioides (i.e., two genes gained in the same orthologous group) was nearly as high as the number where only one copy was gained (200 vs. 260 groups, respectively).
Table 1

The number of orthogroups affected by a gain or loss of a given size (i.e. gain/loss of 1, 2, 3, 4, 5, 6, 7, 8, 14, or 255 genes in a single orthogroup) for each species

Size of Gains/Losses in Orthogroup
Total EventsTotal Genes
1234567814a225a
Gain
Gossypioides kirkii2222353222011259731
Kokia drynarioides260200333110100499790
Loss
G. kirkii2,5511452562010002,7302,957
K. drynarioides1,85758720100001,9252,008

These groups were not used in calculation of Total Events or Total Genes. We infer these genes are either falsely annotated transposable elements or error in the clustering algorithm used.

The number of orthogroups affected by a gain or loss of a given size (i.e. gain/loss of 1, 2, 3, 4, 5, 6, 7, 8, 14, or 255 genes in a single orthogroup) for each species These groups were not used in calculation of Total Events or Total Genes. We infer these genes are either falsely annotated transposable elements or error in the clustering algorithm used. Because overlooked annotations affect our ability to infer CNV events, we evaluated each genome for a subset of the “missing” annotations using only the easiest to interpret cases (i.e., one gene in Go. raimondii vs. >1 [gains] or 0 [losses] in either G. kirkii or K. drynarioides). For the 211 gain events in G. kirkii and 394 in K. drynarioides meeting this criteria, few genes (1–8%) were recovered from the remaining genome sequences (see Materials and Methods), and in most cases, the predicted protein sequence was nonviable (supplementary table S4, Supplementary Material online). For the 2,144 losses in G. kirkii, 1,465 were recovered in the masked G. kirkii; however, 477 contained frame-shift mutations resulting in nonviable proteins, leading to an overall validation rate of 53.9%. Likewise, 872 of the 1,458 putative gene losses in K. drynarioides found in the nonannotated regions of the K. drynarioides genome, with 358 nonviable protein models (64.8% validation). Verification of deletions via delly2 led to a similar, but slightly lower overall validation rate for G. kirkii (40.8% of the original 2,957 inferred), but nearly complete validation of the original K. drynarioides deletion estimate (88.8%). The average number of deletions verified by the two independent methods suggests that the number of losses in both species exceeds the number of gains by ∼2-fold since divergence (5.3 Ma) with a similar number of losses in both G. kirkii and K. drynarioides. Clearly this remarkable result will warrant further research, including using improved genomes and syntenic analyses to uncover the true extent and identity of gene deletions.

Changes in the Repetitive Landscape between K. drynarioides and G. kirkii

Because K. drynarioides and G. kirkii have relatively compact genomes (both 590 Mb), multiple representatives of three cotton species previously used for repetitive analysis (Renny-Byfield et al. 2016) were included in the clustering to aid in the identification of repeat-derived sequences. Just over two million reads derived from these five species (comprising 1% genome size equivalents each) were coclustered using the RepeatExplorer pipeline, producing a total of 74,001 clusters (n >2 reads). Because the smallest clusters are not informative with respect to repetitive sequence evolution, we chose to annotate only those clusters comprising >0.01% of the total reads input (=201 reads); this procedure resulted in 274 retained clusters. We evaluated the cumulative read sum as the cluster number increases (clusters are numbered from largest to smallest) to confirm that the retained clusters represent a majority of the data set, that is, most of the input data were represented in the analyzed clusters (supplementary fig. S1, Supplementary Material online). Despite similarly sized genomes, K. drynarioides and G. kirkii show an ∼1 Mb difference in clustered repeats (109.4 Mb vs. 110.3 Mb, respectively), although this difference is not statistically significant (χ2P > 0.95). Contingency table analysis of the repetitive profiles of each species, as well as the total amount of repetitive DNA calculated for each, suggest that these profiles are indistinguishable (at P < 0.05), despite being an intergeneric comparison. Interspecies (intragenus) repetitive profiles for Gossypium species present in the analysis showed a different pattern, as expected from the 2-fold difference in genome size, whereby Go. raimondii (880 Mb) shows a highly distinct repetitive profile (P <0.05) compared with either species from subgenus Gossypium (i.e., Go. herbaceum and Go. arboreum; 1667 and 1689 Mb, respectively). Notably, the two A-genome species are not distinct (see discussion). To explore further the similarities and differences between the repetitive fractions of the K. drynarioides and G. kirkii genomes, we considered the possibility that while the overall repetitive profiles may not be significantly different, individual clusters may be. Toward this end, we conducted a χ2 test of independence for each cluster and applied a Benjamini–Hochberg correction for multiple testing. At P <0.05, 55 clusters (out of 188) are differentially abundant in K. drynarioides versus G. kirkii, 94.5% of which are LTR-transposable elements (i.e., 61.8% gypsy, 10.9% copia, and 20% unspecified). Greater abundance was more frequently observed in K. drynarioides versus G. kirkii (34 vs. 21 clusters), although the total number of reads in differentially abundant G. kirkii clusters was marginally greater (7,413 reads vs. 7,252, representing a 1.5-Mb genome-wide difference). Because these differentially abundant clusters could represent differences in either proliferation or decay/removal, we gauged the relative age of each cluster based on the method of Ferreira de Carvalho et al. (2016). This analysis attempts to characterize the age of each cluster based on the distinctiveness of the reads which comprise the cluster; that is, younger clusters will have reads that are skewed toward high similarity, whereas reads comprising older clusters will have more interread differences. While an imperfect measure, this characterization permits a generalized perspective on the repeats identified here. Overall, most of the repeats in K. drynarioides and G. kirkii displayed a pattern suggestive of older elements (202 “older” vs. 72 “young”); however, of the 55 differentially abundant clusters, nearly half (25) were categorized as “younger” (supplementary table S7, Supplementary Material online) and were annotated in approximately the same proportions as the overall TE type classification. Interestingly, over 80% of the “young” clusters were overrepresented in K. drynarioides, potentially reflecting differential amplification in these two species. Most of the clusters were broadly annotated as belonging to the Ty3/gypsy superfamily, a result commonly observed in plant genomes (fig. 3; Hawkins et al. 2006; Baucom et al. 2009; Paterson et al. 2009; Schnable et al. 2009; Tian et al. 2009; Lee and Kim 2014). Overall, gypsy elements comprise 77.6 and 76 Mb of the K. drynarioides and G. kirkii genomes, respectively, with uncategorized LTR-retrotransposons and Ty1/copia elements comprising the next most abundant repeats and in similar amounts in each genome (table 2). Unsurprisingly, the small genomes of K. drynarioides and G. kirkii (590 Mb) had lower absolute quantities of most repeat types than the included diploid Gossypium genomes (i.e., Gossypium raimondii, 880 Mb; Go. arboreum, 1689 Mb, and Go. herbaceum, 1667 Mb) except for the non-LTR retrotransposon category. Kokia drynarioides and G. kirkii have comparable or slightly greater amounts of non-LTR retrotransposons as these three cotton species, despite the latter having 2–3× larger genomes (fig. 3). This difference is due to the sole retroposon cluster recovered, which was in the top five largest clusters for both K. drynarioides and G. kirkii (although present in different absolute amounts). The high percent identity among reads for this cluster suggests it is relatively young, and it has likely experienced recent proliferation in these species. Furthermore, the cluster shows differential abundance between the two species, suggesting either that the proliferation began prior to species divergence and continued differentially afterwards, or that the two lineages experienced similar releases from repression for this element, although to varying degrees. The other differentially abundant clusters were largely annotated as putative gypsy elements (61.8%).
. 3.

—The (average) aggregate number of kilobases represented by each transposable element category for each species (genome sizes included next to species names). Transposable elements were broadly categorized into categories and their representation per species summarized. Multiple representatives were available for each Gossypium species (Renny-Byfield et al. 2016), allowing an estimate of SE for those species.

Table 2

Aggregate Amount of Each Broad TE Category per Genome (in kb)

LineageKokia drynarioidesGossypioides kirkiiGo. herbaceumGo. arboreumGo. raimondii
Unspecified2,4131,52012,29915,8425,267
DNA219190823610447
DNA/MULE-MuDR86761,9671,7844,592
LTR15,77016,20752,89048,24132,619
LTR/Copia9,4919,71931,40132,28326,110
LTR/Gypsy77,59676,000876,955970,883200,992
Aggregate Amount of Each Broad TE Category per Genome (in kb) —The (average) aggregate number of kilobases represented by each transposable element category for each species (genome sizes included next to species names). Transposable elements were broadly categorized into categories and their representation per species summarized. Multiple representatives were available for each Gossypium species (Renny-Byfield et al. 2016), allowing an estimate of SE for those species. Ancestral state reconstructions for the 22 clusters with the lowest P value (P < 0.001) were conducted using both K. drynarioides and G. kirkii, as well as three diploid cotton representatives as outgroup species (i.e., Gossypium raimondii, Go. arboreum, and Go. herbaceum). Patterns of both amplification and deletion were inferred (fig. 4), sometimes within the same cluster. For example, both K. drynarioides and G. kirkii have experienced reductions in copy number for repeat cluster 5 (gypsy), albeit to different extents (fig. 4). Likewise, the repeat represented by cluster 129 (gypsy) has experienced copy number growth in both K. drynarioides and G. kirkii, with the element attaining much higher copy numbers in G. kirkii (fig. 4). A large subset of the repeat clusters (20 out of 22) showed gain in one of the two lineages coupled with concomitant loss in the other, creating differentially abundant clusters. Notably, no identifiable pattern of gain/loss or age bias was identified with respect to TE type. These data implicate a recurring pattern of differential proliferation and removal of multiple different repetitive element families (mostly retrotransposons). Congruent with their equivalent genome sizes, no lineage bias was observed for amplification versus contraction.
. 4.

—Example ancestral state reconstructions for gain/loss of sequence in the 22 clusters with the lowest P value (P < 0.001) during the evolution of Kokia/Gossypioides/Gossypium. The total amount of sequence attributable to each cluster is given in kilobases, both next to the name (terminus) and at branch points. Patterns of both amplification (represented by green/blue color) and deletion (yellow/orange/red) were inferred, frequently within the same cluster and sometimes between sister taxa. (A) Exemplar cluster 5, gypsy; (B) Exemplar cluster 129, gypsy; (C) Exemplar cluster 110, gypsy; (D) Exemplar cluster 177, LTR-retrotransposon (unspecified type); and (E) Exemplar cluster 96, gypsy. Notably, no bias in TE type was detected for either gain/loss or age in Kokia and Gossypioides.

—Example ancestral state reconstructions for gain/loss of sequence in the 22 clusters with the lowest P value (P < 0.001) during the evolution of Kokia/Gossypioides/Gossypium. The total amount of sequence attributable to each cluster is given in kilobases, both next to the name (terminus) and at branch points. Patterns of both amplification (represented by green/blue color) and deletion (yellow/orange/red) were inferred, frequently within the same cluster and sometimes between sister taxa. (A) Exemplar cluster 5, gypsy; (B) Exemplar cluster 129, gypsy; (C) Exemplar cluster 110, gypsy; (D) Exemplar cluster 177, LTR-retrotransposon (unspecified type); and (E) Exemplar cluster 96, gypsy. Notably, no bias in TE type was detected for either gain/loss or age in Kokia and Gossypioides.

Patterns of Insertion and Deletion in K. drynarioides and G. kirkii

To explore further sequence gain and loss in these two genera, we polarized indels (as predicted by GATK; see Materials and Methods) for both K. drynarioides and G. kirkii using the Go. raimondii genome to represent the ancestral state. A gain or loss was inferred when one taxon shared the reference state with Go. raimondii and the other had an apparent insertion or deletion. Kokia drynarioides exhibited a greater number of both insertions and deletions; that is, of the 490,591 indels that passed our filtering criteria, 130,177 were insertions in K. drynarioides and 159,222 were deletions, whereas G. kirkii had a total of 87,951 insertions and 113,241 deletions. The distribution of insertion and deletion sizes was biased (for both) towards very small (<10 nt) indels; however, when considering the global pattern, insertions in K. drynarioides tended to be longer than in G. kirkii, whereas G. kirkii had a greater number of smaller insertions (fig. 5). For deletions, K. drynarioides and G. kirkii were largely similar in the number of smaller deletions; however, K. drynarioides exhibited more deletions as the size increased. The overall consequence of these differences in indel evolution resulted in a net gain of 68.6 kb for K. drynarioides and a net loss of 113.2 kb in G. kirkii, a total genome size difference of ∼181.8 kb (0.03% of genome size). The distribution of insertions and deletions across each chromosome was roughly even for both taxa, with up to a 2-fold difference in indel number across chromosomes (fig. 6).
. 5.

—The frequency of indels present between Kokia drynarioides (green) and Gossypioides kirkii (blue), parsed as insertions (top) and deletions (bottom).

. 6.

—Genomic distribution of copy number variations and indels in Kokia drynarioides (Left) and Gossypioides kirkii (Right). Ring 1: gene gains (dark) and losses (light). Ring 2: insertions. Ring 3: deletions. Ring 4: mutual gene losses in Kokia and Gossypioides, relative to Gossypium.

—The frequency of indels present between Kokia drynarioides (green) and Gossypioides kirkii (blue), parsed as insertions (top) and deletions (bottom). —Genomic distribution of copy number variations and indels in Kokia drynarioides (Left) and Gossypioides kirkii (Right). Ring 1: gene gains (dark) and losses (light). Ring 2: insertions. Ring 3: deletions. Ring 4: mutual gene losses in Kokia and Gossypioides, relative to Gossypium.

Discussion

Divergence and speciation are expected outcomes of long-distance insular dispersal, whose conceptual foundations are rooted in the observations of Darwin and many subsequent evolutionary biologists. Because of the small population sizes associated with dispersal-mediated genetic bottlenecks, islands serve as natural laboratories to study the effects of isolation and drift on character evolution, including, as we show here, on genome structure and features. The tribe Gossypieae is characterized by multiple long-range dispersals, ultimately achieving an aggregate geographic distribution that encompasses tropical and subtropical regions worldwide. With the exception of the type genus Gossypium, little is known about the genomes of genera in the Gossypieae, apart from estimates of genome size (Wendel et al. 2002). Here, we present a comparative analysis for the clade of two genera that together comprise the phylogenetic outgroup to Gossypium. We provide insight into the interesting biogeographic history of these genera and clarify the temporal framework for divergence between Gossypioides and Kokia as well as these two genera from Gossypium. This framework permits an analysis of the pace, patterns, and processes that have characterized genomic divergence among the three genera, including novel insights into gene loss, structural variation, and genome downsizing.

Temporal Framework for Divergence and Biogeographic Implications

Interest in the sister genera of Kokia and Gossypioides stems largely from their close evolutionary relationship to Gossypium, although Kokia is an important member of Hawaiian forest communities (see introduction). Early divergence estimates placed the most recent common ancestor of Gossypium and Gossypioides/Kokia at ∼10–15 Myr before present, and the Kokia versus Gossypioides split in the Pliocene at ∼3–5 Ma (Seelanan et al. 1997; Cronn et al. 2002). These initial estimates were from the pregenomics era, and hence were based on relatively few nuclear and plastid genes. Here, we present a updated estimate for the synonymous substitution rate (3.91×10−9 substitutions per site per year) within the Malvaceae using 13,643 single copy orthologs from Go. raimondii and T. cacao. We use this estimate, and a set of 12,175 nuclear orthologs inferred from the three genera of the Gossypieae, to confirm that the synonymous substitution rates are similar between Go. raimondii and either G. kirkii or K. drynarioides. This indicates that despite their disjunct geographic distribution and multiple sequential founder events, there are no significant differences in generation time and/or mutation rate per generation between G. kirkii and K. drynarioides or that any such differences are reciprocal in their effects. With respect to dating divergences, our genome-scale data set permits us to refine earlier estimates. Thus, in contrast to previous analyses, which estimated an ∼4-fold difference in divergence time between Gossypium and Gossypioides/Kokia, we estimate only a 2-fold difference; that is, the divergence of Gossypium from the Kokia/Gossypioides common ancestor occurred approximately twice as long ago as the divergence of those two sister genera from each other. Our estimate of 10.29–11.22 Ma for the divergence of Gossypium from G. kirkii/K. drynarioides is similar to previous estimates (Seelanan et al. 1997; Cronn et al. 2002; Senchina et al. 2003), which is remarkable observation given the fact that earlier estimates were based on two orders of magnitude fewer genes, although the caveat remains that these were derived from single representatives. The indication that K. drynarioides diverged from G. kirkii ∼5.30 Ma, instead of 3 Ma as reported earlier, may be biogeographically significant in that it suggests a divergence at about the same time as the earliest emergence estimate for the present Hawaiian Islands. Because a signature trait of the Gossypieae is multiple transoceanic dispersals, these divergence data may suggest multiple transoceanic voyages between intermediate locale in the evolutionary history of Kokia (and any now-extinct members of its clade) before its arrival and diversification in the Hawaiian Islands concomitant with local extinction at any geographically intermediate locations. We note, however, that the Hawaiian Islands are the world’s most isolated oceanic archipelago, without clear “stepping stones” across the Pacific Ocean from either continental hemisphere. Therefore, a credible alternative is that the antecedent of modern Kokia may have made a great leap circa 5.3 Ma to the Hawaiian archipelago with subsequent island-hopping as suitable habitat became available during the genesis and ecological development of the island chain. In any event, the biogeographic story is a remarkable one, as the two genera Kokia and Gossypioides are separated by a minimum of 17,500 km, and yet are each other’s closest relatives. This is even more striking when one considers that present species lack any clear mechanism for oceanic dispersal, as seeds sink relatively quickly. Seeds of many taxa in the tribe do possess a certain degree of salt-water tolerance (Stephens 1958; Fryxell 1979; Wendel and Grover 2015), however, so the possibility remains that this remarkable dispersal voyage entailed some sort of natural rafting on oceanic debris, either of seeds or of mature but undehisced capsules.

Extensive Gene Removal Differentiates Kokia and Gossypioides

The temporal framework provided above provides the opportunity to explore the relative evolutionary rates of genomic differentiation. With respect to genes, variation in gene content among species and individuals is more extensive than once thought, leading to the concept of “core” and “dispensable” genomes (together, the pan-genome; Medini et al. 2005; Hirsch et al. 2014). Research in plants (Morgante et al. 2007; Springer et al. 2009; Swanson-Wagner et al. 2010; Cao et al. 2011; Chia et al. 2012; Hirsch et al. 2014) suggests that many plant species exhibit evidence of a pan-genome whose “dispensable” component may contribute to diversity and adaptation (Medini et al. 2005; Tettelin et al. 2005; Kahlke et al. 2012). Here, using a divergence time 5.3 Myr, we estimate that gene deletions between Kokia and Gossypioides have occurred at ∼245–300 per lineage per million years. Perhaps more surprising is the number of additional genes in the Gossypium raimondii genome as compared with that in either Kokia or Gossypioides (n=∼6,000). As gene deletions outweigh insertions and identifiable sequence was not recovered from either Kokia or Gossypioides, we infer these missing sequences represent shared deletions that occurred in the ∼5–6 Myr between the divergence of Gossypium from proto-Kokia/Gossypioides and the divergence of the latter two genera from each other. While possible that incomplete assemblies in K. drynarioides and G. kirkii contributed to the observed rate of gene loss, we note that the gene space for both is well-assembled (by BUSCO score) and it is improbable that both assemblies missed the same 6,000 gene models. The rate of gene deletion inferred for the shared lineage is much higher than in either individual lineage, resulting in ∼1,000 deletions per million years in the proto-Kokia/Gossypioides lineage. Postdivergence, the rate of gene deletion between the two lineages was significantly slower and nearly equivalent. Notably, none of these lineages has a history of unshared paleopolyploidy (Paterson et al. 2012), although all three share an ancient polyploidy event whose redundancy could be differentially fractionated.

Static Genome Size in the Face of a Changing Repetitive Element Landscape

Repetitive elements are both labile in nature and potentially sensitive to population size, due to reduced efficiency of purifying selection in small populations because of the prominence of strong genetic drift (Lynch and Conery 2003; Yi and Streelman 2005; Lynch 2011; Lynch et al. 2011; Lefébure et al. 2017). In the context of genome size, strong drift should lead toward an overall increase in genome size as eukaryotic mutation patterns are typically biased toward insertions, although research addressing the validity and ubiquity of this hypothesis is both scant and conflicting (Yi and Streelman 2005; Gregory and Witt 2008; Whitney et al. 2010, 2011; Arnqvist 2015; Mohlhenrich and Mueller 2016; Lefébure et al. 2017). While we do know historical population sizes in the present study, it is clear that population bottlenecks must have been profound in Kokia, as described earlier. The demographic history of G. kirkii is less clear; the current distribution could also reflect a dispersal event to East Africa, as the ancestral range for the ancestor to these genera is unknown, and the fluctuation in population size for this species is not known. Regardless, given the small current population sizes for both and the population bottlenecks that have affected Kokia (minimally), the invariant nature of both their genome size and composition is perhaps surprising. Both species have an estimated genome size of 590 Mb (Wendel et al. 2002), representing genome size stasis during ∼5 Myr of divergence. Analysis of their global repetitive content suggests that there is only a trivial (∼1 Mb) difference in total (identifiable) repeat content, with very similar overall repetitive profiles for each. We note that this result contrasts with the expectation based on small effective population size alone and stands in contrast to abundant literature that reports genome size differences among closely related species (Hawkins et al. 2006; Piegu et al. 2006; Novak et al. 2014; Kelly et al. 2015; Macas et al. 2015; Vu et al. 2015; Tetreault and Ungerer 2016) and even within species (Biemont 2008; Smarda et al. 2008; Diez et al. 2013; Duchoslav et al. 2013). Furthermore, given the stresses associated with dispersal and colonization of new environments, it may be even more surprising that the genome size has remained the same; however, although biotic and abiotic stresses have been associated with TE release from suppression, this activation can be genotype dependent and repression under stress is also commonly observed (reviewed in Horvath et al. 2017). Finally, research on plant invasiveness supports an association between small genome size and invasive potential, possibly due to the traits associated with invasive success (e.g., seedling growth rate and water/nutrient use; reviewed in Suda et al. 2014). While these species are not invasive per se, colonization (such as that of the Hawaiian Islands and East Africa/Madagascar) may favor similar characteristics. Notwithstanding the relative genomic stasis of the two genera, it is clear that the differences that do exist between the two species reflect both gain and loss of repetitive sequence. Most of the “younger” differentially abundant clusters that distinguish K. drynarioides and G. kirkii are overrepresented in K. drynarioides, a result consistent with the observation that a reduction in population size and concomitant increase in the severity of genetic drift can lead to an increase in insertional mutations, possibly due to activation of TEs under stress conditions (Kalendar et al. 2000; Liu and Wendel 2003; Grandbastien 2004; Parisod et al. 2010). Ancestral state reconstructions of TE amounts (fig. 4) also suggest both gain and loss in K. drynarioides and G. kirkii of approximately the same magnitude, which accounts for the static genome size of these species in the face of a changing TE landscape and maybe indicative of nucleotypic or other phenotypic restrictions associated with genome size in these species.

Rates of Indel Formations Compensate for Biased TE Proliferation

While transposable elements are capable of substantially altering genome size and structure, the presence of indels also contributes to genome size and collinearity (Petrov 2002; Gregory 2003; Vitte and Bennetzen 2006; Hjelmen and Johnston 2017; Kapusta et al. 2017). Previous work in cotton suggests there exist small differences in rates between species with large and small genomes that contribute to overall genome size change (Grover et al. 2008a). Global patterns of indel formation, as inferred from modern sequencing, can further extend our understanding of sequence gain and loss by providing a genome-wide view agnostic of sequence type (e.g., TE-derived) or region. As with the repetitive elements, K. drynarioides and G. kirkii vary in their rate of indel formation despite their equivalent genome sizes. In general, K. drynarioides experiences insertions and deletions more frequently, and the insertions tend to be longer than those found in G. kirkii (deletion sizes are equivalent on an average). These small biases lead to overall gain in sequence for K. drynarioides (+68.6 kb) and loss for G. kirkii (−113.2 kb), further exaggerating the gain experienced by K. drynarioides attributable to “younger” transposable elements (i.e., recent proliferation). In addition, these differences also explain why K. drynarioides has more “young” TEs whereas G. kirkii has more repetitive sequence overall, that is, the greater deletion rate in K. drynarioides is likely contributing to accelerated decay in that lineage. The abundance of indels in these two small genomes may be a consequence of their presumably small effective population sizes during the course of evolution. An extension of the drift-barrier hypothesis to indel formation suggests that selection for DNA fidelity and repair is less efficient in small populations, leading to a higher rate of retained indel mutation events (Sung et al. 2016). The predicted decline in DNA fidelity associated with decreasing population sizes has been broadly demonstrated across the tree of life (Sung et al. 2016) and may explain the relative abundance of indels in these two lineages, including the increased propensity for indel formation in K. drynarioides which was likely subject to severe population size restrictions as it colonized the Hawaiian Islands. Differences in double-strand break repair have been invoked for genome size reduction in across the tree of life, including plant species (Kirik et al. 2000; Orel and Puchta 2003; Puchta 2005; Hu et al. 2011; Vu et al. 2015).

Conclusions

External influences on genome evolution are many and complex, affecting genomes in sometimes predictable, and sometimes enigmatic, ways. Despite the strong pressures associated with repeated genetic bottlenecks as Kokia and Gossypioides underwent island dispersal, the most labile component of the genome (i.e., transposable elements) remained surprisingly constant. Furthermore, the changes in size due to differential transposable element occupation were ultimately offset by differential rates of deletion in the two species, resulting in similar genome sizes despite circa 5 Myr of independent evolution, strong founder effects, and intense genetic drift. This is perhaps even more remarkable considering that, in approximately the same timeframe (the last 5–10 Myr), the related genus Gossypium has experienced far more significant changes in genome size due to differential transposable element proliferation, which has led to a 3-fold difference in genome size among cotton species, and similar rates of indel formation (Grover et al. 2008b). Perhaps more unexpected were the presence of >10,000 genes in the Gossypium raimondii genome where no K. drynarioides or G. kirkii homolog was detected, resulting in nearly 8,000 more annotated genes in the Go. raimondii genome than in either K. drynarioides or G. kirkii. While some of these additional gene models may be due to differences in annotation methods between Go. raimondii and K. drynarioides/G. kirkii, it nevertheless suggests a higher rate of gene deletion in these sister genera. The deletions inferred here, both lineage-specific and those occurring in proto-Kokia/Gossypioides, are not only interesting from an evolutionary standpoint but are also germane to the selection of either species as an outgroup to Gossypium. While both species can individually serve as useful representatives of the cotton ancestor, it is clear that enough differences exist between the two outgroup genera to warrant inclusion of both as representatives of the ancestral cotton genome.

Supplementary Material

Supplementary data are available at Genome Biology and Evolution online. Click here for additional data file.
  98 in total

1.  Is small indel bias a determinant of genome size?

Authors:  T Ryan Gregory
Journal:  Trends Genet       Date:  2003-09       Impact factor: 11.639

2.  Pervasive gene content variation and copy number variation in maize and its undomesticated progenitor.

Authors:  Ruth A Swanson-Wagner; Steven R Eichten; Sunita Kumari; Peter Tiffin; Joshua C Stein; Doreen Ware; Nathan M Springer
Journal:  Genome Res       Date:  2010-10-29       Impact factor: 9.043

3.  Using native and syntenically mapped cDNA alignments to improve de novo gene finding.

Authors:  Mario Stanke; Mark Diekhans; Robert Baertsch; David Haussler
Journal:  Bioinformatics       Date:  2008-01-24       Impact factor: 6.937

4.  Evolution of alcohol dehydrogenase genes in the palm and grass families.

Authors:  B R Morton; B S Gaut; M T Clegg
Journal:  Proc Natl Acad Sci U S A       Date:  1996-10-15       Impact factor: 11.205

5.  Do genetic recombination and gene density shape the pattern of DNA elimination in rice long terminal repeat retrotransposons?

Authors:  Zhixi Tian; Carene Rizzon; Jianchang Du; Liucun Zhu; Jeffrey L Bennetzen; Scott A Jackson; Brandon S Gaut; Jianxin Ma
Journal:  Genome Res       Date:  2009-09-29       Impact factor: 9.043

6.  Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data.

Authors:  Petr Novák; Pavel Neumann; Jirí Macas
Journal:  BMC Bioinformatics       Date:  2010-07-15       Impact factor: 3.169

7.  Drift and genome complexity revisited.

Authors:  Kenneth D Whitney; Bastien Boussau; Eric J Baack; Theodore Garland
Journal:  PLoS Genet       Date:  2011-06-09       Impact factor: 5.917

8.  Sealer: a scalable gap-closing application for finishing draft genomes.

Authors:  Daniel Paulino; René L Warren; Benjamin P Vandervalk; Anthony Raymond; Shaun D Jackman; Inanç Birol
Journal:  BMC Bioinformatics       Date:  2015-07-25       Impact factor: 3.169

9.  Genome-wide analysis of repeat diversity across the family Musaceae.

Authors:  Petr Novák; Eva Hřibová; Pavel Neumann; Andrea Koblížková; Jaroslav Doležel; Jiří Macas
Journal:  PLoS One       Date:  2014-06-16       Impact factor: 3.240

10.  Evolution of the Insertion-Deletion Mutation Rate Across the Tree of Life.

Authors:  Way Sung; Matthew S Ackerman; Marcus M Dillon; Thomas G Platt; Clay Fuqua; Vaughn S Cooper; Michael Lynch
Journal:  G3 (Bethesda)       Date:  2016-08-09       Impact factor: 3.154

View more
  7 in total

1.  Whole-Genome Resequencing Deciphers New Insight Into Genetic Diversity and Signatures of Resistance in Cultivated Cotton Gossypium hirsutum.

Authors:  Athar Hussain; Muhammad Farooq; Rubab Zahra Naqvi; Muhammad Qasim Aslam; Hamid Anees Siddiqui; Imran Amin; Chengcheng Liu; Xin Liu; Jodi Scheffler; Muhammad Asif; Shahid Mansoor
Journal:  Mol Biotechnol       Date:  2022-07-01       Impact factor: 2.695

2.  The genomic landscape of molecular responses to natural drought stress in Panicum hallii.

Authors:  John T Lovell; Jerry Jenkins; David B Lowry; Sujan Mamidi; Avinash Sreedasyam; Xiaoyu Weng; Kerrie Barry; Jason Bonnette; Brandon Campitelli; Chris Daum; Sean P Gordon; Billie A Gould; Albina Khasanova; Anna Lipzen; Alice MacQueen; Juan Diego Palacio-Mejía; Christopher Plott; Eugene V Shakirov; Shengqiang Shu; Yuko Yoshinaga; Matt Zane; Dave Kudrna; Jason D Talag; Daniel Rokhsar; Jane Grimwood; Jeremy Schmutz; Thomas E Juenger
Journal:  Nat Commun       Date:  2018-12-06       Impact factor: 14.919

3.  Insights into the Evolution of the New World Diploid Cottons (Gossypium, Subgenus Houzingenia) Based on Genome Sequencing.

Authors:  Corrinne E Grover; Mark A Arick; Adam Thrash; Justin L Conover; William S Sanders; Daniel G Peterson; James E Frelichowski; Jodi A Scheffler; Brian E Scheffler; Jonathan F Wendel
Journal:  Genome Biol Evol       Date:  2019-01-01       Impact factor: 3.416

4.  De Novo Genome Sequence Assemblies of Gossypium raimondii and Gossypium turneri.

Authors:  Joshua A Udall; Evan Long; Chris Hanson; Daojun Yuan; Thiruvarangan Ramaraj; Justin L Conover; Lei Gong; Mark A Arick; Corrinne E Grover; Daniel G Peterson; Jonathan F Wendel
Journal:  G3 (Bethesda)       Date:  2019-10-07       Impact factor: 3.154

5.  Genome sequencing of the Australian wild diploid species Gossypium australe highlights disease resistance and delayed gland morphogenesis.

Authors:  Yingfan Cai; Xiaoyan Cai; Qinglian Wang; Ping Wang; Yu Zhang; Chaowei Cai; Yanchao Xu; Kunbo Wang; Zhongli Zhou; Chenxiao Wang; Shuaipeng Geng; Bo Li; Qi Dong; Yuqing Hou; Heng Wang; Peng Ai; Zhen Liu; Feifei Yi; Minshan Sun; Guoyong An; Jieru Cheng; Yuanyuan Zhang; Qian Shi; Yuanhui Xie; Xinying Shi; Ying Chang; Feifei Huang; Yun Chen; Shimiao Hong; Lingyu Mi; Quan Sun; Lin Zhang; Baoliang Zhou; Renhai Peng; Xiao Zhang; Fang Liu
Journal:  Plant Biotechnol J       Date:  2019-09-13       Impact factor: 9.803

6.  Deleterious Mutations Accumulate Faster in Allopolyploid Than Diploid Cotton (Gossypium) and Unequally between Subgenomes.

Authors:  Justin L Conover; Jonathan F Wendel
Journal:  Mol Biol Evol       Date:  2022-02-03       Impact factor: 16.240

7.  Evolutionary divergence of duplicated genomes in newly described allotetraploid cottons.

Authors:  Renhai Peng; Yanchao Xu; Shilin Tian; Turgay Unver; Zhen Liu; Zhongli Zhou; Xiaoyan Cai; Kunbo Wang; Yangyang Wei; Yuling Liu; Heng Wang; Guanjing Hu; Zhongren Zhang; Corrinne E Grover; Yuqing Hou; Yuhong Wang; Pengtao Li; Tao Wang; Quanwei Lu; Yuanyuan Wang; Justin L Conover; Hassan Ghazal; Qinglian Wang; Baohong Zhang; Marc Van Montagu; Yves Van de Peer; Jonathan F Wendel; Fang Liu
Journal:  Proc Natl Acad Sci U S A       Date:  2022-09-19       Impact factor: 12.779

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.