Literature DB >> 34100895

Transposable Elements Contribute to Genome Dynamics and Gene Expression Variation in the Fungal Plant Pathogen Verticillium dahliae.

David E Torres^1,2, Bart P H J Thomma^2,3, Michael F Seidl¹.

Abstract

Transposable elements (TEs) are a major source of genetic and regulatory variation in their host genome and are consequently thought to play important roles in evolution. Many fungal and oomycete plant pathogens have evolved dynamic and TE-rich genomic regions containing genes that are implicated in host colonization and adaptation. TEs embedded in these regions have typically been thought to accelerate the evolution of these genomic compartments, but little is known about their dynamics in strains that harbor them. Here, we used whole-genome sequencing data of 42 strains of the fungal plant pathogen Verticillium dahliae to systematically identify polymorphic TEs that may be implicated in genomic as well as in gene expression variation. We identified 2,523 TE polymorphisms and characterize a subset of 8% of the TEs as polymorphic elements that are evolutionary younger, less methylated, and more highly expressed when compared with the remaining 92% of the total TE complement. As expected, the polyrmorphic TEs are enriched in the adaptive genomic regions. Besides, we observed an association of polymorphic TEs with pathogenicity-related genes that localize nearby and that display high expression levels. Collectively, our analyses demonstrate that TE dynamics in V. dahliae contributes to genomic variation, correlates with expression of pathogenicity-related genes, and potentially impacts the evolution of adaptive genomic regions.

Entities: Chemical

Keywords: genome evolution; plant pathogen; structural variation; transposable element; transposon

Mesh：

Substances：

Year: 2021 PMID： 34100895 PMCID： PMC8290119 DOI： 10.1093/gbe/evab135

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

Repetitive sequences such as TEs are ubiquitous components of prokaryote and eukaryote genomes (Bowen and Jordan 2002; Wicker et al. 2007). In eukaryotes, the repeat content can differ between species and range from only ∼5.6% in the fish Tetraodon nigroviridis (Jaillon et al. 2004) to up to 85% in the maize genome (Schnable et al. 2009; Jiao et al. 2017). TEs can transpose to different locations within the host genome, and thus are often considered “mobile genetic elements” (McClintock 1950). TEs have been traditionally classified into two classes based on their mode of transposition: class I retrotransposons transpose via a copy-and-paste mechanism, whereas class II DNA transposons mainly mobilize via a cut-and-paste system (Wicker et al. 2007; Piégu et al. 2015). TEs can impact host genome organization and gene function by disrupting protein-coding regions, modifying gene expression, or causing large-scale chromosomal rearrangements (Eichler and Sankoff 2003; Feschotte 2008; Bennetzen and Wang 2014; Le et al. 2015; Mita and Boeke 2016). Consequently, TEs and their activities have been considered to generally decrease fitness, and TEs and TE-induced mutations are typically rapidly purged from a population. Furthermore, TE activities are generally suppressed by host defense processes. These processes include DNA methylation (Selker et al. 2003; Zemach et al. 2010), histone modifications associated with highly condensed heterochromatin (Slotkin and Martienssen 2007; Hocher et al. 2018; Hocher and Taddei 2020), targeted processes such as RNA-silencing (Lippman and Martienssen 2004; Nicolas et al. 2013; Yadav et al. 2018), or repeat-induced point (RIP) mutation, a fungal specific process in in which duplicated or repetitive sequences are targeted by C to T mutations (Cambareri et al. 1989; Galagan and Selker 2004; Fudal et al. 2009; Wang et al. 2020). Over the last decades, besides their detrimental effects, TEs have been increasingly recognized to play important roles in adaptive genome evolution (Venner et al. 2009; Hua-Van et al. 2011; Lynch et al. 2011; Bourque et al. 2018). For example, TEs can be co-opted as a source of cis-regulatory elements (Chuong et al. 2016; Kitano et al. 2018) and mediate swift changes in the transcriptional landscape under biotic and abiotic stresses (Sinzelle et al. 2009; Bourque et al. 2018; Muszewska et al. 2019). A classic example is the dark pigmentation of the peppered moth in response to industrialization that was caused by a single TE insertion into the first intron of a cell-cycle regulator gene, changing the transcript abundance of this gene and allowing the insect to camouflage on polluted surfaces (Cook and Saccheri 2013; Hof et al. 2016). Changes in mutation rates in genes in proximity TEs can be caused by leakage of TE silencing mechanisms (e.g., RIP) or DNA repair following TE insertion (Fudal et al. 2009; Rouxel et al. 2011; Yoshida et al. 2016; Wang et al. 2020). Furthermore, abundant TEs and other repetitive elements can act as an ectopic substrate for double-strand break repair pathways, thereby fostering chromosomal rearrangements (structural variants; SVs), such as chromosomal translocations and inversions as well as large-scale duplications and deletions (Chan and Kolodner 2011; Faino et al. 2016; Bourque et al. 2018). For example, TE expansions in crops such as tomato, rice, wheat, and soybean, have been shown to foster an increased accumulation of SVs associated with important traits during domestication (Fuentes et al. 2019; Alonge et al. 2020; Liu et al. 2020; Walkowiak et al. 2020). Interestingly, as TEs and SVs colocalize, they tend to form clusters in discrete regions in many eukaryotic organisms (Hastings et al. 2009; Lin and Gokcumen 2019; Marand et al. 2019). This colocalization, and its consequences for genome variation and organization, are important to understand the genome evolution of eukaryotic organisms (Kidwell and Lisch 2001; Wright and Finnegan 2001; Lynch and Conery 2003; Hurst et al. 2004; Lynch 2007). Remarkably, these TE-rich regions are often enriched in environmentally responsive genes, for example, in genes associated with roles in immunity or in secondary metabolism in plants (Sakamoto et al. 2004; Kawakatsu et al. 2016; Seidl et al. 2016; van Wersch and Li 2019). Therefore, TE-rich compartments are considered relevant for contribution to adaptation to various abiotic and biotic stresses (Crombach and Hogeweg 2007; Knibbe et al. 2007). TE-rich compartments play important roles in the coevolutionary “arms-race” between pathogens and their hosts. Pathogens secrete molecules known as effectors to establish a successful parasitic relationship on their hosts (Rovenich et al. 2014; Sánchez-Vallet et al. 2018). In turn, hosts evolve receptors to activate immunity and halt pathogen colonization (Cook et al. 2015; Han 2019). Thus, in order to overcome recognition by host immune systems, pathogens have to purge or diversify effector genes once their gene products become recognized (Moller and Stukenbrock 2017; Fouche et al. 2018; Badet and Croll 2020). In many filamentous pathogens, effector genes are enriched in TE-rich and gene-sparse compartments, whereas they are typically underrepresented in TE-poor and gene-dense genomic regions that typically harbor housekeeping genes (Seidl and Thomma 2014; Dong et al. 2015; Frantzeskakis et al. 2020). TE-rich compartments are often characterized by increased substitution rates and increased occurrence of SVs and presence/absence polymorphisms (Raffaele et al. 2010; Croll and McDonald 2012; de Jonge et al. 2012, 2013; Dutheil et al. 2016; Hartmann and Croll 2017; Hartmann et al. 2017; Wang et al. 2017; Fokkens et al. 2018; Plissonneau et al. 2018; Grandaubert et al. 2019; Gan et al. 2021; Wyka et al. 2021). Notably, a similar association of TEs with genes involved in immune responses has also been observed in plant hosts (Leister 2004; Kawakatsu et al. 2016; Mascher et al. 2017; Seidl and Thomma 2017). Therefore, TE-rich compartments have been suggested to facilitate the generation of variation that is necessary to fuel the “arms-race” between pathogens and their hosts (Seidl and Thomma 2017; Frantzeskakis et al. 2019, 2020; Schrader and Schmitz 2019; Torres et al. 2020). Verticillium dahliae is an asexual soil-borne fungal plant pathogen that colonizes the xylem of hundreds of susceptible host plants (Fradin and Thomma 2006; Klosterman et al. 2009, 2011; Inderbitzin and Subbarao 2014). Genomic comparisons between V. dahliae strains revealed that their genomes evolved by extensive chromosomal rearrangements (de Jonge et al. 2013; Faino et al. 2016). These rearrangements include chromosomal translocations, inversions, large-scale deletions, and duplications that collectively contributed to the formation of adaptive genomic compartments that originally have been referred to as lineage-specific (LS) regions, as these are variable between V. dahliae strains (Klosterman et al. 2011; de Jonge et al. 2012, 2013; Faino et al. 2016; Depotter et al. 2019). Initially, LS regions were solely defined by presence/absence polymorphisms among sequenced strains (de Jonge et al. 2013; Faino et al. 2016). However, subsequent work on the chromatin landscape in V. dahliae has revealed that these LS regions display a unique chromatin profile (Cook et al. 2020). Moreover, it was shown that a substantial number of additional genomic regions share chromatin characteristics with LS regions and, accordingly, that the amount of LS DNA has been underestimated (Cook et al. 2020). The originally identified LS regions, together with the additional genomic regions that share a similar chromatin profile, are now referred to as adaptive genomic regions (Cook et al. 2020). Importantly, these adaptive genomic regions are enriched for in planta-induced effector genes that contribute to host colonization (Klosterman et al. 2011; de Jonge et al. 2012, 2013; Kombrink et al. 2017; Gibriel et al. 2019; Cook et al. 2020; Chavarro-Carrero et al. 2021). Furthermore, these regions are enriched in relatively young and transcriptionally active TEs (Faino et al. 2016; Cook et al. 2020). Although a key role of TEs to drive formation and maintenance of adaptive genomic regions has been revealed (Amyotte et al. 2012; Faino et al. 2015, 2016; Shi-Kunne et al. 2018; Depotter et al. 2019; Cook et al. 2020; Seidl et al. 2020), it remains unclear what the impact of TE variation is on the biology of individual strains of V. dahliae. Changes in the TE landscape between V. dahliae strains, involving novel insertions as well as deletions, can cause structural as well as gene expression variation. Here, we exploited a collection of V. dahliae strains to analyze their TE landscape and systematically identify TE polymorphisms. Furthermore, we analyzed the polymorphic TEs in detail to study its impact on genome function and organization. Our analysis extends previous hypotheses on the relevance of TE dynamics in adaptive genomic regions (Faino et al. 2016; Cook et al. 2020), and further characterizes a TE subset implicated in genomic variation and gene expression in V. dahliae.

Results

Structural Variants Are Common in V. dahliae

Previous genomic comparisons among a narrow selection of V. dahliae strains have revealed extensive chromosomal rearrangements, which were proposed to have contributed to the establishment of adaptive genomic regions (de Jonge et al. 2013; Faino et al. 2016). To systematically assess their prevalence and abundance in a considerably larger collection of V. dahliae strains, we used paired-end sequencing reads for de novo prediction of SVs in 42 V. dahliae strains (Cook et al. 2020; Chavarro-Carrero et al. 2021) together with the gapless genome assembly of strain JR2 as a reference (Faino et al. 2015). With a combination of four commonly used SV-callers (supplementary fig. 1, Supplementary Material online), 919 high-confidence SVs were identified in total, comprising 410 deletions, 49 inversions, 24 duplications, and 436 translocations (fig. 1 and supplementary fig. 1, Supplementary Material online). Importantly, we successfully recalled all 19 chromosomal rearrangements that were previously manually identified between V. dahliae strains JR2 and VdLs17 (Faino et al. 2016). To investigate the occurrence of the 919 SVs across the V. dahliae strains, we calculated the frequency of each SV over all the strains included in the analysis. We observed that 87% (n = 803) of the SVs is shared by <20% of the strains, that is, they occur in less than eight strains (fig. 1). Among the SVs, deletions occur at the highest frequency (fig. 1; P < 2 × 10−16, Kruskal–Wallis test). Interestingly, the number of SVs per strain does not correspond with the phylogenetic relationship among the V. dahliae strains (fig. 1). Collectively, our data confirm that SVs occur commonly in V. dahliae, but that individual SVs are typically only shared by a small subset of strains.

Structural variants (SVs) in a collection of 42 Verticillium dahliae strains. (A) The number of SVs is depicted per type for each V. dahliae strain included in the analysis; the color-code for each variant type as indicated in panel (B); (B) Frequency distribution of the 919 SVs in the collection of V. dahliae strains; (C) Unrooted neighbor-joining tree based on presence–absence of 919 high confident SVs relative to the reference genome assembly of V. dahliae strain JR2 (red), color code depicts different groups; (D) Unrooted neighbor-joining tree of the collection of V. dahliae strains based on 287,251 high-confident SNVs relative to the reference genome assembly of V. dahliae strain JR2 (red); the color of the strain names corresponds to the colors represented in (C); (E) Circular plot displaying the genomic distribution of the 919 SVs along the eight chromosomes of the V. dahliae strain JR2 reference genome assembly. The tracks are shown in the following order from outside to inside: centromeric regions (Seidl et al. 2020), gene density in 10 kb windows, TE distribution, adaptive genomic regions distribution (Cook et al. 2020), 410 deletions (light blue), 49 inversions (yellow), 24 duplications (red), and 436 translocations (blue lines). The color intensity of the lines for an individual translocation is based on its frequency in the V. dahliae collection; (F) Number of TEs around SV breakpoints is displayed in 10 kb windows; (G) The distribution of 10,000 random permutations of distances between SVs and TEs in adaptive genomic regions. The vertical dashed line represents the mean number of random occurrences, the yellow line indicates the average distance at 4,681.85 bp, significance P = 9.99 × 10−5, z score = −2.32. Previously, we have demonstrated that SVs cluster at adaptive genomic regions in V. dahliae (de Jonge et al. 2013; Faino et al. 2016). Here, we similarly observed distinct genomic regions comprising multiple nonrandomly distributed SVs (P < 2 × 10−16, Kolmogorov–Smirnov test, supplementary fig. 2, Supplementary Material online), enriched in 17 regions (Pignatelli et al. 2009) (hypergeometric test; P < 0.05 after Benjamini–Hochberg correction; supplementary table 2, Supplementary Material online), and colocalized with the previously identified adaptive genomic regions (one-sided Fisher’s exact test, P = 0.00049; supplementary fig. 3, Supplementary Material online). Interestingly, SVs occur largely independently of nucleotide mutations in adaptive genomic regions (supplementary table 3, Supplementary Material online). We observed that only 0.06% of the nucleotides in adaptive genomic regions are affected by single-nucleotide variations, whereas 14.8% are affected by SVs. Thus, we conclude that SVs are a more likely phenomenon to increase variation in adaptive genomic regions than nucleotide mutations, especially when considering the previously reported high levels of sequence conservation in adaptive genomic regions (de Jonge et al. 2013; Faino et al. 2016; Shi-Kunne et al. 2018; Depotter et al. 2019).

A Subset of TEs in V. dahliae Is Polymorphic

TEs and other repetitive elements are often considered to drive the formation of SVs (Bourque et al. 2018). In V. dahliae, TEs have been proposed to mediate genomic rearrangements (Faino et al. 2016), and TEs and repetitive elements encompass 12.3% of the V. dahliae strain JR2 genome (Faino et al. 2015). Class I long-terminal repeat (LTR) retroelements are the most abundant, whereas Class II DNA is less abundant (supplementary table 4, Supplementary Material online). In line with previous observations (Klosterman et al. 2011; de Jonge et al. 2013; Faino et al. 2015, 2016; Cook et al. 2020), TEs are significantly enriched (3.5-fold) in adaptive genomic regions when compared with the core genome and colocalize with SVs (fig. 1; supplementary fig. 4, Supplementary Material online). As TE dynamics may promote the formation of SVs, insertions, and deletions of TEs (i.e., polymorphic TEs) may directly impact host genome organization. We identified TE insertion and deletion polymorphisms using TEPID (Stuart et al. 2016) by querying paired-end sequencing data for 42 V. dahliae strains relative to the V. dahliae strain JR2 reference genome; deletions are defined as TEs that are present in the reference genome but absent in other strains, whereas insertions are those present in at least one other strain and absent from the syntenic region in the reference genome. Our approach uses a single V. dahliae strains as a reference, and thus we will be limited by the reference TEs annotation when describing TE polymorphisms. For example, two DNA/Tc1-Mariner elements annotated in V. dahliae strain VdLs17 (Amyotte et al. 2012) are absent in the reference strain JR2 (Faino et al. 2015), and consequently these two elements would have not been considered in our analysis. In total, we identified 135 and 30 unique loci that underwent TE deletions and insertions, respectively; these loci are responsible for a total of 2,387 deletions and 136 insertions we identified across the 42 V. dahliae strains (fig. 2). In the case of insertions, we decided to keep only elements that belong to designated TE superfamilies, removing 12 “unspecified” elements that do not show association to a specific TE superfamily or the reference sequence length is incomplete. Only around 8% of all TEs are polymorphic, that is, they show insertion or deletion events in different V. dahliae strains compared with the reference strain JR2 (fig. 2). We observed that polymorphic TEs are significatively larger than nonpolymorphic TEs (supplementary fig. 6, Supplementary Material online). To further analyze the polymorphic TEs, we first calculated their frequency in the V. dahliae strains relative to the reference strain JR2. This revealed that TE deletions occurred more frequently than insertions (P = 0.0349, one-sided Fisher’s exact test; fig. 2); a similar frequency distribution was also observed for SV deletions (fig. 1). Only five TEs display both deletions and insertions, of which one belongs to the LTR/Gypsy and four to the LTR/Copia superfamily. We observed that 33 polymorphic TEs (37% of which belong to the LTR/Copia superfamily) overlap protein-coding genes, 22 of which overlap introns and 11 exons. Thus, only a limited subset of TEs in the V. dahliae genome is polymorphic, whereas the vast majority is shared by all strains analyzed.

Transposable element polymorphisms in a collection of 42 Verticillium dahliae strains. (A) The percentage of polymorphic (shades of blue) and nonpolymorphic (gray) TEs in V. dahliae strain JR2 is depicted as a pie chart. The percentages of polymorphic TEs separated by superfamily relative to the TE annotation of the V. dahliae strain JR2 is shown in blue, and the percentage of polymorphic TEs within each superfamily is shown in red; (B) The proportion of identified TE polymorphisms is shown for every V. dahliae strain, depicting deletions (blue) and insertions (red) per superfamily; (C) Frequency distribution of 165 polymorphic TEs in the V. dahliae strain collection, depicted by TE variant type.

Polymorphic TEs Are Dynamic in V. dahliae

Fungi evolved a range of mechanisms to suppress TE activity and avoid TE expansions (Castanera et al. 2016). However, we previously observed that a fraction of TEs is transcriptionally active in V. dahliae, suggesting it comprises TEs that are potentially able to multiply and accumulate in the V. dahliae genome (Faino et al. 2016; Cook et al. 2020). Therefore, we hypothesized that polymorphic TEs are not suppressed and still able to multiply. To assess if polymorphic TEs lack typical signatures of suppression when compared with nonpolymorphic TEs in V. dahliae strain JR2, we assessed DNA methylation levels (in the three methylation contexts; mCG, mCHG, and mCHH), RIP signature (Composite Repeat-Induced-Mutations Index; CRI), Kimura distance (divergence from the TE consensus sequence), GC content, and expression levels in vitro (counts per million; CPM) by quantifying the abundance of TE-derived reads across all different TE copies in the genome using TEtranscripts (Jin et al. 2015; Jin and Hammell 2018). Dimensional reduction by principal component analysis (PCA) revealed that polymorphic TEs cluster and are largely separated from the nonpolymorphic TEs (fig. 3). On the first dimension of the PCA, explaining 31.6% of the variation, polymorphic and nonpolymorphic TEs is separated based on CRI, DNA methylation, and GC content (supplementary fig. 5, Supplementary Material online). This observation can be explained by the association of TEs with strictly heterochromatic regions, in which TEs are characterized by high CRI and DNA methylation levels as well as by low GC content (Cook et al. 2020). The second dimension of the PCA, explaining 14.5% of the variation, revealed a more distinct separation between polymorphic and nonpolymorphic TEs mainly based on Kimura distance, DNA methylation, and transcriptional activity in vitro (supplementary table 5, Supplementary Material online). Specifically, we observed that the sequence of polymorphic TEs is more similar to the consensus sequence of the TE family (Kimura distance) and is characterized by a higher GC content when compared with nonpolymorphic TEs, indicating that polymorphic TEs are evolutionary younger when compared with nonpolymorphic TEs (fig. 3). Notably, the CRI of polymorphic TEs is lower than nonpolymorphic TEs and resembles that of genomic regions located in the core genome (supplementary fig. 6, Supplementary Material online). Together with lower CG methylation levels in polymorphic TEs, our data suggest that polymorphic TEs are not silenced and are likely more active, as proxied by an increased transcriptional activity when compared with other TEs (fig. 3). As expected, polymorphic TEs are expressed at higher levels in vitro and in planta when compared with nonpolymorphic TEs (fig. 3 and supplementary fig. 6, Supplementary Material online). Collectively, our analyses identified a subset of TEs as polymorphic among strains, and these tend to be evolutionary younger, less suppressed by RIP and methylation, as well as more highly expressed in V. dahliae.

The TE landscape across the Verticillium dahliae genome is dynamic. (A) Principal component analysis of TEs in V. dahliae strain JR2 based on methylation (mCG, mCHG, mCHH), GC proportion, Kimura distance, Composite Repeat-Induced Point Mutation index (CRI), frequency, and in vitro expression. Each point represents a single TE and colored ellipses represent the confidence interval for the polymorphic and nonpolymorphic TEs; (B) Polymorphic (n = 165) and nonpolymorphic TEs (n = 1,956) were compared based on GC proportion, the Kimura distance to the TE family consensus sequence, CRI, cytosine methylation, and in vitro expression (CPM). Statistical significance was assessed using one-sided Wilcoxon rank-sum test; (C) The association between polymorphic TEs and adaptive genomic regions was tested using a permutation test. The vertical dashed line shows the mean number of polymorphic TEs overlaps expected at 10,000 random permutations and the blue line indicates the empirical number of overlaps, significance P = 9.99 × 10−5, z score = 3.2927; (D) Circular plot displays the genomic distribution of 165 polymorphic TEs along the eight chromosomes of V. dahliae strain JR2. The tracks display the centromeric regions, gene density (in 10 kb windows), TE annotation, adaptive genomic regions, TE deletions, and TE insertions (with arrows indicating the insertion direction) from outside to inside. Adaptive genomic regions in V. dahliae have been previously shown to be enriched for transcriptionally active TEs (supplementary fig. 4, Supplementary Material online; Faino et al. 2016; Cook et al. 2020). We hypothesized adaptive genomic regions are enriched for polymorphic TEs. Therefore, we analyzed the distribution of polymorphic TEs in V. dahliae strain JR2 (fig. 3), leading to the identification of clusters of polymorphic TEs that colocalize with adaptive genomic regions (P = 0.00024, one-sided Fisher’s exact test; supplementary table 6, Supplementary Material online; fig. 3). Interestingly, polymorphic TEs do not localize closer to SVs than nonpolymorphic TEs, suggesting that both types of TEs can contribute to the formation of SVs. Although deleted TEs were enriched in adaptive genomic regions, we could not observe a similar enrichment for TE insertions in the JR2 strain (supplementary fig. 7, Supplementary Material online). Polymorphic TEs belonging to LTR/Copia superfamilies or unspecified repetitive elements are highly enriched in adaptive regions (P = 8.24 × 10−9 and P = 1.36 × 10−5 respectively, one-sided Fisher’s exact test), and for instance, LTR/Copia represents 51.1% of all polymorphic TEs in adaptive regions (supplementary table 7, Supplementary Material online). Therefore, our results suggest that the LTR superfamily is an important component of the dynamic TE landscape.

Polymorphic TEs Impact the Functional Genome of V. dahliae

Next to their association with SVs, TE presence and activity have been suggested to directly or indirectly impact the functional genome, for example by modifying the expression of genes in their vicinity or by inducing changes in gene structure (Bourque et al. 2018; Schrader and Schmitz 2019). To establish to which extent TEs may affect gene expression in V. dahliae strain JR2, we first summarized the occurrence of TEs up to 5 kb upstream and downstream of all protein-coding genes (supplementary fig. 8, Supplementary Material online). The majority of annotated TEs are located within 5 kb range of a gene (67% of total TEs, n = 1,418; P = 0.0094, one-sided Fisher’s exact test), with an overrepresentation of polymorphic TEs belonging to the LTR/Copia, LINE/I, and DNA/Tc-1-Mariner superfamilies (fig. 4 and supplementary fig. 8, Supplementary Material online). By assessing the expression of TEs nearby genes (n = 1,418) as a proxy for their activity, we observed that many TE families are expressed in vitro and typically at higher levels during infection (P < 0.05, fig. 4; supplementary fig. 9, Supplementary Material online). As expected, these families are abundant in polymorphic TEs (fig. 4), suggesting that environmental changes induce the expression of TEs.

The presence of polymorphic transposable elements is correlated with expression of genes that localize nearby. (A) Ridgeline shows the density distribution of polymorphic TEs, classified by TE superfamily, upstream and downstream of genes; (B) Unsupervised clustering of TE expression per TE family and superfamily, color-coded based on the mean-per-family expression value (log2(CPM + 1)). The column depicts the abundance of polymorphic TEs within each group (yellow coded). Statistical significance for comparisons between in vitro (growth in MS media) and in planta (colonization of Arabidopsis thaliana at 28 days postinoculation) was assessed using a one-sided Wilcoxon rank-sum test *P ≤ 0.05, **P ≤ 0.01, ***P ≤ 0.001, ****P ≤ 0.0001; (C) Relationship of gene expression ranks in vitro (blue) and in planta (red) and the number of TEs. Low rank = low expression and high rank = high expression. The plots display linear regression (dark line) and confidence interval (light gray) as well as the R and P values after linear regression. The histograms on top of the graph show the number of polymorphic TEs per expression rank and the dot-size reflects the proportion of polymorphic TEs relative to the total number of TEs per expression rank; (D) Relationship of gene expression rank and mean distance to TEs (bp) with annotations as for (C); (E) Gene expression in relation to the TE context (TEs within 5 kb); no TE in proximity = Distal (n = 8,742), or TEs flanked upstream (n = 1,098), downstream (n = 1,051), or in between (n = 546; blue). Significant differences were assessed using one-sided Wilcoxon rank-sum test; (F) TE distance to genes in 5 kb windows; TEs in upstream (n = 1,609), downstream (n = 1,511) or in between (n = 1,582). Statistical differences were assessed using one-sided Wilcoxon rank-sum test. To further assess the correlation between the proximity of TEs and the expression of nearby genes, we compared the expression of genes located within 5 kb windows from TEs under in vitro and in planta conditions. We observed that 23% of all annotated genes reside within 5 kb of a TE (n = 2,696). Genes in proximity to TEs are transcribed at significantly lower levels when compared with genes that do not reside in close proximity to TEs (P < 2 × 10−16, in vitro, P = 1.9 × 10−9 in planta; Kruskal–Wallis test; fig. 4, supplementary fig. 10, Supplementary Material online). Therefore, we hypothesized that the presence of polymorphic TEs correlates with expression in V. dahliae. To further investigate this, we aggregated genes and TEs within 5 kb by their expression rank and obtained the number of TEs in each rank. We observed only a moderate negative correlation between the number of polymorphic TEs and gene expression ranks in vitro (fig. 4; R = –0.21, P = 0.034 linear regression). Conversely, the distance between genes and polymorphic TEs is moderately positively correlated with higher expression ranks (fig. 4; R = 0.16, P = 0.013 linear regression). These results suggest that polymorphic TEs are associated, albeit weakly, with genes expressed at low levels under in vitro conditions. However, we observed a different pattern when considering host colonization, as gene expression ranks in planta did not correlate with TE density (fig. 4R = 0.025, P = 0.8 linear regression) whereas, similar to in vitro conditions, TE distance and expression are moderately positively correlated (fig. 4R = 0.19, P = 0.06 linear regression). Collectively, our data suggest the polymorphic TE density nearby protein-coding genes correlates only weakly with gene expression in different conditions. To further assess the relationship between TEs and gene expression, we categorized genes into four categories based on the relative position of TEs; genes with no TEs in proximity (n = 8,742), genes with TEs localized upstream (n = 1,098), downstream (n = 1,051), or genes flanked by TEs on both sides (n = 546). The distance between TEs and genes is significantly shorter when genes are flanked by TEs, when compared with genes having TEs only upstream or downstream (fig. 4). Importantly, genes flanked by TEs show reduced expression in vitro when compared with genes in the proximity of TEs located either upstream or downstream (fig. 4), a pattern we similarly observed during host colonization. We similarly observed this pattern when only considering the subset of polymorphic TEs (supplementary figs. 10 and 11, Supplementary Material online). Interestingly, during host colonization genes in proximity to polymorphic TEs do not show a significant decrease in expression (supplementary fig. 10, Supplementary Material online). Thus, genes in proximity to nonpolymorphic TEs showed lower expression levels when compared with genes in proximity to polymorphic TEs.

Pathogenicity-Related Genes Colocalize with Polymorphic TEs

To understand which types of genes are enriched in proximity to TEs, we performed functional annotation of the V. dahliae strain JR2 gene catalog using a cluster for orthologs groups (COG) approach and subsequently tested for enrichment of genes with specific COG categories. No significant enrichment of genes in proximity to TEs could be observed for most of the categories belonging to “cellular processes” or to “information processing.” In contrast, genes annotated as “metabolism” and “poorly characterized” are significantly enriched (fig. 5). Especially, genes belonging to the subcategories “defense mechanisms” or “carbohydrate metabolism” are highly enriched (fig. 5). A total of 39% of genes that reside in proximity to TEs fall in the category “function unknown” (n = 579) or “not categorized” (n = 482), which refers to orthologs present in other organisms but without known function. Furthermore, we could not find orthologs for 14.9% of the genes that reside in proximity to TEs (n = 402), described in this work in the “not recognized orthologs” category, which can probably be explained as gene content specific to the V. dahliae lineage (fig. 5 and supplementary table 8, Supplementary Material online). Interestingly, most of these gene categories are associated with TEs (fig. 5) and overlap with various SVs (fig. 5). Subsequently, we annotated all protein-coding genes for functions that have previously been associated with pathogenicity, such as genes encoding secreted proteins (n = 672), carbohydrate-active enzymes (CAZymes, n = 498), secondary metabolites (n = 25), or effector candidates (n = 193). These genes can be broadly classified as “pathogenicity-related” and are enriched in the “poorly characterized” functional category, especially secreted proteins and effector candidates (P = 3.49 × 10−15 and n = 390, P = 2.85 × 10−6 and n = 127, respectively, one-sided Fisher’s exact test; supplementary table 9, Supplementary Material online). Genome-wide, we observed that 40.72% (n = 401) of predicted pathogenicity-related genes have a TE localized within 5 kb (fig. 5). Polymorphic TEs are enriched in proximity (5 kb) to effector candidates (Z score = 3.4838, P = 0.0001; 5,000 random permutations), even though we did not observe that these genes and other pathogenicity-related genes are significantly closer to polymorphic TEs (fig. 5). Thus, commonly considered pathogenicity-associated genes reside in proximity to TEs, yet only some genes are in proximity to polymorphic TEs.

Polymorphic transposable elements correlate with expression of a subset of genes. (A) Enrichment of TEs in proximity (within 5 kb range) to genes in different functional categories. Functional annotation was based on Clustering for Orthologous Groups (COGs) categories. The y axis shows the different functional categories and the x axis the z score after 5,000 random permutations; the gray shadow represents values with a negative z score indicating depletion rather than enrichment, and the −log10(P-value) after Benjamini–Hochberg correction for false-discovery rate is color-coded. The size is relative to the number of genes associated with TEs in each category; (B) The number of genes overlapping deletions (blue) and duplications (red), and inversions (yellow); (C) The relative proportions of genes localized in the adaptive genomic regions (blue) and core (gray) genomic regions; (D) Gene expression distribution in two conditions (in vitro and in planta; log2(CPM + 1)) and separated by core and adaptive genomic regions; (E) Predicted pathogenicity-related genes (n = 1,388) overlapping with COG categories and depicting the proportion of genes with TEs in proximity (5 kb); (F) Gene expression of pathogenicity-related genes in two conditions (in vitro and in planta; log2(CPM + 1)) and separated by core and adaptive genomic regions; (G) Distance between genes and TEs separated by polymorphic and nonpolymorphic TEs. Statistical significance was assessed using Wilcoxon rank-sum test; (H) The composition of pathogenicity-related gene categories separated in adaptive (blue) and core (gray) genomic regions; Pathogenicity-related gene expression in vitro (I) and in planta (J) separated by polymorphic and nonpolymorphic TEs. Statistical significance was assessed by a Kruskal–Wallis rank-sum test; ****P ≤ 0.0001. When compared with the core genome, adaptive genomic regions in V. dahliae are relatively gene-poor (fig. 5) and SV-rich (supplementary fig. 3, Supplementary Material online), as well as enriched for in planta-expressed genes (de Jonge et al. 2012, 2013; Kombrink et al. 2017; Cook et al. 2020; Chavarro-Carrero et al. 2021) (fig. 5). Although a broad range of functional categories can be observed in adaptive genomic regions (fig. 5), these are enriched for genes encoding secreted proteins and candidate effector genes (fig. 5; P = 0.032 and P = 0.014, respectively; one-sided Fisher’s exact test). Since we have shown that gene expression levels correlate with the presence of both polymorphic as well as nonpolymorphic TEs (fig. 4), we queried to which extent polymorphic TEs in proximity of pathogenicity-related genes correlate with gene expression patterns in the adaptive genomic regions. We observed a significant enrichment of polymorphic TEs colocalizing with highly expressed genes in vitro and in planta (P = 0.0116 and P = 0.0010 for 50% and 75% upper quartiles for expression, respectively, after one-sided Fisher’s exact test; supplementary table 10, Supplementary Material online). Under in vitro conditions, expressed pathogenicity-related genes located in the core genome are expressed significantly higher than genes located in adaptive genomic regions (fig. 5). As expected, we observed the opposite trend in planta (fig. 5), where expressed pathogenicity-related genes that are localized in adaptive genomic regions are highly expressed during host colonization (de Jonge et al. 2012, 2013). Furthermore, we observed an enrichment of polymorphic TEs located within 5 kb of pathogenicity-related genes in adaptive genomic regions (P = 0.0062 after one-sided Fisher’s exact test). As expected, a higher proportion of these highly expressed genes are effector candidates (fig. 5). These results suggest that polymorphic TEs occur more likely in close proximity to highly expressed pathogenicity-related genes in adaptive genomic regions.

Discussion

Transposable elements can contribute to genomic and transcriptomic variation (Bourque et al. 2018) and, consequently, TEs have been often considered to play crucial roles in genome evolution and function (Feschotte 2008). Here, we analyzed the TE landscape in the plant pathogen V. dahliae. We show that the presence of TEs is associated with abundant SVs that emerged independently in individual V. dahliae strains. The SVs form discrete clusters and colocalize within adaptive genomic regions. Importantly, we demonstrate that these SVs colocate with TEs, many of them polymorphic, and typically evolutionary younger, less silenced, and higher expressed. Furthermore, we show that the presence of polymorphic TEs is associated with gene expression of genes that can be assigned to diverse functional categories, particularly with host–pathogen interaction genes. Collectively, our results provide evidence for the hypothesis that polymorphic TEs contribute to increased genomic diversity, correlate with the expression of pathogenicity-related genes, and play important roles in the evolution of adaptive genomic regions. TEs occur commonly in eukaryote genomes, yet there is an enormous variation in the TE content and diversity between different organisms (Wicker et al. 2007; Huang et al. 2012; Dietrich et al. 2013). Typically, TE expansions are suppressed by various genome defense mechanisms, but some TE copies can escape suppression and remain active and mobile. Polymorphic TEs have been associated with recent transposition activity (Huang et al. 2012), and in V. dahliae, a subset of TEs shares these characteristics. Only a subset of all predicted TEs is young and dynamic, which is in contrast to some other fungi in which there is a considerable amount of young TEs associated with recent TE expansions, such as in Zymoseptoria tritici (Badet et al. 2020; Lorrain et al. 2021; Oggenfuss et al. 2020), Leptosphaeria maculans (Grandaubert et al. 2014), Blumeria graminis (Frantzeskakis et al. 2018), and Magnaporthe oryzae (Kang et al. 2016). In many plants and fungi, TE expansions are mainly driven by LTR retrotransposons (Vitte and Bennetzen 2006; Tsukahara et al. 2009; Muszewska et al. 2011; Huang et al. 2012; Amselem et al. 2015; Donnart et al. 2017; Galindo-Gonzalez et al. 2017; Lorrain et al. 2021). We observed an enrichment of LTRs in facultative heterochromatic regions and at the centromeres in V. dahliae (Cook et al. 2020; Seidl et al. 2020). Similarly, LTRs are an important fraction of the polymorphic TEs in V. dahliae that are localized in the adaptive genomic regions. TE expansions have occurred at least twice during the evolution of V. dahliae, once during Verticillium diversification, and then right after V. dahliae speciation (Faino et al. 2016). The active polymorphic TEs we identified are likely the product of the most recent TE expansion in V. dahliae, and likely contributed to the formation of the adaptive genomic compartments. Polymorphic TEs are a small fraction of the total TE repertoire, which suggests that most TEs are suppressed. Three main processes have been associated with TE suppression in fungi: RIP mutation, DNA methylation, and RNA-mediated silencing. RIP is usually associated with sexual fungal organisms (Galagan and Selker 2004), yet RIP mutations have been also observed in V. dahliae that is considered asexual (Clutterbuck 2011; Klosterman et al. 2011; Amyotte et al. 2012; Cook et al. 2020). RIP mutations have been associated with meiosis, but alternative mechanisms during vegetative propagation could lead to RIP-like mutations (Clutterbuck 2011). This has been recently observed in Z. tritici and N. crassa in which propagation through mitotic divisions generate RIP-like mutations (Möller et al. 2021; Wang et al. 2020). Intriguingly, polymorphic TEs are depleted in RIP mutations. Consequently, we consider different RIP scenarios, such as that RIP activity was lost during an evolutionary transition to asexuality in V. dahliae. In such a scenario, the mutations we observed in most TEs are remains of ancestral RIP activity. A second scenario is that V. dahliae is not strictly asexual, but rare sexual activity occurs in the population. In this scenario, the RIP machinery may be active during the occasional meiosis. Thirdly, a RIP-like mechanism could act outside the typical meiotic cycles and might be active in V. dahliae. These hypotheses are not mutually exclusive, and it remains challenging to elucidate the mechanistic origin of RIP mutations in V. dahliae. Next to RIP, DNA methylation targets repetitive elements in many fungi (Nakayashiki et al. 1999; Zemach et al. 2010; Montanini et al. 2014; Bewick et al. 2019). We have previously demonstrated that the majority of TEs in V. dahliae is methylated (Cook et al. 2020; Seidl et al. 2020). We now show that polymorphic TEs are typically less methylated and not affected by RIP, collectively indicating that these elements are not suppressed. In the model plant Arabidopsis and the fruit-fly Drosophila, polymorphic and active TEs occur in discrete regions where novel insertions are not efficiently suppressed, whereas TE insertions outside of these regions are mostly suppressed by methylation (Hollister and Gaut 2009; Tsukahara et al. 2009; Stuart et al. 2016; Lee and Karpen 2017). We observed that TEs in adaptive genomic regions display low methylation levels (Cook et al. 2020). Whereas effective TE suppression outside adaptive regions might be an important mechanism to restrict TE expansions in the core genome, the activity TEs in adaptive regions suggests that that these regions are prone to TE-mediated variation. Environmental changes can alter suppression mechanisms and drive a derepression of TEs (Bouvet et al. 2008; Carr et al. 2010; Fouche et al. 2020). We observed TE expression upon changes in environmental conditions, that is, different in vitro growth media as well as in planta growth. In addition, TE expression has been observed previously in different abiotic conditions in V. dahliae (Amyotte et al. 2012; Faino et al. 2016; Cook et al. 2020). For example, some Tc1/Mariner and Mutator elements are induced upon heat stress (Amyotte et al. 2012), whereas we observed that some LTR/Copia and LTR/Gypsy families highly expressed in planta. However, the function of TE derepression under changing environmental conditions remains unknown (Galhardo et al. 2007; Slotkin and Martienssen 2007; Koonin and Wolf 2009). Modifications in suppression patterns induced by changes in the environment could imply a trade-off between the functional TE derepression and increased probability of TE expansions and TE mobilization on nearby genes, or even changes in the genome structure (Seidl et al. 2016). In Schizosaccharomyces pombe, for instance, changes in temperature and nutrient availability induce TE insertions in promoters of stress-dependent genes, and these insertions have been associated with higher expression of the targeted genes (Feng et al. 2013; Maxwell 2020). In the plant-pathogen Z. tritici, changes in nutrient composition led to increased TE expression, and a TE insertion in proximity of an effector gene contribute to reduced expression and virulence (Fouche et al. 2020). Although the mechanisms in which the environment and TE derepression influence genome function and structure remain unclear, the expression of polymorphic TEs induced by changes in the environment could be similarly an important driver of the genome dynamics in V. dahliae. Changes in the TE landscape have often been associated with gene expression changes (Goubert et al. 2020). Insertion of TEs within or near genes usually reduce gene expression, for example by spreading of silencing marks, such as DNA methylation or inducing changes in chromatin conformation (Castanera et al. 2016; Winter et al. 2018). TE insertions are associated with gene silencing and mostly have negative effects, yet also examples for an adaptive advantage exist. For example, in Z. tritici, the insertion of a TE in the promoter of a melanin biosynthesis gene causes reduced melanin synthesis and quicker growth, which would be beneficial for colonization (Krishnan et al. 2018). We observed that genes in proximity to TEs were typically less expressed during in vitro and in planta conditions, which was particularly clear for genes flanked by TEs. However, we observed that genes located in TE-rich regions are highly expressed in planta, raising a conundrum to link these opposing observations. Studies in adaptive genomic regions in V. dahliae and analogous regions in other fungi revealed that chromatin represents an additional layer of genomic regulation that creates a chromatin state allowing for accessibility to the transcriptional machinery (Schotanus et al. 2015; Fokkens et al. 2018; Soyer et al. 2019; Cook et al. 2020). We observed an enrichment of polymorphic TEs in close proximity of these highly expressed genes in adaptive regions. This observation raises the possibility that polymorphic TEs form part of the regulation of genes in the vicinity as cis-regulatory sequences or as a consequence of changes in the chromatin conformation (Le Rouzic et al. 2007; Hollister and Gaut 2009; Huang et al. 2012; Szitenberg et al. 2016; Choi and Lee 2020). For instance, TEs and TE-derived sequences can be co-opted in promoter sequences and contribute to the control of gene expression (Sundaram and Wysocka 2020; Cosby et al. 2021), as for example, Tf1 retrotransposon insertions in S. pombe occur in promoters and induce consistently higher expression of neighboring genes (Leem et al. 2008; Guo and Levin 2010; Maxwell 2020). Furthermore, this could explain the maintenance of polymorphic TEs in V. dahliae, as the TE suppression of these elements could negatively affect the expression levels of neighboring genes, favoring their colocalization. Genome structure can differ significantly between species and even between strains of the same species (Lynch 2007). Here, we demonstrated that even closely related V. dahliae strains are characterized by their unique SV repertoire and colocalization of polymorphic TEs and SVs in adaptive regions. As previously reported, we similarly observed high levels of sequence conservation in adaptive regions (de Jonge et al. 2013; Faino et al. 2016; Shi-Kunne et al. 2018; Depotter et al. 2019). Recently, adaptive regions were characterized by an intermediate chromatin state between heterochromatin and euchromatin (Cook et al. 2020), which could indicate that this unique chromatin organization can allow for lower mutation rates in these adaptive regions (Prendergast et al. 2007). We observed that SVs emerge rapidly due to their association with polymorphic TEs, and they represent a relevant mechanism to generate variation, rather than single-nucleotide changes in adaptive regions. The colocalization of TEs and SVs in distinct regions has been reported previously associated with genome compartmentalization in diverse fungi (Schotanus et al. 2015; Faino et al. 2016; Plissonneau et al. 2016, 2018; Shi-Kunne et al. 2018; Ola et al. 2020; Torres et al. 2020). Relaxed selection has been proposed to explain the association of variation and TEs in genomic compartments, for example, the TE-rich accessory chromosomes in Z. tritici (Croll and McDonald 2012; Hartmann et al. 2017; Grandaubert et al. 2019). Conversely, adaptive regions in V. dahliae do not show signs of relaxed or strong positive selection (Depotter et al. 2019). Since V. dahliae is an asexual fungus and consequently has a low effective population size, genetic drift could represent an important evolutionary force that maintains SVs in population (Kelkar and Ochman 2012; Cvijovic et al. 2015), and these can therefore contribute to genome dynamics. Compartmentalized genomes have been observed in diverse organisms (Torres et al. 2020). In filamentous plant pathogens, this genome organization is described in the “two-speed” genome model (Raffaele et al. 2010) in which dynamic genome compartments serve as cradles of variation (Croll and McDonald 2012; Frantzeskakis et al. 2019; Torres et al. 2020). Additionally, changes in chromatin organization have been shown to be relevant for compartmentalized genomes to further increase or facilitate genomic variation (Schotanus et al. 2015; Moller et al. 2019; Cook et al. 2020). Thus, further insights into chromatin organization and TE dynamics are necessary to further disentangle the impact of chromatin modifications on the emergence of genetic variation and unravel how these facilitate the evolution of virulence in plant pathogens.

Materials and Methods

Verticillium dahliae JR2 Genome, Repetitive Element, and Functional Gene Annotation

Repetitive element annotation of the chromosome-level genome assembly of V. dahliae strain JR2 (Faino et al. 2015) was based on available annotation (Faino et al. 2015; Cook et al. 2020; Seidl et al. 2020). Briefly, repetitive elements were annotated by using a combination of LTRharvest (Ellinghaus et al. 2008), LTRdigest (Steinbiss et al. 2009), and RepeatModeler. The repeat predictions were further curated and classified (Wicker et al. 2007) by a combination of PASTEC (Hoede et al. 2014) and sequence similarity to known TEs. The genome-wide occurrence of repeats was determined with RepeatMasker v 4.0.9, and the output was further processed using “one code to find them all” (Bailly-Bechet et al. 2014). Simple repeats and low-complexity regions were excluded (Cook et al. 2020; Seidl et al. 2020). GC-content, Kimura distance from a consensus TE family, and weighted DNA methylation data were previously summarized (Cook et al. 2020; Seidl et al. 2020). The CRI per TE was calculated by obtaining the RIP substrate and the RIP product index as defined by nucleotide frequencies: RIP product index = TpA/ApT and the RIP substrate index= (CpA + TpG)/(ApC + GpT). We also calculated the genome-wide RIP index (CRI) using RIPper (van Wyk et al. 2019) using sliding windows (1 kb, 500 bp slide). Functional gene annotation was based on the available V. dahliae JR2 gene annotation (Faino et al. 2015) and was performed using eggNOG 5.0 (Huerta-Cepas et al. 2019) based on an hierarchical nonsupervised orthology search, restricted to a fungal database with e-values ≥0.001 and query coverage ≥50%. To define the V. dahliae secretome, we predicted N-terminal signal peptides using SignalP v.4.1 (Nielsen 2017). Subsequently, we predicted putative effectors with EffectorP v.2.0 (Sperschneider et al. 2018), using default parameters. Secondary metabolite biosynthetic genes were previously predicted using antiSMASH fungal version 4.0.2 (Weber et al. 2015; Shi-Kunne et al. 2018). Similarly, we used previously predicted CAZymes that cover signatures of glycoside hydrolases, polysaccharide lyases, carbohydrate esterases, glycosyltransferases, and carbohydrate-binding molecules (Seidl et al. 2015; Shi-Kunne et al. 2018). For further analysis, we considered the groups of “secreted proteins” as the secreted proteomes excluding candidate effectors, secondary metabolite genes, and CAZymes.

Gene and TE RNA-Sequencing Analysis

Transcriptional activity of genes and repetitive elements in V. dahliae strain JR2 was assessed using previously generated transcriptome data (Cook et al. 2020) of three in vitro conditions, (Potato Dextrose Broth, Murashige-Skoog, and Czapeck-Dox media) and of in planta colonization (Arabidopsis thaliana; 28 dpi). For further analysis, we considered only MS media for in vitro and in planta comparisons. Single-end sequencing reads of three biological replicates per condition were mapped to V. dahliae strain JR2 genome assembly (Faino et al. 2015) using STAR v.2.4.2.a, allowing multiple mapped reads with the following settings: –outFilterMultimapNmax 100 –winAnchorMultimapNmax 200 –outSAMtype BAM Unsorted –outFilterMismatchNmax 3 (Jin et al. 2015; Jin and Hammell 2018; Fouche et al. 2020). The bam files were sorted by read name with samtools v.1.2 and the transcriptional activity level was quantified using TEcount from the TEtrancripts package 2.2.1 (Jin et al. 2015), with the following parameters: –stranded no –mode multi, -iteration 1,000. TEcount considers multiple-mapped reads aligned to genes and TE regions to determine transcript abundance per condition/replicate. Furthermore, TEcount only considers reads that map in its entirety to TEs and reads mapping to only a fraction were discarded (Jin et al. 2015; Jin and Hammell 2018). Sequencing reads summarized over TEs and genes were normalized between three replicates in the different conditions using R/Bioconductor package EdgeR v.3.8 (Robinson et al. 2010; Robinson and Oshlack 2010). For normalization, we only considered genes and TEs ≥1 reads in all samples (Anders et al. 2013). TEs with read count 0 were assumed to have no transcriptional activity. Libraries were normalized with TMM method (Robinson and Oshlack 2010), and converted to CPM-mapped reads using R/Bioconductor package EdgeR v.3.8 (Robinson et al. 2010). Comparisons between transcriptional levels were computed in R 3.6.3 (R Core Team 2019).

Single-Nucleotide Variant Detection and Analysis

Single-nucleotide variants were detected using paired-end sequencing reads of each 42 previously sequenced V. dahliae strains (supplementary table 1, Supplementary Material online). Each strain was mapped independently to the reference genome V. dahliae JR2 using BWA -MEM v.0.7 with default options (Li and Durbin 2010). Library artifacts were marked and removed using Picard tools v.2.18 with -MarkDuplicates followed by -SortSam to sort the reads (http://broadinstitute.github.io/picard/). Single-nucleotide variants were identified using the -HaplotypeCaller of the Genome Analysis Toolkit (GATK) v.4.0 (Poplin et al. 2018). Variations were detected individually per strain, using -ploidy 1 and -emitRefConfidence GVCF. Then, all strains were joined by -Jointgenotyping with -maxAltAlleles 2, and all non-SNVs were removed using SelectVariants -selectType SNP. To obtain high-quality SNVs, we applied the following cut-off filters: QUAL < 250, MQ < 40, QD < 20, FS > 60, SOR > 3.0, ReadPosRankSum < −5.0, MQRankSum between −2 and 2. Finally, we excluded variants with missing genotype calls in 10% of the strains. To explore the similarity between strains, an unrooted phylogenetic tree was constructed using a Neighbor-Joining approach based on the final single-nucleotide variance set, using distance function in R 3.6.3 (R Core Team 2019). To further assess the sequence diversity between V. dahliae strains, we calculated the nucleotide diversity (π) based on Nei and Jin (1989) using a sliding window of 1 kb (500 bp sliding) as implemented in the PopGenome package (Pfeifer et al. 2014) in R. Single-nucleotide variants were annotated using SNPeff v.3.2 (Cingolani et al. 2012) using the refined annotation of V. dahliae JR2 strain. Based on the SNPeff prediction, we calculated the number of synonymous and nonsynonymous mutations.

Structural Variant Calling and Analysis

To predict SVs, we used the “sv-callers” workflow with few modifications that enabled parallel execution of multiple SV callers (Kuzniar et al. 2020), an approach that is considered optimal as it exploits often complementary information to predict SVs (Goerner-Potvin and Bourque 2018; Cameron et al. 2019). Briefly, the workflow takes every strain and maps the genomic reads to the reference genome V. dahliae JR2 using BWA-MEM v.0.7 with default options (Li and Durbin 2010). Then, library artifacts were marked and removed using Picard tools v.2.18 with -MarkDuplicates followed by -SortSam to sort the reads (http://broadinstitute.github.io/picard/). We used four different callers: DELLY v.0.8.1 (Rausch et al. 2012), LUMPY v.2.13 (Layer et al. 2014), Manta v.1.6.0 (Chen et al. 2016), and Wham v.1.0 (Kronenberg et al. 2015) in single-sample mode, using each 42 previously sequenced V. dahliae strains independently (supplementary table 1, Supplementary Material online). These callers incorporate a diverse range of information for SV detection such as split reads (DELLY, LUMPY, Wham), discordant read pairs (DELLY, LUMPY, Wham), read depth signal (LUMPY, Wham), short-read assembly (Manta, Wham), and soft-clipping detection (Wham). The four SV callers were used with default parameters, except for DELLY in which we used >1 as minimum quality for further processing. All outputs were first filtered using bcftools -filter v.1.3.2 with default settings used in the “sv-callers” workflow (Li 2011), except for Manta since the score model for postprocessing assumes a diploid genome (Chen et al. 2016). Therefore, we used the unscored Manta predictions for further processing. For final filtering, the “sv-callers” workflow postprocesses the vcfs. Briefly, the results of the four different callers per strain were merged using SURVIVOR v.1.0.6 (Jeffares et al. 2017), keeping SVs predicted by at least three callers, allowing 1,000 bp as the maximum distance from breakpoints predicted by the different tools and considering only same SV types. Subsequently, only SVs with minimum size >50 bp and maximum size 1 Mb as well as localization outside of a low-quality region, defined as MQ = 0 and read support <10 for every strain, were kept. Finally, the 42 independent data sets we combined into a single vcf file using SURVIVOR by merging SVs up to 1,000 bp apart of the same SV type.

Transposable Element Polymorphism Prediction and Analysis

Transposable element presence/absence polymorphisms were analyzed using TEPID v.2.0 (Stuart et al. 2016), using paired-end short reads of 42 previously sequenced V. dahliae strains (supplementary table 1, Supplementary Material online). TEPID integrates split and discordant read mapping, mapping quality, sequencing breakpoints, and local variations in coverage to identify variants with respect to a reference TE annotation (Stuart et al. 2016). We mapped each of the 42 V. dahliae strains individually to V. dahliae JR2 reference genome, using tepid-map (average insert size -s 500), which uses Bowtie2 v.2.2.5 (Langmead and Salzberg 2012). TE variant discovery was computed using tepid-discover, considering insertion and deletion prediction and a conservative discovery through –strict option. We subsequently refined the discovered variants using tepid-refine to reduce false-negative calls within the group of 42 strains.

Statistical Analysis and Visualization

The distribution of SVs over the V. dahliae strain JR2 genome was determined considering the breakpoints (±1 bp) overlapping in 10 kb nonoverlapping windows. We predicted an expected probability distribution assuming a Poisson distribution, using λ = 0.77 (mean number of SVs breakpoints per 10 kb windows). To investigate if SVs colocalize, we performed a clustering analysis using CROC (Pignatelli et al. 2009) based on a hypergeometric distribution test and posterior multiple-testing correction using Benjamini–Hochberg; CROC scans every chromosome with a sliding window (SV-breakpoints per window = 10, slide = 1, >3 SV-breakpoints as a minimum). We computed the same test for TE clustering using TEs per window = 10, slide = 1, >3 polymorphic TEs as a minimum. Permutation tests were computed using R/Bioconductor regionR v.1.18.1 package (Gel et al. 2016) and performed 10,000 iterations, using the mean distance to evaluate the closest relationship (bp distance) between SV-breakpoints and TE elements, and circular randomization to maintain the order and distance of the regions in the chromosomes. We assumed the same parameters to evaluate the association of TEs and adaptive regions but considering the number of overlaps instead of distance. Finally, we computed enrichment association tests using bedtools -fisher v.2.25.0 based on a one-sided Fisher’s test (Quinlan and Hall 2010) and random association permutation tests using R/Bioconductor region v.1.18.1 (Gel et al. 2016), performing 5,000 iterations with a resampling randomization. Principle component analyses were performed in R v.3.6.3, using packages FactoMineR v.1.42 and factoextra v.1.0.5 (Lê et al. 2008). The used variables for each TE were methylation (mCG, mCHG, mCHH), GC content, Kimura distance, CRI, Frequency, and in vitro (MS media) expression. To further investigate the relationship of TEs with gene expression, we summarized TEs in 5 kb windows upstream and downstream genes, using bedtools -window (Quinlan and Hall 2010). We classified genes in “upstream” or “downstream” if ≥1 TE is located in the position in relation to the gene. We considered genes to be located in “between” TEs if these had ≥1 TEs in both positions (upstream and downstream), and we excluded them from the other categories. We ranked each gene near TE based on their expression profile and counted the number of TEs associated with each independent rank. These rank counts were then binned to obtain the distribution of ranks and fit a linear model for the data and calculated the R2 and P value for the fit of the model in R v.3.6.3 (R Core Team 2019). All statistical analyses and comparison tests were performed in R v.3.6.3 (R Core Team 2019), and visualization with R packages ggplot2 and circlize (Gu et al. 2014; Wickham et al. 2016).

Supplementary Material

Supplementary data are available at Genome Biology and Evolution online. Click here for additional data file.

173 in total

Review 1. Genome evolution: sex and the transposable element.

Authors: S Wright; D Finnegan
Journal: Curr Biol Date: 2001-04-17 Impact factor: 10.834

Review 2. The impact of transposable elements in adaptive evolution.

Authors: Lukas Schrader; Jürgen Schmitz
Journal: Mol Ecol Date: 2018-08-04 Impact factor: 6.185

Review 3. Mechanisms of change in gene copy number.

Authors: P J Hastings; James R Lupski; Susan M Rosenberg; Grzegorz Ira
Journal: Nat Rev Genet Date: 2009-08 Impact factor: 53.242

Review 4. RIP: the evolutionary cost of genome defense.

Authors: James E Galagan; Eric U Selker
Journal: Trends Genet Date: 2004-09 Impact factor: 11.639

5. Telomeres and a repeat-rich chromosome encode effector gene clusters in plant pathogenic Colletotrichum fungi.

Authors: Pamela Gan; Ryoko Hiroyama; Ayako Tsushima; Sachiko Masuda; Arisa Shibata; Akiko Ueno; Naoyoshi Kumakura; Mari Narusaka; Trinh Xuan Hoat; Yoshihiro Narusaka; Yoshitaka Takano; Ken Shirasu
Journal: Environ Microbiol Date: 2021-04-12 Impact factor: 5.491

6. regioneR: an R/Bioconductor package for the association analysis of genomic regions based on permutation tests.

Authors: Bernat Gel; Anna Díez-Villanueva; Eduard Serra; Marcus Buschbeck; Miguel A Peinado; Roberto Malinverni
Journal: Bioinformatics Date: 2015-09-30 Impact factor: 6.937

7. Pangenome analyses of the wheat pathogen Zymoseptoria tritici reveal the structural basis of a highly plastic eukaryotic genome.

Authors: Clémence Plissonneau; Fanny E Hartmann; Daniel Croll
Journal: BMC Biol Date: 2018-01-11 Impact factor: 7.431