Literature DB >> 34424154

Population genomics of transposable element activation in the highly repressive genome of an agricultural pathogen.

Danilo Pereira^1,2, Ursula Oggenfuss³, Bruce A McDonald¹, Daniel Croll³.

Abstract

The activity of transposable elements (TEs) can be an important driver of genetic diversity with TE-mediated mutations having a wide range of fitness consequences. To avoid deleterious effects of TE activity, some fungi have evolved highly sophisticated genomic defences to reduce TE proliferation across the genome. Repeat-induced point mutation (RIP) is a fungal-specific TE defence mechanism efficiently targeting duplicated sequences. The rapid accumulation of RIPs is expected to deactivate TEs over the course of a few generations. The evolutionary dynamics of TEs at the population level in a species with highly repressive genome defences is poorly understood. Here, we analyse 366 whole-genome sequences of Parastagonospora nodorum, a fungal pathogen of wheat with efficient RIP. A global population genomics analysis revealed high levels of genetic diversity and signs of frequent sexual recombination. Contrary to expectations for a species with RIP, we identified recent TE activity in multiple populations. The TE composition and copy numbers showed little divergence among global populations regardless of the demographic history. Miniature inverted-repeat transposable elements (MITEs) and terminal repeat retrotransposons in miniature (TRIMs) were largely underlying recent intra-species TE expansions. We inferred RIP footprints in individual TE families and found that recently active, high-copy TEs have possibly evaded genomic defences. We find no evidence that recent positive selection acted on TE-mediated mutations rather that purifying selection maintained new TE insertions at low insertion frequencies in populations. Our findings highlight the complex evolutionary equilibria established by the joint action of TE activity, selection and genomic repression.

Entities: Chemical

Keywords: adaptation; genetic diversity; pathogen; repeat-induced point mutations; selfish elements; transposable elements

Mesh：

Substances：

Year: 2021 PMID： 34424154 PMCID： PMC8549362 DOI： 10.1099/mgen.0.000540

Source DB: PubMed Journal: Microb Genom ISSN： 2057-5858

Data Summary

All Illumina sequence data is available from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under BioProject numbers PRJNA606320, PRJNA398070 and PRJNA476481 (https://www.ncbi.nlm.nih.gov/bioproject). The Methods and Figs S1–S11, Tables S1–S6 (available with the online version of this article) provide all information on strain locations and outcomes of genome analyses. Selfish genetic elements or transposable elements (TEs) are ubiquitous in the genomes of many microbial eukaryotes. In fungal pathogens, TEs have reshaped genomes by copying and reinserting into new locations, potentially causing an expansion of the genome. The most intriguing effects of TEs occur over short evolutionary timescales by creating beneficial new variants in populations leading to drug resistance or increased virulence. Fungal genomes have also powerful defence mechanisms including repeat-induced point mutations (RIPs), which can deactivate TEs over a few generations. Hence, species are under complex selection regimes to modulate the activity of TEs. Yet, species-wide analyses of TE activity and deactivation are largely lacking. Here, we take a comprehensive view of TE dynamics based on 366 whole-genome sequenced strains representing the global diversity of the fungal wheat pathogen Parastagonospora nodorum. We are able to show that the species retained active TEs despite supposedly highly efficient genomic defences and the impact of purifying selection. Some recent TE activity may even have contributed to adaptive evolution. Our study establishes a microbial eukaryote model for species-wide investigations into mechanisms governing TE activity in genomes.

Introduction

Genetic diversity in natural populations largely determines the evolutionary potential of populations. Polymorphism and diversity in haplotypes can arise from various processes, including single base mutations [1], reshuffling of alleles through recombination [2], chromosomal rearrangements [3] and the action of selfish DNA sequences [4]. Transposable elements (TEs) are ubiquitous selfish elements capable of proliferating throughout the genomic landscape [5]. Transposition activity can increase with senescence [6] or under environmental stress conditions increasing the risk for genetic modifications [7]. TE-mediated genetic changes can impact the fitness of the host. During transposition, TEs can create genetic variation by altering coding sequences, gene regulation or triggering chromosomal rearrangements through non-homologous recombination [8, 9]. The impact of TEs is in most cases negative or neutral, but can rarely also be positive by contributing to adaptation in humans and other organisms [10, 11]. In plant pathogens, TEs are important drivers of genome evolution and adaptation to the host [12-14]. For example, TE activity caused gene deletions and sequence reshuffling in the fungal pathogen Blumeria graminis f. sp. hordei ultimately underpinning host specialization [15]. In Zymoseptoria tritici, a gain in virulence was observed after TE-mediated deletion of a single gene [16], and fungicide resistance emerged from overexpression and splicing alterations of target genes after upstream TE insertions [17, 18]. Despite the major impact on genome stability and the expression of phenotypic traits, ongoing TE activity at the intra-specific level is poorly understood in plant pathogens and other organisms. Major questions remain regarding the genome-wide distribution of active TEs and their impact on fitness. A number of fungal genomes, including the genomes of various plant pathogens, contain regions enriched in TEs and are evolving at faster evolutionary rates largely through sequence rearrangements [12, 19–21]. Polymorphism in genes located within such regions is usually higher than in more conserved regions [21, 22]. Furthermore, key virulence factors tend to localize in the vicinity of repetitive regions [13, 23, 24]. Genes showing evidence for presence–absence polymorphism were shown to be closer to TEs than conserved genes [24-26]. Gene deletions in some pathogens are associated with gains in virulence if the affected proteins are recognized by the host [26]. Controlling the activity of TEs faces a costly trade-off between genomic defence mechanisms such as silencing and the bona fide expression of nearby genes [27]. Suppression of TE proliferation can be mediated by heterochromatin modifications, DNA methylation, meiotic silencing and post-transcriptional regulation [28, 29]. A powerful fungal-specific mechanism is to hypermutate duplicated sequences through repeat-induced point mutation (RIP), a mechanism first described in Neurospora crassa [30]. During sexual recombination, RIP acts via homology recognition changing C:G nucleotides to T:A nucleotides, leaving sequence hallmarks found in genomes of many plant pathogenic fungi [31-35]. Although genomic defences against TEs are frequent across fungi, substantial variation in TE abundance was found among genomes between species but also within species [36-39]. The population frequency of individual TE sequences is restrained by purifying selection [40]. A loss of control or neutral effects can permit copy number expansions [40, 41]. In rare cases, a TE insertion locus can undergo a selective sweep indicating that the insertion may have created adaptive genetic variation. Considering that genomic defences such as RIP were shown to vary between species [35, 42], the interplay of genomic defences, selection and demographics has the potential to drive or restrain TE proliferation among populations. In the wheat pathogen Parastagonospora nodorum, TEs have caused significant genetic alterations and likely facilitated the transfer of a key virulence gene. P. nodorum is a pathogen with high evolutionary potential to adapt to local conditions [24, 43, 44] and a worldwide distribution causing significant wheat-yield losses [45]. The pathogen harbours a repertoire of various effector genes (named Tox genes) that confer host adaptation by causing cell death in susceptible plants. Interestingly, the effector gene ToxA was involved in a horizontal gene transfer (HGT) in the triad of the pathogens P. nodorum, Pyrenophora tritici-repentis and Bipolaris sorokiniana [46, 47]. The HGT was facilitated by the TE environment in which ToxA is embedded. Overall, repetitive elements in the genome of P. nodorum comprise less than 6.2 % of the total genome size [24, 48]; proximity to TEs was correlated with gene presence–absence polymorphisms within populations [24]. Despite extensive evidence for RIP-mediated TE control in the reference genome [32, 49], the species display remarkable examples of TE-mediated phenotypic trait variation. Hence, TE activity may not be fully constrained by genomic defences and shape the evolutionary trajectory of the species. In this study, we screened whole-genome sequences of a global collection of 366 P. nodorum isolates for evidence of recent TE activity and genetic footprints of recent selection. TEs were exhaustively identified based on sequence similarity searches in three complete genomes of the species. We analysed the TE load among populations in relationship with the demographic history of the pathogen. Finally, we identified TE loci potentially contributing to local adaptation and inferred the effectiveness of genomic defence mechanisms shaping TE variation.

Methods

Populations characterized using Illumina whole-genome sequencing

We analysed a total of 366 single spore isolates of P. nodorum sampled between 1991 and 2016. All isolates were collected from naturally infected wheat fields. The sampling locations included: Australia (2001 and 2010; n=23), Brazil (year unknown; n=1), Canada (2005; n=1), Finland (year unknown; n=1), Iran (2005 and 2010; n=20), South Africa (1995; n=23), Sweden (year unknown; n=2), Switzerland (1999A and 1999B; n=46), Arkansas (USA; 1995; n=6), Georgia (USA; 2008; n=5), Maryland (USA; 2008; n=3), Minnesota (USA; 2002, 2003 and 2005; n=14), New York (USA; 1991; n=21), North Carolina (USA; 2008; n=9), North Dakota (USA; 1998, 2003, 2005, 2007, 2008, 2010 and 2016; n=82), Ohio (USA; 2003; n=16), Oklahoma (USA; 2016; n=17), Oregon (USA; 1993 and 2011; n=28), South Carolina (USA; 2008; n=4), South Dakota (USA; 2016; n=8), Tennessee (USA; 2008; n=5), Texas (USA; 1992; n=27) and Virginia (USA; 2008; n=4). Earlier publications referred to the Switzerland 1999B population as being of Chinese origin [50-54]. A more recent study corrected the population origin [44]. Illumina whole-genome sequence data was generated for all 366 isolates with paired-end sequencing and a read length of 100–150 bp. Raw data was accessed from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under BioProject ID numbers PRJNA606320, PRJNA398070 and PRJNA476481 [24, 37, 44].

Genome alignment, variant calling and quality filtering

Raw reads were trimmed for remaining Illumina adaptors and read quality was assessed using Trimmomatic version 0.36 [55] with the following parameters: illuminaclip=TruSeq3 PE.fa:2 : 30 : 10, leading=10, trailing=10, slidingwindow=5 : 10, minlen=50. Trimmed reads were aligned against the reference genome established for the isolate Sn2000 [56] using the short-read aligner Bowtie2 version 2.3.3 [57] with the –very-sensitive-local option. PCR duplicates were marked using the MarkDuplicates option in Picard tools version 2.17.2 (http://broadinstitute.github.io/picard). All sequence alignment (SAM) files were sorted and converted to binary (BAM) files using SAMtools version 1.2 [58]. SNP calling and variant filtration were performed using the Genome Analysis Toolkit (GATK) version 3.8-0 [59]. We used HaplotypeCaller on each alignment file individually with the --emit-ref-confidence GVCF and -ploidy one options. Joint variant calls were produced using GenotypeGVCFs with the flag -maxAltAlleles 2. Finally, SelectVariants and VariantFiltration were used for hard filtering SNPs failing the following cut-offs: QUAL >200; QD >10.0; MQ >20.0; –260].

Phylogenomic and population structure analyses

We inferred the evolutionary history of all 366 P. nodorum isolates using two phylogenetic tree reconstruction methods. A maximum-likelihood (ML) tree was estimated for all isolates using the GTRCAT model in RAxML version 8.2.12 [61]. We performed 100 rapid bootstraps with 20 ML searches. The best ML tree was chosen based on support values. We analysed evidence for reticulation among P. nodorum genotypes by building a Neighbor-Net (NN) network with SplitsTree4 version 4.15.1 [62]. The ML tree and the NN network were visualized using the online tool iTOL version 4 [63]. VCF files were converted to RAxML input (phylip) and SplitsTree4 input formats (Nexus) using PGDSpider version 2.1.1.5 [64]. Population structure was inferred based on a principal component analysis (PCA) in tassel version 5.2.56 and a model-based clustering implemented in structure version 2.3.4 [65, 66]. We performed the PCA based on SNP data filtered for a minor allele frequency above 5 % and based on TE frequency insertion sites filtered for a minor allele frequency above 1 %. We visualized the two first principal components using the ggplot2 package in R [67, 68]. For structure analyses, we used an admixture model independent of prior population information and with correlated allele frequencies. The algorithm ran with a burn-in length of 50 000 and a simulation length of 100 000 Markov chain Monte Carlo (MCMC) repetitions. We explore a range of K between 1 and 10, with 10 repetitions per K. The most likely number of populations (K) was estimated using the Delta K (ΔK) method [69] implemented in the R package pophelper version 2.3.0 [70]. For all phylogenetic and structure analyses, we randomly selected SNPs at a mean distance of 15 kb using the vcftools parameter --thin 15 000 to reduce linkage disequilibrium (LD) among loci and computational demands.

Population genetics and selective sweeps

Allele frequencies and nucleotide diversity (p) per site were determined using the options --freq and --site-pi in vcftools [60]. Histograms for minor allele spectrum were generated in ggplot2. LD decay was estimated per population based on 50 kb windows on chromosome one using r 2 with the option --hap-r2 in vcftools [60]. We performed Tajima’s D neutrality tests [71] using the vcftools --haploid flag and analysed non-overlapping bins of 1 kb across the entire genome [60]. Selective sweeps were identified using a likelihood-based detection method implemented in the program SweeD version 3.0 using the option -folded [72]. We ran SweeD 3.0 individually for each of the 23 chromosomes of the reference genome Sn2000 in grids of 1 kb. If LD decay was slow in a population (i.e. r 2 ≥0.2 at 10 kb), we merged together adjacent genomic regions under selection if these were separated by less than 20 kb. We analysed regions of interest by adding 10 kb on each side of the window identified by SweeD.

Analysis of regions with signatures of selection using gene ontology (GO)

The gene annotation for the reference genome Sn2000 was previously established [43]. Briefly, predicted protein sequences of P. nodorum Sn15 strain were aligned and queries with a at least 95 % identity score were maintained. Gene identifiers were retained. Genomic regions showing signatures of selection were used as coordinates in BEDtools version 2.29.0 [73] to intersect annotated genes in the Sn2000 reference strain annotation [56]. Genes were annotated for protein family (PFAM) domain and GO terms using interproscan version 5.36-75.0 with default parameters and a local pre-calculated match lookup service [74]. Protein secretion signals were predicted using SignalP version 4.01 [75], Phobius [76] and tmhmm version 2.0 [77]. GO enrichment analyses were performed using the packages GSEABase version 1.35.0 and GOstats version 2.38.1 in R [78]. We used the false discovery rate of 5 % as a cut-off and minimum GO term size of 5 for the hypergeometric test.

TE consensus sequence identification and classification

To obtain consensus sequences for TE families, we performed individual runs of RepeatModeler version 1.0.8 on the three complete genomes Sn2000, Sn4 and Sn79-1087. Identified sequences were annotated based on GIRI Repbase using RepeatMasker version 4.0.7 [56, 79, 80]. Further classification and processing of consensus sequences was performed using WICKERsoft [81]. We screened the three complete genomes for copies of the above detected consensus sequences with blastn filtering for sequence identity of >80 % over >80 % of the length of the sequence [82]. We then added flanks of at least 300 bp both upstream and downstream of each sequence. We made multiple sequence alignments using ClustalW and defined TE boundaries by visual inspection before updating consensus sequences [83]. If possible, consensus sequences were classified according to the presence and type of terminal repeats, superfamily-specific start and end bases or target site duplications, as well as homology of encoded proteins using blastx in the NCBI nr database. We excluded duplicated consensus sequences using dotter for inspection [84]. We assigned consensus sequence names according to the three-letter naming system [85]. Two predicted TE families showed strong length polymorphism and only weak sequence similarity between all individual insertions. Using blastn on individual insertions, we found that the sequences matched one of three different regions of the entire consensus sequences. We visualized the consensus sequence alignment with genoPlotR version 0.8.9 in R [86] and used blastn to identify matching regions in the Sn2000, Sn4 and Sn79-1087 complete genomes [56, 87, 88]. To re-define consensus sequences, we proceeded as described above. In a second round of TE annotation, we focused on protein-encoding sequences matching previously identified fungal TE superfamilies. We screened the three complete genomes for matches of protein sequences representative of each superfamily from other fungi using tblastn. We filtered hits for a minimal alignment length of 80 bp and a sequence similarity >25 %. Identified sequences were retrieved including 300 bp flanking sequences. Hits were analysed for their matching sequencing with dotter and grouped into families based on visual inspection. We made further multiple sequence alignments using ClustalW and defined TE boundaries by visual inspection. Finally, the TE family consensus sequences from the two methods above were used to annotate the reference genomes using RepeatMasker version 4.0.7 [56, 79, 80]. We estimated the guanine–cytosine (G+C) content as G+C/G+C+A+T on the TE families retained in the consensus sequence: the fasta consensus file was converted to emboss sequence query using emboss seqret [89], and run on emboss geecee (https://www.bioinformatics.nl/cgi-bin/emboss/geecee). TE presence–absence annotation in the global collection of isolates was performed with the method proposed by Linheiro and Bergman [90], with ngs_te_mapper version 1 (https://github.com/bergmanlab/ngs_te_mapper/commit/fb23590200666fe66f1c417c5d5934385cb77ab9) implemented in R. Dependences to ngs_te_mapper pipeline were BWA version 0.7.17-r1188 [91], to map Illumina reads, and SAMtools version 1.2 [58], to perform output file conversion. Depth of coverage can impact robust TE discovery; therefore, we set a minimum depth of 10×, which allows recovery of ≥90 % of identified TEs across populations (Fig. S1). As a consequence, we removed 18 isolates and retained a total of 348 isolates for downstream TE polymorphism analyses. We clustered nearby TE insertions into a single locus if the insertions (i) belonged to the same TE family and (ii) were located within 100 bp distance. Clustering of TE insertion sites was performed using the package GenomicRanges version 1.38.0 implemented in R [92].

Results

Global population structure and phylogenomics

We analysed genome sequencing data for 366 P. nodorum isolates and identified a total of 487 477 high-confidence SNP markers with a minor allele frequency of 5 %. We identified three main groups by performing PCAs (Figs 1a and S2). Isolates from Oklahoma (USA) formed a distinct group flanked by the two largest clusters. Unsupervised Bayesian clustering analyses revealed a similar pattern of high admixture among genotypes from different continents with the optimal number of clusters being K=2 (Figs 1b and S3). At K=3, isolates from Iran and the northern United States (North Dakota, South Dakota and Minnesota) grouped as genetic cluster two, while isolates from the southern United States shared membership with cluster one. Genotypes from Australia and South Africa were assigned mainly to cluster three. Genotypes from Swiss populations and Oklahoma (USA) showed significant admixture. It is important to note that North America is overrepresented in our genomic sampling compared to other regions and the high local genotypic diversity is at least partially explained by the high number of samples. We used genome-wide SNPs to infer a ML tree and identified four well-supported major clades largely independent of sampling origin (Fig. 1c). We found evidence of reticulation indicative of recombination based on a SplitsTree network (Fig. 1d). The network also showed that multiple populations harboured groups of highly similar genotypes indicative of recent ancestry. Both the ML tree and the SplitsTree network identified four major genetic groups including a group of isolates from Australia and South Africa, a group of isolates from Brazil, Finland, Sweden, Switzerland and the southern United States, as well as a group of isolates from Canada, Iran and the northern United States. The fourth group is composed of isolates solely from Oklahoma (USA). The emergence of the genetic groups could stem from historic gene flow and may be shaped by environmental similarities. Our results indicate significant levels of gene flow among populations from the same continent, as well as among a set of populations from different continents. Populations constituting genetic group two were mostly sampled in regions classified as temperate climates (Cfa and Cfb; Table S1), while in genetic group three, samples from continental climates are predominant (Dfa and Dfb; Table S1). Although climate characteristics are not identical among sampling locations (e.g. recent populations of Australia and South Africa both group in genetic group one), similarities in climate regime could represent an important factor affecting global gene flow patterns in P. nodorum.

Fig. 1.

Global population structure and phylogenomics of P. nodorum. (a) PCA with dots representing individual isolates coloured according to the genetic group. PC, Principal component. (b) Estimated population genetic structure. Each vertical bar represents one individual, coloured according to cluster membership values. The cluster designation is represented by K numbers. Iran (Middle East) is highlighted as matching the region of origin of P. nodorum. (c) ML phylogenetic tree. (d) Phylogenetic network reconstruction. Scale bars represents the mean number of nucleotide substitutions per site and the split support for the edges, respectively. Population genetic structure and phylogenetic reconstructions were based on 2278 genome-wide SNP markers spaced evenly at 15 kb. The PCA was based on the complete SNP dataset.

Population-level diversity and demographics

We used nucleotide diversity and minor allele frequency spectra to quantify variation in genetic diversity across major genetic groups and local populations (Fig. 2). Genetic group three (northern USA and Iran) had the highest nucleotide diversity (4.39×10−3) but also the most considerable variation among sites (Fig. 2a). North Dakota (USA) and Iran were highly diverse in contrast to South Dakota (USA). Genetic groups two (mainly southern USA) and 4 (Oklahoma, USA) had intermediate levels of nucleotide diversity (3.96×10−3 and 3.80×10−3, respectively). Genetic group one had the lowest nucleotide diversity (3.63×10−3). We analysed minor allele frequency spectra across populations to detect evidence for past demographic events (e.g. bottlenecks and expansions). Except for South Dakota (USA), populations showed low-frequency alleles being more abundant than high-frequency alleles, suggesting random mixing and low degrees of recent admixture (Fig. 2b). Next, we estimated LD decay and found that the distance for r 2 <0.2 varied between ~2.2 kb in Switzerland 1999B and Australia to 44.1 kb in Oklahoma (USA) (Fig. 2b). Most populations (10 out of 16) showed a fast LD decay (r 2 <0.2 within 10 kb), indicating high levels of diversity and ongoing recombination. The slow LD decay in South Africa and Oklahoma (USA) may indicate recent admixture or more dominant roles of asexual reproduction.

Fig. 2.

Worldwide analyses of demographics and population diversity. (a) Nucleotide diversity for genetic groups and sampling locations. Colours represent different genetic groups. (b) Minor allele frequency spectrum for individual sampling locations. Sampling locations are represented on the map, highlighting northern United States and Iran in yellow; Oklahoma (USA) in brown; the southern United States and Switzerland in blue, and Australia and South Africa in pink. The number in the top right corner of histograms represents the distance for LD to decay below r 2 <0.2. All estimates are based on SNPs on chromosome one.

TE landscape and insertion dynamics among populations

Adaptation to local conditions may emerge from genetic variation produced by TE-mediated sequence rearrangements. Here, we performed de novo TE annotation using three complete genomes of P. nodorum [56] and combined this information into a single set of TE consensus sequences. Our consensus annotation identified 25 TE families in P. nodorum and revealed that the genomes Sn2000, Sn4 and Sn79-1087 are composed of 4.40, 4.23 and 1.57 % TEs, respectively (Fig. 3, Table S2). The TE density among the three reference genomes was heterogeneous (Fig. 3a). The reference isolate Sn79-1087 showed an overrepresentation of TEs mostly in subtelomeric regions. TEs in Sn2000 and Sn4 showed TE blocks along chromosomes with a particularly high density on chromosome 10 (Fig. 3a). Among the TE classes, class I elements comprised 3.76, 3.37 and 0.82 % of the genome, and class II elements comprised 0.50, 0.72 and 0.42 % of the genome in Sn2000, Sn4 and Sn79-1087, respectively. These findings are consistent with the previous estimates [48, 56]. Moreover, we show that complete genomes vary in the content and density of TEs highlighting the interest to screen TE dynamics at the population level.

Fig. 3.

Genome-wide detection of TEs. (a) Annotation of the completely assembled reference genomes. The outer black ring delimitates the 23 chromosomes and positions based on the Sn2000 reference genome (in Mb). Grey stars show approximate positions of the genes ToxA, Tox1 and Tox3 encoding effector proteins. The gene density is calculated for 100 kb windows (darker green for higher densities). The TE density was calculated for 100 kb windows (darker orange for higher densities). Inner grey rings highlight the occurrence of TE families with the highest number of loci in the genome (DTT_Tarvos with blue circles) and the highest degree of singleton (RLC_Phobus with red circles). The position on the y-axis shows the relative difference in the allele frequencies among populations. (b) Linear correlation plot of G+C content and TE copy number across populations. (c) Linear correlation plot of distinct loci per TE family across populations. (d) Analyses of TE copies in the three reference genomes (Sn2000, Sn4 and Sn79-1087) for G+C content and the mean TE length. Red vertical dashed lines indicate the genome-wide mean G+C content (52 mol%) and the dotted blue vertical line shows the 400 bp threshold. We inferred population-level variation in TE activity by analysing the 348 isolates with genome sequences for evidence of newly inserted or deleted TEs. We identified 3850 TE insertions across all isolates with the insertions clustered into 167 unique loci in the genome (Fig. S4a). Recently inserted TEs were mainly DNA transposons (class II; n=2470) and included fewer retrotransposons (class I; n=1380). At the order level, we found only terminal inverted repeats (TIRs) and long terminal repeats (LTRs), respectively (Fig. 4a). Superfamilies were composed mainly of elements classified as class II Tc1-Mariner (n=1742), followed by class II hAT (n=182) and class I Copia (n=104; Fig. 4a). The remaining 1229 LTRs and 546 TIRs were not conserved enough to assign a superfamily. We found a stark difference between autonomous (n=602) and non-autonomous elements (n=3248; Figs 4a and S4b). Non-autonomous elements included 2019 miniature inverted-repeat transposable elements (MITEs; 62.1 %) and 1229 terminal repeat retrotransposons in miniature (TRIMs; 37.9 %). The total count of recent TE insertions in a population represents the load generated through TE activity. TEs were highly homogeneous between populations and genetic groups, with the highest mean numbers of new insertions in isolates of genetic group one (South Africa and Australia; mean=13.1 insertions) and the lowest in genetic group four (Oklahoma; mean=8.5 insertions; Figs 4c and S5). The TE superfamily composition is highly similar among genetic groups with a predominance in Tc1-Mariner and an unknown LTR (Fig. 4d). Recently inserted Gypsy elements were found only in genetic groups two and three. At the family level, insertions of DTX_MITE_Sirius were mostly found at a single locus on chromosome 19 (99 %), and insertions of DTA_Mimas were mostly found at a single locus on chromosome 23 (67 %). Chromosome 23 was identified as an accessory chromosome in P. nodorum [56, 93]. Similarly, TE families DTX_MITE_Ceti and DTX_MITE_Galatea were found only at a single locus on chromosomes 12 and 19, respectively. Most variation in TE loci numbers was driven by DTT_Tarvos (269 copies in total among isolates distributed across 59 loci; Fig. 3a). The RLC_Phobus showed a high degree of singletons with different 44 loci (Fig. 3a). Finally, TE copy expansion was driven by non-autonomous elements DTT_MITE_Geminga (821 copies), RLX_TRIM_Sinope (793 copies) and DTT_MITE_Eridani (652 copies). The species-wide screens of recently inserted TEs show that populations harbour fairly balanced TE loads with only a minor divergence in TE composition.

Fig. 4.

Classification and differential load of TEs among genetic groups. (a) Bar plots of TE counts according to family and transposition mode. (b) TE insertion frequency distribution per genetic group. (c) The mean number of TEs per isolate within genetic groups. (d) TE counts per isolate and TE family across genetic groups. Each bar represents a single isolate and colours identify TE families.

Differential activity among TEs and evidence for purifying selection

P. nodorum has an active genome defence called RIP, which induces C:G to T:A mutations within and proximal to duplicated sequences. The preferential AT-mutation bias of RIP will gradually lower the G+C content in RIP-affected genomic regions [32, 94]. We estimated G+C content as a proxy for RIP activity at the level of TE families. Active TEs not affected by RIP are expected to have a G+C content comparable to the genome-wide mean of coding sequences. We considered the population-level TE activity as the total number of copies per TE family across the 348 isolates. Interestingly, we found that TE families with higher copy numbers across the populations have higher G+C content than low-copy TE families (Fig. 3b). The three most-expanded TE families (DTT_MITE_Geminga, RLX_TRIM_Sinope and DTT_MITE_Eridani) showed the highest G+C content (mean G+C of 41.8, 43.7 and 48.2 mol%, respectively). In contrast, TE families with copies divided across more loci were lower in G+C content, while those in fewer unique loci showed higher G+C content (Fig. 3c). The majority of TEs with high G+C content in both analyses were non-autonomous MITEs and TRIMs. We also found consistent variation in G+C content within TE families, with TEs of short length usually being of higher G+C content (e.g. MITEs and TRIMs) (Fig. 3d). Overall, recently active TEs in the P. nodorum genome are mostly of short length and constituted of non-autonomous elements. TE frequencies across the genome are impacted by the joint actions of TE activity, selection on the insertion and demographics. We analysed whether the TE insertion frequency spectrum reflected neutral processes such as drift and migration, or selection. We performed a PCA using TE presence at insertion loci as a genetic marker. In contrast to genome-wide SNPs, we found no geographical signal for TE insertion loci (Fig. S6a–c). The majority of all TEs were present at <10 % frequency in the global set of isolates, and about 48.5 % of all TEs were singletons (Fig. S7). We found a particularly pronounced shift towards singletons in genetic group two, suggesting strong purifying selection acting against the TEs in this group (Figs 4b and S8).

Genomic signatures of recent selective sweeps

We analysed genetic groups for footprints of recent selective sweeps. We calculated the composite likelihood ratio implemented in the software SweeD for genetic groups one, two and three separately (group four composed of Oklahoma was omitted given the high admixture signatures). We detected a total of 46 genomic regions with significant signatures of recent sweeps across all groups (Fig. S9, Table S3). Selective sweeps were detected on all chromosomes except for chromosomes 1, 14 and 16. The size of the regions showing signatures of selective sweeps ranged from 20 to 40.7 kb. Interestingly, no overlaps in selective sweep regions were found among the genetic groups one, two and three, indicating strong heterogeneity in selection pressures and local adaptation. In genetic group one, a total of 16 genomic regions with sweep signatures were detected on chromosomes 2, 4–6, 8, 9, 11, 12, 15, 18–20 and 23 based on a 99.9 % outlier threshold (Fig. 5a, Table S3). The mean length of the sweep regions was 21.3 kb spanning a total of 341.5 kb or 0.9 % of the genome. The selective sweep region with the highest likelihood score (136) was located on chromosome two (Fig. 5b). We analysed Tajima’s D values in the same region and found a positive but below-average value for the chromosome (Fig. 5c). LD decay was slower in the sweep region with a mean of r 2=0.48 over ~20.8 kb compared to a decay of r 2 <0.2 within 5.7 kb for genetic group one (Fig. 5e). The above-average LD is consistent with a recent selective sweep. In genetic groups two and three, we detected a total of 16 genomic regions with signatures of selection in total (Fig. S9, Table S3).

Fig. 5.

Selective sweep loci in genetic group one. (a) Distribution of the composite likelihood scores in windows of 1 kb. The dashed horizontal line indicates the 99.9th percentile threshold and the grey highlighting shows the strongest sweep region. (b) Distribution of the composite likelihood scores within the strongest sweep region. (c) Tajima’s D values calculated for 1 kb windows. (d) Schematic of genes highlighting predicted function and TEs. (e) Heatmap of pairwise LD r 2 within the sweep region. Gene annotation and chromosomal positions are based on the reference genome Sn2000.

Genes potentially underlying local adaptation

Globally, we found 334 genes in selective sweep regions (genetic groups one, two and three; Table S3). The number of genes per sweep region ranged from 2 to 22. In genetic group one (South Africa and Australia), we identified a total of 99 genes (Table S4). The mean number of genes per sweep region was 6.2 ranging from 3 to 10 genes. We tested whether genes in sweep regions were overrepresented for genes encoding specific protein functions. We found 27 significantly overrepresented GO terms, including basal metabolic processes (e.g. cellular macromolecule biosynthetic/metabolic process), nutrient mobilization (e.g. nitrate assimilation) and transcriptional regulation (e.g. regulation of transcription, zinc ion binding; Table S5). The genomic region with the highest likelihood score in genetic group one contained two genes with roles in transcription, including SNOG_02334 (a zinc-finger transcription factor) and SNOG_30155 (a histidine phosphatase superfamily). We also identified genes encoding transporters, including SNOG_02348 (amino acid permease) and SNOG_02336 (cytochrome b5-like haem; Fig. 5d). Overall, genes in sweep regions were found to control metabolic functions and regulatory processes. Among the 99 genes, 2 genes (SNOG_07292, SNOG_30828) were previously ranked as strong candidate effectors [37]. The genetic signal for a sweep found on chromosome eight (from 520 242 to 540 242 bp; coordinates on the reference genome Sn2000) was in close proximity to the locus containing the necrotrophic effector ToxA. A detailed analysis revealed a strong LD with the downstream region of ToxA and several TE insertion loci (Fig. S10). The presence of Copia, Gypsy and Tc1-Mariner-like elements were previously reported at the ToxA locus [47], here we identified additional MITEs and unclassified TEs (Fig. S10). In addition to TE insertions, the region covers 18 genes encoding among other proteins a kinase activator, a major facilitator superfamily transporter and signalling proteins (Table S6). We examined whether the 99 genes located in selective sweep regions from genetic group one were in close physical proximity of TEs, but we detected no overrepresentation of TEs in proximity of these genes compared to the genomic background (Fig. S11). Our findings show that the recent insertion dynamics of TEs in P. nodorum populations are unlikely to have driven recent selective sweeps. However, the insertion of individual TEs in proximity to known effector genes may well have had an impact on the evolution of virulence.

Discussion

TE activity can be a major source of genetic variation in the fungal genome. Yet, we lack a comprehensive view of how evolutionary forces including genetic drift and selection act on TE dynamics in natural populations. Here, we analysed the intra-species population genetic diversity, genome-wide dynamics of TEs and signatures of selection in a worldwide collection of the plant pathogen P. nodorum. This species exhibits relatively low population subdivisions consistent with recent gene flow among continents. The TE compositions and copy numbers were very similar among populations. Despite strong evidence for highly active genomic defences [32, 49], we identified recent TE activity in the species pool. TEs do not appear to be the main drivers of recent adaptive evolution and are rather under purifying selection. Yet, we show that some high-copy TEs have sequence signatures consistent with an escape from genomic defence mechanisms. For this work, we significantly expanded previous population genomic analyses of P. nodorum for both geographical coverage and sample size [24, 43, 44]. Our results corroborate evidence for genetic admixture among North America, Europe (Switzerland) and the Middle East (Iran) consistent with earlier analyses based on microsatellite loci [51]. High levels of gene flow among geographically widespread pathogen populations can lead to maladaptation and counteract local adaptation [95]. Yet, gene flow can be particularly advantageous for plant pathogens in the agro-ecosystem because agricultural practices and key susceptibility genes can be shared globally among crop growing areas [24, 95]. Indeed. we found high genetic diversity within and among field P. nodorum populations, with a rapid LD decay reflecting frequent sexual recombination. The slow LD decay found in more recently founded populations (e.g. South Africa) most likely reflects recent admixture. The historical expansion of wheat cultivation across the globe was likely accompanied by P. nodorum spreading from the Fertile Crescent to most continents. However, the global expansion of agriculture and pathogens likely imposed strong genetic bottlenecks in newly founded populations [45, 96]. Besides a reduction in genetic diversity, such demographic effects can influence genome-wide TE activity in some species [40, 97]. We found no striking differences in TE abundance among populations. Genetic group one, comprising the more recently founded populations of Australia and South Africa, showed similar TE composition as genetic group two (including Iranian isolates). Populations of the wheat pathogen Z. tritici showed striking expansion in most TE superfamilies [40, 98]. As Z. tritici and P. nodorum are thought to share common domestication origins and subsequent dispersal routes, differences in how TEs were impacted may be due to differences in genomic defences (e.g. cytosine DNA methylation polymorphism in Z. tritici [99]). Plant populations can similarly be shaped by bottleneck effects, but TEs did not show pronounced activation as found for example in the wild grass Brachypodium distachyon [97]. TE activity shaped by demographics highlights how genome-wide TE proliferation is complex and can be governed by stochastic effects. We identified differences in the genome-wide TE density and distribution among the three completely assembled and re-annotated P. nodorum reference genomes. Polymorphism within the species in terms of TE content is indicative of TE activity and the existence of evolutionary hot spots. Such rapidly evolving regions in the genome were repeatedly associated with rapid evolution in plant pathogens. Rapid diversification of effector genes, which constitute a key virulence component, was thought as conferring strong advantages in the arms race with the plant host [12, 14]. However, loss of control over TE proliferation can have deleterious consequences to genome stability and the persistence of lineages [8, 100]. Compartmentalization of repetitive regions in the genome of P. nodorum was previously shown [56], and can mediate local adaptation by causing gene gains and losses [24]. The fact that the strain with the lowest TE content (Sn79-1087) is also avirulent on wheat may be an element of the evolutionary transitions in the lineage to become established on wheat in the agro-ecosystem [46, 56]. However, we found little evidence for TEs playing a major role in selective sweeps leading to local adaptation or sweep regions enriched in virulence factors in our populations. This is similar to other plant pathogens [101, 102], and the lack of enrichment in virulence-related genes may be explained by selection mostly acting on traits determined by small effect loci and not major gene-for-gene interaction loci (i.e. effectors). The overall low number of repetitive sequences (<4.5 % of genome length) and extensive signatures of RIP strongly suggest that the P. nodorum genome is largely devoid of TE activity [32, 48, 48]. This is consistent with the view that TE multiplication in a genome is determined by the balance between the TE’s transposition potential and the effectiveness of genomic defences [103]. Here, we show that P. nodorum TE families exhibit a striking variation in G+C content, which is positively correlated with the TE copy numbers in the genome. The correlation suggests that an increase in TE copy numbers is associated with weak RIP activity on these TEs. RIP guards the genome against TE proliferation by targeting duplicated DNA (e.g. recent copies of TEs), but the RIP intensity can vary among TE families [104, 105]. The impact of RIP can also extend beyond repetitive regions, increasing mutation rates across the entire genome [106] and significantly impacting pathogenicity [22, 33]. In N. crassa, the action of RIP is impaired in duplicated sequences shorter than ~400 bp [94, 107], and with an identity below 80 % [108]. We found that high-copy TE families with high G+C content are mostly MITEs and TRIMs. These elements generally lack coding sequences [85] and are on average <400 bp in P. nodorum. Hence, the MITEs and TRIMs may have escaped RIP through their compactness. Both groups of TEs were found to mediate genome evolution and modulate gene expression across eukaryotes [41, 109–111]. In addition, MITEs are thought to increase in copy number despite genomic defences with possible benefits to DNA transposons [112, 113] and underlie the expression of pathogenicity [22, 114, 115]. MITEs and TRIMs are ubiquitous in the pangenome of the wheat pathogen Z. tritici, which has a similar genome size but more active TEs [39]. MITEs in Z. tritici become de-repressed upon encountering stressful conditions, such as during the colonization of the host plant [41]. The MITE activity in P. nodorum could also be a consequence of post-transcriptional regulation. In rice, short interfering RNA was found to preferentially suppress MITE expression [28, 116]. Genomic regions showing signatures of recent selective sweeps did not contain recently inserted TEs. Whether the activity of MITEs and TRIMs in P. nodorum is contingent on stressful conditions and whether transposition activity creates adaptive variation remains unknown. In this study, we resolved the global population structure and retraced TE activity in the repressive genome of an important wheat pathogen. TEs can play a major role in plant pathogen adaptation by generating adaptive genetic variation to resist pesticides or to overcome host defence mechanisms. Key pathogenicity genes show tightly associated expression patterns with TEs in several plant pathogens. However, the selfish nature of TEs can also impose severe costs on the pathogen. Population genetic analyses informed by highly contiguous genome sequences are powerful tools to disentangle the evolutionary equilibrium between TE activation and repression. Click here for additional data file.

111 in total

1. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer.

Authors: D G Higgins; P M Sharp
Journal: Gene Date: 1988-12-15 Impact factor: 3.688

2. A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis.

Authors: E L Sonnhammer; R Durbin
Journal: Gene Date: 1995-12-29 Impact factor: 3.688

Review 3. Repeat-Induced Point Mutation and Other Genome Defense Mechanisms in Fungi.

Authors: Eugene Gladyshev
Journal: Microbiol Spectr Date: 2017-07

4. Analysis of Repeat Induced Point (RIP) Mutations in Leptosphaeria maculans Indicates Variability in the RIP Process Between Fungal Species.

Authors: Angela P Van de Wouw; Candace E Elliott; Kerryn M Popa; Alexander Idnurm
Journal: Genetics Date: 2018-11-02 Impact factor: 4.562

5. Natural selection drives population divergence for local adaptation in a wheat pathogen.

Authors: Danilo Pereira; Daniel Croll; Patrick C Brunner; Bruce A McDonald
Journal: Fungal Genet Biol Date: 2020-05-01 Impact factor: 3.495

6. Genome expansion and gene loss in powdery mildew fungi reveal tradeoffs in extreme parasitism.

Authors: Pietro D Spanu; James C Abbott; Joelle Amselem; Timothy A Burgis; Darren M Soanes; Kurt Stüber; Emiel Ver Loren van Themaat; James K M Brown; Sarah A Butcher; Sarah J Gurr; Marc-Henri Lebrun; Christopher J Ridout; Paul Schulze-Lefert; Nicholas J Talbot; Nahal Ahmadinejad; Christian Ametz; Geraint R Barton; Mariam Benjdia; Przemyslaw Bidzinski; Laurence V Bindschedler; Maike Both; Marin T Brewer; Lance Cadle-Davidson; Molly M Cadle-Davidson; Jerome Collemare; Rainer Cramer; Omer Frenkel; Dale Godfrey; James Harriman; Claire Hoede; Brian C King; Sven Klages; Jochen Kleemann; Daniela Knoll; Prasanna S Koti; Jonathan Kreplak; Francisco J López-Ruiz; Xunli Lu; Takaki Maekawa; Siraprapa Mahanil; Cristina Micali; Michael G Milgroom; Giovanni Montana; Sandra Noir; Richard J O'Connell; Simone Oberhaensli; Francis Parlange; Carsten Pedersen; Hadi Quesneville; Richard Reinhardt; Matthias Rott; Soledad Sacristán; Sarah M Schmidt; Moritz Schön; Pari Skamnioti; Hans Sommer; Amber Stephens; Hiroyuki Takahara; Hans Thordal-Christensen; Marielle Vigouroux; Ralf Wessling; Thomas Wicker; Ralph Panstruga
Journal: Science Date: 2010-12-10 Impact factor: 47.728

7. Recent Activity in Expanding Populations and Purifying Selection Have Shaped Transposable Element Landscapes across Natural Accessions of the Mediterranean Grass Brachypodium distachyon.

Authors: Christoph Stritt; Sean P Gordon; Thomas Wicker; John P Vogel; Anne C Roulin
Journal: Genome Biol Evol Date: 2018-01-01 Impact factor: 3.416

8. Pangenome analyses of the wheat pathogen Zymoseptoria tritici reveal the structural basis of a highly plastic eukaryotic genome.

Authors: Clémence Plissonneau; Fanny E Hartmann; Daniel Croll
Journal: BMC Biol Date: 2018-01-11 Impact factor: 7.431

9. Massive Expansion of Gypsy-Like Retrotransposons in Microbotryum Fungi.

Authors: Felix Horns; Elsa Petit; Michael E Hood
Journal: Genome Biol Evol Date: 2017-02-01 Impact factor: 3.416

10. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

2 in total

1. Variability in an effector gene promoter of a necrotrophic fungal pathogen dictates epistasis and effector-triggered susceptibility in wheat.

Authors: Evan John; Silke Jacques; Huyen T T Phan; Lifang Liu; Danilo Pereira; Daniel Croll; Karam B Singh; Richard P Oliver; Kar-Chun Tan
Journal: PLoS Pathog Date: 2022-01-06 Impact factor: 6.823

2. Genome-scale phylogenies reveal relationships among Parastagonospora species infecting domesticated and wild grasses.

Authors: D Croll; P W Crous; D Pereira; E A Mordecai; B A McDonald; P C Brunner
Journal: Persoonia Date: 2021-02-14 Impact factor: 11.658

2 in total