Literature DB >> 28292787

Extensive Copy Number Variation in Fermentation-Related Genes Among Saccharomyces cerevisiae Wine Strains.

Abstract

Due to the importance of Saccharomyces cerevisiae in wine-making, the genomic variation of wine yeast strains has been extensively studied. One of the major insights stemming from these studies is that wine yeast strains harbor low levels of genetic diversity in the form of single nucleotide polymorphisms (SNPs). Genomic structural variants, such as copy number (CN) variants, are another major type of variation segregating in natural populations. To test whether genetic diversity in CN variation is also low across wine yeast strains, we examined genome-wide levels of CN variation in 132 whole-genome sequences of S. cerevisiae wine strains. We found an average of 97.8 CN variable regions (CNVRs) affecting ∼4% of the genome per strain. Using two different measures of CN diversity, we found that gene families involved in fermentation-related processes such as copper resistance (CUP), flocculation (FLO), and glucose metabolism (HXT), as well as the SNO gene family whose members are expressed before or during the diauxic shift, showed substantial CN diversity across the 132 strains examined. Importantly, these same gene families have been shown, through comparative transcriptomic and functional assays, to be associated with adaptation to the wine fermentation environment. Our results suggest that CN variation is a substantial contributor to the genomic diversity of wine yeast strains, and identify several candidate loci whose levels of CN variation may affect the adaptation and performance of wine yeast strains during fermentation.

Entities: Chemical Disease Gene Species

Keywords: alcoholic fermentation; carbohydrate metabolism; domestication; genomics; structural variation

Mesh：

Year: 2017 PMID： 28292787 PMCID： PMC5427499 DOI： 10.1534/g3.117.040105

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Saccharomyces cerevisiae, commonly known as baker’s or brewer’s yeast, has been utilized by humans for the production of fermented beverages since at least 1350 B.C.E. but may go as far back as the Neolithic period 7000 yr ago (Mortimer 2000; Cavalieri ). Phylogenetic analyses and archaeological evidence suggest wine strains originated from Mesopotamia (Bisson 2012), and were domesticated in a single event around the same time as the domestication of grapes (Schacherer ; Sicard and Legras 2011). Further phylogenetic, population structure and identity-by-state analyses of single nucleotide polymorphism (SNP) data reveal close affinity and low genetic diversity among wine yeast strains across the globe, consistent with a domestication-driven population bottleneck (Liti ; Schacherer ; Sicard and Legras 2011; Cromie ; Borneman ). These low levels of genetic diversity have led some to suggest that further wine strain development should be focused on introducing new variation into wine yeasts rather than exploiting their standing variation (Borneman ). Many wine strains have characteristic variants that have presumably been favored in the wine-making environment (Marsit and Dequin 2015). For example, adaptive point mutations, deletions, and rearrangements in the promoter and coding sequence of contribute to flocculation and floating, thereby increasing yeast cells’ ability to obtain oxygen in the hypoxic environment of liquid fermentations (Fidalgo ). Similarly, duplications of are strongly associated with resistance to copper (Warringer ), which, at high concentrations, can cause stuck fermentations, and , a gene involved in thiamine metabolism whose expression is associated with an undesirable rotten-egg sensory perception in wine, is absent or downregulated among wine strains and their derivatives (Bartra ; Brion ). As these examples illustrate, the mutations underlying these, as well as many other, presumably adaptive, traits are not only SNPs, but also genomic structural variants, such as duplications, insertions, inversions, and translocations (Pretorius 2000; Marsit and Dequin 2015). Copy number (CN) variants, a class of structural variants defined as duplicated or deleted loci ranging from 50 bp to whole chromosomes (Zhang ; Arlt ), have received considerable attention due to their widespread occurrence (Sudmant ; Bickhart ; Axelsson ; Pezer ) as well as their influence on gene expression and phenotypic diversity (Freeman ; Henrichsen ). Mechanisms of CN variant evolution include nonallelic homologous recombination (Lupski and Stankiewicz 2005) and retrotransposition (Kaessmann ). CN variants are well studied in various mammals, including humans (Homo sapiens; Sudmant ), cattle (Bos taurus; Bickhart ), the house mouse (Mus musculus; Pezer ), and the domestic dog (Canis lupus familiaris; Axelsson ), where they are important contributors to genetic and phenotypic diversity. Relatively few studies have investigated whole-genome CN profiles in fungi (Hu ; Farrer ; Steenwyk ). For example, the observed CN variation of chromosome 1 in the human pathogen Cryptococcus neoformans results in the duplications of ERG11, a lanosterol-14-α-demethylase and target of the triazole antifungal drug fluconazole (Lupetti ), and AFR1, an ATP binding cassette (ABC) transporter (Sanguinetti ), leading to increased fluconazole resistance (Sionov ). Similarly, resistance to itraconazole, a triazole antifungal drug, is attributed to the duplication of cytochrome P-450-depdendent C-14 lanosterol α-demethylase (pdmA)—a gene whose product is essential for ergosterol biosynthesis—in the human pathogen Aspergillus fumigatus (Osherov ). Finally, in the animal pathogen Batrachochytrium dendrobatidis, the duplication of Supercontig V is associated with increased fitness in the presence of resistance to an antimicrobial peptide, although the underlying genetic elements involved remain elusive (Farrer ). Similarly understudied is the contribution of CN variation to fungal domestication (Gibbons and Rinker 2015; Gallone ). Notable examples of gene duplication being associated with microbial domestication include those of α-amylase in Aspergillus oryzae, which is instrumental in starch saccharification during the production of sake (Hunter ; Gibbons ), and of the and loci in beer-associated strains of S. cerevisiae that metabolize maltose—the most abundant sugar in the beer wort (Gallone ; Gonçalves ). Beer strains of S. cerevisiae often contain additional duplicated genes associated with maltose metabolism, including and , two maltose permeases, and the putative maltose-responsive transcription factor, (Gonçalves ). Adaptive gene duplication in S. cerevisiae has also been detected in experimentally evolved populations (Dunham ; Gresham ; Dunn ). Specifically, duplication of the locus containing the high affinity glucose transporters and has been observed in adaptively evolved asexual strains (Kao and Sherlock 2008), as well as in populations grown in a glucose-limited environment (Brown ; Dunham ; Gresham ). Altogether, these studies suggest that CN variation is a significant contributor to S. cerevisiae evolution and adaptation. To determine the contribution of CN variation to genome evolution in wine strains of S. cerevisiae, we characterized patterns of CN variation across the genomes of 132 wine strains, and determined the functional impact of CN variable genes in environments reflective of wine-making. Our results suggest that there is substantial CN variation among wine yeast strains, including in gene families (such as CUP, FLO, HXT, and MAL) known to be associated with adaptation in the fermentation environment. More generally, it raises the hypothesis that CN variation is an important contributor to adaptation during microbial domestication.

Materials and Methods

Data mining, quality control, and mapping

Raw sequence data for 132 S. cerevisiae wine strains were obtained from three studies (Borneman , 127 strains, Bioproject ID: PRJNA303109; Dunn , two strains, Bioproject ID: SRA049752; Skelly , three strains, Bioproject ID: PRJNA186707) (Supplemental Material, File S1 and Figure S1 in File S2). Altogether, these 132 strains represent a diverse set of commercial and noncommercial isolates from the “wine” yeast clade (Borneman ). Sequence reads were quality-trimmed using Trimmomatic, version 0.36 (Bolger ), with the following parameters and values: leading:10, trailing:10, slidingwindow:4:20, minlen:50. Reads were then mapped to the genome sequence of the S. cerevisiae strain S288c (annotation release: R64.2.1; http://www.yeastgenome.org/) using Bowtie 2, version 1.1.2 (Langmead and Salzberg 2012), with the “sensitive” parameter on. For each sample, mapped reads were converted to the bam format, sorted and merged using SAMtools, version 1.3.1. Sample depth of coverage was obtained using the SAMtools depth function (Li ).

CN variant identification

To facilitate the identification of single nucleotide polymorphisms (SNPs), we first generated mpileup files for each strain using SAMtools, version 1.3.1 (Li ). Using the mpileup files as input to VarScan, version 2.3.9 (Koboldt , 2012), we next identified all statistically significant SNPs (Fisher’s Exact test; P < 0.05) present in the 132 strains that had a read frequency of at least 0.75 and minimum coverage of 8×. This step enabled us to identify 149,782 SNPs. By considering only SNPs that harbored a minor allele frequency of at least 10%, we retained 43,370 SNPs. These SNPs were used to confirm the evolutionary relationships among the strains using Neighbor-Net phylogenetic network analyses in SplitsTree, version 4.14.1 (Huson 1998) as well as the previously reported low levels of SNP diversity (Figure S2 in File S2, and Borneman ). To detect and quantify CN variants we used Control-FREEC, version 9.1 (Boeva , 2012), which we chose because of its low false positive rate, and high true positive rate (Duan ). Importantly, the average depth of coverage or read depth of the 132 strains was 30.1 ± 14.7× (minimum: 13.0×, maximum: 104.5×; Figure S3 in File S2), which is considered sufficient for robust CNV calling (Sims ). Control-FREEC uses LOESS modeling for GC-bias correction and a LASSO-based algorithm for segmentation. Implemented Control-FREEC parameters included window = 250, minExpectedGC = 0.35, maxExpectedGC = 0.55, and telocentromeric = 7000. To identify statistically significant CN variable loci (P < 0.05), we used the Wilcoxon Rank Sum test. The same Control-FREEC parameters, but with a window size of 25 bp, were used to examine CN variation within the intragenic Serine/Threonine-rich sequences of (Lo and Dranginis 1996). Bedtools, version 2.25 (Quinlan and Hall 2010) was used to identify duplicated or deleted genic loci (i.e., CN variable loci) that overlapped with genes by at least one nucleotide. The CN of each gene (genic CN) was then calculated as the average CN of the 250 bp windows that overlapped with the gene’s location coordinates in the genome. The same method was used to determine nongenic CN for loci that did not overlap with genes (i.e., nongenic CN variable loci). To identify statistically significant differences between CN variable loci that were duplicated vs. those that were deleted, we employed the Mann-Whitney U-test (Wilcoxon rank-sum test) with continuity correction (Wallace 2004).

Diversity in CN variation and GO enrichment

To identify CN diverse loci across strains, we used two different measures. The first measure calculates the statistical variance (s2) for each locus where CN variants were identified in one or more strains; s2 values were subsequently log10 normalized. Log10(s2) accounts for diversity in raw CN values, but not for diversity in CN allele frequencies. Thus, we also employed a second measure based on the Polymorphic Index Content (PIC) algorithm, which has previously been used to identify informative microsatellite markers for linkage analyses by taking into account both the number of alleles present and their frequencies (Keith ; Risch 1990). PIC has also been used to quantify population-level diversity of simple sequence repeat loci and restriction fragment length polymorphisms in maize (Smith ). PIC values were calculated for each locus harboring at least one CN variant based on the following formula:where is the squared frequency of a to z CN values (Smith ). PIC values may range from 0 (no CN diversity) to 1 (all CN alleles are unique). Both log10(s2) and PIC are measurements of CN diversity between strains, and do not account for any sequence variation between copies of the same locus. To create a list of loci exhibiting high CN diversity for downstream analyses, we retained only those loci that fell within the 50th percentile of log10(s2) values (min = −2.12, median = −1.02, and max = 2.40), or the 50th percentile of PIC values (min = 0.02, median = 0.14, and max = 0.96). Genes overlapping with loci exhibiting high CN diversity were used for Gene ontology (GO) enrichment analysis with AmiGO2, version 2.4.24 (Carbon ) using the PANTHER Overrepresentation Test (release 20160715) with default settings. This test uses the PANTHER GO database, version 11.0 (Thomas ; release date July 15, 2016), which is directly imported from the GO Ontology database, version 1.2 (Gene Ontology Consortium2004; release date 27 October, 2016), a reference gene list from S. cerevisiae, and a Mann-Whitney U-test (Wilcoxon rank-sum test) with Bonferroni multi-test corrected P-values to identify over- and under-represented GO terms (Mi ). Statistical analyses and figures were created using pheatmap, version 1.0.8 (Kolde 2012), gplots, version 3.0.1, ggplot2 (Wickham 2009), or standard functions in R, version 3.2.2 (R Development Core Team 2011).

Identifying loci absent in the reference strain

To identify loci absent from the reference strain but present in other strains, we assembled unmapped reads from the 20 strains with the lowest percentage of mapped reads. The percentage of mapped reads was determined using SAMtools (Li ); its average across strains was 96% (min = 70.5% and max = 99%; Figure S4 in File S2). Unmapped reads from the 20 strains with the lowest percentage of mapped reads were assembled using SPAdes, version 3.8.1 (Bankevich ). The identity of scaffolds longer than the average length of a S. cerevisiae’s gene (∼1400 bp) was determined using blastx from NCBI’s BLAST, version 2.3.0 (Madden 2013) against a local copy of the GenBank nonredundant protein database (downloaded on January 5, 2017).

Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

Results

Descriptive statistics of CN variation

To examine CN variation across wine yeasts, we generated whole genome CN profiles for 132 strains (File S1 and Figure S5 in File S2). Across all strains, we identified a total of 2820 CNVRs that overlapped with 2061 genes and spanned 3.7 Mb. The size distribution of CNVRs was skewed toward CN variants that were <1 kb in length (Figure 1A, and Figure S6A and Table S1 in File S2). Most CNVRs stemming from duplications were <1 kb in size, whereas most CNVRs stemming from deletions were between 5 and 7 kb (Figure S7 in File S2). These deletions were flanked by Ty transposable elements, which have been shown to frequently induce deletions (Xu and Boeke 1987). Strains had an average of 97.8 ± 9.5 CNVRs (median = 86) (Figure S6B in File S2) that affected an average of 4.3 ± 0.1% of the genome (median = 4.1%) (Figure S6C in File S2).

Figure 1

Size distribution and location of CN variable loci. (A) The fraction of CNVRs (y-axis) for a given size range. Most CNVRs are ≤1000 bp. (B, C) Deleted genic (B) and nongenic (C) CNVRs are more prevalent than duplicated ones (P < 0.01 for both comparisons). Note that 26.23% of genic duplications occurred in multiples of three. (D) Location of CN variable loci across the 16 yeast chromosomes. The small, black squares on either side of each chromosome denote centromere location. Chromosomes are oriented with the start of the chromosome on the bottom and the end on top. Loci (blue bars) and genes (orange bars) harboring high log10(s2) or PIC values are shown. (E) A total of 684 of the 1502 CN diverse loci and 243 of the 363 CN diverse genes reside in subtelomeric regions of the yeast genome; in contrast, very few are found in pericentromeric regions (28 loci and three genes). Due to the known influence of CN variable genes (Henrichsen ; Orozco ), we next quantified the number of genic and nongenic CNVRs (Figure 1, B and C). We found statistically significant differences in the number of duplicated and deleted loci that are genic or nongenic (Mann-Whitney U-test; P < 0.01 for both genic and nongenic comparisons), revealing that there were significantly more deleted genic and nongenic CNVRs than duplicated ones.

CN diversity in subtelomeres

To identify loci that exhibited high CN diversity, we retained only those loci that fell within the 50th percentile of at least one of our two different measures [log10(s2) and PIC] across the 132 strains. These loci covered ∼6% of the yeast genome. The distributions of the two measures (Figure S8 in File S2) were similar, with 1326 loci (Figure S8C in File S2) and 291 genes (Figure S8D in File S2) identified in the top 50% of CN diverse genes by both measures. In addition, the log10(s2) measure identified an additional 85 loci and 54 genes in its set of top 50% genes, and PIC an additional 85 loci and 18 genes. In total, our analyses identified 1502 loci and 363 genes showing high CN diversity. Among the genes harboring the highest log10(s2) and PIC values were [PIC = 0.96; log10(s2) = 2.16], [PIC = 0.96; log10(s2) = 2.16], [PIC = 0.96; log10(s2) = 2.16], [PIC = 0.96; log10(s2) = 2.16], [PIC = 0.96; log10(s2) = 2.16], [PIC = 0.96; log10(s2) = 2.16] and [PIC = 0.93; log10(s2) = 2.40]; these genes are all encoded within the 25S rDNA or 35S rDNA locus. The rDNA locus is known to be highly CN diverse (Gibbons ), thereby demonstrating the utility and efficacy of our CN calling protocol as well as our two measures of CN diversity. We next generated CN diversity maps for all 16 S. cerevisiae chromosomes (Figure 1D; Figure S9 in File S2). CN diversity was higher in loci and genes located in subtelomeres (defined as the 25 kb of DNA immediately adjacent to the chromosome ends; Barton ). Specifically, 684/1502 (45.5%) of CN diverse loci and 243/363 (66.9%) CN diverse genes were located in the subtelomeric regions. Conducting the same analysis using an alternative definition of subtelomere (defined as the DNA between the chromosome’s end to the first essential gene; Winzeler ) showed similar results. Specifically, 721/1502 (48%) of CN diverse loci and 233/363 (64.2%) of CN diverse genes were located in the subtelomeric regions.

GO enrichment of CN diverse genes

To determine the functional categories over- and under-represented in the 363 genes showing high CN diversity, we performed GO enrichment analysis. The majority of enriched GO terms were associated with metabolic functions such as α-glucosidase activity (P < 0.01) and carbohydrate transporter activity (P < 0.01) (Figure 2 and File S3 in File S1).

Figure 2

GO enriched terms from high CN diverse genes. Molecular function (black), biological process (gray), and cellular component (white) GO categories are represented by circles, and are enriched among the 363 genes that overlap with CN diverse loci. Enriched terms are primarily related to metabolic function, such as α-glucosidase activity (P < 0.01), carbohydrate transporter activity (P < 0.01) and flocculation (P < 0.01). Genes associated with these GO terms include (, involved in hydrolyzing sucrose), all six members from the MAL gene family (involved in the fermentation of maltose and other carbohydrates), and all five members of the IMA gene family (involved in isomaltose, sucrose, and turanose metabolism). Other enriched categories were associated with multi-cellular processes such as the flocculation (P < 0.01) and aggregation of unicellular organisms (P = 0.03). All members of the FLO gene family (involved in flocculation) and (a flocculin-like gene) were associated with these GO enriched terms. Contrary to over-represented GO terms, under-represented terms were associated with genes whose protein products are part of the interactome or protein–protein interactions such as protein complex (P < 0.01), macromolecular complex assembly (P = 0.03), transferase complex (P < 0.01), and ribonucleoprotein complex biogenesis (P = 0.04). Our finding of under-represented GO terms being associated with multi-unit protein complexes supports the gene balance hypothesis, which states that the stoichiometry of genes contributing to multi-subunit complexes must be maintained to conserve kinetics and assembly properties (Birchler and Veitia 2010, 2012). Thus, genes associated with multi-unit protein complexes are unlikely to exhibit CN variation.

Genic CN diversity

To further understand the structure of CN variation in highly diverse CN genes, we first calculated the absolute CN of 23 genes associated with GO enriched terms related to wine fermentation processes (e.g., metabolic functions; Figure 2 and File S3 in File S1) as well as 57 genes with the highest PIC or log10(s2) values (File S4 in File S1 and Figure S10 in File S2; 69 total unique genes). Among these 69 genes, gene CN ranged from 0 to 92; both the highest CN diversity and absolute CN values were observed in segments of the rDNA locus (mentioned above). Importantly, 35 of the 69 genes have also been reported to have functional roles in fermentation-related processes. The remaining 34 genes with high scores for CN diversity, to our knowledge, have no known association with wine fermentation (e.g., , , and among others). All genes with high CN diversity scores are listed in File S4 in File S1 and summarized in Figure S10 in File S2. For the purposes of this manuscript, we chose to focus on genes related to fermentation. For example, the CNs of (), a gene active during alcoholic fermentation, and its gene neighbor (), an alcohol dehydrogenase, both varied between zero and three. Similarly, the absolute CN of the locus containing both (; PIC = 0.868) and its paralog (; PIC = 0.879) ranged from 0 and 14 (Figure 3 and File S4 in File S1), with 90 strains (68.2%) showing duplications (i.e., a CN > 1), and another 11 strains (8.3%) a deletion (i.e., a CN of 0). Interestingly, multiple copies of confer copper resistance to wine strains of S. cerevisiae, with CN variation at this locus thought to be associated with domestication (Warringer ; Marsit and Dequin 2015).

Figure 3

CN variation of genes and gene families. Heat map of the CN profiles the CUP, THI, SNO, MAL, IMA, and HXT gene families; rows correspond to genes and columns to strains. Blue-colored cells correspond to deletions, black-colored cells to no CN variation and red-to-purple-colored cells to duplications (ranging from 2 to 14). Dots on the right side of the figure represent the proportion of individual strains that harbor CN variation in that gene—the larger the dot, the greater the proportion of the strains that is CN variable for that gene. The expression of SNO family members is induced just prior to, or after, the diauxic shift as a response to nutrient limitation, and is associated with vitamin B acquisition (Padilla ; Rodríguez-Navarro ). We found that () and () were among the 363 genes with highest CN diversity. was duplicated in 14 strains (10.6%) and deleted in nine strains (6.8%), while was deleted in 117 strains (88.6%) (Figure 3). The other two members of the SNO gene family, () and (), both showed a CN of 1 in all strains. Another gene family whose members show high CN diversity is the THI gene family, which is responsible for thiamine metabolism, and is activated at the end of the growth phase during fermentation (Brion ). Specifically, (; PIC = 0.759) was among the 57 genes with the highest CN diversity (File S4 in File S1), and () and () among the 363 most CN diverse genes (File S3 in File S1). was duplicated in 82 strains (62.1%) and deleted in two strains (1.5%) (Figure 3). In contrast, was deleted in 121 strains (91.67%), whereas was deleted in 23 strains (17.42%), and duplicated in only three strains (2.27%). Lastly, the CN of the last THI gene family member, (), did not exhibit CN variation. In addition to the high CN diversity observed in all six members of the and loci responsible for maltose metabolism and growth on sucrose (Stambuk ; Gallone ), (; PIC = 0.53) was among the 57 genes with the highest CN diversity (File S4 in File S1). Evaluation of the absolute CN of all locus genes (Figure 3) showed that , (), and were deleted in 65 (49.2%), 86 (65.2%), and 61 strains (46.2%), respectively. In contrast, the locus genes (), (), and () were duplicated in 100 (75.8%), 99 (75%), and 98 strains (74.2%), respectively. Interestingly, we did not observe any deletions in any of the locus genes across the 132 strains. When considering all members of the MAL gene family, we found that the 132 strains differed widely in their degree to which the locus had undergone expansion or contraction (Figure S11 in File S2). All members of the IMA gene family, composed of genes aiding in sugar fermentation (Teste ), were among the 363 genes with high CN diversity (File S3 in File S1) and (; PIC = 0.87) was among the top 57 genes with the highest CN diversity (File S4 in File S1). was deleted in 54 strains (40.9%) and duplicated in 50 strains (37.9%) (Figure 3). Although many duplications or deletions did not span the entirety of , there were four strains that harbored high CNs between four and six. These same four strains also had similar and unique duplications of and , suggesting that , , and , which are adjacent to each other in the genome, may have been duplicated as one locus. The other isomaltases (IMA2-5; , , , and ) were deleted in at least 11 strains (8.3%) and at most 55 strains (41.7%). No duplications in IMA2-5 were detected, and only rarely in (five strains, 3.8%). Altogether, the 132 strains exhibited both expansions and contractions of the IMA gene family (Figure S11 in File S2). We identified seven members of the HXT gene family (/, /, , , , /, and /), which is involved in sugar transport, that were among the 363 CN diverse genes (File S3 in File S1). Members of the HXT gene family were duplicated, deleted, or had mosaic absolute CN values across the 132 strains. For example, and were primarily duplicated in 25 (18.9%) and 22 strains (16.7%), respectively, while only three strains (2.3%) had deletions in either gene (Figure 3). , , and were deleted in 32 (24.2%), 57 (43.2%), and 53 strains (40.2%), respectively, while no strains had duplications. Finally, was duplicated in 12 strains (9.1%) and deleted in 17 strains (12.9%), and was duplicated in 37 strains (28%) and deleted in nine strains (6.8%). As expansions in the HXT gene family are positively correlated with aerobic fermentation in S. paradoxus and S. cerevisiae (Lin and Li 2011), we also examined the absolute CN of all other 10 members (, , , , , , , , , and ) of the HXT gene family (Figure 3). Interestingly, all remaining 10 members of the HXT gene family were not CN variable. Altogether, examination of the HXT family CN diversity patterns across the 132 strains suggests that wine yeast strains typically exhibit minor contractions (i.e., HXT gene deletions exceed those of duplications) relative to the S288c reference strain (Figure S11 in File S2). All five members of the FLO gene family, which is responsible for flocculation (Govender ), a trait shown to aid in the escape of oxygen limited environments during liquid fermentation (Fidalgo ; Govender ), were found to be among the 363 most CN diverse genes. Furthermore, (; PIC = 0.82) and (; PIC = 0.88) were among the 57 genes with the highest CN diversity (File S4 in File S1). Due to the importance of site-directed CN variation in FLO family genes (Fidalgo ), we modified our representation of CN variation to display intragenic CN variation using a 250 bp window (Figure S12 in File S2). was partially duplicated in 57 strains (43.2%), partially deleted in 47 strains (35.6%), and 115 strains (87.1%) had at least one region of the gene unaffected by CN variation. Duplications and deletions were primarily observed in the Threonine-rich region or Serine/Threonine-rich region located in the center or end of the gene, respectively. To better resolve intragenic CN variation of , whose repeat unit is shorter than that of , we recalled CN variants with a smaller window size of 25 bp and re-evaluated CN variation (Figure S13 in File S2). Using this window size, we found extensive duplications in 97 strains (73.5%) between gene coordinates 250–350 bp. Furthermore, duplications were observed in the hydrophobic Serine/Threonine-rich regions (Figure S13 in File S2), which are associated with the flocculation phenotype (Fidalgo ; Ramsook ). In contrast to and , other members of the FLO gene family did not exhibit intragenic CN variation. For example, CN variation in () and () typically spanned most or all of the sequence of each gene. Specifically, 125 strains (99.2%) had deletions spanning ≥80% of the gene in , and only two strains (1.5%) had the entirety of the gene intact. had deletions in 99 strains (75%) that spanned ≥75% of the gene, 11 strains (8.3%) that had a partial deletion spanning <75% of the gene, whereas one strain (0.8%) had a CN of two, and the remaining 21 strains (15.9%) had a CN of one. In contrast, () showed limited CN variation. Specifically, 108 strains (81.8%) had no CN variation while six strains (4.5%) had deletions spanning the entirety of the gene. No duplications spanned the entirety of the gene but partial duplications were observed in 17 strains (12.9%) and were located in, or just before, the Serine/Threonine-rich region.

Functional implications CN variable genes

To determine the functional impact of deleted CN variable genes, we examined the relative growth of deleted CN variable genes (denoted with the Δ symbol) relative to the wild-type (WT) S. cerevisiae strain S288c across 418 conditions using the Hillenmeyer data (File S5 in File S1 and Figure S14 in File S2). To determine the impact of duplicated genes, we examined growth fitness of the WT strain with low (∼2–3 gene copies) or high (∼8–24 gene copies) plasmid CN, where each plasmid contained a single gene of interest from previously published data, relative to WT (File S6 in File S1, Figure S15 in File S2, and Payen ). Among deleted genes, 42/69 genes for which data exist showed negative and positive fitness effects in at least one tested condition in the S288c genetic background. Furthermore, we found that 12/42 genes that are commonly deleted among wine strains typically resulted in a fitness gain in conditions that resembled the fermentation environment. These conditions include growth at 23° and at 25°, temperatures within the 15–28° range in which wine is fermented (Molina ), and growth in minimal media, which is commonly used to understand fermentation-related processes (Seki ; Govender ; Vilela-Moura ). When examining fitness effects when grown at 23° or at 25° for five or 15 generations for the 12 commonly deleted genes, we observed at least one deletion that resulted in a fitness gain or loss for each condition. However, we observed extensive deletions in the locus (Figure 3), and therefore prioritized reporting the fitness impact of deletions in , , and . Δ resulted in a fitness gain for growth at 23° and 25° for 5 (0.45× and 0.27×, respectively) and 15 generations (0.20× and 0.52×, respectively). Δ resulted in a gain of fitness at only 25° after 15 generations (0.46×) and in a loss of fitness ranging from −0.36× to −1.29× in the other temperature conditions. Similarly, Δ resulted in fitness gains and losses dependent on the number of generations. For example, when grown for 15 generations at 25° a fitness gain of 0.50× was observed, while a fitness loss of −0.82× was observed at 23°. We next determined the fitness effect of deleted genes in minimal media after 0, 5, and 10 generations. Similar patterns of complex fitness gain and loss were observed as for the other conditions. For example, Δ resulted in a loss of fitness of −4.13× and −1.97× after 0 and 5 generations, but a fitness gain of 0.63× after 10 generations. In contrast, other genes resulted in positive fitness effects. For example, Δ resulted in a fitness gain of 7.25× and 10.41× for 0 generations and 10 generations. Among duplicated genes, we focused on growth in glucose- and phosphate-limited conditions because glucose becomes scarce toward the end of fermentation prior to the diauxic shift and phosphate limitation is thought to contribute to stuck fermentations (Bisson 1999; Marsit and Dequin 2015). Among the 35 of the 69 genes where data were available, 14 genes had duplications among the 132 strains. When examining the fitness effects of duplicated genes in a glucose-limited environment in the S288c background, we found that fitness effects were small in magnitude, and dependent on condition and plasmid CN (File S6 in File S1). For example, low CN increased growth fitness by 0.02× but decreased fitness by −0.01× at a high CN (Figure S15 in File S2). Interestingly, the most prevalent CN for across the 132 strains was two (96 strains, 72.7%), with only three strains showing a CN of three and none a higher CN. Another gene found at low CN in 37 strains (28%) was . Low plasmid CN in a glucose-limited conditions resulted in a fitness gain of 0.06×. In contrast, low or high plasmid CN resulted in a negative growth fitness of −0.02× and −0.01×, respectively. Interestingly, duplication is observed in only four strains (3%), and deletions are observed in 61 strains (46.2%). Similar to the glucose-limited condition, we found fitness was dependent on high or low plasmid CN in the phosphate-limited condition. For example, , a gene present at low CN in 100 strains, had a fitness gain of 0.04× at high plasmid CN, but low plasmid CN resulted in a fitness loss of −0.02×. In contrast, , which was present at low CN in 99 strains, had a small fitness gain of 0.002× at low plasmid CN, and a fitness loss at a high plasmid CN of −0.02×. A total of six genes resulted in a disadvantageous growth effect when present at low CN, such as , which resulted in a fitness loss of −0.04×. Altogether, our results suggest that the deleted and duplicated CN variable genes we observe (Figure 4) modulate cellular processes that result in advantageous fitness effects in conditions that resemble the fermentation environment.

Figure 4

Model summary of CN variable genes in wine yeast strains and their cellular functions. Genes that are deleted among wine strains are blue, whereas those that are duplicated are in red. Genes that were observed to be both duplicated and deleted (IMA1, IMA3, HXT13, and CUP1-1) are purple. Disaccharides are in thick-lined boxes, monosaccharides in thin-lined boxes, and alcohols are unboxed.

Identifying loci absent from CN variation analysis

The present study was able to capture loci represented in the WT/S288c laboratory strain. To identify loci absent from the reference strain, we assembled unmapped reads for 20 strains with the lowest percentage of reads mapped and determined their identity (see Materials and Methods; Figure S4 in File S2). Across the 20 strains, we identified 429 loci absent from S288c but present in other sequenced S. cerevisiae strains. These loci had an average length of 6.9 kb and an average coverage of 107.2×. The 20 loci with the highest bitscore alongside with the number of strains containing the locus are shown in Table S2 in File S2. All but two of these loci were present only in one of the 20 strains we examined. The two exceptions were the EC1118_1N26_0012p locus, which we found in 8/20 strains, which originates from horizontal gene transfer from Zygosaccharomyces rouxii to the commercial EC1118 wine strain of S. cerevisiae (Novo ); and the EC1118_1O4_6656p locus, which we found in 7/20 strains. This locus was also originally found in the EC1118 strain (Novo ), and contains a gene similar to a conserved hypothetical protein found in S. cerevisiae strain AWRI1631 (Borneman ).

Discussion

CN variant loci are known to contribute to genomic and phenotypic diversity (Perry ; Cutler and Kassner 2008; Orozco ). However, the extent of CN variation in wine strains of S. cerevisiae, and its impact on phenotypic variation remains less understood. Our examination of structural variation in 132 yeast strains representative of the “wine clade” showed that CN variants are a significant contributor to the genomic diversity of wine strains of S. cerevisiae. Importantly, CN variant loci overlap with diverse genes and gene families functionally related to the fermentation environment such as CUP, FLO, THI, MAL, IMA, and HXT (summarized in Figure 4). The characteristics of CN variation in wine yeast (Figure 1A, and Figure S6 and Table S1 in File S2) were found to be similar to those of the recently described beer yeast lineage (Gallone ). For example, both lineages exhibited a similar size range of CNVRs (Figure 1A, and Figure S6 and Table S1 in File S2), as well as a higher prevalence of CNVRs in the subtelomeric regions (Figure 1D). However, wine strains had a smaller fraction of their genome affected by CN variation (Figure S6 in File S2) than beer strains (Gallone ). Wine yeast strains are thought to be partially domesticated due to the seasonal nature of wine-making, which allows for outcrossing with wild populations (Marsit and Dequin 2015; Gallone ; Gonçalves ). One human-driven signature of domestication is thought to be the duplication of the locus, because multiple copies confer copper resistance and copper sulfates have been used to combat powdery mildews in vineyards since the early 1800s (Warringer ; Marsit and Dequin 2015). Consistent with this “partial domestication” view (Marsit and Dequin 2015; Gallone ; Gonçalves ), many wine strains were not CN variable for and , or had one or both genes deleted (Figure 3). An alternative, albeit not necessarily conflicting, hypothesis is that wine yeasts underwent domestication for specific but diverse wine flavor profiles (Hyma ). Consistent with this view is the deletion (in >90% of the strains) of the gene (Figure 3), whose activity is known to produce an undesirable rotten-egg sensory perception via higher SH2 production and is associated with sluggish fermentations (Bartra ). In contrast to wine strains, duplications of have been observed across the Saccharomyces genus, including in several strains of S. cerevisiae (CBS1171, two copies; S288c, four copies; EM93, five copies), S. paradoxus (five copies), and the lager brewing yeast hybrid S. pastorianus (syn. S. carlsbergensis; 2+ copies) (Wightman and Meacock 2003). In contrast, , which is duplicated in 62.1% of strains, shows an increase in its expression of 6- to 100-fold in S. cerevisiae when grown on medium containing low concentrations of thiamine, allowing for the compensation of low thiamine levels (Li ). Low levels of thiamine in wine fermentation have been associated with stuck or slow fermentations (Ough ; Bataillon ). Similar to deletions, duplications may have also been driven by human activity due to the advantageous effect of increased expression within the fermentation environment. Two other gene families subject to CN variation were the MAL and HXT gene families. The S288c strain that we used as a reference contained two MAL loci ( and ), each containing three genes—a maltose permease (MALx1), a maltase (MALx2), and an MAL trans-activator (MALx3)—and located near the ends of different chromosomes (Michels ). has been observed to be duplicated in beer strains of S. cerevisiae (Gallone ; Gonçalves ), while wine strains primarily lack this locus (Figure 3; Gonçalves ). In contrast to the deletion of the locus, duplication in wine yeasts (Figure 3; Gonçalves ) is surprising because maltose is absent from the grape must (Gallone ). However, knockout studies have demonstrated is necessary for growth on turanose, maltotriose, and sucrose (Brown ), which are present in small quantities in wines (Victoria and Carmen 2013). Due to the prominent duplication of , in particular the enzymatic genes and , we speculate that the locus may be utilized to obtain sugars less prevalent in the wine environment, or serve other purposes. The HXT gene family in the S288c strain that we used as a reference contains 16 HXT paralogs, , , and . The expansion of the HXT gene family is positively correlated with aerobic fermentation in S. paradoxus and S. cerevisiae (Lin and Li 2011). and are high-affinity glucose transporters expressed at low glucose levels, and repressed at high glucose levels (Reifenberger ). In contrast to the recently described Asia (Sake), Britain (Beer) and Mosaic lineages (Gallone ), we detected duplications in the and genes in wine yeasts (Figure 3). This may confer an advantage toward the end of fermentation and before the diauxic shift when glucose becomes a scarce resource. Evidence potentially supporting this hypothesis is that and are upregulated by 9.8- and 5.6-fold, respectively, through wine fermentation in the S. cerevisiae strain Vin13 (Marks ). Furthermore, or is found to be duplicated in experimentally evolved populations in glucose-limited environments (Dunham ; Gresham ; Dunn ). In summary, these results, together with recent studies of CN variation in beer yeast strains (Gallone ; Gonçalves ), suggest that this type of variation significantly contributes to the genomic diversity of domesticated yeast strains. Furthermore, as most studies of CN variation, including ours, use reference strains, they are likely conservative in estimating the amount of CN variation present in populations. This caveat notwithstanding, examination of publically available data regarding the functional impact of duplicated or deleted genes (again in the context provided by the reference strain’s genetic background) suggests that CN variation in several, but not all, of the wine yeast genes confer fitness advantages in conditions that resemble the fermentation environment. Our results raise the questions of the extent to which CN variation contributes to fungal, and more generally microbial, domestication as well as whether the importance of CN variants in natural yeast populations, including those of other Saccharomyces yeasts, is on par to their importance in domestication environments.

Supplementary Material

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.040105/-/DC1. Click here for additional data file. Click here for additional data file.

92 in total

1. A function for subtelomeric DNA in Saccharomyces cerevisiae.

Authors: Arnold B Barton; Yuping Su; Jacque Lamb; Dianna Barber; David B Kaback
Journal: Genetics Date: 2003-10 Impact factor: 4.562

2. Rapid expansion and functional divergence of subtelomeric gene families in yeasts.

Authors: Chris A Brown; Andrew W Murray; Kevin J Verstrepen
Journal: Curr Biol Date: 2010-05-13 Impact factor: 10.834

3. Fast gapped-read alignment with Bowtie 2.

Authors: Ben Langmead; Steven L Salzberg
Journal: Nat Methods Date: 2012-03-04 Impact factor: 28.547

4. Resistance to itraconazole in Aspergillus nidulans and Aspergillus fumigatus is conferred by extra copies of the A. nidulans P-450 14alpha-demethylase gene, pdmA.

Authors: N Osherov; D P Kontoyiannis; A Romans; G S May
Journal: J Antimicrob Chemother Date: 2001-07 Impact factor: 5.790

Review 5. Bread, beer and wine: yeast domestication in the Saccharomyces sensu stricto complex.

Authors: Delphine Sicard; Jean-Luc Legras
Journal: C R Biol Date: 2011-02-01 Impact factor: 1.583

6. Role of AFR1, an ABC transporter-encoding gene, in the in vivo response to fluconazole and virulence of Cryptococcus neoformans.

Authors: Maurizio Sanguinetti; Brunella Posteraro; Marilena La Sorda; Riccardo Torelli; Barbara Fiori; Rosaria Santangelo; Giovanni Delogu; Giovanni Fadda
Journal: Infect Immun Date: 2006-02 Impact factor: 3.441

7. Copy number variation influences gene expression and metabolic traits in mice.

Authors: Luz D Orozco; Shawn J Cokus; Anatole Ghazalpour; Leslie Ingram-Drake; Susanna Wang; Atila van Nas; Nam Che; Jesus A Araujo; Matteo Pellegrini; Aldons J Lusis
Journal: Hum Mol Genet Date: 2009-07-31 Impact factor: 6.150

8. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data.

Authors: Valentina Boeva; Tatiana Popova; Kevin Bleakley; Pierre Chiche; Julie Cappo; Gudrun Schleiermacher; Isabelle Janoueix-Lerosey; Olivier Delattre; Emmanuel Barillot
Journal: Bioinformatics Date: 2011-12-06 Impact factor: 6.937

9. Analysis of the Saccharomyces cerevisiae pan-genome reveals a pool of copy number variants distributed in diverse yeast strains from differing industrial environments.

Authors: Barbara Dunn; Chandra Richter; Daniel J Kvitek; Tom Pugh; Gavin Sherlock
Journal: Genome Res Date: 2012-02-27 Impact factor: 9.043

10. Domestication and Divergence of Saccharomyces cerevisiae Beer Yeasts.

Authors: Brigida Gallone; Jan Steensels; Troels Prahl; Leah Soriaga; Veerle Saels; Beatriz Herrera-Malaver; Adriaan Merlevede; Miguel Roncoroni; Karin Voordeckers; Loren Miraglia; Clotilde Teiling; Brian Steffy; Maryann Taylor; Ariel Schwartz; Toby Richardson; Christopher White; Guy Baele; Steven Maere; Kevin J Verstrepen
Journal: Cell Date: 2016-09-08 Impact factor: 41.582

18 in total

1. Adaptation by Loss of Heterozygosity in Saccharomyces cerevisiae Clones Under Divergent Selection.

Authors: Timothy Y James; Lucas A Michelotti; Alexander D Glasco; Rebecca A Clemons; Robert A Powers; Ellen S James; D Rabern Simmons; Fengyan Bai; Shuhua Ge
Journal: Genetics Date: 2019-08-01 Impact factor: 4.562

2. Variation Among Biosynthetic Gene Clusters, Secondary Metabolite Profiles, and Cards of Virulence Across Aspergillus Species.

Authors: Matthew E Mead; Sonja L Knowles; Jacob L Steenwyk; Huzefa A Raja; Christopher D Roberts; Oliver Bader; Jos Houbraken; Gustavo H Goldman; Nicholas H Oberlies; Antonis Rokas
Journal: Genetics Date: 2020-08-17 Impact factor: 4.562

3. An evolutionary genomic approach reveals both conserved and species-specific genetic elements related to human disease in closely related Aspergillus fungi.

Authors: Matthew E Mead; Jacob L Steenwyk; Lilian P Silva; Patrícia A de Castro; Nauman Saeed; Falk Hillmann; Gustavo H Goldman; Antonis Rokas
Journal: Genetics Date: 2021-06-24 Impact factor: 4.562

4. Environmental change drives accelerated adaptation through stimulated copy number variation.

Authors: Ryan M Hull; Cristina Cruz; Carmen V Jack; Jonathan Houseley
Journal: PLoS Biol Date: 2017-06-27 Impact factor: 8.029

Review 5. Copy Number Variation in Fungi and Its Implications for Wine Yeast Genetic Diversity and Adaptation.

Authors: Jacob L Steenwyk; Antonis Rokas
Journal: Front Microbiol Date: 2018-02-22 Impact factor: 5.640

6. Expansion of a Telomeric FLO/ALS-Like Sequence Gene Family in Saccharomycopsis fermentans.

Authors: Beatrice Bernardi; Yeseren Kayacan; Jürgen Wendland
Journal: Front Genet Date: 2018-11-13 Impact factor: 4.599

7. Extensive loss of cell-cycle and DNA repair genes in an ancient lineage of bipolar budding yeasts.

Authors: Jacob L Steenwyk; Dana A Opulente; Jacek Kominek; Xing-Xing Shen; Xiaofan Zhou; Abigail L Labella; Noah P Bradley; Brandt F Eichman; Neža Čadež; Diego Libkind; Jeremy DeVirgilio; Amanda Beth Hulfachor; Cletus P Kurtzman; Chris Todd Hittinger; Antonis Rokas
Journal: PLoS Biol Date: 2019-05-21 Impact factor: 8.029

Review 8. Into the wild: new yeast genomes from natural environments and new tools for their analysis.

Authors: D Libkind; D Peris; F A Cubillos; J L Steenwyk; D A Opulente; Q K Langdon; A Rokas; C T Hittinger
Journal: FEMS Yeast Res Date: 2020-03-01 Impact factor: 2.796

9. A population genomic characterization of copy number variation in the opportunistic fungal pathogen Aspergillus fumigatus.

Authors: Shu Zhao; John G Gibbons
Journal: PLoS One Date: 2018-08-02 Impact factor: 3.240

10. Traditional Norwegian Kveik Are a Genetically Distinct Group of Domesticated Saccharomyces cerevisiae Brewing Yeasts.

Authors: Richard Preiss; Caroline Tyrawa; Kristoffer Krogerus; Lars Marius Garshol; George van der Merwe
Journal: Front Microbiol Date: 2018-09-12 Impact factor: 5.640