Literature DB >> 29415277

Complete genome sequence and analysis of the industrial Saccharomyces cerevisiae strain N85 used in Chinese rice wine production.

Weiping Zhang¹, Yudong Li^1,2, Yiwang Chen², Sha Xu¹, Guocheng Du¹, Huidong Shi³, Jingwen Zhou¹, Jian Chen¹.

Abstract

Chinese rice wine is a popular traditional alcoholic beverage in China, while its brewing processes have rarely been explored. We herein report the first gapless, near-finished genome sequence of the yeast strain Saccharomyces cerevisiae N85 for Chinese rice wine production. Several assembly methods were used to integrate Pacific Bioscience (PacBio) and Illumina sequencing data to achieve high-quality genome sequencing of the strain. The genome encodes more than 6,000 predicted proteins, and 238 long non-coding RNAs, which are validated by RNA-sequencing data. Moreover, our annotation predicts 171 novel genes that are not present in the reference S288c genome. We also identified 65,902 single nucleotide polymorphisms and small indels, many of which are located within genic regions. Dozens of larger copy-number variations and translocations were detected, mainly enriched in the subtelomeres, suggesting these regions may be related to genomic evolution. This study will serve as a milestone in studying of Chinese rice wine and related beverages in China and in other countries. It will help to develop more scientific and modern fermentation processes of Chinese rice wine, and explore metabolism pathways of desired and harmful components in Chinese rice wine to improve its taste and nutritional value.

Entities: Chemical Disease Gene Species

Keywords: annotation; genome sequence; rice wine yeast; transcriptomics

Year: 2018 PMID： 29415277 PMCID： PMC6014378 DOI： 10.1093/dnares/dsy002

Source DB: PubMed Journal: DNA Res ISSN： 1340-2838 Impact factor: 4.458

1. Introduction

As early as 5,000 years ago, Chinese rice wine (Huangjiu) was being consumed by people as a fermented alcoholic beverage. Chinese rice wine is made from sticky rice, which is different from other beverages made from malt or fruit. The streamed sticky rice was saccharified by raw wheat koji (a variety of Aspergillus) and fermented to generate alcohol by yeast strain. Nowadays, Chinese rice wine is still hugely popular in China due to its pleasant taste, and high nutritional and pharmacological value. However, some harmful byproducts are generated during the fermentation and storage of Chinese rice wine, such as ethyl carbamate (EC), a genotoxic carcinogen of widespread occurrence in fermented food and beverages, with particularly high concentrations in stone fruit spirits., In Chinese rice wine, EC is mainly generated from the reaction of ethanol and urea, and its concentration can reach 160 µg/kg. As increasing attention is paid to food safety problems, some researchers have focused on the mechanism of EC formulation in Chinese rice wine, and attempts to reduce its concentration have been made., Over the past few decades, researchers have identified proteins responsible for the uptake and catabolism of poorly utilized nitrogen sources such as proline and urea, and found that their genes are repressed in Saccharomyces cerevisiae cells cultured in a nitrogen-rich environment. Conversely, in nitrogen-deficient conditions, their transcription is derepressed. This regulatory phenomenon has been termed nitrogen catabolite repression (NCR). During the accumulation of EC in wine fermentation, NCR plays an important role in the accumulation urea, the major precursor of EC, by repressing the transcription of urea catabolism-related genes. NCR is reportedly controlled by complicated regulatory systems, such as global GATA family regulators, the TOR pathway, and the Ssy1p-Ptr3p-Ssy5p sensing pathway. However, the exact regulatory mechanism of NCR is still unclear. Above all, a complete genome sequence is fundamental for understanding the genetic and regulatory systems of Chinese rice wine strains. The draft genome sequence of S. cerevisiae Chinese rice wine strain YHJ7 has been published, which is closely related to the industrial N85 strain (Supplementary Fig. S1). However, there are still many gaps in the YHJ7 assembly, and detailed annotation of the genome is not available. In the mid-2000s, the emergence of next-generation sequencing (NGS) platforms dramatically reduced the run time and cost (around US $1000 for the human genome) of genome sequencing, and increased the throughput to hundreds of Gbp per run compared with first-generation sequencing platforms (Sanger sequencers). However, high/low G + C regions, tandem repeat regions, and interspersed repeat regions remain difficult to sequence using NGS platforms. Furthermore, the limited read length from NGS platforms can prohibit sequencing completeness and the accuracy of analysis, resulting in assemblies that are incomplete and fragmented into several thousand contigs., In 2011, PacBio RS II was developed as the first commercially available third-generation sequencer and it was marketed to address this issue. The system uses a novel and unique single molecule real-time (SMRT) technology which enables the generation of long reads (half of reads >20 kb, maximum read length >60 kb) and reduces the degree of bias (even coverage across regions of differing G + C content). Because long reads can easily handle complex regions such as repeats, PacBio long reads have the potential to provide accurate and improved genome assemblies. In this study, a gapless, near-finished genome sequence of the S. cerevisiae Chinese rice wine N85 strain was achieved by combining second- and third-generation sequencing technologies. Complex approaches were performed to assemble and annotate the N85 genome, and over 6,000 proteins and 238 long non-coding RNAs (lncRNAs) were identified. Moreover, many genomic variants were identified that could be responsible for genomic evolution. The first complete genome sequence of Chinese rice wine brewing yeast, strain N85, is a milestone in studying Chinese rice wine, as well as other related beverages brewing. It will help to understand the evolutionary history of beverage brewing strains and improve the complicated brewing processes from a traditional and experiential way to a modern and scientific way. In addition, further detail analysis of the genome of strain N85 will provide more genetic information for improvement the general quality of Chinese rice wine.

2. Materials and methods

Yeast strain and growth conditions

Saccharomyces cerevisiae strain N85 (MATa/) was the strain generally used for Chinese rice wine production and provided by Guyuelongshan wine company (Shaoxing, China), one of the biggest and oldest Chinese rice wine producer. The yeast strain was pre-cultured on YPD plates (10 g/l yeast extract, 20 g/l peptone, 20 g/l glucose, 20 g/l agar) at 30°C for 24 h. A single colony was activated in YPD media (30°C, 200 rpm) then grown in sole nitrogen medium (1.7 g/l yeast nitrogen base, 20 g/l glucose, 10 mM each of glutamine, arginine, and urea). All cultivations were performed in shake flasks (200 rpm) at 30°C and growth was monitored by determining the optical density at 600 nm (OD600).

2.2. DNA and RNA sequencing

Following a 1 h incubation under the conditions described above, genomic DNA and RNA was isolated as described previously in. DNA/RNA quantity was determined using a Nanodrop 1000 (Thermo Scientific, MA, USA) and integrity was determined with an Agilent 2100 Bioanalyzer (Palo Alto, CA). For NGS of genomic DNA, the libraries were constructed using Nextera DNA sample preparation kit (Illumina, CA, USA). Next, libraries were sequenced using MiSeq reagent kit v3 (Illumina, CA, USA) on MiSeq platform, or TruSeq SBS kit v3-HS (Illumina, CA, USA) on HiSeq 2000 platform. For third generation DNA sequencing, the DNA sample was sequenced using the PacBio RS II technology with a C2 chemistry sequencing kits (Pacific Biosciences, Melon Park, CA). For RNA sequencing (RNA-Seq), all RNA samples were prepared as biological duplicates and subjected to removal of rRNA or Poly-A filtering before cDNA library generation. From these libraries, 100 bp paired-end or strand-specific reads were produced using an Illumina HiSeq 2000. Library preparation and sequencing of DNA and RNA on the Illumina platform was performed at BioMarker Inc. (Beijing, China), and the PacBio sequencing was performed at Shanghai Bohao Inc (Shanghai, China).

2.3. Genome assembly

The genome of strain N85 was assembled using two approaches, one using a classical reference assembly, and the other using a hybrid de novo assembly (Supplementary Fig. S2). The hybrid strategy for genome assembly was carried out in three distinct and contiguous steps: (i) de novo generation of contigs; (ii) ordering of the de novo contigs generated and their concatenation into supercontigs; (iii) closing of the remaining gaps using a de novo iterative approach. Specifically, regarding the reference-guided genome assembly, filtered short reads were used for direct alignment to the publically-available S. cerevisiae S288c (version Scer3) and YHJ7 genome sequences using the CLC Genomics Workbench version v8.0 (https://www.qiagenbioinformatics.com/products/clc-genomics-workbench/ (19 January 2018, date last accessed)). Regarding the hybrid de novo assembly, de novo genome assembly was first performed using one or both of the PacBio and Illumina datasets by several methods, including the CLC Genomics Workbench version v8.0, A5-miseq v2, SSPACE-LongRead v1.1, SPAdes v3.5.0, PBcR, and hierarchical genome assembly process (HGAP) v2.0. Second, the order and orientation of contigs from the de novo assembly were determined by aligning contigs to the reference-guided assembly with a minimum identity value of 90% between the de novo contigs and the reference genome using CLC Genomics Workbench version v8.0 software. In this manner, most de novo contigs could be concatenated into supercontigs by overlapping regions.

2.4. Genome annotation

Gene features were annotated in the high-quality genome sequence using three different approaches: ab initio, evidence-, and homology-based predictions (Supplementary Fig. S3). For the ab initio prediction, AUGUSTUS v3.0.3 was employed with the predefined parameter set for the S. cerevisiae genome. For the evidence-based prediction of transcripts, two annotation methods were performed; one used TopHat2 to map RNA-Seq reads to the N85 genome with default parameters to identify the possible location of introns and exons that were subsequently integrated in a hint-based run of AUGUSTUS; the other used TRINITY to run mapping-free transcript assembly with the RNA-Seq data. For the homology-based prediction of transcripts, S288c open reading frames (ORFs) downloaded from Saccharomyces genome database (SGD), and exonerate was employed to align the ORFs to the genome of strain N85. To compare coding differences between N85 and S288c, local BLASTp searches were carried out using the amino acid sequences of the predicted ORFs as queries, and amino acid sequences of all ORFs in the S288c as database. The following threshold settings were used: e-value < 1 × 10−5, identity >80%, and an alignment length >50%.

2.5. Detection of isoforms and ncRNAs

For the detection of lncRNAs, Tophat2 was used for spliced read mapping with the following non-standard parameters: ‘no-mixed’, ‘no-discordant’, ‘b2-very-sensitive’, ‘max-intron-length 10 100’, and for strand-specific samples ‘libtype fr-first strand’. The number of reads within exons and genes were calculated by using the Cufflinks pipeline, used as gene expression values, and normalized using the number of reads per kilobase on exon regions per million mapped reads. Alternative splicing events (isoforms) were also predicted using the Cufflinks software, and mapped reads were visualized with SpliceGrapher. All identified isoforms and ncRNAs were manually checked using integrative genomics viewer.

2.6. Detection of single nucleotide polymorphisms and analysis of copy-number compare with S288c

To estimate the genetic distance between the N85 and S288c reference strains, all Illumina reads were mapped to reference genomes using BWA-MEM. Biallelic single nucleotide polymorphisms (SNPs) were called by the SAMtools v0.1.19 ‘mpileup’ command and the VCFutils ‘varFilter’ command with ‘-D 200 -d 5’. Per-position read counts were calculated from BAM files of mapped reads using the ‘gennomecov’ utility of BEDTools v2.15.0. Reads were counted using a sliding window (1 kb) and used to find copy-number variations (CNVs). CNVs were identified using HMMcopy, based on the ReadDepth method. To control false-positives, only CNVs with a length >2 kb were selected. The boundaries of CNVs were confirmed by visual inspection in the integrated genome viewer. We calculated the overlap of each gene with SNPs/CNVs from gene ontology (GO) gene sets using FungiFun2 (https://elbe.hki-jena.de/fungifun/fungifun.php (19 January 2018, date last accessed)). Bonferroni adjusted P-values and the false discovery rate were recorded.

2.7. Analysis of the genome variations between N85 and sake strains

The genome sequences of sake strains K7 and K11 were downloaded from SGD (http://www.yeastgenome.org/ (19 January 2018, date last accessed)). The differences between Chinese rice wine strain N85 and sake strains K7 and K11 were analysed at contig and sequencing read levels. First, the genome sequences of strain K7 and K11 were compared with that of strain N85 through local BLASTn with parameters as follows: -outfmt 17 -evalue 1e-5 -num_threads 8 -parse_deflines. The results in SAM format were visualized and manually investigated in integrative genomics viewer. Second, all Illumina sequencing reads of the genome of strain N85, which could be aligned to N85 genome, were mapped into the scaffolds of strain K7 and K11 through CLC Genomics Workbench version v8.0. The resulting unmapped reads were assembled again to detect the unique regions of N85 genome.

2.8. Nucleotide sequence accession number

Data from the whole-genome shotgun project have been deposited at the EMBL database under the accession number LN907784-LN907800.

3. Results

Genome sequencing and assembly of the N85 genome

Genomic DNA from S. cerevisiae strain N85 was sequenced using PacBio RS and Illumina HiSeq/MiSeq platforms. The average read length and coverage value from both sequencing platforms are summarized in Supplementary Table S1. De novo assembly of PacBio reads was performed using the RS HGAP assembly protocol version 3.3 in SMRT analysis version 2.2.0 (Pacific Biosciences) and resulted in 324 contigs (each with a length >500 bp, and an N50 value of 75,828 bp for all contigs). Subsequently, hybrid assembler PBcR and SPAdes were used to assemble the PacBio and Illumina reads together, and this generated 601 and 284 contigs, respectively. Additionally, we assembled the genome using only Illumina MiSeq and HiSeq reads with SPAdes, resulting in >1,000 contigs, and the N50 value was <10,000 bp when assembled with only paired-end MiSeq reads. The gap between contigs from each assembly method was further closed using SSPACE-LongReads. The assembly processes and results are summarized in Table 1 and Supplementary Fig. S2.

Table 1

Summary of the de novo hybrid assembly results

Library type	No. of contigs^a	Maximum contig size (bp)	N50 contig (bp)	Total length (bp)	Software
HiSeq	1,379	200,122	50,511	12,737,887	CLC genomic workbench
MiSeq	2,382	39,836	7,243	11,888,873	A5-miseq pipeline
PacBio	324	242,389	75,828	11,883,288	HGAP
Hybrid^b	601	27,619	6,928	3,893,255	PBcR
Hybrid	284	862, 32	201,497	11,857,571	SPAdes
Close Gap	204	1,107,090	477,098	11,917,338	SSPACE-LongRead

aFor each assembly, only contigs >500 bp in length were considered.

bHybrid represents the combination of HiSeq, MiSeq, and PacBio datasets.

Summary of the de novo hybrid assembly results aFor each assembly, only contigs >500 bp in length were considered. bHybrid represents the combination of HiSeq, MiSeq, and PacBio datasets. De novo genome assemblies were further improved by combining the different hybrid assembly contigs and reference-guided assembly (see Section 2.3). Illumina reads were mapped to the S288c reference genome sequence using CLC genomic workbench to generate consensus sequences, which were used to place these contigs into 16 chromosomal and 1 mitochondrial sequences. The final assembly of the strain N85 genome showed high collinearity and structural conservation with that of the S288c reference (Fig. 1), excluding some repetitive or telomeric regions. Genome sequences that are derived from assembling reads potentially suffer from errors, especially around regions with repetitive sequences. Specifically, the ribosome DNA repeats in Chromosome XII appeared to be incorrectly assembled, and were manually adjusted based on the assembly of the closely related YHJ7 strain. Finally, the gapless and near-finished genome sequence of S. cerevisiae strain N85 was obtained, which contains 16 chromosomal and one mitochondrial sequences and is 12.09 million bp in total (Supplementary Table S2).

Figure 1

Dot plot of sequence similarity between the assembly scaffolds of the N85 and S288c strains. The majority of N85 assembly sequences are collinear with the chromosome of the reference S288c strain.

Multiple approaches used to annotate the N85 genome

In order to achieve a high-quality gene annotation of the S. cerevisiae strain N85 genome, four different annotation approaches were performed. First, all ORFs in the model S. cerevisiae S288c were downloaded from the SGD and aligned to the N85 genome using exonerate. A total of 8,082 predicted ORFs were generated, and 6,236 of these were retained when the percentage coverage of the alignment cut-off was >50%. Second, ab initio gene prediction was performed with AUGUSTUS, which generated 6,786 draft predicted ORFs. Thirdly, the total reads from RNA-Seq experiments were mapped onto the genome of S. cerevisiae strain N85 using TOPHAT2 to generate exon and intron ‘hints’ in AUGUSTUS. With the help of these hints, local AUSGUSTUS genome prediction was carried out, resulting in 5,320 hints-based predicted ORFs. Lastly, total RNA-Seq reads were used to create the unbiased mapping-free transcriptome assembly using TRINITY, which generated 15,371 draft transcripts. The results of the final three methods were filtered by amino acid length (>100), and 6,405, 5,339, and 10,487 ORFs were retained, respectively. To compare coding differences between N85 and S288c, the amino acid sequences of the predicted ORFs were aligned to those of all S288c ORFs. 5,187, 4,968, and 5,267 S288c homologous genes were extracted from AUGUSTUS, AUGUSTUS-Tophat, and TRINITY predictions, respectively (Supplementary Table S3). Combining these S288c homologous genes and those from the exonerate prediction, 6,464 S288c homologous genes were identified in strain N85. Among them, 4,694 S288c homologous genes were present in all four predictions (Fig. 2A). Moreover, we identified 111, 86, and 82 non-S288c genes that were missing from the S288c genome annotation using AUGUSTUS, AUGUSTUS-Tophat, and TRINITY predictions, respectively (Supplementary Table S3). In total, 171 non-S288c genes were identified by merging all these non-S288c genes, and 17 non-S288c genes were common to all three predictions (Fig. 2B). Consistent with a previous study, we detected a large region (24 kb) in Chromosome XIV that includes three non-S288c genes (g5159, g5160, and g5161) that are only present in Asian yeast strains (Fig. 2C). In addition, we also detected some N85-specific regions that are not present in YHJ7, including the region in Chromosome XII that includes two novel genes g4241 and g4242 (Fig. 2C). The expression of these novel non-S288c genes was validated by RNA-Seq data, but their function requires further investigation.

Figure 2

Annotation of the S. cerevisiae N85 genome. Analysis of S288c homologous and non-S288c genes in S. cerevisiae N85 through different approaches. (A) Number of S288c homologous genes identified using exonerate (yellow), AUGUSTUS (red), AUGUSTUS-Tophat (blue), and TRINITY (green). (B) Number of non-S288c genes identified using AUGUSTUS (red), AUGUSTUS-Tophat (blue), and TRINITY (green). (C) Genomic architecture of non-S288c genes in N85. Gene locations are shown below each gene box. Color figures available in online version.

Transcription of lncRNAs and isoforms

RNA-Seq was used to determine lncRNAs in S. cerevisiae strain N85. Approximately, 20 million poly-adenylated RNA-Seq reads and 2 billion strand-specific ribosome-removed RNA-Seq reads were obtained, which allowed us to confirm the orientation of transcripts and predict anti-sense transcripts. After read mapping and transcript assembly, we classified all expressed transcripts longer than 200 nucleotides into coding genes and lncRNAs. Using these sequencing datasets, we detected 238 lncRNAs, most of which are novel lncRNAs not annotated in the databases (Fig. 3A). Consistent with previous studies, lncRNAs were expressed at significantly lower levels than coding genes (Fig. 3B, Wilcoxon test, P < 10−5). However, the function of these novel lncRNAs requires further investigation in the future. Besides lncRNAs, 619 transcript isoforms of N85 genes were also annotated using transcriptome analysis (Fig. 3A), and the expression levels of isoforms were slightly higher than those of genes without alternative splicing (Fig. 3B).

Figure 3

General characteristics of coding and non-coding transcripts. (A) The number of different transcript variants in S. cerevisiae N85. (B) Box-plots of transcript expression levels in log2 (FPKM) units. FPKM, fragments per kilobase of exon per million reads mapped.

Genome variations in strain N85 compared with the model strain S288c

To examine genetic variation in N85, the Illumina short reads were mapped to the S288c reference genome, which identified 57,278 SNPs and 8,624 indels in N85 (Table 2). As expected, ∼99% of these SNPs have been observed in the closely related YHJ7 strain. Although N85 is almost homozygous, 2,131 heterozygous sites were found in the diploid sequenced N85 strain (Table 2). The SNP distribution was not random, and approximately one-third of all detected SNPs were intergenic (Table 2), even though only about 25% of the S. cerevisiae genome is noncoding. More than 30% of SNPs detected in coding regions are nonsynonymous, resulting in changes to the encoded protein sequence. Interestingly, nonsynonymous mutations are more frequent in homozygous SNPs than in heterozygous SNPs (24 vs. 22%, respectively; P < 0.01, Chi-squared test). Moreover, several genes have gained or lost stop codons (Table 2).

Table 2

Genetic variations identified in the S. cerevisiae strain N85

	N85			YHJ7-homologues
	Total	Hom^a	Het	Total	Hom	Het
Exonic	38,478	37,564	914	214	73	141
Synonymous	23,717	23,350	367	75	21	54
Nonsynonymous	12,380	12,121	259	72	26	46
Frameshift	274	243	31	6	2	4
Nonframeshift	1,996	1,748	248	55	20	35
Stop gain or loss	89	82	7	6	4	2
Intronic	428	424	4	5	3	2
Intergenic	26,996	25,783	1,213	1,175	852	323

aHom: homozygous; Het: heterozygous. YHJ7-homologs represents those SNPs that shared with YHJ7 were filtered from the total N85 SNPs.

Genetic variations identified in the S. cerevisiae strain N85 aHom: homozygous; Het: heterozygous. YHJ7-homologs represents those SNPs that shared with YHJ7 were filtered from the total N85 SNPs. The nonsynonymous to synonymous substitution rate (Ka/Ks) relative to the S288c strain was assessed to identify fast-evolving genes. Genes with Ka/Ks values significantly >1 were classified as under positive selection. We identified 232 genes with Ka/Ks values > 1 (Supplementary Table S4). GO term analysis revealed that these genes were mostly associated with transcriptional control. Several stress-responsive transcription factors appear to have evolved rapidly in N85, including the GATA zinc finger protein GZF3, which regulates nitrogen catabolic gene expression, and two genes (HKR1, Ka/Ks = 1.94; MSB2, Ka/Ks = 1.22) involved in the HOG pathway, which have been shown to play a key role in the adaption of Chinese rice wine strains. Sequencing of the N85 strain in its natural ploidy state allowed analysis of gross chromosomal rearrangements and aneuploidies. Based on the sequencing read depth, there were no gains or losses of whole chromosomes in the N85 genome (Supplementary Fig. S4), discounting polyploidy or aneuploidy. A total of 76 amplification and deletion events were detected in N85, covering 1.57 Mb of the genome (Fig. 4). The size of these regions ranges from 1 to 18 kb, and most were detected in subtelomeric regions. GO enrichment analysis of CNV regions revealed that genes involved in cellular responses to nitrogen starvation, asparagine metabolic processes, and cellular aldehyde metabolic processes are most heavily influenced by CNVs.

Figure 4

Genetic variation in the S. cerevisiae N85 genome. The first and second circles represent SNPs and INDELs relative to the S. cerevisiae S288c reference genome, in which the specific variation in N85 relative to YHJ7 is highlighted in red. The third and fourth cycles represent larger duplication/deletion or translocation events relative to YHJ7. Most of the structural rearrangements are localized in subtelomeric regions. Color figures available in online version.

The genome differences between Chinese rice wine strain N85 and sake strains

In general, the genome of Chinese rice wine strain is quite similar with those of sake strains after manually reviewing the BLASTn results between strain N85 and K7 as well as K11. However, some genomic variations may explain the differences between Chinese rice wine and sake to some content. Four hundred indels were identified in the genome of strain N85 compared with both sake yeasts K7 and K11. Among them, 116 indels were found in gene coding regions. Some affected genes were found to involve in the production of organic acids (KGD2, HSP31, BNA3), amino acids catabolism (STR2, HOM6, SPE2, CYS3, ARG2). Previous study suggested that strain with deficient α-ketoglutarate dehydrogenase (KGD2) produced fewer ethanol during wine brewing. The low ethanol content was compensated by an increase of organic acids, such citrate succinate, fumarate, and malate.BNA3 encodes a putative carbon-sulphur lyase responsible for volatile-thiol release. Deletion of BNA3 reduced the release of 4-mercapto-4-methylpentan-2-one, which makes an important contribution to the aroma of wine, in both laboratory and wine yeast background. Besides, the mutation of CYS3, which also encodes a putative carbon-sulphur lyase, led to over production of another sulphur compound, methyltetrathiophen-3-one, which was previously shown to contribute to wine aroma.ARG2 locates in arginine biosynthesis pathway and encodes acetylglutamate synthase, which catalyses the first step in the biosynthesis of the arginine precursor, ornithine. In addition to indels, 25 regions in N85 genome were found to be absent either in the genome of strain K7 or that of K11, through mapping unmapped reads back to N85 genome. Five of these regions were identified to be absent in genomes of both sake strains. All of them are coding regions in the genome of strain N85, which encode the alpha-glucoside permease (MPH3), the sorbitol dehydrogenase (SOR1), a component of vacuolar cation channel (YVC1), and two putative proteins with unknown function (YGL262W and YGL263W). SOR1 and MPH3 locate on the subtelomeric paralogous blocks of Chromosome X. However, the effects of the gain or loss of these genes on Chinese rice wine and sake are still unclear.

Genome browser for N85

To visualize gene sequences, annotated genes, and genetic variants in the genome of S. cerevisiae strain N85, a JBrowse-based genome browser was developed and deposited on the website (http://www.ligene.cn/hygd (19 January 2018, date last accessed)). As shown in Fig. 5, the basic genome browser functionality can provide genome annotation views via an overhead bar that offers a visual indication of the chromosome position. Although only gene annotation and mutation information is currently available, further large-scale datasets will be integrated in the future. We also aim to implement additional analysis tools for BLAST searching, primer design, and sequence alignment, to help the scientific community to use the developed genome resources.

Figure 5

Screenshot of the homepage of the Huangjiu Yeast Genome Database and genome browser displaying Chromosome 1 of S. cerevisiae N85. Gene regions are represented as a horizontal box, and genetic variants are represented as blue dots. Color figures available in online version.

4. Discussion

Chinese rice wine is a traditional fermented alcoholic beverage with many health benefits. Recently, the annual consumption of Chinese rice wine has been steadily increasing, and now is more than 2 million kilolitres., However, the complete genome sequence of the Chinese rice wine strain used for fermentation has not been reported, which makes it difficult to further improve the quality of this popular product. The purpose of the current study was to genetically characterize the S. cerevisiae strain N85 by sequencing the genome using both PacBio and Illumina platforms. Assembly was achieved by combining the long but relatively low-quality PacBio reads with the short but higher quality Illumina reads using a complex approach, and a high-quality genome of the N85 strain was generated. The assembled gapless and near-complete genome is equivalent in length to that of the model S288c strain. Due to the high integrity of the assembled genome, a larger number of protein coding genes were annotated through multiple annotation approaches compared with the gapped genome of the YHJ7 strain. This comprehensive annotation also identified additional non-S288c genes, which are important for functional and evolutionary analysis. Furthermore, the annotated genes in the N85-specific regions may be to explain the different evolutionary routes of Chinese rice wine strains. In addition, genome comparison between Chinese strain and Japanese strains revealed a series of variations in coding regions, which encodes genes involved in the production of organic acids and the catabolism of amino acids. These variations might explain the difference of flavour and nutritional value between Chinese rice wine and sake. Notably, some variants were identified in the coding sequence of ARG2, who encodes acetylglutamate synthase in arginine biosynthesis pathway. Arginine is the precursor of urea, which is considered to be the major precursor of EC. The mutation in ARG2 may lead to the higher concentration of EC in Chinese rice wine than that in sake. Besides, previous study revealed that two subtelomeric paralogous blocks in the genome of strain S288c, containing HXT15-SOR2-MPH2 on Chromosome IV and HXT16-SOR1-MPH3 on Chromosome X, were lost in the genome of strain K7. However, a part of the second block was identified in the genome of strain N85, containing SOR1-MPH3. These differences among the genome of laboratory, Chinese rice wine, and sake strains may suggest strains underwent different evolutionary tracks to cope with each environment. LncRNAs and alternative splicing are two important elements of transcription. LncRNAs are claimed to play an important role in the regulation of gene expression at transcriptional, post-transcriptional, and translational levels., Moreover, alternative splicing is a major contributor that determines the environmental fitness of an organism. Recently, various isoforms of the transcription factor GAT1 were discovered in S. cerevisiae that are involved in nitrogen catabolite repression regulation. In this work, numerous isoforms and novel lncRNAs were identified, enriching our existing knowledge in this area. Furthermore, future analysis of their functions could deepen our understanding of Chinese rice wine fermentation. It should be emphasized that current methods for annotating isoforms are not reliable; some annotated isoforms may be false-positives, and the accuracy of the results will be improved when the sequencing technology evolves and generates longer sequences. The presence of highly conserved SNP sites shared in the N85 and YHJ7 genomes but not in the S288c genome indicated that the two strains originated from a common ancestor that diverged from the ancestor of the model S288c strain, and underwent adaptation during Chinese rice wine fermentation. However, the enrichment of SNPs in intergenic regions in the N85 genome might be due to reduced functional constraints in intergenic sequences during the evolutionary history of this strain. The enrichment of nonsynonymous mutations in homozygous genes suggests the function of heterozygous genes is much more stable, and homozygous SNPs appear to make a greater contribution to functional adaption in the N85 strain. During the fermentation of Chinese rice wine, yeast suffer high concentrations of sugars resulting from the digestion of rice starches in the fermentation mash. The enrichment of nonsynonymous SNPs in the HOG pathway is indicative of adaptive evolution of the strain over its long history in Chinese rice wine fermentation. Notably, this phenomenon was also observed in the genome of another Chinese rice wine strain, namely YJH7. Similar enrichment was also identified in GZF3, one of four global transcriptional regulators of nitrogen catabolite repression, which may be responsible for the accumulation of EC in Chinese rice wine. Variations in genome structure, such as polyploidy, aneuploidy and copy number, have repeatedly been associated with domestication and adaptation to specific niches in experimentally evolved microbes. Moreover, due to their plastic and dynamic nature, loss or gain of genes in subtelomeric regions occurs more frequently than in other regions, which may accelerate adaptive evolution. Analysis of structural variation suggests that CNVs in the N85 genome may underlie niche adaptation.

5. Conclusion

We combined sequencing data from Illumina and PacBio sequencing platforms to generate the first gapless and near-complete genome assembly of the S. cerevisiae strain N85 industrial strain used in Chinese rice wine brewing. Our study revealed many genes and genetic variations that may help the strain to cope with the high glucose and ethanol concentrations in the fermentation environment. Industrial rice wine producers, even other beverages producers, will likely benefit from having access to a complete N85 genome to improve their production progresses and products quality. In addition, our findings provide a rich genetic resource for the S. cerevisiae fundamental and applied research communities. Click here for additional data file. Click here for additional data file. Click here for additional data file.

42 in total

1. Nitrogen regulation involved in the accumulation of urea in Saccharomyces cerevisiae.

Authors: Xinrui Zhao; Huijun Zou; Jianwei Fu; Jian Chen; Jingwen Zhou; Guocheng Du
Journal: Yeast Date: 2013-09-10 Impact factor: 3.239

2. Using native and syntenically mapped cDNA alignments to improve de novo gene finding.

Authors: Mario Stanke; Mark Diekhans; Robert Baertsch; David Haussler
Journal: Bioinformatics Date: 2008-01-24 Impact factor: 6.937

3. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Authors: Konstantin Berlin; Sergey Koren; Chen-Shan Chin; James P Drake; Jane M Landolin; Adam M Phillippy
Journal: Nat Biotechnol Date: 2015-05-25 Impact factor: 54.908

Review 4. Computational methods for transcriptome annotation and quantification using RNA-seq.

Authors: Manuel Garber; Manfred G Grabherr; Mitchell Guttman; Cole Trapnell
Journal: Nat Methods Date: 2011-05-27 Impact factor: 28.547

5. A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data.

Authors: David Coil; Guillaume Jospin; Aaron E Darling
Journal: Bioinformatics Date: 2014-10-22 Impact factor: 6.937

Review 6. Coming of age: ten years of next-generation sequencing technologies.

Authors: Sara Goodwin; John D McPherson; W Richard McCombie
Journal: Nat Rev Genet Date: 2016-05-17 Impact factor: 53.242

Review 7. Amino-acid-induced signalling via the SPS-sensing pathway in yeast.

Authors: Per O Ljungdahl
Journal: Biochem Soc Trans Date: 2009-02 Impact factor: 5.407

8. Analysis of the Saccharomyces cerevisiae pan-genome reveals a pool of copy number variants distributed in diverse yeast strains from differing industrial environments.

Authors: Barbara Dunn; Chandra Richter; Daniel J Kvitek; Tom Pugh; Gavin Sherlock
Journal: Genome Res Date: 2012-02-27 Impact factor: 9.043

9. The modification of Gat1p in nitrogen catabolite repression to enhance non-preferred nitrogen utilization in Saccharomyces cerevisiae.

Authors: Xinrui Zhao; Huijun Zou; Jian Chen; Guocheng Du; Jingwen Zhou
Journal: Sci Rep Date: 2016-02-22 Impact factor: 4.379

10. Domestication and Divergence of Saccharomyces cerevisiae Beer Yeasts.

Authors: Brigida Gallone; Jan Steensels; Troels Prahl; Leah Soriaga; Veerle Saels; Beatriz Herrera-Malaver; Adriaan Merlevede; Miguel Roncoroni; Karin Voordeckers; Loren Miraglia; Clotilde Teiling; Brian Steffy; Maryann Taylor; Ariel Schwartz; Toby Richardson; Christopher White; Guy Baele; Steven Maere; Kevin J Verstrepen
Journal: Cell Date: 2016-09-08 Impact factor: 41.582

1 in total

1. Analysis of Key Genes Responsible for Low Urea Production in Saccharomyces cerevisiae JH301.

Authors: Zhangcheng Liang; Hao Su; Xiangyun Ren; Xiaozi Lin; Zhigang He; Xiangyou Li; Yan Zheng
Journal: Front Microbiol Date: 2022-04-26 Impact factor: 6.064

1 in total