| Literature DB >> 22977733 |
Markus Ralser1, Heiner Kuhl, Meryem Ralser, Martin Werber, Hans Lehrach, Michael Breitenbach, Bernd Timmermann.
Abstract
Saccharomyces cerevisiae strain W303 is a widely used model organism. However, little is known about its genetic origins, as it was created in the 1970s from crossing yeast strains of uncertain genealogy. To obtain insights into its ancestry and physiology, we sequenced the genome of its variant W303-K6001, a yeast model of ageing research. The combination of two next-generation sequencing (NGS) technologies (Illumina and Roche/454 sequencing) yielded an 11.8 Mb genome assembly at an N50 contig length of 262 kb. Although sequencing was substantially more precise and sensitive than whole-genome tiling arrays, both NGS platforms produced a number of false positives. At a 378× average coverage, only 74 per cent of called differences to the S288c reference genome were confirmed by both techniques. The consensus W303-K6001 genome differs in 8133 positions from S288c, predicting altered amino acid sequence in 799 proteins, including factors of ageing and stress resistance. The W303-K6001 (85.4%) genome is virtually identical (less than equal to 0.5 variations per kb) to S288c, and thus originates in the same ancestor. Non-S288c regions distribute unequally over the genome, with chromosome XVI the most (99.6%) and chromosome XI the least (54.5%) S288c-like. Several of these clusters are shared with Σ1278B, another widely used S288c-related model, indicating that these strains share a second ancestor. Thus, the W303-K6001 genome pictures details of complex genetic relationships between the model strains that date back to the early days of experimental yeast genetics. Moreover, this study underlines the necessity of combining multiple NGS and genome-assembling techniques for achieving accurate variant calling in genomic studies.Entities:
Keywords: mapping; next-generation sequencing; phylogeny reconstruction; yeast models
Mesh:
Substances:
Year: 2012 PMID: 22977733 PMCID: PMC3438534 DOI: 10.1098/rsob.120093
Source DB: PubMed Journal: Open Biol ISSN: 2046-2441 Impact factor: 6.411
Sequencing of W303-K6001 on two different NGS platforms.
| 454 sequencing | Illumina sequencing | combined | |||||
|---|---|---|---|---|---|---|---|
| K6001 | K6001-B7 | total | K6001 | K6001-B7 | total | ||
| high-quality reads | 641 083 | 684 545 | 1 325 628 | 20 407 268 | 18 266 146 | 38 673 414 | 39 999 042 |
| average read length | 507.22 | 492.26 | 499.48 | 101.16 | 101.28 | 101.22 | |
| median read length | 529.0 | 524.0 | 527.0 | ||||
| average insert size (paired end) | single read | 225.0 | 221.3 | 223.15 | |||
| bases (Mb) | 325 157 | 336 965 | 662 122 | 2 064 395 | 1 849 944 | 3 914 339 | 4 576 462 |
De novo assembly of the W303-K6001 genome.
| combination of reference-guided and | ||||
|---|---|---|---|---|
| assembled data | Illumina | 454 | Illumina + 454 | 454 + 454 mapped contigs |
| number of contigs | 3095 | 477 | 2846 | 375 |
| number of bases | 11 819 873 | 11 637 892 | 13 865 032 | 11 642 694 |
| average contig size | 3819 | 24 398 | 4871 | 31 047 |
| N50 contig size | 39 717 | 66 531 | 42 713 | 149 943 |
| largest contig size | 165 547 | 260 577 | 164 236 | 466 025 |
| number of scaffolds | 374 | — | 357 | 78 |
| number of bases | 11 351 824 | — | 11 856 226 | 11 591 176 |
| average scaffold size | 30 352 | — | 33 210 | 148 604 |
| N50 scaffold size | 68 612 | — | 102 849 | 367 966 |
| largest scaffold size | 226 549 | — | 386 246 | 833 844 |
Reference-guided assembling of the W303-K6001 genome.
| reference-guided assembly by Newbler v. 2.6 | reference-guided assembly by CLC reference mapper | |||||
|---|---|---|---|---|---|---|
| assembled data | Illumina | 454 | Illumina + 454 | Illumina | 454 | Illumina + 454 |
| number of contigs | 268 | 272 | 265 | 593 | 289 | 329 |
| number of bases | 11 450 524 | 11 844 486 | 11 771 493 | 11 730 567 | 11 874 714 | 11 892 270 |
| average contig size | 42 725 | 43 545 | 44 420 | 19 782 | 41 089 | 36 147 |
| N50 contig size | 107 251 | 261 861 | 262 228 | 202 565 | 231 374 | 266 295 |
| largest contig size | 414 906 | 666 566 | 555 078 | 538 743 | 743 077 | 743 360 |
Comparison of S288c and W303-K6001 genomes using two mapping algorithms.
| CLC bio (both technologies) | Newbler (both technologies) | both mappers and both technologies | |
|---|---|---|---|
| single nucleotide polymorphism | 8815 | 8471 | 8049 |
| single insertion/deletion | 397 | 280 | 25 |
| multiple number variations | 370 | 322 | 59 |
| sum | 9582 | 9073 | 8133 |
Figure 1.Comparing the W303-K6001 genome sequence with whole-genome tiling arrays. The tiling array correctly identified the yeast background, as a significant number of SNV positions were overlapping. However, the tiling array did not reach sensitivity and accuracy of whole-genome resequencing. All unrelated Non-W303 yeast strains share a number of mutant coordinates, indicating errors in the reference genome or tiling array, or private mutations of the S288c line.
Auxotrophic marker mutations found in W303-K6001.
| allele name | locus | detected mutation |
|---|---|---|
| YOR128C | nonsense, | |
| YDR007W | nonsense, | |
| YEL063C | frameshift, | |
| YCL018W | frameshift, | |
| YOR202W | 2x frameshift, |
Gene ontology (GO) categories containing two or more genes with a single nucleotide insertion or deletion.
| GO ID | GO term | frequency | genome frequency | gene(s) |
|---|---|---|---|---|
| 2181 | cytoplasmic translation | five of 41 genes, 12.2% | 174 of 6311 genes, 2.8% | |
| 42274 | ribosomal small subunit biogenesis | three of 41 genes, 7.3% | 124 of 6311 genes, 2% | |
| 6520 | cellular amino acid metabolic process | three of 41 genes, 7.3% | 240 of 6311 genes, 3.8% | |
| 6811 | ion transport | two of 41 genes, 4.9% | 132 of 6311 genes, 2.1% | |
| 6364 | rRNA processing | two of 41 genes, 4.9% | 294 of 6311 genes, 4.7% | |
| 32200 | telomere organization | two of 41 genes, 4.9% | 67 of 6311 genes, 1.1% |
Figure 2.Unequal SNV distribution in the W303 genome illustrates its mutt ancestry. (a) Regions with high-sequence divergence to S288c cluster together. Chromosomal sequences with high identity (less than or equal to 0.5 SNVs per kb) to the S288c Reference genome EF4 are depicted in grey, indicating that 85.4% of the W303-K6001 genome is a S288c descendant. Regions with higher variability form clusters. (b) Median percentage of genetic material with greater than 0.5 SNV per kb divergence from S288c, per chromosome. (c) Distribution of SNV frequencies per 5 kb segment, taking into account all non-S288c clusters larger than 15 kb.
Figure 3.W303-K6001 contains clusters that are identical to Σ1278B, but differ in S288c. (a) S288c is the main ancestor parent of W303 and Σ1278B; however, part of the non-S288c-derived W303-K6001 genome is also found in Σ1278B. Shown are two exemplary multiple alignments each from Chr. XIV, XIII and XI, and the 3′ breakpoint of the cluster on Chr XI. (b) Distance diagram of S288c, Σ1278B and W303-K6001 for the non-S288c cluster on Chr XIV 730 000–760 000. (c) W303-K6001: non-S288c sequence clusters with high sequence similarity to Σ1278B.