| Literature DB >> 29704459 |
Oliver Rupp1, Madolyn L MacDonald2,3, Shangzhong Li4,5, Heena Dhiman6,7, Shawn Polson2,3, Sven Griep1, Kelley Heffner8, Inmaculada Hernandez7, Karina Brinkrolf9, Vaibhav Jadhav6, Mojtaba Samoudi5,10, Haiping Hao11, Brewster Kingham3, Alexander Goesmann1, Michael J Betenbaugh8, Nathan E Lewis4,5,10, Nicole Borth6,7, Kelvin H Lee3,12.
Abstract
Accurate and complete genome sequences are essential in biotechnology to facilitate genome-based cell engineering efforts. The current genome assemblies for Cricetulus griseus, the Chinese hamster, are fragmented and replete with gap sequences and misassemblies, consistent with most short-read-based assemblies. Here, we completely resequenced C. griseus using single molecule real time sequencing and merged this with Illumina-based assemblies. This generated a more contiguous and complete genome assembly than either technology alone, reducing the number of scaffolds by >28-fold, with 90% of the sequence in the 122 longest scaffolds. Most genes are now found in single scaffolds, including up- and downstream regulatory elements, enabling improved study of noncoding regions. With >95% of the gap sequence filled, important Chinese hamster ovary cell mutations have been detected in draft assembly gaps. This new assembly will be an invaluable resource for continued basic and pharmaceutical research.Entities:
Keywords: Chinese hamster; assembly; biopharmaceuticals; genome
Mesh:
Year: 2018 PMID: 29704459 PMCID: PMC6045439 DOI: 10.1002/bit.26722
Source DB: PubMed Journal: Biotechnol Bioeng ISSN: 0006-3592 Impact factor: 4.395
Figure 1The PICR assembly ranked against other mammalian assemblies. (a) The PICR assembly was compared with other candidate assemblies of Cricetulus griseus based on 80 different assembly metrics. This shows for each test how the assemblies compare. The best assembly for each test is plotted on the outer rim, whereas the worst is near the center. Eighty tests were defined (see Supporting Information Table S3) in six different categories. On average, the PICR assembly was the most highly ranked, with the PIRC assembly closely following. (b) Weighted histogram of the contig lengths for the PICR assembly (red) compared with the Ensemble mouse (salmon), rat (purple), and the previous Chinese hamster RefSeq assemblies (green) [Color figure can be viewed at wileyonlinelibrary.com]
Four different orders were used to merge the four initial assemblies with the Metassembler tool, where PICR starts with the PacBio SMRT assembly, after which the Illumina assembly is merged into it, followed by the CSA assembly and the RefSeq assembly
| Base assembly | Added in Step 1 | Step 2 | Step 3 | Name |
|---|---|---|---|---|
| PacBio SMRT | Illumina | CSA | RefSeq | PICR |
| PacBio SMRT | Illumina | RefSeq | CSA | PIRC |
| Illumina | PacBio SMRT | CSA | RefSeq | IPCR |
| Illumina | PacBio SMRT | RefSeq | CSA | IPRC |
Note. CSA, chromosome‐sorted assembly; PacBio SMRT, Pacific Biosciences SMRT assembly; SMRT, single molecule real time.
Assembly metrics of the Illumina scaffolds and PacBio SMRT curated assembly compared with the previously published assemblies
| RefSeq (Lewis et al., | CSA (Brinkrolf et al., | Pooled Illumina scaffolds | Curated PacBio SMRT contigs | |
|---|---|---|---|---|
| Scaffolds (No.) | 52,710 | 28,749 | 17,373 | 1,659 |
| Length (Gb) | 2.36 | 2.33 | 2.39 | 2.31 |
| Min length (bp) | 201 | 830 | 898 | 100,560 |
| Max length (Mb) | 8.32 | 14.66 | 25.84 | 16.08 |
| Mean length (kb) | 44.78 | 81.14 | 137.45 | 1394.69 |
| Median length (bp) | 363 | 1,927 | 2,063 | 693,156 |
| N50 length (kb) | 1558.30 | 1236.52 | 5951.71 | 2906.73 |
| N50 (No.) | 450 | 501 | 128 | 223 |
| N90 length (kb) | 395.29 | 180.69 | 1003.29 | 623.9 |
| N90 (No.) | 1,558 | 2,251 | 468 | 884 |
| Total N gaps (No.) | 166,152 | 290,660 | 110,314 | 0 |
| Total N (%) | 2.49 | 10.45 | 2.66 | 0 |
Note. CSA, chromosome‐sorted assembly; PacBio SMRT, Pacific Biosciences SMRT assembly; N, undefined base in scaffolds.
Assembly metrics of the four merged assemblies
| PICR | PIRC | IPCR | IPRC | |
|---|---|---|---|---|
| Scaffolds (No.) | 1,829 | 1,825 | 2,317 | 2,304 |
| Length (Gb) | 2.37 | 2.37 | 2.36 | 2.36 |
| Min length (bp) | 568 | 568 | 915 | 915 |
| Max length (Mb) | 80.58 | 80.58 | 66.35 | 66.35 |
| Mean length (kb) | 1295.21 | 1298.43 | 1019.33 | 1024.64 |
| Median length (bp) | 37,019 | 38,181 | 13,201 | 14,241 |
| N50 length (kb) | 20188.72 | 19582.71 | 21744.88 | 21262.79 |
| N50 (No.) | 32 | 33 | 33 | 34 |
| N90 length (kb) | 4400.57 | 4422.38 | 3545.61 | 3650.27 |
| N90 (No.) | 121 | 122 | 122 | 122 |
| Total N gaps (No.) | 3,237 | 3,250 | 72,528 | 72,536 |
| Total Ns (%) | 0.12 | 0.12 | 1.13 | 1.13 |
Note. N, undefined base in scaffolds.
Gene and transcript information from the Maker annotation of the PICR and IPCR genome assemblies
| Assembly | ||
|---|---|---|
| All genes | PICR | IPCR |
| Gene count | 24,686 | 23,410 |
| Transcript count | 24,948 | 23,656 |
| Transcripts per gene | 1.01 | 1.01 |
| Average length transcript | 17615.04 | 18089.17 |
| Total length transcript | 439,460,104 | 427,917,413 |
| Average coding length | 1324.93 | 1316.11 |
| Total coding length | 33,054,355 | 31,133,905 |
| Average exons per transcript | 7.49 | 7.54 |
| Total exons | 186,939 | 178,277 |
| Complete transcripts | ||
| Transcript count | 18,476 | 17,557 |
| Total exons | 138,358 | 131,262 |
| Incomplete transcripts | ||
| Transcript count | 6,472 | 6,099 |
| Total exons | 48,581 | 47,015 |
Figure 2Importance of correct assembly of genes and noncoding regions. (a) Chromatin states defined by histone marks: Left: histone marks for CSA assembly (Brinkrolf et al., 2013; Feichtinger et al., 2016); center: histone marks for PICR assembly; right: histone marks from the Human Epigenome Project (Kundaje et al., 2015). (b) A total of 1,538 genes associated with mitochondria were blasted from TSS to TES against the CSA and RefSeq assemblies. The number of hits completely found on a single scaffold is displayed for each assembly. (c) Mouse coding sequences were blasted against Chinese hamster assemblies from the start of translation to the end. (d) The 1,011 complete genes found in PICR were extended 5 kb upstream and 1.5 kb downstream to include promoters and other regulatory noncoding regions and blasted against existing assemblies. (e) Chromatin states around three genes, as found in the previously published CSA‐based chromatin state model (Feichtinger et al., 2016; top for each gene) and the PICR assembly (bottom for each gene), showing promoter and regulatory elements in addition to active transcription. CSA, chromosome‐sorted assembly; TES, transcription end site; TSS, transcription start site [Color figure can be viewed at wileyonlinelibrary.com]
Figure 3Important variants are located in sequence gaps in previous assemblies. (a) More than 95% of sequence gaps were filled in the PICR meta‐assembly (inset shows the log frequency of gaps to highlight the low frequency of PICR gaps not visible in the normal histogram). (b) The missing sequence in gaps in the RefSeq assembly was identified by aligning the RefSeq sequence flanking the gaps to the PICR sequence. (c) Across 13 cell lines, we found 65,842 SNP and indel mutations in the RefSeq gap regions, and 1.3% of these were found in coding regions. (d) A legacy CHO cell line, pgsA745, identified Xylt2 as the glycosyltransferase responsible for the first step in glycosaminoglycan biosynthesis as this cell line is deficient in glycosaminoglycan biosynthesis. Because of a gap in the RefSeq assembly, only in the new PICR meta‐assembly can the causal variant be identified. A G→T mutation introduces an early stop codon in exon 1, resulting in a loss in Xylt2 activity. The genotype is shown for a variety of CHO cell lines (Feichtinger et al., 2016; Lewis et al., 2013; van Wijk et al., 2017), with only pgsA745 showing the early stop codon. CHO, Chinese hamster ovary; SNP, single nucleotide polymorphism; Xylt2, xylosyltransferase 2 [Color figure can be viewed at wileyonlinelibrary.com]