| Literature DB >> 31645421 |
Tyler S Alioto1,2, Fernando Cruz1, Jèssica Gómez-Garrido1, Miriam Triyatni3, Marta Gut1,2, Leonor Frias1,2, Anna Esteve-Codina1, Stephan Menne4, Anna Kiialainen3, Nadine Kumpesa3, Fabian Birzele3, Roland Schmucki3, Ivo G Gut1,2, Olivia Spleiss5.
Abstract
The Eastern woodchuck (Marmota monax) has been extensively used in research of chronic hepatitis B and liver cancer because its infection with the woodchuck hepatitis virus closely resembles a human hepatitis B virus infection. Development of novel immunotherapeutic approaches requires genetic information on immune pathway genes in this animal model. The woodchuck genome was assembled with a combination of high-coverage whole-genome shotgun sequencing of Illumina paired-end, mate-pair libraries and fosmid pool sequencing. The result is a 2.63 Gigabase (Gb) assembly with a contig N50 of 74.5 kilobases (kb), scaffold N50 of 892 kb, and genome completeness of 99.2%. RNA sequencing (RNA-seq) from seven different tissues aided in the annotation of 30,873 protein-coding genes, which in turn encode 41,826 unique protein products. More than 90% of the genes have been functionally annotated, with 82% of them containing open reading frames. This genome sequence and its annotation will enable further research in chronic hepatitis B and hepatocellular carcinoma and contribute to the understanding of immunological responses in the woodchuck.Entities:
Keywords: Chronic Hepatitis B; Eastern Woodchuck; Genome Assembly; Hepatocellular Carcinoma; Immune Response; Marmota monax; Whole Genome Sequencing
Mesh:
Year: 2019 PMID: 31645421 PMCID: PMC6893209 DOI: 10.1534/g3.119.400413
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Output of Sequencing Libraries
| library type | library name | insert size | yield mBases | Coverage | avg pct duplicate | avg phix error r1 | avg phix error r2 |
|---|---|---|---|---|---|---|---|
| WGS PE | 523J_B | 471 | 83996 | 29.3x | 0.33 | 0.238 | 0.265 |
| WGS PE | 523J_C | 589 | 78061 | 27.2x | 0.27 | 0.238 | 0.265 |
| WGS PE | 523J_D | 692 | 107291 | 37.4x | 0.26 | 0.238 | 0.265 |
| WGS MP | 541J-1 | 3772 | 123816 | 43.1x | 77.85 | 0.305 | 0.308 |
| WGS MP | 238L | 6782 | 104709 | 36.5x | 82.21 | 0.258 | 0.338 |
| FE | Z047 | 34543 | 42194 | 88.67 | 0.265 | 0.280 | |
| FE | Z048 | 34542 | 47746 | 87.45 | 0.265 | 0.280 | |
| FP | pools 1-96 | 333 | 10608 | ∼108x | 0.86 | 0.268 | 0.397 |
average values for the 96 fosmid pools.
Figure 1Overview of the assembly workflow. Main assemblies are shown as orange rectangles. Processing steps are shown as colored hexagons. The annotations are represented as blue rectangles.
Figure 2Overview of the protein annotation pipeline. Input data for annotation are shown at the top of the flow chart. Computational steps are shown in light blue and intermediate data are shown in white.
RNA-seq data for tissue samples
| Sample | Individual | Tissue | Number of reads | % of reads mapped |
|---|---|---|---|---|
| F6849 | Liver | 211,888,262 | 95.113 | |
| F6849 | Kidney | 185,982,482 | 95.506 | |
| F6849 | Spleen | 225,337,732 | 94.303 | |
| F6849 | Lung | 206,453,434 | 91.87 | |
| F6849 | Heart | 138,856,972 | 92.539 | |
| F6852 | Liver | 134,795,402 | 92.029 | |
| F9150 | Thymus | 186,450,732 | 94.791 | |
| F9150 | Liver | 144,701,830 | 94.478 | |
| F9150 | Kidney | 179,703,418 | 93.636 | |
| F9150 | Spleen | 142,020,270 | 89.978 | |
| F9150 | Pancreas | 127,666,526 | 95.619 | |
| M4046 | Liver | 202,080,726 | 94.951 | |
| M4046 | Kidney | 127,534,952 | 93.614 | |
| M4046 | Spleen | 125,489,746 | 94.784 | |
| M4075 | Liver | 147,677,126 | 90.497 | |
| M4075 | Kidney | 196,174,530 | 95.174 | |
| M4075 | Spleen | 126,030,468 | 93.961 | |
| M4091 | Liver | 145,226,684 | 95.157 | |
| M4091 | Kidney | 125,790,226 | 93.04 | |
| M4091 | Spleen | 122,753,104 | 91.042 |
Summary statistics of major assembly steps
| Contiguity | Gene Completeness | |||||||
|---|---|---|---|---|---|---|---|---|
| Contigs | Scaffolds | CEGMA | ||||||
| N50 (kb) | N90 (kb) | Length (Gb) | N50 (kb) | N90 (kb) | Length (Gb) | Complete (%) | Partial (%) | |
| 26.1 | 3.4 | 2.79 | 74.6 | 8.7 | 2.79 | 88.71 | 98.39 | |
| 48.8 | 9.1 | 2.54 | 49.5 | 9.2 | 2.55 | |||
| 74.5 | 15.6 | 2.55 | 712.2 | 112.8 | 2.62 | 96.37 | 99.19 | |
| 74.5 | 15.6 | 2.55 | 892.2 | 124.4 | 2.63 | 96.37 | 99.19 | |
gene completeness not determined for assembly monax3.
Protein-coding annotation statistics
| Monax5E | |
|---|---|
| 30873 | |
| 6123 | |
| 44630 | |
| 41826 | |
| 6232 | |
| 217018 | |
| 204230 | |
| 1.45 | |
| 6.1 | |
| 0.73 | |
| 52.81% |
Figure 3Hierarchical clustering of RNA-seq gene expression levels in six tissues.
Orthoinspector results. A pairwise comparison of the number of one-to-one orthologs or in-paralogs detected by Orthoinspector among human, mouse and woodchuck protein products. Out-paralogs are not shown
| 11,481 | 1,788 | 2,054 | 485 | |
| 12,291 | 1,018 | 3,102 | 892 | |
| 10,535 | 3,444 | 1,540 | 749 |