| Literature DB >> 31418539 |
Yu Fan1,2, Mao-Sen Ye1,3, Jin-Yan Zhang1,3, Ling Xu1,2, Dan-Dan Yu1,2, Tian-Le Gu1,3, Yu-Lin Yao1,3, Jia-Qi Chen4, Long-Bao Lv4, Ping Zheng2,3,4,5,6,7, Dong-Dong Wu2,6, Guo-Jie Zhang2,6, Yong-Gang Yao8,2,3,4,5.
Abstract
Chinese tree shrews (Tupaia belangeri chinensis) have become an increasingly important experimental animal in biomedical research due to their close relationship to primates. An accurately sequenced and assembled genome is essential for understanding the genetic features and biology of this animal. In this study, we used long-read single-molecule sequencing and high-throughput chromosome conformation capture (Hi-C) technology to obtain a high-qualitychromosome-scale scaffolding of the Chinese tree shrew genome. The new reference genome (KIZ version 2: TS_2.0) resolved problems in presently available tree shrew genomes and enabled accurate identification of large and complex repeat regions, gene structures, and species-specific genomic structural variants. In addition, by sequencing the genomes of six Chinese tree shrew individuals, we produced a comprehensive map of 12.8 M single nucleotide polymorphisms and confirmed that the major histocompatibility complex (MHC) loci and immunoglobulin gene family exhibited high nucleotide diversity in the tree shrew genome. We updated the tree shrew genome database (TreeshrewDB v2.0: http://www.treeshrewdb.org) to include the genome annotation information and genetic variations. The new high-quality reference genome of the Chinese tree shrew and the updated TreeshrewDB will facilitate the use of this animal in many different fields of research.Entities:
Keywords: Chromosomal level assembly genome; Database; Population sequencing; Tupaia belangeri
Mesh:
Year: 2019 PMID: 31418539 PMCID: PMC6822927 DOI: 10.24272/j.issn.2095-8137.2019.063
Source DB: PubMed Journal: Zool Res ISSN: 2095-8137
Comparison of Chinese tree shrew assembly quality between assemblies TS_1.0 and TS_2.0
| Version | Item | Contig length (bp) | Scaffold length (bp) | Contig No. | Scaffold No. |
|---|---|---|---|---|---|
|
Short-read assembly (KIZ version 1: TS_1.0) | Total | 2 719 442 484 | 2 861 790 358 | 374 120 | 150 513 |
| Max_length | 187 505 | 19 269 909 | – | – | |
| >2 000 bp | – | – | 180 802 | 4 525 | |
| >100 kb | – | – | 305 | 1 418 | |
| N50 | 22 000 | 3 655 608 | 36 335 | 234 | |
| N60 | 17 500 | 3 042 664 | 50 199 | 319 | |
| N70 | 13 431 | 2 302 651 | 67 915 | 427 | |
| N80 | 9 571 | 1 648 848 | 91 810 | 573 | |
|
Long-read assembly (KIZ version 2: TS_2.0) | Total | 2 667 337 536 | 2 667 507 236 | 3 344 | 1 647 |
| Max_length | 16 177 999 | 224 450 918 | – | – | |
| >2 000 bp | – | – | 3 344 | 1 647 | |
| >100 kb | – | – | 1963 | 281 | |
| N50 | 3 217 288 | 104 643 080 | 229 | 10 | |
| N60 | 2 462 062 | 94 037 081 | 323 | 13 | |
| N70 | 1 641 093 | 71 760 103 | 457 | 16 | |
| N80 | 995 871 | 57 328 337 | 664 | 20 |
Contig: Contiguous length of genomic sequence in which the order of bases has a high confidence level. Gaps occur where reads from two sequenced ends of at least one fragment overlap with other reads in two different contigs. Scaffolds are composed of contigs and gaps. N50: N50 statistic defines assembly quality in terms of contiguity. Given a set of contigs ranked by contig size, N50 is defined as the size of the shortest contig, which adds contigs with larger size to reach 50% of total genome length. –: Not available.
Figure 1Assembly, annotation, and nucleotide diversity of Chinese tree shrew genome
A: Contig length distribution of long-read assembly (TS_2.0) in comparison with short-read assembly (TS_1.0) (Fan et al., 2013). B: Circos plot showing genome-wide distribution profiles of genes, SNPs, and indels across Chinese tree shrew genome, and values of population genetic parameters (π and Tajima’s D). C: Manhattan plot of nucleotide diversity (π) at gene level based on SNPs located in coding regions of six wild tree shrews. Top 30 genes are shown in plot, with a cut-off π value of 0.025.
Assessment of assembly completeness in Chinese tree shrew using BUSCO
| Parameter | No. | Percentage (%) |
|---|---|---|
| Complete genes | 3 749 | 91.40 |
| Complete and single-copy genes | 3 696 | 90.10 |
| Complete and duplicated genes | 53 | 1.30 |
| Fragmented genes | 184 | 4.50 |
| Missing genes | 171 | 4.10 |
| Total genes | 4 104 | – |
BUSCO: Benchmarking with Universal Single-Copy Orthologs. A total of 4 104 benchmarking universal single-copy orthologs of the mammalian dataset were retrieved from BUSCO (Simao et al., 2015). These genes were mapped to the TS_2.0 assembly using tBlastn (Altschul et al., 1997). –: Not available.
Statistics of Hi-C data for mapping
| Parameter | Hi-C data |
|---|---|
| Clean data | 705 Gb |
| Clean paired-end reads | 2 351 150 069 |
| Unmapped paired-end reads | 47 514 976 |
| Unmapped paired-end reads rate (%) | 2.02 |
| Paired-end reads with singleton | 303 352 692 |
| Paired-end reads with singleton rate (%) | 12.9 |
| Multi mapped paired-end reads | 443 734 174 |
| Multi mapped ratio (%) | 18.87 |
| Unique mapped paired-end reads | 1 556 548 227 |
| Unique mapped ratio (%) | 66.2 |
Pseudo-chromosome sizes and assignment of Hi-C scaffolds
| Pseudo-chromosome | Contig No. | Length (bp) of pseudo-chromosome |
|---|---|---|
| chr1 | 154 | 224 402 198 |
| chr2 | 107 | 187 971 973 |
| chr3 | 111 | 137 178 494 |
| chr4 | 61 | 121 533 334 |
| chr5 | 74 | 120 860 892 |
| chr6 | 88 | 117 379 583 |
| chr7 | 67 | 108 205 678 |
| chr8 | 56 | 108 052 698 |
| chr9 | 62 | 104 638 498 |
| chr10 | 64 | 101 327 006 |
| chr11 | 71 | 97 509 983 |
| chr12 | 49 | 94 027 333 |
| chr13 | 54 | 92 296 458 |
| chr14 | 58 | 89 547 586 |
| chr15 | 69 | 71 741 294 |
| chr16 | 42 | 69 742 744 |
| chr17 | 48 | 66 945 814 |
| chr18 | 41 | 63 456 188 |
| chr19 | 27 | 57 308 528 |
| chr20 | 32 | 54 551 840 |
| chr21 | 35 | 49 758 179 |
| chr22 | 53 | 52 049 165 |
| chr23 | 27 | 43 809 476 |
| chr24 | 23 | 42 251 409 |
| chr25 | 33 | 41 996 642 |
| chr26 | 100 | 30 565 635 |
| chr27 | 50 | 25 814 610 |
| chr28 | 34 | 26 506 761 |
| chr29 | 27 | 22 607 893 |
| chr30 | 28 | 21 670 314 |
| chrX | 452 | 118 492 391 |
| Total anchored | 2 197 | 2 564 200 597 |
| Unanchored | 1 616 | 102 561 400 |
We used Lachesis (Burton et al., 2013) to cluster, order, and direct the assembled contigs onto 31 pseudo-chromosomes, which were defined according to number of haploid chromosomes of the tree shrew (Liu et al., 1989). Contig No.: Number of contigs assembled onto each chromosome by Hi-C. Total anchored: total number of contigs that could be anchored into 31 pseudo-chromosomes. Unanchored: total number of contigs that could not be anchored into 31 pseudo-chromosomes.
Assembly quality score value statistics
| Parameter | Long-read assembly (TS_2.0) | Short-read assembly (TS_1.0) |
|---|---|---|
| Quality value | 28.56 | 26.75 |
| Translocation | 2 824 | 6 034 |
| Deletion | 3 733 | 12 607 |
| Duplication | 142 | 438 |
| Inversion | 80 | 99 |
| Errors Per 100 Mbp | 253.89 | 718.27 |
| HIGH_COV_PE | 12 016 | 66 655 |
| HIGH_NORM_COV_PE | 12 415 | 69 902 |
| HIGH_OUTIE_PE | 137 | 1 594 |
| HIGH_SINGLE_PE | 10 | 151 |
| HIGH_SPAN_PE | 1 237 | 32 751 |
| LOW_NORM_COV_PE | 536 | 72 38 |
| STRECH_PE | 31 741 | 66 763 |
| COMPR_PE | 13 818 | 20 437 |
Quality value was estimated based on number of non-matching base calls from FreeBayes (Garrison & Marth, 2012). Errors per 100 Mbp were calculated as a sum ratio of Lumpy (Layer et al., 2014) structural variants (SV) to a standardized genome size of 2.67 Gbp. FRC features (Vezzi et al., 2012) can assess assembly errors, including LOW_COV_PE: Low read coverage; HIGH_COV_PE: High read coverage; LOW_NORM_COV_PE: Low coverage of normal paired-end reads; HIGH_NORM_COV_PE: High coverage of normal paired-end reads; COMPR_PE: Areas with low CE statistics; STRECH_PE: Areas with high CE statistics; HIGH_SINGLE_PE: Regions with high numbers of unmapped pairs; HIGH_SPAN_PE: Regions with high numbers of discordant pairs that map to different contigs/scaffolds; HIGH_OUTIE_PE: Regions with high numbers of misoriented or distant pairs. With the exception of the QV score, lower counts are indicative of better assembly.
Gap closure statistics of the two genome assemblies
| Parameter | Long-read assembly TS_2.0) | Short-read assembly (TS_1.0) |
|---|---|---|
| Total number of gaps | 1 697 | 223 607 |
| Partially closed gap using TS_1.0 | 476 | – |
| Partially closed gap using TS_2.0 | – | 0 |
| Fully closed gap using TS_1.0 | 39 | – |
| Fully closed gap using TS_2.0 | – | 163 220 |
| Fully closed gap in genic region | 0 | 65 222 |
| Trans-scaffold gaps | 264 | 4 112 |
Partially closed gap: Gap in one assembly was filled by a scaffold of another assembly, but still had some ambiguous (N) bases within the filled region. Fully closed gap: Gap in one assembly was filled by a contig of another assembly, without any ambiguous (N) bases. Trans-scaffold gap: Flanking sequences of a gap were aligned to two separate scaffolds or pseudo-chromosomes, which was most likely to be assembly errors. –: Not available.
Comparison of transposable elements in Chinese tree shrews between short-read assembly (KIZ version 1: TS_1.0) and long-read assembly (KIZ version 2: TS_2.0)
| Type | Long-read assembly (TS_2.0) | Short-read assembly (TS_1.0) | ||
|---|---|---|---|---|
| Length (Mp) | % in genome | Length (Mp) | % in genome | |
| DNA | 96.8 | 3.6 | 76.6 | 2.7 |
| LINE | 553.3 | 20.8 | 295.2 | 10.3 |
| SINE | 663.1 | 24.9 | 527.2 | 18.8 |
| LTR | 138.0 | 5.2 | 113.1 | 4 |
| Other | 0.0005 | 0.0 | 0.06 | 0.002 |
| Unknown | 68.0 | 2.6 | 0.9 | 0.03 |
| Total | 1 310.5 | 49.1 | 1 001.9 | 35 |
DNA: Deoxyribonucleic acid transposon. LINE: Long interspersed nuclear element. SINE: Short interspersed nuclear element. LTR: Long terminal repeat.
Comparison of transposable element subtypes in Chinese tree shrews between short-read assembly (KIZ version 1: TS_1.0) and long-read assembly (KIZ version 2: TS_2.0)
| TE subtype | Long-read assembly (TS_2.0) | Short-read assembly (TS_1.0) | ||
|---|---|---|---|---|
| Length (Mp) | % in genome | Length (Mp) | % in genome | |
| DNA/En-Spm | 7.92 | 0.30 | 4.87 | 0.17 |
| DNA/hAT | 34.29 | 1.28 | 33.77 | 1.18 |
| DNA/TcMar | 52.11 | 1.96 | 26.90 | 0.94 |
| LINE/CR1 | 5.57 | 0.21 | 2.00 | 0.07 |
| LINE/L1 | 494.51 | 18.54 | 267.29 | 9.34 |
| LINE/L2 | 49.48 | 1.86 | 22.04 | 0.77 |
| LINE/Penelope | 2.02 | 0.08 | 2.29 | 0.08 |
| LTR/ERV1 | 37.66 | 1.41 | 31.77 | 1.11 |
| LTR/ERVK | 18.55 | 0.70 | 8.87 | 0.31 |
| LTR/ERVL | 78.13 | 2.93 | 68.40 | 2.39 |
| LTR/Gypsy | 2.59 | 0.10 | 2.86 | 0.1 |
| SINE/Alu | 10.60 | 0.40 | 3.15 | 0.11 |
| SINE/B4 | 3.02 | 0.11 | 1.72 | 0.06 |
| SINE/MIR | 47.40 | 1.78 | 23.75 | 0.83 |
| SINE/tRNA-Lys | 15.04 | 0.56 | 1.14 | 0.04 |
| SINE/Tu-III | 404.49 | 15.17 | 410.09 | 14.33 |
Comparison of Chinese tree shrew gene annotation between short-read assembly (KIZ version 1: TS_1.0) and long-read assembly (KIZ version 2: TS_2.0)
| Parameter | Long-read assembly (TS_2.0) | Short-read assembly (TS_1.0) |
|---|---|---|
| Total number of genes | 23 568 | 22 121 |
| Complete ORFs | 21 117 | 21 085 |
| Annotated genes | 20 811 | 20 225 |
| Average mRNA length | 40 114 | 33 712 |
| Average CDS length | 1 527 | 1 404 |
| Average exon number | 8.86 | 7.54 |
| Average exon length | 172 | 186 |
| Average intron length | 4 907 | 4 937 |
ORF: open reading frame. CDS: coding-region sequences.
Comparison of Chinese tree shrew gene functional annotation between short-read assembly (KIZ version 1: TS_1.0) and long-read assembly (KIZ version 2: TS_2.0)
| Functional annotation | Short-read assembly (TS_1.0) | Long-read assembly (TS_2.0) | ||
|---|---|---|---|---|
| No. | Percent (%) | No. | Percent (%) | |
| Total | 22 121 | – | 23 568 | – |
| InterPro | 17 420 | 78.7 | 17 534 | 74.4 |
| KEGG | 16 593 | 75.0 | 16 964 | 72.0 |
| Swissprot & TrEMBL | 20 225 | 91.4 | 20 811 | 88.3 |
| Unannotated | 1 896 | 8.6 | 2 309 | 11.7 |
InterPro (http://www.ebi.ac.uk/interpro/). KEGG (https://www.kegg.jp/). Swissprot & TrEMBL (https://web.expasy.org/docs/swiss-prot_guideline.html). –: Not available.
Figure 2Chinese tree shrew and human MHC genes
A: Synteny of MHC genes between Chinese tree shrews and humans. HLA class I & II genes are in red, other genes are in black. Tree shrew TS_2.0 assembly and human genome (hg38; https://www.ncbi.nlm.nih.gov/grc/human) were used for comparison. B: Phylogenetic relationship of MHC-class I genes in humans, tree shrews, and mice.
Figure 3Examples of structural variants in mouse, macaque, tree shrew, and human genomes
A: Chinese tree shrews and humans, but not mice, shared a specific genomic structure in the region from MYSM1 to SLC35D1. B: Chinese tree shrews and mice shared a specific genomic structure in the region from PRKAB2 to POLR3GL, which has undergone dramatic changes in humans. GPR89B (G protein-coupled receptor 89B) and NOTCH2NL (notch 2 N-terminal like A) genes, marked in green, were only present in the human genome. These genomes were retrieved from public domains (mouse: GRCm38; https://www.ncbi.nlm.nih.gov/grc/mouse; macaque: rheMac3(Yan et al., 2011); human: hg38; https://www.ncbi.nlm.nih.gov/grc/human) or generated in this study (tree shrew: TS_2.0).
Figure 4Overview of updated tree shrew database (TreeshrewDB version 2.0)
Inclusion of the high-quality reference genome assembly (TS_2.0) in TreeshrewDB version 2.0 provided a comprehensive update of gene annotation information, genomic variations, population genetic features, and mRNA expressions. Population genetic parameters (including π, Watterson theta estimate (θw) (Watterson, 1975), Tajima’s D (Tajima, 1989), Fu and Li’s D (Fu & Li, 1993), Fu and Li’s F (Fu & Li, 1993), and Fay and Wu’s H (Fay & Wu, 2000)) were estimated based on SNPs located in coding regions in the whole genome sequences of six wild tree shrews. The database is freely accessible at http://www.treeshrewdb.org.