| Literature DB >> 31097710 |
Dainan Cao1, Meng Wang2, Yan Ge1, Shiping Gong3.
Abstract
The big-headed turtle, Platysternon megacephalum, as the sole member of the monotypic family Platysternidae, has a number of distinct characteristics including an extra-large head, long tail, flat carapace, and a preference for low water temperature environments. We performed whole genome sequencing, assembly, and gene annotation of an adult male big-headed turtle based on the Illumina HiSeq X genomic sequencing platform. We generated ~497.1 Gb of raw sequencing data (×208.9 depth) and produced a draft genome with a total length of 2.32 Gb and contig and scaffold N50 sizes of 41.8 kb and 7.22 Mb, respectively. We also identified 924 Mb (39.84%) of repetitive sequences, 25,995 protein-coding genes, and 19,177 non-coding RNAs. We generated the first de novo genome of the big-headed turtle; these data will be essential to the further understanding and exploration of the genomic innovations and molecular mechanisms contributing to its unique morphology and physiological features.Entities:
Mesh:
Year: 2019 PMID: 31097710 PMCID: PMC6522511 DOI: 10.1038/s41597-019-0067-9
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1A representative big-headed turtle, Platysternon megacephalum in China.
Statistics of big-headed turtle genome sequencing data.
| Insert size | Libraries | Read length (bp) | Raw data | Clean data | ||
|---|---|---|---|---|---|---|
| Total data (Gb) | Sequence coverage (×)* | Total data (Gb) | Sequence coverage (×)* | |||
| 250 bp | 2 | 150 | 120.9 | 50.8 | 112.6 | 47.3 |
| 500 bp | 2 | 150 | 103.6 | 43.5 | 96.5 | 40.5 |
| 2 Kbp | 2 | 150 | 98.9 | 41.6 | 92.2 | 38.7 |
| 5 Kbp | 2 | 150 | 78.6 | 33.0 | 75.5 | 31.7 |
| 10 Kbp | 2 | 150 | 95.1 | 40.0 | 92.4 | 38.8 |
| Total | 10 | — | 497.1 | 208.9 | 469.2 | 197.1 |
*Sequence coverage was calculated based on the genome size of 2.38 Gb according to k-mer analysis.
Fig. 2Distribution of 17-mer frequency. In total 170.2 Gb of high-quality short-insert reads (350 bp) were used to generate the 17-mer depth distribution curve frequency information.
Estimation of the genome size using K-mer analysis.
| Kmer | K-mer number | K-mer Depth | Estimated genome size (Mb) | Heterozygous Rate (%) | Repeat Rate (%) |
|---|---|---|---|---|---|
| 17 | 134,817,220,976 | 56 | 2,383.87 | 0.33 | 53 |
Summary of the genome assembly.
| Sample ID | Length | Number | ||
|---|---|---|---|---|
| Contig (bp) | Scaffold (bp) | Contig | Scaffold | |
| Total | 2,282,988,448 | 2,319,520,870 | 470,184 | 360,291 |
| Max | 453,655 | 40,162,337 | — | — |
| Number >= 2000 | — | — | 100,848 | 13,027 |
| N50* | 41,757 | 7,221,511 | 15,524 | 84 |
| N60 | 32,248 | 5,116,999 | 21,745 | 121 |
| N70 | 23,516 | 3,324,563 | 30,018 | 177 |
| N80 | 15,027 | 1,848,865 | 42,079 | 269 |
| N90 | 5,528 | 257,323 | 65,609 | 576 |
*N50 referred to the scaffold larger than half the genome size which was added up from large to small.
Summary statistics of four turtle genomes.
|
|
|
|
| |
|---|---|---|---|---|
| Sequencing technology | Sanger + NGS | NGS | NGS | NGS |
| Assembly size (Gb) | 2.59 | 2.24 | 2.21 | 2.32 |
| Sequence coverage (×) | 18.0 | 82.3 | 105.6 | 204.2 |
| Contig N50 (kb) | 11.9 | 20.4 | 21.9 | 41.8 |
| Scaffold N50 (kb) | 5,212 | 3,778 | 3,331 | 7,222 |
| GC content (%) | 43 | 43.5 | 44.4 | 44.63 |
| Gene number | 21,796 | 19,633 | 19,327 | 22,400 |
Prediction of repeat elements in the big-headed turtle genome.
| Type | Repeat Size (bp) | % of genome |
|---|---|---|
| Trf | 47,338,094 | 2.04 |
| Repeatmasker | 874,588,835 | 37.71 |
| Proteinmask | 267,323,903 | 11.52 |
| Total | 924,094,854 | 39.84 |
Statistics of repeat elements in the big-headed turtle genome.
| Type | Length(bp) | % in Genome |
|---|---|---|
| DNA | 125,220,062 | 5.40 |
| LINE | 637,155,697 | 27.47 |
| SINE | 9,369,097 | 0.40 |
| LTR | 217,629,821 | 9.38 |
| Other | 52 | 0.00 |
| Satellite | 1,062,050 | 0.05 |
| Simple_repeat | 990,682 | 0.04 |
| Unknown | 15,561,759 | 0.67 |
| Total | 903,544,627 | 38.95 |
Statistics of predicted genes.
| Gene set | Number | Average transcript length (bp) | Average CDS length (bp) | Average exons per gene | Average exon length (bp) | Average intron length (bp) | |
|---|---|---|---|---|---|---|---|
| Ab | Augustus | 29,895 | 20,126.36 | 1,083.78 | 5.51 | 196.7 | 4,222.41 |
| GlimmerHMM | 163,252 | 12,105.64 | 451.4 | 3.91 | 115.37 | 4,001.41 | |
| SNAP | 83,162 | 36,326.76 | 549.05 | 3.61 | 152.28 | 1,3731.58 | |
| Homolog prediction |
| 57,621 | 9,467.09 | 794.29 | 3.45 | 230.02 | 3,535.42 |
|
| 37,629 | 15,048.25 | 1,093.96 | 4.9 | 223.12 | 3,575.20 | |
|
| 50,264 | 11,904.42 | 997.94 | 4.25 | 234.74 | 3,354.46 | |
|
| 48,759 | 9,535.70 | 977.63 | 3.55 | 275.28 | 3,354.26 | |
|
| 45,081 | 11,902.30 | 950.46 | 3.91 | 243.04 | 3,762.56 | |
|
| 34,511 | 14,745.27 | 1,136.85 | 4.8 | 236.7 | 3,578.43 | |
|
| 45,954 | 11,607.88 | 1,038.09 | 3.75 | 276.62 | 3,839.62 | |
|
| 41,843 | 13,708.70 | 1,120.03 | 4.2 | 266.87 | 3,937.77 | |
|
| 46,338 | 11,270.86 | 981.39 | 4.05 | 242.04 | 3,368.40 | |
|
| 74,533 | 7,567.03 | 696.45 | 3.09 | 225.04 | 3,279.82 | |
|
| 38,766 | 11,659.15 | 1,052.39 | 4.05 | 260 | 3,480.37 | |
| RNASeq |
| 106,250 | 23,132.29 | 1,040.89 | 5.91 | 176.10 | 4,498.48 |
|
| 101,071 | 32,403.47 | 3,407.83 | 6.79 | 502.11 | 5,010.49 | |
| EVM | 396,95 | 17,966.58 | 987.68 | 5.19 | 190.45 | 4,056.02 | |
| Pasa-update* | 39,212 | 21,180.40 | 1,029.81 | 5.39 | 191.11 | 4,591.49 | |
| Final set* | 25,995 | 30,713.64 | 1,298.80 | 7.33 | 177.28 | 4,649.67 | |
*UTR regions were contained.
Statistics of functional annotation.
| Type | Number | Percentage (%) |
|---|---|---|
| Total | 25,995 | — |
| NR | 22,357 | 86.0 |
| Swiss-Prot | 21,536 | 82.8 |
| KEGG | 19,560 | 75.2 |
| InterPro | 21,227 | 81.7 |
| Pfam | 19,277 | 74.2 |
| GO | 15,735 | 60.5 |
| Annotated | 22,400 | 86.2 |
| Unannotated | 3,595 | 13.8 |
Summary of non-coding RNA.
| Type | Number | Average length (bp) | Total length (bp) | % of genome | |
|---|---|---|---|---|---|
| miRNA | 16,050 | 85.58 | 1,373,585 | 0.05921 | |
| tRNA | 2,089 | 74.90 | 156,473 | 0.00675 | |
| rRNA | rRNA | 409 | 160.37 | 65,591 | 0.00283 |
| 18S | 77 | 164.01 | 12,629 | 0.00054 | |
| 28S | 272 | 173.51 | 47,196 | 0.00204 | |
| 5.8S | 4 | 113.50 | 454 | 0.00002 | |
| 5S | 56 | 94.86 | 5,312 | 0.00023 | |
| snRNA | snRNA | 629 | 129.16 | 81,241 | 0.00350 |
| CD-box | 179 | 97.50 | 17,452 | 0.00075 | |
| HACA-box | 132 | 142.79 | 18,848 | 0.00081 | |
| splicing | 294 | 137.37 | 40,387 | 0.00174 | |
Base content statistics of the genome.
| Type | Number (bp) | % of genome |
|---|---|---|
| A | 632,158,059 | 27.25 |
| T | 631,959,166 | 27.25 |
| C | 509,623,076 | 21.97 |
| G | 509,248,147 | 21.97 |
| N | 36,532,422 | 1.57 |
| Total | 2,319,520,870 | — |
| GC* | 1,018,871,223 | 44.63 |
*GC content of the genome without N.
Statistics of mapping ratio in genome.
| Type | Content | Value |
|---|---|---|
| Reads | Mapping rate (%) | 98.84 |
| Genome | Average sequencing depth (×) | 72.05 |
| Coverage (%) | 99.57 | |
| Coverage at least 4× (%) | 98.73 | |
| Coverage at least 10× (%) | 97.25 | |
| Coverage at least 20× (%) | 95.38 |
Number and density of SNPs in big-headed turtle genome.
| Type | Number | Proportion (%) |
|---|---|---|
| All SNPs | 5,319,363 | 0.233% |
| Heterozygous SNPs | 4,999,745 | 0.219% |
| Homozygous SNPs | 319,618 | 0.014% |
Assessment of CEGMA.
| Species | Complete | Complete + Partial | ||
|---|---|---|---|---|
| Proteins | Completeness (%) | Proteins | Completeness (%) | |
| Big-headed turtle | 202 | 81.45 | 226 | 91.13 |
Assessment of BUSCO.
| Species | Size | BUSCO notation assessment results* |
|---|---|---|
| Big-headed turtle | 2320 Mb | C: 95.2% [S: 94.2%, D: 1.0%], F: 2.6%, M: 2.2%, n: 2586 |
*C: Complete BUSCOs; S: Complete and single-copy BUSCOs; D: Complete and duplicated BUSCOs; F: Fragmented BUSCOs; M: Missing BUSCOs; n: Total BUSCO groups searched.
| Design Type(s) | sequence annotation objective • sequence assembly objective |
| Measurement Type(s) | whole genome sequencing assay |
| Technology Type(s) | next generation DNA sequencing |
| Factor Type(s) | |
| Sample Characteristic(s) | Platysternon megacephalum • stream |