| Literature DB >> 31118413 |
Yao Ming1, Jianbo Jian1, Xueying Yu2, Jingzhen Wang3, Wenhua Liu4.
Abstract
The Indo-Pacific humpback dolphin (Sousa chinensis), is a threatened marine mammal and belongs to the First Order of the National Key Protected Wild Aquatic Animals List in China. However, limited genomic information is available for studies of its population genetics and biological conservation. Here, we have assembled a genomic sequence of this species using a whole genome shotgun (WGS) sequencing strategy after a pilot low coverage genome survey. The total assembled genome size was 2.34 Gb: with a contig N50 of 67 kb and a scaffold N50 of 9 Mb (107.6-fold sequencing coverage). The S. chinensis genome contained 24,640 predicted protein-coding genes and had approximately 37% repeated sequences. The completeness of the genome assembly was evaluated by benchmarking universal single copy orthologous genes (BUSCOs): 94.3% of a total 4,104 expected mammalian genes were identified as complete, and 2.3% were identified as fragmented. This newly produced high-quality assembly and annotation of the genome will greatly promote the future studies of the genetic diversity, conservation and evolution.Entities:
Mesh:
Year: 2019 PMID: 31118413 PMCID: PMC6531461 DOI: 10.1038/s41597-019-0078-6
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Geographical distribution and photograph of S. chinensis. (a) Distribution of S. chinensis reported in Chinese waters and the sampling site of this study. (b) S. chinensis photographed during the boat surveys in Guangxi Beibu Gulf, China.
Comparison of the new genome with our previously published survey assembly of S. chinensis genome.
| Content | The pilot study published[ | This study |
|---|---|---|
| Sequencing data and depth | 107.6 Gb (~32.9X clean data) | 290.5 Gb (~107.6X clean data) |
| The number of insert size libraries | 2 (500 bp and 2 Kb) | 6 (300 bp, 500 bp, 800 bp, 2 Kb, 5 Kb and 10 Kb) |
| Genome assembly methods | SOAPdenovo2 | Platanus v1.2.4 |
| Assembled genome size | 2.29 Gb | 2.34 Gb |
| Assembled quality | contig N50:13 Kb; scaffold N50:163 Kb | contig N50: 67 Kb; scaffold N50: 9 Mb |
| Assembly completeness evaluation (BUSCO) | 76% | 94.3% |
Statistics of raw and clean data.
| Pair-end Libraries | Insert Size | Reads Length (bp) | Raw Data (Gb) | Clean Data (Gb) | Sequence Depth (X) |
|---|---|---|---|---|---|
| 150 | 137.6 | 108.1 | 40 | ||
| 125 | 67 | 60.3 | 22.3 | ||
| 125 | 59 | 51.2 | 19 | ||
| 50 | 40.7 | 28.5 | 10.6 | ||
| 50 | 19 | 11.6 | 4.3 | ||
| 50 | 46.9 | 30.8 | 11.4 | ||
| Total | 370.2 | 290.5 | 107.6 |
Note: Assuming the genome size is 2.7 Gb. *The data was used in previously pilot study project[26].
Statistics of the assembled sequence length.
| Contig Length (bp) | Contig Number | Scaffold Length (bp) | Scaffold Number | |
|---|---|---|---|---|
| N10 | 160,909 | 1,135 | 21,984,446 | 9 |
| N20 | 124,084 | 2,787 | 17,517,993 | 21 |
| N30 | 100,087 | 4,874 | 14,735,920 | 36 |
| N40 | 81,924 | 7,437 | 11,330,947 | 54 |
| N50 | 66,998 | 10,567 | 9,008,636 | 78 |
| N60 | 54,491 | 14,403 | 6,903,794 | 108 |
| N70 | 42,832 | 19,193 | 5,150,637 | 147 |
| N80 | 31,804 | 25,446 | 3,635,400 | 202 |
| N90 | 19,905 | 34,515 | 2,124,572 | 283 |
| Max length | 541,590 | 40,839,098 | ||
| Total length | 2,315,724,921 | 84,941 | 2,339,085,850 | 20,903 |
Evaluation of genome assembly completeness.
| BUSCO benchmark | Number | Percentage (%) |
|---|---|---|
| Complete BUSCOs | 3,870 | 94.3 |
| Complete and single-copy BUSCOs | 3,802 | 92.6 |
| Complete and duplicated BUSCOs | 68 | 1.7 |
| Fragmented BUSCOs | 94 | 2.3 |
| Missing BUSCOs | 140 | 3.4 |
| Total BUSCO groups searched | 4,104 | 100 |
General statistics of repeats in genome.
| Type | Repeat Size | % of genome |
|---|---|---|
| Trf | 27,926,236 | 1.19 |
| Repeatmasker | 592,428,741 | 25.23 |
| Proteinmask | 67,881,250 | 2.89 |
| De novo | 813,811,498 | 34.66 |
| Total | 878,297,072 | 37.41 |
General statistics of predicted protein-coding genes (Note: The average transcript length does not contain UTR).
| Gene set | Number | Average transcript length (bp) | Average CDS length (bp) | Average exon per gene | Average exon length (bp) | Average intron length (bp) | |
|---|---|---|---|---|---|---|---|
|
|
| 30,592 | 17,124 | 1,122 | 6 | 182 | 3,101 |
|
| 23,909 | 22,700 | 1,315 | 7 | 180 | 3,398 | |
|
| 27,223 | 20,725 | 1,260 | 7 | 180 | 3,251 | |
|
| 30,618 | 12,062 | 1,025 | 6 | 180 | 2,360 | |
|
| 27,938 | 13,517 | 1,682 | 6 | 298 | 2,546 | |
|
| 24,640 | 24,148 | 1,283 | 7 | 174 | 3,516 | |
Statistics of function annotation.
| Number | Percent (%) | ||
|---|---|---|---|
| Total | 24,640 | 100 | |
| Annotated | InterPro | 21,313 | 86.50 |
| GO | 15,120 | 61.36 | |
| KEGG | 19,276 | 78.23 | |
| Swissprot | 21,734 | 88.21 | |
| TrEMBL | 22,235 | 90.24 | |
| Annotated overall | 22,472 | 91.20 | |
| Unannotated | 2,168 | 8.80 | |
Note: Five protein databases were chosen to assist in predicting function of genes. They are InterPro, Gene ontology, KEGG, Swissprot and TrEMBL. The table shows numbers of genes match to each database.
Evaluation of genome annotation completeness.
| BUSCO benchmark | Number | Percentage (%) |
|---|---|---|
| Complete BUSCOs | 3,900 | 95.1 |
| Complete and single-copy BUSCOs | 3,803 | 92.7 |
| Complete and duplicated BUSCOs | 97 | 2.4 |
| Fragmented BUSCOs | 61 | 1.5 |
| Missing BUSCOs | 143 | 3.4 |
| Total BUSCO groups searched | 4,104 | 100 |
Statistics of the assembled sequence length of published cetacean genomes (S. chinensis included).
| Species | Assembled genome size (Gb) | Genome coverage (X) | Contig N50 (Kb) | Scaffold N50 (Kb) | Number of genes | Reference |
|---|---|---|---|---|---|---|
|
| 2.3 | 154.3 | 34.8 | 877 | 22,677 |
[ |
|
| 2.44 | 128 | 22.6 | 12,800 | 20,605 |
[ |
|
| 2.53 | 114.6 | 30 | 2,260 | 22,168 |
[ |
|
| 2.37 | 200 | 70.3 | 12,735 | 27,924 |
[ |
|
| 2.34 | 107.6 | 67 | 9,008 | 24,640 |
| Design Type(s) | sequence assembly objective • sequence annotation objective |
| Measurement Type(s) | whole genome sequencing assay |
| Technology Type(s) | DNA sequencing |
| Factor Type(s) | |
| Sample Characteristic(s) | Sousa chinensis • skin of body |