| Literature DB >> 31575853 |
Baohua Chen1,2, Zhixiong Zhou2, Qiaozhen Ke1,2, Yidi Wu2, Huaqiang Bai2, Fei Pu2, Peng Xu3,4,5.
Abstract
Larimichthys crocea is an endemic marine fish in East Asia that belongs to Sciaenidae in Perciformes. L. crocea has now been recognized as an "iconic" marine fish species in China because not only is it a popular food fish in China, it is a representative victim of overfishing and still provides high value fish products supported by the modern large-scale mariculture industry. Here, we report a chromosome-level reference genome of L. crocea generated by employing the PacBio single molecule sequencing technique (SMRT) and high-throughput chromosome conformation capture (Hi-C) technologies. The genome sequences were assembled into 1,591 contigs with a total length of 723.86 Mb and a contig N50 length of 2.83 Mb. After chromosome-level scaffolding, 24 scaffolds were constructed with a total length of 668.67 Mb (92.48% of the total length). Genome annotation identified 23,657 protein-coding genes and 7262 ncRNAs. This highly accurate, chromosome-level reference genome of L. crocea provides an essential genome resource to support the development of genome-scale selective breeding and restocking strategies of L. crocea.Entities:
Mesh:
Year: 2019 PMID: 31575853 PMCID: PMC6773841 DOI: 10.1038/s41597-019-0194-3
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Summary of obtained data using multiple sequencing technologies.
| Library Type | Insert Size (bp) | Raw Data (Gb) | Clean Data (Gb) | Average Read Length of Raw Reads (bp) | Sequencing Coverage (X) |
|---|---|---|---|---|---|
| Illumina | 250 | 105.23 | 105.01 | 150 | 148.54 |
| PacBio | 20,000 | 80.61 | — | 8,530.75 | 113.78 |
| Hi-C | — | 119.15 | 58.97 | 150 | 168.18 |
| Total | — | 304.99 | — | — | 430.50 |
Note: The genome size of L. crocea used to calculate sequencing coverage was 708.47 Mbp, which was estimated using a K-mer analysis of the short reads.
Fig. 1Illustration of the complete genome assembly pipeline.
Summary of the L. crocea genome assembly and structural annotation.
|
| |
| Contig N50 length (Mbp) | 2.83 |
| Number of conitgs longer than N50 | 68 |
| Contig N90 size (Kbp) | 0.26 |
| Number of conitgs longer than N90 | 376 |
| Number of conitgs | 1,591 |
| Maximum contig length (Mbp) | 11.8 |
| Median contig length (Mbp) | 0.64 |
| Total contig length (Mbp) | 723.86 |
|
| |
| Number of protein-coding genes | 23,172 |
| Number of unannotated genes | 73 |
| Average transcript length (bp) | 11,839.98 |
| Average exons per gene | 9.27 |
| Average exon length (bp) | 158.16 |
| Average CDS length (bp) | 1,465.51 |
| Average intron length (bp) | 1,255.04 |
Detailed results of chromosome-level scaffolding using Hi-C technology.
| Chromosomes | Length (Mbp) | Number of Contigs |
|---|---|---|
| Chr1 | 34.89 | 34 |
| Chr2 | 24.81 | 19 |
| Chr3 | 28.07 | 17 |
| Chr4 | 29.96 | 22 |
| Chr5 | 33.77 | 25 |
| Chr6 | 24.87 | 16 |
| Chr7 | 31.52 | 27 |
| Chr8 | 32.80 | 24 |
| Chr9 | 24.26 | 18 |
| Chr10 | 27.49 | 16 |
| Chr11 | 34.65 | 24 |
| Chr12 | 26.70 | 25 |
| Chr13 | 16.24 | 24 |
| Chr14 | 29.81 | 21 |
| Chr15 | 27.79 | 19 |
| Chr16 | 20.01 | 23 |
| Chr17 | 25.06 | 18 |
| Chr18 | 32.81 | 20 |
| Chr19 | 29.92 | 30 |
| Chr20 | 32.24 | 39 |
| Chr21 | 27.85 | 20 |
| Chr22 | 27.44 | 11 |
| Chr23 | 23.57 | 27 |
| Chr24 | 22.13 | 29 |
| Linked Total | 668.67 | 548 |
| Unlinked Total | 54.39 | 1,043 |
| Linked Percent | 92.48 | 34.44 |
| Total | 723.06 | 1,591.00 |
List of RNA-seq datasets used for gene structural prediction.
| Run | Tissue | Sample Name | Study | BioProject | MBases | Load Date |
|---|---|---|---|---|---|---|
| SRR6474596 | gonad | Male5 | SRP128079 | PRJNA368644 | 3,824 | 2018/1/15 |
| SRR6474594 | gonad | Female3 | SRP128079 | PRJNA368644 | 4,845 | 2018/1/15 |
| SRR6474588 | gonad | Female5 | SRP128079 | PRJNA368644 | 4,052 | 2018/1/15 |
| SRR6474586 | gonad | Male4 | SRP128079 | PRJNA368644 | 3,742 | 2018/1/15 |
| SRR5121288 | embryo | pharyngula | SRP095312 | PRJNA357970 | 4,399 | 2016/12/23 |
| SRR5121287 | embryo | gastrulation | SRP095312 | PRJNA357970 | 4,392 | 2016/12/23 |
| SRR5121286 | embryo | 1_cell_embryo | SRP095312 | PRJNA357970 | 4,567 | 2016/12/23 |
| SRR5121204 | embryo | blastula_L1 | SRP095312 | PRJNA357970 | 4,695 | 2016/12/23 |
| SRR5121203 | embryo | 256_cell_embryo_L1 | SRP095312 | PRJNA357970 | 4,730 | 2016/12/23 |
| SRR5121202 | embryo | 16_cell_embryo_L1 | SRP095312 | PRJNA357970 | 4,688 | 2016/12/23 |
| SRR5121194 | embryo | 8_cell_embryo_L1 | SRP095312 | PRJNA357970 | 4,425 | 2016/12/23 |
| SRR5121193 | embryo | 2_cell_embryo_L1 | SRP095312 | PRJNA357970 | 4,495 | 2016/12/23 |
| SRR5000825 | spleen | BS24h | SRP092778 | PRJNA340054 | 5,229 | 2016/11/7 |
| SRR5000824 | spleen | BS0h | SRP092778 | PRJNA340054 | 5,278 | 2016/11/7 |
| SRR3711298 | liver | The raw sequence reads of | SRP076957 | PRJNA326556 | 4,758 | 2016/6/27 |
| SRR3711297 | liver | The raw sequence reads of | SRP076957 | PRJNA326556 | 4,878 | 2016/6/27 |
| SRR2984347 | skin | stress_0.5h_1 | SRP066525 | PRJNA303096 | 2,963 | 2015/12/11 |
| SRR2984346 | skin | control | SRP066525 | PRJNA303096 | 2,913 | 2015/12/11 |
| SRR2473991 | muscle | GSM1890206 | SRP063956 | PRJNA296537 | 5,073 | 2015/9/21 |
| SRR2473990 | muscle | GSM1890205 | SRP063956 | PRJNA296537 | 6,310 | 2015/9/21 |
| SRR1509885 | mixture | a composite sample of large yellow croaker | SRP044199 | PRJNA254539 | 6,122 | 2014/7/10 |
| SRR1284627 | brain | GSM1385502 | SRP041934 | PRJNA246784 | 6,144 | 2015/12/29 |
| SRR1284623 | brain | GSM1385498 | SRP041934 | PRJNA246784 | 4,399 | 2015/9/13 |
Fig. 2Circos plot of 24 chromosome-level scaffolds, representing annotation results of genes, ncRNAs and transposable elements on these scaffolds. The tracks from inside to outside are: 24 chromosome-level scaffolds, gene abundance of positive strand (red), gene abundance of negative strand (blue), TE abundance of positive strand (orange), TE abundance of negative strand (green), ncRNA abundance of both strands, and contigs that comprised the scaffolds (adjacent contigs on a scaffold are shown in different colours).
Detailed results of ncRNA annotation.
| Type | Copy | Average Length (bp) | Total Length (bp) | Proportion in Genome (‰) | |
|---|---|---|---|---|---|
| miRNA | 1,246 | 100.90 | 125,725 | 0.17 | |
| tRNA | 3,517 | 75.58 | 265,811 | 0.37 | |
| rRNA | 18S | 68 | 227.37 | 15,461 | 0.02 |
| 28S | 70 | 208.07 | 14,565 | 0.02 | |
| 5.8S | 1 | 45 | 45 | 0.00 | |
| 5S | 1,619 | 111.3 | 180,190 | 0.25 | |
| Subtotal | 1,758 | 119.6 | 210,261 | 0.29 | |
| snRNA | CD-box | 153 | 118.72 | 18,164 | 0.03 |
| HACA-box | 119 | 156.36 | 18,607 | 0.03 | |
| Splicing | 469 | 124.25 | 58,271 | 0.08 | |
| Subtotal | 741 | 129.85 | 95,042 | 0.14 | |
| Total | 72 | 95.96 | 696,839 | 0.97 | |
Note: The genome size of L. crocea was estimated to be 708.47 Mbp by genome K-mer analysis.
Detailed classification of repeat sequences.
| Type |
| TE proteins | Combined TEs | |||
|---|---|---|---|---|---|---|
| Length (Mbp) | Proportion in Genome (%) | Length (Mbp) | Proportion in Genome (%) | Length (Mbp) | Proportion in Genome (%) | |
| DNA | 66.39 | 9.17 | 5.58 | 0.77 | 69.11 | 9.54 |
| LINE | 45.38 | 6.26 | 14.50 | 2.00 | 51.37 | 7.09 |
| SINE | 3.45 | 0.48 | 0.00 | 0.00 | 3.45 | 0.48 |
| LTR | 51.19 | 7.07 | 9.51 | 1.31 | 52.41 | 7.24 |
| Simple Repeat | 16.86 | 2.33 | 0.00 | 0.00 | 16.86 | 2.33 |
| Unknown | 11.85 | 1.64 | 0.00 | 0.00 | 11.85 | 1.64 |
| Total | 183.50 | 25.33 | 29.51 | 4.07 | 189.27 | 26.13 |
Note: “De novo” represents the de novo identified transposable elements using RepeatMasker, RepeatModeler, RepeatScout, and LTR_FINDER. “TE proteins” indicates homologous transposable elements in Repbase identified with RepeatProteinMask, while “Combined TEs” refers to the combined results of transposable elements identified in these two ways. “Unknown” represents transposable elements that could not be classified by RepeatMasker.
Fig. 3Divergence distribution of TEs in the L. crocea genome.
Details of accuracy and completeness validation of genome assembly.
|
| ||
| Mapping ratio | 97.61% | |
| Mapping coverage | 99.89% | |
| Number of heterozygous SNPs | 3,735,880 | |
| Number of homozygous SNPs | 3568 | |
|
| ||
| Total number of reference genes | 233 | |
| Number of completely assembled CEGs | 231 | |
| Proportion of completely assembled CEGs (%) | 99.14 | |
| Number of assembled CEGs | 232 | |
| Proportion of assembled CEGs (%) | 99.57 | |
|
|
|
|
| All orthologues used | 4584 | 100.00 |
| Complete and fragmented orthologues | 4419 | 97.1 |
| Missing orthologues | 135 | 2.9 |
|
|
|
|
| All orthologues used | 4584 | 100.00 |
| Complete and fragmented orthologues | 4182 | 91.2% |
| Missing orthologues | 402 | 8.8 |
| Measurement(s) | reference genome data |
| Technology Type(s) | DNA sequencing |
| Sample Characteristic - Organism | Larimichthys crocea |