| Literature DB >> 35768602 |
Xi-Wen Xu1,2, Weiwei Zheng1,3, Zhen Meng1, Wenteng Xu1,2, Yingjie Liu3,4, Songlin Chen5,6,7.
Abstract
Turbot (Scophthalmus maximus), commercially important flatfish species, is widely cultivated in Europe and China. With the continuous expansion of the intensive breeding scale, turbot is exposed to various stresses, which greatly impedes the healthy development of turbot industry. Here, we present an improved high-quality chromosome-scale genome assembly of turbot using a combination of PacBio long-read and Illumina short-read sequencing technologies. The genome assembly spans 538.22 Mb comprising 27 contigs with a contig N50 size of 25.76 Mb. Annotation of the genome assembly identified 104.45 Mb repetitive sequences, 22,442 protein-coding genes and 3,345 ncRNAs. Moreover, a total of 345 stress responsive candidate genes were identified by gene co-expression network analysis based on 14 published stress-related RNA-seq datasets consisting of 165 samples. Significantly improved genome assembly and stress-related candidate gene pool will provide valuable resources for further research on turbot functional genome and stress response mechanism, as well as theoretical support for the development of molecular breeding technology for resistant turbot varieties.Entities:
Mesh:
Year: 2022 PMID: 35768602 PMCID: PMC9243025 DOI: 10.1038/s41597-022-01458-4
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
Data statistics of whole genome sequencing reads of S. maximus.
| Library Type | Sequencing Platform | Insert Size (bp) | Raw data (Gb) | Sequence coverage (X) |
|---|---|---|---|---|
| Illumina | Illumina HiSeq 4000 | 350 | 51.80 | 90 |
| Pacbio | PacBio Sequel II | 20,000 | 150.30 | 265 |
Fig. 1The workflows of genome assembly and gene co-expression network inference used in this study. (a) The genome assembly and annotation pipeline. (b) The gene co-expression network inference and analyses pipeline.
Comparative statistic of the S. maximus genome assembly with old ones.
| Genome assembly | This study | Martínez | Xu | Figueras | |
|---|---|---|---|---|---|
| female | male | ||||
| Scaffold N50 (Mb) | 25.76 | 25.95 | 25.17 | 5.93 | 24.81 |
| Contig N50 (Mb) | 25.76 | 20.47 | 0.028 | 0.045 | 0.054 |
| Total scaffold number | 27 | 127 | 28,256 | 9,724 | 22 |
| Total contig number | 27 | 178 | 65,796 | 36,500 | 21,326 |
| Total length (Mb) | 538.22 | 556.70 | 568.47 | 587.19 | 524.98 |
| GC Content (%) | 43.53 | 43.30 | 43.42 | 43.70 | 43.30 |
Classified statistics of repeat sequences of S. maximus.
| RepBase TEs | TE Proteins | Combined TEs | ||||||
|---|---|---|---|---|---|---|---|---|
| Length (bp) | % in Genome | Length (bp) | % in Genome | Length (bp) | % in Genome | Length (bp) | % in Genome | |
| DNA | 38,217,303 | 7.10 | 2,321,886 | 0.43 | 23,128,062 | 4.30 | 54,159,141 | 10.06 |
| LINE | 13,026,936 | 2.42 | 6,871,234 | 1.28 | 7,405,321 | 1.38 | 16,693,988 | 3.10 |
| SINE | 2,309,601 | 0.43 | 0 | 0 | 857,212 | 0.16 | 2,740,574 | 0.51 |
| LTR | 11,363,027 | 2.11 | 2,222,887 | 0.41 | 4,790,157 | 0.89 | 15,901,294 | 2.95 |
| Satellite | 2,989,136 | 0.56 | 0 | 0 | 499,041 | 0.09 | 3,462,111 | 0.64 |
| Simple_repeat | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Other | 2,814 | 0 | 135 | 0 | 0 | 0 | 2,949 | 0 |
| Unknown | 537,749 | 0.10 | 13,890 | 0 | 23,176,727 | 4.31 | 23,566,810 | 4.38 |
| Total | 58,685,000 | 10.90 | 11,419,271 | 2.12 | 58,413,629 | 10.85 | 104,452,847 | 19.41 |
General statistics of predicted protein-coding genes in S. maximus genome.
| Gene set | Protein coding gene number | Average gene length (bp) | Average CDS length (bp) | Average exon per gene | Average exon length (bp) | Average intron length (bp) | |
|---|---|---|---|---|---|---|---|
| Genscan | 30,320 | 12,927 | 1,595 | 8.92 | 178.87 | 1,431 | |
| AUGUSTUS | 40,007 | 8,114 | 1,220 | 6.53 | 186.85 | 1,246 | |
| Homolog | 38,658 | 12,345 | 1,120 | 6.69 | 167.55 | 1,974 | |
| 40,864 | 12,956 | 1,153 | 6.74 | 171.11 | 2,056 | ||
| 35,093 | 11,413 | 1,114 | 6.89 | 161.56 | 1,748 | ||
| 39,404 | 14,059 | 1,167 | 6.69 | 174.59 | 2,267 | ||
| 37,758 | 12,151 | 1,163 | 6.84 | 169.92 | 1,880 | ||
| 40,717 | 14,894 | 1,149 | 6.36 | 180.75 | 2,566 | ||
| 48,770 | 12,386 | 950.24 | 5.65 | 168.14 | 2,458 | ||
| trans.orf/RNAseq | 16,356 | 19,894 | 2,040 | 12.87 | 358.55 | 1,287 | |
| MAKER | 22,442 | 15,828 | 1,703 | 10.51 | 327.83 | 1,302 | |
Fig. 2Comparisons of gene features among S. maximus, Anabas testudineus, Cynoglossus semilaevis, Danio rerio, Gasterosteus aculeatus, Oryzias latipes, Scophthalmus maximux and Takifugu rubripes. (a) Gene length distributions of the species. (b) CDS length distributions of the species. (c) Exon length distributions of the species. (d) Intron length distributions of the species.
General statistics of gene function annotation of S. maximus.
| Type | Number | Percent (%) | |
|---|---|---|---|
| Total | 22,442 | ||
| Annotated | 21,360 | 95.18 | |
| InterPro | 19,732 | 87.92 | |
| GO | 15,096 | 67.27 | |
| KEGG_ALL | 20,917 | 93.2 | |
| KEGG_KO | 13,810 | 61.54 | |
| Swissprot | 19,137 | 85.27 | |
| TrEMBL | 21,313 | 94.97 | |
| TF | 3,328 | 14.83 | |
| Pfam | 19,126 | 85.22 | |
| NR | 21,065 | 93.86 | |
| KOG | 17,738 | 79.04 | |
| Unannotated | 1,082 | 4.82 | |
General statistics of non-coding annotation of S. maximus.
| Type | Copy | Average length(bp) | Total length(bp) | % of genome | |
|---|---|---|---|---|---|
| miRNA | 430 | 85 | 36,407 | 0.006764 | |
| tRNA | 1,796 | 75 | 134,264 | 0.024946 | |
| rRNA | rRNA | 538 | 138 | 74,432 | 0.013829 |
| 18 S | 6 | 1,849 | 11,094 | 0.002061 | |
| 28 S | 0 | 0 | 0 | 0 | |
| 5.8 S | 8 | 156 | 1,247 | 0.000232 | |
| 5 S | 524 | 118 | 62,091 | 0.011536 | |
| snRNA | snRNA | 581 | 137 | 79,403 | 0.014753 |
| CD-box | 193 | 121 | 23,313 | 0.004332 | |
| HACA-box | 75 | 151 | 11,302 | 0.002100 | |
| splicing | 306 | 141 | 43,069 | 0.008002 | |
| scaRNA | 7 | 246 | 1,719 | 0.000319 | |
Overview of the RNA-seq datasets used in this study.
| Stress | SRA Study | SRA-Experiments | Number of individuals | Platform (Illumina) | Size (GB) | References | |
|---|---|---|---|---|---|---|---|
| Crowding | — | SRP129900 | 12 | 500 | HiSeq 4000 | 68.20 | [ |
| Feeding | SRP188583 | 15 | 300 | HiSeq 4000 | 115.45 | [ | |
| fish meal, soybean meal | SRP074811 | 2 | 360 | NextSeq 500 | 42.56 | [ | |
| sodium butyrate, soybean meal | SRP275545 | 6 | 270 | HiSeq 2000 | 50.23 | [ | |
| Heat | — | SRP152627 | 10 | — | HiSeq 4000 | 88.99 | [ |
| Oxygen | — | SRP167318 | 9 | 9 | HiSeq 2500 | 58.99 | [ |
| Pathogens | SRP308109 | 49 | 280 | HiSeq 4000 | 381.62 | [ | |
| SRP255305 | 10 | 120 | HiSeq 4000 | 17.55 | [ | ||
| SRP065375 | 12 | — | HiSeq 2000 | 31.48 | [ | ||
| SRP050607 | 12 | 120 | HiSeq 2000 | 36.02 | [ | ||
| SRP191266 | 4 | 90 | HiSeq 2500 | 53.34 | [ | ||
| Salinity | — | SRP277001 | 6 | 360 | HiSeq 4000 | 49.35 | [ |
| SRP238143 | 9 | 180 | HiSeq 2000 | 70.48 | [ | ||
| SRP153594 | 9 | — | HiSeq 4000 | 70.86 | [ | ||
| Total | — | — | 165 | — | 1135.12 | ||
Fig. 3Gene co-expression network analysis of different stresses. (a) Cluster Dendrogram of genes and modules. The branches and color bands represent the assigned module. The tips of the branches represent genes. (b) Correlation between modules and stresses. The value in the box is the correlation coefficients. Correlation coefficients with ** or *** represent extremely significant correlation and significant correlation with *.
| Measurement(s) | whole genome sequencing |
| Technology Type(s) | PacBio long-read and Illumina short-read sequencing technologies |