| Literature DB >> 34066304 |
Gangcai Xie1, Xu Zhang2,3, Feng Lv4, Mengmeng Sang1, Hairong Hu5, Jinqiu Wang5, Dong Liu2,3.
Abstract
Trachidermus fasciatus is a roughskin sculpin fish widespread across the coastal areas of East Asia. Due to environmental destruction and overfishing, the population of this species is under threat. In order to protect this endangered species, it is important to have the genome sequenced. Reference genomes are essential for studying population genetics, domestic farming, and genetic resource protection. However, currently, no reference genome is available for Trachidermus fasciatus, and this has greatly hindered the research on this species. In this study, we integrated nanopore long-read sequencing, Illumina short-read sequencing, and Hi-C methods to thoroughly assemble the Trachidermus fasciatus genome. Our results provided a chromosome-level high-quality genome assembly with a predicted genome size of 542.6 Mbp (2n = 40) and a scaffold N50 of 24.9 Mbp. The BUSCO value for genome assembly completeness was higher than 96%, and the single-base accuracy was 99.997%. Based on EVM-StringTie genome annotation, a total of 19,147 protein-coding genes were identified, including 35,093 mRNA transcripts. In addition, a novel gene-finding strategy named RNR was introduced, and in total, 51 (82) novel genes (transcripts) were identified. Lastly, we present here the first reference genome for Trachidermus fasciatus; this sequence is expected to greatly facilitate future research on this species.Entities:
Keywords: Hi-C; Trachidermus fasciatus; genome assembly; nanopore; novel gene
Year: 2021 PMID: 34066304 PMCID: PMC8148166 DOI: 10.3390/genes12050692
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Pipeline of Trachidermus fasciatus genome assembly and annotation.
Hisat2 mapping statistics for the RNA-seq datasets of seven tissues.
| Sample | Number of Bases | Total (Pairs of Reads) | Mapped (Unique) | Mapped (Multiple) | Mapped (All) | Mapped (Percentage) |
|---|---|---|---|---|---|---|
| skin | 8,250,236,700 | 27,500,789 | 22,963,898 | 2,672,115 | 25,636,013 | 93.22% |
| stomach | 10,785,480,600 | 35,951,602 | 26,050,112 | 7,422,957 | 33,473,069 | 93.11% |
| gill | 8,196,614,400 | 27,322,048 | 23,854,282 | 1,804,852 | 25,659,134 | 93.91% |
| gallbladder | 10,626,212,100 | 35,420,707 | 30,789,079 | 1,382,971 | 32,172,050 | 90.83% |
| kidney | 9,276,090,300 | 30,920,301 | 27,633,067 | 1,376,568 | 29,009,635 | 93.82% |
| heart | 8,720,360,400 | 29,067,868 | 24,859,462 | 1,093,286 | 25,952,748 | 89.28% |
| liver | 9,334,292,100 | 31,114,307 | 27,014,139 | 2,174,980 | 29,189,119 | 93.81% |
| Average | 9,312,755,229 | 31,042,517 | 26,166,291 | 2,561,104 | 28,727,395 | 92.57% |
| Total | 65,189,286,600 | 217,297,622 | 183,164,039 | 17,927,729 | 201,091,768 |
The sources and versions of the software used.
| Software | Version | Source Link |
|---|---|---|
|
| v3.3.1 | |
|
| v1.8.0 | |
|
| v2.9 | |
|
| 3.1.0 | |
|
| 0.7.12-r1039 | |
|
| v2 | |
|
| v1.1.1 | |
|
| 0.19.4 | |
|
| v1.6.1 | |
|
| v2.2 | |
|
| v1.1.2 | |
|
| 2.17(r941) | |
|
| v2.0-beta.1 | |
|
| v1.0.5 | |
|
| v2.3.3 | |
|
| V3.5.2 | |
|
| Revision 1.331 | |
|
| v1.4 | |
|
| v2.0 | |
|
| v2.2.1 | |
|
| v2.1.4 | |
|
| v5.5.0 | |
|
| v0.12.3 | |
|
| v2.11.0 | |
|
| v1.4.0 | |
|
| - | |
|
| version open-1.0.11 |
Summary of sequencing datasets.
| Library | Number of Bases | Number of Reads | Reads Length (Mean, bp) | Reads Length (Max, bp) |
|---|---|---|---|---|
| nanopore-seq-lib1 | 45,424,703,117 | 2,049,727 | 22,161 | 240,976 |
| nanopore-seq-lib2 | 41,854,167,498 | 2,031,772 | 20,599 | 243,222 |
| Total | 87,278,870,615 | 4,081,499 | 21,384 | 243,222 |
|
|
|
|
|
|
| nanopore-seq-lib1 | 31,576 | 75.85 | 48.11 | 13.48 |
| nanopore-seq-lib2 | 30,199 | 70.82 | 42.38 | 11.99 |
| Total | 30,943 | 73.35 | 45.26 | 12.74 |
|
|
|
|
|
|
| NGS-Genome | 54,380,742,900 | 362,538,286 | 150 | 150 |
| NGS-HiC | 60,291,587,700 | 401,943,918 | 150 | 150 |
| NGS-RNASeq | 65,189,286,600 | 434,595,244 | 150 | 150 |
|
|
| |||
|
| 160.836 | 542,656,829 | ||
|
| 100.212 | |||
Summary of genome assembly.
| Assembly (Preliminary) | Assembly (Polished) | |||
|---|---|---|---|---|
| Contig Length (bp) | Contig Number | Contig Length (bp) | Contig Number | |
| N50 | 23,408,022 | 10 | 23,556,738 | 10 |
| N60 | 20,522,422 | 13 | 20,673,841 | 13 |
| N70 | 18,786,353 | 16 | 18,887,485 | 16 |
| N80 | 17,637,255 | 18 | 17,756,054 | 18 |
| N90 | 7,606,484 | 23 | 7,646,988 | 23 |
| Longest | 35,774,240 | 1 | 36,041,605 | 1 |
| Total | 539,115,043 | 62 | 542,654,729 | 62 |
| Length ≥ 5 kb | 539,115,043 | 62 | 542,654,729 | 62 |
Figure 2Quality evaluation of genome assembly. (A) GC content and sequencing depth distribution. (B) CEGMA evaluation. (C) BUSCO evaluation. (D) Illumina read mappability and genome coverage. The assembled genome is 99.35% covered at least once by Illumina reads, and 97.93% is covered at least 20 times. In addition, 99.48% of the Illumina reads were mapped to the assembled genome, and 96.22% were properly mapped (mapped paired reads with flag 0 × 2 set). (E) Genome contamination evaluation.
Figure 3Hi-C enhanced genome assembly. (A) Comparison of cumulative sums of the contigs between the primary assembled genome and the Hi-C enhanced genome. (B) Reduced number of scaffolds/contigs in Hi-C enhanced genome. (C) Chromosomal level all-by-all Hi-C interaction heatmap (legend value is the natural log transformed linkage intensity). (D) Genomic percentage of the chromosome-level scaffolds. (E) Length distribution for each chromosome.
Figure 4Summary of genome annotation for Trachidermus fasciatus. (A) Number of protein-coding and non-protein-coding genes. (B) Gene annotation length distribution compared between close species. (C) Percentage of genome occupied by repetitive elements. (D) Number of repetitive elements in each class.
Figure 5Identification of novel genes and transcripts. (A) Flowchart of RNR novel gene/transcript identification method. (B) Heatmap of the expression (log1p(FPKM)) of the novel transcripts in seven tissues.