Literature DB >> 33367716

Chromosomal-Level Genome Assembly of Silver Sillago (Sillago sihama).

Xinghua Lin1,2,3,4, Yang Huang1,2,3,4,5, Dongneng Jiang1,2,3,4,5, Huapu Chen1,2,3,4,5, Siping Deng1,2,3,4,5, Yulei Zhang1,3,4,5, Tao Du1,2,3,4,5, Chunhua Zhu1,2,3,4,5, Guangli Li1,2,3,4,5, Changxu Tian1,2,3,4,5.   

Abstract

Silver sillago, Sillago sihama is a member of the family Sillaginidae and found in all Chinese inshore waters. It is an emerging commercial marine aquaculture species in China. In this study, high-quality chromosome-level reference genome of S. sihama was first constructed using PacBio Sequel sequencing and high-throughput chromosome conformation capture (Hi-C) technique. A total of 66.16 Gb clean reads were generated by PacBio sequencing platforms. The genome-scale was 521.63 Mb with 556 contigs, and 13.54 Mb of contig N50 length. Additionally, Hi-C scaffolding of the genome resulted in 24 chromosomes containing 96.93% of the total assembled sequences. A total of 23,959 protein-coding genes were predicted in the genome, and 96.51% of the genes were functionally annotated in public databases. A total of 71.86 Mb repetitive elements were detected, accounting for 13.78% of the genome. The phylogenetic relationships of silver sillago with other teleosts showed that silver sillago was separated from the common ancestor of Sillago sinica ∼7.92 Ma. Comparative genomic analysis of silver sillago with other teleosts showed that 45 unique and 100 expansion gene families were identified in silver sillago. In this study, the genomic resources provide valuable reference genomes for functional genomics research of silver sillago.
© The Author(s) 2020. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

Entities:  

Keywords:  Hi-C; PacBio; chromosomal assembly; genome; silver sillago

Mesh:

Year:  2021        PMID: 33367716      PMCID: PMC7875006          DOI: 10.1093/gbe/evaa272

Source DB:  PubMed          Journal:  Genome Biol Evol        ISSN: 1759-6653            Impact factor:   3.416


Significance

Sillago sihama is a commercial marine fish species with an important economic value in China. A high-quality chromosome-level reference genome of S. sihama was constructed in this study. The genome is an important resource for advancing research on physiology, reproduction, and breeding research.

Introduction

Sillaginidae family (also known as smelt-whitings or sand borers) belongs to order Perciformes, are bottom-dwelling fishes and widely distributed in the shallow sea regions of Indo-West-Pacific Ocean (Xu et al. 2018). Sillaginidae consists of 31 species in three genera and three subgenera, of which the genus Sillago comprises 24 species. Sillago species drill sand to avoid seine-net and other environmental hazards (Lou et al. 2020). Sillago flesh is white and very tender, with excellent flavor. Steamed whiting fillet of Sillago fishes contains little fat content, which is easy to digest. Due to its ecological and economic importance, the inshore fishing of Sillago has developed rapidly in the past decades. However, the natural population of Sillago spp. has reduced in recent years due to overfishing and demersal environmental deterioration, such as localized oxygen depletion, sulfide accumulation, and high turbidity (Lou et al. 2020). Therefore, it is necessary to develop genomic resources to protect their natural resources and to accelerate the process of genome-assisted improvement of important economic traits. Silver sillago, Sillago sihama is found in all Chinese waters, including beaches, sandbars, mangrove creeks, and estuaries (Guo et al. 2014). This fish species has been widely cultured in China due to its high meat quality. However, the reduction of natural population of S. sihama and a low survival rate in artificial breeding decrease the development of the marine aquaculture of S. sihama. To date, complete mitogenome (Siyal et al. 2016), simple sequence repeat (Guo et al. 2014; Qiu et al. 2020), transcriptome (Tian et al. 2019; Saetan et al. 2020), and draft genomic survey data (Li et al. 2019) have been reported for S. sihama. The genome of S. sinica was the first and only reference genome for Sillaginidae (Lou et al. 2020). However, large-scale genomic analysis at the chromosome level has not been well-characterized in Sillago due to the fragmented assemblies. Our study reported the chromosome-level genome of Sillago, which is the first chromosome-level genome of S. sihama. Genomic and comparative genomic analyses provide insights into the genes related to environmental stress. The genome can be used as a basis for the research on the evolution and biology of S. sihama.

Materials and Methods

Ethics Statement

All experimental protocols were approved by the Animal Research and Ethics Committees of the Institute of Aquatic Economic Animals of Guangdong Ocean University, Zhanjiang, Guangdong, China (201903003). The study does not involve endangered or protected species.

Sample Collection and Sequencing

Sillago sihama (length of 19.3 cm) was obtained from Donghai Island, Guangdong, China. Genomic DNA (gDNA) was extracted from muscle samples and constructed two Pacific Biosciences (PacBio) sequencing libraries (insert size of 20 kb). DNA samples were interrupted by g-TUBE, and the adaptor was connected to the DNA. The libraries were purified by an exonuclease, and the sequencing fragments were screened by BluePippin. Sequencing was conducted using the PacBio platform. Adaptors, low-quality reads and short fragments were filtered to obtain high-quality subreads. The high-throughput chromosome conformation capture (Hi-C) library (insert size of 350 bp) was constructed for sequencing to obtain the chromosome-level assembly of the genome. The samples were fixed by formaldehyde, and restriction enzyme was added to digest DNA, followed by repairing the 5′-end by biotin residues. Sequencing was done using the Illumina platform. Adapter sequences of raw reads were trimmed, and low-quality paired-end (PE) reads were removed to get clean data. RNA was extracted from eight tissues, including liver, heart, head kidney, gonad, muscle, brain, stomach, and gill of S. sihama. Illumina HiSeq platform was used for transcriptome sequencing.

Genome Assembly

The filtered data were corrected by Canu (Koren et al. 2017), and then the corrected data were used to assemble the primary genome by WTDBG. After completing the primary assembly, the chromosomal-level genome was assembled from HI-C data. The clean data were compared with preliminary assembly results by Burrows–Wheeler Aligner (Li and Durbin 2009). HiC-Pro (Rusk 2014) was used to filter and evaluate the quality of Hi-C data. The genome sequence was divided into groups, and then sorted and oriented. The assembly results were evaluated by LACHESIS (Servant et al. 2015).

Genome Prediction and Annotation

Based on structural prediction and de novode novo, a repetitive sequence database of S. sihama genome was constructed by LTR FINDER v1.05 (Xu and Wang 2007), RepeatScout v1.0.5 (Price et al. 2005), and PILER-DF v2.4 (Edgar and Myers 2005). PASTEClassifier (Wicker et al. 2007) was used to classify the repetitive sequence database and then merged with the Repbase (Jurka et al. 2005) database as the final repetitive sequence database. The repetitive sequence of S. sihama was predicted by RepeatMasker v4.0.6 (Tarailo-Graovac and Chen 2009). Based on ab initioab initio, homologous alignment and transcriptome data were used to predict protein-coding genes in the genome. The ab initioab initio prediction was done using Genscan (Burge and Karlin 1997), Augustus v2.4 (Stanke and Waack 2003), GlimmerHMM v3.0.4 (Majoros et al. 2004), GeneID v1.4 (Alioto et al. 2018), and Supplemental Nutrition Assistance Program (SNAP) (version 2006-07-28) (Korf 2004). The protein sequences of Larimichthys crocea, Oreochromis niloticus, Oryzias latipes, Danio rerio, and Sillago sinica were downloaded from the National Center for Biotechnology Information (NCBI) and GIGA databases. The homologous alignment was constructed using GeMoMa v1.3.1 (Keilwagen et al. 2016) to predict protein-coding genes. The reference transcripts were assembled by Hisat v2.0.4, Stringtie v1.2.3 (Pertea et al. 2016), TransDecoder v2.0 (Haas et al. 2013), and GeneMarkS-T v5.1 (Tang et al. 2015) were used for gene prediction. Based on transcriptome data, unigene sequences were predicted by PASA v2.0.2 (Campbell et al. 2006). EVM v1.1.1 (Haas et al. 2008) was used to integrate the prediction results obtained by the above three methods. We performed homology searches in public gene databases, including NCBI Refseq (NR, Marchler-Bauer et al. 2011), Kyoto Encyclopedia of Genes and Genomes (KEGG, Ogata et al. 1999), Clusters of orthologous groups for eukaryotic complete genomes (KOG, Tatusov 2001), Translation of EMBL nucleotide sequence database (TrEMBL, Boeckmann 2003) and Gene Ontology (GO, Dimmer et al. 2012). Function annotation was performed on the predicted gene sequences by BLAST v2.2.31 (Altschul et al. 1990) (-evalue 1e-5). Based on the comparison results of the NR database, the functional annotation of the GO database was performed by Blast2GO (Conesa et al. 2005). The rRNA and microRNA sequences were predicted by Infenal 1.1 (Nawrocki and Eddy 2013) on the Rfam (Griffiths-Jones et al. 2005) and miRBase (Griffiths-Jones et al. 2006) databases. The tRNA was identified by tRNAscan-SE v1.3.1 (Lowe and Eddy 1997).

Assessment of Completeness of the Genome Assembly

The core eukaryotic gene mapping approach was used to assess the completeness of assembly and gene annotation (CEGMA, v2.5) (http://korflab.ucdavis.edu/Datasets/cegma/, last accessed January 13, 2021) (Parra et al. 2007) and benchmarking universal single-copy orthologs (BUSCO, v2) (http://busco.ezlab.org/, last accessed January 13, 2021) (Simao et al. 2015) were used.

Genome Evolution Analysis

Based on the protein sequences of the S. sihama and 10 other teleosts, including Takifugu rubripes (GCA_000180615.2), Gasterosteus aculeatus (GCA_006229165.1), O. latipes (GCA_004347445.1), D. rerio (GCA_000002035.4), O. niloticus (GCA_001858045.3), Latimeria chalumnae (GCF_000225785.1), S. sinica (http://dx.doi.org/10.5524/100490, last accessed January 13, 2021), L. crocea (GCA_003845795.1), Lepisosteus oculatus (GCA_000242695.1), and Xiphophorus maculatus (GCA_002775205.2). The evolution between species and the classification of gene families were analyzed. The protein sequences of 11 teleosts were classified into gene families, and single-copy genes were extracted by OrthoMCL (Li et al. 2003). In order to study the evolutionary relationship between 11 teleosts, the single-copy protein sequences of 11 teleosts were used to construct the maximum-likelihood (ML) phylogenetic tree by PHYML (Guindon et al. 2010). The divergence time was predicted by McMctree in PAML and timetree databases (http://www.timetree.org/, last accessed January 13, 2021) to correct divergence time. L. crocea was phylogenetically closely related to S. sihama. The 24 S. shama chromosomes were aligned with L. crocea chromosomes by MCScanX to visualize the consistency between the genomes of S. sihama and L. crocea (Wang et al. 2012).

Gene Family Expansion and Contraction Analysis

The expansion and contraction gene families among T. rubripes, G. aculeatus, O. latipes, D. rerio, O. niloticus, L. chalumnae, S. sinica, L. crocea, L. oculatus, X. maculatus, and S. sihama were identified by CAFÉ (De Bie et al. 2006). The number of gene families of each ancestor was estimated by the birth mortality model, thereby predicting the number of gene family expansion and contraction gene families.

Results and Discussion

Genome Sequencing and Assembly

After quality filtering, 66.16 Gb subread data were obtained from two long-insert (20 kb) libraries (sequence coverage: ∼126×; subread N50: 15,715 bp; supplementary table S1, Supplementary Material online). A total of 89.08 Gb Hi-C data were obtained from the HI-C sequencing library (sequence coverage: ∼170×; GC content: 43.95%; Q30: 90.92%; supplementary table S1, Supplementary Material online). The PacBio data were used to construct the primary assembly. The primary genome assembly size was 522.06 Mb, and contig N50 was 13.55 Mb. The efficiency of comparing HI-C sequence data with the primary assembled genome was 90.79% (Unique Mapped Read Pair was 77.18%). Total effective Hi-C data were 153.18 Mb. Re-assemble after correcting the errors of the primary assembled genome by Hi-C data. The chromosome-level genome size was 521.63 Mb, and contig N50 was 13.54 Mb (table 1). Using Hi-C data, 556 contigs were mapped to 24 chromosomes (supplementary fig. S1, Supplementary Material online). A total length of 498.82 Mb of the genomic sequence was anchored to 24 chromosomes, accounting for 96.93% of the entire genomic sequence (supplementary table S2 and fig. S2, Supplementary Material online).
Table 1

Statistics of Sillago sihama Genome Assembly and Annotation Data

Chromosome-Level Genome Assembly
Assembly
 Assembly size (bp)521,631,495
 Number of scaffolds470
 Scaffold N50 (bp)21,469,626
 Longest scaffold (bp)28,013,376
 Number of contigs556
 Contig N50 (bp)13,543,514
 Longest contig max (bp)22,111,180
 GC (%)44.66
BUSCO (% of total BUSCO)
 Complete4,463 (97.36%)
 Single-copy4,345 (94.79%)
 Duplicated118 (2.57%)
 Fragmented27 (0.6%)
 Missing94 (2.05%)
CEGMA
 CEGs (% of all CEGs)453 (98.97%)
 Highly conserved CEGs (% of all highly conserved CEGs)246 (99.16%)
Repetitive sequences (% of genome)
 SINE (bp)60,396 (0.01%)
 LINE (bp)7,497,699 (1.44%)
 LTR (bp)6,955,457 (1.33%)
 DNA (bp)17,803,273 (3.00%)
 SSR (bp)101,169 (0.02%)
 Unclassified (bp)39,549,161 (7.58%)
 Total (bp)71,864,242 (13.78%)
Gene annotations (% of all genes)
 GO annotation12,408 (51.79%)
 KEGG annotation14,510 (60.59%)
 KOG annotation15,991 (66.74%)
 TrEMBL annotation22,953 (95.8%)
 NR annotation23,101 (96.42%)
 All annotated23,123 (96.51%)
Noncoding protein genes (% of genome)
 Number of miRNA419
 Number of tRNA1,587
 Number of rRNA67
 Length of miRNA34,211 (0.00656%)
 Length of tRNA160,051 (0.03068%)
 Length of rRNA60,018 (0.00575%)
Statistics of Sillago sihama Genome Assembly and Annotation Data According to BUSCO results, the genome contained 4,463 (97.36%) complete BUSCOs, including 4,345 single-copy BUSCOs and 118 duplicated BUSCOs (table 1). The CEGMA v2.5 database contained 248 conserved core genes of eukaryotes, and there were 246 conserved core genes (99.19%) in this genome (table 1). The results indicated that the genome assembly had high coverage and completeness.

Genome Annotation

De novoDe novo prediction and Repbase database results showed that the repeated sequences accounted for 13.78% of S. sihama genome, which is lower than D. rerio (63.12%), O. latipes (42.83%), and L. crocea (20.31%), and higher than S. sinica (10.92%) and T. rubripes (9.37%). DNA transposons (3%) were the most common among transposons of S. sihama genome, followed by long interspersed repeated segments (LINEs, 1.44%) and long terminal repeats (LTR, 1.33%) (table 1, supplementary table S3, Supplementary Material online, fig. 1).
. 1

Genome landscape and evolutionary analysis of Sillago sihama. (A) Genome landscape of S. sihama. (a) Chromosome length, (b) GC content, (c) gene density, (d) repeat sequence, (e) long terminal repeated (LTE), (f) long interspersed nuclear elements (LINE), and (g) simple sequence repeat (SSR). (B) Phylogenetic analysis of 11 teleost fishes. At each branch point, the predicted species divergence time (million years ago) is marked. The red number on each evolutionary branch represents the number of expanding gene families, and the blue number represents the number of contracting gene families. (C) Collinearity analysis of S. sihama and Larimichthys crocea genomes. Blue and orange outer circles represent the chromosome of S. sihama and L. crocea, respectively.

Genome landscape and evolutionary analysis of Sillago sihama. (A) Genome landscape of S. sihama. (a) Chromosome length, (b) GC content, (c) gene density, (d) repeat sequence, (e) long terminal repeated (LTE), (f) long interspersed nuclear elements (LINE), and (g) simple sequence repeat (SSR). (B) Phylogenetic analysis of 11 teleost fishes. At each branch point, the predicted species divergence time (million years ago) is marked. The red number on each evolutionary branch represents the number of expanding gene families, and the blue number represents the number of contracting gene families. (C) Collinearity analysis of S. sihama and Larimichthys crocea genomes. Blue and orange outer circles represent the chromosome of S. sihama and L. crocea, respectively. A total of 23,959 protein-coding genes (supplementary table S4, Supplementary Material online) were predicted in the S. sihama genome by ab initio, homologous prediction and RNA-seq prediction methods, with an average length of 11,241.51 bp. Comparing the length distribution of genes, coding sequences (CDS), exons and introns, the gene distribution of S. sihama was similar to other teleosts. Sillago sihama gene proportions were lower than other fishes but similar to S. sinica (supplementary fig. S3, Supplementary Material online). The functions of the protein-coding genes were annotated in NR, TrEMBL, KOG, KEGG, and GO databases. A total of 23,123 genes were annotated, accounting for 96.5% of all protein-coding genes (table 1). Rfam, miRBase, and tRNAscan-SE databases were used to predict noncoding RNA, and a total of 1,587 tRNAs, 67 rRNAs, and 419 miRNAs were predicted (table 1, supplementary table S5, Supplementary Material online).

Comparative Genome Analysis

The genomes of 11 teleosts were compared with study the phylogenetic relationships between S. sihama and other teleosts. A total of 16,856 gene families and 5,950 single-copy orthologs were identified (supplementary table S6 and fig. S4, Supplementary Material online). The ML phylogenetic tree was constructed from single-copy orthologs. The phylogenetic tree showed that S. sinica was closely related to S. sihama, and the divergence time was ∼7.92 (2.45–16.57) Ma (fig. 1). The genomes of S. sihama and L. crocea were compared with analyze chromosomal evolutionary events (fig. 1). The results showed that the 24 chromosomes of S. sihama were aligned with 22 chromosomes of L. crocea. The chromosomes III and XIII of L. crocea were compared with LG2, LG10, LG5, and LG16 of S. sihama, respectively. The common ancestor of L. crocea and S. sihama undergone a chromosome break recombination event during the evolution process, which increases the number of chromosomes.

Gene Family Analysis

The expansion and contraction of gene families are one of the most important factors for the evolution of phenotypic diversity and environmental adaptation. S. sihama is sensitive to environmental factors, such as sound, vibration, light, and shadow. In order to explore the adaptability of environmental factors in S. sihama, the gene families of 11 teleost fishes (T. rubripes, G. aculeatus, O. latipes, D. rerio, O. niloticus, L. chalumnae, S. sinica, L. crocea, L. oculatus, X. maculatus, and S. sihama) were compared. A total of 57 unique, 100 expanded (P < 0.05) and 25 contracted (P < 0.05) gene families were identified in S. sihama (supplementary table S7, Supplementary Material online), including immune-related gene families (immunoglobulin domain, immunoglobulin V-set domain, immunoglobulin I-set domain and NACHT domain) and olfactory receptor gene family (seven transmembrane receptor).

Conclusions

This study was determined the chromosomal-level genome assembly of S. sihama. The continuity and completeness of the S. sihama genome was reached the level of other high-quality teleost fish genomes, which provides a useful reference for system biology and comparative genome evolution analysis. Genome evolution analysis showed the insights into the high irritability of S. sihama. This reference genome is important for aquaculture and artificial breeding of S. sihama, which provides a basis for further research.

Supplementary Material

Supplementary data are available at Genome Biology and Evolution online. Click here for additional data file.
  44 in total

1.  Isolation and characterization of microsatellite DNA loci from Sillago sihama.

Authors:  Yu-Song Guo; Zhong-Duo Wang; Cheng-Zhong Yan; Yu-Lan Zhang; Jin-Nan Zheng; Yuan-Min Xu; Tao Du; Chu-Wu Liu
Journal:  J Genet       Date:  2012-03-28       Impact factor: 1.166

2.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.

Authors:  T M Lowe; S R Eddy
Journal:  Nucleic Acids Res       Date:  1997-03-01       Impact factor: 16.971

3.  The COG database: new developments in phylogenetic classification of proteins from complete genomes.

Authors:  R L Tatusov; D A Natale; I V Garkavtsev; T A Tatusova; U T Shankavaram; B S Rao; B Kiryutin; M Y Galperin; N D Fedorova; E V Koonin
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

4.  A draft genome assembly of the Chinese sillago (Sillago sinica), the first reference genome for Sillaginidae fishes.

Authors:  Shengyong Xu; Shijun Xiao; Shilin Zhu; Xiaofei Zeng; Jing Luo; Jiaqi Liu; Tianxiang Gao; Nansheng Chen
Journal:  Gigascience       Date:  2018-09-01       Impact factor: 6.524

5.  MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity.

Authors:  Yupeng Wang; Haibao Tang; Jeremy D Debarry; Xu Tan; Jingping Li; Xiyin Wang; Tae-ho Lee; Huizhe Jin; Barry Marler; Hui Guo; Jessica C Kissinger; Andrew H Paterson
Journal:  Nucleic Acids Res       Date:  2012-01-04       Impact factor: 16.971

6.  Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis.

Authors:  Matthew A Campbell; Brian J Haas; John P Hamilton; Stephen M Mount; C Robin Buell
Journal:  BMC Genomics       Date:  2006-12-28       Impact factor: 3.969

7.  Identification of protein coding regions in RNA transcripts.

Authors:  Shiyuyun Tang; Alexandre Lomsadze; Mark Borodovsky
Journal:  Nucleic Acids Res       Date:  2015-04-13       Impact factor: 16.971

8.  Using intron position conservation for homology-based gene prediction.

Authors:  Jens Keilwagen; Michael Wenk; Jessica L Erickson; Martin H Schattat; Jan Grau; Frank Hartung
Journal:  Nucleic Acids Res       Date:  2016-02-17       Impact factor: 16.971

9.  Transcriptome Analysis of Male and Female Mature Gonads of Silver Sillago (Sillago sihama).

Authors:  Changxu Tian; Zhiyuan Li; Zhongdian Dong; Yang Huang; Tao Du; Huapu Chen; Dongneng Jiang; Siping Deng; Yulei Zhang; Saetan Wanida; Hongjuan Shi; Tianli Wu; Chunhua Zhu; Guangli Li
Journal:  Genes (Basel)       Date:  2019-02-11       Impact factor: 4.096

10.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.