Literature DB >> 28327943

Whole genome sequencing of Chinese clearhead icefish, Protosalanx hyalocranius.

Kai Liu¹, Dongpo Xu¹, Jia Li², Chao Bian², Jinrong Duan¹, Yanfeng Zhou¹, Minying Zhang¹, Xinxin You², Yang You¹, Jieming Chen², Hui Yu², Gangchun Xu¹, Di-An Fang¹, Jun Qiang¹, Shulun Jiang¹, Jie He¹, Junmin Xu^2,3,4, Qiong Shi^2,3,4,5, Zhiyong Zhang⁶, Pao Xu^1,4.

Abstract

Background: Chinese clearhead icefish, Protosalanx hyalocranius , is a representative icefish species with economic importance and special appearance. Due to its great economic value in China, the fish was introduced into Lake Dianchi and several other lakes from the Lake Taihu half a century ago. Similar to the Sinocyclocheilus cavefish, the clearhead icefish has certain cavefish-like traits, such as transparent body and nearly scaleless skin. Here, we provide the whole genome sequence of this surface-dwelling fish and generated a draft genome assembly, aiming at exploring molecular mechanisms for the biological interests. A total of 252.1 Gb of raw reads were sequenced. Subsequently, a novel draft genome assembly was generated, with the scaffold N50 reaching 1.163 Mb. The genome completeness was estimated to be 98.39 % by using the CEGMA evaluation. Finally, we annotated 19 884 protein-coding genes and observed that repeat sequences account for 24.43 % of the genome assembly. We report the first draft genome of the Chinese clearhead icefish. The genome assembly will provide a solid foundation for further molecular breeding and germplasm resource protection in Chinese clearhead icefish, as well as other icefishes. It is also a valuable genetic resource for revealing the molecular mechanisms for the cavefish-like characters.

Entities: Chemical Disease Gene Species

Keywords: Clearhead icefish; Gene prediction; Genome assembly; Protosalanx hyalocranius; Repetitive sequences; Whole genome sequencing

Mesh：

Year: 2017 PMID： 28327943 PMCID： PMC5530312 DOI： 10.1093/gigascience/giw012

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 6.524

Data description

Introduction

Icefishes (Osmeriformes, Salangidae) are widely distributed in freshwater, coastal, and estuarine habitats in East Asian countries [1-3]. Chinese clearhead icefish (Protosalanx hyalocranius; Fig. 1), a diadromous fish, mainly inhabits in coastal areas and adjacent freshwaters [4-6]. As an economically important fish in China, the clearhead icefish was widely introduced into some lakes from the original Lake Taihu half a century ago, and it has developed a resident life history in these water areas [2, 7, 8]. Because of its transparent body and nearly scaleless skin, similar to the Sinocyclocheilus cavefishes [9], we are very interested in this surface-dwelling fish and are performing comparative genomics studies to explore the mechanisms for these biological phenotypes. However, with the rapid development of the Chinese economy in recent decades, the population size of the clearhead icefish has been seriously declining because of overfishing, construction of water conservancy facilities, and water pollution in the ecological systems [10]. To maintain its sustainable development in China, here we performed whole genome sequencing of Chinese clearhead icefish to support its biological and economic importance.

Figure 1:

Picture of a Chinese clearhead icefish. It was captured from the Taihu Lake of Jiangsu Province, China.

Sample and sequencing

In this study, we applied Illumina whole genome sequencing strategy to sequence the genome of Chinese clearhead icefish (NCBI taxonomy ID: 418454; Fishbase ID: 12236). Genomic DNA was isolated from the muscle tissue of an individual collected from the Lake Taihu of Jiangsu Province in China. We constructed seven paired-end libraries with three short-insert libraries (250, 500, and 800 bp) and four long-insert libraries (2, 5, 10, and 20 kb) using the standard protocol provided by Illumina (San Diego, CA, USA). Subsequent paired-end sequencing was performed by the Illumina HiSeq 2000 platform for each library. Finally, we obtained 252.1 Gb of raw reads for further analysis.

Genome size estimation and genome assembly

The SOAPfilter v2.2 software [11] with optimized parameters (-y -p -g 1 -o clean -M 2 -f 0) was utilized to remove low-quality raw reads (including reads with 10 or more Ns and low-quality bases) and PCR replicates as well as adaptor sequences. In total, we obtained 169.0 Gb of clean reads. Subsequently, we estimated the genome size based on the 17-mer depth frequency distribution method [12]. We applied the following formula to calculate the genome size: G = k_num/k_depth = b_num/b_depth (k_num is the total number of K-mers from the sequencing data, k_depth is the expected coverage depth for k-mers, b_num is the total number of bases, b_depth is the expected coverage depth of bases; As one read with length L generates L-K+1 k-mers, k_num/b_num = (L-K+1)/L). In our current study, the K_num was 10 500 000 000 and the K_depth was 20. Hence, we estimated that the genome size of Chinese clearhead icefish is 525 Mb. The filtered reads were assembled using SOAPdenovo2 v2.04.4 software [13] with optimized parameters (pregraph -K 79 -d 1; contig -M 1; scaff -F -b 1.5 -p 16) to generate contigs and original scaffolds. The gaps were filled using GapCloser v1.12 software [14] with default parameters and –p set to 25. Finally, we generated a draft genome assembly of 536 Mb, with the scaffold N50 reaching 1.163 Mb (Table 1).

Table 1:

The statistics of genome assembly and annotation for P. hyalocranius

Genome assembly
Contig N50 size (kb)	17.2
Scaffold N50 size (Mb)	1.163
Estimated genome size (Mb)	525
Assembled genome size (Mb)	536
Genome coverage (X)	315
The longest scaffold (bp)	5 398 389
Gap length (Mb)	122
Genome annotation
Protein-coding gene number	19 884
Annotated functional gene number	19 125 (96.2 %)
Unannotated functional gene number	759 (3.8 %)
Repeat content	24.43 %

The statistics of genome assembly and annotation for P. hyalocranius The completeness of our assembly was evaluated by using both CEGMA [15] and BUSCO [16]. The CEGMA program (Core Eukaryotic Genes Mapping Approach; version 2.4) assessment with 248 conserved Core Eukaryotic Genes was performed for evaluation of the gene space completeness. Our results revealed that the assembled genome had a CEGMA completeness score at 90.32 % and 98.39 %, which was calculated from the complete gene set and the partial gene set, respectively. Meanwhile, we used the representative metazoa gene set [17], which contains 843 single-copy genes that are widely present in metazoan, as a reference. The assessment demonstrated that the BUSCO value is 89 %, containing [D: 10 %], F: 7.7 %, M: 2.9 %, n: 843 (C: complete [D: duplicated], F: fragmented, M: missed, n: genes). These data from CEGMA and BUSCO indicate that the assembled genome covered majority of the gene space.

Repeat annotation

Firstly, a de novo repeat library was constructed by the RepeatModeller v1.05 [18] and LTR_FINDER.x86_64-1.0.6 [10] with default parameters. Then, the assembled genome sequences were aligned against the RepBase v21.01 [19] and the de novo repeat libraries to recognize the known and novel transposable elements using the RepeatMasker v4.06 [20]. Meantime, the Tandem Repeat Finder v4.07 [21] with parameters “Match = 2, Mismatch = 7, Delta = 7, PM = 80, PI = 10, Minscore = 50, and MaxPeriod = 2000” was utilized for annotation of tandem repeats. Furthermore, the RepeatProteinMask software v4.0.6 [20] was used to predict transposable element relevant proteins in our genome assembly. Finally, we observed that the repeat sequences account for 24.43 % of the assembled genome (Table 1), and the de novo annotation method predicted the most abundant repeat sequence among the four methods (Table 2).

Table 2:

Detailed classification of repeat sequences in the assembled genome

Type	Repeat size (bp)	% of Genome
ProteinMask	9 925 152	1.85
RepeatMasker	5 948 136	1.11
Tandem Repeat Finder	66 595 756	12.41
De novo	93 726 009	17.47
Total	131 090 229	24.43

Detailed classification of repeat sequences in the assembled genome

Genome Annotation

In brief, we utilized two different methods to predict total gene set of the clearhead icefish.

de novo annotation

The AUGUSTUS v2.5 [22] and GENSCAN v1.0 [23] were executed to ab initio predict genes within the assembled genome, with the repetitive sequences masked as “N” to discard pseudo gene prediction. Those low-quality genes with short length (<150 bp), premature termination, or frame-shifting were removed. Finally, we identified 23 132 and 21 379 pro-coding genes by using the AUGUSTUS and GENSCAN software (Table 3).

Table 3:

Gene annotation statistics of the genome of P. hyalocranius

			Average transcript	Average CDS	Average Exons	Average Exons	Average Intron
Method		Number	length (bp)	length (bp)	Per Gene	Length (bp)	Length (bp)
De novo	AUGUSTUS	23 132	4897.24	1264.61	5.78	218.81	760.04
	GeneScan	21 379	17 213.49	1973.56	10.22	193.05	1652.41
Homolog	Danio rerio	25 390	7156.92	1312.32	6.17	212.62	1129.99
	Oryzias latipes	25 319	6411.36	1194.58	5.89	202.73	1066.29
	Takifugu rubripes	16 563	7990.91	1759.17	11.59	151.75	588.32
	Tetraodon nigroviridis	19 128	8335.40	1351.98	7.44	181.78	1084.78
	Esox lucius	24 861	8019.18	1375.58	6.92	198.85	1122.70
	Gasterosteus aculeatus	25 354	6819.62	1183.46	6.18	191.44	1087.68
Final gene set		19 884	12 889.35	1821.79	9.13	199.49	1360.92

Gene annotation statistics of the genome of P. hyalocranius

Homology annotation

We aligned the protein sequences from six published genomes, including Danio rerio [24], Oryzias latipes [25], Takifugu rubripes [26], Tetraodon nigroviridis [27], Esox lucius [28], and Gasterosteus aculeatus [29], against our assembly to predict homology-based genes. The potential homology-based genes were searched by TblastN [30] with an e-value of 10−5. The TblastN results were then processed by Sorting Out Local Alignment Result [31] to obtain the best hit of each alignment. Subsequently, GeneWise v2.2.0 [32] was performed to detect the possible gene structure for the best hit of each alignment. The low-quality genes were also removed as described in the above-mentioned de novo annotation.

Integration of annotation results

We employed the GLEAN [33] to generate a nonredundant and comprehensive gene set. Finally, the best hit of each protein was obtained through all protein sequences from the GLEAN results aligned to the databases of the SwissProt and TrEMBL [34] (Uniprot release 2011.06) by BlastP with an e-value of 10−5. Overall, we generated a final gene set with 19,884 genes for the Chinese clearhead icefish (Table 3). CEGMA was performed again to evaluate the coverage rate between eukaryotic orthologous group genes predicted by CEGMA and the predicted total gene set. It demonstrates that the predicted gene set mapped 96.4 % of the eukaryotic orthologous groups. Simultaneously, the BUSCO was implemented again to assess completeness of the predicted gene set. The BUSCO values were calculated as follows: C: 79 % [D: 16 %], F: 9.8 %, M: 10, n: 843 (C: complete [D: duplicated], F: fragmented, M: missed, n: genes). The assessment values from both CEGMA and BUSCO proved high accuracy of the annotation.

Function annotation

The predicted protein sequences of the clearhead icefish were aligned against several public databases (Pfam [35], PRINTS [36], ProDom [37], and SMART [38]) for detection of functional motifs and domains. Finally, we found that 96.2 % of the predicted total gene set had been annotated with at least one functional assignment from other public databases (Swiss-Prot [39], Interpro [40], TrEMBL [41], and KEGG [42]).

Genome evolution

We performed phylogenomic analyses with orthologues from representative species for each clade. We used the Ensembl BioMart (www.ensembl.org; Ensembl version 76) to extract orthologues for zebrafish [24], fugu [26], stickleback [29], medaka [25], and spotted gar [43]. This generated orthologue dataset from six species was filtered out to retain only one-to-one orthologues. Meanwhile, a new Asian arowana gene set stemmed from our recent work [44]. To extrapolate the Biomart orthologues to the arowana and clearhead icefish gene sets, we used zebrafish as the reference. We ran InParanoid [45] for the three species pairs (zebrafish-arowana and zebrafish-clearhead icefish) at default settings (i.e., a minimum BLASTP score of 40 bits, minimum 50 % alignment span, minimum 25 % alignment coverage, and minimum inparalog confidence level of 0.05). By comparing the three InParanoid outputs, we narrowed down the list of one-to-one orthologues, presented in all seven species, to 454 genes. Multiple alignments were subsequently performed on proteins of each selected family using MUSCLE (version 3.8.31) [46], and protein alignments were converted to their corresponding CDS alignments using an in-house perl script (see supporting data). All the translated CDS sequences were linked into one “supergene” for each species. Nondegenerated sites extracted from the supergenes were subsequently joined into the new sequence of each species to construct a phylogenetic tree (Fig. 2) using MrBayes [47] (GTR+gamma model, Version 3.2). Our phylogenetic data demonstrate the phylogenetic position of the clearhead icefish (Fig. 2).

Figure 2:

Phylogeny of seven representative ray-finned fishes. The spotted gar was used as the outgroup species.

Synteny blocks and genome duplication

Genomic homology between the clearhead icefish and Nile tilapia [48] was examined using i-ADHoRe 3.0 [49] using the following settings: alignment method gg2, gap size 30, tandem gap 30, cluster gap 35, q value of 0.85, prob cutoff 0.01, anchor points 5, and using multiple hypothesis correction FDR. The output of this was processed by the pipeline and incorporated in a relational database to which visualization programs can connect and on which additional statistical analysis can then be performed. For synteny detection, the cloud mode was enabled (cluster_type = cloud) and appropriate settings were selected as follows: cloud_gap_size 20, cloud_cluster_gap 20, cloud_filter_method binomial, prob cutoff 0.01, anchor points 5, multiple hypothesis correction FDR, and level_2_only true. Finally, we identified 771 synteny blocks containing 7057 genes between the clearhead icefish and Nile tilapia. Subsequently, protein sequences of homologous gene pairs in the identified syntenic regions were aligned using MUSCLE [46], and the protein alignments were then converted to the CDS alignments. Finally, 4-fold degenerative third-codon transversion (4DTV) values were calculated on these CDS alignments and corrected using the HKY model in the PAML package [50]. These data indicate that the clearhead icefish also experienced the teleost-specific whole genome duplication (Fig. 3).

Figure 3:

Distribution of 4DTV distances between the clearhead icefish and tilapia. The horizontal axis stands for the 4DTV distance corrected using the HKY model. The vertical axis represents the percentage of collinear gene pairs.

Conclusion

We generated a draft genome assembly of the Chinese clearhead icefish. The novel genome data were deposited in publicly accessible repositories to promote further biological research, molecular breeding, and resource protection of this representative and valuable icefish.

Availability of supporting data

Supporting data and materials are available in the GigaScience GigaDB database [51], with the raw genome sequences deposited in the SRA under the bioproject number PRJNA328051.

Competing interests

The authors declare that they have no competing interests.

Funding

This study was supported by a grant from the Natural Science Foundation of Jiangsu Province (No. BK2012093), fish investigation in Taihu Lake (No. TH2016WT007), National Infrastructure of Fishery Germplasm Resources (No. 2016DKA30470), Basic Research Funds from Freshwater Fisheries Research Center (No. 2013JBFM07), Special Project on the Integration of Industry, Education and Research of Guangdong Province (No. 2013B090800017), Shenzhen Special Program for Future Industrial Development (N o. JSGG20141020113728803), and Zhenjiang Leading Talent Program for Innovation and Entrepreneurship.

Author Contributions

KL, PX, QS, DX, JX, CB, and ZZ conceived the project. MZ, XY, HY, JC, GX, DF, JQ, SJ, and JH collected the samples and extracted the genomic DNA. JL, CB, and HY performed the genome assembly and data analysis. JL, CB, QS, KL, XP, KL, YY, and ZZ wrote the paper. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.

40 in total

1. GeneWise and Genomewise.

Authors: Ewan Birney; Michele Clamp; Richard Durbin
Journal: Genome Res Date: 2004-05 Impact factor: 9.043

2. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes.

Authors: Genis Parra; Keith Bradnam; Ian Korf
Journal: Bioinformatics Date: 2007-03-01 Impact factor: 6.937

3. SOAP2: an improved ultrafast tool for short read alignment.

Authors: Ruiqiang Li; Chang Yu; Yingrui Li; Tak-Wah Lam; Siu-Ming Yiu; Karsten Kristiansen; Jun Wang
Journal: Bioinformatics Date: 2009-06-03 Impact factor: 6.937

4. Tandem repeats finder: a program to analyze DNA sequences.

Authors: G Benson
Journal: Nucleic Acids Res Date: 1999-01-15 Impact factor: 16.971

5. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors: Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal: Bioinformatics Date: 2015-06-09 Impact factor: 6.937

6. Long-term stability of diagnosis and symptom dimensions in a systematic sample of patients with onset of schizophrenia in childhood and early adolescence. II: Postnegative distinction and childhood predictors of adult outcome.

Authors: M Maziade; S Bouchard; N Gingras; L Charron; A Cardinal; M A Roy; B Gauthier; G Tremblay; S Côté; C Fournier; P Boutin; M Hamel; C Mérette; M Martinez
Journal: Br J Psychiatry Date: 1996-09 Impact factor: 9.319

7. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype.

Authors: Olivier Jaillon; Jean-Marc Aury; Frédéric Brunet; Jean-Louis Petit; Nicole Stange-Thomann; Evan Mauceli; Laurence Bouneau; Cécile Fischer; Catherine Ozouf-Costaz; Alain Bernot; Sophie Nicaud; David Jaffe; Sheila Fisher; Georges Lutfalla; Carole Dossat; Béatrice Segurens; Corinne Dasilva; Marcel Salanoubat; Michael Levy; Nathalie Boudet; Sergi Castellano; Véronique Anthouard; Claire Jubin; Vanina Castelli; Michael Katinka; Benoît Vacherie; Christian Biémont; Zineb Skalli; Laurence Cattolico; Julie Poulain; Véronique De Berardinis; Corinne Cruaud; Simone Duprat; Philippe Brottier; Jean-Pierre Coutanceau; Jérôme Gouzy; Genis Parra; Guillaume Lardier; Charles Chapple; Kevin J McKernan; Paul McEwan; Stephanie Bosak; Manolis Kellis; Jean-Nicolas Volff; Roderic Guigó; Michael C Zody; Jill Mesirov; Kerstin Lindblad-Toh; Bruce Birren; Chad Nusbaum; Daniel Kahn; Marc Robinson-Rechavi; Vincent Laudet; Vincent Schachter; Francis Quétier; William Saurin; Claude Scarpelli; Patrick Wincker; Eric S Lander; Jean Weissenbach; Hugues Roest Crollius
Journal: Nature Date: 2004-10-21 Impact factor: 49.962

8. i-ADHoRe 3.0--fast and sensitive detection of genomic homology in extremely large data sets.

Authors: Sebastian Proost; Jan Fostier; Dieter De Witte; Bart Dhoedt; Piet Demeester; Yves Van de Peer; Klaas Vandepoele
Journal: Nucleic Acids Res Date: 2011-11-18 Impact factor: 16.971

9. OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software.

Authors: Evgenia V Kriventseva; Fredrik Tegenfeldt; Tom J Petty; Robert M Waterhouse; Felipe A Simão; Igor A Pozdnyakov; Panagiotis Ioannidis; Evgeny M Zdobnov
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 16.971

10. The genome and linkage map of the northern pike (Esox lucius): conserved synteny revealed between the salmonid sister group and the Neoteleostei.

Authors: Eric B Rondeau; David R Minkley; Jong S Leong; Amber M Messmer; Johanna R Jantzen; Kristian R von Schalburg; Craig Lemon; Nathan H Bird; Ben F Koop
Journal: PLoS One Date: 2014-07-28 Impact factor: 3.240

4 in total

Review 1. Evolutionary Genomics Reveals Multiple Functions of Arylalkylamine N-Acetyltransferase in Fish.

Authors: Yu Huang; Jia Li; Chao Bian; Ruihan Li; Xinxin You; Qiong Shi
Journal: Front Genet Date: 2022-05-19 Impact factor: 4.772

2. Whole genome sequencing of Chinese clearhead icefish, Protosalanx hyalocranius.

Authors: Kai Liu; Dongpo Xu; Jia Li; Chao Bian; Jinrong Duan; Yanfeng Zhou; Minying Zhang; Xinxin You; Yang You; Jieming Chen; Hui Yu; Gangchun Xu; Di-An Fang; Jun Qiang; Shulun Jiang; Jie He; Junmin Xu; Qiong Shi; Zhiyong Zhang; Pao Xu
Journal: Gigascience Date: 2017-04-01 Impact factor: 6.524

3. Genome Sequencing of the Japanese Eel (Anguilla japonica) for Comparative Genomic Studies on tbx4 and a tbx4 Gene Cluster in Teleost Fishes.

Authors: Weiwei Chen; Chao Bian; Xinxin You; Jia Li; Lizhen Ye; Zhengyong Wen; Yunyun Lv; Xinhui Zhang; Junmin Xu; Shaosen Yang; Ruobo Gu; Xueqiang Lin; Qiong Shi
Journal: Mar Drugs Date: 2019-07-20 Impact factor: 5.118

4. Female-specific genomic regions and molecular sex identification of the clearhead icefish (Protosalanx hyalocranius).

Authors: Teng-Fei Xing; Yu-Long Li; Jin-Xian Liu
Journal: BMC Genomics Date: 2021-07-02 Impact factor: 3.969

4 in total