Literature DB >> 28444302

Draft genome of the lined seahorse, Hippocampus erectus.

Qiang Lin¹, Ying Qiu^2,3, Ruobo Gu^2,3,4, Meng Xu⁵, Jia Li³, Chao Bian^3,6,7, Huixian Zhang¹, Geng Qin¹, Yanhong Zhang¹, Wei Luo¹, Jieming Chen³, Xinxin You^3,6, Mingjun Fan³, Min Sun³, Pao Xu^2,6, Byrappa Venkatesh⁸, Junming Xu^3,4,6, Hongtuo Fu^2,6, Qiong Shi^3,4,6,9.

Abstract

Background: The lined seahorse, Hippocampus erectus , is an Atlantic species and mainly inhabits shallow sea beds or coral reefs. It has become very popular in China for its wide use in traditional Chinese medicine. In order to improve the aquaculture yield of this valuable fish species, we are trying to develop genomic resources for assistant selection in genetic breeding. Here, we provide whole genome sequencing, assembly, and gene annotation of the lined seahorse, which can enrich genome resource and further application for its molecular breeding. A total of 174.6 Gb (Gigabase) raw DNA sequences were generated by the Illumina Hiseq2500 platform. The final assembly of the lined seahorse genome is around 458 Mb, representing 94% of the estimated genome size (489 Mb by k-mer analysis). The contig N50 and scaffold N50 reached 14.57 kb and 1.97 Mb, respectively. Quality of the assembled genome was assessed by BUSCO with prediction of 85% of the known vertebrate genes and evaluated using the de novo assembled RNA-seq transcripts to prove a high mapping ratio (more than 99% transcripts could be mapped to the assembly). Using homology-based, de novo and transcriptome-based prediction methods, we predicted 20 788 protein-coding genes in the generated assembly, which is less than our previously reported gene number (23 458) of the tiger tail seahorse ( H. comes ). We report a draft genome of the lined seahorse. These generated genomic data are going to enrich genome resource of this economically important fish, and also provide insights into the genetic mechanisms of its iconic morphology and male pregnancy behavior.

Entities: Chemical Disease Species

Keywords: genome; assembly; annotation; Hippocampus erectus

Mesh：

Year: 2017 PMID： 28444302 PMCID： PMC5459928 DOI： 10.1093/gigascience/gix030

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 6.524

Data Description

Background

Syngnathidae, an interesting teleost family, exhibit special morphological innovations and reproductive behavior, and these phenotypes have come into being through long-term molecular evolution [1, 2]. Seahorses (Hippocampinae) are a group of popular and iconic species because of their unique body plan and male pregnancy. As an interesting model, seahorses could provide exceptional clues for studying evolution in virtue of their closed brood pouch, male pregnancy, and seasonal migration [3, 4]. Recently, we have reported whole genome sequence of the tiger tail seahorse (Hippocampus comes) [5] and provided primary insights into the genetic basis of its iconic morphology. The work also dealt with a number of fascinating areas, such as the patristacin subfamily of astacin metalloproteases that may be closely related to the unusual male pregnancy in this species since they were expanded and highly expressed in the male brood pouch during mid- and late pregnancy [5]. Here, we provide a draft genome of the lined seahorse (H. erectus; Fig. 1), which inhabits coastal waters in the Western Atlantic such as Nova Scotia, Canada, and northern Gulf of Mexico to Panama and Venezuela [6]. It has been treated as vulnerable or endangered in the Red List of Threatened Species (IUCN 2015) [7]. Moreover, the lined seahorse is easily domesticated for breeding, and it has become a popular and commercially important ingredient for traditional Chinese medicine in China [8-12]. In order to study the evolutionary history of the lined seahorse and improve its aquaculture yield, we are trying to develop genomic resources for assisted selection in genetic breeding. Hence, we performed whole genome sequencing, assembly, and gene annotation of the lined seahorse, which should facilitate further studies on species conservation and molecular breeding of this economically important fish.

Figure 1:

Photo of a cultivated line seahorse in Shenzhen, China.

Preparation and sequencing of DNA samples

Genomic DNA was extracted from a pool of four male lined seahorses (NCBI Taxonomy ID: 109281; Fishbase ID: 3283). All animal experiments conformed to the guidelines of the Animal Ethics Committee and were approved by the Institutional Review Board on Bioethics and Biosafety of BGI (approval ID: FT16091). Seven libraries, including three short-insert libraries (200, 500, and 800 bp) and four long-insert libraries (2, 5, 10, and 20 kb), were constructed based on the standard protocol of Illumina (CA, USA) and sequenced using the Illumina HiSeq2500 platform (the read length is 125 bp). Finally, we generated a total of 174.6-Gb raw sequences.

Processing of the raw sequencing reads

These raw sequences contained some sequencing errors, which may reduce the quality of genome assembly. Hence we filtered these raw sequences with the following stringent filtering processes through SOAPfilter (v. 2.2) software [13]: (i) filtered reads with 40% low-quality bases (quality scores ≤7); (ii) removed reads with N bases more than 10%; (iii) trimmed reads with five low-quality bases at the 5΄ end; (iv) discarded reads with adapter contamination and/or PCR duplicates; (v) corrected raw reads from the short-insert libraries based on the k-mer spectrum. Finally, we obtained 111.3 Gb of clean reads in total, in which 12.0, 14.5, 13.7, 18.0, 15.9, 18.9, and 18.3 Gb were kept from the seven sequencing libraries (from 200 bp to 20 kb), respectively.

Estimation of the genome size and assembly of the genome sequences

The genome size was estimated based on the k-mer spectrum [14] with the following formula: G = k-mer_number/k-mer_depth, where G is the genome size, k-mer_number is the total number of k-mer, and k-mer_depth means the peak frequency that was higher than any other frequencies. For the lined seahorse, the k-mer_number is 24 445 959 200 (based on 17-mer), and the k-mer_depth is 50. Therefore, the genome size was estimated to be approximately 489 Mb, which is much smaller than our estimation (695 Mb) for the tiger tail seahorse [5]. The generated clean reads were further assembled by SOAPdenovo2 (v. 2.04) [15] with optimized parameters (pregraph -K 27 -d 1; contig -M 1; scaff -b 1.5) to construct contigs and original scaffolds. Subsequently the gaps in the intra-scaffolds were filled using the reads of short-insert libraries by GapCloser 1.12 [13]. Finally, the achieved total scaffold length reached up to 457 759 912 bp with 2.8% gaps (12.8 Mb), which is smaller than that of the reported tiger tail seahorse (501 592 652 bp) [5]. The calculated scaffold N50 and contig N50 are 1.97 Mb and 14.57 kb, respectively (Table 1), which are comparable with the values from the tiger tail seahorse (see more details about the comparison in Table 1)[5].

Table 1:

Comparison of genome assembly and annotation between the lined seahorse and the reported tiger tail seahorse

Genome Assembly	Lined Seahorse	Tiger Tail Seahorse
Contig N50 size (kb)	14.57	34.67
Scaffold N50 size (Mb)	1.97	1.87
Estimated genome size (Mb)	489	695
Assembled genome size (Mb)	457.76	501.59
Genome coverage (×)	243.05	192.05
Longest scaffold (bp)	7 855 128	9 810 584
Genome annotation
Protein-coding gene number	20 788	23 458
Annotated functional gene number	18 776 (90.32%)	22 245 (94.83%)
Unannotated functional gene number	2012 (9.68%)	1213 (5.17%)
Transposable elements content	28.1%	24.8%

Comparison of genome assembly and annotation between the lined seahorse and the reported tiger tail seahorse

Assessment of genome completeness

Benchmarking Universal Single-Copy Orthologs (BUSCO) [16] is a software that can be used to evaluate the completeness of a genome assembly by genes selected from appropriate lineage-specific orthologous groups. For the lined seahorse, the analysis data proved that our assembly contains 73% complete and 12% partial sequences of vertebrate BUSCO orthologues (3023 genes in total). Simultaneous completeness of the lined seahorse genome was also evaluated using the de novo assembled RNA-seq transcripts from different developmental stages of the lined seahorse (downloaded from our recent paper [5]) to map the lined seahorse genome assembly with Blat [17]. All the results showed that more than 99% of transcripts could be mapped to the assembly (Table 2), suggesting that our assembly is of high quality.

Table 2:

Assessment of the completeness of the lined seahorse genome using transcriptome data

Dataset	Number	Total Length (bp)	Base Covered by Assembly (%)	Sequence Covered by Assembly (%)	With >90% Sequence in 1 Scaffold		With >50% Sequence in 1 Scaffold
Dataset	Number	Total Length (bp)	Base Covered by Assembly (%)	Sequence Covered by Assembly (%)	Number	Percent	Number	Percent
All	71 765	52 877 091	98.22	99.52	68 292	95.16	71 255	99.29
>200 bp	71 765	52 877 091	98.22	99.52	68 292	95.16	71 255	99.29
>500 bp	29 811	40 111 717	98.12	99.68	27 902	93.60	29 640	99.43
>1000 bp	14 780	29 612 539	97.92	99.70	13 561	91.75	14 686	99.36

Assessment of the completeness of the lined seahorse genome using transcriptome data

Repeat analysis

Tandem repeats were searched in the generated genome assembly by utilizing Tandem Repeats Finder (v. 4.04) [18]. Transposable elements (TEs) were identified with an approach that combined both homology-based and de novo predictions. First, RepeatMask (v. 3.3.0) [19] was employed to detect known TEs based on a homologous search against the Repbase TE library (release 17.01) [20]. RepeatProteinMask (v. 3.3.0) [19], an updated software included in the RepeatMasker package, was used to identify the TE relevant proteins. Subsequently, LTR_FINDER [21] and RepeatModeler (v. 1.05) [22] were used with the default parameters to construct the de novo repeat library. Then we used RepeatMask [19] to identify and classify novel TEs against this de novo repeat library. All the repeats were finally combined together with a filtering of those redundant repetitive sequences. In total, the lined seahorse genome comprises approximately 30.43% repetitive sequences, in which 28.12% are TEs. Interestingly, the most abundant type of TE is class II DNA transposon, which covered around 15% of the genome. Our data are similar to the report of the tiger tail seahorse [5], in which 24.82% are TEs, with class II DNA transposon as the most abundant.

Gene annotation

De novo prediction

Repetitive regions in the genome sequence were replaced with “N” to reduce the ratio of pseudogene annotations. Then we chose 1000 full-length but randomly selected genes from zebrafish homology gene set to train the model parameters for AUGUSTUS. We subsequently employed AUGUSTUS 3.0.1 [23] and GenScan 1.0 [24] for de novo prediction of repeat-masked genome sequences. Short genes (less than 150 bp) and premature or frame-shifted genes were removed.

Homology-based annotation

Protein sequences of zebrafish (Danio rerio), medaka (Oryzias latipes), fugu (Takifugu rubripes), stickleback (Gasterosteus aculeatus), and Nile tilapia (Oreochromis niloticus) were downloaded from Ensembl (release 83) [25]. Protein sequences of the tiger tail seahorse (H. comes) were downloaded from our recently published genome data (Bioproject ID: PRJNA314292) [5]. Protein sets of these species were mapped to the assembled lined seahorse genome using tBlastn (v. 2.2.19) [26] with E-value ≤ 1e-5. Genewise (v. 2.2.0) [27] was applied to refine the potential gene models of all alignments. Ultimately, we filtered short genes (less than 150 bp) and premature or frame-shifted genes.

Transcriptome-based prediction

We downloaded the transcriptome data of the lined seahorse from our previous work [10]. The raw reads were mapped onto the genome using TopHat (v. 2.0) [28] with the default parameters and assembled into transcripts using Cufflinks [29].

Gene set integration and optimization

The gene models based on de novo prediction, homology-based annotation, and transcriptome-based prediction were merged to form a comprehensive and non-redundant gene set using GLEAN [30]. Finally, we obtained a gene set containing 20 788 genes, which is less than the reported gene number (23 458) of the tiger tail seahorse [5].

Annotation of patristacin gene family

The patristacin subfamily of the astacin metalloprotease family may be closely related to the unusual male pregnancy in seahorses since we identified six patristacin genes in the tiger tail seahorse and confirmed their expansion and high expression in the male brood pouch [5]. We also analyzed patristacin in the lined seahorse genome. Related patristacin protein sequences were downloaded from the tiger tail seahorse genome data [5] and used for homology searches against the lined seahorse genome using tBlastn (v. 2.2.19) [26]. We chose alignments with coverage >50% and identity >50% and then used Genewise (v. 2.2.0) [27] to predict the gene structures. We also downloaded the RNA-seq data at the pregnancy stage of the male lined seahorse from our recently published paper [10] to confirm the existence of the six patristacin genes in the lined seahorse. The RNA-seq reads were mapped by TopHat [28], and gene expression levels were measured by RPKM (reads per kilobases per million reads). Finally, we observed that all the six patristacin genes were expressed during pregnancy in the male lined seahorse.

Functional assignment

The protein sequences predicted from the lined seahorse genome were aligned to the Swiss-Prot and TrEMBL databases [31] using BlastP at E-value ≤ 1e-5. The motifs and domains were annotated using InterProScan [32] by searching publicly available databases including Pfam [33], ProDom [34], SMART [35], PRINTS [36], and PANTHER [37], and then we retrieved Gene Ontology (GO) [38] annotation from the results of InterProScan. The gene pathways were assigned based on the best blast hit against the KEGG database [39]. In summary, approximately 90.32% of the genes are supported by at least one related function from the searched databases (Swiss-Prot, Interpro, TrEMBL, and KEGG).

Construction of gene families

Protein sequences of seven ray-fin fishes, including zebrafish, medaka, fugu, stickleback, Nile tilapia, platyfish (Xiphophorus maculatus), and spotted gar (Lepisosteus oculatus), were downloaded from Ensembl (release 83) [25]. Protein sequences of the tiger tail seahorse (H. comes) were downloaded from our recently published genome data [5]. Protein sequences of Gulf pipefish (Sygnathus scovelli) were downloaded from the Cresko Lab web server (http://creskolab.uoregon.edu) [40]. The consensus proteome set of the above nine species and the lined seahorse were composed of a final data set of 209 747 protein sequences. Finally, we used OrthoMCL [41] to cluster gene families and obtained 19 053 OrthoMCL families with all-to-all BLASTP strategy (E-value ≤ 1e-5) and a Markov Chain Clustering (MCL) default inflation parameter.

Phylogenetic analysis

We extracted 2812 one-to-one orthologous genes from the above-mentioned gene family set. The protein sequences of each selected family were aligned using MUSCLE (v. 3.8.31) [42] with the default parameters. The protein alignments were then converted to corresponding coding sequences (CDS) using an in-house Perl script. All these nucleotide sequences were concatenated into a supergene for each species, all of which were used to construct a phylogenetic tree using PhyML (Fig. 2) [43].

Figure 2:

Phylogeny of ray-finned fishes. The Spotted gar was used as the outgroup species. See more details of the protein sequence sources in the main context.

Conclusions

Seahorses are a fascinating teleost group with special morphological innovations and reproductive behavior. In our previous genome paper about the tiger tail seahorse [5], we paid much attention to the genetic bases of their unique morphology and reproductive system. However, besides the spectacular aspects of the phenotype, seahorses have been very popular in traditional Chinese medicine. Here we report the first draft genome assembly of the lined seahorse, an economically important aquaculture fish in China. With availability of these genomic data, we can develop genetic markers for construction of a high-density genetic linkage map, and subsequently further genetic selection and molecular breeding in the future. These works will support a significant increase of the aquaculture yield, which could produce remarkable economic benefits and realize the ecological protection of seahorses in the world. Our genome data will also facilitate the genetic mechanism study and evolutionary history analysis of the lined seahorse.

Availability of supporting data

Supporting data are available in the GigaScience database [44], and the raw data have been deposited in NCBI with the project accession PRJNA347499. Phylogeny of ray-finned fishes. The Spotted gar was used as the outgroup species. See more details of the protein sequence sources in the main context. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.

39 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

3. Male pregnancy in seahorses and pipefishes (family Syngnathidae): rapid diversification of paternal brood pouch morphology inferred from a molecular phylogeny.

Authors: A B Wilson; A Vincent; I Ahnesjö; A Meyer
Journal: J Hered Date: 2001 Mar-Apr Impact factor: 2.645

4. GeneWise and Genomewise.

Authors: Ewan Birney; Michele Clamp; Richard Durbin
Journal: Genome Res Date: 2004-05 Impact factor: 9.043

5. SOAP2: an improved ultrafast tool for short read alignment.

Authors: Ruiqiang Li; Chang Yu; Yingrui Li; Tak-Wah Lam; Siu-Ming Yiu; Karsten Kristiansen; Jun Wang
Journal: Bioinformatics Date: 2009-06-03 Impact factor: 6.937

6. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors: Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal: Bioinformatics Date: 2015-06-09 Impact factor: 6.937

7. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

Authors: Cole Trapnell; Brian A Williams; Geo Pertea; Ali Mortazavi; Gordon Kwan; Marijke J van Baren; Steven L Salzberg; Barbara J Wold; Lior Pachter
Journal: Nat Biotechnol Date: 2010-05-02 Impact factor: 54.908

8. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors: Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal: Gigascience Date: 2012-12-27 Impact factor: 6.524

9. Pfam: the protein families database.

Authors: Robert D Finn; Alex Bateman; Jody Clements; Penelope Coggill; Ruth Y Eberhardt; Sean R Eddy; Andreas Heger; Kirstie Hetherington; Liisa Holm; Jaina Mistry; Erik L L Sonnhammer; John Tate; Marco Punta
Journal: Nucleic Acids Res Date: 2013-11-27 Impact factor: 16.971

10. De Novo Transcriptome Analysis of Two Seahorse Species (Hippocampus erectus and H. mohnikei) and the Development of Molecular Markers for Population Genetics.

Authors: Qiang Lin; Wei Luo; Shiming Wan; Zexia Gao
Journal: PLoS One Date: 2016-04-29 Impact factor: 3.240

10 in total

1. Seadragon genome analysis provides insights into its phenotype and sex determination locus.

Authors: Meng Qu; Yali Liu; Yanhong Zhang; Shiming Wan; Vydianathan Ravi; Geng Qin; Han Jiang; Xin Wang; Huixian Zhang; Bo Zhang; Zexia Gao; Ann Huysseune; Zhixin Zhang; Hao Zhang; Zelin Chen; Haiyan Yu; Yongli Wu; Lu Tang; Chunyan Li; Jia Zhong; Liming Ma; Fengling Wang; Hongkun Zheng; Jianping Yin; Paul Eckhard Witten; Axel Meyer; Byrappa Venkatesh; Qiang Lin
Journal: Sci Adv Date: 2021-08-18 Impact factor: 14.136

2. Evolutionary divergence of a Hoxa2b hindbrain enhancer in syngnathids mimics results of functional assays.

Authors: Allison M Fuiten; William A Cresko
Journal: Dev Genes Evol Date: 2021-05-18 Impact factor: 0.900

3. Draft genome of the lined seahorse, Hippocampus erectus.

Authors: Qiang Lin; Ying Qiu; Ruobo Gu; Meng Xu; Jia Li; Chao Bian; Huixian Zhang; Geng Qin; Yanhong Zhang; Wei Luo; Jieming Chen; Xinxin You; Mingjun Fan; Min Sun; Pao Xu; Byrappa Venkatesh; Junming Xu; Hongtuo Fu; Qiong Shi
Journal: Gigascience Date: 2017-06-01 Impact factor: 6.524

4. Phylogenomics investigation of sparids (Teleostei: Spariformes) using high-quality proteomes highlights the importance of taxon sampling.

Authors: Paschalis Natsidis; Alexandros Tsakogiannis; Pavlos Pavlidis; Costas S Tsigenopoulos; Tereza Manousaki
Journal: Commun Biol Date: 2019-11-01

5. A highly contiguous nuclear genome assembly of the mandarinfish Synchiropus splendidus (Syngnathiformes: Callionymidae).

Authors: Martin Stervander; William A Cresko
Journal: G3 (Bethesda) Date: 2021-12-08 Impact factor: 3.154

6. Hepcidin Gene Co-Option Balancing Paternal Immune Protection and Male Pregnancy.

Authors: Wanghong Xiao; Zelin Chen; Yanhong Zhang; Yongli Wu; Han Jiang; Huixian Zhang; Meng Qu; Qiang Lin; Geng Qin
Journal: Front Immunol Date: 2022-04-22 Impact factor: 8.786

7. Identification of neurohypophysial hormones and the role of VT in the parturition of pregnant seahorses (Hippocampus erectus).

Authors: Huixian Zhang; Yali Liu; Geng Qin; Qiang Lin
Journal: Front Endocrinol (Lausanne) Date: 2022-07-29 Impact factor: 6.055

8. High-Throughput Identification of Putative Antimicrobial Peptides from Multi-Omics Data of the Lined Seahorse (Hippocampus erectus).

Authors: Xiyang Chen; Yunhai Yi; Xinxin You; Jie Liu; Qiong Shi
Journal: Mar Drugs Date: 2019-12-29 Impact factor: 5.118

9. Genome and population sequencing of a chromosome-level genome assembly of the Chinese tapertail anchovy (Coilia nasus) provides novel insights into migratory adaptation.

Authors: Gangchun Xu; Chao Bian; Zhijuan Nie; Jia Li; Yuyu Wang; Dongpo Xu; Xinxin You; Hongbo Liu; Jiancao Gao; Hongxia Li; Kai Liu; Jian Yang; Quanjie Li; Nailin Shao; Yanbing Zhuang; Dian Fang; Tao Jiang; Yunyun Lv; Yu Huang; Ruobo Gu; Junmin Xu; Wei Ge; Qiong Shi; Pao Xu
Journal: Gigascience Date: 2020-01-01 Impact factor: 6.524

10. Genome and gene evolution of seahorse species revealed by the chromosome-level genome of Hippocampus abdominalis.

Authors: Libin He; Xin Long; Jianfei Qi; Zongji Wang; Zhen Huang; Shuiqing Wu; Xingtan Zhang; Huiyu Luo; Xinxin Chen; Jinbo Lin; Qiuhua Yang; Shiyu Huang; Qi Zhou; Leyun Zheng
Journal: Mol Ecol Resour Date: 2021-11-07 Impact factor: 8.678

10 in total