Literature DB >> 28327946

Draft genome of the Northern snakehead, Channa argus.

Jian Xu¹, Chao Bian^2,3,4, Kunci Chen⁵, Guiming Liu⁶, Yanliang Jiang¹, Qing Luo⁵, Xinxin You^2,3, Wenzhu Peng^1,7, Jia Li³, Yu Huang³, Yunhai Yi³, Chuanju Dong^1,8, Hua Deng⁹, Songhao Zhang¹, Hanyuan Zhang¹, Qiong Shi^2,3,10, Peng Xu^1,7.

Abstract

The Northern snakehead (Channa argus), a member of the Channidae family of the Perciformes, is an economically important freshwater fish native to East Asia. In North America, it has become notorious as an intentionally released invasive species. Its ability to breathe air with gills and migrate short distances over land makes it a good model for bimodal breath research. Therefore, recent research has focused on the identification of relevant candidate genes. Here, we performed whole genome sequencing of C. argus to construct its draft genome, aiming to offer useful information for further functional studies and identification of target genes related to its unusual facultative air breathing. Findings: We assembled the C. argus genome with a total of 140.3 Gb of raw reads, which were sequenced using the Illumina HiSeq2000 platform. The final draft genome assembly was approximately 615.3 Mb, with a contig N50 of 81.4 kb and scaffold N50 of 4.5 Mb. The identified repeat sequences account for 18.9% of the whole genome. The 19 877 protein-coding genes were predicted from the genome assembly, with an average of 10.5 exons per gene.
Conclusion: We generated a high-quality draft genome of C. argus, which will provide a valuable genetic resource for further biomedical investigations of this economically important teleost fish.

Entities: Chemical Disease Species

Keywords: Channa argus; annotation; gene prediction; genome assembly

Mesh：

Year: 2017 PMID： 28327946 PMCID： PMC5530311 DOI： 10.1093/gigascience/gix011

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 6.524

Data description

Introduction of C. argus

The Northern snakehead (Channa argus) is a special snakehead fish cultivated mainly in Asia and Africa for food, especially in China with an annual production of about 510 000 tons (worth ∼1.6 billion US dollars) (Fig. 1). Genetic degradation caused by inbreeding of C. argus cultivation has led to higher susceptibility to diseases. Furthermore, C. argus is considered a serious invasive species in North America, due to its wide-range diet, parental care, and rapid colonization and expansion [1]. C. argus has a specialized aerial breathing organ, the suprabranchial chamber, which facilitates its aquatic–aerial bimodal breathing. Because of its aggressive status in ecosystem of rivers, lakes, and ponds, and little consumption of the C. argus in America for food, this leads to threats to the balance of ecosystems. For both economic and ecological consideration, it is vital to develop genomic resources for further genetic breeding studies or ecological research. So far, the genome sequence of C. argus has not been reported, and hence in our current study we performed genome sequencing, assembly, and annotation of this teleost species.

Figure: 1:

the Northern snakehead fish, Channa argus.

C. argus genome sequencing on the Illumina platform

Genomic DNA was extracted from blood sample of a single female C. argus (Fishbase ID: 4799) using Qiagen GenomicTip100 (Qiagen). The fish was obtained from the Pearl River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou, China. A whole-genome shotgun sequencing strategy was applied, and short-insert libraries (180, 500, and 800 bp) and long-insert libraries (3 and 5 kb) were constructed using the standard protocol provided by Illumina (San Diego, CA, USA). Paired-end sequencing with a 2 × 100-bp read length was performed on the short-insert and long-insert libraries using the Illumina HiSeq2000 platform. In total, we generated about 140.3 Gb of raw reads, including 33.0, 36.9, 17.4, 26.5, and 26.5 Gb of reads from the 180-, 500-, 800-, 3-, and 5-kb libraries. After removal of low-quality and redundant reads, we obtained about 138.2 Gb of clean data for further de novo assembling of the C. argus genome.

Estimation of C. argus genome size and sequencing coverage

All the cleaned reads were subjected to 17-mer frequency distribution analysis [2]. As the total number of k-mers was about 5.90 × 1010 and the peak of k-mers at a depth of 88, the genome size of C. argus was calculated to be 670.4 Mb using the following formula: genome size = k-mer_number / peak_depth. Therefore, the sequencing coverage was found to be ∼124.5 × based on the estimated genome size.

De novo genome assembly and quality assessment

For whole genome assembly, SOAPdenovo2 [3] was used with optimized parameters (-K 75) to construct contigs and original scaffolds by using the reads from short-insert libraries. All reads were then mapped onto contigs for scaffold construction by utilizing the paired-end information of long-insert libraries. Some intra-scaffold gaps were filled by local software using read-pairs in which one end uniquely mapped to a contig and the other end was located within a gap. Finally, a draft C. argus genome of 615.3 Mb was assembled, with a contig N50 size of 81.4 kb and a scaffold N50 size of 4.5 Mb (Table 1).

Table 1:

summary of the Channa argus genome assembly and annotation

Genome assembly
Contig N50 size (kb)	81
Contig number (>100 bp)	29 146
Scaffold N50 size (Mb)	4.5
Scaffold number (>100 bp)	5297
Total length (Mb)	615.3
Genome coverage (X)	224.6
The longest scaffold (bp)	18 736 006
Genome annotation
Protein-coding gene number	19 877
Mean transcript length (kb)	16.5
Mean exons per gene	10.5
Mean exon length (bp)	175.0
Mean intron length (bp)	1537.3

summary of the Channa argus genome assembly and annotation Subsequently, the Core Eukaryotic Genes Mapping Approach software [4] (version 2.3) with 248 conserved Core Eukaryotic Genes was utilized to evaluate completeness of genes. Our results demonstrated that the generated genome assembly covered 242 of the 248 Core Eukaryotic Gene sequences, suggesting a high level of completeness within the genome assembly. Alongside this, we also used BUSCO (version 1.22) [5] (the representative vertebrate gene set containing 3023 single-copy genes that are highly conserved in vertebrates) software to assess the quality of the generated genome assembly. The assessment demonstrated that the BUSCO value is 82.9%, containing C: 66% [D: 1.4%], F: 16%, M: 17%, n: 3023 (C: complete [D: duplicated], F: fragmented, M: missed, n: genes), suggesting a high quality of the generated assembly.

Repeat sequence within the C. argus genome assembly

To analyze the C. argus genome, we employed Tandem Repeats Finder [6] (version 4.04) with core parameters set as “Match = 2, Mismatch = 7, Delta = 7, PM = 80, PI = 10, Minscore = 50, and MaxPerid = 2000” to identify tandem repeats. Simultaneously, RepeatModeler (version 1.04) and LTR_FINDER [7] were utilized to construct a de novo repeat library with default parameters. Subsequently, we used RepeatMasker [8] (version 3.2.9) to map our assembled sequences on the Repbase TE (version 14.04) [9] and the de novo repeat libraries to search for known and novel transposable elements (TEs). In addition, the TE-related proteins were annotated by using RepeatProteinMask software [8] (version 3.2.2). In summary, the total identified repeat sequences accounted for 18.94% of the C. argus genome (Table 2). Among them, long interspersed nuclear elements were the most abundant type of repeat sequences and occupy 8.92% of the whole genome.

Table 2:

the detailed classification of repeat sequences of Channa argus

	Repbase TEs		TE protiens		De novo		Combined TEs
Type	Length (bp)	% in genome	Length (bp)	% in genome	Length (bp)	% in genome	Length (bp)	% in genome
DNA	17 984 515	2.92	6 784 728	1.10	25 663 752	4.17	35 435 946	5.76
LINE	16 799 343	2.73	17 563 763	2.85	54 890 557	8.92	60 651 866	9.86
SINE	4 512 385	0.73	0	0	6 672 552	1.08	9 026 285	1.47
LTR	4 421 728	0.72	3 031 607	0.49	24 144 657	3.92	26 983 318	4.39
Other	8125	0.001	0	0	0	0	8125	0.001
Unknown	0	0	0	0	9 413 375	1.53	9 413 375	1.53
Total	41 585 442	6.76	27 363 267	4.45	103 162 115	16.77	116 545 270	18.94

the detailed classification of repeat sequences of Channa argus

Gene annotation

Gene annotation of the C. argus genome was conducted using several approaches, including transcriptome-based prediction, de novo prediction, and homology-based prediction. RNA-seq datasets of pooled 13 tissues were obtained from our previous work [10]. We mapped these RNA reads onto our genome assembly using TopHat1.2 software [11], and then we employed Cufflinks (version 2.2.1) [12] to predict the gene structures. Furthermore, we performed Augustus (version 2.5.5) [13], GlimmerHMM (version 3.0.1) [14], and GenScan (version 1.0) [15] softwares for de novo prediction on the repeat-masked C. argus genome assembly. The protein sequences of zebrafish (Danio rerio) [16], Japanese puffer (Fugu rubripes) [17], medaka (Oryzias latipes) [18], spotted green pufferfish (Tetraodon nigroviridis) [19] (the above 5 species were downloaded from Ensembl release 75), blue spotted mudskipper (Boleophthalmus boddarti) [20], and golden arowana (Scleropages formosus) [21] were mapped on the C. argus genome using TblastN with e-value ≤ 1e-5. Subsequently, Genewise2.2.0 software [22] was employed to predict the potential gene structures on all alignments. Finally, the above three datasets were integrated to yield a comprehensive and nonredundant gene set using GLEAN (https://sourceforge.net/projects/glean-gene/) [23] with several filter steps (removing partial sequences or genes shorter than 150 bp or prematurely terminated/frame-shifted genes). The final total gene set was composed of 19 877 genes, with an average of 10.5 exons per gene (Table 1).

Construction of gene families and phylogenetic tree

We downloaded the protein sequences of zebrafish [17], Japanese puffer [18], stickleback (Gasterosteus aculeatus) [24], spotted green pufferfish [20], and medaka [19] from the Ensembl Core database (release 75), and we also obtained the protein sequences of Asian seabass (Lates calcarifer) [25], blue spotted mudskipper [21], and golden arowana [22] from their corresponding ftp websites, respectively. The consensus proteome set of the above eight species and snakehead fish was filtered to remove those protein sequences <50 amino acids and resulted in a dataset of 190 566 protein sequences, which was used as the input file for OrthoMCL [26] to construct gene families. A total of 17 954 OrthoMCL families were built utilizing an effective database size of 190 566 sequences for all-to-all BLASTP strategy with an E-value of 1e-5 and a Markov Chain Clustering default inflation parameter. We further identified 24 gene families that were specific in the snakehead fish (Fig. 2a).

Figure: 2:

genome evolution. (a) Orthologous gene families across five fish genomes (Snakehead fish, Zebrafish, Asian seabass, Mudskipper, and Arowana). (b) Phylogeny of ray-finned fishes (the arowana as the outgroup species). Subsequently, we selected 1918 single-copy (only one gene from each species) orthogroups from the above-mentioned 9 teleost species. We used MUSCLE (version 3.8.31) [27] to align the protein sequences from the 1918 orthogroups, respectively. We also converted protein alignments to their corresponding coding DNA sequence alignments using an in-house perl script. All the translated coding DNA sequence sequences were then combined into one “supergene” for each species. Nondegenerated sites (4D) extracted from the supergenes were then joined into new sequence of each species to construct a phylogenetic tree (Fig. 2b) using MrBayes [28] (Version 3.2, with the GTR+gamma model).

Conclusion

We report the first whole genome sequencing, assembly, and annotation of the Northern snakehead (Channa argus). The final draft genome assembly is approximately 615.3 Mb, accounting for 91.8% of the estimated genome size (670.4 Mb). We also predicted 19 877 protein-coding genes from the generated assembly. The draft genome assembly will be valuable resource for genetic breeding, environmental DNA detection of invasive species, and biological studies on this economically important teleost fish. Based on these genomic data, researchers will be able to develop genetic markers for further quantitative trait locus and genome-wide association studies on growth traits. These markers will also be very useful for DNA barcoding in screening invasive C. argus for ecological protection.

Availability of supporting data

The raw sequencing reads of all libraries have been deposited at NCBI (SRP078899). Further supporting data are available in the GigaScience database, GigaDB [29].

Abbreviation

TE: transposable element.

Author contributions

PX designed the study. JX, CB, GL, JL, HD, YH, YX, and QS assembled and annotated the genome. CB and YY performed the evolution analysis. JX, YJ, XY, QL, and HZ analyzed the data. WP, CD, SZ, and KC collected the sample and prepared the quality control. JX, CB, QS, and PX wrote the manuscript. QS and PX participated in discussions and provided advice. All authors read and approved the final manuscript. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.

26 in total

1. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Authors: Guillaume Marçais; Carl Kingsford
Journal: Bioinformatics Date: 2011-01-07 Impact factor: 6.937

Review 2. Repbase Update, a database of eukaryotic repetitive elements.

Authors: J Jurka; V V Kapitonov; A Pavlicek; P Klonowski; O Kohany; J Walichiewicz
Journal: Cytogenet Genome Res Date: 2005 Impact factor: 1.636

3. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors: Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal: Bioinformatics Date: 2015-06-09 Impact factor: 6.937

4. The medaka draft genome and insights into vertebrate genome evolution.

Authors: Masahiro Kasahara; Kiyoshi Naruse; Shin Sasaki; Yoichiro Nakatani; Wei Qu; Budrul Ahsan; Tomoyuki Yamada; Yukinobu Nagayasu; Koichiro Doi; Yasuhiro Kasai; Tomoko Jindo; Daisuke Kobayashi; Atsuko Shimada; Atsushi Toyoda; Yoko Kuroki; Asao Fujiyama; Takashi Sasaki; Atsushi Shimizu; Shuichi Asakawa; Nobuyoshi Shimizu; Shin-Ichi Hashimoto; Jun Yang; Yongjun Lee; Kouji Matsushima; Sumio Sugano; Mitsuru Sakaizumi; Takanori Narita; Kazuko Ohishi; Shinobu Haga; Fumiko Ohta; Hisayo Nomoto; Keiko Nogata; Tomomi Morishita; Tomoko Endo; Tadasu Shin-I; Hiroyuki Takeda; Shinichi Morishita; Yuji Kohara
Journal: Nature Date: 2007-06-07 Impact factor: 49.962

5. OrthoMCL: identification of ortholog groups for eukaryotic genomes.

Authors: Li Li; Christian J Stoeckert; David S Roos
Journal: Genome Res Date: 2003-09 Impact factor: 9.043

6. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space.

Authors: Fredrik Ronquist; Maxim Teslenko; Paul van der Mark; Daniel L Ayres; Aaron Darling; Sebastian Höhna; Bret Larget; Liang Liu; Marc A Suchard; John P Huelsenbeck
Journal: Syst Biol Date: 2012-02-22 Impact factor: 15.683

7. Mudskipper genomes provide insights into the terrestrial adaptation of amphibious fishes.

Authors: Xinxin You; Chao Bian; Qijie Zan; Xun Xu; Xin Liu; Jieming Chen; Jintu Wang; Ying Qiu; Wujiao Li; Xinhui Zhang; Ying Sun; Shixi Chen; Wanshu Hong; Yuxiang Li; Shifeng Cheng; Guangyi Fan; Chengcheng Shi; Jie Liang; Y Tom Tang; Chengye Yang; Zhiqiang Ruan; Jie Bai; Chao Peng; Qian Mu; Jun Lu; Mingjun Fan; Shuang Yang; Zhiyong Huang; Xuanting Jiang; Xiaodong Fang; Guojie Zhang; Yong Zhang; Gianluca Polgar; Hui Yu; Jia Li; Zhongjian Liu; Guoqiang Zhang; Vydianathan Ravi; Steven L Coon; Jian Wang; Huanming Yang; Byrappa Venkatesh; Jun Wang; Qiong Shi
Journal: Nat Commun Date: 2014-12-02 Impact factor: 14.919

8. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors: Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal: Gigascience Date: 2012-12-27 Impact factor: 6.524

9. TopHat: discovering splice junctions with RNA-Seq.

Authors: Cole Trapnell; Lior Pachter; Steven L Salzberg
Journal: Bioinformatics Date: 2009-03-16 Impact factor: 6.937

10. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons.

Authors: Zhao Xu; Hao Wang
Journal: Nucleic Acids Res Date: 2007-05-07 Impact factor: 16.971

9 in total

1. An NGS-based approach for the identification of sex-specific markers in snakehead (Channa argus).

Authors: Mi Ou; Cheng Yang; Qing Luo; Rong Huang; Ai-Di Zhang; Lan-Jie Liao; Yong-Ming Li; Li-Bo He; Zuo-Yan Zhu; Kun-Ci Chen; Ya-Ping Wang
Journal: Oncotarget Date: 2017-10-19

2. Genome Sequencing of the Japanese Eel (Anguilla japonica) for Comparative Genomic Studies on tbx4 and a tbx4 Gene Cluster in Teleost Fishes.

Authors: Weiwei Chen; Chao Bian; Xinxin You; Jia Li; Lizhen Ye; Zhengyong Wen; Yunyun Lv; Xinhui Zhang; Junmin Xu; Shaosen Yang; Ruobo Gu; Xueqiang Lin; Qiong Shi
Journal: Mar Drugs Date: 2019-07-20 Impact factor: 5.118

3. The sequence and de novo assembly of Takifugu bimaculatus genome using PacBio and Hi-C technologies.

Authors: Zhixiong Zhou; Bo Liu; Baohua Chen; Yue Shi; Fei Pu; Huaqiang Bai; Leibin Li; Peng Xu
Journal: Sci Data Date: 2019-09-30 Impact factor: 6.444

4. Phylogenomics investigation of sparids (Teleostei: Spariformes) using high-quality proteomes highlights the importance of taxon sampling.

Authors: Paschalis Natsidis; Alexandros Tsakogiannis; Pavlos Pavlidis; Costas S Tsigenopoulos; Tereza Manousaki
Journal: Commun Biol Date: 2019-11-01

Review 5. Chromosome-Level Assembly of the Southern Rock Bream (Oplegnathus fasciatus) Genome Using PacBio and Hi-C Technologies.

Authors: Yulin Bai; Jie Gong; Zhixiong Zhou; Bijun Li; Ji Zhao; Qiaozhen Ke; Xiaoqing Zou; Fei Pu; Linni Wu; Weiqiang Zheng; Tao Zhou; Peng Xu
Journal: Front Genet Date: 2021-12-21 Impact factor: 4.599

6. Chromosome-level genome assemblies of Channa argusandChanna maculata and comparative analysis of their temperature adaptability.

Authors: Mi Ou; Rong Huang; Cheng Yang; Bin Gui; Qing Luo; Jian Zhao; Yongming Li; Lanjie Liao; Zuoyan Zhu; Yaping Wang; Kunci Chen
Journal: Gigascience Date: 2021-10-21 Impact factor: 6.524

7. The First Highly Contiguous Genome Assembly of Pikeperch (Sander lucioperca), an Emerging Aquaculture Species in Europe.

Authors: Julien Alban Nguinkal; Ronald Marco Brunner; Marieke Verleih; Alexander Rebl; Lidia de Los Ríos-Pérez; Nadine Schäfer; Frieder Hadlich; Marcus Stüeken; Dörte Wittenburg; Tom Goldammer
Journal: Genes (Basel) Date: 2019-09-13 Impact factor: 4.096

8. MRAP2 Interaction with Melanocortin-4 Receptor in SnakeHead (Channa argus).

Authors: Zheng-Yong Wen; Ting Liu; Chuan-Jie Qin; Yuan-Chao Zou; Jun Wang; Rui Li; Ya-Xiong Tao
Journal: Biomolecules Date: 2021-03-23

9. Comparative Transcriptomic Analysis of Regenerated Skins Provides Insights into Cutaneous Air-Breathing Formation in Fish.

Authors: Songqian Huang; Bing Sun; Longfei Huang; Lijuan Yang; Chuanshu Liu; Jinli Zhu; Jian Gao; Xiaojuan Cao
Journal: Biology (Basel) Date: 2021-12-08

9 in total