Literature DB >> 32917720

Draft Genome Assembly of the Freshwater Apex Predator Wels Catfish (Silurus glanis) Using Linked-Read Sequencing.

Mikhail Yu Ozerov^1,2,3, Martin Flajšhans⁴, Kristina Noreikiene⁵, Anti Vasemägi^6,5, Riho Gross⁵.

Abstract

The wels catfish (Silurus glanis) is one of the largest freshwater fish species in the world. This top predator plays a key role in ecosystem stability, and represents an iconic trophy-fish for recreational fishermen. S. glanis is also a highly valued species for its high-quality boneless flesh, and has been cultivated for over 100 years in Eastern and Central Europe. The interest in rearing S. glanis continues to grow; the aquaculture production of this species has almost doubled during the last decade. However, despite its high ecological, cultural and economic importance, the available genomic resources for S. glanis are very limited. To fulfill this gap we report a de novo assembly and annotation of the whole genome sequence of a female S. glanis The linked-read based technology with 10X Genomics Chromium chemistry and Supernova assembler produced a highly continuous draft genome of S. glanis: ∼0.8Gb assembly (scaffold N 50 = 3.2 Mb; longest individual scaffold = 13.9 Mb; BUSCO completeness = 84.2%), which included 313.3 Mb of putative repeated sequences. In total, 21,316 protein-coding genes were predicted, of which 96% were annotated functionally from either sequence homology or protein signature searches. The highly continuous genome assembly will be an invaluable resource for aquaculture genomics, genetics, conservation, and breeding research of S. glanis.

Entities: Chemical Disease Gene Species

Keywords: 10X Genomics Chromium linked-reads; Silurus glanis; de novo assembly; teleost; wels catfish; whole genome sequencing

Mesh：

Year: 2020 PMID： 32917720 PMCID： PMC7642921 DOI： 10.1534/g3.120.401711

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

The wels catfish, also known as the European catfish or sheatfish, Silurus glanis (NCBI Taxonomy ID: 94993; Figure 1), is a non-migratory benthopelagic predatory fish that lives mainly in freshwater habitats (large warm lakes and deep, slow-flowing rivers), but also occurs in brackish waters in the Baltic and Black Seas (Frimodt 1995). S. glanis is native to the North, Baltic, Black, Caspian and Aral Sea basins; its distribution extends from Germany eastwards through Poland and the Baltic States to Russia, and from Southern Sweden southwards through Eastern European countries to Turkey, Northern Iran and the Aral Sea basin of Kazakhstan and Uzbekistan (Kottelat and Freyhof 2007; Copp ). Following introduction outside its native range, S. glanis has become established in a number of Western and Southern European countries, as well as in Syria and China (Froese and Pauly 2019).

Figure 1

Wels catfish (Silurus glanis). Photo by Filip Staes, http://www.fsfotografie.be/.

Wels catfish (Silurus glanis). Photo by Filip Staes, http://www.fsfotografie.be/. S. glanis belongs to the family Siluridae, a large group of freshwater fishes native to Europe, Asia and Africa. The family includes more than 100 species from 12 genera. There are 18 species in the genus Silurus, two of which are native to Europe: S. glanis and S. aristotelis (Copp ). S. glanis is the largest fish of the order Siluriformes and the largest among European freshwater teleost fishes. Its maximum reported size was 5 m and 306 kg, caught in the River Dniepr (Berg 1949), although the typical length and weight vary from 1.3 to 1.6 m and from 15 to 65 kg, respectively (Frimodt 1995; Alp ). S. glanis is an ecologically and economically important species. It is the freshwater apex predator, the target of commercial and recreational fisheries, and a rising species in aquaculture (Linhart ; Vejřík ; Cucherousset ). S. glanis is particularly popular among European anglers, and its economic importance in many Central and Eastern European countries has increased because the species possesses many characteristics desirable for profitable aquaculture (e.g., its growth rate is among the highest of any fish, and it has tasty white boneless flesh, high carcass yield, and high feed utilization efficiency; Pruszynski and Pistelok 1999; Linhart ; Jankowska ; Adamek ). The aquaculture production of S. glanis in Europe, Asia and Africa has grown from 14 tons in 1984 to 2,026 tons in 2018. Currently the largest producers are Uzbekistan, Poland, Hungary, Bulgaria, France, Germany and Czech Republic (FAO 2020). Despite its high ecological, cultural and economic significance, the available genomic resources for S. glanis are very limited. To date, only a mitochondrial genome (Vittas ) and a few gene sequences of S. glanis are available in the NCBI database (https://www.ncbi.nlm.nih.gov/). There are a number of genome assemblies of other Siluriformes (Liu ; Gong ; Kim ; Li ; Zhang ; Jiang ), however, these species are evolutionarily rather distant from S. glanis (Kappas ). Therefore, given the growing interest in the aquaculture of S. glanis, there is a strong need for improving the databases of its genetic resources and for developing novel genetic tools to increase the efficiency of breeding programs. Moreover, given that the fisheries based on wild-caught S. glanis exceed aquaculture production (FAO 2020), genomic information is essential for developing sustainable conservation strategies for wild populations. To assemble the S. glanis genome, we used the 10X Genomics Chromium technology, a genomic library preparation technique designed to build accurate and highly continuous assemblies from short sequencing reads. In brief, the 10X Genomics library preparation technology incorporates unique molecular barcodes into individual high molecular weight DNA molecules, after which genomic libraries undergo standard Illumina short-read sequencing (Zheng ). The construction of highly continuous scaffolds is achieved by using these barcodes in a phased assembly strategy algorithm implemented in the Supernova software to tag short reads that come from the same long DNA fragment (known as linked-reads; Weisenfeld ). Here, we report a high-quality, highly continuous, and nearly complete assembly of the S. glanis genome generated using 10X Genomics linked-read sequencing. This resource will serve as a backbone for aquaculture genomics, genetics, conservation, and breeding research of S. glanis and other fish species of the Siluridae family.

Materials And Methods

Ethics statement

The requirements outlined in the Annex III (Requirements for establishments and for the care and accommodation of animals) and Annex IV (Methods of killing animals) Section B point 11 of the “Directive 2010/63/EU of the European parliament and of the council of 22 September 2010 on the protection of animals used for scientific purpose” were fully met. The authors have followed the principles of the 3Rs (Replacement, Reduction and Refinement) and have involved the minimum number of animals to produce statistically reproducible results.

Samples, library preparation and sequencing

The sample tissues were obtained from one 160-day-old S. glanis female (weight 346 g, total length 393 mm) of the Hodonin strain, collected in the experimental aquaculture facilities of the Chair of Aquaculture, Estonian University of Life Sciences (Tartu, Estonia). The Hodonin strain originated from the Danube river drainage, and the broodstock are maintained at the Research Institute of Fish Culture and Hydrobiology (RIFCH) of the Faculty of Fisheries and Protection of Waters, University of South Bohemia (Vodňany, Czech Republic). The specimen was killed by an overdose of 2-phenoxyethanol before sampling. A blood sample was collected from the caudal vein into a 2 ml sterile vacuum blood collection tube containing 3.6 mg K2EDTA (VACUTEST KIMA, Italy) using a sterile needle with a safety device (21G × 1.5”, VACUTEST KIMA, Italy). The tube was kept on ice during transport to the laboratory, and then stored at +4° until DNA isolation. For transcriptome characterization, tissues from eleven organs of the same individual (barbel, whole eye, fin, gill, gonad (ovary), heart, kidney, muscle, liver, spleen and swim bladder) were dissected, snap frozen in liquid nitrogen and kept at -80° until RNA isolation. High molecular weight genomic DNA (gDNA) was isolated from blood using the MagAttract HMW DNA Kit (Qiagen, Halden, Germany) according to manufacturer’s instructions, with a few modifications. As fish red blood cells contain a nucleus, we used only 3 µl of buffered blood (instead of the recommended 200 µl). In addition, to avoid fragmentation of high molecular weight DNA, vortexing and mixing procedures were very brief and gentle during the DNA isolation procedure. Total gDNA was eluted in 80 µl of buffer AE (Qiagen, Halden, Germany). The quantity of gDNA was measured by Qubit Fluorometric Quantitation (ThermoFisher), and the average length of the gDNA fragments was determined by the FEMTOPule using the Femto gDNA 165Kb FP-1002 kit (Agilent Technologies). The average estimated fragment size of gDNA was > 65 Kb. Whole genome sequencing libraries were prepared from 0.6 ng DNA using the Chromium Genome Library preparation kit (cat# 120257/58/61/62) according to the manufacturer’s protocol (#CG00043 Chromium Genome Reagent Kit v2 User Guide), and 0.75 ng of template gDNA was loaded on a Chromium Genome Chip. The library was sequenced on two lanes of an Illumina HiSeq X sequencer in rapid run mode, using paired-end sequencing (2 × 150 bp) to generate 655.87 M linked-reads, with a mean read length of 139.5 bp after trimming. The weighted mean molecule size was estimated as 37.8 Kb, and the mean read coverage was ∼102×. The WGS library preparation and sequencing was performed by the SNP&SEQ Technology Platform (Uppsala, Sweden). Total RNA was extracted from eleven pulverized frozen tissues using a NucleoSpin RNA extraction kit (MACHEREY-NAGEL, Duren, Germany) according to the manufacturer’s guidelines. The concentration and integrity of the extracted RNA samples were measured with TapeStation 2200 (Agilent Technologies). Prior the preparation of sequencing libraries, the quality of the RNA samples was ensured using Bioanalyzer 2100 (Agilent Technologies) electrophoresis, and sample concentration was measured with a Qubit Fluorometric Quantitation (Life Technologies). The RNA integrity number scores (RIN) for all samples were high, ranging from 8.2 to 10.0 (mean = 9.5). The libraries were prepared from 100 ng of total RNA according to the Illumina TruSeq Stranded mRNA Sample Preparation Guide (part no. 15031047) to generate libraries with an insert size of 300-bp. The libraries were sequenced using an Illumina HiSeq 3000 (2 × 75 bp configuration, single lane) at the Finnish Functional Genomics Centre (Turku, Finland).

Estimation of the genome metrics based on raw reads

Jellyfish v.2.2.6 (Marçais and Kingsford 2011) was applied to generate k-mer counting and frequency distributions of 17-, 21- and 25-mers for quality and barcode trimmed Illumina reads (jellyfish histo -h 3000000). Genome size, heterozygosity and repeat content were estimated based on the generated histograms using GenomeScope (Vurture ; high frequency k-mer cutoff = 10,000) and findGSE (Sun ).

de novo genome assembly

The Supernova v.2.1.1 assembler (Weisenfeld ) was used to assemble the linked-reads data on a 28 core and 240 Gb RAM CSC – IT Center for Science cPouta virtual private server, based on Intel Xeon CPU E5-2680 v.4 2.4 GHz processors. The Supernova run was performed with default parameters, except maximum reads (–maxreads) was set for 386 million input reads to achieve 56× raw coverage, as suggested in the Supernova protocol. The 824.5 Mb initial draft genome assembly was presented in pseudohaplotype format and consisted of 31,193 of scaffold sequences (≥ 1 Kb), of which 83.5 Mb (10.1%) represented unknown bases. Duplicated scaffolds (1,359 scaffolds) were removed using GenomeTools sequniq v.1.5.10 (Gremme ), and only scaffolds with more than 10% of unique sequence were retained (29,812 scaffolds). Further, the redundancy of the genome assembly was reduced in two steps. First, CD-HIT v.4.7 package (Fu ; Li and Godzik 2006) was applied to cluster all the scaffolds < 2 Mb. When two or more scaffolds showed ≥ 99% similarity, all but the longest scaffold were removed to generate a non-redundant set of < 2 Mb scaffolds. This resulted in removal of 3,568 potentially redundant scaffolds from the assembly. Second, to further reduce potential redundancy, the assembly including the non-redundant set of < 2 Mb scaffolds was self-aligned using LAST v.926 (Kiełbasa ; identity ≥ 99%, coverage of query sequence ≥ 95%), resulting in exclusion of 540 additional scaffolds. In total, 31.1 Mb (3.9%) were removed from the initial assembly due to potential duplication or redundancy. The potentially redundant scaffolds varied in size from 1 Kb to 62.5 Kb and most of them (84.5%) did not exceed 10 Kb. The final S. glanis genome assembly consisted of 25,703 scaffolds. The assembly was screened for vectors and contaminants using a Kraken v.1.0 (Wood and Salzberg 2014) customized database, which included standard Kraken viral, bacterial, archaeal, plasmid and human databases, additional genomes of Trypanosoma brucei (GCF_000210295.1, Jackson ) and seven fish species (Cyprinus carpio GCF_000951615.1, Li J.-T., Chinese Academy of Fishery Science; Danio rerio GCF_000002035.6, Howe ); Esox lucius GCF_000721915.3. Rondeau ); Lates calcarifer GCF_001640805.1, Vij ); Nothobranchius furzeri GCF_001465895.1, Senf et al. Leibniz Institute for Age Research – Fritz Lipmann; Oncorhynchus mykiss GCF_002163495.1 Lien et al. Norwegian University of Life Sciences; and Takifugu rubripes GCF_000180615.1, Kai ). In total, 30 and 810 scaffolds were detected as potentially contaminated by unicellular organisms or human DNA, respectively. NCBI’s blastn v.2.9.0 (Boratyn ) was further applied to align those scaffolds to viral, bacterial, trypanosoma or to human refseq gene sequences. As most of the significant hits did not cover more than 2% of a query sequence, all of the scaffolds were considered as non-contaminated and were thus retained for further analyses. The quality metrics of the genome assembly were generated using QUAST v.4.5.4 (Gurevich ). Genome assembly completeness was assessed with BUSCO v.3.0.2b (Simão ) using a database of ray-finned fishes (Actinopterygii obd9) consisting of 4,584 orthologs from 20 fish species. Furthermore, to compare the scaffold-level genome assembly of S. glanis in this study with a reported chromosome-level genome assembly of I. punctatus (GCA_001660625.1, Liu ), we performed a synteny analysis of these two genome assemblies using LAST v.926 (Kiełbasa ), which only considered the reliably aligned regions longer than 1 Mb (e-value ≤ 1e-5). The results were visualized using Circos v.0.96-9 (Krzywinski ).

Retrieval of the mitochondrial genome

The mitochondrial genome sequence was identified as a single scaffold by searching the available mitochondrial genome of the S. glanis (NC_014261.1, from Kastoria Lake, Greece; Vittas ) against the generated genome assembly using NCBI’s blastn v.2.9.0 (Boratyn ; evalue 1e-10, -soft_masking true, -lcase_masking, and a hit fraction filter to include only hits of > 90% target length, -qcov_hsp_perc 90). Further, both the retrieved and published (NC_014261.1) mitochondrial genomes were aligned, and genetic variants were identified. The genomic classification of genetic variants into different categories (synonymous, non-synonymous, exonic and intergenic) was performed using SnpEff v.4.3t (Cingolani ).

Transcriptome assembly

To assist the subsequent genome annotation, we performed RNA sequencing and de novo transcriptome assembly based on eleven tissues. A total of 364.0 M RNA-seq read-pairs were generated. Short (< 50 bp) and low-quality reads (average quality ≤ 25) were trimmed using Trimmomatic v.0.35 (Bolger ); SLIDINGWINDOW:5:25 MINLEN:50). Erroneous k-mers were removed from the Illumina paired-end reads, and random sequencing errors were corrected using rCorrector (Song and Florea 2015). The corrected and trimmed reads were mapped to an rRNA database (SILVA Release 132; Pruesse ) to further reduce bias in downstream analyses due to over ribo-depletion (Lahens ). Finally, 318.5 M filtered read-pairs were assembled de novo using Trinity v.2.5.1 (Haas ) with default parameters. The redundancy of the transcriptome assembly was reduced as described in Eccles . Briefly, the quality-controlled RNA-seq reads were mapped to the Trinity-generated transcriptome assembly using Salmon v.1.1.0 (Love ). A threshold of 100 mapped reads in BUSCO genes was chosen, as 98% of complete BUSCO sequences had more than 100 mapped reads, retaining 80,005 of the initial 250,220 transcripts. The longest Met to Stop open reading frame was identified for each transcript for protein-based clustering with CD-HIT v.4.7 (Li and Godzik 2006; Fu ) at 98% identity, retaining 56,266 transcripts. Finally, the longest isoform of each gene was identified, producing an isoform-collapsed transcriptome subset, which was clustered at the protein level by CD-HIT v.4.7 at 90% identity. This resulted in a final dataset of 48,141 transcripts.

Repetitive-sequence discovery

First, a de novo repeat library was generated using RepeatModeler v.1.0.11 (Smit and Hubley 2008-2015) with default parameters. Further, RepeatMasker v.4.0.8 (Smit ) was applied to screen for repeats and low complexity regions in the assembly using the generated de novo repeat library in combination with Dfam consensus 20171107 (Hubley ) and RepBase 20181026 (Bao ) repeat libraries. In addition, putative repeat regions were identified using Red v. 05/22/2015, a repeat-detection tool applying machine learning, which is capable of labeling its training data and training itself automatically on an entire genome (Girgis 2015). Finally, a consensus bed-file containing repeat coordinates derived from both methods was merged using bedtools v.2.27.1 (Quinlan and Hall 2010).

Gene prediction and annotation

Gene models were predicted with MAKER v 3.01.2-beta (Holt and Yandell 2011), which combines ab initio gene prediction, RNA-seq assisted prediction and homology-based prediction. Repetitive genomic regions of the S. glanis genome were masked based on the repeat annotation results. Three MAKER runs were performed. During the first run, the S. glanis transcripts and protein sequences of 75 other fish species from the Ensembl 99 database (January 2020) were aligned to the genome as evidence to retrain Augustus v.3.3.3 (Stanke ) and SNAP v. 2013-11-29 (Korf 2004) ab initio gene prediction tools. Gene annotations generated in the first (and second) runs of ab initio gene prediction tools were used in the second and third runs of MAKER. The BUSCO v.3.0.2b pipeline was applied to retrain Augustus using genomic regions containing mRNA annotations from the first and second MAKER run (including additional 1,000 bp on each side). BUSCO runs were performed using the–long option to optimize the HMM settings of the raw zebrafish HMM (–sp zebrafish; first run) or trained S. glanis HMM (second run), and to generate the final trained S. glanis HMM. SNAP was retrained using gene models from the first and second MAKER run with an annotation edit distance (AED) ≤ 0.25 and a length of amino acids ≥ 50. The AED quantifier ranges from 0 to 1 and shows the match between a gene annotation and its supporting evidence (EST, protein and mRNA-seq alignments). Lower AED values indicate higher congruency between the intron-exon coordinates of an annotation and its aligned evidence, whereas AED = 1 indicates no evidence for support of predicted genes. Only sequences with AED < 0.5 and coding sequences (CDS) ≥ 90 bp were retained in the final set of predicted genes. The completeness of the predicted gene transcripts was evaluated with BUSCO v.3.0.2b (Simão ) using a database of ray-finned fishes (Actinopterygii obd9). In addition, the number of reads mapped to the predicted genes was estimated by mapping the quality-controlled RNA-seq reads to the predicted transcriptome using Salmon v.1.1.0 (Love ). Functional annotation of the predicted proteins against vertebrate protein sequences in NCBI’s non-redundant database was performed using NCBI’s blastp v.2.9.0 (Boratyn ; -evalue 1e-10, -soft_masking true, -lcase_masking, and a hit fraction filter to include only hits of > 70% target length, -qcov_hsp_perc 70). Next, non-annotated sequences were searched against all the protein sequences in the NCBI non-redundant database. In addition, Interproscan v.5.30-69.0 (Jones ) was applied to search for protein domains, motifs and signatures present in the predicted protein sequences by searching against publicly available databases, including PANTHER 12.0 (Thomas ), Pfam v.31.0 (Finn ), PRINTS v.42.0 (Attwood ), PROSITE v.2018_02 (Sigrist ), SFLD 3 (Akiva ), SMART v.7.1 (Letunic ), SUPERFAMILY v.1.75 (de Lima Morais ), and TIGRFAM v.15.0 (Haft ).

Inference of demographic history

The pairwise sequentially Markovian coalescent (PSMC) method was applied to infer the demographic history of S. glanis from a diploid sequence using psmc v.0.6.5 (Li and Durbin 2011). In brief, the quality and barcode-trimmed Illumina reads were aligned to the S. glanis reference genome assembly using bowtie2 v.2.3.5.1 (Langmead and Salzberg 2012) applying default parameters, except the modified score minimum threshold (–score-min L,-0.3,-0.3) and the maximum fragment length for valid paired-end alignments (-X 700). SAMtools v.1.10 (Li 2011) pipeline was further applied to generate the diploid consensus sequences with default settings (https://github.com/lh3/psmc), except that the minimum and maximum read depths were set to 15 and 90, respectively (-d 15 -D 90) to obtain high quality SNP data. The input file for PSMC modeling was generated with the fq2psmcfa tool (-q 20) and processed in psmc applying the following parameters: -N25 -t15 -r5 -p ‘4+25*2+4+6’ (Liu ). A generation time of 6 years (Froese and Pauly 2019) and a mutation rate of 2.5e-08 (Liu ) were applied for time calibration.

Data availability

Short Illumina linked-reads are available in the NCBI Sequence Read Archive (SRA; SSR11087275-SSR11087276 and SSR11087282-SSR11087283), and the Whole Genome Assembly has been deposited at DDBJ/EMBL/GenBank under the accession JAAIIK000000000, both under BioProject PRJNA605930. Transcriptome reads are available in the NCBI SRA (SRR11087269-SRR11087274 and SRR11087277-SRR11087281), and the Transcriptome Shotgun Assembly has been deposited at DDBJ/EMBL/GenBank under accession number GIPM00000000, as a part of BioProject PRJNA605930.

Results And Discussion

The S. glanis genome characteristics

Genome size estimates from GenomeScope ranged from 723.4 to 769.9 Mb, whereas estimates based on findGSE (fitted and original counts with corrected k-mer coverage) were higher, ranging from 822.6 Mb to 906.5 Mb (Table 1). The estimated genome size was comparable to the most closely related S. asotus (831 – 1,411 Mb, Jianxun ) and other Siluriformes, whose genome sizes vary from 599 Mb in B. yarrelli (Jiang ) to 1,200 Mb in C. batrachus (Li ).

Table 1

The S. glanis genome size, heterozygosity and repeat content as estimated by the GenomeScope and findGSE software

Genome characteristics	k-mer size
	k = 17	k = 21	k = 25
GenomeScope
Genome haploid length (Mb)	723.4	753.6	769.9
Genome repeat length (Mb)	337.8	216.8	205.4
Genome unique length (Mb)	385.6	536.7	564.5
Heterozygosity, %	0.24	0.25	0.23
Estimated repetitive ratio,%	46.7	28.8	26.7
Read error rate, %	0.30	0.36	0.36
findGSE
Genome haploid length (Mb)	822.6	901.3	906.5
Genome repeat length (Mb)	395.1	326.9	300.2
Genome unique length (Mb)	427.5	574.5	606.3
Estimated repetitive ratio, %	48.0	36.3	33.1

The GenomeScope analysis revealed low heterozygosity of the S. glanis genome (0.23 – 0.25%, Table 1) in comparison with other species (e.g., Kajitani ; Vurture ). The estimated proportion of repetitive regions in the S. glanis genome was high, ranging from 26.7% (k = 25, GenomeScope) to 48.0% (k = 17, findGSE). These estimates are comparable to other Siluriformes, in which the proportion of repeats in the genome varies from 30% in C. batrachus (Li ) to 44% in I. punctatus (Yuan ).

Genome assembly

The total length of the assembly was 793.4 Mb, which included ca. 10% (80.4 Mb) of unknown bases (Table 2). Such a high proportion of unknown bases is typical for genome assemblies generated using linked reads with 10X Chromium technology (e.g., Mohr ; Hulse-Kemp ), as the Supernova assembler estimates gap sizes rather than introducing an arbitrary value of Ns during scaffolding (Weisenfeld ). In the presented S. glanis genome assembly, repeat regions were estimated to account for 39.5% (313.3 Mb). The contig N50 and scaffold N50 sizes were 13.9 Kb and 3.2 Mb, respectively (Table 2). Altogether, 282 of the longest scaffolds (≥ 250 Kb; 1.1% of all scaffolds) covered more than 80% of the genome assembly.

Table 2

The S. glanis genome assembly and annotation statistics

	Genome assembly^a
Contig statistics
Number of contigs	105,816
Total contig size (bp)	712,999,588
Contig N₅₀ size (bp)	13,869
Largest contig (bp)	140,841
Scaffold statistics
Number of scaffolds	25,703
Total scaffold size (bp)	793,358,859
Scaffold N₅₀ size (bp)	3,169,562
Largest scaffold (bp)	13,715,129
GC content (%)	39.2
Unknown base (%)	10.1
BUSCO genome completeness
Complete	3,859 (84.2%)
Complete and single copy	3,717 (81.1%)
Complete and duplicated	142 (3.1%)
Fragmented	312 (6.8%)
Missing	413 (9.0%)
Annotation
Number of protein-coding genes	21,316
with partial EST support	10,260
with > 90% EST support	4,989
with full length EST support	3,795
with > 100 RNAseq reads aligned	17,330
with > 10 RNAseq reads aligned	19,855
Number of functionally-annotated proteins	20,532
Mean protein length (interquartile range, aa)	501 (218-617)
Longest protein (aa)	27,306 (titin-like)
Average number of exons per gene (mean length, interquartile range)	9 (212, 89-194 bp)
Average number of introns per gene (length, interquartile range)	8 (1,208, 133-1,274 bp)
BUSCO completeness of the predicted gene models
Complete	3,427 (74.8%)
Complete and single copy	3,248 (70.9%)
Complete and duplicated	179 (3.9%)
Fragmented	403 (8.8%)
Missing	754 (16.4%)

Minimum scaffold length: 1 Kb.

Minimum scaffold length: 1 Kb. The presented S. glanis genome assembly covered 84.2% complete and 6.8% partial ray-finned fishes BUSCOs (Table 2). Despite that the completeness of genome assemblies in other catfishes, generated with the assistance of long-read technologies, was ca. 5–10% higher (e.g., Kim ; Liu ; Jiang ) compared to the present genome assembly of S. glanis (Figure 2), this is comparable to other genome assemblies generated using only linked-read technology (e.g., Hammond ; Jones ; Ozerov ; Lu ). In addition, high completeness and accuracy of the presented S. glanis genome assembly was supported by high sequence similarities and concordant alignment with the chromosome regions of I. punctatus (Liu ; Figure 3).

Figure 2

BUSCO assessment of the S. glanis and other Siluriformes genomes.

Figure 3

Circos plot showing the high level of synteny between S. glanis (this study) and I. punctatus (Liu ).

BUSCO assessment of the S. glanis and other Siluriformes genomes. Circos plot showing the high level of synteny between S. glanis (this study) and I. punctatus (Liu ). The S. glanis transcriptome assembly based on multiple tissues consisted of 48,133 transcripts covering 92.1% complete and 2.7% partial ray-finned fish BUSCOs. The total transcriptome and N50 transcript size were 80.8 Mb and 2.4 Kb, respectively (Table 3).

Table 3

The S. glanis transcriptome assembly statistics

	Transcriptome assembly (multiple tissues)
Transcript statistics
Number of transcripts	48,133
Total transcript size (bp)	80,812,654
Transcript N₅₀ size (bp)	2,394
Largest transcript (bp)	69,646
BUSCO transcriptome completeness
Complete	4,222 (92.1%)
Complete and single copy	3,844 (83.9%)
Complete and duplicated	378 (8.2%)
Fragmented	123 (2.7%)
Missing	239 (5.2%)

Genome annotation

In total 21,316 protein-coding genes were predicted in the present S. glanis genome with the MAKER annotation pipeline (Table 2). The putative function annotation based on homology with NCBI’s blastp was obtained for 19,627 proteins (92.1%). Further, domains, motifs and signatures were detected for 19,931 proteins (93.5%) with Interproscan. As a result, 20,532 genes were annotated by at least one of the two methods, accounting for approximately 96.3% of the gene models of S. glanis (Table 2).

Mitochondrial genome

In addition to the nuclear genome, we successfully retrieved a nearly complete S. glanis mitochondrial genome consisting of 16,527 bp (scf113625_mtDNA). By comparing the newly assembled genome with the previously published mito-genome of S. glanis (NC_014261.1, Vittas ), we detected 79 SNPs (of which 29 were non-synonymous) corresponding to 0.4% sequence divergence. Assuming a conventional mtDNA molecular clock of 2% sequence divergence per million years (Brown ), this places the estimated time of divergence of these two lineages from a common ancestor to approximately 200,000 years ago. Such a deep divergence is consistent with the earlier work on mtDNA variation based on PCR-RFLP analysis of cytochrome b, control region and ND-5/6 regions in S. glanis, which have revealed two major evolutionary lineages in Europe (Krieg ).

Population history of S. glanis

Using the pairwise sequentially Markovian coalescent model (Li and Durbin 2011), we inferred the S. glanis historic population dynamics (Figure 4). The large effective population size of the S. glanis populations observed ca. 500,000 – 250,000 years ago declined to its minimum during the Last Glacial Period (ca. 115,000 – 11,700 years ago) with the following recovery in the earlier Holocene (ca. 11,650 – 8,000 years ago). Thereafter, the effective population size of S. glanis started to show another declining trend.

Figure 4

Inference of the S. glanis demographic history as revealed by PSMC analysis. The inferred population size is presented as a bold red line, and the surrounding thin pink lines are the estimates of population size generated after 100 rounds of bootstrapping. Green and white bars above the figure represent interglacial and glacial periods, respectively. The names of the last four geological epochs are indicated above the bars: the Holocene, the Last Glacial Period (LGP), the Eemian and the Penultimate Glacial Period (PGP). It should be noted that the estimates of historical fluctuations of the effective population size observed for S. glanis should be taken with caution, as the model itself estimates the rate of coalescence at each point in time, which in turn can be used as a proxy for population size under certain assumptions (Mather ). Thus, there are many factors affecting the relationship between coalescence times and population size, such as natural selection and non-random mating (Mazet ). Furthermore, when the migration rate between populations is low (e.g., island model as an example of non-random mating; Wright 1931), the inferred coalescent effective population size may be overestimated (Li & Durbin 2011), leading to a reflection of increased population structure rather than increased population size. In other words, the estimates of the effective population size at the time of separation will be biased upwards with increased within-species genetic structure (Nadachowska-Brzyska ). The pattern observed for S. glanis in our study may therefore indicate that within-species population structure remained at lower levels during the Eemian interglacial (130,000 – 115,000 years ago) compared to the earlier Holocene, due to a higher connectivity of water bodies during the former (Tudryn ; Krijgsman ). In the earlier Holocene, however, S. glanis recolonization of western Europe from the eastern refugium – presumably located around the Ponto-Caspian region (Krieg ) – increased the within-species structuring as the ice sheet gradually retreated.

Conclusions

We performed whole-genome assembly, using a combination of 10X Chromium linked-read technology and accurate short-read sequencing, to generate the first genome reference for S. glanis. More than 21,000 protein coding genes were identified; among these, 19,627 genes were annotated with known homology, which will facilitate further functional genomic and gene ontology analyses. The scaffold length (N50 = 3.2 Mb) of the generated genome assembly will facilitate the detection of genomic variants, including copy number variations and large insertions/deletions. Given the ecological, cultural and economic importance of S. glanis, as well as its increasing aquaculture production, the generated genome assembly will serve as a backbone for future genomics, genetics, conservation, and breeding research of S. glanis and other fish species of the Siluridae family.

66 in total

1. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Authors: Guillaume Marçais; Carl Kingsford
Journal: Bioinformatics Date: 2011-01-07 Impact factor: 6.937

2. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors: Weizhong Li; Adam Godzik
Journal: Bioinformatics Date: 2006-05-26 Impact factor: 6.937

3. Repbase Update, a database of repetitive elements in eukaryotic genomes.

Authors: Weidong Bao; Kenji K Kojima; Oleksiy Kohany
Journal: Mob DNA Date: 2015-06-02

4. PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification.

Authors: Paul D Thomas; Anish Kejariwal; Michael J Campbell; Huaiyu Mi; Karen Diemer; Nan Guo; Istvan Ladunga; Betty Ulitsky-Lazareva; Anushya Muruganujan; Steven Rabkin; Jody A Vandergriff; Olivier Doremieux
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

5. The genome and linkage map of the northern pike (Esox lucius): conserved synteny revealed between the salmonid sister group and the Neoteleostei.

Authors: Eric B Rondeau; David R Minkley; Jong S Leong; Amber M Messmer; Johanna R Jantzen; Kristian R von Schalburg; Craig Lemon; Nathan H Bird; Ben F Koop
Journal: PLoS One Date: 2014-07-28 Impact factor: 3.240

6. Chromosomal-Level Assembly of the Asian Seabass Genome Using Long Sequence Reads and Multi-layered Scaffolding.

Authors: Shubha Vij; Heiner Kuhl; Inna S Kuznetsova; Aleksey Komissarov; Andrey A Yurchenko; Peter Van Heusden; Siddharth Singh; Natascha M Thevasagayam; Sai Rama Sridatta Prakki; Kathiresan Purushothaman; Jolly M Saju; Junhui Jiang; Stanley Kimbung Mbandi; Mario Jonas; Amy Hin Yan Tong; Sarah Mwangi; Doreen Lau; Si Yan Ngoh; Woei Chang Liew; Xueyan Shen; Lawrence S Hon; James P Drake; Matthew Boitano; Richard Hall; Chen-Shan Chin; Ramkumar Lachumanan; Jonas Korlach; Vladimir Trifonov; Marsel Kabilov; Alexey Tupikin; Darrell Green; Simon Moxon; Tyler Garvin; Fritz J Sedlazeck; Gregory W Vurture; Gopikrishna Gopalapillai; Vinaya Kumar Katneni; Tansyn H Noble; Vinod Scaria; Sridhar Sivasubbu; Dean R Jerry; Stephen J O'Brien; Michael C Schatz; Tamás Dalmay; Stephen W Turner; Si Lok; Alan Christoffels; László Orbán
Journal: PLoS Genet Date: 2016-04-15 Impact factor: 5.917

7. The North American bullfrog draft genome provides insight into hormonal regulation of long noncoding RNA.

Authors: S Austin Hammond; René L Warren; Benjamin P Vandervalk; Erdi Kucuk; Hamza Khan; Ewan A Gibb; Pawan Pandoh; Heather Kirk; Yongjun Zhao; Martin Jones; Andrew J Mungall; Robin Coope; Stephen Pleasance; Richard A Moore; Robert A Holt; Jessica M Round; Sara Ohora; Branden V Walle; Nik Veldhoen; Caren C Helbing; Inanc Birol
Journal: Nat Commun Date: 2017-11-10 Impact factor: 14.919

8. European catfish (Silurus glanis) as a freshwater apex predator drives ecosystem via its diet adaptability.

Authors: Lukáš Vejřík; Ivana Vejříková; Petr Blabolil; Antti P Eloranta; Luboš Kočvara; Jiří Peterka; Zuzana Sajdlová; Son Hoang The Chung; Marek Šmejkal; Mikko Kiljunen; Martin Čech
Journal: Sci Rep Date: 2017-11-21 Impact factor: 4.379

9. Whole-Genome Sequencing of Chinese Yellow Catfish Provides a Valuable Genetic Resource for High-Throughput Identification of Toxin Genes.

Authors: Shiyong Zhang; Jia Li; Qin Qin; Wei Liu; Chao Bian; Yunhai Yi; Minghua Wang; Liqiang Zhong; Xinxin You; Shengkai Tang; Yanshan Liu; Yu Huang; Ruobo Gu; Junmin Xu; Wenji Bian; Qiong Shi; Xiaohui Chen
Journal: Toxins (Basel) Date: 2018-11-23 Impact factor: 4.546

10. CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors: Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

2 in total

1. A highly contiguous nuclear genome assembly of the mandarinfish Synchiropus splendidus (Syngnathiformes: Callionymidae).

Authors: Martin Stervander; William A Cresko
Journal: G3 (Bethesda) Date: 2021-12-08 Impact factor: 3.154

2. Insights into chromosomal evolution and sex determination of Pseudobagrus ussuriensis (Bagridae, Siluriformes) based on a chromosome-level genome.

Authors: Chuankun Zhu; Haiyang Liu; Zhengjun Pan; Lei Cheng; Yanhong Sun; Hui Wang; Guoliang Chang; Nan Wu; Huaiyu Ding; Haitao Zhao; Lei Zhang; Xiangsheng Yu
Journal: DNA Res Date: 2022-06-25 Impact factor: 4.477

2 in total