Literature DB >> 31188893

De novo European eel transcriptome provides insights into the evolutionary history of duplicated genes in teleost lineages.

Christoffer Rozenfeld1, Jose Blanca2, Victor Gallego1, Víctor García-Carpintero2, Juan Germán Herranz-Jusdado1, Luz Pérez1, Juan F Asturiano1, Joaquín Cañizares2, David S Peñaranda1.   

Abstract

Paralogues pairs are more frequently observed in eels (Anguilla sp.) than in other teleosts. The paralogues often show low phylogenetic distances; however, they have been assigned to the third round of whole genome duplication (WGD), shared by all teleosts (3R), due to their conserved synteny. The apparent contradiction of low phylogenetic difference and 3R conserved synteny led us to study the duplicated gene complement of the freshwater eels. With this aim, we assembled de novo transcriptomes of two highly relevant freshwater eel species: The European (Anguilla anguilla) and the Japanese eel (Anguilla japonica). The duplicated gene complement was analysed in these transcriptomes, and in the genomes and transcriptomes of other Actinopterygii species. The study included an assessment of neutral genetic divergence (4dTv), synteny, and the phylogenetic origins and relationships of the duplicated gene complements. The analyses indicated a high accumulation of duplications (1217 paralogue pairs) among freshwater eel genes, which may have originated in a WGD event after the Elopomorpha lineage diverged from the remaining teleosts, and thus not at the 3R. However, very similar results were observed in the basal Osteoglossomorpha and Clupeocephala branches, indicating that the specific genomic regions of these paralogues may still have been under tetrasomic inheritance at the split of the teleost lineages. Therefore, two potential hypotheses may explain the results: i) The freshwater eel lineage experienced an additional WGD to 3R, and ii) Some duplicated genomic regions experienced lineage specific rediploidization after 3R in the ancestor to freshwater eels. The supporting/opposing evidence for both hypotheses is discussed.

Entities:  

Mesh:

Year:  2019        PMID: 31188893      PMCID: PMC6561569          DOI: 10.1371/journal.pone.0218085

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Large accumulations of gene duplications can originate from one single event, like a whole genome duplication (WGD) [1] or from multiple small duplication events such as small segmental duplications (SDs) [2], which are often found in tandem. Any of these duplication events may contribute to species evolution by providing raw genetic material for new phenotypic variation [1-3]. Relatively recent SDs are often found in tandem and have been found in high abundance in several organisms including yeast [4], daphnia [5], humans [2,6,7] and teleosts [8-12]. Soon after a SD, one paralogue is most commonly lost [1] possibly due to an accumulation of deleterious mutations or genetic drift [13]. In a few cases, a high abundance of SDs can persist for millions of years as seen in yeast [4], common carp [9] and humans [2,6,14]. This process has been associated with adaptation to new environments [5,15,16]. On the other hand, WGDs are presumed rare in mammals [17], but are recurrently found in amphibians and reptiles [18] and have frequently been suggested in insects [18], fungi [19], and plants [20-23]. Recent WGD events have traditionally been observed by cytological studies through the observation of additional chromosomes [22]; however, ancient WGD events are often hidden [22,24-26] by massive gene losses [27-29] and the fusion or loss of duplicated chromosomes [19,30-33]. Therefore, an ancient WGD event can only be discovered through specific analysis at a whole genome level. Consequently, discoveries of WGD events have accelerated as sequencing techniques have improved and genome-scale data has become more accessible [22,24,26]. It has been suggested that early on in the vertebrate lineage two WGDs (1R and 2R) occurred resulting in species radiation and evolution of new traits [1-3,34]. In teleosts, strong genomic evidence supports the existence of an additional WGD called the teleost specific 3rd round of WGD (3R), which occurred in the base of the teleost lineage between 350 and 320 million years ago (MYA) [35,36]. In addition to 3R, WGD events appear to be a reoccurring phenomenon in Actinopterygii even when only considering cytological evidence [37,38]. Furthermore, Inoue et al. [27] found that 70–80% of the genes originating from the 3R WGD get lost after just 60 million years. Similarly, other studies have found that in most teleosts 3–20% of the genes generated during 3R are conserved today [31]. Moreover, extensive chromosome reorganizations have been suggested in the teleost lineage associated with 3R [39,40] and after the salmonid specific 4th round of WGD (Ss4R) [41]. Therefore, it has been suggested that further discoveries of new WGDs in teleosts may increase following the development of sequencing techniques and the increase in the number of studies specifically analysing the temporal distribution and quantity of gene duplications [31]. This phenomenon of accelerating rates of WGD discoveries is currently observed in plant genomics [20-22,42]. Following a WGD event, paralogues genes will start to diverge after the recombination between duplicated genes has stopped at the transition from tetrasomic to disomic inheritance [41,43,44], also referred to as cytological rediploidization. However, after autotetraploidization tetrasomic segregation may continue due to the high similarity between the duplicated chromosomes, and thus rediploidization may be vastly delayed after a WGD event [43]. Therefore, variations in phylogenetic divergences between paralogue gene pairs originating from the same WGD event can appear in cases where a genomic region is under tetrasomic inheritance, at the time of a speciation event [43]. The resulting phylogenetic gene family trees from such event are virtually indistinguishable from gene trees where additional gene duplications have occurred [44]. In particular, in salmonids, strong evidence suggests that rediploidization after the Ss4R has been protracted in time for approximately a quarter of the genome [41,45]. In turn, this mechanism has led to several salmonid gene duplicates to not present 1:1 orthology relationships among different salmonid species, despite being created at the Ss4R [41,46,47]. A protracted pseudotetraploid period has also been suggested in teleosts after 3R [44]. In particular, the peculiar hox gene complement of the African butterfly fish (Pantodon buchholzi) is most parsimoniously explained by a hypothesis which includes protracted rediploidization for some genomic regions [44]. However, unequivocal support of protracted rediploidization beyond salmonids will require further careful phylogenomic analysis [43]. Several studies have revealed a high occurrence of duplicated genes in freshwater eels (Anguilla spp., Elopomorpha) [48-56]. While these duplicated genes often present weak conserved synteny, suggesting a 3R origin, they also present low phylogenetic divergence between paralogues, indicating that they recently started to diverge. For example, Lafont et al. [50] hypothesize that the entire genomic region containing the gene gper could have been duplicated in freshwater eels, and maybe also in other teleosts; and that the retention of duplicated genes may be higher in these eels than in other teleosts. The occurrence of duplicated genes in freshwater eels seems to be higher than for most teleost lineages, and specifically, the remarkably high conservation of duplicated gene sequences since 3R, often hypothesized for freshwater eel genes [48-56], would be unique [57]. Owing to the fact that the availability of genetic raw material has been suggested to increase the potential of novel adaptation [42], information on the duplicated gene complement of eels may prove valuable in understanding the biology of these endangered species. Therefore, the peculiarity of the published data led us to quantify and analyse duplications in the most relevant freshwater eel species and investigate the temporal distribution of the events that created them. To this end, we assembled de novo transcriptomes of Japanese (Anguilla japonica) and European eel (Anguilla anguilla) from downloaded and newly generated Illumina RNA sequencing data, respectively. Furthermore, we performed phylogenetic reconstructions, assigned paralogue pairs to branches of the resulting species tree, and calculated fourfold synonymous third-codon transversion (4dTv) distances for each paralogue pair identified within these transcriptomes. These analyses were run on our de novo transcriptomes and on multiple other fish transcriptomes and genomes. Our analysis supports the commonly suggested hypothesis of a high abundance of paralogue pairs, unique to the freshwater eel species. However, the phylogenetic and 4dTv analyses suggest a post 3R origin, and a strong signal of synteny between the genomic environments of these paralogues opposes a hypothesis of a SD origin. Similar results were also obtained from the included Osteoglossomorpha branches and the basal Clupeocephala branch. This, in turn, suggests that the results were generated by protracted rediploidization in teleosts after the 3R. These results thus open a discussion on whether these duplicated genes are the result of a 4R WGD in a common ancestor to freshwater eels or rather have been conserved on chromosomal regions, which have experienced delayed rediploidization after the 3R.

Materials and methods

Fish husbandry

Ten immature farm European eel males (mean body mass 96.7±3.6 g ± SEM) supplied by Valenciana de Acuicultura S.A. (Puzol, Valencia, Spain) were transported to the Aquaculture Laboratory at the Universitat Politècnica de València, Spain. The fish were kept in a 200-L tank, equipped with individual recirculation systems, a temperature control system (with heaters and coolers), and aeration. The fish were gradually acclimatized to seawater (final salinity 37 ± 0.3‰), over the course of two weeks. The temperature, oxygen level and pH of rearing were 20°C, 7–8 mg/L and ~ 8.2, respectively. The tank was covered to maintain, as far as possible, a constant dark photoperiod, and the fish were starved throughout the holding period. After acclimation, the fish were sacrificed in order to collect samples of the forebrain (telencephalon, diencephalon, and olfactory bulb), pituitary, and testis tissues.

Human and animal rights

This study was carried out in strict accordance with the recommendations given in the Guide for the Care and Use of Laboratory Animals of the Spanish Royal Decree 53/2013 regarding the protection of animals used for scientific purposes (BOE 2013), and in accordance with the European Union regulations concerning the protection of experimental animals (Dir 86/609/EEC), Guidelines of the European Union (2010/63/EU). The protocol was approved by the Experimental Animal Ethics Committee from the Universitat Politècnica de València (UPV) and final permission was given by the local government (Generalitat Valenciana, Permit Number: 2014/VSC/PEA/00147). The fish were sacrificed using an overdose of anaesthesia.

RNA extraction and sequencing

High quality RNA was extracted from the forebrain, pituitary, and testis samples of one individual male eel (weight: 105.4 g, length: 38.5 cm, and eye index: 4.62), following the protocol developed by Peña-Llopis and Brugarolas [58]. The quantity and quality were tested using a bioanalyser (Agilent Technologies, USA), the samples with sufficient RNA integrity number (RIN) values (RIN > 8.2) and RNA amounts (>3 μg of total RNA) were selected. Total RNA of the three samples were shipped to the company Macrogen Korea (Seoul, South Korea). mRNA purification was carried out on these samples, using Sera-mag Magnetic Oligo (dT) Beads, followed by buffer fragmentation. Reverse transcription was followed by PCR amplification to prepare the samples for sequencing using the TruSeq stranded mRNA LT sample prep kit (Illumina, San Diego, USA). The strand information was kept in an Illumina Hiseq-4000 sequencer (Illumina, San Diego, USA). Resulting raw sequences were 101bp paired-end reads which are available at the NCBI Sequence Read Archive (SRA) under the accession no. SRP126643.

Transcriptome assemblies and genomes

The bioinformatics methodology described below is illustrated in Fig 1. Specifically, FastQC [59] software was used to assess the quality of the raw reads generated by Macrogen. Thereafter, trimmomatic [60] was used to trim the reads, eliminating known adaptor sequences, and low quality regions. Finally, trimmed reads shorter than 50 bp were filtered out. European eel reads were digitally normalized before assembly by Khmer software [61] using a k-mer length of 25 and a coverage of 100. Furthermore, the RNA-Seq raw reads of a Japanese eel Fertilized egg (SRA, NCBI: SRR1930110), preleptocephalus (SRA, NCBI: SRR1930112), leptocephalus (SRA, NCBI: SRR1930115) and glass eel (SRA, NCBI: SRR1930117) were downloaded from NCBI. The RNA-Seq raw reads for Northern pike (Esox lucius), elephantnose fish (Gnathonemus petersii) and silver arowana (Osteoglossum bicirrhosum) were downloaded from the PhyloFish project [62]. All transcriptomes were then assembled using Trinity software [63], taking the strand orientation (for European eel) into account. Naturally produced transcripts may include intervals with a high bias for specific nucleotides (low-complexity), such transcripts may give high-scoring blast results but in fact be biologically insignificant. Therefore, the transcripts assembled were filtered according to their complexity (with a DUST score threshold of 7 and a DUST window of 64), length (with a minimum length of 500 bp), and level of expression (with a transcripts per million (TPM) threshold of 1). The DUST module from BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi) was used for this filtering, and Salmon software [64] was used to estimate TPM. After assembly, the coding DNA sequences (CDSs) and proteins were annotated using the Trinotate functional annotation pipeline [63]. Transcripts that share k-mers were clustered by Trinity. However, these transcripts might correspond to different transcript forms of the same gene or to closely related genes from a gene family. We split these transcripts into genes by running a transitive clustering based on a blast search. In this clustering, transcripts, which shared at least 100 bp with a minimum identity of 97%, were considered to be isoforms of the same gene. Thus, some Trinity clusters were split into several genes. For each gene, the most expressed transcript, according to the Salmon software [64], was chosen as its representative (Fig 1).
Fig 1

Methodology.

Pipeline of the bioinformatics methodology. Folders describe the software used, light grey boxes describe the action taken, light brown bubbles describes the rationale for selected actions, and light blue boxes describe the specific goal of each section. Finally, green boxes represent external data input.

Methodology.

Pipeline of the bioinformatics methodology. Folders describe the software used, light grey boxes describe the action taken, light brown bubbles describes the rationale for selected actions, and light blue boxes describe the specific goal of each section. Finally, green boxes represent external data input.

Technical point

The transcriptomes of one eel was used for the European eel de novo transcriptome assembly due to the fact that transcriptome assemblies based on multiple individuals are more prone to mistake allelic variants for recent gene duplications. Despite our transitive clustering, alleles could still be present in our transcriptome. However, these would resemble very similar paralogues, and be assigned a very low 4dTv distance. Therefore, it is implausible that local density maximum of eel paralogues found at higher 4dTv values could be alleles. Additionally, since our assessment of synteny (see Synteny section) is based on the genome, the paralogue pairs from which synteny can be assessed are highly unlikely to include these potential alleles. The Atlantic salmon (Salmo salar) genome assembled by the International Cooperation to Sequence the Atlantic Salmon Genome [41] and the Asian arowana (Scleropages formosus) genome [40] were downloaded from NCBI. The genomes of zebrafish (Danio rerio) [65], fugu (Takifugu rubripes) [66], spotted gar (Lepisosteus oculatus) [39], and platyfish (Xiphophorus maculatus) [67] were downloaded from ENSEMBL (release 87). The Northern pike genome [12] was downloaded from the Northern Pike Genome web site (Genbank accession GCA_000721915.1). For each gene in the genomes, the longest transcript was chosen as the representative. For the synteny analysis, the available European [49] and Japanese [68] eel genomes were downloaded from the ZF-Genomics and the DDBJ web site, respectively (Fig 1).

Genome and transcriptome quality assessment

In order to assess the quality of the transcriptomes and genomes, we looked for the Benchmarking set of Universal Single-Copy Orthologues (BUSCO) conserved gene set in them [69]. BUSCOs are conserved proteins which are expected to be found in complete genomes or transcriptomes. Therefore, the number of present, missing, or fragmented BUSCOs can be used as a quality control of a genome or transcriptome assembly. For this assessment, the Actinopterygii (odb9) gene set, which consists of 4584 single-copy genes that are present in at least 90% of Actinopterygii species, was used. As an additional comparison between the transcriptome and genomes of pike and eels, RNA-seq reads were mapped both to the genome and transcriptome assemblies using HISAT2 [70] and BWA-MEM [71] software, respectively (Fig 1), using default settings in both programs.

Gene families

Genes were clustered into gene families by the OrthoMCL web service [72], which uses the Markov Cluster algorithm to group homologs of all the included datasets, based on all against all BLASTP searches. Therefore, the OrthoMCL gene families were also considered gene families for this study. For each gene family, a multiple protein alignment was built. To avoid transcriptome assembly artefacts, proteins longer than 1,500 amino acids, transcripts with a DUST score higher than 7 and sequences with more than 40% gaps in the alignment, were filtered out. The software Clustal Omega [73] carried out the protein multiple alignment and trimAl [74] removed the regions with too many gaps or those difficult to align. The protein alignment was used as a template to build the codon alignment by aligning the transcript sequences against the corresponding protein using the protein2dna exonerate algorithm [75] (Fig 1).

Phylogenetic reconstruction and duplication dating

The resulting protein alignments were used by PHYLDOG [76] software to generate a species tree as well as a gene family tree corresponding to each alignment. Due to the high memory requirements of PHYLDOG, not all the gene families could be run in the same analysis, therefore 10 analyses were carried out, with 8,000 protein alignments being chosen at random for each. Once all runs were finished, we checked that the species tree topology of all the 10 species trees matched exactly. PHYLDOG uses a maximum likelihood approach to simultaneously co-estimate the species and gene family trees from all individual alignments. In order to confirm the tree topology of the PHYLDOG species tree, the species phylogeny was also reconstructed using a Bayesian approach with PhyloBayes MPI version 1.7 [77]. Furthermore, from the gene families that had one gene for each species, 100 were chosen at random to create a concatenated alignment of 43,566 amino acids. The model used was CAT-GTR and three independent MCMC chains were run for 39,872, 56,328, and 39,285 iterations (Fig 1). PHYLDOG further tagged duplications and assigned these to specific tree branches based on the gene family trees. Between any pair of duplicated sequences, the number of transversions found in the third base of the codon was divided by the number of four-fold degenerated codons resulting in the 4dTv distance. A correction to the 4dTv was applied: ln (1–2 * distance) / -2. The 4dTv was calculated for all the duplications tagged by PHYLDOG within any gene family. These calculations are implemented by the function calculate_4dTv found in the Python scripts (S1 Material). The distribution of 4dTvs was fitted with a lognormal mixture model using the scikit-learn Gaussian Mixture class (Fig 1).

Synteny

The kind of event that created each duplication was characterized by analysing the conserved synteny between the paralogues created by that duplication within a particular genome. Tandem SDs would create paralogues found close to each other in the genome, whereas the paralogues created by a WGD would be far apart, but surrounded by similar genes in each of the duplicated regions. Also, we have to consider that several phylogenetically close species can be affected by the same older duplication event. With this in mind, we categorized duplications as one of 4 classes (Fig 2): i) the paralogue genes that were found close to each other in the genome, within the 50 neighbouring genes to either side, were labelled as “close”, ii) the paralogues which were found in syntenic regions where 2 or more paralogues from other gene families were located within the 50 neighbouring genes to either side, not necessarily in the same collinear order, were labelled as “some synteny”, iii) the cases in which fewer than 2 gene families could be identified within the 50 neighbouring genes to either side, from either of the paralogues genes, were labelled as “no info”, and iv) the cases where conflicting evidence was found in the genomes of the different species affected by the duplication were labelled as “conflicting syntenies”.
Fig 2

Synteny illustration.

Visualization of the assigned synteny types: “some synteny” (), paralogues of genes found close to one duplicate are also found close to the other duplicate; “no synteny” (), less than two paralogues for other genes are found close to both paralogue duplicates; “close” (), duplicated genes are close in the genome; “no information” (), the duplicated genes are located in small scaffolds with too few gene families close by; “conflicting syntenies” (), different synteny classification found in the genomes of the different species affected by the duplication. Sand coloured boxes represent genes which have not been assigned to a gene family, pink boxes represent the gene from which synteny is being assessed; all other colour boxes represent other genes which have been assigned to a gene family.

Synteny illustration.

Visualization of the assigned synteny types: “some synteny” (), paralogues of genes found close to one duplicate are also found close to the other duplicate; “no synteny” (), less than two paralogues for other genes are found close to both paralogue duplicates; “close” (), duplicated genes are close in the genome; “no information” (), the duplicated genes are located in small scaffolds with too few gene families close by; “conflicting syntenies” (), different synteny classification found in the genomes of the different species affected by the duplication. Sand coloured boxes represent genes which have not been assigned to a gene family, pink boxes represent the gene from which synteny is being assessed; all other colour boxes represent other genes which have been assigned to a gene family. This labelling of the duplications was carried out by the Python function “determine_if_pair_is_close_or_syntenic”and the Python class GenomeLocator, found in the scripts (S1 Material). The location of each gene in a genome was obtained by performing a BLAST search with its representative transcript against the genome (Fig 1).

Investigation of functional category enrichment

The EggNOG database has gene ontology (GO) annotations for each of its gene families [78]. To match our gene families with those from the EggNOG database, the protein sequence with least gaps per each of our families was selected and a HMMER search [79] was carried out against the EggNOG position weight matrices with an e-value threshold of 10−4. The GO annotation of the best EggNOG hit in this search was transferred to our family. The enrichment analysis was carried out using the Fisher statistic and the weight algorithm of the topGO library [80] from the Bioconductor project. The R script go_enrichment_analysis found in the scripts (S1 Material) implements this analysis. Freshwater eel transcripts were annotated using the BlastKOALA KEGG service [81] and a Fisher exact test was carried out, using the scipy implementation, to look for overrepresented KEGG pathways in the duplications assigned to the basal freshwater eel branch (Fig 1).

Results

Transcriptome assemblies

Forebrain, testis, and pituitary RNA samples, from an individual European eel, were sequenced, generating a total of 191 million Illumina reads (66, 60 and 65 million from the forebrain, testis, and pituitary, respectively), with a length of 101 bp. These reads were assembled into one de novo transcriptome, using the Trinity assembler after a digital normalization step [61] that left 75 million representative reads. The same procedure was used to generate one de novo transcriptome from Illumina RNA-sequencing reads of the Japanese eel, which was downloaded from the NCBI's Sequence Read Archive [82]. The transcriptomes of Northern pike, elephantnose fish and silver arowana were also assembled by Trinity using Illumina reads from the Phylofish database [62]. The number of unigenes (henceforth referred to as transcripts) assembled ranged from 64,857 to 78,610 (Table 1) and the number of transcript clusters ranged from 46,585 to 55,667 (henceforth referred to as genes; Table 2).
Table 1

Included transcriptomes.

SpeciesN.° ReadsQ30TranscriptsMean GC content (%)
European eel181,322,1060.99477,24751.17
Northern Pike553,710,2180.98968,48948.05
Elephantnose fish498,451,6160.99374,64249.75
Silver arowana490,649,2540.99278,61049.18
Japanese eel458,032,1260.98664,85748.13

Metrics of included raw read datasets from European eel (Anguilla anguilla), Japanese eel (Anguilla japonica), northern Pike (Esox lucius), elephantnose fish (Gnathonemus petersi), and silver arowana (Osteoglossum bicirrhosum).

Table 2

Gene quantities far each species included.

SpeciesTranscriptsGenesRepresentative transcripts with predicted proteinGene family transcripts% of genes assigned to a gene family
European eel77,24754,87927,69625,86293.38
Japanese eel64,85746,58523,78023,09897.13
Zebrafish58,27432,18925,79022,70388.03
Northern pike68,48949,15423,84321,69690.99
Elephantnose fish74,64250,45524,85722,03688.65
Spotted gar22,48318,34118,34117,87297.44
Silver arowana78,61055,66724,93821,60486.63
Asian arowana43,35423,79922,74020,63790.75
Atlantic salmon109,58455,10448,59342,62587.72
Fugu47,84118,52318,52317,69895.55
Platyfish20,45420,37920,37919,80797.19

Quantities of included genes per included species: European eel (Anguilla anguilla), Japanese eel (Anguilla japonica), zebrafish (Danio rerio), northern pike (Esox lucius), elephantnose fish (Gnathonemus petersi), spotted gar (Lepisosteus oculatus), Asian arowana (Scleropages formosus), silver arowana (Osteoglossum bicirrhosum), Atlantic salmon (Salmo salar), fugu (Takifugu rubripes), and platyfish (Xiphophorus maculatus). “Transcripts” represents unigenes, “Genes” represents the number of transcript clusters, “Representative transcripts with predicted protein” represents the number of genes with a successful protein annotation, “Gene family transcripts” represents the representative transcripts with predicted protein with a successful gene family annotation, and “% of genes assigned to a gene family” represents the percentage of representative transcripts with predicted protein with successful gene family annotation.

Metrics of included raw read datasets from European eel (Anguilla anguilla), Japanese eel (Anguilla japonica), northern Pike (Esox lucius), elephantnose fish (Gnathonemus petersi), and silver arowana (Osteoglossum bicirrhosum). Quantities of included genes per included species: European eel (Anguilla anguilla), Japanese eel (Anguilla japonica), zebrafish (Danio rerio), northern pike (Esox lucius), elephantnose fish (Gnathonemus petersi), spotted gar (Lepisosteus oculatus), Asian arowana (Scleropages formosus), silver arowana (Osteoglossum bicirrhosum), Atlantic salmon (Salmo salar), fugu (Takifugu rubripes), and platyfish (Xiphophorus maculatus). “Transcripts” represents unigenes, “Genes” represents the number of transcript clusters, “Representative transcripts with predicted protein” represents the number of genes with a successful protein annotation, “Gene family transcripts” represents the representative transcripts with predicted protein with a successful gene family annotation, and “% of genes assigned to a gene family” represents the percentage of representative transcripts with predicted protein with successful gene family annotation.

Genome and transcriptome quality

The genomes and transcriptomes considered for inclusion in the analysis were quality tested by a BUSCO assessment of completeness. In general, when available, genomes were used instead of transcriptomes, except for pike, and European eel, where the transcriptomes outperformed the genomes according to the BUSCO assessment (Fig 3). Furthermore, the Japanese eel transcriptome was preferred due to a problem with the Japanese eel genome annotation. These transcriptomes also provided a higher mapping of RNA sequencing reads compared to their corresponding genomes. The percentage of reads that mapped concordantly against the genome and the transcriptome were 65.8 and 91.9%, respectively, for European eel, 74.3 and 88.4% for Japanese eel and 44.6 and 85.8% for pike. Furthermore, previously published European eel RNA-sequencing experiments were also mapped to the available European eel genome and our de novo transcriptome. In this case, 52.2% [83], 57.9% [84], and 66.18% [85] reads mapped concordantly against the eel genome whereas 84.3% [83], 69.5% [84], and 87.32% [85] mapped against the transcriptome.
Fig 3

BUSCO analysis.

BUSCO (Benchmarking set of Universal Single-Copy Orthologues) result for included genomes and transcriptomes. The sequence of a BUSCO gene can be found complete or fragmented in each genome and it can be found once (single copy), more than once (duplicated) or not found (missing). Included genomes are: European eel (Anguilla anguilla), Japanese eel (Anguilla japonica), Asian arowana (Scleropages formosus), zebrafish (Danio rerio), northern pike (Esox lucius), spotted gar (Lepisosteus oculatus), fugu (Takifugu rubripes), platyfish (Xiphophorus maculatus) and Atlantic salmon (Salmo salar). Included transcriptomes: European eel, Japanese eel, northern pike, elephantnose fish (Gnathonemus petersii) and silver arowana (Osteoglossum bicirrhosum).

BUSCO analysis.

BUSCO (Benchmarking set of Universal Single-Copy Orthologues) result for included genomes and transcriptomes. The sequence of a BUSCO gene can be found complete or fragmented in each genome and it can be found once (single copy), more than once (duplicated) or not found (missing). Included genomes are: European eel (Anguilla anguilla), Japanese eel (Anguilla japonica), Asian arowana (Scleropages formosus), zebrafish (Danio rerio), northern pike (Esox lucius), spotted gar (Lepisosteus oculatus), fugu (Takifugu rubripes), platyfish (Xiphophorus maculatus) and Atlantic salmon (Salmo salar). Included transcriptomes: European eel, Japanese eel, northern pike, elephantnose fish (Gnathonemus petersii) and silver arowana (Osteoglossum bicirrhosum). Genes were assigned to gene families according to the gene family categorization of OrthoMCL [72]. The percentage of genes with predicted proteins assigned to a family by the OrthoMCL web service [72] ranged from 86.6% (silver arowana) to 97.4% (spotted gar; Table 2). Overall, 15,771 gene families were covered, from which 13,972 protein and codon alignments were built. These families contained between 2 and 172 genes, with 11 genes per family being the mode.

Phylogenetic reconstruction and duplication characteristics

PHYLDOG software was used to tag gene duplications, create a species tree, and assign duplications to tree branches, based on gene family phylogenetic trees. Overall, trees for 10,714 gene families were created by PHYLDOG and based on the tree topology, branches in which a gene appeared to duplicate were labelled. The resulting PHYLDOG species tree matched the species tree topology created by phylobayes [77] and the resulting tree of the concatenated alignment; a cladogram of these trees is included in Fig 4. Since PHYLDOG distinguishes between gene divergence at speciation events and duplications, all genes resulting from tagged duplications are assumed to be paralogues. The assigned duplications were subsequently characterized by synteny and 4dTv distance. The 4dTv distance is used to estimate the accumulation of synonymous mutations, which can be used to estimate the time that has passed from when mutations started to accumulate. The assigned synteny classes include: “close” which indicates SDs that are a result of tandem duplications; “some synteny” which indicates a potential WGD origin (or at least a potential duplication event containing >100 genes); and “no synteny”, which supports neither a SD nor a WGD origin (Fig 2).
Fig 4

4dTv and synteny distributions of duplications per branch of the PHYLDOG species tree.

Quantity, 4dTv and synteny distributions of duplications assigned to each branch of the PHYLDOG species tree. Each panel represents the branch with the corresponding number in the cladogram in the bottom right-hand corner. Species included in this study are: European eel (Anguilla anguilla), Japanese eel (Anguilla japonica), zebrafish (Danio rerio), northern pike (Esox lucius), spotted gar (Lepisosteus oculatus), fugu (Takifugu rubripes), platyfish (Xiphophorus maculatus), Atlantic salmon (Salmo salar), elephantnose fish (Gnathonemus petersii), Asian arowana (Scleropages formosus) and silver arowana (Osteoglossum bicirrhosum). The synteny types are the following: close (), duplicated genes are close in the genome; some synteny (), paralogues of genes found close to one duplicate are also found close to the other duplicate; no synteny (), less than two paralogeus for other genes are found close to both paralogue duplicates; no information (), the duplicated genes are located in small scaffolds with too few genes close by; conflicting syntenies (), different synteny classifications found in the genomes of the different species affected by the duplication.

4dTv and synteny distributions of duplications per branch of the PHYLDOG species tree.

Quantity, 4dTv and synteny distributions of duplications assigned to each branch of the PHYLDOG species tree. Each panel represents the branch with the corresponding number in the cladogram in the bottom right-hand corner. Species included in this study are: European eel (Anguilla anguilla), Japanese eel (Anguilla japonica), zebrafish (Danio rerio), northern pike (Esox lucius), spotted gar (Lepisosteus oculatus), fugu (Takifugu rubripes), platyfish (Xiphophorus maculatus), Atlantic salmon (Salmo salar), elephantnose fish (Gnathonemus petersii), Asian arowana (Scleropages formosus) and silver arowana (Osteoglossum bicirrhosum). The synteny types are the following: close (), duplicated genes are close in the genome; some synteny (), paralogues of genes found close to one duplicate are also found close to the other duplicate; no synteny (), less than two paralogeus for other genes are found close to both paralogue duplicates; no information (), the duplicated genes are located in small scaffolds with too few genes close by; conflicting syntenies (), different synteny classifications found in the genomes of the different species affected by the duplication. PHYLDOG labelled 5,063 duplications to the basal teleost branch, after the split of the spotted gar, with a 4dTv mode of 0.75 (Fig 4, Node 3). Of the paralogues created by these duplications, 73.8% were located in regions with some synteny, 1.5% were close to each other, and 22.5% had no synteny (Fig 4, Node 3). These percentages were calculated without taking into account the duplications where no information regarding the physical location of the genes could be established. The duplications assigned to this basal teleost branch (Fig 4, Node 3) included all gene families with members in both sister clades and thus are assumed to have originated at the 3R. This branch further included hundreds of duplications found in the eels. From these duplications, 62 families had conserved 2 paralogue pairs, one of which had started to diverge at 3R and one in a common ancestor of freshwater eels after the split with osteoglossomorphs. From the paralogue pairs, which had started to diverge in a common ancestor of freshwater eels (Fig 4, Node 9), from these 62 families; 30 were located in regions with some synteny, 9 were close to each other, 11 had no synteny and for 12 pairs no information regarding the physical location of the genes could be established. 1,280 duplications were assigned to the branch basal to the included Clupeocephalan teleosts: zebrafish, fugu, platyfish, northern pike, and Atlantic salmon (Fig 4, Node 4). These duplications showed a very similar distribution with those of the 3R branch, with an overall 4dTv mode of 0.75 (Fig 4, Node 3). The basal freshwater eel branch was assigned 1,217 duplications of which 55.3, 15.8, and 24.3% were labelled as some synteny, close and without synteny, respectively (Fig 4, Node 9). The European and Japanese eel specific branches were assigned 510 and 127 duplications, from which 32.2, and 34.7% were labelled as some synteny, 50.0 and 48.4% were labelled as close, and 17.1 and 14.7% were labelled as without synteny, respectively (Fig 4, Nodes 14 and 15). The basal Osteoglossomorpha and the basal arowana branches were assigned 618 and 661 duplications, from which 95.7, and 76.2% were labelled as some synteny, 0.9 and 17.7% were labelled as close, and 3.5 and 5.1% were labelled as without synteny, respectively (Fig 4, Nodes 8 and 12). The salmon and zebrafish specific branches were assigned 8,787 and 1,525 duplications, respectively, and most of these duplications seemed recent, according to their 4dTv distances. In the salmon branch, most of the duplications (87.0%) were characterized by paralogues located in syntenic regions, whereas most of the zebrafish paralogues (60.5%) were characterized as “close” (Fig 4, Nodes 16 and 7). For all the included species the “close” paralogues (tandem SDs) tended to show low divergence according to their 4dTv, whereas the duplications found in synteny and most of the duplications without sufficient genomic location information, were more often found to have higher 4dTv distance (Fig 4). The duplications assigned to the basal freshwater eel branch showed a 4dTv mode of ~0.4 (Fig 4, Node 9). In order to investigate the relative age of all the homolog pairs found in the eels, we ran a 4dTv distance analysis independent of the PHYLDOG tree topology. In this analysis, we compared the 4dTv distribution found for European eel homologs with Japanese eel, elephantnose fish, silver arowana and Asian arowana (Fig 5). The results showed a homolog density mode at 4dTv of ~0.4 for the European and Japanese eel, and 0.5 for the speciation event that separated elephantnose fish, silver arowana, and Asian arowana from the freshwater eels (Fig 5). Furthermore, in order to obtain comparisons between older eel paralogues (>0.2 of 4dTv) and other teleosts, we produced histograms of the 4dTv distances calculated between all paralogues within each species from 0.2 to 1.4 of 4dTv (Fig 6). Additionally, a nonparametric probability density estimate was calculated, using the Gaussian mixture model and plotted on top of the histogram (Fig 6). These results show an older local density maximum (likely originating from 3R) for all teleosts ranging from 0.62 (Zebrafish) to 0.88 (Fugu) of 4dTv. Furthermore, European eel, Japanese eel, Asian arowana and possibly silver arowana showed an additional 4dTv local density maximum at 0.41, 0.42, 0.56 and 0.55 of 4dTv, respectively (Fig 6). A more recent local density maximum was seen in the Atlantic salmon distribution at 0.15 (Fig 4, Node 16).
Fig 5

4dTv distribution between European eel, elephantnose fish, and the arowanas homologs.

4dTv distribution of European eel (Anguilla anguilla) and Japanese eel homologs (), European eel and elephantnose fish (Gnathonemus petersii) homologs (), and European eel, silver arowana (Osteoglossum bicirrhosum) homologs (), and European eel and Asian arowana (Scleropages formosus) homologs ().

Fig 6

Density distribution of all 4dTv distances between teleost paralogues.

Histograms of all 4dTv distances between paralogues of the included teleosts, presented with yellow and blue bars. Furthermore, a probability density estimate curve is plotted on top of the histograms in red. Density values (y-axis) do not correspond to the density estimate. The included species are: European eel (Anguilla anguilla), Japanese eel (Anguilla japonica), zebrafish (Danio rerio), northern pike (Esox lucius), spotted gar (Lepisosteus oculatus), fugu (Takifugu rubripes), platyfish (Xiphophorus maculatus), Atlantic salmon (Salmo salar), elephantnose fish (Gnathonemus petersii), Asian arowana (Scleropages formosus) and silver arowana (Osteoglossum bicirrhosum).

4dTv distribution between European eel, elephantnose fish, and the arowanas homologs.

4dTv distribution of European eel (Anguilla anguilla) and Japanese eel homologs (), European eel and elephantnose fish (Gnathonemus petersii) homologs (), and European eel, silver arowana (Osteoglossum bicirrhosum) homologs (), and European eel and Asian arowana (Scleropages formosus) homologs ().

Density distribution of all 4dTv distances between teleost paralogues.

Histograms of all 4dTv distances between paralogues of the included teleosts, presented with yellow and blue bars. Furthermore, a probability density estimate curve is plotted on top of the histograms in red. Density values (y-axis) do not correspond to the density estimate. The included species are: European eel (Anguilla anguilla), Japanese eel (Anguilla japonica), zebrafish (Danio rerio), northern pike (Esox lucius), spotted gar (Lepisosteus oculatus), fugu (Takifugu rubripes), platyfish (Xiphophorus maculatus), Atlantic salmon (Salmo salar), elephantnose fish (Gnathonemus petersii), Asian arowana (Scleropages formosus) and silver arowana (Osteoglossum bicirrhosum).

Functional category enrichment

To investigate whether some functional categories were overrepresented among the paralogues assigned to the basal eel branch, two enrichment tests were carried out. First, 1,041 unique GO terms were assigned to 3,607 genes from the basal eel branch, by comparing them to the annotated EggNOG gene families. The full GO annotation can be found in S1 Table. From these terms, we performed an enrichment analysis using the topGO R library [80]. The resulting enriched GO-terms are presented in Table 3. In many cases, these terms were involved either in signalling (e.g. receptor activity, molecular transducer activity, or small GTPase mediated signal transduction), development (e.g. embryonic camera-type eye morphogenesis, gastrulation with mouth forming second, or cell migration involved in gastrulation), ion transport (e.g. anion binding, ATP hydrolysis coupled proton transport, or organic anion transmembrane transporter activity), metabolism (e.g. carbohydrate phosphorylation, ubiquitin-dependent protein catabolic process, or lipopolysaccharide biosynthetic process), or neuronal function (e.g. forebrain development, motor neuron axon guidance, or neuromast development; Table 3). Secondly, KEGG terms were assigned to 1,674 freshwater eel genes using BlastKOALA [81] and mapped onto the KEGG pathways using the KEGG Mapper tool. A Fisher test, corrected for multiple comparisons using False Discovery Rate, was used to look for enriched KEGG pathways in the basal eel branch (Table 4). Most of the KEGG pathways found to be enriched were related to the immune system, nervous system, metabolism and signal transduction. Interestingly, the most significantly enriched KEGG pathway was “Dopaminergic synapse”.
Table 3

Enriched Go-terms from the shared freshwater eel branch of Fig 4.

AspectGO IDTermAnnotatedSignificantExpectedFDR
Biological ProcessGO:0007264small GTPase mediated signal transductio …2564322.970.000064
Biological ProcessGO:0045176apical protein localization330.270.00072
Biological ProcessGO:0008045motor neuron axon guidance1050.90.00099
Biological ProcessGO:0048514blood vessel morphogenesis1212010.860.00257
Biological ProcessGO:0000132establishment of mitotic spindle orienta …840.720.00335
Biological ProcessGO:0048596embryonic camera-type eye morphogenesis1140.990.00625
Biological ProcessGO:0015991ATP hydrolysis coupled proton transport2061.790.00661
Biological ProcessGO:0008333endosome to lysosome transport220.180.00804
Biological ProcessGO:0015031protein transport2843925.480.00900
Biological ProcessGO:0006886intracellular protein transport15619140.00907
Biological ProcessGO:0007160cell-matrix adhesion1651.440.01084
Biological ProcessGO:0001756somitogenesis50104.490.01200
Biological ProcessGO:0060042retina morphogenesis in camera-type eye3793.320.01887
Biological ProcessGO:0072358cardiovascular system development2804425.120.01905
Biological ProcessGO:0040023establishment of nucleus localization430.360.02262
Biological ProcessGO:0009826unidimensional cell growth320.270.02268
Biological ProcessGO:0030326embryonic limb morphogenesis320.270.02268
Biological ProcessGO:0008202steroid metabolic process3433.050.02280
Biological ProcessGO:0007179transforming growth factor beta receptor …1341.170.02382
Biological ProcessGO:0071840cellular component organization or bioge …8468175.910.02822
Biological ProcessGO:0048884neuromast development1541.350.02854
Biological ProcessGO:0001569patterning of blood vessels830.720.02858
Biological ProcessGO:0016998cell wall macromolecule catabolic proces …830.720.02858
Biological ProcessGO:0046835carbohydrate phosphorylation830.720.02858
Biological ProcessGO:0043473pigmentation5775.110.02863
Biological ProcessGO:0060059embryonic retina morphogenesis in camera …1441.260.03104
Biological ProcessGO:0001702gastrulation with mouth forming second2362.060.04241
Biological ProcessGO:0060034notochord cell differentiation630.540.04259
Biological ProcessGO:0061035regulation of cartilage development730.630.04260
Biological ProcessGO:0009103lipopolysaCellular Componentharide biosynthetic process420.360.04268
Biological ProcessGO:0043114regulation of vascular permeability420.360.04268
Biological ProcessGO:0015721bile acid and bile salt transport420.360.04268
Biological ProcessGO:0006511ubiquitin-dependent protein catabolic pr …91148.170.04728
Biological ProcessGO:0030900forebrain development53104.760.04832
Biological ProcessGO:0042074cell migration involved in gastrulation3793.320.04885
Cellular ComponentGO:0031105septin complex640.540.00084
Cellular ComponentGO:0030018Z disc1050.90.00099
Cellular ComponentGO:0031461cullin-RING ubiquitin ligase complex2882.520.00555
Cellular ComponentGO:0008290F-actin capping protein complex220.180.00807
Cellular ComponentGO:0005915zonula adherens220.180.00807
Cellular ComponentGO:0005737cytoplasm1750175157.310.01734
Cellular ComponentGO:0005768endosome4784.220.01771
Cellular ComponentGO:0033180proton-transporting V-type ATPase V1 do …1040.90.01913
Cellular ComponentGO:0030424axon730.630.01920
Cellular ComponentGO:0000159protein phosphatase type 2A complex320.270.02275
Cellular ComponentGO:0005667transcription factor complex79127.10.02336
Cellular ComponentGO:0005912adherens junction740.630.04255
Cellular ComponentGO:0031519PcG protein complex940.810.04258
Cellular ComponentGO:0005890sodium:potassium-exchanging ATPase compl …420.360.04281
Cellular ComponentGO:0043198dendritic shaft420.360.04281
Cellular ComponentGO:0005885Arp2/3 protein complex420.360.04281
Cellular ComponentGO:0005765lysosomal membrane1641.440.04914
Molecular FunctionGO:0005525GTP binding2514322.241.6e-05
Molecular FunctionGO:0043168anion binding1289142114.240.0018
Molecular FunctionGO:0060089molecular transducer activity7785668.950.0078
Molecular FunctionGO:0004331fructose-2 6-bisphosphate 2-phosphatase …220.180.0078
Molecular FunctionGO:0045296cadherin binding220.180.0078
Molecular FunctionGO:0046933proton-transporting ATP synthase activit …1040.890.0083
Molecular FunctionGO:0004702receptor signaling protein serine/threon …1971.680.0112
Molecular FunctionGO:0008242omega peptidase activity630.530.0113
Molecular FunctionGO:0031683G-protein beta/gamma-subunit complex bin …630.530.0113
Molecular FunctionGO:0008013beta-catenin binding730.620.0185
Molecular FunctionGO:0016820hydrolase activity acting on acid anhyd …3883.370.0219
Molecular FunctionGO:0004749ribose phosphate diphosphokinase activit …320.270.0221
Molecular FunctionGO:0008601protein phosphatase type 2A regulator ac …320.270.0221
Molecular FunctionGO:0003796lysozyme activity320.270.0221
Molecular FunctionGO:0008146sulfotransferase activity43103.810.0249
Molecular FunctionGO:0051287NAD binding2251.950.0298
Molecular FunctionGO:0003714transcription corepressor activity930.80.0388
Molecular FunctionGO:0008514organic anion transmembrane transporter …2251.950.0399
Molecular FunctionGO:0004872receptor activity6954161.590.0495

Enriched Go-terms from the duplicated genes shared by freshwater eels. “Aspects” indicates the specific GO-term aspect of each enriched GO-term. “GO ID” indicates the identification number of each enriched GO-term. “Term “indicates the verbal description of each enriched GO-term. “Annotated” indicates the number of GO-terms which are associated with each enriched GO-term. “Significant” indicates the number of GO-terms associated to each enriched GO-term found among the duplicated genes. “Expected” indicates the number of GO-terms expected to be found linked to each enriched GO-term. “FDR” indicates the False Discovery Rate adjusted P-value from the Fisher exact test of enrichment.

Table 4

Enriched KEGG-terms from the shared freshwater eel branch of Fig 4.

KEGG IDTermAnnotatedSignificantExpectedFDR
04728Dopaminergic synapse38283120,000001
03015mRNA surveillance pathway2112950,000082
04660T cell receptor signaling pathway2720480,000082
04071Sphingolipid signaling pathway29238100,000112
05142Chagas disease (American trypanosomiasis)2318070,000518
04659Th17 cell differentiation2318470,000596
05162Measles2116870,001190
04390Hippo signaling pathway29282110,001190
04658Th1 and Th2 cell differentiation1914860,001601
04261Adrenergic signaling in cardiomyocytes28291120,004378
05100Bacterial invasion of epithelial cells2018370,005845
05032Morphine addiction1815560,005845
00625Chloroalkane and chloroalkene degradation51000,005845
04640Hematopoietic cell lineage127930,006536
04910Insulin signaling pathway26276110,006551
04630Jak-STAT signaling pathway1817170,012920
04016MAPK signaling pathway—plant62210,014553
04022cGMP-PKG signaling pathway29350140,015915
04917Prolactin signaling pathway1513350,015915
05130Pathogenic Escherichia coli infection1310840,019059
05418Fluid shear stress and atherosclerosis22245100,022738
00020Citrate cycle (TCA cycle)84720,022808
04080Neuroactive ligand-receptor interaction31395160,023393
04151PI3K-Akt signaling pathway37499200,023393
04514Cell adhesion molecules (CAMs)2123190,023393
04391Hippo signaling pathway—fly1615860,034750
05340Primary immunodeficiency74120,036069
04350TGF-beta signaling pathway1616470,038388
05133Pertussis1211150,045104
05152Tuberculosis21247100,047421
04664Fc epsilon RI signaling pathway1211350,048089
00510N-Glycan biosynthesis97130,048943
04144Endocytosis37533220,048943
00350Tyrosine metabolism63410,048943
04510Focal adhesion28379150,048943

Enriched KEGG-terms from the duplicated genes shared by freshwater eels. “KEGG ID” indicates the identification number of each enriched KEGG pathway. “Term”indicates the verbal description of each enriched KEGG pathway. “Annotated” indicates the number of KEGG pathways, which are associated with each enriched KEGG pathway. “Significant” indicates the number of KEGG pathways associated with each enriched KEGG pathway found among the duplicated genes. “Expected” indicates the number of KEGG pathways expected to be found associated with each enriched KEGG pathway. “FDR” indicates the False Discovery Rate adjusted P-value from the Fisher exact test of enrichment.

Enriched Go-terms from the duplicated genes shared by freshwater eels. “Aspects” indicates the specific GO-term aspect of each enriched GO-term. “GO ID” indicates the identification number of each enriched GO-term. “Term “indicates the verbal description of each enriched GO-term. “Annotated” indicates the number of GO-terms which are associated with each enriched GO-term. “Significant” indicates the number of GO-terms associated to each enriched GO-term found among the duplicated genes. “Expected” indicates the number of GO-terms expected to be found linked to each enriched GO-term. “FDR” indicates the False Discovery Rate adjusted P-value from the Fisher exact test of enrichment. Enriched KEGG-terms from the duplicated genes shared by freshwater eels. “KEGG ID” indicates the identification number of each enriched KEGG pathway. “Term”indicates the verbal description of each enriched KEGG pathway. “Annotated” indicates the number of KEGG pathways, which are associated with each enriched KEGG pathway. “Significant” indicates the number of KEGG pathways associated with each enriched KEGG pathway found among the duplicated genes. “Expected” indicates the number of KEGG pathways expected to be found associated with each enriched KEGG pathway. “FDR” indicates the False Discovery Rate adjusted P-value from the Fisher exact test of enrichment.

Discussion

The present study found more than one thousand gene families in which the gene family tree topology indicates a duplication in a common ancestor of freshwater eels sometime after the split of Elopomorpha and Osteoglossomorpha. Only phylogenetic species tree branches with previously documented WGDs (Fig 4, Nodes 1, 3, 4, and 16) and the zebrafish specific branch (Fig 4, Node 7) were assigned more duplications than the basal freshwater eel branch (Fig 4, Node 9). The vast majority of the assigned zebrafish specific duplications formed a 4dTv local density maximum at ~0 and were found “close” in the genome, thus these duplications appear to be tandem SDs, the presence of which concurs with previous studies [8,11,65].

The origin of the duplications assigned to the basal freshwater eel branch

In some cases, it has been shown that SDs could be retained at specific points in time [5,15,16]. However, most duplications assigned to the basal freshwater eel branch were detected in large syntenic blocks which opposes a hypothesis of a SD origin. Rather the synteny results suggest that the duplications assigned to the basal freshwater eel branch originated in larger portions e.g. whole regions (large SDs), chromosomes or genomes. In particular, a WGD origin is consistent with the number of duplications observed and the 4dTv distribution (Fig 4, Node 9), which showed one distinct 4dTv density mode (4dTv ~0.4) placed along the long branch leading to the freshwater eels. Fig 6 (and Fig 5) further shows duplications which started diverging at the 3R in the eel transcriptome, as a 4dTv local density maximum of ~0.75. The notable similarity between this local density maximum and the local density maximum of the other 3R generated genes from all the included teleosts (Figs 4 and 6) suggests that these paralogues (4dTv ~ 0.75) were created by the 3R. This hypothesis is further supported by the results of the phylogenetic analysis (Fig 4, Nodes 3), which assigned hundreds of duplications, which are still present in the eel, to the 3R branch. Moreover, no modes of comparable magnitude at 4dTv ~ 0.75 can be seen in any of the post 3R branches leading to the freshwater eels (Fig 4, Nodes 5 and 9). Therefore, if the duplications assigned to the basal freshwater eel branch were created by a WGD event, and assuming instant rediploidization after the 3R, this event would be more recent and different to the 3R, and thus should be named a 4R WGD event. However, cytological rediploidization is not always completed immediately after an autotetraploidization WGD event, as shown in the case of salmonids [43]. Therefore, the origin of the duplications assigned to the basal freshwater eel branch could also be explained by a hypothesis of lineage-specific rediploidization after the 3R. Protracted rediploidization could result in lower rates of gene losses since deleterious mutations have less time to accumulate in one paralogue and thereby create a pseudogene, which could explain the high number of paralogue pairs found in eel. This hypothesis could also explain both the PHYLDOG and 4dTv results, as paralogue genes only start to diverge after the rediploidization of their genomic region [41,43,44]. If the duplications assigned to the basal freshwater eel branch had, in fact, experienced delayed rediploidization from the 3R, the same genomic regions would have also experienced delayed rediploidization in the lineage of the remaining teleosts. Interestingly, relatively large quantities of duplications, with conserved synteny, were also assigned to the basal Clupeocephala branch and the Osteoglossomorpha branches (Fig 4, Nodes 4, 8, and 12). This observation supports the hypothesis that the duplications assigned to the basal freshwater eel branch were located in genomic regions, which experienced delayed rediploidization after the 3R. However, lineage-specific rediploidization has only been unequivocally documented in salmonids, and more studies are needed to demonstrate this process in other species. Therefore, it remains to be determined if this mechanism is a salmonid specific phenomenon. Furthermore, due to the observed 4dTv distances, the mechanisms would have protracted rediploidization for a longer time in eels than in salmonids. Moreover, the 4dTv analysis revealed very similar results for the 3R branch and the basal Clupeocephala branch (Fig 4, Nodes 3 and 4) and for the shared Osteoglossomorpha and Arowana branches, respectively (Fig 4, Nodes 8 and 12). This result supports the hypothesis that these duplications were divided due to a potential PHYLDOG artefact, explained below. Additionally, 62 gene families were found which showed a topology concurring with a 3R event followed by a 4R event, since duplication had been conserved from both events in the same gene family. From these 62 families, 30 paralogues pairs of the suspected 4R duplications were located in regions with some synteny. These trees directly oppose the hypothesis of protracted rediploidization; however, only for these 30 families. In the event of an eel 4R WGD, more such trees would be expected.

Possible PHYLDOG artefact

According to PHYLDOG, the shared teleost duplications split into two events placed in the 3R branch and the basal Clupeocephala branch (Fig 4, Nodes 3 and 4). These results could be caused by an artefact from the phylogenetic analysis. Specifically, PHYLDOG software assigns duplications to branches based on the successful identification of the daughter genes on the branches of both sister clades, thus lower genomic information (fewer or less complete genomes/transcriptomes) increases the chance of not finding a gene and thus misplacing duplications. In this case, although the number of genomes/transcriptomes are the same, the amount of genomic information is substantially different between the two daughter clades, since the genomes/transcriptomes on one side (Fig 4, from Node 5), are generally much less complete than those on the other side (Figs 3 and 4, from Node 4). In support of this hypothesis is the 4dTv analysis, in which the duplications assigned to the 3R and the basal Clupeocephala branch indicate approximately the same mode (Fig 4, Nodes 3 and 4), suggesting that they started to diverge at the same time. Therefore, a PHYLDOG artefact, in which duplications can leak down to a daughter branch which is basal to a clade containing more genomic information, is also a parsimonious explanation for most of the duplications assigned to the basal Clupeocephala branch.

Arowana results

As an unexpected result of our analysis, the included Osteoglossomorphas also appear to contain a high quantity of duplications, which likely started diverging after the split between Elopomorphas and Osteoglossomorphas. These duplications also included a high occurrence of paralogues with some conserved synteny between them. This result suggests that these genes were duplicated in larger portions e.g. whole regions (large SDs), chromosomes or genome and not by smaller SDs. When combining the basal Osteoglossomorpha and the basal arowana branches these were assigned a similar quantity of duplications as the basal freshwater eel branch. This result supports the hypothesis that some genomic regions were still under tetrasomic inheritance, from the 3R, at the time of the split between Elopomorphas and Osteoglossomorphas. However, it is also possible that the duplications were generated by a separate duplication event in a common ancestor to the included Osteoglossomorphas but have leaked into the basal arowana branch and the Asian arowana specific branch in the phylogenetic analysis due to the PHYLDOG artefact described above. The PHYLDOG artefact hypothesis is supported by the 4dTv analysis, as the 4dTv modes of these branches are very similar, and since the Elephantnose fish transcriptomes are the least complete dataset of these branches. To generate a better supported hypothesis of the origin of these duplications a study dedicated to this purpose should be conducted.

Start of divergence of the duplications assigned to the basal freshwater eel branch

In the independent 4dTv analysis, without considering phylogenetic tree topologies, the 4dTv of the homologs between the European eel and the Japanese eel, the elephantnose fish and the arowanas, showed that European eel and Japanese eel homologs have a 4dTv mode at ~ 0.4. On the other hand, the homologs between the European eel and any Osteoglossomorpha species form a 4dTv mode at ~ 0.5 (Fig 5). This result indicates that the duplications found in the freshwater eel species started diverging after the split between Elopomorphas and Osteoglossomorphas (Fig 5). Therefore, the phylogenetic reconstruction and the 4dTv distances together suggest that the duplications assigned to the basal freshwater eel branch (4dTv ~ 0.4) started diverging after the teleost specific 3R duplication event (320–350 MYA) [35,36] and after the split between eels and Osteoglossomorphas, but before the Ss4R (88–103 MYA) [86]. If the 4dTv mode observed in the basal freshwater eel branch was the result of new duplications, then these duplications would likely have originated in a common ancestor to all members of the anguillidae family, as these first appear 20–50 MYA [87]. Due to the 4dTv observed, this event could also be shared by wider Elopomorpha; however, without analysing other anguilliforms or Elopomorpha transcriptomes or genomes, this hypothesis remains speculative.

Previously published related data

In concurrence with the present study, other studies have reported data suggesting an unusually high quantity of gene duplications in eels. In the additional data included by Inoue et al. [27], the eel and zebrafish are the species with the highest percentage of duplicated genes (36.6% and 31.9%, respectively). Furthermore, an unexpectedly high number of Hox genes (73 genes) were found in the analysis of the draft eel genome [49]. In this study [49], the phylogenetic distance between Hox clusters was remarkably short, making it impossible to distinguish between the 3R “a” or “b” association of 3 out of 4 cluster pairs based on DNA sequence alone. Several other studies focusing on particular genes have likewise found paralogue pairs in eels, which are not found in other teleosts [48-56] and similarly, an unexpected short phylogenetic distance is often found between eel paralogue pairs. These results support both a 4R hypothesis and a hypothesis of a 3R origin followed by protracted rediploidization. However, many of the referenced studies also presented results of weak conserved local synteny indicating a 3R origin. These synteny results are unexpected following both a 4R hypothesis and the hypothesis of protracted rediploidization. We draw this conclusion based on the notion that the close genomic region of genes, which experienced delayed rediploidization, is highly expected to also have been under tetrasomic inheritance for an extended period [43]. Thus, these neighbouring genes should not accumulate mutations similarly to homolog regions of other teleosts, which experienced immediate rediploidization, and thus the synteny of these regions are unlikely to match.

Conclusions

The data presented in this study support the hypothesis that a remarkably high amount of paralogues pairs started to diverge in a common ancestor of the freshwater eel lineage after the split from the Osteoglossomorpha lineage. The 4dTv and phylogenetic analyses revealed a clear clustering of these paralogues in the basal freshwater eel branch with a 4dTv mode at ~0.4. The synteny of these paralogue pairs suggests they originated in large portions, most likely from a WGD event. However, the results do not unequivocally support/oppose whether i) These paralogues originated from the 3R but are located in genomic regions which have experienced protracted rediploidization; ii) These paralogues originated in a 4R WGD in a common ancestor to freshwater eels; or iii) Both i and ii have contributed to the evolution of these paralogues. The present results offer robust information on the duplicated gene complement of freshwater eels, thus providing novel insights into the peculiar biology of the critically endangered European eel. However, additional high quality genome resources of other Elopomorpha members are needed to further study the dynamics of gene duplication and conservation in early teleost evolution.

GO annotation of eel duplication gene families from EggNOG.

GO annotation from EggNOG of eel duplication gene families shared by freshwater eels. Including: “Orthogroup”, “Descriptions”, “Functional annotations”, “GO categories”, and “GO IDs”. (CSV) Click here for additional data file.

All scripts used for analysis.

(ZIP) Click here for additional data file.
  79 in total

1.  The polyploidy revolution then…and now: Stebbins revisited.

Authors:  Douglas E Soltis; Clayton J Visger; Pamela S Soltis
Journal:  Am J Bot       Date:  2014-07-20       Impact factor: 3.844

2.  Three nuclear and two membrane estrogen receptors in basal teleosts, Anguilla sp.: Identification, evolutionary history and differential expression regulation.

Authors:  Anne-Gaëlle Lafont; Karine Rousseau; Jonna Tomkiewicz; Sylvie Dufour
Journal:  Gen Comp Endocrinol       Date:  2015-11-30       Impact factor: 2.822

Review 3.  The Divergent Genomes of Teleosts.

Authors:  Vydianathan Ravi; Byrappa Venkatesh
Journal:  Annu Rev Anim Biosci       Date:  2018-02-15       Impact factor: 8.923

Review 4.  Dopaminergic inhibition of reproduction in teleost fishes: ecophysiological and evolutionary implications.

Authors:  S Dufour; F-A Weltzien; M-E Sebert; N Le Belle; B Vidal; P Vernier; C Pasqualini
Journal:  Ann N Y Acad Sci       Date:  2005-04       Impact factor: 5.691

5.  Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.

Authors:  Mihaela Pertea; Daehwan Kim; Geo M Pertea; Jeffrey T Leek; Steven L Salzberg
Journal:  Nat Protoc       Date:  2016-08-11       Impact factor: 13.491

6.  The genome of the platyfish, Xiphophorus maculatus, provides insights into evolutionary adaptation and several complex traits.

Authors:  Manfred Schartl; Ronald B Walter; Yingjia Shen; Tzintzuni Garcia; Julian Catchen; Angel Amores; Ingo Braasch; Domitille Chalopin; Jean-Nicolas Volff; Klaus-Peter Lesch; Angelo Bisazza; Pat Minx; LaDeana Hillier; Richard K Wilson; Susan Fuerstenberg; Jeffrey Boore; Steve Searle; John H Postlethwait; Wesley C Warren
Journal:  Nat Genet       Date:  2013-03-31       Impact factor: 38.330

7.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.

Authors:  Salvador Capella-Gutiérrez; José M Silla-Martínez; Toni Gabaldón
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

8.  Enigmatic orthology relationships between Hox clusters of the African butterfly fish and other teleosts following ancient whole-genome duplication.

Authors:  Kyle J Martin; Peter W H Holland
Journal:  Mol Biol Evol       Date:  2014-06-27       Impact factor: 16.240

9.  The spotted gar genome illuminates vertebrate evolution and facilitates human-teleost comparisons.

Authors:  Ingo Braasch; Andrew R Gehrke; Jeramiah J Smith; Kazuhiko Kawasaki; Tereza Manousaki; Jeremy Pasquier; Angel Amores; Thomas Desvignes; Peter Batzel; Julian Catchen; Aaron M Berlin; Michael S Campbell; Daniel Barrell; Kyle J Martin; John F Mulley; Vydianathan Ravi; Alison P Lee; Tetsuya Nakamura; Domitille Chalopin; Shaohua Fan; Dustin Wcisel; Cristian Cañestro; Jason Sydes; Felix E G Beaudry; Yi Sun; Jana Hertel; Michael J Beam; Mario Fasold; Mikio Ishiyama; Jeremy Johnson; Steffi Kehr; Marcia Lara; John H Letaw; Gary W Litman; Ronda T Litman; Masato Mikami; Tatsuya Ota; Nil Ratan Saha; Louise Williams; Peter F Stadler; Han Wang; John S Taylor; Quenton Fontenot; Allyse Ferrara; Stephen M J Searle; Bronwen Aken; Mark Yandell; Igor Schneider; Jeffrey A Yoder; Jean-Nicolas Volff; Axel Meyer; Chris T Amemiya; Byrappa Venkatesh; Peter W H Holland; Yann Guiguen; Julien Bobe; Neil H Shubin; Federica Di Palma; Jessica Alföldi; Kerstin Lindblad-Toh; John H Postlethwait
Journal:  Nat Genet       Date:  2016-03-07       Impact factor: 38.330

10.  eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences.

Authors:  Jaime Huerta-Cepas; Damian Szklarczyk; Kristoffer Forslund; Helen Cook; Davide Heller; Mathias C Walter; Thomas Rattei; Daniel R Mende; Shinichi Sunagawa; Michael Kuhn; Lars Juhl Jensen; Christian von Mering; Peer Bork
Journal:  Nucleic Acids Res       Date:  2015-11-17       Impact factor: 16.971

View more
  5 in total

1.  Investigating the Transcriptomic and Expression Presence-Absence Variation Exist in Japanese Eel (Anguilla japonica), a Primitive Teleost.

Authors:  Yung-Sen Huang; Chung-Yen Lin; Wen-Chih Cheng
Journal:  Mar Biotechnol (NY)       Date:  2021-10-29       Impact factor: 3.619

2.  An atlas of fish genome evolution reveals delayed rediploidization following the teleost whole-genome duplication.

Authors:  Elise Parey; Alexandra Louis; Jérôme Montfort; Yann Guiguen; Hugues Roest Crollius; Camille Berthelot
Journal:  Genome Res       Date:  2022-08-12       Impact factor: 9.438

3.  Genome-Wide Reconstruction of Rediploidization Following Autopolyploidization across One Hundred Million Years of Salmonid Evolution.

Authors:  Manu Kumar Gundappa; Thu-Hien To; Lars Grønvold; Samuel A M Martin; Sigbjørn Lien; Juergen Geist; David Hazlerigg; Simen R Sandve; Daniel J Macqueen
Journal:  Mol Biol Evol       Date:  2022-01-07       Impact factor: 16.240

4.  DNA Transposon Expansion is Associated with Genome Size Increase in Mudminnows.

Authors:  Robert Lehmann; Aleš Kovařík; Konrad Ocalewicz; Lech Kirtiklis; Andrea Zuccolo; Jesper N Tegner; Josef Wanzenböck; Louis Bernatchez; Dunja K Lamatsch; Radka Symonová
Journal:  Genome Biol Evol       Date:  2021-10-01       Impact factor: 3.416

5.  Vertebrate Alpha2,8-Sialyltransferases (ST8Sia): A Teleost Perspective.

Authors:  Marzia Tindara Venuto; Mathieu Decloquement; Joan Martorell Ribera; Maxence Noel; Alexander Rebl; Virginie Cogez; Daniel Petit; Sebastian Peter Galuska; Anne Harduin-Lepers
Journal:  Int J Mol Sci       Date:  2020-01-14       Impact factor: 5.923

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.