Literature DB >> 28094797

Whole genome sequencing data and de novo draft assemblies for 66 teleost species.

Martin Malmstrøm1, Michael Matschiner1, Ole K Tørresen1, Kjetill S Jakobsen1, Sissel Jentoft1,2.   

Abstract

Teleost fishes comprise more than half of all vertebrate species, yet genomic data are only available for 0.2% of their diversity. Here, we present whole genome sequencing data for 66 new species of teleosts, vastly expanding the availability of genomic data for this important vertebrate group. We report on de novo assemblies based on low-coverage (9-39×) sequencing and present detailed methodology for all analyses. To facilitate further utilization of this data set, we present statistical analyses of the gene space completeness and verify the expected phylogenetic position of the sequenced genomes in a large mitogenomic context. We further present a nuclear marker set used for phylogenetic inference and evaluate each gene tree in relation to the species tree to test for homogeneity in the phylogenetic signal. Collectively, these analyses illustrate the robustness of this highly diverse data set and enable extensive reuse of the selected phylogenetic markers and the genomic data in general. This data set covers all major teleost lineages and provides unprecedented opportunities for comparative studies of teleosts.

Entities:  

Mesh:

Year:  2017        PMID: 28094797      PMCID: PMC5240625          DOI: 10.1038/sdata.2016.132

Source DB:  PubMed          Journal:  Sci Data        ISSN: 2052-4463            Impact factor:   6.444


Background & Summary

Fueled by recent advances in comparative genomics, teleost fishes are becoming increasingly important research objects in several scientific disciplines, ranging from ecology, physiology and evolution to medicine, cancer research and aquaculture[1-7]. Genome information from non-model organisms is highly important in these comparative genomic analyses as they represent specific phenotypes that aid in disentangling the common parts of gene sets from those that have evolved as adaptations to specific ecosystems. In a quest to identify the evolutionary origin of the MHC II pathway loss first observed in the Atlantic cod (Gadus morhua)[8,9], we applied a single sequencing library procedure to cost-efficiently produce draft assemblies for 66 teleost species, representing all major lineages within teleost fishes[10]. Since the alternative immune system, characterized by both the lack of MHC II and an expansion of MHC I, has so far only been identified in the Atlantic cod, we sampled the cod-like fishes of the order Gadiformes more densely than other groups, including 27 species of this order. Based on these genome sequence data, we were able to reconstruct the evolutionary history of the sampled lineages, to pinpoint the loss of the MHC II pathway to the common ancestor of all Gadiformes, and to identify several independent expansions in MHC I copy number within and outside the order Gadiformes. While these analyses and results are reported in a companion paper (Malmstrøm et al.[11]), we here present in greater detail the underlying data sets used for these analyses, including samples, sequencing reads (Data Citation 1), draft assemblies (Data Citation 2), and both mitochondrial and nuclear phylogenetic markers. By providing these data and the applied methodology in a coherent manner we aim to supply the scientific community with a highly diverse, reliable, and easy-to-use genomic resource for future comparative studies. Our sequencing strategy was chosen on the basis of several pseudo-replicates of the budgerigar (Melopsittacus undulatus) genome[12] (Data Citation 3), comprising different combinations of read lengths and coverages to determine the most cost-effective manner to produce genome data of sufficient quality for a reliable determination of gene presence or absence. These budgerigar data sets were furthermore assembled with two of the most used assemblers, the de Bruijn graph based SOAPdenovo[13] and the Overlap-Layout-Consensus based Celera Assembler[14] to investigate which assembly algorithm performed best on the various data replicates. On the basis of these in silico experiments, all species were sequenced on the Illumina HiSeq2000 platform, aiming for ~15× coverage. The sequenced reads were then quality controlled, error corrected and trimmed before performing assembly with Celera Assembler. The continuity of the assemblies was subsequently assessed through N50 statistics and the assembly quality was evaluated on the basis of gene space completeness of highly conserved genes. The assemblies were further used to identify mitochondrial genome sequences, which we used in combination with previously available sequences of related teleosts to verify the phylogenetic positions of sampled taxa (Data Citations 4 to 124). By recovering all taxa in their expected positions, clustering with conspecific or congeneric sequences where such were available, our phylogenetic analysis corroborates the correct identification of all sampled taxa and the absence of DNA contamination. Figure 1 illustrates the total workflow, and detailed information for each analysis step is further provided in the Methods section and in Tables 1–7 (available online only). The data sets presented here contain sequencing reads and assembled draft genomes for non-model species adapted to a wide variety of habitats, ranging from the deep sea and tropical coral reefs, to rivers and freshwater lakes. These data sets can be used individually or collectively, as resources for studies such as gene family evolution, adaptation to different habitats, phylogenetic inference of teleost orders, transposons and repeat content evolution as well as many other applications regarding gene and genome evolution in a comparative or model organism framework.
Figure 1

Flowchart illustrating the processes involved in creating and validating sequence data for 66 teleost species.

(1) A full overview of species, sample supplier and tissue used for DNA extraction is provided in Table 1 (available online only). (2) The DNA extraction method is also found in Table 1 (available online only). (3) All sequencing libraries were created using the Illumina TruSeq Sample Prep v2 Low-Throughput Protocol. Adaptor indexes are provided in Table 2 (available online only). (4) Sequencing statistics and insert sizes for all species are also listed in Table 2 (available online only). (5) FastQC and SGA PreQC analyses were performed for all read sets prior to assembly. (6) Estimated genome sizes, coverages and assembly statistics for all species are presented in Table 3 (available online only), and accession links are provided in Table 4 (available online only). (7) CEGMA and BUSCO statistics are reported in Table 5 (available online only). (8) GenBank accession numbers and UTG IDs for all mitochondrial genomes used in phylogenetic analyses are provided in Tables 6 and 7 (available online only). (9) The maximum-likelihood phylogeny based on mitochondrial genomes is presented in Fig. 3.

Methods

Sample acquisition and DNA extraction

The majority of samples were taken from validated species (mostly voucher specimens) and were provided by museums or university collections. Some samples were obtained from wild caught specimens, in collaboration with local fishermen. All samples were stored on either 96% ethanol or RNA-later (Ambion). The extraction of genomic DNA was carried out using either EZNA Tissue DNA Kit (Omega Bio-Tek), following the manufacturer’s instructions, or using the ‘High salt DNA extraction’ method as described by Phill Watts (https://www.liverpool.ac.uk/~kempsj/IsolationofDNA.pdf). Detailed information about all samples, including origin, voucher specimen ID and DNA extraction method is provided in Table 1 (available online only).
Table 1

Sample information for all species in the reported data set

OrderSpeciesTissueSample IDVoucher ID
ZSCM numbers are vouchers from Zoological State Collection Munich
    
CFM numbers are vouchers from Chicago Field Museum collection
    
ZMUC number refers to voucher from Zoological Museum University of Copenhagen collection
    
KUI number refer to voucher from University of Kansas Biodiversity Institute Icthyology collection
    
UAIC number refers to voucher from University of Alabama Ichtyology collection
    
SAIAB number refers to voucher from South African Institute for Aquatic Biodiversity collection
    
MCZ number refer to voucher from Museum of Comparative Zoology, Harvard University collection
    
NSMT-P number refer to voucher from National Museum of Nature and Science, Tsukuba, Japan
    
1Kjartan Østbye (University of Oslo, Norway), 2Jan Yde Poulsen (Greenland Institute of Natural Resources, Greenland), 3Reinhold Hanel (Thünen-Institute of Fisheries Ecology, Germany), 4Masaki Miya (Natural History Museum & Institute in Chiba, Japan), 5Andrew Bentley (University of Kansas Biodiversity Institute, USA), 6Martin Malmstrøm (University of Oslo, Norway), 7Christophe Pampoulie (Marine Research Institute of Iceland, Iceland), 8Irvin Kilde (NorwegianUniversity of Science and Technology in Trondheim, Norway), 9Walter Salzburger (University of Basel, Switzerland), 10Ian Bradbury (Memorial university, Canada), 11Lukas Rüber (Natural History Museum in Bern, Switzerland), 12Fabio Cortesi (University of Queensland, Australia)
    
OsmeriformesOsmerus eperlanus1Fin*Osep_1_#2NA
StomiatiformesBorostomias antarcticus2MuscleJYP 598ZMUC 8046
AulopiformesParasudis fraserbrunneri3Muscle*A430CFM 117870
AteleopodiformesGuentherus altivela3Muscle*B375NA
MyctophiformesBenthosema glaciale2MuscleJYP 403ZMUC 8477
PolymyxiformesPolymixia japonica4Muscle*NSMT-P 79586.1NSMTNAP 79586
PercopsiformesPercopsis transmontana5Muscle*KU:KUIT:1890KU:KUI:29775
PercopsiformesTyphlichthys subterraneus5Muscle*KU:KUIT:8754UAIC 14148.01
ZeiformesZeus faber3Muscle*B11ZSCM 32795
ZeiformesCyttopsis roseus3MuscleB361ZSCM 32479
StylephoriformesStylephorus chordatus5Muscle*KU:KUIT:8138MCZ 165920
GadiformesBregmaceros cantori5MuscleKU:KUIT:5133KU:KUI:30244
GadiformesMerluccius polli3MuscleB116ZSCM 40336
GadiformesMerluccius merluccius6Thymus*Meme(Ly)_IOF_1_#2NA
GadiformesMerluccius capensis3MuscleB16ZSCM 32773
GadiformesMelanonus zugmayeri3Muscle*B304ZSCM 32519
GadiformesMuraenolepis marmoratus3Muscle*#95NA
GadiformesTrachyrincus scabrus3Muscle*A35CFM 117888
GadiformesTrachyrincus murrayi7Fin*A9-2012-420-171-1NA
GadiformesMora moro8Muscle*Momo(Dy)_Sula_4_#1NA
GadiformesLaemonema laureysi3MuscleB43ZSCM 32710
GadiformesBathygadus melanobranchus3Muscle*B365ZSCM 40344
GadiformesMacrourus berglax6Muscle*Mabe_1_#1NA
GadiformesMalacocephalus occidentalis3Muscle*A25CFM 117884
GadiformesPhycis blennoides7Fin*A9-2012-418-76-1NA
GadiformesPhycis phycis3MusclePhph_X5NA
GadiformesLota lota3Muscle*Lolo_X10NA
GadiformesMolva molva6Thymus*Momo(Br)_IOF_1_#2NA
GadiformesBrosme brosme6Spleen*Brbr_LO_1_#2NA
GadiformesTrisopterus minutus6SpleenTrmi_IOF_1_#1NA
GadiformesGadiculus argenteus6SpleenGaar_IOF_1_#2NA
GadiformesPollachius virens6SpleenPovi_LO_1_#1NA
GadiformesMelanogrammus aeglefinus6SpleenMeae_LO_1_#2NA
GadiformesMerlangius merlangus6Thymus*Meme(Hy)_OOF_1_#2NA
GadiformesArctogadus glacialis9Fin0A-08-045_#3NA
GadiformesBoreogadus saida7Fin*B3-2012-189-6_#1NA
GadiformesTheragra chalcogramma10FinHS-08.010_#1NA
GadiformesGadus morhua6BloodNEAC_001NA
LampriformesRegalecus glesne3Muscle*Regl_X3NA
LampriformesLampris guttatus3Muscle*Lagu_X8NA
BeryciformesMonocentris japonica4Muscle*NSMT-P 75883.1NSMTNAP 75883
HolocentriformesMyripristis jacobus3Blood*KV124NA
HolocentriformesHolocentrus rufus3BloodX2NA
HolocentriformesNeoniphon sammara5Muscle*KU:KUIT:6925SAIAB 77852
BeryciformesBeryx splendens3MuscleA252NA
BeryciformesRondeletia loricata5Muscle*KU:KUIT:8426MCZ 167869
BeryciformesAcanthochaenus luetkenii5MuscleKU:KUIT:8430MCZ 167873
OphidiiformesBrotula barbata3Muscle*B392ZSCM 32626
OphidiiformesLamprogrammus exutus3MuscleA69CFM 118100
OphidiiformesCarapus acus3MuscleB358ZSCM 32503
BatrachoidiformesChatrabus melanurus3MuscleB25ZSCM 32594
ScombriformesThunnus albacares3MuscleSp569NA
GobiiformesLesueurigobius cf. sanzi3Muscle*B265NA
PerciformesPerca fluviatilis3MusclePefl_NEG_1_#1NA
PerciformesMyoxocephalus scorpius3Muscle*Seescorpion_1NA
PerciformesSebastes norvegicus6SpleenSeno_LO_1_#1NA
PerciformesChaenocephalus aceratus9MuscleANT-XXVII/3,#299NA
LabriformesSymphodus melops6Muscle*Kg47NA
SpariformesSpondyliosoma cantharus3Muscle*Sp28NA
LophiiformesAntennarius striatus3MuscleB133ZSCM 32591
CarangiformesSelene dorsalis3MuscleB97NA
AnabantiformesHelostoma temminckii11Muscle*LR10256NA
AnabantiformesAnabas testudineus11MuscleLR11643NA
BlenniiformesParablennius parvicornis11Muscle*LR00424NA
Ovalentariae incertae sedisChromis chromis3Muscle*Sp51NA
Ovalentariae incertae sedisPseudochromis fuscus12Muscle*F5-C4_1NA

*Isolated with EZNA spin columns

†Isolated with high salt method

‡Isolated with blood plug protocol (See Star et al.)8

Fragmentation and library preparation

Genomic DNA samples were diluted to 120 μl (50 ng μl−1) with Qiagen Elution Buffer (Qiagen) if necessary and fragmented to lengths of ~400 bp by sonication using a Covaris S220 (Life Technologies) with the following settings: 200 cycles for 90 s with ω-peak at 105. All sequencing libraries were constructed following the Illumina TruSeq Sample Prep v2 Low-Throughput Protocol.

Sequencing and quality control

All sequencing was performed on an Illumina HiSeq 2000 platform with additional chemicals added to extend the number of cycles, yielding paired reads of 150 bp each. The read quality was then assessed using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Prior to assembly we used SGA PreQC[15] to estimate coverage, per-base error rates, level of heterozygosity, repeat content and genome size in order to assess whether more sequencing would be needed. Some samples were then subjected to a second round of sequencing of the same library. Sequencing statistics are presented in Table 2 (available online only).
Table 2

Sequencing information for all species in the reported data set (Data Citation 1)

SpeciesN reads*N basesBases usedInsert sizeAdaptor index
Osmerus eperlanus84.712,70974%366AD023
Borostomias antarcticus131.919,78983%439AD002
Parasudis fraserbrunneri149.622,43682%366AD022
Guentherus altivela101.715,25697%380AD020
Benthosema glaciale194.429,15487%453AD002
Polymixia japonica90.213,52585%367AD016
Percopsis transmontana147.822,16882%350AD001
Typhlichthys subterraneus142.721,39881%336AD003
Zeus faber151.522,72274%335AD008
Cyttopsis roseus161.024,15675%340AD013
Stylephorus chordatus171.725,75287%428AD014
Bregmaceros cantori310.346,53784%343AD013
Merluccius polli112.916,93180%353AD015
Merluccius merluccius146.521,96882%359AD013
Merluccius capensis107.016,04781%353AD014
Melanonus zugmayeri179.226,88481%351AD016
Muraenolepis marmoratus127.319,10180%345AD009
Trachyrincus scabrus179.526,92882%358AD008
Trachyrincus murrayi66.820,05089%519AD014
Mora moro153.022,94585%345AD014
Laemonema laureysi105.715,86280%348AD015
Bathygadus melanobranchus122.418,36283%330AD008
Macrourus berglax119.717,94882%356AD001
Malacocephalus occidentalis140.821,11778%352AD003
Phycis blennoides80.624,19187%505AD015
Phycis phycis130.919,63176%328AD005
Lota lota125.118,75883%341AD007
Molva molva152.322,85279%329AD006
Brosme brosme122.518,37380%342AD012
Trisopterus minutus151.622,73776%325AD005
Gadiculus argenteus138.120,70979%320AD004
Pollachius virens112.416,85479%328AD006
Melanogrammus aeglefinus110.916,63982%324AD007
Merlangius merlangus165.224,78785%339AD012
Arctogadus glacialis165.222,00681%323AD002
Boreogadus saida155.125,24377%316AD004
Theragra chalcogramma155.123,26884%322AD002
Gadus morhua62.818,84792%531AD019
Regalecus glesne183.927,57883%344AD027
Lampris guttatus162.724,40782%353AD010
Monocentris japonica90.513,58087%468AD012
Myripristis jacobus64.319,27484%353AD001
Holocentrus rufus61.518,46383%354AD003
Neoniphon sammara90.913,63187%474AD005
Beryx splendens90.113,50997%465AD005
Rondeletia loricata127.819,16789%467AD004
Acanthochaenus luetkenii129.419,41190%430AD014
Brotula barbata131.819,76981%352AD015
Lamprogrammus exutus63.39,50093%348AD014
Carapus acus102.715,41079%349AD016
Chatrabus melanurus333.049,95482%353AD020
Thunnus albacares120.518,07586%461AD015
Lesueurigobius cf. sanzi66.219,84692%529AD018
Perca fluviatilis96.914,53278%364AD025
Myoxocephalus scorpius105.715,84780%423AD019
Sebastes norvegicus131.019,65380%349AD027
Chaenocephalus aceratus301.245,17883%355AD022
Symphodus melops86.412,69486%440AD019
Spondyliosoma cantharus97.614,64290%443AD014
Antennarius striatus71.610,74271%363AD022
Selene dorsalis107.916,17986%458AD016
Helostoma temminckii85.312,79587%464AD016
Anabas testudineus96.814,52088%459AD016
Parablennius parvicornis141.921,28888%459AD019
Chromis chromis166.124,91787%341AD018
Pseudochromis fuscus54.716,40487%418AD018

*In millions

†Percentage of total bases used in Celera assembly

‡After merging with FLASH (bp)

Draft genome assembly

The methods used for genome assembly are also described in the Supplementary Note of Malmstrøm et al.[11]. We expand on these methods here, describing the different parameters and settings in greater detail in order to present a complete overview of our analyses. All draft genomes were created using Celera Assembler, and the version used was downloaded from the CVS (Concurrent Version System, http://wgs-assembler.sourceforge.net/) repository on January 12th 2013. The program meryl, included in the Celera Assembler package was used to create a database of k-mers from the pairs of sequencing reads. Lower k-mer sizes might not resolve repetitive regions, while higher k-mer sizes might not overlap, leading to a loss of information required to correct the reads. Thus, an intermediate k-mer size of 22 was used for all assemblies. Meryl was run with the following options, where the sequences from the reads were concatenated into a file named ‘reads.fa’: meryl -B -v -m 22 -memory 55000 -threads 16 -C -s reads.fa -o reads In this command, –B specifies that a k-mer database should be created, and that this should be done using the verbose setting (-v). The –m option denotes the ‘merSize’, while –C specifies that canonical reads (both strands) should be used for creating the k-mer database. The options –threads and –memory specify the computational resources that meryl can utilize and only influence run-time. Most of the computational time used by Celera Assembler is required to identify overlap between reads. To reduce analysis time and generate longer input sequences, overlapping paired reads were merged with the software FLASH v1.2 (ref. 16), executed with the following command, where –d denotes the path to the output directory (with the prefix given with the –o option), –r is the read length, –f is the insert size, and –s is the standard deviation of the insert size: flash input_1.fastq input_2.fastq -d. -r 150 -f 290 -s 50 –o output_prefix Celera Assembler’s merTrim program (see Tørresen et al.[17]) was used to trim, error correct and remove adapters of all reads. The merTrim program estimates the coverage of the sequencing library by analysing the abundance of k-mers versus the number of k-mers at that abundance. By default, k-mers occurring at a frequency corresponding to at least one fourth of the coverage peak can be used to correct reads with k-mers that occur with a frequency of at most one third of the coverage peak. Reads were trimmed to the largest region containing k-mers with a frequency of more than one third of the coverage peak. The trimming of reads removes sequences not supported by other reads and reduces the possible fragmentation of the assembly. Adaptor sequences are not part of the genome and could lead to assembly fragmentation in the same way as repeated regions would. To remove adaptor sequences and other unsupported sequences from the read data, merTrim was executed with the following command: merTrim -F reads.fastq -m 22 -mc meryl_db -mCillumina -t 16 -o out.fastq In this command, –F specifies the reads, –m the k-mer size, –mc the database of trusted k-mers, and –mCillumina specifies that Illumina type adapters should be removed. The –t option defines the number of threads and thus only influences run time. Following correction and trimming, the files in frg format were created with the following commands, as implemented in Celera Assembler: fastqToCA -technology illumina -insertsize 500 50 -libraryname lib_name -mates read1_clean.fastq,read2_clean.fastq>paired_reads.frg fastqToCA -technology illumina-long -insertsize 500 50 -libraryname lib_name -reads merged_reads.fastq>merged_reads.frg The frg files contain information about the sequencing data, such as the expected insert size, location of the fastq files and the prefix for determining the species. Providing this information in the form of frg files is a prerequisite for Celera Assembler. Celera Assembler was then used to assemble the sequencing reads, with the following command specifying the prefix (–p) and the directory for the output (–d): runCA -p prefix -d CA -s spec_file The ‘spec_file’ contains a list of settings and run-options for Celera Assembler. Some of the settings and options are specific to the computing system used for the assembly (such as the number of parallel overlap processes, ‘ovlConcurrency’), but as mentioned above, k-mer size as specified with the option –m (‘merSize’) can have effects on the contiguity of the assembly. The option ‘doFragmentCorrection’ was set to 0 because the reads were corrected with merTrim. The content of this file was: ovlConcurrency=4 ovlThreads=8 cnsConcurrency=32 merSize=22 merylMemory=50000 merylThreads=32 merThreshold=5000 doOBT=0 overlapper=ovl ovlRefBlockSize=6000000 ovlHashBits=24 ovlHashBlockLength=800000000 doFragmentCorrection=0 unitigger=bogart batMemory=55 batThreads=32 doExtendClearRanges=0 doToggle=0 paired_reads.frg merged_reads.frg The output of Celera Assembler consists of a set of three fasta files with increasing continuity that contain unitigs, contigs and scaffolds, respectively. Unitigs are either a unique DNA sequence found in a genome or a repeat, and unique unitigs are used as seeds to create contigs and scaffolds. In cases where Celera Assembler was not able to place a unitig confidently in the assembly, this unitig was not included in the contigs and scaffolds, but output separately. As a result of this, some additional sequence information is available in the assembled unitig fasta file compared to the assembled scaffolds. These additional sequences can include repeated sequences like transposable elements and tandem repeats, but also repeated gene fragments, conserved gene family domains, and other sequences that conflict with the biological assumptions of the assembler. As multiple copies of the mitochondrial genome are present in each cell, it is sequenced to a much higher coverage than the nuclear genome, and may therefore also be excluded from contigs due to false classification as a repetitive region. For these reasons, unitigs instead of contigs were used for both the identification of fragmented genes (see Malmstrøm et al.[11]) and for the mitochondrial phylogeny analysis described below. Assembly statistics for all draft genomes are provided in Table 3 (available online only).
Table 3

Assembly statistics for all species in the reported data set (Data Citation 2).

SpeciesGenome size* (Mb)CoverageN50 contigs length (bp)N50 scaffold length (bp)Total span of scaffolds (Mb)Recovered
Osmerus eperlanus48919.164,5246,79834270%
Borostomias antarcticus86518.913,9285,35242950%
Parasudis fraserbrunneri93519.554,1776,36670676%
Guentherus altivela1,7018.732,9283,19953832%
Benthosema glaciale1,30419.544,3936,09167452%
Polymixia japonica63518.095,8039,53455387%
Percopsis transmontana50935.578,16115,13445790%
Typhlichthys subterraneus75922.857,3149,64055573%
Zeus faber73222.924,6426,31360983%
Cyttopsis roseus64028.144,8437,06054585%
Stylephorus chordatus97123.043,3734,66148750%
Bregmaceros cantori1,65023.604,4525,9091,14269%
Merluccius polli60922.353,4714,46840066%
Merluccius merluccius61129.653,6705,09440066%
Merluccius capensis65319.893,7924,76041363%
Melanonus zugmayeri58937.094,5627,59943273%
Muraenolepis marmoratus84018.193,1263,54941549%
Trachyrincus scabrus57937.963,9006,34636964%
Trachyrincus murrayi67826.476,23119,93145066%
Mora moro49939.103,2674,41234469%
Laemonema laureysi52424.283,4314,69630558%
Bathygadus melanobranchus57726.324,9566,46643075%
Macrourus berglax69321.183,3534,27839958%
Malacocephalus occidentalis50432.683,6974,90734969%
Phycis blennoides67431.244,53210,57041461%
Phycis phycis46832.043,4584,48634574%
Lota lota51230.513,8034,87639778%
Molva molva53933.684,1365,25143781%
Brosme brosme55126.783,6824,63641275%
Trisopterus minutus51733.503,2483,96233465%
Gadiculus argenteus56729.033,3793,94239670%
Pollachius virens51325.843,4574,33139477%
Melanogrammus aeglefinus54325.173,2153,69037469%
Merlangius merlangus56637.163,5384,43042375%
Arctogadus glacialis64627.543,2823,69642966%
Boreogadus saida64130.483,2213,56641264%
Theragra chalcogramma66129.433,6034,32344868%
Gadus morhua67425.865,76516,73149273%
Regalecus glesne75030.646,7819,75365487%
Lampris guttatus1,40514.204,0515,21284760%
Monocentris japonica70616.718,04618,61055479%
Myripristis jacobus77820.759,81621,26071992%
Holocentrus rufus73520.729,24321,32364888%
Neoniphon sammara69616.968,76121,68765794%
Beryx splendens89714.644,2865,97253259%
Rondeletia loricata1,04916.215,1127,44456754%
Acanthochaenus luetkenii82521.105,6368,39854466%
Brotula barbata51930.9617,57845,71348493%
Lamprogrammus exutus9019.804,2135,45949255%
Carapus acus44827.219,55416,89738786%
Chatrabus melanurus1,96520.794,5815,9061,12657%
Thunnus albacares83618.6016,80846,87172687%
Lesueurigobius cf. sanzi1,34913.576,72911,43980860%
Perca fluviatilis90312.514,1405,95162970%
Myoxocephalus scorpius75916.645,7169,44351868%
Sebastes norvegicus78220.049,46716,53071692%
Chaenocephalus aceratus1,05035.915,4607,30962359%
Symphodus melops62817.809,36221,21753385%
Spondyliosoma cantharus76717.1911,63328,10967989%
Antennarius striatus55213.786,0869,74344180%
Selene dorsalis57624.1311,20932,35152792%
Helostoma temminckii68616.1417,05571,66259987%
Anabas testudineus57622.0818,81750,09852491%
Parablennius parvicornis62330.047,34316,73459896%
Chromis chromis90723.888,50914,18583292%
Pseudochromis fuscus74019.3912,02924,62965689%

*Estimated by Celera Assembler

†Based on Celera Assembler genome estimation

‡Span of scaffolds divided by estimated genome size

Code availability

The most crucial commands are implemented in the Methods section, while additional scripts (used in phylogenetic analyses) are available on the code repository on GitHub (https://github.com/uio-cees/teleost_genomes_data_descriptor).

Data Records

All raw sequencing reads have been deposited in the European Nucleotide Archive (ENA) with study accession number PRJEB12469 (Data Citation 1). Table 4 (available online only) list the sample identifiers for each species. Each read file is available as a compressed file in fastq format (with extension fastq.gz). For some of the species, more than one read set is available as these were sequenced in two rounds, aiming to increase coverage. Two versions of all assembled genomes, unitigs (utg) and scaffolds (scf), are deposited in the Dryad repository under digital object identifier (DOI): doi:10.5061/dryad.326r8. (Data Citation 2). See Table 4 (available online only) for specific DOI for each species and assembly type.
Table 4

Individual identifiers for samples (read sets) in ENA and genome assemblies in the Dryad repository (Data Citation 2)

SpeciesENA sample accessionDOI for scaffold assemblySize (Mb)DOI for unitig assemblySize (Mb)
Osmerus eperlanusSAMEA4028764doi:10.5061/dryad.326r8/81.104.5doi:10.5061/dryad.326r8/82.207.1
Borostomias antarcticusSAMEA4028765doi:10.5061/dryad.326r8/98.128.6doi:10.5061/dryad.326r8/90.460.2
Parasudis fraserbrunneriSAMEA4028766doi:10.5061/dryad.326r8/71.213.6doi:10.5061/dryad.326r8/72.463.8
Guentherus altivelaSAMEA4028767doi:10.5061/dryad.326r8/77.165.8doi:10.5061/dryad.326r8/78.858.8
Benthosema glacialeSAMEA4028768doi:10.5061/dryad.326r8/91.204.8doi:10.5061/dryad.326r8/92.684.3
Polymixia japonicaSAMEA4028769doi:10.5061/dryad.326r8/47.167.5doi:10.5061/dryad.326r8/48.282.7
Percopsis transmontanaSAMEA4028770doi:10.5061/dryad.326r8/49.140.0doi:10.5061/dryad.326r8/50.194.1
Typhlichthys subterraneusSAMEA4028771doi:10.5061/dryad.326r8/51.169.5doi:10.5061/dryad.326r8/52.286.6
Zeus faberSAMEA4028772doi:10.5061/dryad.326r8/53.186.1doi:10.5061/dryad.326r8/54.362.3
Cyttopsis roseusSAMEA4028773doi:10.5061/dryad.326r8/55.166.8doi:10.5061/dryad.326r8/56.294.8
Stylephorus chordatusSAMEA4028774doi:10.5061/dryad.326r8/103.147.4doi:10.5061/dryad.326r8/104.455.1
Bregmaceros cantoriSAMEA4028775doi:10.5061/dryad.326r8/41.341.4doi:10.5061/dryad.326r8/42.879.0
Merluccius polliSAMEA4028776doi:10.5061/dryad.326r8/29.121.2doi:10.5061/dryad.326r8/30.295.7
Merluccius merlucciusSAMEA4028777doi:10.5061/dryad.326r8/25.121.2doi:10.5061/dryad.326r8/26.328.6
Merluccius capensisSAMEA4028778doi:10.5061/dryad.326r8/27.125.1doi:10.5061/dryad.326r8/28.351.8
Melanonus zugmayeriSAMEA4028779doi:10.5061/dryad.326r8/31.130.3doi:10.5061/dryad.326r8/32.331.4
Muraenolepis marmoratusSAMEA4028780doi:10.5061/dryad.326r8/39.123.4doi:10.5061/dryad.326r8/40.398.6
Trachyrincus scabrusSAMEA4028781doi:10.5061/dryad.326r8/67.110.6doi:10.5061/dryad.326r8/68.316.0
Trachyrincus murrayiSAMEA4028782doi:10.5061/dryad.326r8/125.132.6doi:10.5061/dryad.326r8/126.326.5
Mora moroSAMEA4028783doi:10.5061/dryad.326r8/43.103.2doi:10.5061/dryad.326r8/44.292.2
Laemonema laureysiSAMEA4028784doi:10.5061/dryad.326r8/45.92.3doi:10.5061/dryad.326r8/46.266.4
Bathygadus melanobranchusSAMEA4028785doi:10.5061/dryad.326r8/37.129.8doi:10.5061/dryad.326r8/38.285.2
Macrourus berglaxSAMEA4028786doi:10.5061/dryad.326r8/33.120.9doi:10.5061/dryad.326r8/34.312.6
Malacocephalus occidentalisSAMEA4028787doi:10.5061/dryad.326r8/35.106.0doi:10.5061/dryad.326r8/36.230.5
Phycis blennoidesSAMEA4028788doi:10.5061/dryad.326r8/127.121.7doi:10.5061/dryad.326r8/128.455.0
Phycis phycisSAMEA4028789doi:10.5061/dryad.326r8/17.104.6doi:10.5061/dryad.326r8/18.240.8
Lota lotaSAMEA4028790doi:10.5061/dryad.326r8/21.119.9doi:10.5061/dryad.326r8/22.254.5
Molva molvaSAMEA4028791doi:10.5061/dryad.326r8/19.131.6doi:10.5061/dryad.326r8/20.243.0
Brosme brosmeSAMEA4028792doi:10.5061/dryad.326r8/23.125.0doi:10.5061/dryad.326r8/24.251.4
Trisopterus minutusSAMEA4028793doi:10.5061/dryad.326r8/5.101.5doi:10.5061/dryad.326r8/6.269.0
Gadiculus argenteusSAMEA4028794doi:10.5061/dryad.326r8/15.119.2doi:10.5061/dryad.326r8/16.291.8
Pollachius virensSAMEA4028795doi:10.5061/dryad.326r8/7.120.2doi:10.5061/dryad.326r8/8.230.0
Melanogrammus aeglefinusSAMEA4028796doi:10.5061/dryad.326r8/9.114.6doi:10.5061/dryad.326r8/10.256.7
Merlangius merlangusSAMEA4028797doi:10.5061/dryad.326r8/11.128.6doi:10.5061/dryad.326r8/12.284.2
Arctogadus glacialisSAMEA4028798doi:10.5061/dryad.326r8/1.130.1doi:10.5061/dryad.326r8/2.286.9
Boreogadus saidaSAMEA4028799doi:10.5061/dryad.326r8/3.124.9doi:10.5061/dryad.326r8/4.290.5
Theragra chalcogrammaSAMEA4028800doi:10.5061/dryad.326r8/13.135.8doi:10.5061/dryad.326r8/14.304.1
Gadus morhuaSAMEA4028801doi:10.5061/dryad.326r8/131.146.9doi:10.5061/dryad.326r8/132.336.6
Regalecus glesneSAMEA4028802doi:10.5061/dryad.326r8/73.200.3doi:10.5061/dryad.326r8/74.312.1
Lampris guttatusSAMEA4028803doi:10.5061/dryad.326r8/75.259.5doi:10.5061/dryad.326r8/76.671.2
Monocentris japonicaSAMEA4028804doi:10.5061/dryad.326r8/99.169.7doi:10.5061/dryad.326r8/100.283.4
Myripristis jacobusSAMEA4028805doi:10.5061/dryad.326r8/63.220.3doi:10.5061/dryad.326r8/64.319.3
Holocentrus rufusSAMEA4028806doi:10.5061/dryad.326r8/65.198.9doi:10.5061/dryad.326r8/66.307.2
Neoniphon sammaraSAMEA4028807doi:10.5061/dryad.326r8/97.201.5doi:10.5061/dryad.326r8/98.278.7
Beryx splendensSAMEA4028808doi:10.5061/dryad.326r8/95.163.5doi:10.5061/dryad.326r8/96.572.3
Rondeletia loricataSAMEA4028809doi:10.5061/dryad.326r8/93.173.6doi:10.5061/dryad.326r8/94.450.5
Acanthochaenus luetkeniiSAMEA4028810doi:10.5061/dryad.326r8/101.167.4doi:10.5061/dryad.326r8/102.350.9
Brotula barbataSAMEA4028811doi:10.5061/dryad.326r8/59.148.2doi:10.5061/dryad.326r8/60.210.3
Lamprogrammus exutusSAMEA4028812doi:10.5061/dryad.326r8/57.151.5doi:10.5061/dryad.326r8/58.482.8
Carapus acusSAMEA4028813doi:10.5061/dryad.326r8/61.118.5doi:10.5061/dryad.326r8/62.200.3
Chatrabus melanurusSAMEA4028814doi:10.5061/dryad.326r8/69.347.3doi:10.5061/dryad.326r8/70.1001.0
Thunnus albacaresSAMEA4028815doi:10.5061/dryad.326r8/107.222.2doi:10.5061/dryad.326r8/108.363.4
Lesueurigobius cf. sanzoiSAMEA4028816doi:10.5061/dryad.326r8/129.244.7doi:10.5061/dryad.326r8/120.683.5
Perca fluviatilisSAMEA4028817doi:10.5061/dryad.326r8/83.193.8doi:10.5061/dryad.326r8/84.382.4
Myoxocephalus scorpiusSAMEA4028818doi:10.5061/dryad.326r8/123.158.7doi:10.5061/dryad.326r8/124.375.4
Sebastes norvegicusSAMEA4028819doi:10.5061/dryad.326r8/85.219.4doi:10.5061/dryad.326r8/86.339.5
Chaenocephalus aceratusSAMEA4028820doi:10.5061/dryad.326r8/87.190.5doi:10.5061/dryad.326r8/88.573.8
Symphodus melopsSAMEA4028821doi:10.5061/dryad.326r8/119.162.9doi:10.5061/dryad.326r8/120.238.1
Spondyliosoma cantharusSAMEA4028822doi:10.5061/dryad.326r8/105.209.0doi:10.5061/dryad.326r8/106.281.5
Antennarius striatusSAMEA4028823doi:10.5061/dryad.326r8/79.135.7doi:10.5061/dryad.326r8/80.280.1
Selene dorsalisSAMEA4028824doi:10.5061/dryad.326r8/113.161.7doi:10.5061/dryad.326r8/114.235.8
Helostoma temminckiiSAMEA4028825doi:10.5061/dryad.326r8/109.183.5doi:10.5061/dryad.326r8/110.241.8
Anabas testudineusSAMEA4028826doi:10.5061/dryad.326r8/111.160.6doi:10.5061/dryad.326r8/112.206.3
Parablennius parvicornisSAMEA4028827doi:10.5061/dryad.326r8/117.182.7doi:10.5061/dryad.326r8/118.257.3
Chromis chromisSAMEA4028828doi:10.5061/dryad.326r8/115.253.3doi:10.5061/dryad.326r8/116.397.6
Pseudochromis fuscusSAMEA4028829doi:10.5061/dryad.326r8/121.199.4doi:10.5061/dryad.326r8/122.284.4

Technical Validation

Both genome coverage and N50 lengths of contigs and scaffolds are considered important attributes for assessing a genome assembly. Assembly statistics for all species are reported in Table 2 (available online only). Another, and perhaps more crucial attribute, is the completeness of gene space, which is particularly important for the investigation of gene presence or absence. We used two different programs, CEGMA[18] (Core Eukaryotic Genes Mapping Approach) v. 2.4.010312 and BUSCO[19] (Benchmarking Universal Single-Copy Orthologs) v. 1.1b, to assess the gene-space completeness of our draft genome assemblies. CEGMA generates a list of ‘partial’ and ‘complete’ gene hits for the 248 most conserved genes, which were used as a validation of the assembly quality. BUSCO can be executed with several different reference data sets, optimized for different taxonomic groups. We used the ‘actinopterygii’ data set consisting of 3,698 highly conserved genes in acanthopterygian species (this specific data set is not publicly available yet—as of September 9th, 2016—but was provided by the developers of BUSCO upon request). BUSCO identifies and classifies these genes in the target genomes as either ‘Complete’, ‘Complete and duplicated’, ‘Fragmented’ or ‘Missing’. Table 5 (available online only) lists the CEGMA and BUSCO results for all assembled draft genomes, while Fig. 2a,b show the proportions of these conserved genes found (as partial hits) in relation to the read coverage and N50 scaffold length of all assemblies. In line with the results of our initial investigation of the budgerigar genome, we find no improvement in CEGMA or BUSCO gene set recovery when assembly coverage exceeds ~15× for the genomes included in this data set (linear regression of BUSCO versus coverage (>15×): R=0.038, P=0.07; CEGMA versus coverage (>15×): R=0.002, P=0.30) (Fig. 2a). When comparing the fractions of partial CEGMA and BUSCO genes recovered in each assembly with the N50 scaffold lengths of these assemblies, an initial steep increase is evident, clearly illustrating the sensitivity of these methods in relation to continuity (linear regression of BUSCO versus N50 scaffold length: R=0.55, P<10–12; CEGMA versus N50 scaffold length: R=0.30, P<10–5) (Fig. 2b). Finally, we find that the N50 scaffold length is largely uncorrelated with coverage (linear regression: R=0.015, P=0.17), indicating that the specific sequencing strategy (insert size and read length) and the properties of the sequenced genomes (repeat content etc.) are more likely the limiting factors for N50 scaffold length (Fig. 2c). The observed lack of a correlation across all assemblies seems to be influenced by generally low N50 scaffold lengths for species of the order Gadiformes despite relatively high coverage for these genomes (mean coverage: 28×, mean N50 scaffold length: 6 kbp) compared to all other genomes (mean coverage: 20×, mean 50 scaffold length: 16 kbp). Thus, the species of the order Gadiformes appear more difficult to assemble which is likely explained by their high proportion of repetitive regions (see Tørresen et al.[17]). Collectively, these analyses illustrate that most of the variation in the recovery rate of the highly conserved genes is not due to low coverage, but rather reflects lineage-specific genomic features such as the amount and identity of repetitive elements that hamper the assembly of long continuous sequences.
Table 5

Gene space completeness metrics for all draft assemblies in this data set (Data Citation 2)

SpeciesCEGMA complete*CEGMA partial*BUSCO completeBUSCO duplicatedBUSCO fragmentedBUSCO missing
Osmerus eperlanus1752202,07171760867
Borostomias antarcticus1161801,101378691,728
Parasudis fraserbrunneri1312051,625638801,193
Guentherus altivela44119402118092,487
Benthosema glaciale1712141,4781287681,452
Polymixia japonica1882302,47480663561
Percopsis transmontana1852332,41171639648
Typhlichthys subterraneus1672222,02455763911
Zeus faber1552191,758489171,023
Cyttopsis roseus1852331,96962867862
Stylephorus chordatus1251931,189399261,583
Bregmaceros cantori85189867329421,889
Merluccius polli1332041,188441,0191,491
Merluccius merluccius1472101,257369591,482
Merluccius capensis1442091,363389861,349
Melanonus zugmayeri1632181,803568951,000
Muraenolepis marmoratus1432091,258371,0401,400
Trachyrincus scabrus1902251,95745872869
Trachyrincus murrayi2182352,80669464428
Mora moro1552151,582409481,168
Laemonema laureysi1632231,814418441,040
Bathygadus melanobranchus1792232,02660831841
Macrourus berglax1472061,172619271,599
Malacocephalus occidentalis1472101,419409961,283
Phycis blennoides1902312,15544717826
Phycis phycis1442131,461369661,271
Lota lota1692131,740499021,056
Molva molva1722181,739619391,020
Brosme brosme1672151,712369071,079
Trisopterus minutus1381891,199399171,582
Gadiculus argenteus1191951,193269581,547
Pollachius virens1472081,405359681,325
Melanogrammus aeglefinus1452011,356419961,346
Merlangius merlangus1361951,429399461,323
Arctogadus glacialis1411991,177341,0361,485
Boreogadus saida1372051,261439551,482
Theragra chalcogramma1532101,486369711,241
Gadus morhua2022352,45543608635
Regalecus glesne1842342,22870675795
Lampris guttatus1292011,467479531,278
Monocentris japonica1992352,70974545444
Myripristis jacobus2002302,844102461393
Holocentrus rufus2022332,94497441313
Neoniphon sammara1952282,74281505451
Beryx splendens1521971,638589411,119
Rondeletia loricata1402041,82862921949
Acanthochaenus luetkenii1382191,736599361,026
Brotula barbata2302433,27897247173
Lamprogrammus exutus1462111,692771,011995
Carapus acus2152402,66652505527
Chatrabus melanurus921881,142361,0911,465
Thunnus albacares2092363,14799351200
Lesueurigobius cf. sanzi1632062,13058625943
Perca fluviatilis1221891,673551,027998
Myoxocephalus scorpius1642172,01669803879
Sebastes norvegicus1902332,45869698542
Chaenocephalus aceratus1461991,91863876904
Symphodus melops1992292,75576537406
Spondyliosoma cantharus2152403,00178426271
Antennarius striatus1952262,31273656730
Selene dorsalis2152312,96880427303
Helostoma temminckii2252363,387100204107
Anabas testudineus2252403,314110245139
Parablennius parvicornis1862242,33663650712
Chromis chromis1802302,45175670577
Pseudochromis fuscus2102312,83790475386

*Out of 248 highly conserved eukaryotic genes.

†Out of 3,698 highly conserved acanthopterygian genes.

Figure 2

Correlation between gene space completeness, coverage, and N50 scaffold length for the 66 teleost genomes.

(a) Scatterplot illustrating the correlation of gene space completeness (evaluated on the basis of BUSCO and CEGMA partially complete genes detected) and the read coverage (linear regression of BUSCO versus coverage (>15×): R=0.038, P=0.07; CEGMA versus coverage (>15×): R=0.002, P=0.30). (b) Scatterplot showing the correlation of BUSCO / CEGMA scores and N50 scaffold length (linear regression of BUSCO versus N50 scaffold length: R=0.55, P<10–12 and CEGMA versus N50 scaffold length: R=0.30, P<10–5) for all genome presented in the data set. (c) Scatterplot illustrating the correlation of coverage and N50 scaffold length (linear regression: R=0.015, P=0.17). Species within the order Gadiformes are represented by triangles in all three plots. The lines shown are smooth LOESS curves, also referred to as local regressions, and the gray shaded areas represent 95% confidence interval in all three plots.

Phylogenetic analyses using mitochondrial genomes

To verify the correct identification of sampled species and the absence of contamination, we performed phylogenetic analyses of mitochondrial genomes extracted from all assemblies, in combination with previously available mitochondrial sequence data for sampled taxa and their close relatives. Mitochondrial genomes are particularly suitable for this comparison as the coverage of mitochondrial sequences is usually extremely high owing to the multiple copies of mitochondrial DNA (mtDNA) present in each mitochondrion and the large number of mitochondria per cell[20]. Furthermore, mitochondrial genomes are useful phylogenetic markers due to the very low frequency of recombination in animal mtDNA[21] and the large number of mitochondrial genome sequences already available in GenBank[22] (Data Citations 5 to 124). We downloaded mitochondrial genome sequences for 120 species of which 14 species (Lampris guttatus, Polymixia japonica, Percopsis transmontana, Zeus faber, Stylephorus chordatus, Lota lota, Gadus morhua, Monocentris japonicus, Rondeletia loricata, Beryx splendens, Antennarius striatus, Anabas testudineus, Helostoma temminkii, and Perca fluviatilis) were also included in our set of 66 new teleost genome assemblies and an additional 8 species (Osmerus mordax, Polymixia lowei, Bregmaceros nectabanus, Beryx decadactylus, Myripristis berndti, Lamprogrammus niger, Carapus bermudensis, and Thunnus thynnus) were represented by a congener. GenBank accession numbers for the 120 downloaded genome sequences are given in Table 6 (available online only) (Data Citations 5 to 124). Protein-coding sequences for all mitochondrial genes except mt-ND6 (see Miya et al.[23]) were extracted from the 120 mitochondrial genomes, aligned with the software MAFFT[24], v7.213 and translated to amino-acid sequences using AliView[25] v.1.16.
Table 6

GenBank accession numbers for 120 previously published mitochondrial genomes (Data Citation 5 – 124)

SpeciesGenBank accessionSpeciesGenBank accession
Abudefduf vaigiensisNC_009064Kareius bicoloratusNC_003176
Allocyttus nigerNC_004398Labracinus cyclophthalmusNC_009054
Anabas testudineusNC_024752Lampris guttatusNC_003165
Anomalops katoptronNC_008128Lamprogrammus nigerNC_004378
Anoplogaster cornutaNC_004391Lophiomus setigerusNC_008125
Antennarius striatusAB282828Lophius americanusNC_004380
Antigonia caprosNC_004391Lota lotaNC_004379
Aphredoderus sayanusNC_004372Lycodes toyamensisNC_004409
Aptocyclus ventricosusNC_008129Mastacembelus favusNC_003193
Arcos sp KU 149NC_004413Melanocetus murrayiNC_004384
Aspasma minimaNC_008130Melanotaenia lacustrisNC_004385
Ateleopus japonicusNC_003178Monocentris japonicusNC_004392
Aulopus japonicusNC_002674Monopterus albusNC_003192
Bassozetus zenkevitchiNC_004374Mugil cephalusNC_003182
Batrachomoeus trispinosusAP006738Myctophum affineNC_003163
Beryx decadactylusNC_004393Myripristis berndtiNC_003189
Beryx splendensNC_003188Neocyttus rhomboidalisNC_004399
Bregmaceros nectabanusNC_008124Neolamprologus brichardiNC_009062
Carangoides armatusNC_004405Neoscopelus microchirNC_003180
Caranx melampygusNC_004406Odax cyanomelasNC_009061
Carapus bermudensisNC_004373Oncorhynchus mykissNC_001717
Cataetyx rubrirostrisNC_004375Oryzias latipesNC_004387
Caulophryne jordaniNC_004383Osmerus mordaxNC_015246
Cetostoma reganiNC_004389Ostichthys japonicusNC_004394
Champsocephalus gunnariNC_018340Paralichthys olivaceusNC_002386
Chauliodus sloaniNC_003159Parazen pacificusNC_004396
Chaunax abeiNC_004381Perca fluviatilisNC_026313
Chaunax tosaensisNC_004382Percopsis transmontanaNC_003168
Chlorophthalmus agassiziNC_003160Petroscirtes brevicepsNC_004411
Coelorinchus kishinouyeiNC_003169Pholis crassispinaNC_004410
Cololabis sairaNC_003183Physiculus japonicusNC_004377
Coregonus lavaretusNC_002646Polymixia japonicaNC_002648
Cottus reiniiNC_004404Polymixia loweiNC_003181
Crossostoma lacustreNC_001727Porichthys myriasterNC_006920
Cyprinus carpioNC_001606Poromitra oscitansNC_003172
Dactyloptena peterseniNC_003194Pterocaesio tileNC_004408
Dactyloptena tiltoniNC_004402Rhyacichthys asproNC_004414
Danacetichthys galathenusNC_003185Rondeletia loricataNC_003186
Diaphus splendidusNC_003164Salarias fasciatusAP004451
Diplacanthopoma brachysomaNC_004376Sardinops melanostictusNC_002616
Diplophos sp MM1999AB034825Sargocentron rubrumNC_004395
Diretmoides veriginaeNC_008126Satyrichthys amiscusNC_004403
Diretmus argenteusNC_008127Saurida undosquamisNC_003162
Eleotris acanthopomaNC_004415Scarus schlegeliNC_011936
Emmelichthys struhsakeriNC_004407Scomber japonicusNC_013723
Etheostoma radiosumNC_005254Scopelogadus mizolepisNC_003171
Exocoetus volitansNC_003184Sigmops gracilisNC_002574
Gadus morhuaNC_002081Sirembo imberbisNC_008123
Gambusia affinisNC_004388Sparus aurataNC_024236
Gasterosteus aculeatusAP002944Stephanolepis cirrhiferNC_003177
Halieutaea stellataAP005977Stylephorus chordatusNC_009948
Harpadon microchirNC_003161Sufflamen fraenatumNC_004416
Helicolenus hilgendorfiNC_003195Synbranchus marmoratusAP004439
Helostoma temminkiiNC_022728Thunnus thynnusNC_014052
Histrio histrioAB282829Trachipterus trachipterusNC_003166
Hoplostethus japonicusNC_003187Zalieutes elaterAB282835
Hypoatherina tsurugaeNC_004386Zenion japonicumNC_004397
Hypoptychus dybowskiiNC_004400Zenopsis nebulosusNC_003173
Ijimaia dofleiniNC_003179Zeus faberNC_003190
Indostomus paradoxusNC_004401Zu cristatusNC_003167
To extract mitochondrial genomes from the 66 new unitig assemblies, we generated nucleotide BLAST databases for a subset of each assembly, consisting of all unitigs matched by at least 1,000 reads. This threshold was selected based on observed coverage distributions and the assumption that mitochondrial unitigs have particularly high coverage due to the relatively higher abundance of mitochondrial compared to nuclear DNA within each cell. The use of this threshold does not imply that all unitigs with higher coverage are mitochondrial, only that unitigs with lower coverage were ignored when mining for mitochondrial orthologs. For each mitochondrial gene, all 120 aligned amino-acid sequences were used as queries in searches with TBLASTN[26] v.2.2.29 to identify unitigs with orthologous sequences in each of the 66 BLAST databases. For comparison, we also performed TBLASTN searches with the same queries against 10 additional BLAST databases generated for genome assemblies downloaded from ENSEMBL[27] v.78 (Danio rerio, Astyanax mexicanus, Gadus morhua, Gasterosteus aculeatus, Oreochromis niloticus, Oryzias latipes, Takifugu rubripes, Tetraodon nigroviridis and Xiphophorus maculatus) and GenBank (Salmo salar; NCBI accession number AGKD00000000.3). For each of the 76 BLAST databases, the overall best TBLASTN hit for each mitochondrial gene was recorded and accepted as a homologous sequence if its e-value was below 1e–15. In cases where different unitigs matched different regions of the same gene (each with e-values below the threshold), these unitigs were jointly recorded as a single hit. Unitig identifiers for all hits are given in Table 7 (available online only). All hits were subsequently added to the untranslated mitochondrial gene alignments and realigned on the basis of amino-acid translations using TranslatorX[28]. Alignments were further analyzed with the software BMGE[29] v.1.0 to determine unreliably aligned regions, and we excluded all codons that included sites with a gap rate above 0.2 or a smoothed entropy-like score (see Criscuolo & Gribaldo[29]) above 0.5. Finally, we concatenated the alignments of all mitochondrial genes, excluding two taxa (Parasudis fraserbrunneri and Acanthochaenus luetkenii) for which no homologs could be identified for eight or more genes. The final alignment used for phylogenetic inference included 9,303 bp.
Table 7

ID of all unitigs containing mitochondrial data for all species included in the data set

SpeciesUTG IDs
Acanthochaenus luetkeniiutg7180003914080, utg7180003914081
Anabas testudineusutg7180000074085
Antennarius striatusutg7180002097916, utg7180002097917, utg7180002097918, utg7180002097919
Arctogadus glacialisutg7180001210258
Bathygadus melanobranchusutg7180000000032
Benthosema glacialeutg7180006223522, utg7180007030062, utg7180007030067, utg7180007030068, utg7180007152696, utg7180007434654, utg7180007609485, utg7180007660002, utg7180007660003, utg7180007673377, utg7180007673378
Beryx splendensutg7180000469666, utg7180000771701, utg7180004939694, utg7180005165554, utg7180005165555, utg7180005341444, utg7180005341445, utg7180005366167, utg7180005385228, utg7180005385229, utg7180005385230, utg7180005385231, utg7180005476828, utg7180005476831, utg7180005476832, utg7180005569875, utg7180005569876, utg7180005673983, utg7180005673984, utg7180005673985, utg7180005691644, utg7180005691645, utg7180005705333, utg7180005869836, utg7180006066164, utg7180006120309, utg7180006133924
Boreogadus saidautg7180001220567, utg7180001220611
Borostomias antarcticusutg7180001274025, utg7180003691481, utg7180003691544, utg7180003691567, utg7180003754703, utg7180003754717, utg7180003754718, utg7180003754733, utg7180003811941, utg7180004025360, utg7180004025368, utg7180004025384, utg7180004102570, utg7180004469635, utg7180004469636, utg7180004492307, utg7180004584377, utg7180004605154, utg7180004605155, utg7180004625717
Bregmaceros cantoriutg7180000000000, utg7180000000028, utg7180000003257
Brosme brosmeutg7180001047115, utg7180001047140
Brotula barbatautg7180000000018
Carapus acusutg7180000000000
Chaenocephalus aceratusutg7180002324592
Chatrabus melanurusutg7180004205210, utg7180004208747, utg7180004208748, utg7180004208749, utg7180004208753, utg7180004208754, utg7180004208755, utg7180004208756, utg7180004212026, utg7180004212030, utg7180004212034, utg7180004212035, utg7180004212036, utg7180004215531, utg7180004215533,
Chromis chromisutg7180001202954, utg7180001209662, utg7180001209663, utg7180001266383, utg7180001266389, utg7180001322771, utg7180001335484, utg7180001335485, utg7180001335486, utg7180001335489, utg7180001415702, utg7180001623875, utg7180001623880, utg7180001660412
Cyttopsis roseusutg7180001278658, utg7180001278979
Gadiculus argenteusutg7180001379789, utg7180001379798, utg7180001387987
Guentherus altivelautg7180000271871, utg7180000494472, utg7180000503520, utg7180001132842, utg7180001368068, utg7180001512825, utg7180001890828, utg7180002011637, utg7180002196109, utg7180002773145, utg7180007337091, utg7180008012272, utg7180008012273, utg7180008012274, utg7180008012275, utg7180008692799, utg7180008692802, utg7180008940047, utg7180008940048, utg7180008940049, utg7180009466012
Helostoma temminckiiutg7180000715927, utg7180000715930
Holocentrus rufusutg7180000000000
Laemonema laureysiutg7180001371677
Lampris guttatusutg7180002509326, utg7180002509328, utg7180002509329, utg7180002509330, utg7180002509333, utg7180002509335, utg7180002509336, utg7180002509337, utg7180002509339, utg7180002509340, utg7180002509341, utg7180002509342, utg7180002511349, utg7180002511350, utg7180002511351, utg7180002512148, utg7180002817224
Lamprogrammus exutusutg7180004205210, utg7180004208747, utg7180004208748, utg7180004208749, utg7180004208753, utg7180004208754, utg7180004208755, utg7180004208756, utg7180004212026, utg7180004212030, utg7180004212034, utg7180004212035, utg7180004212036, utg7180004215531, utg7180004215533,
Lesueurigobius cf. sanzoiutg7180000000879
Lota lotautg7180000000000
Macrourus berglaxutg7180000034506, utg7180001621271, utg7180001623489, utg7180001624139
Malacocephalus occidentalisutg7180000000000
Melanogrammus aeglefinusutg7180000000000
Melanonus zugmayeriutg7180000000010, utg7180001692953
Merlangius merlangusutg7180000000000
Merluccius capensisutg7180001513810
Merluccius merlucciusutg7180001887176, utg7180001917684, utg7180001972531, utg7180001981428, utg7180002025406, utg7180002025422, utg7180002097733, utg7180002097734, utg7180002097738, utg7180002079715, utg7180002146127, utg7180002218855, utg7180002218856, utg7180002307512
Merluccius polliutg7180001442827
Molva molvautg7180000000000
Monocentris japonicautg7180000342919, utg7180000463143, utg7180000514479, utg7180000538369, utg7180001377029, utg7180001377031, utg7180001377032, utg7180001434412, utg7180001434417, utg7180001434418, utg7180001434429, utg7180001434430, utg7180001434446, utg7180001434447, utg7180001434448, utg7180001469573, utg7180001469581, utg7180001469582, utg7180001469586, utg7180001469587, utg7180001469589, utg7180001486715, utg7180001516364, utg7180001516372, utg7180001516373, utg7180001516374, utg7180001524523, utg7180001524552, utg7180001550906, utg7180001550907, utg7180001550908, utg7180001692212, utg7180001692214, utg7180001820920
Mora moroutg7180000000000
Muraenolepis marmoratusutg7180000000000, utg7180001973851, utg7180001973898
Myoxocephalus scorpiusutg7180002464675, utg7180002464676, utg7180002464677, utg7180002464678, utg7180002504481, utg7180002504482, utg7180002533598, utg7180002533599, utg7180002533605, utg7180002533606
Myripristis jacobusutg7180000000000
Neoniphon sammarautg7180000064300
Osmerus eperlanusutg7180000000726
Parablennius parvicornisutg7180000020269
Parasudis fraserbrunneriutg7180003294189, utg7180003426264, utg7180003433528
Perca fluviatilisutg7180001412776, utg7180001412933
Percopsis transmontanautg7180000622724, utg7180000622787, utg7180000622789, utg7180000630769, utg7180000630770, utg7180000634249, utg7180000640180, utg7180000671918, utg7180000671919, utg7180000681531, utg7180000690201, utg7180000690202, utg7180000695306, utg7180000695308, utg7180000716419, utg7180000757684, utg7180000782025
Phycis blennoidesutg7180003799308
Phycis phycisutg7180001189424
Pollachius virensutg7180000000000
Polymixia japonicautg7180001067565, utg7180001067570, utg7180001067565
Pseudochromis fuscusutg7180001142570, utg7180001142583, utg7180001142600, utg7180001145451, utg7180001145455
Regalecus glesneutg7180000000000
Rondeletia loricatautg7180000491946, utg7180000516073, utg7180000842519, utg7180000847838, utg7180000928011, utg7180000966600, utg7180001048288, utg7180001149138, utg7180001161638, utg7180001206941, utg7180001478759, utg7180001623954, utg7180002730734, utg7180002730736, utg7180002730737, utg7180002814675, utg7180002976817, utg7180002976819, utg7180003297619, utg7180003297620, utg7180003394061, utg7180003394062, utg7180003438009, utg7180003438010, utg7180003438011, utg7180003438012, utg7180003438013, utg7180003936305, utg7180003936306, utg7180003936728, utg7180004045904, utg7180004045909, utg7180004317827, utg7180004338610, utg7180004338611, utg7180004601252, utg7180004621852, utg7180004746867, utg7180004746868
Sebastes norvegicusutg7180001468849
Selene dorsalisutg7180001234455
Spondyliosoma cantharusutg7180001069401, utg7180001069405
Stylephorus chordatusutg7180003402356, utg7180003402376, utg7180003402383, utg7180003402389, utg7180003428557, utg7180003428577, utg7180003428590, utg7180003428591, utg7180003428594, utg7180003428603, utg7180003428622, utg7180003444599, utg7180003444601, utg7180003444608, utg7180003456661, utg7180003456662, utg7180003456664, utg7180003456684, utg7180003456692, utg7180003514255, utg7180003514256, utg7180003560601, utg7180003560603, utg7180003560604, utg7180003560606, utg7180003560623, utg7180003727526, utg7180003727530, utg7180003727534, utg7180003727536, utg7180003727543, utg7180003727550, utg7180003727562, utg7180003917108, utg7180003917109, utg7180003917110, utg7180003917115, utg7180003917116, utg7180003917712, utg7180003917713, utg7180004070138, utg7180004070139
Symphodus melopsutg7180000868836, utg7180000889427, utg7180000868836
Theragra chalcogrammautg7180000000000
Thunnus albacaresutg7180001817200, utg7180001817201, utg7180001817221, utg7180001817255, utg7180001817266, utg7180001817279, utg7180001817286, utg7180001817289, utg7180001817293, utg7180001958519, utg7180001958520, utg7180002005352, utg7180002005353, utg7180002005358, utg7180002030942, utg7180002030943, utg7180002031021, utg7180002031022, utg7180002163410, utg7180002163441
Trachyrincus murrayiutg7180002283202, utg7180002334643
Trachyrincus scabrusutg7180000000000
Trisopterus minutusutg7180000000029
Typhlichthys subterraneusutg7180000000573, utg7180000802674
Zeus faberutg7180000003061, utg7180001638560
Maximum-likelihood phylogenetic inference was performed with the software RAxML[30] v.8.1.12, applying separate instances of the GTRCAT substitution model[31] to three partitions corresponding to all first, second, and third codon positions. To assess the impact of potentially saturated third codon positions in the phylogenetic inference, we conducted two additional analyses in which these positions were either completely ignored or coded as ‘R’ and ‘Y’ so that only transversions would be counted as state changes. Phylogenetic node support was estimated through bootstrapping with an automatically determined number of bootstrap replicates (RAxML option ‘autoMRE’). Topologies of the three resulting maximum-likelihood phylogenies based on different usage of third codon positions were highly congruent, however, basal branches appeared to be best resolved in the analysis based on the alignment with three equally coded partitions. This maximum-likelihood phylogeny (Fig. 3) also received the highest mean bootstrap support (81.6, compared to 76.7 and 80.1 for the analyses in which third codon positions were ignored or coded as ‘R’ and ‘Y’, respectively). All taxa sampled for new genome assemblies had phylogenetic positions according to the expectations; for the 14 species for which we included both a GenBank sequence and a mitochondrial genome extracted from new assembly data, the two sequences clustered monophyletically in each case and were connected by short branches (see e.g., Polymixia japonica; Fig. 3). In other cases, mitochondrial genomes extracted from new assemblies clustered monophyletically with their congeneric counterparts downloaded from GenBank (see e.g., the mitochondrial genomes of Osmerus eperlanus and Osmerus mordax; Fig. 3).
Figure 3

Maximum-likelihood phylogeny of teleost mitochondrial genome sequences.

Sequences extracted from the new assemblies are marked in bold, all other mitochondrial genome sequences were previously available from the GenBank or ENSEMBL (where noted) databases. Black circles on nodes are sized proportional to bootstrap support, and the circle size corresponding to support values of 50, 75, and 100 are shown. Clade labels indicate taxonomic orders of all species as well as (with smaller font size) the (sub)family of gadiform species, following Betancur-R. et al.[10] and Nelson[48]. Note that the orders Tetraodontiformes, Beloniformes, and Lophiiformes appear as non-monophyletic (marked with asterisks). For comparability, color code is identical to Fig. 1 in Malmstrøm et al.[11]. The tree file in Newick format has been deposited on Figshare under DOI: doi:10.6084/m9.figshare.4224234 (Data Citation 4).

It should be noted that basal phylogenetic nodes generally received relatively weak bootstrap support values, indicating that mitochondrial sequence data may not be sufficient to reliably resolve these ancient divergence events. Furthermore, three orders (Tetraodontiformes, Beloniformes, and Lophiiformes) appeared non-monophyletic, however, in all of these cases only weakly supported nodes separated two subgroups of the order. Thus, our results do not contradict the monophyly of these orders, which has been strongly supported in previous studies[10,32,33]. Most importantly, despite the not unexpected lower support values of basal nodes, our mitochondrial phylogeny corroborates the correct species identification and the absence of DNA contamination in the 66 new assemblies.

Phylogenetic analyses using nuclear markers

To reliably reconstruct the evolutionary history of the 66 sequenced teleost species, we further extracted a set of carefully selected phylogenetic markers from the nuclear genomes. Based on a strict filtering procedure (see Malmstrøm et al.[11]), we selected one-to-one orthologs for 567 exons of 111 genes from the 66 draft assemblies and from 10 genome assemblies available in the ENSEMBL database (Danio rerio, Astyanax mexicanus, Gasterosteus aculeatus, Oreochromis niloticus, Oryzias latipes, Takifugu rubripes, Tetraodon nigroviridis, Poecilia formosa and Xiphophorus maculatus) or GenBank (Salmo salar). The 111 selected genes were characterized by clock-like evolution, homogeneity in GC content among species, and no or only weak signals of selection and were therefore particularly well suited for the reconstruction of time-calibrated phylogenies. The 111 genes were distributed across all chromosomes of the zebrafish genome and included between 3 and 14 exons that were used in our analyses. Per gene, we concatenated sequences of these exons into a single alignment, which then included between 300 and 1,888 (mean: 643.4) bp, between 47 and 777 (mean: 240.5) variable sites and between 33 and 490 (mean: 157.5) parsimony-informative sites. As orthologous sequences for the 111 genes could be detected in almost all assemblies, the resulting 111 alignments contained only between 1.4 and 11.9% (mean: 7.3%) missing data (Table 8 (available online only)).
Table 8

Nuclear markers used in phylogenetic analyses

ENSEMBL Gene IDGene nameChr. in D. rerioNum. exonsAlignment lengthNum. variable sitesNum. PI sites*Proportion of missing dataMean node supportSH P-valueK-score
ENSDARG00000003189psmd1chr 226730116750.0250.620.013439.5
ENSDARG00000003495maddchr 75594140860.0910.5020.007419.9
ENSDARG00000003984LTN1chr 1044721831200.1150.5180.006352.7
ENSDARG00000004302slc45a1chr 114524122820.0470.550.025425.8
ENSDARG00000005058ncapd2chr 278203572210.0790.6830.008360.2
ENSDARG00000005236srcapchr 1289102611770.0590.7910.002361.5
ENSDARG00000006169lrrk2chr 253330140830.1050.5020.006450.5
ENSDARG00000007092xab2chr 34466135840.0420.520.015410.9
ENSDARG00000007744tsr1chr 1545783202200.1180.7030.003323.5
ENSDARG00000009953med14chr 94444126720.0960.6280.001NA
ENSDARG00000009965magchr 1547923092070.0380.7440.002286.9
ENSDARG00000011764asunchr 43376121800.1040.4920.013434.1
ENSDARG00000012403ERCC6L2chr 844562391760.0810.6840.002296.2
ENSDARG00000013150dhx16chr 1556061841230.1110.6110.002387.9
ENSDARG00000013240zgc172271chr 655342821930.0820.5670.003308.8
ENSDARG00000016177eif4enif1chr 644881941400.060.6580.006363.3
ENSDARG00000016415dhtkd1chr 2545162371610.0840.6580.013311.4
ENSDARG00000016443eif3cchr 126724133920.0950.5230.012395.8
ENSDARG00000016775aqrchr 1756241561010.0830.6460.001411.4
ENSDARG00000016936hmcn1chr 2055743041970.0850.7970.003321.4
ENSDARG00000017034sqrdlchr 2556022451570.0140.6760.001268
ENSDARG00000017696diexfchr 1356883492470.0810.7720.002NA
ENSDARG00000018296rev1chr 933501731100.0650.5620.037NA
ENSDARG00000019000smc3chr 226748154870.0730.6250.002461.2
ENSDARG00000019300ints7chr 203394130820.0580.5120.006408.6
ENSDARG00000019834EDRF1chr 1789683942490.0680.8150.02307.3
ENSDARG00000022730aasdhchr 2044622511860.0760.6070.016289.8
ENSDARG00000025011synj1chr 1089982431720.0760.790.001338.7
ENSDARG00000025269pdcd6ipchr 1933721711210.1190.4370.006NA
ENSDARG00000026180prpf8chr 15141,8741731280.0180.7910.007388
ENSDARG00000027353zmym2chr 93366131990.0960.4980.003396.9
ENSDARG00000027689pold1chr 3334897680.0690.5580.005452.6
ENSDARG00000029556kansl3chr 844441641100.0340.670.069330.4
ENSDARG00000030945si:ch211-259g3.4chr 15111,8887774900.1040.8740.003317.6
ENSDARG00000031886ift140chr 2444742651760.0970.6590.007363.3
ENSDARG00000032459med24chr 123332126770.0720.50.002394.2
ENSDARG00000032704qrsl1chr 1744702281540.0340.7230.008347.6
ENSDARG00000034178cpsf1chr 194458119720.090.4790.003460.6
ENSDARG00000035330taf1chr 555881621140.0970.6340.01398.9
ENSDARG00000035761mcm7chr 144426140980.0590.6090.005397.7
ENSDARG00000035978ube3cchr 74526154830.0950.5740.001418.3
ENSDARG00000036338vps11chr 10101,2383452160.0540.7190.002269.3
ENSDARG00000036755prmt10chr 156042481580.0380.650.003344.6
ENSDARG00000037017ube4bchr 23453298600.0730.5340.003476.2
ENSDARG00000037898pplchr 344382551810.080.5790.003310.8
ENSDARG00000038882smc4chr 15101,2604452610.0830.8130.035293.4
ENSDARG00000039134MTRchr 1255242021470.0910.7030.006384.2
ENSDARG00000041895cadchr 2067902001240.0880.5820.005355.1
ENSDARG00000042530nup205chr 18131,5365993770.0510.8440.028246.1
ENSDARG00000042728plaachr 755822491640.0740.6090.014NA
ENSDARG00000043019exoc1chr 204528137830.0170.6060351.6
ENSDARG00000045626nek8chr 15342878540.0320.490.002478.6
ENSDARG00000045900agbl5chr 43300131830.0580.5920.014413.5
ENSDARG00000051889dhodhchr 755662331710.0720.6690.012287.7
ENSDARG00000053087mthfrchr 83360121790.0710.4840.001NA
ENSDARG00000053200dis3lchr 733161741160.1010.6720.01390.5
ENSDARG00000053303map3k4chr 1344581701000.0840.4810.003388.6
ENSDARG00000054154bms1lchr 1267002851990.0570.6630.001NA
ENSDARG00000056037itih6chr 2333721831130.0660.5540.21397.1
ENSDARG00000056160hspd1chr 93388121850.0730.5430391.5
ENSDARG00000056318kansl2chr 2344281721120.0560.650.004346.3
ENSDARG00000056530CPAMD8chr 2268202931760.0890.5770.028318.7
ENSDARG00000056932tfip11chr 233390131910.0970.4480.006NA
ENSDARG00000057508nbeal2chr 1634822761930.1070.540.006345.1
ENSDARG00000057997tmf1chr 1133081511050.0530.5830.009NA
ENSDARG00000058533polechr 578482341470.0760.6630.018NA
ENSDARG00000059553SYMPKchr 579003081960.0750.6260.01355.6
ENSDARG00000059631ints1chr 3111,3604052880.0760.8250.004270.1
ENSDARG00000059711nol6chr 566243302260.0760.6650.009NA
ENSDARG00000059760wdtc1chr 163386132850.0250.6620.004354
ENSDARG00000059846EPG5chr 579305153540.0860.7280.009310.1
ENSDARG00000059925usp24chr 2055901821120.0710.7070413.7
ENSDARG00000060089btaf1chr 1377643001960.0780.7810.022235.2
ENSDARG00000061013ankfy1chr 567262031270.0680.6530.013340.2
ENSDARG00000061394baz1bchr 1831,0345093290.1130.6560.009339.5
ENSDARG00000061789gnl1chr 1633221621090.0480.5770.091356.2
ENSDARG00000062198pcm1chr 167903281850.0760.7540.005378.6
ENSDARG00000062632duoxchr 251112706023870.0530.8670.001217.5
ENSDARG00000062868eea1chr 183362142940.0710.5760.008413.8
ENSDARG00000063558ARHGEF17chr 183386144870.0890.5250.006381.5
ENSDARG00000063626ddx21chr 1344241771200.0330.7130.001359.9
ENSDARG00000067805ggcxchr 1055161731190.0590.6190343
ENSDARG00000069274ighmbp2chr 1881,1085613980.0390.7830.05268.8
ENSDARG00000070109ncapgchr 13326135940.0920.5230.003NA
ENSDARG00000071294TONSLchr 1733321841370.0680.6070.021321.2
ENSDARG00000073862ptpn13chr 2155722271410.1090.640.018353.5
ENSDARG00000074137C2CD3chr 1556203402330.0940.6380.012NA
ENSDARG00000074314TTC37chr 533982091480.10.4820.017NA
ENSDARG00000074410brip1chr 1533341461060.0970.5710.002NA
ENSDARG00000074424ibtkchr 1645482491630.0970.5640.017327.2
ENSDARG00000074524CNTNAP1chr 391,1864393090.0390.8620.001292.3
ENSDARG00000074571GPAA1chr 255861831200.1050.5150.011302.9
ENSDARG00000074675pan2chr 2357582301370.0640.7610.002382.5
ENSDARG00000074749abca12chr 945602741920.0960.7370.009270.1
ENSDARG00000074759ccar1chr 135600150900.0910.6590.002NA
ENSDARG00000075108tmco3chr 144601941130.0710.640.002444.1
ENSDARG00000075672pms2chr 1244501861170.0460.7110.001NA
ENSDARG00000075798USP38chr 136743322080.050.5720.047NA
ENSDARG00000075826msh4chr 1755502211220.0160.7310.009356.3
ENSDARG00000076920ZNF335chr 855381911210.0830.5560.002NA
ENSDARG00000076994gpr124chr 858444542790.0980.780.012313.6
ENSDARG00000077139col6a3chr 939945774070.1150.6680.003318.3
ENSDARG00000077469polr1bchr 1367083312230.090.7860.001336.1
ENSDARG00000077536snrnp200chr 8101,1342231400.0520.7730.004398.3
ENSDARG00000077860ankhd1chr 21339847330.0390.430.007NA
ENSDARG00000077891npc1l1chr 244862061480.0970.7290384.1
ENSDARG00000078135MRC2chr 333181771100.0860.6810.006419.7
ENSDARG00000078890wdfy3chr 5131,5284902910.0970.7060.005365.9
ENSDARG00000079702wdr81chr 153342132750.0660.5450.002370.6
ENSDARG00000079751megf8chr 16121,4985713360.0970.7030.028291.5
ENSDARG00000090858AREL1chr 1756541901120.0280.6270.037375.9

*The number of parsimony-informative sites.

†The P-values obtained from the Shimodaira-Hasegawa tests. Genes with non-significant P-value from the Shimodaira-Hasegawa tests are in bold.

These alignments were used for an extensive set of phylogenetic analyses to reconstruct the species tree as well as individual gene trees, using both maximum-likelihood and Bayesian inference. Detailed descriptions of these analyses and a discussion of the resulting species tree can be found in Malmstrøm et al.[11] In addition, we here present analyses of gene tree discordance in relation to the 66 new assemblies, as a heterogeneous phylogenetic signal among gene trees could, among other causes (e.g. Fontaine et al.[34]; Gante et al.[35]), result from assembly issues such as contamination. To quantify gene tree discordance, we compared each gene tree to the species tree based on their K-scores[36] and using the Shimodaira-Hasegawa (SH) test[37] implemented in PAUP* v.4.0a150 (http://paup.csit.fsu.edu). All gene trees used in this comparison were maximum-clade-credibility (MCC) trees inferred with the software BEAST[38] v.2.2.0 for each of the 111 alignments. Similarly, we considered the MCC tree inferred with BEAST for a single concatenated alignment of all genes as the species tree (Fig. 1 in Malmstrøm et al.[11]) used in this comparisons. According to results of the SH test, all but four gene trees were significantly different (P<0.05) from the species tree (Table 8 (available online only)). K-scores were calculated for 91 of the 111 gene trees, but could not be calculated for the remaining 20 gene trees due to negative branch lengths. The resulting K-scores ranged from 217.5 to 478.6, indicating considerable gene tree discordance in agreement with the results of the SH test (even though individual K-scores and SH test P-values did not correlate; P=0.64). However, such tree discordance does not necessarily indicate assembly issues but can arise from multiple factors including incomplete lineage sorting[39] or a lack of phylogenetic signal[40]. While high levels of incomplete lineage sorting have been shown to affect phylogenomic inference of rapidly radiating lineages like Neoavian birds[41] or cichlid fishes[42,43], its effect is expected to be limited in the analysis of ancient clades with long internode distances[44] such as the teleost species tree inferred from our set of 111 nuclear markers[11]. We investigated the presence of incomplete lineage sorting in this species tree in Malmstrøm et al.[11] by testing for a correlation of indel hemiplasy and branch length[45]. However, since no such correlation could be detected in our data set, we concluded that incomplete lineage sorting was weak or absent in the teleost species tree reported in Malmstrøm et al.[11]. In addition, we now tested whether instead of incomplete lineage sorting, a lack of phylogenetic signal in individual marker alignments could explain the observed gene tree discordance. To this end, we calculated the mean Bayesian posterior probability (BPP) of each gene tree as a measure of its phylogenetic signal and compared it to the K-score between this gene tree and the species tree. We find a highly significant negative correlation between the two measures (linear regression: R=0.34, P<10–15), which is illustrated in Fig. 4. Furthermore, we also detected a highly significant correlation between the number of parsimony-informative sites per marker and the respective K-score (linear regression: R2=0.49, P<10–13) (Fig. 4). These tests show that low phylogenetic signal in individual marker alignments, rather than contamination in the assemblies, is responsible for the observed gene tree discordance. This lack of signal in individual alignments, however, is not exclusive to our phylogenomic data set, but is a feature that is commonly observed in nuclear markers[40,44]. As demonstrated by Malmstrøm et al.[11] as well as other phylogenomic studies[45-47] the combination of such stringently filtered exonic markers nevertheless allows an extremely reliable inference of ancient species trees that could not be achieved with faster-evolving sequence such as mitochondrial genomes, intronic regions, or genes under selection. We therefore recommend the reuse of the marker set presented here as a highly suitable resource for future analyses of the teleost species tree with extended taxon sets.
Figure 4

Distances between tree topologies compared to phylogenetic signal.

Topological distances are measured by the K-score between the gene trees and the species trees, and phylogenetic signal of the gene trees is measured as mean Bayesian posterior probability (BPP). Dots are colored according to the number of parsimonious-informative (PI) sites. The black line represents the linear regression (R=0.34, P<10–15).

Usage Notes

Sequencing reads from all species can be downloaded from the European Nucleotide Archive (ENA), under the sample identifiers ERS1199874—ERS1199939. Unitig and scaffold level assemblies are available for download from the Dryad repository with individual assemblies found under DOI: doi:10.5061/dryad.326r8/1.—dryad.326r8/132. See Table 7 (available online only) for individual identifiers for both the raw sequencing read sets and the two assembly versions. The mitochondrial phylogeny (Fig. 3) can be downloaded as a tree file in Newick format from Figshare under DOI: doi:10.6084/m9.figshare.4224234 (Data Citation 4).

Additional information

How to cite: Malmstrøm, M. et al. Whole genome sequencing data and de novo draft assemblies for 66 teleost species. Sci. Data 4:160132 doi: doi:10.1038/sdata.2016.132 (2017). Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
  43 in total

1.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models.

Authors:  Alexandros Stamatakis
Journal:  Bioinformatics       Date:  2006-08-23       Impact factor: 6.937

2.  FLASH: fast length adjustment of short reads to improve genome assemblies.

Authors:  Tanja Magoč; Steven L Salzberg
Journal:  Bioinformatics       Date:  2011-09-07       Impact factor: 6.937

3.  Evolution of the immune system influences speciation rates in teleost fishes.

Authors:  Martin Malmstrøm; Michael Matschiner; Ole K Tørresen; Bastiaan Star; Lars G Snipen; Thomas F Hansen; Helle T Baalsrud; Alexander J Nederbragt; Reinhold Hanel; Walter Salzburger; Nils C Stenseth; Kjetill S Jakobsen; Sissel Jentoft
Journal:  Nat Genet       Date:  2016-08-22       Impact factor: 38.330

4.  Disentangling Incomplete Lineage Sorting and Introgression to Refine Species-Tree Estimates for Lake Tanganyika Cichlid Fishes.

Authors:  Britta S Meyer; Michael Matschiner; Walter Salzburger
Journal:  Syst Biol       Date:  2017-07-01       Impact factor: 15.683

Review 5.  Zebrafish as tools for drug discovery.

Authors:  Calum A MacRae; Randall T Peterson
Journal:  Nat Rev Drug Discov       Date:  2015-09-11       Impact factor: 84.694

6.  Mosquito genomics. Extensive introgression in a malaria vector species complex revealed by phylogenomics.

Authors:  Michael C Fontaine; James B Pease; Aaron Steele; Robert M Waterhouse; Daniel E Neafsey; Igor V Sharakhov; Xiaofang Jiang; Andrew B Hall; Flaminia Catteruccia; Evdoxia Kakani; Sara N Mitchell; Yi-Chieh Wu; Hilary A Smith; R Rebecca Love; Mara K Lawniczak; Michel A Slotman; Scott J Emrich; Matthew W Hahn; Nora J Besansky
Journal:  Science       Date:  2014-11-27       Impact factor: 47.728

7.  BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments.

Authors:  Alexis Criscuolo; Simonetta Gribaldo
Journal:  BMC Evol Biol       Date:  2010-07-13       Impact factor: 3.260

8.  Ensembl 2015.

Authors:  Fiona Cunningham; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Konstantinos Billis; Simon Brent; Denise Carvalho-Silva; Peter Clapham; Guy Coates; Stephen Fitzgerald; Laurent Gil; Carlos García Girón; Leo Gordon; Thibaut Hourlier; Sarah E Hunt; Sophie H Janacek; Nathan Johnson; Thomas Juettemann; Andreas K Kähäri; Stephen Keenan; Fergal J Martin; Thomas Maurel; William McLaren; Daniel N Murphy; Rishi Nag; Bert Overduin; Anne Parker; Mateus Patricio; Emily Perry; Miguel Pignatelli; Harpreet Singh Riat; Daniel Sheppard; Kieron Taylor; Anja Thormann; Alessandro Vullo; Steven P Wilder; Amonida Zadissa; Bronwen L Aken; Ewan Birney; Jennifer Harrow; Rhoda Kinsella; Matthieu Muffato; Magali Ruffier; Stephen M J Searle; Giulietta Spudich; Stephen J Trevanion; Andy Yates; Daniel R Zerbino; Paul Flicek
Journal:  Nucleic Acids Res       Date:  2014-10-28       Impact factor: 16.971

9.  Assessing the gene space in draft genomes.

Authors:  Genis Parra; Keith Bradnam; Zemin Ning; Thomas Keane; Ian Korf
Journal:  Nucleic Acids Res       Date:  2008-11-28       Impact factor: 16.971

10.  Aggressive assembly of pyrosequencing reads with mates.

Authors:  Jason R Miller; Arthur L Delcher; Sergey Koren; Eli Venter; Brian P Walenz; Anushka Brownley; Justin Johnson; Kelvin Li; Clark Mobarry; Granger Sutton
Journal:  Bioinformatics       Date:  2008-10-24       Impact factor: 6.937

View more
  27 in total

1.  The phylogeny of Salix revealed by whole genome re-sequencing suggests different sex-determination systems in major groups of the genus.

Authors:  Sergey Gulyaev; Xin-Jie Cai; Fei-Yi Guo; Satoshi Kikuchi; Wendy L Applequist; Zhi-Xiang Zhang; Elvira Hörandl; Li He
Journal:  Ann Bot       Date:  2022-03-23       Impact factor: 4.357

2.  Full genome survey and dynamics of gene expression in the greater amberjack Seriola dumerili.

Authors:  Elena Sarropoulou; Arvind Y M Sundaram; Elisavet Kaitetzidou; Georgios Kotoulas; Gregor D Gilfillan; Nikos Papandroulakis; Constantinos C Mylonas; Antonios Magoulas
Journal:  Gigascience       Date:  2017-12-01       Impact factor: 6.524

3.  Linking species habitat and past palaeoclimatic events to evolution of the teleost innate immune system.

Authors:  Monica Hongrø Solbakken; Kjetil Lysne Voje; Kjetill Sigurd Jakobsen; Sissel Jentoft
Journal:  Proc Biol Sci       Date:  2017-04-26       Impact factor: 5.349

4.  Evolution of Hemoglobin Genes in Codfishes Influenced by Ocean Depth.

Authors:  Helle Tessand Baalsrud; Kjetil Lysne Voje; Ole Kristian Tørresen; Monica Hongrø Solbakken; Michael Matschiner; Martin Malmstrøm; Reinhold Hanel; Walter Salzburger; Kjetill S Jakobsen; Sissel Jentoft
Journal:  Sci Rep       Date:  2017-08-11       Impact factor: 4.379

5.  A High-Quality Reference Genome for the Invasive Mosquitofish Gambusia affinis Using a Chicago Library.

Authors:  Sandra L Hoffberg; Nicholas J Troendle; Travis C Glenn; Ousman Mahmud; Swarnali Louha; Domitille Chalopin; Jeffrey L Bennetzen; Rodney Mauricio
Journal:  G3 (Bethesda)       Date:  2018-05-31       Impact factor: 3.154

Review 6.  Vertebrate Genome Evolution in the Light of Fish Cytogenomics and rDNAomics.

Authors:  Radka Symonová; W Mike Howell
Journal:  Genes (Basel)       Date:  2018-02-14       Impact factor: 4.096

7.  The Most Developmentally Truncated Fishes Show Extensive Hox Gene Loss and Miniaturized Genomes.

Authors:  Martin Malmstrøm; Ralf Britz; Michael Matschiner; Ole K Tørresen; Renny Kurnia Hadiaty; Norsham Yaakob; Heok Hui Tan; Kjetill Sigurd Jakobsen; Walter Salzburger; Lukas Rüber
Journal:  Genome Biol Evol       Date:  2018-04-01       Impact factor: 3.416

8.  Long-read sequence capture of the haemoglobin gene clusters across codfish species.

Authors:  Siv Nam Khang Hoff; Helle T Baalsrud; Ave Tooming-Klunderud; Morten Skage; Todd Richmond; Gregor Obernosterer; Reza Shirzadi; Ole Kristian Tørresen; Kjetill S Jakobsen; Sissel Jentoft
Journal:  Mol Ecol Resour       Date:  2018-12-04       Impact factor: 7.090

9.  De Novo Gene Evolution of Antifreeze Glycoproteins in Codfishes Revealed by Whole Genome Sequence Data.

Authors:  Helle Tessand Baalsrud; Ole Kristian Tørresen; Monica Hongrø Solbakken; Walter Salzburger; Reinhold Hanel; Kjetill S Jakobsen; Sissel Jentoft
Journal:  Mol Biol Evol       Date:  2018-03-01       Impact factor: 16.240

10.  Independent losses of a xenobiotic receptor across teleost evolution.

Authors:  Marta Eide; Halfdan Rydbeck; Ole K Tørresen; Roger Lille-Langøy; Pål Puntervoll; Jared V Goldstone; Kjetill S Jakobsen; John Stegeman; Anders Goksøyr; Odd A Karlsen
Journal:  Sci Rep       Date:  2018-07-10       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.