Several attributes intuitively considered to be typical mammalian features, such as complex behavior, live birth and malignant disease such as cancer, also appeared several times independently in lower vertebrates. The genetic mechanisms underlying the evolution of these elaborate traits are poorly understood. The platyfish, X. maculatus, offers a unique model to better understand the molecular biology of such traits. We report here the sequencing of the platyfish genome. Integrating genome assembly with extensive genetic maps identified an unexpected evolutionary stability of chromosomes in fish, in contrast to in mammals. Genes associated with viviparity show signatures of positive selection, identifying new putative functional domains and rare cases of parallel evolution. We also find that genes implicated in cognition show an unexpectedly high rate of duplicate gene retention after the teleost genome duplication event, suggesting a hypothesis for the evolution of the behavioral complexity in fish, which exceeds that found in amphibians and reptiles.
Several attributes intuitively considered to be typical mammalian features, such as complex behavior, live birth and malignant disease such as cancer, also appeared several times independently in lower vertebrates. The genetic mechanisms underlying the evolution of these elaborate traits are poorly understood. The platyfish, X. maculatus, offers a unique model to better understand the molecular biology of such traits. We report here the sequencing of the platyfish genome. Integrating genome assembly with extensive genetic maps identified an unexpected evolutionary stability of chromosomes in fish, in contrast to in mammals. Genes associated with viviparity show signatures of positive selection, identifying new putative functional domains and rare cases of parallel evolution. We also find that genes implicated in cognition show an unexpectedly high rate of duplicate gene retention after the teleost genome duplication event, suggesting a hypothesis for the evolution of the behavioral complexity in fish, which exceeds that found in amphibians and reptiles.
We sequenced the genome of a single platyfish female (XX, 2n=46 chromosomes, Jp163A strain Fig. 1) from generation 104 of continuous brother-sister matings. Total sequence coverage of 19.6-fold (Supplementary Note) produced an assembly with N50 contig and supercontig lengths of 22kb and 1.1Mb, respectively (Supplementary Table 1). Assembly errors, mostly single nucleotide indels, were corrected with Illumina paired-end reads. A total of 669Mb of the 750 – 950 Mb estimated genome length was assembled in contigs. Gene predictions revealed 20,366 coding genes, 348 non-coding genes and 28 pseudogenes (Supplementary Note).
Figure 1
(A) Female (upper) and male (lower) platyfish, Xiphophorus maculatus of strain Jp163A with black pigment spots in the dorsal fin that develop when activity of a X-chromosomal oncogene is appropriately controlled. In hybrid genotypes this control is compromised and malignant melanoma develop from the spots. (B) Phylogenetic position of the platyfish relative to other species mentioned herein.
As in other teleosts, platyfish transposable elements (TEs) are highly diverse including many families absent from mammals[1] and birds (Supplementary Note; Supplementary Fig. 1-3, Supplementary Tables 2-3). 4.8% of the transcriptome is derived from TE sequences representing about 40 different families, indicating that many of the platyfish TEs are most likely still active. The most active TEs are Tc1 DNA transposons (> 16,000 copies) followed by the RTE family (> 9,000 copies). Interestingly, we identified several almost intact envelope-encoding copies of a foamy retrovirus (Spumaviridae) integrated into the platyfish genome (Fig. 2). Foamy viruses (FV) are known as exogenous infectious agents in mammals[2]. Only recently, have endogenous FV sequences representing the “fossil record” of infections been described in the genomes of sloth [3] and aye-aye [4] in mammals, and in the coelacanth[5]. An FV-like sequence in zebrafish[6], a sequence in cod discovered during this work, and the platyfish sequenced herein uncover an even broader spectrum of hosts. The molecular phylogeny of FV is consistent with their host phylogeny (Fig. 2). This result supports an ancient marine evolutionary origin of this type of virus, with possible host-virus coevolution[5]. The nearly intact copies of FV found in the genomes of some divergent fish species while absent from other sequenced fish genomes, might indicate independent germ-line introduction through infection. Exogenous FV had not been described in fish, however out results suggest that exogenous FV have been and might still be infectious in the fish lineage.
Figure 2
Phylogenetic tree of endogenous retroviruses based on reverse transcriptase protein sequences. Note that the endogenous foamy virus sequences form two distinct tetrapod- and fish-specific phylogenetic groups. Alignment was made with clustalW (223 amino acids) and the phylogenetic tree was constructed with the PhyML package using maximum likelihood methods [38] with default bootstrap (shown at the beginning of branches) and optimized calculation options. FV, Foamy Virus; MuERV-L, Mus musculus Endogenous Retrovirus-L; BAEV, Baboon Endogenous Virus; FENV1, Feline Endogenous Virus; EFV, Endogenous Foamy Virus, MLV, Murine Leukemia Virus; HERV-K, Human Endogenous Retrovirus-K; MMTV, Mouse Mammary Tumour Virus; VISNA, Visna virus; HIV-1, Human Immunodeficiency Virus-1. Scale bar represents the number of substitution per sites.
Mammalian chromosome homology maps display a patchwork arrangement of about 35 large conserved synteny blocks (but about 80 in dog and 200 in mouse) and numerous small blocks assembled in different combinations among the varied species and spanning over 90 million years[7]. We constructed the most extensive meiotic genetic map for any vertebrate yet published, which allowed the ordering of X. maculatus scaffolds and the creation of precise conserved synteny comparing fish genomes (Supplementary Note). We used the innovative RAD-tag approach[8] to construct a meiotic map consisting of 16,245 polymorphic markers that define 24 linkage groups equivalent to the haploid chromosome number of the platyfish[9]. Thus 90.17% of the total sequences in contigs could be assigned a chromosomal position. Long-range comparisons of gene orders across species[10] revealed novel evolutionary relationships of platyfish and other teleost chromosomes. Medaka, the closest relative with a sequenced genome, also has 24 chromosomes and 19 of these show a strict one-to-one relationship (Fig. 3A, B). The remaining five platyfish chromosomes are also orthologous to a single medaka chromosome except for one or two short segments (≈1 Mb) that lie on another medaka chromosome (Fig. 3C, Supplementary Fig. 4). Thus, remarkably few short translocations have disrupted karyotypes since the divergence of medaka and platyfish 120 million years ago (mya)[11,12]. A similar picture emerges from comparisons of platyfish chromosomes to stickleback (divergence 180 mya[11,12]). These findings detail a previously unknown breadth to which the genetic content of chromosomes in these teleosts has been conserved over nearly 200 million years of evolution, a conservation much greater than that of mammals over about half that time[7,11,12]. This is somewhat surprising given the teleost genome duplication (TGD) because one might have thought that illegitimate pairing of paralogous chromosomes (paralogous chromosomes arising from the TGD) might have facilitated translocations. Mechanisms that might have mitigated such translocations remain unknown.
Figure 3
Conserved syntenies between platyfish and medaka. (A). The medaka orthologs of genes on X. maculatus chromosome 9 (Xma9) tend to lie on O. latipes chromosome 4 (Ola4), showing that the genic content of these chromosomes has remained intact with no translocations in the 120 million years since the lineages of these species diverged. Each grey dot along the horizontal axis at the position labeled Xma9 represents the position of a platyfish gene whose medaka ortholog (as judged by reciprocal best blast hit analysis) lies directly vertical to the Xma9 gene plotted on the appropriate medaka chromosome [10]. (B). Reciprocally, nearly all of the platyfish orthologs of genes on medaka chromosome Ola4 lie on Xma9. (C). Nearly all of the medaka orthologs of Xma19 lie on Ola22, except for a segment about 1 Mb long at position Ola22:20Mb that appears on Ola24.
The platyfish is a well-known model in cancer research[13]. Its genome contains a tumor control region (TCR). It includes the oncogene xmrk
[14], which triggers melanoma development. The TCR also contains the tumor modifier mdl[15,16]. Mdl allelic variants control the body compartment, time onset and severity of tumors[17]. In addition, mdl manifest in platyfish as a high diversity of genetically defined pigment patterns. The mapped genome allows us to rule out many pigment genes as the responsible factors for these sex-associated pigment variants and melanoma modifiers. All known pigment genes[18] are present in the XX female platyfish genome; thus, none is Y-specific. Only 6 of the 174 known pigment genes (asip2a, egfrb, muted, myca, rps20, tfap2a) are located on the X chromosome (Xma21). Of these six, only the proto-oncogene egfrb resides close enough to the melanoma oncogene xmrk (Supplementary Table 4) to be considered a candidate gene for mdl. Indeed, biochemical studies have shown that egfrb can cooperate with xmrk[19], but expression levels of both genes are inversely regulated in melanoma[20]. Further studies are needed to evaluate egfrb function and to find other non-classical pigmentation gene candidates from this genomic region that may control both pigment pattern and melanoma phenotype.Another so-far unidentified genetic component of the Xiphophorus melanoma model is the R/Diff gene. R/Diff suppresses melanoma formation in wild platyfish and, when eliminated by interspecific hybridization, allows tumor growth. R/Diff was mapped to a 10 cM interval on Xma5 near the cdkn2a/b locus[21]. Despite cdkn2a being a well-described tumor suppressor gene in certain humanmelanomas[22], it was excluded from being R/Diff because it is not mutated but even overexpressed in Xiphophorus melanoma[23]. The Xma5 sequence now defines a number of R/Diff candidate genes for further exploration. For example, scaffold #182 (1,085,500 bp), which harbors cdkn2a/b, contains several genes that have a high potential to play a role as the R/Diff tumor suppressor (tet2, cxxc4, mtap, topo-rs, mdx4, pdcd4a, etc.). Alternatively, the region may represent a complex locus comprised of several genes that act in a synergistic or compensatory manner to regulate the xmrk oncogene consistent with previous reports on spontaneous and induced carcinogenesis among the many Xiphophorus interspecies hybrid tumor models[24-26].Viviparity is an elaborate reproductive mode involving diverse levels of maternal investment in offspring ranging from fully provisioning eggs prior to fertilization and retaining them through development to minimally provisioning eggs prior to fertilization, but doing so post fertilization via a placenta, as in mammals. The fish family Poeciliidae, a monophyletic clade of more than 260 species[27], is unusual in including species that span the spectrum from negligible to extensive post-fertilization provisioning[28,29]. The platyfish genome is the first from a non-mammalian viviparous vertebrate. We analyzed in platyfish and for confirmation in a second livebearing fish, the swordtail X. hellerii, both having well provisioned eggs prior to fertilization[30,31], three groups of viviparity genes (yolk-, placenta- and egg coat genes; n=34) for gene loss and positive selection compared to four species of egg-laying teleosts (medaka, tetraodon, stickleback, zebrafish).In mammals, the rise of viviparity has been proposed to involve the progressive loss of vitellogenins (yolk precursors)[32]. In platyfish and swordtail all yolk-related genes (vitellogenins and their transporter/receptors, Supplementary Table 5) are present and evolved under purifying selection consistent with both species fully provision eggs prior to fertilization, except one gene evolving under positive selection, vitellogenin1 (Supplementary Fig. 5A).Three of 13 platyfish genes, whose mammalian orthologs are related to placenta development, evolved under positive selection (Supplementary Table 5, Fig. 4A, Supplementary Fig. 5B-D). Igf2, which in mouse regulates placenta permeability[33], evolved under strong positive selection in platyfish (Fig. 4a) particularly in the region distal to the proteolysis site. The igf2 sequence[33] was also available from another poeciliid, the desert topminnow Poeciliopsis lucida, which shares a livebearing ancestor with Xiphophorus but differs in having evolved placentation recently. In the desert topminnow the same region as in platyfish was evolving under positive selection but even stronger (Supplementary Fig. 5B) suggesting on-going molecular adaptive evolution since the two genera diverged several mya. The two other placental genes pparg and ncoa6 have multiple regions with signals for positive selection outside known functional domains, suggesting novel regions important for viviparity. The same genes under selection in livebearing fish, however, do not show positive selection signatures when orthologous genes from the egg-laying platypus, from marsupials and placental mammals are analyzed (Supplementary Table 6). This result is in line with the fact that placentas of mammals and fish are convergent but not homologous structures.
Figure 4
Posterior probabilities for sites classes under alternative models along the gene for each amino acid site calculated by Bayes Empirical Bayes analysis. Class 1 (yellow) is the probability of this site being under purifying selection (ka/ks ratio about 0), class 2 (grey) the probability of this site of being under neutral selection (ka/ks ratio about 1), class 3 (red) the probability of this site being under positive selection in Xiphophorus species. (A)
Insulin-like growth factor 2 (igf2). The colored bars show known functional domains, from left to right: signal peptide (1 to 52), insulin motif (54 to 110) and IGF2 C-terminal domain (147 to 202). The arrowhead shows the position of the proteolysis site (between 118 and 119) (B)
choriogeninH minor, upper: comparison of egg-laying vs. livebearing fish, lower: comparison placental vs. non-placental mammals, showing the same regions under positive selection in fishes and mammals.
Zona pellucida (Zpc) genes, which produce a glycoprotein rich coat surrounding the oocyte plasma membrane, show the most dramatic changes. Alveolin was lost from the platyfish genome. Conversely, choriogeninH minor, choriolysinL, choriolysinH and zvep, evolved under positive selection (Fig. 4B, Supplementary Fig. 5E-G, Supplementary Table 5). In Xenopus, zpc genes control species-specific sperm binding and help ensure that only conspecific sperm released into the aqueous environment fertilizes eggs[34]. Viviparous fish, however, have internal fertilization, where species-specific sperm recognition would not be as crucial. Compared to egg-laying fish the eggshell is expected to have adapted to development inside the mother because it is no longer essential for protection but must facilitate gas and material exchange. Hatching enzyme genes, zvep and choriolysinH exhibit fast evolving sites generally located adjacent to the catalytic domains (Supplementary Fig. 4F-G) indicating that during evolution of viviparity these enzymes might have altered interactions with target or regulatory proteins. Interestingly, in choriogeninH minor the same regions in particular in the zona pellucida domain evolved under positive selection in both mammals and fishes (Fig. 4B). This is a striking example of how convergent evolution at the molecular level manifests on the physiological and ultimately morphology, levels.Our analyses of the consequences of the TGD uncovered a functional class of genes that raised our interest because Xiphophorus fish in particular, and teleosts in general, show a remarkably high level of behavioral complexity[35] that other groups of ”coldblooded“ vertebrates like amphibians and reptiles do not achieve. Utilizing the platyfish genome and gene annotations from six other sequenced teleosts, we asked whether duplicate gene retention from the TGD could produce through neofunctionalization and/or sub/neofunctionalization[36], the acquisition of more complex behaviors. We compared 190 cognition-related genes (Supplementary Note, Supplementary Table 7) to pigmentation (133 genes, for which increased gene repertoires have been connected to the high complexity and diversity of teleost coloration) and liver functions (187 genes)[18] as a control. Analysis of cognition-related genes revealed an outstanding high duplicate retention rate of 45% in platyfish and similar values in other teleosts (Fig. 5, Supplementary Fig. 6) compared to pigmentation (30%) and liver (15%) genes. The average duplicate retention rate over all genes in teleost genomes is estimated at 12-24%[37]. We found no bias for genes from all three functional categories (cognition/pigmentation/liver) that were retained after the TGD to be dosage sensitive or members of protein complexes (Supplementary Note, Supplementary Table 8, 9), but a bias in the cognition genes (but not for liver and pigmentation) for particularly large proteins (>1000 amino acids) (Supplementary Note, Supplementary Table 10, Supplementary Fig. 7). Plotting gene losses on the phylogenetic tree revealed that cognition gene retention was already fixed shortly after the TGD and before teleosts diversification. This finding supports the hypothesis that paralog retention from the TGD may have supported the high level of behavioral complexity in Xiphophorus and other teleosts.
Figure 5
Differential retention of gene duplicates in cognition, pigmentation and liver function classes in teleosts after the teleost genome duplication (TGD). (A) Retention rates for TGD duplicates of cognition, pigmentation, and liver genes in seven teleost genomes. Time points during teleost evolution that involve the lineage leading to Xiphophorus are connected by lines. (B) Phylogenetic mapping of gene losses for 190 pairs of cognition gene duplicates after the TGD. Losses are indicated with negative values. The number of retained TGD paralog pairs for each individual teleost genome is given in brackets. TGD paralog losses were mapped onto the teleost phylogeny provided by Setiamarga et al. [39] following the parsimony principle. The TGD event was set to 350 million years ago. The retention rate of TGD paralogs is defined by the pairs of TGD duplicates present in a specific lineage divided by the number of pairs of TGD duplicates present at the time of the TGD [18]
The platyfish genome reveals new perspectives for several prominent features of this fish model including its livebearing reproductive mode, variation in pigmentation patterns, sex chromosome evolution in action, complex behavior and both spontaneous and induced carcinogenesis[17]. Teleosts dominate the extant fish fauna, and within teleosts (Fig. 1B), the family Poeciliidae, including platyfish, swordtails, guppies and mollies, is a paradigmatic example of this wide spectrum of adaptations. Our study of this first genome of a poeciliid fish illuminates some teleost evolutionary adaptations and provides a critical resource to advance the study of melanoma and other segregating phenotypes.
Methods
Methods and any associated references are available in the online version of the paper.
Online Methods
Source Material
DNA for genome sequencing was derived from a single female of Xiphophorus maculatus, strain Jp 163A (sample id: XMAC-090115_JP163A) from the Xiphophorus Genetic Stock Center, Texas State University, San Marcos, Texas, USA (XGSC; http://www.xiphophorus.txstate.edu/). The Jp163A line is maintained exclusively by brother-sister matings. The sequenced fish came from generation 104. A female fish was chosen because of its XX sex chromosome constitution. RNA that was sequenced to assemble the Jp163A reference transcriptome was isolated from two stages of pooled embryos (stages 15 and 25), a single individual 5 day old and a 1 month old fry, a single male and female at 2 months of age, one 9 month old female, one 15 month old male fish, and the testes and ovaries from single 10 month old fish.A Jp163A BAC library (average insert size 160 kb; 10× genome coverage with a total of 43,192 clones available at http://bacpac.chori.org/library.php?id=353)[40] was produced from subline WLC#1247 maintained at the Biocenter Fish Facility (BFF), University of Würzburg, Germany. WLC#1247 was separated from the XGSC Jp163A line after about generations 50 and then maintained by inbreeding at the BFF.For RAD-tag mapping one X. maculatus Jp 163A male (WLC#1325, BFF) was crossed with a female of X. hellerii (strain Rio Lancetilla, Db-, WLC#1337, BFF). Two F1 hybrid females from this cross were then backcrossed to X. hellerii males and DNAs from 267 backcross individuals were used for analysis.
Genome Sequencing
All genomic sequences for de novo assembly were generated on Roche 454 Titanium and Illumina GAIIx instruments with the exception of the BAC-end sequences, which were generated on an ABI3730.
Physical map
A physical map indicating tiling paths of Xiphophorus maculatus contigs was constructed by generating fingerprints from the WLC-1247 BAC library (http://bacpac.chori.org, [40]).
Genome assembly
Two independent assemblies were built with all sequence data, using the Newbler (Roche) and PCAP [41] algorithms from ~19.6× total sequence coverage in whole-genome shotgun reads, a combination of 12× fragments, 9× 3kb, 0.38× 20kb and 0.02× BAC-end read pairs. A merged assembly was achieved by assigning the Newbler assembly as the reference and aligning the PCAP assembly via BLAT followed by assimilation of all aligned scaffolds using an established graph accordance method [42]. Assembly consensus base error correction was accomplished by aligning Illumina reads (75 base paired-end reads, insert size 200bp), the same DNA source used for the reference, to the reference assembly using the Genomics Workbench v.4.03 software (CLC Bio). A consensus sequence was then created that factored the quality scores of both the reference assembly and the individual Illumina reads.
Transcriptome sequencing and annotation
Total RNA was isolated from platyfish tissues using the RiboPure Total RNA Isolation kit (Ambion). mRNA was isolated from total RNA using the Micro-PolyA Purist kit (Ambion). mRNA was reverse transcribed with SuperScript III Reverse Transcriptase (Invitrogen) using random hexamer primers (Invitrogen). Second-strand cDNA was synthesized using random primers and 15 units of Klenow DNA polymerase exo-minus (Epicentre). Double stranded cDNA was sheared in a Bioruptor (Diagenode) for 30 cycles (30 sec on, 60 sec off). Sheared DNA was end repaired with the End-It DNA repair kit (Epicentre) and dA overhangs were added with Klenow DNA polymerase exo-minus. Adapters were ligated to cDNA overnight and 100 ng was PCR amplified for 12 cycles with Phusion DNA polymerase (New England Biolabs). Each mRNA sample was sequenced on an Illumina GAII sequencer (60 bp). The X. maculatus transcriptome was assembled combining sequences from several tissues including heart, liver, brain, ovaries, and testes, as well as from embryonic stages 15 and 25. For the X. hellerii transcriptome, RNA from 1 month old whole fish, and from brain, liver, ovaries and testes of mature fishes was sequenced and assembled. The transcriptome sequences were aligned to the genome assembly contigs using Bowtie[43], then assembled using the Velvet[44]/Oases package (http://www.ebi.ac.uk/~zerbino/oases/), reporting putative transcripts and splice variants using a coverage cutoff of 4, an insert length estimate of 120, and other parameters at default values.
Gene models and annotation
Gene annotation using Ensembl genebuild was done on assembly Xipmac4.4.2 (GenBank Assembly ID GCA_000241075.1; http://www.ncbi.nlm.nih.gov/genome/assembly/?term=GCA_000241075.1 The annotated platyfish genome can be found at http://www.ensembl.org/Xiphophorus_maculatus/Info/Index.Another gene identification analysis was performed by a combination of gene prediction and transcriptome integration. We used ab initio modeling with Augustus [45] that had been trained on the medaka gene set and on the alignment of full-length gene models of medaka and zebrafish (both from Ensembl) using BLATX [46]. Transcriptome sequences were aligned to the assembly scaffolds using Bowtie [43], then this alignment was adjusted for the most likely exon-intron boundaries using TopHat [47], and then gene models created using Cufflinks [48]. Only those transcripts containing a complete ORF and a transcript read coverage of at least 3× were retained, and then these were reconciled into a single set of 33,756 unique potential protein-encoding genes. These gene models were further culled to a subset of 17,783 that are amenable to phylogenetic analysis for entry into a whole genome evolutionary interpretation using the PHRINGE (Phylogenetic Resources for the Interpretation of Genomes) system (http://genomeprojectsolutions.com/PHRINGE_pipeline.html) by eliminating any transcripts shorter than 300 nucleotides and retaining only the longest version of any splice variant at each locus (Supplementary Note).
Estimation of gene number by transcriptome similarity
We identified known genes by reciprocal BLASTX [49] searches of the de novo transcriptome assembly against medaka, stickleback, fugu, tetraodon, zebrafish, and human Ensembl gene libraries. In order to control for the inclusion of alternate transcript forms, we grouped these by the ‘locus’ number as reported by Oases [50].
Estimation of novel genes
In order to identify novel genes, we first reduced the redundancy of the platyfish transcriptome by clustering similar (>95% identity) sequences. Sequences from clusters with no identifiable members were filtered to remove sequences that mapped (by GMAP [51]) with less than 99% identity to the genome or had predicted coding sequences shorter than 300bp. Finally, identities for the remaining sequences were sought in the non-redundant database (NCBI). A separate clustering by genomic distance (1kb) produced a very similar gene number estimate. (Supplementary note).
Annotation of non-coding RNAs
To detect snoRNA, snRNA, miRNA and rRNA a homology-based prediction was done using the multispecies RNA database (http://www.ensembl.org/info/data/ftp/index.html) and combined with zebrafish, stickleback, medaka and Takifugu ncRNA libraries. tRNAs were annotated using tRNAscan-SE .21 software locally on UNIX [52]. rRNA, miRNA, snRNA and snoRNA were predicted by BLASTN using other fish ncRNA database as query and duplicates were removed from the output files (Supplementary Tables 11-12). Fish databases were downloaded from Ensembl on the following genome versions: zv9 (Danio rerio), BROADS1 (Gasterosteus aculeatus), HdrR (Oryzias latipes) and FUGU4.0 (Takifugu rubripes). miRNA sequences were identified with the Vienna RNA package of MiRscan(http://genes.mit.edu/mirscan/).
Annotation of transposable elements
Both manual and automatic classification of transposable elements (TE), based on Wicker’s nomenclature [53], were performed and combined into a single library. Two TE elements were considered as different if their sequence diverged by more than 20% at the nucleotide level. Manual classification was done by searching TE sequence homology using CENSOR [54] software, by homology searching specific TE proteins using TBLATN and BLASTP, by identifying terminal repeat features (TIRs, LTRs and TSDs) using BLASTN2 and LTR_FINDER software [55], and by reconstructing phylogeny using ClustalW alignment and maximum likelihood calculation (default aLRT) using the PhyML package [38]. Phylogenetic reconstructions for the DNA, LINE and LTR classes (Supplementary Fig. 1-3) were based either on transposase or reverse transcriptase proteins. An automatic repeat library was built with RepeatScout software using default parameters on the supercontig assembly corrected for the homopolymer errors. The percentage of transposable elements in the genome was determined from unassembled reads by running locally RepeatMasker software (A.F.A. Smit, R. Hubley & P. Green RepeatMasker at http://repeatmasker.org) on the UNIX system.
Construction of a meiotic map using RAD-tags
Genomic DNA from map cross parents and progeny was digested with the restriction enzyme SbfI (New England Biolabs) and adapters with five nt barcodes each differing by at least two nucleotides were ligated onto fragments. RAD-tag libraries were made as described [8]. A 50 ng aliquot of size-selected DNA was PCR amplified for 12 cycles and fragments 200 to 500-bp long were gel purified and sequenced using 80 nt single-end reads on an Illumina HiSeq2000 sequencer. Equal quantities of bar-coded DNA from 16 progeny were loaded onto each lane. Low quality reads and ambiguous barcodes were discarded. We used Stacks software [56] to sort retained reads into loci and to genotype individuals by implementing the likelihood-based SNP calling algorithm [57] to distinguish SNPs from sequencing errors. Stacks exported data into JoinMap 4.0 (Wageningen, The Netherlands) for linkage analysis using markers that were present in at least 200 of 267 individuals
Assigning scaffolds to map positions
To finalize assembly scaffold order and orientation we utilized the high density meiotic map for assigning genome contigs to the genetic map. Using 14,391 marker sequences, we could reliably align 1,950 scaffolds to all linkage groups. Of these, 231 scaffolds mapped to multiple linkage groups, suggesting a misassembly event and were manually split (Supplementary Note).
Genome synteny
For the analysis of conserved syntenies, the Synteny Database [10] was employed using parameters as described. To construct the dot plots, for each gene along a specific platyfish chromosome, the Synteny Database identifies orthologs and paralogs by reciprocal best BLAST analysis and plots positive results on the chromosomes of the same or other species directly above the index gene on the index chromosome.
Analyses of viviparity genes
Thirty-four protein-coding genes known to function in yolk production, placenta-related characteristics, and zona pellucida structures were selected as candidate genes (Supplementary Notes) for the evolution of viviparity among Xiphophorus fishes. Eighteen randomly selected genes were used for control. Orthologous sequences for these genes from four fish species (O. latipes, G. aculeatus, T. nigroviridis, D. rerio) were retrieved from the Ensembl database and then aligned using the MAFFT translation alignment. PAML (version 4.4, linux 64bit) was implemented to test if genes are under positive selection using branch-site specific model [58]. Genes with p-value less than 0.05 from likelihood ratio tests were designated as positively selected in Xiphophorus and the Bayes empirical Bayes method [59] was further used to calculate the selection pressure at each site.
Analysis of post-TGD gene retention
Cognition, pigmentation, and liver gene orthologs of human, mouse, and teleosts were obtained from Ensembl65 and missing gene annotations identified with TBLASTN (Supplementary Notes, Supplementary Table 7). EnsemblCompara GeneTrees were checked for teleost duplications and TGD-based duplications were confirmed using the Synteny Database [10]. Xiphophorus orthologs were identified from transcriptome v4 and the genome using BLAST searches and assignment confirmed with the Synteny Database. Potential bias in TGD duplicated retention for dosage sensitivity, protein complex membership, and gene length was tested (Supplementary Notes).
Authors: C P Sibley; P M Coan; A C Ferguson-Smith; W Dean; J Hughes; P Smith; W Reik; G J Burton; A L Fowden; M Constância Journal: Proc Natl Acad Sci U S A Date: 2004-05-18 Impact factor: 11.205
Authors: Maria Elena Miletto Petrazzini; Isabel Fraccaroli; Francesco Gariboldi; Christian Agrillo; Angelo Bisazza; Cristiano Bertolucci; Augusto Foà Journal: Biol Lett Date: 2017-04 Impact factor: 3.703
Authors: Ingo Braasch; Samuel M Peterson; Thomas Desvignes; Braedan M McCluskey; Peter Batzel; John H Postlethwait Journal: J Exp Zool B Mol Dev Evol Date: 2014-08-11 Impact factor: 2.656
Authors: Kuan Yang; Mikki Boswell; Dylan J Walter; Kevin P Downs; Kimberly Gaston-Pravia; Tzintzuni Garcia; Yingjia Shen; David L Mitchell; Ronald B Walter Journal: Comp Biochem Physiol C Toxicol Pharmacol Date: 2014-02-17 Impact factor: 3.228
Authors: Jun Inoue; Yukuto Sato; Robert Sinclair; Katsumi Tsukamoto; Mutsumi Nishida Journal: Proc Natl Acad Sci U S A Date: 2015-11-17 Impact factor: 11.205