| Literature DB >> 23814184 |
Michael Hiller1, Saatvik Agarwal, James H Notwell, Ravi Parikh, Harendra Guturu, Aaron M Wenger, Gill Bejerano.
Abstract
Many important model organisms for biomedical and evolutionary research have sequenced genomes, but occupy a phylogenetically isolated position, evolutionarily distant from other sequenced genomes. This phylogenetic isolation is exemplified for zebrafish, a vertebrate model for cis-regulation, development and human disease, whose evolutionary distance to all other currently sequenced fish exceeds the distance between human and chicken. Such large distances make it difficult to align genomes and use them for comparative analysis beyond gene-focused questions. In particular, detecting conserved non-genic elements (CNEs) as promising cis-regulatory elements with biological importance is challenging. Here, we develop a general comparative genomics framework to align isolated genomes and to comprehensively detect CNEs. Our approach integrates highly sensitive and quality-controlled local alignments and uses alignment transitivity and ancestral reconstruction to bridge large evolutionary distances. We apply our framework to zebrafish and demonstrate substantially improved CNE detection and quality compared with previous sets. Our zebrafish CNE set comprises 54 533 CNEs, of which 11 792 (22%) are conserved to human or mouse. Our zebrafish CNEs (http://zebrafish.stanford.edu) are highly enriched in known enhancers and extend existing experimental (ChIP-Seq) sets. The same framework can now be applied to the isolated genomes of frog, amphioxus, Caenorhabditis elegans and many others.Entities:
Mesh:
Year: 2013 PMID: 23814184 PMCID: PMC3753653 DOI: 10.1093/nar/gkt557
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Zebrafish is currently evolutionarily distant from all other available fish genomes. (A) Phylogeny with branch lengths and clade groupings (solid lines only). The ‘mousefish’, a desirable but currently unavailable teleost genome at human—mouse distance, is discussed in the text. Apart from zebrafish, frog (1.49 subs/site to chicken), lamprey (1.76 subs/site to zebrafish), amphioxus (>2.5 subs/site to lamprey) and C. elegans (1.07 subs/site to Caenorhabditis remanei) are also shown to have phylogenetically isolated genomes. Molecular distances were taken from the UCSC genome browser (28) for the hg18, braFlo1 and ce10 assemblies. (B) Evolutionary distances (neutral substitutions per site) between zebrafish (left) and human (right) to other sequenced species. In contrast to human, the zebrafish genome occupies a phylogenetic outgroup position with the closest sequenced teleosts at a distance of 1.25–1.41 subs/site, which exceeds the distance between the human and chicken genome (1.08 subs/site). (C) The portion of CNEs conserved to mouse that can be discovered in comparisons between human and evolutionarily more distant species can be used to estimate the fraction of zebrafish CNEs visible using the current availability of genomes.
Figure 2.Comparative genomics approach to detect CNEs in isolated genomes. (A) Several steps in this pipeline aim at detecting remote homologies. Still, we use strict filtering for alignment quality, synteny and conservative masking of potential genic sequences to achieve a high-quality CNE set. Total coverage in the zebrafish genome for each step is given as both fraction of the zebrafish genome and megabases (Mb). (B) The panels illustrate patching (highly sensitive alignment for a region bounded by up- and downstream aligning anchors in blue) and synteny filtering for chains and nets. Colored boxes are alignments, horizontal lines connect co-linear alignment blocks and different colors represent different chromosomes in the aligning species.
Figure 3.Transitivity can reveal orthology between distant genomes that is not directly visible. (A) Illustration of the transitivity principle. A zebrafish locus aligns to chicken but not directly to human. However, the chicken locus does align to human, allowing us to infer orthology and anchor an alignment between zebrafish and human. Conceptually, transitivity mimics a multiple alignment using the intermediate species as the reference species. (B) Sequence identity of zebrafish—human/mouse alignments, separating those alignments found only using transitivity (blue) and those directly aligning in the syntenic multiple alignments (gray), suggests that transitivity-inferred alignments also evolve under clear purifying selection. (C) An example where zebrafish has a weaker alignment to human that is not detected in the genome-wide pipeline. However, an anchored alignment using chicken as an intermediate species shows clear orthology between the diverged zebrafish and human sequence. (D) The CNE shown in (C) is in synteny with the PTPRE gene in all three species.
The number and base pair (bp) coverage of zCNEs conserved between zebrafish and human/mouse obtained through our different processing steps
| Human | Mouse | |||
|---|---|---|---|---|
| Number | Genome coverage (bp) | Number | Genome coverage (bp) | |
| From multiple alignment | 9373 | 1 453 794 | 8757 | 1 386 647 |
| Detected by transitivity | 1115 | 143 380 | 1055 | 141 953 |
| Detected by ancestral reconstruction | 1262 | 146 303 | 1349 | 156 972 |
| Total | 11 573 | 1 769 804 | 10 989 | 1 707 381 |
Figure 4.Ancestral reconstruction reveals additional CNE alignments between distant species. (A) Large evolutionary distances between zebrafish and human/mouse can be substantially reduced if (B) reconstructed ancestral sequences are aligned. The phylogenetic tree contains the species used to reconstruct the percomorph and mammalian ancestor. Species used as outgroups are in blue in (B). (C) Sequence identity of zebrafish–human alignments is shown for CNEs that align to human in our multiple alignment and for 1262 CNEs where ancestral reconstruction but not direct alignment detects conservation to human (630 align to a tetrapod but not human in our multiple alignment; 632 have no alignment to any vertebrate). Although alignments detected only using reconstruction have lower sequence identities, even values ∼50% indicate clear conservation between species separated by ≥1.8 neutral substitutions per site. (D) An example where conservation within teleosts and within tetrapods can be used to reconstruct the percomorph and mammalian ancestor of the CNE (1 and 2). The reconstructed ancestral sequences align with high enough sequence identity to detect orthology and anchor an alignment between the human and zebrafish CNEs not visible otherwise (3). The CNE shares conserved synteny with the same putative target gene (4). Blue background is identity to the ancestor in (1 and 2) and sequence identity in (3).
Comparison of our zCNE set to previous zebrafish CNE sets
| Pairwise comparison | |||||||
|---|---|---|---|---|---|---|---|
| Total (bp) | In both CNE sets | Unique in our zCNE set | Unique in other CNE set | ||||
| zCNEs all | 6 643 241 | ||||||
| zCNEs teleost + lamprey | 6 072 642 | ||||||
| zCNEs human | 1 769 804 | ||||||
| ECRBrowser | bp | % of other set | bp | % of our set | bp | % of other set | |
| Fugu | 4 014 503 | 2 509 862 | 62.52% | 3 562 780 | 58.67% | 1 504 641 | 37.48% |
| Human | 1 262 653 | 874 451 | 69.26% | 895 353 | 50.59% | 388 202 | 30.74% |
| ANCORA | |||||||
| Fugu | 4 085 372 | 2 870 306 | 70.26% | 3 202 336 | 52.73% | 1 215 066 | 29.74% |
| Tetraodon | 3 582 692 | 2 634 273 | 73.53% | 3 438 369 | 56.62% | 948 419 | 26.47% |
| Stickleback | 5 020 698 | 3 220 581 | 64.15% | 2 852 061 | 46.97% | 1 800 117 | 35.85% |
| Medaka | 4 745 417 | 2 945 151 | 62.06% | 3 127 491 | 51.50% | 1 800 266 | 37.94% |
| Human | 1 403 083 | 996 067 | 70.99% | 773 737 | 43.72% | 407 016 | 29.01% |
| CNEViewer | |||||||
| All CNEs | 563 113 | 416 675 | 73.99% | 1 353 129 | 76.46% | 146 438 | 26.01% |
| Syntenic CNEs | 248 105 | 196 911 | 79.37% | 1 572 893 | 88.87% | 51 194 | 20.63% |
| Union of all previous sets | |||||||
| In zCNEs and other sets | Unique in our zCNE set | Unique in other sets | |||||
| Union of all previous resources | 10 765 678 | 5 204 466 | 48.34% | 1 438 775 | 21.66% | 5 561 212 | 51.66% |
| Breakdown of 5 561 212 bp unique to other sets | bp | % | |||||
| Do not align to zebrafish by our approach | 191 704 | 3.4% | |||||
| Are not syntenic according to our criteria | 3 714 572 | 66.8% | |||||
| Do not overlap well-aligning windows | 981 312 | 17.6% | |||||
| Overlap well-aligning windows but the region is <50 bp | 496 580 | 8.3% | |||||
| Overlap well-aligning window ≥50 bp but not supported by ≥2 species | 155 813 | 2.8% | |||||
| GREAT enrichment (top annotation term) | |||||||
| # GREAT version 2.0.1 | In zCNEs and other sets | Unique in our zCNE set | Unique in other sets | ||||
| Ontology and term name | Binom FDR Q-Val | Binom fold enrichment | Binom FDR Q-Val | Binom fold enrichment | Binom FDR Q-Val | Binom fold enrichment | |
| GO molecular function | |||||||
| Sequence-specific DNA binding | 0 | 3.1 | 7 E-279 | 2.2 | n.d. | n.d. | |
| GO biological process | |||||||
| Regulation of transcription, DNA-dependent | 0 | 2.7 | 3 E-314 | 2.0 | n.d. | n.d. | |
| Wiki pathways | |||||||
| Nuclear receptors | 3 E-128 | 5.0 | 5 E-41 | 3.0 | n.d. | n.d. | |
| InterPro | |||||||
| Homeodomain-like | 0 | 3.7 | 5 E-199 | 2.5 | n.d. | n.d. | |
aTo exclude any differences because of our stringent filtering procedure, we applied the same filters for repeats and genic regions to the ECR Browser (pairwise zebrafish–fugu/human), Ancora (pairwise zebrafish–fugu/tetraodon/stickleback/medaka/human) and CNEViewer (pairwise zebrafish–human) sets (Supplementary Table S5). We used our set of CNEs built from the teleosts and lamprey multiple alignment (6 072 642 bp in 51 997 CNEs) to compare with pairwise zebrafish–teleost sets. We used our set of CNEs that are conserved to human (1 769 804 bp in 11 573 CNEs) to compare with pairwise zebrafish–human sets. We found that, despite our stringent synteny filter and requiring at least two other aligning species, our CNE set is substantially larger than any of these pairwise sets, as 44–89% of the bases in zCNEs are not contained in the other sets.
bCompared with the union of all previous sets, zCNEs still add 1.4 Mb (22% of our set). Other sets contain 5.6 Mb that are not in our zCNE set for reasons listed in the table.
cCNEs that are in both our zCNE and other sets as well as CNEs that are unique to our set show the expected enrichments for transcription factors using the zebrafish GREAT webserver http://great.stanford.edu. In contrast, these enrichments were not detected (n.d.) for CNEs unique to other sets. Top enrichment is shown. Size-matched sets were compared to assure equal statistical power of GREAT.
The 10 genomic regions with the highest zCNE density are all associated with transcription factors
| Locus danRer7 | % Bases in zCNEs | Putative target gene(s) | ||
|---|---|---|---|---|
| chr | Window start | Window end | ||
| chr4 | 5805000 | 5905000 | 18.83% | |
| chr24 | 23860000 | 23960000 | 17.92% | |
| chr23 | 29215000 | 29315000 | 16.89% | |
| chr7 | 28480000 | 28580000 | 16.74% | |
| chr7 | 47940000 | 48040000 | 15.50% | |
| chr7 | 47810000 | 47910000 | 15.40% | |
| chr9 | 31905000 | 32005000 | 15.12% | |
| chr7 | 37245000 | 37345000 | 14.33% | |
| chr7 | 70460000 | 70560000 | 14.21% | |
| chr12 | 44145000 | 44245000 | 13.99% | |
Each window is 100 kb. The nearest gene(s) is listed. See Supplementary Table S7 for a longer table.
Figure 5.Top: UCSC genome browser-like representation of the Sox5 locus shows clusters of zCNEs (blue), many of which align to human or mouse (red). Bottom: a screenshot of the details page that is available for each zCNE.