Literature DB >> 23418180

Genome-Level Analysis of Selective Constraint without Apparent Sequence Conservation.

Olga A Vakhrusheva¹, Georgii A Bazykin, Alexey S Kondrashov.

Abstract

Conservation of function can be accompanied by obvious similarity of homologous sequences which may persist for billions of years (Iyer LM, Leipe DD, Koonin EV, Aravind L. 2004. Evolutionary history and higher order classification of AAA+ ATPases. J Struct Biol. 146:11-31.). However, presumably homologous segments of noncoding DNA can also retain their ancestral function even after their sequences diverge beyond recognition (Fisher S, Grice EA, Vinton RM, Bessling SL, McCallion AS. 2006. Conservation of RET regulatory function from human to zebrafish without sequence similarity. Science 312:276-279.). To investigate this phenomenon at the genomic scale, we studied homologous introns in a quartet of insect species, and in a quartet of vertebrate species. Each quartet consisted of two pairs of moderately distant genomes, with a much larger evolutionary distance between the pairs. In both quartets, we found that introns that carry a regulatory segment or a conserved segment in the first pair tend to carry a conserved segment in the second pair, even though no similarity of these segments could be detected between the two pairs. Furthermore, introns from one pair that are preserved in the other pair tend to carry a conserved segment within the first pair, and be longer in the first pair, compared with the introns that were lost between pairs, even though no similarity between pairs could be detected in such preserved introns. These results indicate that selective constraint, presumably caused by conservation of the ancestral function, often persists even after the homologous DNA segments become unalignable.

Entities: CellLine Chemical Disease Gene Species

Year: 2013 PMID： 23418180 PMCID： PMC3622294 DOI： 10.1093/gbe/evt023

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

Conservation of function can be accompanied by obvious similarity of homologous sequences, which may persist for billions of years. Many bacterial proteins possess more than 50% similarity to their eukaryotic orthologs. Moreover, analysis of the reconstructed genome of the LUCA revealed a history of pre-LUCA gene duplications; such duplications produced paralogous proteins whose similarity is still statistically significant (Iyer et al. 2004). Even within noncoding segments of genomes, there are ultraconservative segments that retain strong similarity within, for example, all vertebrates (Dermitzakis et al. 2003; Lowe et al. 2011). However, conservation of the primary sequence, resulting in a meaningful sequence alignment, is not a sine qua non for conservation of other properties of the molecule. Proteins with unalignable amino acid sequences can have very similar 3D structures (Murzin and Bateman 1997). Single-stranded RNAs with dissimilar sequences can fold into identical secondary structures (Schuster et al. 1994) (e.g., AAAAAGGGTTTTT and GGGGGTTTCCCCC). At the level of functional noncoding DNA sequences, there are a number of described cases when sequences from different organisms perform similar functions, and are likely to be homologous, despite the lack of any meaningful alignment (Taher et al. 2011). For example, an enhancer of a murine gene can drive, in a transgenic assay, normal expression of its zebrafish ortholog (Fisher et al. 2006), although the murine and the zebrafish enhancers are not alignable. Another study in zebrafish has shown that the up-to-date alignment techniques are unable to detect many of the functional genomic regions (McGaughey et al. 2008). The noncoding elements that regulate homologous genes in nematodes and vertebrates are themselves alignable within both these lineages, but not between them, implying regulatory “rewiring” (Vavouri et al. 2007). However, to our knowledge, this phenomenon has not been investigated genome-wide. Here, we show that selective constraint, presumably caused by conservation of the ancestral function, often persists even after the homologous genome compartments diverge beyond alignability.

Materials and Methods

Data

We studied two quartets of species together with the two corresponding outgroup species (fig. 1). In each quartet, the number of synonymous substitutions per site Ks between the species of the two pairs is larger than 1, and therefore cannot be measured with precision. The approximate values given in figure 1 were obtained as follows. In dipterans, to roughly estimate the values of Ks between the two pairs, we scaled the protein identity-based trees by the known Ks values obtained for the more closely related species within pair 1. The first, dipteran, quartet consisted of two Drosophila species, D. melanogaster and D. mojavensis (pair 1: Ks1 ∼2.37; Heger and Ponting 2007), and two mosquito species, Culex quinquefasciatus and Aedes aegypti (pair 2: Ks2 ∼2.6); the estimated Drosophila–mosquito Ks3 is approximately 6.5. Ks between C. quinquefasciatus and Aed. aegypti, and between Drosophila and mosquitoes, were estimated through calibrating Drosophila–mosquito tree of concatenated sequences of motor proteins (Odronitz et al. 2009) with known Ks for D. melanogaster–D. mojavensis. The second, vertebrate, quartet consisted of two mammalian species Homo sapiens and Mus musculus (pair 1: Ks1 ∼ 0.43; Jaillon et al. 2004), and two fish species Tetraodon nigroviridis and Takifugu rubripes (pair 2: Ks2 ∼ 0.35; Jaillon et al. 2004). Ks3 is approximately 1.5 (Jaillon et al. 2004).

Two quartets of species used in the analysis, of 1) dipterans and 2) vertebrates, together with the corresponding outgroup species. Evolutionary distances within each pair of species, characterized by the estimated per site number of synonymous substitutions Ks, are presented. Lists of orthologous proteins for each pairwise combination of species within a quartet were downloaded from INPARANOID (Ostlund et al. 2010) database (http://inparanoid.sbc.su.se/cgi-bin/index.cgi, last accessed February 27, 2013). For each quartet, we selected unambiguous seed-ortholog pairs for each of the 6 (10 for analyses requiring an outgroup) unordered pairs chosen from the 4 (5) species. We further considered only those orthologs that comprised a four-species (five-species) clique. This procedure resulted in 5,189/3,565 and 8,179/2,522 unambiguous orthologs for the dipteran and vertebrate quartets, respectively; after exclusion of the coding sequences with internal stop codons in any of the species, the corresponding numbers were 5,183/3,541 and 8,159/2,518 for dipteran and vertebrate quartets, respectively. For sequence analysis, we used genome assembly versions identical to those given in INPARANOID to avoid orthologs misidentification due to differences in annotations between releases. Specifically, for H. sapiens, M. musculus, C. intestinalis, T. nigroviridis, Tak. rubripes, and Aed. aegypti, we used NCBI 36 (Lander et al. 2001; Wheeler et al. 2008), NCBI m37 (Waterston et al. 2002), JGI 2 (Dehal et al. 2002), TETRAODON 8.0 (Jaillon et al. 2004), FUGU 4.0 (Aparicio et al. 2002), and AaegL1 (Nene et al. 2007) assemblies, respectively, all corresponding to ENSEMBL release 52. For D. melanogaster and D. mojavensis, we used r5.13 (Adams et al. 2000) and r1.3 (Clark et al. 2007) assemblies, corresponding to ENSEMBL releases 58 and 63, respectively. For C. quinquefasciatus, we used CpipJ1.2 (Arensburger et al. 2010) assembly. For A. mellifera (Honeybee Genome Sequencing Consortium 2006), we used NCBI build 4.1. All sequence and annotation data except for data on A. mellifera and C. quinquefasciatus was fetched from ENSEMBL (Kersey et al. 2009) through usage of ENSEMBL PERL API for perl scripts. Genome sequence and annotation for A. mellifera and C. quinquefasciatus was downloaded from NCBI (http://www.ncbi.nlm.nih.gov, last accessed February 27, 2013) (Sayers et al. 2012) and VectorBase (http://cquinquefasciatus.vectorbase.org/, last accessed February 27, 2013) (Lawson et al. 2009), respectively. Alignments of the orthologous proteins were performed with MUSCLE (Edgar 2004) with the default parameters.

Identification and Analysis of Orthologous Introns

We selected introns orthologous in all the four species, defined as the introns in the orthologous positions of the coding sequences in the orthologous proteins, and also having the identical phase. For this purpose, coordinates of intron shadows were mapped onto protein alignments. To avoid analyzing nonorthologous introns, only introns mapping to regions of high-quality protein alignment were considered. For this purpose, we disallowed gaps in the two (for phase 0 introns) or one (for phase 1 or 2 introns) amino acid sites to which the intron mapped, and in the two immediately neighboring amino acid sites to the left and to the right of it. Furthermore, we required at least five alignment positions similar by BLOSUM62 matrix, and no more than two alignment gaps, in each of the species within 10 amino acids flanking the intronic shadow from each side. To ensure that we are studying noncoding sequences, we excluded from the analysis those introns which overlapped protein-coding exons in any known transcript for this gene. After all filtering, we identified 5,367 and 51,844 sets of orthologous introns in quartets 1 and 2, respectively. The first 6 and the last 16 nucleotides of the intronic sequences were excluded from analyses of conservation, as they are likely to be under selective constraint due to presence of elements crucial for the correct splicing of an intron (Haddrill et al. 2005). The remaining parts of the intronic sequences of the four species were then aligned with bl2seq (Altschul et al. 1990) with anchor length set to 7 and low complexity filtering on. Alignments were performed for each pairwise combination of species from the quartet. Sets of orthologous introns with significant similarity (bl2seq E-value ≤ 0.0001) between sequences from different pairs were excluded from analysis.

Calculation of Expected Number of Introns Carrying a Segment of Similarity within Both Pairs

We started by counting, for a particular E-value, the numbers of introns N1(E) and N2(E) carrying a region of local sequence similarity in pairs 1 and 2, respectively. The expected number of introns carrying local sequence similarities within both species pairs was then calculated using four different randomization procedures: 1) without accounting for any potential confounding variables; 2) accounting for intron lengths; 3) accounting for gene identity; and 4) accounting for gene identity and for whether the intron was first or subsequent. In the first procedure, in each of the 10,000 reshuffling trials, we simply redistributed N1(E) and N2(E) introns among all introns of pairs 1 and 2 randomly, and counted the number of introns that have within-pair similarity in both species pairs in this resampled set. The resulting distributions were used to obtain the means and the confidence intervals in supplementary figure S1, Supplementary Material online. (The expected fraction of introns with similarity in both species pairs among all introns generated by this procedure is roughly equal to that obtained by simply multiplying the frequencies of introns with within-pair similarity in the first pair by this frequency in the second pair.) As the lengths of orthologous introns are correlated between species pairs (e.g., for D. melanogaster and C. quinquefasciatus: Spearman’s rho = 0.29, P value < 2.2e−16; for H. sapiens and Tak. rubripes, Spearman’s rho = 0.215, P value < 2.2e−16), and longer introns are generally better conserved than the short ones (supplementary fig. S2, Supplementary Material online), the numbers of introns carrying local sequence similarities within both species could be confounded by intron lengths. To account for this effect in the second procedure, we subdivided introns into10 bins according to their length within each species pair. Bins contained roughly equal numbers of introns; precisely equally sized bins were unobtainable, due to a relatively large number of introns that fell on the bin thresholds, particularly for short introns in dipterans. The bin thresholds were as follows: Homo–Mus (125, 279, 499, 757, 1,077, 1,479, 2,046, 3,036, and 5,620); Tetraodon–Takifugu (72, 77, 82, 89, 102, 130, 192, 337, and 709); D. melanogaster–D. mojavensis (56, 58, 60, 62, 64, 67, 73, 124, and 573); and Aedes–Culex (57, 59, 61, 63, 66, 71, 125, 666, and 4,476). We then classified the N1(E) and N2(E) introns carrying regions of similarity in pairs 1 and 2 by bins of intron length in pair 1 and 2, respectively. We thus obtained the distributions of introns with regions of similarity in pairs 1 and 2 by intron length. In each of the 10,000 resampling trials, we then randomly drew, from each of these distributions, N1(E) and N2(E) introns for pairs 1 and 2, respectively, and counted the number of introns with within-pair similarity in both species pairs in this resampled set. The resulting distributions of numbers of introns with within-pair similarity in both pairs were used to obtain the means and the confidence intervals in figure 2.

Introns that carry a segment of high similarity between species of one pair are more likely to also carry a segment of high similarity between species of the other pair within a quartet. Each blue dot corresponds to a specific BLAST E-value (shown next to the dot), with lower values corresponding to more stringent similarity thresholds. Each E-value was used to detect similar segments within orthologous introns of the two species belonging to pair 1, and of the two species belonging to pair 2. Horizontal axis, fraction of introns that carry similar segments within pair 1, among introns present in all four species. Vertical axis, number (A, B) or observed-to-expected ratio (C, D) for the number of introns that carry similar segments both within pair 1 and 2. Top, dipterans; bottom, vertebrates. Observed-to-expected ratio was defined as the ratio of the observed number of introns with similarity within both pairs to the expected number if the segments of similarity were distributed randomly over all introns, controlling for intron lengths (see text). The red line and the gray area correspond to the mean and 95% confidence intervals for the expected values calculated in 10,000 resampling trials.

Analysis of Data on Chromatin Modifications

Data on chromatin modifications for 51,541 human and 5,367 fruit fly introns were obtained from ENCODE (ENCODE Project Consortium 2004; Ernst et al. 2011) (http://genome.ucsc.edu/ENCODE/, last accessed February 27, 2013) and modENCODE (Kharchenko et al. 2010; Roy et al. 2010) (http://www.modencode.org/, last accessed February 27, 2013) databases, respectively. ENCODE provides chromatin state segmentation and the corresponding predicted functional annotation for nine cell lines. We excluded from the analysis the two cancer cell lines (K562 and HepG2). The remaining cell lines were further subdivided into adult (GM12878, HMEC, HSMM, NHEK, and NHLF) and embryonic cell lines comprising embryonic stem cells (H1-hESC) and umbilical vein endothelial cells (HUVEC). We used segmentation tracks for Human Genome Build 36 (hg18), selecting the introns overlapping with strong enhancers (states 4 and 5), or insulators (state 8) either in all adult tissue cells or in both embryonic line cells. For D. melanogaster, ModENCODE provides segmentation models for two cell lines (BG3 and S2). We used segmentation tracks for FlyBase release 5, selecting the introns overlapping with either Regulatory regions (enhancers) (state 3) or with Active introns (state 4) at least in one of the cell lines. We estimated the statistical significance of the results in 10,000 resampling trials while correcting for introns length (procedure 2 of the previous section).

Identification of Intron Losses

For the third and the fourth tests, to trace the losses of introns on the phylogeny, the tree corresponding to each species quartet was rooted with an outgroup species (fig. 1). We then selected introns present in both species from pair 1 and in the outgroup species, assuming that such introns were present in the last common ancestor of both considered species pairs. Again, orthologous introns with significant similarity (bl2seq E-value ≤ 0.0001) between sequences from different pairs were excluded from analysis. These introns were then subdivided into 1) those also present in both pair 2 species and 2) those that had been lost in at least one of the pair 2 species. (Introns absent in one of the pair 2 species were usually also absent in the other, implying loss on the branch separating pair 2 from its common ancestor with pair 1; cases of intron loss mapping to external branches were rare.) For groups of introns indicated in 1) and 2), we compared the distributions of E-values within pair 1, and the distributions of intron lengths within pair 1. Reciprocal tests were performed analogously.

Results and Discussion

We investigated, at the genomic scale, the common selective constraint (probably associated with conservation of function) in homologous, but highly divergent, noncoding sequences. We focus on the genomic segments that have diverged from their common ancestor to such an extent that they have lost all primary sequence similarity. Generally, any similarity of properties of orthologous sequences that are not alignable suggests the presence of such a selective constraint, unless this similarity can be explained otherwise. As a sample of genome segments orthologous between distant species, we used the introns of orthologous genes, because their orthology can be easily determined through flanking exons even at phylogenetic distances so large that the introns themselves are no longer alignable. We used four tests that can provide evidence of selective constraint without sequence similarity. Each test was done on a quartet of species. A quartet consists of two pairs of species, such that the evolutionary distance at selectively neutral sites within the first (Ks1) and the second (Ks2) pair is sufficiently large so that the sequence conservation between two species of a pair is indicative of selective constraint, but much shorter than the distance between the two pairs (Ks3). We studied two such quartets, of dipterans and of vertebrates (fig. 1). In each quartet, there are many introns that contain highly significant local sequence similarities within each pair, indicative of selective constraint. In contrast, there are only a few meaningful sequence similarities between introns from species that belong to different pairs within a quartet. First, we asked whether the presence of a conserved (and, by inference, functional) segment between two species of a pair within a quartet is a significant predictor for the presence of a conserved segment in the orthologous intron between the two species of the other pair. Consider orthologous noncoding segments that have a function conserved between all four species of a quartet. Such functional conservation should lead to above-neutral sequence conservation within each pair. In addition, strong functional conservation may lead to above-expected sequence conservation even at much higher evolutionary distances that separate the two pairs. In the latter case, conservation of a sequence segment spans the entire quartet. As our focus was functional conservation without sequence conservation, we excluded from the analysis 34 and 303 introns with significant local similarities between species from different pairs in dipteran and vertebrate quartets, respectively. The remaining 5,333 introns in the dipteran quartet, and 51,541 introns in the vertebrate quartet, therefore, contained only those introns that were unalignable between the two pairs. Still, an intron that contains a significant local similarity within one pair of species contains a significant local similarity within the other pair much more often than would be expected if these segments were distributed over the introns independently in each of the pairs (supplementary fig. S1, Supplementary Material online). However, this analysis can be confounded by differences in intron lengths. Indeed, longer introns are more likely to contain within-pair similarities (supplementary fig. S2, Supplementary Material online), in agreement with the data on their higher conservation, at least in Drosophila (Haddrill et al. 2005), and intron lengths are correlated between pairs of species (see Materials and Methods). This nonuniformity with respect to intron length should be controlled for. Therefore, the expected number of introns carrying a significant local similarity within both pairs should be obtained by summation over bins of introns of different lengths, to avoid underestimating this number due to correlated intron lengths in different pairs of species within a quartet, and therefore overestimating our effect. Nevertheless, even after this correction, we observed substantially more introns with conservation in both pairs than expected (fig. 2). Therefore, among those introns that are unalignable between the two pairs of species within a quartet, an excess of introns conserved in both species pairs was observed. This excess suggests the presence of a selective constraint that did not lead to observable sequence conservation between the two pairs. The excess of conservation was stronger under more stringent similarity thresholds, that is, when the parameters were chosen in such a way that only a small fraction of introns were aligned within a pair. When this fraction dropped below 5% for the dipteran quartet, or below 3% for the vertebrate quartet, the number of introns possessing an alignment in both pairs of species exceeded the random expectation by a factor of 3; in vertebrates, when only the 0.4% introns with the most stringent conservation were used, an 8-fold excess was observed (fig. 2). Nonrandom distribution of conservative elements among introns can be caused not only by conservation of the ancestral function but also by some common characteristics of introns not necessarily associated with their common origin. In particular, a pattern similar to that in figure 2 is expected if introns of a particular subset of genes (e.g., of highly expressed genes) are more likely to contain a conserved element, or if the first introns of genes tend to be more conserved. To test whether these common features of orthologous introns lead to the observed pattern, we reshuffled, for each species pair, the introns within each gene, and asked whether an excess of introns with coincident conservation is still observed in the data, compared with this control. This test is extremely conservative, because such reshuffling is expected to lead to false-negatives if the number of introns is low, and especially if there are many genes with only one intron. In fact, the mean number of introns with established orthology per gene in our data set was 1.95 (median = 1) for dipterans, and 7.00 (median = 5) for vertebrates. Still, even in this very conservative test, the excess of introns carrying conserved regions in both pairs remained (supplementary fig. S3, Supplementary Material online). It also remained after exclusion of the first introns of genes (supplementary fig. S4, Supplementary Material online), suggesting that it was not due to their higher conservation. Therefore, the observed coincident conservation of unalignable regions in phylogenetically remote species is not simply due to common characteristics of the orthologous introns or of the genes carrying them. Second, we hypothesized that if some regulatory elements persist for longer than the sequence similarity of the corresponding DNA segments, we would expect introns containing a regulatory element in pair 1 also to carry a segment of similarity in pair 2 (again, correcting for the differences in intron lengths). To study this, we used genome-wide predictions of regulatory regions based on patterns of chromatin modifications (Kharchenko et al. 2010; Roy et al. 2010; Ernst et al. 2011). As in the first test, we excluded from the analysis 34 and 303 introns with significant local similarities between species from different pairs in dipteran and vertebrate quartets, respectively. Nevertheless, we found that Drosophila melanogaster introns, which overlap regions that are enriched in active chromatin modifications, and therefore likely to be involved in regulation, are up to approximately three times more likely to carry a segment of similarity in the mosquito species pair (fig. 3). Analogously, introns overlapping insulators or enhancers in human are respectively up to 2.6 times or 1.4 times more likely to carry a segment of similarity between the two fish species (fig. 3). In enhancers, a stronger effect was observed if only embryonic cell lines were considered (embryonic stem cells and umbilical vein cord cells, fig. 3), in line with the observations that conserved noncoding regions tend to be associated with genes involved in developmental processes (Woolfe et al. 2005).

Introns that carry a segment of similarity in pair 2 are more likely to overlap regulation-associated elements within pair 1. Each blue dot corresponds to a specific BLAST E-value (shown next to the dot). Each E-value was used to detect similar segments within orthologous introns of the two species belonging to pair 2. Horizontal axis, fraction of introns that carry similar segments within pair 2, among introns present in all four species. Vertical axis, observed-to-expected ratio for the number of introns that carry a regulation-associated element within pair 1, according to modENCODE (Kharchenko et al. 2010; Roy et al. 2010) (A, B) or ENCODE (ENCODE Project Consortium 2004; Ernst et al. 2011) (C–F) data, and also carry a segment of similarity within pair 2. (A, B) Dipterans; (C–F) vertebrates. (A) Active introns, any cell line; (B) enhancers, any cell line; (C) insulators, all adult cell lines; (D) insulators, both embryonic cell lines; (E) strong enhancers, all adult cell lines; (F) strong enhancers, both embryonic cell lines. Observed-to-expected ratio was defined as the ratio of the observed number of introns with a regulation-associated element within pair 1 and similarity within pair 2 to the same number expected if the regulation-associated elements and the segments of similarity were distributed randomly over all introns, controlling for intron lengths. The red line and the gray area correspond to the mean and 95% confidence intervals for the expected values calculated in 10,000 resampling trials. Third, in addition to conservation of a sequence within an intron, presence of a functional segment may lead to a reduced rate of loss of such introns in evolution. We asked whether among the introns that were present in pair 1 species, the introns that were preserved in pair 2 species are more likely to carry a segment conserved within pair 1. Because in this analysis, we need to discriminate between intron losses and gains, we used an outgroup species to determine the ancestral state. Again, to avoid dealing with sequence similarities spanning all four species of a quartet, we excluded from the analysis 14 out of 3,073, and 54 out of 6,609, introns with significant sequence similarities between species from different pairs, in quartets 1 and 2, respectively. In the remaining set, higher prevalence of conserved, within a pair, segments within introns that were preserved between pairs would imply functional conservation (with no detectable underlying sequence conservation) spanning both pairs. Among all introns, an intron present in both Drosophila species and in the outgroup (Apis mellifera) is also present in both mosquito species in 69% of cases. Those 69% of introns were significantly more likely to carry a segment of similarity between Drosophila species than the remaining 31% (P value = 5.71e−07, Fisher’s exact test; fig. 4A, inset). An intron present in both mammalian species and in Ciona intestinalis is also present in both species of fish in 98% of cases. A higher fraction of introns is preserved in the vertebrate quartet because the analyzed vertebrate species are more closely related than the dipteran species, and intron loss is less frequent in vertebrates (Putnam et al. 2007). Those 98% of introns were somewhat more likely to carry a segment of similarity between human and mouse than the remaining 2%, although the difference was not significant (P value = 0.167, Fisher’s exact test; fig. 4B, inset), probably because nearly all the introns carry some similarity between human and mouse under the thresholds used. Moreover, among introns carrying conserved segments, this similarity was higher within introns that were preserved in both pair 2 species, compared with the remaining introns, both in the dipteran (P value = 0.00103) and in the vertebrate quartets (P value = 0.0203, Wilcoxon rank sum test with continuity correction) (fig. 4). The results of the reciprocal tests were similar (supplementary fig. S5, Supplementary Material online).

Introns preserved in both pair 2 species are more likely to carry a conserved segment within pair 1. For each quartet, the distribution of E-values within pair 1 are shown for introns preserved in both pair 2 species (green), and for introns lost in at least one of the pair 2 species (blue). E-values indicated below the horizontal axis correspond to the lower E-value threshold. Insets show the fraction of introns with at least a marginal (E-value ≤ 1) similarity observed, for the same two groups. Fourth, longer introns are more likely to carry a segment of conservation than short introns (Haddrill et al. 2005). Therefore, a function conserved in all four species of a quartet may be associated with a larger length in pair 1 of introns preserved in pair 2, compared with the introns lost in pair 2. Indeed, among the 3,059 (6,555) dipteran (vertebrate) introns found in both pair 1 species and in the outgroup, the introns also present in both pair 2 species tend to be longer in pair 1 species than introns lost in pair 2, in both quartets (dipterans: P = 2.64e−18; vertebrates: P = 0.00111; Wilcoxon rank sum test with continuity correction; fig. 5). Again, for both quartets, the results of the two reciprocal tests were similar (supplementary fig. S6, Supplementary Material online). Therefore, we observe a higher preservation of introns possessing long orthologs in a phylogenetically remote species, and therefore presumably most likely to carry a functional DNA segment in those species, compared with introns with short orthologs.

Introns preserved in both pair 2 species tend to be longer within pair 1 species. For each quartet, the distribution of lengths within pair 1 are shown for introns preserved in both pair 2 species (green), and for introns lost in at least one of the pair 2 species (blue). Lengths indicated below the horizontal axis correspond to the lower length threshold. The length of the shorter of the two orthologous introns in pair 1 was used. Thus, all the four analyses provide evidence for selective constraint which keeps operating even after diverging orthologous introns became unalignable, in the evolution of dipterans and vertebrates. Apparently, long-living functional elements in orthologous genomic compartments, which persist longer than the conservation of the primary sequence, are common. The excess of introns with conserved segments (fig. 2) suggests that such elements reside within approximately 5% of introns in dipterans, and approximately 3% in vertebrates. Selective constraint acting on homologous, unalignable DNA segments is also likely to be common within intergenic regions, which carry numerous regulatory elements (Heintzman et al. 2009). However, it is difficult to subdivide unalignable intergenic regions into orthologous compartments, which are the core of our analysis. The precise nature of constraint imposed by the conserved function on the evolution of homologous DNA segments which are no longer alignable also remains a mystery.

Supplementary Material

Supplementary figures S1–S6 are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

36 in total

1. Metrics of sequence constraint overlook regulatory sequences in an exhaustive analysis at phox2b.

Authors: David M McGaughey; Ryan M Vinton; Jimmy Huynh; Amr Al-Saif; Michael A Beer; Andrew S McCallion
Journal: Genome Res Date: 2007-12-10 Impact factor: 9.043

2. From sequences to shapes and back: a case study in RNA secondary structures.

Authors: P Schuster; W Fontana; P F Stadler; I L Hofacker
Journal: Proc Biol Sci Date: 1994-03-22 Impact factor: 5.349

3. Ensembl Genomes: extending Ensembl across the taxonomic space.

Authors: P J Kersey; D Lawson; E Birney; P S Derwent; M Haimel; J Herrero; S Keenan; A Kerhornou; G Koscielny; A Kähäri; R J Kinsella; E Kulesha; U Maheswari; K Megy; M Nuhn; G Proctor; D Staines; F Valentin; A J Vilella; A Yates
Journal: Nucleic Acids Res Date: 2009-11-01 Impact factor: 16.971

4. Evolutionary discrimination of mammalian conserved non-genic sequences (CNGs).

Authors: Emmanouil T Dermitzakis; Alexandre Reymond; Nathalie Scamuffa; Catherine Ucla; Ewen Kirkness; Colette Rossier; Stylianos E Antonarakis
Journal: Science Date: 2003-10-02 Impact factor: 47.728

5. Patterns of intron sequence evolution in Drosophila are dependent upon length and GC content.

Authors: Penelope R Haddrill; Brian Charlesworth; Daniel L Halligan; Peter Andolfatto
Journal: Genome Biol Date: 2005-07-27 Impact factor: 13.583

6. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis.

Authors: Gabriel Ostlund; Thomas Schmitt; Kristoffer Forslund; Tina Köstler; David N Messina; Sanjit Roopra; Oliver Frings; Erik L L Sonnhammer
Journal: Nucleic Acids Res Date: 2009-11-05 Impact factor: 16.971

7. Sequencing of Culex quinquefasciatus establishes a platform for mosquito comparative genomics.

Authors: Peter Arensburger; Karine Megy; Robert M Waterhouse; Jenica Abrudan; Paolo Amedeo; Beatriz Antelo; Lyric Bartholomay; Shelby Bidwell; Elisabet Caler; Francisco Camara; Corey L Campbell; Kathryn S Campbell; Claudio Casola; Marta T Castro; Ishwar Chandramouliswaran; Sinéad B Chapman; Scott Christley; Javier Costas; Eric Eisenstadt; Cedric Feschotte; Claire Fraser-Liggett; Roderic Guigo; Brian Haas; Martin Hammond; Bill S Hansson; Janet Hemingway; Sharon R Hill; Clint Howarth; Rickard Ignell; Ryan C Kennedy; Chinnappa D Kodira; Neil F Lobo; Chunhong Mao; George Mayhew; Kristin Michel; Akio Mori; Nannan Liu; Horacio Naveira; Vishvanath Nene; Nam Nguyen; Matthew D Pearson; Ellen J Pritham; Daniela Puiu; Yumin Qi; Hilary Ranson; Jose M C Ribeiro; Hugh M Roberston; David W Severson; Martin Shumway; Mario Stanke; Robert L Strausberg; Cheng Sun; Granger Sutton; Zhijian Jake Tu; Jose Manuel C Tubio; Maria F Unger; Dana L Vanlandingham; Albert J Vilella; Owen White; Jared R White; Charles S Wondji; Jennifer Wortman; Evgeny M Zdobnov; Bruce Birren; Bruce M Christensen; Frank H Collins; Anthony Cornel; George Dimopoulos; Linda I Hannick; Stephen Higgs; Gregory C Lanzaro; Daniel Lawson; Norman H Lee; Marc A T Muskavitch; Alexander S Raikhel; Peter W Atkinson
Journal: Science Date: 2010-10-01 Impact factor: 63.714

8. VectorBase: a data resource for invertebrate vector genomics.

Authors: Daniel Lawson; Peter Arensburger; Peter Atkinson; Nora J Besansky; Robert V Bruggner; Ryan Butler; Kathryn S Campbell; George K Christophides; Scott Christley; Emmanuel Dialynas; Martin Hammond; Catherine A Hill; Nathan Konopinski; Neil F Lobo; Robert M MacCallum; Greg Madey; Karine Megy; Jason Meyer; Seth Redmond; David W Severson; Eric O Stinson; Pantelis Topalis; Ewan Birney; William M Gelbart; Fotis C Kafatos; Christos Louis; Frank H Collins
Journal: Nucleic Acids Res Date: 2008-11-21 Impact factor: 16.971

9. MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Authors: Robert C Edgar
Journal: BMC Bioinformatics Date: 2004-08-19 Impact factor: 3.169

10. Evolution of genes and genomes on the Drosophila phylogeny.

Authors: Andrew G Clark; Michael B Eisen; Douglas R Smith; Casey M Bergman; Brian Oliver; Therese A Markow; Thomas C Kaufman; Manolis Kellis; William Gelbart; Venky N Iyer; Daniel A Pollard; Timothy B Sackton; Amanda M Larracuente; Nadia D Singh; Jose P Abad; Dawn N Abt; Boris Adryan; Montserrat Aguade; Hiroshi Akashi; Wyatt W Anderson; Charles F Aquadro; David H Ardell; Roman Arguello; Carlo G Artieri; Daniel A Barbash; Daniel Barker; Paolo Barsanti; Phil Batterham; Serafim Batzoglou; Dave Begun; Arjun Bhutkar; Enrico Blanco; Stephanie A Bosak; Robert K Bradley; Adrianne D Brand; Michael R Brent; Angela N Brooks; Randall H Brown; Roger K Butlin; Corrado Caggese; Brian R Calvi; A Bernardo de Carvalho; Anat Caspi; Sergio Castrezana; Susan E Celniker; Jean L Chang; Charles Chapple; Sourav Chatterji; Asif Chinwalla; Alberto Civetta; Sandra W Clifton; Josep M Comeron; James C Costello; Jerry A Coyne; Jennifer Daub; Robert G David; Arthur L Delcher; Kim Delehaunty; Chuong B Do; Heather Ebling; Kevin Edwards; Thomas Eickbush; Jay D Evans; Alan Filipski; Sven Findeiss; Eva Freyhult; Lucinda Fulton; Robert Fulton; Ana C L Garcia; Anastasia Gardiner; David A Garfield; Barry E Garvin; Greg Gibson; Don Gilbert; Sante Gnerre; Jennifer Godfrey; Robert Good; Valer Gotea; Brenton Gravely; Anthony J Greenberg; Sam Griffiths-Jones; Samuel Gross; Roderic Guigo; Erik A Gustafson; Wilfried Haerty; Matthew W Hahn; Daniel L Halligan; Aaron L Halpern; Gillian M Halter; Mira V Han; Andreas Heger; LaDeana Hillier; Angie S Hinrichs; Ian Holmes; Roger A Hoskins; Melissa J Hubisz; Dan Hultmark; Melanie A Huntley; David B Jaffe; Santosh Jagadeeshan; William R Jeck; Justin Johnson; Corbin D Jones; William C Jordan; Gary H Karpen; Eiko Kataoka; Peter D Keightley; Pouya Kheradpour; Ewen F Kirkness; Leonardo B Koerich; Karsten Kristiansen; Dave Kudrna; Rob J Kulathinal; Sudhir Kumar; Roberta Kwok; Eric Lander; Charles H Langley; Richard Lapoint; Brian P Lazzaro; So-Jeong Lee; Lisa Levesque; Ruiqiang Li; Chiao-Feng Lin; Michael F Lin; Kerstin Lindblad-Toh; Ana Llopart; Manyuan Long; Lloyd Low; Elena Lozovsky; Jian Lu; Meizhong Luo; Carlos A Machado; Wojciech Makalowski; Mar Marzo; Muneo Matsuda; Luciano Matzkin; Bryant McAllister; Carolyn S McBride; Brendan McKernan; Kevin McKernan; Maria Mendez-Lago; Patrick Minx; Michael U Mollenhauer; Kristi Montooth; Stephen M Mount; Xu Mu; Eugene Myers; Barbara Negre; Stuart Newfeld; Rasmus Nielsen; Mohamed A F Noor; Patrick O'Grady; Lior Pachter; Montserrat Papaceit; Matthew J Parisi; Michael Parisi; Leopold Parts; Jakob S Pedersen; Graziano Pesole; Adam M Phillippy; Chris P Ponting; Mihai Pop; Damiano Porcelli; Jeffrey R Powell; Sonja Prohaska; Kim Pruitt; Marta Puig; Hadi Quesneville; Kristipati Ravi Ram; David Rand; Matthew D Rasmussen; Laura K Reed; Robert Reenan; Amy Reily; Karin A Remington; Tania T Rieger; Michael G Ritchie; Charles Robin; Yu-Hui Rogers; Claudia Rohde; Julio Rozas; Marc J Rubenfield; Alfredo Ruiz; Susan Russo; Steven L Salzberg; Alejandro Sanchez-Gracia; David J Saranga; Hajime Sato; Stephen W Schaeffer; Michael C Schatz; Todd Schlenke; Russell Schwartz; Carmen Segarra; Rama S Singh; Laura Sirot; Marina Sirota; Nicholas B Sisneros; Chris D Smith; Temple F Smith; John Spieth; Deborah E Stage; Alexander Stark; Wolfgang Stephan; Robert L Strausberg; Sebastian Strempel; David Sturgill; Granger Sutton; Granger G Sutton; Wei Tao; Sarah Teichmann; Yoshiko N Tobari; Yoshihiko Tomimura; Jason M Tsolas; Vera L S Valente; Eli Venter; J Craig Venter; Saverio Vicario; Filipe G Vieira; Albert J Vilella; Alfredo Villasante; Brian Walenz; Jun Wang; Marvin Wasserman; Thomas Watts; Derek Wilson; Richard K Wilson; Rod A Wing; Mariana F Wolfner; Alex Wong; Gane Ka-Shu Wong; Chung-I Wu; Gabriel Wu; Daisuke Yamamoto; Hsiao-Pei Yang; Shiaw-Pyng Yang; James A Yorke; Kiyohito Yoshida; Evgeny Zdobnov; Peili Zhang; Yu Zhang; Aleksey V Zimin; Jennifer Baldwin; Amr Abdouelleil; Jamal Abdulkadir; Adal Abebe; Brikti Abera; Justin Abreu; St Christophe Acer; Lynne Aftuck; Allen Alexander; Peter An; Erica Anderson; Scott Anderson; Harindra Arachi; Marc Azer; Pasang Bachantsang; Andrew Barry; Tashi Bayul; Aaron Berlin; Daniel Bessette; Toby Bloom; Jason Blye; Leonid Boguslavskiy; Claude Bonnet; Boris Boukhgalter; Imane Bourzgui; Adam Brown; Patrick Cahill; Sheridon Channer; Yama Cheshatsang; Lisa Chuda; Mieke Citroen; Alville Collymore; Patrick Cooke; Maura Costello; Katie D'Aco; Riza Daza; Georgius De Haan; Stuart DeGray; Christina DeMaso; Norbu Dhargay; Kimberly Dooley; Erin Dooley; Missole Doricent; Passang Dorje; Kunsang Dorjee; Alan Dupes; Richard Elong; Jill Falk; Abderrahim Farina; Susan Faro; Diallo Ferguson; Sheila Fisher; Chelsea D Foley; Alicia Franke; Dennis Friedrich; Loryn Gadbois; Gary Gearin; Christina R Gearin; Georgia Giannoukos; Tina Goode; Joseph Graham; Edward Grandbois; Sharleen Grewal; Kunsang Gyaltsen; Nabil Hafez; Birhane Hagos; Jennifer Hall; Charlotte Henson; Andrew Hollinger; Tracey Honan; Monika D Huard; Leanne Hughes; Brian Hurhula; M Erii Husby; Asha Kamat; Ben Kanga; Seva Kashin; Dmitry Khazanovich; Peter Kisner; Krista Lance; Marcia Lara; William Lee; Niall Lennon; Frances Letendre; Rosie LeVine; Alex Lipovsky; Xiaohong Liu; Jinlei Liu; Shangtao Liu; Tashi Lokyitsang; Yeshi Lokyitsang; Rakela Lubonja; Annie Lui; Pen MacDonald; Vasilia Magnisalis; Kebede Maru; Charles Matthews; William McCusker; Susan McDonough; Teena Mehta; James Meldrim; Louis Meneus; Oana Mihai; Atanas Mihalev; Tanya Mihova; Rachel Mittelman; Valentine Mlenga; Anna Montmayeur; Leonidas Mulrain; Adam Navidi; Jerome Naylor; Tamrat Negash; Thu Nguyen; Nga Nguyen; Robert Nicol; Choe Norbu; Nyima Norbu; Nathaniel Novod; Barry O'Neill; Sahal Osman; Eva Markiewicz; Otero L Oyono; Christopher Patti; Pema Phunkhang; Fritz Pierre; Margaret Priest; Sujaa Raghuraman; Filip Rege; Rebecca Reyes; Cecil Rise; Peter Rogov; Keenan Ross; Elizabeth Ryan; Sampath Settipalli; Terry Shea; Ngawang Sherpa; Lu Shi; Diana Shih; Todd Sparrow; Jessica Spaulding; John Stalker; Nicole Stange-Thomann; Sharon Stavropoulos; Catherine Stone; Christopher Strader; Senait Tesfaye; Talene Thomson; Yama Thoulutsang; Dawa Thoulutsang; Kerri Topham; Ira Topping; Tsamla Tsamla; Helen Vassiliev; Andy Vo; Tsering Wangchuk; Tsering Wangdi; Michael Weiand; Jane Wilkinson; Adam Wilson; Shailendra Yadav; Geneva Young; Qing Yu; Lisa Zembek; Danni Zhong; Andrew Zimmer; Zac Zwirko; David B Jaffe; Pablo Alvarez; Will Brockman; Jonathan Butler; CheeWhye Chin; Sante Gnerre; Manfred Grabherr; Michael Kleber; Evan Mauceli; Iain MacCallum
Journal: Nature Date: 2007-11-08 Impact factor: 49.962

6 in total

Review 1. Comparative transcriptomics in human and mouse.

Authors: Alessandra Breschi; Thomas R Gingeras; Roderic Guigó
Journal: Nat Rev Genet Date: 2017-05-08 Impact factor: 53.242

Review 2. Volatile evolution of long noncoding RNA repertoires: mechanisms and biological implications.

Authors: Aurélie Kapusta; Cédric Feschotte
Journal: Trends Genet Date: 2014-09-11 Impact factor: 11.639

Review 3. Defining functional DNA elements in the human genome.

Authors: Manolis Kellis; Barbara Wold; Michael P Snyder; Bradley E Bernstein; Anshul Kundaje; Georgi K Marinov; Lucas D Ward; Ewan Birney; Gregory E Crawford; Job Dekker; Ian Dunham; Laura L Elnitski; Peggy J Farnham; Elise A Feingold; Mark Gerstein; Morgan C Giddings; David M Gilbert; Thomas R Gingeras; Eric D Green; Roderic Guigo; Tim Hubbard; Jim Kent; Jason D Lieb; Richard M Myers; Michael J Pazin; Bing Ren; John A Stamatoyannopoulos; Zhiping Weng; Kevin P White; Ross C Hardison
Journal: Proc Natl Acad Sci U S A Date: 2014-04-21 Impact factor: 12.779

4. Functionally conserved enhancers with divergent sequences in distant vertebrates.

Authors: Song Yang; Nir Oksenberg; Sachiko Takayama; Seok-Jin Heo; Alexander Poliakov; Nadav Ahituv; Inna Dubchak; Dario Boffelli
Journal: BMC Genomics Date: 2015-10-30 Impact factor: 3.969

5. Comprehensive Identification of Long Non-coding RNAs in Purified Cell Types from the Brain Reveals Functional LncRNA in OPC Fate Determination.

Authors: Xiaomin Dong; Kenian Chen; Raquel Cuevas-Diaz Duran; Yanan You; Steven A Sloan; Ye Zhang; Shan Zong; Qilin Cao; Ben A Barres; Jia Qian Wu
Journal: PLoS Genet Date: 2015-12-18 Impact factor: 5.917

6. Initiation of cyp26a1 Expression in the Zebrafish Anterior Neural Plate by a Novel Cis-Acting Element.

Authors: Chunhong Chen; Aline Stedman; Emmanuelle Havis; Isabelle Anselme; Daria Onichtchouk; François Giudicelli; Sylvie Schneider-Maunoury
Journal: PLoS One Date: 2016-03-09 Impact factor: 3.240

6 in total