Bar Lavi1,2, Eli Levy Karin1,3, Tal Pupko1, Einat Hazkani-Covo2. 1. Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Israel. 2. Department of Natural and Life Sciences, The Open University of Israel, Ra'anana, Israel. 3. Department of Molecular Biology & Ecology of Plants, George S. Wise Faculty of Life Sciences, Tel Aviv University, Israel.
Abstract
Perfect short inverted repeats (IRs) are known to be enriched in a variety of bacterial and eukaryotic genomes. Currently, it is unclear whether perfect IRs are conserved over evolutionary time scales. In this study, we aimed to characterize the prevalence and evolutionary conservation of IRs across 20 proteobacterial strains. We first identified IRs in Escherichia coli K-12 substr MG1655 and showed that they are overabundant. We next aimed to test whether this overabundance is reflected in the conservation of IRs over evolutionary time scales. To this end, for each perfect IR identified in E. coli MG1655, we collected orthologous sequences from related proteobacterial genomes. We next quantified the evolutionary conservation of these IRs, that is, the presence of the exact same IR across orthologous regions. We observed high conservation of perfect IRs: out of the 234 examined orthologous regions, 145 were more conserved than expected, which is statistically significant even after correcting for multiple testing. Our results together with previous experimental findings support a model in which imperfect IRs are corrected to perfect IRs in a preferential manner via a template switching mechanism.
Perfect short inverted repeats (IRs) are known to be enriched in a variety of bacterial and eukaryotic genomes. Currently, it is unclear whether perfect IRs are conserved over evolutionary time scales. In this study, we aimed to characterize the prevalence and evolutionary conservation of IRs across 20 proteobacterial strains. We first identified IRs in Escherichia coli K-12 substr MG1655 and showed that they are overabundant. We next aimed to test whether this overabundance is reflected in the conservation of IRs over evolutionary time scales. To this end, for each perfect IR identified in E. coli MG1655, we collected orthologous sequences from related proteobacterial genomes. We next quantified the evolutionary conservation of these IRs, that is, the presence of the exact same IR across orthologous regions. We observed high conservation of perfect IRs: out of the 234 examined orthologous regions, 145 were more conserved than expected, which is statistically significant even after correcting for multiple testing. Our results together with previous experimental findings support a model in which imperfect IRs are corrected to perfect IRs in a preferential manner via a template switching mechanism.
A perfect inverted repeat (IR) is a stretch of DNA consisting of two DNA sequences (arms) in reverse complement orientation (e.g., 5′ AGAACAxxxxTGTTCT 3′). When the arms are directly adjacent to each other, the sequence is called a palindrome (e.g., 5′ AGAACATGTTCT 3′). When the arms do not perfectly match, the segment would be termed an imperfect IR, or a quasi-palindrome. The reverse symmetry between the arms allows them to form base pairs with each other, resulting in DNA secondary structures, such as hairpins and cruciforms (Lilley 1980; Panayotatos and Wells 1981).IRs are important for a wide range of biological processes, including replication, transcription, and DNA repair. IRs are found to be abundant in viral origins of replication (Pearson et al. 1996) and in bacterial plasmids (del Solar et al. 1998). In addition, in Escherichia coli, imperfections in the IR sequence of the lac repressor gene regulatory region were shown to decrease the binding affinity of the transcription machinery to that gene (Sadler et al. 1983). Moreover, IRs can be involved in alternative termination of bacterial genes (Li et al. 1997) and participate in the process of immunoglobulin V(D)J gene rearrangement (Cuomo et al. 1996). IRs are also important for RNA functionality. Endogenous transcripts bearing IRs could be processed to produce small RNAs that are involved in gene silencing (Okamura et al. 2008; Wroblewski et al. 2014).Numerous studies of the DNA repair mechanism have demonstrated the role of long IRs in promoting genome instability both in prokaryotes and eukaryotes. For example, the bacterial transposon Tn5 that includes a 1.5-kb long IR was found to be unstable when inserted into Saccharomyces cerevisiae (Gordenin et al. 1993). Following DNA replication, IRs can undergo a process in which their length increases, enhancing genome instability (Butler et al. 1996; Tanaka et al. 2002; Brewer et al. 2011). Instability of long IRs was shown to be associated with higher probability for translocation and deletion events (Richard et al. 2008; Mizuno et al. 2009; Branzei and Foiani 2010; Lu et al. 2015). It was additionally shown that nearby IRs in yeast fuse to form unstable dicentric chromosomes, resulting in translocations and other gross chromosomal changes (Paek et al. 2009). In addition, Alus inserted as IRs in the yeast genome (Alu-IR) induce double strand breaks (DSBs) that are terminated by hairpins and are mitotic recombination hotspots. Failure to process the hairpins leads to generation of chromosome inverted duplications (Lobachev et al. 2002). Furthermore, secondary structures of IRs can stall the DNA replication process, leading to DSBs and deletions, for example, Alu-IRs were shown to be highly unstable and to stall the replication fork (Voineagu et al. 2008). In addition, Alu-IRs are very rare in the human genome, suggesting that long IRs are selected against (Lobachev et al. 2000). Secondary structures formed by IRs were also found to be related to gene amplification (Pearson et al. 1996; Tanaka et al. 2002; Narayanan et al. 2006). For example, in breast cancer it was shown that palindrome formation proceeds gene amplification (Tanaka et al. 2005; Tanaka and Yao 2009).In contrast to long IRs, several reports showed that perfect short IRs are abundant in genomes. Cox and Mirkin (1997) showed that short perfect IRs are enriched in human, S. cerevisiae, E. coli, and Haemophilus influenzae. This enrichment of short perfect IRs in bacterial genomes was further reported by others (Lillo et al. 2002; van Noort, Snel, et al. 2003; Ladoukakis and Eyre-Walker 2008; Strawbridge et al. 2010).A striking characteristic of IRs is their tendency to undergo concerted evolution, which leads to the removal of variation between the two arms of an IR. For example, it was reported that in chloroplasts of land plants, long IRs undergo frequent recombination to produce identical IR forms (Kolodner and Tewari 1979; Grant et al. 1980; Turmel et al. 2017). Concerted evolution of IRs can be mediated by two mechanisms: homologous recombination and template switching (see below). Homologous recombination is known to occur in long IRs, while template switching can occur in both long and short IRs. While concerted evolution of long IRs through homologous recombination is relatively well studied (Warburton et al. 2004; Chen et al. 2007; Darmon and Leach 2014), the impact of template switching of short IRs on genome evolution received little to no attention.DNA template switching occurs when the DNA polymerase hops between templates. Template switching between IR arms was first identified in bacteriophages by Ripley (1982). The mechanism requires two template switching steps and can occur through an intra or inter-molecular template switching (Ripley 1982). An example of the template switching mechanism is shown in figure 1. Starting with a template represented by an imperfect IR of arm size 5 bp (lower sequence in fig. 1), the first switch occurs after the newly synthesized strand replicates the template of the 3′ arm. This switch is to the complementary arm, either in the nascent DNA strand (intramolecular, fig. 1) or across the replication fork (intermolecular, fig. 1). The second switch (fig. 1) occurs when the replication fork realigns back to the native template. At the end of this process, one strand has an imperfect IR and the other strand has a perfect IR. At the next round of replication, one of the daughter cells harbors a perfect IR and the other one—does not. The resulting perfect IR is shown as the upper sequence in figure 1.
. 1.
—Template switching converts an imperfect IR to a perfect one. (A) The upper and lower sequences represent a perfect and an imperfect IR, respectively, located in an orthologous locus in two genomes. (B) The first switch under intramolecular template switching. Here the nascent strand is used as template. (C) The first switch under intermolecular template switching. Here the strand across the fork is used as template. (D) The second switch returns the nascent strand into the original template, resulting in a perfect IR as represented by the upper sequence in A. Upper case letters represent the IR arms while red dots represent mismatches between the arms. The noncanonical template is marked with a red line. The direction of the replication fork is indicated with an arrow.
—Template switching converts an imperfect IR to a perfect one. (A) The upper and lower sequences represent a perfect and an imperfect IR, respectively, located in an orthologous locus in two genomes. (B) The first switch under intramolecular template switching. Here the nascent strand is used as template. (C) The first switch under intermolecular template switching. Here the strand across the fork is used as template. (D) The second switch returns the nascent strand into the original template, resulting in a perfect IR as represented by the upper sequence in A. Upper case letters represent the IR arms while red dots represent mismatches between the arms. The noncanonical template is marked with a red line. The direction of the replication fork is indicated with an arrow.Template switching is known to occur in numerous organisms including bacteriophage T4 (Ripley 1982), E. coli (Rosche et al. 1997; Viswanathan et al. 2000), S. cerevisiae (Hampsey et al. 1988), as well as human (Greenblatt et al. 1996; Bissler 1998). Template switching between IR arms was reported to be associated with mutation hotspots in the thyA and rpsL genes in bacteria (Mo et al. 1991; Viswanathan et al. 2000; Yoshiyama et al. 2001) and in numerous genetic diseases, such as thromboembolism, osteogenesis-related disease, familial hypercholesteremia, and Duchenne muscular dystrophy (Bissler 1998).Factors affecting template switching mutations include the directionality of the replication fork (Yoshiyama et al. 2001; Kim et al. 2013), the DNA strand (Seier et al. 2011), and the local sequence context (Schultz and Drake 2008). In yeast, quasi-palindrome to palindrome correction is enriched in highly transcribed regions (Kim et al. 2013). Fork replication problems as well as DNA repair proteins affect the rate of IR template switching. For example, the loss of the yeastRAD27 gene, a key player in Okazaki fragment maturation, increases the rate of template switching (Omer et al. 2017). Moreover, template switching of IRs is enriched in the absence of nucleotide excision repair genes and is dependent on translesion synthesis DNA polymerases (Kim et al. 2013).Previous studies analyzed short IRs in a genomic context (Cox and Mirkin 1997; van Noort, Worning, et al. 2003; Lisnić et al. 2005; Ladoukakis and Eyre-Walker 2008). These studies suggested that overrepresentation of short-perfect IRs in genomes stems from template switching. The studies mentioned above analyzed a single genome at a time, and hence the importance of template switching in the context of evolution was not directly evaluated. The analysis of orthologous IRs in closely related genomes, through a comparative evolutionary approach, is currently lacking.Here, we aimed to characterize IR dynamics from an evolutionary perspective in order to understand their long-term impact on genomes. To this end, we examined the abundance of perfect IRs and tested their evolutionary conservation in the context of 20 proteobacteria genomes.
Materials and Methods
Genome Sequences
All proteobacterial genomes reported in this study were obtained from GenBank (Actinobacillus equuli NZ_CP007715.1; Citrobacter amalonaticus NZ_CP011132.1; Citrobacter freundii NZ_CP012554.1; Enterobacter aerogenes NZ_CP011574.1; E. aerogenes NZ_CP011539.1; E. aerogenes NZ_CP011539.1; Enterobacter lignolyticus NZ_CP012871.1; Escherichia albertii NZ_CP007025; E. coli K-12 substr MG1655 U00096; Escherichia fergusonii NC_011740.1; Klebsiella variicola NZ_CP012871.1; K. variicola NC_013850.1; Kluyvera intermedia NZ_CP011602.1; Kosakonia radicincitans NZ_CP015113.1; Kosakonia sacchari NZ_CP016337.1; Leclercia adecarboxylata NZ_CP013990.1; Lelliottia amnigena NZ_CP015774.1; Raoultella ornithinolytica NC_021066.1; Salmonella enterica NZ_LN868943.1; Serratia fonticola NZ_CP011254.1; Shigella sonnei NC_007384).
Genome Partition
Noncoding (NC) regions of the E. coli K-12 substr MG1655 genome were determined according to the GenBank annotation. NC regions shorter than 10 bp were filtered out. This resulted in 3,436 NC sequence regions.
Single Genome Analyses
IRs Detection
IRs were detected in NC sequence regions of E. coli K-12 substr MG1655 as well as in simulated data sets (see below). We searched for IRs with an arm length of at least 5 bp using the EMBOSS palindrome package (Rice et al. 2000). In our search, we allowed a spacer of up to 70 bp between the two IR arms. This spacer was chosen since the probability of forming a cruciform with a longer spacer is small (Schroth and Ho 1995).
Sequence Simulation from the Null Model
Simulated data sets were generated according to two different algorithms. The first algorithm is based on the Fisher–Yates method (Fisher and Yates 1948), which takes the content of the input sequence and shuffles it to obtain a set of permutated sequences. The second algorithm is based on a second-order Markov chain. The parameters of the chain (base frequencies and transition probabilities, e.g., the probability to observe ‘A’ after the dinucleotide ‘AC’) were learnt from the concatenated NC sequence regions of E. coli K-12 substr MG1655. These parameters were used to generate random sequences. The second algorithm preserves the trinucleotide composition of the input sequence. For each algorithm, a data set of 100 simulated NC genomes (each simulation contained 3,436 NC sequence regions) were generated using a Perl script.
Statistical Test of Enrichment
In order to test whether the IR lengths observed in the real data come from the same distribution of IR lengths of the null data sets, we performed a likelihood ratio test. Let N (M) denote the total number of IRs across all real (simulated) NC regions. Let () denote the total number of IRs of length i across all NC regions in the real data set (simulated data sets). Due to small sample sizes, all IRs of length ≥ 13 were counted in (). We computed the proportion of IRs of length i in the real (simulated) collection as . Thus, a total of k = 9 length categories were considered (i = 5, …, 13). The set of values represents the maximum likelihood estimate for the proportions of each length-category in an alternative model (i.e., the real data set comes from its own distribution). The set of values represents the parameters for the proportions of each length-category under the null model. Either set of parameters can be used to compute the probability of observing the real data using a multinomial distribution. For computational reasons, the probability of observing the data under either model was first approximated using a multivariate normal distribution and then a likelihood ratio test was performed to select between the models.
Evolutionary Analysis
Sequence Segment Definitions and Orthologs Detection
The MG1655 genome contains 4,288 coding sequences (CDSs). We denote each CDS and the NC region downstream of it as “CDS–NC unit.” For each such CDS–NC unit, we added the CDS–NC unit immediately upstream and downstream of it. The resulting segment contained three CDSs and three NCs. We consider the middle CDS–NC unit as the focus. Such a three-unit segment was constructed for each identified CDS–NC unit, thus each CDS–NC unit served as the focus in its turn. Hereon, we denote each such resulting segment of three units as “reference query.” We have chosen to work with such a reference query to decrease the potential of erroneous classification of regions as orthologous due to horizontal gene transfer. Of note, each reference query has a middle NC region. We retained only reference queries for which the middle NC region contained at least one IR (2,320 out of 3,436). We then used each of the retained reference queries as input for a BLASTN search (BLAST+, Camacho et al. 2009) against the 20 proteobacterial genomes. The BLAST search was limited to hits with an E-value ≤ 0.001. For each such reference query, we chose the best hit in each of the 20 genomes, conditioning that this hit was similar in length to the reference query (at least 70% but no more than 130% in length of the reference query). If no hit met these conditions—no ortholog from that species was collected. Finally, we retained only orthologous sets with at least 10 orthologs. A total of 605 such orthologous sets were retained.
Aligning Orthologous Sequences and Tree Reconstruction
Each of the 605 orthologous sets was aligned using MAFFT V3.705 (Katoh et al. 2009) with default parameters for nucleotide alignment. The three CDS blocks were removed from the resulting multiple sequence alignment (MSA). Each resulting MSA thus contained only NC regions. Only MSAs of length >150 bp were retained for further analysis (234 out of 605). Each such MSA was given as input to PhyML V3.1 (Guindon et al . 2010 broken ref.) to reconstruct the maximum likelihood tree. The model of nucleotide substitution for the PhyML reconstruction was HKY (Hasegawa et al. 1985). Rate variation among sites was modeled by a discrete gamma distribution with four rate categories (Yang 1994).
Rooting Phylogenetic Trees
We generated rooted versions for each phylogenetic tree by first turning MG1655 into an internal node by adding a leaf below it with a branch of length 0. Next, the modified tree was rooted on MG1655 using the R package ape (Paradis et al. 2004).
Control Sequences
In order to compare conservation levels between an aligned IR segment and an aligned segment without an IR, we searched for IR-less alignment segments of the same length as the IR in the same orthologous set described above. We could identify such control regions for 207 NC regions.
Conservation Score
We defined an IR conservation score as the ratio between the number of species that displayed the exact IR sequence as the root (including the root) and the total number of species in the phylogenetic tree. The conservation score is defined as the average IR conservation score in a given NC focus region. The same computation was repeated with the control sequences replacing the IRs. This resulted in 234 conservation scores for IR regions and 207 conservation scores for control regions.
Sequence Simulations
Sequences were simulated along rooted phylogenetic trees using the module Simseq of the R package phangorn V2.2.0 (Schliep 2011). Each IR and each control sequence were simulated according to the phylogenetic tree using the inferred evolutionary model parameters in the region in which they reside. In each simulation, the starting root sequence was set to be the MG1655 sequence. For each IR and for each control sequence, we performed 1,000 simulations. IRs and control sequences in the same region were grouped together, thus obtaining a collection of 1,000 artificial copies for each region. In total, we computed 234,000 simulated conservation scores for IR regions and 207,000 simulated conservation scores for control regions.
Results
Characterization of IRs within the E. coli str. K-12 substr. MG1655 Genome
We first examined the prevalence and features of IRs in the E. coli str. K-12 substr. MG1655 genome (Blattner et al. 1997). The annotated genome is 4,639,221 bp in length and contains 4,288 CDSs. We identified 233,730 short IRs in coding regions and 27,678 short IRs in Noncoding (NC) regions (from here on “the NC collection”). All these short IRs had an arm length shorter than 20 bp and a spacer of up to 70 bp (see section Materials and Methods). Below, we focused on NC IRs as the selection and mutation processes on these IRs are not masked by other selective forces to preserve protein function.For the total of 27,678 IRs detected in the MG1655 NC collection, we found an average of nine IRs per NC region. As shown in figure 2, most NC regions include few IRs and only a small fraction of NC regions includes many IRs. We next accounted for the variation in the length of the NC regions by dividing either the number of detected IRs or the number of IR base-pairs by the total number of base-pairs in a given region. We found that the number of detected IRs increases linearly with the length of the region (R2 = 0.84, P < 2e−15, fig. 2) and an average ratio of 0.44 ± 0.13 of IR bases per NC region (fig. 2). We then examined the distribution of arm lengths across all detected IRs. We found an average arm length of 5.75 ± 1.16 bp per IR and an inverse relationship between the arm length and its prevalence across IRs (fig. 2).
. 2.
—Characteristics of detected IRs in the MG1655 genome. (A) The proportion of regions in the NC collection for which a certain number of IRs was detected. (B) The number of detected IRs in a region as a function of the length of the region. (C) The proportion of regions in a collection for which a certain ratio of IR base-pairs to total number of region base-pairs was detected. (D) The proportion of IRs for which a certain arm length was detected.
—Characteristics of detected IRs in the MG1655 genome. (A) The proportion of regions in the NC collection for which a certain number of IRs was detected. (B) The number of detected IRs in a region as a function of the length of the region. (C) The proportion of regions in a collection for which a certain ratio of IR base-pairs to total number of region base-pairs was detected. (D) The proportion of IRs for which a certain arm length was detected.Next, we aimed to determine whether the prevalence of IRs and their length distribution in the MG1655 genome are different from what one could expect when no special mechanism for IR creation or repair exists. Without such a special mechanism, IR regions are expected to evolve under the same mutation-selection regime as other NC regions (our null model). To this end, we first simulated sequence regions under these null conditions (see section Materials and Methods). Specifically, for each NC region in our data, we simulated 100 corresponding null regions. This yielded a total of 100 simulated null collections, each of which corresponds to the entire MG1655 collection. We then measured the prevalence and lengths of IRs in each simulated null collection in an identical manner to the characterization of the real MG1655 data set.When simulating the null collections, we examined two models: the Fisher and Yates (Fisher and Yates 1948) algorithm and a second-order Markov chain (see full details in the Materials and Methods section). The first algorithm offers a simple shuffle of the nucleotide base-pairs. The second model takes into account the trinucleotide composition of the region. The total number of IRs in the MG1655 NC collection was significantly greater than the total number of IRs seen in each null collection. This was true for both simulation schemes (Fisher–Yates and second-order Markov chain; empirical P < 0.01, fig. 3). Furthermore, we evaluated each region in the MG1655 collection individually by computing the proportion of null regions that contained an equal to or greater than number of IRs as detected for the real region. This proportion served as an empirical P-value for that region. Specifically, under null conditions one could expect a uniform distribution of P-values. However, when examining the distribution of the obtained empirical P-values, we noticed a strong inflation of small P-values (see supplementary fig. S1 for the two models, Supplementary Material online). This deviation is statistically significant (Kolmogorov–Smirnov test against a uniform distribution, P < 2e−7 in all cases). This suggests that null conditions, in which no mechanism exists to create or repair IRs are unlikely to explain the abundance of IRs observed in the MG1655 genome.
. 3.
—Detected IRs in the entire MG1655 genome compared to simulations. The total number of IRs in the MG1655 NC regions (bold dashed line) compared to the total number of IRs in each of its corresponding null collections.
—Detected IRs in the entire MG1655 genome compared to simulations. The total number of IRs in the MG1655 NC regions (bold dashed line) compared to the total number of IRs in each of its corresponding null collections.We have noticed that the length of the IR arm is an important factor in determining its deviation from the null expectation. Specifically, we first computed the proportion of IRs with arm length i in the MG1655 collection. We then computed these proportions in the null simulated collections. Next, a likelihood ratio test was performed to test whether the counts observed for the IR arm length categories in the real collection were likely to originate from the same distribution of IR arm lengths computed from the null simulations. We found that the IR arm length distribution of the MG1655 NC collection was significantly different from that computed from the null simulations using either the Fisher or the second-order Markov algorithm (likelihood ratio test with eight degrees of freedom; P < 1e−40 in both cases). We have further analyzed the deviation from expectation as a function of the IR arm length. We found that longer IRs are enriched in the MG1655 genome for all arms longer than 5 bp (table 1).
Table 1
The Observed/Expected Ratio for all IR Arm Length Categories
IR Arm Length Category
5
6
7
8
9
10
11
12
≥13
Fisher
0.89
1.07
1.39
2.3
3.88
8.26
24.64
49.08
152.86
Markov
0.91
1.04
1.29
2.04
3.36
7.12
20.44
38.2
148.86
Note.—Expected values were inferred using the Fisher and second-order Markov chain algorithms.
The Observed/Expected Ratio for all IR Arm Length CategoriesNote.—Expected values were inferred using the Fisher and second-order Markov chain algorithms.
Evolutionary Conservation of IRs in Proteobacteria
The above analysis suggests that perfect IRs are more abundant in a given genome than randomly expected. An important question follows this finding: are IRs conserved throughout evolution? To test the conservation of IRs in a broader evolutionary context, we collected orthologous sequence units for the MG1655 genome across 20 additional proteobacteria (see section Materials and Methods). Out of 27,678 IRs in MG1655 we were able to identify 914 IRs that appear in at least 10 species. These orthologs reside in 234 NC regions and were further analyzed. For each of these 234 regions, we computed a conservation score (see fig. 4 and Materials and Methods for full details).
. 4.
—Conservation analysis. (A) An example alignment of an IR and its mapping onto its corresponding phylogenetic tree, with the IR of the MG1655 as the root sequence. The IR conservation score is 7/11, since 7 out of 11 sequences are identical to the root IR. (B) Conservation score computation for the entire NC region, located between the ldtB and the yblT genes in the MG1655 genome, which contains three additional IRs. (C) Analysis of conservation of IRs in the region located between the ldtB and the yblT genes. The distribution of 1,000 conservation scores computed using simulated data is shown in blue and the conservation score value computed from real data is shown as a bold dashed line. The detailed template switching mechanism that can explain this example is shown in fig 1.
—Conservation analysis. (A) An example alignment of an IR and its mapping onto its corresponding phylogenetic tree, with the IR of the MG1655 as the root sequence. The IR conservation score is 7/11, since 7 out of 11 sequences are identical to the root IR. (B) Conservation score computation for the entire NC region, located between the ldtB and the yblT genes in the MG1655 genome, which contains three additional IRs. (C) Analysis of conservation of IRs in the region located between the ldtB and the yblT genes. The distribution of 1,000 conservation scores computed using simulated data is shown in blue and the conservation score value computed from real data is shown as a bold dashed line. The detailed template switching mechanism that can explain this example is shown in fig 1.Next, we aimed to compare the 234 conservation scores computed from the real NC regions to the expectation under null conditions, in which IRs evolve similarly to any other NC sequence. To this end, we generated 1,000 corresponding simulated data sets for each of the 234 real NC regions. For each such NC region, the empirical distribution of the 1,000 conservation scores served as a null distribution to which we compared the conservation score of the real region. An example of one NC region, which is more conserved than the null distribution, is shown in figure 4.Out of the 234 NC regions, 145 were found to have a significantly higher conservation score than the expectation under null conditions (empirical P < 0.05, fig. 5). Generally, the distribution of the 234 empirical P-values deviates from a uniform distribution with a strong inflation of small P-values (fig. 5). The high IR conservation level reported above could potentially result from a purifying selection regime acting on the NC region where the IRs are located. In order to test whether the elevated sequence conservation is specific to IRs or general to their location on the genome, we analyzed control sequences adjacent to the IRs. A lower conservation level of control sequences versus the IRs would suggest that IRs are more conserved than their surrounding sequences. We computed 234 conservation scores for IR regions and 207 control region conservation scores. Next, using simulations (see Materials and Methods), we computed empirical P-values for each such region, obtaining a total of 234 P-values for IR regions and 207 P-values for control regions. We found these distributions to be significantly different, with the IR conservation P-value distribution having a heavier left tail (Kolmogorov–Smirnov test, P < 0.028, fig. 5). These results indicate that the selective forces acting on IRs differ from their adjacent regions, suggesting the operation of a special mechanism to conserve IRs.
. 5.
—Comparison of conservation significance between IR regions and control regions. An empirical P-value was computed for each region based on its 1,000 corresponding simulated data sets. Shown in blue are the P-values for the IR regions and in green for the control regions.
—Comparison of conservation significance between IR regions and control regions. An empirical P-value was computed for each region based on its 1,000 corresponding simulated data sets. Shown in blue are the P-values for the IR regions and in green for the control regions.
Discussion
Theoretical studies suggested that genomes are enriched with short perfect IRs. This has been observed in individual genomes across the tree of life (Smith et al. 1995; Lisnić et al. 2005; Ladoukakis and Eyre-Walker 2008). The abundance of IRs in genomes implies a mechanism, acting to maintain perfect IRs in genomes. The mechanism of template switching that was reported in experimental systems to homogenize IR arms during replication is the best known mechanism to explain the enrichment of perfect IRs in genomes (Lovett 2017).Here, we chose to focus on E. coli, in which the template switching mechanism was studied the most. Considering the single E. coli MG1655 genome, we first validated that perfect IRs are more abundant in E. coli than expected by chance. The added value of our analysis of a single genome is the findings concerning the arm length. Our results clearly indicate that the deviation from the null expectation increases as a function of the arm length (table 1). This deviation can be explained as longer IRs are more likely to form secondary structures, which promote template switching (Strawbridge et al. 2010).The main contribution of this study is the analysis of IRs using an evolutionary genomics approach. If template switching occurs in genomes then we expect it to contribute to the correction of imperfect IRs to perfect ones, resulting in the conservation of IRs through evolution. This hypothesis is in line with the observed overabundance of perfect IRs when analyzing single genomes. However, the conservation of IRs was hitherto never demonstrated in a comparative evolutionary manner and here we have shown it for proteobacterial data. Specifically, the rejection of the null hypothesis, according to which there is no special mechanism to retain and preserve IRs, supports the model of IR arm corrections by template switching through evolution.We propose an illustrative model of IRs evolutionary dynamics (fig. 6). The model consists of two major evolutionary forces: substitution and small insertion–deletion (indel) events and a correction mechanism (template switching). The contribution of the correction mechanism likely depends on the arm and spacer lengths. Changes in these lengths should affect secondary structures of DNA, and hence the rate of template switching. Although we have shown the evolutionary conservation of IRs only in proteobacteria, given the prevalence of template switching in other organisms (Dutra and Lovett 2006; Löytynoja and Goldman 2017) we speculate that the model holds across large sections of the tree of life.
. 6.
—Two main processes (template switching in red and the formation of substitutions and indels in black) are dictating the dynamics of short IRs. In this scheme there are three sequence states and the transition between them occurs due to the two processes. Black arrows indicate substitutions and indels, red arrows indicate template switching. The width of the arrows indicates the rate of each process.
—Two main processes (template switching in red and the formation of substitutions and indels in black) are dictating the dynamics of short IRs. In this scheme there are three sequence states and the transition between them occurs due to the two processes. Black arrows indicate substitutions and indels, red arrows indicate template switching. The width of the arrows indicates the rate of each process.Much research effort was previously invested in accurate modeling of base pair substitutions and recently, also in indel events (Aris-Brosou et al. 2012; Ezawa 2016; Levy Karin et al. 2017). In contrast, very little effort was invested in modeling the evolutionary dynamics of IRs. Of note, a model that accounts for IR evolutionary dynamics and template switching in particular between pairs of closely related genomes was recently presented by Löytynoja and Goldman (2017). Their work and ours should ideally be extended to a generative model for sequence evolution, in which IR corrections and generations are explicitly accounted for. In the same vein, IRs are currently ignored in most evolutionary analyses, including alignment algorithms, phylogenetic tree inference, ancestral sequence reconstruction, and molecular dating. In such cases, differences between orthologous IRs are considered as independent events, which is clearly not the case. The impact of ignoring IRs in such evolutionary analyses remains to be studied.
Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.Click here for additional data file.
Authors: F R Blattner; G Plunkett; C A Bloch; N T Perna; V Burland; M Riley; J Collado-Vides; J D Glasner; C K Rode; G F Mayhew; J Gregor; N W Davis; H A Kirkpatrick; M A Goeden; D J Rose; B Mau; Y Shao Journal: Science Date: 1997-09-05 Impact factor: 47.728
Authors: Jian-Min Chen; David N Cooper; Nadia Chuzhanova; Claude Férec; George P Patrinos Journal: Nat Rev Genet Date: 2007-09-11 Impact factor: 53.242
Authors: Kirill V Mikhailov; Boris D Efeykin; Alexander Y Panchin; Dmitry A Knorre; Maria D Logacheva; Aleksey A Penin; Maria S Muntyan; Mikhail A Nikitin; Olga V Popova; Olga N Zanegina; Mikhail Y Vyssokikh; Sergei E Spiridonov; Vladimir V Aleoshin; Yuri V Panchin Journal: Nucleic Acids Res Date: 2019-07-26 Impact factor: 16.971