Literature DB >> 28262684

Conserved intergenic sequences revealed by CTAG-profiling in Salmonella: thermodynamic modeling for function prediction.

Le Tang1,2,3, Songling Zhu1,2, Emilio Mastriani1,2, Xin Fang1,2, Yu-Jie Zhou1,2, Yong-Guo Li4, Randal N Johnston5, Zheng Guo6, Gui-Rong Liu1,2, Shu-Lin Liu1,2,4,7.   

Abstract

Highly conserved short sequences help identify functional genomic regions and facilitate genomic annotation. We used Salmonella as the model to search the genome for evolutionarily conserved regions and focused on the tetranucleotide sequence CTAG for its potentially important functions. In Salmonella, CTAG is highly conserved across the lineages and large numbers of CTAG-containing short sequences fall in intergenic regions, strongly indicating their biological importance. Computer modeling demonstrated stable stem-loop structures in some of the CTAG-containing intergenic regions, and substitution of a nucleotide of the CTAG sequence would radically rearrange the free energy and disrupt the structure. The postulated degeneration of CTAG takes distinct patterns among Salmonella lineages and provides novel information about genomic divergence and evolution of these bacterial pathogens. Comparison of the vertically and horizontally transmitted genomic segments showed different CTAG distribution landscapes, with the genome amelioration process to remove CTAG taking place inward from both terminals of the horizontally acquired segment.

Entities:  

Mesh:

Substances:

Year:  2017        PMID: 28262684      PMCID: PMC5337935          DOI: 10.1038/srep43565

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Bacterial genomes may contain highly conserved short sequences that are lineage-specific, such as the tetranucleotide sequence CTAG in Escherichia coli and Salmonella12. Genomic regions containing such conserved short sequences carry information about the evolution and phylogenetic divergence of the bacteria, but so far such information, especially that embedded in the intergenic regions, has not been efficiently extracted, due largely to the lack of theoretical or methodological strategies to reveal such conserved sequences. As a result, intergenic regions of the bacterial genomes have been poorly annotated in contrast to genes, which researchers often compare with those of E. coli3 for annotation, taking the advantage that functions of many of the E. coli genes have been validated by molecular biology experiments. Such a situation calls for novel approaches to facilitate genomic annotation, and characterization of conserved short sequences may prove instrumental in the identification of functional intergenic sequences. For this purpose, we have focused on the identification and characterization of short sequences in the bacterial genome, using representative Salmonella lineages as the models. Previously, we reported that the CTAG-containing endonuclease cleavage sites such as that of XbaI (TCTAGA) are highly conserved45. In a recent study, we found that many of such cleavage sites are located in common intergenic regions across Salmonella and E. coli1. This finding strongly suggests the potential importance of such sequences and brings up the hopes that the identification and functional analyses of such genomic regions may facilitate bacterial genome annotation and functional studies. We anticipate that the highly conserved intergenic regions may be involved in gene expression regulation and pathogenic evolution. In the case of Salmonella, more enhanced genome annotation strategies are especially needed for understanding the divergence processes that have created diverse and distinct pathogens from common ancestors. One of the main reasons for us to use Salmonella as the models in this study is that this genus contains the most widely distributed bacterial lineages known to date, which are genetically highly similar among them but pathogenically each drastically distinct6, with the genetic factors that have made these closely related bacteria different pathogens remaining largely unknown. Since the first isolation of a Salmonella strain from a typhoid patient in 1881, more than 2600 Salmonella lineages, classified as serotypes based on their different combinations of the somatic (O) and flagellar (H) antigens, have been documented7. The genetic similarity among the Salmonella serotypes was revealed by comparison of genome structures in the last century89 and genomic sequences since the beginning of this century101112, although these bacteria cause different diseases, from self-limited gastroenteritis (such as S. typhimurium, S. enteritidis, etc.) to potentially fatal systemic infections like typhoid (such as S. typhi)13. The current bacterial taxonomy tends to classify the gastroenteritis-causing S. typhimurium and the typhoid-causing S. typhi into the same subspecies as Salmonella enterica subspecies enterica serovars Typhimurium and Typhi, respectively, however we and many other authors have continued using the traditional nomenclature to avoid confusion; we explained the rightfulness of treating S. typhimurium and S. typhi as separate species in a recent perspective article14. Previous studies have shown that the Salmonella lineages that elicit similar disease phenotypes may have evolved by either divergent or convergent processes12. However, whether similar pathogens may use the same set or overlapping sets of genes for the infections remains unclear. Further, very possibly, some elements encoded by intergenic sequences might be involved in the modulation of virulence expression. Complicating this issue is the fact that more than half of the genes in Salmonella genomes still await convincing annotation, and deciphering the seemingly non-coding sequences in the intergenic regions is even more challenging. In this study, we profiled the CTAG sequences, which are relatively rare in Escherichia coli and Salmonella15, in representative Salmonella serotypes. To evaluate their level of conservation in evolution, we determined their genomic distribution in comparison with E. coli and other bacteria. We hypothesize that it is the potential biological significance of the intergenic CTAG-containing sequences that have made them highly conserved, therefore they survived the elimination process by the Very Short Patch (VSP) repair mechanism2. We ranked the profiled CTAG sequences according to their theoretical importance in function by their levels of evolutionary conservation. As the most highly conserved sequences may retain key functions, we focused on the CTAG sequences that are present in the least annotated intergenic regions across Salmonella and E. coli. Computer modeling demonstrated that, for the CTAG sequences conserved in bacteria across the Enterobacteriaceae genera, substitution of any of the four nucleotides may disrupt the stem-loop structure of the CTAG-containing intergenic sequence, potentially affecting their biological functions. We also documented CTAG sequence degeneration patterns and found that the CTAG sequence decays in serotype-specific ways. Of particular interest, the degeneration processes of CTAG seemed to be current and still going on at different stages in the genome.

Methods

Bacterial strains

Information on all bacterial strains used in this study can be found at the Salmonella Genetic Stock Center (http://www.ucalgary.ca/~kesander/), and the accession numbers of the sequenced genomes can be found at http://www.ncbi.nlm.nih.gov/genome. We use the traditional Salmonella nomenclature for reasons detailed in a previous publication14.

Construction of phylogenetic trees

We aligned the 16S rDNA sequences of the bacterial strains in comparison using the Clustal X program and constructed the phylogenetic trees by MEGA6 using the Neighbor-Joining algorithm.

General strategies of computer modeling to predict secondary structures of short sequences

The modeling was conducted based on the Free Energy Minimization method161718. As free energy models typically assume that the total free energy of a given secondary structure for a molecule is the sum of independent contributions of adjacent or stacked base pairs in stems, we particularly focused on the stem-loop structures. Specifically, we employed the Vienna RNA Package (http://www.tbi.univie.ac.at/RNA/), which consists of a C code library and several stand-alone programs, including RNAfold. This program reads RNA sequences from standard inputstdin, calculates their minimum free energy (mfe) structure and prints to standard output. The mountain.pl script produces a mountain plot, which is an xy-diagram plotting the number of base pairs enclosing a sequence position. The resulting plot shows three curves (red, black and green, respectively; Fig. 1), with the red one showing two peaks derived from the MFE structure, the black one demonstrating the pairing probabilities, and the green one indicating the positional entropy. Well-defined regions are identified by low entropy. By superimposing several mountain plots, the structures can easily be compared.
Figure 1

Mountain plot representing the modeled secondary structure by height versus position.

The height m(k) is given by the number of base pairs enclosed at position k. Three curves are shown: the MFE structure (red), the pairing probabilities (black) and a positional entropy curve (green). Well-defined regions are identified by low entropy.

To test the reliability of the modeling, we also used the Maximum Expected Accuracy method by the program CONTRAfold19 and predicted the most probable structure and the pseudo-knot-free structures by maximizing the sum of the base-paired and single-stranded nucleotide probabilities, called expected accuracy, where pairing probabilities can be weighted by a specific factor. CONTRAfold uses probabilistic parameters learned from a set of RNA secondary structures to predict base-pair probabilities and then predicts structures using the maximum P (i, j) expected accuracy approach. In this study, we referred to The MaxExpect program (http://rna.urmc.rochester.edu/RNAstructureWeb), which is part of the RNAstructure package by Mathews Lab, University of Rochester Medical Center, Department of Biochemistry and Biophysics. To run the programs for the prediction, we used an HPC Cluster based on Ubuntu 14.04.2 LTS (Trusty), employing the Sun Grid Engine 6.2u5–7.3 amd64 as queue manager and scheduler to accept jobs. We ran the programs on the cluster in “trivial parallel computing” way and obtained the results from RNAfold v. 2.19 and MaxExpect v. 5.6 linked to Perl v.5.18.2. Throughout the modeling, we used both strategies to crosscheck each other’s performance and evaluate the quality of the obtained predictions.

Using the RNAfold program to model the structure

We used two scripts, mountain.pl and relplot.pl, We used two scripts, mountain.pl and relplot.pl, which are part of the core routines on-board with the Vienna RNA Package (www.tbi.univie.ac.at/RNA/), with the former for predicting pair probabilities within the equilibrium ensemble and the latter for producing a diagram of the predicted structure containing also information about probability. The Perl script relplot.pl adds reliability information to a RNA secondary structure plot and computes a well-definedness measure, which we call “positional entropy” (Fig. 1).

Secondary structure prediction by the MEA approach

Using the MaxExpect program, we executed the following command: MaxExpect –sequence LT2-SpeI.fasta LT2-SpeI.out –gamma 1 –percent 10 –structures 20 –window 3″. This will generate a file name containing information about the predicted structure.

Illustration of CTAG frequency and landscape on genomic sequences

The frequency of the CTAG sequence was analyzed using our own Perl scripts. To profile the CTAG sequences, we first looked for the positions of all CTAG and then extended the analysis to 50 bp of the genomic sequence both up- and down-stream of the CTAG. We used this 104 bp sequence as query to search in BLAST db (like subject sequence). To illustrate the CTAG frequency on the genomic sequence, we showed the numbers of the tetranucleotide per 10 kb window.

Determining the level of sequence conservation

To determine how a given sequence is conserved across different bacteria, we searched it against the NCBI databases by the BLAST service and ranked the level of conservation by the set of parameters including Max and Total scores, E-values and sequence coverage and percent identity. In the case of the inter-lpp-pykF sequence, we used the sequence as query in the search against the database (https://www.ncbi.nlm.nih.gov/genome) that contains all published genomes of Enterobacteriaceae family.

Results

Profiling the CTAG sequence in bacterial genomes

The tetranucleotide sequence CTAG is remarkably under-abundant in E. coli and Salmonella as previously evidenced by the relatively small numbers of endonuclease cleavage sites containing CTAG such as XbaI (TCTAGA), BlnI or AvrII (CCTAGG), and SpeI (ACTAGT)51520, a phenomenon of biased codon usage intensively studied in E. coli2122. E. coli and closely related bacteria contain the Very Short Patch repair system, which tends to eliminate CTAG where possible in the genome223. To quantitatively validate the scarcity of the CTAG sequence in Salmonella and E. coli, we profiled the tetranucleotides consisting of one each of C, T, A and G ordered randomly in the genome by the formula of: In S. typhimurium LT2, p(C), p(T), p(A) and p(G) are 0.26, 0.24, 0.24 and 0.261022, respectively, and the frequency of a random combination of C, T, A and G comes to 0.00389376, which would lead to an estimated number of a random combination of C, T, A and G to be 18914 in the genome of S. typhimurium LT2, which is 4857432 bp. When we actually profiled the tetranucleotide sequences consisting of one each of C, T, A and G in S. typhimurium LT2 and representative strains of S. typhi, S. paratyphi A, B, C, and S. gallinarum in comparison with E. coli K12, we found that the majority of the combinations have numbers greater than twelve thousand in all Salmonella genomes analyzed and the numbers of many of them are close to the estimated 18914 (Table 1), e.g., CAGT (18347), TGAC (19229), ACTG (18472) or GATC (19168).
Table 1

Numbers of tetranucleotide sequences consisting of one each of C, T, A and G in representative strains of Salmonella and E. coli.

 S. typhimurium LT2S. typhi Ty2S. paratyphi A ATCC9150S. paratyphi B SPB7S. paratyphi C RKS4594S. gallinarum 287/91E. coli K12
CTAG8501025858861928924885
AGTC9810980093509985997592699377
GACT9693962591429644968693669528
GCTA13201129231250412937130151255110608
GTAC12993127881231913063130531254912036
TAGC12983130531251613317131681283510606
AGCT13948140291322014084140161335513333
TCGA14306139991351114336143081369515457
ACGT15426151681462415435154231489514545
TGCA15872159041499515947158121504119761
CATG16194161401536016326161741554215246
TACG16501161861558016380164021599314101
CGTA16431163391575916658163741585514324
ACTG18472182941735718406183181752420435
CAGT18347184961730318571186141743120477
GTCA19434192811851019691198711825018388
GATC19168187871809719138189221845519120
TGAC19229189761792618981187421865418580
ATGC21823213182066421794214732104121733
GCAT21915218352056822005219912089021685
CTGA24470239342257724204240232283424365
TCAG23808241772269924459244182311424638
CGAT26823260682542726801263392614624248
ATCG26940264302543927020271572578124354
To test the postulation that the CTAG sequence scarcity seen in Salmonella1 is not a general feature of bacteria at large, we screened sequenced strains of phylogenetically diverse bacteria to document the number of the CTAG sequence and determine its frequency per kb genome (Supplementary Table 1). We found that the CTAG sequence frequency varies widely among the bacteria, from as low as 0.023 per kb in Thermodesulfovibrio yellowstonii DSM 11347 (Phylum: Nitrospira, Class: Nitrospira, Order: Nitrospirales, Family: Nitrospiraceae; Chromosome 1, GC percentage: 34%) to as high as 3.933 per kb in Thermobaculum terrenum ATCC BAA-798 (Thermobaculum; GC percentage: 48%). E. coli K12 and S. typhimurium LT2 had a CTAG sequence frequency of 0.191 and 0.175 per kb, respectively. It is worth noting that bacteria with CTAG sequence frequencies lower than those of E. coli and Salmonella were seen in both low and high GC categories, demonstrating that the CTAG sequence frequency is not correlated with GC contents. Even within a narrow range of GC compositions, such as GC percentages of >45 and <=55%, which cover those of E. coli and Salmonella (around 51–52%), CTAG sequence frequencies vary remarkably, e.g., from 0.095 per kb in Prosthecochloris aestuarii DSM 271 (GC 50%) to 3.933 per kb in T. terrenum ATCC BAA-798 (GC 48%). Additionally, at the level of phyla, bacteria having very different GC contents and CTAG sequence frequencies are mixed without a noticeable phylogenetic tendency. For example, bacteria as closely related as Aminobacterium colombiense DSM 12261 and Thermanaerovibrio acidaminovorans DSM 6589 may have GC compositions as different as 45% and 64%, and CTAG sequence frequencies as different as as 0.988 and 0.684 per kb, respectively (see arrows in Fig. 2A, and Supplementary Table 1). Therefore, at the highest evolutionary branches, bacteria do not exhibit phylogenetic tendencies of CTAG frequency or GC content.
Figure 2

Phylogenetic tree of bacterial strains based on 16S rDNA sequence comparison.

(A) Bacterial strains representing a wide range of phyla; color categories for GC percentages: black, GC up to 45%; red, GC 46–55%; blue, GC >55% (see Supplementary Table 1). (B) Bacterial strains representing main branches of the Proteobacteria Phylum; purple color indicates bacteria that had lowest CTAG frequencies among the strains compared (see Supplementary Table 2). (C) Bacterial strains representing main branches of the Gammaproteobacteria Class; orange color indicates bacteria that had lowest CTAG frequencies among the strains compared (see Supplementary Table 3).

Within the Proteobacteria Phylum, bacteria among different Classes have a much narrower range of GC compositions, i.e., from 45 to 55%, but their CTAG sequence frequencies still vary broadly, from 0.125 to 4.590 per kb (Supplementary Table 2), demonstrating that low CTAG sequence frequencies are not common either within the Proteobacteria branch. Of interest, bacteria with lower CTAG frequencies tended to cluster to the Gammaproteobacteria Class, which contains the Enterobacteriaceae including the Genera Salmonella and Escherichia (Fig. 2B). When we focused on representative bacteria among the Gammaproteobacteria branches, we found that bacteria with low CTAG frequencies, such as Salmonella and Escherichia, where the CTAG frequencies are around 0.2 per kb or lower, mostly belong to the Enterobacteriaceae Family (Fig. 2C), although many Enterobacteriaceae bacteria had much higher CTAG frequencies, e.g., >0.7 per kb in Yersinia (Supplementary Table 3). Therefore, CTAG frequency as low as those of Salmonella and Escherichia is not a general feature of bacteria of the Enterobacteriaceae Family but is common only to Salmonella, Escherichia and their close relatives (Table 2).
Table 2

Phylogenetic distribution of bacteria having low CTAG sequence frequencies.

Bacterial strainGenome size (bp)Number of CTAGCTAG/kbGC %
Morganella morganii KT37995394080.1070.51
Citrobacter koseri ATCC BAA-89547204626090.1290.54
Enterobacter cloacae EcWSU147344387430.1570.55
Rahnella
R. sp. Y960248642178100.1670.52
R. aquatilis HX249621739200.1850.52
Salmonella
S. enteritidis P12510946858488070.1720.52
S. typhi CT18480903710260.2130.52
S. paratyphi A ATCC915045852298580.1870.52
S. paratyphi C RKS459448330809280.1920.52
S. typhimurium LT248574328500.1750.52
S. typhimurium DT10449336319570.1940.52
S. bongori NCTC 1241944601057160.1610.51
Escherichia
E. fergusonii ATCC 3546945887117840.1710.50
E. coli K-12 MG165546416528850.1910.51
E. coli IAI3951320689510.1850.51
E. coli SE1547173389160.1940.51
E. coli UM146499301310340.2070.51
E. coli O157:H7 EDL933552844511760.2130.50
Shigella boydii Sb227451982310660.2360.51
Pantoea
P. ananatis LMG 534246055459870.2140.53
P. sp. At-9b43687085550.1270.55
Erwinia tasmaniensis Et1/9938834679960.2560.54

Evolutionary implications of the CTAG sequence

Assuming that the common ancestor of Salmonella and E. coli had a much higher frequency of CTAG 200 million years back in evolution, one would anticipate seeing different patterns of the degeneration process of the CTAG sequence among the descendants of the assumed ancestor. In Salmonella, all serotypes have diverged from their ancestors during adaptation to their specific niches under distinct selection pressures. To look into the hypothesized degeneration patterns of the CTAG sequence in different lineages of the bacteria, we profiled CTAG and those deemed homologous to CTAG but with one or two of the four nucleotides replaced by other nucleotides in representative Salmonella strains and found lineage-specific patterns of the degeneration. For example, the CTAG in gene yiaE is conserved in S. typhimurium and S. heidelberg but is degenerated to CTGG in the other Salmonella lineages compared; similarly, the CTAG between genes yjjY and lasT is conserved in most analyzed Salmonella lineages except S. heidelberg, where this tetranucleotide is degenerated to ATAG (Table 3). Overall, each analyzed Salmonella lineage has a distinct pattern of CTAG degeneration, reflected by “signature CTAG degeneracies” such as the ATAG between genes yjjY and lasT in S. heidelberg or combinations of specific CTAG degeneracies, which are present in all strains of a given Salmonella lineage (lineage-specific CTAG degeneration pattern; Supplementary Table 4). For instance, all eight strains of S. typhimurium analyzed in this study have a common pattern and all three strains of S. typhi have another pattern (e.g., the sequence CTAG at the genomic location 2948212–2948215 of S. typhimurium LT2 and all other analyzed Salmonella strains is conserved except S. typhi, in which the sequence is CTAT; Supplementary Table 4). Interestingly, many degenerated sequences are common to several bacterial lineages (e.g., CTAG in cdsA of S. typhimurium strain D23580 is CTGG in all other Salmonella strains compared here including those of S. bongori). Among the degenerated forms, CTGG is the most common degenerated sequence across all bacterial lineages compared throughout the genome, and many other degenerated forms, such as CTCG, are also conserved in certain locations, both reflecting a tendency of preferred nucleotide composition in the amelioration process. We anticipated that comparison of similar and dissimilar CTAG degeneration patterns among the Salmonella lineages may lead to novel discoveries about the divergence and evolution of the diverse pathogens originating from a common ancestor.
Table 3

Lineage-specific CTAG degeneration patterns in Salmonella and E. coli K12.

LT2 siteGene nameS. tmS. tyS. pAS. puS. gaS. enS. chS. pCS. duS. heS. agS. neS. scS. jaS. arS. boE. coli
257125pyrH&frrCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAC
264404yaeTCTAGCTAGCTAG  CTAG  CTAGCTAGCTAG      
440769yaiI&aroLCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG   
1426591aroH&ydiACTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG CTAG 
1459627lpp&pykFCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
1597093marB&marACTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG @ 
1818367trpH&trpLCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG 
1818375trpH&trpLCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG 
1977607eda&eddCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTACCTTGCTTG
2023278yecGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAACTTGCTTG
2149464yeeZ&hisGCTAG CTAGCTAGCTAGCTAG  CTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
2174259rfbICTAGCTAGCTAGCTAGCTAGCTAG  CTAGCTAGCTAGCTAGCTAGCTAG   
2399521yfaXCTAGCTTGCTTGCTTGCTTGCTTGCTAGCTAGCTTGCTAGCTTGCTAGCTTG    
2440495lrhACTAGCTATCTAGCTAGCTAGCTAGCTAACTATCTAGCTAACTAACTAGCTAACTATCTATCTAA 
2506054argW&pgtECTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG   
2800018gltWCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
2816155rimMCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGATAGCGAGCGAG
2914911hinCTAG CTAG   CTACCTTG CTAG CTAGCTAGCTAG   
3098680eno&pyrGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
3142826amiC&argACTAGCTAGCTAACTAGATAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAA  
3528054argRCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG 
3576883smf&defCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG 
3597875bfr&bfdCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG 
3834463yiaECTAGCTGGCTGG CTGGCTGGCTGGCTGGCTGGCTAGCTGGCTGGCTGG    
3910806rfaK&rfaZCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTGGCTAGCTAGCTAGCTAGCTAG@  
4101769gltUCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
4127619rfeCTAGCTAGCTAGCTAGCTAGCTAG@@CTAGCTAGCTAGCTAGCTAGCTAA@  
4396313gltVCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
4526230fdhFCTAG@CTGGCTGGCTGGCTGGCTGGCTGGCTGGCTGGCTGGCTGGCTGGCTGG CTGG 
4606231hflXCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG 
4856458yjjY&lasTCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGATAGCTAGCTAGCTAGCTAGCTAGCTAG 

S. tm: S. typhimurium LT2; S. ty: S. typhi Ty2; S. pA: S. paratyphi A ATCC9150; S. pu: S. pullorum RKS5078; S. ga: S. gallinarum 287/91; S. en: S. enteritidis P125109; S. ch: S. choleraesuis B67; S. pC: S. paratyphi C RKS4594; S. du: S. dublin CT_02021853; S. he: S. heidelberg B182; S. ag: S. agona SL483; S. ne: S. newport SL254; S. sc: S. schwarzengrund CVM19633; S. ja: S. javiana CFSAN001992; S. ar: S. arizonae RKS2980; S. bo: S. bongri NCTC12419; E. coli: E. coli K12. Note 2: The degenerated sequences with a different nucleotide from CTAG are in italic and underlined. Note 3: @denotes degenerated sequence at homologous locations to CTAG in LT2 or another Salmonella genome but with two nucleotides substituted. Genomic locations of CTAG only in LT2 are given.

Differential levels of evolutionary conservation of the CTAG sequence

We postulated that most ancestral CTAG sequences were degenerated in evolution and became unrecognizable by sequence and this postulation is at least partly supported by the results in Supplementary Table 4, in which the degeneration follows a phylogenetic trend among the bacteria. Based on this finding, we ranked the existing CTAG sequences according to their range of prevalence across bacteria of different taxonomic ranks (e.g., within a bacterial lineage, across different lineages within a genus, or between different genera) and divided them into three levels of conservation: level 1, conserved in Salmonella, as represented by Salmonella subgroup I and V strains, and E. coli, represented by strain K12 MG1655 (Table 4); level 2, conserved among Salmonella but not with E. coli; and level 3, conserved among only Salmonella subgroup I strains; genomic locations of CTAG conserved at the three levels are summarized in Supplementary Table 5.
Table 4

CTAG sequences conserved between Salmonella and E. coli.

CTAG site in LT2Gene_idAnnotation
284175STM0242proline tRNA synthetase
459521STM0403 & STM0404intergenic region between yajB and queA
500950STM0445 & STM0446intergenic region between yajG and bolA
1280403STM1196 & STM1197intergenic region between acpP and fabF
1280562STM11973-oxoacyl-[acyl-carrier-protein] synthase II
1459628STM1377 & STM1378intergenic region between lpp and pykF
1519898STM1444transcriptional regulator SlyA
1794367STM1702RNase II
1877803STM1780phosphoribosylpyrophosphate synthetase
2035488STM1943tRNA-Cys
2496632STM2385 & STM2386intergenic region between yfcB and STM2386
2544539STM2430 & STM2431intergenic region between cysK and ptsH
2797006STM265723 S ribosomal RNA
2797988STM265723 S ribosomal RNA
2798973STM265723 S ribosomal RNA
2799967STM2658tRNA-Sec
2800314STM265916 S ribosomal RNA
2801372STM265916 S ribosomal RNA
2844094STM2692 & STM2693intergenic region between STM2692 and STM2693
3221860STM3060putative cytoplasmic protein
3346112STM3182putative esterase
3414851STM3245 & STM3246intergenic region between tdcA and rnpB
3494593STM3330glutamate synthase, large subunit
3585835STM3418 & STM3419intergenic region between rpsM and rpmJ
3589847STM342730 S ribosomal subunit protein S14
3593146STM3434 & STM3435intergenic region between rpsC and rplV
4141162STM3933tRNA-Leu
4631227STM4392primosomal replication protein N
4810992STM4555 & STM4556intergenic region between leuQ and rsmC
Hypothesizing that the most highly conserved sequences may retain biological functions across the bacteria that contain them, we focused on the level 1 CTAG sequences in the least annotated intergenic regions of Salmonella in comparison with E. coli. Although intergenic sequences may be involved in replicating the genome, coordinating the expression of other functionalities, mediating recombination24, etc., most intergenic sequences are known for little more than insulating the genes. Since the CTAG profiles of different bacterial lineages, such as S. typhi vs S. typhimurium, may reflect differential selection pressures on specific nucleotides as suggested by phylogenetic analyses described above, we focused on the CTAG sequences in representative Salmonella lineages and E. coli as well as some more distantly related bacteria. One of the intergenic regions, inter-lpp-pykF between genes lpp and pykF (Fig. 3A), is conserved among bacteria of Salmonella, the E. coli complex (including E. coli and Shigella), Enterobacter, Klebsiella, Yersinia, Citrobacter, Cronobacter, Edwardsiella, Erwinia, Pantoea, Pectobacterium, Rahnella, Serratia, Shimwellia, Sodalis, Dickeya, etc (Supplementary Table 6). The high level of conservation of this intergenic region suggests that this segment might be functionally important and may form some conformational structure for a certain biological function.
Figure 3

Genomic location and computer modeling of inter-lpp-pykF.

(A) Location of lpp, the intergenic region and pykF, with the umbers at the bottom indicating the start and end nucleotides of genes lpp and pykF and the red vertical line indicating the location of CTAG (start and end nucleotides in the brackets); (B) Predicted stem-loop structure; (C) Changed structure when C in the CT(U)AG sequence is substituted by U. The positional entropy is coded by hues ranging from red (low entropy, well-defined) via green to blue and violet (high entropy, ill-defined) Predicted.

Structure prediction of the highly conserved CTAG-containing intergenic sequences

We conducted computer modeling on inter-lpp-pykF and some other intergenic sequences that contain CTAG for further characterization using the Free Energy Minimization method161718, which is widely used for RNA secondary structure prediction based on empirical free energy change parameters derived from experiments. This thermodynamic model assumes that an RNA molecule folds into a structure that has the minimum free energy out of the exponentially increasing possibilities with the growth of lengths of the molecule. As exemplified by the analysis of inter-lpp-pykF, we conducted the modeling and predicted a stem-loop structure using services available at http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi (Fig. 3B). As demonstrated by the substitution of C with U, a base change would dramatically make the free energy re-distributed among the nucleotides (Fig. 3C; compare the change of nucleotide colors) and disrupt the stem-loop structure. Note in Fig. 3C that, even if the position correlation between U and G is the same as C and G before the substitution of C by U, the structure shown on Fig. 3B (before the substitution of C by U) becomes highly unstable (Fig. 3C, after the substitution of C by U) as judged by the changed topology and the radical changes in free energy on every nucleotide. The bioinformatics prediction and biological significance need to be validated by mutational experiments.

Continuing nucleotide refinement of the Salmonella genome to expel the CTAG tetranucleotide sequence – trend of degeneration

The differential numbers and distinct distributions of the CTAG sequence on the genomes among different Salmonella lineages and their degeneration patterns led us to postulate a continuing nucleotide amelioration tendency of the Salmonella genome to expel this tetranucleotide sequence. We observed a common phenomenon that more-recently diverged Salmonella lineages tend to have greater numbers of the CTAG sequence. For example, S. typhi, which appeared only about 35–50 thousand years ago25, has 1025 CTAG sequences compared to 850 in S. typhimurium or 885 in E. coli, which have diverged millions of years prior to S. typhi in evolution262728. This is mostly because of the newly acquired genomic regions, which were probably not affected by the Very Short Patch repair system before integration to the Salmonella genome. We postulated that analysis of the newly acquired genomic regions may provide a snapshot of the possible scenarios of CTAG degeneration in the Salmonella genome. To look into this postulation, we conducted systematic comparative analyses on two selected S. heidelberg strains, B18229 and SL47630, which have several large genomic insertions present in one but not in the other strain, and profiled the CTAG sequences. We first confirmed that S. heidelberg B182 and S. heidelberg SL476 belong to the same Salmonella natural cluster (as opposed to those of polyphyletic Salmonella serotypes like S. paratyphi B) based on their common genomic features31, and then compared their genomes for the numbers and distributions of the CTAG sequence. We found that for most of the genomes, the two bacterial strains have nearly identical CTAG profiles, with most of the differences being seen only in the laterally acquired DNA segments. Among the genomic insertions present only in S. heidelberg SL476 but not in B182 are Insertions 1 and 2 (marked Insertions 1 and 2, respectively, in Panel A and shown in red color in Panel B, Fig. 4). As anticipated, both insertions have much more densely distributed CTAG sequences than the vertically transmitted genome (the core genome). S. heidelberg SL476 had 108 more CTAG sequences than S. heidelberg B182, most of which (97 out of the 108) were located in the two largest insertions (Insertions 1 and 2; Fig. 4), where we also found many tetranucleotide sequences with one degeneracy. We are inclined to believe that many of tetranucleotide with 75% similarity to CTAG might be the degeneration products of CTAG, but this postulation remains to be validated using homologous DNA segments/sequences from diverse bacteria. The CTAG sequences within each of Insertions 1 and 2 tended to be more densely distributed toward the middle part of the insertions than the upstream- and downstream parts (Fig. 4). To see if this phenomenon might reflect a fact that Salmonella may tend to expel the CTAG sequences in a DNA segment by starting the process from two ends toward the center of the laterally acquired DNA segment, we profiled their CTAG sequence frequencies (Table 5). Notably, these two insertions seemed to be in different stages of the nucleotide amelioration process to degenerate the CTAG sequences since the time when they became incorporated into a Salmonella genome, as judged by the comparison of their calculated and profiled numbers of CTAG (Table 5). S. heidelberg SL476 has 833 CTAG sequences in the core genome, a number that is very similar to that of S. heidelberg B182 (830) and other Salmonella and E. coli strains (such as 850 in S. typhimurium LT2 and 885 in E. coli K12; see Table 1). Whereas Insertion 1 has more than eight time greater density of the CTAG sequences than the core genome, Insertion 2 has half of the density as in Insertion 1 (Table 6), suggesting that Insertion 2 has been in the genome for a much longer time than Insertion 1 to degenerate the CTAG sequences. Additionally, the greater number of CTAG sequences in Insertion 1 than in Insertion 2 is mostly in the middle part of the 41.6 kb insertion, prompting one to believe that the genomic process of degenerating the CTAG sequences had taken place inward from both terminals of the laterally acquired segment, which is illustrated in Fig. 4, where the CTAG sequences have a normal distribution in Insertion 1, but the distribution in Insertion 2 is already similar to that of the core genome. Further scrutiny of the genomic nucleotide composition refinement processes will provide novel information on bacterial genomic evolution that leads to the creation of diverse and distinct pathogens.
Figure 4

Comparison of S. heidelberg B182 and SL476 for their differences in CTAG profiles.

(A) Whole genome alignment to show the two largest insertions in SL476 but not B182; (B) Distribution patterns of CTAG inside insertions 1 and 2. The red color indicates the regions of the insertions, with the start and end positions marked in both insertions, and black color indicates the up- and down-stream genomic sequences.

Table 5

Profiles of tetranucleotides consisting of one each of C, T, A and G in insertions 1 and 2 of S. heidelberg SL476.

 Insertion 1Insertion 2
Calculated CTAG162221
CTAG5839
GTAC86126
TAGC92122
GACT103147
CGTA111146
TACG112186
AGTC113132
GCTA119130
ACGT122147
GTCA141221
TGAC147233
CAGT156301
ACTG163254
CATG164263
TCGA168126
ATCG170196
GCAT178289
ATGC180297
AGCT189173
TGCA193265
CGAT200219
TCAG216399
GATC221169
CTGA224388
Table 6

Profiles of the tetranucleotide CTAG in two recent insertions and the core genome of S. heidelberg SL476.

 Insertion 1Insertion 2Core Genome
Length of DNA (bp)41606578924789272
Number of profiled CTAG5839833
Density of CTAG (number/kb)1.390.670.17
Number of calculated CTAG16222118648
CTAG profiled/calculated (%)35.817.64.5
CTAG index (%)1.60.7850.212

Note: 1 Length of the core genome is the whole genome of S. heidelberg SL476 (4888768) minus the lengths of the two insertions; Note: 2 CTAG index is the ratio of CTAG over the total number of all 24 combinations of the tetranucleotides consisting of one each of C, T, A and G.

Discussions

To reveal evolutionarily conserved intergenic regions for analyses of their potential functions, we previously profiled the XbaI cleavage site, which is a hexanucleotide sequence containing the tetranucleotide sequence CTAG, in representative Salmonella serotypes and demonstrated that the XbaI cleavage patterns are serotype-specific and so could be used to delineate Salmonella into natural genetic clusters1. Of special significance, many of the profiled XbaI cleavage sites fell in intergenic regions, indicating potential biological importance of these sequences, but their functions remain largely unknown. As profiling the XbaI cleavage site TCTAGA could sample only a subset of the sequences that contain the highly conserved CTAG tetranucleotides2, in this study we documented all CTAG sequences of the genome in representative Salmonella lineages in comparison with E. coli. Overall, we found that the CTAG sequence is more than 20 times rarer than what would be estimated for a random tetranucleotide sequence consisting of one each of C, T, A and G. Most existing CTAG sequences, except those acquired relatively recently through lateral transfer, are conserved at a certain level: within a Salmonella serotype, among different serotypes of Salmonella subgroup I, across the Salmonella subgroups such as I and V, between Salmonella and E. coli or even more distantly related bacteria, such as CTAG in the inter-lpp-pykF region, which is conserved in bacteria across most genera of the Enterobacteriaceae family (See Supplementary Table 6). We believe that although many of the existing CTAG sequences in the Salmonella genomes may still be in the degenerating process, some of them will stay unchanged in the genomes due to their sequence importance. As such, the comparison of the CTAG degeneration patterns among different bacterial lineages should be a reasonably effective way to reveal hitherto unknown functional genomic regions according to their levels of evolutionary conservation. Computer modeling in this study supported this assumption, demonstrating that substituting any of the tetranucleotides C, T, A or G at highly conserved genomic regions like inter-lpp-pykF would disrupt the stem-loop structure (see Fig. 3). The CTAG profiles (Table 1 and Supplementary Table 5) and their degeneration patterns (each lineage having a specific degeneration pattern; Supplementary Table 4) are unique to each of the Salmonella serotypes analyzed in this study as a consequence of natural selection during the adaptation of the bacteria to a given niche, e.g., a particular host. In fact, the differential levels of conservation among the existing (and very rare) CTAG sequences at different genomic locations (See Table 2 and Supplementary Table 5) should reflect their functional importance and may lead to the discovery of motifs for gene expression regulation or other biological functions. We anticipate that other bacteria may have similarly under-abundant short sequences in the genome, profiling and analysis of which may facilitate the studies of genomic evolution and biological divergence of the bacteria. One finding of special interest in this study is that the CTAG sequences in a relatively recent insertion have a normal distribution with a typical peak in the middle of the inserted DNA segment. This phenomenon reflects a way that the genome takes to “treat” an incoming DNA segment: if not treating it as a parasite or something useless, the genome may accept it and in time modify it according to the general genomic environment. The “modification” or amelioration process may take place inward from both terminals of the horizontally acquired segment. Detailed analyses of the processes may help in understanding the biological meaning of differential codon usages in different organisms, in correlating natural selection pressure to a particular niche of the bacteria, and in uncovering novel mechanisms of genomic regulation and evolution by the recognition of highly conserved short sequences and their functions.

Additional Information

How to cite this article: Tang, L. et al. Conserved intergenic sequences revealed by CTAG-profiling in Salmonella: thermodynamic modeling for function prediction. Sci. Rep. 7, 43565; doi: 10.1038/srep43565 (2017). Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
  30 in total

1.  A physical map of the Salmonella typhimurium LT2 genome made by using XbaI analysis.

Authors:  S L Liu; K E Sanderson
Journal:  J Bacteriol       Date:  1992-03       Impact factor: 3.490

2.  CONTRAfold: RNA secondary structure prediction without physics-based models.

Authors:  Chuong B Do; Daniel A Woods; Serafim Batzoglou
Journal:  Bioinformatics       Date:  2006-07-15       Impact factor: 6.937

3.  The complete genome sequence of Escherichia coli K-12.

Authors:  F R Blattner; G Plunkett; C A Bloch; N T Perna; V Burland; M Riley; J Collado-Vides; J D Glasner; C K Rode; G F Mayhew; J Gregor; N W Davis; H A Kirkpatrick; M A Goeden; D J Rose; B Mau; Y Shao
Journal:  Science       Date:  1997-09-05       Impact factor: 47.728

4.  Bacterial phylogenetic clusters revealed by genome structure.

Authors:  S L Liu; A B Schryvers; K E Sanderson; R N Johnston
Journal:  J Bacteriol       Date:  1999-11       Impact factor: 3.490

5.  Determining divergence times of the major kingdoms of living organisms with a protein clock.

Authors:  R F Doolittle; D F Feng; S Tsang; G Cho; E Little
Journal:  Science       Date:  1996-01-26       Impact factor: 47.728

6.  Comparative genomics of 28 Salmonella enterica isolates: evidence for CRISPR-mediated adaptive sublineage evolution.

Authors:  W Florian Fricke; Mark K Mammel; Patrick F McDermott; Carmen Tartera; David G White; J Eugene Leclerc; Jacques Ravel; Thomas A Cebula
Journal:  J Bacteriol       Date:  2011-05-20       Impact factor: 3.490

7.  The XbaI-BlnI-CeuI genomic cleavage map of Salmonella paratyphi B.

Authors:  S L Liu; A Hessel; H Y Cheng; K E Sanderson
Journal:  J Bacteriol       Date:  1994-02       Impact factor: 3.490

8.  The XbaI-BlnI-CeuI genomic cleavage map of Salmonella typhimurium LT2 determined by double digestion, end labelling, and pulsed-field gel electrophoresis.

Authors:  S L Liu; A Hessel; K E Sanderson
Journal:  J Bacteriol       Date:  1993-07       Impact factor: 3.490

9.  Supplement 2002 (no. 46) to the Kauffmann-White scheme.

Authors:  Michel Y Popoff; Jochen Bockemühl; Linda L Gheesling
Journal:  Res Microbiol       Date:  2004-09       Impact factor: 3.992

10.  Inversions over the terminus region in Salmonella and Escherichia coli: IS200s as the sites of homologous recombination inverting the chromosome of Salmonella enterica serovar typhi.

Authors:  Suneetha Alokam; Shu-Lin Liu; Kamal Said; Kenneth E Sanderson
Journal:  J Bacteriol       Date:  2002-11       Impact factor: 3.490

View more
  3 in total

1.  E. coli diversity: low in colorectal cancer.

Authors:  Le Tang; Yu-Jie Zhou; Songling Zhu; Gong-Da Liang; He Zhuang; Man-Fei Zhao; Xiao-Yun Chang; Hai-Ning Li; Zheng Liu; Zhi-Rong Guo; Wei-Qiao Liu; Xiaoyan He; Chun-Xiao Wang; Dan-Dan Zhao; Jia-Jing Li; Xiao-Qin Mu; Bing-Qing Yao; Xia Li; Yong-Guo Li; Li-Bo Duo; Li Wang; Randal N Johnston; Jin Zhou; Jing-Bo Zhao; Gui-Rong Liu; Shu-Lin Liu
Journal:  BMC Med Genomics       Date:  2020-04-06       Impact factor: 3.063

2.  Genetic boundaries delineate the potential human pathogen Salmonella bongori into discrete lineages: divergence and speciation.

Authors:  Xiaoyu Wang; Songling Zhu; Jian-Hua Zhao; Hong-Xia Bao; Huidi Liu; Tie-Min Ding; Gui-Rong Liu; Yong-Guo Li; Randal N Johnston; Feng-Lin Cao; Le Tang; Shu-Lin Liu
Journal:  BMC Genomics       Date:  2019-12-04       Impact factor: 3.969

3.  Differential degeneration of the ACTAGT sequence among Salmonella: a reflection of distinct nucleotide amelioration patterns during bacterial divergence.

Authors:  Le Tang; Emilio Mastriani; Yu-Jie Zhou; Songling Zhu; Xin Fang; Yang-Peng Liu; Wei-Qiao Liu; Yong-Guo Li; Randal N Johnston; Zheng Guo; Gui-Rong Liu; Shu-Lin Liu
Journal:  Sci Rep       Date:  2017-09-08       Impact factor: 4.379

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.