| Literature DB >> 25873371 |
Che L Martin, Che I Martin1, Tika Y Sukarna1, Saymon Akther2, Girish Ramrattan2, Pedro Pagan2, Lia Di2, Emmanuel F Mongodin3, Claire M Fraser3, Steven E Schutzer4, Benjamin J Luft5, Sherwood R Casjens6, Wei-Gang Qiu7.
Abstract
UNLABELLED: Phylogenomic footprinting is an approach for ab initio identification of genome-wide regulatory elements in bacterial species based on sequence conservation. The statistical power of the phylogenomic approach depends on the degree of sequence conservation, the length of regulatory elements, and the level of phylogenetic divergence among genomes. Building on an earlier model, we propose a binomial model that uses synonymous tree lengths as neutral expectations for determining the statistical significance of conserved intergenic spacer (IGS) sequences. Simulations show that the binomial model is robust to variations in the value of evolutionary parameters, including base frequencies and the transition-to-transversion ratio. We used the model to search for regulatory sequences in the Lyme disease species group (Borrelia burgdorferi sensu lato) using 23 genomes. The model indicates that the currently available set of Borrelia genomes would not yield regulatory sequences shorter than five bases, suggesting that genome sequences of additional B. burgdorferi sensu lato species are needed. Nevertheless, we show that previously known regulatory elements are indeed strongly conserved in sequence or structure across these Borrelia species. Further, we predict with sufficient confidence two new RpoS binding sites, 39 promoters, 19 transcription terminators, 28 noncoding RNAs, and four sets of coregulated genes. These putative cis- and trans-regulatory elements suggest novel, Borrelia-specific mechanisms regulating the transition between the tick and host environments, a key adaptation and virulence mechanism of B. burgdorferi. Alignments of IGS sequences are available on BorreliaBase.org, an online database of orthologous open reading frame (ORF) and IGS sequences in Borrelia. IMPORTANCE: While bacterial genomes contain mostly protein-coding genes, they also house DNA sequences regulating the expression of these genes. Gene regulatory sequences tend to be conserved during evolution. By sequencing and comparing related genomes, one can therefore identify regulatory sequences in bacteria based on sequence conservation. Here, we describe a statistical framework by which one may determine how many genomes need to be sequenced and at what level of evolutionary relatedness in order to achieve a high level of statistical significance. We applied the framework to Borrelia burgdorferi, the Lyme disease agent, and identified a large number of candidate regulatory sequences, many of which are known to be involved in regulating the phase transition between the tick vector and mammalian hosts.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25873371 PMCID: PMC4453575 DOI: 10.1128/mBio.00011-15
Source DB: PubMed Journal: MBio Impact factor: 7.867
Orthologous ORFs and IGSs
| Characteristic | Value | ||
|---|---|---|---|
| Main chromosome | lp54 | cp26 | |
| No. of orthologous ORF families | 750 | 62 | 26 |
| Synonymous tree length ( | 1.4965 | 1.8959 | 1.7931 |
| Nonsynonymous tree length ( | 0.1092 | 0.3684 | 0.1925 |
| Ratio ( | 0.07297 | 0.1944 | 0.1074 |
| No. of orthologous IGS families | 203 | 27 | 17 |
| No. of convergent IGSs (no. conserved | 31 (6 | 3 (0; 0) | 3 (0; 0) |
| No. of tandem IGSs (no. conserved | 109 (41 | 19 (8 | 8 (0; 0) |
| No. of divergent IGSs (conserved | 63 (40 | 5 (0; 0) | (3 |
Median values obtained by PAML (35) among 23, 41, and 327 orthologous ORF families on cp26, lp54, and the main chromosome, respectively. The total number of sequences in individual ORF families is 22 for those on the main chromosome and 23 for those on lp54 and cp26.
Includes only IGSs with an alignment length of 30 bases or more.
With nucleotide substitution rates obtained by Rates4site (63) significantly lower (P < 0.001 by t test) than those of flanking third-codon sites.
bb0004-bb0005, bb0364-bb0365, bb0459-bb0460, bb0536-bb0537, bb0688-bb0689, bb0758-bb0759.
bb0034-bb0035, bb0057-bb0058, bb0089-bb0090, bb0146-bb0147, bb0163-bb0164, bb0172-bb0173, bb0208-bb0209, bb0219-bb0220, bb0247-bb0248, bb0250-bb0251, bb0255-bb0256, bb0278-bb0279, bb0328-bb0329, bb0339-bb0340, bb0347-bb0348, bb0380-bb0381, bb0381-bb0382, bb0389-bb0390, bb0390-bb0391, bb0430-bb0431, bb0434-bb0435, bb0539-bb0540, bb0542-bb0543, bb0567-bb0568, bb0584-bb0585, bb0603-bb0604, bb0608-bb0610, bb0642-bb0643, bb0647-bb0648, bb0671-bb0672, bb0679-bb0680, bb0693-bb0694, bb0715-bb0716, bb0726-bb0727, bb0744-bb0745, bb0755-bb0756, bb0770-bb0771, bb0773-bb0774, bb0776-bb0777, bb0808-bb0809, bb0830-bb0831.
bb0007-bb0008, bb0023-bb0024, bb0045-bb0046, bb0100-bb0101, bb0133-bb0134, bb0135-bb0136, bb0154-bb0155, bb0190-bb0192, bb0201-bb0202, bb0214-bb0215, bb0226-bb0227, bb0236-bb0237, bb0253-bb0254, bb0313-bb0314, bb0336-bb0337, bb0346-bb0347, bb0365-bb0366, bb0373-bb0374, bb0400-bb0401, bb0436-bb0437, bb0454-bb0455, bb0457-bb0458, bb0460-bb0461, bb0507-bb0508, bb0560-bb0561, bb0571-bb0572, bb0596-bb0597, bb0598-bb0599, bb0620-bb0621, bb0623-bb0624, bb0629-bb0630, bb0655-bb0656, bb0706-bb0707, bb0723-bb0724, bb0734-bb0735, bb0748-bb0749, bb0760-bb0761, bb0812-bb0814, bb0828-bb0829, bb0835-bb0836.
bba14-bba15, bba16-bba18, bba21-bba23, bba24-bba25, bba39-bba40, bba51-bba52, bba64-bba65, bba65-bba66.
bbb08-bbb09, bbb25-bbb26, bbb27-bbb28.
FIG 1 Statistical power of phylogenomic footprinting. (A and B) Each data point represents the probability (y axis, in −log10) of an L-mer IGS segment having n substitutions (x axis) after evolving with an expected neutral distance of T0. These probabilities were calculated according to equation 1 in Materials and Methods and obtained using the R function pbinom (58). (A) Probabilities for IGSs on the main chromosome, with the neutral distance T0 approximated by T = 1.5 substitutions/site (Table 2); (B) probabilities for IGSs on the plasmids, with the neutral distance T0 approximated by T = 1.85 substitutions/site (Table 2). These two plots show that the statistical power of identifying regulatory elements using phylogenomic footprinting increases with the length of the element (L), the degree of its sequence conservation (n), and the total neutral divergence among the genomes (T). (C) Phylogenetic tree of neutrally evolved IGS sequences (each 10,199 bp long) simulated by Evolver (35) with parameters taken from a typical plasmid-borne gene (a39, with T = 1.85, %GC = 21.3%, and a transition-to-transversion ratio of 3.66). (D) Sensitivity of statistical power (y axis, calculated by equation 2 in Materials and Methods) to phylogenetic diversity (x axis, measured by T). Vertical gray lines indicate subtree distances from B31 up to a labeled strain.
Predicted regulatory elements
| IGS | Orientation | ncRNA | Promotor | Terminator |
|---|---|---|---|---|
| Tandem | + | |||
| Convergent | + | |||
| Tandem | + | + | ||
| Divergent | + | |||
| Tandem | + | |||
| Tandem | + | + | + | |
| Tandem | + | |||
| Divergent | + | |||
| Divergent | + | + | ||
| Tandem | + | + | ||
| Divergent | + | |||
| Convergent | + | |||
| Tandem | + | |||
| Divergent | + | |||
| Tandem | + | + | ||
| Divergent | + | + | + | |
| Tandem | + | |||
| Convergent | + | |||
| Convergent | + | + | ||
| Divergent | + | + | + | |
| Convergent | ++(2) | |||
| Tandem | + | + | ||
| Tandem | + | + | ||
| Tandem | + | |||
| Divergent | + | |||
| Tandem | + | |||
| Divergent | + | |||
| Tandem | + | |||
| Divergent | + | |||
| Divergent | + | |||
| Divergent | + | |||
| Divergent | + | |||
| Divergent | + | |||
| Divergent | + | |||
| Convergent | + | |||
| Tandem | + | |||
| Divergent | + | |||
| Divergent | + | |||
| Divergent | + | + | ||
| Convergent | + | |||
| Tandem | + | |||
| Tandem | + | |||
| Divergent | + | |||
| Tandem | + | + | ||
| Tandem | + | |||
| Divergent | + | + | ||
| Tandem | + | |||
| Divergent | + | + | ||
| Tandem | + | |||
| Convergent | + | |||
| Tandem | + | |||
| Divergent | + | |||
| Divergent | + | |||
| Tandem | + | |||
| Tandem | + | |||
| Divergent | + | |||
| Convergent | + | |||
| Tandem | + | + | + | |
| Tandem | + | |||
| Divergent | + | |||
| Divergent | + | |||
| Divergent | + | |||
| Tandem | + | |||
| Tandem | + | |||
| Divergent | + | |||
| Divergent | + |
Including IGSs on chromosome, lp54, and cp26 that are ≥30 nucleotides and present in at least seven of the eight B. burgdorferi sensu lato species.
Presence (+; n = 28) of a conserved RNA structure predicted by RNAz (61). Sequences are available in Table S3 in the supplemental material.
Presence (+; n = 39) of a conserved promoter predicted by PromPredict (51). Sequences are available in Table S4 in the supplemental material.
Presence (+; n = 19) of a conserved transcription terminator predicted by TransTermHP (52). Sequences are available in Table S5 in the supplemental material.
0577-0578 contains DsrA, a small ncRNA that regulates rpoS expression. It is not identified here due to an overlap with the 3′ end of 0577 (49).
FIG 2 Frequency distributions of base substitution rates. (A) Normalized base substitution rates (x axis), obtained by using concatenated IGS-ORF alignments and calculated by Rates4site (38), are distributed similarly among the three types of IGSs (top three rows) and the third-base sites (bottom row). Chromosomal sequences (right column) are more conserved than plasmid-borne sequences (left and middle columns). (B) Substitution rates of a conserved divergent IGS (middle panel) consist of a significantly higher (P = 5.6e−06, by a Wilcoxon rank sum test) density of low-rate sites than its flanking third-base sites (left and right panels).
FIG 3 Observed and predicted counts of perfectly conserved intergenic blocks (PCIBs) on the chromosome (A) and plasmids (B) (note the scale difference of the y axis). A PCIB has no nucleotide variations or alignment gaps. The minimum length of a PCIB is six nucleotides. “Observed,” length distribution of 935 PCIBs on the main chromosome and 276 PCIBs on the plasmids; “permuted,” counts of L-mer PCIBs from 10 rounds of permutations of original IGS alignments; “simulated,” PCIB counts from simulated sequences using Evolver (35); “expected,” PCIB counts obtained by equation 2 in Materials and Methods. Solid triangles represent L-mers having significantly higher counts than permuted counts (P < 0.001 by one-tailed t tests).
FIG 4 Conserved RpoS-dependent promoter sequences. RpoS recognition sites, ribosome-binding sites, and inverted repeats are conserved among B. burgdorferi sensu lato species in the promoters of six genes known to be upregulated during transmission from the tick to mammalian hosts. A new putative RpoS consensus sequence is derived (inset). The inverted repeats upstream of ospC are conserved in secondary structure but not in primary sequence.
FIG 5 Predicted secondary structures of highly conserved putative ncRNAs. Structures of these eight longest inverted repeats (IRs) were predicted using RNAz (61) and plotted with B31 sequences using Varna (62). Arrows point to variations in the indicated strains. The Rfam accessions and annotations based on searches using Infernal (48) are as follows: IR0146-0147-RF00082, small RNA G (SraG); IR0243-0244-RF02152, long noncoding RNA (MINT_2); IR0434-0435-RF00074, pre-miRNA (mir-29); IR0346-0347-RF01350, CRISPR direct repeat element (CRISPR-DR41); IR0385-0386-RF01379, CRISPR direct repeat element (CRISPR-DR66); IR0602-0603-RF02066, bacterial small RNAs (STnc320); IRa37-a38, RF02058-bacterial small RNAs (STnc400); and IRb03-b04, RF00741-pre-miRNA (mir-378). Structures of another six long conserved IRs in Borrelia (IRb04-b05, IRb12-b13, IRb29-b01, IRa16-a18, IRa21-a23, and IRa34-a36) have been published earlier (32, 33).