Literature DB >> 11737947

Prediction of co-regulated genes in Bacillus subtilis on the basis of upstream elements conserved across three closely related species.

Abstract

BACKGROUND: Identification of co-regulated genes is essential for elucidating transcriptional regulatory networks and the function of uncharacterized genes. Although co-regulated genes should have at least one common sequence element, it is generally difficult to identify these genes from the presence of this element because it is very easily obscured by noise. To overcome this problem, we used conserved information from three closely related species: Bacillus subtilis, B. halodurans and B. stearothermophilus.
RESULTS: Even though such species have a limited number of clearly orthologous genes, we obtained 1,884 phylogenetically conserved elements from the upstream intergenic regions of 1,568 B. subtilis genes. Similarity between these elements was used to cluster these genes. No other a priori knowledge on genes and elements was used. We could identify some genes known or suggested to be regulated by a common transcription factor as well as genes regulated by a common attenuation effector.
CONCLUSIONS: We confirmed that our method generates relatively few false positives in clusters with higher scores and that general elements such as -35/-10 boxes and Shine-Dalgarno sequence are not major obstacles. Moreover, we identified some plausible additional members of groups of known co-regulated genes. Thus, our approach is promising for exploring potentially co-regulated genes.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2001 PMID： 11737947 PMCID： PMC60312 DOI： 10.1186/gb-2001-2-11-research0048

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

Background

Transcriptional regulatory networks are important for controlling many biological phenomena, such as development and cell proliferation. Even in bacteria, elucidation of such networks or identification of co-regulated genes (regulons) is essential for understanding many cellular processes. Because co-regulated genes are likely to function for the same purpose, identifying them can also provide hints on gene function. The microarray technique, which enables us to monitor the expression levels of thousands of genes in parallel, appears very powerful for identifying co-regulated genes and several articles on this technique have been published [1,2,3]. Even if we can ignore experimental artifacts, however, it is not always easy to set experimental conditions to identify differential expression patterns of uncharacterized genes. Thus, it would be desirable to develop some computational methods that can supplement such experimental techniques. In recent years, several computational approaches to identifying co-regulated genes have been reported. Because transcription is regulated by transcription factors that bind DNA in a sequence-specific manner, comparison of gene upstream regions could, in principle, identify co-regulated genes. Thus, a classical and most widely used method for predicting co-regulated genes is to search upstream regions for sequence segments similar to known binding sites for transcription factors [4,5,6]. This approach is, however, applicable only when information on binding sites is available. Furthermore, as DNA sequences recognized by a single transcription factor are only about 6-10 base pairs (bp) long and are not strictly conserved, many false-positive matches would be unavoidable. One way to overcome this difficulty is to use conservation information across species. New members of co-regulated genes have been predicted on the basis of conservation of hypothetical transcriptional regulatory sites between several eubacteria such as Escherichia coli and Haemophilus influenzae [7,8,9]. A similar approach was also applied to the analysis of four archaeal candidate regulons [10]. In that approach, the heuristic that many binding sites are quasi-palindromic was also used. McGuire et al. have exploited the possibility of using conservation in a wider range of species [11,12]. To reduce false-positive hits, candidate genes were prescreened using a priori knowledge such as their function, the metabolic pathway they belong to, and their functional coupling predicted from conserved operons, protein fusions and correlated evolution. Techniques for detecting conserved elements in noncoding regions across species have also been studied [13,14,15]. For bacterial genes, McCue et al. developed an elaborate algorithm for detecting potential binding sites in sets of upstream regions of orthologous genes [16]. Their method also assumes the palindromic nature of binding sites. Thus, it is evident that such a method would fail to detect non-palindromic binding sites, of which there are many. It is also questionable whether the molecular mechanisms of transcription in distantly related bacteria have been well conserved and whether each orthologous transcription factor recognizes exactly the same consensus pattern in each species. Furthermore, the problem of detecting conserved elements is not simple; we should carefully observe each case of conservation and optimize parameters to detect as many known binding sites as possible. In this paper, we used three closely related genome sequences to predict co-regulated genes of Bacillus subtilis. Our method consists of two parts; first, we identified phylogenetically conserved elements (PCEs) in the upstream intergenic regions of B. subtilis genes; then they were clustered according to the similarity of PCEs in their upstream region. In addition, each of the obtained clusters, predicted to be co-regulated, was examined in terms of existing knowledge of regulons and functional information from downstream genes. The species used for this analysis are: B. subtilis [17], B. halodurans [18], and B. stearothermophilus (genome sequence incomplete; see Materials and methods). We selected these sequences for three reasons. First, the interpretation of the comparison of upstream regions of orthologous genes would be more straightforward because their regulatory mechanisms are also likely to be conserved. Second, we have constructed a database (DBTBS) of B. subtilis promoters and transcription factors by literature survey [19,20]. Therefore, it is easier to check the predictions and optimize parameters. Third, an international project on functional genomics, including transcriptome analysis, of B. subtilis is ongoing [21]. Thus, our predictions have more chance of being tested experimentally. Here we report the results of our prediction of co-regulated genes in B. subtilis without any prior knowledge or assumption. The extensive evaluation of these results is also described.

Results and discussion

Detection of PCEs and their verification

We could analyze the upstream regions of 1,568 B. subtilis genes. For 675 of them, orthologous genes were found in both B. halodurans and B. stearothermophilus, for 706 in B. halodurans only, and for 187 in B. stearothermophilus only. The genome sequence of B. stearothermophilus is still incomplete; its length was 3,286,068 bp on 21 February 2001. If we assume that the genome of B. stearothermophilus is about the same size as that of B. subtilis, the data roughly correspond to three-quarters of all genes. Within the upstream regions of these 1,568 genes, we identified 1,884 PCEs. For comparison, we generated five pseudogenomes of scrambled upstream regions; for this we took all upstream regions of these genes and randomly placed them in front of randomly chosen genes. Then, the same PCE identification procedure was applied to each pseudogenome. In these cases, we can basically regard detected PCEs as spurious. On average, 793 spurious PCEs were identified (the standard deviation is 26.7). Figure 1 shows the histogram of scores calculated against these PCEs. The score of spurious PCEs is relatively low, suggesting that their length is relatively short. We estimate that over half of the 1,884 PCEs are meaningful and that this ratio becomes higher for longer PCEs. These PCEs were also compared with known binding sites for transcription factors using the DBTBS database [19,20] and literature survey. Table 1 summarizes the result for each known transcription factor. In total, 52 of 122 known binding sites overlapped with the PCEs. For some transcription factors such as GltR, ComA and IolR, the orthologous genes themselves could not be identified, whereas orthologous genes of most genes regulated by some factors, such as DegU and GerE, could not be found. On the other hand, 6 of 11 known binding sites of CcpA overlapped with PCEs.

Figure 1

Histogram of PCE scores calculated from sequence alignments. (a) Three or (b) two sequences were aligned. Green bars correspond to the score of actual PCEs and yellow bars to the score of spurious PCEs generated by joining upstream regions with unrelated coding regions. In the yellow bars, the averaged values of five trials are shown with their error bars.

Table 1

Correspondence between known transcription factor binding sites and PCEs

Factor name	Orthologs^*	Number of known sites^†	Number of sites to be detected^‡	Number of overlaps^§
AbrB	H S	11 (1)	7	3
AhrC	H S	5 (1)	3	2
AraR	H	5 (1)	2	1
BirA	H S	1	0	-
BltR	H	1	1	0
BmrR	None	1	0	-
CcpA	H S	33 (17)	11	6
CodY	H S	2	1	1
ComA	None	5	2	0
ComK	None	1	0	-
CtsR	H S	6	6	4
DegU	H S	14 (3)	5	1
DeoR	H	1	1	0
LexA	H	8	6	3
ExuR	S	1	1	0
Fnr	H	2	2	1
GerE	H S	21 (2)	7	0
GlnR	S	6	3	0
GltC	H S	3	3	0
GltR	None	4	2	0
GntR	None	1	0	-
Hpr	H	8 (1)	3	0
HrcA	H S	2	2	2
IolR	None	2	1	0
LevR	S	3	0	-
LicT	H	1	1	1
LrpC	H	1	1	0
Mta	H S	3	2	1
MtrB	H S	1	1	0
PhoP	H S	6 (2)	2	0
PyrR	H S	3	3	3
PurR	H S	1	0	-
RibC	H S	1	0	-
RocR	H	4	2	0
SacT	None	1	0	-
SacY	None	1	0	-
SenS	None	1	0	-
SinR	H	6	5	5
Spo0A	H S	22 (1)	18	10
SpoIIID	H S	12 (5)	6	2
TnrA	H S	10	6	1
TreR	H S	2	2	2
Xre	H	4	0	-
XylR	H S	1	1	1
MntR	H S	2	1	1
Zur	H S	2	2	1
Total		232 (34)	122	52

*Name(s) of species having the orthologous gene with the B. subtilis gene. H: B. halodurans; S: B. stearothermophilus.†Total number of experimentally verified binding sites of < 50 bp. The number of binding sites in the coding region is shown in parentheses. ‡Number of known binding sites in the region analyzed in this work. §Number of analyzed sites overlapping with PCEs over 5 bp.

Clustering of PCEs and its verification

Using the clustering process, 188 clusters were obtained which contained many known or possible co-regulated genes (see below). To estimate the number of false positives, we performed the same clustering procedure five times against the 1,884 PCEs of randomly shuffled sequences. Figure 2 shows the histogram of similarity scores used during these clustering processes. It shows that many false-positive clusters can occur by chance around a cut-off score of 60, but that they are rare above score 80. Although about half of detected PCEs might be false positives, such PCEs are usually short (Figure 1) and the similarity score between them is relatively low (Figure 2, blue bar). We therefore conclude that non-meaningful PCEs are rarely included in our clustering results, at least in the clusters with higher scores.

Figure 2

Histogram of similarity scores used during the clustering process. Red bars represent clustering of PCEs within the upstream regions of orthologous genes, green bars the clustering of PCEs with randomly shuffled sequence, and blue bars the clustering of PCEs identified when the upstream regions are linked to unrelated coding regions. For the green and blue bars, average values are shown with their error bars.

Prediction of co-regulated genes

Among the 188 clusters obtained, we excluded 34 because of the alignment of hypothetical Shine-Dalgarno (SD) sequences (see below). The remaining clusters, ranked by the highest similarity score within each cluster, are available as a table (see Additional data files). We expect that many members of each cluster will be co-regulated by a common factor, especially when their similarity scores are above 80. We now discuss the clustered genes in terms of some typical regulons (Table 2).

Table 2

Comparison of some typical regulons with our results

Regulon	Gene^*	Cluster information^†	Sequence of PCE^‡
pyr operon (regulator: PyrR)	pyrR	34	AGTCCAGAGAGGCTGAGAAGGA-T
	pyrP	34	AATCCAGAGAGGTTG
	pyrB	C	CAGAGAGGCTT
S-box regulon (regulator: unknown)	metK	1,11
	yusC	1
	ykrW	1,5,11
	yjcI	5,11
	metE	5
	ykrT	5,11
	yitJ	5,11
	cysH	B
	yoaD	A
	yxjG	A
	yxjH	A
Hypothetical xanthine regulon (regulator: unknown)	purE	20
	xpt	14,20
	pbuG^*	14
Aminoacyl-tRNA synthetases (regulator: uncharged tRNA)	serS	2	AGGGTGGCAACGCGAG
	valS	2	AAAAAAGGTGGTACCGCGA
	thrS	2	GAAAAAAGGGTGGAACCACGA
	tyrS	2	TTAGTAGGGTGGTACCGCGA
	leuS	2	AGGGTGGTACCGCGGG
	tyrZ	2	AGGGTGGTACCGCGTG
	ilvB	2	AGGGTGGTACCGCGGAAAG
	pheS	2	AATAAGGGTGGTACCGCG
	hisS	2	AACTAGGGTGGCACCACGGGTAT..
	glyQ	2	GCAACTAGGGTGGAACCGCGGG
	alaS	2	AGGGTGGTACCGCGAG-A
	ileS	2	AGGGTGGTACCGCGAGA
	proB	2	AAGGTGGTACCACGGA
	cysE	D	C-AAACAGAGTGGAACCGCG
	trpS	C	AGGGTGG
	thrZ	A
Heat-shock regulon (regulator: CtsR)	ctsR	6	GTCAAATATAGTCAAAGTCA
	clpE	6	GGTCAAAGATAGTCAAA
	dnaJ^*	6	GAAAGTCAAAGTCAGGCAT
	clpP	B
CcpA regulon^§ (regulator: CcpA)	bglS	47	TAGAAAACGCTTTCAA
	msmX	47	GTAAACGCTTTCTT
	yvfK	47	...TCTT-TAAAGCGCTTTCAT
	mfd	47	GACCAAAGCGTTTTT
	bglP	59	AAATGAAAGCGTTGACA
	sucC	59	TATAGAATGAAAGCGC
	mmgA	D	ATTGTAAGCGCT
	hutP	D	AGTTAATAGTTATCAGA
	rbsR	D	GTAAACGGTTACATAAACA
	yxjC	B
	ackA	B
	licB	B
	acuA	B
	acsA	E
	xylA	E
	iolB	E
	galT	E
	uxaC	E
	ydhO	E
	acoA	E
	araB	E
	lcfA	E
	dra	E
	kdgA	E
	yobO	E
	treP	E
	yxkJ	A
	amyE	A
	gntR	A
	xynP	A
	levD	A
	dctP	A
	citM	A

*Probable new members identified by our analysis are shown with anasterisk. †Cluster number(s) are shown when available, otherwise, one of the situation codes is shown: A, orthologous genes not found; B, no overlaps between known binding site and PCE; C, PCE overlaps with known site but is too short; D, PCE overlaps with known site but is slightly different; E, binding site exists within the coding region. ‡PCE sequence in B. subtilis. The region overlapping with a known binding site is shown in bold. §CcpA-dependent genes identified by a systematic experiment [31] are not included.

Clusters 2 and 3: the T-box family

One of the most conspicuous clusters detected in our analysis was the so-called T-box family, which consists of many aminoacyl-tRNA synthetase operons and some operons related to amino-acid biosynthesis [22]. It is known that these operons are regulated by the attenuation mechanism, where an uncharged tRNA molecule is used as an effector. The PCE shared in cluster 2 is a part of the attenuation region where an uncharged tRNA is believed to bind (the T-box), whereas the PCE in cluster 3 is a region loosely complementary to the T-box. All the members of cluster 3 are included in cluster 2. In addition to 11 aminoacyl-tRNA synthetases, it makes sense that proB and ilvB were clustered because their function is related to amino-acid synthesis. However, three additional members could not be detected; two of them had less similar or shorter PCEs and the other did not have an orthologous counterpart.

Cluster 34: the pyr operon

The pyr operon contains at least three genes, each of which is directly regulated by PyrR, a transcription attenuation regulator ([23] and Figure 3a). Each leader region of these genes can form three different RNA secondary structures (terminator, antiterminator and anti-antiterminator) when transcribed ([24] and Figure 3b). PyrR then binds to the anti-antiterminator regions of the mRNAs. Cluster 34 contains pyrR and pyrP, their PCEs corresponding to a part of each anti-antiterminator. The other gene, pyrB, was, however, not detected because its PCE was not sufficiently well conserved to become long enough for clustering.

Figure 3

Post-transcriptional regulation of the pyr operon. (a) The three attenuation regions in the operon. (b) Two alternative secondary structures of the transcript of each attenuation region. In the presence of high UMP concentration, PyrR binds to the anti-antiterminator and stabilizes the formation of the terminator structure, while preventing the formation of the antiterminator.

Clusters 1, 5 and 11; S-box regulon

The S-box regulon is a hypothetical regulon relating to methionine and/or cysteine biosynthesis. The leader regions of its putative transcriptional units have considerable sequence similarity and seem to form complex secondary structures that are similar to those in the pyr operon [25]. Three different PCEs were identified in our analyses, each of which forms a cluster related to each of the others. The PCEs correspond to several parts of the hypothetical anti-antiterminator region, where an unidentified binding factor is postulated to stabilize its secondary structure [25]. Of the eleven putative members of this regulon, seven were included in at least one of these clusters whereas three could not be detected because of the lack of orthologous genes. The leader region of the remaining one, cysH, was very poorly conserved.

Clusters 14 and 20: hypothetical xanthine metabolic regulon

It has been suggested that the expression of the xpt-pbuX operon in B. subtilis is regulated by a termination-antitermination control mechanism similar to the mechanism suggested for the pur biosynthesis operon, purEKBCSLQFMNHD [26]. It has been speculated that the regulatory proteins of these two operons are the same because they seem to have the same effector - xanthine [26]. Our results support this hypothesis because xpt and purE were clustered in cluster 20. xpt also belongs to another cluster, 14, with pbuG. As the PbuG protein has the characteristic Pfam [27] domain of the xanthine/uracil permease family, pbuG is very likely to be a new member of the xanthine metabolism regulon.

Cluster 6: class III heat-shock regulon

This cluster corresponds to a part of the class III heat-shock regulon, which is regulated by CtsR. Cluster 6 contains two of the three known genes that have experimentally verified CtsR-binding sites [28,29]. Interestingly, cluster 6 contains dnaJ, which belongs to the dnaK operon - hrcA-grpE-dnaK-dnaJ-yqeT-yqeU-yqeV [30]. As the dnaK operon is involved in the class I heat-shock regulon (which corresponds to cluster 13) and as there is an internal promoter between dnaK and dnaJ [30], there is likely to be regulatory overlap between the class I and the class III heat-shock regulons.

Clusters 12, 47, 52 and 59: genes under glucose repression

The largest genetic network identified so far in B. subtilis is the regulatory system that is stimulated by glucose repression, in which the transcription factor CcpA has a central role [6]. In our analysis, not many known CcpA-dependent genes were clustered and they were even split into three subgroups (clusters 47, 52 and 59). Two members of cluster 47 have PCEs overlapping with the CcpA-binding site, and another member, yufK, was recently shown in a microarray experiment to be under glucose repression [31]. In cluster 52, araA was also shown to be under glucose repression. It seems very likely that CcpA regulates all members of this cluster because their PCEs are similar to the CcpA-binding site and their functions are consistent with this hypothesis. As for cluster 59, both of its two members, bglP and sucC, were shown to be under glucose repression [31]. Many other known genes are regulated by CcpA. As shown in Table 2, their CcpA-binding sites reside within their coding regions in most cases, whereas these sites can be less conserved in other cases. As noted above, many of the known binding sites overlap with PCEs. Therefore, it seems possible that the split into subgroups has some biological meaning. There are also co-expressed genes that are subject to CcpA-independent glucose repression. All three members of cluster 12 were shown to be under glucose repression, two of which, gapB and pckA, were shown to be CcpA-independent in a recent systematic experiment [31]. Our results support this because PCEs in cluster 12 are not similar to the CcpA-binding site.

Potentially new regulons/members

As described above, we found several potentially new members of known regulons: for example, dnaJ in cluster 6, pbuG in cluster 14, kduI and odhA in cluster 52 (see table in Additional data files for more examples). In addition, topA in cluster 15 is likely to belong to the SpooA regulon because PCEs of this cluster is very similar to the SpooA-binding site and its functions are related to sporulation. There are, however, potential regulons not reported so far. For example, aroA and aroF in cluster 29 seem to constitute a regulon related to the metabolism of aromatic amino acids. In this regard, clusters 24 and 16 are especially interesting. Cluster 24 contains two genes (dnaA and dnaN) related to DNA replication and its PCEs are very similar to the DnaA-binding site (DnaA-box:TTATCCACA). yqeG, another member of cluster 24, has two DnaA-like PCEs in its upstream region. It is known that DnaA box is often found in multiple copies. Moreover, cluster 16 contains yqeG and dnaA, its PCEs being very similar to the SpooA-binding site. Thus, it is likely that both DnaA and SpooA bind to the upstream regions of yqeG and dnaA, suggesting a new crosstalk of regulatory networks between DNA replication and sporulation. yqeG, whose function cannot be inferred by sequence similarity, may be involved in DNA replication and/or sporulation. As there are many additional cases where functionally related genes are included in the same cluster (see Additional data files), we expect that future experiments will prove that at least some of them are co-regulated.

On the possibility of misclustering due to general patterns

In our method, there is a concern that a set of functionally unrelated genes can be clustered from general motifs such as the -35/-10 boxes and the SD sequence. Thus, we investigated the occurrences of these motifs in the clusters. As the SD sequence is located at some relatively definite distances from the translation start site, which is known at least in principle, it is relatively easy to detect the SD sequence. With the criterion described in Materials and methods, we excluded 34 clusters, all members of which contain an SD-like PCE (Table 3). Apparently, many of these genes are translation related (that is, ribosomal proteins and elongation factors). Possibly their SD sequence has been highly conserved to maximize their translation efficiency. Another possibility is that there are some factors that recognize such SD-like PCEs and that these clusters are co-regulated by them.

Table 3

Clusters having SD-like PCEs

Gene	Functional classification^*
atpG	Membrane bioenergetics
spoVG	Sporulation
yyaA	Sporulation
rpsS	Ribosomal proteins
rpoC	Elongation
rpsL	Ribosomal proteins
rpoB	Elongation
ydaO	None
ydcD	None
secG	None
sspE	Sporulation
rplK	Ribosomal proteins
sspA	Sporulation
rpsF	Ribosomal proteins
rplJ	Ribosomal proteins
rplU	Ribosomal proteins
ftsA	Cell division
rpmE	Ribosomal proteins
fusA	Elongation
cysE	Metabolism of amino acids and related molecules
yeeI	None
rpoA	Elongation
gerE	Regulation
sigA	Initiation
gerM	Germination
asnS	Aminoacyl-tRNA synthetases
nusG	Termination
ypjB	None
yjcI	Metabolism of amino acids and related molecules
sigG	Initiation
acpA	Metabolism of lipids
prfA	Termination
thdF	Detoxification
minC	Cell division
cwlJ	Cell wall
hag	Mobility and chemotaxis
aprX	Metabolism of amino acids and related molecules
tsf	Elongation
yvgY	Transport/binding proteins and lipoproteins
yabR	Metabolism of nucleotides and nucleic acids
yqfC	None
ileS	Aminoacyl-tRNA synthetases
yocD	Detoxification
gcvH	Metabolism of amino acids and related molecules
rpsJ	Ribosomal proteins
rplQ	Ribosomal proteins
dnaA	DNA replication
thrS	Aminoacyl-tRNA synthetases
ysgA	RNA modification
yjzC	None
ytdA	Specific pathways
ywrD	Metabolism of amino acids and related molecules
spoVT	Regulation
dxs	Specific pathways
pyrP	Transport/binding proteins and lipoproteins
leuS	Aminoacyl-tRNA synthetases
ykkC	Protein folding
ylaN	None
yslB	None
thrS	Aminoacyl-tRNA synthetases
rpsD	Ribosomal proteins
yvdF	Specific pathways
citG	TCA cycle
ykoY	Detoxification
ripX	Phage-related functions
trpE	Metabolism of amino acids and related molecules
lepA	Elongation
greA	Elongation
ytaG	None
citZ	TCA cycle
ykzG	None
yocC	None
ybyB	None
pgk	Main glycolytic pathways
pheS	Aminoacyl-tRNA synthetases
pbpX	Cell wall
rpsB	Ribosomal proteins
ybxF	Ribosomal proteins
gcaD	Cell wall
yrrK	None
zur	Regulation
yjbK	None
pdhA	Main glycolytic pathways
yneF	None
fbaA	Main glycolytic pathways
rocD	Metabolism of amino acids and related molecules
cah	Detoxification
appA	Transport/binding proteins and lipoproteins
appA	Transport/binding proteins and lipoproteins
yobV	None
spoIVA	Sporulation

*Functional classification is obtained from the SubtiList website [42,43]. Genes belonging to the same cluster are grouped together

It is more difficult to detect the -35/-10 boxes than the SD sequence because the distance between the start sites of transcription and translation is rather variable. We investigated the number of known -35/-10 boxes overlapping with the PCEs using the DBTBS database [19,20]. As shown in Table 4,19% of them overlap with the PCEs on average. It is possible that the presence of the -35/-10 boxes might have affected the clustering of clusters 7, 22, 42, 53, 122, 129, 134 and 144. However, we do not regard this as a serious problem because the conservation of these boxes is relatively weak and because it is natural that many regulatory elements overlap with the -35/-10 boxes. Namely, if a PCE overlaps with the -35/-10 box in a cluster, it does not directly mean that the clustering is a mistake. On the other hand, it could be also problematic if no -35/-10 like elements were found around PCEs because it may not be a promoter region but an intergenic region within an operon. However, considering that it is still difficult to predict the position of promoters in bacterial genomes exactly, we did not use information of promoter existence in our scheme. In future, it seems to be reasonable to include the prediction of operon structure in our method [32,33,34].

Table 4

Number of -35/-10 boxes that overlap with PCEs for each sigma factor

Sigma factor	Number of sites^*	Number of -35 boxes^†	Number of -10 boxes^‡
SigA	62	9	14
SigB	9	0	0
SigD	5	0	2
SigE	19	3	7
SigF	7	2	1
SigG	14	3	2
SigH	8	0	2
SigK	13	3	2
SigL	1	0	1
SigW	9	4	2
SigX	2	0	0

-35/-10 boxes that overlap with a PCE by 5 bp or more were counted. If the box is shorter than 5 bp, those fully overlapping with PCE were counted. *Number of known -35/-10 boxes that exist in the regions analyzed in our work. †Number of -35 boxes that overlap with PCE. §Number of -10 boxes that overlap with PCE.

Conclusions

In this work, we aligned the upstream regions of orthologous genes between three closely related species and identified the PCEs within them. Genes of B. subtilis were then clustered according to the similarity of the PCEs in their upstream region. Most parameters in our method were determined such that as many known co-regulated genes are clustered together and the nature of the clustered genes was thoroughly investigated. In this sense, the use of closely related species, one of which has a long history of experimental research, was essential in our work. There are several potential difficulties in our approach. One is that the regulatory system of co-regulated genes must be conserved in a pair of species at least. In fact, even in the close relatives compared, only a proportion of genes had orthologous counterparts. However, this situation will be improved as the number of sequenced bacterial genomes increases. Another is that it is difficult to cluster genes harboring relatively short and/or variable elements. For example, although many of the known binding sites for CcpA, AbrB, SpooA and LexA overlap with PCEs, genes regulated by them were not clustered well with a reasonable value of the cut-off score. Currently, it is rather difficult to detect elements of about 6 bp long. It seems biologically reasonable, however, that in some large regulons, such as one regulated by CcpA, its binding affinity is modulated for each element. Thus, that all members of a known large regulon are not clustered is not always a failure of our approach. The third difficulty is related to the operon structure of bacterial genes. In some operons, the order of constituent genes is not conserved across species. Our method could not deal with cases when the position of the first gene was changed. As noted above, future incorporation of operon prediction may be useful. In fact, there is already research combining the predictions of transcription units and transcription factor binding sites [8]. On the other hand, our method could detect not only the DNA-binding sites for transcription factors but also some binding sites in RNA or conserved RNA secondary structure elements. This seems to reflect the fact that B. subtilis heavily exploits the antitermination mechanism to control gene expression [22]. Thus, our method could grasp a global feature of the gene regulatory mechanism in B. subtilis, without any a priori knowledge about it. In conclusion, although it is difficult to detect the entire set of co-regulated genes with our method, it can be used as a powerful tool to explore them. In addition, our results can be used as criteria for comparing results from other methods, and are useful for developing a more elaborate method. Thus, our approach is a model for further studies.

Materials and methods

Genome sequence data

Genome sequences of B. subtilis [17] and B. halodurans [18] with the annotation information were obtained from GenBank [35] (accession numbers: AL009126 and BA000004, respectively). Unfinished genome sequence of B. stearothermophilus was downloaded from the website of the B. stearothermophilus genome-sequencing project at the University of Oklahoma [36].

Identification of orthologous genes

Genes orthologous between B. subtilis and B. halodurans were obtained by finding the best match counterpart of amino-acid sequence from each genome with BLASTP [37]. As the annotation in the genome of B. stearothermophilus was not given, orthologs between B. subtilis and B. stearothermophilus were obtained as follows: a TBLASTN search was done against the contig sequences of B. stearothermophilus for each amino-acid sequence of B. subtilis. If the best-hit alignment started before the tenth residue of the query, this translated counterpart was used as a BLASTP query against all B. subtilis sequences. If its best hit was identical with the initial query, they were regarded as orthologous.

Alignment of upstream regions

Although binding sites for transcription factors can sometimes exist in coding regions, we excluded B. subtilis genes with upstream intergenic regions of less than 50 bp from further analyses, in order to reduce potential noise. Next, the upstream 300 bp region of each B. subtilis gene and that of an orthologous gene, if any, were aligned with a local pairwise alignment program LALIGN [38,39]. The open gap penalty was set to 20, which is higher than the default value. Locally conserved regions in an upstream region of 300 bp from closely related species were realigned with the entire upstream region of B. subtilis without overlap. The most conserved element of either B. halodurans or B. stearothermophilus was first aligned with the upstream 300 bp sequence of B. subtilis. Next, the second most conserved element is aligned, unless this element overlaps with the previous alignment. This procedure was repeated for all detected elements. The final alignments are shown in DBTBS [19,20].

Identification of phylogenetically conserved elements (PCEs)

On the basis of the alignments described above, we defined PCEs within the upstream noncoding region as follows: first, 3 bp segments where all of the nucleotides were conserved for three species were sought. Then, each segment was extended until a consecutive unconserved site appeared for each direction. Unless its score was less than 10, the sequence was designated a PCE (for the scoring of PCEs, see below). To increase the number of PCEs, we also identified PCEs even when they were conserved in only two species under a more stringent condition: segments of 6 bp where the nucleotides were conserved at all positions were first sought. Then, each sequence was extended in each direction until it faced a 3 bp segment in which two of the positions were unconserved. Unless its score was less than 20, it was assigned as a PCE (the cut-off score was chosen by observing the number of spurious PCEs detected when the upstream regions are joined to unrelated coding sequences). Thus, a PCE is an alignment of three or two conserved fragments from different species.

Scoring PCEs

Suppose a PCE, denoted by M, consists of a set of fragments of (two or three) species, S. The score of M was defined by Score(M) = -log2 [<Πx FxiNx>] (i ∈ S, x ∈ A, T, G, C), where the brackets (< >) denote an average over S, Fxidenotes the fraction of nucleotide x in the 300 bp upstream sequence of species i, and Nx is the number of positions at which nucleotide x is conserved over S in M. Thus, the score of PCEs becomes low if they are short and rich in frequent nucleotides.

Clustering genes

Genes were clustered according to the similarity of PCEs in their upstream region. A similarity measure (SMN) between two PCEs, M and N, was defined by the sum of all pairwise alignment scores between any constituent sequences from both PCEs: sMN = ΣmnLmn (m ∈ sequences in M, n ∈ sequences in N) Lmn = max [lmn, d • lmn'], where lmn denotes the score of the Smith-Waterman local alignment algorithm [40] between constituent sequences m and n (the match score, the mismatch cost and the gap cost were set to 1, 2 and 3, respectively); n' denotes the reverse complement of n; and d is an empirical cost for selecting n' (we set d = 0.9). As sMN becomes larger as the number of constituent sequences of M and N is larger, sMN was further normalized as follows: SMN = sMN • 9b/(km • kn), where km, and kn denote the number of constituent sequences of M and N, respectively; b is again an empirical cost for smaller values of km, or kn: b = 1.0 if both km and kn are 3 b = 0.9 if either km or kn is 2 b = 0.8 if both km and kn are 2 We used a simple algorithm UPGMA [41] to cluster genes. The UPGMA algorithm was continued until no pairs of PCEs have a normalized similarity value of more than 60. We chose all of the above-mentioned empirical parameters by observing the results for known co-regulated genes.

Discarding clusters with SD-like PCEs

We discarded clusters when all of their members contain the SD sequence-like elements. More specifically, a member is considered to have an SD-like element if the B. subtilis sequence of its PCE contains a 5 bp segment where there are at least two Gs and one A but no Cs, and if this segment lies within the region 20 bp upstream from the translation initiation site. Subsequently, the cluster was discarded if all of the other members also have its corresponding regions.

Additional data files

A table showing all clusters ranked by the highest similarity score within each cluster is available as an Excel file.

Additional data file 1

Table showing all clusters ranked by the highest similarity score within each cluster Click here for additional data file

37 in total

1. Prediction of operons in microbial genomes.

Authors: M D Ermolaeva; O White; S L Salzberg
Journal: Nucleic Acids Res Date: 2001-03-01 Impact factor: 16.971

2. Combined transcriptome and proteome analysis as a powerful approach to study genes under glucose repression in Bacillus subtilis.

Authors: K Yoshida ; K Kobayashi; Y Miwa; C M Kang; M Matsunaga; H Yamaguchi; S Tojo; M Yamamoto; R Nishi; N Ogasawara; T Nakayama; Y Fujita
Journal: Nucleic Acids Res Date: 2001-02-01 Impact factor: 16.971

Review 3. Aminoacyl-tRNA synthetase gene regulation in Bacillus subtilis.

Authors: C Condon; M Grunberg-Manago; H Putzer
Journal: Biochimie Date: 1996 Impact factor: 4.079

4. The dnaK operon of Bacillus subtilis is heptacistronic.

Authors: G Homuth; S Masuda; A Mogk; Y Kobayashi; W Schumann
Journal: J Bacteriol Date: 1997-02 Impact factor: 3.490

5. Function of RNA secondary structures in transcriptional attenuation of the Bacillus subtilis pyr operon.

Authors: Y Lu; R J Turner; R L Switzer
Journal: Proc Natl Acad Sci U S A Date: 1996-12-10 Impact factor: 11.205

6. Identification of common molecular subsequences.

Authors: T F Smith; M S Waterman
Journal: J Mol Biol Date: 1981-03-25 Impact factor: 5.469

7. DBTBS: a database of Bacillus subtilis promoters and transcription factors.

Authors: T Ishii; K Yoshida; G Terai; Y Fujita; K Nakai
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

8. A comparative genomics approach to prediction of new members of regulons.

Authors: K Tan; G Moreno-Hagelsieb; J Collado-Vides; G D Stormo
Journal: Genome Res Date: 2001-04 Impact factor: 9.043

9. Roles of the three transcriptional attenuators of the Bacillus subtilis pyrimidine biosynthetic operon in the regulation of its expression.

Authors: Y Lu; R J Turner; R L Switzer
Journal: J Bacteriol Date: 1995-03 Impact factor: 3.490

10. Conservation of the binding site for the arginine repressor in all bacterial lineages.

Authors: K S Makarova; A A Mironov; M S Gelfand
Journal: Genome Biol Date: 2001-03-22 Impact factor: 13.583

10 in total

1. The identification of functional modules from the genomic association of genes.

Authors: Berend Snel; Peer Bork; Martijn A Huynen
Journal: Proc Natl Acad Sci U S A Date: 2002-04-30 Impact factor: 11.205

2. Regulog analysis: detection of conserved regulatory networks across bacteria: application to Staphylococcus aureus.

Authors: Wynand B L Alkema; Boris Lenhard; Wyeth W Wasserman
Journal: Genome Res Date: 2004-07 Impact factor: 9.043

3. New RNA motifs suggest an expanded scope for riboswitches in bacterial genetic control.

Authors: Jeffrey E Barrick; Keith A Corbino; Wade C Winkler; Ali Nahvi; Maumita Mandal; Jennifer Collins; Mark Lee; Adam Roth; Narasimhan Sudarsan; Inbal Jona; J Kenneth Wickiser; Ronald R Breaker
Journal: Proc Natl Acad Sci U S A Date: 2004-04-19 Impact factor: 11.205

10. A regulatory similarity measure using the location information of transcription factor binding sites in Saccharomyces cerevisiae.

Authors: Wei-Sheng Wu; Ming-Liang Wei; Chia-Ming Yeh; Darby Tien-Hao Chang
Journal: BMC Syst Biol Date: 2014-12-12