Much is known regarding the structure and logic of genetic regulatory networks. Less understood is the contextual organization of promoter signals used during transcription initiation, the most pivotal stage during gene expression. Here we show that promoter networks organize spontaneously at a dimension between the 1-dimension of the DNA and 3-dimension of the cell. Network methods were used to visualize the global structure of E. coli sigma (sigma) recognition footprints using published promoter sequences (RegulonDB). Footprints were rendered as networks with weighted edges representing bp-sharing between promoters (nodes). Serial thresholding revealed phase transitions at positions predicted by percolation theory, and nuclei denoting short steps through promoter space with geometrically constrained linkages. The network nuclei are fractals, a power-law organization not yet described for promoters. Genome-wide promoter abundance also scaled as a power-law. We propose a general model for the development of a fractal nucleus in a transcriptional grammar.
Much is known regarding the structure and logic of genetic regulatory networks. Less understood is the contextual organization of promoter signals used during transcription initiation, the most pivotal stage during gene expression. Here we show that promoter networks organize spontaneously at a dimension between the 1-dimension of the DNA and 3-dimension of the cell. Network methods were used to visualize the global structure of E. coli sigma (sigma) recognition footprints using published promoter sequences (RegulonDB). Footprints were rendered as networks with weighted edges representing bp-sharing between promoters (nodes). Serial thresholding revealed phase transitions at positions predicted by percolation theory, and nuclei denoting short steps through promoter space with geometrically constrained linkages. The network nuclei are fractals, a power-law organization not yet described for promoters. Genome-wide promoter abundance also scaled as a power-law. We propose a general model for the development of a fractal nucleus in a transcriptional grammar.
Entities:
Keywords:
E. coli; power-law scaling; promoter footprint; systems biology; transcription
In prokaryotes, one of several sigma (σ) factors binds to a promoter upstream of a gene and helps position RNA polymerase during transcription initiation. Though consensus and canonical promoter motifs are frequently referenced in textbooks and the literature, genome-scale surveys have forced a reconsideration of the specific role played by these idealized sequences.1–3 Actual promoters can vary in sequence considerably while still binding the same σ, though efficiencies vary several-fold.4 Collectively these promoter sequences form a footprint in promoter space, defining a regulon of genes responsive to a particular environmental cue or cellular need. Each σ represents a hub, or highly connected node, in the overall gene regulatory network. Our concern in this study is with the structure of promoter variation, specifically the topology of a hub footprint.Our use of networks to visualize promoter diversity departs from their traditional use in gene regulation research. Putting aside protein interaction networks (PINs), transcriptional interdependencies are visualized using two main approaches: (1) Most common is the gene regulatory network (GRNs), often generated using gene expression data, which conveys information on the realized interdependencies among genes.5–8 Nodes represent genes, and certain of the protein products act as regulators of one or more of the genes in the network. Regulatory relationships are denoted by directed edges between nodes, and global studies of the transcriptome are now commonplace. (2) Studies that explicitly consider promoter diversity focus more on the nature and pattern of variation in the cis-element signals used to initiate transcription—but here global or large-scale network approaches are not typical. For example, one promoter diversity study2 examined the details of σ70 promoter variation in E. coli, but did not render relationships as a network. Another study9 developed a regulatory network for acid resistance genes in E. coli, but theirs was a conceptual model. Another3 produced a hierarchical clustering model representing sequence similarities among 441 E. coli promoters, yet hierarchical trees carry the unnecessary constraint that cycles must be avoided in the rendering of network relationships.Here we explore the structure of promoter networks from E. coli using affiliation-based subgraph extractions, or serial thresholding. Promoter predictions were obtained from RegulonDB and include three regulons mediating a type of stress response10 (σ24, σ28, and σ54) along with the larger housekeeping σ70. Networks were generated with edge weights representing the number of bases shared between pairs of promoter sequences (nodes). Rather than exploring the network in its totality as a weighted graph, we broke the network into a series of subgraphs based on edge weights and examined subgraph features separately. In particular, attributes of the LCC (largest connected component) of each network were tracked across a range of critical edge values.We consider the following specific questions: (1) What is the apparent role, if any, of the consensus promoter motif? What is the frequency of predicted promoters in the genome? (2) What is the topological structure of variation across promoter sequences in a regulon of genes, and does this structure vary across regulons? How does the organization of predicted promoter networks compare to that of networks built from random sequence promoters? (3) Do the results suggest a mechanism for promoter evolution?
Experimental Procedures
Promoter sequences
Promoter sequences were obtained from RegulonDB. The RegulonDB database11 (http://regulondb.ccg.unam.mx/) is the primary reference database for the transcriptional regulatory network of Escherichia coli K-12 (substr. MG1655, GenBank ref. seq. NC_000913.2, GI: 49175990). Predictions are anchored by experimental evidence on the location of transcription start sites determined by RegulonDB using a modified 5’RACE procedure.Predicted promoter data files (accessed 5.26.09) contained the base sequence of both boxes (−35 and −10 boxes) and the size in bp of the intervening spacer region, along with promoter positions in the genome. We studied three regulons in detail: σ24 (799 genes), σ28 (122 genes), σ54 (151 genes). The large housekeeping regulon σ70 (4010 genes) was added later in the study. Base sequence information included: σ24 and σ54, 11 bp (6 bp of −35 box, 5 bp of −10 box); σ28, 15 bp (7 and 8 bp, respectively); and σ70, 17 bp (9 and 8 bp, respectively). Alignments used were as provided by RegulonDB.
Power-law scaling of promoter abundances
We used Perl script to survey the E. coli K-12 genome and assess the abundance of the predicted promoter motifs along with their inferred consensus sequence for each regulon. These distributions were evaluated for their fit relative to a Pareto distribution12 using Matlab. For this purpose we evaluated F for each graph, the complementary cumulative distribution function (ccdf), which is a monotonically non-increasing function describing the probability that a random variable takes a value greater than x:
where cdf is the standard cumulative distribution function, and x is the minimum value taken by x. In the evaluation of promoter frequencies in the genome, we used both a measure of goodness of fit (R) and an estimate of the scaling coefficient (γ). After taking logs of both sides, α was obtained as the slope:
and the scaling exponent as γ = α + 1 such that
Predicted promoter networks
Sequence and spacer information were used to calculate A, the number of bp shared between promoter sequences i and j. A gap penalty (−1 per bp) was applied for mismatches in spacer sizes in the RegulonDB alignments. These weighted edge values populated the adjacency matrix, , which was used to construct a network, or graph G. Networks were visualized using Pajek13 and the Kamada-Kawai14 projection. Networks were analyzed using script written in Python that utilized NumPy, SciPy, and NetworkX,15 an open source Python package for the analysis of complex networks (http://networkx.lanl.gov/).
Random promoter networks
Random promoter networks were generated for Monte Carlo tests by forming a set of n promoters, each through B random draws from a uniform base distribution (A, C, G, T). We considered the three RegulonDB systems, σ24, σ28, and σ54, with promoter numbers n and footprint sizes B as noted above. The size of the spacer separating the −10 and −35 boxes was randomly drawn from the distribution of sizes in the relevant data set. Random promoter networks were then produced in the same fashion as with the predicted promoter networks.
Network extractions using thresholding
Subgraphs were extracted using serial thresholding, or affiliation-based extraction,16 performed as follows. For m-slices, we sequentially removed all edges from graph G below a sliding critical integer threshold m (1 < m < B), where B was the maximum number of bp in the promoter sequence. For x-sections, we used discrete intervals based on the same sliding scale of integer threshold values, removing edges above and below that value of x. At each step, we then extracted the largest (maximal) connected component (LCC), the largest set of nodes that remain interconnected after selective edge removal from G. For each LCC, the number of nodes (graph size) and number of edges were evaluated. LCC that retained at least half of the nodes in graph G were giant components, by definition.
Monte Carlo tests
We used Monte Carlo randomizations to compare the node and edge counts of the LCCs obtained from the predicted promoter networks with their random counterparts through a series of x-sections. Bonferroni corrections to α were used for the multiple tests within a regulon (tests were performed only on LCCs of size n ≥ 5). Each replicate involved the production of a random promoter network from which a series of x-sections were extracted, and the node and edge counts appraised and stored. The replicate stored counts were used to form 95% confidence intervals wherein the observed data value was treated as if drawn from a distribution with at least r = 1,000 replicates (σ24 and σ54, r = 1,320; σ28, r = 1,800).
Estimating the fractal dimension
Song et al17 showed how to measure the fractal dimension in a network by implementing the standard box covering method as a network coloring problem. In brief, for a given box length (l) or shortest path length between nodes, each node is colored in a fashion such that neighbors of like color are no further away than that current box length. Then the network is renormalized by collapsing adjacent nodes into a single node if they share the same color. Considering a range of box lengths, where each determines the renormalized node count, or graph size n, a plot of lB versus n on a log-log scale will be linear for networks with a fractal topology. Python script was written (using NetworkX) to implement this renormalization method. The fractal dimension d is obtained from linear regression of the log-log transformation of the general scaling relation:
Results and Discussion
Power-law scaling of promoter abundance
Consensus sequence promoter motifs were not present in the predicted promoter sets from RegulonDB, and were rare or absent in the E. coli K-12 genome, as noted elsewhere.2 Of the regulons we examined, only the inferred consensus for σ28 occurred in the genome (three copies).A subsequent survey of the full predicted promoter sets against the E. coli K-12 genome revealed that promoter abundances approximated a power-law. Log-log plots of the complementary cumulative distribution functions (ccdf) for promoter motif counts are shown (Fig. 1). We included the large σ70 regulon and, generally, sets with more promoters gave a better fit to a power function. Power-law scaling has been described before for gene frequencies within and across genomes and often attributed to gene and genome duplication events.18
Figure 1.
Promoter frequencies in genomes: Log-log plots of complementary cumulative distribution functions for occurrences of promoter motifs in the full genome: σ28 (n = 122 genes, α = 0.300, R2 = 0.615), σ54 (n = 151, α = 1.925, R2 = 0.819), σ24 (n = 799, α = 2.567, R2 = 0.907), and σ70 (n = 4010, α = 1.704, R2 = 0.935).
These findings support the growing view that consensus and canonical promoter motifs generally play an indirect role in genome evolution. That they rarely participate directly in transcription has been attributed to the fact that they bind σ too firmly, preventing promoter clearance and elongation, and that there is functionality in a weak promoter that can be modulated with compensatory regulators.1,2,4,19–21 And in many cases promoters appear to be chimeric combinations of canonical and non-canonical binding sites.1,22 supporting the view that ‘perfect promoters are not biologically relevant’.1 We accept this sentiment insofar as it conveys the fact that consensus promoters actually perform little of the transcriptional work in the cell. We nuance this perspective by suggesting that the ideal consensus promoter represents the optimal DNA-protein binding chemistry and therefore serves as an organizing principle for the evolution of the transcriptional grammar and of the resultant topologies seen in the promoter networks described in this study.
Phase transitions in promoter networks
Serial extractions revealed phase transitions in the promoter networks (Fig. 2) at positions predicted by percolation theory (Fig. 3). The unreduced promoter networks were highly dense (>0.999), occluded by numerous weak edges representing the sharing of few bases. Thresholding provided targeted windows of lowered edge density through which we examined attributes of the LCCs.
Figure 2.
Largest connected components following extractions of x-section by thresholding of three E. coli regulons. Each promoter network was broken into subgraphs based on edge weights using a series of integer threshold values (X), shown along the top of the figure). An x-section retained only edges of weight x = X bp-sharing between promoter sequences (nodes) (every other step is shown in the figure).
Figure 3.
x-sectional profiles of number of nodes and edges for predicted promoters (lines) along with 95% confidence intervals (CIs, shaded regions). CIs were based on Monte Carlo simulations of random promoter networks built from sets of promoters of random base sequence, each with footprint and spacer attributes drawn from a predicted promoter set. Whereas predicted promoters were in close juxtaposition in promoter space (sharing ∼7–8 bp out of 11 or 15), random promoter networks had significantly more diffuse footprints (2–4 bp shared) consistent with binomial expectations. Vertical dashed lines mark the phase transitions predicted by percolation theory.
A phase transition is an abrupt change in the state of a system associated with incremental change in a system parameter, such as the shift with temperature between liquid and gas phases described by van der Waals.23 In networks, as edges are added (removed) randomly to a graph, there is a sudden increase (decrease) in global connectivity with emergence (fracture) of a giant component, a connected component containing at least half of the nodes.24 In a random graph of n nodes, this occurs predictably around the percolation threshold p = 1/n.In Figure 3, we indicate the positions of the phase transitions expected from percolation theory in our plots of node and edge numbers. In each case, an expected phase transition is marked as a vertical dashed line positioned at the edge density p = 1/n. The resulting alignment of these positions with the observed phase transitions in node counts is taken as evidence of concurrence with theory. With σ54 as an example, n = 151 promoters yields a percolation threshold p of ∼0.0066. Of the 11,325 possible edges, this translates into ∼75 edges. Though our discrete categories are coarse, this is roughly where we observed the formation of the largest connected component in our x-sectional profile: between 3–4 bp shared, the number of edges changed from 4 to 340, and largest connected component size jumped from 5 to 128 nodes.
Topology of promoter networks
Whereas the LCCs from lower thresholds were fairly homogeneous and dense, containing numerous edges representing low-value bp-sharing, the LCCs emerging from the upper phase transition displayed considerable structural complexity. These network nuclei represent a significantly constrained limiting similarity among promoters as they contain information on high levels of bp-sharing among many of the promoters in the regulon. Monte Carlo tests showed that LCCs built from RegulonDB promoter sets contained significantly higher-valued edge weights than those of random promoter networks (Fig. 3).The network nuclei have a fractal topology, as implied by their self-similar appearance (Fig. 4). LCCs captured from the upper phase transition were evaluated using the method of Song et al17 who showed how to measure the fractal dimension of a network by implementing the standard box covering method as a network coloring problem. In the regulons we examined, the average fractal dimension was d = 1.731 (Fig. 5). This has the biological interpretation that a unit increase in the log of the box length (modular extent of promoter sequence similarity) is met with a 1.731-fold decline in the log of the graph size (number of nodes). It is noteworthy that the weakest fractal structure was displayed by σ28 which was the regulon whose consensus sequence appeared in the genome. Regulons with a highly fractal nucleus did not utilize their consensus promoter in the E. coli genome.
Figure 4.
Fractal nuclei of the four regulons captured at upper phase transitions. A) σ24, d = 1.492; B) σ70, 1.911; C) σ28, 1.929; D) σ54, 1.590. Promoter abundance in the E. coli K-12 genome is shown as node size variation. The consensus sequence (orange node) for σ28 occurred in the genome, others did not and are included for heuristic purposes. Networks were rendered using Pajek.13
Figure 5.
Fractal analysis of the upper phase transition nucleus for the four E. coli σ regulons: Log-log plots of box number (l) versus graph size (normalized number of nodes) for LCC. Fractal dimensions (mean d = 1.731) and coefficients of determination (mean R2 = 0.957): σ28 (d = 1.929, R2 = 0.949), σ54 (d = 1.590, R2 = 0.959), σ24 (d = 1.492, R2 = 0.978), and σ70 (d = 1.911, R2 = 0.943).
DLA model of promoter evolution
These findings, including the mean fractal dimension of d = 1.731, suggested a specific repulsive mechanism for development of a fractal nucleus in a promoter network. A dimension of d = 1.7 is typical of fractals arising by diffusion-limited aggregation (DLA).25 In the general 2-d model, particles diffuse randomly as a Brownian motion, occasionally sticking to a growing cluster. Growth is through preferential attachment, but not to the oldest particles as in a scale-free model of network growth.26 Instead, particles attach preferentially to the growing arms of the cluster since the arms increasingly obstruct access to the central region. It appears as though the center repulses any new additions.A promoter network growing by DLA would be regulated by both repulsive and attractive forces, mediated on the micro-scale through DNA-protein binding chemistry, and on the macro-scale by population-level fitnesses, all organized around the consensus promoter. The consensus would form an attractor in transcriptional promoter networks because it represents the optimal binding chemistry for σ, and departures from the consensus would weaken and eventually eliminate this binding capacity.4 Yet it appears that the consensus and canonical motifs rarely participate directly in transcription perhaps because they bind σ too firmly.2,4,19–21 The resulting lowered population-level fitness would repulse additions from the network center.These dynamics are analogous to the inter-atomic attractive and repulsive forces that include the van der Waals interactions.23 Our interpretation comports with the recent generalization that repulsion is a critical prerequisite to fractal development in most complex networks.27,28
Concluding Remarks
Our results suggest a link between the development of scaling relations in genome structure and function. This correspondence is in part anticipated by the Zipf-Mandelbrot law,29,30 though genome work to date has emphasized frequency (structural) scaling without integrating topological (functional) scaling.
Authors: Socorro Gama-Castro; Verónica Jiménez-Jacinto; Martín Peralta-Gil; Alberto Santos-Zavaleta; Mónica I Peñaloza-Spinola; Bruno Contreras-Moreira; Juan Segura-Salazar; Luis Muñiz-Rascado; Irma Martínez-Flores; Heladia Salgado; César Bonavides-Martínez; Cei Abreu-Goodger; Carlos Rodríguez-Penagos; Juan Miranda-Ríos; Enrique Morett; Enrique Merino; Araceli M Huerta; Luis Treviño-Quintanilla; Julio Collado-Vides Journal: Nucleic Acids Res Date: 2007-12-23 Impact factor: 16.971
Authors: Gabriella Captur; Audrey L Karperien; Chunming Li; Filip Zemrak; Catalina Tobon-Gomez; Xuexin Gao; David A Bluemke; Perry M Elliott; Steffen E Petersen; James C Moon Journal: J Cardiovasc Magn Reson Date: 2015-09-07 Impact factor: 5.364