| Literature DB >> 31671158 |
Thomas D Schneider1, Vishnu Jejjala2,3.
Abstract
Restriction enzymes recognize and bind to specific sequences on invading bacteriophage DNA. Like a key in a lock, these proteins require many contacts to specify the correct DNA sequence. Using information theory we develop an equation that defines the number of independent contacts, which is the dimensionality of the binding. We show that EcoRI, which binds to the sequence GAATTC, functions in 24 dimensions. Information theory represents messages as spheres in high dimensional spaces. Better sphere packing leads to better communications systems. The densest known packing of hyperspheres occurs on the Leech lattice in 24 dimensions. We suggest that the single protein EcoRI molecule employs a Leech lattice in its operation. Optimizing density of sphere packing explains why 6 base restriction enzymes are so common.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31671158 PMCID: PMC6822723 DOI: 10.1371/journal.pone.0222419
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Sphere packing.
Circles demonstrate square and hexagonal sphere packing in two dimensions. The hexagonal packing is 12% more dense. In higher dimensional spaces sphere packing is less intuitive. When hyperspheres pack together there is an odd property diagramed on the right side of the figure (which is derived from Shannon’s proof of the channel capacity theorem, Theorem 2 in his figure 5 [15]). The vertical arrow represents moving from the center of one hypersphere to the center of a second hypersphere. For Shannon, working with electrical communications, this voltage is proportional to the square root of the power dissipation, . In a 100 dimensional space, the thermal noise in the second sphere (green circle) disturbs the signal in all directions, shown by splayed arrows with lengths proportional to . However, 99 of those dimensions do not perturb in the direction of the power dissipation. In his proof, Shannon neglected the 1% of the noise in the direction of the power since this represents the error, and it can be made as small as one may desire by increasing the dimensionality—in 1000 dimensions the error is only 0.1%. So relative to the direction of the power, the received hypersphere can be treated as a flat surface since all the other directions (splayed arrows) are at right angles to the power direction. If two hyperspheres are to be separated with as low an error as desired, then the power to get from one to the next must just exceed the thermal noise power of the first sphere, so and P > N.
Fig 2Isothermal efficiency curve for molecular machines showing bounds that constrain the coding space dimensionality D.
Real molecular machines that select between two or more distinct states may have parameters anywhere in the shaded (green) area in which the real isothermal efficiency ϵ is bounded above by the theoretical isothermal efficiency ϵ (Eq (6)) and to the left by the power to noise ratio ρ = P/N > 1 (Eq (7)). During evolution, they tend to lose unnecessary energy dissipation, which decreases ρ towards the lower limit of ρ = 1. Independently, they tend to increase their information use (R) for the energy dissipated, which increases ϵ toward the theoretical maximum ϵ determined by the channel capacity. These factors lead to an ‘optimal’ molecular machine in which ρ = 1 and ϵ = ϵ = ln 2. At that point the dimensionality has been squeezed in a pincers (Eq (21)) until it reaches D = 2R.
Coding space dimensionality (D) and number (N) of restriction enzymes.
The information content in bits, R, of the recognition sequence of 4297 restriction enzymes from REBASE (restriction enzyme database) http://rebase.neb.com or ftp://ftp.neb.com/pub/rebase/ version allenz.801 (Dec 27 2017) [40] was computed. A fully conserved base (A, C, G, T) contributes 2 − log2 1 = 2 bits, two possibilities (R = G/A, Y = C/T, M = A/C, K = G/T, S = C/G, W = A/T) contributes 2 − log2 2 = 1 bit, three possibilities (B = C/G/T, D = A/G/T, H = A/C/T, V = A/C/G) contributes 2 − log2 3 ≈ 0.42 bits and any allowed base (N) contributes 2 − log2 4 = 0 bits [23, 41]. The sum of the information at each base, R, was used to find the corresponding number of compressed bases (λ = R/2) and then the coding dimension (D = 2R), assuming that each enzyme has an efficiency of ϵ = ln 2 and ρ = 1 so that there is a unique dimension according to Eq (21). The most commercially available enzymes and their reported recognition sequences are given as examples. When the DNA backbone cleavage site is known it is indicated by an arrow (↓). The distance to cleavage sites outside the given sequence is shown in parenthesis for the corresponding and complementary strands. Star activity (variation within the canonical site) and flanking sequence effects are found for many restriction enzymes [42]. However, the patterns in the database are reported as consensus sequences that may distort the information content [43], and so may affect the results given here.
| Example | Sequence | Compressed | Bits | Dimension | Number |
|---|---|---|---|---|---|
| AbaSI | C(11/9) | 1.00 | 2.00 | 4.00 | 20 |
| MspJI | CNNR(9/13) | 1.50 | 3.00 | 6.00 | 1 |
| RlaI | VCW | 1.71 | 3.42 | 6.83 | 1 |
| SgeI | CNNGNNNNNNNNN↓ | 2.00 | 4.00 | 8.00 | 6 |
| AspBHI | YSCNS(8/12) | 2.50 | 5.00 | 10.00 | 1 |
| PsuGI | BBCGD | 2.62 | 5.25 | 10.49 | 1 |
| SgrTI | CCDS(10/14) | 2.71 | 5.42 | 10.83 | 2 |
| CviJI | RG↓CY | 3.00 | 6.00 | 12.00 | 9 |
| LpnPI | CCDG(10/14) | 3.21 | 6.42 | 12.83 | 1 |
| EcoBLMcrX | RCSRC(-3/-2) | 3.50 | 7.00 | 14.00 | 1 |
| M.NgoDCXV | GCCHR | 3.71 | 7.42 | 14.83 | 1 |
| TaqI | T↓CGA | 4.00 | 8.00 | 16.00 | 1210 |
| Bsp1286I | GDGCH↓C | 4.42 | 8.83 | 17.66 | 16 |
| AvaII | G↓GWCC | 4.50 | 9.00 | 18.00 | 396 |
| Pin17FIII | GGYGAB | 4.71 | 9.42 | 18.83 | 2 |
| HincII | GTY↓RAC | 5.00 | 10.00 | 20.00 | 507 |
| Cco14983V | GGGTDA | 5.21 | 10.42 | 20.83 | 1 |
| PpuMI | RG↓GWCCY | 5.50 | 11.00 | 22.00 | 55 |
| EcoRI | G↓AATTC | 6.00 | 12.00 | 24.00 | 1864 |
| Rba2021I | CACGAGH | 6.21 | 12.42 | 24.83 | 10 |
| PspXI | VC↓TCGAGB | 6.42 | 12.83 | 25.66 | 1 |
| RsrII | CG↓GWCCG | 6.50 | 13.00 | 26.00 | 54 |
| SgrAI | CR↓CCGGYG | 7.00 | 14.00 | 28.00 | 99 |
| KpnBI | CAAANNNNNNRTCA | 7.50 | 15.00 | 30.00 | 2 |
| SfiI | GGCCNNNN↓NGGCC | 8.00 | 16.00 | 32.00 | 36 |
Fig 3Comparison of restriction enzyme frequency and best known sphere packing density in different dimensions.
A. Coding dimensions used by restriction enzymes. The number of enzymes at each dimensionality is plotted from Table 1. B. Best known sphere packings in high dimensions were given by Conway and Sloane [26, 44]. The graph is equivalent to their Figure 1.5; see Table I.1(a), Table I.1(b) on pages xix and xx; and pages 14 to 16. The updated sphere center density formulas used here were from http://www.math.rwth-aachen.de/~Gabriele.Nebe/LATTICES/density.html (Last modified Feb. 2012, accessed Jan 06, 2018). The sphere center density, δ, is the number of sphere centers per unit volume when sphere radii are set to 1. Without the logarithm, a graph of δ versus D appears nearly flat from D = 7 to D = 18. Circles (°) represent lattice packings; x’s (×) represent nonlattice packings.