| Literature DB >> 16049029 |
Kobe Florquin1, Yvan Saeys, Sven Degroeve, Pierre Rouzé, Yves Van de Peer.
Abstract
DNA encodes at least two independent levels of functional information. The first level is for encoding proteins and sequence targets for DNA-binding factors, while the second one is contained in the physical and structural properties of the DNA molecule itself. Although the physical and structural properties are ultimately determined by the nucleotide sequence itself, the cell exploits these properties in a way in which the sequence itself plays no role other than to support or facilitate certain spatial structures. In this work, we focus on these structural properties, comparing them between different organisms and assessing their ability to describe the core promoter. We prove the existence of distinct types of core promoters, based on a clustering of their structural profiles. These results indicate that the structural profiles are much conserved within plants (Arabidopsis and rice) and animals (human and mouse), but differ considerably between plants and animals. Furthermore, we demonstrate that these structural profiles can be an alternative way of describing the core promoter, in addition to more classical motif or IUPAC-based approaches. Using the structural profiles as discriminatory elements to separate promoter regions from non-promoter regions, reliable models can be built to identify core-promoter regions using a strictly computational approach.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16049029 PMCID: PMC1181242 DOI: 10.1093/nar/gki737
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Sequence information is converted to numerical profiles. In this example, the trinucleotide bendability model by Brukner et al. (11) is used, based on Dnase I cutting frequencies. The enzyme Dnase I preferably binds to the minor groove and cuts DNA that is bent, or bendable toward the major groove. Therefore, Dnase I cutting frequencies on naked DNA can be interpreted as a quantitative measure of major groove compressibility or bendability. These frequencies allow for the derivation of bendability parameters for 32 complementary trinucleotide pairs and range from −0.280 to +0.194. To evaluate different smoothings of the raw profile signal (see text), a sliding window approach was used with steps of 1 and a window size of 1–10, respectively.
The different structural models that were considered in our analysis
| Structural feature | Property measured | Min | Max | Reference |
|---|---|---|---|---|
| Stacking energy | Dinucleotide base-stacking energy scale expressed in kilocalories per mol, derived from approximate quantum mechanical calculations on crystal structures. High peaks in base stacking reflect regions of the helix that de-stack or melt more easily; conversely a minimal peak would represent more stable regions | −14.59 kcal | −3.82 kcal | ( |
| Propeller twist | The dinucleotide propeller twist angle scale is measured in degrees and is based on X-ray crystallography of DNA oligomers. A region with high propeller twist would mean that the helix is quite rigid in this area. Correspondingly, regions that are quite flexible would have low propeller twist values | −18.66° | −8.11° | ( |
| Nucleosome position preference | NPP is a trinucleotide model based on the preferential location of sequences within a nucleosomal core. The study was performed on sequences wrapped around nucleosome cores and in closed circles of DNA. They calculated the fractional preference of each base pair triplet for a position facing out. High value peaks represent more rigid regions where nucleosomes are less likely to appear | −36% | +45% | ( |
| Bendability | The trinucleotide bendability model is based on Dnase I cutting frequencies. The enzyme Dnase I preferably binds to the minor groove and cuts DNA that is bent, or bendable toward the major groove. Thus Dnase I cutting frequencies on naked DNA can be interpreted as a quantitative measure of major groove compressibility or bendability. DNA regions with a high peak correspond to regions that are more flexible than regions with a low peak value | −0.280 | +0.194 | ( |
| A-philicity | The free energy dinucleotide base pair scale, for the ethanol-induced B- to A-DNA conformational transitions in solution, was determined for a series of carefully designed synthetic duplexes. A region in the DNA with a high A-philicity value is more easily converted to the A-form than a low value region, which is more resistant to transition | ( | ||
| Protein-induced deformability | The dinucleotide protein deformability scale is derived from empirical energy functions extracted from the fluctuations and correlations of structural parameters as determined by the examination of more than a hundred crystal structures of DNA–protein complexes. On this scale, a larger value reflects a more deformable sequence while a smaller value indicates a region where the DNA helix is less likely to be changed dramatically by proteins | 1.6 | 12.1 | ( |
| Duplex disrupt energy | The DNA disrupt energy was calculated using calorimetric calculation on 19 DNA oligomers and 9 DNA polymers. It has been shown that the stability of a DNA duplex depends on its base sequence and that it is not the base composition that determines the stability of the duplex. Regions with a high disrupt energy value will be more stable than a region with a lower energy value | 0.9 kcal | 3.1 kcal | ( |
| Duplex free energy | For 50 DNA/DNA duplexes the thermodynamic parameters of the DNA free energy were calculated. The melting behavior of these duplexes was observed and the transition enthalpy was calculated giving dinucleotide values. Regions with a low free energy content will be more stable than regions with high thermodynamic energy content | −2.1 kcal/mol | −0.9 kcal/mol | ( |
| DNA denaturation | The denaturation equilibrium is calculated by UV electronic spectroscopy at 270 nm of high-resolution melting experiments on 42 plasmids, containing synthetic repeated inserts. DNA regions with a low peak value are more likely to denaturate than regions with a higher peak value | 64.35 cal/mol | 135.38 cal/mol | ( |
| DNA-bending stiffness | The bending stiffness is regarded as the translational positioning of nucleosomes and more precisely the string correlation with the anisotropic flexibility of the DNA. In the analysis, a simple algorithm is used that accounts for nucleosome translational positions in terms of bending free energy. The values are given in nm, which stand for the persistence length value that is derived from experimental data. High peak values correspond to DNA regions that are more rigid, while low peak values correspond to regions that will bend more easily | 20 nm | 130 nm | ( |
| B-DNA twist | The study focuses on the mean twist angles in B-DNA and was calculated on 38 B-DNA crystal structures. Structures with a low twist region appear to unwind in response to steric clashes of large exocylic groups in the major and minor grooves, and those with high twist values are subject to lesser contact | 30.6° | 43.2° | ( |
| Protein–DNA twist | Olson | 31.5° | 37.8° | ( |
| Stabilizing energy of Z-DNA | To search for particular DNA segments, which can adopt a left-handed Z-conformation, empirically determined energetic parameters are used. The dinucleotide parameters represent the free energy values for a transition from B- to Z-DNA. Stretches of DNA with low energy minima are more likely to form Z-DNA than a high-energy region | 5.9 kcal/mol | 0.7 kcal/mol | ( |
Figure 2For Arabidopsis and rice, in-house core-promoter datasets were constructed. The ARAPROM dataset (7088 promoter sequences) was constructed by aligning full-length cDNA sequences, generated by Seki et al. (31), with the genomic sequence. The RICEPROM dataset consists of 2195 putative promoter sequences. From each original promoter sequence, 100 bp upstream and 50 bp downstream of the TSS were selected. As negative datasets, we extracted 150 bp from the non-promoter sequence part, including intron, exon and intergenic sequences. In addition, three randomized datasets were constructed, based on randomizing the core-promoter sequences.
Figure 3Profiles based on the structural model ‘duplex disrupt energy’ and window size 10 are shown for the four highest quality value clusters for Arabidopsis, rice, human and mouse (42). The position of the transcription start site is shown on the different structural profiles.
Figure 4Influence of the window size on the classification results. This shows the discriminative power to distinguish core-promoter sequences from non core-promoter sequences—for all structural models and for window sizes 1–10. For each structural model, all core-promoter sequences from the clusters with the highest quality value were mixed with 75% sequences coming from the dinucleotide-randomized dataset. The F-value, which combines sensitivity and specificity, is a measure for the overall performance of discriminating between core-promoter sequences and non-core-promoter sequences. Classification results were based on applying the LSVM classification method.
Figure 5(a–j) The first 10 clusters, as inferred by the AQBC method, of human structural profiles obtained using bendability as a structural model with window size 10 are shown. All the core promoters are aligned based on the TSS and each profile corresponds to 100 bp downstream of the TSS and 50 bp upstream.