Literature DB >> 18805906

DiProDB: a database for dinucleotide properties.

Maik Friedel¹, Swetlana Nikolajewa, Jürgen Sühnel, Thomas Wilhelm.

Abstract

DiProDB (http://diprodb.fli-leibniz.de) is a database of conformational and thermodynamic dinucleotide properties. It includes datasets both for DNA and RNA, as well as for single and double strands. The data have been shown to be important for understanding different aspects of nucleic acid structure and function, and they can also be used for encoding nucleic acid sequences. The database is intended to facilitate further applications of dinucleotide properties. A number of property datasets is highly correlated. Therefore, the database comes with a correlation analysis facility. Authors having determined new sets of dinucleotide property values are invited to submit these data to DiProDB.

Entities: CellLine Chemical Gene Species

Mesh：

Substances：

Year: 2008 PMID： 18805906 PMCID： PMC2686603 DOI： 10.1093/nar/gkn597

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Nucleic acid properties are governed by the corresponding nucleotide sequence. More specifically, many properties such as nucleic acid stability, for example, seem to depend primarily on the identity of nearest-neighbour nucleotides (1). The corresponding nearest-neighbour model is also the basis for RNA secondary structure prediction by free-energy minimization (2). It is known that not only thermodynamic but also conformational nucleotide properties may play a role. It has been shown, for example, that promoter locations can be predicted adopting dinucleotide stiffness parameters derived from molecular dynamic simulations (3). Also, curved DNA is known to play a role in prokaryotic gene expression (4). In addition, physical DNA profiles have been used for an improved promoter prediction (5,6). There are numerous other examples. It is, however, beyond the scope of this brief database description to provide a comprehensive overview. Currently, we are developing a Genome Browser that encodes complete eukaryotic or prokaryotic genomes by thermodynamic and conformational dinucleotide properties. In this context, we have collected more than 100 sets of dinucleotide properties from the literature. Currently, there are two related data collections, the PROPERTY DB (srs6.bionet.nsc.ru/srs6bin/cgi-bin/wgetz?-page+LibInfo+-id+1pFZP1TuQpU+-lib+PROPERTY) with about 30 property sets (7) and plot.it (hydra.icgeb.trieste.it/dna/plot_it.html) with about 50 sets (Vlahovicek,K. and Pongor,S., unpublished data). Both of these databases do not include many of the existing datasets and, in addition, it is difficult to trace back the original data sources. Also, both of them are not included in the NAR Database Collection. Therefore, we have set up the database DiProDB, which is aimed to be a one-stop resource for these properties. With DiProDB we want to provide reliable, easily accessible and comprehensive information on dinucleotide properties that may stimulate the application of these data to a diversity of biological problems.

DATABASE CONTENT

DiProDB currently includes 115 dinucleotide datasets. They were collected from the literature and are classified according to nucleic acid type (DNA and RNA), strand information (double or single), how the data were obtained (experimental, theoretical/calculated) and also according to the general type of the dinucleotide property: thermodynamical (e.g. free energy), conformational (e.g. twist) or letter-based (e.g. GC content). We include the letter-based data to demonstrate relations to thermodynamical and conformational properties. Moreover, most of the current motif discovery approaches are letter-based. An example from our work refers to the identification of significant purine–pyrimidine patterns in restriction enzyme binding sites (8). The number of datasets for each category is shown in Table 1. For each dataset, the 16 dinucleotide values, the unit of measurement, the reference, the classification features as well as comments are provided. If a dataset refers to RNA, it is mentioned in the corresponding property name, if the name does not mention a nucleic acid, it always refers to DNA.

Table 1.

Number of dinucleotide property datasets for each category

Nucleic acid type			Strand information		Mode of property determination		Property type
DNA	DNA/RNA	RNA	Double	Single	Experimental	Theoretical/calculated	Thermo-dynamical	Conformational	Letter-based
93	7	15	103	12	33	82	34	74	7

Number of dinucleotide property datasets for each category

USER INTERFACE

DiProDB displays all data in a single table, see Figure 1. The number and type of columns shown can be customized by the user. When clicking on the ID button in the first column a new page pops up containing all relevant information about the corresponding property. The database entries can be sorted according to three different criteria. There is also a search option for all or for specific columns. The complete table or parts of it can be saved as text file or in a format directly importable into the Genome Browser mentioned in the Introduction section. The DiProDB website contains a Submit button, where users can submit new property datasets.

Figure 1.

Screenshot of the DiProDB table displaying search results for the term ‘twist’ (conformational dinucleotide property) in the property name.

DATA ANALYSES

The DiProDB website contains a Correlate option, where users can calculate Pearson's or Spearman's rank correlation coefficients for all or selected properties. This allows easy identification of dependencies between different dinucleotide properties. As an example in Figure 2, Spearman's correlation data are shown for five different datasets quantifying the twist in B-DNA. All datasets are clearly correlated to each other. However, the extent of correlation is rather different. Correlation coefficients >0.58 are considered as statistically significant (P < 0.01, t-test).

Figure 2.

Pearson's correlation coefficients for five sets of twist angles. ID (Ref.): 1 (9), 61 (10), 88 (11), 92 (12) and 98 (13). Correlation coefficients >0.8 are coloured in green.

Pearson's correlation coefficients for five sets of twist angles. ID (Ref.): 1 (9), 61 (10), 88 (11), 92 (12) and 98 (13). Correlation coefficients >0.8 are coloured in green. Based on these correlations, we have done different hierarchical clustering analyses to get a deeper insight into the overall correlation of the datasets. Figure 3 shows a single linkage hierarchical clustering of all 23 B-DNA double-strand thermodynamical properties together with the three-dinucleotide letter-based quantities GC content, purine (GA) content and keto (GT) content. This clustering is based on the distance measure 1−|rPearson|, because it is just the absolute value of the correlation, which indicates whether two properties contain similar information. Other correlation measures like Spearman or Kendall-Tau give very similar results. It can be seen that all free-energy data contain more or less the same information and that this is basically equivalent to the GC content. This is very likely due to the simple fact that GC pairs have three H-bonds instead of two in AT base pairs. The complete single-linkage hierarchical clustering of all 115 properties is given in the Supplementary Material (Table 2), where also a corresponding Ward clustering (14) is shown. The latter one shows a separation between a free energy/entropy/enthalpy/stacking energy/melting temperature cluster and another cluster containing all the conformational datasets. The complete single linkage clustering reveals that the most uncorrelated dinucleotide properties are direction, inclination, twist–rise (conformational), stacking energy, tilt, shift, propeller twist and rise.

Figure 3.

Table 2.

Content of supplementary material

Figure S1	Single linkage hierarchical clustering of 115 dinucleotide properties.
Figure S2	Ward hierarchical clustering of 115 dinucleotide properties.
Figure S3	115 dinucleotide properties in the first two principal components.
Figure S4	115 dinucleotide properties in the first and third principal component.
Figure S5	The 16 dinucleotides in the first two principal components.
Table S1	Percentage of importance of the 15 PCs carrying >10⁻¹⁴% of information.
Table S2	Percentage of importance of the first 10 dinucleotide properties in the first 15 PCs in decreasing order.
Table S3	Involvement of the 10 most important dinucleotide properties in the PCs 1–15.

Hierarchical clustering of all 23 B-DNA double-strand physicochemical properties and the three-dinucleotide letter-based quantities GC content, purine (GA) content and keto (GT) content. The property sets are designated by their IDs and names. Content of supplementary material In order to gain more insights into the data, we performed two principal component analyses (PCA) (15). The complete data of 115 properties for 16 dinucleotides corresponds to 115 points in 16-dimensional space (or 16 points in 115-dimensional space). PCA helps to reveal the internal structure of such high-dimensional data by providing lower dimensional pictures of the ‘cloud’ in coordinates corresponding to maximum variance of the data (http://en.wikipedia.org/wiki/Principal_components_analysis). The cloud of all 115 properties in the first two principal components (PCs, the new coordinates) is shown in Figure 4. Only the most uncorrelated property ‘direction’ lies outside the shown region: (PC1,PC2)Direction = (0.1,1.6) (the complete figure containing direction and a PC1–PC3 projection are given in the Supplementary Material; note also that only the first three PCs carry relevant information: PC1 78.5%, PC2 16.9%, PC3 3.3%). The other two outliers are melting temperature and persistence length. This indicates that especially these three properties carry information quite different from the others. Note that the latter two properties are not amongst the outliers according to the above mentioned single linkage clustering, because each one has (at least) one better correlation to other datasets (melting temperature to stacking energy, and persistence length to tilt–shift). Figure 4 also indicates three clusters containing all other properties, one stacking energy/entropy cluster, a twist cluster and the central main cluster.

Figure 4.

All dinucleotide properties plotted in the first two PCs. A few of them are designated by property name and ID.

All dinucleotide properties plotted in the first two PCs. A few of them are designated by property name and ID. Finally, we also performed a PCA calculating the 115 principal components for the 16 dinucleotides. The first 15 PCs carry information (23%, 21%, 14%, 12%, 6%, etc.), roughly indicating that about this number of low correlated properties is needed to represent all information of the complete set of 115 properties. The Supplementary Material also contains a corresponding PC1–PC2 plot, together with all detailed information about the performed PCAs.

OUTLOOK

So far the DiProDB database contains 115 sets of dinucleotide properties. In the future, this number is to be increased. We also invite other authors to submit their measured or calculated dinucleotide properties to DiProDB.

SUPPLEMENTARY DATA

Supplementary data are available at NAR Online.

FUNDING

Funding for open access charge: Biotechnology and Biological Sciences Research Council (BBSRC)IFR Core Strategic Grant. Conflict of interest statement. None declared.

13 in total

1. Conformational and physicochemical DNA features specific for transcription factor binding sites.

Authors: J V Ponomarenko; M P Ponomarenko; A S Frolov; D G Vorobyev; G C Overton; N A Kolchanov
Journal: Bioinformatics Date: 1999 Jul-Aug Impact factor: 6.937

2. The relative flexibility of B-DNA and A-RNA duplexes: database analysis.

Authors: Alberto Pérez; Agnes Noy; Filip Lankas; F Javier Luque; Modesto Orozco
Journal: Nucleic Acids Res Date: 2004-11-23 Impact factor: 16.971

3. Combining structural analysis of DNA with search routines for the detection of transcription regulatory elements.

Authors: H Karas; R Knüppel; W Schulz; H Sklenar; E Wingender
Journal: Comput Appl Biosci Date: 1996-10

4. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics.

Authors: J SantaLucia
Journal: Proc Natl Acad Sci U S A Date: 1998-02-17 Impact factor: 11.205

5. CURVATURE: software for the analysis of curved DNA.

Authors: E S Shpigelman; E N Trifonov; A Bolshoy
Journal: Comput Appl Biosci Date: 1993-08

6. B-DNA twisting correlates with base-pair morphology.

Authors: A A Gorin; V B Zhurkin; W K Olson
Journal: J Mol Biol Date: 1995-03-17 Impact factor: 5.469

Review 7. Promoters responsive to DNA bending: a common theme in prokaryotic gene expression.

Authors: J Pérez-Martín; F Rojo; V de Lorenzo
Journal: Microbiol Rev Date: 1994-06

8. Role of base-backbone and base-base interactions in alternating DNA conformations.

Authors: M Suzuki; N Yagi; J T Finch
Journal: FEBS Lett Date: 1996-01-29 Impact factor: 4.124

9. Large-scale structural analysis of the core promoter in mammalian and plant genomes.

Authors: Kobe Florquin; Yvan Saeys; Sven Degroeve; Pierre Rouzé; Yves Van de Peer
Journal: Nucleic Acids Res Date: 2005-07-27 Impact factor: 16.971

10. Common patterns in type II restriction enzyme binding sites.

Authors: Svetlana Nikolajewa; Andreas Beyer; Maik Friedel; Jens Hollunder; Thomas Wilhelm
Journal: Nucleic Acids Res Date: 2005-05-11 Impact factor: 16.971

38 in total

1. MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters.

Authors: Meng Zhang; Fuyi Li; Tatiana T Marquez-Lago; André Leier; Cunshuo Fan; Chee Keong Kwoh; Kuo-Chen Chou; Jiangning Song; Cangzhi Jia
Journal: Bioinformatics Date: 2019-09-01 Impact factor: 6.937

2. Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework.

Authors: Fuyi Li; Jinxiang Chen; Zongyuan Ge; Ya Wen; Yanwei Yue; Morihiro Hayashida; Abdelkader Baggag; Halima Bensmail; Jiangning Song
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622

3. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences.

Authors: Bin Liu; Fule Liu; Xiaolong Wang; Junjie Chen; Longyun Fang; Kuo-Chen Chou
Journal: Nucleic Acids Res Date: 2015-05-09 Impact factor: 16.971

4. DiProGB: the dinucleotide properties genome browser.

Authors: Maik Friedel; Swetlana Nikolajewa; Jürgen Sühnel; Thomas Wilhelm
Journal: Bioinformatics Date: 2009-07-15 Impact factor: 6.937

5. Diversity of eukaryotic DNA replication origins revealed by genome-wide analysis of chromatin structure.

Authors: Nicolas M Berbenetz; Corey Nislow; Grant W Brown
Journal: PLoS Genet Date: 2010-09-02 Impact factor: 5.917

6. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches.

Authors: Bin Liu; Xin Gao; Hanyu Zhang
Journal: Nucleic Acids Res Date: 2019-11-18 Impact factor: 16.971

7. Development of a new oligonucleotide block location-based feature extraction (BLBFE) method for the classification of riboswitches.

Authors: F Golabi; Mousa Shamsi; M H Sedaaghi; A Barzegar; Mohammad Saeid Hejazi
Journal: Mol Genet Genomics Date: 2020-01-04 Impact factor: 3.291