Literature DB >> 19759861

D-MATRIX: a web tool for constructing weight matrix of conserved DNA motifs.

Naresh Sen¹, Manoj Mishra, Feroz Khan, Abha Meena, Ashok Sharma.

Abstract

UNLABELLED: Despite considerable efforts to date, DNA motif prediction in whole genome remains a challenge for researchers. Currently the genome wide motif prediction tools required either direct pattern sequence (for single motif) or weight matrix (for multiple motifs). Although there are known motif pattern databases and tools for genome level prediction but no tool for weight matrix construction. Considering this, we developed a D-MATRIX tool which predicts the different types of weight matrix based on user defined aligned motif sequence set and motif width. For retrieval of known motif sequences user can access the commonly used databases such as TFD, RegulonDB, DBTBS, Transfac. D-MATRIX program uses a simple statistical approach for weight matrix construction, which can be converted into different file formats according to user requirement. It provides the possibility to identify the conserved motifs in the co-regulated genes or whole genome. As example, we successfully constructed the weight matrix of LexA transcription factor binding site with the help of known sos-box cis-regulatory elements in Deinococcus radiodurans genome. The algorithm is implemented in C-Sharp and wrapped in ASP.Net to maintain a user friendly web interface. D-MATRIX tool is accessible through the CIMAP domain network. AVAILABILITY: http://203.190.147.116/dmatrix/

Entities: Chemical Species

Keywords: Weight matrix; file format; motif databases; motif prediction

Year: 2009 PMID： 19759861 PMCID： PMC2737498 DOI： 10.6026/97320630003415

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

An important task in molecular biology is to identify DNA regulatory elements for transcription factors. These binding sites are short regions and called as ’motifs‘. Despite considerable efforts to date, DNA motif finding in whole genome remains a challenge for researchers. There are several approaches to identify the conserved motifs but the recent one is through weight matrix based. So far no such tool is available to construct the different types of weight matrices according to user defined set. Earlier tools uses promoter sequences of co-regulated genes from single genome and search for statistically over-represented motifs. However, most of these motif finding tools have been shown to work successfully in yeast and other lower organisms, but perform significantly worse in higher organisms. Over the past few years, numerous tools have become available for the prediction of TF binding sites [1-3]. Especially popular are those tools which use information of known binding sites that are collected in databases such as TRANSFAC [4], EpoDB [5], TRANSCompel [6]. More sophisticated approaches include consideration of nucleotide correlation in different positions of the sites, HMM, taking into account flanking regions and others [7-14]. But usually, complex approaches require large training sets, which is rather problematic since, only small sets of binding patterns are known for a motif (i.e. up to 10 sites). Currently the genome wide motif prediction tools required either direct pattern sequence (for single motif) or weight matrix (for multiple motifs). Although there are known motif pattern databases and tools for genome wide prediction but no tool for weight matrix construction. Considering this, we have developed D-MATRIX tool which constructs the different types of weight matrices based on user defined motif sequences and width. D-MATRIX can use both orthologous and co-regulated genes upstream sequences as input data set. For demonstration, we used the known LexA transcription factor binding site of Deinococcus radiodurans (a radiation digestive bacterium), to construct the weight matrix similar to earlier reported one [15]. Predictions performance showed promising results, as on comparison of weight matrix with known one, we found 90% accuracy with aligned motifs of same width. D-MATRIX can generate different types of matrices i.e., alignment, frequency and weight matrix. D-MATRIX also offers weight matrix conversion into different file formats as per user ease. These converted files can than be used as input files by genome wide motif prediction tools e.g. PoSSuMsearch [16] and RSAT-Patser [17]. Aligned motif sequences can be retrieved through available motif discovery tools e.g. SIGNAL SCAN [7], MATRIX SEARCH [8], MatInspector [9], Fuzzy clustering tool [10], FUNSITE [11], Gibbs Sampling tool [13], AliBaba2 [14] etc. D-MATRIX differs from existing tools by providing liberty to design user defined weight matrix model & signature.

Methodology

D-MATRIX takes aligned DNA motif sequences ’N‘ and motif width ’w‘ as input, searches for nucleotide frequency at each position ’F(ij)‘ and outputs the found consensus patterns/motifs according to conservation priority based on nucleotide frequency ’F(ij)‘, constructed frequency matrix, alignment matrix and weight matrix along with motif signature and degenerate consensus sequence according to IUPAC/IUB convention. Scoring of the weight matrix was done through following equation (see equation 1 in supplementary material) as described elsewhere [15,18].

Implementation

The D-MATRIX web tool is implemented in CSharp and wrapped in ASP.Net to maintain a user friendly web interface. The D-MATRIX user interface is shown in snapshots (Figure 1). It has been designed so that the user has all necessary parameters available on one screen. The top panel is used to paste the input sequences (or aligned known TF binding sites) and to specify the name and width of motif to be search. The results panel contains five major sections: consensus pattern/motif sequence, frequency matrix, alignment matrix, weight matrix and signature sequence as per IUPAC code. Along with these results a tool for matrix transformation is also associated in right panel, which can transform the derived matrix according to input file format of various genomic motif discovery tools. Since input sequence set required is experimental one, thus all weight matrices constructed through D-MATRIX tool can be considered as a source of well supported hypotheses for further experimental verification.

Figure 1

Snapshots of DMatrix tool

16 in total

Review 1. Regulatory elements and expression profiles.

Authors: P Bucher
Journal: Curr Opin Struct Biol Date: 1999-06 Impact factor: 6.809

Review 2. Discovery and modeling of transcriptional regulatory regions.

Authors: J W Fickett; W W Wasserman
Journal: Curr Opin Biotechnol Date: 2000-02 Impact factor: 9.740

3. EpoDB: a prototype database for the analysis of genes expressed during vertebrate erythropoiesis.

Authors: C J Stoeckert; F Salas; B Brunk; G C Overton
Journal: Nucleic Acids Res Date: 1999-01-01 Impact factor: 16.971

4. Transcription regulatory region analysis using signal detection and fuzzy clustering.

Authors: L Pickert; I Reuter; F Klawonn; E Wingender
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

5. SIGNAL SCAN 4.0: additional databases and sequence formats.

Authors: D S Prestridge
Journal: Comput Appl Biosci Date: 1996-04

6. Computer tool FUNSITE for analysis of eukaryotic regulatory genomic sequences.

Authors: A E Kel; Y V Kondrakhin; O V Kel; A G Romashenko; E Wingender; L Milanesi; N A Kolchanov
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1995

7. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.

Authors: C E Lawrence; S F Altschul; M S Boguski; J S Liu; A F Neuwald; J C Wootton
Journal: Science Date: 1993-10-08 Impact factor: 47.728

8. MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices.

Authors: Q K Chen; G Z Hertz; G D Stormo
Journal: Comput Appl Biosci Date: 1995-10

D-MATRIX: a web tool for constructing weight matrix of conserved DNA motifs.

Background

Methodology

Implementation

Review 1. Regulatory elements and expression profiles.

Review 2. Discovery and modeling of transcriptional regulatory regions.

3. EpoDB: a prototype database for the analysis of genes expressed during vertebrate erythropoiesis.

4. Transcription regulatory region analysis using signal detection and fuzzy clustering.

5. SIGNAL SCAN 4.0: additional databases and sequence formats.

6. Computer tool FUNSITE for analysis of eukaryotic regulatory genomic sequences.

7. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.

8. MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices.

9. MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data.

10. Fast index based algorithms and software for matching position specific scoring matrices.

Review 1. Abscisic-acid-dependent basic leucine zipper (bZIP) transcription factors in plant abiotic stress.

2. Functional pathway mapping analysis for hypoxia-inducible factors.

3. DEAF1 binds unmethylated and variably spaced CpG dinucleotide motifs.

4. Regulatory loop between the CsrA system and NhaR, a high salt/high pH regulator.