Literature DB >> 15980502

ARGO: a web system for the detection of degenerate motifs and large-scale recognition of eukaryotic promoters.

Oleg V Vishnevsky¹, Nikolay A Kolchanov.

Abstract

Reliable recognition of the promoters in eukaryotic genomes remains an open issue. This is largely owing to the poor understanding of the features of the structural-functional organization of the eukaryotic promoters essential for their function and recognition. However, it was demonstrated that detection of ensembles of regulatory signals characteristic of specific promoter groups increases the accuracy of promoter recognition and prediction of specific expression features of the queried genes. The ARGO_Motifs package was developed for the detection of sets of region-specific degenerate oligonucleotide motifs in the regulatory regions of the eukaryotic genes. The ARGO_Viewer package was developed for the recognition of tissue-specific gene promoters based on the presence and distribution of oligonucleotide motifs obtained by the ARGO_Motifs program. Analysis and recognition of tissue-specific promoters in five gene samples demonstrated high quality of promoter recognition. The public version of the ARGO system is available at http://wwwmgs2.bionet.nsc.ru/argo/ and http://emj-pc.ics.uci.edu/argo/.

Entities: Chemical Disease Species

Mesh：

Substances：
Transcription Factors

Year: 2005 PMID： 15980502 PMCID： PMC1160220 DOI： 10.1093/nar/gki459

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The assembly of the basal transcription complex and the tissue-specific and stage-specific features of eukaryotic gene transcription depend on the context and structural organization of the promoter core and the presence of transcription factor binding sites (TFBSs) in the 5′ regulatory region of the gene (1,2). Most approaches to promoter recognition use information about the distribution features of potential TFBSs (3,4) and of short oligonucleotides along promoters (5–10), derived from the analysis of Internet-accessible databases (11,12). It is suggested that oligonucleotide composition characteristics of various promoter regions may be determined not only by the presence of TFBSs in these regions but also by certain context-dependent specific features of promoter DNA local conformation (13,14). To reveal such specific oligonucleotide signals, there are methods based on the detection of short conserved motifs that are significantly overrepresented in a sample of promoter sequences compared with the number expected by chance. The methods include analysis of the frequencies of l-mers (l-letter substrings) (15,16), suffix trees (17,18), finding of the largest cliques in the graph induced by the edit distance between the l-mers (19), local multiple alignment approaches using a greedy algorithm (20), expectation–maximization algorithm (21,22) and stochastic sampling strategy (23). Genomic sequencing of an increasing number of eukaryotes (24,25) encouraged the development of methods that use multiple alignment of genomic sequences with expressed sequence tags and mRNA sequences during promoter recognition (26–30) and of those utilizing the information about the localization of promoters in orthologous genes (29,31). In addition, consensus-based methods are applied, which are based on the combined use of several independent methods for recognizing promoters (32). However, despite the diversity of the approaches, the reliable recognition of promoters in eukaryotic genomes remains an open issue (33,34). A great hindrance to the development of accurate methods for promoter recognition is the tremendous diversity of their structural–functional organization (11). This makes the search for general context regularities that would serve as the background for recognition of promoters difficult. Recent data indicate a certain similarity in the promoter organization in genes with similar expression patterns (e.g. promoters of genes expressed in particular tissues). This manifests itself as the presence of similar TFBS sets in the promoters of such genes (1,2,35,36). The detection of such ensembles of regulatory signals characteristic of specific groups of promoters increases promoter recognition accuracy and prediction of specific expression patterns of the analyzed genes. The ARGO_Motifs package was developed for analysis of functional nucleotide sequences (37). It allows the recognition of oligonucleotide motifs with the following properties: (i) degeneracy, i.e. the use of the extended IUPAC code (A,T,G,C, R = G/A, Y = T/C, M = A/C, K = G/T, W = A/T, S = G/C, B = T/G/C, V = A/G/C, H = A/T/C, D = A/T/G, N = A/T/G/C); (ii) region-specificity, i.e. the preferential occurrence in a certain region of a functional sequence; (iii) quasi-invariance, i.e. the occurrence in certain sequence subgroups only; (iv) contrast, i.e. much more frequent occurrence in functional than random sequences. The ARGO_Viewer package was developed (37) for the recognition of tissue-specific gene promoters based on the presence and distribution of oligonucleotide motifs obtained by the ARGO_Motifs program. Five samples of tissue-specific promoters from the Transcription Regulatory Regions Database of regulatory sequences () in the [−300; +100] region relative to the transcription start were studied using the ARGO_Motifs program (37). The resulting sets of motifs were used to construct methods for recognizing tissue-specific promoters by the ARGO_Viewer program. It was demonstrated that one overprediction error occurred per 100 000 bp with a false negative error in the 4–8% range for four of the five promoter samples.

METHODS AND ALGORITHMS

ARGO_Motifs description

The search for degenerate motifs in a sample of functional sequences using the ARGO_Motifs program (Figure 1) is detailed in (37) and is implemented by the grouping of similar perfect oligonucleotides from the oligonucleotide vocabularies corresponding to different sequences. Each oligonucleotide of the sequence vocabularies is considered, and the group for each oligonucleotide is formed. A group consists of oligonucleotides belonging to the vocabularies of other sequences differing from it by not more than R positions (R < r0, where r0 is the threshold similarity value). Then, the consensus in an extended IUPAC code is constructed for each oligonucleotide group using an iteration procedure. Each position of the consensus is occupied by the most significant of the 15 possible letters and their significance is estimated independently of each other using the binomial criterion. The obtained oligonucleotide motifs are regarded as significant if they meet the requirements in Equation 1a–c. The significant motif that has the smallest probability to occur by chance is deposited in the databank, while all the perfect oligonucleotides it describes are removed from the vocabularies of the oligonucleotide sequences. The procedure for the detection of the motif ranking next in significance is applied in the same way to the modified vocabularies. The procedure is iterated until the detection of common degenerate motifs that satisfy the condition in Equation 1a–c is still feasible.

Figure 1

Layout of the algorithm for the recognition of degenerate oligonucleotide motifs in a promoter sample.

The degenerate oligonucleotide motif obtained using this procedure is considered significant if it meets the following criteria: Here, F is the proportion of promoters containing the motif in the window under analysis; f0 is the threshold level of the motif occurrence in the promoter sample; P(n, N) is the probability of the accidental occurrence of the motif in the analyzed window in not less than n sequences of N; p0 is the threshold probability level (see the estimation method below); Q denotes the proportion of sequences of the negative sample containing the motif; q0 is the threshold level of the motif occurrence in the negative sample. A set of 1000 randomly generated sequences of the length L is used as the negative sample. Thus, an oligonucleotide motif is accepted as significant if (i) it occurs frequently in a promoter sample, (ii) infrequently in a sample of random sequences and (iii) its occurrence probability by chance in a sample of promoter sequences is significantly low. The probability P(n, N) is calculated as follows. Let us consider the oligonucleotide motif M = m1, m2, …, m of the length l in the extended 15 single letter-based IUPAC code. The occurrence probability of the motif at a certain position of a sequence of length L is estimated as: where P is the frequency of the letter m calculated from the mononucleotide composition of promoters. The binomial occurrence probability of the motif M in ≥n sequences of N, P(n, N) is The probability P(n, N) calculated in such a way is used to assess the significance of the motif by the significance criteria (Equation 1b). Then, the oligonucleotides contained in this motif are removed from the oligonucleotide vocabularies of all the sequences. The procedure of clustering and construction of the new motif is repeated for the current oligonucleotide vocabularies thus constructed until it is possible to construct new motifs meeting the criteria of significance (Equation 1a–c) using the continuously decreasing vocabularies.

ARGO_Viewer description

Tissue-specific promoters are recognized by the ARGO_Viewer program [detailed in (37)], in a scanning window sliding with a specified step along the genomic sequence analyzed. In every window, the corresponding region-specific oligonucleotide motifs obtained by the ARGO_Motifs (Figure 2) are detected. Then, the similarity between the distributions of the motifs found in this window and in promoters of the groups studied is assessed. As a measure of similarity between the jth promoter and the sequence studied, the value is used, where L is the size of the window analyzed and p is the product of nucleotide frequencies consistent with the motifs covering the kth position (Figure 2).

Figure 2

Example of determination of the set of permissible nucleotides for each position of the [−50; −10] region of an erythroid-specific promoter.

The greater the value of P, the lower is the probability of chance occurrence of the motif set characteristic of the jth promoter in the sequence. Thus, the promoter displaying the maximum value of the similarity function is found. If this value exceeds a certain threshold value, it is thought that the promoter of the considered group is identified in the window.

IMPLEMENTATION

Description of the web-interface of the ARGO_Motifs program

The public version of ARGO_Motifs (Figure 3A) is available at and .

Figure 3

Example of ARGO_Motifs input and output windows. (A) Input window. The region [−50; +1] of promoters of erythroid-specific genes is analyzed. (B) A table containing the motifs detected and their characteristics. (C) A distribution pattern of the motifs found.

The user can paste a set of analyzed sequences of equal lengths in FASTA format via the sequence input. All the parameters needed for analysis are specified in the lower part of the window. The program was designed to search for region-specific motifs. Therefore, once the sample of DNA sequences is input, the user can analyze consecutively the regions of interest. In addition, the length of the motifs detected, the Hamming's distance and the degree of similarity between the perfect oligonucleotides clustered in a motif are indicated. The user can search for both perfect oligonucleotide motifs in the 4 single letter-based (A, T, G and C) code and degenerate motifs in the 15 single letter-based IUPAC code. The program allows the motifs meeting the significance criteria to be found in both DNA strands. It is possible to specify for the motifs detected both the boundary value of binomial probability of their random occurrence in the examined sample and the threshold occurrence rate (%) of a motif, i.e. the fraction of analyzed sequences containing the motif. The results of the sequence analysis are displayed as a table containing the motifs detected and their characteristics. As an example, Figure 3B shows the motifs found in the [−50; +1] region of promoters of erythroid-specific genes. The motifs of length l = 8 meeting the parameters below of Equation 1a–c were considered significant: P(n, N) < 10−13, f0 = 20% and q0 = 100%. As an example, let us consider the first oligonucleotide listed in Figure 3B: ATAWAARG = (A)(T)(A)(A/T)(A)(A)(A/G)(G), found in the [−50: +1] region relative to the transcription start. This motif was found in 19 of 41 promoters (46%), exceeding the threshold (20%) by ∼2-fold. The random occurrence probability of this motif in 19 or more of the 41 promoters is 10−36. In the negative sample, this motif occurred in the queried region only in 4 random sequences of 1000 (0.4%). Hence, this motif meets the significance criteria (Equation 1a–c). In addition to the table output mode, the user can get a distribution pattern of the motifs found in the selected window of the sample analyzed (Figure 3C). This representation may be useful for the detection of ensembles of mutually present motifs and subgrouping of the sequences of the total sample.

Web interface of the ARGO_Viewer program

The ARGO_Viewer package was developed for the recognition of tissue-specific gene promoters on the basis of the presence and distribution of oligonucleotide motifs obtained by the ARGO_Motifs program. The public version of the ARGO_Viewer (Figure 4A) is available at and .

Figure 4

Profile of the promoter recognition function for the sequence of the human β-globin gene clusters (EMBL ID: HSHBB). Values of the recognition function (ordinate) are plotted versus positions of the sequence (abscissa). Arrows indicate the positions of the transcription starts of the genes in this cluster. The triangle shows the position of the 5′-terminal region of the pseudogene corresponding to the transcription start point.

The user can paste the genomic sequence analyzed in FASTA format into the sequence input box. The class of promoters to be searched for is specified at the bottom of the window. The program provides the search for promoters in both the direct and the complementary DNA strands. Furthermore, two modes of output recognition results are provided. In the case of text mode, the user gets a list of positions of potential transcription starts. In graphic mode, the program constructs the profile of recognition function. The program implementation is illustrated (Figure 4) using the example of the human β-globin region (ID HSHBB), of 73 308 bp in length, mapped on chromosome 11. This sequence contains five experimentally detected transcription start sites at positions 19 487, 34 478, 39 414, 54 740 and 62 137 together with the promoter region of a pseudogene in the vicinity of position 45 557. Predicted positions of the transcription starts in five genes of this cluster differed from the real starts by not more than 20 bp. Therefore, the proposed procedure provides high efficiency of promoter recognition.

34 in total

1. The Eukaryotic Promoter Database EPD: the impact of in silico primer extension.

Authors: Christoph D Schmid; Viviane Praz; Mauro Delorenzi; Rouaïda Périer; Philipp Bucher
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

2. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences.

Authors: C E Lawrence; A A Reilly
Journal: Proteins Date: 1990

Review 3. [Mechanisms of transcriptional regulation of interferon-induced genes: description in the IIG-TRRD information system].

Authors: E A Anan'ko; S I Bazhan; O E Belova; A E Kel'
Journal: Mol Biol (Mosk) Date: 1997 Jul-Aug

Review 4. [Mechanisms of transcriptional regulation of erythroid specific genes].

Authors: O A Podkolodnaia; I L Stepanenko
Journal: Mol Biol (Mosk) Date: 1997 Jul-Aug

Review 5. Eukaryotic promoter recognition.

Authors: J W Fickett; A G Hatzigeorgiou
Journal: Genome Res Date: 1997-09 Impact factor: 9.043

6. Predicting Pol II promoter sequences using transcription factor binding sites.

Authors: D S Prestridge
Journal: J Mol Biol Date: 1995-06-23 Impact factor: 5.469

7. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.

Authors: C E Lawrence; S F Altschul; M S Boguski; J S Liu; A F Neuwald; J C Wootton
Journal: Science Date: 1993-10-08 Impact factor: 47.728

8. Eukaryotic promoter recognition by binding sites for transcription factors.

Authors: Y V Kondrakhin; A E Kel; N A Kolchanov; A G Romashchenko; L Milanesi
Journal: Comput Appl Biosci Date: 1995-10

9. Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units.

Authors: Vladimir B Bajic; Seng Hong Seah
Journal: Genome Res Date: 2003-07-17 Impact factor: 9.043

10. PromoSer: A large-scale mammalian promoter and transcription start site identification service.

Authors: Anason S Halees; Dmitriy Leyfer; Zhiping Weng
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

1 in total

Review 1. Advances in the Bioinformatics Knowledge of mRNA Polyadenylation in Baculovirus Genes.

Authors: Iván Gabriel Peros; Carolina Susana Cerrudo; Marcela Gabriela Pilloff; Mariano Nicolás Belaich; Mario Enrique Lozano; Pablo Daniel Ghiringhelli
Journal: Viruses Date: 2020-12-06 Impact factor: 5.048

1 in total