| Literature DB >> 18258700 |
Alexis Vandenbon1, Yuki Miyamoto, Noriko Takimoto, Takehiro Kusakabe, Kenta Nakai.
Abstract
Transcriptional regulation is the first level of regulation of gene expression and is therefore a major topic in computational biology. Genes with similar expression patterns can be assumed to be co-regulated at the transcriptional level by promoter sequences with a similar structure. Current approaches for modeling shared regulatory features tend to focus mainly on clustering of cis-regulatory sites. Here we introduce a Markov chain-based promoter structure model that uses both shared motifs and shared features from an input set of promoter sequences to predict candidate genes with similar expression. The model uses positional preference, order, and orientation of motifs. The trained model is used to score a genomic set of promoter sequences: high-scoring promoters are assumed to have a structure similar to the input sequences and are thus expected to drive similar expression patterns. We applied our model on two datasets in Caenorhabditis elegans and in Ciona intestinalis. Both computational and experimental verifications indicate that this model is capable of predicting candidate promoters driving similar expression patterns as the input-regulatory sequences. This model can be useful for finding promising candidate genes for wet-lab experiments and for increasing our understanding of transcriptional regulation.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18258700 PMCID: PMC2650632 DOI: 10.1093/dnares/dsm034
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1A visual representation of the scoring process of the Markov chain-based promoter structure model. (A) A promoter sequence to score. The arrow on the right indicates the translation or transcription start site. The squares represent predicted TFBSs for motifs A, B, and C, with ‘ + ’ and ‘ − ’ indicating their orientation. The promoter sequence is divided into a proximal and a distal region with the boundary between these regions, here set at −500 bp. (B) and (C) A visual representation of the promoter model during the scoring process of the distal region and the proximal region, respectively. The states of the model are shown as circles. Each of the two regions has a ‘start’ and a ‘stop’ state, in addition to states for each motif type in both orientations. To score the sequence shown in (A), in the proximal region of the promoter a transition is made from ‘start’ to ‘C+’, from ‘C+’ to ‘A−’, from ‘A−’ to ‘A−’ and finally from ‘A−’ to ‘stop’, corresponding to the TFBSs predicted in the proximal region of the promoter. The score of the proximal region is the sum of the LLR values associated with each of these transitions (e.g. LLRproximal(C+ | start) for the transition from ‘start’ to ‘C+’, etc.). This process is repeated for the distal part of the promoter, and the final score of the promoter is the sum of the scores of both regions.
The seven motifs used in the C. elegans pharyngeal muscle promoter model, with their consensus sequence
| Motif name | Consensus sequence | Positional bias score | Densest 300 bp window | Orientation bias + –(ratio +) |
|---|---|---|---|---|
| Cel_PM1 | TTTSBVRRATTTTMR | 7.3e − 9 | −862 to −562 | 25–14 (0.64) |
| Cel_PM2 | ACTCMGAGCA | 1.1e − 4 | −337 to −37 | 12–12 (0.50) |
| Cel_PM3 | CGGGATCT | 9.1e − 4 | −504 to −204 | 9–16 (0.36) |
| Cel_PM4 | GAATCAGCGC | 4.1e − 3 | −605 to −305 | 18–7 (0.72) |
| Cel_PM5 | AAAAATTCAATTTT | 0.033 | −2240 to −1940 | 17–17 (0.50) |
| Cel_PM6 | GCARCAWA | 0.034 | −1742 to −1442 | 11–12 (0.48) |
| Cel_PM7 | CTCCCTGAGC | 0.086 | −1307 to −1007 | 21–7 (0.75) |
The third and fourth columns show the positional bias score of each motif and the positions of the densest window relative to the translation start site, respectively. The fifth column shows the number of predicted sites in the input promoters on each strand and the ratio of sites in the ‘plus’ orientation.
The ten highest scoring non-input promoters for the C. elegans pharyngeal muscle promoter model, with their rank, sequence and transcript name and reported expression pattern
| Rank | Sequence name | Transcript name | Expression pattern as annotated on WormBase |
|---|---|---|---|
| 30 | Y24D9A.4 | Y24D9A.4a.2 | Nervous system, reproductive system, anal depressor muscle, body wall muscle, pharynx |
| 47 | F52C9.8 | F52C9.8e | Nervous system, intestine |
| 50 | T13C5.1 | T13C5.1a | Head neurons, hypodermis, vulval muscle, anterior ganglia, spermathecae |
| 72 | D1081.2 | D1081.2 | Stomato-intestinal muscle, anal depressor muscle, body wall muscle |
| 75 | F22B7.9 | F22B7.9 | E lineage, syncytial hypoderm |
| 79 | C53C11.3 | C53C11.3 | Head neurons, ventral nerve cord, tail neurons, nervous system |
| 80 | F10E9.6 | F10E9.6a.1 | Nervous system, reproductive system, body wall muscle, pharyngeal neurons, anal depressor muscle, vulval muscle |
| 83 | ZK652.8 | ZK652.8 | Head neurons, nervous system, intestine, tail neurons |
| 84 | C36E6.5 | C36E6.5.2 | Pharynx, pharyngeal muscle |
| 86 | R07B1.1 | R07B1.1 | Ventral cord motor neurons, seam cells, hypodermal, neuroblasts, head |
Of these ten promoters five drive expression in one or more muscle tissues, one specifically in pharyngeal muscles.
The ten motifs used in the C. intestinalis muscle promoter model, with their consensus sequence. See the legend of Table 1 for explanations on the meaning of each column
| Motif name | Consensus sequence | Positional bias score | Densest 300 bp window | Orientation bias + –(ratio +) |
|---|---|---|---|---|
| Cin_1 | TKGTGACGTCA | 1.2e − 5 | −232 to +68 | 24–14 (0.63) |
| Cin_2 | GCCGGC | 1.9e − 3 | −1020 to −720 | 19–10 (0.66) |
| Cin_3 | TGCAGCTGCR | 2.5e − 3 | −407 to −107 | 12–14 (0.46) |
| Cin_4 | MACAACARA | 4.8e − 3 | −328 to −28 | 15–9 (0.63) |
| Cin_5 | ATAAACGACANA | 6.9e − 3 | −614 to −314 | 21–8 (0.72) |
| Cin_6 | ATGCCGAC | 0.037 | −214 to +86 | 14–13 (0.52) |
| Cin_7 | CATCGGGGTA | 0.040 | −398 to −98 | 14 – 9 (0.61) |
| Cin_8 | NVNNGACAACTG | 0.045 | −58 to +242 | 19–18 (0.51) |
| Cin_9 | AMTCAAGCAA | 0.094 | −150 to +150 | 17–10 (0.63) |
| Cin_10 | YTTCACTC | 0.13 | −191 to +109 | 19–5 (0.79) |
Here the positions of the densest window are given relative to the TATA-box.
Figure 2Expression signals of four high-scoring genes for the Ciona muscle promoter architecture model, determined by in situ hybridization experiments in C. intestinalis. These are the 20th, 31st, 41st, and 50th highest scoring sequences, respectively. These ranks include the input sequences and possible alternative transcripts. For each gene, the expression in the trunk and in the tail is shown. (A) A gene encoding a protein similar to human ‘vacuolar H+ ATPase E1’. This gene is conspicuously expressed in the central nervous system (brain, visceral ganglion, nerve cord) as well as in mesenchyme, but not in the muscle cells. (B) A gene encoding a protein similar to human ‘deformed epidermal autoregulatory factor 1’. In the trunk, this gene is specifically expressed in mesenchyme cells. In the tail, signals are predominantly found in muscle cells. Note that signals are not found in the notochord and epidermis. (C) A gene encoding a protein similar to human ‘glioma tumor suppressor candidate region gene 1 isoform 4’. It is expressed in endoderm of the trunk and also expressed weakly in muscle cells of the tail. (D) A gene encoding a protein similar to human ‘antigen p97 (melanoma associated) identified by monoclonal antibodies 133.2 and 96.5’. In the trunk, this gene is weakly expressed in endoderm cells. Signals are predominantly found in muscle cells, while signals are not found in the notochord and epidermis. Color versions of these pictures are available upon request.