| Literature DB >> 22792292 |
Ian W Davis1, Christopher Benninger, Philip N Benfey, Tedd Elich.
Abstract
UNLABELLED: Transcription factors and the short, often degenerate DNA sequences they recognize are central regulators of gene expression, but their regulatory code is challenging to dissect experimentally. Thus, computational approaches have long been used to identify putative regulatory elements from the patterns in promoter sequences. Here we present a new algorithm "POWRS" (POsition-sensitive WoRd Set) for identifying regulatory sequence motifs, specifically developed to address two common shortcomings of existing algorithms. First, POWRS uses the position-specific enrichment of regulatory elements near transcription start sites to significantly increase sensitivity, while providing new information about the preferred localization of those elements. Second, POWRS forgoes position weight matrices for a discrete motif representation that appears more resistant to over-generalization. We apply this algorithm to discover sequences related to constitutive, high-level gene expression in the model plant Arabidopsis thaliana, and then experimentally validate the importance of those elements by systematically mutating two endogenous promoters and measuring the effect on gene expression levels. This provides a foundation for future efforts to rationally engineer gene expression in plants, a problem of great importance in developing biotech crop varieties. AVAILABILITY: BSD-licensed Python code at http://grassrootsbio.com/papers/powrs/.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22792292 PMCID: PMC3390389 DOI: 10.1371/journal.pone.0040373
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Comparison of motif finders on benchmark and de novo discovery data sets.
| Group | Target | Ref. | POWRS | POWRS-FL | Simple8 | Simple6 | Amadeus | Weeder | Trawler | YMF | AlignACE | MEME | Dispom |
|
| CREB | MA0018.2 | 2 | 3 | 2 | 2 | 2 | X(2) | X | X | X | X | X(2) |
| E2F | MA0024.1 | 2 | 2 | X | 1 | 1 | X(1) | 2 | X | X | X | 1 | |
| ETS1 | MA0098.1 | 2 | 1 | 1 | 1 | 3 | 1 | X | X(1) | X | X | 2 | |
| HNF1a | MA0046.1 | 1 | X | X | 2 | 3(1) | X | 1 | X | X(4) | X | 2 | |
| NFkB1 | MA0105.1 | 1 | X | X | X | 1 | X(4) | 2 | X | X | X | 2 | |
| P53 | MA106.1 | X | X | X | X | 1(X) | 3 | X | X | X | X | X | |
| Sox2 | MA0143.1 | X | X | X | X | X | X | X | X | X | X | X | |
| SRF | MA0083.1 | 1 | 1 | X | 1 | 1 | 2(1) | 2 | 4 | X | X | 1 | |
| YY1 | MA0095.1 | 1 | 2 | X | 1 | 1 | 1 | 1 | 1 | 1 | X | 1 | |
|
| let7 (B) | MIMAT0000062 | X | X | X | X | X | X | X | X | X | X | nd |
| let7 (J) | MIMAT0000062 | 1 | 1 | X | X | 1 | 1 | 1 | X | X | X | nd | |
| miR106b | MIMAT0000680 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | X | 1 | nd | |
| miR124 | MIMAT0000422 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | X | 1 | nd | |
| miR16 | MIMAT0000069 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | X | X | 1 | nd | |
| miR1 | MIMAT0000416 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | X | X | 1 | nd | |
| miR34 (C) | MIMAT0000686 | X | X | X | X | X | X | X | X(4) | X | X | nd | |
| miR34 (H) | MIMAT0000686 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | X | X | 2 | nd | |
| miR373 | MIMAT0000726 | 1 | 1 | 1 | 1 | 1 | 1 | 2(X) | 1 | X | 1 | nd | |
|
| telo box |
| 1 | 2 | 1 | 2 | nd | nd | nd | nd | nd | nd | nd |
| Site II | AAGGCCCAWT | 2 | 1 | 2 | 1 | nd | nd | nd | nd | nd | nd | nd | |
| TATA box | TCTATAAAA | 3 | X | X | X | nd | nd | nd | nd | nd | nd | nd |
Rank of the “correct” motif in the output of various programs. “Target” refers to data sets defined in [3]. “Ref.” gives the accession number in the JASPAR or miRbase database, or the target consensus sequence. X, no match in the top 4 results; nd, not determined (i.e. the tools were not run due to licensing restrictions on non-academic use). Results for Amadeus, Weeder, Trawler, YMF, AlignACE, and MEME are quoted from [3], as several are not freely available outside academia. Results for Dispom are quoted from [17]. “POWRS-FL” is POWRS without position sensitivity (“full length”). “Simple8” and “Simple6” are the whole-sequence, binomial-scoring algorithm described in the text, using 8-mers and 6-mers respectively. A result was considered correct if at least 6 contiguous bases of the result matched the literature motif (except ETS1 and YY1, which are effectively 4 bases long). The ranking from the more permissive PWM-based metric in [3] is shown in parenthesis where it disagrees.
Detailed results of POWRS motif searches.
| Target | Ref. | Rank | Score | Motif | Start | End | Strands? |
|
| TGACGTNW | 2 | 45.4 | [Act][Gat][Tg][Gc]ACG[Tac] | −400 | 0 | Both |
|
| TTTSSCGC | 2 | 10.7 | TT[Ga][Gt]C[Ga]C[Gc] | −450 | −50 | Both |
|
| NWTCCN | 2 | 50.8 | [Cagt]A[Cagt]TTCCG | −550 | 0 | Both |
|
| GGTTAATNWTTNNN | 1 | 15.2 | TTA[Ac][Tc][Gac]A[Tcg] | −250 | 0 | Both |
|
| GGGGRWYYCCC | 1 | 11.4 | G[Gt][Ag][At][At][Tac][Cat]C | −400 | 0 | Both |
|
| NNRRRCATGYCCGGGCATGT | – | – | – | – | – | – |
|
| CCWTTGTNNTNNNNN | – | – | – | – | – | – |
|
| GCCCWTATAWGG | 1 | 12.8 | C[Ca][At][Ta][Agt]T[At][Ta] | −300 | 0 | One |
|
| NCCATN | 1 | 67.2 | GC[Cg]AT[Gact]T[Tc] | −350 | 0 | Both |
|
| CTACCTCA | – | – | – | – | – | – |
|
| CTACCTCA | 1 | 24.3 | [Ca]TA[Cag]CT[Cg][Ta] | 0 | 5000 | One |
|
| GCACTTTA | 1 | 29.1 | GCACTTT[Act] | 0 | 4000 | One |
|
| GTGCCTTA | 1 | 31.9 | [Gt]TGCCTT[Acgt] | 0 | 5000 | One |
|
| TGCTGCTA | 1 | 31.3 | [Tac]GCTGCT[Agt] | 0 | 3500 | One |
|
| ACATTCCA | 1 | 23.8 | [Agt]CATTCC[Agt] | 0 | 2000 | One |
|
| CACTGCCT | – | – | – | – | – | – |
|
| CACTGCCT | 1 | 45.9 | [Cagt]AC[Tag]GCC[Tag] | 0 | 2000 | One |
|
| AGCACTTC | 1 | 18.4 | [Tac]A[Ag]GCACT | 0 | 1000 | One |
|
|
| 1 | 28.0 | [Ag]C[Ct]C[Ta][At][Gat][Tacg] | −75 | +25 | Both |
|
| AAGGCCCAWT | 2 | 23.4 | [Agt][Gat][Ga]CC[Cg]A[Acgt] | −150 | −25 | One |
|
| TCTATAAAA | 3 | 17.2 | [Tacg][Ca][Tg]ATAA[Ag] | −50 | −25 | One |
“Target” refers to the data sets from [3]. Reference motifs are IUPAC approximations of PWMs from JASPAR (human TFs), seed sequences from miRbase (human miRNAs), or manual consensus sequences (Arabidopsis). (See Table S2 for the same data with PWMs from JASPAR shown as sequence logos.) Motifs are represented with the primary bases in uppercase and the variant bases in lowercase, with degenerate positions grouped in square brackets. Matching words are those that use at most one variant base, so [Tac]GCTGCT[Agt] = {TGCTGCTA, aGCTGCTA, cGCTGCTA, TGCTGCTg, TGCTGCTt}.
Figure 1Graphical depiction of Site II motif matches in Arabidopsis.
Smoothed histogram (kernel density estimate) of occurrences of the Site II motif in Arabidopsis promoters from the 118 constitutive genes of interest (solid line) or background genes (dashed line). The Site II motif is as defined in Table 2. Units of motif density are occurrences per base pair per sequence. POWRS reports maximal enrichment of Site II in the genes of interest relative to the background in the region from −150 to +25, in excellent agreement with what is seen here. Note that although Site II occurs more often near the TSS for all genes, the effect is significantly stronger among the genes of interest.
Figure 2Transversion scheme in GR2A and GR11A.
Endogenous sequence is shown in black, sequence after transversion is shown above in gray. Transcription starts sites annotated by TAIR9 [22] and inferred from EST data are indicated. Blocks for transversion are numbered and delimited by spaces. Natural Site II and telo box motifs are marked on the endogenous sequence in green and yellow respectively. Non-natural Site II and telo box motifs created by the transversions are marked on the transversion sequence; in some cases, these are split between natural and mutated sequences. Blocks whose transversion clearly disrupted promoter activity are numbered in red (compare to Figure 3).
Figure 3Transversion results for GR2A and GR11A.
Mean and standard error of GFP expression driven by 10 bp transversion mutants of endogenous promoters GR2A and GR11A. Stable transgenic plants from 4–6 independent events per line were assayed by qRT-PCR and corrected for copy number.