| Literature DB >> 18852176 |
Lucie N Hutchins1, Sean M Murphy, Priyam Singh, Joel H Graber.
Abstract
MOTIVATION: Cis-acting regulatory elements are frequently constrained by both sequence content and positioning relative to a functional site, such as a splice or polyadenylation site. We describe an approach to regulatory motif analysis based on non-negative matrix factorization (NMF). Whereas existing pattern recognition algorithms commonly focus primarily on sequence content, our method simultaneously characterizes both positioning and sequence content of putative motifs.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18852176 PMCID: PMC2639279 DOI: 10.1093/bioinformatics/btn526
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Summary of the six motifs used in artificial sequence sets
| Motif | Consensus | ||
|---|---|---|---|
| 1 | CCCCCC | 0.64 | 2.9 |
| 2 | GGTGGG | 1.5 | 7.1 |
| 3 | AATAAA | 1.5 | 23.2 |
| 4 | ACAC | 0.82 | 51.1 |
| 5 | TGTGTG | 0.82 | 18.2 |
| 6 | GCGCGCGC | 0.82 | 1.9 |
abits/base is the summed information context across the motif, divided by the length of the motif.
bX2 is a χ2 measure of the divergence of the motif positioning from a uniform distribution.
Fig. 1.Variation of the RSS with the number of elements (r) provides a robust estimate of the proper number of vectors. (a) Test matrices, with (black triangles) and without (green crosses) inserted patterns. Line plots represent a least squares linear fit of the last five points in each series. (b) RSS versus r plots for random sequences (open boxes), sequences with six inserted patterns (red squares) and human 3′-processing site flanking sequences (blue diamonds).
Fig. 2.An example comparing NMF analysis with the actual motifs used to generate the T = 300 test set. All sequence logos (Schneider and Stephens, 1990) in this and subsequent figures have a vertical scale of 2 bits and were generated with WebLogo (Crooks et al., 2004). In this and subsequent figures, the colors of line plots and logo frames are matched in corresponding pairs for each element. Patterns that are interpreted as changes in background composition are not shown as sequence logos.
Improving performance of NMF pattern detection with increase in number of test sequences (T)
| Number of training sequences ( | ||||||
|---|---|---|---|---|---|---|
| Motif | 30 | 100 | 300 | 1000 | 3000 | 10 000 |
| 1 | 0.14 | 0.23 | 0.62 | 0.90 | 0.90 | 0.96 |
| 2 | 0.55 | 0.67 | 0.88 | 0.96 | 0.98 | 0.98 |
| 3 | 0.91 | 0.97 | 0.99 | 0.99 | 0.99 | 0.99 |
| 4 | 0.37 | 0.82 | 0.93 | 0.95 | 0.98 | 0.97 |
| 5 | 0.77 | 0.85 | 0.94 | 0.97 | 0.98 | 0.96 |
| 6 | 0.35 | 0.50 | 0.70 | 0.73 | 0.89 | 0.91 |
| 1 | 0.75 | 0.67 | 0.93 | 0.97 | 0.99 | 0.97 |
| 2 | 0.60 | 0.96 | 0.98 | 0.98 | 0.97 | 0.98 |
| 3 | 0.76 | 0.89 | 0.91 | 0.92 | 0.91 | 0.89 |
| 4 | 0.31 | 0.71 | 0.80 | 0.85 | 0.88 | 0.86 |
| 5 | 0.64 | 0.90 | 0.94 | 0.97 | 0.97 | 0.96 |
| 6 | 0.60 | 0.77 | 0.95 | 0.95 | 0.93 | 0.94 |
Table entries list the best-match Pearson's correlations between (Panel A) the exact k-mer counts generated by insertion of the patterns and rows of the NMF matrix; and (Panel B) the positioning distribution used for pattern insertion and columns of the matrix.
Comparison of NMF with other pattern finders on the six motif, T = 300 sequence set
| NMF best match | Gibbs sampler | The improbizer | Weeder | Oligo analysis | YMF | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Motif | seq | pos | seq | pos | seq | pos | seq | consensus | consensus | ||||||
| 1 | 8 | 1.37 | 0.94 | 6 | 0.69 | 0.91 | 28 | 0.94 | 0.62 | 6 | CCCCCC | 6 | CCCCCC | ||
| 2 | 8 | 1.1 | 0.98 | 6 | 0.4 | 0.96 | 49 | 0.17 | 0.8 | 6 | 0.24 | 10 | TGGTGGGTAA | ||
| 3 | 8 | 0.3 | 0.95 | 8 | AATAAACA | ||||||||||
| 4 | 8 | 0.37 | 0.79 | 6 | CATACC | 6 | CACASA | ||||||||
| 5 | 8 | 0.86 | 0.94 | 5 | 1.09 | 0.83 | 8 | 0.48 | 0.81 | 8 | ATGTGCTA | 6 | CGYGYG | ||
| 6 | 8 | 2.86 | 0.94 | 6 | 0.73 | 0.95 | 18 | 0.57 | 0.89 | 8 | 1 | 7 | CGCGCGC | 6 | GCGCGC |
Fig. 3.NMF analysis of the sequence elements that specify 3′-processing sites for a variety of eukaryotic organisms reveals a common arrangement of signals. Plots are shown for mouse (M.musculus), fugu (T.rubripes), mosquito (A.gambiae) and rice (O.sativa). NMF patterns that are interpreted as changes in background composition are not shown as sequence logos.