| Literature DB >> 16393334 |
Micah Hamady1, Erin Peden, Rob Knight, Ravinder Singh.
Abstract
BACKGROUND: Many vital biological processes, including transcription and splicing, require a combination of short, degenerate sequence patterns, or motifs, adjacent to defined sequence features. Although these motifs occur frequently by chance, they only have biological meaning within a specific context. Identifying transcripts that contain meaningful combinations of patterns is thus an important problem, which existing tools address poorly.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16393334 PMCID: PMC1360682 DOI: 10.1186/1471-2105-7-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Fast-FIND indexing phase. Summary of the strategies for generating and storing (a) bitvectors for 32-base windows and (b) compositions of 3–20 base windows for each sequence.
Figure 2Searches for different types of patterns. Searching for sequence patterns based on (a) base composition, (b) exact matches, and (c) degenerate base patterns. (d) A simplified example of the bit masking approach for 3-base patterns and 12-base windows. Calculate the integer X1 for the string S1 as the sum of 2bit-pos, where bit-pos refers to bit positions (0 to 23) for each bit set to 1. Each bitvector is followed by its corresponding decimal value (in parentheses). Similarly, calculate integer values for the overlapping string S2 and for the upper (U1 and U2) and lower (L1 and L2) bounds for two search patterns (TAT and ATC). The bit patterns for windows S1 and S2 are shown using the notation for bases in Figure 1a. The bit patterns for the search patterns, TAT and ATC, are indicated by an underline, and the remaining positions are masked with a value of either 0 or 1 for the lower and upper integer limits (as shown for S1), respectively. X1 is between L1 and U1, but not between L2 and U2. Similarly, X2 is between L2 and U2, but not between L1 and U1. This example demonstrates that S1 begins with TAT but not ATC, and S2 begins with ATC but not TAT.
Figure 3Identifying regions of interest through cDNA/EST matches. The region of interest located between 100 nucleotides upstream of the 3' end of the first EST set through the 3' end of the second EST set was indexed.
Candidates with desired binding sites adjacent to alternative polyadenylation sites. Identification of cDNAs with potential alternative 3' ends and various patterns – base composition, degenerate, and combinatorial patterns – located between 100 nucleotides upstream of the 3' end of the first EST set through the 3' end of the second EST set. # and ** cDNAs were used as examples for the alignment shown in Figure 4.
| Number of cDNAs with potential alternative 3' ends and search patterns | ||
| Search pattern | Number of Patterns | Number of cDNAs |
| A. Base composition | ||
| CstF64; U> = 4, G< = 4, A+C = 0 ;length = 8 | 163 | 276 |
| SXL; U> = 15, G< = 2, A+C = 0 ;length = 17 | 154 | 5 |
| SXL1; U> = 8, G+A+C = 0 ;length = 8 | 1 | 27 |
| SXL2; U> = 10, G< = 2, A+C = 0 ;length = 12 | 79 | 25 |
| B. Degenerate motifs | ||
| hnRNP F/H/H' (core); GGGA | 1 | 232 |
| hnRNP F/H/H'; GGGGA | 1 | 78 |
| Rbp1; DCADCUUA | 9 | 47 |
| PSI; RCYYCUURYRC | 12 | 8 |
| Rbp9; UUUNUUUU | 4 | 111 |
| C. Combinatorial motifs | ||
| CstF64 + SXL | 25,102 | 5# |
| CstF64 + hnRNP F/H/H' (core) | 163 | 178* |
| CstF64 + hnRNP F/H/H' | 163 | 59** |
| SXL + hnRNP F/H/H' (core) | 154 | 4*** |
| PSI + hnRNP F/H/H' (core) | 12 | 8*** |
# Since both SXL and CstF64 sites are GU rich, these motifs are not expected to be statistically independent. However, all three Monte Carlo analyses showed that the association was significant (P < 0.001) even when accounting for composition, indicating that SXL sites are more likely to also be CstF64 sites than chance predicts.
* and ** Associations are statistically significant by the G test:
(* G = 69.8, P = 3.3 × 10-17, df = 1; and **G = 11.6, P = 0.00033, df = 1). However, these associations were not significant in the Monte Carlo.
*** Associations not individually significant by the G test, but significant (<0.01) in all three Monte Carlo tests.
Associations of various other combinations of SXL, Rbp1, PSI, and Rbp9 motifs in cDNAs are not statistically significant.
Figure 4A schematic of potential candidates for alternative polyadenylation. Arrowheads show 3' ends, asterisks show the consensus polyadenylation signal, and potential SXL, hnRNP H/H'/F, and CstF64 sites are indicated.