| Literature DB >> 19874606 |
Eric S Ho1, Christopher D Jakubowski, Samuel I Gunderson.
Abstract
BACKGROUND: With the advent of high throughput sequencing techniques, large amounts of sequencing data are readily available for analysis. Natural biological signals are intrinsically highly variable making their complete identification a computationally challenging problem. Many attempts in using statistical or combinatorial approaches have been made with great success in the past. However, identifying highly degenerate and long (>20 nucleotides) motifs still remains an unmet challenge as high degeneracy will diminish statistical significance of biological signals and increasing motif size will cause combinatorial explosion. In this report, we present a novel rule-based method that is focused on finding degenerate and long motifs. Our proposed method, named iTriplet, avoids costly enumeration present in existing combinatorial methods and is amenable to parallel processing.Entities:
Year: 2009 PMID: 19874606 PMCID: PMC2784457 DOI: 10.1186/1748-7188-4-14
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Inter-sequence algorithm. (A) For each lmer r1 in R1, identify 2d-mutants in sequences R2, S1, S2, ... The rectangular box represents the 2d-mutant of r1. The dotted line triangle represents a triplet. (B) Hash table to keep track of the span of the putative motif. Hash table consists of two parts viz. key and value. In this case, the key is the putative motif; value is a list of unique sequence IDs. Putative motifs are produced by the Triplet algorithm. They are common motifs to triplets.
Figure 2Intuition of Triplet algorithm. (A) Intuition of Triplet algorithm. A triplet consists of 12 mers l1, l2 and l3. l1 and l2, l1 and l3, and l2 and l3 contain 4, 6 and 5 differences respectively as labeled in the lines connecting them. Use the 12 mer as the center to draw an imaginary circle. Each circle denotes the set of neighboring 12 mers that are no more than 3 differences from the center 12 mer. In other words, each circle represents the set of putative motifs that generate the center 12 mer. Note that we do not actually generate the set of putative motifs. Centroid lmer is denoted by a diamond shape dot. The goal of the algorithm is to uncover all members of the set in the intersection (dark gray) of the three sets. (B) Centroid lmer construction. Shown are three patterns of columns viz. same nucleotide in three 12 mers Pi (solid line vertical boxes in positions 1, 5, 6, 8 and 10), all different nucleotides across three 12 mers Pnc (vertical box with dashed boundary in position 11), and two out of three 12 mers having the same nucleotides Pmn (dotted line vertical boxes in positions 2, 3, 4, 7, 9, and 12). The centroid lmer is constructed in stage 1 of Triplet algorithm described in the text. The number of identical positions between the centroid lmer and l1, l2 and l3, is represented by the score vector and the selection of nucleotides encoded in move vector (C) Structure of move vector. (D) Exploratory scheme discovery from stage 2 of Triplet algorithm. Centroid lmer constructed in Figure 2B is modified by the composite operation of sac(P12) and nc(3,1) to create three extra motifs near its neighborhood. (E) Example of applying rule 13 to create a new move vector in (D).
Five basic operations for triplet processing of iTriplet algorithm
| Operations | Description | Examples based on Figure 2D if possible |
|---|---|---|
| sac(Pmn) | Instead of choosing the dominant nucleotide from Pmn column, choose the odd nucleotide. | sac(P12), take 'G' at position 3 from |
| compl(Pmn) | Instead of choosing the dominant or odd nucleotide from Pmn column, choose nucleotides complementary to them. | Apply on the 2nd column, compl(P23), take nucleotides complementary to 'G' and 'T', i.e. choose 'A' or 'C' for position 2. |
| nc(i, j) | Instead of taking nucleotide from | Apply nc(3,1) to position 11. Instead of choose 'A' from |
| nc(i,0) | Instead of taking nucleotide from | Apply nc(3,0) to position 11. Instead of choose 'A' from |
| sac_i(Pi) | Instead of keeping the nucleotide identical to all lmers in the triplet, take the three complementary nucleotides. | Apply sac_i(Pi) to position 1. Take 'A', 'G' or 'T' instead of 'C' at position 1. |
Methods comparison on simulated datasets.
| Models | Neighborhood Probability | MotifEnumerator | RISOTTO | PMSprune | iTriplet | iTriplet |
|---|---|---|---|---|---|---|
| 11,2 | 0.7% | 6 s | 2.2 s | 1 s | 2 s | 1 s |
| 12,3 | 5.4% | 1 m | 40 s | 4 s | 33 s | 18 s |
| 13,3 | 2.4% | 2 m | 33 s | 2 s | 6 s | 4 s |
| 14,4 | 11% | -a | 8 m | 1 m | 3 m | 2 m |
| 15,4 | 5.6% | - | 6 m | 16 s | 36 s | 19 s |
| 16,5 | 19% | - | 82 m | 13.5 m | 26 m | 13 m |
| 18,6 | 28% | - | -b | -b | 3 h | 1.5 h |
| 19,6 | 18% | - | - | - | 27 m | 14 m |
| 24,8 | 23% | - | - | - | 4 h | 2 h |
| 28,8 | 3% | - | - | - | 19 s | 10 s |
| 30,9 | 5% | - | - | - | 2.3 m | 1.5 m |
| 38,12 | 7% | - | - | - | 1 h | 33 m |
| 40,12 | 3% | - | - | - | 5 m | 4 m |
Neighborhood probability refers to the probability that two lmers differ by no more than 2d differences. The formula to calculate neighborhood probability is stated in the Additional file 1. Time is measured in seconds (s), minutes (m) or hours (h). (a) MotifEnumerator ran out of memory for l greater than 13. (b) Program took more than 6 hours to handle for the model <18,6> or longer. For the parallel version of iTriplet, reported runtime is the longest lapse time required for all nodes to finish.
iTriplet prediction using real biological sequences.
| Preproinsulin (IEB1) promoter+5' UTR | Remarks | ||
|---|---|---|---|
| iTriplet | GTYYGGAAAYTGCAGC | <25,2> model | |
| PMSprune | CAGC | Ref. [ | |
| MITRA | C | Ref. [ | |
| Published | CTCAGCCCCCAGCCATCTGCCGACCCCCCC | Transfac ID: R04457 | |
| iTriplet | <15,3> model | ||
| PMSprune | Ref. [ | ||
| MITRA | TGCA | Ref. [ | |
| Published | ATTTCGCGCCAAA | Transfac ID: R01928 | |
| iTriplet | TTT | <15,1> model | |
| PMSprune | CTC | Ref. [ | |
| MITRA | Ref. [ | ||
| Published | TGCGCCCGG | Transfac ID: R08298 | |
| iTriplet | <20,1> model | ||
| PMSprune | Ref. [ | ||
| MITRA | Ref. [ | ||
| Published | CAGGATGTCCATATTAGGACATC | Transfac ID: R00466 | |
| AU-rich (ARE) | WWTTATTTATTWW | <14,3> model | |
| Cytoplasmic Polyadenylation element (CPE) | TTTTAT and TTTTAAT | <6,1> model | |
| Pumillio binding element (PBE) | TGTAAATA | <8,1> model | |
Motif predicted by iTriplet is presented in consensus sequence. Bold and underlined sequence represents correctly predicted nucleotide. Transfac IDs are obtained from TRANSFAC database [37]
Prediction accuracy of iTriplet versus four others motif finding methods.
| Algorithms | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| iTriplet | 0.195 | 0.292 | 0.322 | 0.286 | 0.319 | 0.489 | 0.418 | 0.422 | 0.853 | 0.591 |
| MEME | 0.180 | 0.551 | 0.214 | 0.296 | 0.258 | 0.733 | 0.280 | 0.397 | 1.000 | 0.817 |
| WEEDER | 0.128 | 0.274 | 0.245 | 0.208 | 0.263 | 0.538 | 0.332 | 0.367 | 0.833 | 0.532 |
| BioProspector | 0.102 | 0.372 | 0.129 | 0.179 | 0.212 | 0.704 | 0.224 | 0.328 | 0.986 | 0.670 |
| MotifSampler | 0.052 | 0.257 | 0.068 | 0.091 | 0.106 | 0.422 | 0.111 | 0.162 | 0.461 | 0.392 |
PC, Sn, Sp and F are performance coefficient, sensitivity, specificity and F-measure level respectively. Prefixes 'n' and 's' represent nucleotide or binding site level measurements respectively. mSr and sSr are motif and sequence level accuracy respectively.
Figure 3Confirmation of predicted polyA downstream elements by dual Luciferase reporter system. (A) pRL-GAPDHwt was made from a standard pRL-SV40 Renilla expression plasmid by replacing the SV40-derived 3'UTR and polyA signal sequences with the human GAPDH 3'UTR (NM_002046) and 116 nt past the PAS. pRL-GAPDHmt matches pRL-GAPDHwt but having Motif A mutated as shown. Plasmids were transfected into HeLa cells and Luciferase activity measured 24 hours later. Values for Renilla Luciferase were normalized to those obtained from a co-transfected Firefly Luciferase plasmid. The pRL-GAPDHwt plasmid expresses 2.2 fold more Renilla than pRL-GAPDHmt plasmid thus Motif A is enhancing expression by 2.2 fold. (B) pRL-RAFwt (NM_002880) was made like pRL-GAPDHwt but from the human RAF gene sequences as indicated. pRL-RAFmt matches pRL-RAFwt but having Motif A mutated as shown. These plasmids were transfected and analyzed as in panel A. (C) pRL-U1Awt (NM_004596) was made like pRL-GAPDHwt but from the human U1A gene sequences as indicated. pRL-U1Amt matches pRL-U1Awt but having Motif A mutated as shown. These plasmids were transfected and analyzed as in panel A.