| Literature DB >> 21948794 |
Guido H Jajamovich1, Xiaodong Wang, Adam P Arkin, Michael S Samoilov.
Abstract
Finding conserved motifs in genomic sequences represents one of essential bioinformatic problems. However, achieving high discovery performance without imposing substantial auxiliary constraints on possible motif features remains a key algorithmic challenge. This work describes BAMBI-a sequential Monte Carlo motif-identification algorithm, which is based on a position weight matrix model that does not require additional constraints and is able to estimate such motif properties as length, logo, number of instances and their locations solely on the basis of primary nucleotide sequence data. Furthermore, should biologically meaningful information about motif attributes be available, BAMBI takes advantage of this knowledge to further refine the discovery results. In practical applications, we show that the proposed approach can be used to find sites of such diverse DNA-binding molecules as the cAMP receptor protein (CRP) and Din-family site-specific serine recombinases. Results obtained by BAMBI in these and other settings demonstrate better statistical performance than any of the four widely-used profile-based motif discovery methods: MEME, BioProspector with BioOptimizer, SeSiMCMC and Motif Sampler as measured by the nucleotide-level correlation coefficient. Additionally, in the case of Din-family recombinase target site discovery, the BAMBI-inferred motif is found to be the only one functionally accurate from the underlying biochemical mechanism standpoint. C++ and Matlab code is available at http://www.ee.columbia.edu/~guido/BAMBI or http://genomics.lbl.gov/BAMBI/.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21948794 PMCID: PMC3241671 DOI: 10.1093/nar/gkr745
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Performance comparison of different methods using synthetic data with varied motif length.
Statistics of the recombinase database
| Number of Sequences | 10 |
| Shortest Sequence (nucleotides) | 546 |
| Longest Sequence (nucleotides) | 4335 |
| Average Sequence Length (nucleotides) | 2436.4 |
| Total Data set Size (nucleotides) | 24364 |
Figure 2.Length PDF estimated by BAMBI for the CRP binding site motif.
Figure 3.Logos of the CRP binding site motif. Empirical (‘True’) versus those inferred by the different algorithms. (A) True motif logo, (B) BAMBI's motif logo, (C) MEME's motif logo, (D) BioProspector's motif logo, (E) SeSiMCMC's motif logo and (F) Motif Sampler's motif logo.
Performance comparison using the CRP database
| BAMBI | MEME | BioProspector (+BioOptimizer) | SeSiMCMC | Motif Sampler | |
|---|---|---|---|---|---|
| 21 | 24 | 24 | 19 | – | |
| 0.6763 | 0.5358 | 0.5745 | 0.63633 | 0.5590 |
The value of M was found to be 22 empirically.
Target sites of Din-family recombinases
| dix (consensus) | TTC———AAAC– | –A | –GTTT———GAA |
|---|---|---|---|
| hixL | TTCTTGAAAACC | GGTTTTTGATAA | |
| hixR | TTTTCCTTTTGG | GGTTTTTGATAA | |
| gixL | TTCCTGTAAACC | GGTTTTGGATAA | |
| gixR | TTCCTGTAAACC | GGTTTTGGATAA | |
| cixL | TTCTCTTAAACC | GGTTTAGGATTG | |
| cixR | TTCTCTTAAACC | GGTATTGGATAA | |
| pixL | TTCTCCCAAACC | GGTTTTCGAGAG | |
| pixR | TTCTCCCAAACC | CGTTTATGAAAA | |
| mixMI′′L′ | TTCCCCCAAACC | CGTTTTAGTCTT | |
| mixMr′′N′ | TTCCCCTAAACC | CGTTTTTATGCC | |
| mixN′′O′ | TTCCCCCAAACC | CGTTTTTATGTG | |
| mixO′′P′ | TTCCCCTAAACC | CGTTTTTATGCC | |
| mixP′′Q′ | TTCCCCTAAACC | CGTTTTTATGCC | |
| mixQ′′R′ | TTCCCCCAAACC | GGTAATCAAGAA | |
| nix1 | TTTCCCAGAAGC | CCTTAAGTAAAA | |
| nix2 | TTTCGCAGAAGC | CCTTACGTCAAA | |
| nix3 | AGACGAAGAAGC | CCTTAAGTCAAA | |
| nix4 | TTTCCCAGAAGC | CCTTAAGTCAAA | |
| bixL | TTCCTGTAAACC | GGTATTCGATAA | |
| bixR | TTCCTGTAAACC | GGTTTTAGATAA |
Recombination sites for Din subfamily members: Hin (hixL and hixR), Gin (gixL and gixR), Cin (cixL and cixR), Pin (pixL and pixR), Min [mixMI′′L′, mixMr′′N′, mixN′′O′, mixO′′P′, mixP′′Q′ and mixQ′′R′—labeled according to the convention used in (32)], D. nodosus [nix1, nix2, nix3 and nix4—with sequences taken from the updated GenBank record rather than as specified in Moses et al. (31)], and PinB (bixL and bixR) (29,31–34). Din palindromic consensus binding site (dix) is as discussed in (35). The two core residues at the centers of the sites where strand breakage and exchange occur are highlighted in bold.
Database of recombination sites
| GenBank accession number | Start sequence | End sequence | Recombination sites |
|---|---|---|---|
| FN424405 | 2907699 | 2908805 | hixL, hixR |
| AF083977 | 31913 | 35084 | gixL, gixR |
| NC_005856 | 32206 | 36541 | cixL, cixR |
| X01805 | 21 | 1929 | pixL, pixR |
| X62121 | 2743 | 4447 | mixR′M1′′, mixMr′′N′ |
| X62121 | 4848 | 5465 | mixN′′O′, mixO′′P′ |
| X62121 | 5868 | 6414 | mixP′′Q′, mixQ′′L′ |
| U02462 | 182 | 4049 | nix1, nix2 |
| U02462 | 4489 | 8411 | nix3, nix4 |
| D00660 | 600 | 3788 | bixL, bixR |
Sequence start and end labels are given by the nucleotide number in the corresponding GenBank record.
Figure 4.Logos of the Din recombinase binding site motif. Empirical (‘True’) versus those inferred by the different algorithms. (A) True motif logo, (B) BAMBI's motif logo, (C) MEME's motif logo, (D) BioProspector's motif logo, (E) SeSiMCMC's motif logo and (F) Motif Sampler's motif logo.
Performance comparison using the recombinase database
| BAMBI | MEME | BioProspector | SeSiMCMC | Motif Sampler | |
|---|---|---|---|---|---|
| 0.7711 | 0.7618 | 0.7618 | −0.0153 | −0.0182 |
MEME, BioProspector, SeSiMCMC and Motif Sampler did not produce a functionally correct site.