Literature DB >> 17893078

Comparative analysis of regulatory motif discovery tools for transcription factor binding sites.

Wei Wei1, Xiao-Dan Yu.   

Abstract

In the post-genomic era, identification of specific regulatory motifs or transcription factor binding sites (TFBSs) in non-coding DNA sequences, which is essential to elucidate transcriptional regulatory networks, has emerged as an obstacle that frustrates many researchers. Consequently, numerous motif discovery tools and correlated databases have been applied to solving this problem. However, these existing methods, based on different computational algorithms, show diverse motif prediction efficiency in non-coding DNA sequences. Therefore, understanding the similarities and differences of computational algorithms and enriching the motif discovery literatures are important for users to choose the most appropriate one among the online available tools. Moreover, there still lacks credible criterion to assess motif discovery tools and instructions for researchers to choose the best according to their own projects. Thus integration of the related resources might be a good approach to improve accuracy of the application. Recent studies integrate regulatory motif discovery tools with experimental methods to offer a complementary approach for researchers, and also provide a much-needed model for current researches on transcriptional regulatory networks. Here we present a comparative analysis of regulatory motif discovery tools for TFBSs.

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 17893078      PMCID: PMC5054109          DOI: 10.1016/S1672-0229(07)60023-0

Source DB:  PubMed          Journal:  Genomics Proteomics Bioinformatics        ISSN: 1672-0229            Impact factor:   7.691


Introduction

Biological processes in prokaryotic and eukaryotic organisms are guided by genomic information in coding and non-coding DNA sequences. Both kinds of sequences coordinate the construction of transcriptional regulatory networks to perform gene expression with temporal-spatial variations. Compared with the pregenomic era, which concentrated on deciphering coding DNA sequences and completed the blueprint of the human genome, the post-genomic era puts more emphases on digging the gold mine hidden in non-coding DNA sequences. Currently the identification of specific motifs or transcription factor binding sites (TFBSs) has become one of the key steps in this task. As we all know, interaction between transcription factors (TFs) and non-coding DNA sequences is a prerequisite for transcription initiation of genes. The function of TFs is to recognize short conserved regions in non-coding DNA sequences, which are called motifs or TFBSs (. However, it is not enough to find motifs or TFBSs in non-coding DNA sequences only depending on experimental methods. For example, systematic evolution of ligands by exponential enrichment (SELEX), serial analysis of gene expression (SAGE), and DNA microarray are only for transcript profiling in vitro 1., 2.. Chromatin immunoprecipitation (ChIP) can be combined with DNA microarray, namely ChIP-on-chip, to identify protein-DNA interaction in vivo (, but it is limited by antibody performance and availability (. For this reason, a wide range of motif discovery tools and databases have been applied to motif or TFBS prediction in biological studies. Unfortunately, 99.9% of their predictions are shown to be futility theorems (. Motifs or TFBSs are always represented as consensus IUPAC strings, position frequency matrices (PFMs), position weight matrices (PWMs), or position specific scoring matrices (PSSMs) in databases. Commonly, motifs or TFBSs in non-coding DNA sequences are conserved but still tend to be degenerate, which can influence the interaction between TFs and motifs or TFBSs. Therefore, after motif or TFBS data are collected and aligned from experimental or computational results, relevant consensus IUPAC strings can be constructed by selecting a degeneracy base pair symbol for each position in the alignment (. The motif or TFBS data can also be modeled as PFM by aligning identified sites and counting the frequency of each base pair at each position of the alignment (. Usually, PFM should be converted into PWM or PSSM according to formulas 5., 7.. Site scoring of non-coding DNA sequences can be calculated by computing the values for each position in PWM or PSSM model (. Moreover, by using sequence logos, PWM can be displayed with color and height proportional to the base pair frequency and information content for each position by formulas (. In 1970s, scientists predicted that the pivotal difference between human and chimpanzee was located in non-coding DNA sequences rather than coding DNA sequences (. Since then many essential elements of transcriptional regulatory networks have been identified in non-coding DNA sequences, including promoters, enhancers, insulators, silencers, and locus control regions (. Nowadays, the discovery of motifs is mainly limited in canonical 5’ termini of known genes, where TFs are generally thought to bind in. Nevertheless, recently some researches have shown that only small proportion of motifs or TFBSs lie in immediate upstream sequences of well-characterized protein-coding genes, while the rest of them exist in either introns or 3’ regions 6., 10., 11.. A number of algorithms to discover motifs have been applied previously, for example, BE95 (, KYD96 (, DB97 (, vHRCV00 (, BJVU98 (, EP20 (, KFQW99 (, and so on. However, many of these algorithms were designed for finding longer or more common motifs rather than for identifying TFBSs (. The price paid for this generality is that many of the cited algorithms are not guaranteed to find globally optimal solutions, since they employ some forms of local search, such as Gibbs sampling, expectation maximization (EM), and phylogenetic algorithms. In this study, we give a brief introduction to the algorithm design and analysis for TFBSs with a focus on problems in comparative motif discovery.

Results and Discussion

Combinatorial approaches

Among the possible algorithmic approaches, combinatorial approaches try to exhaustively explore all the ways that a molecular process could happen. This leads to hard combinatorial problems for which efficient algorithms are required. Thus this kind of algorithms must make use of complex data representations and techniques.

Sequence-driven or Sample-driven (SD) algorithms

SD algorithms try to find comparative patterns by comparing the given length strings and looking for local similarities between them. They are based on constructing a local multiple alignment of the given non-coding DNA sequences and then extracting the comparative patterns from the alignment by combining the segments, which is common to most of the non-coding DNA sequences (.

Pattern-driven (PD) algorithms

PD algorithms are based on enumerating candidate patterns in a given length string and inputting substrings with high fitness. The advantage of PD algorithms is that they can search the best comparative patterns in some limited sizes (. Compared with SD algorithms, PD algorithms can be performed intelligently so that patterns are not present in the data that are not generated. For example, if a pattern α is not frequently present in the data, then there will be no frequent refinement that makes α more specific (hitting in even fewer places) in the data either (.

Multiprofiler

This algorithm mainly utilizes multi-profiles that generalize a notion of a profile to detect subtle patterns that might escape detection by standard profiles (. It is designed for finding particularly subtle motifs even in the case when real motifs may be blurred by random ones. The advantage of Multiprofiler is that it takes much less time (. Kravchenko et al. used Multiprofiler to search and statistically assess putative motifs in promoter regions of co-regulated genes, where the discovered over-represented sites could be totally verified by cell transfection experiments (.

Consensus

This approach determines all possible pairwise alignments of matrices and remains words to create two sequence alignments. It scores the two sequence alignments by using information content, and the highest scoring will be saved (. Each of the two sequence matrices is paired with each word that is not already in the matrix, and then three sequence matrices are scored for information content, among which the highest will be kept again. This process will continue until each sequence has contributed exactly once to each saved alignment (. In practice, Lenz et al. scanned the upstream regions of the known Vibrio cholerae σ54-regulated genes and obtained a 16 bp motif, which perfectly matches the known σ54 binding sites in V. cholerae with the consensus sequence “TGGCAC-N5-TTGCA/T” (. In another study, to prove the hypothesis that IL-2-regulated genes in T1 cells may be influenced by STAT5, Fung et al. searched for motifs in 5,000 bp upstream regions by using the Consensus approach, and the obtained classic motif “TTCNNNGAA” can be verified by ChIP experiments (.

Teiresias

Teiresias is a two-phase combinatorial approach for general pattern discovery. This algorithm assumes an instance that every motif is present in every sequence, namely, it finds all the maximal patterns with minimum support. Its performance scales quasi-linear sequences with the size of output (. One property that differentiates Teiresias from other algorithms is the type of structural restriction. In this algorithm users are allowed to impose on special patterns to search. For example, only the parameter W needs to be set. It thus becomes possible to discover patterns of arbitrary length as long as preserved positions are not more than W residues away (. In 2005, Kiesler et al. scanned 23 Hrp59 target exons by using Teiresias and found the known “GGAGG” core motif. This result was confirmed by ChIP, IP, and RT-PCR experiments, respectively (.

Winnower, SP-STAR, and cWinnower

Winnower first represents motif instances as vertices, then it tries to delete spurious edges and recover motifs with the remaining vertices (. SP-STAR is a local sum of pairwise score improvement algorithm, which considers only the subsequences present in dataset and iteratively updates scores of the motifs (. cWinnower improves its running time by a stronger constraint function (.

MobyDick

In some cases, motifs can be defined as strings whose probability of occurrence greatly exceeds the expectation of background. One problem is to decide which part constitutes the background and natural limits in a motif since large pieces of a motif will show up in a list of improbable strings. MobyDick can resolve this issue perfectly. It is suitable for discovering motifs from a large collection of sequences, for example, all of the upstream regions in the yeast genome or all of the genes regulated during sporulation (. In 2003, based on two clusters of genes gained from microarray experiments, Murphy et al. scanned 1,000 bp promoter regions of each gene in each cluster and found a motif “T(G/A)TTTAC”, which had been previously validated to be bound by a known TF. Moreover, they found a new motif “CTTATCA” that may control gene transcription (.

Smile, Verbumculus, and Weeder

The Smile algorithm takes into consideration the fact that TFBSs may be multiple and present a constrained spatial structure in genomes. Such algorithm is therefore able to identify genomic sequences that are called “structured motifs”. A suffix tree is used for finding such motifs (. The inner core of Verbumculus rests on subtly interwoven properties of statistics, pattern matching, and combinatorics on words. Thereby it is more feasible to both detect and visualize such words in a fast and practically useful way (. Weeder permits to extend exhaustive enumeration to signals and does not need to input the exact length of the pattern to be found (.

Mitra

Mitra can be extended to handle insertions and deletions in addition to mismatches in selected sequences. It takes advantage of a new insight, which prunes the patterns that allow for more efficient use of pairwise similarity than in Winnower. For example, unlike previous PD or SD algorithms, Mitra is able to discover composite motifs of a combined length over 30 bp (.

Projection

This algorithm ameliorates the limitations of existing algorithms by using random projections of input. It extends previous projection-based searching techniques to solve a multiple alignment problem that is not effectively addressed by pairwise alignments. It is designed to efficiently solve the problems from the planted-(l, d) motif model, and can do more reliably and substantially difficult instances than previous algorithms (. For t= 20 and n= 600, this algorithm achieves performance close to the best possible, being limited primarily by statistical considerations (.

EC and MoDEL

The evolutionary computation (EC) approach allows variation of motifs by the measurement of a similarity score. Compared with SD algorithms, which are not always easy to define and rely on the accuracy of PSSM, the EC approach does not rely on any pre-defined or estimated weight matrices 39., 40.. MoDEL uses a hybrid strategy consisting of an evolutionary algorithm (global search) and hill-climbing optimizations (local search) according to Brazma’s classification (. It addresses a well-known problem: given a set of functionally related sequences, how to choose exactly one occurrence per sequence in a way that all chosen occurrences are maximally similar. Such a set of occurrences will be referred to as ungapped local multiple alignment (.

Probabilistic approaches

Probabilistic or randomized approaches make certain decisions randomly. This concept extends the classical model of deterministic algorithms and has generated many useful and probably efficient algorithms over the last twenty years. Probabilistic approaches are often faster, simpler, or more elegant than their combinatorial counterparts. Probabilistic algorithms that identify gene modules based on motif discovery are highly appropriate for analyzing synthetic lethal genetic interaction datasets, and have great potential in the integrative analysis of heterogeneous datasets (.

EM

The EM algorithm is used to estimate the probability density of a given dataset by employing the Gaussian mixture model. The probability density of a dataset is modeled as the weighted sum of a number of Gaussian distributions. The main advantage of EM is its fast speed, while the disadvantage is that it requires “appropriate” starting values and is difficult to deal with constrained parameters (.

Gibbs Sampler

The Gibbs sampling algorithm is one of the simplest Markov chain Monte Carlo algorithms. By Gibbs sampling, the joint distribution of the parameters will converge to the joint probability of the parameters in the given dataset. Gibbs sampling strategies claim to be fast and sensitive. It generally finds an optimized local alignment model for N sequences in N-linear time, avoiding the problem that the EM algorithm falls into. For example, it requires a relatively large dataset (15 or more sequences) for weakly conserved patterns to reach statistical significance (. In 2000, Petersen et al. tried to find motifs that are not necessarily 100% conserved in 17 putative promoter regions obtained from microarray experiments by using Gibbs Sampler (. The search was performed in sequences ranging from 6 to 16 bp, where Gibbs Sampler repeatedly found motifs “TTGACT” and “GACTWWHC”, both of which had been identified by previous experiments.

MEME

The MEME algorithm extends the EM algorithm for identifying motifs in unaligned sequences. While a drawback of EM is that the maximum it finds is only local (, MEME can either favor motifs that appear exactly once (one-per model) or appear zero or once (zero-or-one-per model) in each sequence in a training set, or give no preference to a number of occurrences (zero-or-more-per model). In 2005, Hall et al. acquired a set of correlated genes from genomic, transcriptomic, and proteomic analyses. They applied MEME to scan 1,000 bp of the 3’ end of stop codon, where a 47 bp motif was found in six of the analyzed sequences. Then it was used to search the entire genome and 20 additional genes were identified to have the same motif. This motif was known to be bound to Puf protein, implying that Puf protein may control the transcription of the analyzed genes (.

LOGOS and MotifPrototyper

LOGOS consists of two interacting submodels: HMDM, a model for aligned selected sequences, and HMM, a model for the global distribution of motif instances. HMDM is a hidden Markov-Dirichlet multinomial model that captures rich biological prior knowledge and positional dependence in motif local structure in a principled way. HMM is a standard hidden Markov model, which allows formal and efficient inference of motif locations, and is potentially capable of capturing their dependencies. Model parameters can be fit on training motifs by using a variational EM algorithm within an empirical Bayesian framework (. MotifPrototyper is later used to train the model’s parameters and to scan for known regulatory motifs and discover unknown ones (.

Motif Sampler

Motif Sampler uses higher-order Markov models to represent the intergenic motifs in non-coding DNA sequences. It can incorporate higher-order background models to update probabilities of finding a motif at a certain position (. To search for a known TF Yrrp1 consensus binding site in yeast, Le Crom et al. used Motif Sampler to search for motifs in the genes regulated by Yrrp1, and the result motif “(T/A)CCG(C/T)(G/T)(G/T)(A/T)(A/T)” was confirmed by EMSA experiments (.

AlignACE

AlignACE is based on the Gibbs sampling algorithm, but it differs from Gibbs sampling in the following ways. Firstly, the motif model is changed so that base frequencies for non-site sequences are fixed according to the source genome. Secondly, both strands of input sequences are simultaneously considered at each step of this algorithm. Overlapping sites are not allowed even if these sites are on opposite strands. Thirdly, simultaneous multiple searching is replaced by an approach in which single motif is found and iteratively masked 52., 53., 54..

ANN-Spec

The objective function for ANN-Spec is designed to find patterns that distinguish the positive dataset from background. It succeeds in identifying the desired patterns specific for the positive dataset. For example, Gibbs sampling and ANN-Spec both work very well when the background is assumed to be random, while ANN-Spec finds patterns with higher specificity and higher correlation coefficients when it is provided with background sequences 55., 56..

BioProspector

BioProspector uses the Markov background to model base dependencies of non-motif bases, which greatly improves the specificity of reported motifs. The parameters of the Markov background model are either estimated from user-specified sequences or precomputed from the whole genome. A new motif scoring function is adopted to allow each input sequence contain zero to multiple copies of the motif. In addition, BioProspector can model gapped motifs with palindromic patterns, which are prevalent motif patterns in prokaryotes 57., 58..

MDscan and Motif Regressor

MDscan mainly examines ChIP-on-chip selected sequences. It combines the advantages of two widely adopted motif search strategies, word enumeration and PSSM, and incorporates ChIP enrichment information to accelerate the searching and enhance its success rate. Motif Regressor uses linear regression analysis to select motifs whose sequence matching scores are significantly correlated with ChIP-on-chip enrichment or downstream gene expression values. Ranking motifs by linear regression р-value, Motif Regressor automatically picks the best one with optimal width 59., 60., 61..

Improbizer

Improbizer searches for motifs that occur with improbable frequency by using a variation of the EM algorithm. It works by finding the patterns that occur more frequently than they should occur by chance. The simple way to estimate how frequently a particular nucleotide should occur by chance is to put one quarter to the power of the number of nucleotides in the sequence. Optionally, Improbizer constructs a Gaussian model of motif placement, so that motifs occurring in similar positions in the input sequences are more likely to be found (.

SeSiMCMC

SeSiMCMC is a tool for multiple local alignment of a set of non-coding DNA sequences, which is based on a modification of the Gibbs sampling algorithm. Its primary objective is to create a computationally efficient tool that uses user-defined motif symmetry and evaluates motif length from dataset. Sequence fragments in a training set can have arbitrary orientation, and there is a probability for a sequence to contain no sites (.

GMS-MP

GMS-MP performs significantly better than standard PWM-based Gibbs sampling methods. Compared with the Bayesian network approach, GMS-MP has a simpler model, easier prescribing prior, and much faster computation. The step of sampling pairwise correlations takes up only about 3% of the total computing time, which is much faster than the Bayesian network. This method also does not suffer any problems with over-fitting, which is likely to occur due to the employment of a rather conservative prior distribution on model pattern (.

Phylogenetic footprinting approaches

Phylogenetic footprinting approaches discover regulatory elements in a set of orthologous regulatory regions from multiple species by identifying the best conserved motifs in those orthologous regions (.

PhyloCon

Phylogenetic-Consensus (PhyloCon) takes into account both conserved orthologous genes and co-regulated genes within a species. The key idea of PhyloCon is to compare aligned sequence profiles from orthologous genes or co-regulated genes rather than unaligned sequences. PhyloCon integrates the knowledge of co-regulated genes in single species with sequence conservation across multiple species to improve the performance of motif discovery. An advantage of PhyloCon is that it reports motifs of varying lengths, instead of requiring the motif length to be input 66., 67..

EMnEM and OrthoMeme

Expectation-maximization on evolutionary mixtures (EMnEM) considers special motifs that are generated from ancestral sequences. The ancestral sequences are made of two component mixtures of motifs and background, each with their own evolutionary model. The value of varying evolutionary models has been realized in other contexts as well, and such models have been successfully trained by using EM. Normally, MEME often scores better than EMnEM with a substitution model, except for higher evolutionary distances, where EMnEM takes the head (. OrthoMeme is the first algorithm to deal with heterogeous data sources in a truly integrated manner by using all the data from onset of analysis (.

PhyME

PhyME integrates two different axes of information content in evaluating the significance of candidate motifs. One axis is the overrepresentation that depends on the number of occurrences of motifs in each species. The other axis is the level of conservation of each motif instance across species. An important feature of PhyME is that it allows motifs to occur in evolutionarily conserved as well as unconserved regions in orthologous sequences. PhyME treats the two kinds of occurrences differently when it scores a motif (.

FootPrinter

The unique character of FootPrinter is that it takes input as a set of unaligned homologous sequences from various species and elicits a phylogenetic tree relating to these species. It then searches for short regions of the sequences that are highly conserved according to a parsimony criterion. The regions identified will be good candidates for regulatory elements (.

CompareProspector

CompareProspector identifies regulatory elements by using information content from both intraspecies pattern enrichment and interspecies sequence conservation. This distinguishes it from other phylogenetic footprinting programs that use orthologous sequences of a single gene from multiple species to identify regulatory elements (.

Conclusion

In the last decade, computational identification of motifs or TFBSs by analyzing non-coding DNA sequences has emerged as a major new technology for elucidating transcriptional regulatory networks. Combinatorial algorithms assume a discrete model and search for motifs with a high rate of occurrences in non-coding DNA sequences. One major drawback of combinatorial algorithms is that they are sometimes difficult to understand and many “hidden” details make them hard to implement. Probabilistic algorithms often run faster than their corresponding combinatorial algorithms. Moreover, many probabilistic algorithms are easier to implement and describe than combinatorial algorithms of comparable performance. However, these algorithms may miss lots of useful information when searching in non-coding DNA sequences. Phylogenetic footprinting assumes that functional sequences tend to be conserved through evolution. Motifs or TFBSs can thus be identified by looking for conservation of small regions within multiple alignments of non-coding DNA sequences. Up to date, more than 120 motif discovery tools have been applied in biological researches. All the time the main challenge of motif discovery tools has been the application of effective algorithms that can treat all the intrinsic complexities associated with the nature of motifs or TFBSs. However, there still exist some considerations that we should bear in mind when thinking of computational approaches to tackle biological problems. One is the issue of futility theorem, which means we still do not have any good methods other than traditional molecular biology to find out whether our predictions of individual motif or TFBS have any relationships with a clear function in vivo. Another is that pattern discovery methods are severely restricted by the signal-to-noise problem, because the information content of motifs is strictly limited by its intrinsic nature. Additionally, some algorithms that work well for yeast might not work for human due to the complexity of DNA structure. Therefore, all observed patterns must be carefully considered.

Materials and Methods

Web-based resources for non-coding DNA sequence datasets

The non-coding DNA sequence dataset perspectives in web-based resources give the tools for biologists to work with relational experimental researches in their application development. The relational dataset tools include views, wizards, editors, and other features that make it easy for users to predict and test the experimental elements of their applications (partially in Table 1).
Table 1

Selected web-based resources for promoter databases

DatabaseExplanationURL
EPDEukaryotic promoter databasehttp://www.epd.isb-sib.ch/
DBTSSDatabase of transcriptional start sites (human)http://dbtss_old.hgc.jp/hg17/
SCPDSaccharomyces cerevisiae promoter databasehttp://rulai.cshl.edu/SCPD/
DCPDDrosophila core promoter databasehttp://www-biology.ucsd.edu/labs/Kadonaga/DCPD.html
PlantProm DBPlant promoter databasehttp://mendel.cs.rhul.ac.uk/mendel.php?topic=plantprom
CSHLmpdCold Spring Harbor Laboratory mammalian promoter databasehttp://rulai.cshl.edu/CSHLmpd2/
TREDTranscriptional regulatory element databasehttp://rulai.cshl.edu/cgi-bin/TRED/tred.cgi?process=home

Web-based resources for regulatory motif or TFBS datasets

The relational motif or TFBS datasets help biologists create and manipulate the data definitions for their own projects, in terms of relational dataset schemas. Users can access relational motif or TFBS datasets under the analysis perspective, which allows users to browse or import dataset schemas in the servers view, create and work with dataset schemas in the data definition view, and change dataset schemas in the table editor. Users can also export data definitions to another dataset installed either locally or remotely (partially in Table 2).
Table 2

Selected web-based resources for regulatory motifs or TFBSs

DatabaseExplanationURL
JASPARA collection of transcription factor DNA-binding preferenceshttp://mordor.cgb.ki.se/cgi-bin/jaspar2005/jaspar_db.pl
TRANSFACDatabase on eukaryotic transcription factors, their genomic binding sites and DNA-binding profileshttp://www.gene-regulation.com/pub/databases.html#transfac
TRRDTranscription regulatory regions databasehttp://wwwmgs.bionet.nsc.ru/mgs/gnw/
RegulonDBA computational model of mechanisms of transcriptional regulationhttp://regulondb.ccg.unam.mx/html/What_is_RegulonDB.jsp
TFDTranscription factor databaseshttp://www.ifti.org/

Web-based resources for motif or TFBS discovery algorithms

Emphases are placed on the development of general design algorithms and data structures that are particularly suited for biological problems. Applications in a variety of areas such as genetic information systems, computer graphics, alignments, and computer aided designs are performed (partially in Table 3).
Table 3

Selected web-based resources for motif discovery tools

AlgorithmMotif modelMatch modelRef.
AlignACEmatrixPWM52
ANN-SpecmatrixPWM55
BioOptimizerPWM72
BioProspectormatrix, dyadPWM57
CAGER73
Cis-analystPWM74
CisModulePWM75
CisterPWM76
CloverPWM77
ClusterScanPWM78
CoBindmatrix, dyadPWM79
COMET80
CompareProspector57
ConsecIDPWM81
ConsensusmatrixPWM24
ConSitePWM82
COOPreg.exp83
cWinnowerstringmismatch31
DMotifsstringreg.exp84
DMSPWM85
Dyad analysisstring, dyadoligos15
ECstringfitness39
EMnEM68
FastMPWM18
FootPrintermismatch71
FrameWorkerPWM86
GANNflexible87
Gibbs samplermatrixPWM44
Gibbs recursivematrixPWM88
GLAMstring89
GMS-MPGWMHMM64
HMDMDM90
ImprobizerPWM62
LOGOSHMDMDM48
MAPPERHMM91
MCASTPWM92
MDScanmatrixPWM59
MEMEmatrixPWM46
MERMAIDstringPWM93
MISAEmismatch94
Mitrastring, dyadmismatch48
Mitra-dyadmismatch17
MITRA-PSSMmatrixPWM95
MMPWM96
MobyDickstringmismatch32
MoDELstringPWM41
ModelGeneratorPWM97
ModelInspectorPWM97
ModulescannerPWM98
ModuleSearcherPWM98
MotifLocatorPWM98
MotifPrototyperDM49
Motif regressorPWM41
Motif samplerPWM50
MSCANPWM99
MultiProfilerstringmismatch21
NestedMICAPWM100
NONPARmixture101
Oligo–analysisstringoligos102
OrthoMEMEPWM69
Pattern–assembly103
PhyloConPWM66
PhyME70
Pratt2reg.exp104
ProjectionstringPWM38
ProMapperDM105
PromoterInspoligos106
QuickScorestringIUPAC107
REDUCEPWM108
SCORE109
SeSiMCMCPWM63
SMILEstring, multmismatch34
SOMBEROPWM110
Splashreg.exp111
StubbPWM42
Teiresiasstringreg.exp27
TFBSclusterPWM112
Verbumculusstringmismatch35
Weederstringmismatch36
Winnowerstringmismatch30
YMFstringreg.exp113

Authors’ contributions

WW carried out the study, and YX supervised the research. Both authors read and approved the final manuscript.

Competing interests

The authors have declared that no competing interests exist.
  108 in total

1.  Regulatory element detection using correlation with expression.

Authors:  H J Bussemaker; H Li; E D Siggia
Journal:  Nat Genet       Date:  2001-02       Impact factor: 38.330

2.  Systematic and fully automated identification of protein sequence patterns.

Authors:  R K Hart; A K Royyuru; G Stolovitzky; A Califano
Journal:  J Comput Biol       Date:  2000       Impact factor: 1.479

3.  ANN-Spec: a method for discovering transcription factor binding sites with improved specificity.

Authors:  C T Workman; G D Stormo
Journal:  Pac Symp Biocomput       Date:  2000

4.  Discriminative motifs.

Authors:  Saurabh Sinha
Journal:  J Comput Biol       Date:  2003       Impact factor: 1.479

5.  A non-parametric model for transcription factor binding sites.

Authors:  Oliver D King; Frederick P Roth
Journal:  Nucleic Acids Res       Date:  2003-10-01       Impact factor: 16.971

6.  MoDEL: an efficient strategy for ungapped local multiple alignment.

Authors:  David Hernandez; Robin Gras; Ron Appel
Journal:  Comput Biol Chem       Date:  2004-04       Impact factor: 2.877

Review 7.  Prediction of cis-regulatory elements using binding site matrices--the successes, the failures and the reasons for both.

Authors:  Tanya Vavouri; Greg Elgar
Journal:  Curr Opin Genet Dev       Date:  2005-08       Impact factor: 5.578

8.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies.

Authors:  J van Helden; B André; J Collado-Vides
Journal:  J Mol Biol       Date:  1998-09-04       Impact factor: 5.469

9.  The small RNA chaperone Hfq and multiple small RNAs control quorum sensing in Vibrio harveyi and Vibrio cholerae.

Authors:  Derrick H Lenz; Kenny C Mok; Brendan N Lilley; Rahul V Kulkarni; Ned S Wingreen; Bonnie L Bassler
Journal:  Cell       Date:  2004-07-09       Impact factor: 41.582

10.  CAGER: classification analysis of gene expression regulation using multiple information sources.

Authors:  Jianhua Ruan; Weixiong Zhang
Journal:  BMC Bioinformatics       Date:  2005-05-12       Impact factor: 3.169

View more
  7 in total

1.  Sorting signal targeting mRNA into hepatic extracellular vesicles.

Authors:  Natalia Szostak; Felix Royo; Agnieszka Rybarczyk; Marta Szachniuk; Jacek Blazewicz; Antonio del Sol; Juan M Falcon-Perez
Journal:  RNA Biol       Date:  2014-06-12       Impact factor: 4.652

2.  Circulating microRNA trafficking and regulation: computational principles and practice.

Authors:  Juan Cui; Jiang Shu
Journal:  Brief Bioinform       Date:  2020-07-15       Impact factor: 11.622

3.  BayesMotif: de novo protein sorting motif discovery from impure datasets.

Authors:  Jianjun Hu; Fan Zhang
Journal:  BMC Bioinformatics       Date:  2010-01-18       Impact factor: 3.169

4.  FITBAR: a web tool for the robust prediction of prokaryotic regulons.

Authors:  Jacques Oberto
Journal:  BMC Bioinformatics       Date:  2010-11-11       Impact factor: 3.169

5.  The Escherichia coli RutR transcription factor binds at targets within genes as well as intergenic regions.

Authors:  Tomohiro Shimada; Akira Ishihama; Stephen J W Busby; David C Grainger
Journal:  Nucleic Acids Res       Date:  2008-05-30       Impact factor: 16.971

6.  Sequence information gain based motif analysis.

Authors:  Joan Maynou; Erola Pairó; Santiago Marco; Alexandre Perera
Journal:  BMC Bioinformatics       Date:  2015-11-09       Impact factor: 3.169

7.  BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements.

Authors:  Dieter De Witte; Jan Van de Velde; Dries Decap; Michiel Van Bel; Pieter Audenaert; Piet Demeester; Bart Dhoedt; Klaas Vandepoele; Jan Fostier
Journal:  Bioinformatics       Date:  2015-08-08       Impact factor: 6.937

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.