As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for using conservation information for TF binding site discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It requires neither sequence alignments nor the phylogenetic relationships between the orthologous sequences, yet it is more effective on real biological data than methods that do.
As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for using conservation information for TF binding site discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It requires neither sequence alignments nor the phylogenetic relationships between the orthologous sequences, yet it is more effective on real biological data than methods that do.
Due to advances in DNA sequencing technologies, the number of closely related genomes being sequenced has increased tremendously (1–4). This has consequently led to the emergence of comparative studies focused on detecting functional elements in intergenic DNA sequences. Functional elements, including transcription factor (TF) binding sites, are known to evolve at a slower rate than non-functional elements, and therefore well-conserved DNA sites should be good candidates for TF binding sites.Many algorithms use evolutionary conservation information for de novo TF motif discovery, either by filtering the putative regions according to their conservation levels and then applying conventional motif finders (5) or by incorporating the conservation information into the motif finder itself (5–7, and many more). The former approach, although straightforward, has the main drawback that any region with a conservation level below the chosen cutoff is completely ignored, and thus motifs that are not well-conserved are not found by such methods. Thus, most conservation-based motif finders take the latter approach and incorporate the conservation information into the algorithm itself.These methods can be further divided into two main categories: ‘single gene’ and ‘multiple genes’. Methods in the first category [e.g. FootPrinter (8), the phylogenetic Gibbs sampler of Newberg et al. (9)] take as input the regulatory region of a single gene, together with its orthologs from related organisms. Methods in the second category [e.g. the method of Kellis et al. (2), Converge (5), PhyloCon (6), PhyME (7), PhyloGibbs (10), OrthoMEME (11), EMnEM (12), Compare Prospector (13)] are designed to search for motifs that are both overrepresented in a set of sequences from a reference species and conserved across related organisms. Our method falls into this category, so for the rest of the article we will focus only on ‘multiple genes’ approaches.
Conserved TF binding sites are sometimes misaligned
Most conservation-based approaches to TF binding site discovery rely on multiple or pairwise alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms (2,5,7,10–14). However, since binding sites are usually short, sometimes degenerate, and often in reverse orientation or even relocated (15), alignment algorithms may not correctly align the binding sites within orthologous regulatory sequences. Especially when the sequences are very divergent, the background ‘noise’ of non-functional regions may be stronger than the ‘signal’ of conserved motifs, preventing a correct alignment. In Figure 1, we illustrate four scenarios where known motifs in orthologous yeast sequences are not correctly aligned, and thus would almost surely be missed by alignment-based motif finders. When a motif changes position or orientation, as in Figure 1C and D, correct alignment of motifs may even be impossible. Alignment algorithms may also incorrectly align orthologous regulatory regions from higher organisms: Kheradpour et al. (16) have shown examples of Drosophila melanogasterMef-2 binding sites that are not correctly aligned to the orthologous sites in 11 related Drosophila species.
Figure 1.
Examples of conserved TF binding sites in aligned orthologous yeast sequences (18) that are likely to be missed by alignment-based motif discovery programs. The sites matching the motifs of the respective TFs are underlined and marked in color. (A) Alignment algorithms may incorrectly insert gaps in orthologous motif occurrences. (B) Non-functional regions that are conserved in closely related organisms may prevent a correct alignment of the binding sites. (C) Binding sites are sometimes free to change orientation, which is probably the case for the Rpn4 binding site in S. kluyveri. (D) Motifs may change their position relative to each other, as shown by the Leu3 and Ume6 sites.
Examples of conserved TF binding sites in aligned orthologous yeast sequences (18) that are likely to be missed by alignment-based motif discovery programs. The sites matching the motifs of the respective TFs are underlined and marked in color. (A) Alignment algorithms may incorrectly insert gaps in orthologous motif occurrences. (B) Non-functional regions that are conserved in closely related organisms may prevent a correct alignment of the binding sites. (C) Binding sites are sometimes free to change orientation, which is probably the case for the Rpn4 binding site in S. kluyveri. (D) Motifs may change their position relative to each other, as shown by the Leu3 and Ume6 sites.In consequence, motif finding algorithms based on alignments of orthologous promoter regions are only applicable when the promoters in the species of reference align well with the promoters in the related species; this is not true, for example, for many promoters in Saccharomyces cerevisiae and their orthologs in non-sensu stricto Saccharomyces species. Even when the orthologous promoters align well, depending on the exact algorithm used to construct the alignments, different sites may appear to be conserved. For example, while some studies report a significant number of S. cerevisiae TF binding sites that are conserved in related Saccharomyces species (17,18), Siggia et al. (19) found that among 407 experimentally verified binding sites in S. cerevisiae, only about half appear to be conserved in an alignment of sensu stricto promoter sequences [in this study, the sequences were aligned using a method by Morgenstern (20)]. Similarly, Stark et al. (21) investigated the effect of alignment choice on finding conserved TF binding sites in D. melanogaster and found large discrepancies between the alignments (only 59% agreement).
Relaxed definition of conserved binding sites
To overcome the limitations of using aligned orthologous sequences to distinguish between conserved and non-conserved sites, we have designed a novel, alignment-free method for conservation-based motif discovery that relies on a relaxed definition of conserved DNA sites: we consider a site within a reference regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We show that this simple definition of conserved sites can be used to detect regulatory motifs more successfully than using existing alignment-based approaches.
MATERIALS AND METHODS
In this section, we first provide a brief description of PRIORITY, a general framework for motif discovery (22–24). Based on a Gibbs sampling strategy, PRIORITY can easily incorporate additional biological data that may be relevant for finding DNA motifs. Second, we show how evolutionary conservation data (both alignment-based and alignment-free) can be incorporated into PRIORITY in the form of positional priors. It is important to note that the present article is not about the PRIORITY framework per se, but rather about a simple, but clever, method for exploiting conservation information for more accurate motif discovery. Consequently, the approach introduced here can easily be adapted to other motif finders. For example, Bailey et al. (personal communication) have now incorporated our alignment-free conservation-based priors into MEME (25).
The PRIORITY framework
Given n DNA sequences to believed to be bound by a common TF, PRIORITY searches for a DNA motif that occurs in most sequences and is overrepresented with respect to background. We model the motif as a position-specific scoring matrix [(PSSM) (26)] ϕ of length W, while the rest of the sequence follows some background model parameterized by ϕ0. Let be a vector of length n denoting the starting location of the binding site in each sequence with the convention that Z = 0 if contains no binding site. For simplicity, we model at most one binding site in each sequence.We use collapsed Gibbs sampling (27) to find ϕ and that maximize the joint posterior distribution of all the unknowns given the data:
We randomly initialize and then sample repeatedly from the joint posterior over ϕ and for a predetermined number of iterations, while keeping track of the highest scoring PSSM. In each run of PRIORITY, we start from several random starting points (by default, the number of trials is 50) and output the highest scoring PSSM across all the trials. Further details are available in our earlier work (22–24,28) and in the Supplementary Data. All the results reported here were obtained using version 2.1.0 of our PRIORITY software implementation, which is available online.
Incorporating positional priors
The Gibbs sampling technique described above has been used in several motif finders, often with additional parameters and heuristics. Usually, these motif finders assume a uniform prior over the locations . We believe that some positions are a priori more likely to be starting locations of TF binding sites, and therefore use informative positional priors P().A positional prior can be built from any score 𝒮 that defines, for each site of size W in the input sequences, the a priori probability of that site being a TF binding site: 𝒮(, j)=P(the W-mer starting at position j in sequence is a binding site). Given 𝒮( for all positions j in the sequence , we define the positional prior P(Z) as:
for 1 ≤ j ≤ l, where l is the length of sequence . We then normalize P(Z) so that under the assumptions of our model we have for 1 ≤ i ≤ n.As described in the next sections, we substitute 𝒮 with different scores based on evolutionary conservation to obtain different positional priors and thus different versions of the PRIORITY algorithm.
Derivation of alignment-free conservation score 𝒮
Based on the relaxed definition of conserved binding sites, we develop the alignment-free conservation score 𝒮 as described in Figure 2.
Figure 2.
Computation of the alignment-free conservation score. For the W-mer starting at position j in sequence , the conservation score 𝒮(, j) is the fraction of sequences orthologous to that contain the word , in either forward or reverse orientation.
Computation of the alignment-free conservation score. For the W-mer starting at position j in sequence , the conservation score 𝒮(, j) is the fraction of sequences orthologous to that contain the word , in either forward or reverse orientation.Let be a DNA sequence from the reference species (i.e. the species for which we have TF binding data). Let ( for 1 ≤ s ≤ k be orthologous sequences from related species, obtained via a genome alignment or by searching for regions near orthologous genes. A sequence is permitted to be empty if no such region is found in the genome of the corresponding organism.We compute the alignment-free conservation score 𝒮 by searching orthologous sequences for occurrences of all W-mers present in . We assume that a W-mer has a high probability of being conserved if it occurs in most of the orthologous sequences regardless of its orientation or specific position in the sequences. For the W-mer at position j in the bound sequence , the score is defined as:
where I[·] is an indicator function and denotes the W-mer at position j in sequence . In other words, the score 𝒮() is directly proportional to the number of orthologous sequences in which the W-mer appears. The values of 𝒮 range from 0 to1. To avoid singularities, we scale 𝒮 linearly so that the values lie between 0.1 and 0.9. We convert this score into a positional prior by substituting 𝒮 for 𝒮 in (1) and (2) and then normalizing P(Z). We call this prior 𝒞, and the Gibbs sampling algorithm that uses this prior, PRIORITY-𝒞.
Derivation of alignment-based conservation scores 𝒮 and 𝒮
The alignment-based score 𝒮 is built directly from the aligned orthologous sequences
. First, for every position j in sequence , we compute the fraction of orthologous sequences in which the nucleotide at position j is conserved according to the alignment. Next, we average this fraction over the W-mer starting at position j in sequence X and call this score 𝒮().The score 𝒮 is based on the conservation track computed by Siepel et al. (18) from multiple alignments of seven yeast species, and available at the UCSC genome browser (29). For every position j in sequence , the track specifies the probability of that position being conserved (as computed by PhastCons from the multiple alignments). We define 𝒮() as the average of the conservation track over the W-mer starting at position j in sequence .As with 𝒮, we scale 𝒮 and 𝒮 linearly to lie between 0.1 and 0.9 instead of 0 and 1 to avoid singularities in the model. We then convert the scores into positional priors by substituting 𝒮 and 𝒮 for 𝒮 in (1) and (2) and normalizing P(Z). We call the new positional priors 𝒜 and 𝒯.
Derivation of discriminative scores
A ChIP-chip experiment gives rise to sequences that are bound by the profiled TF as well as sequences that are not bound. Assume we are given m such unbound sequences. As in the case of , we have orthologous sequences to where 1 ≤ s ≤ k. We compute a discriminative score 𝒮() by taking into account the conservation score 𝒮 over both sets and as follows:
Intuitively, this prior ensures a high score for W-mers that are conserved only in the bound set but not W-mers that are conserved in general throughout the genome. Please refer to the Supplementary Data for details.As in the case of previous conservation scores, we convert 𝒮 into a positional prior which we call 𝒟𝒞. Similarly, we compute the discriminative scores 𝒮 and 𝒮 using the scores 𝒮 and 𝒮 across both bound and unbound sequences, and convert the scores into positional priors 𝒟𝒯 and 𝒟𝒜, respectively.
RESULTS
Our motif finding approach can be applied to any set of DNA sequences believed to be bound by a common TF. It is common practice for motif finding to be evaluated on synthetically generated promoter data. However, our framework uses informative priors that capture information of biological relevance from true genomic sequences so evaluation with simulated data is not appropriate.Instead, we gather TF binding data from chromatin immunoprecipitation (ChIP-chip) experiments performed by Harbison et al. (5) in S. cerevisiae. We choose this data because it contains experiments for a large number of TFs, both with known and unknown consensus binding motifs. More precisely, Harbison et al. profiled the intergenic binding locations of 203 TFs under various environmental conditions over 6140 yeast intergenic regions. For each TF, we define its sequence-set for a particular condition to be those intergenic sequences reported to be bound with P-value ≤0.001 in that condition. Similarly, we define set Y to be all intergenic sequences reported to be bound with P-value ≥0.5. We restrict our attention to sequence-sets of size at least 10. This results in 238 sets that can be subdivided into 156 sets corresponding to TFs with known binding motifs [as summarized by Harbison et al. (5) at the time their article was published or as reported by Dorrington and Cooper (30), Jia et al. (31), Zhao et al. (32), Liu et al. (33) or Tan et al. (34)], which we can use for assessment of accuracy, and 82 sequence-sets without known motifs, which we can use for prediction of novel motifs.
Conservation information is most useful when used in an alignment-free manner
The PRIORITY framework easily incorporates evolutionary conservation information in the form of positional priors. As described in detail in the ‘Materials and Methods’ section, a positional prior over a DNA sequence in a sequence-set is computed from a probabilistic score 𝒮(), which is simply an indication of the a priori probability that the W-mer at position j in sequence represents a TF binding site.Here, we compute probabilistic scores from evolutionary conservation information derived from the UCSC genome browser. For each intergenic region in S. cerevisiae, we obtain orthologous sequences from six related organisms (S. paradoxus, S. mikatae, S. kudriavzevii, S. bayanus, S. castelli and S. kluyveri) based on the MULTIZ and BLASTZ alignments (18). We compute the conservation-based scores 𝒮 and 𝒮, which are alignment-based, and 𝒮, which is alignment-free.
Comparison of scores
We first evaluate the conservation scores on 62 sequence-sets that correspond to TFs with minimally degenerate consensus motifs reported in the literature. More precisely, we chose the sequence-sets for which the known TF motif (or its core) satisfied the following criteria: (i) it was six to eight positions long, (ii) it was not too degenerate (i.e. it had at most four non-degenerate variants) and (iii) it occurred, in the forward or reverse orientation, at least 10 times in the sequence-set. We chose minimally degenerate consensus binding motifs with the hope that matches to these motifs are functional binding sites and not spurious matches. We imposed the other restrictions to insure that there were enough occurrences of the motifs to conduct our analysis. The exact sequence-sets and W-mers used in this analysis are available in Supplementary Table S1.For each of the 62 sequence-sets, we compute the average conservation score for all possible words of length W, where W is the size of the reported TF motif. Next, we look at the W-mers that match the motif and use their ranks to evaluate how promising each conservation score is: the higher the rank, the better the conservation score. The alignment-free score 𝒮 is better than the alignment-based scores 𝒮 and 𝒮: when sorted in decreasing order by conservation score, the W-mers that match the TF motifs are ranked highest according to 𝒮 in the majority of sequence-sets (Figure 3). Therefore, the alignment-free score 𝒮 provides a better indication of which DNA sites are truly conserved, and thus likely to be functional, than the scores based on alignments.
Figure 3.
Comparison of alignment-free and alignment-based conservation scores. (A) 𝒮, 𝒮 and 𝒮 for all W-mers in three sequence-sets: Tec1_BUT14, Mac1_H2O2Hi and Met32_SM. Higher scores for the W-mers that match the TF motifs indicate better conservation. In all these sequence-sets, the alignment-free score 𝒮 is the highest for the W-mers that match the TF motifs. (B) The fraction of W-mers with conservation scores smaller than the bound W-mers (i.e. the ones that match the TF motifs). Higher fractions indicate better conservation scores, with a fraction of 1 being ideal. Of the 62 tested sequence-sets, both 𝒮 (B) and 𝒮 (C) outperform 𝒮 in only six sequence-sets.
Comparison of alignment-free and alignment-based conservation scores. (A) 𝒮, 𝒮 and 𝒮 for all W-mers in three sequence-sets: Tec1_BUT14, Mac1_H2O2Hi and Met32_SM. Higher scores for the W-mers that match the TF motifs indicate better conservation. In all these sequence-sets, the alignment-free score 𝒮 is the highest for the W-mers that match the TF motifs. (B) The fraction of W-mers with conservation scores smaller than the bound W-mers (i.e. the ones that match the TF motifs). Higher fractions indicate better conservation scores, with a fraction of 1 being ideal. Of the 62 tested sequence-sets, both 𝒮 (B) and 𝒮 (C) outperform 𝒮 in only six sequence-sets.
Comparison of priors
Next, we use the conservation scores 𝒮, 𝒮 and 𝒮 to derive the informative positional priors 𝒞, 𝒯 and 𝒜, respectively. We incorporate each prior into PRIORITY and run the resulting algorithms (PRIORITY-𝒞, PRIORITY-𝒯 and PRIORITY-𝒜) on the 156 sequence-sets with known TF motifs. To assess the contribution of the conservation priors, we also run PRIORITY with a uniform prior that we call 𝒰. We count the number of successes for each algorithm, considering an algorithm successful on a particular sequence-set if the reported motif matches the known literature motif [according to a standard inter-motif distance criterion (5,28)]. PRIORITY is a stochastic algorithm, so different runs may give slightly different results. For this reason, we run each version of PRIORITY 50 times and report the median number of successes, along with the first and the third quartiles (Figure 4).
Figure 4.
Number of successes obtained by PRIORITY with different positional priors on the 156 sequence-sets. Each algorithm was run 50 times with the default settings (Supplementary Data). The height of each bar represents the median number of successes among the 50 runs. The plot also shows, as confidence intervals, the first and third quartiles for each algorithm.
Number of successes obtained by PRIORITY with different positional priors on the 156 sequence-sets. Each algorithm was run 50 times with the default settings (Supplementary Data). The height of each bar represents the median number of successes among the 50 runs. The plot also shows, as confidence intervals, the first and third quartiles for each algorithm.While all three conservation priors perform better than the uniform prior, the improvement is greater when the conservation information is used in an alignment-free manner: the median number of successes obtained by PRIORITY-𝒞 is 69, compared with 63 for PRIORITY-𝒯 and PRIORITY-𝒜 and 58 for PRIORITY-𝒰. Since the computation of the C prior uses our relaxed definition of conserved sites, we believe this approach is able to sidestep the alignment artifacts described in Figure 1, and hence pick up the true motif signal more often than the alignment-based priors 𝒯 and 𝒜. A detailed analysis for each sequence-set is available in the Supplementary Data.
Adopting a discriminative perspective improves results further
The scores 𝒮, 𝒮 and 𝒮 used to compute the priors 𝒞, 𝒯 and 𝒜, respectively, reflect the probability that a W-mer is conserved. While it is true that regions bound by the TF are more likely to be conserved, it does not follow that every conserved region is more likely to be bound by the profiled TF. Some conserved regions could be binding sites of other TFs or other functional DNA elements. To address this, we developed the discriminative conservation scores 𝒮, 𝒮 and 𝒮, which are specific to each profiled TF (see ‘Materials and Methods’ section). Intuitively, the discriminative conservation scores give higher weights to W-mers that are more conserved in the bound sequences than overall in the genome.Figure 5 shows the scores 𝒮 and 𝒮 over an intergenic sequence belonging to a sequence-set of Ste12. As can be seen, the conservation score computed with a discriminative perspective is effective in filtering out false conservation peaks, i.e. conserved DNA regions that are not specific to the profiled factor.
Figure 5.
(A) Scores 𝒮 and 𝒮 computed over intergenic region iYJL157C, bound by Ste12 after treatment with alpha factor (5). Binding sites of Dal80, Ste12 and Mcm1 are shown as annotated by MacIsaac et al. (14). 𝒮 has five tall peaks, marked with asterisks: two of them correspond to Ste12 binding sites, one to the Dal80 binding site and the two remaining peaks correspond to conserved A-T rich regions (not annotated here). The 𝒮 score is specific to the profiled factor: it retains only the peaks corresponding to the Ste12 sites, and filters out non-specific peaks corresponding to A-T rich regions or other conserved sites. (B) 𝒮 and 𝒮 for all W-mers in the Ste12_Alpha sequence-set. (C) The fraction of W-mers with conservation scores smaller than the bound W-mers. 𝒮 outperforms 𝒮 in 57 of the 62 tested sequence-sets, and is essentially equally good in four others. Furthermore, the fraction corresponding to 𝒮 is highly enriched towards the ideal fraction of 1.
(A) Scores 𝒮 and 𝒮 computed over intergenic region iYJL157C, bound by Ste12 after treatment with alpha factor (5). Binding sites of Dal80, Ste12 and Mcm1 are shown as annotated by MacIsaac et al. (14). 𝒮 has five tall peaks, marked with asterisks: two of them correspond to Ste12 binding sites, one to the Dal80 binding site and the two remaining peaks correspond to conserved A-T rich regions (not annotated here). The 𝒮 score is specific to the profiled factor: it retains only the peaks corresponding to the Ste12 sites, and filters out non-specific peaks corresponding to A-T rich regions or other conserved sites. (B) 𝒮 and 𝒮 for all W-mers in the Ste12_Alpha sequence-set. (C) The fraction of W-mers with conservation scores smaller than the bound W-mers. 𝒮 outperforms 𝒮 in 57 of the 62 tested sequence-sets, and is essentially equally good in four others. Furthermore, the fraction corresponding to 𝒮 is highly enriched towards the ideal fraction of 1.We use the scores 𝒮, 𝒮 and 𝒮 to build the positional priors 𝒟𝒞, 𝒟𝒯 and 𝒟𝒜, respectively. Note that if we assume a constant level of conservation across all W-mers, then priors 𝒞, 𝒯 and 𝒜 simplify to the widely used uniform prior U. Priors DC, 𝒟𝒯 and 𝒟𝒜, however, simplify to a special prior 𝒟 that reflects the relative frequency of each W-mer in versus both and [see (23) for benefits of using the discriminative prior 𝒟]. Here, we analyze the benefits of using conservation information in a discriminative manner.Indeed, for each of the priors 𝒞, 𝒯, 𝒜 and 𝒰, adopting a discriminative perspective helps find the true motif in many more instances than without doing so (Figure 4). PRIORITY-𝒟𝒞 is the most accurate, with a median number of successes of 77. A detailed analysis is available in the Supplementary Data.For both simple and discriminative priors, the conservation information is most useful when used in an alignment-free manner: PRIORITY-𝒞 is superior to PRIORITY-𝒯 and PRIORITY-A, and PRIORITY-𝒟𝒞 is superior to PRIORITY-𝒟𝒯 and PRIORITY-𝒟𝒜 (Figure 4). We will henceforth focus on the performance of our alignment-free motif finders: PRIORITY-𝒞 and PRIORITY-𝒟𝒞.
PRIORITY-𝒞 and -𝒟𝒞 perform better than current conservation-based methods
In this section, we compare the results of PRIORITY-𝒞 and PRIORITY-𝒟𝒞 with the results of seven conservation-based motif finders: MEME_c (5), a method of Kellis et al. (2), Converge (14), PhyloCon (6), PhyME (7), PhyloGibbs (10) and CompareProspector (13). All methods fall into the ‘multiple genes’ category, and thus search for motifs that are both overrepresented in a set of bound sequences from a species of reference and conserved across related species. Our alignment-free algorithms PRIORITY-𝒞 and PRIORITY-𝒟𝒞 are more effective at finding the true motif than all these methods, with PRIORITY-𝒟𝒞 performing best (Table 1).
Table 1.
Number of successes for different conservation-based methods
Program
Description
Number of successes
MEME_c
Alignment-based; masks non-conserved bases and then applies MEME
49
Kellis et al.
Alignment-based; searches for significantly conserved 3-gap-3 motifs, then extends them
47
Converge
Alignment-based; uses EM; incorporates conservation and evolutionary distances into the model
68
PhyloCon
Locally aligns conserved regions into profiles, compares profiles and merges them using a greedy approach
19
PhyME
Alignment-based; uses EM; evolutionary model accounts for binding site specificities
21
PhyloGibbs
Alignment-based; similar to PhyME, but uses Gibbs sampling; searches for multiple motifs simultaneously
54
CompareProspector
Alignment-based; uses Gibbs sampling; biases the search towards conserved windows
64
PRIORITY-
Alignment-free; incorporates a prior based on conserved W-mers into a Gibbs sampler
69
PRIORITY-
Alignment-free; incorporates a prior based on conserved W-mers in both bound and unbound sequences
77
For MEME_c and the method of Kellis et al. we use the results reported by Harbison et al. [5]; for Converge we use the results reported by MacIsaac et al. [14]. We ran PhyloCon, PhyME, PhyloGibbs and CompareProspector, as described in the Supplementary Data. For PRIORITY- and PRIORITY-, the numbers (shown in bold) represent medians across 50 runs of the algorithms (see main text for details).
Number of successes for different conservation-based methodsFor MEME_c and the method of Kellis et al. we use the results reported by Harbison et al. [5]; for Converge we use the results reported by MacIsaac et al. [14]. We ran PhyloCon, PhyME, PhyloGibbs and CompareProspector, as described in the Supplementary Data. For PRIORITY- and PRIORITY-, the numbers (shown in bold) represent medians across 50 runs of the algorithms (see main text for details).We did not compare with a few other methods in the ‘multiple genes’ category (8,11,12,35) due to one or more of the following reasons: some are so computationally expensive that running them on all 156 sequence-sets was practically impossible; some are designed for only two related organisms; and some have been reported to perform worse than methods we have included in our analysis (see Supplementary Data for details).
PRIORITY-𝒞 and -𝒟𝒞 are much faster than current conservation-based methods
PRIORITY with the alignment-free conservation priors outperforms other methods not only in terms of accuracy, but also speed (Supplementary Figure S3). Since the running times of PRIORITY-C and PRIORITY-𝒟𝒞 are comparable (with minor differences in the prior computation making PRIORITY-𝒞 slightly faster), we only discuss the times for PRIORITY-𝒟𝒞 here.The running time of PRIORITY-𝒟𝒞 varies only slightly with increasing numbers of sequences, and PRIORITY-𝒟𝒞 is faster than PhyloCon, PhyME and PhyloGibbs on all sequence-sets. On sets of 50 or more sequences, our algorithm is 2–3 orders of magnitude faster than the three methods. These results are not surprising: PRIORITY-𝒟𝒞 uses the conservation information only during prior computation, while PhyloCon, PhyME and PhyloGibbs analyze the orthologous regions at every step of the algorithm. The running times of CompareProspector and PRIORITY-𝒟𝒞 are more comparable, but PRIORITY-𝒟𝒞 scales better as the number of sequences grows, overtaking CompareProspector on sequence-sets of size 40 or more.
Predicting novel TF binding motifs using PRIORITY-𝒟𝒞
Since PRIORITY-𝒟𝒞 performed best among the tested algorithms, we next use it to predict novel yeast TF binding motifs. The yeast ChIP-chip data set contains 82 sequence-sets with at least 10 bound probes, which correspond to TFs without a consensus binding motif at the time the ChIP-chip experiments were performed. Subsequently, a few groups (14,36) have reported putative motifs in these sequence-sets using computational approaches, but often with little discussion of their biological significance. Here, we discuss in detail novel motifs obtained using PRIORITY-𝒟𝒞 on these sequences-sets, along with their implications in yeast biology.For each sequence-set, PRIORITY-𝒟𝒞 returns the top-scoring motif along with its log-posterior score (28). We assess the significance of a motif score by running PRIORITY on randomized sequence-sets and using the resulting scores to compute empirical distributions and consequently empirical P-values. For each novel motif reported by PRIORITY-𝒟𝒞 we compute the associated P-value, which reflects our confidence in the accuracy of the motif. We choose a P-value cutoff of 2 × 10−7, which corresponds to an estimated false discovery rate of 15% (see Supplementary Data for details). Out of the 82 motifs predicted for sequence-sets without a known TF binding motif, 16 motifs have a P-value smaller than the chosen cutoff (Figure 6). Since PRIORITY-𝒟𝒞 is a stochastic algorithm, for each of the 16 predicted motifs we verified that even if we run the algorithm several times on a particular sequence-set, the reported motifs are highly similar (or exactly the same), and the P-values associated with the motif scores are always below the chosen cutoff.
Figure 6.
Novel TF binding motifs predicted by PRIORITY-𝒟𝒞.
Novel TF binding motifs predicted by PRIORITY-𝒟𝒞.The first three motifs shown in Figure 6 correspond to Dig1 profiled under different environmental conditions: Alpha (treatment with the alpha pheromone, which induces mating), BUT90 (butanol treatment for 90 min) and BUT14 (butanol treatment for 14 h, which induces filamentation). Dig1 is not currently known to bind DNA directly, but only through Ste12 or Tec1 during mating and filamentation, respectively (37). The predictions made by PRIORITY-𝒟𝒞 are consistent with the literature: the motif found in the Dig1_Alpha sequence-set is a very good match to the Ste12 motif, while the motif found in Dig1_BUT14 matches the Tec1 motif. It is not clear what complexes bind DNA when cells are treated with butanol for a short duration (90 min in this case), so currently we cannot evaluate the prediction made by PRIORITY-𝒟𝒞 in the Dig1_BUT90 sequence-set.The next three motifs correspond to Fhl1 under different cellular conditions, and they all match the Rap1 consensus motif. Both Fhl1 and Rap1 associate with promoters of ribosomal protein genes (38), and a recent study (39) has shown that among the three main factors that control transcription of ribosomal protein genes (Rap1, Fhl1 and Ifh1), only Rap1 binds DNA directly. This evidence, together with evidence of direct interaction between Rap1 and Fhl1 (39), supports the hypothesis that Fhl1 binds DNA indirectly through Rap1, and thus our predictions for the sequence-sets of Fhl1 are very likely to be correct. Similarly, Sfp1 may bind DNA indirectly through Rap1 (40), and the Rap1 motif is indeed predicted by PRIORITY-𝒟𝒞 in the Sfp1_SM sequence-set.For Ime4_YPD and Mal33_H2O2Hi, we find the repetitive motif TG to be highly significant. Although these motifs may play a role in disrupting the chromatin structure (41), we do not believe they represent motifs of Ime4 or Mal33. To our knowledge, DNA binding motifs of these factors have not yet been reported in the literature. If we mask the TG repeats and run PRIORITY-𝒟𝒞 again, in both sequence-sets we obtain motifs with lower scores, which do not pass our significance criterion.As for predictions 10–16, in at least four of these cases we believe the motifs found by PRIORITY-𝒟𝒞 are correct: Phd1_BUT90, Rfx1_YPD, Ydr026c_YPD and Stb1_YPD. The motif predicted for the Phd1_BUT90 sequence-set is consistent with the Phd1 motif reported in two recent in vitro studies (42,43). The DNA binding motif of Rfx1 has been recently listed in TRANSFAC 11.1, and also reported by Badis et al. (42) and Zhu et al. (43), matching the motif predicted by PRIORITY-𝒟𝒞. The predicted motif for the putative TF Ydr026c is a very good match to the known Reb1 binding motif (44). Although an experimentally verified binding site is not currently available for Ydr026c, it is known that it has a strong similarity to the DNA-binding protein Reb1 (45). In the Stb1_YPD sequence-set, we predict a motif that matches the known Swi6 binding specificity (46), which is consistent with what is known in the literature: Stb1 plays a role in the regulation of MBF-specific transcription, and it is known to bind in vitro to Swi6 (47), a member of the MBF complex.The motifs we predict for sequence-sets Phd1_YPD, Snt2_YPD and Sok2_BUT14 remain to be verified. We note, however, that our predictions for these sequence-sets are in good agreement with motifs obtained using other computational methods (5,14,36).
DISCUSSION
We are not the first to use alignment-free conservation across species to find motifs. Elemento and Tavazoie (48) look for conserved regulatory elements by scanning a pair of related genomes for on the order of 400 highly enriched W-mers. They then use a hypergeometric distribution to evaluate the significance of each of these W-mers in bound ChIP-chip sets. However, using this method, they are able to assign a W-mer that matches to the true motif to only 15 TFs. Since they limit their analysis to reporting W-mers, it is possible that they are not able to find TF motifs that have greater sequence variation. In contrast, though our scores 𝒮 are also computed over W-mers, we use them only to construct positional priors; our Gibbs sampler returns a full PSSM, which handles sequence variations in the binding sites. In addition, the approach of Elemento and Tavazoie (48) is limited to pairs of related organisms, and thus the choice of organisms becomes crucial for the success of the algorithm.One potential limitation of our approach is that the conservation priors are computed by counting only exact matches between the W-mers in the reference genome and W-mers in the related genomes. We have also tried computing priors similar to 𝒞 and 𝒟𝒞 that allow for a mismatch when searching for conserved words. Specifically, an 8-mer was defined as ‘conserved’ in an orthologous sequence if the sequence contained either an exact match to that 8-mer or any of the 24 8-mers that differed at exactly one position. The effect of allowing one mismatch anywhere in the W-mer was that the signal of truly conserved sites was mixed with noise due to the 24 possible 8-mers, and overall these priors were not as effective as 𝒞 and 𝒟𝒞. Allowing for more than one mismatch may further dilute the signal of conserved sites. However, prior knowledge about the structure of the binding site (for example, if we know we should be searching for a gapped motif) might be useful in restricting the mismatches to certain positions.
Our alignment-free approach can be used with both closely and distantly related organisms
In this article, we show how multiple unaligned genomes can be successfully used for motif discovery. Our method can be applied to any number of genomes. For instance, we independently computed six variant 𝒟𝒞 priors using: only the single closest species (S. paradoxus); the two closest species (S. paradoxus and S. mikatae); the three closest species (S. paradoxus, S. mikatae and S. kudriavzevii); and so on. PRIORITY-𝒟𝒞 consistently found 69 or more motifs with each of these variant priors. The general trend indicated that adding more organisms improves performance. For other algorithms, however, the choice of related organisms is crucial for the success of the algorithm. PhyloGibbs, for example, works well when using the sensu stricto Saccharomyces species, but the performance drops dramatically if we include more distantly related species.The sensu stricto species (S. paradoxus, S. mikatae, S. kudriavzevii and S. bayanus) provide most of the conservation information in the priors. However, since these species are closely related to S. cerevisiae, their intergenic regions may contain many non-functional conserved sites, simply because not enough evolutionary time has passed since the species diverged from their common ancestor. Although in this case many DNA sites will appear to be conserved—and thus functional—this does not pose a problem for our conservation-based algorithm because the information in the cobound sequences helps reduce the space of putative TF binding sites to those conserved DNA sites that also appear in most of the cobound sequences. Furthermore, the more distantly related species S. castelli and S. kluyveri provide some of the sequence divergence necessary for filtering out the conserved non-functional sites. According to a study by Cliften et al. (49), only a small number of the intergenic regions in the S. castelli and S. kluyveri genomes can be aligned to S. cerevisiae regions, and only after the corresponding orthologous genes have been identified. Even then, the conserved regulatory sites may be hard, if not impossible, to align correctly. Hence, alignment-based motif finders may not be able to fully exploit the information provided by the two distantly related species, while our alignment-free algorithm can.
PRIORITY-𝒞 and -𝒟𝒞 scale well with the size of the sequence-set and the number of organisms
Our conservation-based approach is much faster than current motif finding algorithms. It only needs a few minutes to compute a motif, even on the largest sequence-set, while other methods required days or in some cases months. Interestingly, other methods become slower precisely because they use conservation information, but our method actually speeds up: the informative prior computed from conservation information facilitates rapid convergence to the posterior, as evidenced by the fact that PRIORITY-𝒟𝒞 reaches convergence faster than PRIORITY-𝒰 (data not shown).In Supplementary Figure S3, we show that PRIORITY-𝒟𝒞 scales well with the size of the sequence-set. A similar analysis can be done by keeping the size of the sequence-set fixed but varying the number of orthologs for each sequence. The running time for PRIORITY-𝒟𝒞 varies only slightly when we increase the number of orthologous sequences, while the running time of other methods increases substantially.
PRIORITY-𝒞 and -𝒟𝒞 do not require a phylogenetic tree
Currently, the derivation of our alignment-free priors 𝒞 and 𝒟𝒞 does not take phylogenetic information into account, mainly because high-quality phylogenetic trees are usually hard to compute. However, when such a tree is available, our algorithm can easily incorporate the phylogenetic information into the priors. A simple approach is to weight the sequences in each organism (and thus the occurrences of W-mers in these sequences) according to the evolutionary distance between that organism and the reference organism. We have derived such a weighting scheme for the Saccharomyces species using the phylogenetic tree reported by Siepel et al. (18). However, on these data sets, the conservation priors computed using the weighted sequences did not show any improvement over the initial conservation priors, 𝒞 and 𝒟𝒞. Other approaches for incorporating phylogenetic information may also be used. For example, one could use the branch length score developed by Stark et al. (21)—which takes into account phylogenetic information—as a probabilistic score (see ‘Materials and Methods’ section). This score would then be converted into a prior and incorporated into PRIORITY.It is important to make the distinction between discovering the TF binding motifs (which is the focus of our analysis) and finding the exact locations of the binding sites of a TF across the genome based on a known model (this latter problem is sometimes called motif scanning). In our work, we show that using conservation information in the form of aligned sequences or phylogenetic models is not the best solution for improving the prediction accuracy of computational motif finders. In a recent analysis, Hawkins and Bailey (50) have shown that motif scanning also does not benefit significantly from alignment-based conservation information and complex phylogenetic models. In another recent paper, Ward and Bussemaker (51) have shown the advantages of using orthologous promoter sequences in an alignment-free manner to solve yet another problem: the discovery of functional regulatory targets of TFs.
Using TF binding data from higher organisms
PRIORITY is not restricted to S. cerevisiae data, but can also be used on TF binding data from higher organisms. Unfortunately, an experimental study similar to the one performed by Harbison et al. (5) in yeast is not currently available in more complex organisms, so a thorough analysis of the performance of our alignment-free approach on data from more complex organisms is not currently possible. We did, however, as a proof of principle, apply PRIORITY on fly, mouse and human TF binding data. On all the sequence-sets described below, PRIORITY was applied with the same parameters as on the yeast sequence-sets, except for the conservation-based priors and the background model, which are specific to the organism and the sequence-set. For each sequence-set, each version of PRIORITY was run 50 times, and for each run we computed the distance between the literature motif and and the PRIORITY motif. We consider that an algorithm finds the correct motif in a sequence-set if the median distance between the literature motif and the reported motif is <0.25 (the distance criterion used for the yeast data). Details on the literature motifs and the exact organisms used in each analysis are available in the Supplementary Data.We used fly TF binding data from Zhou and Wong (52), who collected over 60 enhancer sequences controlling 20 different genes expressed during the early development of D. melanogaster. Based on known regulatory interactions, they built three sequences-sets, each of which contained all enhancers believed to be bound by one of the three TFs Bicoid (Bcd), Hunchback (Hb) and Krüppel (Kr). As described earlier for the yeast ChIP-chip sequence-sets, we used orthologous regulatory regions to compute conservation priors 𝒞, 𝒯 and 𝒜, and we applied PRIORITY-𝒞, PRIORITY-𝒯, PRIORITY-𝒜 and PRIORITY-𝒰 on the three fly sequence-sets. (Note that we cannot compute discriminative priors since a set of unbound sequences is not available.) PRIORITY-𝒰 and PRIORITY-𝒜 find only one correct motif (Hb), PRIORITY-𝒯 finds two correct motifs (Hb and Kr), while PRIORITY-C finds all three motifs correctly (Supplementary Figures S4 and S5).We also applied PRIORITY-𝒰, -𝒜, -𝒯, -𝒞, -𝒟, -𝒟𝒜, -𝒟𝒯 and -𝒟𝒞 on 12 mouse ChIP-seq data sets from Chen et al. (53), as compiled by Bailey et al. (personal communication). All eight algorithms were successful in 9 of the 12 mouse sequence-sets (Supplementary Figures S6–S10). However, the alignment-free priors seem to perform better than the other priors: an analysis of the median distance between the PRIORITY motifs and the literature motifs showed that PRIORITY-𝒞 is at least as good as PRIORITY-𝒰 in 7 of the 9 successful sequence-sets, while PRIORITY-𝒯 and PRIORITY-𝒜 are at least as good as PRIORITY-𝒰 in only 4 of the 9 successful sets.Discovery of human TF binding motifs can also benefit from using alignment-free conservation information. We applied PRIORITY-𝒰, PRIORITY-𝒜, PRIORITY-𝒯 and PRIORITY-𝒞 on three sequence-sets containing promoters bound by the human TFs HNF1, HNF4 and HNF6 in human hepatocytes, derived from ChIP-chip experiments by Odom et al. (54). Supplementary Figures S11 and S12 show the results on the three sequence-sets: PRIORITY-𝒰, PRIORITY-𝒜 and PRIORITY-𝒯 find the corrrect motif in only one sequence-set (HNF6), while PRIORITY-𝒞 is successful in the HNF4 and HNF6 sequence-sets.Although the fly, mouse and human TF binding data sets discussed in this section are not as comprehensive as the yeast ChIP-chip data of Harbison et al. (5), the results show that incorporating conservation information in an alignment-free manner improves motif discovery not only in the case of yeast TFs but also TFs from more complex organisms. Furthermore, our approach is not restricted to ChIP-chip data, but can be used on ChIP-seq data or any set of co-regulated sequences.
CONCLUSION
Sequence alignments are undoubtedly very useful for the analysis of genomic data. For example, many genes are detected in newly sequenced organisms based on their homology to genes in related, well-studied species. Once homologous genes are detected, one can also align their regulatory regions with the hope of finding conserved TF binding motifs. We show, however, that for this purpose, alignments of orthologous regions can be misleading. Due to the short length of most binding sites, alignment algorithms are very likely to misalign true functional sites that are actually conserved across species (Figure 1). Furthermore, different algorithms may build very different alignments, in which different DNA sites appear to be conserved, so choosing the alignment algorithm becomes crucial for finding the conserved TF binding sites. Our method overcomes these issues because it uses cross-species conservation information without relying on alignments. In doing so, it outperforms currently used conservation-based programs in both speed and accuracy.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
National Institutes of Health (P50-GM081883-01, R01-ES015165-01); DARPA (HR0011-08-1-0023, HR0011-09-1-0040); a National Science Foundation CAREER award (NSF 0347801); Alfred P. Sloan Research Fellowship (to A.J.H.). Funding for open access charge: DARPA HR0011-09-1-0040.Conflict of interest statement. None declared.
Authors: W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler Journal: Genome Res Date: 2002-06 Impact factor: 9.043
Authors: Alexander Stark; Michael F Lin; Pouya Kheradpour; Jakob S Pedersen; Leopold Parts; Joseph W Carlson; Madeline A Crosby; Matthew D Rasmussen; Sushmita Roy; Ameya N Deoras; J Graham Ruby; Julius Brennecke; Emily Hodges; Angie S Hinrichs; Anat Caspi; Benedict Paten; Seung-Won Park; Mira V Han; Morgan L Maeder; Benjamin J Polansky; Bryanne E Robson; Stein Aerts; Jacques van Helden; Bassem Hassan; Donald G Gilbert; Deborah A Eastman; Michael Rice; Michael Weir; Matthew W Hahn; Yongkyu Park; Colin N Dewey; Lior Pachter; W James Kent; David Haussler; Eric C Lai; David P Bartel; Gregory J Hannon; Thomas C Kaufman; Michael B Eisen; Andrew G Clark; Douglas Smith; Susan E Celniker; William M Gelbart; Manolis Kellis Journal: Nature Date: 2007-11-08 Impact factor: 49.962
Authors: Lee A Newberg; William A Thompson; Sean Conlan; Thomas M Smith; Lee Ann McCue; Charles E Lawrence Journal: Bioinformatics Date: 2007-05-08 Impact factor: 6.937
Authors: Paul Cliften; Priya Sudarsanam; Ashwin Desikan; Lucinda Fulton; Bob Fulton; John Majors; Robert Waterston; Barak A Cohen; Mark Johnston Journal: Science Date: 2003-05-29 Impact factor: 47.728
Authors: Andrew G Clark; Michael B Eisen; Douglas R Smith; Casey M Bergman; Brian Oliver; Therese A Markow; Thomas C Kaufman; Manolis Kellis; William Gelbart; Venky N Iyer; Daniel A Pollard; Timothy B Sackton; Amanda M Larracuente; Nadia D Singh; Jose P Abad; Dawn N Abt; Boris Adryan; Montserrat Aguade; Hiroshi Akashi; Wyatt W Anderson; Charles F Aquadro; David H Ardell; Roman Arguello; Carlo G Artieri; Daniel A Barbash; Daniel Barker; Paolo Barsanti; Phil Batterham; Serafim Batzoglou; Dave Begun; Arjun Bhutkar; Enrico Blanco; Stephanie A Bosak; Robert K Bradley; Adrianne D Brand; Michael R Brent; Angela N Brooks; Randall H Brown; Roger K Butlin; Corrado Caggese; Brian R Calvi; A Bernardo de Carvalho; Anat Caspi; Sergio Castrezana; Susan E Celniker; Jean L Chang; Charles Chapple; Sourav Chatterji; Asif Chinwalla; Alberto Civetta; Sandra W Clifton; Josep M Comeron; James C Costello; Jerry A Coyne; Jennifer Daub; Robert G David; Arthur L Delcher; Kim Delehaunty; Chuong B Do; Heather Ebling; Kevin Edwards; Thomas Eickbush; Jay D Evans; Alan Filipski; Sven Findeiss; Eva Freyhult; Lucinda Fulton; Robert Fulton; Ana C L Garcia; Anastasia Gardiner; David A Garfield; Barry E Garvin; Greg Gibson; Don Gilbert; Sante Gnerre; Jennifer Godfrey; Robert Good; Valer Gotea; Brenton Gravely; Anthony J Greenberg; Sam Griffiths-Jones; Samuel Gross; Roderic Guigo; Erik A Gustafson; Wilfried Haerty; Matthew W Hahn; Daniel L Halligan; Aaron L Halpern; Gillian M Halter; Mira V Han; Andreas Heger; LaDeana Hillier; Angie S Hinrichs; Ian Holmes; Roger A Hoskins; Melissa J Hubisz; Dan Hultmark; Melanie A Huntley; David B Jaffe; Santosh Jagadeeshan; William R Jeck; Justin Johnson; Corbin D Jones; William C Jordan; Gary H Karpen; Eiko Kataoka; Peter D Keightley; Pouya Kheradpour; Ewen F Kirkness; Leonardo B Koerich; Karsten Kristiansen; Dave Kudrna; Rob J Kulathinal; Sudhir Kumar; Roberta Kwok; Eric Lander; Charles H Langley; Richard Lapoint; Brian P Lazzaro; So-Jeong Lee; Lisa Levesque; Ruiqiang Li; Chiao-Feng Lin; Michael F Lin; Kerstin Lindblad-Toh; Ana Llopart; Manyuan Long; Lloyd Low; Elena Lozovsky; Jian Lu; Meizhong Luo; Carlos A Machado; Wojciech Makalowski; Mar Marzo; Muneo Matsuda; Luciano Matzkin; Bryant McAllister; Carolyn S McBride; Brendan McKernan; Kevin McKernan; Maria Mendez-Lago; Patrick Minx; Michael U Mollenhauer; Kristi Montooth; Stephen M Mount; Xu Mu; Eugene Myers; Barbara Negre; Stuart Newfeld; Rasmus Nielsen; Mohamed A F Noor; Patrick O'Grady; Lior Pachter; Montserrat Papaceit; Matthew J Parisi; Michael Parisi; Leopold Parts; Jakob S Pedersen; Graziano Pesole; Adam M Phillippy; Chris P Ponting; Mihai Pop; Damiano Porcelli; Jeffrey R Powell; Sonja Prohaska; Kim Pruitt; Marta Puig; Hadi Quesneville; Kristipati Ravi Ram; David Rand; Matthew D Rasmussen; Laura K Reed; Robert Reenan; Amy Reily; Karin A Remington; Tania T Rieger; Michael G Ritchie; Charles Robin; Yu-Hui Rogers; Claudia Rohde; Julio Rozas; Marc J Rubenfield; Alfredo Ruiz; Susan Russo; Steven L Salzberg; Alejandro Sanchez-Gracia; David J Saranga; Hajime Sato; Stephen W Schaeffer; Michael C Schatz; Todd Schlenke; Russell Schwartz; Carmen Segarra; Rama S Singh; Laura Sirot; Marina Sirota; Nicholas B Sisneros; Chris D Smith; Temple F Smith; John Spieth; Deborah E Stage; Alexander Stark; Wolfgang Stephan; Robert L Strausberg; Sebastian Strempel; David Sturgill; Granger Sutton; Granger G Sutton; Wei Tao; Sarah Teichmann; Yoshiko N Tobari; Yoshihiko Tomimura; Jason M Tsolas; Vera L S Valente; Eli Venter; J Craig Venter; Saverio Vicario; Filipe G Vieira; Albert J Vilella; Alfredo Villasante; Brian Walenz; Jun Wang; Marvin Wasserman; Thomas Watts; Derek Wilson; Richard K Wilson; Rod A Wing; Mariana F Wolfner; Alex Wong; Gane Ka-Shu Wong; Chung-I Wu; Gabriel Wu; Daisuke Yamamoto; Hsiao-Pei Yang; Shiaw-Pyng Yang; James A Yorke; Kiyohito Yoshida; Evgeny Zdobnov; Peili Zhang; Yu Zhang; Aleksey V Zimin; Jennifer Baldwin; Amr Abdouelleil; Jamal Abdulkadir; Adal Abebe; Brikti Abera; Justin Abreu; St Christophe Acer; Lynne Aftuck; Allen Alexander; Peter An; Erica Anderson; Scott Anderson; Harindra Arachi; Marc Azer; Pasang Bachantsang; Andrew Barry; Tashi Bayul; Aaron Berlin; Daniel Bessette; Toby Bloom; Jason Blye; Leonid Boguslavskiy; Claude Bonnet; Boris Boukhgalter; Imane Bourzgui; Adam Brown; Patrick Cahill; Sheridon Channer; Yama Cheshatsang; Lisa Chuda; Mieke Citroen; Alville Collymore; Patrick Cooke; Maura Costello; Katie D'Aco; Riza Daza; Georgius De Haan; Stuart DeGray; Christina DeMaso; Norbu Dhargay; Kimberly Dooley; Erin Dooley; Missole Doricent; Passang Dorje; Kunsang Dorjee; Alan Dupes; Richard Elong; Jill Falk; Abderrahim Farina; Susan Faro; Diallo Ferguson; Sheila Fisher; Chelsea D Foley; Alicia Franke; Dennis Friedrich; Loryn Gadbois; Gary Gearin; Christina R Gearin; Georgia Giannoukos; Tina Goode; Joseph Graham; Edward Grandbois; Sharleen Grewal; Kunsang Gyaltsen; Nabil Hafez; Birhane Hagos; Jennifer Hall; Charlotte Henson; Andrew Hollinger; Tracey Honan; Monika D Huard; Leanne Hughes; Brian Hurhula; M Erii Husby; Asha Kamat; Ben Kanga; Seva Kashin; Dmitry Khazanovich; Peter Kisner; Krista Lance; Marcia Lara; William Lee; Niall Lennon; Frances Letendre; Rosie LeVine; Alex Lipovsky; Xiaohong Liu; Jinlei Liu; Shangtao Liu; Tashi Lokyitsang; Yeshi Lokyitsang; Rakela Lubonja; Annie Lui; Pen MacDonald; Vasilia Magnisalis; Kebede Maru; Charles Matthews; William McCusker; Susan McDonough; Teena Mehta; James Meldrim; Louis Meneus; Oana Mihai; Atanas Mihalev; Tanya Mihova; Rachel Mittelman; Valentine Mlenga; Anna Montmayeur; Leonidas Mulrain; Adam Navidi; Jerome Naylor; Tamrat Negash; Thu Nguyen; Nga Nguyen; Robert Nicol; Choe Norbu; Nyima Norbu; Nathaniel Novod; Barry O'Neill; Sahal Osman; Eva Markiewicz; Otero L Oyono; Christopher Patti; Pema Phunkhang; Fritz Pierre; Margaret Priest; Sujaa Raghuraman; Filip Rege; Rebecca Reyes; Cecil Rise; Peter Rogov; Keenan Ross; Elizabeth Ryan; Sampath Settipalli; Terry Shea; Ngawang Sherpa; Lu Shi; Diana Shih; Todd Sparrow; Jessica Spaulding; John Stalker; Nicole Stange-Thomann; Sharon Stavropoulos; Catherine Stone; Christopher Strader; Senait Tesfaye; Talene Thomson; Yama Thoulutsang; Dawa Thoulutsang; Kerri Topham; Ira Topping; Tsamla Tsamla; Helen Vassiliev; Andy Vo; Tsering Wangchuk; Tsering Wangdi; Michael Weiand; Jane Wilkinson; Adam Wilson; Shailendra Yadav; Geneva Young; Qing Yu; Lisa Zembek; Danni Zhong; Andrew Zimmer; Zac Zwirko; David B Jaffe; Pablo Alvarez; Will Brockman; Jonathan Butler; CheeWhye Chin; Sante Gnerre; Manfred Grabherr; Michael Kleber; Evan Mauceli; Iain MacCallum Journal: Nature Date: 2007-11-08 Impact factor: 49.962
Authors: Gabriel Cuellar-Partida; Fabian A Buske; Robert C McLeay; Tom Whitington; William Stafford Noble; Timothy L Bailey Journal: Bioinformatics Date: 2011-11-08 Impact factor: 6.937
Authors: Dominic Schmidt; Michael D Wilson; Benoit Ballester; Petra C Schwalie; Gordon D Brown; Aileen Marshall; Claudia Kutter; Stephen Watt; Celia P Martinez-Jimenez; Sarah Mackay; Iannis Talianidis; Paul Flicek; Duncan T Odom Journal: Science Date: 2010-04-08 Impact factor: 47.728
Authors: Khalid Mahmood; Geoffrey I Webb; Jiangning Song; James C Whisstock; Arun S Konagurthu Journal: Nucleic Acids Res Date: 2011-12-30 Impact factor: 16.971
Authors: Holger Hartmann; Eckhart W Guthöhrlein; Matthias Siebert; Sebastian Luehr; Johannes Söding Journal: Genome Res Date: 2012-09-18 Impact factor: 9.043