| Literature DB >> 18302777 |
Kjetil Klepper1, Geir K Sandve, Osman Abul, Jostein Johansen, Finn Drablos.
Abstract
BACKGROUND: Computational discovery of regulatory elements is an important area of bioinformatics research and more than a hundred motif discovery methods have been published. Traditionally, most of these methods have addressed the problem of single motif discovery - discovering binding motifs for individual transcription factors. In higher organisms, however, transcription factors usually act in combination with nearby bound factors to induce specific regulatory behaviours. Hence, recent focus has shifted from single motifs to the discovery of sets of motifs bound by multiple cooperating transcription factors, so called composite motifs or cis-regulatory modules. Given the large number and diversity of methods available, independent assessment of methods becomes important. Although there have been several benchmark studies of single motif discovery, no similar studies have previously been conducted concerning composite motif discovery.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18302777 PMCID: PMC2311304 DOI: 10.1186/1471-2105-9-123
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Datasets
| AP1-Ets | 16 | 17 | 14860 | 14 – 99 (27) |
| AP1-NFAT | 8 | 11 | 6893 | 14 – 19 (16) |
| AP1-NFκB | 7 | 8 | 6532 | 18 – 135 (53) |
| CEBP-NFκB | 8 | 8 | 7308 | 44 – 118 (84) |
| Ebox-Ets | 4 | 6 | 3489 | 16 – 50 (25) |
| Ets-AML | 5 | 5 | 4053 | 13 – 30 (19) |
| IRF-NFκB | 6 | 6 | 5344 | 23 – 71 (43) |
| NFκB-HMGIY | 6 | 7 | 5393 | 10 – 32 (13) |
| PU1-IRF | 5 | 5 | 4530 | 12 – 14 (13) |
| Sp1-Ets | 7 | 8 | 5787 | 16 – 117 (37) |
| 12 | 14 | 11943 | 26 – 176 (112) | |
| 24 | 24 | 20427 | 14 – 294 (120) |
A brief overview of the ten TRANSCompel sequence sets and the liver and muscle datasets used in the assessment. Further information can be found in Additional File 1.
Description of module discovery tools
| CisModule | CisModule models the structure of sequences with a two-level hierarchical mixture-model and uses a Bayesian approach with Gibbs sampling to simultaneously infer the modules, TFBSs and PWMs based on their joint posterior distribution, which is the probability of a model given the input sequence set. At the first level, sequences are viewed as a mixture of module instances and background. At the second level, modules are modelled as a mixture of motifs and inter-module background. Parameters of the model include the widths and representations (PWMs) of single motifs and parameters related to distances between modules and between TFBS within modules. From a random initialization, CisModule iteratively cycles through steps of parameter update and module-motif detection. New parameter values are sampled from their conditional posterior distributions based on the currently predicted modules and motifs, and new predictions of modules and TFBSs are then sampled based on these updated parameter values. Positions in the sequences where the marginal posterior probability of being sampled within modules was greater than 0.5 were output as module predictions. |
| Cister | Given a set of PWMs and parameters specifying the expected number of motifs in modules, the expected distances between motifs in modules and the expected distance between modules, |
| Cluster-Buster | Cluster-Buster is developed by the same group that made |
| Composite Module Analyst (CMA) | The promoter model in CMA is expressed as a Boolean combination of one or more |
| MCAST | MCAST builds a HMM-model consisting of an intra-module state, an inter-module state and motif-states based on the supplied PWMs. The score for a motif-state is called a |
| ModuleSearcher | Given a list of PWM hits with match scores for putative TFBSs in a sequence set, ModuleSearcher finds the module model (set of |
| MSCAN | MSCAN discovers modules by evaluating the combined statistical significance of sets of potential non-overlapping TF binding sites in a sliding window along the input sequence. PWMs are compared against each position within the window to obtain match scores, and |
| Stubb | The HMM used by Stubb consists of motif states based on supplied PWMs and a single background state based on a |
The table contains short descriptions of the eight methods included in the assessment. All methods except for CisModule rely on supplied PWMs and consider matches on both strands, usually with equal probability (however, Stubb can estimate strand biases for all PWMs in a preprocessing step). Not all methods are able to consider overlapping single binding sites, which do occur in a few modules.
Figure 1Nucleotide-level correlation scores on the TRANSCompel dataset. The graphs show nCC scores for each of the ten sequence sets in the TRANSCompel dataset when methods are supplied with TRANSFAC PWM sets (a) and custom matrices (b).
Correlations between dataset properties and nCC scores
| Average | Highest | Average | Highest | |
| Number of sequences | -0.23 | -0.16 | -0.23 | -0.05 |
| Length of shortest sequence | 0.30 | 0.18 | 0.30 | 0.13 |
| Average sequence length | 0.40 | 0.33 | 0.42 | 0.43 |
| Total sequence set length | -0.19 | -0.12 | -0.18 | -0.02 |
| Number of module instances | -0.38 | -0.32 | -0.40 | -0.19 |
| Size of smallest module | ||||
| Size of largest module | 0.26 | 0.34 | 0.19 | 0.35 |
| Average module size | 0.59 | |||
| Module size standard deviation | 0.23 | 0.29 | 0.13 | 0.29 |
| IC-content (lowest) | 0.46 | 0.45 | 0.47 | |
| IC-content (total) | 0.54 | |||
| Module/background-ratio | 0.53 | 0.61 | 0.51 | |
We conducted a simple correlation analysis to examine which properties of the TRANSCompel sequence sets and PWMs correlated best with the highest and average nCC scores obtained by the methods on these sets. "IC-content (lowest)" is the information content (IC) of the PWM with the lowest IC of the two involved in each sequence set. The information content of a PWM is inversely related to the amount of variability in the binding patterns from which the PWM is constructed [38]. PWMs with higher information content are more specific and match only sites with a high degree of similarity to the consensus motif. "IC-content (total)" is the sum of IC-contents for the two motifs (for TRANSFAC PWMs we used the PWM with the highest IC in each equivalence set to represent the motif). The three highest values are highlighted in each column. The properties that seem to correlate best with methods' performances are the minimum and average size of modules (in basepairs) and the total IC-content, which would imply that module discovery is harder for datasets containing short and degenerate modules.
Figure 2Combined performance scores on the full TRANSCompel dataset. Combined nucleotide-level scores obtained for different performance measures when using TRANSFAC PWM sets (a) and custom matrices (b).
Figure 3Nucleotide-level correlation scores with 50% noise in the PWM sets. The graphs show nCC scores when using TRANSFAC PWM sets (a) and custom matrices (b) with an equal proportion of decoy matrices added. Each value represents the average score over ten runs with different decoy selections.
Figure 4Motif-level correlation scores with 50% noise in the PWM sets. The graphs show mCC scores when using TRANSFAC PWM sets (a) and custom matrices (b) with an equal proportion of decoy matrices added. Each value represents the average score over ten runs with different decoy selections.
Figure 5Nucleotide-level correlation scores at different noise levels. Plot of nCC scores at increasing noise levels when methods are supplied with TRANSFAC PWM sets (a) and custom matrices (b). Scores shown are averages over all sequence sets and decoy selections at each noise level. MCAST was unable to function properly with very large PWM sets and was therefore assigned a score of zero at the 99% level.
Figure 6Motif-level correlation scores at different noise levels. Plot of mCC scores at increasing noise levels when methods are supplied with TRANSFAC PWM sets (a) and custom matrices (b). Scores shown are averages over all sequence sets and decoy selections at each noise level.
Figure 7Performances on the liver dataset. Scores obtained on the liver dataset for different performance measures at nucleotide-level (a) and motif-level (b).
Figure 8Performances on the muscle dataset. Scores obtained on the muscle dataset for different performance measures at nucleotide-level (a) and motif-level (b).