| Literature DB >> 19063744 |
Geir Kjetil Sandve1, Osman Abul, Finn Drabløs.
Abstract
BACKGROUND: Computational discovery of motifs in biomolecular sequences is an established field, with applications both in the discovery of functional sites in proteins and regulatory sites in DNA. In recent years there has been increased attention towards the discovery of composite motifs, typically occurring in cis-regulatory regions of genes.Entities:
Mesh:
Year: 2008 PMID: 19063744 PMCID: PMC2614996 DOI: 10.1186/1471-2105-9-527
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Compo workflow. The general workflow of Compo, from a list of genes defining regulatory regions of interest, to a Pareto front or ranked list of composite motifs as potential regulators of the genes.
Figure 2Search tree. Implicit search tree, where numbers inside nodes correspond to single motifs (z), and paths from the root to a node correspond to composite motifs. The values H{1,3} and P{1,3} corresponds to the path in bold. The X symbol indicates that some composite motifs will be pruned during search.
Prediction performance
| Noise | Compo | CMA | ModuleSearcher | Stubb | MSCAN | MCAST | Cister | Cluster-Buster |
| No | 0.34 | 0.33 | 0.17 | 0.29 | 0.39 | 0.23 | 0.28 | |
| 50% | 0.33 | 0.32 | 0.17 | 0.24 | 0.30 | 0.21 | 0.27 | |
| 75% | 0.31 | 0.32 | 0.15 | 0.07 | 0.22 | 0.16 | 0.20 | |
| 90% | 0.29 | 0.30 | 0.13 | 0.07 | 0.14 | 0.07 | 0.13 | |
| 95% | 0.27 | 0.31 | 0.09 | 0.04 | 0.09 | 0.03 | 0.08 | |
| 99% | 0.23 | 0.20 | 0.01 | 0.02 | 0.00 | 0.01 | 0.05 |
Prediction performance on the TransCompel benchmark data sets. The given scores are for the custom matrices version of the benchmark, with different levels of randomly selected matrices (noise) added to the data set. Score values equal to or better than Compo are shown in bold.
Influence of background models and support
| Compo setup | TransCompel, no noise | TransCompel, 50% noise |
| Default Compo | ||
| Random DNA model bg | 0.36 | 0.35 |
| Independent sequence runs | 0.39 | 0.31 |
The table shows how prediction performance is influenced by using only a random DNA model in background computations (no real background DNA sequence), and by making predictions on sequences independently (no support).
Results on muscle and liver data sets
| Method | Muscle | Liver |
| Compo, independent sequence runs | ||
| Compo, support and allowing non-perfect matches | 0.42 | |
| Compo, support and standard set-model | 0.37 | 0.55 |
| CMA | 0.46 | 0.36 |
| ModuleSearcher | 0.46 | 0.43 |
| Stubb | 0.24 | 0.48 |
| MSCAN | 0.51 | |
| MCAST | 0.30 | 0.50 |
| Cister | 0.36 | 0.31 |
| Cluster-Buster | 0.41 |
Prediction performance on the muscle and liver data sets. Score values equal to or better than main Compo run are shown in bold.
Results on Drosophila data sets
| Method | #sign. results |
| Compo | |
| CisModule | 4 |
| MCD | 4 |
| D2z | |
| Stubb | |
| CSAM |
Prediction performance (the number of data sets with significant predictions at the 0.05 level) on the Drosophila data sets. There is a total of 33 data sets in the benchmark. Score values equal to or better than Compo are shown in bold. Results for other methods have been taken from Table 5 in supplementary material for [28]).
Figure 3Pareto front. a) Pareto front of optimal composite motifs corresponding to a multi-objective optimization with respect to support, distance restriction and specificity (hit-probability). Red colors show high specificity and blue colors show low specificity. b) Corresponding layout of motifs where colors instead denote combined motif score (significance). Red colors correspond to highest-ranked motifs according to combined score. The top-ranked composite motif is located at support 5 and distance window 200.