| Literature DB >> 16141193 |
Cinzia Pizzi1, Stefania Bortoluzzi, Andrea Bisognin, Alessandro Coppe, Gian Antonio Danieli.
Abstract
The problem of detecting DNA motifs with functional relevance in real biological sequences is difficult due to a number of biological, statistical and computational issues and also because of the lack of knowledge about the structure of searched patterns. Many algorithms are implemented in fully automated processes, which are often based upon a guess of input parameters from the user at the very first step. In this paper, we present a novel method for the detection of seeded DNA motifs, composed by regions with a different extent of variability. The method is based on a multi-step approach, which was implemented in a motif searching web tool (MOST). Overrepresented exact patterns are extracted from input sequences and clustered to produce motifs core regions, which are then extended and scored to generate seeded motifs. The combination of automated pattern discovery algorithms and different display tools for the evaluation and selection of results at several analysis steps can potentially lead to much more meaningful results than complete automation can produce. Experimental results on different yeast and human real datasets proved the methodology to be a promising solution for finding seeded motifs. MOST web tool is freely available at http://telethon.bio.unipd.it/bioinfo/MOST.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16141193 PMCID: PMC1197136 DOI: 10.1093/nar/gni131
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Multi-Step architecture of MOST.
Results of MOST evaluation on yeast datasets
| Dataset | Yst01g | yst02g | yst03m | yst04r | yst05r | yst06g | yst08r | yst09g | Total | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| Number of sequences | 8 | 4 | 8 | 6 | 3 | 7 | 11 | 16 | ||
| Sequence length | 1000 | 500 | 500 | 1000 | 500 | 500 | 1000 | 1000 | ||
| Number of known signals | 6 | 5 | 18 | 7 | 4 | 7 | 14 | 13 | 47 | |
| Phase 1: solid words extracted | 65 | 255 | 286 | 162 | 337 | 214 | 88 | 40 | ||
| Phase 2: clusters (80% similarity threshold) | 50 | 126 | 141 | 84 | 157 | 111 | 46 | 31 | ||
| Phase 3A: ext. clusters (95% threshold for the MV function) | 50 | 123 | 137 | 81 | 154 | 106 | 44 | 29 | ||
| Phase 3B: ext. clusters (80% threshold for the MV function) | 50 | 126 | 141 | 84 | 157 | 110 | 46 | 31 | ||
| Phase 1: signals found in solid words | 2 | 5 | 14 | 6 | 4 | 7 | 10 | 12 | 38 | |
| Phase 2: signals found in clusters | 2 | 5 | 14 | 6 | 4 | 7 | 10 | 12 | 38 | |
| Phase 3A: signals found in ext. clusters | 0 | 5 | 10 | 7 | 3 | 5 | 7 | 6 | 30 | |
| Phase 3B: signals found in ext. clusters | 1 | 5 | 16 | 7 | 4 | 7 | 9 | 11 | 40 | |
| Phase 1 | 0.33 | 1.00 | 0.78 | 0.86 | 1.00 | 1.00 | 0.71 | 0.92 | 0.83 | |
| Phase 2 Sensitivity | 0.33 | 1.00 | 0.78 | 0.86 | 1.00 | 1.00 | 0.71 | 0.92 | 0.83 | |
| Phase 3A | 0.00 | 1.00 | 0.56 | 1.00 | 0.75 | 0.71 | 0.50 | 0.46 | 0.67 | |
| Phase 3B | 0.17 | 1.00 | 0.89 | 1.00 | 1.00 | 1.00 | 0.64 | 0.85 | 0.84 | |
| Phase 2 | ||||||||||
| Maximum number of signals per cluster | 1 | 4 | 9 | 5 | 3 | 6 | 8 | 8 | 28 | |
| Number of maximal clusters | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||
| Sensitivity | 0.17 | 0.80 | 0.50 | 0.71 | 0.75 | 0.86 | 0.57 | 0.62 | 0.63 | |
| Phase 3A | ||||||||||
| Maximum number of signals per cluster | 0 | 1 | 5 | 4 | 3 | 5 | 4 | 3 | 18 | |
| Number of maximal clusters | - | 11 | 3 | 2 | 1 | 1 | 1 | 4 | ||
| Sensitivity | 0.00 | 0.20 | 0.28 | 0.57 | 0.75 | 0.71 | 0.29 | 0.23 | 0.42 | |
| Phase 3B | ||||||||||
| Maximum number of signals per cluster | 1 | 3 | 8 | 3 | 3 | 6 | 4 | 7 | 24 | |
| Number of maximal clusters | 1 | 1 | 1 | 3 | 1 | 1 | 2 | 1 | ||
| Sensitivity | 0.17 | 0.60 | 0.44 | 0.43 | 0.75 | 0.86 | 0.29 | 0.54 | 0.54 |
Datasets are identified by the names originally used by Tompa et al. For each dataset, the total number of sequences included the length of promoter sequences, and the number of signals included is reported. Rows from four to seven describe results obtained by MOST different analysis steps and, for the third step, with different conditions. The following eight rows show the number and the proportion (sensitivity) of known signals per dataset represented in results of the previously described analysis steps. In the last part of the table, the number and the proportion (sensitivity) of known signals represented in the maximal cluster are shown, for each dataset.
Description of transcription factors binding sites, whose activity was experimentally proven, represented in the group of human promoter sequences composing the Mixed signals benchmark dataset
| Transcription factor | Sequence elements | |||
|---|---|---|---|---|
| Number | Average length | Minimum length | Maximum length | |
| AP-1 | 9 | 9.6 | 7 | 17 |
| c-Myc | 4 | 8.0 | 6 | 14 |
| CREB | 4 | 8.5 | 8 | 10 |
| CRE-BP1 | 7 | 8.3 | 7 | 10 |
| CTF | 4 | 6.8 | 6 | 7 |
| E2F | 5 | 9.0 | 8 | 12 |
| E2F-1 | 5 | 9.8 | 8 | 12 |
| HIF-1 | 7 | 7.4 | 6 | 8 |
For each transcription factor, the number of binding sites in the considered group of sequences, their average, minimum and maximum length are reported.
Results of MOST evaluation on Mixed signals dataset
| No. of 6 bp words | TP | FN | Sensitivity | |
| Occobs/Occexp | ||||
| 1.0 | 1164 | 41 | 4 | 0.91 |
| 1.1 | 1152 | 41 | 4 | 0.91 |
| 1.3 | 929 | 37 | 8 | 0.82 |
| 1.5 | 710 | 36 | 9 | 0.80 |
| 1.7 | 560 | 35 | 10 | 0.78 |
| 2.0 | 382 | 29 | 16 | 0.64 |
| 2.2 | 334 | 27 | 18 | 0.60 |
| 2.5 | 221 | 27 | 18 | 0.60 |
| 3.0 | 109 | 4 | 41 | 0.09 |
| Seqobs/Seqexp | ||||
| 1.0 | 1173 | 34 | 11 | 0.76 |
| 1.1 | 999 | 34 | 11 | 0.76 |
| 1.3 | 773 | 34 | 11 | 0.76 |
| 1.5 | 579 | 34 | 11 | 0.76 |
| 1.7 | 469 | 28 | 17 | 0.62 |
| 2.0 | 246 | 21 | 24 | 0.47 |
| 2.2 | 293 | 18 | 27 | 0.40 |
| 2.5 | 102 | 4 | 41 | 0.09 |
| 3.0 | 48 | 1 | 44 | 0.02 |
Experiments on MOST first step: identification of surprising words. The sensitivity of MOST first phase, carried out with different overrepresentation measures, was evaluated. All the 6 bp sequences representing known binding sites, or all of the substrings of binding sites whose length exceeded six, were searched in the list of 6 nt strings extracted as overrepresented, according to different measures and different thresholds (first column). The sensitivity has been calculated as the number of known sites represented in the list over the total number of known sites [sensitivity = TP/(TP+FN); TP, true positives; FN, false negatives].
Results of MOST evaluation on Mixed signals dataset
| Clustering similarity threshold | Transcription factor | Known instances | ||||
|---|---|---|---|---|---|---|
| Maximum per cluster | Phase 1 found | Total | Maximum per cluster/phase 1 found | Maximum per cluster/total | ||
| 80% | AP-1 | 7 | 8 | 9 | 0.88 | 0.78 |
| C-MYC | 4 | 4 | 4 | 1.00 | 1.00 | |
| CREB | 3 | 4 | 4 | 0.75 | 0.75 | |
| CREB-BP1 | 6 | 6 | 7 | 1.00 | 0.86 | |
| CTF | 3 | 4 | 4 | 0.75 | 0.75 | |
| E2F | 4 | 4 | 5 | 1.00 | 0.80 | |
| E2F-1 | 3 | 5 | 5 | 0.60 | 0.60 | |
| HIF-1 | 5 | 6 | 7 | 0.83 | 0.71 | |
| Total | 35 | 41 | 45 | Average 0.85 | Average 0.78 | |
| 60% | AP-1 | 1 | 8 | 9 | 0.13 | 0.11 |
| C-MYC | 1 | 4 | 4 | 0.25 | 0.25 | |
| CREB | 3 | 4 | 4 | 0.75 | 0.75 | |
| CREB-BP1 | 7 | 6 | 6 | 1.00 | 0.86 | |
| CTF | 2 | 4 | 4 | 0.50 | 0.50 | |
| E2F | 3 | 4 | 5 | 0.75 | 0.60 | |
| E2F-1 | 4 | 5 | 5 | 0.80 | 0.80 | |
| HIF-1 | 2 | 6 | 7 | 0.33 | 0.29 | |
| Total | 23 | 41 | 45 | Average 0.56 | Average 0.52 | |
Experiments on MOST second step: clustering of exact patterns for building the motif core. In two clustering experiments, core motifs were built by grouping 1164 surprising words with similarity threshold set to 60 and 80%, respectively. The number of known sequence elements pertaining to each specific group (e.g. AP-1 group of nine elements) represented in obtained clusters is reported.