| Literature DB >> 30574025 |
Ngoc Tam L Tran1, Chun-Hsi Huang1.
Abstract
BACKGROUND: Previous studies show various results obtained from different motif finders for an identical dataset. This is largely due to the fact that these tools use different strategies and possess unique features for discovering the motifs. Hence, using multiple tools and methods has been suggested because the motifs commonly reported by them are more likely to be biologically significant.Entities:
Keywords: Binding sites; DNA motif; Merging similar motifs; Motif clustering; Motif detection tool; Motif similarity comparison
Year: 2018 PMID: 30574025 PMCID: PMC6299673 DOI: 10.1186/s12575-018-0088-3
Source DB: PubMed Journal: Biol Proced Online ISSN: 1480-9222 Impact factor: 3.244
Four distance metrics used in pair-wise comparisons with MOTIFSIM
| Metric | Formula | Description | Ref. |
|---|---|---|---|
| Average Kullback-Leibler (AKL) |
| 21 | |
| Average Log-likelihood Ratio (ALLR) |
| 24 | |
| Pearson Correlation Coefficient (PCC) |
| 24 | |
| χ2 Distance |
| 16 |
Sixteen benchmark sequence datasets [29]. They are grouped by species. Each sequence dataset has an embedded transcription factor
| Sequence Dataset | Dataset Type | Species | Transcription Factor | Number of Sequences | Sequence Length |
|---|---|---|---|---|---|
| hm01g | Generic |
| AP-1 | 18 | 2000 |
| hm04g | Generic |
| c-Jun | 13 | 2000 |
| hm08m | Markov |
| CREB | 15 | 500 |
| hm15g | Generic |
| NF-1 | 4 | 2000 |
| hm17g | Generic |
| NF-kappaB | 11 | 500 |
| hm19g | Generic |
| Sp1 | 5 | 500 |
| hm22g | Generic |
| USF1 | 6 | 500 |
| hm22m | Markov |
| USF1 | 6 | 500 |
| mus04m | Markov |
| C/Ebalpha | 7 | 1000 |
| mus06g | Generic |
| GATA-1 | 3 | 500 |
| mus10g | Generic |
| Sp1 | 13 | 1000 |
| mus11m | Markov |
| Sp1 | 12 | 500 |
| yst02g | Generic |
| GAL04 | 4 | 500 |
| yst03m | Markov |
| GCN4 | 8 | 500 |
| yst06g | Generic |
| MCM1 | 7 | 500 |
| yst09g | Generic |
| CAR1 | 16 | 1000 |
Nineteen motif finders used in the assessment. An x mark associates a sequence dataset with a tool. The sequence datasets are grouped by species. Thirteen tools in italic face are older tools used by Tompa et al. [29]. The rest are newer tools. The datasets with (*) were used to run newer tools
| Datasets | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Homo Sapiens | Mus Musculus | Saccharomyces Cerevisiae | |||||||||||||||
| Tool Name | hm01g* | hm04g* | hm08m | hm15g* | hm17g* | hm19g* | hm22g* | hm22m | mus04m | mus06g | mus10g | mus11m | yst02g | yst03m | yst06g | yst09g* | Ref. |
|
| x | x | x | x | x | x | 29 | ||||||||||
|
| x | x | x | x | x | x | x | x | x | 29 | |||||||
| ChIPMunk | x | x | x | x | x | x | x | 31 | |||||||||
|
| x | x | x | x | x | 29 | |||||||||||
| DMINDA | x | x | x | x | x | x | x | 32 | |||||||||
|
| x | x | x | x | x | x | x | x | x | 29 | |||||||
|
| x | x | x | x | x | x | x | x | x | 29 | |||||||
|
| x | x | x | x | x | x | x | x | 29 | ||||||||
| MEME (v. 4.11.4) | x | x | x | x | 3 | ||||||||||||
|
| x | x | x | x | x | x | x | x | x | 29 | |||||||
|
| x | x | x | x | x | x | x | x | 29 | ||||||||
|
| x | x | x | x | x | x | x | x | 29 | ||||||||
| peak-motifs | x | x | x | x | x | x | x | 33 | |||||||||
|
| x | x | x | x | x | x | 29 | ||||||||||
|
| x | x | x | x | x | x | x | x | x | 29 | |||||||
| STEME | x | x | x | x | 34 | ||||||||||||
|
| x | x | x | x | x | x | x | x | x | 29 | |||||||
| XXmotif | x | x | x | x | 35 | ||||||||||||
|
| x | x | x | x | x | x | x | x | x | 29 | |||||||
Four datasets used in motif clustering comparisons. The motifs in each dataset were selected from the Jaspar database [28]
| Dataset | Number of Motifs | Taxonomic Group |
|---|---|---|
| pfm_fungi | 78 | Fungi |
| pfm_insect | 42 | Insects |
| pfm_plant | 65 | Plants |
| pfm_vertebrate | 73 | Vertebrates |
Performance comparisons for USW, AKL, ALLR, PCC, CS, and MOTIFSIM for the predicted motifs in the collection. The number of motifs that were correctly identified by each method per sequence dataset is listed. The percentage of motifs that were correctly identified by each method per dataset was also calculated
| Sequence Dataset | Number of Motifs Correctly Identified | % of Motifs Correctly Identified | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| USW | AKL | ALLR | PCC | CS | MOTIFSIM | Total # of Tools | USW | AKL | ALLR | PCC | CS | MOTIFSIM | |
| hm08m | 0 | 0 | 0 | 0 | 0 | 1 | 12 | 0% | 0% | 0% | 0% | 0% | 8% |
| hm17g | 2 | 0 | 0 | 0 | 3 | 2 | 5 | 40% | 0% | 0% | 0% | 60% | 40% |
| hm22m | 1 | 0 | 0 | 0 | 2 | 1 | 10 | 10% | 0% | 0% | 0% | 20% | 10% |
| mus04m | 0 | 0 | 0 | 0 | 0 | 2 | 12 | 0% | 0% | 0% | 0% | 0% | 17% |
| mus06g | 1 | 1 | 0 | 0 | 1 | 2 | 13 | 8% | 8% | 0% | 0% | 8% | 15% |
| mus10g | 3 | 0 | 0 | 0 | 0 | 5 | 11 | 27% | 0% | 0% | 0% | 0% | 45% |
| mus11m | 2 | 0 | 0 | 0 | 0 | 3 | 11 | 18% | 0% | 0% | 0% | 0% | 27% |
| yst02g | 6 | 0 | 0 | 0 | 7 | 6 | 11 | 55% | 0% | 0% | 0% | 64% | 55% |
| yst03m | 3 | 0 | 0 | 0 | 1 | 9 | 13 | 23% | 0% | 0% | 0% | 8% | 69% |
| yst06g | 5 | 0 | 0 | 0 | 2 | 3 | 11 | 45% | 0% | 0% | 0% | 18% | 27% |
| yst09g | 2 | 0 | 0 | 0 | 1 | 1 | 3 | 67% | 0% | 0% | 0% | 33% | 33% |
| Total | 25 | 1 | 0 | 0 | 17 | 35 | 112 | 22% | 1% | 0% | 0% | 15% | 31% |
Performance comparisons for USW, AKL, ALLR, PCC, CS, and MOTIFSIM for the selected motifs from TRANSFAC database in the collection. The number of motifs that were correctly identified by each method per species is listed. The percentage of motifs that were correctly identified by each method per species was also calculated
| Number of Motifs Correctly Identified | % of Motifs Correctly Identified | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Species | USW | AKL | ALLR | PCC | CS | MOTIFSIM | Total # of Motifs by Species | USW | AKL | ALLR | PCC | CS | MOTIFSIM |
|
| 11 | 19 | 19 | 19 | 17 | 19 | 19 | 58% | 100% | 100% | 100% | 89% | 100% |
|
| 7 | 15 | 15 | 15 | 14 | 14 | 15 | 47% | 100% | 100% | 100% | 93% | 93% |
|
| 4 | 5 | 5 | 5 | 4 | 5 | 5 | 80% | 100% | 100% | 100% | 80% | 100% |
|
| 6 | 7 | 7 | 7 | 4 | 7 | 7 | 86% | 100% | 100% | 100% | 57% | 100% |
| Total | 28 | 46 | 46 | 46 | 39 | 45 | 46 | 61% | 100% | 100% | 100% | 85% | 98% |
Average percentage for the predicted motifs and the selected motifs by each method. MOTIFSIM achieves higher performance than other methods
| % of Motifs Correctly Identified | ||||||
|---|---|---|---|---|---|---|
| Motif Category | USW | AKL | ALLR | PCC | CS | MOTIFSIM |
| Predicted motifs | 22% | 1% | 0% | 0% | 15% | 31% |
| Selected motifs from TRANSFAC | 61% | 100% | 100% | 100% | 85% | 98% |
| Average percentage | 41.5% | 50.5% | 50% | 50% | 50% | 64.5% |
Comparison results for Matrix-clustering and MOTIFSIM for four taxonomic datasets. The number of motifs that were correctly classified and the percentage of correct classification by each tool for each dataset are shown. MOTIFSIM has a similar or better performance than Matrix-clustering
| Dataset | Total Number of Motifs | MOTIFSIM | Matrix Clustering | ||
|---|---|---|---|---|---|
| # of Motifs Correctly Clustered | % of Correct Classification | # of Motifs Correctly Clustered | % of Correct Classification | ||
| Fungi | 78 | 48 | 62% | 45 | 58% |
| Insects | 42 | 24 | 57% | 23 | 55% |
| Plants | 65 | 63 | 97% | 63 | 97% |
| Vertebrates | 73 | 66 | 90% | 66 | 90% |
Fig. 1Performance comparison for MOTIFSIM and RSAT Matrix-clustering tool on four taxonomic datasets: Fungi, Insects, Plants, and Vertebrates. MOTIFSIM has higher accurate percentages than Matrix-clustering for Fungi and Insects datasets. It achieves 62% for Fungi and 57% for Insects datasets comparing to 58% and 55% respectively from Matrix-clustering. For Plants and Vertebrates datasets, both tools achieve similar accurate percentages with 97% and 90% respectively
Fig. 2Average statistics for six newer motif finders (ChIPMunk, DMINDA, MEME v. 4.11.4, peak-motifs, STEME, XXmotif) and MOTIFSIM on seven additional sequence datasets. The first four statistics at the bottom of the figure are nucleotide level statistics. The next two are site level statistics. STEME shows lower performance than all other tools due to its nature design and implementation. MOTIFSIM has better performance than MEME and STEME and it is in an intermediate range comparing to other tools
Fig. 3Average statistics for thirteen older motif finders and MOTIFSIM on nine sequence datasets. The older tools and nine sequence datasets were used by Tompa et al. in their study [29]. The first four statistics at the bottom of the figure are nucleotide level statistics. The next two are site level statistics. MOTIFSIM attains better performance than ten other tools except for Oligodyad-analysis, Weeder, and YMF