| Literature DB >> 25004797 |
Abstract
BACKGROUND: Identifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motif extraction constitutes a fundamental step in understanding regulation of gene expression. State-of-the-art tools for motif extraction have their own constraints. Most of these tools are only designed for single motif extraction; structured motifs additionally allow for distance intervals between their single motif components. Moreover, motif extraction from large-scale datasets-for instance, large-scale ChIP-Seq datasets-cannot be performed by current tools. Other constraints include high time and/or space complexity for identifying long motifs with higher error thresholds.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25004797 PMCID: PMC4227134 DOI: 10.1186/1471-2105-15-235
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Number of motifs identified by MoTeX-II using a synthetic dataset
| <(8,1)[3,3](8,1),7> | 100 | 100 | 100 |
| <(8,1)[3,3](8,1),15> | 100 | 100 | 105 |
| <(8,1)[3,3](9,2),7> | 100 | 100 | 100 |
| <(8,1)[3,3](9,2),15> | 100 | 100 | 100 |
| <(9,2)[3,3](8,1),7> | 100 | 100 | 128 |
| <(9,2)[3,3](8,1),15> | 100 | 100 | 120 |
| <(9,2)[3,3](9,2),7> | 100 | 100 | 101 |
| <(9,2)[3,3](9,2),15> | 100 | 100 | 100 |
The number of motifs identified by MoTeX-II using a synthetic dataset. The basic input dataset consists of 1,062 upstream sequences of Bacillus subtilis genes of total size 240 KB.
Statistical evaluation of motifs identified by MoTeX-II using a synthetic dataset
| <(3,0)[2,2](5,0),7> | 1 | 1 | 5 | 1/1 |
| <(5,0)[2,2](3,0),7> | 1 | 1 | 6 | 1/1 |
| <(3,0)[2,2](6,1),7> | 1 | 1 | 2,475 | 1/1 |
| <(6,1)[2,2](3,0),7> | 1 | 1 | 2,753 | 1/1 |
| <(5,1)[2,2](6,1),7> | 1 | 1 | 17,118 | 1/1 |
| <(6,1)[2,2](5,1),7> | 1 | 1 | 17,135 | 1/1 |
Ranking stands for the z-score ranking of the identified implanted motif based on support/weighted support.
The statistical evaluation of the motifs identified by MoTeX-II using a synthetic dataset. The basic input dataset consists of 1,062 upstream sequences of Bacillus subtilis genes of total size 240 KB.
Elapsed-time comparison of RISOTTO, EXMOTIF, and MoTeX-II using a small-scale real dataset
| <(8,1)[2,3](8,1),7> | 286s | 898s | 1,885s | 46s |
| <(8,1)[2,3](8,1),15> | 217s | 626s | 1,860s | 48s |
| <(8,1)[2,3](9,2),7> | 2,086s | 2,253s | 1,871s | 49s |
| <(8,1)[2,3](9,2),15> | 1,103s | 2,222s | 1,860s | 48s |
| <(9,2)[2,3](8,1),7> | 4,868s | 2,222s | 1,868s | 48s |
| <(9,2)[2,3](8,1),15> | 4,279s | 2,197s | 1,856s | 49s |
| <(9,2)[2,3](9,2),7> | 39,488s | 22,862s | 1,871s | 47s |
| <(9,2)[2,3](9,2),15> | 21,274s | 22,739s | 1,865s | 47s |
Elapsed-time comparison of RISOTTO, EXMOTIF, and MoTeX-II using a small-scale real dataset. The input dataset consists of 250 upstream sequences of Homo sapiens genes of total size 250 KB.
Elapsed-time comparison of RISOTTO, EXMOTIF, and MoTeX-II using a medium-scale real dataset
| <(8,1)[3,5](8,1),10> | 1,015s | ** | 6,853s |
| <(8,1)[3,5](8,1),20> | 423s | ** | 6,848s |
| <(8,1)[3,5](10,3),10> | * | ** | 6,865s |
| <(8,1)[3,5](10,3),20> | 41,310s | ** | 6,915s |
| <(10,3)[3,5](8,1),10> | 492,282s | ** | 7,002s |
| <(10,3)[3,5](8,1),20> | * | ** | 6,976s |
| <(10,3)[3,5](10,3),10> | * | ** | 7,008s |
| <(10,3)[3,5](10,3),20> | * | ** | 7,005s |
*The programme did not terminate after one week of execution.
**The programme was terminated by a segmentation fault.
Elapsed-time comparison of RISOTTO, EXMOTIF, and MoTeX-II using the full upstream Yeast genes dataset. The input dataset consists of 5,796 upstream sequences of total size 3.7 MB.
Elapsed-time comparison of RISOTTO, EXMOTIF, and MoTeX-II using a large-scale real dataset
| <(8,1)[2,3](9,2)[3,5](10,3),5> | * | * | 12,068s |
| <(8,1)[2,3](10,3)[3,5](9,2),5> | * | * | 12,371s |
| <(9,2)[2,3](8,1)[3,5](10,3),5> | * | * | 11,953s |
| <(9,2)[2,3](10,3)[3,5](8,1),5> | * | * | 12,095s |
| <(10,3)[2,3](8,1)[3,5](9,2),5> | * | * | 12,035s |
| <(10,3)[2,3](9,2)[3,5](8,1),5> | * | * | 11,729s |
*The programme did not terminate after one week of execution.
Elapsed-time comparison of RISOTTO, EXMOTIF, and MoTeX-II using the full upstream Homo Sapiens genes dataset. The input dataset consists of 19,535 upstream sequences of total size 22.2 MB.
Extraction of transcription factors for the Zinc factors by EXMOTIF and MoTeX-II
| | | | | | ||
|---|---|---|---|---|---|---|
| GAL4 | | | | | | |
| GAL4 chips | 1634(3346) | 1/1 | 1634(3346) | 1/1 | ||
| CAT8 | 1621(3356) | 451/73 | 1621(3356) | 359/51 | ||
| HAP1 | 1621(3356) | 84/96 | 1621(3356) | 73/85 | ||
| LEU3 | 1588(3366) | 2/2 | 1588(3366) | 1/2 | ||
| LYS | 1605(3360) | 39/25 | 1605(3360) | 32/17 | ||
| PPR1 | 1621(3356) | 1/2 | 1621(3356) | 1/2 | ||
| PUT3 | | | | | | |
| 727(4035) | 1/1 | 727(4035) | 1/1 |
TF name stands for transcription factor name; Known Motif stands for the known binding sites corresponding to the transcription factors in TF name column; Predicted Motif stands for the motifs extracted by EXMOTIF and MoTeX-II, respectively; Extracted motifs gives the final (original) number of motifs extracted (original is before excluding the motifs that are also valid in the ORF regions); Ranking stands for the z-score ranking based on support/weighted support.
The extraction of transcription factors for the Zinc factors by EXMOTIF and MoTeX-II.