| Literature DB >> 29914360 |
Qiang Yu1, Dingbang Wei1, Hongwei Huo2.
Abstract
BACKGROUND: Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more.Entities:
Keywords: Quorum planted motif search; Sample sequences; Transcription factor binding sites
Mesh:
Substances:
Year: 2018 PMID: 29914360 PMCID: PMC6006848 DOI: 10.1186/s12859-018-2242-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Notations used in this paper
| Notation | Explanation |
|---|---|
|
| The length of a string or the size of a set. |
| Σ | The DNA alphabet, Σ = {A, C, G, T}. |
|
| An |
| The | |
| A substring of the string | |
| The concatenation of two strings | |
| The string | |
| The string | |
| Notations for the input. | |
| Notations for the output. | |
| The count (number of occurrences) of a string | |
| The count (number of occurrences) of a string | |
| The Hamming distance between two strings | |
| The set of | |
| The integer obtained by conversion from a string |
Real datasets selected from the ENCODE TF ChIP-seq data
| Dataset | Motif | ( |
|
|
|---|---|---|---|---|
| egr1 | CCGCCCCCGCA | (11, 3) | 15,400 | 0.68 |
| elf1 | AACCCGGAAGT | (11, 3) | 8611 | 0.54 |
| hnf4 | GGGTCAAAGTCCA | (13, 4) | 11,045 | 0.53 |
| myc | ACCACGTGCTC | (11, 3) | 4542 | 0.49 |
| nfy | ACTAACCAATCAG | (13, 4) | 9781 | 0.44 |
| sp1 | GGGGCGGGG | (9, 2) | 14,779 | 0.52 |
| srf | TGACCATATATGGTC | (15, 5) | 4903 | 0.36 |
| yy1 | CGGCCATCT | (9, 2) | 2077 | 0.49 |
Real datasets in the mESC data
| Dataset | Motif | ( |
|
|
|---|---|---|---|---|
| c-Myc | GCACGTGGC | (9, 2) | 3422 | 0.60 |
| CTCF | CCACCAGGGGGCG | (13, 4) | 39,601 | 0.58 |
| Esrrb | GGTCAAGGTCA | (11, 3) | 21,644 | 0.54 |
| Klf4 | GGGTGTGGC | (9, 2) | 10,872 | 0.61 |
| Nanog | CCTTGTCATGC | (11, 3) | 10,342 | 0.26 |
| n-Myc | GCACGTGGC | (9, 2) | 7181 | 0.57 |
| Oct4 | CATTGTTATGCAAAT | (15, 5) | 3775 | 0.29 |
| Smad1 | CCTTTGTTATGCA | (13, 4) | 1126 | 0.36 |
| Sox2 | CATTGTTATGCAAAT | (15, 5) | 4525 | 0.39 |
| STAT3 | TTCCCGGAA | (9, 2) | 2546 | 0.61 |
| Tcfcp2I1 | CCGGTTCAAACCG | (13, 4) | 26,907 | 0.29 |
| Zfx | GCTAGGCCGCG | (11, 3) | 10,336 | 0.49 |
Fig. 1Illustration of word count with mismatches. This figure shows an illustration of word count with up to k mismatches
Fig. 2Illustration of obtaining high-frequency substrings. This figure illustrates the process of obtaining high-frequency substrings. N and N are count(w-mer) for a background substring and a motif instance in the random case, respectively
Fig. 3Results on the ENCODE TF ChIP-seq data. This figure shows the results on the eight Homo sapiens datasets selected from the ENCODE TF ChIP-seq data
Fig. 4Results on the mESC data. This figure shows the results on the 12 mouse datasets in the mESC data
Results of stpd qPMS algorithms on the first group of simulated datasets
| ( | Conservation |
| FMotif | ||||
|---|---|---|---|---|---|---|---|
|
|
| Speedup | |||||
| (9, 2) | High | 33.0 s | 1.6 m | 1 | 1.2 s | 1 | 3 |
| Intermediate | 17.0 s | 1.7 m | 1 | 0.7 s | 1 | 6 | |
| Low | 12.8 s | 1.7 m | 1 | 0.5 s | 1 | 8 | |
| (11, 3) | High | 26.8 s | 21.1 m | 1 | 7.0 s | 1 | 37 |
| Intermediate | 18.0 s | 21.1 m | 1 | 6.0 s | 1 | 53 | |
| Low | 13.0 s | 21.3 m | 1 | 5.7 s | 1 | 68 | |
| (13, 4) | High | 28.8 s | 3.0 h | 1 | 1.0 m | 1.2 | 119 |
| Intermediate | 20.2 s | 3.0 h | 1 | 1.0 m | 1 | 130 | |
| Low | 13.0 s | 3.4 h | 1 | 56.2 s | 1.2 | 174 | |
| (15, 5) | High | 29.4 s | 37.7 h | 1 | 10.4 m | 1 | 208 |
| Intermediate | 20.2 s | 34.1 h | 1 | 9.6 m | 1 | 207 | |
| Low | 13.0 s | 35.9 h | 1 | 10.5 m | 1 | 200 | |
| (17, 6) | High | 29.4 s | N | N | 1.7 h | 1.2 | > 28 |
| Intermediate | 19.8 s | N | N | 1.5 h | 1 | > 31 | |
| Low | 13.0 s | N | N | 1.3 h | 1 | > 36 | |
| (19, 7) | High | 32.0 s | N | N | 17.3 h | 1 | > 3 |
| Intermediate | 21.0 s | N | N | 15.9 h | 1 | > 3 | |
| Low | 12.8 s | N | N | 13.0 h | 1 | > 4 | |
s seconds, m minutes, h hours, N no result because the running time exceeds 48 h; T: running time of SamSelect; T and T’: running time of a qPMS algorithm on the original dataset D and the sample sequence sets D’, respectively; R and R’: the rank of the implanted motif among the identified motifs obtained on D and D’, respectively; speedup: T / T + T’
Results of stpd qPMS algorithms on the second group of simulated datasets
|
|
| FMotif | ||||
|---|---|---|---|---|---|---|
|
|
| Speedup | ||||
| 0.2 | 13.0 s | 2.4 m | 1 | 0.5 s | 1 | 11 |
| 0.3 | 13.2 s | 2.2 m | 1 | 0.5 s | 1 | 10 |
| 0.4 | 13.0 s | 2.0 m | 1 | 0.5 s | 1 | 9 |
| 0.5 | 13.0 s | 1.9 m | 1 | 0.6 s | 1 | 8 |
| 0.6 | 13.0 s | 1.2 m | 1 | 0.5 s | 1 | 5 |
| 0.7 | 14.0 s | 1.1 m | 1 | 0.5 s | 1 | 4 |
| 0.8 | 14.0 s | 1.0 m | 1 | 0.7 s | 1 | 4 |
| 0.9 | 14.0 s | 54.9 s | 1 | 0.5 s | 1 | 4 |
s seconds, m minutes, T: running time of SamSelect; T and T’: running time of a qPMS algorithm on the original dataset D and the sample sequence sets D’, respectively; R and R’: the rank of the implanted motif among the identified motifs obtained on D and D’, respectively; speedup: T / T + T’
Results on the simulated datasets of non-challenging (l, d) instances
| ( | Conservation |
| FMotif | ||||
|---|---|---|---|---|---|---|---|
|
|
| Speedup | |||||
| (9, 1) | High | 27.6 s | 8.7 s | 1 | 0.1 s | 1 | < 1 |
| Intermediate | 22.8 s | 7.6 s | 1 | 0.1 s | 1 | < 1 | |
| Low | 15.8 s | 7.7 s | 1 | 0.2 s | 1.2 | < 1 | |
| (11, 2) | High | 21.2 s | 2.1 m | 1 | 1.1 s | 1 | 6 |
| Intermediate | 15.4 s | 2.0 m | 1 | 0.9 s | 1 | 7 | |
| Low | 12.0 s | 2.1 m | 1 | 1.0 s | 1 | 10 | |
| (13, 3) | High | 18.2 s | 18.0 m | 1 | 11.2 s | 1.2 | 37 |
| Intermediate | 15.0 s | 18.2 m | 1 | 10.1 s | 1 | 43 | |
| Low | 12.0 s | 18.1 m | 1 | 9.7 s | 1 | 50 | |
| (15, 4) | High | 18.2 s | 3.4 h | 1 | 1.7 m | 1.2 | 102 |
| Intermediate | 15.0 s | 3.4 h | 1 | 1.6 m | 1 | 111 | |
| Low | 11.0 s | 3.3 h | 1 | 1.5 m | 1 | 116 | |
| (17, 5) | High | 18.0 s | 37.5 h | 1 | 15.7 m | 1 | 141 |
| Intermediate | 15.0 s | 40.6 h | 1 | 16.3 m | 1 | 147 | |
| Low | 10.8 s | 38.8 h | 1 | 13.8 m | 1 | 166 | |
| (19, 6) | High | 21.0 s | N | N | 2.9 h | 1 | > 17 |
| Intermediate | 16.2 s | N | N | 2.3 h | 1 | > 21 | |
| Low | 10.6 s | N | N | 2.1 h | 1 | > 22 | |
s seconds, m minutes, h hours, N no result because the running time exceeds 48 h; T: running time of SamSelect; T and T’: running time of a qPMS algorithm on the original dataset D and the sample sequence sets D’, respectively; R and R’: the rank of the implanted motif among the identified motifs obtained on D and D’, respectively; speedup: T / T + T’
Results of spd qPMS algorithms on the first group of simulated datasets
| ( | conservation |
| qPMS9 | TravStrR | ||||
|---|---|---|---|---|---|---|---|---|
|
| Speedup |
| Speedup | |||||
| (9, 2) | high | 33.0 s | N | 2.3 s | > 4895 | N | 0.3 s | > 5189 |
| intermediate | 17.0 s | N | 1.8 s | > 9191 | N | 0.2 s | > 10,047 | |
| low | 12.8 s | N | 1.7 s | > 11,917 | 24.2 h | 0.1 s | 6766 | |
| (11, 3) | high | 26.8 s | N | 3.5 s | > 5703 | N | 0.6 s | > 6307 |
| intermediate | 18.0 s | N | 3.1 s | > 8190 | N | 0.3 s | > 9443 | |
| low | 13.0 s | N | 3.0 s | > 10,800 | N | 0.3 s | > 12,992 | |
| (13, 4) | high | 28.8 s | N | 8.4 s | > 46,456 | N | 2.8 s | > 5468 |
| intermediate | 20.2 s | N | 7.6 s | > 6216 | N | 1.9 s | > 7819 | |
| low | 13.0 s | N | 7.0 s | > 8640 | N | 1.4 s | > 12,000 | |
| (15, 5) | high | 29.4 s | N | 25.3 s | > 3159 | N | 29.5 s | > 2934 |
| intermediate | 20.2 s | N | 13.6 s | > 5112 | N | 10.6 s | > 5610 | |
| low | 13.0 s | N | 12.5 s | > 6776 | N | 5.6 s | > 9290 | |
| (17, 6) | high | 29.4 s | N | 9.1 m | > 300 | N | 6.4 m | > 415 |
| intermediate | 19.8 s | N | 47.8 s | > 2556 | N | 36.8 s | > 3053 | |
| low | 13.0 s | N | 16.1 s | > 5938 | N | 14.0 s | > 6400 | |
| (19, 7) | high | 32.0 s | N | N | N | N | 1.1 h | > 43 |
| intermediate | 21.0 s | N | 5.0 m | > 541 | N | 4.5 m | > 598 | |
| low | 12.8 s | N | 30.7 s | > 3972 | N | 42.1 s | > 3148 | |
s seconds, m minutes, h hours, N no result because the running time exceeds 48 h; T: running time of SamSelect; T and T’: running time of a qPMS algorithm on the original dataset D and the sample sequence sets D’, respectively; speedup: T / T + T’
Results of spd qPMS algorithms on the second group of simulated datasets
|
|
| qPMS9 | TravStrR | ||||
|---|---|---|---|---|---|---|---|
|
| Speedup |
| Speedup | ||||
| 0.2 | 13.0 s | N | 1.3 s | > 12,084 | N | 0.1 s | > 13,191 |
| 0.3 | 13.2 s | N | 1.5 s | > 11,755 | N | 0.1 s | > 12,992 |
| 0.4 | 13.0 s | N | 1.7 s | > 11,755 | 41.8 h | 0.1 s | 11,490 |
| 0.5 | 13.0 s | N | 1.7 s | > 11,755 | 24.3 h | 0.1 s | 6671 |
| 0.6 | 13.0 s | 24.2 h | 1.7 s | 5919 | 11.2 h | 0.1 s | 3088 |
| 0.7 | 14.0 s | 7.2 h | 1.7 s | 1651 | 3.1 h | 0.1 s | 785 |
| 0.8 | 14.0 s | 1.5 h | 1.7 s | 338 | 1.5 h | 0.1 s | 377 |
| 0.9 | 14.0 s | 9.2 m | 1.7 s | 35 | 4.1 m | 0.1 s | 17 |
s seconds, m minutes, h hours, N no result because the running time exceeds 48 h; T: running time of SamSelect; T and T’: running time of a qPMS algorithm on the original dataset D and the sample sequence sets D’, respectively; speedup: T / T + T’
Results of spd qPMS algorithms on the third group of simulated datasets
|
|
| qPMS9 | TravStrR | ||||
|---|---|---|---|---|---|---|---|
|
| Speedup |
| Speedup | ||||
| 3000 | 13.0 s | N | 1.7 s | > 11,755 | 24.3 h | 0.1 s | 6671 |
| 4000 | 14.0 s | N | 1.7 s | > 11,006 | N | 0.1 s | > 12,255 |
| 5000 | 15.0 s | N | 1.7 s | > 10,347 | N | 0.1 s | > 11,444 |
| 6000 | 15.8 s | N | 1.7 s | > 9874 | N | 0.1 s | > 10,868 |
| 7000 | 16.4 s | N | 1.7 s | > 9547 | N | 0.1 s | > 10,473 |
| 8000 | 17.0 s | N | 1.8 s | > 9191 | N | 0.1 s | > 10,105 |
| 9000 | 18.0 s | N | 1.7 s | > 8772 | N | 0.1 s | > 9547 |
| 10,000 | 18.8 s | N | 1.8 s | > 8388 | N | 0.2 s | > 9095 |
s seconds, h hours, N no result because running time exceeds 48 h; T: running time of SamSelect; T and T’: running time of a qPMS algorithm on the original dataset D and the sample sequence sets D’, respectively; speedup: T / T + T’