| Literature DB >> 27843946 |
Qiang Yu1, Hongwei Huo1, Dazheng Feng2.
Abstract
Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.Entities:
Mesh:
Year: 2016 PMID: 27843946 PMCID: PMC5098105 DOI: 10.1155/2016/4986707
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Algorithm 1PairMotifChIP.
Figure 1An example for extracting pairs of l-mers.
Data for probabilistic analysis.
|
|
|
|
|
|
| occ | occ |
|---|---|---|---|---|---|---|---|
| 8 | 1 | 0 | 0.57 | 284715 | 12784 | 2.95 | 32.00 |
| 9 | 2 | 0 | 0.14 | 69930 | 511 | 0.73 | 1.28 |
| 10 | 2 | 0 | 0.04 | 19980 | 511 | 0.19 | 1.28 |
| 11 | 3 | 1 | 0.29 | 144855 | 3577 | 1.54 | 8.95 |
| 12 | 3 | 1 | 0.08 | 39960 | 3196 | 0.42 | 8.00 |
| 13 | 4 | 2 | 0.39 | 194805 | 4581 | 2.08 | 11.46 |
| 14 | 4 | 2 | 0.11 | 54945 | 4059 | 0.60 | 10.16 |
| 15 | 5 | 3 | 0.43 | 214785 | 5274 | 2.30 | 13.20 |
| 16 | 5 | 3 | 0.13 | 64935 | 4584 | 0.70 | 11.47 |
Running time on the first group of simulated data sets.
|
|
| PairMotifChIP | MEME-ChIP | F-motif | PairMotif+ | qPMS9 |
|---|---|---|---|---|---|---|
| 9 | 0.2 | 26.3 s | 1510.1 s | 9.2 s | 300.7 s | 247.4 s |
| 0.5 | 21.3 s | 1507.1 s | 9.2 s | 212.9 s | 234.7 s | |
| 0.8 | 18.7 s | 1462.6 s | 9.1 s | 217.9 s | 226.0 s | |
|
| ||||||
| 15 | 0.2 | 35.9 s | 1325.0 s | 16655.1 s | 73048.5 s | — |
| 0.5 | 25.6 s | 1354.9 s | 16403.4 s | 23549.0 s | — | |
| 0.8 | 19.5 s | 1466.7 s | 15982.7 s | 845.6 s | — | |
|
| ||||||
| 21 | 0.2 | 47.4 s | 1425.5 s | — | — | — |
| 0.5 | 30.7 s | 1148.5 s | — | — | — | |
| 0.8 | 20.5 s | 1349.2 s | — | — | — | |
Note. s: seconds; —: over 24 hours.
Site-level identification accuracy on the first group of simulated data sets.
|
|
| PairMotifChIP | MEME-ChIP | F-motif | PairMotif+ |
|---|---|---|---|---|---|
| 9 | 0.2 | 0.942 | 0.866 | 0.942 | 0.942 |
| 0.5 | 0.902 | 0.734 | 0.902 | 0.902 | |
| 0.8 | 0.907 |
| 0.907 | 0.907 | |
|
| |||||
| 15 | 0.2 | 0.995 | 0.960 | 0.995 | 0.995 |
| 0.5 | 0.969 | 0.916 | 0.969 | 0.969 | |
| 0.8 | 0.936 |
| 0.936 | 0.936 | |
|
| |||||
| 21 | 0.2 | 1.000 | 0.947 | — | — |
| 0.5 | 0.988 | 0.953 | — | — | |
| 0.8 | 0.981 | 0.844 | — | — | |
Note. —: the result is not obtained because the running time is over 24 hours; the result is not obtained because motif sites are not provided by MEME-ChIP on the corresponding data sets. The site-level identification accuracy is evaluated by the site-level performance coefficient sPC. Since qPMS9 and F-motif report the same motifs and have the same identification accuracy, the results of qPMS9 are not listed in this table.
Nucleotide-level identification accuracy on the first group of simulated data sets.
|
|
| PairMotifChIP | MEME-ChIP | F-motif | PairMotif+ | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| nCC | nSn | nSp | nCC | nSn | nSp | nCC | nSn | nSp | nCC | nSn | nSp | ||
| 9 | 0.2 | 0.969 | 0.970 | 0.999 | 0.927 | 0.899 | 0.999 | 0.969 | 0.970 | 0.999 | 0.969 | 0.970 | 0.999 |
| 0.5 | 0.947 | 0.949 | 0.998 | 0.849 | 0.762 | 0.999 | 0.947 | 0.949 | 0.998 | 0.947 | 0.949 | 0.998 | |
| 0.8 | 0.921 | 0.929 | 0.996 |
|
|
| 0.950 | 0.951 | 0.998 | 0.950 | 0.951 | 0.998 | |
|
| |||||||||||||
| 15 | 0.2 | 0.997 | 0.997 | 1.000 | 0.978 | 0.997 | 0.998 | 0.997 | 0.997 | 1.000 | 0.997 | 0.997 | 1.000 |
| 0.5 | 0.983 | 0.984 | 0.999 | 0.952 | 0.950 | 0.997 | 0.983 | 0.984 | 0.999 | 0.983 | 0.984 | 0.999 | |
| 0.8 | 0.965 | 0.967 | 0.998 |
|
|
| 0.965 | 0.967 | 0.998 | 0.965 | 0.967 | 0.998 | |
|
| |||||||||||||
| 21 | 0.2 | 1.000 | 1.000 | 1.000 | 0.969 | 1.000 | 0.994 | — | — | — | — | — | — |
| 0.5 | 0.993 | 0.994 | 0.999 | 0.972 | 0.955 | 0.996 | — | — | — | — | — | — | |
| 0.8 | 0.989 | 0.990 | 0.999 | 0.921 | 0.906 | 0.996 | — | — | — | — | — | — | |
Note. —: the result is not obtained because the running time is over 24 hours; the result is not obtained because motif sites are not provided by MEME-ChIP on the corresponding data sets. Since qPMS9 and F-motif report the same motifs and have the same identification accuracy, the results of qPMS9 are not listed in this table. Besides the nucleotide-level identification accuracy nCC, the sensitivity nSn and specificity nSp are also listed in this table.
Running time and identification accuracy on the second group of simulated data sets.
|
| Sequence # | PairMotifChIP | F-motif | PairMotif+ |
|---|---|---|---|---|
| 9 | 500 | 14.4 s (0.955) | 7.8 s (0.955) | 68.1 s (0.628) |
| 1000 | 60.6 s (0.945) | 17.2 s (0.945) | 410.1 s (0.945) | |
| 1500 | 133.3 s (0.953) | 27.8 s (0.953) | 989.9 s (0.953) | |
| 2000 | 231.4 s (0.953) | 40.0 s (0.953) | 1704.1 s (0.953) | |
| 2500 | 361.4 s (0.951) | 52.8 s (0.951) | 3012.7 s (0.951) | |
| 3000 | 519.2 s (0.955) | 67.2 s (0.955) | 4307.4 s (0.955) | |
|
| ||||
| 15 | 500 | 17.9 s (0.986) | 13581.7 s (0.986) | 14394.4 s (0.986) |
| 1000 | 74.8 s (0.983) | 30293.2 s (0.983) | 35172.2 s (0.983) | |
| 1500 | 150.9 s (0.980) | 50102.5 s (0.980) | — | |
| 2000 | 253.0 s (0.981) | 66344.7 s (0.981) | — | |
| 2500 | 396.9 s (0.982) | — | — | |
| 3000 | 554.4 s (0.981) | — | — | |
|
| ||||
| 21 | 500 | 22.9 s (0.995) | — | — |
| 1000 | 90.5 s (0.996) | — | — | |
| 1500 | 171.6 s (0.995) | — | — | |
| 2000 | 277.2 s (0.995) | — | — | |
| 2500 | 423.8 s (0.996) | — | — | |
| 3000 | 592.2 s (0.996) | — | — | |
Note. s: seconds; —: over 24 hours. The number after each running time is the corresponding nucleotide-level identification accuracy nCC.
Running time of methods for extracting pairs of l-mers.
| Sequence | Method in this paper | Method in [ |
|---|---|---|
| 200 | 2.2 s | 23.6 s |
| 400 | 8.7 s | 96.1 s |
| 600 | 19.7 s | 197.9 s |
| 800 | 34.7 s | 331.7 s |
| 1000 | 54.3 s | 518.1 s |
| 1200 | 78.4 s | 741.6 s |
| 1400 | 109.0 s | 1015.4 s |
| 1600 | 140.5 s | 1334.1 s |
| 1800 | 178.3 s | 1731.2 s |
| 2000 | 223.2 s | 2163.5 s |
Note. s: seconds.
Results on the mESC data.
| Data set | Published motif | PairMotifChIP | PairMotif+ | ||
|---|---|---|---|---|---|
| Time | Predicted motif | Time | Predicted motif | ||
| c-Myc |
| 37.2 s |
| 4106.1 s |
|
| CTCF |
| 29.1 s |
| 23584.3 s |
|
| Esrrb |
| 25.6 s |
| 7424.6 s |
|
| Klf4 |
| 29.3 s |
| 3558.5 s |
|
| Nanog |
| 24.3 s |
| 1975.6 s |
|
| n-Myc |
| 36.3 s |
| 33962.6 s | — |
| Oct4 |
| 8.9 s |
| 2608.8 s |
|
| Smad1 |
| 20.3 s |
| 5296.1 s | — |
| Sox2 |
| 23.1 s |
| 4115.2 s |
|
| STAT3 |
| 22.9 s |
| 6342.6 s | — |
| Tcfcp2I1 |
| 23.5 s |
| 2269.5 s | — |
| Zfx |
| 42.2 s |
| 3617.2 s |
|
Note. —: there is no motif overlapping the published motif in the top ten predicted motifs.