| Literature DB >> 29298679 |
Jader M Caldonazzo Garbelini1, André Y Kashiwabara2, Danilo S Sanches2.
Abstract
BACKGROUND: De novo prediction of Transcription Factor Binding Sites (TFBS) using computational methods is a difficult task and it is an important problem in Bioinformatics. The correct recognition of TFBS plays an important role in understanding the mechanisms of gene regulation and helps to develop new drugs.Entities:
Keywords: Evolutionary algorithms; Heuristics; Memetic algorithms; Motif; Transcription factor binding sites
Mesh:
Substances:
Year: 2018 PMID: 29298679 PMCID: PMC5751424 DOI: 10.1186/s12859-017-2005-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1MFMD pipeline. (1) In Preprocessing step MFMD uses DUST to remove sub-sequences with low complexity entropy. If DUST can not be run, MFMD uses an objective function defined in [21] to mitigate this problem. (2) In Pattern Discovery step, MFMD attempts to find the best PSSM matrix using GRASP and VNS heuristics. (3) In Pattern Matching step, MFMD uses the PSSM matrix found in the previous step to predict the initial positions of the motifs in the dataset
child
c2. The mutation is performed through the following rule (Eq. 5):
Fig. 2a Sequence dataset. b Splitting the dataset into w−mers (w=13). For each window, the score is calculated using the PSSM matrix found in Pattern Discovery step. c Transformation of the scores in z-scores. d The p-values are calculated from the z-scores. A cut-off point can be used to sort new motifs
Fig. 3a Motif parameters were calculated using uniform background probability Pr(a)=Pr(c)=Pr(g)=Pr(t)=0.25 and pseudocounters=1b Information Content calculated using motif parameters. c Semi-greedy insertion. d Greedy insertion
Summary of JASPAR datasets
| ID | Name | Species | Number of sequences |
|---|---|---|---|
| MA0003.2 | TFAP2A | H. sapiens | 5098 |
| MA0036.2 | GATA2 | H. sapiens | 4380 |
| MA0037.2 | GATA3 | H. sapiens | 4628 |
| MA0050.2 | IRF1 | H. sapiens | 1362 |
| MA0150.2 | NFE2L2 | M. musculus | 726 |
Summary of real datasets experiments
| ID | Name | Site | Number of sequences | Number of motifs |
|---|---|---|---|---|
| CREB | cAMP Response Element | ABS | 17 | 19 |
| HNF-1 | Hepatocyte Nuclear Factor-1 | ABS | 22 | 27 |
| MEF2 | Myocyte Enhancer Factor-2 | ABS | 17 | 17 |
| MyoD | Myogenic Differentiation-1 | ABS | 17 | 21 |
| NF-kB | NF Kappa-Light-Chain-Enhancer | ABS | 6 | 8 |
| SRF | Serum Response Factor | ABS | 20 | 36 |
| TBP | TATA-Binding Protein | ABS | 95 | 95 |
| PDR3 | Pleiotropic Drug Response | SCPD | 7 | 18 |
| REB1 | RNA polymerase I enhancer | SCPD | 15 | 20 |
| MCB | Mlu I cell cycle boxes | SCPD | 6 | 12 |
| CRP | cAMP Receptor Protein | Stormo and Hartzell | 18 | 24 |
Results achieved by predictors in JASPAR datasets
| Dataset | Predictor | Precision | Recall | F-Score |
|---|---|---|---|---|
| GATA2 | MFMD | 0.968±0.011 | 0.972±0.021 | 0.970±0.057 |
| MEME | 0.948 | 0.948 | 0.948 | |
| GIBBS | 0.826 | 0.188 | 0.307 | |
| GATA3 | MFMD | 0.971±0.015 | 0.965±0.011 | 0.968±0.019 |
| MEME | 0.965 | 0.965 | 0.965 | |
| GIBBS | 0.440 | 0.094 | 0.156 | |
| IRF1 | MFMD | 0.829±0.018 | 0.835±0.023 | 0.832±0.022 |
| MEME | 0.903 | 0.903 | 0.903 | |
| GIBBS | 0.695 | 0.510 | 0.588 | |
| NFE2L2 | MFMD | 0.879±0.011 | 0.881±0.031 | 0.880±0.041 |
| MEME | 0.866 | 0.866 | 0.866 | |
| GIBBS | 0.754 | 0.754 | 0.754 | |
| TFAP2A | MFMD | 0.951±0.013 | 0.949±0.070 | 0.950±0.010 |
| MEME | 0.515 | 0.515 | 0.515 | |
| GIBBS | 0.950 | 0.186 | 0.311 |
Results achieved by predictors in real datasets experiments
| Dataset | Predictor | Precision | Recall | F-Score |
|---|---|---|---|---|
| CREB | MFMD | 0.647±0.024 | 0.578±0.044 | 0.611±0.031 |
| MEME |
|
|
| |
| GIBBS | 0.529 | 0.473 | 0.500 | |
| CRP | MFMD | 0.909±0.039 | 0.833±0.033 | 0.869±0.027 |
| MEME | 0.904 | 0.791 | 0.844 | |
| GIBBS | 0.941 | 0.666 | 0.780 | |
| HNF1 | MFMD | 0.772±0.013 | 0.629±0.032 | 0.693±0.019 |
| MEME | 0.136 | 0.111 | 0.122 | |
| GIBBS | 0.500 | 0.222 | 0.307 | |
| MCB | MFMD | 0.999±0.030 | 0.667±0.042 | 0.800±0.030 |
| MEME | 0.692 | 0.750 | 0.719 | |
| GIBBS | 0.750 | 0.750 | 0.750 | |
| MEF2 | MFMD | 0.700±0.033 | 0.823±0.030 | 0.756±0.024 |
| MEME | 0.705 | 0.705 | 0.705 | |
| GIBBS | 0.176 | 0.176 | 0.176 | |
| MYOD | MFMD | 0.363±0.016 | 0.380±0.024 | 0.372±0.018 |
| MEME | 0.235 | 0.190 | 0.210 | |
| GIBBS | 0.208 | 0.238 | 0.222 | |
| NFKB | MFMD | 0.667±0.040 | 0.500±0.099 | 0.571±0.062 |
| MEME |
|
|
| |
| GIBBS | 0.667 | 0.500 | 0.571 | |
| PDR3 | MFMD | 0.850±0.035 | 0.944±0.046 | 0.894±0.034 |
| MEME | 0.653 | 0.944 | 0.772 | |
| GIBBS | 0.928 | 0.722 | 0.812 | |
| REB1 | MFMD | 0.800±0.027 | 0.600±0.025 | 0.685±0.021 |
| MEME | 0.333 | 0.350 | 0.341 | |
| GIBBS | 0.266 | 0.200 | 0.228 | |
| SRF | MFMD | 0.477±0.007 | 0.583±0.014 | 0.525±0.008 |
| MEME | 0.440 | 0.611 | 0.511 | |
| GIBBS | 0.514 | 0.500 | 0.507 | |
| TBP | MFMD | 0.657±0.004 | 0.768±0.008 | 0.708±0.006 |
| MEME | 0.578 | 0.578 | 0.578 | |
| GIBBS | 0.308 | 0.347 | 0.326 |
Some predictors failed to score in these experiments because they found initial positions with a deviation greater than 2. These data are highlighted in bold
Wins and losses in JASPAR and real datasets experiments
| Predictor | Dataset | Wins | Losses | Total |
|---|---|---|---|---|
| MFMD | JASPAR | 9 | 1 | 8 |
| Real | 21 | 0 | 21 | |
| MEME | JASPAR | 6 | 4 | 2 |
| Real | 5 | 17 | –12 | |
| GIBBS | JASPAR | 0 | 10 | –10 |
| Real | 6 | 15 | –9 |
Ranking of algorithms according to Table 5 (from best to worst)
| JASPAR datasets: | ||
| MFMD | MEME | GIBBS |
| Real datasets experiments: | ||
| MFMD | GIBBS | MEME |
Statistical test between MFMD vs GIBBS and MFMD vs MEME approaches
| Type | Group/Dataset | Approach | Result | Approach | Result | ||
|---|---|---|---|---|---|---|---|
| ChIP | GATA2 | MFMD | 2.2 | + | MFMD | 1.327 | + |
| GIBBS | MEME | ||||||
| GATA3 | MFMD | 2.2 | + | MFMD | 0.1599 | = | |
| GIBBS | MEME | ||||||
| IRF1 | MFMD | 2.2 | + | MFMD | 2.200 | - | |
| GIBBS | MEME | ||||||
| NFE2L2 | MFMD | 2.2 | + | MFMD | 0.0476 | + | |
| GIBBS | MEME | ||||||
| TFAP2A | MFMD | 2.2 | + | MFMD | 2.200 | + | |
| GIBBS | MEME | ||||||
| Real | SRF | MFMD | 3.736 | + | MFMD | 1.401 | + |
| GIBBS | MEME | ||||||
| TBP | MFMD | 2.2 | + | MFMD | 2.200 | + | |
| GIBBS | MEME |
+ There is statistical difference (MFMD better); = There is no difference; - There is statistical difference (MFMD worse)
Fig. 4Comparison between real logos and logos found by MFMD in ChIP–seq datasets. a TFAP2A real Logo. b TFAP2A MFMD Logo. c GATA2 real Logo. d GATA2 MFMD Logo. e GATA3 real Logo. f GATA3 MFMD Logo. g IRF1 real Logo. h IRF1 MFMD Logo. i NFE2L2 real Logo. j NFE2L2 MFMD Logo
Fig. 5Comparison between real logos and logos found by MFMD in real datasets. a CRP real Logo. b CRP MFMD Logo. c MYOD real Logo. d MYOD MFMD Logo. e TBP real Logo. f TBP MFMD Logo. g PDR3 real Logo. h PDR3MFMD Logo