| Literature DB >> 29953448 |
Liangxin Gao1, Wenzhen Bao1, Hongbo Zhang1, Chang-An Yuan2, De-Shuang Huang1.
Abstract
Both in DNA and protein contexts, an important method for modelling motifs is to utilize position weight matrix (PWM) in biological sequences. With the development of genome sequencing technology, the quantity of the sequence data is increasing explosively, so the faster searching algorithms which have the ability to meet the increasingly need are desired to develop. In this paper, we proposed a method for speeding up the searching process of candidate transcription factor binding sites (TFBS), and the users can be allowed to specify p threshold to get the desired trade-off between speed and sensitivity for a particular sequence analysis. Moreover, the proposed method can also be generalized to large-scale annotation and sequence projects.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29953448 PMCID: PMC6023231 DOI: 10.1371/journal.pone.0198922
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
A position weight matrix in the nucleotide sequence databases.
| 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|
| 0.14 | -4.16 | 1.03 | -4.16 | 0.58 | -0.36 | |
| 0.17 | -2.31 | -4.16 | -4.16 | -2.31 | -1.32 | |
| -1.06 | 1.64 | -2.32 | -0.85 | -1.06 | 1.12 | |
| 0.12 | -4.16 | -2.64 | 1.18 | 0.07 | -0.77 |
The pattern was obtained from the count matrix GATA-3 of JASPAR which was transformed into log-odds matrix using background distribution qA = 0.278,qC = 0.312, qG = 0.212, qT = 0.198. A pseudocount q was first added to the counts for each alphabet symbol s.
Fig 1Overview of faster lookahead scoring algorithm with main steps.
Fig 2Average running times (in Seconds, Preprocessing excluded) of different algorithms for model MOD1 with p-value γ = 0.0001.
Fig 4Average running times (in Seconds, Preprocessing excluded) of different algorithms for real-world model MOD3 and sequence SEQ2 with p-value γ = 0.0001.
Average running times of various p-value.
| 10−1 | 10−2 | 10−3 | 10−4 | 10−5 | 10−6 | |
|---|---|---|---|---|---|---|
| 34.83 | 32.53 | 32.48 | 32.99 | 33.85 | 34.02 | |
| 30.48 | 24.06 | 21.22 | 19.30 | 18.08 | 17.19 | |
Average running times (in Seconds, Preprocessing excluded) of different algorithms for DNA pattern (m = 21) from JASPAR and varying p-Values, and each reported time is an average of 10 runs.
Fig 3Average running times (in Seconds, Preprocessing excluded) of different algorithms for model MOD2 with p-value γ = 0.0001.