| Literature DB >> 15904489 |
Stefania Bortoluzzi1, Alessandro Coppe, Andrea Bisognin, Cinzia Pizzi, Gian Antonio Danieli.
Abstract
BACKGROUND: Searching for approximate patterns in large promoter sequences frequently produces an exceedingly high numbers of results. Our aim was to exploit biological knowledge for definition of a sheltered search space and of appropriate search parameters, in order to develop a method for identification of a tractable number of sequence motifs.Entities:
Mesh:
Year: 2005 PMID: 15904489 PMCID: PMC1173081 DOI: 10.1186/1471-2105-6-121
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flow-chart of COOP program. Input, output and main steps are shown.
Number of sequences in which most represented patterns were found in different retinal datasets.
| 44 | 42 | 32 | 40 | |
| 19 | 19 | 10 | 12 | |
| 11 | 11 | 5 | 7 |
Statistics about patterns found in different groups of retina gene promoter sequences and in the corresponding negative control random datasets.
| 719 | 351.3 | 0.017 | |||
| 0 | - | - | |||
| 0 | - | - | |||
| 18683 | 12846.4 | 0.016 | |||
| 41 | 35.4 | 0.324 | |||
| 0 | - | - | |||
| 714 | 410.2 | 0.060 | |||
| 1537 | 429.4 | 0.001 | |||
Figure 2Comparison of patterns discovery results in retinal gene promoter sequences and in 1,000 negative control datasets. Plots of number of patterns (12 bp long, with at most two variable positions) vs number of sequences in which they were found, in retinal gene promoter sequences (open squares) and in 1,000 negative control datasets (filled diamonds). For negative control datasets, the average value of 1,000 sets of sequences is given, with a two standard deviations interval. Statistically significant differences (0.05 threshold) are marked by stars. (A) Comparison between the 1000M52 dataset (52 promoter sequences of genes overexpressed in the retina) and the RAN1000M52i dataset (1,000 groups of 52 randomly chosen human promoters); (B) Comparison between the 1000M91 dataset (91 retinal gene promoter sequences) and the RAN1000M91i (1,000 groups of 91 randomly chosen human promoters).
List of 60 consensus sequences corresponding to selected motifs showing most conserved central regions. For each motif, consensus sequence, length and total number of occurrences in the 1000M dataset are reported, along with LocusLink symbols of corresponding genes. In the last column, for each consensus, the list of mammalian transcription factors recognising similar DNA sequences is reported.
| AAAAAAAAAAAAAA | 14 | 151 | EFEMP1, CCNI, CNGB3, KCNV2, IMPDH2, SLC24A1, DHRS3, G2AN, RTP801, MGC15WIF11, USH3A, CRX, 18, HMGA1, SLC24A2, RDS, TULP1, DC-TM4F2, OPN1SW, RP1, MGAT4B, GAPD, ELOVL4, RRAD, ARR3 | |
| NGGCCCCGCCCCCN | 14 | 114 | EEF1G, HMGA1, EFEMP1, CYBA, KRT18, OPA1, DPYSL4, RAX, FLJ1415, MGC15WIF11, FLJ1415, ALMS1, EIF3S8, G2AN, ALMS1, DC-TM4F2, MSH6, RCV1, KRT19, DHRS3, PITPNC1, RRAD, HPCAL1, MGAT4B, SLC38A3, IMPDH2, CNGB1, RDH5, EFEMP1, CRABP1, C7orf20, CCNI, GNB1, CRX, GAPD, ARF4L, AIPL1, DKFZP564K0822 | AP-1, GCF, Sp1, Sp3, TFIID |
| GCACCCCCAGCCCCN | 15 | 101 | RHO, G2AN, EFEMP1, SLCO4A1, CYBA, HPCAL1, KIFC3, RCV1, NK4, KRT18, CRX, ARR3, PPP1R3F, MGAT4B, NRL, RRAD, CCNI, SAG, ALMS1, MGC15WIF11, DKFZP564K0822, VMD2, DPYSL4, GNAT1, GAPD, OPN1SW, RAX, DHRS3, COPEB, SLC38A3, TMEM16B, SLC24A1 | Sp1 |
| NGAGGGCAGGGGCNN | 15 | 94 | GNB1, KRT19, ELOVL4, VMD2, MSH6, HMGA1, RHO, NK4, SLC38A3, LRRCGUCA1B, CYBA, RCV1, RRAD, GUCY2D, MGC15WIF11, AIPL1, MGAT4B, KIFC3, CRX, CRABP1, G2AN, ALMS1, RTP801, EEF1G, COPEB, OPA1, EFEMP1, KCNV2, PDE6A, AOC2, RLBP1, FLJ1415, RAX, DPYSL4, WIF1, DC-TM4F2 | Sp1 |
| CCTCCCTCCCTCCC | 14 | 76 | ARF4L, COPEB, RHO, SLC38A3, FLJ1415, WDR17, ELOVL4, DHRS3, KCNV2, OPA1, CCNI, GUCA1B, RDH5, RAX, ALMS1, DKFZP564K0822, NK4, RGS19IP1, RRAD, KIFC3, KRT19, SLCO4A1, HPCAL1, DPYSL4, TNFRSF6, CNGB1, DC-TM4F2 | MAZ |
| NCTCCCCCTCCCCC | 14 | 43 | CNGB1, GAPD, RPE65, ALMS1, COPEB, MSH6, RRAD, CRABP1, TNFRSF6, CRX, WIF1, FLJ1415, DKFZP564K0822, PDE6A, RDH5, SLC38A3, CYBA, GNB1, MERTK, WDR17 | Sp1, AP-2, MAZ |
| GNNTGGGGGAGGGGN | 15 | 41 | CYBA, RLBP1, KCNV2, CNGB1, COPEB, KIFC3, RDH5, CCNI, FLJ1415, MGC15WIF11, AIPL1, NK4, HPCAL1, CNGB1, GUCA1A, ALMS1 | MAZ, Sp1 |
| CNCCCCCACCCCCACC | 16 | 40 | RCV1, SLC38A3, HPCAL1, KIFC3, RLBP1, RPE65, DHRS3, RTP801, CYBA, DPYSL4, RDH5, RRAD, COPEB | AP-2alphaB, Sp1, WT1 |
| CTCCCCCTCCCCNNC | 15 | 26 | CNGB1, CRX, GAPD, RHO, CNGB1, COPEB, CYBA, AIPL1, RAX | AP-2, MAZ, Sp1 |
| CCCCAGCCCCNCA | 13 | 23 | CCNI, EFEMP1, SLCO4A1, MGC15WIF11, ARR3, CYBA, HPCAL1, KIFC3, RAX, RLBP1, MGAT4B, AIPL1, RGS19IP1, ALMS1 | Sp1 |
| NNGGCCCCTGCCCN | 14 | 23 | HMGA1, NK4, LRRCGUCA1B, FLJ1415, GNB1, KRT19, AIPL1, GUCA1A, DHRS3 | Sp1 |
| NCCCCCTCCACCN | 13 | 22 | ARR3, HMGA1, KRT19, VMD2, DHRS3, ARF4L, RAX, CCNI, SIRT3, GUCA1B, DC-TM4F2 | Sp1 |
| NCNGGGCTGGGGN | 13 | 22 | CYBA, HPCAL1, RRAD, GAPD, GUCA1A, RHO, G2AN, EFEMP1 | Sp1 |
| NNTCCCCCTCCCNN | 14 | 22 | TNFRSF6, CNGB1, CRX, EEF1G, GAPD, RPE65, ALMS1, DKFZP564K0822, COPEB, AIPL1 | AP-2alphaB, MAZ, Sp1, WT1 -KTS |
| NNCCCAGCCCCCAN | 14 | 20 | RDH5, SLC38A3, EFEMP1, ARR3, CYBA, GAPD, HPCAL1, NK4, PPP1R3F | Sp1 |
| NTGGGGGAGGGGNA | 14 | 20 | COPEB, CYBA, RLBP1, PITPNC1, CNGB1, CRX, GAPD, MERTK, CCNI | MAZ, Sp1, Sp3 |
| CCNGCCCTGGCCT | 13 | 18 | GUCA1A, GUCY2D, RCV1, VMD2, EFEMP1, LRRCGUCA1B, C7orf20, 4, RRAD, UNC119, MERTK | Sp1 |
| GCNGCCCCTGCCN | 13 | 18 | CRX, CYBA, GNB1, HMGA1, RHO, SLC38A3, MGAT4B, FLJ1415, KRT18 | |
| NCNGGGGGCGGGG | 13 | 18 | CYBA, RRAD, FLJ1415, HMGA1, RDH5, RGS19IP1, G2AN, RTP801, DC-TM4F2 | AP-1, ER, Sp1 |
| CTNCCCCTCCCC | 12 | 17 | RLBP1, AIPL1, PITPNC1, CNGB1, GAPD, RHO, CNGB1, EFEMP1, COPEB, CYBA, GNB1, PDE6A | AP-2alphaB, MAZ, Sp1 |
| GGGGTGGGGNTG | 12 | 17 | GUCY2D, FLJ1415, AIPL1, RDH5, CRABP1, HPCAL1, KIFC3, DHRS3, RTP801, CYBA, RLBP1 | AP-2alphaB, Sp1, Sp3 |
| CCCGCCCCTGNCC | 13 | 16 | GNB1, HPCAL1, KRT19, MGAT4B, G2AN, | Sp1 |
| NGGGGGTGGGGGN | 13 | 16 | HPCAL1, RRAD, DHRS3, FLJ1415, CYBA, GNB1, DPYSL4 | Sp1 |
| NNCCCCCGCCCCNN | 14 | 16 | GNB1, RGS19IP1, LRRCGUCA1B, ALMS1, DC-TM4F2, KRT18, SAG | AP-1, AP-2alphaB, ER, Krox-20, Sp1, WT1, WT1 I, WT1 I -KTS |
| AGNGGGAGGGGCN | 13 | 14 | CYBA, EFEMP1, RAX, MGC15WIF11, ARF4L, CRX, SLCO4A1 | MAZ, Sp1, Sp3 |
| CCCTGTCCCTGGAN | 14 | 14 | ARR3, HPCAL1, FLJ1415, DC-TM4F2, KRT19, LRRCGUCA1B, TMEM16B | GR |
| CGGGGCCGCCNCN | 13 | 14 | FLJ1415, DC-TM4F2, MGC15WIF11, COPEB, MGAT4B, SLCO4A1, RAX | CUP, Sp1 |
| CTCTCTCTCCNTN | 13 | 14 | GAPD, GUCA1A, NRL, RRAD, FLJ1415, GNAT2, KCNV2 | |
| NANCTCTGCACCC | 13 | 14 | LRAT, TNFRSF6, CYBA, KIFC3, DPYSL4, G2AN, RTP801 | |
| NCCGCCCCCGCCN | 13 | 14 | GNB1, IMPDH2, SLC38A3, COPEB, CYBA, KRT18, SLCO4A1 | AP-1, ER, Kxox-20, Sp1, WT1 I -KTS, WT1-del2 |
| NGGCCTCTGGNCN | 13 | 14 | CYBA, GAPD, KRT19, RDH5, DPYSL4, HPCAL1, MGAT4B | |
| NGGGAGGGGGAAG | 13 | 14 | GAPD, AIPL1, FLJ1415, EEF1G, RPE65, ALMS1, WDR17 | AP-2alphaB, MAZ, Sp1, WT1 I -KTS |
| NGNCCCCAGCCCC | 13 | 14 | GAPD, GUCA1A, RHO, ARR3, CYBA, NK4, PPP1R3F | AP-2, Sp1 |
| NNCCCAGCCCAGNN | 14 | 14 | GAPD, RHO, ARR3, CRABP1, CYBA, RRAD, MGAT4B | Sp1 |
| TGGGGGTGGGGGN | 13 | 14 | HPCAL1, RLBP1, DHRS3, CYBA, HMGA1, RRAD, DPYSL4 | Sp1 |
| NGGCGGGGGCGGGG | 14 | 13 | EFEMP1, KRT18, RRAD, SLCO4A1, IMPDH2, EFEMP1, COPEB | AP-1, Krox-20, Sp1, WT1 I -KTS, WT1-del2 |
| GGNAGGGGCGGG | 12 | 11 | ELOVL4, REA, G2AN, GNB1, MSH6, GUCY2D, RGS19IP1, LRRC21, SLCO4A1, PITPNC1 | MAZ, Sp1 |
| CCCGCCCGCCCC | 12 | 9 | GNB1, RGS19IP1, WIF1, PITPNC1, DC-TM4F2, HMGA1, DPYSL4, KRT18, RAX | Sp1 |
| GGGCGGGGCNGG | 12 | 9 | CYBA, DPYSL4, MGAT4B, MSH6, RCV1, ALMS1, FLJ1415 | ER, GCF, Sp1 |
| GGGCTGGGGGTG | 12 | 9 | CYBA, HPCAL1, KIFC3, RCV1, RHO, G2AN, DKFZP564K0822 | Sp1 |
| GGGGAAGGGNGG | 12 | 9 | TULP1, CRX, MSH6, KRT19, CNGB1, SLC38A3, AIPL1, HMGA1, FLJ1415 | |
| GGGGCGGGCNNG | 12 | 9 | EEF1G, KRT19, DC-TM4F2, GUCY2D, RGS19IP1, PITPNC1, C7orf20, RTP801 | ER, Sp1 |
| GGNGCGGGCGGG | 12 | 9 | HMGA1, KRT19, DPYSL4, DC-TM4F2, RGS19IP1, WIF1, PITPNC1, FLJ1415 | AP-2, ETF, Krox-20, Sp1, WT1 I -KTS |
| GNNGGGGCTGGG | 12 | 9 | GAPD, HPCAL1, KIFC3, RCV1, RAX, COPEB, RDH5 | WT1 -KTS |
| CAGGGGGCGGGG | 12 | 8 | CYBA, EFEMP1, HPCAL1, FLJ1415, GAPD, HMGA1, G2AN, DC-TM4F2 | AP-1, ER, Sp1, Yi |
| CNCCCCCACCCC | 12 | 8 | CYBA, HMGA1, RCV1, SLC38A3, HPCAL1, RLBP1, DHRS3 | AP-2alphaB, CACCC-binding, factor, Sp1, WT1 |
| GAGTGGGGGAGG | 12 | 8 | DHRS3, KCNV2, COPEB, CYBA, HMGA1, WIF1, FLJ1415, MGC15WIF11 | |
| GCCTGGGGGAGG | 12 | 8 | CYBA, SIRT3, KIFC3, CCNI, DKFZP564K0822, DC-TM4F2, MGC15WIF11 | AP-2 |
| GGGCAGGGGCNG | 12 | 8 | CYBA, GNB1, HPCAL1, HMGA1, RHO, SLC38A3, MGAT4B, G2AN | Sp1 |
| GGGCGGGGCTGG | 12 | 8 | CYBA, HPCAL1, RAX, MSH6, RCV1, ALMS1, DC-TM4F2 | ER, GCF, Sp1 |
| CCCTGTCCCTGG | 12 | 7 | CNGB1, GNB1, FLJ1415, KRT19, ELOVL4, TMEM16B, FLJ1415 | GR |
| CCTTCCCCCNGC | 12 | 7 | GNB1, SLC38A3, AIPL1, SLCO4A1, RDH5, TULP1, NK4 | MAZ |
| CNCCTCCTGCNC | 12 | 7 | CRABP1, GUCA1A, PDE6A, RGR, DPYSL4, WIF1, HPCAL1 | PPUR, Sp1 |
| CNGCCCCCAGNC | 12 | 7 | RHO, EFEMP1, DC-TM4F2, CNGB1, CYBA, NK4, MERTK | Sp1 |
| GCNCCCCTCCCC | 12 | 7 | COPEB, CRX, HPCAL1, RGR, CNGB1, MERTK, RAX | MAZ, Sp1 |
| GGGCAGGGGCGG | 12 | 7 | ELOVL4, HMGA1, HPCAL1, RHO, SLC38A3, MGAT4B, G2AN | Sp1 |
| GGGGCTGGGGNC | 12 | 7 | ARR3, CYBA, HPCAL1, NK4, RAX, PPP1R3F, RLBP1 | AP-2alphaB, Sp1 |
| GNAGGGGGCAGG | 12 | 7 | GAPD, NK4, GUCA1B, SLC38A3, WIF1, G2AN, EFEMP1 | Sp1 |
| TGGGGGAGGNNA | 12 | 7 | KCNV2, COPEB, HMGA1, KIFC3, RDH5, CCNI, FLJ1415 | MAZ, Sp1 |
| TTTTTTTTTNTA | 12 | 7 | IMPDH2, G2AN, SLC24A2, RTP801, KCNV2, USH3A-PROMB, CCNI | TBP |
Figure 3Statistics comparing the accuracy of COOP and of 14 different motif discovery tools on 26 human positive control datasets. Combined measures of correctness over all 26 human datasets, as defined in Methods. The number of datasets (out of 26) for which no motif was predicted by each tool is reported in brackets, following the name of the tool.
Description of COOP parameters.
| Search for pattern occurrences | 1 - N | ≥ 20 out of 100 sequences | ||
| Clustering | Physical distance between 5'-ends of occurrences of patterns of length p | 0 - |p| | ≤ 2 nucleotides | |
| Ratio between observed overlapping occurrences of two patterns and their average number of occurrences | 0 – 1 | ≥ 0.8 | ||
| Consensus building | Ratio between the number of nucleotides per alignment position and the total number of lines in the alignment. The maximal number of adjacent positions exceeding the threshold | 0 – 1 | ≥ 0.5 | |
| Nucleotide length of the lateral region of the motif | 0 - m/2 | 3 bp (out of 10) | ||
| Frequency of a single nucleotide in each position of the lateral region to be considered specified | 0 – 1 | ≥ 0.6 | ||
| Frequency of a single nucleotide in each position of the core region to be considered specified | 0 – 1 | ≥ 0.8 | ||
Procedure for building a consensus sequence starting from a matrix of nucleotide counts, according to selected parameters. Rows from two to five represent the matrix of nucleotide counts in different positions of an alignment associated to a cluster of pattern occurrences. The sixth row contains, for each alignment position, the ratio between number of sequences in the position and the total number of lines in the alignment. Out of 11 positions of the matrix, positions from one to ten (shaded in grey) fulfil the minimum i (0.5) and are considered for building the consensus. If the lateral region length is set to 3 nucleotides, a 3-4-3 motif is obtained. The f(0.6) threshold is applied to the positions in the lateral regions, whereas the f(0.8) is applied to positions in the core region. Cells containing values fulfilling the condition reported on the left are in bold. In the last row, the derived consensus sequence is shown.
| 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 0 | 0 | 5 | 0 | 5 | 2 | 0 | 0 | 0 | 0 | 2 | |
| 0 | 4 | 0 | 0 | 0 | 3 | 5 | 5 | 0 | 4 | 0 | |
| 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | |
| 0.4 | |||||||||||
| 0.6 | |||||||||||
| Consensus sequence | |||||||||||