| Literature DB >> 18691400 |
Niina Haiminen1, Heikki Mannila, Evimaria Terzi.
Abstract
BACKGROUND: Event sequences where different types of events often occur close together arise, e.g., when studying potential transcription factor binding sites (TFBS, events) of certain transcription factors (TF, types) in a DNA sequence. These events tend to occur in bursts: in some genomic regions there are more genes and therefore potentially more binding sites, while in some, possibly very long regions, hardly any events occur. Also some types of events may occur in the sequence more often than others. Tendencies of co-occurrence of binding sites of two or more TFs are interesting, as they may imply a co-operative role between the TFs in regulatory processes. Determining a numerical value to summarize the tendency for co-occurrence between two TFs can be done in a number of ways. However, testing for the significance of such values should be done with respect to a relevant null model that takes into account the global sequence structure.Entities:
Mesh:
Year: 2008 PMID: 18691400 PMCID: PMC2547115 DOI: 10.1186/1471-2105-9-336
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Example of burstiness in a DNA sequence. A 20 kbp region from the chromosome 1 sequence described in Results and Discussion, showing locations of matches to the Jaspar motifs. Short bursts are visible, e.g., 6 closely located matches around 8.876 Mbp.
Figure 2Illustration of the null models. An event sequence S of length n = 100 with event types r1, r2, r3, r4 and illustration of null models UL, FL, and FL(r) w.r.t. S with distance parameter w = 20. The sequence regions from which the W, C, and D scores are computed w.r.t. sequence S and event type r1 are shown in the top half of the figure (n/w = 5 regions for the W score, and c(r1) = 4 regions for the C and D scores). Models FL and FL(r) keep the locations of the events fixed, while UL randomly assigns new locations. In addition, here FL(r1) keeps the labels of events of type r1 fixed. All methods maintain the total number of events of each type. The co-occurrence counts for the pair (r1, r2) in the original sequence are W (r1, r2, S) = 3, C(r1, r2, S) = 4, and D(r1, r2, S) = 3. For the randomized sequences the counts are W (r1, r2, RUL(S)) = 1, C(r1, r2, RUL(S)) = 3, D(r1, r2, RUL(S)) = 3, W (r1, r2, RFL(S)) = 2, C(r1, r2, RFL(S)) = 4, D(r1, r2, RFL(S)) = 2, W(r1, r2, (S)) = 3, C(r1, r2, (S)) = 4, and D(r1, r2, (S)) = 3.
Number of significant pairs in synthetic data
| Dataset | Number of significant pairs | |||||||||
| UL | FL | FL( | UL | FL | FL( | UL | FL | FL( | ||
| 1. Uncorrelated | 39 | 0 | 0 | 54 | 1 | 0 | 54 | 1 | 0 | |
| 2. Correlated | 33 | 1 | 2 | 54 | 2 | 2 | 53 | 2 | 2 | |
| 3. Directed | 38 | 1 | 1 | 54 | 1 | 1 | 54 | 2 | 1.5 | (1) |
| 4. Distinct correlated | 38 | 1 | 1 | 54 | 1 | 1 | 54 | 1 | 1 | |
| 5. Distinct directed | 39 | 1 | 1 | 54 | 1 | 1 | 54 | 1 | 1 | |
| Number of randomizations where ( | ||||||||||
| 1. Uncorrelated | 85 | 1 | 0 | 97 | 0 | 0 | 90:92 | 1:1 | 0:1 | |
| 2. Correlated | 100 | 100 | 100 | 100 | 100 | 100 | 100:100 | 100:100 | 100:100 | |
| 3. Directed | 100 | 93 | 99 | 100 | 88 | 99 | 100:98 | 100:1 | 100:2 | (2) |
| 4. Distinct correlated | 94 | 34 | 35 | 97 | 33 | 34 | 95:94 | 35:33 | 36:33 | |
| 5. Distinct directed | 93 | 29 | 31 | 99 | 5 | 17 | 96:97 | 31:5 | 33:0 | |
(1) Median number of pairs of event types, over 100 randomly generated sequences, whose co-occurrence score is significant. Results are shown for five types of synthetic datasets. (2) The number of randomizations in which the planted pair (a, b) is found significant. UL, FL, and FL(r) correspond to the null models, and W, C, and D to the window, undirected, and directed co-occurrence scores. For the D score, the two values s1: s2 denote the number of randomizations in which (a → b) and (b → a) are found significant. The empirical p-values are based on 1000 randomizations. Results are shown for p-value threshold 0.01, with 10 event types, w = 50, burst lengths in [100, 200], sequence length 100000, and parameter values p1 = 0.01, p2 = 0.1, with 50 bursts per sequence. For the datasets 4 and 5, the number of bursts containing correlations was randomly chosen from [5, 25].
Number of significant pairs in chromosome data
| UL | FL | FL( | UL | FL | FL( | |||||||
| chr | 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 |
| 1 | 134 | 68 | 108 | 60 | 96 | 41 | 143 | 71 | 123 | 62 | 87 | 40 |
| 2 | 138 | 73 | 36 | 29 | 84 | 41 | 153 | 83 | 33 | 22 | 87 | 35 |
| 3 | 146 | 90 | 118 | 62 | 90 | 47 | 162 | 87 | 131 | 71 | 98 | 45 |
| 4 | 192 | 120 | 116 | 60 | 104 | 53 | 217 | 122 | 110 | 64 | 107 | 50 |
| 5 | 138 | 85 | 90 | 60 | 98 | 51 | 146 | 79 | 92 | 51 | 88 | 37 |
| 6 | 146 | 83 | 119 | 60 | 107 | 59 | 165 | 79 | 131 | 58 | 112 | 40 |
| 7 | 147 | 78 | 86 | 52 | 87 | 37 | 161 | 93 | 117 | 62 | 100 | 43 |
| 8 | 130 | 76 | 96 | 57 | 79 | 30 | 159 | 86 | 115 | 65 | 93 | 39 |
| 9 | 200 | 119 | 158 | 101 | 125 | 58 | 243 | 125 | 196 | 102 | 137 | 54 |
| 10 | 154 | 100 | 126 | 70 | 93 | 45 | 164 | 97 | 137 | 75 | 103 | 50 |
Number of significant pairs for 10 Mbp regions in human chromosomes 1–10. Results are shown for window sizes w = 300 and two p-value thresholds, for null models UL, FL, and FL(r), for co-occurrence scores C and D. Minimum distance parameter d = 20 bp. Number of event types is 115, and thus the total number of undirected pairs is (1152 - 115)/2 + 115 = 6670 and number of directed pairs is 1152 = 13225.
Number of significant pairs per window size
| UL | FL | FL( | UL | FL | FL( | |||||||
| 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 | |
| 100 | 90 | 44 | 67 | 31 | 70 | 23 | 102 | 55 | 89 | 40 | 63 | 20 |
| 300 | 134 | 68 | 108 | 60 | 96 | 41 | 143 | 71 | 123 | 62 | 87 | 40 |
| 500 | 151 | 80 | 130 | 67 | 107 | 39 | 171 | 83 | 142 | 72 | 106 | 37 |
Number of significant pairs for 10 Mbp regions in human chromosome 1. Results are shown for window sizes w ∈ {100, 300, 500} and two p-value thresholds, for null models UL, FL, and FL(r), for co-occurrence scores C and D. Minimum distance parameter was d = 20 bp. Number of event types is 115, and thus the total number of undirected pairs is 6670 and number of directed pairs is 13225.
Differences between FL and FL(r) significant pairs in chromosome data
| chr | FL | both | FL( |
| 1 | 28 | 38 | 3 |
| 2 | 11 | 21 | 20 |
| 3 | 33 | 34 | 13 |
| 4 | 30 | 34 | 19 |
| 5 | 26 | 38 | 13 |
| 6 | 25 | 42 | 17 |
| 7 | 25 | 30 | 7 |
| 8 | 35 | 27 | 3 |
| 9 | 57 | 49 | 9 |
| 10 | 35 | 39 | 6 |
The number of significant pairs (p ≤ 0.001) in chromosome data according to C score and null models FL and FL(r). Parameters used are w = 300 bp, minimum distance d = 20 bp. The number of FL- and FL(r)-specific pairs is shown, and the number of pairs that both models find significant.
Significant pairs in chromosome data
| chr | TF 1 | TF 2 | # TF 1 | # TF 2 | num | FL | ref | |
| 1 | MA0042, FOXI1 | MA0045, HMG-IY | 1656 | 3958 | 550 | 10 | Y | |
| 1 | MA0041, Foxd3 | MA0045, HMG-IY | 1643 | 3958 | 547 | 10 | Y | |
| 9 | MA0045, HMG-IY | MA0119, TLX1-NFIC | 3771 | 985 | 458 | 7 | Y | |
| 2 | MA0073, RREB1 | MA0079, SP1 | 1968 | 447 | 317 | 6 | Y | [ |
| 1 | MA0045, HMG-IY | MA0088, Staf | 3958 | 583 | 180 | 4 | Y | |
| 9 | MA0045, HMG-IY | MA0137, STAT1 | 3771 | 642 | 173 | 3 | Y | |
| 4 | MA0045, HMG-IY | MA0079, SP1 | 2661 | 744 | 170 | 2 | Y | [ |
| 9 | MA0042, FOXI1 | MA0119, TLX1-NFIC | 1439 | 985 | 170 | 4 | Y | |
| 6 | MA0045, HMG-IY | MA0082, SQUA | 4029 | 579 | 164 | 2 | N | |
| 9 | MA0003, TFAP2A | MA0073, RREB1 | 856 | 1756 | 131 | 4 | Y | [ |
| 1 | MA0029, Evi1 | MA0045, HMG-IY | 856 | 1756 | 131 | 4 | Y | |
| 4 | MA0045, HMG-IY | MA0112, ESR1 | 465 | 3958 | 129 | 2 | Y | [ |
| 9 | MA0041, Foxd3 | MA0119, TLX1-NFIC | 2661 | 605 | 121 | 1 | N | |
| 9 | MA0022, dl_1 | MA0045, HMG-IY | 1162 | 985 | 118 | 4 | Y | |
| 1 | MA0045, HMG-IY | MA0049, hb | 365 | 3771 | 110 | 4 | Y | |
| 9 | MA0003, TFAP2A | MA0123, ABI4 | 3958 | 387 | 110 | 1 | N | |
| 4 | MA0073, RREB1 | MA0123, ABI4 | 856 | 355 | 107 | 10 | Y | |
| 4 | MA0045, HMG-IY | MA0048, NHLH1 | 1971 | 303 | 103 | 4 | Y | |
| 4 | MA0079, SP1 | MA0119, TLX1-NFIC | 744 | 1013 | 102 | 1 | Y | |
| 6 | MA0073, RREB1 | MA0138, REST | 1868 | 615 | 100 | 1 | Y |
The significant pairs (p ≤ 0.001) in chromosome data with the highest C scores. The parameters used are w = 300 bp, minimum distance d = 20 bp, and significance is determined according to the FL(r) null model for the C score. There are in total 241 unique significant pairs in chromosomes 1–10 with these parameters. The chromosome where the C score is highest, and the total number of times that each TF occurs in the corresponding chromosome are given, as well as the C score. The following two columns state the number of chromosomes (1–10) in which the pair is significant, and if the pair is significant according to the FL null model. The last column gives a reference when one exists. Additional file 2 contains the full names of the factors.
Figure 3Co-localization of FOXI1 and HMG-IY. A visualization of the potential binding sites for TFs FOXI1 and HMG-IY in a 500 kbp subsequence from the chromosome 1 sequence described in Results and Discussion. In several cases the starting positions of the TF matches are located within a very short distance from each other.