| Literature DB >> 25627242 |
Yuri V Kravatsky1, Vladimir R Chechetkin2, Nikolai A Tchurikov2, Galina I Kravatskaya2.
Abstract
The broad class of tasks in genetics and epigenetics can be reduced to the study of various features that are distributed over the genome (genome tracks). The rapid and efficient processing of the huge amount of data stored in the genome-scale databases cannot be achieved without the software packages based on the analytical criteria. However, strong inhomogeneity of genome tracks hampers the development of relevant statistics. We developed the criteria for the assessment of genome track inhomogeneity and correlations between two genome tracks. We also developed a software package, Genome Track Analyzer, based on this theory. The theory and software were tested on simulated data and were applied to the study of correlations between CpG islands and transcription start sites in the Homo sapiens genome, between profiles of protein-binding sites in chromosomes of Drosophila melanogaster, and between DNA double-strand breaks and histone marks in the H. sapiens genome. Significant correlations between transcription start sites on the forward and the reverse strands were observed in genomes of D. melanogaster, Caenorhabditis elegans, Mus musculus, H. sapiens, and Danio rerio. The observed correlations may be related to the regulation of gene expression in eukaryotes. Genome Track Analyzer is freely available at http://ancorr.eimb.ru/.Entities:
Keywords: bioinformatic tool; epigenetics; gene expression; genome tracks; transcription start sites
Mesh:
Year: 2015 PMID: 25627242 PMCID: PMC4379982 DOI: 10.1093/dnares/dsu044
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
FDR for the tests on the absence of correlations between exons in human chromosomes and random sets
| Number of points in random sets relative to that of exons | Fraction of events with | |||||
|---|---|---|---|---|---|---|
| Exons on forward strand | Exons on reverse strand | |||||
| 2-Fold less | Equal, ±10% | 2-Fold more | 2-Fold less | Equal, ±10% | 2-Fold more | |
| Genomic HyperBrowser[ | ||||||
| 0.253 | 0.253 | 0.247 | 0.249 | 0.250 | 0.248 | |
| Adjusted | 0.227 | 0.228 | 0.227 | 0.227 | 0.231 | 0.228 |
| 0.538 | 0.597 | 0.670 | 0.588 | 0.635 | 0.773 | |
| Adjusted | 0.523 | 0.582 | 0.656 | 0.573 | 0.628 | 0.768 |
| GenometriCorr[ | ||||||
| Relative distance KS | 0.047 | 0.049 | 0.051 | 0.051 | 0.048 | 0.050 |
| Absolute distance | 0.109 | 0.106 | 0.100 | 0.110 | 0.107 | 0.109 |
| Relative distance KS | 0.537 | 0.437 | 0.428 | 0.579 | 0.450 | 0.215 |
| Absolute distance | 0.705 | 0.697 | 0.662 | 0.725 | 0.693 | 0.599 |
| Genome Track Analyzer | ||||||
| United | 0.056 | 0.052 | 0.051 | 0.044 | 0.046 | 0.048 |
All tests were performed with predicted P-values <0.05. The expected mean value and standard deviation for FDR per 100 MC realizations should be 0.05 ± 0.005.
FDR, false discovery rate; MC, Monte Carlo; KS, Kolmogorov–Smirnov test.
Correlations between transcription start sites on the forward and reverse strands
| Chromosome | TSS forward | TSS reverse | NN | ||||
|---|---|---|---|---|---|---|---|
| A. | |||||||
| chr1 | 2.59 | 0.010 | 5.93 | <0.001 | 1,226 | 1,156 | 230 |
| chr2 | 3.77 | <0.001 | 4.28 | <0.001 | 817 | 728 | 177 |
| chr3 | 3.01 | 0.003 | 4.60 | <0.001 | 643 | 663 | 133 |
| chr4 | 0.28 | 0.779 | 2.07 | 0.038 | 446 | 471 | 103 |
| chr5 | 4.39 | <0.001 | 4.42 | <0.001 | 570 | 497 | 113 |
| chr6 | 4.34 | <0.001 | 5.22 | <0.001 | 609 | 617 | 132 |
| chr7 | 0.65 | 0.516 | 2.97 | 0.003 | 562 | 509 | 109 |
| chr8 | 2.76 | 0.006 | 4.48 | <0.001 | 397 | 423 | 72 |
| chr9 | 2.97 | 0.003 | 3.39 | <0.001 | 443 | 507 | 95 |
| chr10 | 1.97 | 0.049 | 4.27 | <0.001 | 471 | 475 | 106 |
| chr11 | 3.52 | <0.001 | 5.02 | <0.001 | 696 | 656 | 129 |
| chr12 | 4.21 | <0.001 | 5.36 | <0.001 | 623 | 656 | 147 |
| chr14 | 4.28 | <0.001 | 4.64 | <0.001 | 382 | 363 | 90 |
| chr15 | 1.94 | 0.052 | 5.74 | <0.001 | 338 | 359 | 82 |
| chr16 | 3.73 | <0.001 | 6.30 | <0.001 | 571 | 422 | 113 |
| chr17 | 5.08 | <0.001 | 6.37 | <0.001 | 633 | 725 | 150 |
| chr19 | 4.28 | <0.001 | 7.65 | <0.001 | 823 | 772 | 178 |
| chr20 | 2.57 | 0.010 | 4.06 | <0.001 | 320 | 291 | 70 |
| chr22 | 1.77 | 0.077 | 4.16 | <0.001 | 268 | 269 | 63 |
| chrX | 2.24 | 0.025 | 1.43 | 0.153 | 477 | 453 | 96 |
| B. | |||||||
| chr2L | 11.98 | <0.001 | 15.68 | <0.001 | 1,393 | 1,379 | 454 |
| chr2R | 13.72 | <0.001 | 14.73 | <0.001 | 1,578 | 1,529 | 481 |
| chr3L | 12.57 | <0.001 | 15.59 | <0.001 | 1,528 | 1,570 | 473 |
| chr3R | 14.19 | <0.001 | 16.61 | <0.001 | 1,762 | 1,852 | 583 |
| chrX | 12.42 | <0.001 | 14.08 | <0.001 | 1,176 | 1,198 | 387 |
The positions of TSS on the reverse strand were projected to the forward strand. z and z are calculated by Equations (23) and (11) and characterize the positional and ordering correlations between TSS, respectively. The 1% significance thresholds for |z| and |z| in the case of random correlations correspond to 2.58, while 5% significance thresholds correspond to 1.96. The positive values of z indicate that projected TSS on the reverse strand precede TSS on the forward strand. The corresponding P-values were calculated using Gaussian statistics. The data were filtered by the number of pairs of the nearest neighbours (NN) exceeding 50 to ensure the applicability of Gaussian statistics.
Figure 1.(a) The distributions of the nearest neighbouring transcription start sites (NN TSS) on the forward and reverse strands across particular chromosomes of the Homo sapiens genome. The cytobands across corresponding chromosomes and relevant length scales are shown below the TSS. The length scale is in megabases. The blue vertical lines correspond to the pairs of NN TSS. The 15 closest pairs on each chromosome are marked by the red lines, and the names of the corresponding NN TSS are indicated. Names shown above the red lines correspond to the TSS on the forward strand, whereas names shown below the red line correspond to the TSS on the reverse strand (names are given according to EPD notation). (b) Particular examples of NN TSS pairs in the H. sapiens genome. The transcriptions factors (TF) participating in the regulation of expression of a particular gene are listed after the name of the gene. The TF that match genes on both strands are marked in red. The data on binding sites for TF associated with genes were taken from http://www.genecards.org.
Figure 2.(a) The binding profiles for proteins E(Z), Pc-S2, and Psc, and for H3me3K27 histone marks over chromosome 3R of Drosophila melanogaster. For the study of correlations, these profiles were preliminary filtered by the cut-off threshold mean + 2 SD and clustered with distance of 50 nt [Preprocessing of input genetic data and Equation (12)]. The input data after preprocessing are shown below initial profiles. (b) z-ratios [Equation (23)] characterizing pairwise positional correlations between profiles for proteins E(Z), Pc-S2, and Psc, and for the H3me3K27 mark in the different chromosomes of D. melanogaster. The input data were preprocessed as described above. The numbers below the chromosome nomenclature correspond to that of the nearest neighbours. The horizontal broken lines for z-ratios correspond to 5% (|z| = 1.96) and 1% (|z| = 2.58) significance thresholds for random correlations. (c) Ratios characterizing positional correlations between profiles for proteins E(Z), Pc-S2, and Psc, and for H3me3K27 histone marks in the chromosome 2R of D. melanogaster at the different clustering lengths. The profiles were preliminary filtered by the cut-off threshold mean + 2 SD. The positive values of zcorr reflect a trend towards shorter distances between profiles relative to the reference model (or correlations), whereas the negative values of zcorr reflect a trend towards longer distances between profiles (or anticorrelations).
Figure 3.(a) The distributions of DNA double-strand breaks (DSBs) and H3K4me3 histone marks over human chromosome 7. The distributions of DSBs and histone marks were coarse-grained over bins of 100 kb, i.e. the heights in these distributions correspond to the number of points in the bins of 100 kb. Both sets were preprocessed as described in the main text. The distribution of cytobands across chromosome 7 is shown above the length scale. (b) z-ratios [Equation (23)] characterizing pairwise positional correlations between distributions of DSBs and H3K4me3 in the human chromosomes. The correlations for the Y-chromosome are not shown due to poor statistics. The numbers below the chromosome nomenclature correspond to that of the nearest neighbours. The horizontal broken lines for z-ratios correspond to 5% (z = 1.96) and 1% (z = 2.58) significance thresholds for random correlations.