| Literature DB >> 27580854 |
Sagi Shporer1, Benny Chor1, Saharon Rosset2, David Horn3.
Abstract
BACKGROUND: The generalization of the second Chargaff rule states that counts of any string of nucleotides of length k on a single chromosomal strand equal the counts of its inverse (reverse-complement) k-mer. This Inversion Symmetry (IS) holds for many species, both eukaryotes and prokaryotes, for ranges of k which may vary from 7 to 10 as chromosomal lengths vary from 2Mbp to 200 Mbp. The existence of IS has been demonstrated in the literature, and other pair-wise candidate symmetries (e.g. reverse or complement) have been ruled out.Entities:
Keywords: Chromosome k-mer distributions; Generalized Chargaff rules; Inversion symmetry
Mesh:
Substances:
Year: 2016 PMID: 27580854 PMCID: PMC5006273 DOI: 10.1186/s12864-016-3012-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Averages of normalized differencess between occurrences of k-mers and their inverses (reverse-complements), Ek[X], for different chromosomes of the HG38 human assembly, plotted vs k
Fig. 2HG38 chr1: Histogram (probability distribution in bins of Δx = 0.02) of relative occurrences of k-mer pairs vs x for different values of k (4 to 10). a inverse pairs; plotted range is x < 0.3, above which the histogram values are negligibly small. b random pairs for full x range; c Reverse pairs for full x range
comparisons of averages Ek[X] of μka = inverse pairs, μkb = random pairs, and μkc = reverse pairs, for chr1 of HG38
| k | μka | μkb | μkc |
|---|---|---|---|
| 1 | 0.0009 | 0.083 | 0 |
| 2 | 0.0008 | 0.20 | 0.15 |
| 3 | 0.0031 | 0.26 | 0.21 |
| 4 | 0.0055 | 0.33 | 0.27 |
| 5 | 0.0090 | 0.40 | 0.32 |
| 6 | 0.013 | 0.44 | 0.36 |
| 7 | 0.017 | 0.49 | 0.40 |
| 8 | 0.025 | 0.52 | 0.43 |
| 9 | 0.043 | 0.55 | 0.46 |
| 10 | 0.085 | 0.57 | 0.49 |
| 11 | 0.18 | 0.60 | 0.53 |
| 12 | 0.32 | 0.67 | 0.60 |
Results of the evaluation of averages and variances over k-mers of X and Z distributions on human chr 1. Large k-values approach the results Ek(|Z|) = 0.8 and σk(|Z|) = 0.6 expected from standard normal Z distributions
| k | Ek[X] | Ek[|Z|] = Ek[X/σX] | σk[|Z|] |
|---|---|---|---|
| 1 | .0004 | 4.56 | 3.7 |
| 2 | .0006 | 3.26 | 2.4 |
| 3 | .00075 | 1.98 | 1.58 |
| 4 | .00125 | 1.34 | 1.12 |
| 5 | .002 | 1.07 | .86 |
| 6 | .004 | .93 | .75 |
| 7 | .0085 | 0.89 | .72 |
| 8 | .018 | 0.866 | .72 |
| 9 | .038 | 0.843 | .69 |
| 10 | 0.083 | 0.825 | .67 |
Fig. 3a Z-Distrbution for inverse kmer pairs of k = 8 shows high consistency with the expected standard normal distribution. b Z-Distribution of reverse pairs of k = 8 displays a completely different behavior from inverse pairs, having variance = 1600
k-limits for human data as well as other eukaryotes and prokaryotes
| Species | Length | KL |
|---|---|---|
| HG38 chr1 | 230 M | 10 |
| HG18 chr1 | 225 M | 10 |
| Chimpanzee chr1 | 217 M | 10 |
| Mouse chr1 | 192 M | 10 |
| HG18 chrX | 151 M | 9 |
| Zebrafish chr7 | 77 M | 9 |
| D. melanogaster chr3R | 28 M | 9 |
| C. elegans chrV | 21 M | 9 |
| HG18 chrY | 26 M | 8 |
| Human section 10 M | 10 M | 8 |
| E. coli K12 | 4.6 M | 8 |
| B. subtilis | 4.2 M | 8 |
| Human section 5 M | 5 M | 7 |
| M. avium paratubercolosis | 4.8 M | 7 |
| P. furyosus | 1.91 M | 7 |
| T. maritima | 1.86 M | 7 |
| S. cerevisiae chr IV | 1.53 M | 7 |
| Human section 1 M | 1 M | 6 |
| Human section 100 K | 100 K | 5 |
| Human section 50 K | 50 K | 4 |
| Human section 10 K | 10 K | 3 |
| Human section 5 K | 5 K | 2 |
Fig. 4k-limits vs chromosomal length, based on Table 3. The figure displays a universal logarithmic behavior. Boxes are human data, stars denote examples of other eukaryotes, and circles represent examples of prokaryotes. The shown linear regression of this set of data has a slope of 0.73*ln(length), which agrees with our theoretical expectation
Evaluation of E[|Z|], E[X], fraction of unrealized inverse pairs, and chromosomal length
| k | HG38 | HG38M | HG18 | HG18M | Mouse | MouseM | C eleg | Cerevisiae | Ecoli | |
|---|---|---|---|---|---|---|---|---|---|---|
| E[|Z|] | 1 | 4.154 | 3.943 | 4.560 | 5.406 | 6.928 | 11.001 | 2.814 | 2.057 | 1.273 |
| 2 | 2.581 | 2.316 | 3.260 | 3.417 | 3.695 | 5.652 | 1.548 | 1.682 | 1.479 | |
| 3 | 1.707 | 1.769 | 1.983 | 2.152 | 2.780 | 3.904 | 1.589 | 1.434 | 1.318 | |
| 4 | 1.446 | 1.392 | 1.339 | 1.492 | 1.809 | 2.342 | 1.397 | 1.000 | 1.012 | |
| 5 | 1.202 | 1.186 | 1.069 | 1.133 | 1.262 | 1.490 | 1.216 | 0.867 | 0.921 | |
| 6 | 1.057 | 1.001 | 0.930 | 0.943 | 0.990 | 1.070 | 1.075 | 0.791 | 0.852 | |
| 7 | 0.984 | 0.935 | 0.894 | 0.884 | 0.892 | 0.902 | 0.980 | 0.780 | 0.837 | |
| 8 | 0.929 | 0.883 | 0.867 | 0.845 | 0.843 | 0.839 | 0.893 | 0.787 | 0.815 | |
| 9 | 0.881 | 0.855 | 0.843 | 0.828 | 0.823 | 0.819 | 0.851 | 0.841 | 0.811 | |
| 10 | 0.844 | 0.831 | 0.825 | 0.816 | 0.815 | 0.813 | 0.824 | 0.902 | 0.815 | |
| 11 | 0.825 | 0.821 | 0.816 | 0.814 | 0.813 | 0.814 | 0.835 | 0.940 | 0.856 | |
| 12 | 0.824 | 0.829 | 0.821 | 0.826 | 0.822 | 0.828 | 0.881 | 0.956 | 0.916 | |
| E[X] | 1 | 0.00038 | 0.00050 | 0.00041 | 0.00067 | 0.00068 | 0.00152 | 0.00099 | 0.00672 | 0.00083 |
| 2 | 0.00046 | 0.00058 | 0.00058 | 0.00083 | 0.00070 | 0.00150 | 0.00111 | 0.01021 | 0.00196 | |
| 3 | 0.00067 | 0.00095 | 0.00077 | 0.00106 | 0.00121 | 0.00218 | 0.00260 | 0.01752 | 0.00350 | |
| 4 | 0.00134 | 0.00179 | 0.00115 | 0.00170 | 0.00165 | 0.00283 | 0.00474 | 0.02527 | 0.00554 | |
| 5 | 0.00247 | 0.00329 | 0.00206 | 0.00284 | 0.00260 | 0.00397 | 0.00839 | 0.04547 | 0.01067 | |
| 6 | 0.00470 | 0.00593 | 0.00402 | 0.00537 | 0.00461 | 0.00636 | 0.01535 |
| 0.02075 | |
| 7 | 0.00942 | 0.01205 | 0.00852 | 0.01123 | 0.00941 | 0.01222 | 0.02905 | 0.18223 | 0.04362 | |
| 8 | 0.01918 | 0.02472 | 0.01809 | 0.02355 | 0.01954 | 0.02505 | 0.05593 | 0.38663 |
| |
| 9 | 0.03951 | 0.05169 | 0.03850 | 0.04998 | 0.04226 | 0.05334 |
| 0.64905 | 0.18551 | |
| 10 |
|
|
|
|
|
| 0.24551 | 0.82906 | 0.36850 | |
| 11 | 0.17518 | 0.22274 | 0.17538 | 0.21909 | 0.19196 | 0.23044 | 0.47655 | 0.91443 | 0.61571 | |
| 12 | 0.31969 | 0.38249 | 0.32051 | 0.37838 | 0.33829 | 0.38843 | 0.68957 | 0.94564 | 0.81471 | |
| Fraction of null pairs | 7 | 0.00110 | ||||||||
| 8 | 0.04863 | 0.00079 | ||||||||
| 9 | 0.00001 | 0.00001 | 0.00002 | 0.48397 | 0.00954 | |||||
| 10 | 0.00042 | 0.00130 | 0.00042 | 0.00127 | 0.00101 | 0.00217 | 0.00552 | 2.39166 | 0.06538 | |
| 11 | 0.01460 | 0.02590 | 0.01471 | 0.02515 | 0.02289 | 0.03312 | 0.14178 | 9.83436 | 0.30279 | |
| 12 | 0.09259 | 0.14336 | 0.09292 | 0.13934 | 0.11537 | 0.15551 | 0.85693 | 39.18 | 0.66052 | |
| length | 2.3E + 08 | 1.1E + 08 | 2.2E + 08 | 1.2E + 08 | 1.9E + 08 | 1.1E + 08 | 1.5E + 07 | 230218 | 4639664 |
Displayed results are for chr1 of HG38, HG18, mouse, C elegans, and S cerevisiae, and for the full bacterial chromosome of E coli. M refers to masked chromomes. Centromere regions were removed from the HG 38 data. Highlighted results are the ones determining the k-limit, KL, of the different chromosomes
Violations of the 2nd Chargaff rule on HG38. Columns contain the values of #T/#A, #G/#C on different chromosomes, as well as their Y and Z values. The latter reflect the significance of the inequality
| T/A | G/C | Y(T,A) | Y(G,C) | Z(T,A) | Z(G,C) | |
|---|---|---|---|---|---|---|
| chr1 | 1.002593 | 1.001175 | 0.001295 | 0.000587 | 15 | 5.76 |
| chr2 | 1.00274 | 1.002747 | 0.001368 | 0.001372 | 16.41 | 13.49 |
| chr3 | 1.002416 | 1.002824 | 0.001207 | 0.00141 | 13.19 | 12.5 |
| chr4 | 1.001062 | 1.002595 | 0.000531 | 0.001296 | 5.75 | 11.04 |
| chr5 | 1.004679 | 1.004144 | 0.002334 | 0.002068 | 24.44 | 17.5 |
| chr6 | 1.000537 | 1.001981 | 0.000268 | 0.000989 | 2.72 | 8.12 |
| chr7 | 1.003332 | 1.001884 | 0.001663 | 0.000941 | 16.15 | 7.57 |
| chr8 | 0.999241 | 1.002536 | −0.00038 | 0.001266 | −3.53 | 9.65 |
| chr9 | 1.001327 | 1.002823 | 0.000663 | 0.001409 | 5.61 | 9.99 |
| chr10 | 1.0039 | 1.002911 | 0.001946 | 0.001454 | 17.18 | 10.82 |
| chr11 | 1.001915 | 1.002815 | 0.000956 | 0.001405 | 8.48 | 10.51 |
| chr12 | 1.003102 | 1.003317 | 0.001548 | 0.001656 | 13.75 | 12.2 |
| chr13 | 1.003831 | 1.005012 | 0.001912 | 0.002499 | 14.83 | 15.36 |
| chr14 | 1.008943 | 1.007342 | 0.004451 | 0.003658 | 32.58 | 22.24 |
| chr15 | 1.001842 | 1.00411 | 0.00092 | 0.002051 | 6.44 | 12.23 |
| chr16 | 1.009601 | 1.007001 | 0.004778 | 0.003488 | 32.17 | 21.07 |
| chr17 | 1.002905 | 1.006812 | 0.00145 | 0.003395 | 9.77 | 20.81 |
| chr18 | 1.005494 | 1.016917 | 0.00274 | 0.008388 | 19.03 | 47.34 |
| chr19 | 1.009276 | 1.007636 | 0.004617 | 0.003803 | 25.46 | 20.13 |
| chr20 | 1.011147 | 1.012815 | 0.005542 | 0.006367 | 33.22 | 33.7 |
| chr21 | 1.003017 | 1.005026 | 0.001506 | 0.002507 | 7.33 | 10.15 |
| chr22 | 0.998893 | 1.009337 | −0.00055 | 0.004647 | −2.52 | 19.94 |
| chrX | 1.003463 | 1.005699 | 0.001728 | 0.002842 | 16.73 | 22.23 |
| chrY | 1.008873 | 1.000209 | 0.004417 | 0.000105 | 17.58 | 0.34 |
All Z values are very significant, but for Z(G,C) on chrY which corresponds to a p-value of 0.367. All other have inequality p-values < 0.01. On all chromosomes we observe #G > #C on the positive strand. Same is true for #T > #A, but for chr8 and chr22, where #T < #A, which is also a significant observation (|Z| > 2.575 corresponds to an inequality p-value < 0.005)
Gene occurrences on the plus (#P) and minus (#M) strands of HG38 display abundance of the former
| chr | P | M | Y(P,M) | Z(P,M) | p values | Z(T,A) | Z(G,C) | corr |
|---|---|---|---|---|---|---|---|---|
| 1 | 4488 | 4291 | 0.022 | 2.103 | 0.018 | 15.00 | 5.76 | v |
| 2 | 4106 | 3367 | 0.099 | 8.549 | 0 | 16.41 | 13.49 | v |
| 3 | 2938 | 2516 | 0.077 | 5.714 | 5.65E-09 | 13.19 | 12.50 | v |
| 4 | 2542 | 1792 | 0.173 | 11.392 | 0 | 5.75 | 11.04 | v |
| 5 | 2777 | 2186 | 0.119 | 8.389 | 0 | 24.44 | 17.50 | v |
| 6 | 4840 | 3563 | 0.152 | 13.931 | 0 | 2.72 | 8.12 | v |
| 7 | 3024 | 2402 | 0.115 | 8.444 | 0 | 16.15 | 7.57 | v |
| 8 | 2135 | 2032 | 0.025 | 1.596 |
|
| 9.65 | |
| 9 | 3032 | 2180 | 0.163 | 11.802 | 0 | 5.61 | 9.99 | v |
| 10 | 2532 | 2156 | 0.080 | 5.492 | 2.01E-08 | 17.18 | 10.82 | v |
| 11 | 2879 | 4047 | −0.169 |
| 0 | 8.48 | 10.51 | x |
| 12 | 3003 | 2771 | 0.040 | 3.053 | 0.0011 | 13.75 | 12.20 | x |
| 13 | 1261 | 1227 | 0.014 | 0.682 |
| 14.83 | 15.36 | |
| 14 | 2092 | 1906 | 0.047 | 2.942 | 0.0016 | 32.58 | 22.24 | v |
| 15 | 4226 | 3547 | 0.087 | 7.702 | 6.77E-15 | 6.44 | 12.23 | v |
| 16 | 2529 | 1875 | 0.149 | 9.855 | 0 | 32.17 | 21.07 | v |
| 17 | 3582 | 2902 | 0.105 | 8.445 | 0 | 9.77 | 20.81 | v |
| 18 | 1182 | 1490 | −0.115 |
| 1.26E-09 | 19.03 | 47.34 | x |
| 19 | 3287 | 3036 | 0.040 | 3.157 | 0.00079 | 25.46 | 20.13 | v |
| 20 | 1258 | 1193 | 0.027 | 1.313 |
| 33.22 | 33.70 | |
| 21 | 670 | 779 | −0.075 |
| 0.00212 | 7.33 | 10.15 | x |
| 22 | 1429 | 1793 | −0.113 |
| 7.28E-11 |
| 19.94 | ? |
| X | 1927 | 1572 | 0.101 | 6.001 | 9.87E-10 | 16.73 | 22.23 | v |
| Y | 491 | 184 | 0.455 | 11.816 | 0.00E + 00 | 17.58 |
| |
|
|
|
|
|
Three of the results are insignificant (highlighted > 0.05, q > 0.044 using FDR corrections). Four chromosomes have opposite preferences, set in italics for P < M and T < A. For all significant results we find 16 chromosomes displaying both P > M, T > A, and G > C. Chr 22 has both P < M and T < A. Last column indicates significant correlations of T-A and G-C with gene counts (positive by v and negative by x)
Comparison of two measures of inversion symmetry on chr1 of HG18 and HG38
| HG18 chr1 | HG38 chr1 | |||
|---|---|---|---|---|
| k | 1-S1 | Ek[X] | 1-S1 | Ek[X] |
| 5 | 0.0016 | 0.0021 | 0.0072 | 0.009 |
| 6 | 0.0026 | 0.0040 | 0.010 | 0.013 |
| 7 | 0.0048 | 0.0085 | 0.014 | 0.017 |
| 8 | 0.0091 | 0.018 | 0.018 | 0.025 |
| 9 | 0.017 | 0.038 | 0.027 | 0.043 |
| 10 | 0.033 | 0.083 | 0.043 | 0.085 |