Literature DB >> 19861381

A study in entire chromosomes of violations of the intra-strand parity of complementary nucleotides (Chargaff's second parity rule).

B R Powdel1, Siddhartha Sankar Satapathy, Aditya Kumar, Pankaj Kumar Jha, Alak Kumar Buragohain, Munindra Borah, Suvendra Kumar Ray.   

Abstract

Chargaff's rule of intra-strand parity (ISP) between complementary mono/oligonucleotides in chromosomes is well established in the scientific literature. Although a large numbers of papers have been published citing works and discussions on ISP in the genomic era, scientists are yet to find all the factors responsible for such a universal phenomenon in the chromosomes. In the present work, we have tried to address the issue from a new perspective, which is a parallel feature to ISP. The compositional abundance values of mono/oligonucleotides were determined in all non-overlapping sub-chromosomal regions of specific size. Also the frequency distributions of the mono/oligonucleotides among the regions were compared using the Kolmogorov-Smirnov test. Interestingly, the frequency distributions between the complementary mono/oligonucleotides revealed statistical similarity, which we named as intra-strand frequency distribution parity (ISFDP). ISFDP was observed as a general feature in chromosomes of bacteria, archaea and eukaryotes. Violation of ISFDP was also observed in several chromosomes. Chromosomes of different strains belonging a species in bacteria/archaea (Haemophilus influenza, Xylella fastidiosa etc.) and chromosomes of a eukaryote are found to be different among each other with respect to ISFDP violation. ISFDP correlates weakly with ISP in chromosomes suggesting that the latter one is not entirely responsible for the former. Asymmetry of replication topography and composition of forward-encoded sequences between the strands in chromosomes are found to be insufficient to explain the ISFDP feature in all chromosomes. This suggests that multiple factors in chromosomes are responsible for establishing ISFDP.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 19861381      PMCID: PMC2780954          DOI: 10.1093/dnares/dsp021

Source DB:  PubMed          Journal:  DNA Res        ISSN: 1340-2838            Impact factor:   4.458


Introduction

Chargaff's first parity rule based on the nucleotide composition of double-stranded DNA states that the complementary nucleotides have the same abundance values.[1,2] This is explained by the DNA double-helix model in which A pairs only with T and G pairs only with C.[3] Chargaff and his colleagues[4,5] came with a similar observation of compositional relationship between the complementary nucleotides even within individual DNA strands of bacterial chromosomes. In the post-genomic era, this intra-strand relationship between the complementary nucleotides is observed in double-stranded genomes of viruses, bacteria, archaea and eukaryotes, which is known as Chargaff's second parity rule or intra-strand parity (ISP).[2] There is no such defined rule to describe ISP in chromosomes like the base-pairing rule in Chargaff's first parity. ISP is also observed between the complementary oligonucleotides in chromosomes,[6-9] which has been attributed to genome-wide large-scale inversion, inversion transposition[10] and coding sequence compositional symmetry between the strands.[9] Violation of ISP is observed with respect to organellar (mitochondria and plastids) genomes of some organisms, single-stranded viral genomes or any RNA genome.[11-13] Theoretically, under no strand bias in terms of mutation and selection, the base complementary relationship easily explains the presence of ISP in chromosomes.[14,15] However, several evidences now prove that both the strands are not identical in terms of mutation/selection.[16] This results into violation of ISP in sub-chromosomal regions. Longer the sub-chromosomal region, smaller is the violation of ISP observed.[17] The mechanisms that are responsible to cause violation are defined under three categories.[18] First, DNA replication: leading strand (LeS) is found to be composed of more K nucleotides (G and T) than the complementary M (A and C) nucleotides and the reverse holds true for the lagging strand (LaS).[19] This is due to the fact that the LeS which functions as the template for Okazaki fragment synthesis (functions as template for LaS) remains exposed more as single-stranded than the LaS (functions as template for LeS) during replication that results into higher deamination of the cytosine residues[20,21] in LeS (cytosine gets deaminated 140 times faster in ssDNA than in dsDNA[22]). In addition, the influence of Okazaki fragments and the sliding DNA clamp proteins associated with the synthesis of LaS create functional asymmetry of the mismatch repairing system on DNA.[23] Second, transcription: genes are preferentially located in the LeS than in the LaS to avoid head on collision between the machineries of replication and transcription.[24] During transcription, the non-template strand remains more exposed as single-stranded than the template strand, which causes asymmetry in cytosine deamination between the strands.[22] The transcription-coupled repair system also acts only upon the template strand and thereby contributes to the strand asymmetry.[25] Third, translation: uses of synonymous codons are influenced by differential abundance of tRNA molecules which results into the differential abundance of complementary nucleotides at the third position of family box codons. This causes parity violation.[14] In spite of these factors favoring violations of the parity in chromosomes, ISP is observed in an entire chromosome due to the cancellation effect of the local violations in opposite directions.[14] Evolutionary biologists are more interested to understand the role of mutation and/or selection in the violation of ISP by analyzing the weakly selected or selectively neutral regions (third position of family box codons and non-coding regions) in chromosomes.[14,26] Whether any specific feature(s) is/are associated with chromosomes exhibiting ISP is yet to be understood. Shioiri and Takahata[27] studied ISP by finding out the total AT skew (ATS) and GC skew (GCS) in the chromosomes of several bacteria. In their study, out of 36 bacterial chromosomes, Xylella fastidiosa exhibited maximum ATS and GCS. They observed variable ATS/GCS among chromosomes of different strains of a species as well as chromosomes within a bacterial cell. They also observed ATS and GCS may be different from each other within a chromosome. Since, they did not do any statistical analysis of the skew, the significance of the variability observed among chromosomes was not discussed by them. The usual statistical tool used to find out ISP in chromosomes is a correlation analysis of oligonucleotides abundance described by Prabhu.[6] The ISP study between the complimentary mononucleotides is important because it has been proven that oligonucleotide parity and mononucleotide parity are independent.[8] Baisnée et al.[8] studied parity in chromosomes by measuring the S1 index which is defined as the sum of the absolute values of the differences between complementary oligonucleotides (n mer) frequencies (n varies from 1 to 9 mer). Both these methods do not measure the statistical significance of differences between the abundance values of a mono/oligonucleotide and its reverse complement. For example, if a chromosome carries significant similarity between the abundance values of A and T but carries significant difference between the abundance values of G and C, this will not be identified separately. Similarly, the above methods are unable to find out parity violations in chromosomes with respect to the abundance values of an oligonucleotide and its reverse complement. We have developed a methodology here that can independently study ISP between S nucleotides (any oligonucleotide and its reverse complement) as well as between W nucleotides using the abundance values of mononucleotides. We use the well-known Kolmogorov–Smirnov (KS) test to study the frequency distribution of the compositional abundance values of the mononucleotides in a chromosome sequence, which gives the statistical significance of the similarity between the distributions of complementary nucleotides. This we called as intra-strand frequency distribution parity (ISFDP), which has been used here to study the chromosomes of bacteria, archaea and eukaryotes.

Materials and methods

Frequency distribution calculation

Chromosome sequences of different bacteria, archaea and eukaryotes (Tables 1–3) were obtained from the genome information broker, DDBJ site (www.gib.genes.nig.ac.jp). Bacterial chromosomes were chosen randomly from the database starting the genus name from A to Z. Chromosome sequences of different strains belonging to the same species in the case of bacteria were taken in several cases to do the intra-species comparison. Each chromosome sequence was divided into smaller-size sequences of 1000 nucleotides each starting from the beginning, and the abundance value of the four nucleotides was determined using the computer program (developed for this study). The distribution of the abundance values of complementary nucleotides in different fragments were analyzed by the KS non-parametric test using XLSTAT program[28-30] (Kovach Computing Services, Anglesey, Wales). H0: distribution patterns of any two nucleotides/oligonucleotides in a chromosome are similar; HA: there is a difference between the two distributions. Owing to the large sample size, similarity was considered at the P-value of >0.01, weak similarity was considered at the P-value between 0.01 and 10−4, and the value of <10−4 was considered as strong violation similarity. Group-frequency distributions of the abundance values were plotted to observe the frequency-distribution parity. In the case of the di- and trinucleotides, the abundance values were determined using a different computer program (developed here for this study) in the segments for the 16 dinucleotides and 64 trinucleotides. The analysis was done as described for the mononucleotides earlier.
Table 1

ISFDP analysis in bacterial chromosomes

Serial numberStrain nameSize (kb)GC%KS (W)KS (S)|(∑A − ∑T)|/(∑A + ∑T)|(∑G − ∑C)|/(∑G + ∑C)Bacterial groupTB (°)
1Acinetobacter sp. ADP1359840.430.7450.0060.000680.00484G-Proteobacteria7.07
2Actinobacillus pleuropneumoniae L20 serotype 5b227441.30.4360.8190.001870.00109NA
3Actinobacillus succinogenes 130Z231944.910.3120.2910.002320.00291
4Aeromonas hydrophila subsp. hydrophila ATCC 7966474461.550.880.190.001410.00139
5Aeromonas salmonicida subsp. salmonicida A449470258.510.040.9590.002150.00073
6Agrobacterium tumefaciens C58 (circular chromosome)284159.38<0.0001<0.00010.006940.00967A-Proteobacteria7.37
7Alkaliphilus oremlandii OhILAs312336.26<0.0001<0.00010.006150.01324FirmicutesNA
8Anaeromyxobacter dehalogenans 2CP-C501374.90.0770.0010.004760.00249D-Proteobacteria70.57
9Anaeromyxobacter sp. Fw109-5527773.530.7120.0080.000730.002167.48
10Bacillus anthracis Ames522735.380.004<0.00010.002150.00581FirmicutesNA
11Bacillus anthracis 'Ames Ancestor'522735.380.003<0.00010.002150.005827.48
12Bacillus anthracis Sterne522835.380.008<0.00010.002210.005887.46
13Bacillus subtilis421443.520.2190.2340.002120.0022413.69
14Bacillus thuringiensis Al Hakam525735.430.1230.0020.000420.00081NA
15Bacillus thuringiensis serovar konkukian 97-27523735.410.015<0.00010.001940.004383.98
16Bordetella parapertussis 12822477368.10.433<0.00010.002470.00776B-Proteobacteria37.01
17Bordetella pertussis Tohama 1408667.720.861<0.00010.000220.0039071.28
18Bradyrhizobium japonicum USDA 110910564.060.5120.310.000700.00038A-Proteobacteria7.07
19Bradyrhizobium sp. BTAi1826464.920.3810.010.001000.00163NA
20Brucella melitensis 16M117757.350.4720.0080.002270.00312
21Campylobacter concisus 13826205239.430.0330.0480.000380.00599E-Proteobacteria
22Campylobacter curvus 525.92197144.540.0280.7520.007450.00282
23Campylobacter jejuni RM1221177730.310.5740.230.003300.004368.69
24Campylobacter jejuni subsp. jejuni 81116162830.540.4910.0290.002500.00613NA
25Campylobacter jejuni subsp. jejuni NCTC 11168164130.550.0670.1320.002960.0045710.25
26Candidatus Desulfococcus oleovorans Hxd3394456.170.2580.1330.001990.00157FirmicutesNA
27Caulobacter crescentus CB15401667.220.0420.1710.003960.00188A-Proteobacteria8.56
28Chlamydia muridarum Nigg107240.340.2210.8530.001070.00337Chlamydiae1.17
29Chlamydia trachomatis AHAR-13104441.310.2280.2840.002300.000591.30
30Chlamydophila abortus S263114439.870.5340.0020.000650.003610.57
31Coxiella burnetii Dugway 7E9-12215842.440.0040.0010.005920.00573G-ProteobacteriaNA
32Coxiella burnetii RSA 493199542.660.0140.4670.001980.0002931.15
33Desulfovibrio desulfuricans G20373057.840.590.0010.001890.00322Firmicutes10.70
34Desulfovibrio vulgaris subsp. vulgaris DP4346263.010.30.1590.001520.00106D-ProteobacteriaNA
35Desulfovibrio vulgaris subsp. vulgaris Hildenborough357063.140.5570.0820.001430.000244.78
36Enterobacter sakazakii ATCC BAA-894436856.770.1670.3880.003590.00044G-ProteobacteriaNA
37Enterobacter sp. 638451852.980.6450.390.001690.00163NA
38Escherichia coli 536493850.520.7140.0840.000620.003287.40
39Escherichia coli APEC O1508250.550.7790.5760.000320.00070NA
40Escherichia coli CFT073523150.480.1120.920.001730.000805.66
41Escherichia coli E24377A497950.620.7360.1280.002050.00212NA
42Escherichia coli HS464350.820.3280.4690.001510.00207
43Escherichia coli K12 MG1655463950.790.7320.5870.000540.001134.28
44Escherichia coli UTI89506550.60.510.2370.000760.002033.70
45Escherichia coli W3110464650.80.8730.7290.000730.0009112.64
46Frankia alni ACN14A chromosome749772.820.4630.0360.001410.00139ActinobacteriaNA
47Frankia sp. CcI3543370.080.8080.6620.001290.00017
48Haemophilus influenzae 86-028NP191438.160.8860.6540.000890.00044G-Proteobacteria
49Haemophilus influenzae PittEE181338.040.5440.0380.000540.00317
50Haemophilus influenzae PittGG188738.010.125<0.00010.000050.01016
51Haemophilus influenzae Rd KW20183038.150.1540.0040.002980.0047246.61
52Helicobacter acinonychis Sheeba155338.1800.5960.008690.00164E-ProteobacteriaNA
53Helicobacter hepaticus ATCC 51449179935.930.161<0.00010.004990.0151846.54
54Helicobacter pylori J99164339.190.2460.2560.002590.0051010.97
55Lactobacillus acidophilus NCFM199334.720.382<0.00010.000660.01644Firmicutes19.54
56Lactobacillus brevis ATCC 367229146.220.023<0.00010.002710.02882NA
57Lactobacillus delbrueckii subsp. bulgaricus ATCC BAA-365185649.690.4910.2640.002010.00087
58Lactobacillus reuteri F275199938.870.001<0.00010.001220.01040
59Lactococcus lactis subsp. cremoris MG1363252935.750.2330.0560.003520.00524
60Lactococcus lactis subsp. cremoris SK11243835.860.3990.5210.001470.00136
61Magnetococcus sp. MC-1471954.170.001<0.00010.004900.01198Magnetococcus
62Magnetospirillum magneticum AMB-1496765.090.031<0.00010.003390.00288A-Proteobacteria2.14
63Methylobacillus flagellatus KT297155.720.030.9160.002260.00135B-Proteobacteria10.57
64Methylococcus capsulatus Bath330463.590.1450.0040.001500.00287G-ProteobacteriaNA
65Mycobacterium leprae TN326857.80.003<0.00010.003780.00609Actinobacteria7.04
66Mycobacterium sp. KMS573768.440.3890.4780.000300.00060NA
67Mycobacterium tuberculosis F11442465.620.3660.0070.000060.00198
68Mycobacterium ulcerans Agy99563165.47<0.0001<0.00010.004330.00374
69Mycoplasma gallisepticum R99631.450.180.6150.006260.00021Tenericutes9.32
70Mycoplasma genitalium G3758031.6900.1480.012190.004333.75
71Mycoplasma hyopneumoniae J89728.520.0330.5990.010200.00067NA
72Mycoplasma pneumoniae M12981640.010.0010.1150.017670.0024316.23
73Neisseria gonorrhoeae FA 1090215352.690.070.0330.006010.00144B-Proteobacteria9.20
74Neisseria meningitidis MC58227351.520.6950.0040.001350.00806NA
75Nitrobacter hamburgensis X14440661.720.3320.530.001120.00041A-Proteobacteria
76Nitrobacter winogradskyi Nb-255340262.050.011<0.00010.003230.0029437.15
77Nitrosococcus oceani ATCC 19707348150.320.020.0560.005300.00243G-Proteobacteria8.39
78Nitrosomonas eutropha C91266148.490.9920.3180.000430.00162B-ProteobacteriaNA
79Nostoc sp. PCC 7120641341.350.1340.8570.001290.00162Cyanobacteria
80Pseudomonas entomophila L48 chromosome588864.160.6570.2510.000780.00173G-Proteobacteria1.99
81Pseudomonas fluorescens PfO-1643860.520.0030.0280.004430.002223.18
82Pseudomonas putida F1595961.860.6020.0130.001130.0018736.81
83Ralstonia eutropha H16291266.780.2380.470.004830.00023B-ProteobacteriaNA
84Ralstonia solanacearum GMI1000 chromosome371667.040.056<0.00010.006360.0058122.40
85Rhizobium etli CFN 42438161.270.107<0.00010.001750.01177A-Proteobacteria17.65
86Rhizobium leguminosarum bv. viciae 3841505761.090.001<0.00010.003630.01196NA
87Rickettsia bellii RML369-C152231.650<0.00010.008590.0151426.08
88Rickettsia conorii Malish 7126832.440.5840.0520.002940.0063416.28
89Rickettsia rickettsii 'Sheila Smith'125732.470.5750.0020.001820.00767NA
90Rickettsia typhi Wilmington111128.920.9190.0070.000200.0139526.15
91Salmonella enterica subsp. enterica serovar Typhi CT18480952.090.2670.0430.001510.00152G-Proteobacteria9.85
92Salmonella typhimurium LT2485752.220.890.5850.000430.000083.58
93Shigella boydii Sb227451951.210.5710.0010.000220.0024911.05
94Shigella flexneri 58401457450.920.480.2680.001470.00214NA
95Staphylococcus aureus RF122274232.780.7880.4270.001300.00247Firmicutes0.10
96Staphylococcus epidermidis ATCC 12228249932.1<0.0001<0.00010.012460.0108721.12
97Staphylococcus haemolyticus JCSC1435268532.790.00100.005840.00643NA
98Streptococcus mutans UA159203036.830.1110.0460.004030.00679
99Streptococcus pyogenes MGAS2096186038.730.6190.150.001330.001543.71
100Streptococcus thermophilus CNRZ1066179639.080.050.8630.005370.004592.63
101Streptomyces coelicolor A3(2)866772.120.0010.0370.003940.00134ActinobacteriaNA
102Thermotoga maritima MSB8186046.250.171<0.00010.003440.01548Thermotogae39.15
103Thermotoga petrophila RKU-1182446.090.733<0.00010.000130.01687NA
104Thiobacillus denitrificans ATCC 25259290966.070.9620.0860.000270.00059B-Proteobacteria5.70
105Vibrio cholerae O395302447.78<0.00010.0690.005140.00105G-ProteobacteriaNA
106Vibrio fischeri ES114133237.030.0010.0370.009940.00491
107Xanthomonas campestris pv. campestris ATCC 33913507665.070.1960.7190.003020.00038
108Xanthomonas oryzae pv. oryzae KACC 10331494163.690.870.4990.001040.00065
109Xylella fastidiosa 9a5c267952.68<0.0001<0.00010.047270.0529162.97
110Xylella fastidiosa Temecula 1251951.780.04400.003790.010936.44
111Yersinia pestis CO92465347.640.6490.0010.000900.00520NA
112Yersinia pseudotuberculosis IP32953474447.610.9690.0010.001240.00496

TB, termination bias. Chromosomes of bacteria analyzed in this study. The KS test for significance between the frequency distribution of complementary nucleotide values are given as: KS (W) between A and T and KS (S) between G and C. In bacteria, archaea and eukaryotes, P-values of <10−4 (strong violation of ISFDP) are shown in bold and P-values of <0.01 but≥10−4 (weak violation of ISFDP) are shown in italics. The P-value between 10−4 and 10−3 is shown as 0.000. Relative absolute abundance value difference between the complementary nucleotides is given by |(∑A − ∑T)|/(∑A + ∑T) and |(∑G − ∑C)|/(∑G + ∑C) for ATS and GCS, respectively. In chromosome of X. fastidiosa 9a5c, the GCS/ATS value is highest suggesting that the difference between the abundance values of complementary nucleotides is high. The P-value by the KS test is in concordant with the ATS/GCS suggesting that the abundance difference can be represented by the frequency distribution study of the nucleotides. Similar relation is also observed in other chromosomes.

Table 2

ISFDP analysis in archaea chromosomes

Serial numberStrain nameSize (kb)GC%KS (W)KS (S)|(∑A − ∑T)|/(∑A + ∑T)|(∑G − ∑C)|/(∑G + ∑C)Archaea group
1Aeropyrum pernix K1167056.30.0010.0250.012920.00695Crenarchaeota
2Archaeoglobus fulgidus DSM 4304217948.50.0370.0930.003650.00350Euryarchaeota
3Caldivirga maquilingensis IC-167207843.080.5860.6430.001460.00104Crenarchaeota
4Candidatus Methanoregula boonei 6A8254354.510.0580.1910.003110.00108Euryarchaeota
5Cenarchaeum symbiosum204657.340.1010.0060.005740.00161Crenarchaeota
6Haloarcula marismortui ATCC 43049 chromosome 1313262.350.2520.9050.010750.00024Euryarchaeota
7Halobacterium sp. NRC-1201567.880.8620.3130.000560.00151
8Haloquadratum walsbyi DSM 16790313347.850.5780.0270.001600.00523
9Hyperthermus butylicus DSM 5456166853.70.0190.9080.005310.00100Crenarchaeota
10Ignicoccus hospitalis KIN4/I129856.50.1180.9010.001990.00014
11Metallosphaera sedula DSM 5348219246.210<0.00010.006680.01423Crenarchaeota
12Methanobrevibacter smithii ATCC 35061185431.02<0.0001<0.00010.020480.03768Euryarchaeota
13Methanocaldococcus jannaschii DSM 2661166631.40.1320.0310.004500.01128
14Methanococcoides burtonii DSM 6242257640.740.0780.0020.002660.00845
15Methanococcus aeolicus Nankai-3157030.020.2180.520.003990.00063
16Methanococcus maripaludis C5178132.990.0010.0650.008460.00454
17Methanococcus maripaludis C6174533.40.0450.0450.005530.00224
18Methanococcus maripaludis C7177333.270.2560.7840.004300.00088
19Methanococcus maripaludis S2166233.080.0210.080.006190.00259
20Methanococcus vannielii SB172131.310.5050.5190.003640.00400
21Methanocorpusculum labreanum Z180649.970.6060.050.000970.00404
22Methanoculleus marisnigri JR1247962.040.8160.7450.002340.00000
23Methanopyrus kandleri AV19169661.220.5560.0320.002300.00471
24Methanosaeta thermophila PT188053.530.6730.0040.000180.00595
25Methanosarcina acetivorans C2A575242.67<0.00010.8390.006280.00083
26Methanosarcina barkeri Fusaro483839.27<0.00010.0030.004750.00391
27Methanosarcina mazei Goe1409741.470.2520.8120.002120.00079
28Methanosphaera stadtmanae DSM 3091176827.620.0020.2750.008970.00652
29Methanospirillum hungatei JF-1354545.14<0.00010.0150.009510.00411
30Methanothermobacter thermautotrophicus Delta H175249.520.0220.1140.005660.00166
31Nanoarchaeum equitans Kin4-M49131.550.5490.1770.000000.00127Nanoarchaeota
32Natronomonas pharaonis DSM 2160259663.420.4730.2280.001740.00091Euryarchaeota
33Nitrosopumilus maritimus SCM1164631.15<0.00010.0020.009210.00855Crenarchaeota
34Picrophilus torridus DSM 9790154635.960.2960.0080.000960.00793Euryarchaeota
35Pyrobaculum aerophilum IM2222351.340.001<0.00010.007270.01022Crenarchaeota
36Pyrobaculum arsenaticum DSM 13514212254.980.7950.4310.001380.00316
37Pyrobaculum calidifontis JCM 11548201057.130.1480.3370.002940.00008
38Pyrobaculum islandicum DSM 4184182749.580.3050.4360.000850.00183
39Pyrococcus abyssi176644.690.6520.5740.002190.00342Euryarchaeota
40Pyrococcus furiosus DSM 3638190940.750.7540.7570.000040.00094
41Pyrococcus horikoshii OT3173941.860.1330.0020.002290.01262
42Staphylothermus marinus F1157135.71<0.0001<0.00010.020780.02726Crenarchaeota
43Sulfolobus acidocaldarius DSM 639222736.690.4130.5260.003090.00124
44Sulfolobus solfataricus P2299335.770.0070.7470.005330.00241
45Sulfolobus tokodaii 7269532.780.0050.0290.005210.00659
46Thermococcus kodakarensis KOD1208951.980.0620.3280.004180.00160Euryarchaeota
47Thermofilum pendens Hrk 5178257.660.0140.0050.003460.00665Crenarchaeota
48Thermoplasma acidophilum DSM 1728156545.990.0160.0160.006800.00383Euryarchaeota
49Thermoplasma volcanium GSS1158539.910.0550.3610.004040.00263

Chromosomes of archaea analyzed in this study. The KS test for significance between the frequency distribution of complementary nucleotide values are given as KS (W) between A and T and KS (S) between G and C. In bacteria, archaea and eukaryotes, P-values of <10−4 (strong violation of ISFDP) are shown in bold and P-values of <0.01 but ≥10−4 (weak violation of ISFDP) are shown in italics. The P-value between 10−4 and 10−3 is shown as 0.000. Relative absolute abundance value difference between the complementary nucleotides is given by |(∑A − ∑T)|/(∑A + ∑T)and |(∑G − ∑C)|/(∑G + ∑C) for ATS and GCS, respectively. In chromosome of X. fastidiosa 9a5c, the GCS/ATS value is highest suggesting the difference between the abundance values of complementary nucleotides is high. The P-value by the KS test is in concordant with the ATS/GCS suggesting that the abundance difference can be represented by the frequency distribution study of the nucleotides. Similar relation is also observed in other chromosomes.

Table 3

ISFDP analysis in eukaryotes chromosomes

Serial numberStrain nameSize (kb)GC%KS (W)KS (S)|(∑A − ∑T)|/(∑A + ∑T)|(∑G − ∑C)|/(∑G + ∑C)Eukaryotes group
1Guillardia theta nucleomorph chromosome 0119725.640.4110.4680.000800.00517Cryptophyta
2Guillardia theta nucleomorph chromosome 0218126.70.4350.350.004510.00356
3Guillardia theta nucleomorph chromosome 0317526.810.6710.4030.000510.00622
4Leishmania major Friedlin chromosome 0127062.840.055<0.00010.012900.02500Euglenozoa
5Plasmodium falciparum 3D7 chromosome 0164420.520.0010.690.021840.01210Alveolata
6Plasmodium falciparum 3D7 chromosome 05134419.320.0060.0050.012880.01482
7Plasmodium falciparum 3D7 chromosome 11203618.950.0430.0270.003390.00994
8Plasmodium falciparum 3D7 chromosome 12227219.310.050.6770.005970.00376
9Plasmodium falciparum 3D7 chromosome 13273319.110.1050.2660.004220.00914
10Plasmodium falciparum 3D7 chromosome 14329218.430.2580.0620.002750.00730
11Saccharomyces cerevisiae S288C chromosome 0123139.140.7310.0880.001000.01231Fungi
12Saccharomyces cerevisiae S288C chromosome 04153237.90.8070.3790.002400.00345
13Saccharomyces cerevisiae S288C chromosome 07109138.050.2850.850.001360.00080
14Saccharomyces cerevisiae S288C chromosome 12107938.440.0550.4610.003250.00173
15Saccharomyces cerevisiae S288C chromosome 15109238.130.1810.640.005840.00387
16Schizosaccharomyces pombe 972h chromosome 01557436.090.40.0760.000730.00086
17Schizosaccharomyces pombe 972h chromosome 02451035.920.4610.8250.002070.00039
18Schizosaccharomyces pombe 972h chromosome 03245336.230.1520.0120.002170.00369

Chromosomes of eukaryotes analyzed in this study. The KS test for significance between the frequency distribution of complementary nucleotide values are given as KS (W) between A and T and KS (S) between G and C. In bacteria, archaea and eukaryotes, P-values of <10−4 (strong violation of ISFDP) are shown in bold and P-values of <0.01 but ≥10−4 (weak violation of ISFDP) are shown in italics. The P-value between 10−4 and 10−3 is shown as 0.000. Relative absolute abundance value difference between the complementary nucleotides is given by |(∑A − ∑T)|/(∑A + ∑T) and |(∑G − ∑C)|/(∑G + ∑C) for ATS and GCS, respectively. In chromosome of X. fastidiosa 9a5c, the GCS/ATS value is highest suggesting that the difference between the abundance values of complementary nucleotides is high. The P-value by the KS test is in concordant with the ATS/GCS suggesting that the abundance difference can be represented by the frequency distribution study of the nucleotides. Similar relation is also observed in other chromosomes.

Angular replication asymmetry of the chromosomes was calculated with the help of the information on ori (origin) and ter (termination) cited in the websites (http://www.cbs.dtu.dk/services/GenomeAtlas/suppl/origin/ and http://pbil.univ-lyon1.fr/software/Oriloc/oriloc.html). The chromosomal region starting from ori to ter was considered as the leading region in the Watson strand (Ws) and the remaining portion of the chromosome as the lagging region. For a circular chromosome, the angular replication asymmetry was calculated as the amount of angular distance of leading region deviating from 180°.

Proportionate distribution of forward- and reverse-encoded sequences in a DNA strand

From the DDBJ site, only coding sequences were downloaded. A continuous stretch of the nucleotide sequence was made from all the sequences by removing the gene names. This resembled a DNA strand only composed of forward-encoded sequences. Frequency distribution analysis was done on this. In another approach, 50% of the above strand was made reverse complement by in silico followed by joining with the rest. This resembled a DNA strand composed of 50% forward-encoded and 50% reverse-encoded sequences. Frequency-distribution study was carried out as described above.

Identification of leading and LaS region

ATS and GCS analyses of the chromosome sequences were done as described earlier.[21] This was used to find out the tentative leading and lagging portions in a DNA strand.

Relative proportion of coding sequence distribution

This was found out by deducting ORF numbers between Ws (top strand) and Crick strand (Cs: bottom strand) followed by dividing that with the total number of ORFs. Gene orientation information was obtained from the website (http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi).

Results

ISFDP in chromosomes of bacteria

In this study, a total of 112 bacterial chromosomes were considered, which includes different lineages of bacteria such as protobacteria, cyanobacteria, firmicutes, actinobacteria etc. Samples from each group were taken randomly. The bacteria included in the sample comprised a GC% variation from a minimum of 28% to a maximum of 75% and chromosome size variation from 580 kb to a maximum of 9105 kb. We have studied the frequency distributions of the abundance values of mononucleotides in the uniform sub-chromosomal length of 1000 nucleotides. A collective analysis of the nucleotide abundance values from all the segments of a chromosome was done by frequency distribution smooth curves using Microsoft Excel, and the similarity of the distributions of two complementary nucleotides was tested using the KS test (XL-Stat; http://www.xlstat.com/en/download). Figure 1A(i), B(i), C(i), D(i) and E(i) represents the smooth curves of frequency distributions of nucleotides in chromosomes Campylobacter jejuni RM1221 (30.31%), Escherichia coli K12 MG1655 (50.79%), Xanthomonas campestris pv. campestris (Xcc; 65.07%), X. fastidiosa 9a5c (52.68%) and X. fastidiosa Temecula (51.78%). Smooth curves of complementary nucleotides overlap with each other in the first three chromosomes, whereas those of non-complementary ones do not. In the fourth chromosome, none of the curves overlap with each other. In E. coli chromosome [Fig. 1B(i)], all the four smooth frequency curves are close to each other due to the closeness of the abundance values of the nucleotides, whereas in the graphs of C. jejuni and Xcc, the smooth frequency curves of W (A and T) and S (G and C) nucleotides are distinctly separated as GC% the chromosome are toward both extremes. The distribution was studied by the KS test and the results of the four chromosomes are shown in Fig. 1A(ii, iii), B(ii, iii), C(ii, iii), D(ii, iii) and E(ii, iii). The graphs generated by the KS test suggest the complete overlapping between the complementary nucleotides in the chromosomes except the one of X. fastidiosa strain, which is in concordant with the smooth frequency curve. The distributional similarity between the complementary nucleotides is called as ISFDP. A total of 112 bacterial, 49 archaea and 18 eukaryotic chromosomes (Tables 1–3) were analyzed by the KS test to study ISFDP. The P-values between the A and T distributions as well as between the G and C distributions are given in Tables 1–3.
Figure 1

(A–E) Frequency distribution of nucleotides in chromosomes. Smooth curves present the group-frequency distribution of the four nucleotides a (square), t (asterisk), g (triangle) and c (rhombus). The X-axis represents the abundance values of the nucleotide spanning a range, whereas the Y-axis represents the frequency of the abundance values. In (A), the chromosome is AT rich; in (B), the chromosome is composed of similar AT and GC and in (C), the chromosome is GC rich. This is also evident from the group-frequency distribution curve. The smooth frequency curves of complementary nucleotides in these chromosomes are overlapping with each other. The KS test is shown for S and W nucleotides separately adjacent to the figures, respectively [a(ii, iii)–e(ii, iii)]. The KS test is in concordance with the curve obtained by smoothing group-frequency distribution. In (D) and (E), the group-frequency distribution for the chromosomes of two strains of X. fastidiosa is shown. In 9a5c strain chromosome, the smooth frequency curve between the complementary nucleotides does not overlap which is also suggested by the KS test. However, in Temecula 1 strain chromosome, the parity is maintained.

(A–E) Frequency distribution of nucleotides in chromosomes. Smooth curves present the group-frequency distribution of the four nucleotides a (square), t (asterisk), g (triangle) and c (rhombus). The X-axis represents the abundance values of the nucleotide spanning a range, whereas the Y-axis represents the frequency of the abundance values. In (A), the chromosome is AT rich; in (B), the chromosome is composed of similar AT and GC and in (C), the chromosome is GC rich. This is also evident from the group-frequency distribution curve. The smooth frequency curves of complementary nucleotides in these chromosomes are overlapping with each other. The KS test is shown for S and W nucleotides separately adjacent to the figures, respectively [a(ii, iii)–e(ii, iii)]. The KS test is in concordance with the curve obtained by smoothing group-frequency distribution. In (D) and (E), the group-frequency distribution for the chromosomes of two strains of X. fastidiosa is shown. In 9a5c strain chromosome, the smooth frequency curve between the complementary nucleotides does not overlap which is also suggested by the KS test. However, in Temecula 1 strain chromosome, the parity is maintained. ISFDP analysis in bacterial chromosomes TB, termination bias. Chromosomes of bacteria analyzed in this study. The KS test for significance between the frequency distribution of complementary nucleotide values are given as: KS (W) between A and T and KS (S) between G and C. In bacteria, archaea and eukaryotes, P-values of <10−4 (strong violation of ISFDP) are shown in bold and P-values of <0.01 but≥10−4 (weak violation of ISFDP) are shown in italics. The P-value between 10−4 and 10−3 is shown as 0.000. Relative absolute abundance value difference between the complementary nucleotides is given by |(∑A − ∑T)|/(∑A + ∑T) and |(∑G − ∑C)|/(∑G + ∑C) for ATS and GCS, respectively. In chromosome of X. fastidiosa 9a5c, the GCS/ATS value is highest suggesting that the difference between the abundance values of complementary nucleotides is high. The P-value by the KS test is in concordant with the ATS/GCS suggesting that the abundance difference can be represented by the frequency distribution study of the nucleotides. Similar relation is also observed in other chromosomes. ISFDP analysis in archaea chromosomes Chromosomes of archaea analyzed in this study. The KS test for significance between the frequency distribution of complementary nucleotide values are given as KS (W) between A and T and KS (S) between G and C. In bacteria, archaea and eukaryotes, P-values of <10−4 (strong violation of ISFDP) are shown in bold and P-values of <0.01 but ≥10−4 (weak violation of ISFDP) are shown in italics. The P-value between 10−4 and 10−3 is shown as 0.000. Relative absolute abundance value difference between the complementary nucleotides is given by |(∑A − ∑T)|/(∑A + ∑T)and |(∑G − ∑C)|/(∑G + ∑C) for ATS and GCS, respectively. In chromosome of X. fastidiosa 9a5c, the GCS/ATS value is highest suggesting the difference between the abundance values of complementary nucleotides is high. The P-value by the KS test is in concordant with the ATS/GCS suggesting that the abundance difference can be represented by the frequency distribution study of the nucleotides. Similar relation is also observed in other chromosomes. ISFDP analysis in eukaryotes chromosomes Chromosomes of eukaryotes analyzed in this study. The KS test for significance between the frequency distribution of complementary nucleotide values are given as KS (W) between A and T and KS (S) between G and C. In bacteria, archaea and eukaryotes, P-values of <10−4 (strong violation of ISFDP) are shown in bold and P-values of <0.01 but ≥10−4 (weak violation of ISFDP) are shown in italics. The P-value between 10−4 and 10−3 is shown as 0.000. Relative absolute abundance value difference between the complementary nucleotides is given by |(∑A − ∑T)|/(∑A + ∑T) and |(∑G − ∑C)|/(∑G + ∑C) for ATS and GCS, respectively. In chromosome of X. fastidiosa 9a5c, the GCS/ATS value is highest suggesting that the difference between the abundance values of complementary nucleotides is high. The P-value by the KS test is in concordant with the ATS/GCS suggesting that the abundance difference can be represented by the frequency distribution study of the nucleotides. Similar relation is also observed in other chromosomes. Out of 112 bacterial chromosomes, 60 chromosomes exhibited ISFDP, 16 chromosomes exhibited violation between S nucleotides as well as between W nucleotides, 30 chromosomes exhibited violation only between S nucleotides and 7 chromosomes exhibited violation only between W nucleotides (Table 4). Chromosomes of Alkaliphilus oremlandii OhILAs (36.26%), Agrobacterium tumefaciens C58 (circular; 59.38%), Mycobacterium ulcerans Agy99 (65.47%), Staphylococcus epidermidis ATCC 12228 (32.1%) and X. fastidiosa 9a5c (52.68%) exhibited strong violations between S nucleotides as well as between W nucleotides. Chromosomes of the three Bacillus anthracis (35.35%) strains, Lactobacillus reuteri F275 (38.87%), Magnetococcus sp. MC-1 (54.17%), Mycobacterium leprae TN (57.8%), Rhizobium leguminosarum bv. viciae 3841 (61.09%) and Rickettsia bellii RML369-C (31.65%) exhibited strong violation between S nucleotides as well as weak violation between W nucleotides. Chromosomes of Coxiella burnetii Dugway 7E9-12 (42.44%) and Staphylococcus haemolyticus JCSC1435 (32.79%) exhibited weak violation between S nucleotides as well as between W nucleotides. Chromosome of Vibrio cholerae O395 (47.78%) exhibited strong violation of ISFDP only between W nucleotides. Similarly, there are six chromosomes where weak violations only between W nucleotides were observed. Chromosomes of Bacillus thuringiensis serovar konkukian 97-27 (34.41%), Bordetella parapertussis 12822 (68.1%), Bordetella pertussis Tohama 1 (67.72%), Haemophilus influenzae PittGG (38.01%), Helicobacter hepaticus ATCC 51449 (35.93%), Lactobacillus acidophilus NCFM (34.72%), Lactobacillus brevis ATCC 367 (46.22%), Nitrobacter winogradskyi Nb-255 (62.05%), Ralstonia solanacearum GMI1000 chromosome (67.04%), Rhizobium etli CFN 42 (61.27%), Thermotoga maritima MSB8 (46.25%) and Thermotoga petrophila RKU-1 (46.09%) exhibited strong violation only between S nucleotides. Similarly there are 17 chromosomes exhibited weak violation only between S nucleotides. An interesting finding that came from this study is that violations of ISFDP within a chromosome with respect to S and W nucleotides may not be of similar magnitudes. This study suggests that although ISFDP is commonly observed among chromosomes, its violation is not as rare as described earlier.[13] ISFDP violation found in bacteria belongs to different groups, possessing different GC% and with different genome sizes.
Table 4

Summary of ISFDP violations in chromosomes of Bacteria, Archaea and Eukaryotes

OrganismNumber of chromosomesNumber of chromosomes exhibiting ISFDP for both W and SNumber of chromosomes violating* ISFDP for both W and SNumber of chromosomes violating ISFDP only between S nucleotidesNumber of chromosomes violating ISFDP only between W nucleotides
Bacteria1126015 (5a+8b+0c+2d)30 (13e+17f)07 (1g+6h)
Archaea493006 (2a+2b+2c+0d)06 (0e+6f)07 (2g+5h)
Eukaryotes181501 (0a+0b+0c+1d)01 (1e+0f)01 (0g+1h)

*Violation of ISFDP includes both weak (10−2 > P ≥ 10−4) and strong (P < 10−4).

aStrong violation between S nucleotides as well as between W nucleotides.

bStrong violation between S nucleotides but weak violation between W nucleotides.

cWeak violation between S nucleotides but strong violation between W nucleotides.

dWeak violation between S nucleotides as well as between W nucleotides.

eStrong violation only between S nucleotides.

fWeak violation only between S nucleotides.

gStrong violation only between W nucleotides.

hWeak violation only between W nucleotides.

Summary of ISFDP violations in chromosomes of Bacteria, Archaea and Eukaryotes *Violation of ISFDP includes both weak (10−2 > P ≥ 10−4) and strong (P < 10−4). aStrong violation between S nucleotides as well as between W nucleotides. bStrong violation between S nucleotides but weak violation between W nucleotides. cWeak violation between S nucleotides but strong violation between W nucleotides. dWeak violation between S nucleotides as well as between W nucleotides. eStrong violation only between S nucleotides. fWeak violation only between S nucleotides. gStrong violation only between W nucleotides. hWeak violation only between W nucleotides. Usually, different strains within a species are found to be similar with respect to ISFDP such as the eight E. coli strains were observed to exhibit ISFDP between S nucleotides as well as between W nucleotides, the three B. anthracis strains are found to be similar in terms of their ISFDP violation (strong violation of ISFDP between S nucleotides as well as weak violations of ISFDP between W nucleotides). However, variation among the strains of a bacterial species with respect to ISFDP was observed as follows: out of the two strains of C. burnetii, Dugway 7E9-12 strain violated ISFDP, whereas RSA 493 strain exhibited ISFDP. Out of the four H. influenza strains, 86-028NP and PittEE exhibited violation of ISFDP, whereas PittGG and Rd KW20 exhibited strong and weak violations only between S nucleotides, respectively. Xylella fastidiosa 9a5c exhibited strong violation of ISFDP, whereas X. fastidiosa Temecula 1 exhibited weak violation of ISFDP only between S nucleotides. These are called as intra-species ISFDP violations. Chromosomes of four species of Mycobacterium genus exhibited a large difference among each other with respect to ISFDP. Chromosome of Mycobacterium sp. KMS (68.44%) exhibited parity between S nucleotides as well as between W nucleotides, whereas chromosome of M. ulcerans Agy99 (65.47%) exhibited strong violation of the parity between S nucleotides as well as between W nucleotides.

ISFDP in chromosomes of archaea and eukaryotes

Out of the 49 archaea chromosomes, 30 exhibited ISFDP, 6 exhibited violations of it between S nucleotides as well as between W nucleotides, 6 exhibited violations only between S nucleotides and 7 exhibited violations only between W nucleotides (Table 4). Chromosomes of Methanobrevibacter smithii ATCC 35061 (31.02%) and Staphylothermus marinus F1 (35.71%) exhibited strong violation of ISFDP between S nucleotides as well as between W nucleotides. Chromosomes of Metallosphaera sedula DSM 5348 (46.21%) and Pyrobaculum aerophilum IM2 (51.34%) exhibited strong violations between S nucleotides but weak violations between W nucleotides. Strong violation between W nucleotides and weak violation between S nucleotides were observed in chromosomes of Methanosarcina barkeri Fusaro (39.27%) and Nitrosopumilus maritimus SCM1 (31.15%). This suggests that within a chromosome, the magnitude of parity violation between S nucleotides may be different from that between W nucleotides in archaea also like that of bacteria. Intra-species parity violation was also observed in archaea in the case of Methanococcus maripaludis. The C5 strain exhibited ISFDP violation between W nucleotides but exhibited parity between S nucleotides. The C6, C7 and S2 strains exhibited ISFDP between S nucleotides as well as between W nucleotides. Out of the 18 eukaryotic chromosomes belonging to five species, 15 chromosomes exhibited ISFDP (Table 4). Strong violation of ISFDP only between S nucleotides is observed in Leishmania major Friedlin chromosome 01 (62.84%). Plasmodium falciparum 3D7chromosome 05 exhibited weak violation of parity between S nucleotides as well as between W nucleotides, whereas chromosome 01 exhibited violation of parity only between W nucleotides. The other four chromosomes of P. falciparum exhibited parity between S nucleotides as well as between W nucleotides. Similarly, the eight chromosomes of Saccharomyces cerevisiae even though exhibited parity between S nucleotides as well as between W nucleotides, the P-values either for S nucleotides or for W nucleotides is of more than 10-fold difference among the chromosomes. This differential ISFDP violation observed among chromosomes of an organism suggests that there may not be any strict rule inside a cell to maintain ISFDP.

ISFDP between complementary oligonucleotides in chromosomes

ISP between compositional abundance values of complimentary oligonucleotides is well reported. We studied here the frequency distribution of complementary di- and trinucleotides in chromosomes as described for mononucleotides. The smooth curves of oligonucleotide frequencies have been shown in Supplementary data. In Supplementary Fig. S1a and b, the frequency distributions of dinucleotides have been shown for E. coli K12 MG1655 and Pseudomonas entomophila L48 chromosome (64.16%). Out of the 12 smooth frequency curves (four palindromic dinucleotides were excluded), overlapping of the curves between complementary dinucleotides is observed. In Fig. 2, though the abundance values of aa, tt, tg and ca dinucleotides in E. coli chromosome are close, the distributions between the complementary dinucleotides are found only overlapping and that of the non-complementary ones are different. The distributions for aa and tt follow a higher standard deviation (values not shown) than that of tg and ca. Similarly, gg and cc dinucleotides distributions exhibit a higher standard deviation (values not shown) than that of the dinucleotides tc and ga, although the abundance values of the four dinucleotides are close to each other. The significance of the similarity was studied by the KS test which suggested that the frequency distributions between complementary dinucleotides are statistically similar. Apart from this, dinucleotides distribution parity has been studied in three more bacterial chromosomes, two archaea chromosomes and one eukaryotic chromosome (data not shown) and similar result has been observed. In Supplementary Fig. S2i and ii, the distribution of 22 trinucleotides of E. coli K12 MG1655 chromosome is shown. Like dinucleotides, overlapping between the distributions of complementary trinucleotides is also observed. Distribution similarity between complementary trinucleotides was studied by the KS test for the 64 trinucleotides which suggested that the distributions of complementary trinucleotides within a strand are similar. The same study was done in one more bacterial chromosome (data not shown) and similar results were obtained. Although we did not analyzed the chromosomes of archaea and eukaryotes for trinucleotide distribution parity, it is expected to be there because these chromosomes had exhibited ISFDP for mononucleotides as well as dinucleotides.
Figure 2

A schematic representation of coding sequence arrangement studied. In the upper row, the entire DNA strand is composed of forward encoded sequences (black color). Parity is not observed in this case. In the lower row, the DNA strand is made up of 50% forward encoded sequences and the other 50% is the reverse encoded sequences (white color). Parity is observed in this case.

A schematic representation of coding sequence arrangement studied. In the upper row, the entire DNA strand is composed of forward encoded sequences (black color). Parity is not observed in this case. In the lower row, the DNA strand is made up of 50% forward encoded sequences and the other 50% is the reverse encoded sequences (white color). Parity is observed in this case.

ISFDP weakly correlates with Chargaff's second parity

Comparison of ISFDP was done with the ATS/GCS in chromosomes to find out whether one can define the other. GCS was compared with ISFDP violation between S nucleotides and ATS was compared with ISFDP violation between W nucleotides. Among the bacterial chromosomes, maximum GCS was found in X. fastidiosa 9a5c with the value 0.0529. All of the 16 chromosomes with GCS ≥0.01 were found to violate ISFDP (14 strongly violated and 2 weakly violated). Out of the 18 chromosomes with GCS ≥0.005 but <0.01, 6 exhibited insignificant violation, 7 exhibited strong violation and 5 exhibited weak violation of ISFDP. Similarly, out of 56 chromosomes with GCS ≥0.001 but <0.005, 5 exhibited strong violation, 11 exhibited weak violation and 40 exhibited insignificant violation. Out of the 22 chromosomes with GCS <0.001, except B. thuringiensis Al Hakam chromosome (with GCS value 0.00081 exhibited weak violation of ISFDP) all other exhibited insignificant violation. Maximum ATS was found in X. fastidiosa 9a5c with the value 0.04727. Out of the five chromosomes with ATS ≥0.01, four were found to violate ISFDP (two strongly violated and two weakly violated), whereas Mycoplasma hyopneumoniae J exhibited insignificant violation (with ATS 0.0102). Out of the 14 chromosomes with ATS ≥0.005 but <0.01, 6 exhibited insignificant violation, 3 exhibited strong violation and 5 exhibited weak violation of ISFDP. Out of the 67 chromosomes with ATS ≥0.001 but <0.005, 57 exhibited parity, 1 strongly violated and 9 violated weakly between the W nucleotides. All the 26 chromosomes with ATS ≤0.001 exhibited insignificant violation of ISFDP. These results suggest that chromosomes with high ATS/GCS (≥0.01) have a stronger propensity to violate ISFDP and chromosomes with low ATS/GCS (≤0.001) have a stronger propensity to exhibit ISFDP. However, chromosomes with intermediate ATS/GCS (≥0.001 and ≤0.01) have the possibility of either exhibiting parity or violating the parity. Correlation analysis was done between the P-values (from the KS test between) of W nucleotides and ATS as well as between the P-values (from the KS test between) of S nucleotides and GCS. The r-values are −0.5572 and −0.4526 for W and S nucleotides, respectively. This suggests that the correlation between the two ISP features is weak. The correlation between ATS and GCS is 0.629, which suggests that parity violation between S nucleotides weakly correlates with parity violation between W nucleotides within a chromosome. Unlike ATS and GCS correlation, no correlation was found between the P-values (the KS test) of W nucleotides and that of S nucleotides, which supports that ISFDP and Chargaff's second parity are not the same. In the case of the archaea chromosomes, the ISFDP analysis revealed similar results to that of bacterial chromosomes. Maximum GCS with the value 0.03768 was found in the chromosome of M. smithii ATCC 35061 (31.02%) followed by the value 0.02726 in S. marinus F1 (35.71%), in which significant ISFDP violation was also observed. In the GCS interval 0.005 < GCS ≤ 0.01, there were eight chromosomes out of which five exhibited weak violation and three exhibited insignificant violation of ISFDP. Out of the 24 chromosomes in the interval 0.001 < GCS ≤ 0.005, 2 exhibited weak violation and 22 exhibited insignificant violation of ISFDP. These results suggest that chromosomes with high ATS/GCS (≥0.01) are most likely going to violate ISFDP and chromosomes with low ATS/GCS (≤0.001) are most likely going to exhibit ISFDP. However, chromosomes with intermediate ATS/GCS ((≥0.001 and≤0.01) have the possibility of either exhibiting parity or violating the parity. Pearson's correlation coefficient between ATS and GCS was found to be 0.707847, which is similar to that of the bacterial analysis. The r-values between ATS and the P-values of KS (W) as well as GCS and the P-values of KS (S) were found to be −0.57495 and −0.47557, respectively, suggesting a weak correlation.

The chromosomes with asymmetric replication topography are more prone to ISFDP violation in bacteria

Bacterial chromosome is a single replicon. Owing to the bidirectional mode of replication, one part of a strand is synthesized as LeS whereas the other part is synthesized as LaS. In most of the chromosomes, the mutational strand asymmetry causes K nucleotides > M nucleotides in LeS and the reverse in (K nucleotides < M nucleotides) in LaS. In an ideal case where the termination site is located symmetrically with respect to the origin of replication in a chromosome, the excess of K nucleotides in LeS will be similar to the excess of M nucleotides in LaS and therefore will cancel each other to exhibit Chargaff's second parity in chromosomes. Potential replication origin and termination sites for different chromosomes based on ATS, GCS, coding sequence skew, nucleotide skew at the third position of codons and oligonucleotides skew in chromosomes have been reported,[31,32] which has been reviewed in detail.[33] Out of the 112 bacterial chromosomes analyzed in this study, information regarding the potential site for the origin and termination of 56 chromosomes is available. ISFDP violation between S nucleotides was compared with the angular deviation of termination site because G > C in LeS is a more universal feature of chromosomes than T > A in LeS. Of the 112 chromosomes, maximum angular deviation of 71.28° is reported in B. pertussis Tohama 1. Out of the 14 chromosomes where ≥20° angular deviation was observed, 12 exhibited violation of ISFDP between S nucleotides. Pseudomonas putida F1 (61.86%) with 36.8° and C. burnetii RSA 493 (42.66%) with 31.14° angular deviations exhibited insignificant parity violation. Out of the 11 chromosomes with deviation ≥10° but <20°, 4 chromosomes exhibited ISFDP violation between S nucleotides. Out of the 30 strains with deviation ≥1.0° and ≤10°, 9 chromosomes exhibited parity violation between S nucleotides. Chlamydophila abortus S263 with angular deviation only 0.569°, parity violation was observed only between S nucleotides. This study indicates that chromosomes with higher asymmetric topography are more prone to violate the parity. However, chromosomes with symmetric replication topography were also observed to violate the parity. The correlation coefficient between angular deviations and GCS as well as ATS values are 0.474 and 0.357, respectively, suggesting a weak correlation. The correlation between angular deviations and P-value of S (the KS test between S nucleotides) as well as that of W (the KS test between W nucleotides) are −0.259 and −0.048, respectively. The angular deviation in X. fastidiosa 9a5c is 62.96°, whereas the same in Temecula 1 is 6.44°. The difference in the magnitude of ISFDP violation between the strains might be attributed to the chromosome topography. Comparison for the four H. influenzae strains could not be done due to the unavailability of information for all the strains. The Rd KW20 chromosome (that violated ISFDP) has the angular deviation 46°, which might be an important factor to violate ISFDP. Archaea chromosomes have been reported to have more replication origin like eukaryotic chromosomes. Therefore, replication topography will not be applicable to study ISFDP violations in these cases.

Composition of forward- and reverse-encoded sequences within DNA strands might influence the parity

Most of the regions in prokaryotic chromosomes are composed of coding sequences. Presence of both forward- and reverse-encoded sequences in bacterial chromosomes has been proposed for the observation of Chargaff's second parity in chromosomes.[8,9] So we analyzed only coding sequences in chromosomes of bacteria and archaea to study ISFDP as follows (Fig. 2): in one way (Case I), a DNA strand is only composed of only forward-encoded sequences, and in the other way (Case II), a DNA strand is composed of 50% forward-encoded and 50% reverse-encoded sequences. The result is shown for E. coli chromosome (Fig. 3A and B). The smooth frequency curves of complementary nucleotides overlap in Fig. 3B, whereas in Fig. 3A, they do not overlap. The significance of these overlaps were studied by the KS test which suggests that the similarity between the distribution of complementary nucleotides in Case II. Similar results were obtained by the analysis of several bacterial (10) and archaea (15) chromosomes.
Figure 3

(A and B) Frequency distribution study of nucleotides in coding sequences. Smooth curves present the group-frequency distribution of the four nucleotides a (square), t (asterisk), g (triangle) and c (rhombus). The X-axis represents the abundance values of the nucleotide spanning a range, whereas the Y-axis represents the frequency of the abundance values. In (A), the frequency of the nucleotides in a DNA strand only composed of forward encoded sequences of E. coli is shown (coding sequences analyzed for other chromosomes exhibited the similar feature). It is evident from (B) that the frequency distributions of the complementary nucleotides do not overlap. In (B), the frequency of the nucleotides of the same DNA strand done where 50% of the sequence was joined with the rest after reverse complementation (see the Materials and methods section). This resembled a strand composed of 50% forward encoded sequences and 50% reverse encoded sequences. It is evident from the figures that parity between the complementary nucleotides is observed in this case. These observations have been confirmed by the KS test.

(A and B) Frequency distribution study of nucleotides in coding sequences. Smooth curves present the group-frequency distribution of the four nucleotides a (square), t (asterisk), g (triangle) and c (rhombus). The X-axis represents the abundance values of the nucleotide spanning a range, whereas the Y-axis represents the frequency of the abundance values. In (A), the frequency of the nucleotides in a DNA strand only composed of forward encoded sequences of E. coli is shown (coding sequences analyzed for other chromosomes exhibited the similar feature). It is evident from (B) that the frequency distributions of the complementary nucleotides do not overlap. In (B), the frequency of the nucleotides of the same DNA strand done where 50% of the sequence was joined with the rest after reverse complementation (see the Materials and methods section). This resembled a strand composed of 50% forward encoded sequences and 50% reverse encoded sequences. It is evident from the figures that parity between the complementary nucleotides is observed in this case. These observations have been confirmed by the KS test. A comparative analysis between the Ws and Cs in a chromosome with respect to their composition of forward-encoded sequences was done in X. fastidiosa species as well as in H. influenza species. The relative differences in the compositional abundance values of forward sequences in Ws and Cs of X. fastidiosa 9a5c and X. fastidiosa Temecula 1 chromosomes are 0.078 and 0.015, respectively, which indicate that the proportion of forward- and reverse-encoded sequence in 9a5c strain is more disproportionate than that of Temecula 1 strain, which might be the reason for a stronger parity violation in the former. The relative differences of the compositional abundance values of forward-encoded sequences in Ws and Cs of H. influenzae 86-028NP (exhibits parity) and H. influenzae Rd KW20 (violates parity) chromosomes are 0.030 and 0.005, respectively, which suggest that the proportion of forward- and reverse-encoded sequences in 86-028NP strain is more disproportionate than that of Rd KW20 strain. This is in contrast to the result of X. fastidiosa, i.e. parity violation is observed in the strain (Rd KW20) with more proportionate gene distribution between Ws and Cs, whereas insignificant parity violation is observed in chromosome with disproportionate gene distribution between the strands. A quantitative estimation of the coding sequences in both the strands of the chromosomes was done in few other bacteria and archaea such as A. tumefaciens, B. subtilis, E. coli, M. smithii and S. marinus (Fig. 4). Maximum difference of ORF numbers between Ws and Cs was found in S. marinus, in which the parity violation was also observed. However, the relative difference of ORFs between the strands is found more in B. subtilis than in M. smithii. The former exhibited the parity whereas the latter violated it. Agrobacterium tumefaciens was shown to possess minimum relative difference of ORF numbers between the strands but violates parity. The results from this indicate that a higher disproportionate composition of forward- and reverse-encoded sequences within a strand has greater propensity to parity violation. However, proportionate composition of the sequences not necessarily implies the exhibition of parity.
Figure 4

Relative disproportionate composition of ORFs between Ws and Cs in chromosomes. The composition of ORFs in Ws and Cs of seven bacteria and two archaea was studied. Relative disproportionate composition was found out by deducting the ORF numbers between the two strands and then dividing the value obtained by the total number of ORFs present in both the strands. In A. tumefaciens, relative disproportionate value found to be minimum suggesting that the difference in the number of ORFs between the strands is relatively minimum when compared with others. In the archaea S. marinus, the value is found to be maximum among these nine strains. Both A. tumefaciens and S. marinus exhibited ISFDP violations, whereas insignificant ISFDP violation observed between E. coli and B. subtilis. Comparison between the strains of X. fastidiosa and H. influenzae is shown.

Relative disproportionate composition of ORFs between Ws and Cs in chromosomes. The composition of ORFs in Ws and Cs of seven bacteria and two archaea was studied. Relative disproportionate composition was found out by deducting the ORF numbers between the two strands and then dividing the value obtained by the total number of ORFs present in both the strands. In A. tumefaciens, relative disproportionate value found to be minimum suggesting that the difference in the number of ORFs between the strands is relatively minimum when compared with others. In the archaea S. marinus, the value is found to be maximum among these nine strains. Both A. tumefaciens and S. marinus exhibited ISFDP violations, whereas insignificant ISFDP violation observed between E. coli and B. subtilis. Comparison between the strains of X. fastidiosa and H. influenzae is shown.

Discussion

We have described in this study a new ISP feature in chromosomes, which is found in bacteria, archaea and eukaryotes. The methodology used to study this parity gives the statistical significance of similarity between the two distributions of complementary nucleotides/oligonucleotides. The basic qualitative feature of ISFDP is not changing for a chromosome even the segmentation is done at randomly taking any point out of the first 1000 nucleotides as the starting point. In other words, the sampling fluctuation is not affecting the feature. The correlation between the ISFDP and ISP is not strong, which is in accordance with the view that similarity in the total abundance values of two complementary nucleotides will not always yield similarity in their frequency distribution pattern. However, violation of ISP will definitely exhibit violation of ISFDP. Around 50% of the chromosomes in bacteria are found to exhibit ISFDP violations. Chromosomes of H. influenzae Rd KW20, M. tuberculosis F11, etc., which have been reported to exhibit ISP, are found to violate ISFDP.[27] ISFDP violation observed in all possible combinations in chromosomes: (i) violation of parity between S nucleotides as well as between W nucleotides; (ii) only between S nucleotides and only between W nucleotides. The correlation between ATS and GCS is found to be not strong suggesting that parity violation between S nucleotides not necessarily always associate with parity violations between W nucleotides and vice versa. This can be called as intra-chromosomal parity violations. ISFDP violations of different magnitudes were found among chromosomes of different strains belonging to a species which can be referred as intra-species parity violations. Examples are C. burnetii, H. influenzae and X. fastidiosa. These intra-chromosomal and intra-species violations suggest that there may not be any strict rule existing in cells to maintain ISFDP in chromosomes. Differential ISP among chromosomes within a species and between chromosomes within a bacterium has already been reported in Chlamydophila pneumoniae strains and Deinococcus radiodurans R1 chromosomes,[27] respectively. However, these were not considered significant in their study due to the lack of statistical proof. Oligonucleotide skew patterns also have been found to be variable among strains of Yersinia pestis. These intra-species variations in the chromosomal features are interesting and need in-depth analysis of the genome sequences to find out the reason that might reveal the reason for ISP/ISFDP violation in chromosomes and between the two ISP features. Enrichment of LeS with K nucleotides over M nucleotides and the vice versa in LaS due to the mutational strand asymmetry is a general observation in chromosomes. Owing to the bidirectional replication, GCS/ATS in LeS is cancelled with GCS/ATS in LaS which results in the establishment of parity in chromosomes. The cancellation effect indirectly suggests that the compositional abundance values between the two complementary nucleotides even though they differ within a sub-chromosomal region. This is in support of the observation here that chromosomes with higher GCS/ATS values are violating ISFDP and chromosomes with lower GCS/ATS are exhibiting the parity. However, the chromosomes with intermediate range GCS/ATS are found to exhibit parity as well as violate parity and this violation is independent of genome GC%. For example, Streptococcus mutans UA159, Rickettsia conorii Malish 7, C. jejuni subsp. jejuni 81116, Campylobacter concisus 13826 and Lactococcus lactis subsp. cremoris MG1363, Helicobacter pylori J99 are (all AT-rich organisms) chromosomes with GCS≥0.005 that exhibit ISFDP between S nucleotides, whereas chromosomes of B. anthracis strains (AT rich) with similar GCS (>0.005) violate ISFDP between S nucleotides. So ISFDP in these chromosomes is an interesting aspect of future research. In concordant with the view of the bidirectional replication and establishment of parity in chromosomes, several chromosomes with higher asymmetric replication topography were found to violate ISFDP. The exceptions are P. putida F1 and C. burnetii RSA 493 chromosomes with 36° and 31° angular deviations, respectively. Chromosomes of C. abortus S263 and Magnetospirillum magneticum AMB-1, with very less angular deviations 0.57° and 2.14°, respectively, are found violating ISFDP. This indicates that features apart from the replication topography might contribute to the parity establishment in chromosomes. Proportionate composition of forward-encoded sequences between the two strands though thought to be responsible to establish the parity after the analysis of artificially constructed chromosomes, several observations went against it. The extreme case is A. tumefaciens where the composition is very much proportionate but violations of ISFDP are strong. So the two factors such as asymmetric replication topography and disproportionate composition of forward-encoded sequences between the strands in chromosomes that were assumed to play important roles in determining ISFDP violations were found to be insufficient. In spite of different selection/mutation pressures on chromosomes as exemplified by codon usage,[34] replication topography,[31] isochores[35] and GCS/ATS,[21] the tendency of the chromosomes of all types toward maintaining the ISFDP is interesting. Since ISFDP and ISP are the outcomes of compositional abundance of nucleotides (mono/oligo), theories proposed for ISP might hold true for ISFDP. The Nussinov–Forsdyke hypothesis is that stem–loop potential has an adaptive advantage, and therefore an important factor driving the compositional symmetry (ISP) between the complementary oligonucleotides[36,37] has been challenged recently by Chen and Zhao[38] for human chromosomes. This indicates that the stem–loop (recombination) hypothesis might not be the only explanation for ISP in chromosomes. Baisnée et al.[8] have argued that the reverse complement symmetry does not result only from point mutation or from recombination, but from a combination effect of different mechanisms at different orders.[8] Two independent reports have theoretically shown that multiple inversion events in chromosomes can establish ISP.[10,39] Though this hypothesis looks fine theoretically, frequent inversion unable to explain the universal observation of opposite GCS/ATS in LeS and LaS,[40] gene distribution asymmetry between the strands[41] and the maintenance of gene orders among different bacterial chromosomes.[42] This hypothesis also does not describe any functional significance/advantage of the ISP/ISFDP feature, which is so wide spread in chromosomes. Theoretically, it has also been argued that the mismatch error repairing system is responsible to establish Chargaff's second parity rule in chromosomes.[13] However, the intra-chromosomal parity violation observed in eukaryotes (this study) goes against this hypothesis. We think the important factor that determines ISP/ISFDP in chromosomes is the bidirectional replication. This causes one part of a strand Ws/Cs as LeS and the other part as LaS. The strand mutational asymmetry and gene distribution asymmetry between LeS and LaS therefore cancel out each other within the strand to exhibit the parity. In the case of ssDNA/ssRNA viruses, gene distribution is restricted to one strand only depending on which these are called as either plus or – strand viruses. The genome size is also not large (<10 kb) in these phages[43,44] and during replication, one strand only acts as the template on which the other strand is made. Most likely these features are responsible for violating the parity in these genomes. The advantages of bidirectional replication in bacteria and archaea where the nucleus is absent are as follows: (i) quicker completion of replication than the unidirectional mode of replication and (ii) the meeting of the two replication forks might be sending some signal to the cell for the completion of chromosome replication where the nucleus is absent. Symmetric replication topography will help to terminate the replication from the origin in a lesser time in comparison with an asymmetric topography. So the selection pressure to maintain the symmetric replication topography in fast-growing bacteria is likely to be more than that in slow-growing bacteria. This proposition has similarity with the Selection Mutation Drift theory proposed for codon usage[45] in bacteria. Our study of ISFDP of Vibrio species (the generation time is 0.2–0.3 h; fast-growing) in this context seems to be also not holding true here because its chromosomes violate ISFDP between W nucleotides. Moreover, comparison of generation time[40] with asymmetry in replication topography of chromosomes[32] exhibits no correlation (data not shown). More research on this aspect will give a conclusive result if the growth rate has any relation with parity establishment in chromosomes. In conclusion, our study has revealed an interesting aspect of ISP. Future research will reveal the reason for the presence of this parity in chromosomes.

Supplementary data

Supplementary data are available at www.dnaresearch.oxfordjournals.org.
  39 in total

1.  Translation-coupled violation of Parity Rule 2 in human genes is not the cause of heterogeneity of the DNA G+C content of third codon position.

Authors:  N Sueoka
Journal:  Gene       Date:  1999-09-30       Impact factor: 3.688

2.  Gene essentiality determines chromosome organisation in bacteria.

Authors:  Eduardo P C Rocha; Antoine Danchin
Journal:  Nucleic Acids Res       Date:  2003-11-15       Impact factor: 16.971

3.  Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid.

Authors:  J D WATSON; F H CRICK
Journal:  Nature       Date:  1953-04-25       Impact factor: 49.962

Review 4.  Chargaff's legacy.

Authors:  D R Forsdyke; J R Mortimer
Journal:  Gene       Date:  2000-12-30       Impact factor: 3.688

5.  Negative correlation between compositional symmetries and local recombination rates.

Authors:  Liang Chen; Hongyu Zhao
Journal:  Bioinformatics       Date:  2005-08-30       Impact factor: 6.937

6.  Deviations from Chargaff's second parity rule in organellar DNA Insights into the evolution of organellar genomes.

Authors:  Christoforos Nikolaou; Yannis Almirantis
Journal:  Gene       Date:  2006-06-28       Impact factor: 3.688

Review 7.  Identification of replication origins in prokaryotic genomes.

Authors:  Natalia V Sernova; Mikhail S Gelfand
Journal:  Brief Bioinform       Date:  2008-07-26       Impact factor: 11.622

8.  Structure and function of nucleic acids as cell constituents.

Authors:  E CHARGAFF
Journal:  Fed Proc       Date:  1951-09

9.  Strong doublet preferences in nucleotide sequences and DNA geometry.

Authors:  R Nussinov
Journal:  J Mol Evol       Date:  1984       Impact factor: 2.395

10.  Base composition skews, replication orientation, and gene orientation in 12 prokaryote genomes.

Authors:  M J McLean; K H Wolfe; K M Devine
Journal:  J Mol Evol       Date:  1998-12       Impact factor: 2.395

View more
  4 in total

1.  High order intra-strand partial symmetry increases with organismal complexity in animal evolution.

Authors:  Shengqin Wang; Jing Tu; Zhongwei Jia; Zuhong Lu
Journal:  Sci Rep       Date:  2014-09-29       Impact factor: 4.379

2.  Inversion symmetry of DNA k-mer counts: validity and deviations.

Authors:  Sagi Shporer; Benny Chor; Saharon Rosset; David Horn
Journal:  BMC Genomics       Date:  2016-08-31       Impact factor: 3.969

3.  DNA word analysis based on the distribution of the distances between symmetric words.

Authors:  Ana H M P Tavares; Armando J Pinho; Raquel M Silva; João M O S Rodrigues; Carlos A C Bastos; Paulo J S G Ferreira; Vera Afreixo
Journal:  Sci Rep       Date:  2017-04-07       Impact factor: 4.379

4.  The exceptional genomic word symmetry along DNA sequences.

Authors:  Vera Afreixo; João M O S Rodrigues; Carlos A C Bastos; Raquel M Silva
Journal:  BMC Bioinformatics       Date:  2016-02-03       Impact factor: 3.169

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.