Literature DB >> 25263801

High order intra-strand partial symmetry increases with organismal complexity in animal evolution.

Shengqin Wang¹, Jing Tu¹, Zhongwei Jia², Zuhong Lu³.

Abstract

For sufficiently long genomic sequence, the frequency of any short nucleotide fragment on one strand is approximately equal to the frequency of its reverse complement on the same strand. Despite being studied over two decades, the precise mechanism involved has not yet been made clear. In this study, we calculated the high order intra-strand partial symmetry (IPS) for 14 animal species by using a fixed sliding window method to scan each genome sequence. The study showed that the IPS was positive associated with organismal complexity measured by the number of distinct cell types. The results indicated that the IPS might be resulted from the increasing of functional non-coding DNAs, and plays an important role in the evolution process of complex body plans.

Entities: Chemical Species

Mesh：

Year: 2014 PMID： 25263801 PMCID： PMC4178289 DOI： 10.1038/srep06400

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

It is well known that the frequency of single nucleotide is very similar to the frequency of its reverse complement (%A = %T and %C = %G) within single-stranded DNA sequence in most complete genome sequences. This kind of intra-strand compositional symmetry phenomenon is also called the Chargaff second parity rule, which was discovered in the late 1960s1. As growing of deposited genome sequences, such symmetry is commonly found in a sequence which is longer than 50 Kb2. Interestingly, the extending researches of Chargaff's Second Parity Rule reveal that the symmetry is very strongly supported by di-, tri or higher order N-mer oligonucleotides for sufficiently long genomic sequences23. A large number of published papers introduce the intra-strand symmetry in genomic era, but the issue about the origin of this kind of universal phenomenon is still controversial, such as point mutation4, local recombination rates5, inversion and inverted transpositions67, etc. Stem-loop structure is also one proposed factor for the intra-strand symmetry89, though the contribution is limited10. Moreover, it is an intrinsic property in the genome evolution established by a cumulative effects of a number of mechanisms for multiple orders and length scales1112. It is noted that, the fragments in some genome sequences are found not to be obedient the intra-strand symmetry1314. Besides, different nucleotide composition are found to exist in the two replicating strands of most genomes, even mammalians, which affects the intra-strand symmetry in local region1516. Particularly, the intra-strand symmetry for higher order nucleotides will drop sharply as the increasing of order N5. It is known that, the organism complexity measured by the number of different cell types has increased greatly during the course of evolution. In order to know if there is relationship between intra-strand partial symmetry (IPS) of genome and organismal complexity, we investigated the IPS values on high order N-mer nucleotides within available animal organisms via fixed sliding window approach in this study. Increased data of full complete genome sequences provides us an opportunity for global properties analysis of this kind of symmetry. At last, we found that the high order IPS is significantly correlated with organismal complexity, and the increasing of functional non-coding DNAs might be one of the reasons.

Results

Here, the measurement of different high order intra-strand symmetry was performed out, and the ability to get the difference of IPS values is maximized by using 9-mer oligonucleotides (N = 9) (Figure 1). For one sequence, the IPS values of N-mer nucleotides ranging from 0 to 1 usually drop sharply as the increasing of order N11. In order to maximize the ability to detect the difference of IPS values, we calculated the standard deviation of IPS values under a different order of oligonucleotides to get the optimal parameter (Figure 1).

Figure 1

Plot of standard deviation of the IPS values under different N-mer nucleotides.

9-mer was selected to maximize the divergence among organisms in this study.

Fourteen organisms with their organismal complexity greater than 64 were selected by three filter steps, in which the number of remained fragments is greater than 10. Although there are many sequenced organism genome with low organismal complexity, they have quite a lot of degenerated nucleotides in continuous 50 Kb sliding windows. These degenerated nucleotides could artificially increase the degree of symmetry when allowed fuzzy match. Therefore, we simply discarded all of the fragments containing degenerated nucleotides. We built a simple linear regression model to find out the relationship between the IPS value and organismal complexity. The average IPS values increase with organismal complexity in a given GC range, since the GC content of the genomic fragment might influence the IPS value. Figure 2 gives relationship between Average IPS and organismal complexity in 0.01 interval from 0.4 to 0.51, as well as their combined range of 0.4 ~ 0.6. The results show that the average IPS values are in the range from 0.21 to 0.29, and all of the correlation effects with organismal complexity are strong (Figure 2). We also calculated other IPS values of N-mer oligonucleotides, and the 7-, 8-, and 10-mer also give significant correlation between the IPS value and organismal complexity except the 6-mer (Supplementary Figure S1). One of the reasons we use the fragments with GC content range from 0.4 ~ 0.6 is that these sequences usually present functional elements, for instance more than 75% of UCSC human genes with GC content within the range from 0.4 to 0.6.

Figure 2

The average value of 9-mer order intra-strand symmetry is positively correlated with organismal complexity (measured by number of different cell types).

Scatterplot was generated using the ggplot2 R package33. The top of each scatterplot shows the GC content of fragments. The R-squared and significance are marked at the top-right corner of scatter plots (P value: *P<0.05, **P<0.01, ***P<0.001). Fourteen organisms were used in this study: Aga (Anopheles gambiae), Bta (Bos taurus), Cfa (Canis familiaris), Ciona intestinalis (Cin), Dme (Drosophila melanogaster), Dre (Danio rerio), Gga (Gallus gallus), has (Homo sapiens), Mmu (Mus musculus), Ptr (Pan troglodytes), Rno (Rattus norvegicus), Tni (Tetraodon nigroviridis), Tru (Takifugu rubripes), and Xtr (Xenopus tropicalis).

To detect the functional role of such kind of symmetry, average 9-mer IPS values were compared with the density of UCSC genes among chromosomes in the well-annotated human genome. Interestingly, we found that the average IPS values ranging from 0.269 to 0.289 are different among chromosomes, and significantly correlated with the chromosome gene density (Figure 3). High IPS could help producing compact structures, which might prevent the genome degradation by folding stably secondary structure. It is also noted that, there are a lot of functional non-coding regions existed in UCSC genes, such as UTRs, introns etc. Compared with coding DNA, the functional non-coding regions should have more evolutional pressure.

Figure 3

Positive correlation between high order intra-strand symmetry (9-mer) and density of UCSC genes in human chromosomes.

The GC content of the fragments is in the range from 0.4 to 0.6.

Non-coding DNA is different from coding DNA and does not suffer by the codon choices17. It should promote symmetry better and has more chance to fold as secondary structure. The secondary structure in non-coding region, such as the intron region, can be selected for optimizing gene expression of pre-mRNA by increasing the folding free energy1819. Recent research also shows folding into secondary structures to exhibit longer half-lives at the 3′ end are the major determinant of mRNA stability20. In order to check if the emergence of functional non-coding DNA increases with the degree of symmetry, we downloaded coding and functional non-coding regions with FASTA format from UCSC genome browser based on “human” Refseq annotation21, and compared the difference of the IPS values between them. Here, we removed sequences containing degenerate bases. Continuous stretches of 50 Mb nucleotide sequences were made from non-redundant subset sequences by removing gene names, and then we quantified 9-mer symmetries as the increasing of contiguous sequences in each dataset. In order to ignore the order effect of pooled genes, each data was randomly shuffled and calculated by 10 times. Our result shown that the functional non-coding regions approach higher symmetry than coding regions at 9-mer nucleotides as the size of the pool increases (Figure 4), which also give us the clue that the emergence of functional non-coding sequence may give the reason to the increasing of the symmetry degree in animal evolution.

Figure 4

Plot shows the mean of high order intra-strand partial symmetry (9-mer) for each pooled datasets.

The mean of each data were calculated by subsampling each data 10 times, and the strand error of IPS values for each dataset are shown by the dashed lines.

Discussion

The significance of correlation between IPS and organismal complexity suggests functional implication for such symmetry, perhaps in promoting structural evolution of functional DNAs (Figure 2, Supplementary Figure S1). It is known that, the genome changes both in size and structure on long-term evolution22. The various changes in the genome can be interpreted by occurred duplication, insertion, transposable elements, recombination, mutation, et al. These variants correlating with phenotype will go through accumulation of mutations unfettered by the restraining selective forces before being fixed in the genome. In other words, the functional genome structures can be generated from the mechanism of random genetic drift and selection pressure during the evolutional process. For example, as the hypothesis of “Kissing” model, the stem-loop structure contributes to the initiation of meiotic recombination9, so the variation increasing the symmetry will tend to be fixed on long-term evolution. As shown in Figure 2, the average IPS values between Homo sapiens and Pan troglodytes is quite similar in each GC content, suggesting the contribution of different nucleotide composition between the two replicating strands to the IPS should be limited. One of the important contributions of increasing symmetry is the enrichment of homogenous feature in sequence, which should be made advantage of the productivity of primary secondary and tertiary structure in both DNA and RNA. During the course of evolution, the organismal complexity of an organism is not well related with its total number of genes based on the G-value paradox23. The regulatory potential of coding DNA is not sufficient to give reasons for the evolution of complex organisms. In contrast to the limited diversity of proteins in phylogeny, the functional non-coding DNA, which has much greater chemical versatility to interact with other molecules easily, can be given the new role to dramatically improve the complicated regulatory framework in complex organism2425. The organismal complexity can be attribute by the expansion of functional non-coding genome sequences, at least for the first approximation2627. For example, the length of 3′ and 5′ untranslatable regions has expanded with the evolution of complex organisms, particularly in animals, suggesting an increase in cis-acting regulatory sequences that control translation and mRNA stability2829. Many clues show the emergence of the functional non-coding DNA can increase the organismal complexity. Our results show the IPS is well correlated with the increasing of functional non-coding DNA, providing clues to the evolution of organismal complexity (Figure 3, 4). It has also been proved that the upstream region of coding DNA can promote symmetry better than coding region11. Since functional non-coding DNA does not suffer from codon choices, the variations in functional non-coding DNA could be kept more easily to increase the secondary structure and get higher gene expression level. Based on the stable secondary structure, the functional DNA could not only fold as regular secondary structure to prevent degraded, but also construct a huge amount of complex conformations for further biological functions. This process can be driven by the genome duplication events30. To our knowledge, it is the first time for us to report that the high order IPS is strongly correlated with organismal complexity with animal evolution. More importantly, we proved that the functional non-coding DNA shows higher IPS value in comparison with coding DNA. The genomes of complex organisms usually have abundant of functional non-coding DNAs, which coordinate with the conclusion that high complex organisms have high IPS values. During the organism evolution process, organisms are producing new cell type and expand post-transcriptional regulation level. Therefore, the increasing of functional non-coding DNA can be attributed to the increasing of compositional symmetry, which might be a possible event to explain its increasing of the regulation of gene transcription and post-transcription in complex organisms.

Methods

There are 14 animals used in this study, including six Mammal genomes: Homo sapiens (Version: hg19), Pan troglodytes (panTro4), Mus musculus (mm10), Rattus norvegicus (rn5), Bos Taurus (bosTau7), Canis familiaris (canFam3), fiveVertebrate genomes: Gallus gallus (galGal4), Xenopus tropicalis (xenTro3), Takifugu rubripes (fr3), Tetraodon nigroviridis (tetNig2), Danio rerio (danRer7),two Insect genomes: Drosophila melanogaster (dm3), Anopheles gambiae (anoGam1),and one Deuterostome genome: Ciona intestinalis (ci2). For each organism, we extracted complete genome sequences from the UCSC Genome Browser21. The organismal complexity information measured by the number of distinct cell types was obtained from the result of a recent research31. In addition, we used Perl-based bioinformatics scripts and R language to process the statistical analysis. In order to ignore the effect of sequence length, we employed a sliding window approach to divide each downloaded chromosome sequence into non-overlapping 50 Kb fragments from the first nucleotide of each sequence. The length was commonly found to obey the Chargaff second parity rule2. For each fragment, the high order IPS was calculated using the measurement defined by previously developed algorithm11, IPS, where f is the frequency of the i-th N-mer oligonucleotides in one fragment, and f is the frequency of its reverse complement in the same fragment. In palindromic nucleotides, such as “GAATTC”, which contains the same order of oligonucleotides with its reverse complement, we simply split them into two same parts, where f is equal to f. Finally, the IPS of N-mer oligonucleotides for each fragment can be determined respectively. To decrease the impact of other factors, such as the local recombination5 and dramatically highly repetitive DNA in the Y chromosome sequence32, we performed three filter steps. Firstly, we discarded fragments containing degenerate bases, which comes to be more similar between f and f. Secondly, low complex sequence tends to have intra-strand symmetry, so we discarded fragments with extreme GC content (<0.4 or >0.6) to decrease putative positive. Thirdly, we performed the interquartile range (IQR) to filter outliers. It is described as IQR = Q3 - Q1, where Q1 and Q3 are the first quartile or 25th percentile and the third quartile or 75th percentile, respectively. Any score below the Lower fence (Q1 - 1.5 * IQR) or above the Upper fence (Q3 + 1.5 * IQR) can be considered as an outlier. At last, we only kept organisms with count of fragments greater than 10 within defined GC content of all chromosomes after removing the outliers.

Author Contributions

Z.L. and S.W. conceived of the study and wrote the main manuscript text. S.W. and J.T. performed the data analysis. Z.L., Z.J. and S.W. advised and revised the manuscript text. All authors reviewed the manuscript.

31 in total

1. The g-value paradox.

Authors: Matthew W Hahn; Gregory A Wray
Journal: Evol Dev Date: 2002 Mar-Apr Impact factor: 1.930

Review 2. Chargaff's legacy.

Authors: D R Forsdyke; J R Mortimer
Journal: Gene Date: 2000-12-30 Impact factor: 3.688

Review 3. Eukaryotic regulatory RNAs: an answer to the 'genome complexity' conundrum.

Authors: Kannanganattu V Prasanth; David L Spector
Journal: Genes Dev Date: 2007-01-01 Impact factor: 11.361

4. A meta-analysis of the genomic and transcriptomic composition of complex life.

Authors: Ganqiang Liu; John S Mattick; Ryan J Taft
Journal: Cell Cycle Date: 2013-06-06 Impact factor: 4.534

5. Causes and effects of N-terminal codon bias in bacterial genes.

Authors: Daniel B Goodman; George M Church; Sriram Kosuri
Journal: Science Date: 2013-09-26 Impact factor: 47.728

6. Limited contribution of stem-loop potential to symmetry of single-stranded genomic DNA.

Authors: Shang-Hong Zhang; Ya-Zhi Huang
Journal: Bioinformatics Date: 2009-12-22 Impact factor: 6.937

7. Lengthening of 3'UTR increases with morphological complexity in animal evolution.

Authors: Cho-Yi Chen; Shui-Tein Chen; Hsueh-Fen Juan; Hsuan-Cheng Huang
Journal: Bioinformatics Date: 2012-10-18 Impact factor: 6.937

8. Sequence finishing and mapping of Drosophila melanogaster heterochromatin.

Authors: Roger A Hoskins; Joseph W Carlson; Cameron Kennedy; David Acevedo; Martha Evans-Holm; Erwin Frise; Kenneth H Wan; Soo Park; Maria Mendez-Lago; Fabrizio Rossi; Alfredo Villasante; Patrizio Dimitri; Gary H Karpen; Susan E Celniker
Journal: Science Date: 2007-06-15 Impact factor: 47.728

9. A study in entire chromosomes of violations of the intra-strand parity of complementary nucleotides (Chargaff's second parity rule).

Authors: B R Powdel; Siddhartha Sankar Satapathy; Aditya Kumar; Pankaj Kumar Jha; Alak Kumar Buragohain; Munindra Borah; Suvendra Kumar Ray
Journal: DNA Res Date: 2009-10-27 Impact factor: 4.458

10. Selection on codon bias in yeast: a transcriptional hypothesis.

Authors: Edoardo Trotta
Journal: Nucleic Acids Res Date: 2013-08-13 Impact factor: 16.971

2 in total

1. Inversion symmetry of DNA k-mer counts: validity and deviations.

Authors: Sagi Shporer; Benny Chor; Saharon Rosset; David Horn
Journal: BMC Genomics Date: 2016-08-31 Impact factor: 3.969

2. Systematic Characteristic Exploration of the Chimeras Generated in Multiple Displacement Amplification through Next Generation Sequencing Data Reanalysis.

Authors: Jing Tu; Jing Guo; Junji Li; Shen Gao; Bei Yao; Zuhong Lu
Journal: PLoS One Date: 2015-10-06 Impact factor: 3.240

2 in total