Literature DB >> 24822026

A Brief Review: The Z-curve Theory and its Application in Genome Analysis.

Abstract

In theoretical physics, there exist two basic mathematical approaches, algebraic and geometrical methods, which, in most cases, are complementary. In the area of genome sequence analysis, however, algebraic approaches have been widely used, while geometrical approaches have been less explored for a long time. The Z-curve theory is a geometrical approach to genome analysis. The Z-curve is a three-dimensional curve that represents a given DNA sequence in the sense that each can be uniquely reconstructed given the other. The Z-curve, therefore, contains all the information that the corresponding DNA sequence carries. The analysis of a DNA sequence can then be performed through studying the corresponding Z-curve. The Z-curve method has found applications in a wide range of areas in the past two decades, including the identifications of protein-coding genes, replication origins, horizontally-transferred genomic islands, promoters, translational start sides and isochores, as well as studies on phylogenetics, genome visualization and comparative genomics. Here, we review the progress of Z-curve studies from aspects of both theory and applications in genome analysis.

Entities: Chemical Disease Gene Species

Keywords: GC profile; Gene finding; Genomic island; Replication origin; Z-curve.

Year: 2014 PMID： 24822026 PMCID： PMC4009844 DOI： 10.2174/1389202915999140328162433

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

INTRODUCTION

In theoretical physics, there exist two basic mathematical approaches, algebraic and geometrical methods, which, in most cases, are complementary. In the area of genome studies, however, algebraic approaches, such as Markov chain models and hidden Markov chain models, have been widely used, while geometrical approaches have been less explored for a long time. The Z-curve theory is a geometric approach to genome analysis. The Z-curve is a 3-dimensional curve that represents a given DNA sequence in the sense that each can be uniquely reconstructed given the other [1-3]. The Z-curve, therefore, contains all the information that the corresponding DNA sequence carries. The analysis of a DNA sequence can then be performed through studying the corresponding Z-curve. Historically, various methods for the graphical representation of DNA sequences were proposed, such as the H curve [4] and the 2-dimensional DNA walk [5]. It has been shown that most of these methods are, in fact, special cases of the Z-curve, and an extensive comparison between the Z-curve and other representations was detailed in reference [2]. One of the advantages of the Z-curve is its intuitiveness, enabling global and local compositional features of genomes to be grasped quickly in a perceivable form. The methodology of the Z-curve is a suitable platform on which other methods, such as statistics, can be integrated to address bioinformatics questions. The Z-curve method [1, 2] has found many applications in genome analysis since its initiation two decades ago. Here, we review the progress of the Z-curve studies from aspects of both theory and applications in genome research.

PART-1: THEORY OF THE Z-CURVE

Symmetry of Four DNA Bases and its Geometric Representation

The DNA sequence is composed of 4 kinds of nucleotides, adenine, cytosine, guanine and thymine, denoted by A, C, G and T, respectively. The number of possible combinations when taking 2 bases at a time from 4 bases is 6. The 6 combinations are: R (A/G) and Y (C/T); M (A/C) and K (G/T); W (A/T) and S (G/C), where R, Y, M, K, W and S represent the bases of puRine, pYrimidine, aMino, Keto, Weak hydrogen bonds and Strong hydrogen bonds, respectively, according to the NC-IUB recommendation [6]. The chemical structures of the four bases are shown in (Fig. ), illustrating the symmetry among the four bases. According to different criteria, the four bases can be classified into two categories. (i) Criterion 1, according to the chemical structure of having single or double rings (ii) Criterion 2, according to the chemical structure of having an amino or keto group (iii) Criterion 3, according to the structure of the double helix forming two or three hydrogen bonds in the Watson-Crick pair We seek to find some geometrical representation for the above symmetry. If a 2-dimensional (plane) graph is adopted, we find that the symmetry can be represented by (Fig. ). If a 3-dimensional graph is adopted, the regular cube, as shown in (Fig. ), seems to be the unique choice to represent the symmetry. Each face of the cube is assigned to one, and only one, of the six characters: R, Y, M, K, W and S, thereby keeping the rule that R and Y, M and K, as well as W and S are on opposite sides. To prepare (Fig. ), readers may cut (Fig. ) and fold it along the dashed lines. Diagonals of the regular cube form a regular tetrahedron ACGT, as shown in (Fig. ). Assigning one of A, C, G and T to each vertex of the tetrahedron as shown in (Fig. ) is not arbitrary. Note that the vertex A of the tetrahedron is also the vertex of the cube, at which three faces of the cube, R, M and W, are crossed. The intersection base of R (A/G), M (A/C) and W(A/T) is A. Similar assignments can be applied to the vertices C, G and T, as shown in (Fig. and ). To further the study, a coordinate system needs to be established (Fig. ). The line connecting the middle point of an edge and that of the opposite edge of the tetrahedron is called the middle line. There are a total of three middle lines in a tetrahedron, crossing at the center O, and they are perpendicular to each other. A Cartesian coordinate system OXYZ can be set up by using the three middle lines, as shown in (Fig. ). Thus, the cube-tetrahedron geometric entity established here correctly reflects the symmetry of the four DNA bases.

The DNA Group

A regular tetrahedron is a geometric entity of high symmetry. All possible rotational motions which keep the tetrahedron fixed in the space form a group, called a tetrahedron group or T-group. As shown in (Fig. ), a tetrahedron group consists of 12 operational elements, which are described below. I, i.e., the identity operation; R, R and R, i.e., the 180° rotation along x, y and z axes, respectively; R, R, R and R, i.e., the 120° rotation along AO, CO, GO and TO axes, respectively; R, R, R and R, i.e., the 240° rotation along AO, CO, GO and TO axes, respectively. A point with coordinates x, y and z will be transformed accordingly under the operational elements of the T-group. For example, The transforms of x, y and z under the 12 operations of the T-group are listed in (Table ). The four elements I, R, R and R form an invariant subgroup of the T-group, which is isomorphic to the Klein-4 group, or K4 group. The K4 group and its cosets exhaust the T-group. Therefore, all the 12 elements of the T-group can be divided into four classes, which are (I), (R,R, R), (R, R, R, R) and (R, R, R, R). On the other hand, the set of all possible permutations of four objects forms a symmetric group, denoted by S4. Among the 24 elements of the symmetric group S4, the set of all 12 even permutations forms an invariant subgroup of S4, referred to as the alternative group of order 4, denoted by A4. The DNA group is defined as a particular A4 group, in which the permuted objects are the four DNA bases A, C, G and T. According to the group theory, the T-group and A4 group are isomorphic with each other. From the perspective of the abstract group, the T-group and the A4 group are the same group, because they have the same group structure and matrix representation. The four bases A, C, G and T are assigned to the four vertices of the tetrahedron, as shown in (Fig. ). The four characters A, C, G and T will be transformed accordingly under the 12 operational elements of the DNA group or the A4 group. For example, Biologically, the transform R is called transition, whereas the transform Rand R are called transversion. Here R is termed as the complementary transform. We have previously established that the T-group and the A4 group are the same group [3], and thus their elements should have one-to-one corresponding relations, as shown in (Table ). Both the A4 group and the T-group are called the DNA group, which forms the basis of the Z-curve theory.

The Z-transform Formulas

Let the occurrence frequencies of the four bases, A, C, G and T in a DNA sequence be denoted by a, c, g and t, respectively. The normalized condition reads indicating that among the four real numbers a, c, g and t, only three of them are independent. Suppose that X, Y and Z are the coordinates of a point P in the coordinate system shown in (Fig. ), which can be expressed by a linear combination of the four frequencies a, c, g and t, as follows where a11, a12, …, c13, c14 are real coefficients. Eqs. (2) can be re-written as a matrix form The coordinates of the four vertices of the regular tetrahedron A, C, G and T are already known, and shown in (Table ). Based on the 12 numbers in (Table ), the 12 coefficients can be uniquely determined, and eqs. (3) becomes Equivalently, eqs. (4) may be re-written as Eqs. (4) are called the Z-transform formulas, which were first derived in 1991 by a totally different way [1]. The Z-transform formulas transform the four base frequencies into three coordinates of a point (called a mapping point) in a three-dimensional space. As previously indicated [1], for convenience, we introduced the reduced coordinate system x, y and z such that In what follows, we always use the Z-transform formulas based on the reduced coordinate system eqs. (6), unless otherwise indicated. Equivalently, eqs. (6a) can be also re-written as a matrix form Letting and Equivalently, eqs. (6b) may be re-written with a simplified form The reverse equation of eqs. (6) is It is shown that regardless of the values of x, y and z,. In fact, it was shown in 1991 that the mapping point P (x, y, z), corresponding to a, c, g and t, is always situated within the tetrahedron ACGT shown in (Fig. ) [1]. To provide a clear visualization, the tetrahedron and the mapping points within it are projected onto some coordinate planes. Referring to (Fig. and ), note that the tetrahedron ACGT has 4 vertices A, C, G and T, and six edges AC, AG, AT, CG, CT and GT. Interestingly, the projection of six edges onto any coordinate plane forms a regular square and two diagonal lines within the square, where the projection of four vertices of the tetrahedron forms four vertices of the square, as shown in (Fig. , and ), for the x-y, x-z and y-z planes, respectively. Note that (Fig. , and ) are in accordance with (Fig. and ), respectively. It should be noted that A, C, G and T are sometimes used to denote DNA bases, while the same symbols can represent vertices of the tetrahedron or squares. Refer to (Fig. ) first. Projections of four edges AG, AC, CT and GT form the four sides of the square, whereas those of AT and GC form the two diagonal lines of the square. Based on the Z-transform formulas eqs. (6), the base composition of a DNA sequence, i.e., the values of a, c, g and t, can be visualized by observing the position of the mapping point in the square. For example, if the DNA sequence has only one kind of base, say, A, then a = 1, c = g = t = 0. The corresponding mapping point is situated at the vertex A in (Fig. ). Similar results for (Fig. ) are summarized as follows. Similar deductions for the annotation of (Fig. and ) are left out for readers who might be interested in doing so.

Linear Representation of the DNA Group

Based on the reduced coordinate system, the coordinates of the four vertices A, C, G and T can be represented by We previously established the linear representation of the DNA group, i.e., the tetrahedron group or the alternative group A4 in 1997 [3]. For readers’ convenience, here we re-write the result (see eqs. (6) in [3] as follows: This matrix representation depicts correct relationships among the 12 elements of the DNA group. For example, a rotation of 180° along the x-axis in (Fig. ), followed by another similar rotation, leads to the original state, i.e., That is to say, the matrix representation not only results in a one-to-one correspondence among elements of the DNA group, but also correctly reflects their relations based on the multiplication of matrices. In the following, we show that the transform matrix (3x4) eq. (8a) and its variants also constitute a one-to-one representation to each element of the DNA group. For this purpose, eq. (8a) can be re-written as where (A), (C), (G) and (T) are denoted by eqs. (11). Referring to (Table ), we find that the order of the four nucleotides above correspond to the element I of the A4 group. Similarly, its 11 variants can be derived, and are listed as follows where Z= Z. Based on the 12 Z matrices, we have Note that each Z corresponds to each R by a way of one-to-one correspondence. Therefore, the Z matrix also constitutes a representation of the DNA group in this sense. However, it is not an ordinary representation of the DNA group, because the similar multiplication relation such as eq. (13) does not exist among the Z matrices. It should also be noted that the Z-transform formulas eqs. (6), which transform the nucleotide frequencies into the coordinates of a point in a three-dimensional space, are unique and invariant under the operations of the DNA group. The Z-transform formulas shown in eqs. (6) represent the unique set of equations which reflect the inherent features of the DNA group. The Z-transform formulas are the core of the Z-curve theory.

The Z-transform Formulas for Studying Correlations of Multiple Nucleotides

To extract features of a given DNA sequence, in addition to considering occurrence frequencies of a single nucleotide, correlations of multiple nucleotides should also be considered. Therefore, the Z-transform formulas should be extended to consider the correlations of multiple nucleotides. The case of a single nucleotide Eqs. (16) are equivalent to eqs. (9). Here we have 3 (= 3x4°) parameters. The case of di-nucleotides where p(HA) represents occurrence frequencies of the di-nucleotide HA, and so forth. Here we have 12 (=3x41)parameters. The case of tri-nucleotides where p(HIA) represents occurrence frequencies of the tri-nucleotide HIA, and so forth. Here we have 48 (=3x42) parameters. The case of tetra-nucleotides where p(HIJA) represents occurrence frequencies of the four-nucleotide HIJA, and so forth. Here we have 192 (=3x43) parameters. The case of penta-nucleotides where p(HIJKA) represents occurrence frequencies of the five-nucleotide HIJKA, and so forth. Here we have 768 (=3x44) parameters. The case of hexa-nucleotides where p(HIJKLA) represents occurrence frequencies of the six-nucleotide HIJKLA, and so forth. Here we have 3072 (=3x45) parameters. To calculate the occurrence frequencies of multiple nucleotides for a given DNA sequence, we use a moving window with size = 1, 2, 3, 4, 5 and 6. Starting from the first nucleotide or base, move the sliding window rightward one base at a time, and then the frequencies can be calculated. Substitute the frequencies into eqs. (16) to (21), and then the Z-curve parameters can be obtained. For some applications, eq. (16), i.e., 3 parameters are sufficient. However, in some cases, more parameters are needed. The space spanned by the 3 parameters is denoted by V1, and similarly we have V2, V3, V4, V5 and V6, respectively, corresponding to eqs. (16) to (21). Usually, the direct sum among different spaces is needed. For most applications, there are six possible choices where the symbol ⊕ represents the direct sum of two spaces. The dimensions of the spaces V1 to V6 are 3, 15 (= 3+12), 63 (= 15+48), 255 (= 63+192), 1023 (= 255+768) and 4095 (= 1023+3072), respectively. Generally, for the space, the dimension is

Quadratic Form of x, y and z

Starting from eq. (9) We have where “T” means the transpose operation of a matrix. Furthermore, we find Simple derivation shows that where S is defined by S, named as “genome order index” [7], is useful for designing a fast genome segmentation algorithm [8, 9]. We also observed that for most genomes Eq. (27) has a clear geometrical explanation. The surface of the inscribed sphere is described by the equation Therefore, S<1/3 implies that the mapping point is within the inscribed sphere [7].

The Z-curve

One of the most important applications of the Z-transform formulas is to derive the equation of the Z-curve. Consider a DNA sequence with N bases that are inspected one base at a time. From the first base to the nth base, compute accumulative numbers of the bases A, C, G and T, denoted by A, C, G and T, respectively. Based on the Z-transform formulas eqs (6), we find Multiplied by n to both hands of eq. (29), and letting we have or equivalently which was first derived in 1994 by an entirely different method [2]. It should be noted that A, C, G and T are the cumulative occurrence numbers of A, C, G and T, respectively, in the sub-sequence from the 1st base to the nth base in the sequence with length N. We define A0=C0=G0=T0=0, therefore, x0 = y0 = z0 = 0. The Z-curve is defined as the connection of the nodes P0 (x0, y0, z0), , P2 (x2, y2, z2), …, P(x, y, z) one by one sequentially with straight lines. The connection results in a curve with a zigzag shape, hence the name Z-curve. Note that the Z-curve always starts from the origin of the three-dimensional coordinate system. Once the coordinates x, y and z (n = 1, 2, …, N) of a Z-curve are given, the corresponding DNA sequence can be reconstructed uniquely from the so-called inverse Z-transform formulas where the normalized relation of A+ C+ G+ T = n is used. The three components of the Z-curve, x, y and z, represent three independent distributions, that is, those of purine/pyrimidine (R/Y), amino/keto (M/K) and strong-H bond/weak-H bond (S/W) bases, respectively, and they completely describe the DNA sequence being studied. In the subsequence constituted from the 1st base to the nth bases of the sequence, when purine bases (A/G) are in excess of pyrimidine bases (C/T), x> 0, otherwise, x< 0, and when the numbers of purine (A/G) and pyrimidine bases (C/T) are identical, x = 0. Similarly, when amino bases (A/C) are in excess of keto bases (G/T), y> 0, otherwise, y< 0, and when the numbers of amino (A/C) and keto bases (G/T) are identical, y = 0. Finally, when weak H-bond bases (A/T) are in excess of strong H-bond bases (G/C), z> 0, otherwise, z < 0, and when the numbers of (A/T) and (G/C) bases are identical, z = 0. The x and y components are termed RY and MK disparity curves, respectively. Similarly, the AT and GC disparity curves are defined by (x + y)/2 and (x-y)/2, which shows the excess of A over T and G over C, along the genome. The RY and MK disparity curves, as well as AT and GC disparity curves, can be used to predict replication origins of various genomes.

The GC Profile

For most genome sequences, Chargaff Parity Rule II holds, i.e.,, where N is the length of a genome or a chromosome. According to eqs (31), we find Therefore, the curves of z~ n are roughly straight lines in this case. To amplify the variations of the straight-line-like curve, the curve of z ~ n is firstly fitted by a straight line using the least square technique, where (z, n) is the coordinate of a point on the fitted straight line and k is its slope. We define the z’ curve, where Therefore, the variations of z ~ n curve deviated from the straight line, which corresponds to a constant G+C content (see eq. (36) below), are protruded by the z’ curve. One may also use the average slope of the z ~ n curve to compute k, k = z/ N, where z is the terminal coordinate of the z ~ n curve and N is the sequence length. The essence of the z’ curve is to display the variations of the G+C content along a genome or chromosome based on the cumulative count of G and C bases. Let denote the average G+C content within a region Δn in a sequence, it was shown that [10]. where is the average slope of the z’ curve within the region Δn. Both quantities of and Δn can be calculated by using the z’ curve. It is clear to see from eq. (36) that a jump in the z’ curve, i.e., , indicates a decrease of G+C content or an increase of A+T content, whereas a drop in the z’ curve, i.e., , indicates an increase of G+C content or a decrease of A+T content. The region Δn is usually chosen to be a fragment of a DNA sequence. The above method to calculate G+C content is called the windowless technique [10]. The GC profile is defined as, because it is more intuitive in the sense that a jump denotes an increase in GC content. We emphasize the importance of the GC profile for genome studies, because it represents a windowless technique to calculate the G+C content along genome sequences.

A Segmentation Algorithm Based on the Z-transform

Let n be a point within a DNA sequence of length N, which divides the whole sequence into two parts: the right and left sub-sequences, and then denote frequencies of bases in the right sub-sequence and left sub-sequence by (a, c, g, t) and (a, c, g, t), respectively. The frequencies are mapped onto two points, P(x, y, z) and P(x, y, z), in a 3-D space, where The square of Euclidean distance between the two points is denoted by D, where Substituting eqs. (37) into eq. (38), we have where C is a constant. Note that D is a function of n. Suppose that when Then the point is called a compositional segmentation point [8]. The segmentation algorithm is recursive, i.e., after n* is determined, the same procedure is applied to both the left and right sub-sequences recursively, until D(n) is less than a given threshold. For more details refer to [8]. Eq. (39) can be extended to the case of a binary sequence. For example, by replacing the bases G and C with S, and bases A and T with W, a DNA sequence can be transformed into a binary sequence of S and W. In this case, the algorithm results in compositional segmentation points according to GC content. A software, called GC-Profile, was developed to implement the algorithm for genome segmentation [9].

PART-2: APPLICATIONS IN GENOME ANALYSIS

The Z-curve theory has been successfully applied in many different research areas in analyzing genomes of bacteria, archaea, eukaryotes and viruses. The applications include, to name a few, the identifications of protein-coding genes, replication origins, horizontally-transferred genomic islands, isochore structures, genome segmentation points, promoters and translational start sites, as well as studies on nucleosome positioning, DNA curvature profiles, phylogenetics and comparative genomics, in various organisms (Table ). It is not practical to cover all these areas in detail in a single review, and thus we will only highlight some studies.

Identification of Protein-coding Genes

One of the most important applications of the Z-curve theory is gene-finding in various genomes. The principle in using the Z-curve theory to identify protein-coding genes is straightforward. Based on the Z-transform formulas, the occurrence frequencies of 4 bases in a DNA sequence are mapped onto a point in a 3-dimensional (3-D) or 15-, 63-, 255-, 1023- and 4095-D space, depending on the number of correlated bases under consideration (eqs. (16) to (22)). The first application was for gene recognition in the budding yeast genome, where a 3-D space (eqs. (16)) was adopted [11]. However, since the protein coding sequence has 3 phases, the 3 Z-curve parameters are expanded to 9 (3x3 phases) parameters. Adding the genome order index S (eq. (26)) into the set of 9 Z-curve parameters, a 10-D space is spanned by the 10 parameters. It was observed that the mapping points of protein coding sequences and non-coding sequences are distributed in two distinct regions in the 10-D space, although there is minor overlapping [12]. Therefore, the two kinds of points can be discriminated by the Fisher discriminant method, or other classifiers, such as support vector machines. The Z-curve algorithm was first applied to recognize protein coding genes in the budding yeast (Saccharomyces cerevisiae) genome with an accuracy better than 95%, where the accuracy is defined as the average of sensitivity and specificity [11]. The same algorithm achieved an accuracy rate over 98% in the Vibrio cholerae genome, based on 9 parameters only [13]. The success of the above studies led to the development of a series of ab initio gene-finding software for various species with different numbers of Z-curve parameters.

Gene-finding in Bacterial, Archaeal, Phage and Virus Genomes

Based on 33 Z-curve parameters, we developed ZCURVE 1.0, which is an ab initio gene-finding software for bacterial and archaeal genomes [12]. Based on the 9 Z-curve parameters, ZCURVE_V was developed for identifying protein-coding genes in viral and phage genomes [14]. We also developed the software, ZCURVE_CoV, for gene-finding in coronavirus genomes, with special applications for SARS-coronavirus genomes [15, 16]. The above set of gene-finding software has been widely used in various laboratories worldwide. For example, ZCURVE 1.0 has been used for annotating protein-coding genes in many newly sequenced bacterial genomes, such as those of Acinetobacter baumannii [17], Variovorax paradoxus [18], Amycolatopsis mediterranei [19], Bacillus thuringiensis [20], Streptomyces tendae [21], Phaeobacter gallaeciensis [22], Desulfobacterium autotrophicum [23], Mycobacterium tuberculosis [24], Magnetospirillum gryphiswaldense [25] and Beggiatoa [26]. ZCURVE 1.0 was also used for annotating archaeal genomes, e.g., archaea of the ANME-1 group [27] (Table ). For some genomes, e.g., those of the bacterium Mycobacterium tuberculosis H37Ra [24] and Me Tri virus [28], ZCURVE 1.0 was the only software used for genome annotation, more frequently, however, results of ZCURVE 1.0 were combined with those of others, such as Glimmer [29] and Genmark [30]. For instance, ZCURVE 1.0 is integrated into meta-gene-finding tool YACOP [31] and GARSA [32]. It is noteworthy that ZCURVE 1.0 is especially suitable for genomes with high GC contents, e.g., GC content > 56% [12]. Likewise, ZCURVE_V and ZCURVE_CoV have been widely used for annotating protein-coding genes in newly sequenced genomes of viruses, coronaviruses [28, 33-37] and SARS coronaviruses [38-53].

Gene-finding in Eukaryotic Genomes

Algorithms based on the Z-curve theory have been used for recognizing protein coding genes in a number of eukaryotic genomes, e.g., the budding yeast genome [11], Leptospira interrogans genome [54] and Drosophila genomes [55]. The Z-curve algorithm has also been used in recognizing short coding sequences of human genes [56]. The algorithm based on the 189 Z-curve parameters was shown to be the most accurate among those tested for a given database, with the second one being an algorithm based on the Markov chain of order five [56], and the result was later confirmed by an independent study [57]. Recognition of exons and introns of human genes was also studied by using the Z-curve method [58].

Gene-finding Using the Fast Fourier Transform (FFT) Technique

The standard genetic code defines a mapping between a codon and an amino acid. According to this mapping, protein coding regions are divided into a series of tri-nucleotides (codon or triplet), resulting in a period-3 property in coding regions. Therefore, it is possible to find coding regions by exploring the 3-periodicity of DNA sequences. Consequently, the first step is to transform the DNA sequence into a digital sequence or signal, and the Z-curve is especially suitable for this purpose. According to eqs. (31), . Applying the FFT to , respectively, we are able to detect the 3-periodicity in the FFT power spectrum for each of the three numerical sequences. To increase the sensitivity, a lengthen-shuffle FFT algorithm was proposed for finding protein coding regions [59]. For example, the method was used to detect introns in the C. elegance chromosome III [60], and was later improved by using an adaptive filter to predict the exons in DNA sequences [61]. The relationship between the Z-curve and the Fourier transform for DNA sequence classification was studied in details [62].

Prediction of Replication Origins

Prediction of Replication Origins of Archaeal Genomes

Bacterial and eukaryotic genomes contain single and multiple replication origins, respectively. It was once a mystery whether archaea could have multiple oriCs. Using the Z-curve method, we firstly predicted three oriCs as well as their precise locations for Sulfolobus solfataricus [63], and the prediction was consistent with later experimental evidence [64-67]. The archaeon Methanococcus jannaschii was the first to have its genome sequenced, however, its oriCs were notoriously difficult to locate by both theoretical and experimental methods. The Z-curve method predicted 2 oriCs [68] that were supported by later experimental evidence [66]. Similarly, we predicted a single oriC in the genome of Methanosarcina mazei [69] and 2 oriCs in the genome of A. pernix [70], which were also supported by experimental evidence [71]. The Z-curve method has been commonly used for annotating newly sequenced archaeal genomes, such as those of Sulfolobus acidocaldarius [72], Haloferax volcanii [73], Desulfurococcus kamchatkensis [74], Thermococcus sibiricus [75], and Sulfolobus islandicus [76].

Prediction of Replication Origins in Bacterial Genomes

The Z-curve method is an effective technique that detects the asymmetrical nucleotide distribution around replication origins. The Z-curve contains all the information of its corresponding DNA sequence, and therefore the GC-skew [77] is a special case of the Z-curve. Thus the Z-curve can reveal nucleotide asymmetry that is not detectable by GC skew [70]. For instance, RY, MK and AT disparity curves show an oriC in the archaeon Methanosarcina mazei Tuc01 (Fig. ), while RY, MK, and GC disparity curves show an oriC in the bacterium Salmonella enterica tr. CT18 (Fig. ). Ori-Finder, an integrated in silico method to predict oriC regions of bacterial genomes, has been developed, based on the Z-curve method, along with distributions of DnaA box patterns, indicator genes, and phylogenetic relationships [78]. Ori-finder has become a commonly used annotation tool for identifying oriCs in newly sequenced archaeal and bacterial genomes, e.g., those of Moraxella catarrhalis [79], Sorangium cellulosum [80], Microcystis aeruginosa [80], Cyanothece [81], Cupriavidus metallidurans [82], Azolla filiculoides [83], Variovorax paradoxus [18], Corynebacterium pseudotuberculosis [84, 85], Orientia tsutsugamushi [86], Propionibacterium freudenreichii [87], Laribacter hongkongensis [88], Legionella pneumophila [89], and Ehrlichia canis [90] (Table ).

Studies of Genome Domain Structures

G+C content is an important characteristic of genome sequences. In the human genome, based on density gradient ultra-centrifugation experiments, it was found that long domains of relatively homogenous G+C content exist, and these domains are referred to as isochores [91, 92]. Traditionally, the G+C content along the genome is calculated using an overlapping or non-overlapping sliding window technique, based on which, however, isochores are hard to identify [93]. We developed a windowless technique in G+C content calculation, the GC profile [9, 10], which was used to study isochore structures in genomes of human [94], mouse [95], Arabidopsis thaliana [96] and chicken [97]. Based on the GC profile, the technique of wavelet multi-resolution analysis was used to identify isochore boundaries in the human genome [98]. For instance, a clear domain structure is revealed by the GC profile in chromosome 11 of finch (Fig. ). Other groups also used GC-Profile to study isochores in the pig genome [99] and to assess DNA curvature profiles for Aspergillus fumigatus [100].

Identification of Horizontally-transferred Genomic Islands in Bacterial Genomes

It is generally accepted that horizontal gene transfer (HGT) plays an important role throughout the genome evolution of prokaryotes, because HGT alters the genotype of a bacterium, and could potentially lead to new traits [101]. Genomic islands (GIs) contain clusters of horizontally transferred genes and therefore, identification of horizontally-transferred GIs is an important biological issue. Because the GC profile is sensitive to changes in GC content, it is a powerful tool in identifying GIs [102]. Based on the method of GC profile, GIs in many bacteria have been identified, e.g., Bacillus cereus [103], Corynebacterium glutamicum [104], Corynebacterium efficiens [105], Vibrio vulnificus CMCP6 [104], and Rhodopseudomonas palustris [106]. For instance, it was once believed that R. palustris does not have GIs, but analysis based on the GC profile identified 3 GIs that help explain how this bacterium survives in a versatile environment [106]. Corynebacterium efficiens can grow and produce glutamate at temperature above 40°C; unexpectedly, however, an aspartate kinase is less thermostable. This kinase gene is located in a GI that we identified, and this result suggests an explanation for its being less thermostable, i.e., the adaptive mutations have not occurred extensively due to the recent HGT [105]. For instance, horizontally transferred elements in Streptococcus pneumoniae ATCC 700669 can be clearly shown by the GC profile (Fig. ). The GC profile method has also been used for identification of GIs in other genomes, e.g., those of plant pathogens [107], Streptomyces lividans [108], Parachlamydiaceae UWE25 [109], epsilon proteobacteria Sulfurovum and Nitratiruptor [110], Acinetobacter oleivorans [111], and Silicibacter pomeroyi [112].

Identification of Promoters, Translation Start Sites and Nucleosome Positioning

Based on the behavior of the Z-curve near the bacterial gene translation start sites (TSS), a self-training method was proposed to find TSS with high accuracy [113]. It is likely that methods based on the same principle can also be used to recognize TSS in archaea and eukaryotes as well. Indeed, the Z-curve method was used to recognize human Pol II promoters [114] and promoters for bacterial genomes [115]. The positioning of nucleosomes, an elementary structural unit in eukaryotic chromatin, is pivotal in regulating many cellular processes, such as gene transcription. The Z-curve algorithm has been used to construct a genome-wide dynamic nucleosome positioning map for the budding yeast [116].

Visualization of DNA Sequences, Comparative Genomics and the Z-curve Database

One of the aims for developing the Z-curve theory is to visualize DNA sequences. By using the Z-curve, features of related DNA sequences can be grasped quickly in a perceivable form [1, 2]. Therefore, we constructed the Z-curve database (www.zcurve.net), which contains Z-curves for currently available genomes, online Z-curve drawing tools and other Z-curve related software [117]. For instance, human chromosome 6 and chimpanzee chromosome 6 are homologous, and they apparently have similar Z-curve patterns (Fig. and ). A typical example is the visualization of the genomes of related SARS-coronaviruses. Based on the 3-D coordinates of the corresponding Z-curves, the phylogenetic tree was constructed and was found to be in agreement with that based on sequence alignment [118]. Comparative genomics based on the GC profile was used to identify genomic islands [103, 119]. According to eqs. (6), the base composition of a DNA sequence can be represented by a point in a 3-D space, thus providing an intuitive method to display base compositions. This method was used to study the codon usage in the genomes of AIDS virus [120], human [121], E. coli [122], Vibrio cholerae [123], Aeropyrum pernix K1 [124], Streptomyces coelicolor A3(2) [125] and seven GC-rich bacteria [126]. In prokaryotic genomes with high-GC content, coding ORFs and non-coding ORFs are located in distinct regions in a 9-dimensional space revealed by the Z-curve method, forming a flower-like pattern (Fig. and ).

SUMMARY

The three components of the Z-curve, x, y and z, which display distributions of purine/pyrimidine (R/Y), amino/keto (M/K) and strong-H bond/weak-H bond (S/W) bases, respectively, are independent, and completely describe the DNA sequence. The x and y components are related to the disparities of RY, MK, AT and GC bases, and can therefore be used to identify oriC regions in prokaryotic and eukaryotic genomes. The component z is related to G+C content, and can therefore be used to identify domain structures of eukaryotic genomes and genomic islands of prokaryotic genomes. The set of all three components can be used in identifications of protein-coding genes, promoters, translational start sites or in other bioinformatics issues. Generally, further applications are expected to benefit from the use of functions based on the three components, i.e., f (x, y, z), with potential integration of other parameters. In conclusion, the methodology of the Z-curve provides a geometrical approach to analyzing genomic DNA sequences. Considerable progress in applying the Z-curve method has been achieved, and the Z-curve theory provides a solid basis for future developments.

Table 1.

Twelve Elements of the DNA Group (A4 Group or the Tetrahedron Group).

Element	A4 Group	Tetrahedron Group
I	A C G T	x y z
Rx	G T A C	x -y -z
Ry	C A T G	-x y -z
Rz	T G C A	-x -y z
RA	A T C G	z x y
RC	G C T A	z -x -y
RG	T A G C	-z -x y
RT	C G A T	-z x -y
R2A	A G T C	y z x
R2C	T C A G	-y -z x
R2G	C T G A	-y z -x
R2T	G A C T	y -z -x

Table 2.

Coordinates of the 4 Vertices of the Regular Tetrahe-dron ACGTa.

Coordinates	Vertices
Coordinates	A	C	G	T
X		-		-
Y			-	-
Z		-	-

Refer to Fig. 2 (d) for the original coordinate system, where the height of the tetrahe-dron is 1. Consequently, the edge length of the tetrahedron is , and the edge length of the cube is .

Table 3.

A Partial List of Z-curve Applications in Genome Analysis.

Research areas	Involved Z-Curve Components	Algorithm, Software or Database	Life Domains or Virus	Species
Protein-coding gene recognition a	x, y, z, S	Z-curve algorithm [1, 2], Zcurve [12]	Bacteria	Acinetobacter baumannii [17], Variovorax paradoxus [18], Amycolatopsis mediterranei [19], Bacillus thuringiensis [20], Streptomyces tendae [21], Phaeobacter gallaeciensis [22], Desulfobacterium autotrophicum [23], Mycobacterium tuberculosis [24], Magnetospirillum gryphiswaldense [25], Beggiatoa [26]
			Phage, plasmid	Fosmids of marine Planctomycetes [127], plasmids in the human gut [128], phage Rtp [129]
			Archaea	Archaea of the ANME-1 group [27]
			Eukaryotes	Leptospira interrogans [130], Yeast [11], Short human protein-coding genes [56, 131], Drosophila [55]
		Zcurve_V [14], Zcurve_CoV [15]	Virus, Coronavirus, phages	Prophage [33], Me Tri virus [28], novel human coronaviruses NL63 and HKU1 [34], novel bat coronaviruses [35], bat coronaviruses 1A, 1B and HKU8 [36], novel human coronavirus [37]
		Zcurve_V [14], Zcurve_CoV [15]	SARS_CoV	Various strains of SARS_CoV [38-53]
Replication origin identification	AT, GC, MK and RY disparity b	Ori-finder [78], DoriC [132, 133]	Archaea	Methanosarcina mazei[69], Halobacterium species NRC-1[63], Methanocaldococcus jannaschii [68], Sulfolobus acidocaldarius [72], Haloferax volcanii [73], Desulfurococcus kamchatkensis [74], Thermococcus sibiricus [75], Sulfolobus islandicus [76]
			Bacteria	Moraxella catarrhalis [79], Sorangium cellulosum [80], Microcystis aeruginosa [80], Cyanothece [81], Cupriavidus metallidurans [82], Azolla filiculoides [83], Variovorax paradoxus [18], Corynebacterium pseudotuberculosis [84], [85], Orientia tsutsugamushi [86], Propionibacterium freudenreichii [87], Laribacter hongkongensis [88], Legionella pneumophila [89], Ehrlichia canis [90]
			Phage, plasmid	Streptococcus pneumoniae Virulent Phage Dp-1 [134], R-plasmid pPRS3a from Bacillus cereus [135]
Genomic island identification	z’	GC profile [9, 10]	Bacteria	Corynebacterium efficiens [105], Rhodopseudomonas palustris [106], Corynebacterium glutamicum [104], Vibrio vulnificus and Bacillus cereus [103], Agrobacterium tumefaciens, Rolstonia solanacearum, Xanthomonas axonopodis, Xanthomonas campestris, Xylella fastidiosa and Pseudomonas syringae [107], Streptomyces lividans [108], Parachlamydiaceae UWE25 [109], epsilon proteobacteria Sulfurovum and Nitratiruptor [110], Acinetobacter oleivorans [111], Silicibacter pomeroyi [112]
Genomic island identification	z’	GC profile [9, 10]	Archaea	Haloquadratum walsbyi [136]
GC content variation,	z’, S	GC profile [9, 10]	Eukaryotes	Human genome: isochores [94, 98, 137] and replication time zones [138]; Isochores for chicken [97], Arabidopsis thaliana [96], mice [95] and pig [99]; DNA curvature profile for Aspergillus fumigatus [100]
isochore, genome segmentation	z’, S	GC profile [9, 10]	Bacteria	Bifidobacterium longum [139], Streptomyces avermitilis [140], Erwinia amylovora [141], Ralstonia pickettii [142]
Promoter, translational start sites, nucleosome positioning	x, y, z	Z-curve algorithm [11, 12], GS-finder [113]	Bacteria	Translational start sites [113] and promoters [115] of Escherichia coli and Bacillus subtilis
Promoter, translational start sites, nucleosome positioning	x, y, z	Z-curve algorithm [11, 12], GS-finder [113]	Eukaryotes	Human Pol II promoter [114], Yeast genome for stable and dynamic nucleosome positioning [116]
Comparative genomics, genome visualization	x, y, z, z’	Z-curve database [117]	Bacteria, archaea, eukaryotes and viruses	Bacillus cereus [103], Bacillus cereus ATCC 10987 [119], Coronavirus [118], human immunodeficiency virus [120], human [121, 143], E. coli [122], Seven GC-rich bacteria [126], 90 species [1], Aeropyrum pernix K1 [124], Streptomyces coelicolor [125]

136 in total

Review 1. Lateral gene transfer and the nature of bacterial innovation.

Authors: H Ochman; J G Lawrence; E A Groisman
Journal: Nature Date: 2000-05-18 Impact factor: 49.962

2. Genome analysis of Moraxella catarrhalis strain BBH18, [corrected] a human respiratory tract pathogen.

Authors: Stefan P W de Vries; Sacha A F T van Hijum; Wolfgang Schueler; Kristian Riesbeck; John P Hays; Peter W M Hermans; Hester J Bootsma
Journal: J Bacteriol Date: 2010-05-07 Impact factor: 3.490

3. A nucleotide composition constraint of genome sequences.

Authors: Chun-Ting Zhang; Ren Zhang
Journal: Comput Biol Chem Date: 2004-04 Impact factor: 2.877

4. Genetic characterization of the HrpL regulon of the fire blight pathogen Erwinia amylovora reveals novel virulence factors.

Authors: R Ryan McNally; Ian K Toth; Peter J A Cock; Leighton Pritchard; Pete E Hedley; Jenny A Morris; Youfu Zhao; George W Sundin
Journal: Mol Plant Pathol Date: 2011-08-10 Impact factor: 5.663

5. Characterization of a new plasmid-like prophage in a pandemic Vibrio parahaemolyticus O3:K6 strain.

Authors: Shih-Feng Lan; Chung-Ho Huang; Chuan-Hsiung Chang; Wei-Chao Liao; I-Hsuan Lin; Wan-Neng Jian; Yueh-Gin Wu; Shau-Yan Chen; Hin-Chung Wong
Journal: Appl Environ Microbiol Date: 2009-03-13 Impact factor: 4.792

6. Growth phase-dependent global protein and metabolite profiles of Phaeobacter gallaeciensis strain DSM 17395, a member of the marine Roseobacter-clade.

Authors: Hajo Zech; Sebastian Thole; Kerstin Schreiber; Daniela Kalhöfer; Sonja Voget; Thorsten Brinkhoff; Meinhard Simon; Dietmar Schomburg; Ralf Rabus
Journal: Proteomics Date: 2009-07 Impact factor: 3.984

7. The major components of the mouse and human genomes. 1. Preparation, basic properties and compositional heterogeneity.

Authors: G Cuny; P Soriano; G Macaya; G Bernardi
Journal: Eur J Biochem Date: 1981-04

8. Haloquadratum walsbyi: limited diversity in a global pond.

Authors: Mike L Dyall-Smith; Friedhelm Pfeiffer; Kathrin Klee; Peter Palm; Karin Gross; Stephan C Schuster; Markus Rampp; Dieter Oesterhelt
Journal: PLoS One Date: 2011-06-20 Impact factor: 3.240

9. Human Pol II promoter recognition based on primary sequences and free energy of dinucleotides.

Authors: Jian-Yi Yang; Yu Zhou; Zu-Guo Yu; Vo Anh; Li-Qian Zhou
Journal: BMC Bioinformatics Date: 2008-02-24 Impact factor: 3.169

10. Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes.

Authors: Michael F Lin; Ameya N Deoras; Matthew D Rasmussen; Manolis Kellis
Journal: PLoS Comput Biol Date: 2008-04-18 Impact factor: 4.475

8 in total

1. Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors.

Authors: Mostafa M Abbas; Mostafa M Mohie-Eldin; Yasser El-Manzalawy
Journal: PLoS One Date: 2015-03-24 Impact factor: 3.240

2. Improved Classification of Lung Cancer Using Radial Basis Function Neural Network with Affine Transforms of Voss Representation.

Authors: Emmanuel Adetiba; Oludayo O Olugbara
Journal: PLoS One Date: 2015-12-01 Impact factor: 3.240

3. RNA-TVcurve: a Web server for RNA secondary structure comparison based on a multi-scale similarity of its triple vector curve representation.

Authors: Ying Li; Xiaohu Shi; Yanchun Liang; Juan Xie; Yu Zhang; Qin Ma
Journal: BMC Bioinformatics Date: 2017-01-21 Impact factor: 3.169

8. A complete annotation of the chromosomes of the cellulase producer Trichoderma reesei provides insights in gene clusters, their expression and reveals genes required for fitness.

Authors: Irina S Druzhinina; Alexey G Kopchinskiy; Eva M Kubicek; Christian P Kubicek
Journal: Biotechnol Biofuels Date: 2016-03-29 Impact factor: 6.040

8 in total