Jianbo Gao1, Yan Qi, Yinhe Cao, Wen-wen Tung. 1. Department of Electrical & Computer Engineering, University of Florida, Gainesville, FL 32611-6200, USA. gao@ece.ufl.edu
Abstract
Most codon indices used today are based on highly biased nonrandom usage of codons in coding regions. The background of a coding or noncoding DNA sequence, however, is fairly random, and can be characterized as a random fractal. When a gene-finding algorithm incorporates multiple sources of information about coding regions, it becomes more successful. It is thus highly desirable to develop new and efficient codon indices by simultaneously characterizing the fractal and periodic features of a DNA sequence. In this paper, we describe a novel way of achieving this goal. The efficiency of the new codon index is evaluated by studying all of the 16 yeast chromosomes. In particular, we show that the method automatically and correctly identifies which of the three reading frames is the one that contains a gene.
Most codon indices used today are based on highly biased nonrandom usage of codons in coding regions. The background of a coding or noncoding DNA sequence, however, is fairly random, and can be characterized as a random fractal. When a gene-finding algorithm incorporates multiple sources of information about coding regions, it becomes more successful. It is thus highly desirable to develop new and efficient codon indices by simultaneously characterizing the fractal and periodic features of a DNA sequence. In this paper, we describe a novel way of achieving this goal. The efficiency of the new codon index is evaluated by studying all of the 16 yeast chromosomes. In particular, we show that the method automatically and correctly identifies which of the three reading frames is the one that contains a gene.
Gene identification is one of the most important tasks in the
study of genomes. In
order to be successful, a gene-finding algorithm has to
incorporate good indices for the protein coding regions. In
the past two decades, a number of useful codon indices have
been proposed.
They include the codon bias index (CBI) (Bennetzen and Hall
[1]), the
codon adaptation index (CAI) (Sharp and Li [2]; Jansen
et al [3]), the YZ score (Zhang and Wang [4]),
measures based on differences in codon usage (Staden and
McLachlan [5]), hexamer counts (Claverie and Bougueleret
[6]; Farber et al [7]; Fickett and Tung [8]),
codon position asymmetry (Fickett [9]), autocorrelations
and nucleotide frequencies (Shulman et al [10]; Fickett
[9]; Borodovsky et al [11]), entropy (Almagor
[12]), and periodicities, especially the period-3 feature
of a nucleotide sequence in the coding regions (Fickett
[9]; Silverman and Linsker [13]; Chechetkin and
Turygin [14]; Tiwari et al [15]; Trifonov [16];
Yan et al [17]; Anastassiou [18]; Issac et al
[19]; Kotlar and Lavner [20]). Most of them mainly
capture the feature of highly biased nonrandom usage of codons
in the coding regions. The background of a DNA sequence, be it
a coding or noncoding sequence, however, is fairly random.
Consequentially, a DNA sequence can be characterized as a
random fractal. Is it possible to develop a new codon index by
simultaneously incorporating the fractal and periodic features
of a DNA sequence? The aim of this paper is to develop a
simple method to achieve such a goal. Since the codon index
obtained this way complements existing codon indices, it has
the potential of being incorporated into existing gene
identification algorithms so that the accuracy of those
algorithms can be improved and their training be simplified.The novel codon index proposed here is based on two incompatible
features of DNA sequences: the period-3 and the fractal features.
It has been known for a while
that a DNA sequence exhibits fractal properties, with
noncoding regions often possessing, but coding regions often
lacking long-range correlations (Li and Kaneko [21]; Peng
et al [22]; Voss [23]). Roughly speaking, a fractal
means a part is similar to another part or to the whole, and
hence, does not possess any well-defined scale (Mandelbrot
[24]). However, it has been recognized that the biased
nonrandom use of codons in coding regions often defines a
period-3 feature in the coding regions. Period-3 is a specific
scale and hence, is incompatible with the concept of fractal.
We show here that a convenient codon index can be developed by
exploiting this incompatibility. This is achieved by
quantifying the deviation of a DNA sequence from its fractal
behavior due to the period-3 feature. Amazingly, this simple
measure not only correlates well with coding regions, but
also automatically and correctly identifies which of the three
reading frames is the correct one (ie, containing a gene). In
this paper, we will illustrate the idea and evaluate the
proposed index by studying all of the 16 yeast chromosomes.
DATABASES AND METHODS
Database
The yeast chromosome sequences and the associated
annotation data used for the analysis are based on sequence dated 1
October 2003 in the Saccharomyces Genome Database
(http://www.yeastgenome.org) and can be obtained from
ftp://genome-ftp.stanford.edu/pub/yeast/data_download.
The period-3 feature
The period-3 feature of a DNA sequence has been used to
develop important codon indices (Fickett [9]; Silverman
and Linsker [13]; Chechetkin and Turygin [14]; Tiwari
et al [15]; Trifonov [16]; Yan et al [17];
Anastassiou [18]; Issac et al [19]; Kotlar and Lavner
[20]). The existence of this feature can be shown, for
example, by Fourier spectral analysis. In order to apply
Fourier transform, first one has to obtain one or more
numerical sequences from a DNA sequence. A common mapping
scheme is to construct four binary sequences from a DNA
sequence, one for each base. For instance, when nucleotide
base “A” is concerned, a sequence u(n) is assigned 1's at
those positions where “A” is present, and 0's otherwise.
Take sequence
as an example. One obtains the following
binary sequences:
Such a scheme has been used, for example, by Voss [23]
and Kotlar and Lavner [20]. Alternatively, one can obtain
numerical sequences using the following mapping rules. (a) C
or G → u(n) = +1; A or T → u(n) = −1. This
rule suggested by Azbel [25] maps a DNA sequence into a
sequence of weak/strong hydrogen bonds. (b) C or T → u(n) = +1; A or G → u(n) = −1. This
scheme was proposed by Peng et al [22] and maps a DNA
sequence into a sequence of purine/pyrimidine. When a
numerical sequence u(n) is obtained, the discrete fourier
transform (DFT) can be used to compute its spectrum U(k),
which is given by
where N is the length of u(n) and k corresponds to the
discrete frequency of (2π/N)k or a period of (N/k).
U(k) can be conveniently used to identify characteristic
periodicities of u(n). Since a coding DNA sequence is
comprised of codons (units of three nucleotide bases) and the
nucleotide usage in a coding sequence is highly biased and
nonrandom, a period of 3 is often present in the coding
sequence u(n). This feature is usually referred to as
“period-3.” Consequently, the DFT magnitude or power
spectrum density of u(n) often displays a distinct peak at
k = N/3 (or at a frequency around [N/3] when N/3 is not an
integer). However, the period-3 feature is usually lacking or
weak in noncoding regions (Fickett [9]; Silverman and
Linsker [13]; Chechetkin and Turygin [14]; Tiwari et
al [15]; Trifonov [16]; Yan et al [17];
Anastassiou [18]; Issac et al [19]; Kotlar and Lavner
[20]). To illustrate this idea, we use Peng's mapping
rule to construct u(n) from the coding/noncoding DNA
sequences of yeast and perform DFT on u(n) (for
simplicity, we have chosen N = 1026). Typical DFT magnitudes
|U(k)| for coding and noncoding regions are shown in
Figure 1. A strong peak is observed for |U(k)|
at k = 1026/3 of the coding region while no such feature is
observed for the noncoding region.
Figure 1
Representative DFT magnitudes for (a) coding and (b) noncoding
regions in yeast chromosome I.
Fractal property and the DFA technique
For other analyses, especially fractal analysis, it is more handy
to construct a random walk (called DNA walk) from the DNA
sequence. The walk y(n) is generated by forming a partial sum of
the u(i) sequence constructed from the DNA sequence
Several different versions of DNA walks have been proposed
based on different mapping rules for u(n). The recently
proposed 3D DNA walk is also called Z curve (Yan et al
[17]; Zhang et al [26]). Note that the DNA walks
based on the two 1D mapping rules mentioned above (Azbel
[25]; Peng et al [22]) are equivalent to the z-component and
x-component of the Z curve, respectively. Other types of
multidimensional DNA walks have also been suggested
(Berthelsen et al [27]; Cebrat et al [28]). In this
work, we will employ the x-component of the Z curve (A or
T → u(n) = +1; C or G →
u(n) = −1) for further analysis because of its simplicity and
efficiency (Stanley et al [29]). An example is shown in
Figure 2 for the first 20 000 bases of the
chromosome I of yeast. Note that such a mapping generates an
equivalent (except differing by a sign) DNA walk for the
reverse strand of a DNA sequence. Hence, analysis based on
such a mapping processes both strands of a DNA sequence
simultaneously.
Figure 2
An example of a DNA walk constructed from the first
20 000 bases of the chromosome I of yeast.
While DNA walks are useful in many applications,
interpretation of some computational results, such as
long-range correlations (Stanley et al [29]), are
sometimes problematic, due to patchiness effects along a DNA
sequence (Karlin and Brendel [30]). To remove such
patchiness effects, a method called detrended fluctuation
analysis (DFA) was developed by Peng et al [31] and has
been used to identify characteristic patch sizes (Viswanathan
et al [32]). DFA works as follows: first divide a given
DNA walk of length N into ⌊N/l⌋
nonoverlapping segments (where the notation
⌊x⌋ denotes the largest integer that is not
greater than x), each containing l nucleotides; then
define the local trend in each segment to be the ordinate of a
linear least-squares fit for the DNA walk in that segment;
finally compute the “detrended walk,” denoted by y(n),
as the difference between the original walk y(n) and the
local trend. The following scaling behavior (ie, fractal
property) has been found for many DNA walks studied:
where the angle brackets denote ensemble average of all the
segments and F(l) is the average variance over all
segments. The exponent H is often called the “Hurst
parameter” (Mandelbrot [24]). When H = 0.5, the DNA
walk is similar to a standard random walk. When H > 0.5, the
DNA walk possesses long-range correlations. Statistically
speaking, a noncoding region is often more likely to possess
the long-range correlation properties (Stanley et al
[29]). This feature, together with the DFA technique, was
used by Ossadnik et al [33] to develop a coding sequence
finder for genomes with long noncoding regions. To further
illustrate the ideas, we analyze the coding and noncoding
sequences of the yeast genome using the DFA technique. For a
coding/noncoding sequence of length N, first a DNA walk
y(n) is constructed according to Peng's mapping rule. Then
the detrended fluctuation F(l) is computed according to
(5) for a series of segment sizes l (l < N). In
practice, l is often chosen to be the power of a common base
r, that is, l(j) =
r, j = 1,2,...,
log.
Notice that
log F(l) ~ H log l, F(l) is approximately
linear on double logarithmic scale when l is within a
certain range [l0, l1]. A linear least-squares fit of data
in this range produces a straight line with slope a and
intersect b from which we can get an estimate of the Hurst
parameter H = a/2.
Figure 3 shows a log-log plot
of F(l) versus
l for (a) a coding and (b) a noncoding
sequence of yeast chromosome I. The two sequences are of
lengths (a) N = 1742 and (b) N = 3598. We choose
l to
increment with the base r = 2 (also for all following analysis
that concerns DFA) and the fitting range with the best scaling
property is found to be [l0, l1]
= [22, 28] for both (a) and
(b). Within this range, the Hurst parameters are (a) H = 0.54
and (b) H = 0.62. The nice scaling law in [l0,
l1]
indicates that DNA sequences are fractals. Often, the Hurst
parameters in noncoding regions are larger than those in
coding regions, suggesting that noncoding regions often
possess stronger long-range correlations. Sometimes this
feature is termed lesser complexity in noncoding regions
(Ossadnik et al [33]; Stanley et al
[29]).
Figure 3
Representative period-3 fractal deviation
(PFD) for (a) coding and (b) noncoding regions in yeast chromosome I.
Deviation from fractal scaling due to period-3 signal
Intuitively, when a
periodicity exists in a sequence, the fractal
scaling law does not hold at that particular “scale” defined by
the periodicity. Specifically, for DNA sequences, a normally
strong period-3 signal in coding regions causes a “deviation”
from the sequence's fractal “background.” On the contrary, for a
noncoding region, such deviation, if any, is typically much
smaller than that of a coding region. Based on the DFA technique,
we have developed a novel codon index which quantifies such
deviation from the fractal scaling law due to the period-3
feature, which we will denote by period-3 fractal
deviation (PFD) for simplicity.For a DNA walk y(n) of length N, after computing its detrended
fluctuation using DFA and identifying the best fitting range
[l0, l1], an approximation of F(l) can be obtained by
The deviation of y(n) is defined as the difference between
and logF(l) at l = 3, that is,To verify our intuition about the capacity of this index in
distinguishing coding and noncoding regions, we have computed
the PFD value for a large number of the verified open reading
frames (ORFs)
and noncoding segments from all of the 16 yeast
chromosomes. An example of representative PFD values for
coding and noncoding regions is shown in Figure 3.
We observe that for the coding region, the fluctuation
F(l) at l = 3 deviates severely from the power-law
relation (ie, the straight line in a log-log plot in
Figure 3),
while the deviation for the noncoding
region is relatively small.One may wonder if any DNA segment that belongs to a coding region
has a large PFD. In fact, this is not the case. The
quantification of the period-3 feature by the deviation from
fractal scaling is reading-frame dependent. When the coding
segment starts with the gene-containing reading frame (the first
nucleotide of a codon), the period-3 feature collides
with the DFA technique at the scale of l = 3 and results in a large
PFD. When the segment starts with an incorrect reading frame,
the periodicity of 3 cannot be captured by DFA and the deviation
value is small. For noncoding regions where the period-3 feature is
usually lacking or weak, the PFD does not change much for the
three reading frames. Note that the DFT magnitude for a coding
region is similar for all three reading frames while the DFT phase
is not. Codon indices based on the latter has improved performance
compared with algorithms that use DFT magnitude (Kotlar and Lavner
[20]). The PFD measure, whose value also varies for different
reading frames, not only quantifies a sequence's coding strength
well but also locates the reading frame correctly. The latter
statement will be made more concrete shortly.
Algorithm for computing period-3 fractal deviations along a DNA sequence
Based on the observations above, we employ a sliding window
technique (with window size w) to calculate PFD along a DNA
sequence in a systematic fashion. The algorithm can be stated in
four steps.Step 1. Given a DNA sequence of length N, construct a DNA
walk of length N based on the simple purine/pyrimidine rule
(Peng et al [22]). Let w be the size of the sliding
window. When successive windows overlap by w − 1 bases, a
total of N − w + 1 windows can be obtained. For each window, the
value of PFD can be computed based on (7).Step 2. By common sense, one would associate each PFD for a
window with the center of the window. In order to preserve
information about the reading frames, however, this rule is
slightly modified as follows: denote the position of the
window along the DNA sequence by [n, n + w − 1]. We associate its
PFD with the position n + 3j, where j is the largest
integer such that 3j ≤ w/2.Step 3. Form three reading-frame-specific deviation sequences
by dividing the PFD(n) sequence into three subsets,
PFD1(1 + 3m), PFD2(2 + 3m),
PFD3(3 + 3m),
m = 0, 1, 2, ..., corresponding to the positions
(1, 4, 7, ...), (2, 5, 8, ...), (3, 6, 9, ...),
respectively. For later convenience, we will denote
PFD1(1 + 3m),
PFD2(2 + 3m),
PFD3(3 + 3m) by PFD1(m),
PFD2(m),
PFD3(m), m = 0, 1, 2, ....As we will illustrate in the next section, the above three
steps automatically exhibit which reading frame is the correct
one. Step 4 defines a simple but efficient codon index
MAXFD.Step 4. After PDF, i = 1, 2, 3 are obtained,
we compute
Let δ0 be a threshold value. A segment under study is
declared “coding” if the codon index
MAXFD is greater than δ0 and “noncoding”
otherwise. In practice, the threshold value δ0 is
usually chosen to be where the cumulative distribution for the
coding regions intersects with the complementary cumulative
distribution for the noncoding regions (See
Figure 4).
Figure 4
Distributions of MAXFD for the coding (solid
curves) and noncoding (dashed curves) subsets of the 16 yeast
chromosomes. The sliding window size is w = 512. The
parameters n1 and n2 designate the coding/noncoding
segments with lengths greater than n1 and n2,
respectively. (a) n1 = n2 = 1, (b) n1 = n2 = 256,
(c) n1 = n2 = 512, (d) n1 = n2 = 1026. See the text and Table 1 for more
details.
RESULTS AND DISCUSSION
The above algorithm has been used to calculate the PFD values
of all of the 16 yeast chromosomes. To show that the algorithm
is largely independent of the sliding window size as well as
to show that the method is applicable to short DNA sequences,
sliding window sizes of 128, 256, 512, and 1024 have
been tried, with the best scaling regions identified as [22,
24], [22, 25], [22, 26], and [22, 27],
respectively. For each window size, after a PFD sequence is
obtained for an entire chromosome sequence, the three
reading-frame-specific deviation sequences PFD(m),
i = 1, 2, 3, are plotted against their nucleotide positions
along the chromosome in red, green, and black, respectively.
For all four window sizes, the three deviation curves thus
obtained exhibit similar and very interesting patterns. As
examples, we have shown in Figures 5 and 6
the three reading-frame-specific PFD(m) sequences for DNA
segments in yeast chromosome I, from nucleotide 58 000 to
nucleotide 70 000, and from 141 500 to nucleotide 160 000,
respectively, where the window size w is chosen to be 512.
To appreciate the correlations between the patterns of the
variations of the PFD(m) sequences and the
coding/noncoding regions, the locations of the genes from both
positive and reverse strands of the DNA sequence are also
shown below the PFD(m) curves. We observe a few
interesting features. (i) Generally, the three curves,
corresponding to three different colors, do not overlap with
one another. This is a necessary condition for the three
reading frames to be separable. (ii) In coding regions, both
in the positive and the reverse strands, typically one of the
three PFD curves displays a large value and separates
considerably from the other two curves. By systematically
comparing the yeast genome annotation data with the PFD
curve with the largest values among the three, we have found
that this is indeed the correct reading frame. Presumably, by
searching for start and stop codons, one can aptly find out
whether the gene is on the positive or the reverse strand of
the genome, and determine which region(s) define(s) the gene.
If this simple assumption would work, then a gene-finding
algorithm that employs MAXFD as a codon index would require
minimal amount of training. (iii) In noncoding regions, the
three PFD curves are mixed. That means the three reading
frames are more or less equivalent and inseparable.
Figure 5
The
reading-frame-specific PFD, i = 1,2,3 curves for a segment of
DNA in yeast chromosome I (from nucleotide 58 000 to nucleotide 70 000).
The sliding window size is w = 512. A 5th-order moving average
filter has been applied. Colored horizontal bars on the two lines
below the deviation curves are the open reading frames on the two
strands of the chromosome, (first line: positive strand; second
line: reverse strand). The orange and blue bars represent verified
ORFs while a gray bar represents a dubious ORF.
Figure 6
Period-3 fractal deviations (PFDs) of yeast chromosome I
(a segment from nucleotide 141 500 to nucleotide 160 000).
We now evaluate the efficiency of the MAXFD as a codon
index by studying all of the 16 yeast chromosomes. Our sample
pool is comprised of two sets of DNA segments: the coding set,
which contains 4125 verified ORFs (fully coding regions or
exons), and the noncoding set, which contains 5993 segments
(fully noncoding regions or introns). Different subsets of
coding/noncoding segments are extracted according to
the lengths of the sequence segments. These subsets are
described by four parameters, N1,
n1, N2,
n2,
where N1 and N2 are the numbers of coding and
noncoding sequences with length greater than
n1 and n2, respectively. After subsets of
coding/noncoding sequences are chosen, MAXFD is then
computed for all segments in those subsets. We denote the
cumulative distribution of MAXFD over the two subsets as
P(δ) and
P(δ). Then
1 − P(δ) is the
proportion of coding segments in C with
MAXFD > δ and
P(δ) is the proportion
of noncoding segments in NC with
MAXFD < δ. We define sensitivity as the proportion of
segments in set C correctly labeled as
“coding“ and specificity as the proportion of segments in
set NC correctly labeled as “noncoding.”
Given a threshold δ0, the sensitivity and specificity
are 1 − P(δ0) and P(δ0), respectively. If
we define the percentage accuracy as the average of
sensitivity and specificity, an optimal decision threshold is
often set at where sensitivity equals specificity. By plotting
1 − P(f) and
P(f) together, the optimal
decision threshold is the abscissa of the point where the two
curves intersect. The corresponding percentage accuracy is
then simply 1 −
P(δ0)
(or P(δ0),
since
P(δ0) = 1 − P(δ0)). Figure 4
shows the sensitivity/specificity curves
for four configurations of (n1, n2).
More detailed
statistics for all five configurations of
(n1, n2) studied
are summarized in Table 1, for two window sizes,
w = 512 and 128. With sliding window size w = 512, the
percentage accuracy on the entire sample pool is 82.5%.
When only those segments longer than the window size are
concerned, the accuracy is increased to 89.8%. For coding
and noncoding subsets with segment lengths greater than 1026,
the accuracy is further improved to 95.4%. While one might
think the statistics shown in Table 1 may become a
lot worse when a much smaller sliding window size is used,
this is not the case. In fact, when the sliding window size is
reduced to 128, the accuracy for the long coding/noncoding
sequences is only slightly degraded, while the accuracy for
the entire coding/noncoding sequences is actually improved.
Overall, we would conclude that the codon index proposed is
fairly independent of the sliding window size.
Table 1
Accuracy of the PFD-based coding-region identification
algorithm on different coding/noncoding subsets. The parameters
N1 and N2 are the numbers of coding and noncoding sequences
with length greater than n1 and n2, respectively. A DNA
segment is declared “coding” if MAXFD > δ0 and
“noncoding” otherwise. Accuracy is defined as the average of
sensitivity and specificity. The threshold δ0 is set at
where sensitivity equals specificity.
n1
N1
n2
N2
δ0
Sensitivity/specificity
w = 512
w = 128
1
4125
1
5993
0.1800
82.5%
84.7%
256
4067
256
4186
0.1660
85.7%
86.7%
512
3756
512
1948
0.1500
89.8%
89.4%
1026
2674
512
1948
0.1620
92.5%
91.2%
1026
2674
1026
650
0.1320
95.4%
94.4%
In experiments involving expressed sequence tags (ESTs), the
sequences available may all be short. Can the MAXFD index
proposed still be useful? The answer is yes. When a sequence
is very short, it is not necessary to use a sliding window to
obtain three deviation curves. Instead one can simply obtain
three values, PFD1, PFD2, PFD3, from the sequence
and find MAXFD using (8). If the value is very
large, one has good reason to assume that the suspected EST
indeed belongs to a coding region. Otherwise, it may not. When
the former is the case, the reading frame with the largest
PFD, where i ∈ {1, 2, 3}, very likely
indicates the correct reading frame (assuming there is no error
in the sequence). When the sequence under study is not too
short, one can then employ the sliding window technique. The
PFD1(m), PFD2(m), PFD3(m) curves that can be
obtained this way will look like those obtained for a short
segment of those shown in Figures 5 and 6.
We note that the procedures outlined in this here have been
applied to some experimentally obtained short DNA segments
provided by Drs. Farmerie and Liu of the Institute of
Biotechnology at the University of Florida.
Authors: Vladimir Paar; Nenad Pavin; Ivan Basar; Marija Rosandić; Matko Gluncić; Nils Paar Journal: BMC Bioinformatics Date: 2008-11-03 Impact factor: 3.169
Authors: Chuanhua Xing; Donald L Bitzer; Winser E Alexander; Mladen A Vouk; Anne-Marie Stomp Journal: Nucleic Acids Res Date: 2008-12-10 Impact factor: 16.971
Authors: A R Galimov; A A Kruglov; N L Bol'sheva; O Iu Iurkevich; D Ia Lipin'sh; I A Mufazalov; D V Kuprash; S A Nedospasov Journal: Mol Biol (Mosk) Date: 2008 Jul-Aug