Qinqin Wu1, Jiajun Wang, Hong Yan. 1. Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong. suzhouqinqin@gmail.com
Abstract
Nucleosome positioning can affect the accessibility of the underlying DNA to the nuclear environment and as such plays an essential role in the regulation of cellular processes. Specific patterns have been found in the underlying DNA sequences of the nucleosome, and one of the most important patterns includes dinucleotides distributed every 10 to 11 base pairs. Based on this property, we propose to match each dinucleotide in the sequence against its mirror occurrences for 10 to 11 base pairs on both left-hand and right-hand sides. A large number of matches in a local region will then signify the existence of a nucleosome. In this paper, we propose the matched mirror position filters for efficient matching of periodic dinucleotide patterns and computationally predict the nucleosome positions. Experimental results on the Saccharomyces cerevisiae (yeast) genome show that the proposed algorithm can predict nucleosome positions effectively. More than 50% of our predicted nucleosomes are within 35 base pairs of those detected by biological experiments.
Nucleosome positioning can affect the accessibility of the underlying DNA to the nuclear environment and as such plays an essential role in the regulation of cellular processes. Specific patterns have been found in the underlying DNA sequences of the nucleosome, and one of the most important patterns includes dinucleotides distributed every 10 to 11 base pairs. Based on this property, we propose to match each dinucleotide in the sequence against its mirror occurrences for 10 to 11 base pairs on both left-hand and right-hand sides. A large number of matches in a local region will then signify the existence of a nucleosome. In this paper, we propose the matched mirror position filters for efficient matching of periodic dinucleotide patterns and computationally predict the nucleosome positions. Experimental results on the Saccharomyces cerevisiae (yeast) genome show that the proposed algorithm can predict nucleosome positions effectively. More than 50% of our predicted nucleosomes are within 35 base pairs of those detected by biological experiments.
The nucleosome, which is the primary unit of a eukaryotic
chromatin, contains about 147 base pairs (bp) of DNA
which are sharply bent and tightly wrapped around a
histone octamer. This sharp bending occurs at every DNA
helical repeat (∼10 bp) where the major groove of the DNA
faces inwards towards the histone octamer, and again ∼5
bp away, with opposite direction, the major groove faces
outward [1]. Along the chromatin, the neighboring
nucleosomes are separated from each other by 10 to 50-bp
long stretches of unwrapped linker DNA [Figure 1], thus,
75–90% of genomic DNA is wrapped in nucleosomes and
as such affects sequence accessibility [2]. The positioning
of nucleosomes along the chromatin is known to play an
important role in the regulation of gene expression in
eukaryotic cells.
Figure 1
Examples of nucleosomes, each containing 147 bp of nucleotides, connected by linkers of different lengths.
Extensive research has been carried out on nucleosome
positioning recently. Some of the research tries to explore
the mechanic properties of nucleosomes, while the others
develop mathematical models for the prediction of the
nucleosome positions [1,3>–5].
In their study of nucleosome
placement in chromatin, Vincent Miele proposed that the
physical properties of DNA may determine the nucleosome
occupancy from yeast to fly [6]. Besides, I.P. Ioshikhes
predicted some nucleosomes on the basis of comparative
genomics while M. Yassour located the nucleosomes by
analyzing microarray data [7,8]. Despite the fact that
genome-wide maps of nucleosome locations have been
generated [2,6,9,
10], the problem of accurately
predicting the nucleosome positions computationally at
high resolution remains unresolved.Statistical analysis suggests that the periodicities of the
underlying DNA sequence might help solve this problem
[2]. Indeed, there is evidence that the distinctive sequence
motifs recurring periodically at the DNA helical repeat
facilitate the sharp bending of DNA around the nucleosome and hence favor nucleosome formation. These
motifs include ∼10-bp periodic AA/TT/TA dinucleotides
that oscillate in phase with each other and out of phase with
∼10-bp periodic GC dinucleotides [2,4,11]
. Based on this property, we propose a computational method using
the so-called matched mirror position filter (MMPF) for
the prediction of nucleosome positions. An advantage of
our method is that it does not require training data and thus
expensive wet-lab experiments are not needed. The
computational experiment results demonstrate that our
approach can detect the positions of the nucleosome
effectively. On average, more than 50% of our predicted
stable nucleosomes are within 35bp of those detected by
biological experiments and reported in literature.
Methodology
Nucleotide base set
Let us define the set B1 = }A, C, G, T{, which contains the
four nucleotide bases. Then consequently there should be
4×4 =16 dinucleotides. S. C. Satchwell has validated that
in the 16 possible dinucleotides only 10 of them are
unique. He explained that it is because some of them are
related to the two fold axis that passes between the two
strands of the double helix, and the reverse complementary
dinucleotides are considered to be equivalent [11].
Considering this factor, we define the unique dinucleotides
as set B2 = {AA/TT, AT, AC/GT, AG/CT, TA, TC/GA,
TG/CA, CC/GG, CG, GC}.
Dirac delta function
Assume that a DNA sequence is represented by a discrete
function x(n), with x(n) єB1 and n = 0, 1, 2, …, N ∼1. For
each dinucleotide b єB2, a delta function is defined as in
Equation 1 (see supplementary material). The notation
of this delta function has been used to represent the
positions of nucleotide bases [12,
13]. Here we generalize
the notation to dinucleotides. For example, if x(n) = {GCAGGTACAGTCGAG},
then x = δ(n∼2) + δ(n∼8)
+ δ(n∼13).
Matched mirror position filter
Ideally, to produce a periodicity of ∼10.5 bp, an impulse δ
(n) at each position n should be like a double-side
mirror. That is, it should reflect an impulse δ(n∼nk) on the
left-hand side 10 to 11 bp away and an impulse δ(n)
on the right-hand side 10 to 11 bp away (Figure 2). As
defined in Equations (2) and (3) (see supplementary material) respectively,
d is the distance of
δ(n) to
the impulse closest to the position n 10.5, and d is
the distance of δ(n) to the impulse closest to the position
n 10.5. Then, ideally we should have d = 10 or 11
and d = 10 or 11 and there should be many of these
impulses in a nucleosome to produce the periodicity of
∼10.5 bp and no such impulse in a linker.
Figure 2
Positions of dinucleotides along a DNA sequence. Ideally, a dinucleotide should have two mirror images, on the
left-hand and right-hand sides respectively, 10 to 11 bp away to produce the periodicity of ∼10.5 bp in a nucleosome.
In practice, d and d may deviate from the
ideal values. To take this deviation into account, we use a
matching function f(d) to measure the contributions of δ(n),
δ(n) and δ(n) to the periodicity. The matching
function should be large for d close to 10.5 and decrease as
d moves away from this optimal value. Several choices of
f(d) are shown in Figure 3. Given the matching function,
the contributions of the three impulses to the periodicity
can be described as Equation 4 (see supplementary material).
Figure 3
Possible choices of the matching function f(d) with d representing the distance from the current position. Diagrams
in (a) and (b) show trapezoids with different base lengths. Diagram (c) shows a matching function consisting of the
summation of six Gaussian functions with standard deviation equal to 0.5 and means at 8, 9, 10, 11, 12 and 13 bp.
To detect the presence of nucleosomes, we move a window
with the size of 2W + 1 along the DNA sequence and
accumulate the contributions from all dinucleotides within
the window. Consequently, the nucleosome matching score
function can be defined as Equation 5 (supplementary material). In fact, our proposed method involves flexible
matching of mirrored dinucleotide positions. To some
extent, it is similar to the matched filter used in radar
systems for detecting echo signals. Thus, our system is
named the matched mirror position filter (MMPF) with
S(n) being the output.
Threshold discriminant
Following the ideas discussed above, the nucleosomes can
be predicted by comparing S(n) as defined in Equation 5
(see supplementary material) with a threshold
distinguishing the nucleosome and the non-nucleosome.
The threshold here is determined empirically by
performing a lot of experiments on a number of nucleotide
sequences. After a lot of experimental trials, we found that
T = 1.2(2W + 1) is the optimal threshold. In the
implementation of our proposed method, the window
length is selected as 2W + 1=147 according to the fact that
the lengths of nucleosomes in these sequences studied are
mostly 147 bp [2,6,9,
10]. A window size in the range of
100 to 200 gives similar results, but the algorithm achieves
the best result in terms of the false positives in the
sequences studied with the window size of 147. For this
window size, the threshold for the nucleosome matching
score is 1.22W + 1) ≂ 180.
Discussion
In order to verify the performance of the MMPF, we have
performed nucleosome position prediction experiments on
the Saccharomyces cerevisiae (yeast) genome, which is
downloaded from the database of National Center for
Biotechnology Information (NCBI) [14]. The genome
contains 16 chromosomes with lengths from 230k to 1532k
bp. We choose the yeast genome since it has been
extensively studied and there are high-resolution
nucleosome mapping results reported in literature [7]. A
criterion similar to that in [2] is used here to compute the
prediction accuracy. That is, we consider a prediction
correct if the nucleosome center position determined by the
MMPF is within 35 bp of that reported in the highresolution
result of [7]. The prediction accuracy is then
computed as the ratio of correctly predicted nucleosomes
over the total number of nucleosomes predicted by our
algorithm.The prediction results achieved by using different matching
functions in the MMPF are summarized in Table 1
(supplementary material). From this table, it can be seen
that with the isosceles trapezoid function (upper base 9-12)
the least nucleosomes can be predicted and the accuracy is
the worst, while with superimposed Gaussian function
(delta = 0.5) the accuracy achieves the best occasionally
but its predicted nucleosomes are still less than using the
isosceles trapezoid function (upper base 8-13). In
consideration of both the number of correctly predicted
nucleosomes and the prediction accuracy, we choose the
isosceles trapezoid function (upper base 8-13) as the
optimal matching function f(d) to predict nucleosome
positions for further analysis.For a better understanding of the correlation between the
predicted nucleosomes and the regulatory function sites,
such as promoters and the transcription factor binding
sites, a comparison between the nucleosomes predicted by
our method and those obtained from biological
experiments and published in literature is shown in
Figures4, 5 and 6. Besides, the effect of different
thresholds on the accuracy is also discussed. From Figure 4a, it can be seen that in the segment near the genes CHA1
and VAC17 of chromosome III, ten nucleosomes are
reported in literature. Seven stable ones are predicted with
our proposed method, and four of them coincide well with
those determined in biological experiments. In Figure 4b,
when the threshold is set to be 190, although the number of
false positives decreases, the accuracy of the whole
genomic-scale prediction degrades to ∼45%.
Figure 4
Genome-wide prediction of nucleosome positions in yeast chromosome III using the MMPF with n denoting the
position in the chromosome and S(n) denoting the matching score. (a) The distribution of the nucleosomes near genes CHA1
and VAC17. The upper green diamonds show the nucleosomes determined from biological experiments [7] and the lower red
diamonds show the nucleosomes detected using our computational method. The dashed line represents the threshold at 180
for the nucleosome matching score. (b) The threshold is changed to 190.
Figure 5
Genome-wide prediction of nucleosome positions in chromosome XV and chromosome XIV using the MMPF
with n denoting the position in the chromosome and S(n) denoting the matching score: (a) near gene HIS3 and (b) gene
MFA2.
Figure 6
Genome-wide prediction of nucleosome positions near the transcription factor motifs [15] in chromosome XVI
and chromosome III using the MMPF with n denoting the position in the chromosome and S(n) denoting the matching score.
Figure 5 shows the comparison between our predicted
stable nucleosomes and those reported in literature near the
genes HIS3 and MFA2 in chromosomes XV and XIV,
respectively. A similar comparison of the nucleosomes near
some transcription factor motifs [15] (in chromosomes
XVI and III) is shown in Figure 6. The results shown in
Figures4, 5 and 6 suggest that the potential nucleosomes
can be effectively predicted by the MMPF. Although in
some segments the rate of correctly predicting
nucleosomes is low (Figure 6b), this algorithm performs well on a genomic scale. Our analysis also indicates that
nucleosomes may play a role in the regulation of the DNA
sequences. From Figures 4a and 5, it can be seen that there
may be correlations existing between the positions of
nucleosomes and genes. It can also be seen from these
figures that the nucleosomes have a strong affinity to the
genes in some genomic locations and exhibit lower
occupancy in other positions such as the promoters, usually
found ∼100 to 500 bp upstream of the start codon in the
gene or in the intergenic regions. Besides, our findings in
Figure 6 support the hypothesis that the functional
transcription factor binding sites may be predominantly
nucleosome-free [15].
Conclusion
In this paper, we have presented a computational method
based on the MMPF for nucleosome position prediction.
This technique, while offering a level of accuracy
comparable to existing ones, has some distinct advantages.
Being based on the periodicity of nucleosomes and a
flexible pattern matching scheme, it is independent of a
training set that must be obtained through biological
experiments. The MMPF can provide a useful tool to study
the eukaryotic genome chromatin structure, protein-DNA
interactions, and transcriptional regulations. Our future
work aims to improve the prediction accuracy for large
genomic sequences based on more sophisticated
mathematical models, such as probabilistic relaxation
labeling [16].
Authors: Dmitry K Pokholok; Christopher T Harbison; Stuart Levine; Megan Cole; Nancy M Hannett; Tong Ihn Lee; George W Bell; Kimberly Walker; P Alex Rolfe; Elizabeth Herbolsheimer; Julia Zeitlinger; Fran Lewitter; David K Gifford; Richard A Young Journal: Cell Date: 2005-08-26 Impact factor: 41.582
Authors: Guo-Cheng Yuan; Yuen-Jong Liu; Michael F Dion; Michael D Slack; Lani F Wu; Steven J Altschuler; Oliver J Rando Journal: Science Date: 2005-06-16 Impact factor: 47.728
Authors: Christopher T Harbison; D Benjamin Gordon; Tong Ihn Lee; Nicola J Rinaldi; Kenzie D Macisaac; Timothy W Danford; Nancy M Hannett; Jean-Bosco Tagne; David B Reynolds; Jane Yoo; Ezra G Jennings; Julia Zeitlinger; Dmitry K Pokholok; Manolis Kellis; P Alex Rolfe; Ken T Takusagawa; Eric S Lander; David K Gifford; Ernest Fraenkel; Richard A Young Journal: Nature Date: 2004-09-02 Impact factor: 49.962
Authors: Heather E Peckham; Robert E Thurman; Yutao Fu; John A Stamatoyannopoulos; William Stafford Noble; Kevin Struhl; Zhiping Weng Journal: Genome Res Date: 2007-07-09 Impact factor: 9.043