Venkata Rajesh Yella1, Akkinepally Vanaja1,2, Umasankar Kulandaivelu2, Aditya Kumar3. 1. Department of Biotechnology, Koneru Lakshmaiah Education Foundation, Guntur 522502, Andhra Pradesh, India. 2. KL College of Pharmacy, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Guntur 522502, Andhra Pradesh, India. 3. Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India.
Abstract
DNA replication in eukaryotes is an intricate process, which is precisely synchronized by a set of regulatory proteins, and the replication fork emanates from discrete sites on chromatin called origins of replication (Oris). These spots are considered as the gateway to chromosomal replication and are stereotyped by sequence motifs. The cognate sequences are noticeable in a small group of entire origin regions or totally absent across different metazoans. Alternatively, the use of DNA secondary structural features can provide additional information compared to the primary sequence. In this article, we report the trends in DNA sequence-based structural properties of origin sequences in nine eukaryotic systems representing different families of life. Biologically relevant DNA secondary structural properties, namely, stability, propeller twist, flexibility, and minor groove shape were studied in the sequences flanking replication start sites. Results indicate that Oris in yeasts show lower stability, more rigidity, and narrow minor groove preferences compared to genomic sequences surrounding them. Yeast Oris also show preference for A-tracts and the promoter element TATA box in the vicinity of replication start sites. On the contrary, Drosophila melanogaster, humans, and Arabidopsis thaliana do not have such features in their Oris, and instead, they show high preponderance of G-rich sequence motifs such as putative G-quadruplexes or i-motifs and CpG islands. Our extensive study applies the DNA structural feature computation to delve into origins of replication across organisms ranging from yeasts to mammals and including a plant. Insights from this study would be significant in understanding origin architecture and help in designing new algorithms for predicting DNA trans-acting factor recognition events.
DNA replication in eukaryotes is an intricate process, which is precisely synchronized by a set of regulatory proteins, and the replication fork emanates from discrete sites on chromatin called origins of replication (Oris). These spots are considered as the gateway to chromosomal replication and are stereotyped by sequence motifs. The cognate sequences are noticeable in a small group of entire origin regions or totally absent across different metazoans. Alternatively, the use of DNA secondary structural features can provide additional information compared to the primary sequence. In this article, we report the trends in DNA sequence-based structural properties of origin sequences in nine eukaryotic systems representing different families of life. Biologically relevant DNA secondary structural properties, namely, stability, propeller twist, flexibility, and minor groove shape were studied in the sequences flanking replication start sites. Results indicate that Oris in yeasts show lower stability, more rigidity, and narrow minor groove preferences compared to genomic sequences surrounding them. Yeast Oris also show preference for A-tracts and the promoter element TATA box in the vicinity of replication start sites. On the contrary, Drosophila melanogaster, humans, and Arabidopsis thaliana do not have such features in their Oris, and instead, they show high preponderance of G-rich sequence motifs such as putative G-quadruplexes or i-motifs and CpG islands. Our extensive study applies the DNA structural feature computation to delve into origins of replication across organisms ranging from yeasts to mammals and including a plant. Insights from this study would be significant in understanding origin architecture and help in designing new algorithms for predicting DNA trans-acting factor recognition events.
The genetic information
between generations is preserved by the
mechanism of DNA replication, and it forms the basis for heredity.
The precision in the process is achieved by activating it only once
during each cycle of cell division.[1] The
invigoration of DNA replication initiation is accomplished through
two successive regulated steps: origin licensing and origin activation.[2] The first step, origin licensing, occurs in the
G1 phase, where highly conserved replication initiation proteins are
sequentially loaded on a DNA sequence known as origins of replication
(Oris) to form the pre-replicative complex (preRC).[3] Next, origin activation occurs through the S phase when
additional proteins are recruited to the preRC. The unwinding and
the synthesis of two daughter DNAs start simultaneously after the
origin activation. A diverse set of factors such as regulatory proteins,
remodelers, replicator sequences, small noncoding RNA, epigenetic
mechanisms, chromatin configuration and domains, nuclear envelopes,
and subcellular compartmentalization culminate the replication mechanisms
temporally and spatially.[2] Oris, the genetic
determinants of the cell, form the anchoring site for the replication
machinery and are known to have signature sequence context.In consonance with the classic replicon model, the replication
is initiated through recognition of cis-acting elements by trans-acting
initiator proteins.[4] However, the cis-activation
is strictly applicable to bacteria and a few lower eukaryotes, or
the code is not yet clear in complex genomes. In metazoans, it has
been revealed that various proteins involved in the orchestration
of replication are conserved, and on the other hand, the genetic determinants
are rapidly evolving. In Saccharomyces cerevisiae, Oris have AT-rich autonomously replicating sequences (ARSs), which
encompass 11–17 nucleotide consensus motifs.[5] In S. pombe, the ARSs are
not observed, but long AT tracts can play their role instead.[6] Origins of a yeast species S.
japonicus have sequences with GC-rich content.[7] In comparison, metazoan Oris are preponderant
in genomic regions with high GC content,[8] such as CpG islands and G-rich elements, the G-quadruplex-forming
sequences. Further, it has been reported that an origin G-rich repeated
element (OGRE), which can favor the G-quadruplex, was identified in
the majority of D. melanogaster and
mammalian Oris.[8] The metazoan initiation
sites may have some common genetic determinants, but robust consensus
sequence signals are not displayed widely. Concisely, research indicates
that primary DNA sequences at Oris can vary in various families of
life, and the current understanding of Oris is still far from satisfactory.The replication process, which involves intricate DNA-protein recognition
events (access and orchestration of replication machinery) and melting
of origin DNA, occurs in the context of the three-dimensional structure
of DNA. Hence, it is pivotal to study the Oris as the descriptors
of the DNA structure. B-DNA is the most common form of structure subsisting
in physiological conditions. It can display extensive structural polymorphism
both at a gross level 3D structure (non B-DNA) and local scales (DNA
structural features). Literature has reported more than 20 noncanonical
DNA secondary structures,[9] while biologically
relevant structures include cruciform DNA, G-quadruplexes, intercalated
motifs, hairpins, triple helices (H-DNA), and slipped structures.
Recent work suggests that DNA conformation may diverge from the canonical
B-form in approximately ∼13% (or 394.2 mb) of the human genome.[10] Quantitative studies on various X-ray crystal
structures of free B-DNA illustrated that naked DNA contours a significant
amount of conformational space,[11,12] and sequence-dependent
perturbations have been understood.[13] The
sequence-dependent fluctuations in local helicoidal parameters (rotational
and translational), which lead to variations at a gross level, occur
in DNA melting, DNA-protein recognition, nucleosome organization,
chromatin configuration, and genome integrity. Extensive experimental
measurements, theoretical simulations, and computational studies have
led to the establishment of various DNA structural features, namely,
duplex stability, intrinsic curvature, protein-induced bendability,
groove shape, topography,[11,14−23] and DNA crookedness.[24] Further studies
on the phylogenetics of conserved regions/regulatory elements showed
that the local topography and DNA shape are found to be conserved
and evolutionarily constrained compared to the DNA base sequences
in various vertebrates.[25] Various tools
based on DNA structural features were designed for large-scale applications
in genomics during the last decade.[26] In
our earlier works, we have extensively studied DNA structural parameters
to characterize prokaryotic and eukaryotic promoters,[21,25] to understand DNA transcription factor recognition[27] and conservation of DNA structural properties in promoter
regions,[28] to predict promoter regions
in genomes,[19] to delineate TATA-containing
and TATA-less promoters,[17] to link the
DNA structure with gene expression variability,[16,18] etc. Fewer studies have been done on characterization of origins
of replication in S. cerevisiae,[29]D. melanogaster, and humans.[30] These reports are limited
to one or two organisms with smaller genomic regions surrounding the
replication start sites. The current study focuses on DNA structural
features in the vicinity of origin start sites in nine different eukaryotic
systems including species of yeasts, humans, and plants.
Results and Discussion
The origins of replication in eukaryotic systems, Saccharomyces cerevisiae, Kluyveromyces
lactis, Candida glabrata, Pichia pastoris, Schizosaccharomyces pombe, Drosophila
melanogaster, mice, humans, and Arabidopsis
thaliana, are examined, and they are regarded as model
organisms for discerning eukaryotic replication at different levels
and aspects. The systems vary in their genomic GC content (36–42%)
(Table ) and nucleotide
base composition and are widely investigated, and experimentally inferred
replisome information is available.[31] Here,
we performed the computation of sequence composition and sequence-dependent
structural features in the Oris of these systems to understand their
similarities and differences. Our analysis has included long flanking
regions such as −5000 to +5000 relative to the starting genomic
loci for origins of replication listed in the DeOri database.[31] All throughout this manuscript, we refer to
the position “0” as the Ori start site and these regions
as Ori regions or Ori sequences. The strategy implemented for this
work is outlined in Figure .
Table 1
Genomic Features of Data Sets and
Compositional Analysis (Most Occurring Di, Tri, Tetra, and Heptamers)
of Oris in Eukaryotesa
most
represented oligonucleotides
name of organism
genome size (in mb)
no. of Chr
no. of Ori sites
genome GC %
GC % of Ori region
di
tri
tetra
hepta
Saccharomyces cerevisiae
12.07
16
357
38.15
36.05
AA (11.59)
AAA (4.49)
AAAA (1.85)
AAAAAAA (0.217)
TT (11.53)
TTT (4.43)
TTTT (1.79)
TTTTTTT (0.198)
AT (9.77)
AAT (3.23)
AAAT (1.22)
ATATATA (0.097)
TA (8.56)
ATT (3.21)
ATTT (1.21)
TATATAT (0.091)
TG (6.20)
ATA (3.05)
ATAT (1.14)
AAAAAAT (0.077)
Kluyveromyces lactis
10.68
6
144
38.76
35.02
TT (11.57)
AAA (4.16)
AAAA (1.49)
AAAAAAA (0.133)
AA (11.34)
TTT (4.16)
TTTT (1.44)
TTTTTTT
(0.116)
AT (10.12)
AAT (3.32)
ATAT (1.24)
ATATATA (0.105)
TA (8.95)
ATT (3.29)
ATTT (1.21)
TATATAT (0.101)
TG (6.42)
TAT (3.21)
AAAT (1.17)
AAATAAA (0.063)
Candida glabrata strain
CBS138
4.81
13
256
39.03
33.85
AA (11.61)
AAA
(4.29)
AAAA (1.64)
AAAAAAA (0.128)
TT (11.41)
TTT (4.20)
TTTT (1.61)
TTTTTTT (0.116)
AT (10.73)
TAT (3.61)
ATAT (1.41)
ATGTTTT (0.102)
TA (9.82)
ATA (3.58)
AAAT (1.31)
ACCAAAA (0.087)
TG (6.57)
AAT (3.54)
TATT (1.25)
TTTTTAT (0.084)
Pichia pastoris
9.35
4
294
41.13
39.51
AA (10.16)
AAA (3.42)
AAAA (1.21)
AAAAAAA (0.091)
TT (10.08)
TTT (3.40)
TTTT (1.19)
TTTTTTT
(0.088)
AT (8.39)
AAT (2.75)
AAAT (0.89)
AAAAAAT (0.046)
TA (6.92)
ATT (2.68)
AATT (0.86)
ATTTTTT (0.045)
GA (6.66)
TTG (2.40)
ATTT (0.85)
TCTTTTT (0.043)
Schizosaccharomyces pombe
12.59
3
345
36.06
30.79
AA (14.27)
TTT (6.26)
TTTT (2.81)
TTTTTTT (0.459)
TT (14.25)
AAA (6.24)
AAAA (2.78)
AAAAAAA (0.428)
AT (10.55)
AAT (4.12)
AAAT (1.75)
TTTATTT
(0.162)
TA (9.80)
ATT (4.09)
ATTT (1.74)
ATTTTTT (0.160)
TG (5.53)
TAA (3.59)
TAAA (1.59)
AAATAAA (0.160)
Drosophila melanogaster (S2)
137.55
4
7156
42.29
43.8
TT (9.40)
TTT (3.38)
TTTT (1.23)
AAAAAAA (0.140)
AA (9.37)
AAA (3.37)
AAAA (1.23)
TTTTTTT (0.131)
AT (7.59)
ATT (2.53)
ATTT (0.97)
TTTATTT (0.071)
CA (6.84)
AAT (2.51)
AAAT (0.96)
AAAAATA (0.062)
TG (6.84)
TTG (2.16)
AATT (0.83)
TTTTATT (0.061)
Arabidopsis thaliana
119.16
5
1533
36.05
41.53
AA (9.75)
AAA (3.33)
AAAA (1.15)
AAAAAAA (0.099)
TT (9.64)
TTT (3.28)
TTTT (1.14)
TTTTTTT (0.098)
AT (7.70)
AGA (2.41)
AAGA (0.87)
AAGAAGA (0.059)
GA (7.13)
TCT (2.39)
TCTT (0.87)
TCTTCTT (0.059)
TC (7.13)
GAA (2.32)
AGAA (0.85)
AGAAGAA (0.057)
mouse
(P19)
2716.96
20
2412
42
50.38
CT (7.97)
CTG
(2.65)
TTTT (0.87)
TTTTTTT (0.167)
AG (7.87)
CAG (2.61)
AAAA (0.82)
AAAAAAA (0.153)
TG (7.65)
TTT (2.45)
CTGG (0.78)
TGTGTGT (0.100)
CA (7.58)
AAA (2.37)
CCAG (0.78)
ACACACA (0.099)
CC (7.21)
CCT (2.33)
CCTG (0.78)
GTGTGTG (0.096)
human (MCF7)
3259.56
23
94,195
41
57.76
GG
(10.33)
GGG (3.61)
CAGG (1.20)
CCCTCCC (0.064)
CC (10.32)
CCC (3.61)
CCTG (1.20)
GGGAGGG
(0.063)
CT (8.17)
CAG (3.31)
CTGG (1.15)
GGCTGGG (0.062)
AG (8.17)
CTG (3.30)
CCCC (1.14)
GGGCAGG (0.062)
TG (8.05)
CCT (3.00)
GGGG (1.14)
CCCAGCC (0.062)
Origins sequences were downloaded
from the DeOri database for computing the GC percent and k-mer calculations (k = 2, 3, 4 and 7). The numbers
in the parenthesis indicate the absolute percentage frequency of oligonucleotides
observed in the data sets. Five most occurring words are displayed
in the table. The frequency of k-mer depends on GC
percentage and also the arrangement of nucleotide steps which is characteristic
of Ori regions. Different cell- types dataset word composition for D. melanogaster, mouse and human is shown in Supplementary Table 2.
Figure 1
Analysis outline for computation of DNA structural features or
motifs in origins of replication in the eukaryotic genomes. Experimentally
mapped endogenous replication initiation sites are retrieved from
the DeOri database (http://tubic.org/deori/).[31] Various different physiologically
relevant DNA structural features and motifs, including stability,
propeller twist, minor groove shape, G-quadruplexes, i-motifs, etc.,
were computed using lookup tables of di/tri/tetra nucleotide descriptors
or regular expression patterns.
Analysis outline for computation of DNA structural features or
motifs in origins of replication in the eukaryotic genomes. Experimentally
mapped endogenous replication initiation sites are retrieved from
the DeOri database (http://tubic.org/deori/).[31] Various different physiologically
relevant DNA structural features and motifs, including stability,
propeller twist, minor groove shape, G-quadruplexes, i-motifs, etc.,
were computed using lookup tables of di/tri/tetra nucleotide descriptors
or regular expression patterns.Origins sequences were downloaded
from the DeOri database for computing the GC percent and k-mer calculations (k = 2, 3, 4 and 7). The numbers
in the parenthesis indicate the absolute percentage frequency of oligonucleotides
observed in the data sets. Five most occurring words are displayed
in the table. The frequency of k-mer depends on GC
percentage and also the arrangement of nucleotide steps which is characteristic
of Ori regions. Different cell- types dataset word composition for D. melanogaster, mouse and human is shown in Supplementary Table 2.
Ori Regions Display Signature Structural Profiles
In
recent years, intensive experimental and computational analysis has
been carried out on the sequence-dependent secondary structural properties
of regulatory genomic sequences. Also, studies have been carried out
on DNA secondary structure or shape analysis of origins of replication
in yeast[29,32] and D. melanogaster and humans.[30] It has been observed that
common DNA shape signatures in D. melanogaster and humans are marked by elevated propeller twist, roll angles,
and minor groove width and reduced helical twist.[30] The studies are on few data sets and one or two data systems.
The current study focuses on the characterization of the DNA structure
in the vicinity of origins of replication. To explore the structural
properties, we first aligned sequences encompassing origins of replication,
relative to their Ori start sites [−5000 nt to +5000 nt relative
to Ori where 0 indicates the genomic beginning locus for Ori sequences
compiled in the DeOri database] and then computed the structural features
of DNA sequences using lookup tables. We calculated the structural
features for every k-mer (k = 2–4)
in each DNA sequence as described in Materials and also in our previous work.[18,33] The averaged
structural profile based on the nucleotide position can be considered
as a consensus numerical signature or structural profile for a given
organism.[34]Figure displays the signature features, DNA duplex
stability, melting temperature, propeller twist, bendability (DNase
1 and NPP models), and groove shape (minor groove width) of Ori regions
of S. cerevisiae, K.
lactis, S. pombe, D. melanogaster, humans, and A. thaliana.
Figure 2
DNA structural profiles of S. cerevisiae, K. lactis, S. pombe, D. melanogaster, human, and A. thaliana Ori sequences. The x-axis in all the plots represents the sequences spanning from the
−5000 to +5000 region with respect to Ori start sites. The
rows indicate the property, while the columns represent genomes. Average
free energy, normalized melting temperature, propeller twist, flexibility
(two models, DNase 1 sensitivity and nucleosome positioning preference),
and minor groove width were shown. The models of normalized melting
temperature, DNase 1 sensitivity, and nucleosome positioning preference
measure the properties in arbitrary units. Blue-colored error bars
indicate the standard error of the mean property values. Experimentally
identified genomic locations of Ori start sites are retrieved from
the DeOri database (http://tubic.org/deori/). The y-axis for each structural property is maintained
with equal ranges.
DNA structural profiles of S. cerevisiae, K. lactis, S. pombe, D. melanogaster, human, and A. thalianaOri sequences. The x-axis in all the plots represents the sequences spanning from the
−5000 to +5000 region with respect to Ori start sites. The
rows indicate the property, while the columns represent genomes. Average
free energy, normalized melting temperature, propeller twist, flexibility
(two models, DNase 1 sensitivity and nucleosome positioning preference),
and minor groove width were shown. The models of normalized melting
temperature, DNase 1 sensitivity, and nucleosome positioning preference
measure the properties in arbitrary units. Blue-colored error bars
indicate the standard error of the mean property values. Experimentally
identified genomic locations of Ori start sites are retrieved from
the DeOri database (http://tubic.org/deori/). The y-axis for each structural property is maintained
with equal ranges.In S.
cerevisiae and K. lactis, the low negative free-energy value is
observed from the Ori start sites spanning the region up to 1000 nucleotides
relative to the Ori start site or more extended region in S. cerevisiae, and a sharp free-energy maximum is
observed around the nucleotides (298 and 327) (Figure ). Meanwhile, in S. pombe, the free-energy profile is typically with a radical departure from
highly stable to less stable sequences from the vicinity of Ori start
sites. The highly stable region spans up to the −1000 region
from 0, while the less stable region is extended to 2000 nucleotides.
To understand the unexpected behavior of S. pombe, we have computed the word composition in the abovementioned regions
separately. Composition analysis revealed that the region −1000
to −1 shows the preponderance of steps CCACCG, GCGGTC, GACCAC,
CTGGGC, CGGGCC, and CTGGCG at least 4 times more compared to the region
0 to 2000. In contrary, the latter region displays higher preference
for TTTTTT, TATTTA, AAAAAA, and AATTTA at least 4 times compared to
the former region. Overall, yeasts display a low-stability region
in the vicinity of Ori regions. In D. melanogaster, humans, and Arabidopsis, the trends of the free-energy
profiles are quite reversed with high-stability peaks in the region
downstream of Oris. The melting temperature profiles in Figure are similar to free-energy
profiles. It should be noted that lower DNA stability or melting temperature
is mainly influenced by AT/GC composition. AT-rich sequences are intrinsically
prone to melting, and in our study, we have observed these regions
at Oris in lower eukaryotes. The results in this study are comparable
to our previous work on free-energy profiles. The low-stability region
or maxima in core promoters is a characteristic structural feature
of all classes of bacteria and eukaryotes.[18,19,21,23,28] However, in Oris, the low-stability region spans
over a broad area up to 1500 nucleotides relative to Ori start sites
in yeast (while in promoters, it spans to only −200 to −300
nucleotides relative to Ori start sites),[16] and the trends are not observed in drosophila, plants, and mammals.
Melting of the dsDNA origin is essential for propagation of replication
fork. Several initiator proteins and helicases orchestrate this process.[35] The exact mechanism of DNA melting and unwinding
is not clearly understood due to lack of high-resolution structures.[36] The AT-rich regions can enhance easy replication.
However, this principle only applies to prokaryotes[37] and lower eukaryotes (Figure ).Another dinucleotide property, the
propeller twist, displays alike
profiles for free energy and melting temperature. The propeller minima
are observed at 255, 328, and 628 for S. cerevisiae, K. lactis, and S.
pombe, respectively. Meanwhile, in D. melanogaster, high propeller angles are observed
in the immediate downstream of the Ori start site (Figure ). The propeller twist angle
is the rotation of nucleotide bases in a base pair and influences
the rigidity of DNA. Sequences with higher negative propeller twist
values are more rigid (A-tracts). The thermodynamic dinucleotide models,
like stability and melting temperature, and the conformational property,
propeller twist, revealed here the differences between Ori regions
and the surrounding regions in six eukaryotes. A recent study has
utilized six helicoidal properties for predicting origins in S. cerevisiae based on the significant differences
between Ori and non-Ori sequences.[38] Researchers
have reported a prediction accuracy of 84% with the tool PseKNC for S. cerevisiaeOri sequences.[39] Meanwhile, another study was developed for human Oris for Hela cell
types.[40] So, we have compared our results
by plotting the six rotational and translational features for six
systems to see whether the tool can be applied globally (Supplementary figure 1). The trends are consistent
with the three dinucleotide features studied. In S.
cerevisiae and K. lactis, Oris display lower roll, tilt, slide, and shift compared to the
flanking sequences. In contrast, quite opposite trends in the profiles
are observed in D. melanogaster and
humans. Here, we suggest that these tools can be applied for all species
by understanding the differences of these properties across species
with additional strategies for implementation in Oris of flies, mammals,
plants, etc.The trinucleotide bendability models (DNase 1 and
NPP) can predict
flexibility of DNA in the context of genomic-scale experiments. The
two models revealed that the Ori regions in yeasts are rigid compared
to surrounding sequences, while mammal and plant Oris are highly flexible.
Earlier work by Chen et al. on 270 replication origins in S. cerevisiae showed that replication origins are
significantly rigid relative to neighboring genomic DNA.[32] Our result on S. cerevisiae is consistent with their work. It is known that rigid DNA in genomes
can enhance the sliding of DNA binding proteins.[34,41,42] The proteins of replication machinery may
utilize the property of DNA rigidity for scanning the genomes or efficient
orchestration of replication machinery at Oris in lower eukaryotes.
The common theme of regulatory regions such as promoters and origins
is that they have nucleosome-free regions or the DNA in these sequences
is less conductive for nucleosome formation.[43]Further, the DNA shape feature, groove shape, reveals that
yeasts
and fungi prefer narrow minor grooved sequences in the vicinity of
Oris. Contrastingly in D. melanogaster, humans, and A. thaliana, wider minor
grooves are predicted near the Oris with longer sequences. Here, we
have used minor groove preferences over larger regions of DNA. Our
results on minor groove width and propeller twist are consistent with
earlier published results on D. melanogaster.(30) It has been observed
that common DNA shape signatures in D. melanogaster and humans are marked by elevated propeller twist, roll angles,
and minor groove width and reduced helical twist.[30] Altogether, Oris in lower organisms are less stable and
rigid and prefer narrow minor grooves, while humans and Arabidopsis show quite opposite trends with high GC content, high stability,
and flexible and wider minor groove sequence preference. These observations
could be due to the prevalence of CpG islands and GC-rich sequence
motifs in these genomic regions.[2] The structural
features observed for Oris can be comparable to promoter features
reported in earlier research studies with an exception where CpG islands
are not observed in promoters of D. melanogaster.(17,44) However, one key difference in
profiles of Oris and promoters is that the structural feature signatures
can extend up to 5000 nucleotides surrounding Ori start sites, while
the signals can extend up to 1000 nt flanking transcription start
sites in mammals.[19] In summary, the unique
structural signatures demarcate Oris from surrounding genomic regions
in eukaryotes.The DNA replication initiation program is highly
flexible, origins
may be different in various cell lineages, and cell type-specific
origins display unique epigenetic signatures.[2] So, it is necessary to understand the structural features of cell
type Oris in eukaryotes. Here, we have also carried out a separate
structural feature computation for various cell types in D. melanogaster, mice, and humans. The data sets
retrieved from DeOri constitute three cell types for D. melanogaster (Kc, Bg3, and S2), three for the
mouse (ES, MEF, and P19), and three for humans (K562, MCF7, and Hela).
The Ori sequences of three different cell types in the same species
display similar structural profiles. However, we cannot conclude the
commonalities in mice and humans as the data sets in human Hela and
all three cell types in the mouse are too small for statistical comparisons.
Ori Sequences Are Enriched with Characteristic Sequence and
Structural Motifs
We also revisited the earlier studied features
such as GC content and sequence word composition to supplement the
structural property preferences. The similarities and distinctions
in the structural signatures of Ori regions in the above-shown systems
can be ascribable to varying nucleotide base compositions along the
sequence or due to selective preference for a few oligonucleotides. Table lists the preponderant
word frequencies or k-mers (k =
2, 3, 4, and 7) in the sequences, in between start and end positions
of origins of replication (listed in the DeOri database), for the
nine systems. Word compositions for various cell types in D. melanogaster, mice, and humans have been also
carried out (Supplementary table 2). The
Oris have typical nucleotide composition with preference for AT-rich k-mers in lower eukaryotes and plants (Table and Supplementary table 2). The dinucleotides (AA and TT), trinucleotides (AAA
and TTT), tetranucleotides (AAAA and TTTT), and the heptanucleotides
(AAAAAAA, TTTTTTT) are over-represented in the Oris of S. cerevisiae, K. lactis, Candida albicans, S. pombe, and D. melanogaster, while in the case of humans, they are enriched with G- or C-rich
heptamer sequences, for instance, CCCTCCC, GGGAGGG, GGCTGGG, GGGCAGG,
and GGGTGGG. Our results are in line with recent work reported by
Lin’s group.[45] The authors extensively
investigated sequence motifs in Oris using the MEME tool and reported
that CpG-rich sequence motifs were observed in humans, mice, and A. thaliana, while three yeasts, K.
lactis, P. pastoris, and S. pombe, and D. melanogaster display preferences for AT-rich motifs.
It should be noted that though D. melanogaster has a similar composition to yeasts, the trends of structural property
profiles are in congruence with that of humans (Figure ). At a closer inspection of word composition,
we observed long repeats of CA or TG and TA steps. The cell type-specific
composition analysis also reveals common trends (Supplementary table 2). In D. melanogaster, the heptamers with CA steps are observed in all cell types (S2,
Bg3, and K2). Mouse data sets (MEF, P19, ES1, and ES2) have A-tracts
and CA-containing oligonucleotides. Human cell types MCF7 and K562
have similar word composition with GGGAGGG or its complementary sequence
CCCTCCC being enriched, while the Hela data sets (Hela1 and Hela2)
show abundance for A-tracts. The cell type similarities and differences
are also consistent with earlier published results.[45] However, it should be noted that the ES1 and ES2 data sets
for mouse and human Hela data sets are too small statistically or
in a genome-wide scale to derive strong conclusions.[45] The high incidence of AT-rich sequences in Oris of lower
eukaryotes is emulated in their lower DNA duplex stability, higher
propeller angles, and rigidity. Higher eukaryotes, like humans in
this data set, seem to be enriched with G-quadruplexes forming G4-motifs,
i-motifs, and oligo G-tracts, besides A-tracts. In continuity, we
have analyzed for the preponderance of various structural motifs along
with CpG islands in detail (Table and Figure ).
Table 2
Propensity of Well
Characterized Sequence
Motifs in Oris in Eukaryotesa
organism
i-motif
density
G-quad density
A-tracts
G-tracts
ARS
TATA box
S. cerevisiae
0.01
0.02
0.99
0.11
0.26
0.95
K. lactis
0.00
0.01
0.94
0.08
0.13
0.89
P. pastoris
0.01
0.02
0.94
0.04
0.07
0.75
C. glabrata
0.09
0.05
0.96
0.35
0.21
0.93
S. pombe
0.01
0.01
1.00
0.08
0.33
0.97
D. melanogaster
0.19
0.20
0.96
0.33
0.28
0.92
A. thaliana
0.02
0.02
0.98
0.04
0.21
0.88
mouse
0.64
0.66
0.92
0.70
0.09
0.61
human
0.57
0.57
0.86
0.34
0.10
0.49
Densities of i-motifs,
G-quadruplexes,
A-tracts, G-tracts, autonomously replicating sequences (ARS), and
TATA boxes were shown in the table. One thousand mer sequences downstream
to the Ori start sites were considered in this table.
Figure 3
Positional distribution of (a) A-tracts, (b) G-tracts, G-quadruplexes,
and intercalated motifs, and (c) CpG islands in Ori regions of various
eukaryotes. The regular expressions “A7 or T7”, “G7 or C7”,
“G3–5N1–7G3–5N1–7G3–5N1–7G3–5” and “C3–5N1–7C3–5N1–7C3–5N1–7C3–5” are searched in the −5000 to +5000 region relative
to origin start sites and summed for each 200 nucleotide bin for defining
A-tracts, G-tracts, G-quadruplexes, and intercalated motifs. In yeasts
(S. cerevisiae, K. lactis and S. pombe), A-tracts are prevalent
in the vicinity of Oris, while in D. melanogaster and humans, G-tracts, G-quadruplexes, and i-motifs are preferred.
CpG islands are observed in D. melanogaster, humans, and A. thaliana. CpG islands
in −5000 to +5000 regions are searched using “CpG island
searcher” program with a 500 nt window.[46]
Positional distribution of (a) A-tracts, (b) G-tracts, G-quadruplexes,
and intercalated motifs, and (c) CpG islands in Ori regions of various
eukaryotes. The regular expressions “A7 or T7”, “G7 or C7”,
“G3–5N1–7G3–5N1–7G3–5N1–7G3–5” and “C3–5N1–7C3–5N1–7C3–5N1–7C3–5” are searched in the −5000 to +5000 region relative
to origin start sites and summed for each 200 nucleotide bin for defining
A-tracts, G-tracts, G-quadruplexes, and intercalated motifs. In yeasts
(S. cerevisiae, K. lactis and S. pombe), A-tracts are prevalent
in the vicinity of Oris, while in D. melanogaster and humans, G-tracts, G-quadruplexes, and i-motifs are preferred.
CpG islands are observed in D. melanogaster, humans, and A. thaliana. CpG islands
in −5000 to +5000 regions are searched using “CpG island
searcher” program with a 500 nt window.[46]Densities of i-motifs,
G-quadruplexes,
A-tracts, G-tracts, autonomously replicating sequences (ARS), and
TATA boxes were shown in the table. One thousand mer sequences downstream
to the Ori start sites were considered in this table.The replication process involves
the generation of ssDNA, which
can provide an opportunity for the formation of secondary structure
elements such as i-motifs, G-quadruplexes, and cruciform DNA. The
structures may affect both the fidelity and processability of the
polymerization reaction. It is not yet clear how the organisms handle
the genome instability and how regions are conserved in metazoans.
H-DNA can induce the stalling of the replication machinery.[47,48] Here, we report the preponderance of structurally constrained B-DNA
sequence motifs (A-tracts and G-tracts) and non-B-DNA-forming sequence
motifs (G-quadruplexes and i-motifs). The occurrence of some well
characterized sequence elements, like oligo-A or G-tracts and G4 motifs,
in the Ori regions of nine eukaryotic organisms are listed in Table . It is clearly seen
that Oris of S. cerevisiae, S. pombe, and D. melanogaster are highly enriched in oligo-A tracts while moderately enriched
in TATA box-like sequences (Table ). G-tracts, another structural motif, are observed
in D. melanogaster and humans along
with putative G-quadruplex and i-motif sequences. Further, the earlier
established feature of CpG islands in D. melanogaster and human origins of replication is now revealed in A. thaliana. Altogether, the composition and motif
search analysis reveal that the motif preferences in origins of replication
of different systems are dissimilar, yeasts being AT-rich, particularly
A-tracts, while mammals have a high preference for GC-rich motifs.
Though the GC composition is different in various eukaryotes, the
common principle of conservation of antinucleosomal sequences (A-tracts,
G-tracts, and G4 motifs) is ubiquitous in eukaryotic origins.
Eukaryotic
Origins of Replication May Be Linked to Promoter
Regions
The promoters are crucial for transcription, and
their activity is conferred by the stereotypical sequence motifs Inr
(initiator element), TATA box, BRE (TFIIB recognition element), DPE
(downstream promoter element), etc. at a well-defined location relative
to the transcription initial sites.[20] The
origins of replication have a similar chromatin environment and share
some genetic features to that of transcription-activating sequences
or promoters.[2] Mounting evidence showed
that the Oris are inclined to sequence positions in the vicinity of
transcriptional start sites (TSSs).[2,49] The commonly
noticed links between eukaryotic replication and transcription are
due to shared nucleosome-depleted regions.[50] In metazoans, the Oris are concentrated near the core promoter regions.[8] Further, it can be due to preferential association
with CpG islands in both promoter regions and origins of replication.[51,52] In yeast, they are associated with ARS and antinucleosomal sequences
and precisely positioned nucleosomes (+1 and −1 nucleosome).[53] In yeasts, the distance between Ori start sites
and transcription start sites is less than 500 nucleotides in 31.46%
of the sequences studied.[54] So, we have
addressed the link between Oris and promoters by analyzing distribution
of consensus transcription factor binding sequences or promoter elements
in the vicinity of Ori start sites. We have searched for the known
core promoter elements in the Ori regions. We observed that there
are no common trends on relation between Oris and promoters in all
the systems studied. However, few promoter elements are prominent
in majority of the systems (Supplementary figure 3). Figure shows the density of general transcription factor binding sites
in yeasts and A. thaliana. The distribution
of TATA boxes [consensus site - TATAWAWR] in S. cerevisiae, K. lactis, P. pastoris, and S. pombe is shown in blue color,
and the distribution of BREu [SSRCGCC], DCE-I [CTTC], DCE-III [AGC],
and Pause-button [KCGRWCG] of A. thaliana is shown in green-colored bar plots. A preponderance of TATA boxes
is observed in all species of yeasts in our data set, with peak occurrence
approximately at positions 200, 400, 800, and 600 for S. cerevisiae, K. lactis, P. pastoris, and S. pombe respectively (Figure a). However, it should be noted that the
Ori regions in yeasts are AT–rich, and natural enrichment of
TATA boxes can be observed. The plant genome, A. thaliana, displays typical results in connection with Oris and promoters
(Figure b). The promoter
elements, BREu, DCE-I, DCE-III, and Pause–button, are overly
represented in these regions. From this result, we speculate that
the TATA box-containing genes are associated with origins of replication
in yeasts. However, the apparent link can be observed in A. thaliana, and few core promoter elements have
been preponderantly found, suggesting that promoters and origins of
replication are linked together.
Figure 4
Positional distribution of promoter sequence
elements in Ori regions
in (a) yeasts and (b) A. thaliana.
The plot shows preponderance of the TATA box [TATAWAWR] in Ori sequences
[−5000 to +5000 relative to 0 Ori start sites] in the lower
yeast species S. cerevisiae, K. lactis, P. pastoris, and S. pombe. Plots with green-colored
bars indicate the occurrence of promoter elements BREu [SSRCGCC],
DCE-I [CTTC], DCE-III [AGC], and Pause-button [KCGRWCG] in A. thaliana. The IUPAC nucleotide code is K = G or
T, R = A or G, and W = A or T. Promoter sequence motif information
was retrieved from the literature (eukaryotic core promoters and the
functional basis of transcription initiation). Positional distribution
of promoter sequence elements in Ori regions in all systems are also
displayed in Supplementary figure 3.
Positional distribution of promoter sequence
elements in Ori regions
in (a) yeasts and (b) A. thaliana.
The plot shows preponderance of the TATA box [TATAWAWR] in Ori sequences
[−5000 to +5000 relative to 0 Ori start sites] in the lower
yeast species S. cerevisiae, K. lactis, P. pastoris, and S. pombe. Plots with green-colored
bars indicate the occurrence of promoter elements BREu [SSRCGCC],
DCE-I [CTTC], DCE-III [AGC], and Pause-button [KCGRWCG] in A. thaliana. The IUPAC nucleotide code is K = G or
T, R = A or G, and W = A or T. Promoter sequence motif information
was retrieved from the literature (eukaryotic core promoters and the
functional basis of transcription initiation). Positional distribution
of promoter sequence elements in Ori regions in all systems are also
displayed in Supplementary figure 3.
Conclusions
Our comprehensive work
focuses on unveiling DNA structural features
in the origins of replication of eukaryotic systems and concludes
that eukaryotic Oris have characteristic signature structural profiles.
We observed that Oris of lower eukaryotes are more meltable and rigid
compared to surrounding sequences. The complex replication process
depends on the interaction between cis-regulatory modules and a set
of regulatory proteins. The structural signals may help in the interaction
to make DNA nucleosome-free (anti-nucleosomal sequences such as A-tracts)
and easy to melt (reduced free energy for DNA melting). This work
is the conceptual update to the current knowledge of Ori sequences
in the region where the replication fork emanates. The molecular mechanisms
regulating DNA replication may be highly conserved, but the secondary
structural elements of Oris vary from yeast, invertebrates to vertebrates,
and plants. Further, the CG-rich sequence motifs, which act as hot
spots for DNA methylation in higher eukaryotes, suggest that the epigenetic
features may modulate the replication mechanism precisely. Our approach
can warrant a better understanding of mechanisms involved in the replication.
Further unraveling the DNA structure in dormant, constitutive, and
facultative Oris will be an outlook from this work.
Materials and
Methods
Origins of Replication Data Sets
Experimentally mapped
endogenous replication initiation sites (Table ) are retrieved from DeOri version 6 (http://tubic.org/deori/).[31] The database features the eukaryotic DNA replication
origins identified by genome-wide experimental studies. The genomic
locations of Oris for Saccharomyces cerevisiae, Kluyveromyces lactis, Candida glabrata strainCBS138, Pichia
pastoris, Schizosaccharomyces pombe, Drosophila melanogaster, mice, humans,
and Arabidopsis thaliana are retrieved
from the database. It should be noted that the current experimental
method such as Chip-Seq, SNS-seq (sequencing of RNA-primed short nascent
DNA strand), and replication bubble and Okazaki fragment-based methods
cannot determine the replication start sites precisely, or the resolution
of the methods varies from few bases to kilobases.[49,55] They can only limit some small regions, which contain Oris.[38] In this work, we have chosen the starting locus
of the Ori regions provided by DeOri and refer to them as Ori start
sites.The genome locations are mapped to the genomes, and sequences
of −5000 to 5000 nucleotides relative to Ori start sites (position
0 is the genomic start location provided by DeOri) are extracted for
the analysis. The numbers of sequences used in this study are 357,
144, 256, 294, 345, 7156, 2412, 94,195, and 1533 for Saccharomyces cerevisiae, Kluyveromyces
lactis, Candida glabrata strainCBS138, Pichia pastoris, Schizosaccharomyces pombe, Drosophila
melanogaster, mice, humans, and Arabidopsis
thaliana, respectively. The data set covers the various
families of life in eukaryotes and can thus be used for conclusive
representations. Whole-genome sequences for S. cerevisiae, Kluyveromyces lactis, Candida glabrata strainCBS138, Pichia
pastoris, Schizosaccharomyces pombe, Drosophila melanogaster, and Arabidopsis thaliana were retrieved from NCBI data
bank (https://www.ncbi.nlm.nih.gov/genome). Mouse (mm8) and human (hg19) genomes were downloaded from the
UCSC Genome site (http://genome.ucsc.edu/).[56] Further, we have also included tissue-specific
Ori data sets for D. melanogaster (Kc,
Bg3, and S2), mice (MEF, P19, ES1, and ES2), and humans (MCF7, K562,
Hela1, and Hela2) in our analysis. The sequence length of −5000
to 5000 relative to Ori start sites was chosen based on empirical
evidence observed in DNA structural features such as free energy and
flexibility relative to the Ori in humans. It was observed that the
span of signature regions in humans extends beyond 4000 nucleotides
in both sides of the Ori. Hence, for comparative analysis, we selected
the same region for all the organisms in our data set.
DNA Structural
Profile Enumeration
The initiation of
replication involves the search of proper Ori sequences by the replication
machinery proteins, orchestration of different trans factors, DNA-protein
recognition, formation of stable complexes, and finally, the open
complex formation. Here, we used k-mer (k = 2–4) nucleotide descriptors to relate various processes
of replication. DNA stability and melting temperature models can explain
the sequence preferences for open complex formation, DNA bending models
may explain sequence search and orchestration, and propeller twist
and minor groove models explain DNA-protein recognition. The propeller
twist can also explain the rigidity of DNA.
DNA Stability and Melting
Temperature Models
DNA duplex
stability or free energy of the fragment of DNA depends on hydrogen
bonds between bases and the stacking interaction between consecutive
bases and can be computed by summing the free energy of the constituent
dinucleotides.[20] The melting temperature
of a DNA fragment directly depends on DNA stability. A dinucleotide
descriptor based on the collection of melting studies of 108 oligonucleotides[57] has been used for computing DNA stability. Further,
another model based on normalized dinucleotide empirical melting temperature
descriptors[58] was also utilized for comparison.
DNA Bendability Models
Bending flexibility or bendability
of a sequence is the anisotropic bending of DNA under the influence
of DNA-binding factors such as proteins. The bending propensity of
sequences was computed using genome context-derived trinucleotide
descriptors, the DNase 1 sensitivity model[59] and nucleosome positioning preference (NPP) model.[60] Higher negative values from the DNase 1 sensitivity model
or a less positive number from nucleosome positioning preference (NPP)
indicates more rigidity of a given DNA fragment.
Propeller
Twist and Minor Groove Width
The propeller
twist is the inherent or induced non-planarity of a base pair quantified
as the relative angle of rotation in between paired bases about their
common y-axis. DNA sequences with higher negative
propeller twist values are more rigid (A-tracts). The propeller twist
angle values based on X-ray crystal structures[12] are retrieved from DiProDB (dinucleotide property database)[61] for all 16 dinucleotides. In a B-DNA strand,
grooves arise due to the two glycosyl bonds branching off from one
side of the hydrogen-bonded base pair. At minor grooves, backbones
appear closer together, and it is a key factor for indirect readout
for DNA-protein recognition. The tetranucleotide model derived from
protein-DNA crystal structure complexes[14] is employed for minor groove width computation in this study.With the knowledge of each unique dimer/trimer/tetramer feature,
one can utilize a one-nucleotide sliding window model to convert a
given sequence into a numerical profile. Smoothing windows with the
size of 15 nucleotides (corresponding to 14 dinucleotide steps) for
dinucleotide models and 30 nucleotides for tri or tetranucleotide
structural descriptors were employed based on our previous studies.[18,20,21]
Computation of Structural
Motifs
A DNA G-quadruplex
is defined as a four-stranded DNA structure that is composed of stacked
guanine tetrads.[62] G-quadruplex-forming
sequences in the genomes are envisaged from the primary sequence of
contextual DNA. A putative G-quadruplex consensus sequence has been
identified using a simple pattern match, G3–5N1–7G3–5N1–7G3–5N1–7G3–5,[63] where N represents the linker nucleobases and
can be any of four nucleotides. The complementary sequences on the
other strand of the G-quadruplex [C3–5N1–7C3–5N1–7C3–5C1–7C3–5] can form an intercalated
motif. We have looked for i-motifs (intercalated motifs) separately
as its significant role in human genome has been depicted in a recent
study.[65] G-quadruplex motifs, i-motifs,
A-tracts, and G-tracts are computed using pattern search methods.
Long stretches of A or G can act as antinucleosomal sequences. A-tracts
constitute a stretch of four or more continuous runs of A/T base pairs
excluding a flexible TA dinucleotide step. G-tracts (G7 or C7) are also computed as poly(A); poly(G) can act
as an antinucleosomal sequence.[66,67]
CpG Island Calculations
and Promoter Motif Element Search
CpG islands (CGIs) are
described as DNA sequences with length greater
than 500 nucleotides, GC percentage ≥55, and the ratio of observed/expected
CpG content ≥0.65. CGI start locations in the Ori regions are
predicted using a published method.[46]