Literature DB >> 32566825

Delving into Eukaryotic Origins of Replication Using DNA Structural Features.

Venkata Rajesh Yella¹, Akkinepally Vanaja^1,2, Umasankar Kulandaivelu², Aditya Kumar³.

Abstract

DNA replication in eukaryotes is an intricate process, which is precisely synchronized by a set of regulatory proteins, and the replication fork emanates from discrete sites on chromatin called origins of replication (Oris). These spots are considered as the gateway to chromosomal replication and are stereotyped by sequence motifs. The cognate sequences are noticeable in a small group of entire origin regions or totally absent across different metazoans. Alternatively, the use of DNA secondary structural features can provide additional information compared to the primary sequence. In this article, we report the trends in DNA sequence-based structural properties of origin sequences in nine eukaryotic systems representing different families of life. Biologically relevant DNA secondary structural properties, namely, stability, propeller twist, flexibility, and minor groove shape were studied in the sequences flanking replication start sites. Results indicate that Oris in yeasts show lower stability, more rigidity, and narrow minor groove preferences compared to genomic sequences surrounding them. Yeast Oris also show preference for A-tracts and the promoter element TATA box in the vicinity of replication start sites. On the contrary, Drosophila melanogaster, humans, and Arabidopsis thaliana do not have such features in their Oris, and instead, they show high preponderance of G-rich sequence motifs such as putative G-quadruplexes or i-motifs and CpG islands. Our extensive study applies the DNA structural feature computation to delve into origins of replication across organisms ranging from yeasts to mammals and including a plant. Insights from this study would be significant in understanding origin architecture and help in designing new algorithms for predicting DNA trans-acting factor recognition events.

Entities: CellLine Chemical Disease Gene Species

Year: 2020 PMID： 32566825 PMCID： PMC7301376 DOI： 10.1021/acsomega.0c00441

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

The genetic information between generations is preserved by the mechanism of DNA replication, and it forms the basis for heredity. The precision in the process is achieved by activating it only once during each cycle of cell division.[1] The invigoration of DNA replication initiation is accomplished through two successive regulated steps: origin licensing and origin activation.[2] The first step, origin licensing, occurs in the G1 phase, where highly conserved replication initiation proteins are sequentially loaded on a DNA sequence known as origins of replication (Oris) to form the pre-replicative complex (preRC).[3] Next, origin activation occurs through the S phase when additional proteins are recruited to the preRC. The unwinding and the synthesis of two daughter DNAs start simultaneously after the origin activation. A diverse set of factors such as regulatory proteins, remodelers, replicator sequences, small noncoding RNA, epigenetic mechanisms, chromatin configuration and domains, nuclear envelopes, and subcellular compartmentalization culminate the replication mechanisms temporally and spatially.[2] Oris, the genetic determinants of the cell, form the anchoring site for the replication machinery and are known to have signature sequence context. In consonance with the classic replicon model, the replication is initiated through recognition of cis-acting elements by trans-acting initiator proteins.[4] However, the cis-activation is strictly applicable to bacteria and a few lower eukaryotes, or the code is not yet clear in complex genomes. In metazoans, it has been revealed that various proteins involved in the orchestration of replication are conserved, and on the other hand, the genetic determinants are rapidly evolving. In Saccharomyces cerevisiae, Oris have AT-rich autonomously replicating sequences (ARSs), which encompass 11–17 nucleotide consensus motifs.[5] In S. pombe, the ARSs are not observed, but long AT tracts can play their role instead.[6] Origins of a yeast species S. japonicus have sequences with GC-rich content.[7] In comparison, metazoan Oris are preponderant in genomic regions with high GC content,[8] such as CpG islands and G-rich elements, the G-quadruplex-forming sequences. Further, it has been reported that an origin G-rich repeated element (OGRE), which can favor the G-quadruplex, was identified in the majority of D. melanogaster and mammalian Oris.[8] The metazoan initiation sites may have some common genetic determinants, but robust consensus sequence signals are not displayed widely. Concisely, research indicates that primary DNA sequences at Oris can vary in various families of life, and the current understanding of Oris is still far from satisfactory. The replication process, which involves intricate DNA-protein recognition events (access and orchestration of replication machinery) and melting of origin DNA, occurs in the context of the three-dimensional structure of DNA. Hence, it is pivotal to study the Oris as the descriptors of the DNA structure. B-DNA is the most common form of structure subsisting in physiological conditions. It can display extensive structural polymorphism both at a gross level 3D structure (non B-DNA) and local scales (DNA structural features). Literature has reported more than 20 noncanonical DNA secondary structures,[9] while biologically relevant structures include cruciform DNA, G-quadruplexes, intercalated motifs, hairpins, triple helices (H-DNA), and slipped structures. Recent work suggests that DNA conformation may diverge from the canonical B-form in approximately ∼13% (or 394.2 mb) of the human genome.[10] Quantitative studies on various X-ray crystal structures of free B-DNA illustrated that naked DNA contours a significant amount of conformational space,[11,12] and sequence-dependent perturbations have been understood.[13] The sequence-dependent fluctuations in local helicoidal parameters (rotational and translational), which lead to variations at a gross level, occur in DNA melting, DNA-protein recognition, nucleosome organization, chromatin configuration, and genome integrity. Extensive experimental measurements, theoretical simulations, and computational studies have led to the establishment of various DNA structural features, namely, duplex stability, intrinsic curvature, protein-induced bendability, groove shape, topography,[11,14−23] and DNA crookedness.[24] Further studies on the phylogenetics of conserved regions/regulatory elements showed that the local topography and DNA shape are found to be conserved and evolutionarily constrained compared to the DNA base sequences in various vertebrates.[25] Various tools based on DNA structural features were designed for large-scale applications in genomics during the last decade.[26] In our earlier works, we have extensively studied DNA structural parameters to characterize prokaryotic and eukaryotic promoters,[21,25] to understand DNA transcription factor recognition[27] and conservation of DNA structural properties in promoter regions,[28] to predict promoter regions in genomes,[19] to delineate TATA-containing and TATA-less promoters,[17] to link the DNA structure with gene expression variability,[16,18] etc. Fewer studies have been done on characterization of origins of replication in S. cerevisiae,[29]D. melanogaster, and humans.[30] These reports are limited to one or two organisms with smaller genomic regions surrounding the replication start sites. The current study focuses on DNA structural features in the vicinity of origin start sites in nine different eukaryotic systems including species of yeasts, humans, and plants.

Results and Discussion

The origins of replication in eukaryotic systems, Saccharomyces cerevisiae, Kluyveromyces lactis, Candida glabrata, Pichia pastoris, Schizosaccharomyces pombe, Drosophila melanogaster, mice, humans, and Arabidopsis thaliana, are examined, and they are regarded as model organisms for discerning eukaryotic replication at different levels and aspects. The systems vary in their genomic GC content (36–42%) (Table ) and nucleotide base composition and are widely investigated, and experimentally inferred replisome information is available.[31] Here, we performed the computation of sequence composition and sequence-dependent structural features in the Oris of these systems to understand their similarities and differences. Our analysis has included long flanking regions such as −5000 to +5000 relative to the starting genomic loci for origins of replication listed in the DeOri database.[31] All throughout this manuscript, we refer to the position “0” as the Ori start site and these regions as Ori regions or Ori sequences. The strategy implemented for this work is outlined in Figure .

Table 1

Genomic Features of Data Sets and Compositional Analysis (Most Occurring Di, Tri, Tetra, and Heptamers) of Oris in Eukaryotesa

						most represented oligonucleotides
name of organism	genome size (in mb)	no. of Chr	no. of Ori sites	genome GC %	GC % of Ori region	di	tri	tetra	hepta
Saccharomyces cerevisiae	12.07	16	357	38.15	36.05	AA (11.59)	AAA (4.49)	AAAA (1.85)	AAAAAAA (0.217)
						TT (11.53)	TTT (4.43)	TTTT (1.79)	TTTTTTT (0.198)
						AT (9.77)	AAT (3.23)	AAAT (1.22)	ATATATA (0.097)
						TA (8.56)	ATT (3.21)	ATTT (1.21)	TATATAT (0.091)
						TG (6.20)	ATA (3.05)	ATAT (1.14)	AAAAAAT (0.077)
Kluyveromyces lactis	10.68	6	144	38.76	35.02	TT (11.57)	AAA (4.16)	AAAA (1.49)	AAAAAAA (0.133)
						AA (11.34)	TTT (4.16)	TTTT (1.44)	TTTTTTT (0.116)
						AT (10.12)	AAT (3.32)	ATAT (1.24)	ATATATA (0.105)
						TA (8.95)	ATT (3.29)	ATTT (1.21)	TATATAT (0.101)
						TG (6.42)	TAT (3.21)	AAAT (1.17)	AAATAAA (0.063)
Candida glabrata strain CBS138	4.81	13	256	39.03	33.85	AA (11.61)	AAA (4.29)	AAAA (1.64)	AAAAAAA (0.128)
						TT (11.41)	TTT (4.20)	TTTT (1.61)	TTTTTTT (0.116)
						AT (10.73)	TAT (3.61)	ATAT (1.41)	ATGTTTT (0.102)
						TA (9.82)	ATA (3.58)	AAAT (1.31)	ACCAAAA (0.087)
						TG (6.57)	AAT (3.54)	TATT (1.25)	TTTTTAT (0.084)
Pichia pastoris	9.35	4	294	41.13	39.51	AA (10.16)	AAA (3.42)	AAAA (1.21)	AAAAAAA (0.091)
						TT (10.08)	TTT (3.40)	TTTT (1.19)	TTTTTTT (0.088)
						AT (8.39)	AAT (2.75)	AAAT (0.89)	AAAAAAT (0.046)
						TA (6.92)	ATT (2.68)	AATT (0.86)	ATTTTTT (0.045)
						GA (6.66)	TTG (2.40)	ATTT (0.85)	TCTTTTT (0.043)
Schizosaccharomyces pombe	12.59	3	345	36.06	30.79	AA (14.27)	TTT (6.26)	TTTT (2.81)	TTTTTTT (0.459)
						TT (14.25)	AAA (6.24)	AAAA (2.78)	AAAAAAA (0.428)
						AT (10.55)	AAT (4.12)	AAAT (1.75)	TTTATTT (0.162)
						TA (9.80)	ATT (4.09)	ATTT (1.74)	ATTTTTT (0.160)
						TG (5.53)	TAA (3.59)	TAAA (1.59)	AAATAAA (0.160)
Drosophila melanogaster (S2)	137.55	4	7156	42.29	43.8	TT (9.40)	TTT (3.38)	TTTT (1.23)	AAAAAAA (0.140)
						AA (9.37)	AAA (3.37)	AAAA (1.23)	TTTTTTT (0.131)
						AT (7.59)	ATT (2.53)	ATTT (0.97)	TTTATTT (0.071)
						CA (6.84)	AAT (2.51)	AAAT (0.96)	AAAAATA (0.062)
						TG (6.84)	TTG (2.16)	AATT (0.83)	TTTTATT (0.061)
Arabidopsis thaliana	119.16	5	1533	36.05	41.53	AA (9.75)	AAA (3.33)	AAAA (1.15)	AAAAAAA (0.099)
						TT (9.64)	TTT (3.28)	TTTT (1.14)	TTTTTTT (0.098)
						AT (7.70)	AGA (2.41)	AAGA (0.87)	AAGAAGA (0.059)
						GA (7.13)	TCT (2.39)	TCTT (0.87)	TCTTCTT (0.059)
						TC (7.13)	GAA (2.32)	AGAA (0.85)	AGAAGAA (0.057)
mouse (P19)	2716.96	20	2412	42	50.38	CT (7.97)	CTG (2.65)	TTTT (0.87)	TTTTTTT (0.167)
						AG (7.87)	CAG (2.61)	AAAA (0.82)	AAAAAAA (0.153)
						TG (7.65)	TTT (2.45)	CTGG (0.78)	TGTGTGT (0.100)
						CA (7.58)	AAA (2.37)	CCAG (0.78)	ACACACA (0.099)
						CC (7.21)	CCT (2.33)	CCTG (0.78)	GTGTGTG (0.096)
human (MCF7)	3259.56	23	94,195	41	57.76	GG (10.33)	GGG (3.61)	CAGG (1.20)	CCCTCCC (0.064)
						CC (10.32)	CCC (3.61)	CCTG (1.20)	GGGAGGG (0.063)
						CT (8.17)	CAG (3.31)	CTGG (1.15)	GGCTGGG (0.062)
						AG (8.17)	CTG (3.30)	CCCC (1.14)	GGGCAGG (0.062)
						TG (8.05)	CCT (3.00)	GGGG (1.14)	CCCAGCC (0.062)

Origins sequences were downloaded from the DeOri database for computing the GC percent and k-mer calculations (k = 2, 3, 4 and 7). The numbers in the parenthesis indicate the absolute percentage frequency of oligonucleotides observed in the data sets. Five most occurring words are displayed in the table. The frequency of k-mer depends on GC percentage and also the arrangement of nucleotide steps which is characteristic of Ori regions. Different cell- types dataset word composition for D. melanogaster, mouse and human is shown in Supplementary Table 2.

Figure 1

Analysis outline for computation of DNA structural features or motifs in origins of replication in the eukaryotic genomes. Experimentally mapped endogenous replication initiation sites are retrieved from the DeOri database (http://tubic.org/deori/).[31] Various different physiologically relevant DNA structural features and motifs, including stability, propeller twist, minor groove shape, G-quadruplexes, i-motifs, etc., were computed using lookup tables of di/tri/tetra nucleotide descriptors or regular expression patterns. Origins sequences were downloaded from the DeOri database for computing the GC percent and k-mer calculations (k = 2, 3, 4 and 7). The numbers in the parenthesis indicate the absolute percentage frequency of oligonucleotides observed in the data sets. Five most occurring words are displayed in the table. The frequency of k-mer depends on GC percentage and also the arrangement of nucleotide steps which is characteristic of Ori regions. Different cell- types dataset word composition for D. melanogaster, mouse and human is shown in Supplementary Table 2.

Ori Regions Display Signature Structural Profiles

In recent years, intensive experimental and computational analysis has been carried out on the sequence-dependent secondary structural properties of regulatory genomic sequences. Also, studies have been carried out on DNA secondary structure or shape analysis of origins of replication in yeast[29,32] and D. melanogaster and humans.[30] It has been observed that common DNA shape signatures in D. melanogaster and humans are marked by elevated propeller twist, roll angles, and minor groove width and reduced helical twist.[30] The studies are on few data sets and one or two data systems. The current study focuses on the characterization of the DNA structure in the vicinity of origins of replication. To explore the structural properties, we first aligned sequences encompassing origins of replication, relative to their Ori start sites [−5000 nt to +5000 nt relative to Ori where 0 indicates the genomic beginning locus for Ori sequences compiled in the DeOri database] and then computed the structural features of DNA sequences using lookup tables. We calculated the structural features for every k-mer (k = 2–4) in each DNA sequence as described in Materials and also in our previous work.[18,33] The averaged structural profile based on the nucleotide position can be considered as a consensus numerical signature or structural profile for a given organism.[34]Figure displays the signature features, DNA duplex stability, melting temperature, propeller twist, bendability (DNase 1 and NPP models), and groove shape (minor groove width) of Ori regions of S. cerevisiae, K. lactis, S. pombe, D. melanogaster, humans, and A. thaliana.

Figure 2

DNA structural profiles of S. cerevisiae, K. lactis, S. pombe, D. melanogaster, human, and A. thaliana Ori sequences. The x-axis in all the plots represents the sequences spanning from the −5000 to +5000 region with respect to Ori start sites. The rows indicate the property, while the columns represent genomes. Average free energy, normalized melting temperature, propeller twist, flexibility (two models, DNase 1 sensitivity and nucleosome positioning preference), and minor groove width were shown. The models of normalized melting temperature, DNase 1 sensitivity, and nucleosome positioning preference measure the properties in arbitrary units. Blue-colored error bars indicate the standard error of the mean property values. Experimentally identified genomic locations of Ori start sites are retrieved from the DeOri database (http://tubic.org/deori/). The y-axis for each structural property is maintained with equal ranges. In S. cerevisiae and K. lactis, the low negative free-energy value is observed from the Ori start sites spanning the region up to 1000 nucleotides relative to the Ori start site or more extended region in S. cerevisiae, and a sharp free-energy maximum is observed around the nucleotides (298 and 327) (Figure ). Meanwhile, in S. pombe, the free-energy profile is typically with a radical departure from highly stable to less stable sequences from the vicinity of Ori start sites. The highly stable region spans up to the −1000 region from 0, while the less stable region is extended to 2000 nucleotides. To understand the unexpected behavior of S. pombe, we have computed the word composition in the abovementioned regions separately. Composition analysis revealed that the region −1000 to −1 shows the preponderance of steps CCACCG, GCGGTC, GACCAC, CTGGGC, CGGGCC, and CTGGCG at least 4 times more compared to the region 0 to 2000. In contrary, the latter region displays higher preference for TTTTTT, TATTTA, AAAAAA, and AATTTA at least 4 times compared to the former region. Overall, yeasts display a low-stability region in the vicinity of Ori regions. In D. melanogaster, humans, and Arabidopsis, the trends of the free-energy profiles are quite reversed with high-stability peaks in the region downstream of Oris. The melting temperature profiles in Figure are similar to free-energy profiles. It should be noted that lower DNA stability or melting temperature is mainly influenced by AT/GC composition. AT-rich sequences are intrinsically prone to melting, and in our study, we have observed these regions at Oris in lower eukaryotes. The results in this study are comparable to our previous work on free-energy profiles. The low-stability region or maxima in core promoters is a characteristic structural feature of all classes of bacteria and eukaryotes.[18,19,21,23,28] However, in Oris, the low-stability region spans over a broad area up to 1500 nucleotides relative to Ori start sites in yeast (while in promoters, it spans to only −200 to −300 nucleotides relative to Ori start sites),[16] and the trends are not observed in drosophila, plants, and mammals. Melting of the dsDNA origin is essential for propagation of replication fork. Several initiator proteins and helicases orchestrate this process.[35] The exact mechanism of DNA melting and unwinding is not clearly understood due to lack of high-resolution structures.[36] The AT-rich regions can enhance easy replication. However, this principle only applies to prokaryotes[37] and lower eukaryotes (Figure ). Another dinucleotide property, the propeller twist, displays alike profiles for free energy and melting temperature. The propeller minima are observed at 255, 328, and 628 for S. cerevisiae, K. lactis, and S. pombe, respectively. Meanwhile, in D. melanogaster, high propeller angles are observed in the immediate downstream of the Ori start site (Figure ). The propeller twist angle is the rotation of nucleotide bases in a base pair and influences the rigidity of DNA. Sequences with higher negative propeller twist values are more rigid (A-tracts). The thermodynamic dinucleotide models, like stability and melting temperature, and the conformational property, propeller twist, revealed here the differences between Ori regions and the surrounding regions in six eukaryotes. A recent study has utilized six helicoidal properties for predicting origins in S. cerevisiae based on the significant differences between Ori and non-Ori sequences.[38] Researchers have reported a prediction accuracy of 84% with the tool PseKNC for S. cerevisiae Ori sequences.[39] Meanwhile, another study was developed for human Oris for Hela cell types.[40] So, we have compared our results by plotting the six rotational and translational features for six systems to see whether the tool can be applied globally (Supplementary figure 1). The trends are consistent with the three dinucleotide features studied. In S. cerevisiae and K. lactis, Oris display lower roll, tilt, slide, and shift compared to the flanking sequences. In contrast, quite opposite trends in the profiles are observed in D. melanogaster and humans. Here, we suggest that these tools can be applied for all species by understanding the differences of these properties across species with additional strategies for implementation in Oris of flies, mammals, plants, etc. The trinucleotide bendability models (DNase 1 and NPP) can predict flexibility of DNA in the context of genomic-scale experiments. The two models revealed that the Ori regions in yeasts are rigid compared to surrounding sequences, while mammal and plant Oris are highly flexible. Earlier work by Chen et al. on 270 replication origins in S. cerevisiae showed that replication origins are significantly rigid relative to neighboring genomic DNA.[32] Our result on S. cerevisiae is consistent with their work. It is known that rigid DNA in genomes can enhance the sliding of DNA binding proteins.[34,41,42] The proteins of replication machinery may utilize the property of DNA rigidity for scanning the genomes or efficient orchestration of replication machinery at Oris in lower eukaryotes. The common theme of regulatory regions such as promoters and origins is that they have nucleosome-free regions or the DNA in these sequences is less conductive for nucleosome formation.[43] Further, the DNA shape feature, groove shape, reveals that yeasts and fungi prefer narrow minor grooved sequences in the vicinity of Oris. Contrastingly in D. melanogaster, humans, and A. thaliana, wider minor grooves are predicted near the Oris with longer sequences. Here, we have used minor groove preferences over larger regions of DNA. Our results on minor groove width and propeller twist are consistent with earlier published results on D. melanogaster.(30) It has been observed that common DNA shape signatures in D. melanogaster and humans are marked by elevated propeller twist, roll angles, and minor groove width and reduced helical twist.[30] Altogether, Oris in lower organisms are less stable and rigid and prefer narrow minor grooves, while humans and Arabidopsis show quite opposite trends with high GC content, high stability, and flexible and wider minor groove sequence preference. These observations could be due to the prevalence of CpG islands and GC-rich sequence motifs in these genomic regions.[2] The structural features observed for Oris can be comparable to promoter features reported in earlier research studies with an exception where CpG islands are not observed in promoters of D. melanogaster.(17,44) However, one key difference in profiles of Oris and promoters is that the structural feature signatures can extend up to 5000 nucleotides surrounding Ori start sites, while the signals can extend up to 1000 nt flanking transcription start sites in mammals.[19] In summary, the unique structural signatures demarcate Oris from surrounding genomic regions in eukaryotes. The DNA replication initiation program is highly flexible, origins may be different in various cell lineages, and cell type-specific origins display unique epigenetic signatures.[2] So, it is necessary to understand the structural features of cell type Oris in eukaryotes. Here, we have also carried out a separate structural feature computation for various cell types in D. melanogaster, mice, and humans. The data sets retrieved from DeOri constitute three cell types for D. melanogaster (Kc, Bg3, and S2), three for the mouse (ES, MEF, and P19), and three for humans (K562, MCF7, and Hela). The Ori sequences of three different cell types in the same species display similar structural profiles. However, we cannot conclude the commonalities in mice and humans as the data sets in human Hela and all three cell types in the mouse are too small for statistical comparisons.

Ori Sequences Are Enriched with Characteristic Sequence and Structural Motifs

We also revisited the earlier studied features such as GC content and sequence word composition to supplement the structural property preferences. The similarities and distinctions in the structural signatures of Ori regions in the above-shown systems can be ascribable to varying nucleotide base compositions along the sequence or due to selective preference for a few oligonucleotides. Table lists the preponderant word frequencies or k-mers (k = 2, 3, 4, and 7) in the sequences, in between start and end positions of origins of replication (listed in the DeOri database), for the nine systems. Word compositions for various cell types in D. melanogaster, mice, and humans have been also carried out (Supplementary table 2). The Oris have typical nucleotide composition with preference for AT-rich k-mers in lower eukaryotes and plants (Table and Supplementary table 2). The dinucleotides (AA and TT), trinucleotides (AAA and TTT), tetranucleotides (AAAA and TTTT), and the heptanucleotides (AAAAAAA, TTTTTTT) are over-represented in the Oris of S. cerevisiae, K. lactis, Candida albicans, S. pombe, and D. melanogaster, while in the case of humans, they are enriched with G- or C-rich heptamer sequences, for instance, CCCTCCC, GGGAGGG, GGCTGGG, GGGCAGG, and GGGTGGG. Our results are in line with recent work reported by Lin’s group.[45] The authors extensively investigated sequence motifs in Oris using the MEME tool and reported that CpG-rich sequence motifs were observed in humans, mice, and A. thaliana, while three yeasts, K. lactis, P. pastoris, and S. pombe, and D. melanogaster display preferences for AT-rich motifs. It should be noted that though D. melanogaster has a similar composition to yeasts, the trends of structural property profiles are in congruence with that of humans (Figure ). At a closer inspection of word composition, we observed long repeats of CA or TG and TA steps. The cell type-specific composition analysis also reveals common trends (Supplementary table 2). In D. melanogaster, the heptamers with CA steps are observed in all cell types (S2, Bg3, and K2). Mouse data sets (MEF, P19, ES1, and ES2) have A-tracts and CA-containing oligonucleotides. Human cell types MCF7 and K562 have similar word composition with GGGAGGG or its complementary sequence CCCTCCC being enriched, while the Hela data sets (Hela1 and Hela2) show abundance for A-tracts. The cell type similarities and differences are also consistent with earlier published results.[45] However, it should be noted that the ES1 and ES2 data sets for mouse and human Hela data sets are too small statistically or in a genome-wide scale to derive strong conclusions.[45] The high incidence of AT-rich sequences in Oris of lower eukaryotes is emulated in their lower DNA duplex stability, higher propeller angles, and rigidity. Higher eukaryotes, like humans in this data set, seem to be enriched with G-quadruplexes forming G4-motifs, i-motifs, and oligo G-tracts, besides A-tracts. In continuity, we have analyzed for the preponderance of various structural motifs along with CpG islands in detail (Table and Figure ).

Table 2

Propensity of Well Characterized Sequence Motifs in Oris in Eukaryotesa

organism	i-motif density	G-quad density	A-tracts	G-tracts	ARS	TATA box
S. cerevisiae	0.01	0.02	0.99	0.11	0.26	0.95
K. lactis	0.00	0.01	0.94	0.08	0.13	0.89
P. pastoris	0.01	0.02	0.94	0.04	0.07	0.75
C. glabrata	0.09	0.05	0.96	0.35	0.21	0.93
S. pombe	0.01	0.01	1.00	0.08	0.33	0.97
D. melanogaster	0.19	0.20	0.96	0.33	0.28	0.92
A. thaliana	0.02	0.02	0.98	0.04	0.21	0.88
mouse	0.64	0.66	0.92	0.70	0.09	0.61
human	0.57	0.57	0.86	0.34	0.10	0.49

Densities of i-motifs, G-quadruplexes, A-tracts, G-tracts, autonomously replicating sequences (ARS), and TATA boxes were shown in the table. One thousand mer sequences downstream to the Ori start sites were considered in this table.

Figure 3

Positional distribution of (a) A-tracts, (b) G-tracts, G-quadruplexes, and intercalated motifs, and (c) CpG islands in Ori regions of various eukaryotes. The regular expressions “A7 or T7”, “G7 or C7”, “G3–5N1–7G3–5N1–7G3–5N1–7G3–5” and “C3–5N1–7C3–5N1–7C3–5N1–7C3–5” are searched in the −5000 to +5000 region relative to origin start sites and summed for each 200 nucleotide bin for defining A-tracts, G-tracts, G-quadruplexes, and intercalated motifs. In yeasts (S. cerevisiae, K. lactis and S. pombe), A-tracts are prevalent in the vicinity of Oris, while in D. melanogaster and humans, G-tracts, G-quadruplexes, and i-motifs are preferred. CpG islands are observed in D. melanogaster, humans, and A. thaliana. CpG islands in −5000 to +5000 regions are searched using “CpG island searcher” program with a 500 nt window.[46] Densities of i-motifs, G-quadruplexes, A-tracts, G-tracts, autonomously replicating sequences (ARS), and TATA boxes were shown in the table. One thousand mer sequences downstream to the Ori start sites were considered in this table. The replication process involves the generation of ssDNA, which can provide an opportunity for the formation of secondary structure elements such as i-motifs, G-quadruplexes, and cruciform DNA. The structures may affect both the fidelity and processability of the polymerization reaction. It is not yet clear how the organisms handle the genome instability and how regions are conserved in metazoans. H-DNA can induce the stalling of the replication machinery.[47,48] Here, we report the preponderance of structurally constrained B-DNA sequence motifs (A-tracts and G-tracts) and non-B-DNA-forming sequence motifs (G-quadruplexes and i-motifs). The occurrence of some well characterized sequence elements, like oligo-A or G-tracts and G4 motifs, in the Ori regions of nine eukaryotic organisms are listed in Table . It is clearly seen that Oris of S. cerevisiae, S. pombe, and D. melanogaster are highly enriched in oligo-A tracts while moderately enriched in TATA box-like sequences (Table ). G-tracts, another structural motif, are observed in D. melanogaster and humans along with putative G-quadruplex and i-motif sequences. Further, the earlier established feature of CpG islands in D. melanogaster and human origins of replication is now revealed in A. thaliana. Altogether, the composition and motif search analysis reveal that the motif preferences in origins of replication of different systems are dissimilar, yeasts being AT-rich, particularly A-tracts, while mammals have a high preference for GC-rich motifs. Though the GC composition is different in various eukaryotes, the common principle of conservation of antinucleosomal sequences (A-tracts, G-tracts, and G4 motifs) is ubiquitous in eukaryotic origins.

Eukaryotic Origins of Replication May Be Linked to Promoter Regions

The promoters are crucial for transcription, and their activity is conferred by the stereotypical sequence motifs Inr (initiator element), TATA box, BRE (TFIIB recognition element), DPE (downstream promoter element), etc. at a well-defined location relative to the transcription initial sites.[20] The origins of replication have a similar chromatin environment and share some genetic features to that of transcription-activating sequences or promoters.[2] Mounting evidence showed that the Oris are inclined to sequence positions in the vicinity of transcriptional start sites (TSSs).[2,49] The commonly noticed links between eukaryotic replication and transcription are due to shared nucleosome-depleted regions.[50] In metazoans, the Oris are concentrated near the core promoter regions.[8] Further, it can be due to preferential association with CpG islands in both promoter regions and origins of replication.[51,52] In yeast, they are associated with ARS and antinucleosomal sequences and precisely positioned nucleosomes (+1 and −1 nucleosome).[53] In yeasts, the distance between Ori start sites and transcription start sites is less than 500 nucleotides in 31.46% of the sequences studied.[54] So, we have addressed the link between Oris and promoters by analyzing distribution of consensus transcription factor binding sequences or promoter elements in the vicinity of Ori start sites. We have searched for the known core promoter elements in the Ori regions. We observed that there are no common trends on relation between Oris and promoters in all the systems studied. However, few promoter elements are prominent in majority of the systems (Supplementary figure 3). Figure shows the density of general transcription factor binding sites in yeasts and A. thaliana. The distribution of TATA boxes [consensus site - TATAWAWR] in S. cerevisiae, K. lactis, P. pastoris, and S. pombe is shown in blue color, and the distribution of BREu [SSRCGCC], DCE-I [CTTC], DCE-III [AGC], and Pause-button [KCGRWCG] of A. thaliana is shown in green-colored bar plots. A preponderance of TATA boxes is observed in all species of yeasts in our data set, with peak occurrence approximately at positions 200, 400, 800, and 600 for S. cerevisiae, K. lactis, P. pastoris, and S. pombe respectively (Figure a). However, it should be noted that the Ori regions in yeasts are AT–rich, and natural enrichment of TATA boxes can be observed. The plant genome, A. thaliana, displays typical results in connection with Oris and promoters (Figure b). The promoter elements, BREu, DCE-I, DCE-III, and Pause–button, are overly represented in these regions. From this result, we speculate that the TATA box-containing genes are associated with origins of replication in yeasts. However, the apparent link can be observed in A. thaliana, and few core promoter elements have been preponderantly found, suggesting that promoters and origins of replication are linked together.

Figure 4

Positional distribution of promoter sequence elements in Ori regions in (a) yeasts and (b) A. thaliana. The plot shows preponderance of the TATA box [TATAWAWR] in Ori sequences [−5000 to +5000 relative to 0 Ori start sites] in the lower yeast species S. cerevisiae, K. lactis, P. pastoris, and S. pombe. Plots with green-colored bars indicate the occurrence of promoter elements BREu [SSRCGCC], DCE-I [CTTC], DCE-III [AGC], and Pause-button [KCGRWCG] in A. thaliana. The IUPAC nucleotide code is K = G or T, R = A or G, and W = A or T. Promoter sequence motif information was retrieved from the literature (eukaryotic core promoters and the functional basis of transcription initiation). Positional distribution of promoter sequence elements in Ori regions in all systems are also displayed in Supplementary figure 3.

Conclusions

Our comprehensive work focuses on unveiling DNA structural features in the origins of replication of eukaryotic systems and concludes that eukaryotic Oris have characteristic signature structural profiles. We observed that Oris of lower eukaryotes are more meltable and rigid compared to surrounding sequences. The complex replication process depends on the interaction between cis-regulatory modules and a set of regulatory proteins. The structural signals may help in the interaction to make DNA nucleosome-free (anti-nucleosomal sequences such as A-tracts) and easy to melt (reduced free energy for DNA melting). This work is the conceptual update to the current knowledge of Ori sequences in the region where the replication fork emanates. The molecular mechanisms regulating DNA replication may be highly conserved, but the secondary structural elements of Oris vary from yeast, invertebrates to vertebrates, and plants. Further, the CG-rich sequence motifs, which act as hot spots for DNA methylation in higher eukaryotes, suggest that the epigenetic features may modulate the replication mechanism precisely. Our approach can warrant a better understanding of mechanisms involved in the replication. Further unraveling the DNA structure in dormant, constitutive, and facultative Oris will be an outlook from this work.

Materials and Methods

Origins of Replication Data Sets

Experimentally mapped endogenous replication initiation sites (Table ) are retrieved from DeOri version 6 (http://tubic.org/deori/).[31] The database features the eukaryotic DNA replication origins identified by genome-wide experimental studies. The genomic locations of Oris for Saccharomyces cerevisiae, Kluyveromyces lactis, Candida glabrata strain CBS138, Pichia pastoris, Schizosaccharomyces pombe, Drosophila melanogaster, mice, humans, and Arabidopsis thaliana are retrieved from the database. It should be noted that the current experimental method such as Chip-Seq, SNS-seq (sequencing of RNA-primed short nascent DNA strand), and replication bubble and Okazaki fragment-based methods cannot determine the replication start sites precisely, or the resolution of the methods varies from few bases to kilobases.[49,55] They can only limit some small regions, which contain Oris.[38] In this work, we have chosen the starting locus of the Ori regions provided by DeOri and refer to them as Ori start sites. The genome locations are mapped to the genomes, and sequences of −5000 to 5000 nucleotides relative to Ori start sites (position 0 is the genomic start location provided by DeOri) are extracted for the analysis. The numbers of sequences used in this study are 357, 144, 256, 294, 345, 7156, 2412, 94,195, and 1533 for Saccharomyces cerevisiae, Kluyveromyces lactis, Candida glabrata strain CBS138, Pichia pastoris, Schizosaccharomyces pombe, Drosophila melanogaster, mice, humans, and Arabidopsis thaliana, respectively. The data set covers the various families of life in eukaryotes and can thus be used for conclusive representations. Whole-genome sequences for S. cerevisiae, Kluyveromyces lactis, Candida glabrata strain CBS138, Pichia pastoris, Schizosaccharomyces pombe, Drosophila melanogaster, and Arabidopsis thaliana were retrieved from NCBI data bank (https://www.ncbi.nlm.nih.gov/genome). Mouse (mm8) and human (hg19) genomes were downloaded from the UCSC Genome site (http://genome.ucsc.edu/).[56] Further, we have also included tissue-specific Ori data sets for D. melanogaster (Kc, Bg3, and S2), mice (MEF, P19, ES1, and ES2), and humans (MCF7, K562, Hela1, and Hela2) in our analysis. The sequence length of −5000 to 5000 relative to Ori start sites was chosen based on empirical evidence observed in DNA structural features such as free energy and flexibility relative to the Ori in humans. It was observed that the span of signature regions in humans extends beyond 4000 nucleotides in both sides of the Ori. Hence, for comparative analysis, we selected the same region for all the organisms in our data set.

DNA Structural Profile Enumeration

The initiation of replication involves the search of proper Ori sequences by the replication machinery proteins, orchestration of different trans factors, DNA-protein recognition, formation of stable complexes, and finally, the open complex formation. Here, we used k-mer (k = 2–4) nucleotide descriptors to relate various processes of replication. DNA stability and melting temperature models can explain the sequence preferences for open complex formation, DNA bending models may explain sequence search and orchestration, and propeller twist and minor groove models explain DNA-protein recognition. The propeller twist can also explain the rigidity of DNA.

DNA Stability and Melting Temperature Models

DNA duplex stability or free energy of the fragment of DNA depends on hydrogen bonds between bases and the stacking interaction between consecutive bases and can be computed by summing the free energy of the constituent dinucleotides.[20] The melting temperature of a DNA fragment directly depends on DNA stability. A dinucleotide descriptor based on the collection of melting studies of 108 oligonucleotides[57] has been used for computing DNA stability. Further, another model based on normalized dinucleotide empirical melting temperature descriptors[58] was also utilized for comparison.

DNA Bendability Models

Bending flexibility or bendability of a sequence is the anisotropic bending of DNA under the influence of DNA-binding factors such as proteins. The bending propensity of sequences was computed using genome context-derived trinucleotide descriptors, the DNase 1 sensitivity model[59] and nucleosome positioning preference (NPP) model.[60] Higher negative values from the DNase 1 sensitivity model or a less positive number from nucleosome positioning preference (NPP) indicates more rigidity of a given DNA fragment.

Propeller Twist and Minor Groove Width

The propeller twist is the inherent or induced non-planarity of a base pair quantified as the relative angle of rotation in between paired bases about their common y-axis. DNA sequences with higher negative propeller twist values are more rigid (A-tracts). The propeller twist angle values based on X-ray crystal structures[12] are retrieved from DiProDB (dinucleotide property database)[61] for all 16 dinucleotides. In a B-DNA strand, grooves arise due to the two glycosyl bonds branching off from one side of the hydrogen-bonded base pair. At minor grooves, backbones appear closer together, and it is a key factor for indirect readout for DNA-protein recognition. The tetranucleotide model derived from protein-DNA crystal structure complexes[14] is employed for minor groove width computation in this study. With the knowledge of each unique dimer/trimer/tetramer feature, one can utilize a one-nucleotide sliding window model to convert a given sequence into a numerical profile. Smoothing windows with the size of 15 nucleotides (corresponding to 14 dinucleotide steps) for dinucleotide models and 30 nucleotides for tri or tetranucleotide structural descriptors were employed based on our previous studies.[18,20,21]

Computation of Structural Motifs

A DNA G-quadruplex is defined as a four-stranded DNA structure that is composed of stacked guanine tetrads.[62] G-quadruplex-forming sequences in the genomes are envisaged from the primary sequence of contextual DNA. A putative G-quadruplex consensus sequence has been identified using a simple pattern match, G3–5N1–7G3–5N1–7G3–5N1–7G3–5,[63] where N represents the linker nucleobases and can be any of four nucleotides. The complementary sequences on the other strand of the G-quadruplex [C3–5N1–7C3–5N1–7C3–5C1–7C3–5] can form an intercalated motif. We have looked for i-motifs (intercalated motifs) separately as its significant role in human genome has been depicted in a recent study.[65] G-quadruplex motifs, i-motifs, A-tracts, and G-tracts are computed using pattern search methods. Long stretches of A or G can act as antinucleosomal sequences. A-tracts constitute a stretch of four or more continuous runs of A/T base pairs excluding a flexible TA dinucleotide step. G-tracts (G7 or C7) are also computed as poly(A); poly(G) can act as an antinucleosomal sequence.[66,67]

CpG Island Calculations and Promoter Motif Element Search

CpG islands (CGIs) are described as DNA sequences with length greater than 500 nucleotides, GC percentage ≥55, and the ratio of observed/expected CpG content ≥0.65. CGI start locations in the Ori regions are predicted using a published method.[46]

2 in total

1. A deep learning framework combined with word embedding to identify DNA replication origins.

Authors: Feng Wu; Runtao Yang; Chengjin Zhang; Lina Zhang
Journal: Sci Rep Date: 2021-01-12 Impact factor: 4.379

2. Delineation of the DNA Structural Features of Eukaryotic Core Promoter Classes.

Authors: Akkinepally Vanaja; Venkata Rajesh Yella
Journal: ACS Omega Date: 2022-02-09

2 in total