Literature DB >> 25477864

Sequence analysis of origins of replication in the Saccharomyces cerevisiae genomes.

Wen-Chao Li¹, Zhe-Jin Zhong¹, Pan-Pan Zhu¹, En-Ze Deng¹, Hui Ding¹, Wei Chen², Hao Lin¹.

Abstract

DNA replication is a highly precise process that is initiated from origins of replication (ORIs) and is regulated by a set of regulatory proteins. The mining of DNA sequence information will be not only beneficial for understanding the regulatory mechanism of replication initiation but also for accurately identifying ORIs. In this study, the GC profile and GC skew were calculated to analyze the compositional bias in the Saccharomyces cerevisiae genome. We found that the GC profile in the region of ORIs is significantly lower than that in the flanking regions. By calculating the information redundancy, an estimation of the correlation of nucleotides, we found that the intensity of adjoining correlation in ORIs is dramatically higher than that in flanking regions. Furthermore, the relationships between ORIs and nucleosomes as well as transcription start sites were investigated. Results showed that ORIs are usually not occupied by nucleosomes. Finally, we calculated the distribution of ORIs in yeast chromosomes and found that most ORIs are in transcription terminal regions. We hope that these results will contribute to the identification of ORIs and the study of DNA replication mechanisms.

Entities: Chemical Disease Gene Species

Keywords: GC profile; GC skew; Saccharomyces cerevisiae; distribution of ORIs; information redundancy; origin of replication

Year: 2014 PMID： 25477864 PMCID： PMC4235382 DOI： 10.3389/fmicb.2014.00574

Source DB: PubMed Journal: Front Microbiol ISSN： 1664-302X Impact factor: 5.640

INTRODUCTION

The well-known replication theory was proposed in 1963 based on a large number of experiments using the sexual system of Escherichia coli (Jacob et al., 1963). DNA replication is an orchestrated process. When a cell enters the S phase of replication, the DNA double helix of this cell is unwound. Then, replication forks are generated to allow the DNA synthesis machinery to copy each DNA strand in a bidirectional manner. In the process of replication, the specific regions that are responsible for the initiation of the replication of the genome are called origin of replication (ORI) regions. These regions are recognized by the origin recognition complex (ORC). The DNA replication process is usually activated only once per cell cycle to avoid amplification and maintain genome integrity (Cayrou et al., 2012). Although most of bacterial genomes have only a single ORI region (Gao and Zhang, 2007) and some archaea use more than one ORI region to initiate DNA replication (Luo et al., 2014), the fungus, Saccharomyces cerevisiae (S. cerevisiae) has multiple ORIs on its chromosomes to perform complete replication in a reasonable period of time because of the large size of its genomes and the limitation of nucleotide incorporation during DNA synthesis. Therefore, predicting ORIs is more difficult in the S. cerevisiae genome than that in bacterial genomes. Several experiments have revealed that the activity of ORIs in yeast depends on a cis-acting replicator sequence termed autonomous replication sequence (ARS). These regulatory sequences are generally found in AT-rich regions in yeast genome. The ARS generally contains three domains: A, B, and C. An essential ARS consensus sequence (ACS) (T/A)TTTAT(A/G)TTT(T/A) usually appears in the A domain (Wu et al., 2014). The B domain contains a number of short sequence motifs that contribute to origin activity (Dhar et al., 2012). The motifs in the C domain are responsible for the interaction between DNA and regulatory proteins (Crampton et al., 2008). However, these motif sequences are not conserved enough to be used to identify ORIs (Nieduszynski et al., 2006). Thus, the discovery of the hidden intrinsic characteristics at the sequence level is helpful not only for understanding the regulatory mechanism but also for accurately identifying ORIs. With the accumulation of experimental data (Levitsky et al., 2005; Yamashita et al., 2011; Gao et al., 2012), some researchers have analyzed features of replication. Recently, by analyzing four highly active origins, Chang et al. (2011) revealed that sequences adjacent to the ACS contributed substantially to origin activity and ORC binding. Yin et al. (2009) found that the nucleosome depletion regions are preferentially permissive for replication and proposed that the ORI organization imposed by nucleosome positioning is phylogenetically widespread in eukaryotes. DNA structure may also influence the distribution of ORIs. Chen et al. (2012) found that the DNA bendability and cleavage intensity in ORIs are dramatically lower than those in both upstream and downstream regions of ORIs. Although some characteristics of ORIs have been described, the available information about ORIs is still far from satisfactory. Therefore, to clarify replication mechanisms, it is still necessary to discover the intrinsic characteristics of ORIs. With this in mind, we performed a series of analyses to investigate the composition bias and correlation of nucleotides in ORIs, the distribution of ORIs in genomes, and the relationships between ORIs and regulatory elements.

MATERIALS AND METHODS

DATASETS

The S. cerevisiae ORIs were collected from OriDB (Siow et al., 2012; http://www.oridb.org/). The confidence of the ORI data has three levels: confirmed, likely, and dubious. To provide a reliable and high-quality dataset, only the 410 experimentally confirmed ORIs were selected and used in the following analysis. The complete S. cerevisiae genome was downloaded from GenBank (Benson et al., 2013). The 5015 transcription start sites (TSSs) of S. cerevisiae were previously published (Lee et al., 2007). The in vitro nucleosome data and nucleosome data from three growth conditions [ethanol, yeast extract, peptone, and dextrose (YPD) medium, and galactose] were previously reported (Yuan et al., 2005; Lee et al., 2007; Kaplan et al., 2009)

SEQUENCE COMPOSITION ANALYSIS

The GC profile represents the variation in GC content along the genomic sequence (Gao and Zhang, 2006), which can be defined by the following equation (Zhang et al., 2005; Xing et al., 2014): where f(A), f(C), f(G), and f(T) are the frequencies of adenine(A), cytosine(C), guanine (G), and thymine(T), respectively, in the i-th sliding window along the sequence. The range of values for the GC profile is between 0 and +1. Values ranging from 0 to 0.5 indicate that the GC content in the i-th sliding window is lower than the AT content, while values ranging from 0.5 to 1 indicate that the GC content in the i-th sliding window is higher than the AT content. GC skew was the first proposed computational method to identify ORIs in bacterial genomes (Lobry, 1996a,b). For a given sequence, the GC skew is defined by the following equation (McLean et al., 1998): where f(C) and f(G) are the frequencies of cytosine(C), and guanine (G) in the i-th sliding window along a sequence, respectively. The range of values for GC skew is between -1 and +1. Values ranging from -1 to 0 indicate that f(G) < f(C), and values ranging from 0 to +1 indicate that f(G) > f(C).

INFORMATION REDUNDANCY

As a genetic language, the nucleic acid sequence can be investigated through an information-theoretic method (Luo et al., 1998). In recent years, informational entropy was widely applied in the recognition and evolution research of DNA sequences (Grosse et al., 2000; Yu and Jiang, 2001; Otu and Sayood, 2003; Xing et al., 2013). The average mutual information profile is an excellent candidate for a species signature (Bauer et al., 2008). Based on these studies, we introduced the k-order information redundancy, which can be defined as follows (Luo et al., 1998): where p() is the joint probability of base j occurring after base i at a distance k along the sequence. The term k= 0 indicates the adjacent correlation between two bases. D+2 describes the divergence of the sequence from independence and the correlation between nucleotides with the gap of k nucleotides. In general, the larger the D+2 value is, the stronger the divergence degree of the sequence from independence is. The H value is the informational entropy and is defined by the following equation where p is the probability of base a (a = A, G, C, or T) occurring in the sequence.

RESULTS AND DISCUSSION

GC CONTENT SURROUNDING ORIs

DNA sequence information is the most basic but important genetic information. It also plays an important role in the determination of ORIs in the S. cerevisiae genome. However, the extent to which ORIs are determined in vivo by cis-acting sequence is poorly understood. To investigate the compositional bias of ORIs, we calculated the GC content of 300 bp of each ORI. As a comparison, the GC content of the genome sequence was also calculated by using a window of 300 bp with a step of 300 bp. The mean GC content of ORIs is 0.3168 (SD = 0.23 × 10-2), which is significantly lower (P < 2.3 × e-133, Mann–Whitney U-test) than the genome-wide GC content (0.3796; SD = 0.24 × 10-2). In other words, ORIs are AT-rich. The high AT content of ORI sequences contributes to the opening of the DNA double helix structure for the initiation of DNA replication.

GC PROFILE AND GC-SKEW SURROUNDING ORI

To investigate the compositional bias, the GC profile and GC skew surrounding ORIs was calculated using a 50 bp sliding window with a step of 1 bp. The average scores of the GC profile and GC-skew are plotted in Figure . As illustrated in Figure , the score of the GC profile in the ORI regions was statistically lower than that in the surrounding regions (P < 2.0 × e-86, Mann-Whitney U-test). The GC profile (A) and the GC skew ( The profiles are plotted using a 50-bp sliding window with a 1-bp step. The horizontal axis represents the nucleotide position, which ranges from -300 to +300 bp relative to ORIs (denoted as 0). The vertical axis represents the GC content score (A) and GC skew score (B). To further investigate the sequence mode of ORI sequences, MEME (Multiple Em for Motif Elicitation; Bailey and Elkan, 1994) was used to discover the consensus motifs in ORI sequences. We found that the consensus sequences are all AT-rich motifs. It has been reported that ORIs contain some AT-rich elements for interactions with regulatory proteins (Reeves and Beckerbauer, 2001; Takahashi et al., 2003). Previous research demonstrated that the information encoded in the high AT content can be recognized by the Orc4 subunit of ORC (Mojardin et al., 2013). This can be attributed to the enrichment of the ACS around ORIs in S. cerevisiae, which is an AT-rich motif that contains the binding site for ORC. Recent research also revealed that a conspicuous feature of a replication regulatory protein was the presence of nine AT-hook domains in its amino terminus (Chuang and Kelly, 1999) that were essential for the binding of ORC to ORIs. However, the GC skew in Figure displays a different trend. The GC skew score in the core ORI regions was statistically lower than that in the upstream regions (P < 2.3 × e-80, Mann-Whitney U-test), but higher than that in the downstream regions (P < 5.0 × e-40, Mann-Whitney U-test). We noticed that the GC skew score conversed from positive to negative at the 0th site corresponding to the DNA replication initiation site. In bacterial genomes, GC skew changes sign at the boundaries of the two replichores, which correspond to the DNA replication origin or terminus (Lobry, 1996a; Necsulea and Lobry, 2007). Thus, our finding implies that the S. cerevisiae genome may have a replication mechanism that is similar to that of bacterial genomes.

CORRELATION OF NUCLEOTIDES SURROUNDING ORIs

Based on Eq. 3, we calculated information redundancies D2 of ORI sequences. The average values are illustrated in Figure . The main maxima for most ORI sequences are located on D2. This result demonstrates that D2 is the maximum among all considered D+2 (k= 0, 1, …, 48), indicating that ORI sequences have a short-range dominance of base correlations. Subsequently, we calculated D2 in a 150 bp window with a step of 1 bp for ORI sequences. As shown in Figure , a peak near the ORIs and two valleys flanking the ORIs were observed, suggesting that the ORI sequences have very strong short-range correlations. It has been reported that D2 is correlated with the evolutionary active region (Du et al., 2006). As a special region in the replication process, ORIs have a high probability of deletion, insertion, and mismatch (Umar and Kunkel, 1996). Thus, the evolutionary force reflected by the D2 constraint indicates the diversity of ORI sequences. However, the evolutionary mechanism of fungi ORIs needs further investigation. (A) Average D+2 vs. k+2 for the ORI sequences. The horizontal axis represents the gap of k+2. The vertical axis represents the value of D+2. (B) The distribution of D2 surrounding ORIs. The horizontal axis represents the nucleotide position, which ranges from -300 bp to +300 bp relative to ORIs (denoted as 0). The vertical axis represents the value of D2.

DISTRIBUTION OF ORIs IN THE GENOME

It is widely accepted that functional regions are not randomly distributed in the genome (Zhang et al., 2007). Based on this hypothesis, we statistically analyzed the distribution of ORIs in the yeast genome. First, we investigated the position relationship between ORIs and nucleosomes. Nucleosomes are the elementary units of chromatin organization and are composed of a ∼147 bp stretch of DNA that is tightly wrapped around a histone core (Richmond and Davey, 2003; Segal et al., 2006). Nucleosome positioning affects nearly every cellular process that requires protein access to genomic DNA (Lee et al., 2007; Kaplan et al., 2009). Thus, it is worth studying the nucleosome occupancy around ORIs. To examine the distribution of nucleosomes around ORIs, we selected regions from -1000 to 1000 bp flanking ORIs and then mapped the nucleosomes in these regions. The average nucleosome occupancy scores surrounding ORIs in vitro and in vivo (ethanol, YPD, and galactose) are shown in Figure . The nucleosome occupancies around ORIs both in vitro and in vivo display a similar tendency: i.e., the nucleosome occupancy scores in ORIs are significantly lower than those in flanking regions, indicating that ORIs always appear in the nucleosome-free regions. This result can be explained as follows: once wrapped around the histone core, it is difficult for regulatory proteins to access the regions, which makes it difficult to open the DNA double helix (Kass and Wolffe, 1998). Nucleosome occupancy around ORIs. The black curve represents the in vitro data. The red, blue, and green curves represent in vivo experimental maps for three growth conditions (ethanol, yeast extract, peptone, and dextrose medium [YPD] and galactose). Gene transcription also requires the opening of the DNA double helix. Thus, there are coupling effects between ORIs and promoters. In fact, several studies focused on replication–transcription interactions (Rocha, 2004; Sequeira-Mendes and Gomez, 2012; Helmrich et al., 2013; Lubelsky et al., 2014). Here, the distance between ORIs and TSSs in the yeast genome was calculated. For over 31.46% of cases, the distance between ORI and TSS was less than 500 bp. These promoters are also AT-rich sequences (Lee et al., 2001). Thus, these promoters might share elements with ORIs. Origins of replications are associated with bias in gene density (Necsulea et al., 2009). To further investigate the relationship between replication and transcription, we analyzed the distribution of ORIs in three kinds of intergenic regions. We obtained 2770 tandem, 1514 divergent, and 1497 convergent intergenic regions based on the orientations of the adjacent gene pair from the GenBank database. The tandem and divergent intergenic regions usually contain promoters; especially, each divergent intergenic region has two promoters for the transcription of two genes, whereas no promoter appears in convergent intergenic regions. By mapping ORIs in these regions, we found that 12.9% of ORIs are located in convergent regions, 25.1% are located in tandem regions, and 12.9% are located in divergent regions. The remaining ORIs (about 46.8%) overlap with coding regions, including 16.3% that are found in the tail of coding regions and 6.6% that are in the head of genes. These results suggest that most ORIs are not biased to transcription start regions, which may guarantee the coordination of replication and transcription.

PREDICTION OF ORIs

The aim of the above statistical analysis was to gain intrinsic observations to understand the replication initiation mechanism and to provide enough information for ORI prediction. Thus, we evaluated the predicted accuracies of the GC profile, GC skew, information redundancy D2, and nucleosome occupancy to discriminate the ORIs from non-ORIs using a support vector machine. Here, 300 bp of each ORI was selected as the positive set, while the 300 bp upstream of ORIs was extracted as the negative set. The 10-fold cross-validated results are recorded in Table . It is obvious that the nucleosome occupancy feature can more accurately predict ORIs than GC skew and D2. The comparative accuracy was also obtained with the GC profile. However, these results are still far from satisfactory. The features of GC profile, GC skew, and D are based on the nucleotide sequence content, in which little sequence-order effect was considered. In the future, we will consider the sequence-order effect to improve the prediction quality. Predicted results of different parameters using a support vector machine.

CONCLUSION

Despite several studies focusing on DNA replication, the mechanism of replication initiation remains elusive. This study focused on the ORIs of S. cerevisiae and systematically analyzed the sequences surrounding ORIs. We found that the sequence around ORIs had a lower GC profile score and a higher nucleotide correlation than the sequence in flanking regions. DNA replication is a highly regulated process that relies on interactions between regulatory proteins and DNA sequences. The AT-rich motif is easily recognized by ORC. By studying the distribution of ORIs in genomes, we found that DNA replication initiation usually occurs in nucleosome-free regions. The short distance between ORIs and TSSs suggested that the expression of genes may be influenced by DNA replication. We expect that the observed properties of ORIs in this work will influence research related to ORIs and provide novel insights into regulatory mechanisms of DNA replication.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Table 1

Predicted results of different parameters using a support vector machine.

Method	Performance evaluation^b
	Sn	Sp	Acc
GC profile	0.7605	0.7728	0.7667
GC skew	0.6247	0.5778	0.6012
D₂	0.5309	0.5704	0.5506
Nucleosome (in vitro)	0.7448	0.7575	0.7511
Nucleosome (ethanol)	0.7071	0.7840	0.7456
Nucleosome (YPD)	0.7567	0.7811	0.7689
Nucleosome (galactose)	0.7485	0.7910	0.7697

45 in total

1. A divide-and-conquer approach to fragment assembly.

Authors: Hasan H Otu; Khalid Sayood
Journal: Bioinformatics Date: 2003-01 Impact factor: 6.937

2. New insights into replication origin characteristics in metazoans.

Authors: Christelle Cayrou; Philippe Coulombe; Aurore Puy; Stephanie Rialle; Noam Kaplan; Eran Segal; Marcel Méchali
Journal: Cell Cycle Date: 2012-02-15 Impact factor: 4.534

3. Genome-scale identification of nucleosome positions in S. cerevisiae.

Authors: Guo-Cheng Yuan; Yuen-Jong Liu; Michael F Dion; Michael D Slack; Lani F Wu; Steven J Altschuler; Oliver J Rando
Journal: Science Date: 2005-06-16 Impact factor: 47.728

4. The impact of nucleosome positioning on the organization of replication origins in eukaryotes.

Authors: Shanye Yin; Wenjun Deng; Landian Hu; Xiangyin Kong
Journal: Biochem Biophys Res Commun Date: 2009-05-20 Impact factor: 3.575

5. An analysis and prediction of nucleosome positioning based on information content.

Authors: Yong-qiang Xing; Guo-qing Liu; Xiu-juan Zhao; Lu Cai
Journal: Chromosome Res Date: 2013-02-22 Impact factor: 5.239

Review 6. Transcription-replication encounters, consequences and genomic instability.

Authors: Anne Helmrich; Monica Ballarino; Evgeny Nudler; Laszlo Tora
Journal: Nat Struct Mol Biol Date: 2013-04 Impact factor: 15.369

7. A high-resolution atlas of nucleosome occupancy in yeast.

Authors: William Lee; Desiree Tillo; Nicolas Bray; Randall H Morse; Ronald W Davis; Timothy R Hughes; Corey Nislow
Journal: Nat Genet Date: 2007-09-16 Impact factor: 38.330

8. Base composition skews, replication orientation, and gene orientation in 12 prokaryote genomes.

Authors: M J McLean; K H Wolfe; K M Devine
Journal: J Mol Evol Date: 1998-12 Impact factor: 2.395

Review 9. DNA-replication fidelity, mismatch repair and genome instability in cancer cells.

Authors: A Umar; T A Kunkel
Journal: Eur J Biochem Date: 1996-06-01

10. OriDB, the DNA replication origin database updated and extended.

Authors: Cheuk C Siow; Sian R Nieduszynska; Carolin A Müller; Conrad A Nieduszynski
Journal: Nucleic Acids Res Date: 2011-11-24 Impact factor: 16.971

8 in total

1. Recent advances in the genome-wide study of DNA replication origins in yeast.

Authors: Chong Peng; Hao Luo; Xi Zhang; Feng Gao
Journal: Front Microbiol Date: 2015-02-19 Impact factor: 5.640

Review 2. Survey of Programs Used to Detect Alternative Splicing Isoforms from Deep Sequencing Data In Silico.

Authors: Feng Min; Sumei Wang; Li Zhang
Journal: Biomed Res Int Date: 2015-09-03 Impact factor: 3.411

3. A deformation energy-based model for predicting nucleosome dyads and occupancy.

Authors: Guoqing Liu; Yongqiang Xing; Hongyu Zhao; Jianying Wang; Yu Shang; Lu Cai
Journal: Sci Rep Date: 2016-04-07 Impact factor: 4.379

4. iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC.

Authors: Hui Yang; Wang-Ren Qiu; Guoqing Liu; Feng-Biao Guo; Wei Chen; Kuo-Chen Chou; Hao Lin
Journal: Int J Biol Sci Date: 2018-05-22 Impact factor: 6.580

5. Euclidean Distance Analysis Enables Nucleotide Skew Analysis in Viral Genomes.

Authors: Formijn van Hemert; Maarten Jebbink; Andries van der Ark; Frits Scholer; Ben Berkhout
Journal: Comput Math Methods Med Date: 2018-10-30 Impact factor: 2.238

6. Sir2 suppresses transcription-mediated displacement of Mcm2-7 replicative helicases at the ribosomal DNA repeats.

Authors: Eric J Foss; Tonibelle Gatbonton-Schwager; Adam H Thiesen; Erin Taylor; Rafael Soriano; Uyen Lao; David M MacAlpine; Antonio Bedalov
Journal: PLoS Genet Date: 2019-05-13 Impact factor: 5.917

7. Context based computational analysis and characterization of ARS consensus sequences (ACS) of Saccharomyces cerevisiae genome.

Authors: Vinod Kumar Singh; Annangarachari Krishnamachari
Journal: Genom Data Date: 2016-07-19

8. iRO-PsekGCC: Identify DNA Replication Origins Based on Pseudo k-Tuple GC Composition.

Authors: Bin Liu; Shengyu Chen; Ke Yan; Fan Weng
Journal: Front Genet Date: 2019-09-18 Impact factor: 4.599

8 in total