Literature DB >> 25349679

Investigating genomic structure using changept: A Bayesian segmentation model.

Abstract

Genomes are composed of a wide variety of elements with distinct roles and characteristics. Some of these elements are well-characterised functional components such as protein-coding exons. Other elements play regulatory or structural roles, encode functional non-protein-coding RNAs, or perform some other function yet to be characterised. Still others may have no functional importance, though they may nevertheless be of interest to biologists. One technique for investigating the composition of genomes is to segment sequences into compositionally homogenous blocks. This technique, known as 'sequence segmentation' or 'change-point analysis', is used to identify patterns of variation across genomes such as GC-rich and GC-poor regions, coding and non-coding regions, slowly evolving and rapidly evolving regions and many other types of variation. In this mini-review we outline many of the genome segmentation methods currently available and then focus on a Bayesian DNA segmentation algorithm, with examples of its various applications.

Entities: CellLine Chemical Disease Gene Species

Keywords: Bayesian modelling; Conservation levels; GC content; Generalised Gibbs sampler; Non-coding RNA; Sequence segmentation

Year: 2014 PMID： 25349679 PMCID： PMC4204429 DOI： 10.1016/j.csbj.2014.08.003

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Role of genome segmentation

Identifying the distinct components of the human and other genomes is a core task in current bioinformatics, and a necessary pre-requisite to a full understanding of the connections between genomes and phenotypes. Yet the annotation of complex eukaryotic genomes is still far from complete. Even the proportion of the genome that performs biological functions is still hotly debated, with estimates varying from 5% [1] to 80% [2]. Whatever the true figure may be, it is clear that a vast amount of the biology underlying the structure of genomes remains to be discovered. Bioinformatics has an important role to play in this endeavour, and one of its tasks is to identify segments of the genome representing elements that require annotation.

Segmentation methods

Several techniques have been developed to analyse variation in properties of interest across a genome and to provide clues to the nature of its components. In this article we review some of the most widely used segmentation methods and discuss the main ideas behind each technique.

Sliding window analysis

Although not technically a segmentation method, ‘sliding window analysis’ is the most commonly used way to profile variation in a property of interest across a genome. This technique involves averaging the property of interest over a sliding window of a predetermined length along the sequence. For example if the window size is 10, the first point is obtained by averaging the property of interest over nucleotides 1–10, the second point is the average over nucleotides 2–11, and so on. Determining the window size can be crucial: a smaller window allows for a more precise localisation of changes, however this can increase the noise. Tajima in 1991 has proposed an algorithm to determine window size [3]. The main drawback of the sliding window analysis is that it does not identify boundaries where statistically significant changes to the property in question occur. To avoid some of the disadvantages of the sliding window approach, a windowless technique based on the Z curve was introduced to analyse GC content of genomic sequences [4]. This method enables calculation of GC content at any resolution, even at a base position. Some applications of the sliding window analysis can be found in papers [5-16].

Hidden Markov models

More precise segmentation methods have been developed to identify homogenous segments as well as the locations (change-points) at which sharp changes in a particular property of interest occurs. Hidden Markov models (HMMs) are one approach capable of inferring segment boundaries. The HMM methodology is well-established, dating from the 1950s [17]. In these models, the observed sequence is considered to be composed of segments, with the sequence of each segment generated by a Markov process. The transition probabilities for each segment are determined by a hidden state, and transitions between hidden states occur at segment boundaries. The sequence of hidden states is also modelled as a Markov process. A key parameter of an HMM is the order of the Markov chain, that is, the number of preceding sequence positions required to condition the transition probabilities of the observed sequence. This is unknown a priori, and usually needs to be specified, although some approaches are able to infer the order, or determine it adaptively. HMMs were first used in biological sequence analysis by Churchill [18,19]. The parameters of the model, including segment boundaries, were estimated by using the maximum likelihood method based on the expectation–maximisation (EM) algorithm [20]. HMMs have since been widely used for sequence analysis problems in bioinformatics, and an extensive literature now exists. Two important developments were the 1998 GeneMark.hmm algorithm which used an HMM to find exact gene boundaries [21] and an HMM developed by Peshkin and Gelfand in 1999 to segment yeast DNA sequences [22]. Some other important examples are included in [23-29]. The Sarment package of Python modules built by Gueguen for easy building and manipulation of sequence segmentations uses both sliding window and HMM methods [30]. HMM models have also been implemented from a Bayesian perspective. One advantage of adopting a Bayesian approach is that it provides quantification of the uncertainties in parameter estimates in the form of probability distributions. In fact, one can dispense with point estimates of parameters altogether, instead reporting marginal distributions for key parameters, such as the locations of change-points. Boys et al. in 2000 presented a Bayesian method of segmentation using HMMs when the number of segments is known [31] and later generalised this method for an unknown number of segments [32]. In 2006, the segmentation method developed by Kedzierska and Husmeier was a combination of the sliding window analysis and the Bayesian HMM [33]. Nur and co-workers in 2009 performed sensitivity analysis on priors used in the Bayesian HMM to show the impact of prior choice on posterior inference [34]. One challenge for Bayesian HMM approaches is that they are computationally intensive and are typically infeasible for segmenting large-scale sequences, without simplifying heuristics.

Multiple change-point analysis

This approach arose independently of HMMs, and has an extensive literature dating back to the 1970s [35,36]. Change-point analysis differs from HMMs in that it typically assumes no Markov dependence in either the observed sequence or the underlying sequence of hidden states. In this sense change-point models are simpler than HMMs, and have fewer parameters. However, the two types of analysis are clearly related, and it may be useful to think of change-point models as zeroth order HMMs. A key advantage of change-point models, due to their simplicity, is their reduced computational burden, a point which is of particular relevance when implementing them within a Bayesian framework. The use of multiple change-point models in bioinformatics was pioneered by Liu and Lawrence in 1999, using a Bayesian framework [37]. In 2000, Ramensky et al. developed a similar method which uses a Bayesian estimator to measure the degree of homogeneity in segmentation [38]. In this method, optimal segmentation is obtained by maximising the likelihood function using the dynamic programming technique presented in [39]. After completion, the partition function approach is used to obtain segmentation with longer segments by filtering the boundaries. In contrast to the approach of Liu and Lawrence, this method does not use probability distributions for segment boundaries and does not use sampling. A related method is presented in [40], which uses reversible jump Markov chain Monte Carlo (RJMCMC) sampling method to estimate posterior probabilities [41]. In contrast to Liu and Lawrence, they have used Poisson intensity models as the underlying model (as opposed to multinomial likelihood). The method has been tested by applying to modelling the occurrence of ORFs along the human genome. Another Bayesian model can be found in [42]. The method on which we focus in the main part of this article [43,44] is also of this type. The method can be described as a segmentation–classification model as it not only detects change-points but also groups segments based on their sequence characteristics. The group to which a segment belongs is essentially a hidden state, in the terminology of HMMs, and the classification is unsupervised, in the terminology of machine learning. There are two main innovations in this method. The first is that the character frequencies (emission probabilities) for a given segment are not constant for all segments in a group. Instead, the character frequencies are drawn from a Dirichlet distribution specific to the group to which that segment belongs, and it is the parameters of this distribution that characterise the group. There is thus an additional layer to this hierarchical model, and this layer is another characteristic distinguishing the model from HMMs. Allowing variation in the character frequencies for segments in a group means that this model can be used to dissect multi-modal distributions of properties of interest, a central feature in recent applications [45,46]. The second innovation in this method is the use of the Generalised Gibbs Sampler (GGS) [47], a new technique in Markov chain Monte Carlo simulation. The GGS provides highly efficient sampling from a varying dimensional space (important here as the number of change points is variable).

Recursive segmentation method

The recursive segmentation method finds segment boundaries that maximise the difference in base compositions between adjacent segments with respect to some predefined compositional measure (Jensen–Shannon divergence — DJS). The process is repeated until further segmentation of sequence segments produces no statistically significant improvements. The recursive segmentation method has been widely applied to segmentation problems such as isochore detection or detection of CpG islands [48-52]. More recent applications include locating borders between coding and non-coding regions of bacteria genomes [53] and in developing IsoPlotter: a tool for studying the compositional architecture of genomes [54]. The recursive segmentation method presented in [55] is significant in that it does not require specification of the number of segment classes (something most of the other methods require). This method has been successfully used to identify alien DNAs in bacterial genomes, detect structural variants in cancer cell lines and perform alignment-free genome comparisons.

Other segmentation methods

Methods based on least squares estimation [56] and wavelet analysis [57] have also been used. Sequential importance sampling (SIS) [58], the cross-entropy method [59] and the Bayesian adaptive independence sampler [60] have also been used to find segment boundaries and parameters of the process in each segment. Olshen et al. developed the circular binary segmentation method (CBS) in 2004 for the analysis of array-based comparative genomic hybridisation (array-CGH) data [61]. CGH (comparative genomic hybridization) is a technique for measuring DNA copy numbers at thousands of locations on a genome. The modification of conventional CGH to obtain high resolution data is called array-CGH. The variation in DNA copy number is often used to identify cancer progression. The CBS algorithm divides the genome into regions of equal DNA copy number and identifies the genomic locations of copy number transitions (change-points). In 2007, changes were made to the original CBS algorithm to enhance the speed by introducing, (1) a hybrid approach for the computation of the p-value and (2) a stopping rule for early identification of change-points [62]. In 1996, Tibshirani proposed a new method called ‘lasso’ (least absolute shrinkage and selection operator) for estimation in regression models, which involves constraining the sum of the absolute values of the regression coefficients [63]. This produces some coefficients that are exactly zero and hence gives interpretable models. In 2006 ‘fused lasso’ — a generalisation of ‘lasso’ — was introduced to handle problems with features that can be ordered in some meaningful way [64]. The fused lasso penalises the sum of the absolute values of the coefficients and their successive differences. The method was applied along with the CBS method to estimate the copy number alterations in breast tumour data (CGH data of breast cancer cell line MDA157) [65]. CBS had difficulties in detecting change points whose alteration signals are weak (chromosome 7 and 15 of the selected cell line), but the fused lasso successfully recognised various copy number alterations. Besides identifying gains and losses in CGH data, the fused lasso can also be generalised to other analysis; for example, understanding the interactions between copy number alternations and mRNA expression levels. Determining the number of change-points is an important aspect of change-point analysis. In 2007, Zhang et al. proposed the modified Bayes Information Criterion (BIC) as a model selection procedure for array-CGH data analysis [66]. The first term of the modified BIC is similar to the classic BIC (consisting of the log likelihood), but it differs in the terms that penalise for model dimension. One of the advantages of using the modified BIC is that it does not require a specific prior or tuning parameters, but it can only be applied to normally distributed, uncorrelated and homoscedastic data. However the modified BIC is not limited to the analysis of array-CGH data. Some other methods that adaptively determine the number of change-points can be found in [41,46,67]. The multi-scale segmentation method developed by Futschik and co-workers also estimates the number of segments and their boundaries simultaneously [68]. One advantage of this method is that it does not require distributional assumptions regarding the lengths of segments. Another feature is that this method is able to choose an appropriate number of segments with user specified probability 1 − α. Many early statistical segmentation methods were reviewed in [69]. Elhaik et al. reviewed the performance of seven recent algorithms by segmenting human chromosome 1 based on variability of GC content [70].

Changept analysis

In the remainder of this mini-review, we focus on the changept program developed by Keith et al. [43,44]. This is a Bayesian multiple change-point algorithm capable of simultaneously segmenting a genomic alignment and classifying segments into one of a predefined number of segment classes. Segments can be classified according to multiple properties including level of evolutionary conservation between species, GC content and transition/transversion ratio. Program readcp is a part of the changept package that takes the outputs produced by changept and estimates, for each genomic position, the probability that genomic position belongs to each segment class. The package uses a highly efficient sampling technique known as the Generalised Gibbs Sampler [47] resulting in a highly efficient algorithm that enables chromosome or even genome-wide analysis. The algorithm can be used to segment a genomic alignment based on a single property of interest or multiple properties. There is no limit on number of aligned species.

Method

Transforming alignment

Segmentation based on a single property of interest (e.g.: conservation level)

Suppose we want to segment a pairwise alignment of size L based on the degree of conservation between two species. The first step is to convert the alignment into a binary sequence by replacing the alignment columns in which two DNA sequences match with a ‘1’ and replacing columns in which they mismatch with a ‘0’. The gaps between alignment blocks are marked by a ‘#’ symbol and these are considered as fixed change-points by the model. The indels (alignment gaps) in the reference species are not encoded while indels in other species are encoded using letter ‘I’ which will be excluded from the final analysis of the sequence. The binary sequence generated in this way is used as the input for the program changept.

Segmentation based on multiple properties (e.g.: conservation level, GC content and transition/transversion ratio)

In segmenting a pairwise alignment based on more than one property of interest, one possibility is to use a 16-character representation (A = (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) to encode the alignment (Table 1).

Table 1

16-character representation used to encode a pairwise alignment.

Species 1

Species 2

Symbol

In the case of a 3-way alignment, a 32-character representation (A = (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, U, V, W, X, Y, Z) is used to transform the alignment into the changept input sequence. Table 2 depicts the possible encoding. Indel positions in Species 2 and Species 3 are encoded using letter ‘I’ which will be excluded from the final analysis.

Table 2

32-character representation used to encode a 3-way alignment.

Species 1

Species 2

Species 3

Symbol

In the 3-way alignment, alignment columns with complementary bases were encoded using the same characters. For example: Species 1 ‘A’, Species 2 ‘A’, Species 3 ‘A’ = Species 1 ‘T’, Species 2 ‘T’, Species 3 ‘T’ = ‘a’ Species 1 ‘A’, Species 2 ‘A’, Species 3 ‘C’ = Species 1 ‘T’, Species 2 ‘T’, Species 3 ‘G’ = ‘b’. In the 16-character representation, the symbols ‘a’, ‘f’, ‘k’ and ‘p’ represent the conserved bases in the alignment. Similar information is represented by symbols ‘a’ and ‘v’ in the 32-character representation. Both input sequences also contain other biologically significant information such as GC content in species and transition/transversion ratio. For example in the 16-character representation, symbols from ‘e’ to ‘l’ correspond to ‘C’ or ‘G’ content in Species 1 and similar information is represented by symbols from ‘q’ to ‘Z’ in the 32-character representation. In the case of more than 3 aligned species, we have proposed two methods that can be used to transform an alignment. The first method is known as ‘maximum frequency transformation’ in which a score is assigned for each alignment column equivalent to the maximum number of nucleotides that are identical. The second method uses Fitch's algorithm [71] to compute Parsimony score — the smallest number of mutations along the evolutionary tree. See [45] which uses both methods in transforming a 4-way alignment into the changept input sequence.

Modelling

The complete model is presented in [43,44]. Here we only present the main idea behind the model. The process of Bayesian modelling consists of 3 main steps [72]: (1) set up a joint probability distribution for all the variables in a problem; (2) calculate posterior distribution — the conditional probability distribution of the unobserved parameters of interest, given the observed data; (3) evaluate the model. Step (1) starts with writing down the likelihood function of the model, i.e. probability of the observed quantities given unknown parameters. This describes the stochastic process by which sequences are generated, and consequently it quantifies the probability of generating the observed sequence for any given parameter values. In writing down the likelihood function of our model, we denote the probability of starting a new segment by ϕ, the number of fixed change-points by k′ and the total number of change-points (including fixed change-points) by k. The positions of change-points are denoted by C = (c1, c2, …, c). We set c0 = 1. For each position in the sequence, except for the first position and those immediately following a fixed change-point (marked by ‘#’s), a decision has to be made whether to start a new segment. Thus the probability of generating a segmentation with k change-points at C = (c1, c2, …, c) positions is given by:where L is the length of the sequence S. Each segment is then assigned to one of ω conservation classes. Let π denotes the probability of assigning a segment to class t. We denote the class to which segment i is assigned by g ∈ {0, 1, …, ω − 1} and let g = (g0, g1, …, g). The probability that x0 segments are assigned to class 0, x1 segments are assigned to class 1, …, x segments are assigned to class ω − 1 is: In the case of the binary representation of the sequence S, let θ represent the probability of generating a ‘1’ in each position of segment i in class t. Each θ is independently drawn from the following beta distribution with unknown parameters α0( and α1(. Here θ = (θ0, θ1, …, θ), α( = (α0(, α1() and α = (α(0), α(1), …,α(). This can be generalised when S represents the alignment formed using a finite alphabet {1, …, D} (D-character representation). Let θ represent the probability of generating character j in segment i = 0, …, k. We denote Θ = (θ, …, θ). Then for each segment i in class g, Θs are drawn from a Dirichlet distribution p(Θ| α, g) with parameter vector α = (α1(, …, α() for each class. The binary sequence within each segment i is generated by independent Bernoulli trials at each position in the segment. Thus the probability that segment i contains specific sequence S including m number of ‘0’s and n number of ‘1’s is given by:where L = c − c is the length of segment i. In using the D-character representation, we assume that within each segment, the sequence is generated by independent trials with D possible outcomes. Let m be the number of times character j appears in segment i. Thus the likelihood of an observed DNA sequence can be written as: The final sequence is obtained by concatenating sequences S0,…., S. Therefore the joint distribution of parameters k, c, g, θ and S is given by: The prior probabilities assigned to parameters ϕ, π and α are given in [44]. Using Bayes theorem, integrating over ϕ and θ, and summing over g, the following posterior distribution is obtained:where In the case of the D-character representation, the posterior distribution is given by: Here p(ϕ), p(α) and p(π) denote the prior probabilities assigned to parameters ϕ, α and π [43]. In simplifying further, it is possible to integrate the above equation over ϕ and Θ and to take sum over g to obtain the posterior distribution of p(k, C, α, π|S). Fig. 1 shows the parameter dependencies of the model.

Fig. 1

Parameters of the changept model and their conditional dependencies. The parameter at the head of the arrow is conditionally dependent on the parameter at the tail.

Sampling

The posterior distribution is sampled using the Generalised Gibbs Sampler (GGS), a Markov chain Monte Carlo technique [47]. Unlike the conventional Gibbs sampler, the GGS takes into account the fact that the number of change-points is varying and thus provides an alternative to the reversible-jump sampler [41]. It cycles through each segment and either inserts a change-point, deletes a change-point or updates the change-point positions. These different types of updates are referred as ‘move-types’ which are analogous to the coordinate updates of the conventional Gibbs sampler. Once the alignment is transformed into the changept input sequence, it is then run through the program changept (source code is available upon request) to produce a user specified number of samples. The next step of changept analysis is to check if convergence to the limiting distribution has occurred. This is most commonly assessed by inspecting a time-series plot of the log-likelihood against the sample number. The same plot is used to decide the length of the ‘burn-in’ period. Changept currently requires the user to specify the number of segment classes (T). Selecting the model with the most appropriate number of classes can be done by using either of the following methods: (1) investigating AIC, BIC and DICV plots [67]; and (2) investigating the stability of each segment class [46]. The final model is then run through the program readcp to calculate profile values. The profile shows the probability that each position in the input sequence belongs to one of the segment classes in the selected model. These posterior probabilities are estimated using Monte Carlo integration. These outputs (a profile file for each segment class in the final model) are used to generate WIG/BED files that can be uploaded to a genome browser (e.g. http://genome.ucsc.edu/) for viewing gene-related information. This workflow is illustrated in Fig. 2 and a full description of how to use changept and readcp can be found in [73].

Fig. 2

The changept workflow. This figure illustrates the sequence of steps generally followed in analysing a set of DNA sequences by using the program changept. In step 3, T represents the number of segment classes specified by the user.

Applications of changept

In this section we discuss several applications of program changept. These can be categorised into sub-headings: Investigate segmentation patterns of genomic regions Identify alternatively spliced exons Identify putative transcription factor binding sites (TFBS) Identify putative non-coding RNAs Identify rapidly evolving genomic regions. In each sub-heading we provide examples to illustrate the performance of the program changept.

Investigate segmentation patterns of genomic regions

This section summarises the results of [46]. The program changept was applied to three possible pairwise alignments of 3′UTR among three closely related Drosophila species: Drosophila melanogaster, Drosophila simulans and Drosophila yakuba. We also segmented three randomly selected portions of the alignment of D. melanogaster to D. simulans protein-coding sequences of the same length as the 3′UTR alignment of that pair. This was required as the number of segment classes detectable is sensitive to the length of the changept input sequence. These alignments were obtained from http://genomics.princeton.edu/AndolfattoLab/Andolfatto_Lab.html. Each pairwise alignment is encoded using an 8-character representation (Table 3) that captures degree of conservation between two species, GC content and transition/transversion ratio.

Table 3

8-character representation used to encode a pairwise alignment.

Species 1

Species 2

Symbol

In order to select the optimal number of segment classes for each alignment, we performed separate segmentation analysis using models with 1–20 segment classes (T = 1,.., 20). After assessing stability of segment classes in each model of 3'UTRs, we selected the 15-class model for the D. melanogaster versus D. simulans alignment, the 16-class model for the D. melanogaster versus D. yakuba alignment and the 15-class model for the D. simulans versus D. yakuba alignment. Further we selected the 12-class model for the D. melanogaster versus D. simulans two protein-coding sequences (coding 1 and coding 2) and the 14-class model for the third protein-coding sequence (coding 3). The figure (Fig. 3) shows the segmentation patterns of each of the alignments based on the conservation levels between two species and the GC content of the first species in each pair. It can be seen that segment classes identified in D. melanogaster versus D.yakuba (Fig. 3C) and D. simulans versus D. yakuba (Fig. 3D) 3'UTR alignments have very similar characteristics. Although classes detected in the 3′UTR alignment of D. melanogaster versus D. simulans (Fig. 3A) show a similar pattern, corresponding classes appear to be compressed towards the right of the figure (i.e. higher conservation levels). This must be due to the shorter evolutionary distance between D. melanogaster and D. simulans. By contrast, the classes shown in Fig. 3B, representing the first coding sequence alignment of D. melanogaster versus D. simulans, exhibit a pattern distinct from the other three, making it difficult to identify class correspondences.

Fig. 3

GC content versus conservation level for selected models. GC content (in the first named species of each pair) versus the proportion of alignment matches, for each model is shown. The different colours represent different classes, and each class is plotted for the post burn-in samples; A) 15-class model for the D. melanogaster versus D. simulans 3′UTR alignment, B) 12-class model for the D. melanogaster versus D. simulans first coding sequence (coding 1) alignment, C) 16-class model for the D. melanogaster versus D. yakuba 3′UTR alignment and D) 15-class model for the D. simulans versus D. yakuba 3′UTR alignment.

Table 4 summarises further evidence of distinct segmentation patterns of two genomic regions; 3′UTR and protein-coding.

Table 4

Segmentation characteristics of two genomic regions.

Alignment	Component	Model	No. of alignment columns	No. of fixed change-points	Posterior average no. of change-points	Posterior average length of segments
Dmea vs Dsib	3′UTR	15	2,678,635	9112	50,001	54
Dme vs Dyac	3′UTR	16	2,486,711	8622	53,051	47
Dsi vs Dya	3′UTR	15	2,481,568	8607	51,547	48
Dme vs Dsi	Coding 1d	12	2,680,987	6760	11,086	242
Dme vs Dsi	Coding 2d	12	2,681,121	6626	10,190	263
Dme vs Dsi	Coding 3d	14	2,681,284	6463	9982	268

Dme: D. melanogaster.

Dsi: D. simulans.

Dya: D. yakuba.

Coding 1, 2, 3: three different randomly selected protein-coding sequences.

According to these segmentation results (Table 4) it is clear that a greater number of segment classes is identified in Drosophila 3′UTR components compared to protein-coding regions. The number of change-points estimated in 3′ UTRs is nearly five times that estimated for coding sequence, and consequently the average segment length in 3′UTRs is about one fifth of that in the coding sequence. This evidence suggests that Drosophila 3′UTRs contain more numerous sub-units than protein-coding sequences.

Identify alternatively spliced exons

This example was extracted from work presented by Boyd SE and co-workers in segmenting a 3-way alignment (human, mouse and rat DNA sequences) of the GFAP gene [74]. Fig. 4 shows a section of the WIG file (uploaded to the UCSC genome browser) of the segment class that corresponds to regions of high conservation among human, mouse and rat of the GFAP gene. In general, the start and end points of the conserved features occur at or very close to the boundaries of the exons (e.g. exon 6 in right of the screen). In the case of exons 7 and 7a (as labelled), the conserved features do not terminate immediately after the end of the annotated exon boundaries. The conserved feature corresponding to exon 7 extends for 30 nucleotides into intron 7 and the feature corresponding to exon 7a begins 50 nucleotides upstream of the start of exon 7a.

Fig. 4

Conserved features across exon 6/7/7a of GFAP. This profile corresponds to the most conserved segment class of the 4-class model. The profile value shows the probability that the base at each position of the GFAP gene belongs to the most conserved class. Exons (wide bars), UTRs (narrow bars) and introns (arrowed lines) are shown for three genes in the UCSC collection and one in RefSeq. HSF1 and HSF2 mark the actual and possible acceptor sites identified by Human Splice Finder (scores 93.19 and 76.63 respectively).

To find the possible novel splicing sites associated with exon 7a, the human DNA sequence of the extended region has been submitted to the Human Splicing Finder server (http://www.umd.be/HSF/HSF.html). The HSF predicts a potential acceptor splice site located 40 nt upstream of the conserved region (marked by HSF2 in Fig. 4), supporting the hypothesis of a new splice variant of the GFAP gene.

Predict transcription factor binding sites (TFBS)

Identifying putative TFBS is yet another interesting application of the program changept. To test this, we selected the pairwise alignment (human versus mouse) of the SHH gene which contains experimentally identified regulatory elements within the upstream regulatory region [75]. We used LAGAN (http://lagan.stanford.edu/lagan_web/index.shtml) [76] to align the two DNA sequences. The alignment was encoded using the 16-character representation. Based on the investigation of DICV values, the 6-class model was selected for human and mouse 2-way alignment. Interestingly, for SHH, the positions of annotated exons were not identified as belonging to the most conserved segment class (90% conservation), rather they were identified to belong to the second most conserved class (85% conservation). Fig. 5 depicts the WIG profiles of these two most conserved segment classes.

Fig. 5

WIG profiles of the two most conserved segment classes of the SHH gene. The figure shows the profiles (uploaded to UCSC genome browser) of the two most conserved classes (90% and 85% conservation levels), as identified by the program changept applied to the 2-way alignment of human and mouse DNA sequences. The two rows below the 2nd most conserved class profile display the exons (wide bars), the UTRs (narrow bars) and the introns (thin lines) of the SHH gene recorded in the UCSC and RefSeq collections respectively. The grey vertical lines with value − 1 represent the gaps (insertions and deletions) in the original alignment as assigned by program readcp.

Features A and B (Fig. 6) are regions identified as belonging to the most conserved class. These regions have been experimentally identified as regulatory elements [75].

Fig. 6

Conserved regions correspond to TFBS in SHH identified by program changept. The profile shows the conserved features predicted by program changept in the upstream of SHH gene, genomic coordinates — chr7:155,604,884–155,605,370. The locations of TFBS are marked by red arrows.

This result confirms that regions predicted by changept (features A and B) are in appropriate locations for transcription factor binding. We are currently investigating the potential of changept for genome-wide detection of TFBS.

Identify putative non-coding RNAs

Non-coding RNA (ncRNA) is an RNA molecule that is not translated into a protein. It has been estimated that 98% of human genomic output is ncRNAs, however what proportion of ncRNAs are functional and the functions of many ncRNAs remain unknown [77]. The program changept can be used to identify highly conserved non-coding regions in genomes that are likely to be functional. To provide an example, we can use the WIG profiles of the two most conserved segment classes of SHH gene (Fig. 5). The top profile shows features that are even more conserved than the annotated protein-coding regions. Further, changept has predicted conserved features in the 2nd most conserved class that are equally conserved as exons. These highly conserved elements could contain either ncRNAs or regulatory sequences. In a recent project, we are working with biologists to investigate these and other putative ncRNAs identified using changept in a number of genomes.

Identify rapidly evolving genomic regions

The work presented in [44] provides an example for this changept application. To summarise the main findings, program changept has been applied on three whole-genome and three partial-genome pairwise alignments of eight Drosophila species. Three main classes of conservation level have been identified, comprising slowly evolving, rapidly evolving and intermediate segments. In a recent project, we are applying changept to three malaria species to identify genomic regions likely to be involved in the ability of the malaria parasite to infect their host species.

Summary

In this mini-review, we discussed various algorithms that can be used to segment genomic sequences. We also outlined the mathematics and methods of program changept, a Bayesian segmentation algorithm that is capable of segmenting an alignment while simultaneously classifying segments into different segment classes that share similar properties. We have demonstrated the effectiveness of this method through examples. The program changept can be used to identify putative functional elements in genomes such as non-coding RNAs, alternatively spliced exons and transcription factor binding sites. Other applications of program changept include identifying rapidly evolving genomic regions and inferring various segmentation patterns in genomic regions.

60 in total

Review 1. Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes.

Authors: S Karlin
Journal: Trends Microbiol Date: 2001-07 Impact factor: 17.079

2. A sliding window-based method to detect selective constraints in protein-coding genes and its application to RNA viruses.

Authors: Mario A Fares; Santiago F Elena; Javier Ortiz; Andrés Moya; Eladio Barrio
Journal: J Mol Evol Date: 2002-11 Impact factor: 2.395

3. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA.

Authors: Michael Brudno; Chuong B Do; Gregory M Cooper; Michael F Kim; Eugene Davydov; Eric D Green; Arend Sidow; Serafim Batzoglou
Journal: Genome Res Date: 2003-03-12 Impact factor: 9.043

4. IsoFinder: computational prediction of isochores in genome sequences.

Authors: José L Oliver; Pedro Carpena; Michael Hackenberg; Pedro Bernaola-Galván
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

5. Genomic regions exhibiting positive selection identified from dense genotype data.

Authors: Christopher S Carlson; Daryl J Thomas; Michael A Eberle; Johanna E Swanson; Robert J Livingston; Mark J Rieder; Deborah A Nickerson
Journal: Genome Res Date: 2005-11 Impact factor: 9.043

6. GC composition of the human genome: in search of isochores.

Authors: Netta Cohen; Tal Dagan; Lewi Stone; Dan Graur
Journal: Mol Biol Evol Date: 2005-02-23 Impact factor: 16.240

Review 7. Construction of phylogenetic trees.

Authors: W M Fitch; E Margoliash
Journal: Science Date: 1967-01-20 Impact factor: 47.728

4. A Bayesian model based computational analysis of the relationship between bisulfite accessible single-stranded DNA in chromatin and somatic hypermutation of immunoglobulin genes.

Authors: Guojun Yu; Yingru Wu; Zhi Duan; Catherine Tang; Haipeng Xing; Matthew D Scharff; Thomas MacCarthy
Journal: PLoS Comput Biol Date: 2021-09-07 Impact factor: 4.475

4 in total

Investigating genomic structure using changept: A Bayesian segmentation model.

Role of genome segmentation

Segmentation methods

Sliding window analysis

Hidden Markov models

Multiple change-point analysis

Recursive segmentation method

Other segmentation methods

Changept analysis

Method

Transforming alignment

Segmentation based on a single property of interest (e.g.: conservation level)

Segmentation based on multiple properties (e.g.: conservation level, GC content and transition/transversion ratio)

Modelling

Sampling

Applications of changept

Investigate segmentation patterns of genomic regions

Identify alternatively spliced exons

Predict transcription factor binding sites (TFBS)

Identify putative non-coding RNAs

Identify rapidly evolving genomic regions

Summary

Review 1. Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes.

2. A sliding window-based method to detect selective constraints in protein-coding genes and its application to RNA viruses.

3. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA.

4. IsoFinder: computational prediction of isochores in genome sequences.

5. Genomic regions exhibiting positive selection identified from dense genotype data.

6. GC composition of the human genome: in search of isochores.

Review 7. Construction of phylogenetic trees.

8. Hidden Markov models in computational biology. Applications to protein modeling.

9. An isochore map of human chromosomes.

10. Evidence for large domains of similarly expressed genes in the Drosophila genome.

1. Discovery of putative small non-coding RNAs from the obligate intracellular bacterium Wolbachia pipientis.

2. Genome-wide identification of conserved intronic non-coding sequences using a Bayesian segmentation approach.

3. Detection and identification of cis-regulatory elements using change-point and classification algorithms.

4. A Bayesian model based computational analysis of the relationship between bisulfite accessible single-stranded DNA in chromatin and somatic hypermutation of immunoglobulin genes.