Literature DB >> 35328107

Note on DNA Analysis and Redesigning Using Markov Chain.

Maciej Zakarczemny¹, Małgorzata Zajęcka¹.

Abstract

The paper contains a discussion on mathematical modifying and redesigning DNA with the use of Markov chains. We give a simple mathematical technique for overwriting missing parts of DNA. With a certain probability (without even knowing the function of the missing codon) we can find a synonymous codon, so that there is no frequency change in amino acid sequences of proteins. We use Markov Chain to analyze the dependencies in DNA sequence of the human gene Alpha 1,3-Galactosyltransfe rase 2. We include a theoretical introduction which facilitates the understanding of the paper for non-mathematicians, especially for biologists not familiar with the theory of Markov chains.

Entities: Chemical

Keywords: DNA; Markov chain; human gene

Mesh：

Substances：
Codon
DNA

Year: 2022 PMID： 35328107 PMCID： PMC8953815 DOI： 10.3390/genes13030554

Source DB: PubMed Journal: Genes (Basel) ISSN： 2073-4425 Impact factor: 4.096

1. Introduction

1.1. Motivation and Methods

Modeling and analyzing DNA sequences using statistical methods have been a challenge for statisticians and biologists for many years. For many years, the most common approach was based on the theory of Markov chains. Well known simple models appeared in the mid-1980s in the papers by B.E. Blaisdell [1] and V. Brendel, J.S. Beckmann, E.N. Trifonov [2]. This was later followed by more advanced models developed to study different biological aspects of DNA (see [3,4,5]) that have been also using Markov chains (both homogeneous and non-homogeneous ones, possibly of higher orders). There are some statistical results showing that first-order Markov chains are not an adequate model for DNA sequences (see, e.g., [6]). Moreover, the latest comparison studies (see [7]) show that in general DNA sequencing by models based on even higher order Markov chains does not fit perfectly. Thus the latest research in DNA sequencing in bioinformatics is focusing on deep learning methods (see [8]). The methods presented in this paper, however, are local rather than statistical. We apply our model to DNA sequences less than 55 base pairs long, which is not enough for statistical methods. Markov chain theory is being applied to modeling music and literature. For example, a random song generated by a Markov chain based on some given piece of music, can achieve a similarity level comparable to this piece (see [9,10,11]). The question arises whether DNA can be handled similarly to a piece of music. Numerous attempts to write an understandable text as the realization of a low-order Markov chain have not proved successful. For high-order Markov chains, however, such a text becomes understandable as it contains complete sentences from the original text (see [12,13,14]). Therefore, the question whether Markov chains of any order are suitable to study DNA sequencing becomes the question about the complexity of the DNA structure. In this paper we show methods for filling in short gaps in DNA sequences. The results obtained by our method are then compared with the original DNA sequence.

1.2. Notations

We consider a probability space , where denotes a finite state space, i.e., the set of all possible results of an experiment, is a -field of events, and denotes a probabilistic measure. By random variable we denote a function such that for any a preimage is in the -field . In our case a random variable will allow us to assign natural numbers to states from the state space : let E be a single experiment with possible results forming , if in t-th repetition (i.e., in time t) of experiment E we obtain result , then we put , where is a random variable with natural values. A matrix Rows of right (columns of left) stochastic matrix are row (vertical) stochastic vectors.

1.3. Stochastic Matrices

Using basic algebra one can prove the following remark. The product of two right (two left) stochastic matrices is a right (left) stochastic matrix. A square matrix Let A be a square matrix. An eigenvalue of A is a complex number λ such that characteristic polynomial The eigenvector of a of a square matrix A is a column vector v such that

2. Markov Chains

A Markov chain is a sequence of random variables forming a probabilistic model describing a memoryless type of dependency: the future may depend only on the present and must be independent of the past. E is an experiment with possible results forming a finite set ; is a state space associated with ; is a probability space, where is a -field and P is a probabilistic measure ; is a random variable defined as follows: if in t-th repetition of experiment E we obtain result , then we put . A sequence of random variables Markov chain if for all if only In the general case a state space The Equation ( Let for all i such that Let For our convenience we index states by t, but we remind the reader of the following property of homogeneous Markov chains If a Markov chain

2.1. Model

Now we are going to consider probabilities of moving from state to state in larger number of steps. Let for all i such that Let For theorem is true because . For from the law of total probability we obtain Hence , thus . □ (Random walk on a complex plane). Identify adenine with where For Let because variables One can show that

2.2. Classification of States and Chains

In this subsection, we will give some mathematical background and also state some well known results, see [7,15]. A state i is called accessible from state j if there exists Observe that communication is an equivalence relation that divides states into equivalence classes called communicating classes. A Markov chain is called irreducible if its state space forms a single communicating class. In other words, in irreducible Markov chain it is possible to get from any state to any state (every two states communicate). A state i is called inessential if there exists a state j and Denote , , and define , where when . Then is a probability that we access state j from state i exactly in n steps, is a probability that we ever access state j from state i, a trace of the transition matrix in n steps, is a random variable that counts how many times state i is accessed and is a random variable that counts how many steps are needed to reappear in state i for the first time. where A state i is called recurrent if We will need the following result. (see [15]). A state i is recurrent if and only if A state i is transient if and only if A state i is transient if and only if A state i is recurrent if and only if Note that for all states and natural n we have It follows from the fact that accessing state j from state i after n steps means that we access state j for the first time after exactly m steps (for some ) and then after next steps we return to it (perhaps reaching state j a few times on the way). Thus we have Hence we get an inequality Since n is arbitrary, as n tends to infinity we obtain Assume that state i is transient. Then from (6) we obtain . On the other hand if then , thus . From Theorem 2 state i is transient. If i is recurrent, then we must have (otherwise see proof of point (a)). Now assume . If then is unbounded and from (6) we get a contradiction. □ In irreducible Markov chain either all states are recurrent or all states are transient. Thus we call an irreducible Markov chain recurrent or transient, depending on type of states. One can show that for a finite Markov chain (chain with a finite state space) a state is inessential if and only if it is transient, thus a state is essential if and only if it is recurrent. In the case of gene A3GALT2 each state is essential (because all entries in transition matrix are nonzero), thus from Remark 9 each state is recurrent. This also can be shown using properties of transition matrix: evaluate A state i is called null-recurrent if A state i is called periodic with period A state which is aperiodic, recurrent, and positive recurrent is called ergodic. In case of our matrix Π all states are ergodic. For irreducible matrices we have the following property. A Markov chain is irreducible if and only if for all independent of i, where A special case of irreducible Markov chain is a regular Markov chain. A irreducible Markov chain is called regular if there exists Let Π be a transition matrix of a Markov chain. A stationary probability vector is a stochastic vector (see Definition 1) such that Π. (see [15]). Let for any Markov chain a vector For regular Markov chains we have the following result. Let Π be a transition matrix of an irreducible aperiodic Markov chain with finite state space. Then matrix

2.3. Analysis of Alpha 1,3-Galactosyltransferase 2

We show that a time homogeneous Markov chain is an appropriate simple model of Alpha 1,3-Galactosyltransferase 2 (A3GALT2). We use transition matrices as a criterion for identifying similarities in structure of this particular gene. A3GALT2 is a Protein Coding gene (a region of DNA) located in chromosome 1, position 33,306,766, consisting of 14,333 bases [16]. Let , where , and be a corresponding state space. We form a stochastic matrix (7) as follows. For example we would like to know how probable is that after adenine () occurs cytosine (). We count all occurrences of a pair in gene A3GALT2 and divide it by number of all occurring pairs which start from . Number of all such pairs is equal to number of occurrences of provided that is not the last nucleotid base in gene A3GALT2. Note that because the last nucleotide base in gene A3GALT2 is , in the denominator in last row we have 3370 instead of 3371. One can pose a question of biological interpretation of matrix Π. Does the occurrences of nucleotid bases can be used to identify a specific gene or, in general case, to identify an individual? Because all entries of matrix are positive, in the case of gene A3GALT2 a suitable Markov chain is irreducible (see Definition 10) and each for thus our chain is aperiodic (see Definition 14). From Theorem 5 for any there exists a limit , suitable Markov chain is recurrent, is a unique stationary probability vector and where is an average number of steps needed to reappear in state j. The stationary probability vector of matrix Π is Note that if nucleotide bases in gene A3GALT2 are a good estimation of a possible sequence of values of a Markov chain, then from Remark 13 probability of occurrence of a should be 0.22292, c: 0.28265, g: 0.25923 and t: 0.23521. Comparing this with computed probabilities from From Remark 13 we can compute approximate values of All of the above considerations can be repeated for pairs of bases, see Table 2.

Table 2

Occurence of nucleotide pairs of bases in A3GALT2.

Pairs of Bases	Occurence in A3GALT2	Probability	Pairs of Bases	Occurence in A3GALT2	Probability
AA	769	76914,332	GA	811	81114,332
AC	745	74514,332	GC	934	93414,332
AG	1093	109314,332	GG	1254	125414,332
AT	588	58814,332	GT	716	71614,332
CA	1140	114014,332	TA	475	47514,332
CC	1392	139214,332	TC	980	98014,332
CG	350	35014,332	TG	1018	101814,332
CT	1170	117014,332	TT	897	89714,332

3. The Markov Process Model of Nucleotide Substitution

We assume that nucleotide substitution is follow a homogeneous Markov process. We take , where . Let be a corresponding state space. Let is a matrix of transition probabilities in time We assume that where is the rate matrix of the process.

4. Application in DNA Sequencing, Redesigned DNA

The meaning of DNA sequencing is here deemed to cover all methods used to determine the order of nucleotides along a DNA strand. The objective of this section is to present an example of applying Markov chains to complete short fragments of a DNA strand. Markov chains have been applied as mathematical models of real-life processes. Such real-life dynamical systems, examined with Markov-chain method, include queues of passengers arriving at an airport, currency exchange rates or animal population dynamics. Markov chains are also applied to build algorithms calculating the PageRank value for a website (see [17]). The website PageRank value reflects the probability that a random internet surfer will land on this webpage upon clicking a link. Markov processes of various orders are used to model DNA sequencing (see also [6]). A Markov process of order m is one for which the probability of any event depends exclusively on the m preceding events. Statistically, DNA does not have the features of a first-order Markov chain. Higher-order models have been proposed for analyzing interrelations within a DNA sequence, see [6]. Statistical tests used in [6] properly determined the order of the Markov chain being tested for sequences of length base pairs or higher. Nevertheless, the authors consider it important to present the method described below for local problems, that is for very short DNA sequences (54 base pairs in our example). We believe it is worth examining the results of local DNA completion based on the properties of first-order Markov chains, when the length of a DNA sequence is less than base pairs. The method presented below is a simple, local tool for completing short DNA segments. The method has also educational value. Moreover, instead of analyzing a DNA sequence of the five unit nucleotides, we may use first-order Markov chains to analyze codons or, more precisely, amino acids encoded with those codons. It is worth recalling here some basic facts and conventions: Codon is a sequence of three nucleotides (a triplet) occurring in mRNA, a unit encoding a specific amino acid during protein synthesis; Proteins are built of 20 different amino acids; The sequence of amino acids in a protein exactly follows the sequence of the relevant codons in mRNA; Most amino acids are encoded in several ways (with different codons, which, however, differ from one another usually on the third place in the triplet only); owing to this, certain changes in the genetic information (mutations) do not affect the amino acid sequence; There are 61 codons encoding amino acids and 3 non-encoding codons (they are STOP codons: UAG, UAA, UGA); all in all: various triplets; The AUG codon, read as the first one in mRNA by a ribosome during protein synthesis is known as the initiation or start codon; Since a mutation of a single nucleotide changes a single amino acid, the genetic code has to be read as non-overlapping, i.e., any given codon may be followed by any other codon; To get the form typical of DNA, each U in an mRNA codon should be replaced by T; for instance, TAA is the DNA equivalent of the mRNA codon UAA; In the case of a sequence of amino acids, understood as resulting from first-order Markov process action, the transition matrix is a square matrix of degree at most 21. One state is reserved for the three STOP codons, which do not encode amino acids. Let us consider the human SATB1 gene, which, as research has revealed, is a major growth factor for breast cancer, see [ The relevant state space comprises four nucleotides: We give another sequence in In the application of the method described below, it is important that in both tables the last element occurs at least twice. Assume that in the sequence in In the other approach, using representation with amino acids (see The (extensive) transition matrix corresponding to sequence ( Let us observe that the number of occurrences of Note that, after rounding, the stationary probability vector of matrix For the sake of comparison, we below give rounded relative frequencies of nucleotide occurrences, as disclosed in The values sourced from the sequence of 54 nucleotides is short, in sequence ( Using the transition matrix Prepare four boxes, labeled For the reader’s convenience, we present in

5. Conclusions

The method just presented is primarily of educational value. The procedure is described suggestively and, the authors believe, explanatory, which makes it possible for the method to be used in a more general context, also by non-mathematicians. The authors do not imply that the probability of achieving the proper completion of a DNA genome is satisfactory; they only present a tool which may be used for such completion and with which they would like to familiarize the reader. The authors are aware that the contemporary efforts in the area of DNA genome completion are focusing on deep learning rather than on Markov chains even of higher orders, see [8]. This paper proposes an alternative tool that can be explained suggestively and deeply. It is now clear that the deep learning methods lead to more exact completions than the Markov chains methods. Yet, our method allows for understanding of what happens behind the process of proper completion and sequencing. It should therefore be treated as of explanatory and educational value, with a potential for future research. It is worth asking, if the algorithms presented in Example 2 might be used to easily generate test data for more advanced deep learning algorithms (see [21]).

Table 1

Occurence of nucleotide bases in A3GALT2.

Nucleotide Bases	Occurence in A3GALT2	Probability (Occurence Divided by Length of Gene)
A	3195	0.222912
C	4052	0.282704
G	3715	0.259192
T	3371	0.235192

Table 3

The sequence of amino acids corresponding to 54-base-pair-long segment of SATB1.

1,2,3	4,5,6	7,8,9	10,11,12	13,14,15	16,17,18	19,20,21	22,23,24	25,26,27
Val	Lys	Arg	Leu	Ser	Asp	Lys	Asn	Lys
28,29,30	31,32,33	34,35,36	37,38,39	40,41,42	43,44,45	46,47,48	49,50,51	52,53,54
Ser	Ser	Leu	STOP	Gln	Leu	Cys	Cys	STOP

Table 4

Numbered 54-base-pair-long segment of SATB1.

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18
G	T	C	A	A	A	A	G	A	C	T	C	T	C	C	G	A	C
19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36
A	A	A	A	A	C	A	A	A	T	C	C	A	G	T	C	T	C
37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54
T	A	G	C	A	G	T	T	A	T	G	T	T	G	T	T	A	G

6 in total

1 in total

1. Note on DNA Analysis and Redesigning Using Markov Chain.

Authors: Maciej Zakarczemny; Małgorzata Zajęcka
Journal: Genes (Basel) Date: 2022-03-21 Impact factor: 4.096