Literature DB >> 24684679

The number of reduced alignments between two DNA sequences.

Helena Andrade, Iván Area, Juan J Nieto1, Angela Torres.   

Abstract

BACKGROUND: In this study we consider DNA sequences as mathematical strings. Total and reduced alignments between two DNA sequences have been considered in the literature to measure their similarity. Results for explicit representations of some alignments have been already obtained.
RESULTS: We present exact, explicit and computable formulas for the number of different possible alignments between two DNA sequences and a new formula for a class of reduced alignments.
CONCLUSIONS: A unified approach for a wide class of alignments between two DNA sequences has been provided. The formula is computable and, if complemented by software development, will provide a deeper insight into the theory of sequence alignment and give rise to new comparison methods. AMS SUBJECT CLASSIFICATION: Primary 92B05, 33C20, secondary 39A14, 65Q30.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 24684679      PMCID: PMC3977907          DOI: 10.1186/1471-2105-15-94

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

Let us consider a DNA sequence as a mathematical string where x ∈{A,G,C,T} is one of the four nucleotides, i=1,2,…,n, i.e. A denotes adenine, C cytosine, G guanine and T thymine. In these conditions, the sequence x is of length n. Our main goal is to compare the sequence x with another DNA sequence to measure the similarity between both strings and also to determine their residue-residue correspondences. Sequence comparison and alignment is a central and crucial tool in molecular biology. For example, Pairwise Sequence Alignment is used to identify regions of similarity that may indicate functional, structural and/or evolutionary relationships between two biological sequences (protein or nucleic acid) [1]. For some recent developments and directions we refer the reader to [2-7] and [8] for a general review of different alignments methods. To align the sequences CGT and ACTT, one can use EMBOSS Needle for nucleotide sequence [9] that creates an optimal global alignment of the two sequences using the Needleman-Wunsch algorithm to get Following Lesk [10], in order to compare the amino acids appearing at their corresponding positions in two sequences, theirs correspondences must be assigned and a sequence alignment is the identification of residue-residue correspondence. For some references on sequence alignment we refer the reader to [10-16]. To compare two sequences, there exist mainly three different possibilities leading to three different numbers of total alignments [10,11,13]: 1. The total number of alignments denoted by f(n,m) that was solved in [13]. 2. A gap in a sequence is followed by another gap in the other sequence as in Alignments 1 and 2 for the sequences x=CGT and y=ACTT (see Tables 1 and 2 below) Considering the two alignments as equivalents to the Alignment 3 (see Table 3) without gap in those positions, we have the number of reduced alignments denoted by h(n,m), and obviously h(n,m)11], and we give here another representation in terms of hypergeometric series.
Table 1

Alignment 1

CGT
ACTT
Table 2

Alignment 2

CGT
ACTT
Table 3

Alignment 3

CGT
ACTT
3. In the interesting case that the alignments 1 and 2 are equivalent, but different from alignment 3 we have a number or reduced alignments g(n,m) where h(n,m) Alignment 1 Alignment 2 Alignment 3

Number of alignments

The total number of alignments f(x,y) satisfies the following recurrence relation [13] with initial conditions f(n,0)=f(0,m)=1 for n,m=1,2,3,…. The solution of the above partial difference equation is given by (see formula (10) in [13]) and the generating function [17,18] is Therefore the coefficients f(n,m) in the expansion are given in terms of a hypergeometric series by This relation seems to be new in this form. Here, the generalized hypergeometric series is defined as (see e.g. [19, Chapter 16]) and (A) =A(A+1)⋯(A+n−1), with (A)0=1, denotes the Pochhammer’s symbol. It is assumed that b ≠−k in order to avoid singularities in the denominators. If one of the parameters a equals to a negative integer, then the sum becomes a terminating series. In this case, the recurrence relation for the h(n,m) coefficients is [11] with initial conditions h(n,0)=h(0,m)=1. Therefore, the generating function [17,18] is and the coefficients in the expansion are given by where The above coefficients can be written in terms of (terminating) hypergeometric series as As indicated before, the main aim of this paper is to give an explicit representation in this case. The recurrence relation for the g(n,m) coefficients is [11] with initial conditions g(n,0)=g(m,0)=1. Thus, the generating function [17,18] is

Theorem 1. The coefficientsα in the expansion are explicitly given by where and [ x] denotes the integer part of x. Proof. If we expand, we have two summands to be computed, namely In order to compute the first sum (12) let us introduce Therefore, the summation to be done reads as where U, V, A and B must be computed in terms of the initial indices. The product of binomials can be simplified to Thus, and then Finally, the summation reads as where A similar work with the second summand (13) leads to the final result. Some numerical values are g(10,10)=2003204, g(50,50)=2.71972×1034, g(100,100)=7.55997×1069, and we note that g(n,n)>1080 for n≥115. This last inequality is relevant since 1080 is an estimation of the number of protons of our universe [13].

Conclusions

A unified approach for a wide class of alignments between two DNA sequences has been provided. We conclude also that our approach gives an explicit formula filling a gap in the theory of sequence alignment. The formula is computable and, if complemented by software development, will provide a deeper insight into the theory of sequence alignment and give rise to new comparison methods. It may be used also, in the future, to get explicit formulas and compute the number of total, reduced, and effective alignments for multiple sequences.

Methods

We have performed a number of numerical computations to compare our formulae and Mathematica®; [20] command Coefficient for the series expansion of (1), on a MacBook Pro featuring a 45 nm “Penryn” 2.66 GHz Intel “Core 2 Duo” processor (P8800), with two independent processor “cores” on a single silicon chip, 8 GB of 1066 MHz DDR3 SDRAM (PC3-8500). We would like to mention that our approach is amazingly fast, since e.g. g(100,100) is computed by using Mathematica®; in 0.125165 seconds by using the new formulas presented in this paper, while the use of Mathematica®; command Coefficient needs 99.167659 seconds.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Each of the authors HA, IA, JJN and AT, contributed to each part of this study equally and read and approved the final version of the manuscript.
  5 in total

1.  An exact formula for the number of alignments between two DNA sequences.

Authors:  Angela Torres; Alberto Cabada; Juan J Nieto
Journal:  DNA Seq       Date:  2003-12

Review 2.  Alignment methods: strategies, challenges, benchmarking, and comparative overview.

Authors:  Ari Löytynoja
Journal:  Methods Mol Biol       Date:  2012

3.  Oculus: faster sequence alignment by streaming read compression.

Authors:  Brendan A Veeneman; Matthew K Iyer; Arul M Chinnaiyan
Journal:  BMC Bioinformatics       Date:  2012-11-13       Impact factor: 3.169

4.  Efficient alignment of RNA secondary structures using sparse dynamic programming.

Authors:  Cuncong Zhong; Shaojie Zhang
Journal:  BMC Bioinformatics       Date:  2013-09-08       Impact factor: 3.169

5.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory.

Authors:  Mark J Chaisson; Glenn Tesler
Journal:  BMC Bioinformatics       Date:  2012-09-19       Impact factor: 3.169

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.