Literature DB >> 16393148

Sorting by restricted-length-weighted reversals.

Thach Cam Nguyen¹, Hieu Trung Ngo, Nguyen Bao Nguyen.

Abstract

Classical sorting by reversals uses the unit-cost model, that is, each reversal consumes an equal cost. This model limits the biological meaning of sorting by reversal. Bender and his colleagues extended it by assigning a cost function f(l) = l(alpha) for all alpha > or =0, where l is the length of the reversed subsequence. In this paper, we extend their results by considering a model in which long reversals are prohibited. Using the same cost function above for permitted reversals, we present tight or nearly tight bounds for the worst-case cost of sorting by reversals. Then we develop algorithms to approximate the optimal cost to sort a given 0/1 sequence as well as a given permutation. Our proposed problems are more biologically meaningful and more algorithmically general and challenging than the problem considered by Bender et al. Furthermore, our bounds are tight and nearly tight, whereas our algorithms provide good approximation ratios compared to the optimal cost to sort 0/1 sequences or permutations by reversals.

Entities: Chemical Disease Species

Mesh：

Year: 2005 PMID： 16393148 PMCID： PMC5963005 DOI： 10.1016/s1672-0229(05)03016-0

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Biological motivation

Given the gene order of a chromosome, a reversal takes a segment and reverses the order of the sequences in it. This is one of the most important mutations at the chromosome level. Indeed, reversal is believed to be the most common in these mutations (. For instance, this fact has been reported on bacteria (, plants (, and fruit fly (. The minimum-cost reversal distance thus becomes a useful measure for reconstructing the evolutionary history of organisms, because the most parsimonious series of reversals transforming one sequence to another is likely to be the evolutionary path between two organisms. However, most of the algorithmic works to date on the problem of sorting by reversals have used the unit-cost model, that is, each reversal consumes a unit cost. This model implicitly assumes simple evolutionary paths in which reversal mutations of different lengths are considered equally likely to happen. It is therefore not very biologically meaningful. Instead, the mechanics of genome reversals suggest that the probabilities of reversals depend on the fragment lengths (. Furthermore, not all reversals can occur in the evolutionary path. The reversals whose lengths are greater than some limits should be forbidden because they affect the organisms seriously and destroy all the specific characteristics of their genomes. Motivated by this characteristic of reversals, many works assigned length-sensitive costs to reversal operations and indicated that length indeed plays an important role in biasing certain rearrangement patterns. In this work, we generalize previous results on the problems of sorting by length-weighted reversals and investigate more general variants by imposing the condition that the reversals acting on the fragments longer than a certain length are prohibited.

Problem definition

Since we can always relabel genes so that the resulting permutation is the identity permutation, the problem of finding reversal distance is equivalent to that of finding the most economic series to transform a permutation into the identity one, and it is often regarded as “sorting by reversals”. Our problem, sorting by restricted-length-weighted reversals, consists of three kinds of input: a 0/1 sequence or permutation S, a cost function f on the length of the reversals, and a positive integer k. The aim is to find a minimum-cost series of reversals whose lengths do not exceed k to sort S. In this paper, we consider a wide class of cost functions, namely f(l) = l for all α ≥ 0, where l is the length of the reversed segment. Denoting the minimum cost to sort S by SBRLR (S, k, α), we address the following two problems: 1. Determining the diameter of sorting by restricted-length-weighted reversals, that is, finding the maximum cost C such that there is one permutation S whose cost to sort by restricted-length-weighted reversals is at least C , and the minimum cost C such that any permutation can be sorted with cost at most C. 2. Approximating SBRLR (S, k, α).

Related work

There is a huge number of literature on sorting by reversals. The most exciting results were due to Kececioglu and Sankoff (, Hannenhalli and Pevzner (, and Caprara (. Kececioglu and Sankoff ( gave a 2-approximation algorithm, which is the first performance guaranteed approach for sorting by reversals. Bafna and Pevzner ( then developed the notion of breakpoint graph of permutation and gave a better approximation algorithm. Based on this notion, Hannenhalli and Pevzner ( stated a novel dual theorem and gave the first polynomial time algorithm to sort a signed permutation, known as the HP’s algorithm. This algorithm took O(n2) time to calculate the reversal distance between the genomes and O(n4) time to find the optimal series of reversals to sort a permutation. On the other hand, Caprara ( proved that sorting by reversals is NP-Hard for the general (unsigned) case. This result changed the focus of the study of sorting by reversals to the signed case. Since then, many simplifications and improvements of the HP’s algorithm have been developed. Berman and Hannenhalli ( gave an algorithm that sorted a permutation in O(n2 α(n)) time where α is the inverse Ackerman function. Kaplan et al. ( then simplified the concept of hurdles and gave a simple O(n2) algorithm. Bader et al. ( gave an optimal O(n) algorithm to calculate the reversal distance between genomes. Recently, Kaplan and Verbin ( gave an interesting randomized algorithm, which Tannier and Sagot ( derandomized to get a deterministic one, to sort by reversals in time. The lengths of reversals were first taken into account by Chen and Skiena (, in which only reversals whose lengths are some constant k are allowed. They gave an algorithm to sort all circular n-permutations using O(n2/k + kn) k-reversals while proving that there exists permutations requiring Ω(n2/k2 + n) k-reversals to sort. Continuing on this track, Pinter and Skiena ( studied the linear cost model, which has an upper bound of O(n log2 n) on the cost of sorting any n-element permutation and a guaranteed approximation ratio of O(log2 n) times the optimal cost. Recently, Bender et al. ( presented tight and nearly tight lower and upper bounds for a wide class of cost models f(l) = l where α ≥ 0. They also gave good approximation algorithms for sorting linear permutations. Swidan et al. ( extended the results for the cases of signed and circular permutations. Table 1 summarizes the results obtained in Bender et al. Note that the lower and upper bounds shown in the table are true not only for linear permutations, but also for circular permutations. However, the approximation ratios are only for linear permutations.

Table 1

The Results Obtained in Bender et al. (

α value	Lower bounds	Upper bounds		Approximation ratio
		Permutation	0/1 sequence	Permutation	0/1 sequence
0 ≤ α < 1	Ω(n)	O (n log n)	Θ(n)		O(1)
α = 1	Ω(n log n)	O (n log²n)	Θ(n log n)	O (log n)	1
1 < α < 2	Ω(n^α)	Θ(n^α)	Θ(n^α)	O (log n)	O(1)
α ≥ 2	Ω(n²)	Θ(n²)	Θ(n²)	2 for α < 3	1
				1 for α ≥ 3

The Results Obtained in Bender et al. (

Results and Discussion

In this paper, we considered sorting 0/1 sequences and permutations using restricted-length-weighted reversals. First, we proved tight and nearly tight (upper and lower) bounds for all α ≥ 0. Our results on these bounds are for linear permutations as well as circular permutations. When k = O(n/log n) (k is the maximum length of each reversal and n is the length of the permutation), our upper bounds are tight for both classes of permutations and 0/1 sequences. Second, we gave approximation algorithms yielding nontrivial approximation ratios for all 1 ≤ α < 2. Table 2 summarizes our results. Note that the lower and upper bounds in the upper table are true for both linear and circular permutations. Also, when α ≥ 2, short reversals are always preferred to long ones; thus the results for our problem are identical with those for the problem considered in Bender et al.

Table 2

The Results Obtained in Our Work

α value	Lower bounds	Upper bounds
		Permutation	0/1 sequence
0 ≤ α < 1	Ω(n + n²k^α−2)	O(n log n + n²k^α−2)	Θ(n + n²k^α−2)
α = 1	Ω(n log n + n²/k)	O(n log n log k + n²/k)	Θ(n log k + n²/k)
1 < α < 2	Ω(n²k^α−2)	Θ(n²k^α−2)	Θ(n²/k^α−2)
α ≥ 2	Ω(n²)	Θ(n²)	Θ(n²)

The Results Obtained in Our Work

Lower bounds

We establish the lower bounds for sorting both linear and circular permutations of n elements based on the lower bounds of 0/1 sequences, and analyze the case when α ≤ 2 for the other case is trivial. Our lower bounds are asymptotically tight because there exists algorithms to sort any 0/1 sequence with the cost asymptotically equal to these bounds.

Linear permutations

The lower bound in this case is established based on the notion of inversion. Given a sequence S = s1s2…s, an inversion is a pair of positions i < j such that s > s. The number of inversions in a 0/1 sequence of x 0’s and y 1’s is at most xy, which is smaller than (x + y)2/4. Therefore, a reversal of length l can remove at most l2/4 inversions from a 0/1 sequence. The sorted sequence has no inversion. Noticing that the sequence S = 11…100…0, where there are equal numbers of 0’s and 1’s, contains n2/4 inversions, we establish our lower bounds for linear permutations based on S. When α < 2, the cost required to sort S is at least n2k. Proof: Let l1, l2,…, l be the length of reversals in the optimal reversal series to sort S, such that l ≤ k for all 1 ≤ i ≤ p. Let n be the number of inversions removed by performing reversal l, we have . The optimal cost is . We have: Hence, the theorem holds. The cost of the optimal reversal series to sort a sequence when restricted to reversals whose lengths do not exceed k must be greater than or equal to the minimum cost to sort the same sequence when using any reversals. Thus, we combine Theorem 1 and the lower bounds established in Bender et al. to get the following lower bounds on the cost of the optimal reversal series in which no reversal has the length greater than k for different α:

Circular permutations

Consider the circular sequence S = 0101…01. We choose an arbitrary 0 from S, and count the other 0’s and 1’s in that sequence. Let d denote the distance between the ith 0 and the ith 1 and . After performing an optimal reversal series l1, l2,…, l, we obtain the sorted sequence S′ with d = Θ (n2). The optimal cost to sort S is . Consider a reversal of length l, it can increase the distance between the ith 0 and the ith 1 of S by at most l, and it can increase at most l such distances. Thus a reversal of length l can increase d by at most l2. Let x be the amount that d increases by performing the reversal l in the series, we have x ≤ l2. Then Thus, we have the following theorem: When α < 2, the minimum cost required to sort S must be at least Ω(n2k). Again, combining Theorem 2 and the lower bounds established in Swidan et al. ( gives us the lower bounds of sorting circular permutations, which are identical with those of sorting linear permutations.

Upper bounds

Here we present the algorithms for sorting any n-element permutation and analyze their worst-case cost. The worst-case costs used by these algorithms are the upper bounds on the diameter of sorting by restricted-length-weighted reversals. We again note that bubble sort gives a tight upper bound for the case α ≥ 2. Furthermore, by “cutting” a circular permutation at any point and treat it as a linear permutation, we obtain an algorithm with the same asymptotic cost to sort a circular permutation. Therefore, we only consider sorting a linear permutation with α < 2. To sort a 0/1 sequence S = s1s2…s with only reversals whose lengths do not exceed k, we divide S into small segments such that for each segment (i, j): 1. There are at most ⌊k/2⌋ 0’s and at most ⌊k/2⌋ 1’s in {s, s,…, s}. 2. There are either ⌊k/2⌋ 0’s or ⌊k/2⌋ 1’s in {si, si+1,…, s}. We first sort each segment separately based on the algorithms described in Bender et al. for 0/1 sequences and on the corresponding cost model. Then we perform bubble sort on blocks of 0’s and 1’s, in which each swap is mimicked by a reversal of length not greater than k. There are O(n/k) segments, and each segment has length of O(k). Let us denote the cost to sort each subsequence whose length l does not exceed k by B′(l). The cost to sort a permutation is: To sort a permutation π, it needs to divide the sequence around the median and recursively sort both halves. In order to divide around the median, we map each element less than the median to 0 and map each element greater than or equal to the median to 1. Thus, the cost to sort a permutation is: When the cost of reversals is f(l) = l, 0 ≤ α < 1, for linear permutations, each 0/1 sequence can be sorted with cost O(n + n2k), and each permutation can be sorted with cost O(n log n + n2k). Proof: The cost to sort a 0/1 sequence of length not exceed k is B′(l) = O(l). We have B(n) = O(n/k)O(k) + O(n2/k2)O(k). Hence, B(n) = O(n + n2k). For the permutation, the cost is P(n) = 2P(n/2) + O(n + n2k). Hence, P(n) = O(n log n + n2k). When the cost of reversals is f(l) = l, for linear permutations, each 0/1 sequence can be sorted with cost O(n log k + n2/k), and each permutation can be sorted with cost O(n log n log k + n2/k). Proof: The cost to sort a 0/1 sequence of length not exceed k is B′(l) = O(l log l). We have B(n) = O(n/k)O(k log k) + O(n2/k2) O(k). Hence, B(n) = O(n log k + n2/k). For the permutation, the cost is P(n) = 2P(n/2) + O(n log k + n2/k). Hence, P(n) = O(n log n log k + n2/k). The above upper bounds for 0/1 sequences are tight. When k is small, i.e. k2− = O(n/log n), we have n log n = O(n2k). For this case, the upper bounds for permutations are also tight. When k = n, our results are the same as that in Bender et al. Next, we consider the case when 1 < α < 2. When the cost of reversals is f(l) = l, 1 ≤ α < 2, for linear permutations, each 0/1 sequence can be sorted with cost O(n2k), and each permutation can be sorted with cost O(n log n + n2k). Proof: The cost to sort a 0/ 1 sequence of length not exceed k is B(l) = O(l). We have B(n) = O(n/k)O(kα) + O(n2/k2)O(kα). Hence, B(n) = O(n2k). For the permutation, the cost is P(n) = 2P(n/2) + O(n2k). Hence, P(n) = O(n2k + n log n). In this case, the upper bounds are tight for both 0/1 sequences and permutations. When k = n, our results are the same as that in Bender et al.

Approximation algorithms for 1 ≤ α < 2

Constant approximation algorithm for S = 11…100…0 We present 2-approximation algorithms for the sequence S = 11…100…0. Although they only solve the problem on one special sequence, these algorithms are important since they serve as the basis to sort general 0/1 sequences. Let a and b be the length of the blocks of 1’s and 0’s in S, respectively. Without loss of generality, assume that a ≤ b. When b < k/2, the sequence can be sorted optimally by a single reversal. The following theorem states that there are 2-approximation algorithms to sort S in the remaining cases. There is a 2-approximation algorithm for SBRLR (S, k, α). Proof: We consider four cases based on the parity of k and the relationship between a and k/2. Case 1. k is even and a ≥ k/2. Let k = 2k′, a = k′i+p, and b = k′j+q for k′, i, j ≥ 1 and p, q < k′. By the inversion argument similar to that used in deriving lower bounds, we can prove that L = 4 a b k is a lower bound of the cost of sorting S. We prove that the following algorithm sorts S with a cost at most 2L. 1. Divide the block of 1’s into i blocks of length k and 1 block of length p such that the block of length p is at the right end of the original block. 2. Divide the block of 0’s into j blocks of length k′ and 1 block of length q such that the block of length q is at the left end of the original block. 3. Move the block of p 1’s to the right end of S. 4. Move the block of q 0’s to the left end of S. 5. Sort the remaining blocks by bubble sort, and each swap is mimicked by a reversal of length k. The cost of this algorithm is analyzed in the following. Step 3 has the cost j(p + k′) + (p + q), Step 4 has i(q + k′), and Step 5 has ij(2k′). Hence, the total cost of this algorithm is: we have: Similarly, and thus F is increasing on both p and q. Hence F(p, q) ≥ F(0, 0) > 0. So A < 2L. Case 2. k is even and a < k/2. We augment the inversion argument with simple calculus. First, it can be shown that the function f(x) = (x + a)/ax is decreasing on and increasing on . Thus, f(x) ≥ f(t) where for all x ∊ (0, k — a]. Now let l1, l2,…, l be the length of reversals in the optimal reversal series to sort S, such that l ≤ k for all 1 ≤ i ≤ p, and n be the number of inversions removed by performing reversal l. If l ≤ 2a, we have and hence Otherwise we have n ≤ a(l − a) and hence . Hence for all 1 ≤ i ≤ p. The optimal cost to sort S is then: Let , the following algorithm sorts S by a cost at most 2L: 1. Since (a + b) > 2k′ ≥ (t + a), b = ti + r for some i ≥ 1, we divide the block of 0’s into i block of length t and one block of length r. 2. Move the block of 1’s to the right end of S using i reversal of length t + a and a reversal of length a + r. The cost of this algorithm is: Case 3. k is odd and a ≥ k/2. Similar to Case 1. Case 4. k is odd and a < k/2. Similar to Case 2. (2 log2|S| + 1)-approximation algorithm for 0/1 sequences We utilize the algorithms in previous section to sort general 0/1 sequences. First we sort a sequence S where |S| = 2 by the following algorithm: 1. Divide S into two halves L and R. 2. Sort L and R recursively. After this step we have the sequence 00…011…100…011…1. 3. Use the algorithms in the previous section to swap the first block of 1’s with the second block of 0’s. The performance of this algorithm is given by Theorem 7. In proving this theorem, we intensively use the concept of reduction of a reversal on a subsequence S′ of a sequence S. If r is a reversal on S, we define the reduction r|, of r on S′ to be the transformation that reverses the order of elements of S′, which are affected by r. For example, assume that r affects the segment s1s2s3s4s5, and s1, s3, s4 are the elements of S′, then r| reverses the order of s1, s3, s4 in S′, that is, it puts s1 to the position of s4 and vice versa. It is readily verified that r| is a reversal on S′, that is, it reverses the order of consecutive elements of S′. Furthermore, if R = r1, r2,…, r sorts S, then sorts S′. Let OPT, OPT, OPT be the optimal cost to sort S, L, R, respectively, we have OPT ≥ OPT + OPT. Proof: Let R be a reversal series that sorts S. For a reversals r, we have |r|| + |r|R| = |r|, where |x| denotes the length of reversal x. Since 1 ≤ α ≤ 2, we have |r|L| + |r|| ≤ |r|. Hence, the total cost to sort L and R is less than that to sort R. This completes the proof. Let S be a 0/1 sequence and k be a position. If there are i 1’s on the left of k and j 0’s on the right of k, then the cost of sorting S is bigger than that of sorting T = 1i0j. Proof: We map the tth 1 on the left of k in S to the tth 1 of T, and the tth 0 on the right of k in S to the tth 0 of T. Let T′ be the subsequence of S whose elements are mapped to those of T. The reduction of any reversal sequence sorting S on T sorts T and hence sorts T. Hence the cost of sorting T must not exceed the cost of sorting S. The algorithm above sorts a 0/1 sequence S with the cost of 2OPTSlog2|S| when |S| = 2 for some integer t. Proof: Let C(S), C(L), C(R) be the cost of sorting S, L, and R by the algorithm, and D be the cost of Step 3. Proposition 2 shows that D ≤ 2OPT. Hence, we have the following recurrence: By an induction on |S| using Proposition 1, we can verify that C(S) = 2OPTS log2|S|. To sort a general sequence S, let 2 ≤ |S| < 2, we insert (2 − |S|) 0’s to the left of S to get a sequence S′ of length 2. It is obvious that the cost to sort S is at most the cost to sort S′, and any algorithm that sorts S also sorts S′ with the same cost. Hence OPTS = OPTS′, and Hence, there is a (2 log2 |S| + 1)-approximation algorithm to sort general 0/1 sequences. 2(log2n)2 + log2n)-approximation algorithm for linear permutations Again, we first give the algorithm to sort the sequence S where |S| = 2 for some integer k: 1. Find the median of the permutation. 2. Divide the permutation into halves with the right half containing all the elements bigger than the median, and the left half containing all the elements smaller than the median. We can do this by considering the elements bigger than the median 1’s and the elements smaller than the element 0’s and invoke the algorithm in the previous section. 3. Sort the two halves recursively. The performance of this algorithm is given in the following theorem: Let OPT be the optimal cost of sorting a permutation S. The above algorithm sorts S in 2(log2|S|)2OPT. Proof: Since the cost of Step 2 is at most 2(log2|S|), the cost of this algorithm satisfies the recurrence: The theorem then follows a simple induction. To sort a general permutation S where 2 ≤ |S| < 2, we concatenate S with the sequences |S|(|S| + 1)…2 to get a permutation S′ of length 2, and apply the above algorithm. By similar arguments as in the previous section, we obtain the 2(log2 n)2 + log2 n approximation ratio. A more exact analysis of the approximation ratio when k = Ω(|S|) Here, we prove that the above approximation algorithms for 1 ≤ α < 2 and k = Ω(|S|) has the approximation ratios of O(1) for 0/1 sequences and O(log n) for permutations. For a 0/1 sequence S of length n, let w(i, S) denote the number of wrong-sided elements according to position i, that is, the number of extra 1’s in the first i elements in S plus the number of extra 0’s in the last (n – i) elements in S when compared with the sorted sequence. The potential function W(S) is defined as follows: Using a result in Bender et al., we have , where c = 1/8 [(3/4) – (1/4)] and x is the number of 0’s in S. Let k ≥ c1n, we need to prove that for some constant T. In other words, this is equivalent to proving that the cost to sort the sequence S′ = 11…100…0 whose length is w(x, S) should not exceed Tw(x, S). Let k′ = k/2, suppose there are pk′ 1’s and qk′ 0’s in S′. The cost to sort S′ using the above approximation algorithm is 2p qk, whereas w(x, S) = (p+q)k/2. Besides, pk′ + qk′ = n, thus either pk′ ≤ n/2 or qk′ ≤ n/2. Assume pk′ ≤ n/2, hence p ≤ n/k ≤ 1/c1. We have: Hence, we can choose T = 1/2 cc1. By induction, we have C(S) = O(W(S)), that is, the above algorithm is an O(1)-approximation algorithm for 0/1 sequences. When S is a permutation, using the approximation algorithm for permutations, we have C(S) = 2C(S/2) + O(OPT). An easy induction shows that C(S) = O(OPT log n), that is, the above algorithm is an O(log n)-approximation algorithm for permutations.

Conclusion

We have presented tight or nearly tight lower and upper bounds for the problem of sorting by restricted-length-weighted reversals, and also the approximation algorithms for a wide range of α. These results can be extended in various directions. One direction is to strengthen the approximation ratio algorithms or determine the hardness of the problem based on the value of k or/and the value of α. Furthermore, the approximation algorithm for the case where α < 1 is still open. We can also work with other length-weighted functions that are consistent with some meaningful distribution. Another extension is to consider the more general problem when f is a step function: Finally, it is interesting and important to find a cost function that can be verified in practice.

2 in total

1. A linear-time algorithm for computing inversion distance between signed permutations with an experimental study.

Authors: D A Bader; B M Moret; M Yan
Journal: J Comput Biol Date: 2001 Impact factor: 1.479

2. Genomic sorting with length-weighted reversals.

Authors: Ron Y Pinter; Steven Skiena
Journal: Genome Inform Date: 2002

2 in total