| Literature DB >> 20333186 |
István Miklós1, Aaron E Darling.
Abstract
Inversions are among the most common mutations acting on the order and orientation of genes in a genome, and polynomial-time algorithms exist to obtain a minimal length series of inversions that transform one genome arrangement to another. However, the minimum length series of inversions (the optimal sorting path) is often not unique as many such optimal sorting paths exist. If we assume that all optimal sorting paths are equally likely, then statistical inference on genome arrangement history must account for all such sorting paths and not just a single estimate. No deterministic polynomial algorithm is known to count the number of optimal sorting paths nor sample from the uniform distribution of optimal sorting paths. Here, we propose a stochastic method that uniformly samples the set of all optimal sorting paths. Our method uses a novel formulation of parallel Markov chain Monte Carlo. In practice, our method can quickly estimate the total number of optimal sorting paths. We introduce a variant of our approach in which short inversions are modeled to be more likely, and we show how the method can be used to estimate the distribution of inversion lengths and breakpoint usage in pathogenic Yersinia pestis. The proposed method has been implemented in a program called "MC4Inversion." We draw comparison of MC4Inversion to the sampler implemented in BADGER and a previously described importance sampling (IS) technique. We find that on high-divergence data sets, MC4Inversion finds more optimal sorting paths per second than BADGER and the IS technique and simultaneously avoids bias inherent in the IS technique.Entities:
Keywords: MCMC; Yersinia; genome rearrangement; inversion
Year: 2009 PMID: 20333186 PMCID: PMC2817410 DOI: 10.1093/gbe/evp015
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
FThe unsigned representation of the signed permutation −2, +4, −1, +3, and its graph of desire and reality. In the graph of desire and reality, every second number of the unsigned permutation is connected with a line (reality edges, the numbers that are next to each other), and each even number is connected to the next odd number by an arc (desire edges, which numbers should be neighbors to get the +1, +2, …, +n permutation). The graph of desire and reality can be decomposed into cycles. This graph contains two intersecting cycles. Arrows on the reality edges indicates a possible walk around the cycles.
FOverview of the sampler. The set of all optimal sorting paths can be represented as a tree where nodes represent genome arrangements (signed permutations) and edges represent application of a sorting inversion. The root is the starting genome arrangement, and the leaves are the target arrangement so that every path from root to leaf represents an optimal sorting path. In most cases, the tree will be too large to compute in its entirety so we desire instead to take a representative sampling of the sorting paths embedded in the tree. To do so, we construct a sampler with d chains, where d is the minimum sorting distance. Each chain C samples uniformly from the nodes at depth i in the tree, thereby sampling uniformly from the i-long prefixes of optimal sorting paths. Chain C samples complete paths. The chains perform a random walk through the tree by swapping with each other and within themselves (see text). In practice, we do not even compute the full set of edges (sorting reversals) below a node, instead use rejection sampling to find an arbitrary edge quickly.
FThe network representing all shortest sorting paths of signed permutation +2,+1,−3, and the corresponding tree representing all shortest sorting paths.
Data Sets Used for Empirical Evaluation of Mixing Time
| Divergence | Organisms | Breakpoints | Min. Inversions | |
| Low | 16 | 11 | ||
| 1, 5, −13, 7, −10, 9, −8, 11, −12, −6, 3, −14, 16, −4, 2, −15, 17 | ||||
| Medium | 36 | 25 | ||
| 1, −11, 7, 6, −26, −23, −24, −25, −22, 21, −20, 19, −18, 17, −16, 15, −14, 13, −12, −36, 28, −34, 27, −4, −8, −5, 2, 32, 29, −9, −3, −33, 31, −10, −30, 35, 37 | ||||
| Medium 2 | 51 | 35 | ||
| 1, 2, −41, 6, −37, −33, 32, −31, −49, −4, −34, −30, 13, −15, −17, −11, 29, −12, −16, 14, 18, −19, 20, −22, −23, 21, 24, 28, −27, 26, −25, −10, 9, −36, −3, 45, 5, −40, 47, 51, 38, 7, −8, 42, −43, 44, −50, 48, −46, −39, −35, 52 | ||||
| High | 97 | 79 | ||
| 1, −81, 11, 8, −84, −13, 75, 59, −60, 51, 49, 56, −85, 9, −77, 2, −95, 6, 93, 83, −97, 91, −69, 72, 76, −88, −92, −62, 5, −94, −30, 21, 32, 31, 22, −33, −29, 50, 61, −80, 52, 24, 23, 34, −39, 35, 37, −42, 41, −40, −38, −43, −36, 44, −20, 54, −87, −47, 46, −53, −63, 70, 82, 26, 64, 28, −55, 18, −65, −27, 25, −17, 67, 14, 16, 66, 79, −19, 68, −71, 86, 48, −45, 58, −57, −10, 78, −74, −12, 15, −7, −3, 96, −4, −73, −89, 90, 98 | ||||
NOTE.—The four data sets were chosen to represent low, medium, and high degrees of genomic rearrangement. All genomes are circular. Signed permutations for each data set are given in the next row. *Chain et al. (2006); **Deng et al. (2002); ***Song et al. (2004).
Mixing Speed of Our Method and of BADGER (Larget et al. 2005) with Various Numbers of Parallel Chains
| Unique Optimal Samples per Second | All Unique Samples per Second | |||||||
| Program | Low | Medium | Medium 2 | High | Low | Medium | Medium 2 | High |
| MC4 | 1887.0 | 398.0 | 206.0 | 29.0 | 1887.0 | 398.0 | 206.0 | 29.0 |
| BADGER | 1493.0 | 102.0 | 112.0 | 0.08 | 2577.0 | 645.0 | 365.0 | 55.0 |
| BADGER | 437.0 | 30.0 | 34.0 | 0.97 | 790.0 | 184.0 | 113.0 | 18.0 |
| BADGER | 33.0 | 2.6 | 4.7 | 0.0 | 57.0 | 17.0 | 12.0 | 3.2 |
| BADGER | 5.7 | 1.3 | 1.0 | 0.0 | 10.0 | 4.2 | 2.8 | 0.7 |
NOTE.—For each level of divergence, the mixing time is given as the number of unique optimal sorting paths discovered per second and the total sorting paths discovered per second. Each program was run for exactly 10 min of CPU time. BADGER only used updates which modify sorting paths (i.e., tree topology updates were disabled) and a number of parallel chains given by c with temperatures ranging from 1 to 1.04. Other BADGER parameters were fixed at default values. At low levels of divergence, BADGER and the new method perform comparably. At medium and high divergence, the new method samples optimal paths more efficiently.
FEstimated number of sorting paths for the highly divergent permutation in table 1. The y axis shows the base-10 logarithm of the estimated number of sorting paths, and the x axis shows the index of the samples from the Markov chain. The estimation is based on our MCMC sampler, see the text for details. In total 50 million Markov chain steps were made, and the total sorting path count estimates were sampled every 25,000 steps. The orange curve shows the estimations based on all the samples up to the current index, and the black curve shows the estimations when the first 10 million steps were discarded as burn-in of the MCMC. As more samples are taken, our estimates converge on the true number of sorting paths.
FDistribution of inferred inversion lengths. Estimates based on three methods are shown: IS (orange), MC4 (blue), and MC4 modeling a preference for short inversions (black). Short inversions are more common than long inversions, and the IS method exhibits an unexplained bias away from short inversions that can be ascribed to its high sampling variance. When modeling a preference for short inversions we use equation (12) with T = 100.
FEstimated number of times each block boundary has been used as an inversion endpoint. Block boundaries are arranged according to their occurrence in the reference genome. Some blocks are extremely short, leading to points that appear to overlap. Estimates made by three methods are shown: IS (orange dots), MC4 (blue squares), and MC4 with a preference for short inversions (black triangles). When modeling a preference for short inversions, we set T = 100 in equation (12) as described in the text. As previously reported (Darling et al. 2008), we find fewer inversions surrounding the replication terminus, which is marked with a cyan vertical bar.