| Literature DB >> 25859276 |
Fábio V Martinez1, Pedro Feijão2, Marília Dv Braga3, Jens Stoye2.
Abstract
Structural variation in genomes can be revealed by many (dis)similarity measures. Rearrangement operations, such as the so called double-cut-and-join (DCJ), are large-scale mutations that can create complex changes and produce such variations in genomes. A basic task in comparative genomics is to find the rearrangement distance between two given genomes, i.e., the minimum number of rearragement operations that transform one given genome into another one. In a family-based setting, genes are grouped into gene families and efficient algorithms have already been presented to compute the DCJ distance between two given genomes. In this work we propose the problem of computing the DCJ distance of two given genomes without prior gene family assignment, directly using the pairwise similarities between genes. We prove that this new family-free DCJ distance problem is APX-hard and provide an integer linear program to its solution. We also study a family-free DCJ similarity and prove that its computation is NP-hard.Entities:
Keywords: DCJ; Family-free genome comparison; Genome rearrangement
Year: 2015 PMID: 25859276 PMCID: PMC4391664 DOI: 10.1186/s13015-015-0041-9
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1The adjacency graph for the two unichromosomal and linear genomes and .
Figure 2A possible gene similarity graph for the two unichromosomal linear genomes and .
Figure 3Reduced genomes and their weighted adjacency graph. Considering the genomes A={(∘ 1 2 3 4 5 ∘)} and B={(∘ 6 −7 −8 −9 10 11 ∘)} as in Figure 2, let M 1 (dotted edges) and M 2 (dashed edges) be two distinct matchings in G S (A,B), shown in the upper part. The two resulting weighted adjacency graphs , that has two odd paths and three cycles, and , that has two odd paths and two cycles, are shown in the lower part.
Figure 4Gene similarity graph constructed from the input genomes and of - EXDCJ-DISTANCE , where all edge weights are 1. Highlighted edges represent a maximal matching M in G S (A F,B F).
Figure 5Gene similarity graph for .
ILP running-time results for datasets with different genome sizes and evolutionary rates
|
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |||
| Finished | 35/45 | 10/45 | 2/45 | 45/45 | 9/45 | 1/45 | 45/45 | 7/45 | 3/45 | ||
| Avg. Time (s) | 99.66 | 6.97 | 0.53 | 0.47 | 0.70 | 3.31 | 0.45 | 2.03 | 213.15 | ||
| Avg. Gap (%) | 0.3 | 3.0 | 4.3 | 0 | 3.6 | 6.5 | 0 | 5.3 | 4.8 | ||
Each dataset has 10 genomes, totalling 45 pairwise comparisons. Maximum running time was set to 60 minutes. For each dataset, the number of runs is shown that found an optimal solution within the allowed time and their average running time in seconds. For the runs that did not finish, the last row shows the relative gap between the upper bound and the current solution. Rate r=1 means the default rate for ALF evolution, and r=2 and r=5 mean 2-fold and 5-fold increase for the gene duplication, gene deletion and rearrangement rates.