| Literature DB >> 29745861 |
Diego P Rubert1, Edna A Hoshino1, Marília D V Braga2, Jens Stoye2, Fábio V Martinez3.
Abstract
BACKGROUND: The genomic similarity is a large-scale measure for comparing two given genomes. In this work we study the (NP-hard) problem of computing the genomic similarity under the DCJ model in a setting that does not assume that the genes of the compared genomes are grouped into gene families. This problem is called family-free DCJ similarity.Entities:
Keywords: Double-cut-and-join; Family-free genomic similarity; Genome rearrangement
Mesh:
Year: 2018 PMID: 29745861 PMCID: PMC5998916 DOI: 10.1186/s12859-018-2130-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The adjacency graph for the genomes and
Fig. 2Representation of a gene similarity graph GS(A,B) for two unichromosomal linear genomes and
Fig. 3The weighted adjacency graph AG(A,B) for two unichromosomal linear genomes and
Fig. 4Considering, as in Fig. 2, the genomes and , let M1 (dashed edges) and M2 (dotted edges) be two distinct maximal matchings in GS(A,B), shown in the upper part. The two resulting weighted adjacency graphs , that has two cycles and two even paths, and , that has two odd paths, are shown in the lower part
Fig. 5Consider genomes and and their gene similarity graph GS(A,B). The selection of the dashed cycle in AG(A,B) adds to the matching M in GS(A,B) the edges connecting gene 1 to gene 4 and gene 2 to gene 5. After this selection, although the matching M is not yet maximal, there are no more consistent cycles in AG(A,B). Observe that in GS(A,B) gene 6 is unsaturated and its single neighbor - gene 2 - is already saturated. Since gene 6 can no longer be saturated by M, it is a disposable gene and is deleted from AG(A,B), resulting in AGσ′(A,B), where a new consistent cycle appears. The selection of this new cycle adds to the matching M the edge connecting gene 3 to gene 7. Both AG(A,B) and AGσ′(A,B) have a simplified representation, in which the edge weights, as well as two of the four null edges of the capping, are omitted. Furthermore, for the sake of clarity, in this simplified representation each edge has a label describing the extremities connected by it
Results of experiments for simulated genomes
| ILP |
|
|
|
| |||
|---|---|---|---|---|---|---|---|
| Time (s) | Not finished | Gap (%) | Gap (%) | Gap (%) | Gap (%) | Gap (%) | |
| 25 genes, | 19.50 | 0 | – | 16.26 | 5.03 | 5.84 | 5.97 |
| 25 genes, | 84.60 | 2 | 69.21 | 58.69 | 30.77 | 43.57 | 43.00 |
| 25 genes, | 49.72 | 0 | – | 81.39 | 43.83 | 55.38 | 55.38 |
| 50 genes, | 265.23 | 12 | 23.26 | 63.02 | 24.76 | 27.86 | 26.94 |
| 50 genes, | 463.50 | 29 | 38.12 | 123.71 | 65.41 | 66.52 | 64.78 |
| 50 genes, | 330.88 | 29 | 259.72 | 281.70 | 177.58 | 206.60 | 206.31 |
Results of experiments for 10 simulated genomes (45 pairwise comparisons) with smaller PAM distance
| ILP |
|
|
|
| |||
|---|---|---|---|---|---|---|---|
| Time (s) | Not finished | Gap (%) | Gap (%) | Gap (%) | Gap (%) | Gap (%) | |
| 50 genes, | 840.59 | 41 | 329.53 | 415.57 | 163.00 | 172.02 | 168.58 |
Results for heuristics on real genomes
| Smaller genome | Matching size | Time (s) | Similarity | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| Human/mouse | 696 | 643 | 643 | 643 | 643 | 0.07 | 19.6 | 0.1 | 8.6 | 404.56 | 420.64 | 421.48 | 420.72 |
| Human/rat | 672 | 613 | 611 | 611 | 612 | 0.05 | 11.6 | 0.04 | 3.3 | 358.36 | 374.17 | 374.27 | 373.82 |
| Mouse/rat | 746 | 690 | 689 | 689 | 689 | 0.17 | 0.18 | 0.13 | 0.18 | 481.53 | 500.59 | 500.57 | 500.36 |
Smaller genome column shows for each pair of genomes the number of genes in the smaller one, an upper bound for the matching size. Heuristics are represented by their initials (e.g. Greedy-Length = GL)