| Literature DB >> 23216990 |
Matteo Comin1, Davide Verzotto.
Abstract
BACKGROUND: With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques cannot be applied, for example if two genomes do not share the same set of genes, or if they are not alignable to each other due to low sequence similarity, rearrangements and inversions, or more specifically to their lengths when the organisms belong to different species. For these cases the comparison of complete genomes can be carried out only with ad hoc methods that are usually called alignment-free methods.Entities:
Year: 2012 PMID: 23216990 PMCID: PMC3549825 DOI: 10.1186/1748-7188-7-34
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Example of matching statistics [] and [] for the strings = ACACGTACand = TACGTGTA
| 2 | 1 | 4 | 3 | 3 | 3 | 2 | 1 | |
| 3 | 4 | 3 | 2 | 1 | 3 | 2 | 1 | |
Benchmark for prokaryotes – &domains
| aeropyrum pernix str. K1 | 1.7 Mbp | ||
| archaeoglobus fulgidus str. DSM 4304 | 2.2 Mbp | ||
| methanopyrus kandleri str. AV19 | 1.7 Mbp | ||
| methanosarcina acetivorans str. C2A | 5.8 Mbp | ||
| pyrobaculum aerophilum str. IM2 | 2.3 Mbp | ||
| pyrococcus abyssi | 1.8 Mbp | ||
| pyrococcus furiosus str. DSM 3638 | 1.9 Mbp | ||
| treponema pallidum sp. pall. str. Nichols | 1.2 Mbp | ||
| bacillus anthracis str. Sterne | 5.3 Mbp | ||
| bacillus subtilis subsp. subtilis str. 168 | 4.3 Mbp | ||
| buchnera aphidicola str. Sg | 651 kbp | ||
| campylobacter jejuni sp. jej. str. NCTC 11168 | 1.7 Mbp | ||
| chlamydia muridarum str. MoPn/Wiess-Nigg | 1.1 Mbp | ||
| chlamydia trachomatis str. L2/434/Bu | 1.1 Mbp | ||
| coxiella burnetii str. RSA 493 | 2.0 Mbp | ||
| desulfovibrio vulgaris sp. vulg. str. Hildenb. | 3.6 Mbp | ||
| haemophilus influenzae str. Rd KW20 | 1.9 Mbp | ||
| nostoc punctiforme str. PCC 73102 | 8.4 Mbp |
Prokaryotic taxa used in our experiments, divided by domain. For each entity, we list the accession number in the NCBI genome database, the complete name and strain, and the genome size.
Plasmodium are parasites known as causative agents of malaria in different hosts and geographic regions
| P. berghei | rodent | Africa | 18.5 Mbp |
| P. chabaudi | rodent | Africa | 18.8 Mbp |
| P. falciparum | human | Africa, Asia & S./C. America | 23.3 Mbp |
| P. knowlesi | macaque | Southeast Asia | 23.7 Mbp |
| P. vivax | human | Africa, Asia & S./C. America | 22.6 Mbp |
The right-most column lists the size of each complete DNA genome.
Comparison of whole-genome phylogeny reconstructions
| Influenza A | Viruses | 84/102 | 100/102 | 96/102 | |
| Prokaryotes | 6/10 | 6/10 | |||
| Prokaryotes | 10/14 | 10/14 | |||
| Arch. & Bact. | Prokaryotes | 22/30 | 22/30 | ||
| Plasmodium | Eukaryotes | 4/4 |
Normalized Robinson-Foulds scores with the corresponding reference tree. For each dataset the best results are in bold.
Figure 1Whole-genome phylogeny of the 2009 world pandemic Influenza A (H1N1) generated by UA. In green and red are represent the two main clades, where the green Mexico/4108 is probably the closest isolate to the origin of the influenza. In blue and orange are two of the possible early evolutions of the viral disease. The organisms which do not fall into one of the two main clades according to the literature are in black.
Figure 2Whole-genome phylogeny of prokaryotes by UA. In red are the branches of the Archaea domain, while in green are those of the Bacteria domain. Clusters of other organisms are highlighted with different colors. Only two organisms do not fall into the correct clade: Methanosarcina acetivorans – archaea (in cyan) and Desulfovibrio vulgaris subspecies vulgaris – bacteria (in black).
Figure 3Whole-genome phylogeny of the genus by UA, with our whole-genome distance highlighted on the branches.
Comparison of whole-genome phylogeny of influenza virus
| Reference | 0.0 | 0.60 | 0.63 | 0.86 | 0.88 |
| UA | 0.0 | 0.30 | 0.81 | 0.74 | |
| ACS | 0.63 | 0.30 | 0.0 | 0.83 | 0.81 |
| FFP | 0.86 | 0.81 | 0.83 | 0.0 | 0.73 |
| FFP | 0.88 | 0.74 | 0.81 | 0.73 | 0.0 |
Normalized triplet distance between all trees. The best results are in bold.
Comparison of whole-genome phylogeny of prokaryotes
| Reference | 0.0 | 0.24 | 0.37 | 0.62 | 0.39 |
| UA | 0.0 | 0.37 | 0.55 | 0.47 | |
| ACS | 0.37 | 0.37 | 0.0 | 0.59 | 0.48 |
| FFP | 0.62 | 0.55 | 0.59 | 0.0 | 0.57 |
| FFP | 0.39 | 0.47 | 0.48 | 0.57 | 0.0 |
Comparison of Whole-Genome Phylogeny of Prokaryotes. Normalized triplet distance between all trees. The best results are in bold.
Comparison of whole-genome phylogeny of
| Reference | 0.0 | 0.0 | 0.0 | 0.4 | 0.0 |
| UA | 0.0 | 0.0 | 0.3 | 0.0 | |
| ACS | 0.0 | 0.0 | 0.3 | 0.0 | |
| FFP | 0.4 | 0.3 | 0.0 | 0.0 | 0.3 |
| FFP | 0.0 | 0.0 | 0.3 | 0.0 |
Normalized triplet distance between all trees. The best results are in bold.
Main statistics for the underlying approach averaged over all experiments
| Min genome size | 12,976 b | 650 kbp | 18,524 kbp |
| Max genome size | 13,611 b | 8,350 kbp | 23,730 kbp |
| Average genome size | 13,230 b | 2,700 kbp | 21,380 kbp |
| Irredundants
| 3,722 | 3,167 k | 16,354 k |
| Underlying subwords | 60 | 112 k | 706 k |
| Min | | 6 | 10 | 12 |
| Max | | 1,615 | 25 | 266 |
| Average | | 264 | 14 | 20 |
| Untied inversions | 28% | 31% | 33% |
| Untied complements | 22% | 20% | 19% |