| Literature DB >> 21605461 |
Yves-Pol Deniélou1, Marie-France Sagot, Frédéric Boyer, Alain Viari.
Abstract
BACKGROUND: The automatic identification of syntenies across multiple species is a key step in comparative genomics that helps biologists shed light both on evolutionary and functional problems.Entities:
Mesh:
Year: 2011 PMID: 21605461 PMCID: PMC3121647 DOI: 10.1186/1471-2105-12-193
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Whole genomes alignement methods in the literature
| edition operations | |||||||
|---|---|---|---|---|---|---|---|
| GRIMM | pairwise | one-to-one | yes | yes | yes | minimise evolutionary distance | [ |
| C | pairwise | many-to-many | yes | yes | yes | minimise evolutionary distance | [ |
| U | pairwise | one-to-one | yes | no | no | find common intervals | [ |
| H | multiple | one-to-one | yes | no | no | find common intervals | [ |
| D | pairwise | many-to-many | yes | yes | no | find common intervals | [ |
| G | multiple | equivalence | yes | no | yes | divide and conquer | [ |
| H | pairwise | equivalence | yes | yes | yes | divide and conquer | [ |
| D | multiple | equivalence | yes | yes | yes | divide and conquer | [ |
| M | multiple | equivalence | yes | yes | yes | divide and conquer | [ |
| MCP | pairwise | equivalence | yes | yes | yes | divide and conquer | [ |
| MCM | multiple | equivalence | yes | yes | yes | divide and conquer | [ |
| C3P | multiple | many-to-many | yes | yes | yes | partition the | [ |
| FISH | pairwise | many-to-many | local | no | yes | dynamic programming | [ |
| DAG | pairwise | many-to-many | no | no | yes | dynamic programming | [ |
| C | pairwise | many-to-many | no | no | yes | dynamic programming | [ |
| S | multiple | many-to-many | no | yes | yes | dynamic programming on POG | [ |
| C | multiple | many-to-many | no | yes | yes | same + phylogeny | [ |
| ADH | pairwise | many-to-many | no | tandem | yes | clustering | [ |
| multiple | many-to-many | no | tandem | yes | greedy heuristic | [ | |
Principal approaches for whole genomes alignment and the search for syntenies. Note that the real goal of GRIMM and CINTENY is to reconstruct evolutionary scenarios, the extraction of syntenies is only a by-product of those algorithms. Note also that DOMAIN TEAMS uses a protein domain granularity, whereas all other methods operate at the gene level. Among the multiple comparison tools, MCGS (and its extension MCMUSEC), SYNTENATOR (and its extension CYNTENATOR) and I-ADHORE can handle (directly or indirectly) a gene quorum.
Figure 1Example of Network Alignment Multigraph. A simple example of layered data graph (top) and network alignment multigraph NAM (bottom). The layered data graph represents three genomes (blue, red and green). Vertices represent genes and coloured edges represent strict gene adjacency along each genome (no gaps edges in this example). The inter-genomic gene-to-gene correspondence relation S is represented by black dotted edges (notice that S is neither one-to-one nor transitive). If we choose to associate genes that form cliques of S (other choices are possible, see text), then the corresponding network alignment multigraph (NAM) is displayed on the bottom. The vertices of the NAM are 3 - uples (cliques) of genes, also called spines. The coloured edges between spines correspond to the original edges in the layered data graph. For instance, (a1, a2, a3) is red-connected to (b1, b2, b3) because a2 is connected to b2 in the red layer of the layered data graph. The same is true for (b1, b2, b3) and (c1, c2, c3) since b2 and c2 are connected in the red layer. Conversely, (a1, a2, a3) is not blue-connected to (b1, b2, b3) since there is one gap gene (d1) on the blue genome separating a1 from b1 (see text on how to introduce gaps). Syntons are the sets of spines that are connected for all colours. They form a partition of the PNAM vertices. In this case there are 2 syntons: {(a1, a2, a3)} and {(b1, a2, a3), (b1, b2, b3), (c1, c2, c3)}.
Figure 2Example of Partial Network Alignment Multigraph. A simple example of layered data graph (top) and partial network alignment multigraph PNAM (bottom). As in Figure 1, the S gene-to-gene relation is represented by dotted edges. Vertices of the PNAM (spines) correspond to cliques of the S relation. The difference with a NAM (Figure 1) is that "don't care" genes (represented as *) are now allowed in spines (here we have a quorum q = 2 which means we cannot have more than one * in a spine). The set of vertices of this PNAM can be partitioned into four syntons, three of them are singletons: {(a1, a2, a3)}, {(a1, *, a3)} and {(*, a2, a3)}, the fourth is of size 2: {(a1, a2, *), (b1, b2, *)}. Only two of them ({(a1, a2, a3)} and {(a1, a2, *), (b1, b2, *)}) are maximal.
The four groups of bacterial species used in this study
| bacteria | Bacteria | ||
|---|---|---|---|
| AE000512_GR | Thermotoga maritima (strain JCM 10099/DSM 3109) | 1.9 | 1853 |
| AE009951_GR | Fusobacterium nucleatum nucleatum (strain JCM 8532) | 2.2 | 2069 |
| AL009126_GR | Bacillus subtilis (strain 168) | 4.2 | 4237 |
| BA000022_GR | Synechocystis sp. (strain PCC 6803) | 3.6 | 3166 |
| U00096_GR | Escherichia coli (strain K12) | 4.6 | 4320 |
| AE000520_GR | Treponema pallidum (strain Nichols) | 1.1 | 1028 |
| AE001273_GR | Chlamydia trachomatis (strain D/UW-3/Cx) | 1.0 | 895 |
| AM398681_GR | Flavobacterium psychrophilum (strain JIP02/86) | 2.8 | 2432 |
| BX248353_GR | Corynebacterium diphtheriae (strain NCTC 13129) | 2.5 | 2317 |
| CP000359_GR | Deinococcus geothermalis (strain DSM 11300) | 2.5 | 2330 |
| AE005673_GR | Caulobacter crescentus (strain CB15/ATCC 19089) | 4.0 | 3738 |
| AE016825_GR | Chromobacterium violaceum (strain IFO 12614) | 4.7 | 4407 |
| AE017282_GR | Methylococcus capsulatus (strain Bath/NCIMB 11132) | 3.3 | 2960 |
| CP000661_GR | Rhodobacter sphaeroides (strain ATCC 17025) | 3.2 | 3111 |
| U00096_GR | Escherichia coli (strain K12) | 4.6 | 4320 |
| CP000112_GR | Desulfovibrio desulfuricans (strain G20) | 3.7 | 3775 |
| AE001439_GR | Helicobacter pylori (strain J99) | 1.6 | 1491 |
| CP000251_GR | Anaeromyxobacter dehalogenans (strain 2CP-C) | 5.0 | 4346 |
| CP000744_GR | Pseudomonas aeruginosa (strain PA7) | 6.6 | 6286 |
| CP000814_GR | Campylobacter jejuni (subsp. jejuni, serovar O:6) | 1.6 | 1626 |
| AE017282_GR | Methylococcus capsulatus (strain Bath/NCIMB 11132) | 3.3 | 2960 |
| CP000127_GR | Nitrosococcus oceani (strain ATCC 19707/NCIMB 11848) | 3.5 | 2976 |
| CP000462_GR | Aeromonas hydrophila (subsp. hydrophila, ATCC 7966) | 4.7 | 4122 |
| CP000744_GR | Pseudomonas aeruginosa (strain PA7) | 6.6 | 6286 |
| U00096_GR | Escherichia coli (strain K12) | 4.6 | 4320 |
| CP000675_GR | Legionella pneumophila (strain Corby) | 3.6 | 3204 |
| CP000681_GR | Shewanella putrefaciens (strain CN-32/ATCC BAA-453) | 4.7 | 3972 |
| CP001091_GR | Actinobacillus pleuropneumoniae (serovar 7, AP6/AP76) | 2.3 | 2131 |
| CP001132_GR | Acidithiobacillus ferrooxidans (strain ATCC 53993) | 2.9 | 2826 |
| AM920689_GR | Xanthomonas campestris (pathovar campestris) | 5.1 | 4510 |
| AE006468_GR | Salmonella typhimurium (strain ATCC 700720) | 4.9 | 4455 |
| AE009952_GR | Yersinia pestis (biovar Mediaevalis, strain KIM5) | 4.6 | 4104 |
| CP000822_GR | Citrobacter koseri (strain ATCC BAA-895) | 4.7 | 5003 |
| CP000964_GR | Klebsiella pneumoniae (strain 342) | 5.6 | 5425 |
| U00096_GR | Escherichia coli (strain K12) | 4.6 | 4320 |
| AE005674_GR | Shigella flexneri (serovar 2a, strain 301) | 4.6 | 4395 |
| AP008232_GR | Sodalis glossinidius (strain morsitans) | 4.2 | 2432 |
| BX470251_GR | Photorhabdus luminescens laumondii (strain TT01) | 5.7 | 4897 |
| BX950851_GR | Erwinia carotovora (subsp. atroseptica, ATCC BAA-672) | 5.1 | 4491 |
| CP000653_GR | Enterobacter sp. (strain 638) | 4.5 | 4115 |
Detail of the four groups of bacterial species at various phylogenetic distances used in this study. Each group is composed of 5 (first 5 lines) or 10 species. acnum: accession number in the EMBL/Genome Reviews databank; Mb: size of the genome in Mb; NbGenes: number of protein-encoding genes in the genome.
Comparison of execution times and results of I-ADHORE and OTFQ
| OTFQ | ||||
|---|---|---|---|---|
| Set | execution time (s) | |||
| bacteria | 5 | 36 | 20 | |
| 10 | 122 | 55 | ||
| proteo | 5 | 110 | 28 | |
| 10 | 428 | 197 | ||
| gamma | 5 | 232 | 42 | |
| 10 | 825 | 373 | ||
| entero | 5 | 632 | 273 | |
| 10 | 2881 | |||
| bacteria | 5 | 460 (3%)1 | 572 | 457 (99%)2 |
| 10 | 831 (3%) | 919 | 756 (91%) | |
| proteo | 5 | 2012 (11%) | 2449 | 1991 (99%) |
| 10 | 4117 (11%) | 4935 | 4088 (99%) | |
| gamma | 5 | 4266 (21%) | 4777 | 4246 (100%) |
| 10 | 8005 (22%) | 9039 | 7977 (100%) | |
| entero | 5 | 15376 (66%) | 15792 | 15355 (100%) |
| 10 | 29306 (66%) | |||
Comparison of execution times (top) and results (bottom) of I-ADHORE and OTFQ for the four groups of 5 and 10 species in Table 1.
1 number of genes found/total number of genes
2 size of ∩ /number of genes found by I-ADHORE
Figure 3First Example of Operon with Permutations. The glycogen biosynthesis/degradation operon in 5 bacterial genomes. Connected components of the S relation are represented by coloured lines.
Figure 4Second Example of Operon with Permutations. The biotin biosynthesis operon in 5 bacterial genomes. Connected components of the S relation are represented by coloured lines.
Figure 5Distribution of . Distribution of dwithin each group of max(size of the maximal synton in which a given pair of orthologs appears). The black curve corresponds to the median of don the whole set, whereas the coloured curves correspond to the analysis split by pairs of species. We observe in all cases a clear tendency for dto decrease when the synteny size increases.