| Literature DB >> 20376325 |
Jennifer Becq1, Cécile Churlaud, Patrick Deschavanne.
Abstract
Horizontal gene transfer (HGT) has appeared to be of importance for prokaryotic species evolution. As a consequence numerous parametric methods, using only the information embedded in the genomes, have been designed to detect HGTs. Numerous reports of incongruencies in results of the different methods applied to the same genomes were published. The use of artificial genomes in which all HGT parameters are controlled allows testing different methods in the same conditions. The results of this benchmark concerning 16 representative parametric methods showed a great variety of efficiencies. Some methods work very poorly whatever the type of HGTs and some depend on the conditions or on the metrics used. The best methods in terms of total errors were those using tetranucleotides as criterion for the window methods or those using codon usage for gene based methods and the Kullback-Leibler divergence metric. Window methods are very sensitive but less specific and detect badly lone isolated gene. On the other hand gene based methods are often very specific but lack of sensitivity. We propose using two methods in combination to get the best of each category, a gene based one for specificity and a window based one for sensitivity.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20376325 PMCID: PMC2848678 DOI: 10.1371/journal.pone.0009989
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Classification of the 6 gamma proteobacteria and the artificial genomes used in this study according to their distance to artificial E. coli.
| Species | GC% | Group | Distance (AU) | Code | Color |
|
| 50.8 | Gamma-Proteobacteria | - | Ecol | Grey |
|
| 51.9 | Gamma-Proteobacteria | 84 | Sent | Light green |
|
| 53 | Gamma-Proteobacteria | 103 | Epyr | Light blue |
|
| 55 | Gamma-Proteobacteria | 132 | Spro | Green |
|
| 47.7 | Gamma-Proteobacteria | 152 | Ypes | Red |
|
| 47.5 | Gamma-Proteobacteria | 192 | Vcho | Light red |
|
| 56.9 | Gamma-Proteobacteria | 194 | Kpne | Blue |
|
| 52.7 | Beta-Proteobacteria | 247 | Ngon | Light blue |
|
| 43.5 | Firmicute | 274 | Bsub | Dark blue |
|
| 47.4 | Cyanobacteria | 294 | Ssyn | Cyan |
|
| 48.1 | Archaea | 332 | Aful | Dark green |
|
| 38.1 | Gamma-Proteobacteria | 385 | Hinf | Light green |
|
| 62.2 | Alpha-Proteobacteria | 397 | Smel | Green |
|
| 46.2 | Thermotogale | 402 | Tmar | Pink |
|
| 67.0 | Deinococci | 463 | Drad | Brown |
|
| 67.0 | Beta-Proteobacteria | 486 | Rsol | Fuchsia |
|
| 31.3 | Archaea | 618 | Mjan | Orange |
*Distances are calculated using Euclidian metric between the frequencies of the 256 tetranucleotides of each genome. The color-code correspond to the one used in Supplementary
The sixteen horizontal transfer detection methods analyzed in this paper.
| Name | References | Criteria | Genome scanning | metric |
| GC.windows |
| GC% | 20 kb windows, 5 kb step | Manhattan |
| GCtotal |
| GC% | Genes | None |
| GC1-GC3 |
| GC% in positions 1 and 3 of genes | Genes | None |
| dint5 |
| Normalized dinucleotides | 5 kb windows, 5 kb step | Delta* |
| dint.di31T2 |
| Normalized dinuleotides in position 3∶1 of codons | Genes | Mahalanobis |
| CU.chi2 |
| Codons | Genes | Chi2 |
| CU.karlin |
| Codons | Genes | Delta* |
| CU.karlin.aa |
| Amino acids | Genes | Delta* |
| CU.KL |
| Codons | Genes | Kullback-Leibler |
| CU.mahalanobis |
| Codons | Genes | Mahalanobis |
| oli.Pearson |
| Normalized tetranucleotides | 5 kb windows, 1 kb step | Correlation |
| oli.covariance | “ | Normalized tetranucleotides | 5 kb windows, 1 kb step | Covariance |
| oli.chi2 | “ | Normalized tetranucleotides | 5 kb windows, 1 kb step | Chi2 |
| oli.mahalanobis | “ | Normalized tetranucleotides | 5 kb windows, 1 kb step | Mahalanobis |
| oli.KL | “ | Normalized tetranucleotides | 5 kb windows, 1 kb step | Kullback-Leibler |
| signature |
| Tetranucleotides | 5 kb windows, 0.5 kb step | Euclidian |
Mean performances of all the 16 methods with “standard” model genomes.
| Methods | Sensitivity | Specificity | Threshold ± |
| GC.windows | 56.6 | 51.6 | 1.8± |
| GCtotal | 49.1 | 96.1 | 1.9±. |
| GC1-GC3 | 23.9 | 98.2 | 1.4±. |
| Dint5 | 79.4 | 84.4 | 1.8±. |
| Dint.di31T2 | 16.8 | 9.5 | 0.5±. |
| CU.chi2 | 1.3 | 100 | 4.0±. |
| CU.karlin | 62.2 | 73.4 | 1.1±. |
| CU.karlin.aa | 65.9 | 26 | 0.5± |
| CU.KL | 77.2 | 87.8 | 1.4±. |
| CU.mahalanobis | 3.9 | 79.8 | 3.6±. |
| oli.Pearson | 92.5 | 85.5 | 3.2±. |
| oli.covariance | 38.8 | 91.5 | 2.2±. |
| oli.chi2 | 93.8 | 87.1 | 3.9±. |
| oli.mahalanobis | 64.9 | 81.6 | 1.1±. |
| oli.KL | 91.5 | 89.2 | 3.6± |
| signature | 98 | 67.3 | 1.5±. |
*Threshold corresponds to the value of r (see M&M) for optimal performance; the standard deviation of optimal r over the 5 tested genomes is precised.
Figure 1ROC-like curves of the 16 methods.
Each dot of a curve corresponds to the values of type I error (100-sensitivity) and type II error (100-specificity) for each value of r (see M&M). The best methods are those with the less errors, i.e. those that are the closest of the origin.
Figure 2Mean errors of 7 methods according to (A) origin, (B) overall quantity, (C) size and (D) recipient genome.
The mean error is the mean of type I (sensitivity) and type II (specificity) errors. It is presented here for the 7 efficient HT detection methods of each criterion (codon usage: CU.KL; dinucleotide frequencies: dint5; GC content: GCtotal and GC1-GC3; and tetranucleotide frequencies: oli.chi2, oli.KL and signature) according to four parameters. A: the origin. The unique donor genome of the HTs are ordered according to their distance to the host genome (E. coli) in terms of tetranucleotide frequencies – the closest on the left and the farthest on the right. B: the overall quantity of HTs in percentage of the genome. C: the size of the HTs. Small, Medium, Large and Very Large respectively mean 1 to 5 genes, 5 to 10 genes, 10 to 20 genes and 20 to 30 genes. D: the host genome, i.e. the genome receiving the HTs.
Sensitivity, specificity and mean performance of the methods with HTs originating from real gamma-proteobacteria.
| Method | Sensitivity | Specificity | Mean error |
| GCtotal | 5.32 | 100 | 47.34 |
| GC1-GC3 | 2.64 | 100 | 48.68 |
| CU.KL | 6.19 | 96.95 | 48.43 |
| dint5 | 39.66 | 77.73 | 41.3 |
| oli.chi2 | 72.82 | 84.21 | 21.48 |
| oli.KL | 61.01 | 70.12 | 34.44 |
| signature | 84.82 | 59.23 | 27.97 |
Mean performance of the combination of 2 methods over the “standard” model genomes and over the “real” E. coli genomes.
| Method combination | Sensitivity | Specificity | Mean error | |
|
| CU.KL – oli.chi2 | 97.0 | 79.9 | 11.6 |
| CU.KL – oli.KL | 96.2 | 81.6 | 11.1 | |
| CU.KL – signature | 99.4 | 63.2 | 18.7 | |
|
| CU.KL – oli.chi2 | 89.4 | 81.0 | 14.8 |
| CU.KL – oli.KL | 84.5 | 80.1 | 17.7 | |
| CU.KL – signature | 97.2 | 69.4 | 16.7 |