| Literature DB >> 17591616 |
Rajeev K Azad1, Jeffrey G Lawrence.
Abstract
Most parametric methods for detecting foreign genes in bacterial genomes use a scoring function that measures the atypicality of a gene with respect to the bulk of the genome. Genes whose features are sufficiently atypical-lying beyond a threshold value-are deemed foreign. Yet these methods fail when the range of features of donor genomes overlaps with that of the recipient genome, leading to misclassification of foreign and native genes; existing parametric methods choose threshold parameters to balance these error rates. To circumvent this problem, we have developed a two-pronged approach to minimize the misclassification of genes. First, beyond classifying genes as merely atypical, a gene clustering method based on Jensen-Shannon entropic divergence identifies classes of foreign genes that are also similar to each other. Second, genome position is used to reassign genes among classes whose composition features overlap. This process minimizes the misclassification of either native or foreign genes that are weakly atypical. The performance of this approach was assessed using artificial chimeric genomes and then applied to the well-characterized Escherichia coli K12 genome. Not only were foreign genes identified with a high degree of accuracy, but genes originating from the same donor organism were effectively grouped.Entities:
Mesh:
Year: 2007 PMID: 17591616 PMCID: PMC1950545 DOI: 10.1093/nar/gkm204
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.(A) Foreign gene identification by common threshold approaches; native and foreign genes overlap in sequence features. (B) Foreign genes detecting using a clustering approach. Genes from a single source may have features that overlap with features of genes from other sources, making unambiguous delineation difficult. (C) Positional information may be used to accurately classify weakly atypical genes. Misclassified genes may be correctly identified using positional information.
Grouping genes in Artificial Genome I using JS entropic divergence, with codon usage bias as the discriminant criterion; values are averages of 10 trials
| Significance threshold | Number of classes | Number of genes in largest class | Percent of genes in largest class | Type I error (%) | Type II error (%) | Mean error (%) |
|---|---|---|---|---|---|---|
| 1 | 4000 | 1 | 0.025 | |||
| 0.99 | 446.7 | 69 ± 9 | 1.7 ± 0.2 | na | na | na |
| 0.95 | 258.8 | 234 ± 26 | 5.8 ± 0.6 | 0.01 ± 0.02 | 73.5 ± 1.0 | 36.7 ± 0.5 |
| 0.9 | 185.4 ± 6.0 | 452 ± 35 | 11.3 ± 0.8 | 0.01 ± 0.03 | 71.9 ± 0.9 | 35.9 ± 0.4 |
| 0.8 | 123.1 ± 5.6 | 782 ± 39 | 19.5 ± 0.9 | 0.07 ± 0.06 | 69.1 ± 1.0 | 34.5 ± 0.5 |
| 0.7 | 90.5 ± 3.4 | 1203 ± 34 | 30.0 ± 0.8 | 0.2 ± 0.1 | 64.5 ± 1.1 | 32.3 ± 0.6 |
| 0.6 | 69.8 ± 4.5 | 1473 ± 34 | 36.8 ± 0.8 | 0.3 ± 0.2 | 60.7 ± 1.3 | 30.5 ± 0.7 |
| 0.5 | 56.0 ± 3.6 | 1712 ± 41 | 42.8 ± 1.0 | 0.5 ± 0.2 | 56.7 ± 1.1 | 28.6 ± 0.5 |
| 0.4 | 44.3 ± 3.3 | 1912 ± 37 | 47.8 ± 0.9 | 0.7 ± 0.2 | 52.6 ± 1.1 | 26.7 ± 0.5 |
| 0.3 | 35.2 ± 2.0 | 2114 ± 35 | 52.8 ± 0.8 | 1.0 ± 0.2 | 47.7 ± 1.2 | 24.4 ± 0.6 |
| 0.2 | 27.4 ± 2.2 | 2290 ± 36 | 57.2 ± 0.9 | 1.3 ± 0.2 | 42.6 ± 1.2 | 21.9 ± 0.6 |
| 0.1 | 22.0 ± 1.5 | 2462 ± 33 | 61.5 ± 0.8 | 2.0 ± 0.3 | 36.6 ± 1.4 | 19.3 ± 0.8 |
| 10−2 | 15.4 ± 1.5 | 2724 ± 43 | 68.1 ± 1.0 | 4.2 ± 0.7 | 25.3 ± 1.4 | 14.8 ± 0.9 |
| 10−3 | 12.8 ± 1.2 | 2848 ± 34 | 71.2 ± 0.8 | 6.2 ± 1.4 | 19.0 ± 1.8 | 12.6 ± 0.9 |
| 10−4 | 11.6 ± 1.2 | 2934 ± 33 | 73.3 ± 0.8 | 8.2 ± 1.7 | 14.4 ± 0.8 | 11.3 ± 0.9 |
| 10−5 | 10.9 ± 1.1 | 2986 ± 35 | 74.6 ± 0.8 | 10.3 ± 2.2 | 12.0 ± 0.9 | 11.1 ± 1.2 |
| 10−6 | 11.0 ± 1.1 | 3020 ± 35 | 75.5 ± 0.8 | 12.2 ± 2.5 | 10.9 ± 0.9 | 11.6 ± 1.5 |
| 10−7 | 10.6 ± 1.0 | 3049 ± 39 | 76.2 ± 0.9 | 13.9 ± 3.0 | 9.9 ± 1.1 | 11.9 ± 1.7 |
| 10−8 | 10.0 ± 1.3 | 3079 ± 45 | 76.9 ± 1.1 | 16.3 ± 3.5 | 9.5 ± 1.3 | 12.9 ± 2.1 |
| 10−9 | 9.4 ± 0.9 | 3117 ± 52 | 77.9 ± 1.3 | 18.1 ± 3.7 | 7.7 ± 1.1 | 12.9 ± 2.0 |
| 10−11 | 9.0 ± 0.7 | 3160 ± 54 | 79.0 ± 1.3 | 21.4 ± 4.1 | 6.9 ± 1.2 | 14.2 ± 2.1 |
| 0 | 1 | 4000 | 100 |
aNot applicable; the largest clusters did not correspond to native genes.
Error rates of the methods for foreign gene detection
| Classification method | Artificial Genome I | Artificial Genome II | ||||||
|---|---|---|---|---|---|---|---|---|
| Threshold | Type I error (%) | Type II error (%) | Mean error (%) | Threshold | Type I error (%) | Type II error (%) | Mean error (%) | |
| JS-N | 0.25 | 13.1 ± 4.0 | 9.9 ± 2.6 | 11.5 ± 1.9 | 0.2 | 17.3 ± 4.6 | 15.7 ± 3.2 | 16.5 ± 1.9 |
| JS-N pos | 0.25 | 12.4 ± 4.3 | 3.4 ± 1.2 | 7.9 ± 2.0 | 0.2 | 18.9 ± 4.9 | 2.4 ± 1.1 | 10.6 ± 2.1 |
| JS-DN | 0.4 | 8.4 ± 7.7 | 10.3 ± 1.8 | 9.3 ± 3.1 | 0.2 | 13.2 ± 3.9 | 8.1 ± 1.1 | 10.6 ± 2.0 |
| JS-DN pos | 0.4 | 9.1 ± 8.6 | 4.8 ± 1.9 | 7.0 ± 3.5 | 0.4 | 11.1 ± 3.6 | 5.4 ± 2.1 | 8.2 ± 2.5 |
| JS-CB | 10−5 | 10.3 ± 2.2 | 12.0 ± 0.9 | 11.1 ± 1.2 | 10−8 | 14.8 ± 2.5 | 17.6 ± 1.8 | 16.2 ± 1.6 |
| JS-CB pos | 10−2 | 4.2 ± 1.5 | 10−3 | 6.2 ± 2.0 | ||||
| AIC-N | 0.5 | 12.5 ± 6.1 | 10.8 ± 6.9 | 11.6 ± 2.6 | 0.4 | 15.9 ± 3.4 | 9.6 ± 2.3 | 12.8 ± 2.1 |
| AIC-N pos | 0.5 | 11.4 ± 6.5 | 6.5 ± 5.6 | 8.9 ± 2.3 | 0.4 | 16.1 ± 3.0 | 9.7 ± 1.8 | |
| AIC-DN | 1.9 | 16.3 ± 6.7 | 5.7 ± 6.1 | 11.0 ± 2.4 | 1.8 | 14.4 ± 4.0 | 5.6 ± 4.4 | 10.0 ± 2.5 |
| AIC-DN pos | 1.4 | 13.0 ± 6.5 | 4.1 ± 4.2 | 8.6 ± 2.3 | 1.2 | 9.8 ± 5.8 | 6.4 ± 11.2 | 8.1 ± 5.2 |
| AIC-CB | 1.5 | 19.4 ± 5.0 | 4.7 ± 4.0 | 12.0 ± 2.9 | 1.8 | 16.9 ± 6.4 | 13.4 ± 10.8 | 15.2 ± 5.5 |
| AIC-CB pos | 1.1 | 16.0 ± 2.0 | 9.2 ± 1.4 | 1.6 | 19.6 ± 6.8 | 4.0 ± 4.1 | 11.8 ± 3.5 | |
| Karlin's dinuc | 0.15 | 34.2 ± 3.5 | 28.6 ± 0.8 | 31.4 ± 2.0 | 0.12 | 17.3 ± 2.3 | 56.4 ± 1.5 | 36.9 ± 1.7 |
| Karlin's dinuc pos | 0.15 | 40.6 ± 4.0 | 9.6 ± 0.8 | 25.1 ± 2.2 | 0.13 | 31.3 ± 4.0 | 25.8 ± 2.0 | 28.6 ± 2.9 |
| Karlin's codon | 0.49 | 18.9 ± 4.4 | 16.1 ± 0.8 | 17.5 ± 2.5 | 0.48 | 20.7 ± 2.8 | 29.8 ± 1.5 | 25.3 ± 2.0 |
| Karlin's codon pos | 0.47 | 19.3 ± 5.1 | 7.4 ± 0.7 | 13.3 ± 2.7 | 0.43 | 14.7 ± 3.4 | 21.0 ± 1.4 | 17.8 ± 2.2 |
| N/A | 23.2 ± 4.0 | 4.4 ± 1.5 | 13.8 ± 2.4 | N/A | 41.2 ± 6.4 | 42.6 ± 27.7 | 41.9 ± 16.2 | |
| N/A | 21.7 ± 4.4 | 4.5 ± 1.6 | 13.1 ± 2.6 | N/A | 44.0 ± 6.4 | 28.6 ± 19.6 | 36.3 ± 12.4 | |
The methods were applied to identify atypical genes in an artificial E. coli genome with foreign genes from five or ten donor organisms (see text for detail) aJS-N, JS-DN and JS-CB denote Jensen–Shannon-divergence-based gene clustering method using respectively the nucleotide composition, dinucleotide composition and codon usage bias as the discriminant criterion. Similarly for AIC-based gene clustering method. ‘pos’ denotes the use of positional information.
Figure 2.Trade-offs in error rates of foreign gene identification in artificial genomes. JS-N, JS-DN and JS-CB denote Jensen–Shannon divergence-based gene clustering method using respectively the nucleotide composition, dinucleotide composition and codon usage bias as the discriminant criterion. AIC stands for AIC-based gene-clustering methods. (A) Artificial Genome I, with 5 donors. (B) Artificial Genome II with 10 donors.
Misclassification error rates of Jensen–Shannon-divergence-based clustering methods in detecting foreign genes in artificial E. coli genomes
| Artificial Gene Donor | Artificial Genome I | Artificial Genome II | ||||||
|---|---|---|---|---|---|---|---|---|
| Percent contribution | Classification method | Percent contribution | Classification method | |||||
| JS-N | JS-DN | JS-CB | JS-N | JS-DN | JS-CB | |||
| 7.0 | 5.7 ± 2.7 | 0.1 ± 0.1 | 0.3 ± 0.3 | 1.0 | 32.0 ± 19.8 | 1.3 ± 3.3 | 0.6 ± 1.8 | |
| 6.0 | 0.0 ± 0.1 | 1.1 ± 0.7 | 0.2 ± 0.3 | 1.0 | 1.0 ± 1.6 | 2.0 ± 2.1 | 0.8 ± 1.2 | |
| 5.0 | 56.0 ± 18.8 | 38.3 ± 32.3 | 18.1 ± 6.4 | 1.0 | 66.2 ± 21.0 | 77.5 ± 27.0 | 60.8 ± 38.0 | |
| 4.0 | 3.5 ± 3.5 | 3.4 ± 3.0 | 3.3 ± 1.0 | 2.0 | 1.0 ± 1.1 | 2.8 ± 3.3 | 3.1 ± 2.0 | |
| 3.0 | 1.4 ± 1.9 | 2.1 ± 2.2 | 3.3 ± 2.2 | 2.0 | 1.9 ± 2.0 | 3.0 ± 2.0 | 4.4 ± 3.9 | |
| 2.0 | 4.4 ± 3.6 | 4.8 ± 4.6 | 2.1 ± 2.2 | |||||
| 1.0 | 66.7 ± 31.6 | 58.2 ± 31.2 | 36.4 ± 13.0 | |||||
| 2.0 | 5.3 ± 4.5 | 3.8 ± 2.3 | 1.8 ± 2.1 | |||||
| 1.0 | 99.6 ± 0.8 | 3.3 ± 4.9 | 12.5 ± 8.7 | |||||
| 2.0 | 8.2 ± 3.9 | 1.1 ± 0.9 | 0.8 ± 0.9 | |||||
| Type I error (100-sensitivity) | 12.4 ± 4.3 | 9.1 ± 8.6 | 4.1 ± 0.6 | 18.9 ± 4.9 | 11.1 ± 3.6 | 8.1 ± 2.4 | ||
| Type II error (100-specficity) | 3.4 ± 1.2 | 4.8 ± 1.9 | 4.2 ± 1.5 | 2.4 ± 1.1 | 5.4 ± 2.1 | 6.2 ± 2.0 | ||
| Mean error | 7.9 ± 2.0 | 7.0 ± 3.5 | 4.1 ± 0.9 | 10.6 ± 2.1 | 8.2 ± 2.5 | 7.1 ± 1.4 | ||
The positional information of a gene was used to further minimize the classification errors.
Assessment of the ability of Jensen–Shannon-based gene clustering methods in identifying the genes from a donor organism in the artificial E. coli genomes as a distinct group
| Artificial gene donor | JS-N | JS-DN | JS-CB | |||
|---|---|---|---|---|---|---|
| Class abundance | Class purity | Class abundance | Class purity | Class abundance | Class purity | |
| Artificial Genome I: 5 donors | ||||||
| 92.9 ± 2.8 | 92.2 ± 2.9 | 93.4 ± 2.1 | 99.8 ± 0.2 | 99.4 ± 0.5 | 99.6 ± 0.2 | |
| 96.0 ± 2.2 | 99.5 ± 0.3 | 88.0 ± 3.5 | 99.8 ± 0.1 | 97.8 ± 1.5 | 99.6 ± 0.3 | |
| 33.5 ± 16.8 | 81.0 ± 5.9 | 55.2 ± 31.0 | 75.0 ± 7.2 | 80.0 ± 7.0 | 84.5 ± 9.9 | |
| 92.8 ± 4.4 | 98.9 ± 1.5 | 86.1 ± 3.2 | 98.4 ± 2.1 | 94.8 ± 3.8 | 98.2 ± 1.6 | |
| 92.8 ± 2.1 | 90.7 ± 6.4 | 82.2 ± 5.2 | 96.8 ± 2.8 | 91.1 ± 4.7 | 97.5 ± 1.5 | |
| Artificial Genome II: 10 donors | ||||||
| 8.7 ± 26.2 | 56.5 ± 0.0 | 80.3 ± 8.1 | 98.8 ± 2.2 | 86.1 ± 9.9 | 93.7 ± 6.7 | |
| 84.7 ± 14.3 | 98.3 ± 2.8 | 77.0 ± 8.9 | 99.5 ± 1.4 | 94.3 ± 5.7 | 98.3 ± 2.3 | |
| 0.0 ± 0.0 | – | 11.4 ± 23.1 | 63.2 ± 6.4 | 17.3 ± 30.2 | – | |
| 69.2 ± 28.5 | 61.2 ± 10.8 | 66.9 ± 23.4 | 77.4 ± 18.1 | 79.1 ± 14.8 | 83.5 ± 15.1 | |
| 93.2 ± 5.4 | 95.1 ± 3.6 | 77.3 ± 6.8 | 89.8 ± 6.1 | 88.8 ± 5.6 | 97.2 ± 2.3 | |
| 15.1 ± 28.3 | 63.1 ± 17.8 | 36.4 ± 34.0 | 80.2 ± 20.2 | 72.3 ± 20.1 | 89.6 ± 11.0 | |
| 22.0 ± 29.2 | 67.7 ± 15.4 | 28.6 ± 29.8 | 78.8 ± 11.7 | 58.2 ± 13.5 | 72.7 ± 6.5 | |
| 62.1 ± 40.7 | 67.1 ± 13.0 | 77.8 ± 11.2 | 85.4 ± 9.8 | 85.7 ± 23.6 | 84.1 ± 7.2 | |
| 0.0 ± 0.0 | – | 87.2 ± 6.6 | 89.1 ± 5.9 | 88.3 ± 6.1 | 88.6 ± 10.9 | |
| 81.1 ± 27.5 | 71.5 ± 9.6 | 89.8 ± 4.0 | 97.1 ± 4.0 | 96.6 ± 1.7 | 94.1 ± 4.1 | |
aThe percentage of total contributory genes from a source organism identified correctly in a respective class.
bThe percentage of genes in a class correctly assigned to that class.
Figure 3.Performance of the Jensen–Shannon divergence-based gene clustering method in identifying the foreign genes introduced from artificial γ-proteobacterial genomes into an artificial E. coli genome. The percentage of the acquired genes was varied from 10 to 30% of the genome.
Performance of the methods for lateral gene transfer detection in identifying the putative horizontally transferred genes in the E. coli K12 genome
| Parameter | Karlin's codon usage ( | HGT-DB ( | Wn-SVM ( | Alien Hunter ( | AIC-DN. ( | JS-CB |
|---|---|---|---|---|---|---|
| Number of predicted HT genes | 50 (>600 nt) | 306 | 490 | 1239 | 464 | 639 |
| True positives | 45 | 223 | 302 | 504 | 306 | 449 |
| False positives | 5 | 83 | 188 | 735 | 158 | 190 |
| False negatives | 577 | 668 | 589 | 387 | 585 | 442 |
| Type I error (%) | 92.76 | 74.97 | 66.10 | 43.43 | 65.65 | 49.60 |
| Type II error (%) | 10.0 | 27.12 | 38.36 | 59.32 | 34.05 | 29.73 |
| Mean error (%) | 51.38 | 51.04 | 52.23 | 51.37 | 49.85 | 39.66 |
The ‘positives’ and ‘negatives’ respectively mean the genes declared as foreign and native by a method.