| Literature DB >> 21297116 |
Rajeev K Azad1, Jeffrey G Lawrence.
Abstract
Because the properties of horizontally-transferred genes will reflect the mutational proclivities of their donor genomes, they often show atypical compositional properties relative to native genes. Parametric methods use these discrepancies to identify bacterial genes recently acquired by horizontal transfer. However, compositional patterns of native genes vary stochastically, leaving no clear boundary between typical and atypical genes. As a result, while strongly atypical genes are readily identified as alien, genes of ambiguous character are poorly classified when a single threshold separates typical and atypical genes. This limitation affects all parametric methods that examine genes independently, and escaping it requires the use of additional genomic information. We propose that the performance of all parametric methods can be improved by using a multiple-threshold approach. First, strongly atypical alien genes and strongly typical native genes would be identified using conservative thresholds. Genes with ambiguous compositional features would then be classified by examining gene context, including the class (native or alien) of flanking genes. By including additional genomic information in a multiple-threshold framework, we observed a remarkable improvement in the performance of several popular, but algorithmically distinct, methods for alien gene detection.Entities:
Mesh:
Year: 2011 PMID: 21297116 PMCID: PMC3089488 DOI: 10.1093/nar/gkr059
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Parametric approaches to alien gene detection (in order of introduction)
| Method/software | Discriminant criterion | Measure | Classes | References |
|---|---|---|---|---|
| GC bias | G+C content | Deviations in G+C content | 2 | ( |
| Karlin’s dinucleotide | Dinucleotide composition | Difference in dinucleotide relative abundances | 2 | ( |
| Karlin’s codon bias | Codon usage bias | Difference in codon frequencies | 2 | ( |
| Codon usage bias | Kullback–Leibler divergence | 2 and 3 | ( | |
| Naïve Bayesian classifier | Oligonucleotide bias | Maximum | Unspecified | ( |
| 3:1 Genomic signature | Dinucleotide composition at 3:1 codon positions | 2 | ( | |
| Biases in GC content, codon usage and amino acid usage | Abrupt variations in cumulative GC profile, deviations in codon and amino acid usage pattern | 2 | ( | |
| Horizontal transfer index | Hexamer frequencies | 2 | ( | |
| SIGI | Codon usage bias | Log likelihood ratio | 2 | ( |
| Wn | Covariance | 2 | ( | |
| Wn-SVM | One-class support vector machines | 2 | ( | |
| AIC clustering | Many | Maximum likelihood and Akaike Information Criterion | Many | ( |
| Chaos game representation | Tetranucleotide composition | Euclidian distance | 2 | ( |
| IVOM/Alien Hunter | Interpolated octamer frequencies | Kullback–Leibler divergence | 2 | ( |
| JSD clustering | Many | Jensen–Shannon divergence | Many | ( |
| Design-Island | Tetranucleotide composition | Difference in tetranucleotide frequencies | 2 | ( |
| MJSD | Dinucleotide and trinucleotide composition | Markovian Jensen–Shannon divergence | 2 | ( |
Figure 1.Solving the intrinsic problems with single-threshold approaches. (A) In single-threshold approaches, genes are sorted into native and foreign classes according to degree of atypicality. A trade-off between type I and II error results when the threshold is determined because compositional features between native and foreign genes overlap. (B) In multiple-threshold approaches, compositionally ambiguous genes are classified as native or foreign based on genomic context. (C) Reassignment of short-length genes based on genomic context. (D) Assignment of ambiguous genes based on genomic context.
Figure 2.Improvement of threshold methods by including multiple thresholds and positional information. (A) Improvement in standard single-threshold methods. Here ‘nucleotides’, ‘dinucleotides’ and ‘codons’ refer to GC bias, Karlin’s dinucleotide and Karlin’s codon bias method, respectively. (B) Improvement in gene clustering methods. The standard Jensen–Shannon divergence (JSD) approach (14) is here annotated ‘JSD/codon bias’; the ‘proximity’ method groups similar genes first in order of their physical distance within a genome, whereas the ‘augmented’ method uses gene context and operon structure information within a multiple-threshold framework.
Improved performance of position-augmented parametric methods in detecting genomic islands in genuine genomes
| Method for detection | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Predicted | Detected | Percent | Predicted | Detected | Percent | Predicted | Detected | Percent | Predicted | Detected | Percent | |
| Karlin’s codon bias | 1724 | 214 | 65 | 1715 | 460 | 65 | 1655 | 451 | 53 | 1194 | 444 | 49 |
| Karlin’s codon bias augmented | 1726 | 246 | 75 | 1712 | 532 | 75 | 1645 | 556 | 65 | 1194 | 574 | 64 |
| Karlin’s codon bias | 2245 | 246 | 75 | 2308 | 532 | 75 | 2202 | 556 | 65 | 1681 | 574 | 64 |
| Karlin’s dinucleotide | 1654 | 184 | 56 | 1581 | 387 | 55 | 1671 | 416 | 48 | 1112 | 373 | 41 |
| Karlin’s dinucleotide augmented | 1653 | 270 | 83 | 1580 | 539 | 76 | 1670 | 623 | 73 | 1112 | 552 | 61 |
| Karlin’s dinucleotide | 3106 | 272 | 83 | 2893 | 544 | 77 | 2694 | 626 | 73 | 1858 | 553 | 61 |
| HTI/hexamer | 1912 | 238 | 73 | 1921 | 523 | 74 | 2163 | 650 | 76 | 1537 | 605 | 67 |
| HTI/hexamer augmented | 1912 | 279 | 85 | 1920 | 572 | 81 | 2165 | 716 | 83 | 1536 | 678 | 75 |
| HTI/hexamer | 2725 | 279 | 85 | 2299 | 572 | 81 | 2570 | 715 | 83 | 1901 | 677 | 75 |
| Wn/heptamer | 1851 | 203 | 62 | 1736 | 444 | 63 | 1857 | 544 | 63 | 1593 | 621 | 69 |
| Wn/heptamer augmented | 1851 | 225 | 69 | 1735 | 486 | 68 | 1854 | 608 | 71 | 1594 | 701 | 78 |
| Wn/heptamer | 2176 | 225 | 69 | 2117 | 486 | 68 | 2233 | 608 | 71 | 2127 | 701 | 78 |
| JSD/codon bias | 1966 | 190 | 58 | 1938 | 531 | 74 | 1599 | 449 | 52 | 1189 | 457 | 51 |
| JSD/codon bias augmented | 1958 | 316 | 96 | 1928 | 667 | 93 | 1592 | 745 | 86 | 1162 | 650 | 72 |
| JSD/codon bias | 4050 | 311 | 95 | 3438 | 659 | 92 | 3677 | 741 | 86 | 1902 | 653 | 72 |
aAugmented methods use multiple thresholds.
bPredicted: total number of putative alien genes predicted. Detected: number of the 327 genes from the Islander database that were among the total number of predicted. Percent: fraction of the database-archived alien genes detected.
cSeven hundred and ten genes from the tRNAcc database.
dEight hundred and fifty-nine genes from the tRNAcc database.
eNine hundred and three genes as reported by Vernikos and Parkhill (27).
Improved performance of position-augmented parametric methods in detecting phylogenetically unique genes in S. enterica Typhi CT18 genome
| Method for detection | Predicted | Detected | Percent |
|---|---|---|---|
| Karlin’s codon bias | 1194 | 210 | 46 |
| Karlin’s codon bias augmented | 1194 | 303 | 67 |
| Karlin’s codon bias | 1956 | 303 | 67 |
| Karlin’s dinucleotide | 1112 | 120 | 26 |
| Karlin’s dinucleotide augmented | 1112 | 264 | 58 |
| Karlin’s dinucleotide | 2059 | 264 | 58 |
| HTI/hexamer | 1537 | 321 | 71 |
| HTI/hexamer augmented | 1536 | 359 | 79 |
| HTI/hexamer | 1829 | 359 | 79 |
| Wn/heptamer | 1593 | 367 | 81 |
| Wn/heptamer augmented | 1594 | 389 | 86 |
| Wn/heptamer | 1930 | 389 | 86 |
| JSD/codon bias | 1189 | 274 | 60 |
| JSD/codon bias augmented | 1162 | 320 | 71 |
| JSD/codon bias | 1501 | 322 | 71 |
aAugmented methods use multiple thresholds.
bPredicted: total number of alien gene predicted. Detected: number of the 453 unique CT18 genes (those not found in the genomes of related enteric bacteria including E. coli CFT073, E. coli W3110, E. fergusonii ATCC 35469, C. koseri ATCC BAA-895 and K. pneumoniae 342) that were among the total number of predicted. Percent: fraction of the database-archived alien genes detected.
Relative performance of the Karlin’s dinucleotide method versus its augmented version following the seven steps used in augmenting its classification ability
| Step | Method | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Amb. | Native | Alien | TP | SN | Amb. | Native | Alien | TP | SN | Amb. | Native | Alien | TP | SN | Amb. | Native | Alien | TP | SN | ||
| 1 | Augmented | 2881 | 2061 | 418 | 59 | 18.0 | 2800 | 2053 | 455 | 128 | 18.0 | 2738 | 2018 | 622 | 142 | 16.5 | 2315 | 1740 | 338 | 119 | 13.1 |
| Standard | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | |
| 2 | Augmented | 2749 | 2135 | 476 | 67 | 20.4 | 2699 | 2108 | 501 | 151 | 21.2 | 2641 | 2072 | 665 | 169 | 19.6 | 2208 | 1785 | 400 | 152 | 16.8 |
| Standard | – | 4883 | 477 | 65 | 19.8 | – | 4807 | 501 | 138 | 19.4 | – | 4713 | 665 | 155 | 18.0 | – | 3993 | 400 | 137 | 15.1 | |
| 3 | Augmented | 2884 | 2132 | 344 | 64 | 19.5 | 2826 | 2124 | 358 | 141 | 19.8 | 2784 | 2147 | 447 | 153 | 17.8 | 2318 | 1789 | 286 | 134 | 14.8 |
| Standard | – | 5014 | 346 | 53 | 16.2 | – | 4949 | 359 | 100 | 14.0 | – | 4931 | 447 | 91 | 10.5 | – | 4106 | 287 | 97 | 10.7 | |
| 4a | Augmented | 2297 | 2649 | 414 | 85 | 25.9 | 2226 | 2654 | 428 | 187 | 26.3 | 2208 | 2665 | 505 | 181 | 21.0 | 1797 | 2242 | 354 | 177 | 19.6 |
| Standard | – | 4945 | 415 | 59 | 18.0 | – | 4879 | 429 | 114 | 16.0 | – | 4872 | 506 | 104 | 12.1 | – | 4038 | 355 | 123 | 13.6 | |
| 4b | Augmented | 2442 | 2409 | 509 | 85 | 25.9 | 2419 | 2411 | 478 | 191 | 26.9 | 2420 | 2423 | 535 | 188 | 21.8 | 1958 | 2041 | 394 | 208 | 23.0 |
| Standard | – | 4850 | 510 | 69 | 21.1 | – | 4839 | 478 | 131 | 18.4 | – | 4843 | 535 | 116 | 13.5 | – | 3999 | 394 | 136 | 15.0 | |
| 4a and b | Augmented | 1855 | 2926 | 579 | 106 | 32.4 | 1819 | 2941 | 548 | 237 | 33.3 | 1844 | 2941 | 593 | 216 | 25.1 | 1437 | 2494 | 462 | 251 | 27.7 |
| Standard | – | 4781 | 579 | 73 | 22.3 | – | 4760 | 548 | 150 | 21.1 | – | 4785 | 593 | 131 | 15.2 | – | 3930 | 463 | 162 | 17.9 | |
| 5 | Augmented | 1855 | 2928 | 577 | 106 | 32.4 | 1819 | 2943 | 546 | 237 | 33.3 | 1844 | 2946 | 588 | 216 | 25.1 | 1437 | 2497 | 459 | 250 | 27.6 |
| Standard | – | 4783 | 577 | 73 | 22.3 | – | 4762 | 546 | 150 | 21.1 | – | 4790 | 588 | 131 | 15.2 | – | 3934 | 459 | 161 | 17.8 | |
| 6 | Augmented | 1447 | 2928 | 985 | 185 | 56.5 | 1506 | 2943 | 859 | 365 | 51.4 | 1313 | 2946 | 1119 | 453 | 52.7 | 1225 | 2497 | 671 | 384 | 42.5 |
| Standard | – | 4373 | 987 | 113 | 34.5 | – | 4448 | 860 | 216 | 30.4 | – | 4259 | 1119 | 275 | 32.0 | – | 3722 | 671 | 236 | 26.1 | |
| 7 | Augmented | 0 | 3707 | 1653 | 270 | 82.5 | 0 | 3728 | 1580 | 539 | 75.9 | 0 | 3708 | 1670 | 623 | 72.5 | 0 | 3281 | 1112 | 552 | 61.1 |
| Standard | – | 3706 | 1654 | 184 | 56.2 | – | 3727 | 1581 | 387 | 54.5 | – | 3707 | 1671 | 416 | 48.4 | – | 3281 | 1112 | 373 | 41.3 | |
aAugmented methods use multiple thresholds.
bPredicted: total number of putative alien genes predicted. Detected: number of the 327 genes from the Islander database that were among the total number of predicted. Percent: fraction of the database-archived alien genes detected.
cSeven hundred and ten genes from the tRNAcc database.
dEight hundred and fifty-nine genes from the tRNAcc database.
eNine hundred and three genes as reported by Vernikos and Parkhill (27).
Amb., ambiguous.
Relative performance of the Karlin’s dinucleotide method versus its augmented version in detecting phylogenetically unique genes in S. enterica Typhi CT18 genome following the seven steps used in augmenting the method’s classification ability
| Step | Method | Ambiguous | Native | Alien | TP | SN |
|---|---|---|---|---|---|---|
| 1 | Augmented | 2315 | 1740 | 338 | 13 | 2.8 |
| Standard | – | – | – | – | – | |
| 2 | Augmented | 2208 | 1785 | 400 | 30 | 6.6 |
| Standard | – | 3993 | 400 | 19 | 4.1 | |
| 3 | Augmented | 2318 | 1789 | 286 | 30 | 6.6 |
| Standard | – | 4106 | 287 | 7 | 1.5 | |
| 4a | Augmented | 1797 | 2242 | 354 | 52 | 11.4 |
| Standard | – | 4038 | 355 | 14 | 3.0 | |
| 4b | Augmented | 1958 | 2041 | 394 | 68 | 15.0 |
| Standard | – | 3999 | 394 | 18 | 3.9 | |
| 4a and b | Augmented | 1437 | 2494 | 462 | 90 | 19.8 |
| Standard | – | 3930 | 463 | 31 | 6.8 | |
| 5 | Augmented | 1437 | 2497 | 459 | 90 | 19.8 |
| Standard | – | 3934 | 459 | 30 | 6.6 | |
| 6 | Augmented | 1225 | 2497 | 671 | 158 | 34.8 |
| Standard | – | 3722 | 671 | 62 | 13.6 | |
| 7 | Augmented | 0 | 3281 | 1112 | 264 | 58.2 |
| Standard | – | 3281 | 1112 | 120 | 26.4 |
Performance in detecting island borne genes by the combined methods
| Predicted by at least | Genome analyzed | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Predicted | Detected | Percent | Predicted | Detected | Percent | Predicted | Detected | Percent | Predicted | Detected | Percent | |
| 1 of 5 methods | 3150 | 326 | 99.6 | 3168 | 705 | 99.2 | 3228 | 821 | 95.5 | 2327 | 802 | 88.8 |
| 2 of 5 methods | 2298 | 321 | 98.1 | 2241 | 684 | 96.3 | 2242 | 765 | 89.0 | 1604 | 738 | 81.7 |
| 3 of 5 methods | 1731 | 292 | 89.2 | 1690 | 610 | 85.9 | 1712 | 692 | 80.5 | 1231 | 675 | 74.7 |
| 4 of 5 methods | 1259 | 243 | 74.3 | 1178 | 494 | 69.5 | 1168 | 578 | 67.2 | 904 | 568 | 62.9 |
| 5 of 5methods | 692 | 154 | 47.0 | 598 | 303 | 42.6 | 576 | 392 | 45.6 | 532 | 372 | 41.1 |
| 1 of 4 methods or JSD | 3150 | 326 | 99.6 | 3168 | 705 | 99.2 | 3228 | 821 | 95.5 | 2327 | 802 | 88.8 |
| 2 of 4 methods or JSD | 2502 | 324 | 99.0 | 2491 | 691 | 97.3 | 2436 | 787 | 91.6 | 1729 | 761 | 84.2 |
| 3 of 4 methods or JSD | 2205 | 322 | 98.4 | 2206 | 685 | 96.4 | 2086 | 767 | 89.2 | 1454 | 720 | 79.7 |
| 4 of 4 methods or JSD | 2049 | 317 | 96.9 | 2027 | 673 | 94.7 | 1805 | 757 | 88.1 | 1280 | 686 | 75.9 |
aAugmented methods using multiple thresholds. ‘5 methods’ denotes augmented versions of Karlin’s codon bias, Karlin’s dinucleotide, HTI/hexamer, Wn/heptamer and JSD/codon bias. ‘4 methods’ denotes augmented versions of Karlin’s codon bias, Karlin’s dinucleotide, HTI/hexamer and Wn/heptamer.
bPredicted: total number of putative alien genes predicted. Detected: number of the 327 genes from the Islander database that were among the total number of predicted. Percent: fraction of the database-archived alien genes detected.
cSeven hundred and ten genes from the tRNAcc database.
dEight hundred and fifty-nine genes from the tRNAcc database.
eNine hundred and three genes as reported by Vernikos and Parkhill (27).