| Literature DB >> 18799480 |
Abstract
Various methods have been developed to detect horizontal gene transfer in bacteria, based on anomalous nucleotide composition, assuming that compositional features undergo amelioration in the host genome. Evolutionary theory predicts the inevitability of false positives when essential sequences are strongly conserved. Foreign genes could become more detectable on the basis of their higher order compositions if such features ameliorate more rapidly and uniformly than lower order features. This possibility is tested by comparing the heterogeneities of bacterial genomes with respect to strand-independent first- and second-order features, (i) G + C content and (ii) dinucleotide relative abundance, in 1 kb segments. Although statistical analysis confirms that (ii) is less inhomogeneous than (i) in all 12 species examined, extreme anomalies with respect to (ii) in the Escherichia coli K12 genome are typically co-located with essential genes.Entities:
Mesh:
Year: 2008 PMID: 18799480 PMCID: PMC2575891 DOI: 10.1093/dnares/dsn021
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1Scatter plot of total chi-squared divergence, on the left side of Equation (2), versus the sum of the statistics on the right side, in 1 kb segments of the E. coli K12 genome. The line of unit slope through the origin is also drawn.
Figure 2Scatter plot of GC-divergence, describing fluctuations from the gross genomic G + C proportion, versus divergence with respect to DRA, in 1 kb segments of the E. coli K12 genome. The line of unit slope through the origin is also drawn.
Twelve selected bacterial genomes indexed by serial number (SN), identified by species, strain and reference sequence number (Ref.Seq.No)
| SN | Species | Strain | NCBI Ref.Seq.No. | Length (kb) | G + C (%) |
|---|---|---|---|---|---|
| 1 | 168 | 000964.2 | 4214 | 43 | |
| 2 | B31 | 001318.1 | 910 | 28 | |
| 3 | RM221 | 003912.7 | 1777 | 30 | |
| 4 | J138 | 002491.1 | 1226 | 42 | |
| 5 | K12 | 000913.2 | 4639 | 50 | |
| 6 | Rd KW20 | 000907.1 | 1830 | 38 | |
| 7 | J99 | 000921.1 | 1643 | 39 | |
| 8 | CDC1551 | 002755.2 | 4403 | 65 | |
| 9 | G37 | 000908.2 | 580 | 31 | |
| 10 | CT18 | 003198.1 | 4809 | 52 | |
| 11 | N315 | 002745.2 | 2814 | 32 | |
| 12 | PC6803 | 000911.1 | 3573 | 47 |
Summary statistics for 12 selected bacterial genomes indexed by serial number (SN) as in Table 1
| SN | Mean divergence | MAD | ER 95% | ER 99.5% | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Total | DRA | G + C | Total | DRA | G + C | TOTAL | DRA | G + C | ||
| 1 | 25.8 | 10.8 | 15.9 | 2.9 | 47.2 | 27.9 | 43.2 | 32.4 | 13.4 | 32.2 |
| 2 | 21.3 | 9.7 | 12.0 | 3.9 | 42.2 | 22.7 | 39.6 | 25.2 | 7.8 | 28.4 |
| 3 | 21.7 | 10.4 | 12.0 | 3.9 | 45.1 | 26.2 | 40.2 | 28.0 | 10.9 | 27.4 |
| 4 | 15.6 | 9.6 | 5.9 | 1.7 | 34.8 | 22.7 | 26.4 | 17.2 | 8.1 | 14.0 |
| 5 | 27.1 | 9.8 | 18.4 | 3.4 | 48.0 | 24.3 | 46.4 | 33.5 | 10.2 | 34.0 |
| 5(n) | 27.5 | 9.7 | 18.9 | 3.4 | 48.2 | 24.2 | 46.4 | 33.5 | 10.2 | 33.9 |
| 5(e) | 20.4 | 11.4 | 9.5 | 2.9 | 45.5 | 14.6 | 37.5 | 21.6 | 4.3 | 23.2 |
| 6 | 19.9 | 10.6 | 10.2 | 3.6 | 37.0 | 27.1 | 31.0 | 23.2 | 11.6 | 20.3 |
| 7 | 19.1 | 11.4 | 9.2 | 5.5 | 39.7 | 29.4 | 32.0 | 25.0 | 14.3 | 21.7 |
| 8 | 20.2 | 9.5 | 10.3 | 2.4 | 32.1 | 20.1 | 27.8 | 19.1 | 8.4 | 18.4 |
| 9 | 26.2 | 11.3 | 14.0 | 3.4 | 49.3 | 28.6 | 41.4 | 33.6 | 13.9 | 30.7 |
| 10 | 30.4 | 10.8 | 21.9 | 5.2 | 52.8 | 28.3 | 49.3 | 38.2 | 12.8 | 38.2 |
| 11 | 21.7 | 11.3 | 10.2 | 2.6 | 40.7 | 28.6 | 32.9 | 25.6 | 12.9 | 21.2 |
| 12 | 24.4 | 10.0 | 16.6 | 5.7 | 43.8 | 22.2 | 44.6 | 28.6 | 9.8 | 32.4 |
| Average | 22.8 | 10.4 | 13.0 | 3.7 | 42.7 | 25.7 | 37.9 | 27.5 | 11.2 | 26.6 |
Mean values of the chi-squared divergence statistics in Equation (2) are shown with the MAD. ERs at two percentile points are listed. Averages across all species are shown in the last row.
Characterizing the 20 most anomalous 1 kb segments of the E. coli K12 genome based on dinucleotide signature dissimilarity as measured by (a) DRA-divergence, (b) delta-distance, (c) Euclidean distance and (d) the quadratic discriminant
| # | (a) chi-square | (b) delta-distance | (c) Euclidean | (d) quadratic | (e) G + C | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| loc. | HT | loc. | HT | loc. | HT | loc. | HT | loc. | HT | |
| 1 | 200 | yaeT | 151 | −1 | 151 | −1 | 284 | M | 583 | M |
| 2 | 227 | 0 | 227 | 0 | 274 | M | 525 | M | 584 | M |
| 3 | 393 | M | 394 | −1 | 575 | M | 526 | M | 1212 | −1, −1 |
| 4 | 394 | −1 | 526 | M | 777 | −1 | 575 | M | 1636 | M |
| 5 | 526 | M | 777 | −1 | 978 | mukB | 777 | −1 | 2102 | −1 |
| 6 | 777 | −1 | 1287 | −1 | 1142 | rne | 1142 | rne | 2105 | +1 |
| 7 | 978 | mukB | 1427 | M | 1395 | M | 1427 | M | 2468 | M |
| 8 | 1142 | rne | 1465 | 0 | 1427 | M | 1465 | 0 | 2773 | M |
| 9 | 1465 | 0 | 1527 | 0 | 1465 | 0 | 1527 | 0 | 2783 | 0 |
| 10 | 1527 | 0 | 1707 | −1 | 1527 | 0 | 1707 | −1 | 2785 | 0 |
| 11 | 1707 | −1 | 2071 | M | 2101 | M | 2104 | +1 | 2989 | +1 |
| 12 | 2071 | M | 2072 | M | 2104 | +1 | 2105 | +1 | 2990 | −1 |
| 13 | 2072 | M | 2104 | +1 | 2105 | +1 | 2989 | +1 | 2994 | 0 |
| 14 | 2104 | +1 | 2994 | 0 | 2994 | 0 | 2990 | −1 | 3267 | +1, +1 |
| 15 | 3312 | −1, infB | 3312 | −1, infB | 3314 | −1 | 2992 | −1 | 3581 | +1 |
| 16 | 3450 | rplWD | 3602 | ftsY | 3450 | rplWD | 2994 | 0 | 3797 | +1,+1 |
| 17 | 3602 | ftsY | 3915 | −1 | 3602 | ftsY | 3602 | ftsY | 3798 | −1 |
| 18 | 3915 | −1 | 4058 | −1 | 4121 | −1 | 3620 | M | 3803 | −1, +1 |
| 19 | 4058 | −1 | 4187 | rpoC | 4503 | M | 4503 | M | 4267 | −1 |
| 20 | 4181 | rpoB | 4474 | −1, +1 | 4504 | M | 4504 | M, M | 4475 | +1 |
| Genes | 19 | 18 | 18 | 18 | 21 | |||||
| Errors | 14(8) | 12(3) | 9(5) | 6(2) | 7(0) | |||||
Each segment, indexed by location (loc.) in kb from the published origin, overlaps zero, one or two genes in the protein table. Intergenic segments are classified as ‘0’. HT is indicated by ‘+1’ or by ‘M’ (if the gene is a mobile element). False positives are indicated by gene locus (if essential) or by ‘−1’ (if not). False positives account for total errors (bottom row) and the number of essential genes is given (in parentheses). The analysis is repeated for the 20 most anomalous segments with respect to (e) GC-divergence.
Comparing essential segments of the E. coli sequence to all segments and to non-essential segments
| Segments | Mean divergence | ER 95% | ER 99.5% | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Total | DRA | G + C | Total | DRA | G + C | Total | DRA | G + C | |
| All segments | 27.1 | 9.8 | 18.4 | 48.0 | 24.3 | 46.4 | 33.5 | 10.2 | 34.0 |
| Non-essential ( | 27.6 | 9.7 | 19.0 | 48.2 | 23.7 | 47.2 | 34.1 | 9.8 | 35.0 |
| Essential ( | 19.9 | 12.1 | 8.5 | 45.5 | 34.1 | 33.1 | 25.8 | 18.4 | 18.7 |
| 8.6 | −2.4 | 10.5 | 2.7 | −10.4 | 14.1 | 8.3 | −8.6 | 16.3 | |
| MSR | (MSR − 0.5)/SD | ||||||||
| Essential | 0.462 | 0.559 | 0.400 | −2.29 | +3.55 | −6.02 | 0.011 | 0.0002 | <10−9 |
Mean values and ERs of the divergence statistics are computed as in Table 2. The MSR of the divergence in essential segments is tested for significant departure from expected value (0.5) in the bottom row.
Figure 3Comparing observed distributions of compositional divergence components in segments of the E. coli K12 genome. Means, medians, and indicated percentile points are plotted for (circles) essential and (squares) non-essential segments. Arrows depict the inferred time course of amelioration as explained in Section 3.4.