Literature DB >> 18953039

Using Mahalanobis distance to compare genomic signatures between bacterial plasmids and chromosomes.

Haruo Suzuki1, Masahiro Sota, Celeste J Brown, Eva M Top.   

Abstract

Plasmids are ubiquitous mobile elements that serve as a pool of many host beneficial traits such as antibiotic resistance in bacterial communities. To understand the importance of plasmids in horizontal gene transfer, we need to gain insight into the 'evolutionary history' of these plasmids, i.e. the range of hosts in which they have evolved. Since extensive data support the proposal that foreign DNA acquires the host's nucleotide composition during long-term residence, comparison of nucleotide composition of plasmids and chromosomes could shed light on a plasmid's evolutionary history. The average absolute dinucleotide relative abundance difference, termed delta-distance, has been commonly used to measure differences in dinucleotide composition, or 'genomic signature', between bacterial chromosomes and plasmids. Here, we introduce the Mahalanobis distance, which takes into account the variance-covariance structure of the chromosome signatures. We demonstrate that the Mahalanobis distance is better than the delta-distance at measuring genomic signature differences between plasmids and chromosomes of potential hosts. We illustrate the usefulness of this metric for proposing candidate long-term hosts for plasmids, focusing on the virulence plasmids pXO1 from Bacillus anthracis, and pO157 from Escherichia coli O157:H7, as well as the broad host range multi-drug resistance plasmid pB10 from an unknown host.

Entities:  

Mesh:

Year:  2008        PMID: 18953039      PMCID: PMC2602791          DOI: 10.1093/nar/gkn753

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Plasmids are commonly found in bacteria, and often confer various phenotypes to their host such as resistance to antibiotics and heavy metals, production of toxins and other virulence factors, biotransformation of hydrocarbons, symbiotic nitrogen fixation, etc. (1). The overuse of antibiotics in the treatment of infectious diseases of humans and animals and for non-therapeutic purposes in agriculture has contributed to the emergence and spread of antibiotic-resistant pathogens, and this is partly due to horizontal gene transfer via conjugative plasmids (2,3). Many plasmids can transfer horizontally by conjugation, and replicate and be maintained in either a limited or much wider range of hosts (narrow versus broad host range) (4). Transmissible plasmids can be categorized into two types, according to their mobilization ability: (i) ‘self-transmissible’ or ‘conjugative’ plasmids that encode their own transfer machinery, and (ii) ‘mobilizable’ plasmids that can only transfer if provided with the transfer machinery by a co-resident self-transmissible plasmid (4). In spite of the critical role of plasmids in the spread of drug resistance and virulence factors, we have limited insight into their ecology and evolution. In particular we know little about which hosts serve as reservoirs of drug resistance, virulence and other plasmids. A first assessment of the bacterial hosts that are potential long-term carriers of specific plasmids can be based on plasmid-chromosome sequence comparison. By August 2008, there were 1490 completely sequenced plasmid genomes in the NCBI database (http://www.ncbi.nlm.nih.gov). Many of these plasmid sequences were obtained during whole bacterial genome sequencing projects. Among the 1490 plasmids, 1355 (90.9%) are derived from Bacteria, 57 (3.8%) from Archaea and 39 (2.6%) from Eukaryota. Thirty-nine plasmids or 2.6% of the sequences currently archived in GenBank come from uncultured bacteria, and therefore their hosts are unknown. This fraction is expected to increase because of the current rapid rise in metagenomic and other cultivation-independent studies that generate plasmid sequences, including those of the human microbiome (5–8). For example, a plasmid genome sequence analysis project currently performed by us will more than triple this number in a very short time. Thus, methods that can suggest candidate hosts of plasmids based on DNA sequence data alone would help us assess their evolutionary history. Extensive data support the proposal that foreign DNA will acquire the host's nucleotide composition during long-term residence, which is often referred to as ‘genome amelioration’ (9,10). Amelioration may result from restrictions in DNA conformation and mutational biases of replication and repair machinery in the host. The same pressures that homogenize the nucleotide composition throughout the chromosome will also drive a plasmid sequence's nucleotide composition towards that of the host (11). It follows that similar nucleotide compositions between a plasmid and a host's chromosome may indicate long-term evolution of the plasmid in that host, whereas a dissimilar nucleotide composition suggests independent evolutionary histories. Native genes have been distinguished from recently acquired foreign genes using compositional features such as G+C content and frequencies of short oligonucleotides (di-, tri- and tetra-nucleotides) (12–16). One such feature is the dinucleotide relative abundance, which is defined as the ratio of the observed to expected dinucleotide frequency (11,17–28). The profile of 16 dinucleotide relative abundance values is relatively constant throughout the genome, except for regions that were recently acquired via horizontal transfer, such as genomic islands. Additionally, closely related species have more similar profiles than distantly related species. This profile of the dinucleotide composition therefore has been termed a ‘genomic signature’ (17,29). The average absolute dinucleotide relative abundance difference, termed δ-distance, has been commonly used to measure the genomic signature difference between DNA sequences. Previous applications of the δ-distance to plasmids and their host chromosomes have led to contradictory conclusions. Campbell et al. (11) provided qualitative evidence that plasmids have similar genomic signatures to their known hosts using δ-distances averaged over 50 kb segments of the host chromosome. Van Passel et al. (25) used a more quantitative method taking into account the variability in composition around the chromosome, and concluded that plasmids were not more similar to their hosts than genomic aberrations such as horizontally transferred DNA and rRNA gene clusters. The results from van Passel's study indicate that the δ-distance can be improved upon by including information about the variability among dinucleotide relative abundance values along the bacterial chromosome. This motivated us to consider the Mahalanobis distance (30–32), which is well known in multivariate statistical analysis (e.g. discriminant analysis), but has not been considered so far for measuring genomic signature differences. The Mahalanobis distance corrects for the variability in the data (here, dinucleotide abundance changes along the chromosome), as well as for the covariance among the variables. It does this by giving less weight to correlated variables, proportional to their degree of correlation. Indeed, because the dinucleotide relative abundance is calculated on both strands of the DNA, the frequencies of the reverse complements of each dinucleotide are highly correlated. The Mahalanobis distance adjusts for this correlation, whereas the δ-distance does not. Understanding the host in which plasmids evolved is important because (i) for an increasing number of drug resistance and other plasmids the host is not known since they were obtained by cultivation-independent methods, and (ii) even for plasmids that were found in specific strains, these hosts may not always be the long-term hosts. The goal of this study was to use the Mahalanobis distance to measure the genomic signature difference between bacterial plasmids and chromosomes, and to use it as a tool to propose candidate long-term hosts of plasmids. We first compared the performance of the Mahalanobis distance with the commonly used δ-distance in detecting the host in which a particular plasmid was found (designated as ‘known host’). We then focused on the virulence plasmids pXO1 from Bacillus anthracis, and pO157 from Escherichia coli O157:H7 to illustrate that this method can be used to generate hypotheses about the potential long-term reservoirs of virulence plasmids. Finally, we proposed candidate long-term hosts for the broad host range multi-drug resistance plasmid pB10. Neither the recent nor long-term hosts of this plasmid are known because it was captured from a waste water treatment plant by a cultivation-independent method. The Mahalanobis distance performed better than the δ-distance in identifying the known plasmid hosts among 230 bacterial strains, and in proposing candidate long-term hosts that are plausible given our empirical knowledge of plasmid host range. The approach thus generates testable hypotheses about the bacteria that may act as reservoirs for plasmids of interest.

MATERIALS AND METHODS

Software

Genome analyses were conducted using the G-language Genome Analysis Environment version 1.8.3 (33,34), available at http://www.g-language.org/. Statistical tests and graphics were implemented in the R version 2.6.0 (35), available at http://www.r-project.org/.

Genome sequences

Completely sequenced genomes in GenBank format (36) of bacterial plasmids and their corresponding host chromosomes were downloaded from the NCBI ftp site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria). In cases where the host has multiple chromosomes, only the largest chromosome was used for the analysis because the definition of second chromosomes versus megaplasmids is still unclear (20). The final data set included 504 plasmids and 230 chromosomes. Complete listings of the plasmid genomes used in this study are shown in Supplementary Table S1.

Analyses

Representation of dinucleotide composition of a DNA sequence

Dinucleotide composition of a DNA sequence was represented by a vector, which consists of 16 dinucleotide relative abundance values (29). The dinucleotide relative abundance value (x) is defined as the observed dinucleotide frequency divided by the expected dinucleotide frequency, which is the product of the component mononucleotide frequencies: where f and f denote the frequency of the mononucleotide i and j, respectively (i, j ∈ A, C, G, T) and f the frequency of the dinucleotide ij. These values combine counts from both strands of the sequence.

Measurement of dinucleotide composition difference between DNA sequences

A plasmid genome sequence was compared with non-overlapping 5-kb segments spanning a bacterial chromosomal sequence. The difference in dinucleotide composition, or ‘genomic signature’, between the plasmid and chromosome sequences was quantified by the following two distance measures. A high value of these distances represents a large difference in genomic signature between the DNA sequences. First, the average absolute dinucleotide relative abundance difference (δ-distance) between a plasmid and a set of 5-kb chromosomal segments was calculated as: where x is the relative abundance value of the dinucleotide ij for the plasmid, ij is the mean of relative abundance values of dinucleotide ij calculated from the chromosomal segments, and the sum extends over all 16 dinucleotides. Second, the Mahalanobis distance (D2) between a plasmid and a set of 5-kb chromosomal segments was calculated as: where X is a vector of dinucleotide relative abundance values for a plasmid, Y is a mean vector of dinucleotide relative abundance values calculated from the chromosomal segments, S is the variance–covariance matrix of the dinucleotide relative abundances calculated from the chromosomal segments (S−1 is the inverse matrix of S), and the superscript T is the transposition operator. In the S matrix, the variance values indicate the degree of variability in relative abundance of each dinucleotide among the chromosome segments, and the covariance values reflect the correlations among relative abundances of the dinucleotides. The higher the correlation the greater the covariance so that multiplying the difference matrix (X – Y) by the inverse of S reduces the influence of highly correlated dinucleotides. Because the dinucleotides are counted on both strands, there is a high degree of correlation between the reverse complements of each dinucleotide (i.e. CC/GG, TT/AA, TG/CA, AG/CT, AC/GT and GA/TC). The Mahalanobis distance therefore adjusts for double counting dinucleotides and takes into account the variability along the chromosome due to the presence of genomic islands, etc. To provide a P-value for the distance values (Mahalanobis or δ-distance), an empirical distribution of the distances for each chromosome segment was derived for each chromosome. This empirical distribution was constructed using each 5-kb chromosome segment as x or X in the equations for the δ- or Mahalanobis distance, respectively. The advantage of using the P-values is that it places all values between 0 and 1, whereas Mahalanobis distance and δ-distance have no upper bound. P-values close to 1 indicate small distances and similar dinucleotide compositions between a plasmid and chromosome, whereas P-values close to 0 indicate large distances and dissimilar dinucleotide compositions between a plasmid and chromosome.

Performance comparison of Mahalanobis distance and δ-distance

The Mahalanobis distance and δ-distance values between each of 504 plasmids and 230 chromosomes were determined. These values were then used to rank different chromosomes with respect to similarity to a given plasmid: a chromosome ranking the highest (1) was most similar in dinucleotide composition and that ranking lowest (230) was most dissimilar. Since each plasmid was originally sequenced as part of a whole-genome sequencing project for a particular bacterial strain, each plasmid had one ‘known’ host. It was previously shown that the dinucleotide composition similarities between plasmids and the chromosomes of their known host tend to rank high (11). The performances of the Mahalanobis and δ-distances were evaluated for their ability to rank the known host chromosome highest among 230 bacterial chromosomes.

RESULTS

Performance of Mahalanobis and δ-distances in identifying plasmid hosts based on genomic signature similarity

Compared to the δ-distance method, the Mahalanobis distance is an improved measure of the similarity in dinucleotide composition (here briefly called ‘genomic signature’) between plasmids and chromosomes because it takes into account variability of the data along the chromosome and correlations among dinucleotide relative abundance values on both DNA strands. Our first goal was to compare the ability of the Mahalanobis and δ-distance methods to identify among 230 bacteria the ‘known host’ of 504 plasmids, that is, the host in which the plasmid was found. Even though some plasmids may not have evolved in their known host, it was previously shown that dinucleotide composition similarities between plasmids and host chromosomes usually tend to rank high (11). Performance was measured by ranking the distances in genomic signature between each plasmid and all 230 chromosomes. Rank 1 represents the plasmid–host pair with the most similar signature (‘highest ranking’), and 230 corresponds to the most dissimilar pair. The distribution of ranks for all 504 plasmids and their known hosts is presented in Figure 1 as histograms, based on the Mahalanobis distance (Figure 1A) and δ-distance (Figure 1B) (see Supplementary Table S1 for details). Both distributions were heavily skewed toward high ranks, supporting previous findings that the similarities in signatures between each plasmid and its known host are among the highest (11). Of the 504 pairs of plasmids and their known hosts, 159 (32%) ranked number 1 based on the Mahalanobis distance, while only 94 (19%) ranked 1 based on the δ-distance. Thus for one in three plasmids, Mahalanobis distance identified the known plasmid host as the host with the most similar dinucleotide abundance among 230 strains. In 63% of cases, the Mahalanobis distance had a higher rank than the δ-distance, in 17% of cases, the δ-distance had the higher rank, and in 20% of cases did the two measures have the same rank. The median value of the ranks based on the Mahalanobis distances (four) was higher than that based on the δ-distance (eight). A Wilcoxon signed rank test, which compared the ranks based on the two methods, was highly significant (P < 2.2 × 10−16). Thus, in general, the genomic signature similarities between plasmids and their known hosts tended to rank higher when using the Mahalanobis distance than when using the δ-distance.
Figure 1.

Histograms showing the distribution of ranks of genomic signature similarities between 504 plasmids and their known hosts based on Mahalanobis distance (A) and δ-distance (B).

Histograms showing the distribution of ranks of genomic signature similarities between 504 plasmids and their known hosts based on Mahalanobis distance (A) and δ-distance (B). We tested the robustness of our result that Mahalanobis distance performs better than the conventional δ-distance by varying chromosomal segment size, word size (e.g. tri- and tetra-nucleotides) (12,29,37) and composition of the host data set. Campbell et al. (11) and Wong et al. (20) used 50-kb chromosomal segments, while van Passel et al. (25) used plasmid-size chromosomal segments; that is, the segment size was not fixed but depended on the plasmid size. In the present study, we used a fixed segment size. We found that our results remained similar when partitioning chromosomal sequences into different sizes of segments; e.g. 2, 5, 10 and 20 kb in length (data not shown). Among these, 5 kb was selected because the median rank of the genomic signature similarities between plasmids and their known hosts was maximized. The results were also consistent with those obtained when using relative abundances of different word sizes (data not shown). We also demonstrated that our results were robust to the composition of the host data set by testing different subsets of bacteria, e.g. Proteobacteria, Firmicutes, Gram-positive bacteria, Gram-negative bacteria and also when only one representative was randomly selected in the case of species for which multiple strains have been sequenced (data not shown). Thus, Mahalanobis distance performed better than δ-distance in identifying plasmid hosts based on genomic signature similarity, regardless of the datasets used.

Degree of similarity in genomic signature between a plasmid and its host

The ranks of the Mahalanobis distances indicate the similarity in genomic signature between a plasmid and its host relative to all other bacteria in the data set, but do not provide a measure of the degree of similarity within the genomes, and are very much biased by the available genome sequences. Additionally, the Mahalanobis distance between a specific plasmid and a host chromosome has no upper bound and is therefore hard to interpret. Therefore we expressed the degree of similarity in genomic signature between a plasmid and a particular host chromosome with a more intuitively meaningful value. The Mahalanobis distance between a plasmid and the collective set of chromosomal segments of each host was converted to a P-value associated with the empirical distribution of the Mahalanobis distances between the individual and the complete set of chromosomal segments of that host. Likewise, the δ-distance between a plasmid and a set of chromosomal segments was also converted to the P-value associated with an empirical distribution of the δ-distances between individual chromosomal segments and the mean for all chromosomal segments. P-values close to zero reflect large distances and dissimilar genomic signatures between plasmid and chromosome, whereas values close to one reflect small distances and similar genomic signatures. Plasmids whose hosts ranked first did not necessarily have P-values close to 1. For example, using the Mahalanobis distance, the known host of pXFPD1.3, Xylella fastidiosa Temecula1, ranked first among the 230 bacteria used in this study but had a P < 0.05 (Supplementary Table S1). This suggests that even though this host ranked first there must be other hosts, not included in this study, that have genomic signatures much more similar to that of plasmid pXFPD1.3, and therefore are more likely the long-term host of this plasmid. When all plasmids were compared to all 230 bacterial chromosomes, each plasmid showed P < 0.05 with 143 or more bacteria (at least 62%), and the median number of hosts with P < 0.05 was 198 (86%). This indicates that the vast majority of the bacteria considered here can be rejected as potential long-term hosts of these plasmids. The distribution of P-values for all 504 plasmids and their known hosts is presented in Figure 2 as histograms, based on the Mahalanobis distance (Figure 2A) and δ-distance (Figure 2B) (see Supplementary Table S1 for details). In 78% of cases, the Mahalanobis distance had a higher P-value than the δ-distance, and in 20% of cases, the δ-distance had the higher P-value. The median value of the P-values based on the Mahalanobis distances (0.77) was higher than that based on the δ-distance (0.23). A Wilcoxon signed rank test comparing the P-values based on the Mahalanobis distance and the δ-distance was significant (P < 2.2 × 10−16). Of the 504 plasmids tested here, 138 (27%) had P > 0.95 based on the Mahalanobis distance, while only 42 (8%) had P > 0.95 based on the δ-distance. This indicates that Mahalanobis distance more frequently than δ-distance identifies the known plasmid host as the one with the most similar genomic signature. Of the 504 plasmids tested here, 76 (15%) had P < 0.05 based on the Mahalanobis distance, while 82 (16%) had P < 0.05 based on the δ-distance. Thus, fewer than one in six plasmids showed a significantly different genomic signature from that of their host chromosome, and apparently have not (yet) ameliorated their genomes to that of the host they were found in.
Figure 2.

Histogram showing the distribution of P-values derived from Mahalanobis distances (A) and δ-distances (B) between 504 plasmids and their known hosts.

Histogram showing the distribution of P-values derived from Mahalanobis distances (A) and δ-distances (B) between 504 plasmids and their known hosts.

Correlation of plasmid-host genomic signature difference with plasmid size

We reassessed the correlation of the genomic signature differences between plasmids and their host chromosomes with plasmid size, which was analyzed by van Passel et al. (25). Among the 504 plasmids tested here, the genome size ranged from 1286 bp (Xylella fastidiosa 9a5c plasmid pXF1.3) to 2 094 509 bp (Ralstonia solanacearum GMI1000 plasmid pGMI1000MP). As shown in Figure 3, the Mahalanobis distance between a plasmid and its known host was negatively correlated with the plasmid genome size (Spearman's rank correlation coefficient, r = −0.72; P < 2 × 10−16). Thus, the larger the plasmid, the more similar was its genomic signature to that of the host chromosome. This observation is unlikely to be caused by the greater sensitivity of small sequences to small changes in dinucleotide relative abundance values, because all plasmid genomes tested here were larger than 1 kb.
Figure 3.

Plot of Mahalanobis distances between 504 plasmids and their known hosts, against plasmid sizes.

Plot of Mahalanobis distances between 504 plasmids and their known hosts, against plasmid sizes. Our result is in contrast with van Passel et al. (25), who concluded that there is a positive relationship between plasmid size and difference between plasmid and host in genomic signature. It is hard to define exactly why our results are contradictory because there are a sufficient number of differences between our study and the former study [the data set, now not excluding plasmids >100 kb as was done in previous study, chromosomal segment size (5 kb versus plasmid-size), and distance metric]. We examined the effect of plasmid size, chromosomal segment size and distance metric. First, in our dataset 35% of the plasmids were >100 kb, but the negative correlation between Mahalanobis distance and plasmid size was still observed when applied only to plasmids <100 kb. Second, the negative correlation was also observed when using δ-distance instead of Mahalanobis distance. Third, when we used plasmid-size chromosomal segments and our data set, there was no longer a significant correlation between plasmid size and either the Mahalanobis distance or the P-value associated with the distance (data not shown). Thus, we conclude that when taking into account variability along the chromosome by using the Mahalanobis distance method and fixed chromosomal segments sizes (5–20 kb), larger plasmids tend to be more similar in genomic signature to their host than smaller plasmids.

Proposing candidate long-term hosts of virulence plasmids

To illustrate our analysis, we compared the two methods in detail by focusing on plasmids derived from Bacillus anthracis ‘Ames Ancestor’ as examples. B. anthracis—the causative agent of anthrax—is a member of the B. cereus sensu lato group (B. cereus, B. anthracis and B. thuringiensis). The Ames Ancestor strain contains two plasmids, pXO1 and pXO2, which code for toxin production and encapsulation, respectively (38,39), and completely define the pathogenic potential of B. anthracis. Table 1 lists the 10 highest ranking bacterial strains based upon their Mahalanobis distance and δ-distance from plasmid pXO1. Seven bacterial strains of the B. cereus sensu lato group showed the smallest Mahalanobis distance from plasmid pXO1, and all top-10 strains are members of the Firmicutes (Table 1). In contrast, the top-10 smallest δ-distances from pXO1 were found among strains of the Cyanobacteria, Proteobacteria, Bacteroidetes, and Firmicutes. Given the known narrow host range of this plasmid (S. Khan, personal communication) (40), the first three phyla identified by δ-distance are unlikely to be true hosts (Table 1). Thus, among the top-10 ranking strains the Mahalanobis distance only identified strains that are very plausible hosts of the plasmids, while the δ-distance identified several bacteria in which the plasmids have not been shown to replicate. The known host of pXO1, B. anthracis Ames Ancestor, ranked 5 based on the Mahalanobis distance and only 12 based on the δ-distance (Table 1). The P-value for plasmid pXO1 and its known host B. anthracis Ames Ancestor was higher (0.99) when using the Mahalanobis distance than when using the δ-distance (0.38). Thus, the known host of plasmid pXO1 ranked higher and was estimated to be more similar in genomic signature to pXO1 when using the Mahalanobis distance than when using the δ-distance.
Table 1.

Ten highest ranking bacterial strains based on Mahalanobis distance and δ-distance for plasmid pXO1 from B. anthracis str. Ames Ancestor

Bacterial strainPhylumD2P(D2)δP(δ)
Sorted by Mahalanobis distance
    Bacillus cereus ATCC 14579Firmicutes2.170.99447.50.517
    Bacillus thuringiensisstr. Al HakamFirmicutes2.830.98750.60.415
    Bacillus cereus E33LFirmicutes2.860.98650.00.415
    Bacillus thuringiensis serovar konkukian str. 97-27Firmicutes2.910.98951.00.423
    Bacillus anthracis str. Ames AncestoraFirmicutes3.260.98852.20.379
    Bacillus cereus ATCC 10987Firmicutes3.420.97750.90.406
    Bacillus cereus subsp. cytotoxis NVH 391-98Firmicutes4.440.96059.20.226
    Lactobacillus salivarius UCC118Firmicutes5.430.92650.80.378
    Staphylococcus epidermidis RP62AFirmicutes5.720.88755.50.306
    Staphylococcus epidermidis ATCC 12228Firmicutes7.030.85857.30.216
Sorted by δ-distance
    Nostoc sp. PCC 7120Cyanobacteria53.010.01841.70.440
    Anabaena variabilis ATCC 29413Cyanobacteria62.560.01242.70.429
    Buchnera aphidicola str. APS (Acyrthosiphon pisum)Proteobacteria19.640.21143.10.523
    Bacillus cereus ATCC 14579Firmicutes2.170.99447.50.517
    Bacillus cereus E33LFirmicutes2.860.98650.00.415
    Bacillus thuringiensis str. Al HakamFirmicutes2.830.98750.60.415
    Lactobacillus salivarius UCC118Firmicutes5.430.92650.80.378
    Bacillus cereus ATCC 10987Firmicutes3.420.97750.90.406
    Bacillus thuringiensis serovar konkukian str. 97-27Firmicutes2.910.98951.00.423
    Bacteroides fragilis NCTC 9343Bacteroidetes9.910.71151.20.297

D2, Mahalanobis distance; P(D2), P-value based on Mahalanobis distance; δ, δ-distance; P(δ), P-value based on δ-distance.

aKnown host.

The P-values are not completely negatively correlated with the distances because they are based on empirical distributions that differ between bacterial chromosomes.

Ten highest ranking bacterial strains based on Mahalanobis distance and δ-distance for plasmid pXO1 from B. anthracis str. Ames Ancestor D2, Mahalanobis distance; P(D2), P-value based on Mahalanobis distance; δ, δ-distance; P(δ), P-value based on δ-distance. aKnown host. The P-values are not completely negatively correlated with the distances because they are based on empirical distributions that differ between bacterial chromosomes. To further illustrate the utility of the Mahalanobis distance measure, we focused on enterohemorrhagic Escherichia coli O157:H7, which is the predominant causative agent of hemorrhagic colitis, a bloody diarrhea. The virulence plasmid pO157 is an F-like plasmid found in most O157:H7 strains, and is composed of a number of potential virulence genes (41). Table 2 lists the 10 highest ranking bacterial strains based upon their Mahalanobis distance and δ-distance for plasmid pO157 of the E. coli strain O157:H7 EDL933. The top-10 strains based on the Mahalanobis distance were all members of the Proteobacteria, more specifically of the family Enterobacteriacae, which are the typical hosts of narrow-host-range plasmids with F-like replication and maintenance features such as pO157 (42). In contrast, the top-10 strains based on the δ-distance included a Staphylococcus saprophyticus strain belonging to the Firmicutes, a very unlikely host of this plasmid. The known host of pO157 ranked 10 based on the Mahalanobis distance and only 42 based on the δ-distance. The P-value for plasmid pO157 and its known host was much higher (0.93) when using the Mahalanobis distance than when using the δ-distance (0.24). Interestingly, plasmid pO157 showed a smaller Mahalanobis distance (and higher P-values, ranging between 0.95 and 0.97) with seven Yersinia pestis and two Y. pseudotuberculosis strains than with the E. coli host strain. This suggests a potential long-term relationship between pO157-like plasmids and Yersinia, and requires further study.
Table 2.

Ten highest ranking bacterial strains based on Mahalanobis distance and δ-distance for plasmid pO157 from E. coli O157:H7 EDL933

Bacterial strainPhylumD2P(D2)δP(δ)
Sorted by Mahalanobis distance
    Yersinia pestis AntiquaProteobacteria3.790.97232.80.724
    Yersinia pestis KIMProteobacteria4.310.96233.20.704
    Yersinia pestis AngolaProteobacteria4.330.96232.80.700
    Yersinia pestis CO92Proteobacteria4.410.96633.30.696
    Yersinia pestis biovar Microtus str. 91001Proteobacteria4.420.96833.70.687
    Yersinia pestis Nepal516Proteobacteria4.440.96733.20.713
    Yersinia pestis Pestoides FProteobacteria4.490.96333.60.673
    Yersinia pseudotuberculosis IP 31758Proteobacteria4.620.94834.30.667
    Yersinia pseudotuberculosis IP 32953Proteobacteria4.730.95535.30.631
    Escherichia coli O157:H7 EDL933aProteobacteria5.510.92659.30.238
Sorted by δ-distance
    Yersinia pestis AntiquaProteobacteria3.790.97232.80.724
    Yersinia pestis AngolaProteobacteria4.330.96232.80.700
    Yersinia pestis Nepal516Proteobacteria4.440.96733.20.713
    Yersinia pestis KIMProteobacteria4.310.96233.20.704
    Yersinia pestis CO92Proteobacteria4.410.96633.30.696
    Yersinia pestis Pestoides FProteobacteria4.490.96333.60.673
    Yersinia pestis biovar Microtus str. 91001Proteobacteria4.420.96833.70.687
    Yersinia pseudotuberculosis IP 31758Proteobacteria4.620.94834.30.667
    Yersinia pseudotuberculosis IP 32953Proteobacteria4.730.95535.30.631
    Staphylococcus saprophyticus subsp. saprophyticus ATCC 15305Firmicutes338.760.00040.60.525

D2, Mahalanobis distance; P(D2), P-value based on Mahalanobis distance; δ, δ-distance; P(δ), P-value based on δ-distance.

aKnown host.

See Table 1 legend for explanation of P-values

Ten highest ranking bacterial strains based on Mahalanobis distance and δ-distance for plasmid pO157 from E. coli O157:H7 EDL933 D2, Mahalanobis distance; P(D2), P-value based on Mahalanobis distance; δ, δ-distance; P(δ), P-value based on δ-distance. aKnown host. See Table 1 legend for explanation of P-values Ten highest ranking bacterial strains based on Mahalanobis distance and mean δ-distance for broad host range plasmid pB10 from an unknown host (%GC = 64.2) D2, Mahalanobis distance; P(D2), P-value based on Mahalanobis distance; δ, δ-distance; P(δ), P-value based on δ-distance; %GC, genome G + C content defined as 100 × (G + C)/(A + T + G + C). See Table 1 legend for explanation of P-values.

Proposing candidate long-term hosts of multi-drug resistance plasmids

Since at least 50% of the plasmid–host pairs ranked first, second, third or fourth based upon the Mahalanobis distance, we can use this method to propose putative hosts for plasmids from unknown hosts. Therefore, we compared the Mahalanobis and δ-distance measures for a broad host range plasmid for which the host is not known, but whose host range has been investigated experimentally. Several broad host range plasmids of the IncP-1 group have been captured from bacterial communities using ‘exogenous plasmid isolation’ methods (43,44), which do not retrieve the plasmid host. Empirical work has shown that IncP-1 plasmids such as pB10 (45) can transfer and replicate in most species of Gram-negative bacteria, mostly only α-, β- and γ-Proteobacteria, and typically not in bacteria outside of these groups (46,47). We compared pB10's genomic signature with 663 complete bacterial chromosome sequences available from the NCBI ftp site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria) in August 2008. Table 3 lists the 10 highest ranking bacterial strains based upon their Mahalanobis distance and δ-distance from plasmid pB10. The top-10 strains based on the Mahalanobis distance are members of the β- (first 9) and γ-Proteobacteria with P-values ranging between 0.80 and 0.98. The top-10 strains based on the δ-distance included strains belonging to the Actinobacteria and the Chlorobi group. The results from the Mahalanobis distances are consistent with the known host range of IncP-1 plasmids, while those from the δ-distance are not (46,47). Moreover, Ralstonia eutropha JMP134 (currently reclassified as Cupriavidus necator) is the known host of the IncP-1 plasmid pJP4, which is very closely related to plasmid pB10 (45). This strain ranked 14 in our comparison with pB10 using the Mahalanobis distance, and only 38 when the δ-distance was used (data not shown). Nine of the top-10 strains had a higher G+C content than plasmid pB10 (64.2%) when the ranking was based on the Mahalanobis distance, compared to only four based on the δ-distance. Taking into account the finding that plasmids tend to have a lower G + C content than their hosts (25,48), the Mahalanobis distance again performed better than the δ-distance in identifying potential hosts of plasmid pB10. In conclusion, the Mahanalobis distance more correctly assessed the potential range of long-term hosts of this drug resistance plasmid than the δ-distance.
Table 3.

Ten highest ranking bacterial strains based on Mahalanobis distance and mean δ-distance for broad host range plasmid pB10 from an unknown host (%GC = 64.2)

Bacterial strainPhylumD2P(D2)δP(δ)%GC
Sorted by Mahalanobis distance
    Chromobacterium violaceum ATCC 12472Proteobacteria3.070.98444.10.60464.8
    Ralstonia eutropha H16Proteobacteria6.050.92369.60.13366.5
    Ralstonia solanacearum GMI1000Proteobacteria6.100.88673.90.15267.0
    Azoarcus sp. EbN1Proteobacteria6.320.92293.60.06165.1
    Bordetella petrii DSM 12804Proteobacteria6.440.90266.00.22665.5
    Azoarcus sp. BH72Proteobacteria7.310.85647.30.48167.9
    Dechloromonas aromatica RCBProteobacteria7.450.86333.90.68759.2
    Acidovorax avenae subsp. citrulli AAC00-1Proteobacteria7.460.84464.40.24368.5
    Bordetella bronchiseptica RB50Proteobacteria7.840.80276.80.09868.1
    Xanthomonas campestris pv. vesicatoria str. 85-10Proteobacteria8.090.82393.90.06264.7
Sorted by δ-distance
    Dechloromonas aromatica RCBProteobacteria7.450.86333.90.68759.2
    Renibacterium salmoninarum ATCC 33209Actinobacteria116.150.00036.40.47156.3
    Pseudomonas fluorescens PfO-1Proteobacteria13.020.44539.70.54760.5
    Chromobacterium violaceum ATCC 12472Proteobacteria3.070.98444.10.60464.8
    Chlorobaculum parvum NCIB 8327Chlorobi44.060.03744.10.37955.8
    Pseudomonas aeruginosa PAO1Proteobacteria10.740.65744.10.48266.6
    Pseudomonas aeruginosa UCBPP-PA14Proteobacteria8.280.79044.40.48666.3
    Pseudomonas syringae pv. syringae B728aProteobacteria30.230.06546.50.36959.2
    Pseudomonas aeruginosa PA7Proteobacteria9.050.75846.90.43266.4
    Hahella chejuensis KCTC 2396Proteobacteria31.080.07747.00.35353.9

D2, Mahalanobis distance; P(D2), P-value based on Mahalanobis distance; δ, δ-distance; P(δ), P-value based on δ-distance; %GC, genome G + C content defined as 100 × (G + C)/(A + T + G + C).

See Table 1 legend for explanation of P-values.

DISCUSSION

Differences in dinucleotide composition (genomic signature) between bacterial chromosomes and plasmids have previously been measured using the δ-distance (11,20,25,49). Here, we introduce the Mahalanobis distance for this purpose. The Mahalanobis distance has an advantage over the δ-distance because it takes into account both variances and covariances among the 16 dinucleotide relative abundance values along the bacterial chromosome. We conclude that the Mahalanobis distance performs better than the δ-distance because it more often identified the known hosts of the plasmids as the host with greatest genomic signature similarity, and in contrast to δ-distance, its top-10 ranking strains were always plausible hosts based on empirical host range knowledge. We have converted the Mahalanobis distances to P-values so that we may provide more intuitively meaningful descriptions of the relationship between a plasmid and its known host. It is hard to know whether a distance of 20 reflects a large or a small difference in genomic signature, but it is easy to interpret that a P-value close to 1 indicates highly similar signatures, and values close to 0 indicate highly dissimilar signatures. We hypothesize that plasmid–host pairs with low Mahalanobis distances and very high P-values represent plasmids that have acquired the hosts’ signature through genome amelioration, and thus have been long-term residents of that host. More than a quarter of the plasmids tested here fit into this category, with P > 0.95 (Figure 2). They may have been exchanged rarely between hosts or only between closely related hosts with similar genomic signatures. Moreover, ∼85% of all plasmid–host pairs showed P > 0.05, indicating that the known host cannot be rejected as putative long-term host. Coincident with the observation that many plasmids share their host's dinucleotide composition is the fact that large plasmids seem to be more similar in genomic signature to their hosts than small plasmids (Figure 3). Even though our results are in contradiction with those of van Passel et al. (25) (Results section), they can be explained based on known plasmid biology. Large plasmids are often not self-transferable, and even difficult to mobilize with a helper plasmid by conjugation, or they have a narrow host range (4,50). They are also less likely than smaller plasmids to persist as intact molecules outside a cell or to be taken up by transformation. It seems likely therefore that large plasmids spend most of their evolutionary time in a single bacterial host. For example, Sinorhizobium meliloti megaplasmid pSymB (1 683 333 bp) is not self-transmissible, and has not been successfully cured from S. meliloti (51). It has been suggested that pSymB has acquired genes essential to the host's viability and that its genomic signature became similar to that of the main chromosome due to long-term residence in that host (20). Similarly, large Rhizobium sym plasmids were shown to be strongly associated with specific chromosomal types, suggesting limited horizontal transfer (52,53). Only 15% of plasmids tested here showed P < 0.05 (Figure 2) due to a high Mahalanobis distance with their known host, suggesting significantly different genomic signatures. If genome amelioration is a general process, our results imply that 15% of our plasmids were recently acquired by the hosts they were found in. Based on the analysis of plasmid size versus plasmid-host signature similarity (Figure 3), these dissimilar plasmids are relatively small (smaller than 63 kb, Supplementary Table S1). Since smaller plasmids have a higher probability of being mobilized or transformed, or of actively transferring by conjugation than very large plasmids, the strain they were found in may be only one of several recent hosts. The genomic signature analysis can thus generate hypotheses about the evolutionary history of plasmids, which can then be empirically tested. As more genome sequences of plasmids and their hosts become available, it should be possible in the near future to test the relationship between plasmid-host signature similarity and plasmid characteristics such as size, transferability and host range. Another example of how the genomic signature analysis can be used to generate hypotheses about the evolutionary history of plasmids is shown in Table 1. The P-values for plasmid pXO1 and members of B. cereus sensu lato group (B. cereus, B. anthracis and B. thuringiensis) were greater than 0.95, suggesting that not only the known host (B. anthracis) but also the other species (B. cereus and B. thuringiensis) could be the hosts of plasmid pXO1. This finding is strongly supported by the following two facts. First, the different species comprising the B. cereus sensu lato group are largely defined by differences in plasmids, while the chromosomes have been shown to be similar in both gene content and gene order (54,55). Second, conjugation studies have shown that plasmid pXO1 can be transferred from B. anthracis to B. cereus (56). A second example, as shown in Table 2, is plasmid pO157, which was more similar in genomic signature to Y. pestis—the causative agent of plague—than its known host E. coli O157:H7. Moreover pO157 is very similar in genomic signature to plasmid pCD1 of Y. pestis (data not shown). These results led us to hypothesize that plasmid pO157 was acquired by E. coli O157:H7 from Y. pestis. This is supported by evidence for the transfer of genes (57) and plasmids (58) between E. coli and Y. pestis. The genomic signature analysis can also be used to propose candidate long-term hosts of plasmids taken directly from clinical or environmental samples using exogenous plasmid isolation methods (44) or metagenomic approaches (8). For example, the top-9 ranking bacteria identified as potential hosts of plasmid pB10 because of their very similar genomic signatures, are members of the β-Proteobacteria. The suggestion that β-Proteobacteria are the most likely hosts of pB10 is completely in agreement with the fact that other IncP-1 plasmids very similar to pB10 are often found in β-Proteobacteria (4). Thus, the genomic signature analysis using the Mahalanobis distance provides a list of potential hosts of plasmids, which can then be used to experimentally test the host range of that plasmid. It may also result in finding better cloning vectors for specific species. An interesting observation is the low similarity in genomic signature between Buchnera aphidicola plasmids pTrp1 and pBBp1 and their B. aphidicola host chromosomes (Supplementary Table S1). Since B. aphidicola is an obligate intracellular symbiont of aphids, with which it has co-evolved, and since its genome shows no or little indication of horizontal gene acquisition (15), it is not expected to exchange much genetic material with distantly related bacteria. Yet, these two small plasmids must have been recently acquired by these two Buchnera strains from other Buchnera strains with very different genomic signatures as previously suggested by van Passel et al. (25,26). The only alternative explanation is the absence of genome amelioration over evolutionary time, which would be exceptional considering that a third B. aphidicola plasmid, pLeu, shows clear indications of similar genomic signature with its host chromosome. This case is a good example of how results of the Mahalanobis distance analysis of dinucleotide relative abundance can form the basis of further research to better understand the evolutionary history of plasmids. The approach we present here still has some limitations. First, an important caveat to this approach is the mosaic nature of plasmids, and the often large fraction of foreign accessory genes in plasmid genomes. A better approach might be to include only orthologous core genes (essential plasmid backbone genes involved in replication, maintenance and transfer) in this analysis. Determining this precise set of genes, however, can be difficult if phylogenetically closely related plasmids are not available (59–61). Even though we did not exclude accessory plasmid genes in our analysis, the similarity in genomic signature with that of the known plasmid hosts was still very high for many plasmids. Second, we cannot rule out the possibility that plasmids recently acquired by the host could have a similar genomic signature with that host by chance. To exclude these false positives, results should be double-checked by other criteria; for example, taking into account the finding that plasmids tend to have a lower G+C content than their hosts (25,48). We must also realize that several strains, even between species, may have very similar genomic signatures, and that findings are always biased by which host genome sequences are available in the databases. In conclusion, we showed that the Mahalanobis distance performs better than the conventional δ-distance in measuring genomic signature differences between plasmids and chromosomes of potential hosts. In the future, the genomic signature analysis using the Mahalanobis distance can be applied to (i) inferring potential hosts of mobile genetic elements, including plasmids, phages and transposons obtained through cultivation-independent methods such as metagenomics and plasmid capture (5,8,62), and (ii) detecting anomalous genomic regions, e.g. genomic islands including pathogenicity islands that were recently acquired by horizontal transfer (18,22). The combined use of the genomic signature analysis and complementary analyses (e.g. analyses of G+C content, synonymous codon usage, amino acid usage and experimental evidence of host range) will improve our understanding of the potential long-term hosts and host range of plasmids, and how this is shaped by plasmid–host interactions.

SUPPLEMENTARY DATA

Supplementary data are available at NAR Online.

FUNDING

National Science Foundation (EF-0627988); National Institutes of Health (P20RR016454, P20RR16448). Funding for open access charges: National Science Foundation (EF-0627988). Conflict of interest statement. None declared.
  59 in total

Review 1.  Lateral gene transfer and the nature of bacterial innovation.

Authors:  H Ochman; J G Lawrence; E A Groisman
Journal:  Nature       Date:  2000-05-18       Impact factor: 49.962

Review 2.  Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes.

Authors:  S Karlin
Journal:  Trends Microbiol       Date:  2001-07       Impact factor: 17.079

3.  Transposon-aided capture (TRACA) of plasmids resident in the human gut mobile metagenome.

Authors:  Brian V Jones; Julian R Marchesi
Journal:  Nat Methods       Date:  2006-11-26       Impact factor: 28.547

4.  Distinctive features of large complex virus genomes and proteomes.

Authors:  Jan Mrázek; Samuel Karlin
Journal:  Proc Natl Acad Sci U S A       Date:  2007-03-09       Impact factor: 11.205

5.  Antibiotics and antibiotic resistance genes in natural environments.

Authors:  José L Martínez
Journal:  Science       Date:  2008-07-18       Impact factor: 47.728

6.  Ecological and molecular maintenance strategies of mobile genetic elements.

Authors:  S L Turner; M J Bailey; A K Lilley; C M Thomas
Journal:  FEMS Microbiol Ecol       Date:  2002-11-01       Impact factor: 4.194

7.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words.

Authors:  T J Wu; J P Burke; D B Davison
Journal:  Biometrics       Date:  1997-12       Impact factor: 2.571

8.  Sequence and organization of pXO1, the large Bacillus anthracis plasmid harboring the anthrax toxin genes.

Authors:  R T Okinaka; K Cloud; O Hampton; A R Hoffmaster; K K Hill; P Keim; T M Koehler; G Lamke; S Kumano; J Mahillon; D Manter; Y Martinez; D Ricke; R Svensson; P J Jackson
Journal:  J Bacteriol       Date:  1999-10       Impact factor: 3.490

9.  Detection and characterization of horizontal transfers in prokaryotes using genomic signature.

Authors:  Christine Dufraigne; Bernard Fertil; Sylvain Lespinats; Alain Giron; Patrick Deschavanne
Journal:  Nucleic Acids Res       Date:  2005-01-13       Impact factor: 16.971

10.  Pervasive properties of the genomic signature.

Authors:  Robert W Jernigan; Robert H Baran
Journal:  BMC Genomics       Date:  2002-08-09       Impact factor: 3.969

View more
  29 in total

Review 1.  Mobility of plasmids.

Authors:  Chris Smillie; M Pilar Garcillán-Barcia; M Victoria Francia; Eduardo P C Rocha; Fernando de la Cruz
Journal:  Microbiol Mol Biol Rev       Date:  2010-09       Impact factor: 11.056

2.  Predicting plasmid promiscuity based on genomic signature.

Authors:  Haruo Suzuki; Hirokazu Yano; Celeste J Brown; Eva M Top
Journal:  J Bacteriol       Date:  2010-09-17       Impact factor: 3.490

3.  Host range diversification within the IncP-1 plasmid group.

Authors:  Hirokazu Yano; Linda M Rogers; Molly G Knox; Holger Heuer; Kornelia Smalla; Celeste J Brown; Eva M Top
Journal:  Microbiology       Date:  2013-09-03       Impact factor: 2.777

4.  Genomic Signatures Among Acanthamoeba polyphaga Entoorganisms Unveil Evidence of Coevolution.

Authors:  Víctor Serrano-Solís; Paulo Eduardo Toscano Soares; Sávio T de Farías
Journal:  J Mol Evol       Date:  2018-11-20       Impact factor: 2.395

5.  New powerful statistics for alignment-free sequence comparison under a pattern transfer model.

Authors:  Xuemei Liu; Lin Wan; Jing Li; Gesine Reinert; Michael S Waterman; Fengzhu Sun
Journal:  J Theor Biol       Date:  2011-06-25       Impact factor: 2.691

6.  G-language genome analysis environment with REST and SOAP web service interfaces.

Authors:  Kazuharu Arakawa; Nobuhiro Kido; Kazuki Oshita; Masaru Tomita
Journal:  Nucleic Acids Res       Date:  2010-05-03       Impact factor: 16.971

7.  In Vivo Transmission of an IncA/C Plasmid in Escherichia coli Depends on Tetracycline Concentration, and Acquisition of the Plasmid Results in a Variable Cost of Fitness.

Authors:  Timothy J Johnson; Randall S Singer; Richard E Isaacson; Jessica L Danzeisen; Kevin Lang; Kristi Kobluk; Bernadette Rivet; Klaudyna Borewicz; Jonathan G Frye; Mark Englen; Janet Anderson; Peter R Davies
Journal:  Appl Environ Microbiol       Date:  2015-03-13       Impact factor: 4.792

8.  Distinguishing microbial genome fragments based on their composition: evolutionary and comparative genomic perspectives.

Authors:  Scott C Perry; Robert G Beiko
Journal:  Genome Biol Evol       Date:  2010-01-25       Impact factor: 3.416

Review 9.  Pathogenomics of the virulence plasmids of Escherichia coli.

Authors:  Timothy J Johnson; Lisa K Nolan
Journal:  Microbiol Mol Biol Rev       Date:  2009-12       Impact factor: 11.056

10.  Quantitative analysis of replication-related mutation and selection pressures in bacterial chromosomes and plasmids using generalised GC skew index.

Authors:  Kazuharu Arakawa; Haruo Suzuki; Masaru Tomita
Journal:  BMC Genomics       Date:  2009-12-30       Impact factor: 3.969

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.