Literature DB >> 18765438

Multiple genome comparison within a bacterial species reveals a unit of evolution spanning two adjacent genes in a tandem paralog cluster.

Takeshi Tsuru1, Ichizo Kobayashi.   

Abstract

It has been assumed that an open reading frame (ORF) represents a unit of gene evolution as well as a unit of gene expression and function. In the present work, we report a case in which a unit comprising the 3' region of an ORF linked to a downstream intergenic region that is in turn linked to the 5' region of a downstream ORF has been conserved, and has served as the unit of gene evolution. The genes are tandem paralogous genes from the bacterium Staphylococcus aureus, for which more than ten entire genomes have been sequenced. We compared these multiple genome sequences at a locus for the lpl (lipoprotein-like) cluster (encoding lipoprotein homologs presumably related to their host interaction) in the genomic island termed nuSaalpha. A highly conserved nucleotide sequence found within every lpl ORF is likely to provide a site for homologous recombination. Comparison of phylogenies of the 5'-variable region and the 3'-variable region within the same ORF revealed significant incongruence. In contrast, pairs of the 3'-variable region of an ORF and the 5'-variable region of the next downstream ORF gave more congruent phylogenies, with distinct groups of conserved pairs. The intergenic region seemed to have coevolved with the flanking variable regions. Multiple recombination events at the central conserved region appear to have caused various types of rearrangements among strains, shuffling the two variable regions in one ORF, but maintaining a conserved unit comprising the 3'-variable region, the intergenic region, and the 5'-variable region spanning adjacent ORFs. This result has strong impact on our understanding of gene evolution because most gene lineages underwent tandem duplication and then diversified. This work also illustrates the use of multiple genome sequences for high-resolution evolutionary analysis within the same species.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18765438      PMCID: PMC2568036          DOI: 10.1093/molbev/msn192

Source DB:  PubMed          Journal:  Mol Biol Evol        ISSN: 0737-4038            Impact factor:   16.240


Introduction

Gene duplication has long been recognized as an important mechanism in the evolution of genes. Since the essential role of gene duplication in the emergence of novel genes was proposed (Ohno 1970), numerous works have studied the mechanisms by which the duplicated genes diversified (Li 1997). Recent progress in genome sequencing has confirmed the significance of gene duplication by revealing a wide prevalence of multiple homologous genes within a genome, often referred to simply as paralogs (Lynch and Conery 2000; Friedman and Hughes 2001; Gevers et al. 2004). Comparison of multiple genome sequences has provided ample evidence that the present paralogs have undergone various patterns of diversification (Prince and Pickett 2002; Taylor and Raes 2004). In bacteria, many paralogous gene groups have been shown to be involved in genome rearrangements that help the bacteria adapt to ever-changing environments. These processes are referred to as phase variation or antigenic variation, and the dynamics of paralog evolution in bacteria have been mostly examined with reference to these mechanisms (van der Woude and Baumler 2004; Villemur and Deziel 2005). Studies have revealed that these genes have become rearranged through simple processes such as inversion, deletion, or gene conversion, via various molecular mechanisms such as site-specific recombination, homologous recombination, or slipped-strand mispairing during DNA replication (Hughes and Norstrom 2005; Villemur and Deziel 2005). Many paralogous genes that have undergone these rearrangements have been discovered through whole-genome sequencing. Comparison of closely related prokaryotic genomes can help in the elucidation of the molecular mechanisms underlying the rearrangements of paralogous genes. Paralogous genes are often present as a tandem cluster in prokaryotic genomes (Gevers et al. 2004; Reams and Neidle 2004). They often encode surface proteins (Kihara and Kanehisa 2000; Gevers et al. 2004), which may undergo phase variation or antigenic variation (Hughes and Norstrom 2005; Villemur and Deziel 2005). Because their genes were likely to have originated from duplication, tandem paralogs can be suitable targets for the study of paralog diversification. Some of these paralogous gene clusters are on genomic islands, which are likely to have been horizontally acquired, to be highly polymorphic among strains, and to confer strain-specific adaptive properties such as drug resistance or pathogenicity (Dobrindt et al. 2004). Considering the above point, tandem paralogs found in the genomic islands of Staphylococcus aureus are suitable targets for the study of paralog evolution. Staphylococcus aureus is a Gram-positive bacterium with low GC content. It is a major human pathogen and is notable for its expression of a vast variety of toxins (Kuroda et al. 2001). Whole-genome sequences have been determined for more than 10 S. aureus strains (Kuroda et al. 2001; Baba et al. 2002, 2008; Holden et al. 2004; Ohta et al. 2004; Gill et al. 2005; Diep et al. 2006; Highlander et al. 2007; http://genome.jgi-psf.org/finished_microbes/). Two genomic islands, νSaα and νSaβ, common and unique to this bacterial species, encode three tandem paralog clusters: exotoxins (ssl) and lipoproteins (lipoprotein-like [lpl]) encoded by νSaα and proteases (spl) encoded by νSaβ. All these paralogs encode secreted proteins that are inferred to be pathogenicity related (Williams et al. 2000; Kuroda et al. 2001; Reed et al. 2001; Chavakis et al. 2007; Kulig et al. 2007). The copy number and sequence composition varies among strains. In our previous study (Tsuru et al. 2006), we compared these three clusters from seven strains available at the time and inferred an involvement of homologous recombination in their evolution. Among the three, the lpl cluster was unique in that highly conserved nucleotide sequences, representing a possible recombination site, were discovered at the middle of the protein-coding region of every paralogous open reading frame (ORF) and in that rearrangements there seemed to be far more extensive than in the other two clusters. In the present study, we made a detailed comparison of the lpl tandem paralog clusters of nine sequenced S. aureus genomes to examine their evolutionary processes. Contrary to the general belief that a unit of evolution of a gene is an ORF itself, we came to the conclusion that a unit composed of the “3′ half of an ORF and the 5′ half of a downstream ORF” serves as a unit of evolution in this cluster.

Materials and Methods

Homology Search for lpl Genes

The annotated lpl amino acid sequences of S. aureus N315 were used as the query sequences to search for homologs of lpl genes using BlastP and TBlastN. The search was carried out against the complete genome sequences of 520 bacterial and 46 archaeal species (31 July 2007 data) from the National Center for Biotechnology Information (NCBI) Genome database (http://www.ncbi.nlm.nih.gov/Genomes/). All hits with an e value < 10−5 were collected. Pseudogenes of lpl were identified through these analyses, some of which had been referred as such in the original annotations. Against the lpl homologs, manual refinement at the nucleotide sequence level was carried out by macroscopic pairwise genome comparison using CGAT software (Uchiyama et al. 2006) and by multiple sequence alignment using ClustalW version 1.83 (http://www.ebi.ac.uk/Tools/clustalw/). The initiation and termination positions for ORFs (including pseudogenes) were reassigned consistently to resolve discrepancy between the original annotators. The resulting coordinates for all the lpl ORFs are listed in supplementary table S1 (Supplementary Material online).

Nomenclature of Genes

Names for ORFs and pseudogenes used throughout this study were set by modifying the corresponding locus_tag in RefSeq database (http://www.ncbi.nlm.nih.gov/RefSeq/): a three-letter genome name used in the Kyoto Encyclopedia of Genes and Genomes database (http://www.genome.jp/kegg/) was followed by the locus_tag number, for example, SAA0410 for SAUSA300_0410 in RefSeq. The strains in which the relevant homologs were identified and their three-letter genome names are as follows. In S. aureus, SAU is for strain N315 (Kuroda et al. 2001), SAV for Mu50 (Kuroda et al. 2001; Ohta et al. 2004), SAM for MW2 (Baba et al. 2002), SAR for MRSA252 (Holden et al. 2004), SAS for MSSA476 (Holden et al. 2004), SAC for COL (Gill et al. 2005), SAO for NCTC8325 (http://microgen.ouhsc.edu/s_aureus/s_aureus_home.htm), SAA for USA300 (Diep et al. 2006), and SAB for RF122 (Herron-Olson et al. 2007). In Staphylococcus epidermidis, SEP is for strain ATCC12228 (Zhang et al. 2003) and SER for RP62A (Gill et al. 2005). In Staphylococcus haemolyticus, SHA is for strain JCSC1435 (Takeuchi et al. 2005). A pseudogene was indicated by adding “p” at the end of the name, for example, SAA0412p.

Sequence Comparison

An initial multiple sequence alignment of the nucleotide sequences for all the lpl ORFs was constructed using ClustalW with the default parameters. Then a Neighbor-Joining (NJ) phylogeny for the alignment was constructed using MEGA 4.0 (Tamura et al. 2007; http://www.megasoftware.net/) with pairwise deletion mode for gaps and with a maximum composite likelihood model for substitutions (fig. 1 and supplementary fig. S1 [Supplementary Material online]).
F

The lpl homologs in Staphylococcus aureus genome and their phylogenetic tree. (A) Location of four lpl loci, Locus 0 through Locus III, on the genome of strain N315. Note that the lpl homologs are found in the corresponding loci in all the sequenced S. aureus strains. (B) A nucleotide NJ phylogenetic tree for the lpl ORFs and their homologs in two other Staphylococcus species. The uncondensed version of this tree is presented in supplementary figure S1 (Supplementary Material online).

More detailed analyses were carried out with lpl ORFs on genomic island νSaα of S. aureus as detailed below. For the nucleotide sequences of the ORFs, multiple sequence alignment was once again constructed using ClustalW. A multiple sequence alignment for the predicted amino acid sequences was also constructed, omitting those for the pseudogenes. The nucleotide sequence alignment was then manually refined considering their encoding amino acid sequences, using a function implemented in MEGA 4.0. The resulting nucleotide and amino acid sequence alignments are shown in supplementary figure S3 and (Supplementary Material online), respectively. The nucleotide sequences of the relevant regions including the ORFs and the intergenic regions are identical between N315 and Mu50 and between MW2 and MSSA476; therefore, the sequences of Mu50 and MSSA476 were omitted here and in the following analyses.

Definitions Related to the Structure of lpl ORFs on νSaα

The presence of the central conserved region was visualized by similarity plots of aligned nucleotide/amino acid sequences constructed using PLOTCON (http://emboss.sourceforge.net/), in which a similarity score, with window size of 5, was calculated with EDNAFULL score file for nucleotide sequences and with EBLOSUM62 for amino acid sequences (fig. 2). A region conserved both at the nucleotide sequence level and at the amino acid sequence level was determined in the alignments by visual inspection and defined as the central conserved region (supplementary fig. S3 and , Supplementary Material online). Divergent regions to its 5′ side and to its 3′ side were defined as the 5′-variable region and the 3′-variable region, respectively.
F

Structure of lpl genes (top) with similarity plots for nucleotide sequences (middle) and for amino acid sequences (bottom). The central conserved region is highlighted by gray shading. A predicted signal peptide region is indicated together with a conserved cysteine residue at the C-terminus.

Phylogenetic Comparison and Grouping of 5′-Variable Region and 3′-Variable Region

Nucleotide sequences of the 5′-variable region and the 3′-variable region were once again aligned using ClustalW, and NJ phylogenies for them were constructed using MEGA 4.0 with the complete deletion mode for gaps and with the maximum composite likelihood model for substitutions. The phylogenies were compared with each other by connecting operational taxonomic units (OTUs) in a pair of the 3′-variable region of an ORF and the 5′-variable region of its downstream ORF (fig. 4) and in a pair within an ORF (fig. 4). Shapes of the trees were modified by flipping and rerooting to reduce the number of crosses of the connecting lines in each comparison with aid of TreeMap 2.0 (http://www.it.usyd.edu.au/∼mcharles/software/treemap/treemap.html). Grouping was made for each variable region based on the comparison of phylogenetic trees in figure 4. Pairwise identities with respect to the above grouping and with all the sequences were calculated using MEGA 4.0 (table 1).
F

Phylogenetic comparison of the 5′ regions and 3′ regions of lpl ORFs. A nucleotide NJ phylogeny for 5′-variable regions and one for 3′-variable regions were compared with each other by connecting OTUs in a pair of 3′-variable region of an ORF and 5′-variable region of its downstream ORF (A) and in a pair within an ORF (B). The bootstrap values (%) were obtained from 1,000 resamplings. Groups of the conserved pairs are indicated in different colors of connecting lines in (A). A paraphyletic group, B1, and the other monophyletic groups in each phylogeny corresponding to the groups of the pairs in (A) are indicated by boxes both in (A) and (B).

Table 1

Statistics of Nucleotide Sequence Alignments for lpl ORFs and Their Intergenic Region

Pairwise Identity
NameNumber of SequencesLength in Alignment (bp)Minimum–Maximum (%)Average (%)
5′-variable region
    All4812041–10065
    a69689–10094
    b39694–9896
    c387100100
    d296NR93
    e310899–10099
    f310892–9694
    g1108NRNR
    h2108NR98
    i2108NR93
    j2108NR95
    k119680–10088
    l196NRNR
    m912090–10096
Central conserved region
    All4813279–10088
3′-variable region
    All4857452–10066
    A856187–10094
    B1354398–10099
    B22543NR99
    C2561NR98
    D354698–10099
    E356191–9292
    F356588–9391
    G656191–10094
    H2557NR99
    I1567NRNR
    J355299–10099
    K2558NR89
    L1552NRNR
    M956782–10090
Intergenic region
    A–a618100100
    B1–e331100100
    B2–h231NR97
    C–g118NRNR
    C–k118NRNR
    D–f33097–10098
    E–i251NR98
    F–b34789–9893
    G–k34792–9694
    H–j259NR100
    J–c36997–10098
    K–d228NR93
    L–l160NRNR
    M–m922381–10092

NOTE.—NR, not relevant.

Statistics of Nucleotide Sequence Alignments for lpl ORFs and Their Intergenic Region NOTE.—NR, not relevant. For statistically testing whether our tree-based grouping reflects significant linkage between a 3′-variable region and its downstream 5′-variable region, we performed Fisher's exact test implemented in R packages version 2.7.0 (http://www.R-projects.org/), which is based on the FORTRAN program FEXACT (Mehta and Patel 1986; Clarkson et al. 1993).

Comparison of Intergenic Regions

Sequences of the intergenic regions were compared using ClustalW for each group. The presence of intergroup similarities was detected with aid of NJ trees using MEGA4.0 and multiple sequence alignments using ClustalW (fig. 5). Pairwise identities were calculated using MEGA4.0 (table 1). Ribosome-binding sites were predicted in the multiple alignments, referring to the characterized consensus sequences (Novick 1991).
F

Multiple alignments of the lpl intergenic regions. “A–a” represents, for example, an intergenic region sandwiched by the 3′-variable region of “A” group and the 5′-variable region of “a” group. A putative ribosome-binding site (Novick 1991), which could be found in all except for “K–d”, is indicated by asterisks.

Intergenomic Comparison

Intergenomic comparison was carried out using a map of ORFs shown in figure 6. For indels, boundaries of homology and nonhomology were compared by ClustalW as described previously (Tsuru et al. 2006) to identify recombination sites (fig. 6).
F

(A) Schematic maps of lpl ORFs. Naming and coloring for 5′-variable region and 3′-variable region are after the grouping in figure 4. For an indel found between USA300 and COL and one between COL and NCTC8325, an apparently deleted region and the regions involved in recombination are indicated by dotted lines and numbered thick lines, respectively. (B) Alignment of the regions 1, 2, and 3 in (A) suggesting a recombination relationship between them (1 × 2 → 3). The central conserved region is indicated above the alignment. (C) Alignment of the regions 3, 4, and 5 in (A) suggesting a recombination relationship between them (3 × 4 → 5).

Analyses of Another Tandem Gene Cluster for Hypothetical Proteins in S. aureus

The above definitions related to gene structure and phylogenetic comparison were repeated for another tandem gene cluster for hypothetical proteins, homologs of SA1317 (figs. 8–10). Grouping was carried out for each of the two variable regions separately so as the mutual evolutionary distance remains equal to or shorter than 0.15 within a group (fig. 9).
F

Structure of genes of SA1317 homologs (top) with similarity plots for nucleotide sequences (middle) and for amino acid sequences (bottom). A central conserved region is highlighted by gray shading.

F

Phylogenetic comparison of the 5′ regions and 3′ regions of SA1317 homologs. A nucleotide NJ phylogeny for 5′-variable regions and one for 3′-variable regions were compared with each other by connecting OTUs in a pair of 3′-variable region of an ORF and 5′-variable region of its downstream ORF (A) and in a pair within an ORF (B). The bootstrap values (%) were obtained from 1,000 resamplings. Groups of each phylogeny were assigned so as the mutual evolutionary distances remain equal to or shorter than 0.15 within a group.

F

Maps of SA1317 homolog clusters in various Staphylococcus aureus strains. The SA1317 homologs are drawn in bold lines. Naming of their 5′-variable region and 3′-variable region is after the tree-based grouping in figure 9. The larger intervening ORF, SAU1320 and its homologous genes, is observed in all the strains except for MRSA252, whereas the shorter intervening ORF, SAA1377 and its homologs, is observed in strains USA300, COL, and NCTC8325. SAB1350 and SAB1349 in RF122 are truncated genes homologous to SAU1320. Insertion of a prophage into the larger intervening ORF observed in USA300, NCTC8325, and MW2 is indicated by a black triangle.

Results

Distribution of lpl Homologues in Staphylococcus Genomes

In a previous study, we examined diversity of the lpl gene cluster on the genomic island νSaα of several S. aureus strains with a sequenced genome (Tsuru et al. 2006). It had been reported that the homologs of these lpl genes were found in other loci than the genomic island and that they compose the largest group of paralogous genes in S. aureus strain N315 (Kuroda et al. 2001; Baba et al. 2004). Therefore, we first carried out a search for their homologs against the microbial genome sequence database (NCBI Genome database; 31 July 2007 data, see Materials and Methods). This revealed that significantly related lpl homologs were found only in the genomes of the genus Staphylococcus. All the nine sequenced strains of S. aureus examined (N315, Mu50, MW2, MRSA252, MSSA476, COL, NCTC8325, USA300, and RF122) and two sequenced strains of S. epidermidis (ATCC12228 and RP62A) carry multiple lpl homologs within their genome, and the sequenced strain of S. haemolyticus (JCSC1435) carries one lpl homolog (supplementary table S1, Supplementary Material online). In the only remaining Staphylococcus genome that has been sequenced, Staphylococcus saprophyticus strain ATCC15305 (Kuroda et al. 2005), and in the other bacterial or archaeal genomes, no homolog could be found. In all the S. aureus genomes, paralogous lpl genes were found in four loci (fig. 1). First, in the genomic island νSaα, which we analyzed previously (Tsuru et al. 2006), the tandem cluster with three to ten lpl homologs is present in all nine strains (Locus 0). Second, one to five tandemly repeated lpl genes are found in a locus corresponding to SAU0092–SAU0096 of strain N315 in all the strains (Locus I). Third, only one lpl gene is present in a locus corresponding to SAU0203 of N315 in all the strains except for RF122 and MRSA252 (Locus II). Fourth, one to four tandemly repeated lpl genes, sometimes with intervening ORFs, can be found in a locus corresponding to SAU2269–SAU2275 of N315 in all the strains (Locus III). Note that all the nine S. aureus strains show synteny along the entire genome and carry the lpl genes at the same loci in the same orientation. The lpl homologs in Staphylococcus aureus genome and their phylogenetic tree. (A) Location of four lpl loci, Locus 0 through Locus III, on the genome of strain N315. Note that the lpl homologs are found in the corresponding loci in all the sequenced S. aureus strains. (B) A nucleotide NJ phylogenetic tree for the lpl ORFs and their homologs in two other Staphylococcus species. The uncondensed version of this tree is presented in supplementary figure S1 (Supplementary Material online). In order to examine orthologous/paralogous relationships among the lpl homologs, all the nucleotide sequences were compared in a multiple alignment to construct an NJ phylogeny. The condensed version of the resulting tree is displayed in figure 1 and the uncondensed version in supplementary figure S1 (Supplementary Material online). The tree revealed the presence of three monophyletic groups completely corresponding to Locus I, Locus II, and Locus III of S. aureus. Within each group, sequences are relatively similar to each other. Homologs from S. epidermidis and S. haemolyticus were rather divergent and were apparently related to the family at Locus III of S. aureus. The genes on νSaα of S. aureus (Locus 0) did not form a closely related monophyletic group, rather a mixture of several phylogenetic groups. However, these are clearly distinct from the groups at the other three loci of S. aureus. These observations suggested that the lpl genes of S. aureus have evolved, to the first approximation, separately at each locus (see also Discussion). Therefore, in the following sections, we will focus on sequence comparison among lpl homologs from the νSaα genomic island of S. aureus.

Central Conserved Region in lpl Genes on νSaα

In our previous study (Tsuru et al. 2006), we identified a conserved sequence within the lpl genes on νSaα through a multiple dot-plot analysis using the genome sequences of seven strains available at the time. We here confirmed the presence of this conserved sequence in two additional strains, USA300 and RF122, through the same analysis (supplementary fig. S2, Supplementary Material online). The presence of the conserved sequence is seen as dots at the crossing points of a horizontal red line and a vertical red line in comparison within the same genome, as noted in our previous work (Tsuru et al. 2006). To verify this conserved sequence, multiple alignments for nucleotide sequences and for predicted amino acid sequences were constructed for all the relevant lpl ORFs (supplementary fig. S3 and , Supplementary Material online). These alignments revealed the presence of a region conserved at both the nucleotide sequence level and the amino acid sequence level, which we named the central conserved region (fig. 2). This region is 132 nt long or 44 amino acids long without any gap in the alignments. The regions 5′ and 3′ of the central conserved region are less conserved and have variable lengths, and, accordingly, they were defined as the 5′-variable region and the 3′-variable region, respectively. Structure of lpl genes (top) with similarity plots for nucleotide sequences (middle) and for amino acid sequences (bottom). The central conserved region is highlighted by gray shading. A predicted signal peptide region is indicated together with a conserved cysteine residue at the C-terminus. Calculation of pairwise nucleotide sequence identities using all the relevant sequences (table 1, All) revealed that those for the central conserved region are high (minimum of 79% and an average of 88%). Out of the 1,128 pairwise relationships, 427 showed identity as high as 90%. Meanwhile, those for the 5′-variable region and the 3′-variable region show minimal values of 41% and 52%, respectively, whereas the averages are 65% and 66%, respectively. Involvement of this conserved sequence in the rearrangement of the lpl cluster is visible in an intergenomic dot-plot comparison (supplementary fig. S2, Supplementary Material online) as the termination of long black lines at a crossover point of a horizontal red line and a vertical red line as noted earlier (Tsuru et al. 2006). Generally, homologous recombination requires two homologous sequences long enough and similar enough to each other. In Bacillus subtilis, the closest bacterium to S. aureus in which this process has been studied in detail, the minimal length is 70 bp (Khasanov et al. 1992), a size comparable to those reported for other prokaryotic systems (Shen and Huang 1986; Fujitani et al. 1995). Frequency of homologous recombination is very sensitive to homology length around this length: it was found to be proportional to the third power of the homology length (Fujitani et al. 1995). The frequency of homologous recombination decreases very rapidly as the two sequences diverge (Vulic et al. 1997; Majewski and Cohan 1998; Fujitani and Kobayashi 1999). Thus, the homologous recombination between the central conserved regions is likely to occur.

Phylogeny Comparison and Grouping of 5′-Variable Region and 3′-Variable Region

If the central conserved region served as a recombination site during diversification of this region, the crossing-over events there should have changed the combination of the 5′-variable region and the 3′-variable region of an ORF (fig. 3). These events would result in incongruent phylogenies between them within an ORF. On the other hand, these crossing-over events will not disturb the linkage between 3′-variable region of an ORF and 5′-variable region of the next downstream ORF (fig. 3). They would result in a congruent phylogeny between these combinations. To test these predictions, a phylogeny for 5′-variable region and that for 3′-variable region were constructed and compared (fig. 4).
F

An elementary process of diversification through homologous recombination between the central conserved regions. A crossing-over will change combinations of the 5′-variable region and the 3′-variable region of an ORF, but linkage of 3′-variable region of an ORF, its downstream intergenic region, and 5′-variable region of its downstream ORF will be maintained.

An elementary process of diversification through homologous recombination between the central conserved regions. A crossing-over will change combinations of the 5′-variable region and the 3′-variable region of an ORF, but linkage of 3′-variable region of an ORF, its downstream intergenic region, and 5′-variable region of its downstream ORF will be maintained. Phylogenetic comparison of the 5′ regions and 3′ regions of lpl ORFs. A nucleotide NJ phylogeny for 5′-variable regions and one for 3′-variable regions were compared with each other by connecting OTUs in a pair of 3′-variable region of an ORF and 5′-variable region of its downstream ORF (A) and in a pair within an ORF (B). The bootstrap values (%) were obtained from 1,000 resamplings. Groups of the conserved pairs are indicated in different colors of connecting lines in (A). A paraphyletic group, B1, and the other monophyletic groups in each phylogeny corresponding to the groups of the pairs in (A) are indicated by boxes both in (A) and (B). In figure 4 for a pair of variable regions within an ORF, the two phylogenies were found to be significantly incongruent. In figure 4 for a pair encompassing adjacent ORFs, the phylogenies appeared to be more congruent to each other. We observed that the connecting lines in figure 4 could be divided into groups of parallel lines with only a few exceptions. By contrast, in figure 4, such clear groupings of the parallel connecting lines were difficult to identify (see Materials and Methods). We also observed that the grouping of the parallel lines in figure 4 coincided with monophyletic or paraphyletic grouping of each phylogeny. Based upon this observation, we were able to assign groups for the 5′-variable region and the 3′-variable region, respectively. The 3′-variable regions were grouped into 13 distinct monophyletic groups named “A” through “M,” whereas the 5′-variable regions were also grouped into 13 distinct monophyletic groups named “a” through “m.” The group “B” was further divided into a paraphyletic group “B1” and a monophyletic group “B2” because these two are paired with distinct groups of the 5′-variable region in figure 4: group B1 is paired with group “e” and group B2 is paired with “h,” respectively. The internal branch lengths for the resulting groups were relatively short, which indicated sequences are similar to each other within each group. Indeed, pairwise nucleotide identities calculated for each of the resulting groups were calculated to be ≥80%, which is in contrast to those calculated using all the sequences (table 1). In all, 12 distinct groups of the pairs could be assigned in the combination of the 3′-variable region of an ORF and the 5′-variable region of its downstream ORF, which are displayed in different colors of connecting lines in figure 4. The corresponding monophyletic or paraphyletic groups in each phylogeny are displayed in the same, but faint, colors as the connecting lines. One pair, “L” and “l,” forms a group of only one member. The other groups have multiple members. The sequences of two variable regions are conserved within each group but are diverged between groups. There are only two exceptional pairs in this grouping, which are indicated by gray connecting lines in figure 4. These are discussed in detail later (see Discussion). Our grouping is based on comparison of the two phylogenetic trees in figure 4. In order to examine whether it represents a statistically significant relationship, we performed Fisher's exact test (Agresti 1992; see Materials and Methods). The probability calculated for our data is 2.2 × 10−16, which indicates that there is a highly significant linkage between a 3′-variable region and its downstream 5′-variable region. This presence/absence of the conserved pairs in figure 4 supports the hypothesis that the two variable regions of one ORF have been shuffled during the diversification processes via crossing-over events at the central conserved region, whereas the two variable regions encompassing adjacent ORFs have been conserved during these processes. If the linkage between the 3′-variable region of an ORF and the 5′-variable region of its downstream ORF has been maintained, their intervening sequence should also have been conserved (fig. 3). To verify this, the intergenic regions were compared by multiple alignments (fig. 5). They turned out to be highly conserved within each group (see also table 1, intergenic region). Here “A–a” represents, for example, an intergenic region sandwiched by 3′-variable region of “A” and 5′-variable region of “a”. Multiple alignments of the lpl intergenic regions. “A–a” represents, for example, an intergenic region sandwiched by the 3′-variable region of “A” group and the 5′-variable region of “a” group. A putative ribosome-binding site (Novick 1991), which could be found in all except for “K–d”, is indicated by asterisks. In most of the cases, the sequences from different groups differed in length and composition, showing almost no similarity. A putative ribosome-binding site (Novick 1991) could be identified in all their sequences except for “K–d,” and their sequences from the different groups are likewise distinctive. There are, however, several cases in which the intergenic sequences from different groups are similar to each other: among “B1–e,” “B2–h,” and “D–f”; between “F–b” and “G–k”; and between “H–j” and “L–l” (fig. 5). Additionally, “C–k” and “C–g,” the exceptional pairs of the conservation, also showed similarity to each other and to “A–a”. Therefore, the conserved pair of the two variable regions can be extended to the conserved unit comprising a 3′-variable region, a downstream intergenic region, and a downstream 5′-variable region, which spans two adjacent ORFs. The units from different groups are substantially distinctive from each other, though some sequence families of intergenic regions are found to be common to a few different groups.

Intergenomic Comparison and Reconstruction of Genome Rearrangement Events

The result of the grouping based on the phylogeny comparison in figure 4 is displayed in a schematic map of ORFs in figure 6. Gene orders and gene compositions are highly variable among strains, which indicates the occurrence of multiple rearrangement events in this region in the past. Combinations of the 5′-variable region and the 3′-variable region within an ORF appear to have been extensively shuffled. For instance, “f” is paired with “G” in SAU0397/SAV0434 of N315/Mu50, but it is paired with “C” in SAA0411 of USA300. At the level of ORF, 32 distinct patterns were generated in the combination of each 13 groups of 3′ and 5′ regions (fig. 6). In contrast, the presence of the conserved pairs of the 3′-variable region of one ORF and the 5′-variable region of its downstream ORF, as displayed by the same coloring, can be clearly observed here. These observations also support the hypothesis that the linkage encompassing two adjacent genes has been well maintained in spite of the extensive rearrangements. (A) Schematic maps of lpl ORFs. Naming and coloring for 5′-variable region and 3′-variable region are after the grouping in figure 4. For an indel found between USA300 and COL and one between COL and NCTC8325, an apparently deleted region and the regions involved in recombination are indicated by dotted lines and numbered thick lines, respectively. (B) Alignment of the regions 1, 2, and 3 in (A) suggesting a recombination relationship between them (1 × 2 → 3). The central conserved region is indicated above the alignment. (C) Alignment of the regions 3, 4, and 5 in (A) suggesting a recombination relationship between them (3 × 4 → 5). Occurrence of numerous rearrangements of various types can be inferred from pairwise genome comparisons. For example, 1) an indel is found between USA300 (k-D-f-C-g-H-j-A-a-B1-) and COL (k-B1-), with an apparent deletion of “D-f-C-g-H-j-A-a.” 2) A translocation is found between N315/Mu50 (-D-f-G-k-) and MSSA252 (-G-k-D-f-). 3) A substitution is found between RF122 (-b-K-d-B2-h-J-c-M-) and MW2/MSSA476 (-b-B1-e-E-i-M-). In addition, 4) the presence of two units of A–a in USA300 indicates an apparent gene conversion (or a replicative transposition). Among these rearrangements, two indels, one found between USA300 and COL and the other between COL and NCTC8325, are remarkable in that the variation between the strains can be explained by one deletion event. The close relationship between these three strains has been elucidated by phylogenetic analysis using the sequences of housekeeping genes (Baba et al. 2008). For the above two indels, alignment of the sequences involved allowed determination of the recombination site (fig. 6). In the upstream region, a progeny sequence in the second line of the three perfectly aligned with one of the parental sequences in the first line but not with the other parental sequence in the third line. Then in the downstream region, the progeny sequence in the second line aligns perfectly with the parental sequence in the third line but not with the parental sequence in the first line. The transient region, around which recombination is likely to have taken place, coincides with the central conserved regions in both the cases. This indicates that these indels have been generated by a homologous recombination event at the central conserved region.

A Model of Paralog Cluster Diversification through Homologous Recombination between Central Conserved Regions

Using the observations and arguments presented above, we propose a model for the diversification processes in this tandem paralog cluster (fig. 3 and fig. 7). During the sequence diversification process, the central conserved region somehow maintained its sequence being sandwiched by two variable regions. The central conserved region has provided a site for mutual homologous recombination. A recombination event was able to change combination of the 5′-variable region and the 3′-variable region within one ORF but unable to disturb the linkage of the 3′-variable region of one ORF, the intergenic region, and the 5′-variable region of its downstream ORF, resulting in maintenance of this conserved unit (fig. 3). These processes make it possible to generate an ORF with a novel combination of the two variable regions.
F

Formation of various types of rearrangements is explained by multiple rounds of crossing-over events at the central conserved region (gray bar). (A) An unequal crossing-over between sister chromosomes ([b]) will cause deletion ([a] to [c]) and duplication ([a] to [d]). An additional unequal crossing-over ([e]) will result in apparent conversion ([a] to [f]). An intra-chromosomal, unequal crossing-over ([g]) can form a circle ([h]) and a deletion ([a] to [i]), and ensuing re-integration of the circle ([j]) will results in apparent translocation ([a] to [k]). The two routes of deletion formation ([a] to [c]; [a] to [i]) can result in apparent substitution ([c] and [i]). (B) Inter-molecular recombination involving horizontal gene transfer also explains formation of various types of rearrangements. Double cross-overs between a donor [l] and a recipient [m] will cause deletion ([m] to [n]; [m] to [o]), resulting in apparent substitution ([n] and [o]). Additional inter-molecular recombination between [n] and [o] can result in an apparent conversion ([m] to [p]). Another round of recombination between [p] and [n] can result in an apparent translocation ([m] to [q]).

Formation of various types of rearrangements is explained by multiple rounds of crossing-over events at the central conserved region (gray bar). (A) An unequal crossing-over between sister chromosomes ([b]) will cause deletion ([a] to [c]) and duplication ([a] to [d]). An additional unequal crossing-over ([e]) will result in apparent conversion ([a] to [f]). An intra-chromosomal, unequal crossing-over ([g]) can form a circle ([h]) and a deletion ([a] to [i]), and ensuing re-integration of the circle ([j]) will results in apparent translocation ([a] to [k]). The two routes of deletion formation ([a] to [c]; [a] to [i]) can result in apparent substitution ([c] and [i]). (B) Inter-molecular recombination involving horizontal gene transfer also explains formation of various types of rearrangements. Double cross-overs between a donor [l] and a recipient [m] will cause deletion ([m] to [n]; [m] to [o]), resulting in apparent substitution ([n] and [o]). Additional inter-molecular recombination between [n] and [o] can result in an apparent conversion ([m] to [p]). Another round of recombination between [p] and [n] can result in an apparent translocation ([m] to [q]). Multiple rounds of such crossing-over can cause various types of rearrangements including deletions, conversions, translocations, and substitutions, which are observed among the genomes studied (fig. 6). The occurrence of translocation can be explained by circle formation and ensuring reintegration, which is proposed as a mechanism to cause gene amplification or phase variation (Mahan and Roth 1989; Howell-Adams and Seifert 2000; Barten and Meyer 2001), though multiple rounds of unequal crossing-over events can also result in translocation. An apparent gene conversion can be explained by repeated rounds of crossing-over (Yamamoto et al. 1988, 1992). If horizontal gene transfer takes place between closely related strains, intermolecular recombination might also occur (Hacker and Kaper 2000; Ochman et al. 2000) and could contribute to these rearrangements, as illustrated in figure 7. Genetic exchange via horizontal gene transfer has been suggested by sequence comparison of the hsdS gene of the Type I restriction–modification system, which is linked to the lpl cluster (Tsuru et al. 2006). This genomic island, however, has been considered to be no longer mobile because it harbors only the remnants of an integrase homologue (Baba et al. 2004). Another type of genomic island of S. aureus, called SaPIs, has intercellular mobility with the aid of a specific helper phage (Ruzin et al. 2001; Novick and Subedi 2007), but the helper phage for the νSaα island has not yet been identified. Natural transformation has not been reported in S. aureus. Occurrence of conjugation was proposed in order to explain large-scale chromosome replacement, suggested from multilocus sequence typing (Robinson and Enright 2004). It is likely that the diversity of this region was generated through accumulation of these intragenomic and intergenomic processes. This model can explain both formation of highly distorted gene orders and shuffling of the two variable regions of an ORF.

Search for Other Tandem Paralogous Gene Clusters in S. aureus with the Same Diversification Pattern as the lpl Cluster

One might expect that the present model for diversification is general to paralog clusters: any highly conserved sequence within tandem paralogous genes would create a site for recombination and could cause similar diversification to the lpl cluster. In order to explore this possibility, we searched signs of this kind of rearrangement in other tandem gene clusters. Two other tandem clusters on the genomic islands of S. aureus, the ssl cluster on νSaα and the spl cluster on νSaβ, were examined in our previous study (Tsuru et al. 2006), which revealed that the possible recombination sites are located at the upstream region through the 5′-terminal region of the genes in both cases. The mechanism and evolutionary consequence of their diversification are thus different from those for the lpl genes. We searched for other tandem paralogous genes in strain N315 of S. aureus by performing a self–self genome comparison in CGAT (Uchiyama et al. 2006), which is based on all-against-all BlastN analysis and a local program to detect linked repetitive sequences. This screening uncovered 15 tandem clusters (table 2), which include the ssl, lpl, and spl clusters. We then carried out multiple sequence alignment within member genes of each cluster to examine whether they carry a highly conserved sequence sandwiched by variable sequences. One tandem cluster, composed of SAU1317, SAU1318, SAU1319, and SAU1321 in N315 (encoding hypothetical proteins) showed such a structure.
Table 2

Tandem Gene Clusters in Staphylococcus aureus N315 Genome

Cluster NameGenes in N315
Lipoprotein-like (Lpl; Locus I)SAU0092, SAU0092, SAU0093, SAU0094, SAU0095, SAU0096
Hypothetical proteinSAU0282, SAU0286, SAU0287, SAU0288, SAU0289, SAU0290
Superantigen-like (Ssl; νSaα)SAU0382, SAU0383, SAU0384, SAU0385, SAU0386, SAU0387, SAU0388, SAU0389, SAU0390
Lipoprotein-like (Lpl; νSaα)SAU0396, SAU0397, SAU0398, SAU0399, SAU0400, SAU0401, SAU0402, SAU0403, SAU0404, SAU0405
Ser–Asp–rich fibrinogen-binding proteinSAU0519, SAU0520, SAU0521
Superantigen-like proteinSAU1009, SAU1010, SAU1011
ECM-binding protein homologueSAU1267, SAU1268
Hypothetical proteinSAU1317, SAU1318, SAU1319, SAU1321
Serine protease-like (Spl; νSaβ)SAU1627, SAU1628, SAU1629, SAU1630, SAU1631
EnterotoxinSAU1642, SAU1643, SAU1644, SAU1645, SAU1646, SAU1647, SAU1648
HemolysinSAU2207, SAU2208, SAU2209
Hypothetical proteinSAU2263, SAU2264, SAU2265
Lipoprotein-like (Lpl; Locus III)SAU2269, SAU2273, SAU2274, SAU2275
Fibronectin-binding proteinSAU2290, SAU2291
Tandem Gene Clusters in Staphylococcus aureus N315 Genome We carried out screening of their homologs by BlastN search (e value cutoff of 10−5) against prokaryote complete genome sequences in the NCBI Genome database (31 July 2007 data). This screening revealed multiple (two or four) genes in all the nine S. aureus strains analyzed and a single homologous gene in each of the sequenced S. epidermidis strains and in the S. haemolyticus strain. Some of these genes were annotated as lipoproteins (Holden et al. 2004), although not all the genes encode a cysteine residue around the signal peptide-like sequence, a hallmark of bacterial lipoproteins (Sibbald et al. 2006) (data not shown). These genes are termed SA1317 homologs in this study. The multiple homologous genes in the S. aureus strains are present in tandem, sometimes with intervening ORFs and/or an intervening prophage. Similarity plots using all the sequences of the S. aureus genomes confirmed the presence of the central conserved region that is sandwiched by two variable regions (fig. 8). Structure of genes of SA1317 homologs (top) with similarity plots for nucleotide sequences (middle) and for amino acid sequences (bottom). A central conserved region is highlighted by gray shading. A phylogenetic comparison was carried out to detect a linkage encompassing two adjacent paralogous genes (fig. 9). The tree for the 5′-variable region is not as well resolved as that for the 3′-variable region, presumably because of its short and variable length. The two phylogenetic trees seem to be more congruent for pairs encompassing adjacent genes (fig. 9) than for pairs within the same gene (fig. 9). We could not apply clear-cut grouping based on the congruence of the two trees for this cluster. Therefore, we grouped 5′-variable region and 3′-variable region separately based on each tree. The 3′-variable region was grouped into seven groups named “T” through “Z,” whereas the six 5′-variable region groups were named “u” through “z,” such that the evolutionary distance remains equal to or shorter than 0.15 within a group (fig. 9). We then tried to connect these 3′ groups and 5′ groups. This grouping is also indicated in the schematic map in figure 10. This revealed the presence of sequence conservation in the pairs of two variable regions (figs. 9 and 10). Fisher's exact test supported the significance of the linkage between these pairs; the P value is 0.000104. Phylogenetic comparison of the 5′ regions and 3′ regions of SA1317 homologs. A nucleotide NJ phylogeny for 5′-variable regions and one for 3′-variable regions were compared with each other by connecting OTUs in a pair of 3′-variable region of an ORF and 5′-variable region of its downstream ORF (A) and in a pair within an ORF (B). The bootstrap values (%) were obtained from 1,000 resamplings. Groups of each phylogeny were assigned so as the mutual evolutionary distances remain equal to or shorter than 0.15 within a group. Maps of SA1317 homolog clusters in various Staphylococcus aureus strains. The SA1317 homologs are drawn in bold lines. Naming of their 5′-variable region and 3′-variable region is after the tree-based grouping in figure 9. The larger intervening ORF, SAU1320 and its homologous genes, is observed in all the strains except for MRSA252, whereas the shorter intervening ORF, SAA1377 and its homologs, is observed in strains USA300, COL, and NCTC8325. SAB1350 and SAB1349 in RF122 are truncated genes homologous to SAU1320. Insertion of a prophage into the larger intervening ORF observed in USA300, NCTC8325, and MW2 is indicated by a black triangle. Intergenomic comparison showed that the sequences of this cluster are similar to each other between N315 and Mu50; between USA300, COL, and NCTC8325; and between MW2 and MRSA476 (fig. 10). This relatedness among the strains is also indicated in the phylogenetic trees shown in figure 9. Comparison between USA300, COL, and NCTC8325 showed that the linkage between “W” and “u” has been conserved regardless of the apparent insertion of prophages in USA300 and NCTC8325. The presence of intervening nonhomologous ORFs and/or an intervening prophage between SA1317 homologs does not seem to distort the linkage spanning two paralogs between the above closely related strains. On the other hand, comparison between the diverged strains did not provide direct evidence of their rearrangements. Taken together, a significant relationship between the two variable regions encompassing genes was also identified also in the SA1317 homolog tandem cluster. However, we could not judge whether this linkage is formed by the diversification processes proposed in this study (figs. 3 and 7) because the current data sets provided no evidence of rearrangements.

Discussion

The phylogenetic trees for the 3′-variable region of the lpl ORF and the 5′-variable region of its downstream ORF (fig. 4) appeared congruent, but the congruence is not complete. This may be because the level of divergence among the groups is too extensive to reconstruct their evolutionary relationships in either or both of the trees (Nei and Kumar 2000). Another (not mutually exclusive) possibility is that the 5′-variable region may be too short to construct a reliable tree (Nei and Kumar 2000). The conserved units comprising the 3′-variable region, the intergenic region, and the 5′-variable region are maintained within a group in almost all the cases (figs. 4). There are, however, some exceptions in their one-to-one correspondence. One example is present in the 3′-variable region of “B”; there are two types, “B1” and “B2”, which are paired with “e” and “h” groups of the 5′-variable region, respectively. A similar situation is found in the 5′-variable region of “k”. “k” is paired with “G” group of the 3′-variable region in most cases. However, in SAU0400/SAV0440–SAU0401/SAV0441 of N315/Mu50, “k” is paired with “C”. Such imperfect conservation of the unit suggests the involvement of mechanisms other than the homologous recombination at the central conserved region. Illegitimate recombination that requires only short or no homology may have taken place at their intergenic region. The putative recombination points lie close to the 5′ end in both the cases. The 3′-variable region of “C” group is involved in another exceptional pair: “C” is paired with “k” in SAU0400/SAV0440–SAU0401/SAV0441 of N315/Mu50 as mentioned above but paired with “g” in SAA0411–SAA0412p in USA300 (fig. 4, gray connecting lines; fig. 6). The prototype of the 5′-variable region of SAA0412p, a pseudogene, could be “k”. Fast accumulation of mutations in the 5′-variable region of the ancestor of SAA0412p may have led to its erroneous grouping into “g”. Sequence similarity in their intergenic regions (fig. 5) and the common occurrence of “H–j” pair in their downstream (fig. 6) support this hypothesis. One prediction from our diversification model is that the 5′-variable region at the upstream end of this cluster and the 3′-variable region at the downstream end should be conserved among strains. In the case of the lpl cluster, conservation of “k” at the upstream end is consistent with this prediction (fig. 6). However, the 3′-variable region at the downstream end is not conserved (fig. 6), which indicates the involvement of a mechanism other than our model. Illegitimate recombination could be involved in the diversification of the downstream end. Interestingly, such a recombination event that involves a long homology at one joint and a short homology in the other joint has been reported in various bacteria (Kusano et al. 1997; de Vries and Wackernagel 2002; Prudhomme et al. 2002). The present model attempts to explain the generation of a novel paralog by changing the combination of the two variable regions. It does not address the issue of the origin of the sequence diversity among these groups. The intergroup sequence diversity contrasts with the intragroup sequence conservation. Thus, it seems reasonable to assume that these represent two distinctive processes. The genomic island carrying this lpl cluster is likely to have been acquired by an ancestor of S. aureus because this island is common to this species but has not been found in any other Staphylococcus species (Baba et al. 2004). The diverse repertoire of the lpl genes may have been formed before the island was acquired; yet, we do not know the precise molecular mechanisms generating the diverse repertoire. Bacterial surface proteins are often more variable in their amino acid sequence than proteins encoded by housekeeping genes, presumably due to diversifying selection, exerted by the host immune system (Caporale 2003). In S. aureus, IgG-binding protein A gene (spa) and seven S. aureus surface protein genes (sas genes) are known to be so and have been used for strain typing (Shopsin et al. 1999; Mazmanian et al. 2001). In order to examine the evolutionary rate of the lpl genes, we chose SAU0404 (in N315), SAA0418 (in USA300), and SAR0444 (in MRSA252), as a possible orthologous gene set in the lpl cluster (figs. 4 and 6), and compared their evolutionary distances with those of the seven concatenated housekeeping genes (arc, aroE, glpF, gmk, pta, tpi, and yqiL) used in multilocus sequence typing (Enright et al. 2000). The distances calculated for the lpl genes are much longer than those calculated for the concatenated sequences of the seven housekeeping genes: the average Kimura's two-parameter distance (Kimura 1980) for the lpl genes is 0.066 compared with 0.0077 for the housekeeping genes. Phylogenies based on these distances are presented in supplementary figure S1 (Supplementary Material online). This result indicates that the lpl genes have a relatively fast diversification rate. This fast sequence diversification, together with shuffling via recombination as proposed in our model, has very likely contributed to variability in this cluster. The sequence divergence/conservation was detected using the whole sequences of the lpl region in this study. One might expect that a small portion might undergo gene conversion between different groups. To further explore this possibility, we performed detailed pairwise nucleotide sequence comparisons using Blast2 (Tatusova and Madden 1999). These results detected several examples indicating gene conversion (Tsuru T, Kobayashi I, unpublished data). However, contribution of such gene conversion seems so small that it did not affect the clear grouping in figure 4. Likewise, the presence of significantly related lpl homologs in the loci of S. aureus other than νSaα and in some other Staphylococcus species leaves the possibility of interloci recombination and interspecies recombination. The same detailed Blast2 analysis, however, did not detect any sequence homology indicating such interactions; the maximal length of the sequence homology equal or more than 95% identity was only 48 bp. Therefore, it is likely that the lpl genes of S. aureus have evolved separately at each locus, as was suggested by phylogeny analysis using full-length sequences in supplementary figure S1 (Supplementary Material online). The sequences of the central conserved region defined by us are not always completely identical to each other along the entire length (table 1 and supplementary fig. S3 [Supplementary Material online]). This suggests that the precise recombination point varies in each event within the central conserved region because the different recombination points will explain the generation of sequence diversity of the central conserved region. In this study, we were able to estimate the recombination site for the two indels found among USA300, COL, and NCTC8325 (fig. 6). In both these cases, the detected recombination points resided within the central conserved region, but their positions were different from each other. If we examined more sequences and could find recombination events between them, we might be able to test the above hypothesis. In addition, comparison of the sequence of very closely related strains will enable us to determine the position of recombination more precisely. Recently, entire genome sequences have become available for 4 more S. aureus strains: JH1 and JH9 (Mwangi et al. 2007), USA300-TCH156 (Highlander et al. 2007), and Newman (Baba et al. 2008). Additionally, ten genomes of USA300 derivatives were examined by comparative genome sequence analysis (Kennedy et al. 2008). These are closely related to each other or to the other sequenced strains. Although we studied the sequences of these strains to obtain more examples of recombination events, we could not obtain any additional insight into the lpl gene cluster: JH1 and JH9 are almost identical to N315/Mu50 with only a few point mutations throughout the relevant region, and USA300-TCH156 and Newman are almost identical to USA300. From this analysis, we concluded that testing the above hypothesis would require many more genome sequences of appropriate divergence. The frequent recombination at the central conserved region of the lpl genes on νSaα suggests, on the other hand, that an intrastrain homogenization process might occur in this region. However, we could not detect such tendency in the phylogenetic analysis using the sequences of the central conserved region (data not shown). Why has the central region been conserved? One possibility is that this region has been maintained because of its functional importance. At present, neither a functional role for the central conserved region nor a physiological role of these lpl paralogs has been discovered. Generally, bacterial lipoproteins are involved in cellular processes such as antibiotic resistance, adhesion to host cells or other bacterial cells, transport of substances, or intercellular communication (Sibbald et al. 2006). Several lipoproteins are recognized as an antigen by a host's immune system, and they thus affect bacterial survival and their pathogenicity (Henderson et al. 1996). Antigenic variation systems involving lipoproteins have been identified and characterized in Borrelia (Haake 2000; Norris 2006), Porphyromonas (Hall et al. 2005), and Mycoplasma (Lysnyansky et al. 2001; Glew et al. 2002; Ron et al. 2002). Tandemly encoded lipoproteins such as vsp lipoproteins in Borrelia hermsii (Restrepo et al. 1994; Barbour and Restrepo 2000) or p35 family lipoproteins in Mycoplasma penetrans (Neyrolles, Chambaud, et al. 1999; Neyrolles, Eliane, et al. 1999) cause antigenic variation. We do not know whether the interstrain variation in the lpl gene arrangement represents a type of antigenic variation, but we think it is likely because this explains why many copies of these paralogs have been maintained (Jordan et al. 2001; Hooper and Berg 2003). The relatively high evolutionary rate observed in this cluster also supports this view. If this is the case, the central region may be conserved for its function as a recombination site for programed diversification similarly to DNA repeats in some other antigenic variation systems. Shuffling of two variable regions may contribute to alteration of sets of Lpl proteins on the cell surface, partly because replacement of the signal peptides in the 5′-variable region will affect efficiency of secretion of each lipoprotein (Sibbald et al. 2006). The wide diversity in the sequence of the intergenic region including a putative ribosome-binding site (fig. 5) might contribute to diversity in expression efficiency of these genes. Recent transcriptome analyses using the sequenced strains have reported expression of some of their lpl genes (http://www.bioinformatics.org/sammd/; Sobral et al. 2007), although this has not been confirmed by proteome analyses (Nandakumar et al. 2005; Gatlin et al. 2006). How their expression is regulated has not been elucidated. Precise analysis to identify their differential expression within a strain and between strains is necessary to understand their biological role. A prominent feature of our model of the diversification process is the presence of the conserved unit encompassing adjacent ORFs. Mosaic gene formation by intergenomic and intragenomic recombination has been reported in various bacteria for antibiotic resistance genes (Maiden 1998; Normark BH and Normark S 2002) and antigen genes (Bessen and Hollingshead 1994; Hobbs et al. 1998; Lachenauer et al. 2000). For allelic variation in the por gene for porin protein in Neisseria gonorrhoeae, homologous recombination of a partial conserved sequence within the gene is proposed to be responsible for diversity (Cooke et al. 1998; Fudyk et al. 1999; Hamilton and Dillard 2006). However, a linkage of segments beyond adjacent ORFs has not been analyzed so far. The present characteristic mode of tandem paralog diversification, maintaining the 3′ part of a gene, the intergenic region, and the 5′ part of its downstream gene as a unit of evolution, is, to our knowledge, novel among studies of paralog rearrangements. How general is the presented model of diversification among tandemly repeated genes? A further search for other tandem paralog clusters on S. aureus genome revealed another tandem gene cluster, SA1317 homolog, with a gene structure in which a highly conserved sequence is sandwiched between variable sequences (fig. 8). A significant linkage between the variable regions encompassing the genes was also identified in this cluster (figs. 9 and 10), although whether this linkage is formed by the present diversification mechanism is not clear because no evidence of rearrangement was found there. We also examined the other well-characterized tandem paralog genes in other bacteria. In the p35 lipoprotein genes in M. penetrans, a conserved region is located at their 5′ end encoding a signal peptide (Sasaki et al. 2002). In the vsp genes in Mycoplasma bovis, a conserved sequence is found upstream of the ORF (Lysnyansky et al. 2001). In both these cases, a different diversification process from our model must be operating.

Conclusion

We tend to regard an ORF as a unit of gene evolution as well as a unit of gene expression and its function. The present work with a tandem paralog cluster identified a unit consisting of the 3′ half of a gene, a downstream intergenic region, and the 5′ half of a downstream gene as the unit of evolution. This is because the central region of the gene provides a site to recombine the 5′ half and the 3′ half of a gene to generate variation.

Supplementary Material

Supplementary table S1, figures S1–S3, and legends for the supplementary figures are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
  82 in total

Review 1.  Duplication and divergence: the evolution of new genes and old ideas.

Authors:  John S Taylor; Jeroen Raes
Journal:  Annu Rev Genet       Date:  2004       Impact factor: 16.830

2.  Whole genome sequence of Staphylococcus saprophyticus reveals the pathogenesis of uncomplicated urinary tract infection.

Authors:  Makoto Kuroda; Atsushi Yamashita; Hideki Hirakawa; Miyuki Kumano; Kazuya Morikawa; Masato Higashide; Atsushi Maruyama; Yumiko Inose; Kimio Matoba; Hidehiro Toh; Satoru Kuhara; Masahira Hattori; Toshiko Ohta
Journal:  Proc Natl Acad Sci U S A       Date:  2005-08-31       Impact factor: 11.205

3.  Insights on evolution of virulence and resistance from the complete genome analysis of an early methicillin-resistant Staphylococcus aureus strain and a biofilm-producing methicillin-resistant Staphylococcus epidermidis strain.

Authors:  Steven R Gill; Derrick E Fouts; Gordon L Archer; Emmanuel F Mongodin; Robert T Deboy; Jacques Ravel; Ian T Paulsen; James F Kolonay; Lauren Brinkac; Mauren Beanan; Robert J Dodson; Sean C Daugherty; Ramana Madupu; Samuel V Angiuoli; A Scott Durkin; Daniel H Haft; Jessica Vamathevan; Hoda Khouri; Terry Utterback; Chris Lee; George Dimitrov; Lingxia Jiang; Haiying Qin; Jan Weidman; Kevin Tran; Kathy Kang; Ioana R Hance; Karen E Nelson; Claire M Fraser
Journal:  J Bacteriol       Date:  2005-04       Impact factor: 3.490

4.  Proteome analysis of membrane and cell wall associated proteins from Staphylococcus aureus.

Authors:  Renu Nandakumar; M P Nandakumar; Mark R Marten; Julia M Ross
Journal:  J Proteome Res       Date:  2005 Mar-Apr       Impact factor: 4.466

5.  Sequence diversity and antigenic variation at the rag locus of Porphyromonas gingivalis.

Authors:  Lucinda M C Hall; Stuart C Fawell; Xiaoju Shi; Marie-Claire Faray-Kele; Joseph Aduse-Opoku; Robert A Whiley; Michael A Curtis
Journal:  Infect Immun       Date:  2005-07       Impact factor: 3.441

6.  Evaluation of protein A gene polymorphic region DNA sequencing for typing of Staphylococcus aureus strains.

Authors:  B Shopsin; M Gomez; S O Montgomery; D H Smith; M Waddington; D E Dodge; D A Bost; M Riehman; S Naidich; B N Kreiswirth
Journal:  J Clin Microbiol       Date:  1999-11       Impact factor: 5.948

7.  Antigenic characterization and cytolocalization of P35, the major Mycoplasma penetrans antigen.

Authors:  Olivier Neyrolles; Jean-Pierre Eliane; Stéphane Ferris; Regina Ayr Florio da Cunha; Marie-Christine Prevost; Elmostafa Bahraoui; Alain Blanchard
Journal:  Microbiology       Date:  1999-02       Impact factor: 2.777

8.  Genetic diversity and mosaicism at the por locus of Neisseria gonorrhoeae.

Authors:  T C Fudyk; I W Maclean; J N Simonsen; E N Njagi; J Kimani; R C Brunham; F A Plummer
Journal:  J Bacteriol       Date:  1999-09       Impact factor: 3.490

9.  Whole-genome sequencing of staphylococcus haemolyticus uncovers the extreme plasticity of its genome and the evolution of human-colonizing staphylococcal species.

Authors:  Fumihiko Takeuchi; Shinya Watanabe; Tadashi Baba; Harumi Yuzawa; Teruyo Ito; Yuh Morimoto; Makoto Kuroda; Longzhu Cui; Mikio Takahashi; Akiho Ankai; Shin-ichi Baba; Shigehiro Fukui; Jean C Lee; Keiichi Hiramatsu
Journal:  J Bacteriol       Date:  2005-11       Impact factor: 3.490

Review 10.  Horizontal genetic exchange, evolution, and spread of antibiotic resistance in bacteria.

Authors:  M C Maiden
Journal:  Clin Infect Dis       Date:  1998-08       Impact factor: 9.079

View more
  20 in total

1.  Birth and death of genes linked to chromosomal inversion.

Authors:  Yoshikazu Furuta; Mikihiko Kawai; Koji Yahara; Noriko Takahashi; Naofumi Handa; Takeshi Tsuru; Kenshiro Oshima; Masaru Yoshida; Takeshi Azuma; Masahira Hattori; Ikuo Uchiyama; Ichizo Kobayashi
Journal:  Proc Natl Acad Sci U S A       Date:  2011-01-06       Impact factor: 11.205

2.  Allelic diversity among Helicobacter pylori outer membrane protein genes homB and homA generated by recombination.

Authors:  Mónica Oleastro; Rita Cordeiro; Armelle Ménard; João Paulo Gomes
Journal:  J Bacteriol       Date:  2010-06-04       Impact factor: 3.490

Review 3.  Lipoproteins of Gram-Positive Bacteria: Key Players in the Immune Response and Virulence.

Authors:  Minh Thu Nguyen; Friedrich Götz
Journal:  Microbiol Mol Biol Rev       Date:  2016-08-10       Impact factor: 11.056

4.  GenHtr: a tool for comparative assessment of genetic heterogeneity in microbial genomes generated by massive short-read sequencing.

Authors:  Gongxin Yu
Journal:  BMC Bioinformatics       Date:  2010-10-12       Impact factor: 3.169

Review 5.  Mobile genetic elements of Staphylococcus aureus.

Authors:  Natalia Malachowa; Frank R DeLeo
Journal:  Cell Mol Life Sci       Date:  2010-07-29       Impact factor: 9.261

Review 6.  Staphylococcus aureus mobile genetic elements.

Authors:  Babek Alibayov; Lamine Baba-Moussa; Haziz Sina; Kamila Zdeňková; Kateřina Demnerová
Journal:  Mol Biol Rep       Date:  2014-04-13       Impact factor: 2.316

7.  Movement of DNA sequence recognition domains between non-orthologous proteins.

Authors:  Yoshikazu Furuta; Ichizo Kobayashi
Journal:  Nucleic Acids Res       Date:  2012-07-20       Impact factor: 16.971

8.  A very early-branching Staphylococcus aureus lineage lacking the carotenoid pigment staphyloxanthin.

Authors:  Deborah C Holt; Matthew T G Holden; Steven Y C Tong; Santiago Castillo-Ramirez; Louise Clarke; Michael A Quail; Bart J Currie; Julian Parkhill; Stephen D Bentley; Edward J Feil; Philip M Giffard
Journal:  Genome Biol Evol       Date:  2011-08-02       Impact factor: 3.416

9.  Evolutionary diversification of an ancient gene family (rhs) through C-terminal displacement.

Authors:  Andrew P Jackson; Gavin H Thomas; Julian Parkhill; Nicholas R Thomson
Journal:  BMC Genomics       Date:  2009-12-07       Impact factor: 3.969

10.  Toward an understanding of the evolution of Staphylococcus aureus strain USA300 during colonization in community households.

Authors:  Anne-Catrin Uhlemann; Adam D Kennedy; Craig Martens; Stephen F Porcella; Frank R Deleo; Franklin D Lowy
Journal:  Genome Biol Evol       Date:  2012       Impact factor: 3.416

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.