Literature DB >> 19095084

Simple sequence repeats in Haemophilus influenzae.

Peter M Power1, W A Sweetman, N J Gallacher, M R Woodhall, G A Kumar, E R Moxon, D W Hood.   

Abstract

Simple sequence repeat (SSRs) of DNA are subject to high rates of mutation and are important mediators of adaptation in Haemophilus influenzae. Previous studies of the Rd KW20 genome identified the primacy of tetranucleotide SSRs in mediating phase variation (the rapid reversible switching of gene expression) of surface exposed structures such as lipopolysaccharide. The recent sequencing of the genomes of multiple strains of H. influenzae allowed the comparison of the SSRs (repeat units of one to nine nucleotides in length) in detail across four complete H. influenzae genomes and then comparison with a further 12 genomes when they became available. The SSR loci were broadly classified into three groups: (1) those that did not vary; (2) those for which some variation between strains was observed but this could not be linked to variation of gene expression; and (3) those that both varied and were located in regions consistent with mediating phase variable gene expression. Comparative analysis of 988 SSR associated loci confirmed that tetranucleotide repeats were the major mediators of phase variation and extended the repertoire of known tetranucleotide SSR loci by identifying ten previously uncharacterised tetranucleotide SSR loci with the potential to mediate phase variation which were unequally distributed across the H. influenzae pan-genome. Further, analysis of non-tetranucleotide SSR in the 16 strains revealed a number of mononucleotide, dinucleotide, pentanucleotide, heptanucleotide, and octanucleotide SSRs which were consistent with these tracts mediating phase variation. This study substantiates previous findings as to the important role that tetranucleotide SSRs play in H. influenzae biology. Two Brazilian isolates showed the most variation in their complement of SSRs suggesting the possibility of geographic and phenotypic influences on SSR distribution.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 19095084      PMCID: PMC2651432          DOI: 10.1016/j.meegid.2008.11.006

Source DB:  PubMed          Journal:  Infect Genet Evol        ISSN: 1567-1348            Impact factor:   3.342


Introduction

Haemophilus influenzae (Hi), a common commensal bacterium of the upper respiratory tract of humans, is an important cause of diseases that include otitis media, pneumonia, meningitis, and septicaemia. The genome sequence of Hi strain Rd KW20, the first completed for a free-living organism, revealed a high prevalence of simple sequence repeats (SSRs) (Fleischmann et al., 1995; Hood et al., 1996b). SSRs are usually defined as direct, perfect DNA repeats consisting of repeat units (the smallest repeating DNA motif of the SSR) of between one and nine nucleotides in length. In many organisms, taking into account the nucleotide sequence composition of their respective genomes, SSRs are found less frequently than predicted (Mrázek et al., 2007). SSRs are hypermutable (e.g. tetranucleotide SSRs lose and gain units at a rate of 1 × 10−4 per generation (De Bolle et al., 2000) compared with a basal mutation rate of approximately 1 × 10−9) and, therefore, it has been suggested that their decreased prevalence reflects natural selection because the higher rates of mutation of these loci would be more often detrimental to fitness than beneficial. However, in some prokaryotes, predominantly host-adapted organisms, some SSRs are found in greater numbers than would be expected by chance (Mrázek et al., 2007). Analysis of SSRs in the Hi strain Rd KW20 genome revealed that long tracts of tetranucleotides were over-represented (Hood et al., 1996b). A striking feature of these tetranucleotide SSRs is their frequent association with genes whose functions are associated with microbial-host interactions relevant to commensal and virulence behaviour (Hood et al., 1996b). SSRs can be located in promoter regions or within open reading frames and changes in their length can result in the random, high frequency, reversible loss, gain or modulation of gene expression (phase variation). Since these regions of localised hypermutation, often termed ‘contingency loci’, can each independently result in altered gene expression, a repertoire of phenotypic variants is generated (Moxon et al., 2006). Through selection of these variants, the adaptation of the bacterial population to changes in the host environment is facilitated. It has been suggested that this strategy has particular survival value when bacterial populations are subjected to periodic selection during transmission between genetically distinct hosts (Wolf et al., 2005). The advent of the genomic sequencing of multiple strains of the same species has revealed that the genomic sequence of a particular strain may not reflect the diversity and variety of the entire species. The term ‘pan-genome’ has been used to describe the superset of genes of a species (Tettelin et al., 2005). The characterisation of a pan-genome describes the core (genes contained in all genomes of a species) and dispensable genes (those genes absent from one or more strains or unique to each strain) of a species. We suggest that the concept of a pan-genome should also include explicit recognition of differences in gene sequence, organisation and variation that may better describe the adaptive and evolutionary potential of the species (Caporale, 2006). In this study, we have sought to identify the potential repertoire of variation mediated by SSRs in the currently available Hi pan-genome. Prior to this study, our understanding of SSRs in Hi has been predominantly based on analysis of the strain Rd KW20 genome sequence. Whilst selective studies of other Hi strains have provided some evidence to suggest variation in the number, location and nature of the SSRs compared to that seen in the Rd KW20 genome (Fox et al., 2005; van Belkum et al., 1997), the recent availability of a number of completely sequenced Hi genomes has provided us with the opportunity for a much more extensive analysis of SSRs in Hi. We describe in detail 223 SSRs identified in the four complete genome sequences of strains RdKW20, 86-028NP, R2846 and R2866 plus 765 SSRs identified in the complete or partial genome sequences of a further 12 Hi strains. Previous reports of SSRs in Hi have been predominantly of tetranucleotide repeats. From these 16 genomes we describe 199 tetranucleotide SSRs in 28 different loci including 10 which have not previously been described. However, we have also identified a number of mononucleotide, dinucleotide, pentanucleotide, heptanucleotide, and octanucleotide SSRs with a putative role in phase variable gene regulation. A preponderance of the novel SSRs identified occur in only two strains, F3031 and F3034 of the Hi biogroup aegyptius, suggesting that the distribution of SSRs across the Hi pan-genome may be linked with geographic and phenotypic profiles.

Materials and methods

The four Hi genome sequences that were available at the commencement of this study formed the basis of the Hi four genome study. Details of these genomes are given in Table 1. A list of SSRs with repeat unit lengths of between one and nine nucleotides, and a number of repeat unit iterations above an empirically determined threshold value (see below) was compiled for each of these genomes using a PERL script that we have developed and have called HiSSRfinder. The results from HiSSRfinder were used to generate an annotated EMBL file for each of the genomes which then allowed manual analysis and curation of the SSRs identified in each genome using the Artemis and ACT genome viewing, annotation and comparison programs (Rutherford et al., 2000). Each SSR was manually evaluated with regards to its position relative to open reading frames (ORFs), whether an equivalent SSR was present or not in the other three strains and whether there was any variation in the SSR between strains (see Supplementary Table 1).
Table 1

Characteristics of genome sequences used in this study.

StrainGenBank accession numberSource strain informationaReferenceb
Four genome study
 Rd KW20NC_000907Acapsulate serotype d, nasopharynx, USAFleischmann et al. (1995)
 86-028NPNC_007146NTHi, isolated from a child with otitis media, USAHarrison et al. (2005)
 R2846cNTHi, middle ear fluid of a child with acute otitis mediahttp://www.genome.washington.edu/UWGC/Hinf/index.cfm (strain 12, Barenkamp and Leininger, 1992)
 R2866cNTHi, blood of a child with meningitishttp://www.genome.washington.edu/UWGC/Hinf/index.cfm (Int1, Nizet et al., 1996)



Twelve genome study
 PittGGNC_009567NTHi, external ear discharge from otorrhea, USAHogg et al. (2007)
 PittEENC_009566NTHi, chronic otitis media with effusion, USAHogg et al. (2007)
 R3021dNZ_AAZJ00000000NTHi, nasopharynx of a healthy individual, USAHogg et al. (2007)
 PittIIdNZ_AAZI00000000NTHi, chronic otitis media with effusion, USAHogg et al. (2007)
 PittHHdNZ_AAZH00000000NTHi, chronic otitis media with effusion, USAHogg et al. (2007)
 PittAAeNZ_AAZG00000000NTHi, middle ear effusion of a child with chronic otitis media, USAHogg et al. (2007)
 3655dNZ_AAZF00000000NTHi, middle ear effusion of a child in Missouri with acute otitis media, USAHogg et al. (2007)
 22.4.21dNZ_AAZE00000000NTHi, nasopharynx of a healthy individual, USAHogg et al. (2007)
 22.1.21dNZ_AAZD00000000NTHi, nasopharynx of a healthy individual, USAHogg et al. (2007)
 F3031d,eNZ_AADO00000000Hi biogroup aegyptius, Brazilian purpuric fever, Brazilhttp://www.sanger.ac.uk/sequencing/Haemophilus/influenzae/F3031/
 F3043d,eNZ_AADP00000000Hi biogroup aegyptius, purulent conjunctivitis, Brazilhttp://www.sanger.ac.uk/sequencing/Haemophilus/influenzae/F3043/
 10810Serotype b, meningitis isolate, UKhttp://www.sanger.ac.uk/Projects/H_influenzae/

The four strains for which the complete genome sequences were available at the commencement of this study are listed in the upper portion of the table; the further twelve strains for which full or partial sequences were later acquired are detailed in the lower portion.

Lists the serotype, the site of isolation and associated clinical features, and country of isolation.

Where there is no appropriate publication available, the www address from which sequences were obtained is listed.

The complete genome sequences for these strains were made available to us courtesy of Dr Alice Erwin.

Incomplete genomes.

The genome sequences for these strains were made available to us courtesy of Prof. Simon Kroll and the Welcome Trust Sanger Sequencing Centre.

The threshold values, i.e. the minimum number of repeat units required to be present in an uninterrupted tandem arrangement within a genome in order for that sequence to be counted as an SSR and included in further analysis, was determined for each different length of repeat unit from a comparison of the number of SSRs of different lengths and the frequency of polymorphisms between the four genomes (see Section 3 and Table 2). The thresholds determined in this way for this study were as follows: 1 (repeat unit length), >8 (threshold value of repeat units); 2, >4; 3, >3; 4, >2, 5, >2; 6, >2; 7, >2; 8, >2 and 9, >2.
Table 2

Determination of the minimum threshold for SSR detection in H. influenzae genome sequences, with reference to strain Rd KW20.

Repeat unit length
Threshold valueaMaximum expected lengthbNumber of SSRs in Strain Rd-KW20 Genome identified usingc
No.PrefixThreshold value −1Threshold valueThreshold value +1
1Mono913163182
2Di558850
3Tri44814131
4Tetra33381212
5Penta32167652
6Hexa3ND94051
7Hepta3ND12600
8Octa3ND4500
9Nona3ND3421



39246019

This table sets out the minimum number of repeat units (threshold value) for the identification of SSRs of each repeat unit length used in this study. ND: not determined.

The threshold value is the minimum number of repeat units required to be present in an uninterrupted tandem arrangement in order for that sequence to be defined as an SSR.

Maximum expected length is the maximum number of repeat units of each designated length that would be expected to occur in the Rd KW20 genome as predicted by hidden Markov model analysis (Paul Swift, personal communication).

The values in columns Threshold value −1 and Threshold value +1, show the number of SSR identified of each motif length in the Rd KW20 genome if the threshold value is decreased or increased by one unit, respectively.

A database, named SSR_Hi_4G, was constructed to contain the nucleotide sequences of each of the tetranucleotide SSR, together with their 500 bp upstream and downstream flanking sequences, identified in the four genome study. This database was assembled using the formatdb program (SSR_Hi_4G is available at http://users.ox.ac.uk/∼oxmicro/ssrblast.html, formatdb program is available at http://www.ncbi.nlm.nih.gov/blast/download.shtml). A second collection of Hi genomes (herein termed the further 12 genome study) was then examined using the information from the initial four genome survey to guide analysis. Details of these additional genomes are given in Table 1. The SSRs in these genomes were identified using the HiSSRfinder script as described above. In order to determine which of the tetranucleotide SSRs identified in the 12 further genome sequences were equivalent to the tetranucleotide SSRs previously identified in the four genome study, each was compared to the SSR_Hi_4G database using the BLASTN program. Data and boxplot analysis of SSR data was performed using the R statistical package (http://www.r-project.org/) and Microsoft Excel. Transmembrane helicies were predicted using the TMHMM web server v2.0 (http://www.cbs.dtu.dk/services/http://www.cbs.dtu.dk/services/TMHMM; Moxon et al., 2006; Sonnhammer et al., 1998).

Results

Determination of threshold values used to identify SSRs in this study

Previous studies on Hi have described the SSRs present within the genome of strain Rd KW20 (Hood et al., 1996b). Our aim was to extend the analysis of SSRs by comprehensively investigating the repertoire present in the four complete Hi genome sequences that were available for different strains of Hi at the commencement of this study (Four Genome Analysis; see Table 1). In this study, SSRs are defined as tandem repeats of a repeat unit that consists of between one and nine nucleotides. In order to attain maximum sensitivity for the detection of SSRs the threshold values (see Section 2) were set as low as practically possible. Our rationale for adopting threshold values is described in Table 2, which shows the number of SSRs identified in the genome of strain Rd KW20 at the threshold values adopted and also the number of SSRs that would have been included in the subsequent analysis if the threshold value had been set one unit higher or lower for each repeat unit length. It can be seen from Table 2 that, for all but the tetranucleotide SSRs, increasing the threshold value by one substantially decreased the number of SSRs detected and resulted in a total of only 19 SSRs being identified in the Rd KW20 genome. In contrast, decreasing the threshold by one resulted in a large number of SSRs being identified which would be impractical for manual analysis (3924 in strain Rd KW20). At the adopted threshold values, 60 SSRs were identified in strain Rd KW20. The thresholds used in this study for all repeat unit lengths included at least all of the statistically unexpected SSRs determined by hidden Markov model analysis of the Hi genomes (see Table 2; Paul Swift, Oxford, personal communication). This further substantiates the threshold values chosen as being permissive for having a high degree of sensitivity in identifying SSRs with potential roles in mediating phase variation. Additionally, if SSRs were found to be above threshold in at least one genome the corresponding regions in the other genomes were also characterised. A total of 223 SSRs were identified in the four genome sequences when the threshold values described in Table 2 were applied and these 223 SSRs are summarised in Table 3. Comparison of the SSRs across the four Hi genomes for each of the repeat unit lengths, reveals that their numbers are not substantially different between the strains. Also, the total number of repeats found within any one strain is not substantially different from the others (total number of SSRs ranged between 53 and 60), despite the differences in the origin, associated disease and date of isolation of the four strains (see Table 1).
Table 3

The frequency and location of SSRs in the genome sequences of four H. influenzae strains.

Repeat unit lengthaThreshold valuebGenomec
86-028NPRd KW20R2846R2866
1916 (7)18 (7)13 (5)17 (8)
256 (4)5 (3)4 (4)5 (3)
348 (6)13 (11)11 (10)9 (7)
4414 (14)12 (12)13 (13)12 (12)
533 (2)5 (4)3 (3)3 (3)
634 (3)5 (4)3 (3)3 (3)
733 (0)0 (0)4 (1)2 (0)
831 (0)0 (0)1 (0)1 (0)
931 (0)2 (0)1 (1)2 (1)



Total56 (36)60 (40)53 (39)54 (37)

The frequency of SSRs, of each repeat motif length, is given for each of the four genomes in the four genome analysis.

Repeat unit length: the number of nucleotides that compose a single repeat unit.

Threshold value is the minimum number of repeat units required to be present in an uninterrupted tandem arrangement in order for that sequence to be defined as an SSR.

Numbers in bold indicate the number of each type of SSR within each genome; numbers in parentheses show the number of each type of SSR located within predicted ORFs for each genome.

SSRs have previously been associated with hypermutation, as loss or gain of repeat units occur at high frequency due to replication slippage (Moxon et al., 2006). Loss or gain of repeat units from an SSR located within an ORF may result in a frameshift mutation if the length of the repeat unit is not a multiple of three. The position of each of the 223 SSRs identified in this study was manually curated and the proportion of SSRs that were located within ORFs was recorded (see Table 3). Hi has a coding density of approximately 88% of the genome sequence and, of the repeats examined, only the trinucleotide (83%), tetranucleotide (100%) and hexanucleotide (87%) SSRs occur within ORFs at approximately this frequency, whilst SSRs with repeat unit lengths of one, two, five and seven nucleotides were all found to be located within ORFs with a frequency of less than 88%. SSRs with longer repeat unit lengths were not included in this analysis due to their low frequency in the genomes. This suggests that the selective pressure against trinucleotide and hexanucleotide SSRs occurring within an ORF may not be as high as that against SSRs of other repeat unit lengths whose expansion or contraction would result in inactivation of an ORF by frameshift mutation. It is noteworthy that tetranucleotide SSRs are found exclusively within ORFs, consistent with the known importance of this class of repeat in mediating phase variable expression at contingency loci in Hi.

Identification of repeat unit lengths likely to be associated with phase variable gene expression

Manual curation of each of the 223 SSRs allowed us to assess the likelihood of each SSR playing a role in modulating gene expression. Comparison of equivalent SSR loci (those located in the same relative genomic location) allowed the classification of each SSR into one of three categories: (1) SSRs that did not vary in length, sequence or position between the four genomes (invariable), (2) SSRs for which some variation between strains was observed but the variation was not considered likely to result in variation of gene expression (variable) and (3) SSRs that both varied in length and were located in regions consistent with mediating phase variation (potentially phase variable; see Table 4). Careful manual examination of each of the repeat associated loci was necessary to classify the SSRs into the above categories. Factors such as the location of the SSR within a gene, length of the SSR and replacement of a whole or partial tract of an SSR by another sequence contributed to the assessment of whether or not any observed variation in the SSR was likely to mediate phase variation. SSRs located outside ORFs were generally more difficult to assess as to their likely involvement in phase variable modulation of gene expression. SSRs have previously been shown to be mediators of phase variation through modulation of promoter activity and gene transcription (Dawid et al., 1999; Martin et al., 2005; van Ham et al., 1993), but promoter regions of individual genes often cannot be accurately defined. Thus, the influence of variation in SSRs located in non-coding regions on expression of adjacent genes is difficult to predict. The full assessment of the 223 SSRs identified within the four genomes can be found in Supplementary Table 1; a summary of the data is provided in Table 4.
Table 4

Assessment of SSR variability and potential to mediate phase variation in the four genome study.

Repeat unit lengthRepeat type categoriesa
Potentially phase variableVariableInvariable
10 (0)24 (7)14 (7)
20 (0)7 (4)1 (1)
30 (0)12 (9)6 (6)
414 (14)0 (0)4 (4)
52 (2)3 (3)3 (1)
60 (0)8 (6)2 (2)
73 (0)2 (1)0 (0)
80 (0)1 (0)1 (0)
90 (0)3 (2)2 (0)

Each SSR identified in the four genome analysis was manually assessed to determine whether or not it was likely to mediate phase variable gene expression, from consideration of its position within an ORF or relative to adjacent ORFs, together with the variation observed between strains. Numbers in bold indicate the number of each type of SSR within each category; numbers in parentheses show the number of each type of SSR located within predicted ORFs for each category.

This assessment assigned each SSR to one of three categories: potentially phase variable – SSRs that both varied in length and were located in positions likely to influence gene expression; variable – SSRs that showed differences in length or sequence between strains but not of a manner consistent with mediating phase variable expression, and invariable – SSRs that did not show any variation in length, position or sequence between the strains.

The manual classification of the SSRs into the three categories indicated that despite the considerable variation seen between strains for many of the SSRs (especially the mononucleotide SSRs), the tetranucleotide, pentanucleotide and heptanucleotide repeat tracts were the only types of SSR considered to have a potential role in mediating phase variable gene expression in these four strains of Hi (Table 4). The potentially phase variable ORFs associated with each of these types of repeat are detailed below.

Tetranucleotide SSRs identified in the Hi four genome study

A previous analysis of the Rd KW20 genome sequence identified the primacy of tetranucleotide SSR in mediating phase variation in Hi (Hood et al., 1996b). This study extends that work by comparing the tetranucleotide SSRs across four Hi genomes. We identified 18 different tetranucleotide SSR loci that are distributed fairly uniformly between genomes with each genome containing from 12 to 14 tetranucleotide SSR loci (Table 5). Eight of the tetranucleotide SSR loci were found in all four of the strains, two of the loci were found in three of the strains, three of the loci in two of the strains and five of the loci were unique to one strain (two unique loci in each of the strains Rd KW20 and 86-028NP and one unique locus in R2846).
Table 5

The sequence and number of repeat units that comprise each of the 199 tetranucleotide SSRs identified in 16 H. influenzae genomes.

SSR loci numberAssociated ORF or genome locationaDescriptionbRepeat Unit Seq.Strain
RdKW20R2866R284686-02622.1.2122.4.213655PittAAPitlHHPittIIR3021PittEEPittCG10810F3031F3043
1HI_0258igtC glycosyltransferaseGACA222620101312121412161283817824
2HI_0352lic3A lipopolysaccharide sialyltransferaseCAAT3320261919192436201431.8d1920382222
3HI_0550tic2A lipopolysaccharide glycosyltransferaseCAAT235251414e122511II222325201514
4HI_1537licA lipopolysaccharide phosphocholine transferaseCAAT1736715221815253727152451472616
5licA2 lipopolysaccharide phosphocholine transferaseCAAT193117
6HI_0635hgp hemoglobin and hemoglobin–haptoglobin binding proteinCCAA21282520281684.10d411e816332924
7HI_0661hgp hemoglobin and hemoglobin–haptoglobin binding proteinCCAA202739126.6d1723.8d192392010
8HI_1565hgp hemoglobin and hemoglobin–haptoglobin binding proteinCCAA1928178.12d169301513
9HI_0712hgp hemoglobin and hemoglobin–haptoglobin binding proteinCCAA37
10HI_0687Drug/metabolite exporterTTTA64
11HI_1058mod type III restriction/modification system modification methylaseAGCC(TGAC)32c161420122722IS
12Hi_1386USPutative glycosyltransferaseCCAA1681213101010714101016
13Within cryptic yadA-like gene and 245 bp upstream of tolC-like receptor geneCAAG25132414171311301315171314
14r2846v6.916pgtl putative glycosyltransferaseGACA16146135
15r2846v6.1528cIpt3 regionAGTC141014
16r2846v6.1683lex2ACAAG2417141815152416212218261431
17r2866v6.124clav AIDA-I/VirG/PerT family of virulence-associated autotransportersCAAG203014141728
1812846 V6.202oqfA O-antigen lipopolysaccharide acetylaseCAAG91481214111451187141111
19NTHI1034lic3B lipopolysaccharide sialyltransferaseCAAT1521101224
20PITTIIGene encoding a YadA domain containing proteinCAAG201524
21F3043-1499724Gene encoding a YadA domain containing proteinCAAG19
22F3043-196894Gene encoding a YadA domain containing proteinCAAA1513
23F3043-756964225 bp upstream gene encoding a YadA domain containing proteinCAAA1631
24F3043-609747Glycosyltranslerase (family 8) with framshiftCAAT1833
25F3043-1083776Gene encoding a YadA domain containing proteinCAAG23
26F3043-1500170SAM-dependent methyltransferaseCAAT21
27F3043-1598734225 bp upstream formamidopynmidine-DNA glycosylase (mutM)ATTA9
28F3031-112163458 bp upstream adenine specific methylase (EcoRI) and 202 bp upstream of htpX (heat shock protein)CAAG32

Listed in this table are the 199 tetranucleotide SSRs that are associated with the 28 loci identified in this analysis of the genomes of 16 Hi strains.

The associated ORF designation in the Rd KW20 genome, or when not present in the Rd KW20 genome the R2486, R2866 or 86-028NP genomes. SSR from unannotated genomes are identified by the strain, a hyphen and the nucleotide number at which they are found.

Genes associated with the SSR. The gene names and predicted functions are from BLAST similarity searches and the published genome annotations.

The sequence of the tetranucleotide SSR in this locus is different to that associated with this gene in all other strains (5′AGCC).

x…y indicates that there are x number of the tetranucleotide repeat unit, followed by an interruption, followed by y number of tetranucleotide repeat units.

The presumptive position of the SSR loci is at the end of a contig and its presence or absence cannot be determined.

Two of the tetranucleotide SSR loci that we have identified in the four genome analysis have not previously been described as potential mediators of phase variation in Hi. The first of these novel loci contains 14 tandem 5′AGTC repeats and is unique to strain R2846 (starting at nucleotide 1505819; see Table 5). This SSR is found immediately downstream of the presumptive start codon of an ORF encoding a 294 aa protein with homology to the glycosyltransferase 2 family of proteins (PFAM PF000535). In Hi, phase variable glycosyltransferases are frequently involved in LPS biosynthesis (Hood et al., 1996a). In strain 86-028NP the same glycosyltransferase is replaced with a different gene (NTHi_1053) that has high homology (e value of 1 × 10−141, BLASTN) to the phosphoethanolamine transferase gene, lpt3, of Neisseria meningitidis (Mackinnon et al., 2002). This is the first report of a gene with significant homology to lpt3 in Hi. Both NTHi_1053 and the gene encoding the putative glycosyltransferase have an atypically low G + C content (<30%), suggesting that they have been acquired by horizontal transfer. The finding that multiple, distinct gene insertions have occurred in the same region of the bacterial genome in different strains may indicate that this is a hotspot for recombination. The second novel tetranucleotide SSR locus contains a 5′CCAA tract associated with a putative glycosyltransferase (gene NTHi_1769 in strain 86-028NP). This SSR is present in all four genomes examined with between 8 and 16 repeat units and constitutes the first example of a 5′CCAA tract that is associated with a gene other than iron utilisation genes in Hi (Jin et al., 1996; Morton and Stull, 1999).

Pentanucleotide SSRs identified as potential mediators of phase variation in Hi

A total of eight pentanucleotide SSR loci were identified across the four Hi strains investigated, six of these were located within ORFs (see supplementary Table 1). The length of the pentanucleotide SSRs ranged from three to twelve units but the majority were of the minimum threshold value of three units. Two of the pentanucleotide SSRs located within ORFs are of particular interest. The first of these pentanucleotide SSRs is associated with the type I modification enzyme, HsdM (the ORF in the Rd KW20 genome (HI1287) is truncated due to the repeat), and has previously been implicated in the phase variable expression of this type I restriction-modification gene (Zaleski et al., 2005). The SSRs identified in the four genome study are one, two or four units in length. van Belkum et al. (1997) and van Belkum (1999) described length variation in the region of this pentanucleotide repeat in a survey of 20 Hi strains. Zaleski et al estimated the phase variation rates of the (5′GACGA)4 (4 tandem repeats of the sequence 5′GACGA3′) pentanucleotide repeat at this locus from observations on the degree of bacterial lysis induced by exposure to phage HP1c1. The rates they recorded for a change from four to three pentanucleotide repeats in strain RM118 were high and equivalent to those previously measured for much longer tetranucleotide repeat tracts (De Bolle et al., 2000) in the same strain. The second coding pentanucleotide SSR of interest (5′TCAGC) was found in a gene of the hmg locus that encodes a high molecular weight glycoform of the LPS (Hood et al., 2004). The two repeat unit pentanucleotide SSRs present in Rd KW20 and R2846 (within the ORFs HI0867 and Hflu103000281, respectively) are consistent with the expression of a putative LPS flippase, whilst the three unit SSR in R2866 is inconsistent with expression of this gene. It is noteworthy that these two potential phase variation-mediating pentanucleotide SSRs relate to gene functions (restriction-modification and LPS modification) whose expression has previously been reported to be phase varied by tetranucleotide SSRs.

Heptanucleotide SSRs as mediators of phase variation in Hi

Four heptanucleotide SSRs were found in the survey of the four Hi genomes, three of which we have designated as potential mediators of phase variation. Two of these heptanucleotide SSRs are located approximately 100 bp upstream of the hmw1a and hmw2a genes and have previously been described by Dawid et al. They reported that these SSRs are within the promoters of the hmw1a and hmw2a genes and that alteration of the number of repeat units present in these SSRs results in a modulation of gene expression. The exact mechanism by which these SSRs influence transcription from these genes remains to be determined but may involve modulation of transcription from two alternative start sites (Dawid et al., 1999). Strain R2846 has (5′TGAAAGA)17 and (5′TGAAAGA)16 for hmw1a and hmw2a, respectively, and strain 86-028NP has (5′TGAAAGA)17 and (5′TGAAAGA)23 units for hmw1a and hmw2a, respectively, but there are no equivalent loci or repeat tracts in the other two genomes. The third heptanucleotide SSR with a potential to mediate phase variation is the (5′AACAACC)1-7 tract situated only 13 bp upstream of a gene encoding a member of the TonB-dependent receptor family (PF0593) that has similarity to Fe transport proteins. One unit of the repeat is found in the genomes of strains Rd KW20 and R2846, seven in R2866 and six in 86-028NP. Rd KW20 and R2866 appear to have full length ORFs but the 86-028NP ORF is disrupted by a frameshift unrelated to the SSR. The observed variation in the length of this SSR, together with its position so close to the start of the downstream ORF, led us to postulate that it may mediate phase variation in Hi.

Other types of SSRs identified in the Hi four genomes study are not considered to mediate phase variable gene expression

Analysis of mononucleotide, dinucleotide, trinucleotide and hexanucleotide SSRs in the four genome study did not provide any evidence to suggest to us that these classes of repeat were associated with phase variable gene expression as detailed below.

Mononucleotide SSRs in the four Hi genome are predominantly short A or T tracts

Mononucleotide SSRs have previously been documented as important mediators of phase variation in species such as Neisseria meningitidis (Schoen et al., 2007), Bordetella pertussis (Gogol et al., 2007) and Campylobacter jejuni (Hofreuter et al., 2006; Pearson et al., 2007). Perhaps surprisingly, they have not been implicated in phase variation in Hi, although partial sequencing of the iga gene from some Hi biogroup aegyptius strains led the investigators to suggest that a G10 tract found in only one strain may have mediated phase variable expression of the gene (Kilian et al., 2002). Our analysis of the mononucleotide SSR loci present in the four Hi genomes revealed a considerable degree of heterogeneity in this class of SSR between these strains. 64 homopolymeric tracts were identified across the four genomes and Supplementary Table 1 summarises their characteristics. 28/64 (44%) of the mononucleotide SSRs were found within ORFs and although variations were frequently observed between strains they were not consistent with mediating phase variation (see Supplementary Table 1). The findings from the genome of strain Rd KW20 were representative of the distribution of mononucleotide SSRs found in the three other strains. All of the mononucleotide SSRs in this strain were A or T tracts (18/18) and most were the minimum threshold length of 9 units in length (16/18). Comparison across the four strains revealed that the variation observed in the equivalent mononucleotide SSRs of 8–10 units usually occurred by the substitution of one of the bases within the homopolymeric tract with a different base (e.g. an (A)9 tract was found as (A)7CA in some strains). All substitutions interrupting the A or T homopolymeric SSRs were found to be G or C nucleotides, suggesting an uneven pattern of mutation. Examination of the further three genomes identified some anomalous mononucleotide SSRs. The first is an exceptionally long (A)34 tract identified in strain R2866. This SSR was located 120 bp upstream of the start of the ORF encoding the autotransporter adhesin Hia, which is an autotransporter protein containing the YadA domain and is believed to bind vitronectin and aid survival in human serum (Cotter et al., 2005a; Hallström et al., 2006; Meng et al., 2006). This SSR is not obviously associated with a promoter region and its function, if any, remains unclear. The second and third are a (G)12 and a (C)11 repeat tract both found in the genome of strain 86-028NP, and which are noteworthy because mononucleotide SSRs of G or C residues are uncommon in Hi, reflecting the low G + C content of this organism (38%). The (G)12 SSR was within the 5′ end of ORF ntHI0694. This gene shows homology with genes encoding methyltransferases of the FkbM family, some of which are involved in the biosynthesis of methylated sugars in Rhizobium etli LPS (Duelli et al., 2001). This gene has not been identified in other Hi strains and suggests that 86-028NP LPS may be O-methylated. The (C)11 SSR was located 230 bp upstream of the acpP gene (ntHI0243). Members of the AcpP family are short proteins which are involved in the transfer of acyl groups and are considered house keeping proteins. In the three other genomes the tract at the same location contains five C residues.

Dinucleotide SSRs in the Hi four genome study

Phase variation mediated by dinucleotide repeats has been documented previously in Hi. A (TA)9-11 tract, located in the promoter region of two divergently transcribed genes, hifA and hifB was shown to control fimbriae biogenesis in some strains (van Ham et al., 1993). The hif locus is present in only 20% of NTHi strains and, of the four strains analysed here, only R2866 contains the hifA and hifB genes. In this strain, however, the 5′TA tract was present as a 5′(TA)4ATTA sequence. The threshold value set for dinucleotide SSRs in this study was five, therefore this tract was not identified as an SSR; further discussion of this locus is found later in this paper. Eight dinucleotide SSR loci were identified in this four genome analysis, all of which were found to be of the threshold value of five repeat units in length. Five were located within coding regions. In a similar fashion to the variation observed for many of the mononucleotide SSRs, seven of the dinucleotide SSR loci were found to have sequence variations that did not alter the overall length of the sequence between strains and so would not cause frameshifts consistent with phase variation. For example, a (CA)5 repeat conserved in the genomes of strains Rd KW20, R2846 and 86-028NP was found to be replaced with CACG(CA)3 in strain R2866.

Trinucleotide SSRs were predominantly found to be located within ORFs

Eighteen trinucleotide SSRs were identified in this study, the majority of which, (15/18), were located within coding regions. All of these 15 SSRs consisted of no more than four repeat units and where variation in the repeats was observed between strains, it either resulted in a reduction in length of the SSR or disruption of the sequence whilst maintaining the same length. The three trinucleotide SSRs that were found in non-coding regions showed greater variation in overall length but were not within identified promoter or other regulatory regions.

Hexanucleotide SSRs identified in the four genome study

Ten hexanucleotide SSRs were identified within the four genomes, eight within ORFs and two in non-coding regions. Variation in coding hexanucleotide repeats can lead to altered amino acid sequence but not phase variable gene expression. The coding region hexanucleotide SSRs identified in this study were either conserved or, like the mononucleotide and dinucleotide variations discussed above, showed changes in sequence but not length and thus were inconsistent with modulating phase variable expression. Of the two non-coding region associated hexanucleotide repeats, one is conserved across all four strains and is present downstream of the closest ORFs, whilst the other 5′TTAAAA SSR is present as three repeat units in Rd KW20, two units in 86-028NP and as two units plus an interrupted third repeat unit in R2866 and R2846. This SSR is situated 19 bp from the start codon of HI0525 in strain Rd KW20, which encodes a phosphoglycerate kinase involved in central metabolism and the influence of this SSR on the expression of this ORF is unknown.

SSRs with repeat units greater than 7 nucleotides are not found at high frequency in the four genomes

Of the limited number of SSRs with repeat unit lengths greater than seven nucleotides that were identified in the four genomes study, most were found in only one strain. These include a nonanucleotide SSR found in the genome of strain 86-028NP. This (5′GTTTTCTTA)19 SSR was found to be located 92 bp upstream of the hmw2C gene. As discussed above, variations in heptanucleotide SSR associated with the hmw2A loci are thought to modulate gene expression but the function of this nonanucleotide SSR is not known. An octanucleotide (5′ATTATTTG) SSR however, was found in multiple strains, varying in length between 1 and 6 repeat units. It was found to be located between the divergently transcribed cmkB and pdxS genes which encode a cytidylate kinase 2 and a pyridoxal biosynthesis lyase, respectively, (designated HI1646 and HI1647 in strain Rd KW20). They are both suggested to play roles in metabolism and so it is uncertain whether this SSR would actually be utilised in modulating their expression.

Analysis of the SSRs of a further 12 genomes

Whilst the SSR analysis of the four Hi genomes was ongoing, 12 further Hi genomes were sequenced and the resulting full or partial sequences made publicly available (listed in Table 1). These 12 additional genome sequences offered us the chance to confirm and extend our detailed SSR analysis of the four Hi genomes. Using the same SSR search methods and threshold values described for the four genome study, 765 SSRs were identified in these 12 additional genomes (summarised in Table 6). From these data it was seen that mononucleotide SSRs are found in 10 out of 12 of the additional genomes at a higher frequency than was observed in the four genome study. However, it should be noted that the 454 sequencing technology used to generate the majority of the further genome sequences has a decreased fidelity for mononucleotide tracts which may account, to some extent, for the higher number of mononucleotides SSR detected in these strains. However, the F3031 and F3043 genomes, for which the highest number of mononucleotide SSRs were identified, were sequenced using ABI Sanger dideoxy sequencing technology.
Table 6

The number of SSRs, of each repeat unit length, in each genome.

Repeat unit lengthRange in four genome studyaStrainsb
22.1.2122.4.213655PittAAPittHHPittIIR3021PittEEPittGG10810F3031F3043
Mono13–17252122202020281219163025
Di4–6682366526475
Tri8–131213111161012131412106
Tetra12–14119171391581413101821
Penta3–5423434236265
Hexa3–5642665475895
Hepta0–4212412241031
Octa0–1112121000155
Nona1–2002100003000



Total53–60675963635363615567538873

Analysis of the number of SSRs identified in the 12 genome study compared to the four genome study. Numbers given in bold or italic indicate that the value is higher or lower than the four-genome study range, respectively.

The minimum and maximum number of each type of SSR observed per genome in the four genome study.

The number of SSRs per genome, of each repeat unit length, in the 12 genome study.

In a high proportion of the additional genomes, tetranucleotide and hexanucleotide SSR were also observed more frequently than in the four genome study.

Tetranucleotide SSRs in the complete genome collection

The nine NTHi genomes, sequenced by the Center for Genomic Sciences (Hogg et al., 2007) (see Table 1), and the genome of strain 10810, contained a similar number of tetranucleotide SSRs to that previously observed in the four genome study (12-14 per genome) and only two novel tetranucleotide SSR loci were identified. Conversely, in the genomes of strains F3031 and F3043, 18 and 21 tetranucleotide SSR loci were identified, respectively, and eight of these tetranucleotide SSRs were not identified in any of the previously analysed genomes (see Table 5).

Ten novel tetranucleotide SSRs

A total of ten novel tetranucleotide SSR loci were identified in the additional twelve genomes. One locus, licA2 is a duplication of the licA locus reported in the four genome study (Fox et al., 2008) Five of the novel tetranucleotide SSR were associated with genes encoding members of the trimeric autotransporter protein family which commonly contain a C-terminal YadA domain (PFAM03895) (Cotter et al., 2005b; Koretke et al., 2006). All five of these paralogous loci were present in the two Hi biogroup aegyptius strains F3031 and F3043 and one of the loci was also present in the genome of the NTHi strain, PittHH. Previously described members of this family of proteins from Hi include the adhesins Hsf and Hia which have been implicated in virulence (Cotter et al., 2005a; Surana et al., 2004). It can be envisaged that the expression of adhesins may not be advantageous in all growth conditions as they are possible targets for the host immune system and are large proteins (up to 1016 aa) whose expression would require considerable resources. Indeed, the NadA protein from N. meningitidis which is a member of this family of proteins, has previously been shown to be phase variably expressed (Capecchi et al., 2005; Martin et al., 2005). Hi biogroup aegyptius strains have been associated with atypical invasive disease and it is, therefore, tempting to speculate that the high number of putative phase variable adhesins identified in strains F3031 and F3043 may somehow contribute to the unusual clinical outcomes associated with these strains. An additional four novel tetranucleotide SSRs were identified from strains F3031 and/or F3043. The first, a (5′ATTA)9 SSR is found 225 bp upstream of a gene encoding a putative DNA repair enzyme, formamidopyrimidine-DNA glycosylase MutM, in strain F3043. The equivalent position in other Hi strains contains 3 copies of the 5′ATTA repeat unit. The role of this repeat in expression of MutM is unknown but variations in the expression of mutM could potentially result in altered mutation rates in Hi (Horst et al., 1999). The second novel tetranucleotide SSR identified from strains F3031 and F3043 is a 5′CAAT SSR contained within the 5′ region of an ORF that encodes a putative glycosyltransferase with homologies to glycosyltransferase family 8 (PFAM01501). Homologues of this gene are found in other strains of Hi (including HI0223 in strain Rd KW20) but without the associated SSR. The function of this gene is unknown but it may contribute to LPS expression in strains F3031 and F3043. The third of the four additional novel tetranucleotide SSR loci contains (5′CAAT)21 and was found only in strain F3043. It is located within the 5′ end of an ORF that encodes a putative S-adenosylmethionine (SAM)-dependent methyltransferase and shows some homology to HI0096 in strain Rd KW20. SAM-dependent methyltransferases have been implicated in various cellular processes including protein trafficking and sorting, signal transduction, biosynthesis, metabolism, and gene expression. The final novel tetranucleotide SSR identified in strain F3031 is a (5′CAAG)32 SSR located 58 bp upstream of a gene encoding an adenine specific methylase homologue (EcoRI) and 202 bp upstream of the divergently transcribed htpX (which encodes a putative protease protein, induced by heat shock in E. coli). HtpX has not been investigated in Hi but in E. coli it is part of the membrane-localised proteolytic system and may play a part in the degradation of unstable membrane proteins (Sakoh et al., 2005).

Consideration of characteristics of tetranucleotide repeats from the complete genome collection

In total, 199 tetranucleotide SSRs associated with 28 different loci and consisting of nine different repeat unit sequences have been identified in the complete genome collection. The distribution of tetranucleotide SSR length, and the relationship between length of tetranucleotide SSR and strain are shown in Fig. 1. The length of an individual tetranucleotide SSR does not appear to be dependent on strain background, repeat unit sequence or locus (Fig. 1B), and a wide degree of variation and considerable overlap between groupings is observed. Fig. 1A shows that despite differences in the source, date of isolation and associated clinical symptoms of the different strains there is an approximately normal distribution of tetranucleotide SSR lengths. Fig. 1B shows that the two Hi biogroup aegyptius strains F3031 and F3043, which are associated with unusual clinical symptoms and have the highest number of tetranucleotide SSR, display a similar distribution of SSR lengths to all other strains.
Fig. 1

Histogram and boxplot representations of the length distribution, sequence, strain and loci associations of tetranucleotide SSRs. (A) Frequency histogram of tetranucleotide SSR length distribution in the complete genome study. (B) Boxplot analysis of the relationship between the strain from which the genome was derived and the length of the tetranucleotide SSRs.

Consequences of the sequence and location of tetranucleotide SSR

As noted previously, the tetranucleotide SSRs identified in the four genome study of Hi are located within ORFs and, with only two exceptions, are located immediately adjacent to or just downstream of the translational start site. In this position, any frameshift due to variation in length of the SSR, would result in a peptide being made from the incorrect reading frame and a premature stop to translation. The location of tetranucleotide SSRs within the 5′ region of the ORFs limits the encoded tetrapeptide repeat to the N-terminus of the respective protein. The two exceptions to this pattern are the 5′GCAA tetranucleotide SSR located in the middle of the oafA gene that has been previously described (Fox et al., 2005), and a 5′GACA tetranucleotide SSR located in the 3′ region of a gene encoding a putative glycosyltransferase (pgt1) in the genomes of strains R2846, 86-028NP, 3655 and PittEE. These repeats may modulate the protein function rather than control ON/OFF switching of its expression. Tetranucleotide SSRs may constitute a substantial proportion of the coding region of a gene and thus the repeat unit sequence will have a significant influence on the amino acid composition of the encoded protein. The constraints that this may impose, in terms of permissible tetranucleotide sequences, have not been well characterised although High et al. (1996) suggest that the peptides encoded by the repeat regions form structurally flexible regions that loop out of the protein structure and therefore do not interfere with tertiary structure. An in silico analysis of the repeat sequences identified in the 16 Hi genomes analysed was performed and hydrophilic amino acids are over represented in the SSR encoded peptides, compared with their frequency in the normal proteome. Of the eight tetranucleotide repeat sequences found within ORFs in Hi, five encode hydrophilic peptides with no net charge (5′CAAT, 5′GACA, 5′CCAA, 5′AGCC, and 5′AGTC), one encodes a hydrophobic peptide with no net charge (5′TTTA) and two encode hydrophilic peptides with a net positive charge (5′GCAA and 5′CAAA). The high proportion of hydrophilic peptides encoded by the tetranucleotide SSRs and their frequent N-terminal location suggests that they are likely to be surface exposed and have the opportunity to ‘loop out’ of the folded protein structure and thus be less likely to interfere with the tertiary structure of the protein and, therefore, its function. The exception is a 5′TTTA tetranucleotide SSR which encodes a hydrophobic peptide within a putative drug/metabolite exporter (HI0687 in strain Rd KW20). Transmembrane helices predictions (TMHMM server v2.0, Sonnhammer et al., 1998) suggest that the portion of this protein encoded by the SSR lies entirely within a transmembrane domain. Examination of homologues of the HI0687 gene indicates that the hydrophobic nature of such transmembrane helices is well conserved but often the primary sequence is not (data not shown). Another observation of this study was that although previously SSRs of a particular tetranucleotide repeat unit sequence have been associated with genes of related function, e.g. 5′CCAA tracts with genes encoding iron utilisation proteins (Jin et al., 1996; Morton and Stull, 1999), in this study, we have found no evidence of a particular tetranucleotide repeat unit sequence being restricted to a particular class of gene.

Interrupted tetranucleotide SSR may be an indication of intra-genome recombination between paralogous loci

One feature of certain tetranucleotide SSR, noted during the course of this study was their interruption by an imperfect repeat unit. All of the genomes in this study were found to contain between one and four related, hemoglobin/hemoglobin–haptoglobin-binding (hgp) genes containing 5′CCAA SSR that show considerable variation in length. Hi lacks most of the genes of the heme biosynthetic pathway and requires hemoglobin/hemoglobin–haptoglobin-binding proteins to capture heme-containing compounds required for growth (Morton and Stull, 1999). Seven interrupted tetranucleotide SSRs were observed in total in this analysis of which six were found to be associated with hgp genes. We postulate that homologous recombination occurring between these paralogous loci may occasionally generate imperfect repeats and it will be of interest to ascertain whether similar events occur between other duplicated loci, e.g. paralogous adhesin genes (discussed below), partial or fully duplicated hifA loci and duplicated lic1A genes (strain PittGG).

Mononucleotides identified as potential mediators of phase variation in the analysis of the 12 further genomes

In contrast to the four genome study, the analysis of the 12 additional Hi genomes has identified a number of mononucleotide SSRs with the potential to mediate phase variable gene expression. These mononucleotide SSRs were located within the 5′ coding regions of ORFs, associated with frameshift mutations, or located within potential promoter regions. The potential phase variable genes include those encoding virulence-related factors such as glycosyltransferases, type-I restriction modification systems, haemagglutinins, YadA domain containing proteins, pilin genes and a Fe–S cluster assembly scaffold protein (see Table 7). This study offers the first indication that mononucleotide SSRs may mediate phase variation in Hi.
Table 7

Notable non-tetranucleotide SSRs identified in complete H. influenzae genome collection.

GenomeRepeat unit lengthRepeat unit sequenceNumber of repeatsaDescription of locationcGenome location (bp)b
F30431A1767 bp upstream of hifA (pilin)1039014
F30431A1293 bp upstream of hifA (pilin)1207519
F30311G1310 bp within putative YadA-domain containing protein encoding gene548152
F30311G9FS 625 bp into putative glycosyltransferase gene556417
F30311G9FS 641 bp into type I restriction-modification system gene751900
F30311T123 bp upstream of putative YadA-domain containing protein encoding gene1367593
F30311T1293 bp upstream of hifA (pilin)1705139
PittEE1C12FS within 3′ end of a putative O-methyltransferase encoding gene (31% GC)10582
PittEE1C9115 bp within Fe–S cluster assembly scaffold gene697640
PittEE1G9324 bp within transferrin-binding protein 2 gene1384490
r36551G15118 bp upstream of exonuclease ABC subunit C1266118
r36551T12upstream hemoglobin-binding protein A encoding protein encoding gene1338394
r36551G10FS 114 bp within Fe–S cluster assembly scaffold protein encoding gene1547902
PittAA1C9114 bp within Fe–S cluster assembly scaffold protein encoding gene1350019
PittAA1C11454 bp within O-methyltransferase gene1764602
PittII1C9within truncated lic3B, not in other strains237746
PittII1T36120 bp upstream of YadA-domain containing autotransporter adhesin gene983892
R30211T49122 bp upstream of YadA-domain containing autotransporter adhesin gene701090
22.1.211T20121 bp upstream of YadA-domain containing autotransporter adhesin gene1278422
108101A38120 bp upstream of YadA-domain containing autotransporter adhesin gene1960236
F30312TA10225 bp upstream of hifA, 121 bp upstream of hifB (chaperon)45074
F30312TA10225 bp upstream of hifA, 121 bp upstream of hifB (chaperon)150062
F30312TA8104 bp upstream of YadA-domain containing protein gene1258000
22.1.212TA9166 bp upstream of hifA (pilin), truncated hifB (chaperon)52984
22.4.212TA9133 bp upstream of hifA (pilin), truncated hifB (chaperon)937926
PittEE7TGAAAGA1369 bp upstream of hmw1A750553
PittEE7TGAAAGA38104 bp upstream of hmw2A, 154 upstream of NTHI1451 homolog1118607
PittAA7AATTTTG14FS 3.5 kb within hmw1A1861090
PittAA7TGAAAGA16106 bp upstream of hmw1A876652
r36557AACAACC8FS within HI1369 (Fe ligand_gated_channel, TonB dependent)763559
22.4.217AACAACC1377 within HI1369 (Fe ligand_gated_channel, TonB dependent)1296925
F30438ATTATTTG612 bp upstream of pyridoxine biosynthesis protein gene1003128
F30438GCATCATC13213 bp upstream of hmw1A1315267
F30438GCATCATC12209 bp upstream of hmw2A192933
F30318GCATCATC15200 bp upstream of hmw2A1349049
F30318GCATCATC14213 bp upstream of hmw1A1603867
r36558ATTATTTG612 bp upstream of pyridoxine biosynthesis protein (near end of contig)1203993
PittHH8ATTATTTG412 bp upstream of pyridoxine biosynthesis protein gene1799917
PittAA8ATTATTTG412 bp upstream of pyridoxine biosynthesis protein gene430306
PittII8ATTATTTG612 bp upstream of pyridoxine biosynthesis protein gene, 77 bp upstream of cytidylate kinase gene1077736
22.1.218ATTATTTG412 bp upstream of pyridoxine biosynthesis protein gene, 77 bp upstream of cytidylate kinase gene1370808
108108ATTATTTG612 bp upstream of pyridoxine biosynthesis protein gene, 77 bp upstream of cytidylate kinase gene1862249
PittGG9CTTGTTTTT68 bp within/13 bp upstream of low similarity to O-antigen polymerases encoding gene1453483

Analysis of the twelve additional Hi genomes identified 43 non-tetranucleotide SSRs that could potentially mediate phase variation.

The number of tandem repeat units that comprise the SSR.

The base pair at which the 5′ base of the SSR is located.

Location of the SSR relative to, and description of the function of, the ORF whose expression it is proposed to modulate. FS: ORF associated with SSR has a frameshift mutation, bp: base pair.

Further support to the role of the mononucleotide SSRs in mediating phase variation in Hi is that some genes identified in the 12 genome study have previously been determined to be phase variable but mediated by other classes of SSRs. An example is the divergently transcribed pilin genes, hifA and hifB, which Geluk et al. (1998) demonstrated to be phase variable due to variation in the length of a 5′TA SSR located between them and 104–225 bp upstream of the hifA gene. Changes in the length of the dinucleotide SSR were proposed to alter the spacing between the −10 and −35 promoter sequences and therefore alter expression of the genes. In strain F3031, there are four hifA loci in total. Two of the loci have an arrangement similar to that described by Geluk et al. (1998) with the 5′TA SSR located between the divergently transcribed hifA and hifB genes whilst the other two hifA loci have mononucleotide (A17 or A12) instead of dinucleotide SSRs located either 63 or 93 bp upstream of hifA (see Table 7). There is no hifB gene associated with these latter loci. Phase variation of pilin expression mediated by mononucleotide SSRs has not been previously reported in Hi. The exact location and extent of the promoter region of hifA has not been mapped in these strains, but the position of the mononucleotide SSRs makes them a candidate to mediate phase variation. In the four genome study, homopolymeric A or T tracts of less than 11 bp were found with only one exception, an A34 tract found in the R2866 genome. In the additional twelve genomes a similar tract of between 20 and 49 bp was found in four strains in the same genomic location; approximately 120 bp upstream of the nearest ORF which encodes a protein with homology to YadA-domain containing proteins such as Hsf. The function of this SSR is unknown but it is tempting to speculate that it may play a role in regulating the expression of the downstream Hsf-like encoded protein. Similarly, in the genome of strain F3031 the expression of a number of YadA domain containing proteins was suggested to be mediated by tetranucleotide SSRs (see Table 5). However in one instance, the expression of a YadA domain containing protein in this strain is potentially mediated by a G13 SSR (located at base 548152) located 10 bp within the ORF (see Table 7). The association of mononucleotide SSRs, in certain strains, with paralogs of genes which are phase variable by other SSRs offers strong circumstantial evidence that these mononucleotide SSRs may mediate phase variation in Hi.

Other potentially phase variable SSRs in the complete genome collection

The heptanucleotide SSRs associated with the hmw1a and hmw2a genes in the four genome study were also identified in the genomes of strains PittEE and R3655 in the 12 further Hi genomes analysed. In PittEE, 13 copies of the heptanucleotide repeat are present 69 bp upstream and 38 copies 104 bp upstream of the hmw1a and hmw2a genes, respectively, and in R3655, 16 copies of the repeat are present 106 bp upstream of hmw1a. However, in the further genome study, an additional novel heptanucleotide SSR associated with hmw1a was identified in the genome of strain PittAA. Interestingly, this SSR, consisting of (5′AATTTTG)14, was 3.5 kb within the 7.3 kb putative full length ORF rather than in the promoter region, and a frame shift had occurred which is consistent with this being caused by variation in the length of the SSR. In the further genome analysis, an octanucleotide SSR was found associated with hmw loci. This SSR contained twelve to fifteen copies of a 5′GCATCATC repeat and was identified 200–213 nucleotides upstream of the hmw1a and hmw2a loci of strain F3043 and F3031. A further novel heptanucleotide SSR with the potential to mediate phase variation was identified in strains 22.4.21 and R3655, within an ORF which is a homologue of the HI1369 gene (encoding a putative TonB dependent iron ligand gated channel). Thirteen units of the 5′AACAACC repeat are found in 22.4.21, and eight repeat units in strain R3655 which results in a truncated ORF due to a frameshift. An octanucleotide SSR identified in the four genome study as containing one, four or six copies of a 5′ATTATTTG unit 12 bp upstream of a gene encoding a pyridoxine biosynthesis protein, was also identified in seven strains of the further twelve genome collection (four copies of the SSR in strains PittHH, PittAA and 22.1.21 and six copies in strains R3655, PittII, F3043 and Hib; see Table 7). However, the limited range of variation and relatively short length of this SSR are not what would be expected at a classically phase variable locus and so the significance of this SSR at this location remains uncertain.

Discussion

The complete genome sequence of Hi, strain Rd KW20 (Fleischmann et al., 1995), provided for the first time the means to analyse the gene content, organisation and sequence structure of a free-living organism. One of the major findings in Hi strain Rd KW20 was the association of SSRs, especially tetranucleotide SSRs, with genes involved in host adaptation, commensalism and virulence (Hood et al., 1996b). SSRs are hypermutable and mediate a high frequency of reversible increases or decreases in the number of repeat units resulting in phase variable expression of the associated genes (Moxon et al., 2006). As sequencing techniques have progressed, the ease with which sequencing data can be gathered has increased. As a result, the sequences of multiple strains of a single species have become available for comparison and the extent of genomic variation between strains has become evident. In this study, our aim was to extend our understanding of the role of SSRs in the biology and pathogenicity of Hi by an analysis of four complete genome sequences and a survey of available sequence data for a further twelve strains. SSRs, consisting of repeat units of between one and nine nucleotides in length, were characterised. For this analysis to be practical, it was necessary to establish threshold values, above which tandem repeat units were designated as SSRs. Data pertaining to the genomic location, position relative to the nearest ORF and the types of polymorphism observed by comparison between genomes was compiled for each of 223 SSRs in the initial survey of the four complete Hi genome sequences from strains Rd KW20, R2846, R2866 and 86-028NP. These SSRs were broadly classified into three categories; invariant, variant and potentially phase variable. Invariant SSRs showed no variation in sequence, position or length between strains whilst variant SSRs showed some variation between strains but not of a type that would mediate phase variation, i.e. they usually showed some variation in sequence but not overall length. Potentially phase variable SSRs showed variation in the number of repeat units constituting the SSR between strains and were in positions consistent with mediating phase variation either within ORFs or promoter regions. The majority of SSRs examined fell into the first two classes. From the further 12 partial and complete genome sequences, 765 additional SSRs were identified. These studies have confirmed that tetranucleotides are the predominant class of SSR to mediate phenotypic variation via phase variation in Hi. A total of 199 tetranucleotide SSRs were found distributed across the 16 strains, associated with 28 different loci (see Table 5). Of these, 10 were novel tetranucleotides, eight of which were identified in the genome sequences of only two strains, the Hi biogroup aegyptius strains F3031 and F3043. Tetranucleotide SSRs were found associated with a number of paralogous adhesin genes in these strains and, intriguingly, with a mutM locus that could potentially modulate mutation rates due to oxidative damage (Horst et al., 1999). The Hi biogroup aegyptius strains, F3043 and F3031 isolated in Brazil, were associated with conjunctivitis and BPF, respectively. A relevant question is whether the increased number of tetranucleotide SSRs in these strains may contribute to their unusual virulence phenotype. A detailed analysis of the characteristics of the tetranucleotide SSRs across all strains showed that whilst the number of tetranucleotide SSRs was higher in the biogroup aegyptius strains (Table 6), the length or sequence of the SSRs was similar between all the strains (Fig. 1 and unpublished data). Indeed, no relationship was found between the sequence, length, genomic locus or protein function of tetranucleotide SSRs. Other tetranucleotide SSR loci identified included those encoding two glycosyltransferases, one of which contains a 5′CCAA repeat, the first occasion for Hi where this particular SSR unit sequence has been associated with genes encoding proteins of any function other than hemoglobin and hemoglobin–haptoglobin binding. Although tetranucleotide SSRs are the most frequent mediators of phase variation in Hi, other SSRs may play a role in mediating phase variation, particularly in strains such as F3031 and F3043. This study has identified a number of novel mononucleotide, dinucleotide, pentanucleotide, heptanucleotide, and octanucleotide SSRs as potential mediators of phase variation. Mononucleotide SSRs have not previously been described as frequent mediators of phase variation in Hi, in contrast to other bacterial species such as N. meningitidis. There is only one report in the literature of mononucleotide SSRs potentially mediating phase variation in Hi; a G10 SSR is suspected to mediate phase variation of the iga gene, AF522258, in Hi biogroup aegyptius strain HK266 (Kilian et al., 2002). However, the distribution of the mononucleotide SSR loci identified in this study suggests that there may be some strain-dependent differences in the use of mononucleotide SSR to mediate phase variation. The potential mononucleotide SSR-mediated phase variable genes identified include those encoding factors associated with virulence such as glycosyltransferases, type-I restriction modification systems, haemagglutinins, YadA domain containing proteins (Cotter et al., 2005b; Koretke et al., 2006), pilin and a Fe–S cluster assembly scaffold protein (see Table 7). A number of the genes where phase variation is potentially mediated by homopolymeric tracts are phase variable by other mechanisms in other strains. For example, hifA and hifB expression is usually mediated by a dinucleotide SSR (van Ham et al., 1993). Similarly, the expression of YadA-domain containing proteins is potentially mediated by tetranucleotide SSRs in some loci identified in this study and by mononucleotide SSRs in other loci whilst the expression of the hmw1A and hmw2A genes is potentially mediated by upstream heptanucleotide, octanucleotide or nonanucleotide SSRs in different strains. Differences in the classes of SSRs which mediate phase variation between species, or even different strains of one species, may be determined by inter species/inter strain differences in DNA metabolism as the efficiency with which different types of slippage intermediates are recognised and repaired is reliant upon the complement of DNA repair mechanisms in the given strain/species. Investigation of the molecular basis of these differences will be aided by the availability of full genome sequences in conjunction with experimental assays. The strains examined in this study were isolated in the United Kingdom (one strain), Brazil (two strains) and the United States of America (13 strains). A majority of the novel SSRs identified were in the F3031 and F3043 genome sequences (the Brazilian strains) and it remains unknown whether the population/geographical structure of Hi strains may be a significant factor in determining the complement of SSR within a strain: until the population structure of Hi is better understood it is difficult to predict the size of the SSR pan-genome and its potential role in mediating phase variation. In the strains studied, with the exception of F3031 and F3043, there were no associations between the ability to cause disease or commensal infection in the strains and the complement of potential phase variable mediating SSRs. For each strain, the contribution of the number and complement of phase variable genes to the probability of pathogenic potential remains unknown. In conclusion, this study has reaffirmed the primacy of tetranucleotide SSRs as mediators of phase variation in Hi and has characterised and compared 28 tetranucleotide SSR loci (9 of them previously unreported) across 16 strains. Additionally, this study has identified a number of previously unrecognised mononucleotide, dinucleotide, pentanucleotide, heptanucleotide, and octanucleotide SSRs as potential mediators of phase variation that will be the focus of future research efforts. Thus, the utility of whole genome sequences in the investigation of the biology of pathogenic bacteria has been confirmed and, further, the analysis of multiple genomes has revealed non-intuitive subtleties in the population structure concerning the distribution of SSRs across the Hi pan-genome.
  43 in total

1.  The Haemophilus influenzae Hia autotransporter contains an unusually short trimeric translocator domain.

Authors:  Neeraj K Surana; David Cutter; Stephen J Barenkamp; Joseph W St Geme
Journal:  J Biol Chem       Date:  2004-01-15       Impact factor: 5.157

2.  DNA repeats identify novel virulence genes in Haemophilus influenzae.

Authors:  D W Hood; M E Deadman; M P Jennings; M Bisercic; R D Fleischmann; J C Venter; E R Moxon
Journal:  Proc Natl Acad Sci U S A       Date:  1996-10-01       Impact factor: 11.205

3.  Tandem repeats of the tetramer 5'-CAAT-3' present in lic2A are required for phase variation but not lipopolysaccharide biosynthesis in Haemophilus influenzae.

Authors:  N J High; M P Jennings; E R Moxon
Journal:  Mol Microbiol       Date:  1996-04       Impact factor: 3.501

4.  Phase variation of H. influenzae fimbriae: transcriptional control of two divergent genes through a variable combined promoter region.

Authors:  S M van Ham; L van Alphen; F R Mooi; J P van Putten
Journal:  Cell       Date:  1993-06-18       Impact factor: 41.582

5.  Evolution of the paralogous hap and iga genes in Haemophilus influenzae: evidence for a conserved hap pseudogene associated with microcolony formation in the recently diverged Haemophilus aegyptius and H. influenzae biogroup aegyptius.

Authors:  Mogens Kilian; Knud Poulsen; Hans Lomholt
Journal:  Mol Microbiol       Date:  2002-12       Impact factor: 3.501

6.  Cloning of a DNA fragment encoding a heme-repressible hemoglobin-binding outer membrane protein from Haemophilus influenzae.

Authors:  H Jin; Z Ren; J M Pozsgay; C Elkins; P W Whitby; D J Morton; T L Stull
Journal:  Infect Immun       Date:  1996-08       Impact factor: 3.441

7.  Use of the complete genome sequence information of Haemophilus influenzae strain Rd to investigate lipopolysaccharide biosynthesis.

Authors:  D W Hood; M E Deadman; T Allen; H Masoud; A Martin; J R Brisson; R Fleischmann; J C Venter; J C Richards; E R Moxon
Journal:  Mol Microbiol       Date:  1996-12       Impact factor: 3.501

8.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.

Authors:  R D Fleischmann; M D Adams; O White; R A Clayton; E F Kirkness; A R Kerlavage; C J Bult; J F Tomb; B A Dougherty; J M Merrick
Journal:  Science       Date:  1995-07-28       Impact factor: 47.728

9.  Cloning, expression, and DNA sequence analysis of genes encoding nontypeable Haemophilus influenzae high-molecular-weight surface-exposed proteins related to filamentous hemagglutinin of Bordetella pertussis.

Authors:  S J Barenkamp; E Leininger
Journal:  Infect Immun       Date:  1992-04       Impact factor: 3.441

10.  A virulent nonencapsulated Haemophilus influenzae.

Authors:  V Nizet; K F Colina; J R Almquist; C E Rubens; A L Smith
Journal:  J Infect Dis       Date:  1996-01       Impact factor: 5.226

View more
  24 in total

1.  Selection and Counterselection of Hia Expression Reveals a Key Role for Phase-Variable Expression of Hia in Infection Caused by Nontypeable Haemophilus influenzae.

Authors:  John M Atack; Linda E Winter; Joseph A Jurcisek; Lauren O Bakaletz; Stephen J Barenkamp; Michael P Jennings
Journal:  J Infect Dis       Date:  2015-02-23       Impact factor: 5.226

Review 2.  Insights on persistent airway infection by non-typeable Haemophilus influenzae in chronic obstructive pulmonary disease.

Authors:  Christian P Ahearn; Mary C Gallo; Timothy F Murphy
Journal:  Pathog Dis       Date:  2017-06-01       Impact factor: 3.166

3.  A Naturally Occurring Single Nucleotide Polymorphism in a Multicopy Plasmid Produces a Reversible Increase in Antibiotic Resistance.

Authors:  Alfonso Santos-Lopez; Cristina Bernabe-Balas; Manuel Ares-Arroyo; Rafael Ortega-Huedo; Andreas Hoefer; Alvaro San Millan; Bruno Gonzalez-Zorn
Journal:  Antimicrob Agents Chemother       Date:  2017-01-24       Impact factor: 5.191

4.  Haemophilus parainfluenzae has a limited core lipopolysaccharide repertoire with no phase variation.

Authors:  Rosanna E B Young; Derek W Hood
Journal:  Glycoconj J       Date:  2012-10-24       Impact factor: 2.916

5.  Haemophilus influenzae genome evolution during persistence in the human airways in chronic obstructive pulmonary disease.

Authors:  Melinda M Pettigrew; Christian P Ahearn; Janneane F Gent; Yong Kong; Mary C Gallo; James B Munro; Adonis D'Mello; Sanjay Sethi; Hervé Tettelin; Timothy F Murphy
Journal:  Proc Natl Acad Sci U S A       Date:  2018-03-19       Impact factor: 11.205

6.  Changes in IgA Protease Expression Are Conferred by Changes in Genomes during Persistent Infection by Nontypeable Haemophilus influenzae in Chronic Obstructive Pulmonary Disease.

Authors:  Mary C Gallo; Charmaine Kirkham; Samantha Eng; Remon S Bebawee; Yong Kong; Melinda M Pettigrew; Hervé Tettelin; Timothy F Murphy
Journal:  Infect Immun       Date:  2018-07-23       Impact factor: 3.441

7.  ModA2 Phasevarion Switching in Nontypeable Haemophilus influenzae Increases the Severity of Experimental Otitis Media.

Authors:  Kenneth L Brockman; Joseph A Jurcisek; John M Atack; Yogitha N Srikhanta; Michael P Jennings; Lauren O Bakaletz
Journal:  J Infect Dis       Date:  2016-06-10       Impact factor: 5.226

8.  Evasion of killing by human antibody and complement through multiple variations in the surface oligosaccharide of Haemophilus influenzae.

Authors:  Sarah E Clark; Kara R Eichelberger; Jeffrey N Weiser
Journal:  Mol Microbiol       Date:  2013-04-12       Impact factor: 3.501

9.  A novel phase variation mechanism in the meningococcus driven by a ligand-responsive repressor and differential spacing of distal promoter elements.

Authors:  Matteo M E Metruccio; Eva Pigozzi; Davide Roncarati; Francesco Berlanda Scorza; Nathalie Norais; Stuart A Hill; Vincenzo Scarlato; Isabel Delany
Journal:  PLoS Pathog       Date:  2009-12-24       Impact factor: 6.823

10.  Phase Variation in HMW1A Controls a Phenotypic Switch in Haemophilus influenzae Associated with Pathoadaptation during Persistent Infection.

Authors:  Ariadna Fernández-Calvet; Begoña Euba; Celia Gil-Campillo; Arancha Catalan-Moreno; Javier Moleres; Sara Martí; Alexandra Merlos; Jeroen D Langereis; Francisco García-Del Portillo; Lauren O Bakaletz; Garth D Ehrlich; Eric A Porsch; Margarita Menéndez; Joshua C Mell; Alejandro Toledo-Arana; Junkal Garmendia
Journal:  mBio       Date:  2021-06-22       Impact factor: 7.867

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.