| Literature DB >> 19095084 |
Peter M Power1, W A Sweetman, N J Gallacher, M R Woodhall, G A Kumar, E R Moxon, D W Hood.
Abstract
Simple sequence repeat (SSRs) of DNA are subject to high rates of mutation and are important mediators of adaptation in Haemophilus influenzae. Previous studies of the Rd KW20 genome identified the primacy of tetranucleotide SSRs in mediating phase variation (the rapid reversible switching of gene expression) of surface exposed structures such as lipopolysaccharide. The recent sequencing of the genomes of multiple strains of H. influenzae allowed the comparison of the SSRs (repeat units of one to nine nucleotides in length) in detail across four complete H. influenzae genomes and then comparison with a further 12 genomes when they became available. The SSR loci were broadly classified into three groups: (1) those that did not vary; (2) those for which some variation between strains was observed but this could not be linked to variation of gene expression; and (3) those that both varied and were located in regions consistent with mediating phase variable gene expression. Comparative analysis of 988 SSR associated loci confirmed that tetranucleotide repeats were the major mediators of phase variation and extended the repertoire of known tetranucleotide SSR loci by identifying ten previously uncharacterised tetranucleotide SSR loci with the potential to mediate phase variation which were unequally distributed across the H. influenzae pan-genome. Further, analysis of non-tetranucleotide SSR in the 16 strains revealed a number of mononucleotide, dinucleotide, pentanucleotide, heptanucleotide, and octanucleotide SSRs which were consistent with these tracts mediating phase variation. This study substantiates previous findings as to the important role that tetranucleotide SSRs play in H. influenzae biology. Two Brazilian isolates showed the most variation in their complement of SSRs suggesting the possibility of geographic and phenotypic influences on SSR distribution.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19095084 PMCID: PMC2651432 DOI: 10.1016/j.meegid.2008.11.006
Source DB: PubMed Journal: Infect Genet Evol ISSN: 1567-1348 Impact factor: 3.342
Characteristics of genome sequences used in this study.
| Strain | GenBank accession number | Source strain information | Reference |
|---|---|---|---|
| Four genome study | |||
| Rd KW20 | Acapsulate serotype d, nasopharynx, USA | ||
| 86-028NP | NT | ||
| R2846 | NT | ||
| R2866 | NT | ||
| Twelve genome study | |||
| PittGG | NT | ||
| PittEE | NT | ||
| R3021 | NT | ||
| PittII | NT | ||
| PittHH | NT | ||
| PittAA | NT | ||
| 3655 | NT | ||
| 22.4.21 | NT | ||
| 22.1.21 | NT | ||
| F3031 | |||
| F3043 | |||
| 10810 | Serotype b, meningitis isolate, UK | ||
The four strains for which the complete genome sequences were available at the commencement of this study are listed in the upper portion of the table; the further twelve strains for which full or partial sequences were later acquired are detailed in the lower portion.
Lists the serotype, the site of isolation and associated clinical features, and country of isolation.
Where there is no appropriate publication available, the www address from which sequences were obtained is listed.
The complete genome sequences for these strains were made available to us courtesy of Dr Alice Erwin.
Incomplete genomes.
The genome sequences for these strains were made available to us courtesy of Prof. Simon Kroll and the Welcome Trust Sanger Sequencing Centre.
Determination of the minimum threshold for SSR detection in H. influenzae genome sequences, with reference to strain Rd KW20.
| Repeat unit length | Threshold value | Maximum expected length | Number of SSRs in Strain Rd-KW20 Genome identified using | |||
|---|---|---|---|---|---|---|
| No. | Prefix | Threshold value −1 | Threshold value | Threshold value +1 | ||
| 1 | Mono | 9 | 13 | 163 | 18 | 2 |
| 2 | Di | 5 | 5 | 88 | 5 | 0 |
| 3 | Tri | 4 | 4 | 814 | 13 | 1 |
| 4 | Tetra | 3 | 3 | 38 | 12 | 12 |
| 5 | Penta | 3 | 2 | 1676 | 5 | 2 |
| 6 | Hexa | 3 | ND | 940 | 5 | 1 |
| 7 | Hepta | 3 | ND | 126 | 0 | 0 |
| 8 | Octa | 3 | ND | 45 | 0 | 0 |
| 9 | Nona | 3 | ND | 34 | 2 | 1 |
| 3924 | 60 | 19 | ||||
This table sets out the minimum number of repeat units (threshold value) for the identification of SSRs of each repeat unit length used in this study. ND: not determined.
The threshold value is the minimum number of repeat units required to be present in an uninterrupted tandem arrangement in order for that sequence to be defined as an SSR.
Maximum expected length is the maximum number of repeat units of each designated length that would be expected to occur in the Rd KW20 genome as predicted by hidden Markov model analysis (Paul Swift, personal communication).
The values in columns Threshold value −1 and Threshold value +1, show the number of SSR identified of each motif length in the Rd KW20 genome if the threshold value is decreased or increased by one unit, respectively.
The frequency and location of SSRs in the genome sequences of four H. influenzae strains.
| Repeat unit length | Threshold value | Genome | |||
|---|---|---|---|---|---|
| 86-028NP | Rd KW20 | R2846 | R2866 | ||
| 1 | 9 | ||||
| 2 | 5 | ||||
| 3 | 4 | ||||
| 4 | 4 | ||||
| 5 | 3 | ||||
| 6 | 3 | ||||
| 7 | 3 | ||||
| 8 | 3 | ||||
| 9 | 3 | ||||
| Total | |||||
The frequency of SSRs, of each repeat motif length, is given for each of the four genomes in the four genome analysis.
Repeat unit length: the number of nucleotides that compose a single repeat unit.
Threshold value is the minimum number of repeat units required to be present in an uninterrupted tandem arrangement in order for that sequence to be defined as an SSR.
Numbers in bold indicate the number of each type of SSR within each genome; numbers in parentheses show the number of each type of SSR located within predicted ORFs for each genome.
Assessment of SSR variability and potential to mediate phase variation in the four genome study.
| Repeat unit length | Repeat type categories | ||
|---|---|---|---|
| Potentially phase variable | Variable | Invariable | |
| 1 | |||
| 2 | |||
| 3 | |||
| 4 | |||
| 5 | |||
| 6 | |||
| 7 | |||
| 8 | |||
| 9 | |||
Each SSR identified in the four genome analysis was manually assessed to determine whether or not it was likely to mediate phase variable gene expression, from consideration of its position within an ORF or relative to adjacent ORFs, together with the variation observed between strains. Numbers in bold indicate the number of each type of SSR within each category; numbers in parentheses show the number of each type of SSR located within predicted ORFs for each category.
This assessment assigned each SSR to one of three categories: potentially phase variable – SSRs that both varied in length and were located in positions likely to influence gene expression; variable – SSRs that showed differences in length or sequence between strains but not of a manner consistent with mediating phase variable expression, and invariable – SSRs that did not show any variation in length, position or sequence between the strains.
The sequence and number of repeat units that comprise each of the 199 tetranucleotide SSRs identified in 16 H. influenzae genomes.
| SSR loci number | Associated ORF or genome location | Description | Repeat Unit Seq. | Strain | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RdKW20 | R2866 | R2846 | 86-026 | 22.1.21 | 22.4.21 | 3655 | PittAA | PitlHH | PittII | R3021 | PittEE | PittCG | 10810 | F3031 | F3043 | ||||
| 1 | HI_0258 | GACA | 22 | 26 | 20 | 10 | 13 | 12 | 12 | 14 | 12 | 16 | 12 | 8 | 38 | 17 | 8 | 24 | |
| 2 | HI_0352 | CAAT | 33 | 20 | 26 | 19 | 19 | 19 | 24 | 36 | 20 | 14 | 31.8 | 19 | 20 | 38 | 22 | 22 | |
| 3 | HI_0550 | CAAT | 23 | 5 | 25 | 14 | 14 | 12 | 25 | 11 | II | 22 | 23 | 25 | 20 | 15 | 14 | ||
| 4 | HI_1537 | CAAT | 17 | 36 | 7 | 15 | 22 | 18 | 15 | 25 | 37 | 27 | 15 | 24 | 51 | 47 | 26 | 16 | |
| 5 | CAAT | 19 | 31 | 17 | |||||||||||||||
| 6 | HI_0635 | CCAA | 21 | 28 | 25 | 20 | 28 | 16 | 8 | 4.10 | 4 | 11 | 8 | 16 | 33 | 29 | 24 | ||
| 7 | HI_0661 | CCAA | 20 | 27 | 39 | 12 | 6.6 | 17 | 23.8 | 19 | 23 | 9 | 20 | 10 | |||||
| 8 | HI_1565 | CCAA | 19 | 28 | 17 | 8.12 | 16 | 9 | 30 | 15 | 13 | ||||||||
| 9 | HI_0712 | CCAA | 37 | ||||||||||||||||
| 10 | HI_0687 | Drug/metabolite exporter | TTTA | 6 | 4 | ||||||||||||||
| 11 | HI_1058 | AGCC | (TGAC)32 | 16 | 14 | 20 | 12 | 27 | 22 | IS | |||||||||
| 12 | Hi_1386US | Putative glycosyltransferase | CCAA | 16 | 8 | 12 | 13 | 10 | 10 | 10 | 7 | 14 | 10 | 10 | 16 | ||||
| 13 | Within cryptic | CAAG | 25 | 13 | 24 | 14 | 17 | 13 | 11 | 30 | 13 | 15 | 17 | 13 | 14 | ||||
| 14 | r2846v6.916 | GACA | 16 | 14 | 6 | 13 | 5 | ||||||||||||
| 15 | r2846v6.1528c | Ipt3 region | AGTC | 14 | 10 | 14 | |||||||||||||
| 16 | r2846v6.1683 | lex2A | CAAG | 24 | 17 | 14 | 18 | 15 | 15 | 24 | 16 | 21 | 22 | 18 | 26 | 14 | 31 | ||
| 17 | r2866v6.124c | CAAG | 20 | 30 | 14 | 14 | 17 | 28 | |||||||||||
| 18 | 12846 V6.202 | CAAG | 9 | 14 | 8 | 12 | 14 | 11 | 14 | 5 | 11 | 8 | 7 | 14 | 11 | 11 | |||
| 19 | NTHI1034 | CAAT | 15 | 21 | 10 | 12 | 24 | ||||||||||||
| 20 | PITTII | Gene encoding a YadA domain containing protein | CAAG | 20 | 15 | 24 | |||||||||||||
| 21 | F3043-1499724 | Gene encoding a YadA domain containing protein | CAAG | 19 | |||||||||||||||
| 22 | F3043-196894 | Gene encoding a YadA domain containing protein | CAAA | 15 | 13 | ||||||||||||||
| 23 | F3043-756964 | 225 bp upstream gene encoding a YadA domain containing protein | CAAA | 16 | 31 | ||||||||||||||
| 24 | F3043-609747 | Glycosyltranslerase (family 8) with framshift | CAAT | 18 | 33 | ||||||||||||||
| 25 | F3043-1083776 | Gene encoding a YadA domain containing protein | CAAG | 23 | |||||||||||||||
| 26 | F3043-1500170 | SAM-dependent methyltransferase | CAAT | 21 | |||||||||||||||
| 27 | F3043-1598734 | 225 bp upstream formamidopynmidine-DNA glycosylase ( | ATTA | 9 | |||||||||||||||
| 28 | F3031-1121634 | 58 bp upstream adenine specific methylase (EcoRI) and 202 bp upstream of | CAAG | 32 | |||||||||||||||
Listed in this table are the 199 tetranucleotide SSRs that are associated with the 28 loci identified in this analysis of the genomes of 16 Hi strains.
The associated ORF designation in the Rd KW20 genome, or when not present in the Rd KW20 genome the R2486, R2866 or 86-028NP genomes. SSR from unannotated genomes are identified by the strain, a hyphen and the nucleotide number at which they are found.
Genes associated with the SSR. The gene names and predicted functions are from BLAST similarity searches and the published genome annotations.
The sequence of the tetranucleotide SSR in this locus is different to that associated with this gene in all other strains (5′AGCC).
x…y indicates that there are x number of the tetranucleotide repeat unit, followed by an interruption, followed by y number of tetranucleotide repeat units.
The presumptive position of the SSR loci is at the end of a contig and its presence or absence cannot be determined.
The number of SSRs, of each repeat unit length, in each genome.
| Repeat unit length | Range in four genome study | Strains | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22.1.21 | 22.4.21 | 3655 | PittAA | PittHH | PittII | R3021 | PittEE | PittGG | 10810 | F3031 | F3043 | ||
| Mono | 13–17 | 12 | 16 | ||||||||||
| Di | 4–6 | 6 | 8 | 2 | 3 | 6 | 6 | 5 | 2 | 6 | 4 | 5 | |
| Tri | 8–13 | 12 | 13 | 11 | 11 | 6 | 10 | 12 | 13 | 12 | 10 | 6 | |
| Tetra | 12–14 | 9 | 13 | 14 | 13 | ||||||||
| Penta | 3–5 | 4 | 2 | 3 | 4 | 3 | 4 | 2 | 3 | 5 | |||
| Hexa | 3–5 | 4 | 2 | 5 | 4 | 5 | 5 | ||||||
| Hepta | 0–4 | 2 | 1 | 2 | 4 | 1 | 2 | 2 | 4 | 1 | 0 | 3 | 1 |
| Octa | 0–1 | 1 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 1 | ||
| Nona | 1–2 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Total | 53–60 | 67 | 59 | 63 | 63 | 53 | 63 | 61 | 55 | 67 | 53 | 88 | 73 |
Analysis of the number of SSRs identified in the 12 genome study compared to the four genome study. Numbers given in bold or italic indicate that the value is higher or lower than the four-genome study range, respectively.
The minimum and maximum number of each type of SSR observed per genome in the four genome study.
The number of SSRs per genome, of each repeat unit length, in the 12 genome study.
Fig. 1Histogram and boxplot representations of the length distribution, sequence, strain and loci associations of tetranucleotide SSRs. (A) Frequency histogram of tetranucleotide SSR length distribution in the complete genome study. (B) Boxplot analysis of the relationship between the strain from which the genome was derived and the length of the tetranucleotide SSRs.
Notable non-tetranucleotide SSRs identified in complete H. influenzae genome collection.
| Genome | Repeat unit length | Repeat unit sequence | Number of repeats | Description of location | Genome location (bp) |
|---|---|---|---|---|---|
| F3043 | 1 | A | 17 | 67 bp upstream of | 1039014 |
| F3043 | 1 | A | 12 | 93 bp upstream of | 1207519 |
| F3031 | 1 | G | 13 | 10 bp within putative YadA-domain containing protein encoding gene | 548152 |
| F3031 | 1 | G | 9 | FS 625 bp into putative glycosyltransferase gene | 556417 |
| F3031 | 1 | G | 9 | FS 641 bp into type I restriction-modification system gene | 751900 |
| F3031 | 1 | T | 12 | 3 bp upstream of putative YadA-domain containing protein encoding gene | 1367593 |
| F3031 | 1 | T | 12 | 93 bp upstream of | 1705139 |
| PittEE | 1 | C | 12 | FS within 3′ end of a putative O-methyltransferase encoding gene (31% GC) | 10582 |
| PittEE | 1 | C | 9 | 115 bp within Fe–S cluster assembly scaffold gene | 697640 |
| PittEE | 1 | G | 9 | 324 bp within transferrin-binding protein 2 gene | 1384490 |
| r3655 | 1 | G | 15 | 118 bp upstream of exonuclease ABC subunit C | 1266118 |
| r3655 | 1 | T | 12 | upstream hemoglobin-binding protein A encoding protein encoding gene | 1338394 |
| r3655 | 1 | G | 10 | FS 114 bp within Fe–S cluster assembly scaffold protein encoding gene | 1547902 |
| PittAA | 1 | C | 9 | 114 bp within Fe–S cluster assembly scaffold protein encoding gene | 1350019 |
| PittAA | 1 | C | 11 | 454 bp within O-methyltransferase gene | 1764602 |
| PittII | 1 | C | 9 | within truncated | 237746 |
| PittII | 1 | T | 36 | 120 bp upstream of YadA-domain containing autotransporter adhesin gene | 983892 |
| R3021 | 1 | T | 49 | 122 bp upstream of YadA-domain containing autotransporter adhesin gene | 701090 |
| 22.1.21 | 1 | T | 20 | 121 bp upstream of YadA-domain containing autotransporter adhesin gene | 1278422 |
| 10810 | 1 | A | 38 | 120 bp upstream of YadA-domain containing autotransporter adhesin gene | 1960236 |
| F3031 | 2 | TA | 10 | 225 bp upstream of | 45074 |
| F3031 | 2 | TA | 10 | 225 bp upstream of | 150062 |
| F3031 | 2 | TA | 8 | 104 bp upstream of YadA-domain containing protein gene | 1258000 |
| 22.1.21 | 2 | TA | 9 | 166 bp upstream of | 52984 |
| 22.4.21 | 2 | TA | 9 | 133 bp upstream of | 937926 |
| PittEE | 7 | TGAAAGA | 13 | 69 bp upstream of | 750553 |
| PittEE | 7 | TGAAAGA | 38 | 104 bp upstream of | 1118607 |
| PittAA | 7 | AATTTTG | 14 | FS 3.5 kb within | 1861090 |
| PittAA | 7 | TGAAAGA | 16 | 106 bp upstream of | 876652 |
| r3655 | 7 | AACAACC | 8 | FS within | 763559 |
| 22.4.21 | 7 | AACAACC | 13 | 77 within | 1296925 |
| F3043 | 8 | ATTATTTG | 6 | 12 bp upstream of pyridoxine biosynthesis protein gene | 1003128 |
| F3043 | 8 | GCATCATC | 13 | 213 bp upstream of | 1315267 |
| F3043 | 8 | GCATCATC | 12 | 209 bp upstream of | 192933 |
| F3031 | 8 | GCATCATC | 15 | 200 bp upstream of | 1349049 |
| F3031 | 8 | GCATCATC | 14 | 213 bp upstream of | 1603867 |
| r3655 | 8 | ATTATTTG | 6 | 12 bp upstream of pyridoxine biosynthesis protein (near end of contig) | 1203993 |
| PittHH | 8 | ATTATTTG | 4 | 12 bp upstream of pyridoxine biosynthesis protein gene | 1799917 |
| PittAA | 8 | ATTATTTG | 4 | 12 bp upstream of pyridoxine biosynthesis protein gene | 430306 |
| PittII | 8 | ATTATTTG | 6 | 12 bp upstream of pyridoxine biosynthesis protein gene, 77 bp upstream of cytidylate kinase gene | 1077736 |
| 22.1.21 | 8 | ATTATTTG | 4 | 12 bp upstream of pyridoxine biosynthesis protein gene, 77 bp upstream of cytidylate kinase gene | 1370808 |
| 10810 | 8 | ATTATTTG | 6 | 12 bp upstream of pyridoxine biosynthesis protein gene, 77 bp upstream of cytidylate kinase gene | 1862249 |
| PittGG | 9 | CTTGTTTTT | 6 | 8 bp within/13 bp upstream of low similarity to O-antigen polymerases encoding gene | 1453483 |
Analysis of the twelve additional Hi genomes identified 43 non-tetranucleotide SSRs that could potentially mediate phase variation.
The number of tandem repeat units that comprise the SSR.
The base pair at which the 5′ base of the SSR is located.
Location of the SSR relative to, and description of the function of, the ORF whose expression it is proposed to modulate. FS: ORF associated with SSR has a frameshift mutation, bp: base pair.