Literature DB >> 26436423

A Retrospective Study on Genetic Heterogeneity within Treponema Strains: Subpopulations Are Genetically Distinct in a Limited Number of Positions.

Darina Čejková1, Michal Strouhal2, Steven J Norris3, George M Weinstock4, David Šmajs2.   

Abstract

BACKGROUND: Pathogenic uncultivable treponemes comprise human and animal pathogens including agents of syphilis, yaws, bejel, pinta, and venereal spirochetosis in rabbits and hares. A set of 10 treponemal genome sequences including those of 4 Treponema pallidum ssp. pallidum (TPA) strains (Nichols, DAL-1, Mexico A, SS14), 4 T. p. ssp. pertenue (TPE) strains (CDC-2, Gauthier, Samoa D, Fribourg-Blanc), 1 T. p. ssp. endemicum (TEN) strain (Bosnia A) and one strain (Cuniculi A) of Treponema paraluisleporidarum ecovar Cuniculus (TPLC) were examined with respect to the presence of nucleotide intrastrain heterogeneous sites. METHODOLOGY/PRINCIPAL
FINDINGS: The number of identified intrastrain heterogeneous sites in individual genomes ranged between 0 and 7. Altogether, 23 intrastrain heterogeneous sites (in 17 genes) were found in 5 out of 10 investigated treponemal genomes including TPA strains Nichols (n = 5), DAL-1 (n = 4), and SS14 (n = 7), TPE strain Samoa D (n = 1), and TEN strain Bosnia A (n = 5). Although only one heterogeneous site was identified among 4 tested TPE strains, 16 such sites were identified among 4 TPA strains. Heterogeneous sites were mostly strain-specific and were identified in four tpr genes (tprC, GI, I, K), in genes involved in bacterial motility and chemotaxis (fliI, cheC-fliY), in genes involved in cell structure (murC), translation (prfA), general and DNA metabolism (putative SAM dependent methyltransferase, topA), and in seven hypothetical genes.
CONCLUSIONS/SIGNIFICANCE: Heterogeneous sites likely represent both the selection of adaptive changes during infection of the host as well as an ongoing diversifying evolutionary process.

Entities:  

Mesh:

Substances:

Year:  2015        PMID: 26436423      PMCID: PMC4593590          DOI: 10.1371/journal.pntd.0004110

Source DB:  PubMed          Journal:  PLoS Negl Trop Dis        ISSN: 1935-2727


Introduction

The genus Treponema comprises several uncultivable human and animal pathogens including Treponema pallidum ssp. pallidum (TPA), the causative agent of syphilis, T. p. ssp. pertenue (TPE, the causative agent of yaws), and T. p. ssp. endemicum (TEN, the causative agent of bejel). A treponemal isolate Fribourg-Blanc isolated from a baboon (Papio cynocephalus) in West Africa [1],[2] was recently reclassified as a TPE strain [3]. Another animal pathogen closely related to uncultivable human treponemal pathogens is T. paraluisleporidarum ecovar Cuniculus (TPLC; formerly denoted as Treponema paraluiscuniculi) [4-6], the causative agent of venereal spirochetosis in rabbits. In addition, T. paraluisleporidarum ecovar Lepus [6] causes venereal spirochetosis in hares [7-10]. The human disease pinta is caused by a morphologically identical organism called T. carateum, but this organism has not been propagated in experimentally infected animals and has not been characterized genetically. The first complete genome sequence of TPA strain Nichols was determined in 1998 [11]. In the last several years, whole genome sequences of twelve treponemal pathogens (including re-sequenced TPA strains Nichols and SS14) were completed and published [3],[12-20]. In general, genome analyses performed in these studies revealed that genome differences between individual treponemal strains are very subtle, differing in less than 2% of the genome sequence between TPA strains and TPLC [21] and 0.2% between TPA and TPE strains [12]. Genetic diversity among the uncultivable pathogenic treponemes are localized mainly within tpr [22-25], arp [25-27], TP0470 [25], TP0136 [28],[29], TP0548 [29],[30], tp92 [31],[32], and mcp genes [15]. In addition, relatively high interstrain genetic diversity has been detected in several other genes, e.g. in TP0304 (hypothetical protein), TP0346 (lipoprotein), TP0515 (outer membrane protein), TP0558 (nickel-cobalt transporter) [33] and TP0967 (hypothetical protein) [25]. The presence of different treponemal subpopulations infecting the same host has been suggested by several early findings, e.g. by detection of two subpopulations using velocity sedimentation during the Hypaque separation procedure [34], and by the identification of subpopulation which is resistant to phagocytosis [35]. Genetic diversity within individual treponemal strains, i.e. intrastrain genetic diversity, was first found in tprJ and tprK genes during infection of human or animal hosts [36-38]. Several other examples of intrastrain heterogeneity were found in the TPA Nichols [21], and in the TPA SS14 genome [14],[16]. In general, intrastrain heterogeneity was found within tpr genes, in sequences paralogous to tpr genes and in the intergenic regions between tpr genes [14],[16],[36-40]. Other genes with identified intrastrain heterogeneity comprised TP0402 (encoding flagellum specific ATP synthase), TP0971 (encoding Tp34 lipoprotein, membrane antigen), TP1029 (encoding hypothetical protein), TP0341 (encoding MurC), and TP0967 (encoding hypothetical protein) loci [14],[16]. The occurrence of genome heterogeneity (including point mutations, insertions or deletions and gain and loss of mobile genetic elements such as plasmids or phages) within strains is common to many pathogenic bacteria [41-44], and has been found to occur during the course of infection [45-51]. In general, heterogeneous sites may contribute to immune evasion [49] and/or represent adaptive changes during infection of disparate host tissues and compartments [52]. The identification of within-host heterogeneity is an important step in studies tracking transmission networks or in studies mapping bacterial populations during colonization, dissemination and immune clearance [53],[54]. In this communication, whole genome sequences of 10 treponemal strains were systematically analyzed for the presence of intrastrain nucleotide heterogeneous sites. Distinct patterns in the frequency and locations of intrastrain heterogeneous sites were identified among the individual genomes examined.

Materials and Methods

Strains used in this study

The original sequencing data obtained during next-generation sequencing of pathogenic treponemes (Table 1) were used to analyze intrastrain genetic variability. In total, 10 treponemal strains were examined in this study including 4 TPA strains (Nichols, DAL-1, Mexico A, SS14), 4 TPE strains (CDC-2, Gauthier, Samoa D, Fribourg-Blanc), 1 TEN strain (Bosnia A) and one strain of TPLC (Cuniculi A). For the two remaining whole genome sequences (TPA strains Chicago and Sea84-1), the original sequencing data were not deposited in the SRA database.
Table 1

Treponemal genomes used in this study.

GenomePlace and year of isolationReferenceGenBank Accession number, SRA Accession number (Genome reference)
Average coverage (Illumina/454), average Illumina read length (bp), estimated Illumina error rate from BWA a (%)
TPA NicholsWashington, D.C., USA; 1912[93]CP004010.2, SRX012305 [16]31x/30x, 36, 1.65%
TPA DAL-1Dallas, USA; 1991[94]CP003115.1, SRX012302 [18]38x/33x, 36, 2.07%
TPA SS14Atlanta, USA; 1977[95]CP004011.1, SRX012306 [16]40x/29x, 36, 1.93%
TPA Mexico AMexico City, Mexico; 1953[96]CP003064.1, SRX012304 [15]43x/-, 36, 1.51%
TPE CDC-2Akorabo, Ghana; 1980[97]CP002375.1, SRX012301 [12]38x/28x, 36, 2.07%
TPE GauthierBrazzaville, Congo; 1960[98]CP002376.1, SRX104412 [12]56x/33x, 35, 0.80%
TPE Samoa DApia, Samoa; 1953[96]CP002374.1, SRX012307 [12]42x/21x, 36, 2.19%
TPE Fribourg-BlancGuinea; 1966[1],[2]CP003902.1, SRX104411 [3]66x/52x, 35, 0.32%
TEN Bosnia ABosnia; 1950[99]CP007548, SRX144510, SRX144511, SRX144514, SRX144515 [20]194x/72x, 100, 0.30%
TPLC Cuniculi Aunknown; before 1957[96]CP002103.1, SRX012308 [17]20x/9x, 36, 1.61%

aerror rate per nucleotide was estimated using the Borrows-Wheeler Aligner (BWA) [55],[56]

aerror rate per nucleotide was estimated using the Borrows-Wheeler Aligner (BWA) [55],[56] To examine intrastrain heterogeneity within a single strain, selected intrastrain heterogeneous sites were tested in the TPA SS14 strain using four different DNA preparations (4933, 4934, 4950 and 4051), originating from two different rabbit passages. The original treponemal SS14 cells were obtained from Dr. D. L. Cox as stock 2735 (dated 09/24/97) and 2736 (dated 06/20/97), which were used to inoculate rabbits and to harvest treponemal cells of stocks 2839 and 2840, respectively. Bacterial stock 2839 of TPA SS14 was used for two independent isolations of genomic DNA using Wizard Genomic DNA Purification Kit (Promega, Madison, WI, USA), resulting in DNA isolates numbered 4933 and 4950. Similarly, bacterial stock 2840 of TPA SS14 was used for two independent isolations of genomic DNA designated as 4934 and 4951. At least one independent rabbit passage between stock 2735 and stock 2736 was performed.

Ethics statement

No animal was used in the study.

Identification of intrastrain heterogeneous sites

To ascertain intrastrain heterogeneity within individual treponemal strains, Illumina and 454 reads obtained during whole-genome sequencing procedures were used. Data analysis workflow is depicted in Fig 1. Initially, individual reads were mapped to the corresponding complete genome sequence using the Borrows-Wheeler Aligner (BWA) [55],[56], using default parameters, and requiring at least a 95% read identity relative to the reference genome. Duplicated reads were identified with the rmdup algorithm in the SAMtools package [55] and removed. To determine the frequency of each nucleotide (allele frequency) in every single genome position, the mpileup function in the SAMtools package and a python script were used [57]. Because of higher depth coverage and a lower error indel rate, the Illumina sequencing reads were used for intrastrain allele identifications.
Fig 1

Data analysis workflow.

(A) An automated identification pipeline and optimization process. (B) An application of further restrictions and verification of identified putative candidates.

Data analysis workflow.

(A) An automated identification pipeline and optimization process. (B) An application of further restrictions and verification of identified putative candidates. To filter out sequencing errors present in the raw data [58-65], nucleotide positions showing at least six independent (not duplicated) individual reads with a frequency ≥ 20% of the less frequent allele, were further examined. Moreover, several other restrictions were applied during identification of treponemal heterogeneous sites (Fig 1). First, nucleotide positions located within homopolymeric tracts (defined as a stretch of 6 or more identical nucleotides) or within a 2-nt distance of these tracts were omitted from further analysis. Second, at least three independent reads from both directions were required. Third, individual reads supporting a less frequent allele located at the 3’ terminus of the reads (i.e. four or less nucleotides from the 3’ terminus) were omitted. And fourth, heterogeneous positions separated from each other by less than 7 bp were also omitted. The resulting candidate sites for heterogeneous nucleotide positions were subsequently visually inspected using a Integrative Genome Viewer (IGV) [63-66]. Using the above mentioned workflow applied on Illumina reads, putative heterogeneous sites were identified. Identified heterogeneous positions were confirmed using a parallel 454 workflow or by Sanger sequencing (Fig 1 and Table 2 and S2 Table). A detailed description of regions, comprising paralogous sequence regions or/and direct repeats, omitted from Illumina analysis are shown in S1 and S2 Tables. Altogether, 32 genomic regions covering 26,636 bp (2.34% of the entire genome length) were omitted in the TPA Nichols genome (S1 Table). Since paralogous regions in individual genomes are not identical, slightly different regions were omitted from the automated analyses of Illumina sequencing reads in each examined genome (S2 Table). Moreover, the TEN Bosnia A genome was sequenced using pooled segment genome sequencing (PSGS) [12] as separate sequencing runs, therefore the total length of the excluded regions was lower than in other examined genomes (S2 Table).
Table 2

Summary of the intrastrain variable sites identified within Illumina sequencing reads in investigated treponemal genomes.

T. p. strainAverage coverage Illumina/454 a Genome sequenceVerified by 454 or Sanger sequencingMajor/minor alleleGene/Genome positionAmino acid change b Protein function/Functional groupCell localization c
TPA Nichols T454T/CTPANIC_0006/7179*56S; read through stop codonHypothetical protein/Unknowncytoplasm
31x/30x T454T/CTPANIC_0051/59894S104PPrfA/Translationcytoplasm
A454A/CTPANIC_0222/228259E46D; conservativeHypothetical protein/Unknownunknown
GSangerG/ATPANIC_0471/500905D357NHypothetical protein/Unknowncytoplasmic membrane
T454G/Tupstream of TPANIC_0584/635418n/a d n/an/a
TPA DAL-1 C454C/TTPADAL_0065/71972R70WSAM dependent methyltransferase/General metabolismcytoplasm
38x/33x GSangerG/ATPADAL_0720/789942A155V; conservativeCheC-FliY/Motility, Chemotaxiscytoplasm, flagellar
T454T/CTPADAL_0720/790038N123SCheC-FliY/Motility, Chemotaxiscytoplasm, flagellar
T454T/GTPADAL_0897/976768K338QTprK/Virulenceperiplasm [85]
TPA SS14 G454G/CTPASS_20117/135108N533KTprC/Virulenceouter membrane [100]
40x/29x A454A/GTPASS_20117/135261Y483HTprC/Virulenceouter membrane [100]
T454C/TTPASS_20341/364888L64PMurC/Cell structurecytoplasm
ASangerA/CTPASS_20394/420117H107PTopA/DNA metabolismcytoplasm
T454T/CTPASS_20402/428628L134PFliI/Motilitycytoplasm
G454G/TTPASS_20402/428930A235SFliI/Motilitycytoplasm
G454G/ATPASS_21029/1125352D12D; synonymousHypothetical protein/Unknowncytoplasm
TPE Samoa D C454C/TTPESAMD_0134/155544C284YHypothetical protein/Unknownunknown
42x/21x
TEN Bosnia A C454C/GTENDBA_0314/331578E215QHypothetical protein/Unknownunknown
194x/72x A454A/TTENDBA_0314/331618H201QHypothetical protein/Unknownunknown
A454A/GTENDBA_0316/333355V240A; conservativechimeric TprGI e /Virulenceunknown
C454C/TTENDBA_0621/672156T104T; synonymousTprI/Virulenceunknown
S454C/GTENDBA_0897/974407E347QTprK/Virulenceperiplasm [69]
TCCTCCCCC4549 bp indel f TENDBA_0967/1049918-1049951n/aHypothetical protein/Unknownunknown

Illumina-identified intrastrain variable sites were verified using 454 or Sanger sequencing.

ano intrastrain heterogeneous site were identified in the TPA Mexico A, TPE CDC-2, TPE Gauthier, TPE Fribourg-Blanc and TPLC Cuniculi A genomes

bnonconservative amino acid replacements are not listed

cif not indicated, localization was predicted by PSORTb

dnot applicable

e[20],[23]

fvariable number of direct repeat (TCCTCCCCC)

Illumina-identified intrastrain variable sites were verified using 454 or Sanger sequencing. ano intrastrain heterogeneous site were identified in the TPA Mexico A, TPE CDC-2, TPE Gauthier, TPE Fribourg-Blanc and TPLC Cuniculi A genomes bnonconservative amino acid replacements are not listed cif not indicated, localization was predicted by PSORTb dnot applicable e[20],[23] fvariable number of direct repeat (TCCTCCCCC)

DNA amplification and DNA sequencing

Altogether, 26 putative heterogeneous positions identified in the Illumina workflow, but not confirmed by the 454 sequences (Fig 2, Table 2 and S3 Table) were subjected to DNA amplification and Sanger sequencing. Moreover, six heterogeneous positions identified in the TPA SS14 genome in this study or by Matějková et al. [14] were tested in four different SS14 DNA preparations originating from two different rabbit passages (Table 3). Primers used for DNA amplification and sequencing are specified in S4 and S5 Tables. PCR was performed as follows: initial cycle at 94°C (1 minute), was followed by 30 cycles at 94°C (30 seconds), 55°C (30 seconds), and 72°C (1 minute), and by the final extension step at 72°C (7 minutes). Sequencing of the PCR products was performed using primers used for PCR amplifications with the dye-terminator Sanger sequencing technology. The frequency of alternative alleles in heterogeneous positions was calculated from the ratio of corresponding areas under the chromatogram curves. Sequence analysis of Sanger reads was performed using Lasergene software (DNASTAR, Inc., Madison, WI, USA).
Fig 2

A schematic representation of the identified heterogeneous positions in all investigated genomes.

The proportion of alternative alleles is based on nucleotide frequency within individual Illumina reads. While red cells represent identified sites of intrastrain heterogeneity, grey cells represent sites of intrastrain homogeneity. The numbers within cells indicate the number of alternative/standard reads in the sites where the number of alternative reads exceeded 10% but were lower than 20% and therefore remained below the threshold used in this study. Blue cells show nucleotide positions omitted from analysis due to excluded paralogous sequences (S2 Table). For the Bosnia A strain, the intrastrain heterogeneous sites TENDBA_0314/331578, TENDBA_0314/331618, TENDBA_0317/333355 and TENDBA_0621/672156 are not shown because in all other genomes these positions were excluded from analysis due to paralogous sequences. Note that the TPADAL_0897/976678 and TENDBA_0897/974407 positions are the same.

Table 3

Selected intrastrain heterogeneous sites identified in TPA SS14, examined in four different DNA preparations.

Bact erial stock no.DNA preparation no.G/C a , c , d A/G c , d T/C c T/C c , d G/T c , d T/C d
TPASS_20117/135108TPASS_20117/135261TPASS_20341/364888TPASS_20402/428628TPASS_20402/428930TPASS_20971/1056002
2839 4933G/C (0.0–0.1)A/G (0.0–0.2)T/C (0.5–0.6)T (0.0)T (1.0)T/C (0.5–0.6)
4950 b G (0.0)A (0.0)T/C (0.5–0.6)T (0.0)T (1.0)T/C (0.7)
2840 4934G/C (0.3–0.4)A/G (0.4–0.6)T/C (0.7)T/C (0.2–0.3)G/T (0.4–0.7)T/C (0.3)
4951 b G/C (0.3–0.4)A/G (0.4–0.5)T/C (0.5)T/C (0.3–0.4)G/T (0.3–0.6)T/C (0.1)

DNA preparations originated from two different rabbit passages. Relative proportions of alleles not stated in the reference genome are shown in parentheses as derived from repeated Sanger sequencing.

athe first nucleotide corresponds to the sequence published in the SS14 genome sequence CP004011.1 [16]

bDNA preparations 4950 and 4951 were used for whole genome sequencing of the TPA SS14 strain by Matějková et al. [14]; preparation 4951 was used for re-sequencing of this strain [16]

cheterogeneous positions identified in this study (Table 2)

dheterogeneous positions identified by Matějková et al. [14]

A schematic representation of the identified heterogeneous positions in all investigated genomes.

The proportion of alternative alleles is based on nucleotide frequency within individual Illumina reads. While red cells represent identified sites of intrastrain heterogeneity, grey cells represent sites of intrastrain homogeneity. The numbers within cells indicate the number of alternative/standard reads in the sites where the number of alternative reads exceeded 10% but were lower than 20% and therefore remained below the threshold used in this study. Blue cells show nucleotide positions omitted from analysis due to excluded paralogous sequences (S2 Table). For the Bosnia A strain, the intrastrain heterogeneous sites TENDBA_0314/331578, TENDBA_0314/331618, TENDBA_0317/333355 and TENDBA_0621/672156 are not shown because in all other genomes these positions were excluded from analysis due to paralogous sequences. Note that the TPADAL_0897/976678 and TENDBA_0897/974407 positions are the same. DNA preparations originated from two different rabbit passages. Relative proportions of alleles not stated in the reference genome are shown in parentheses as derived from repeated Sanger sequencing. athe first nucleotide corresponds to the sequence published in the SS14 genome sequence CP004011.1 [16] bDNA preparations 4950 and 4951 were used for whole genome sequencing of the TPA SS14 strain by Matějková et al. [14]; preparation 4951 was used for re-sequencing of this strain [16] cheterogeneous positions identified in this study (Table 2) dheterogeneous positions identified by Matějková et al. [14]

Conserved protein domain database search

The NCBI Conserved Domain Database [67] and InterProScan [68] were used to predict protein domains. Putative protein localization within a cell was determined using the PSORTb program [69].

Results

A set of 10 treponemal whole genome sequences including those of 4 TPA strains (Nichols, DAL-1, Mexico A, SS14), 4 TPE strains (CDC-2, Gauthier, Samoa D, Fribourg-Blanc), 1 TEN strain (Bosnia A) and one strain of TPLC (Cuniculi A) were examined with respect to the presence of intrastrain heterogeneous sites. All but one (TPA Mexico A) genomes were sequenced using both Illumina and 454 sequencing methods. Characteristics of the sequence data obtained with each strain, including the average coverage attained during Illumina and 454 sequencing, are shown in Table 1. Altogether, 890 potentially heterogeneous positions among investigated genomes were identified using an automated pipeline (Fig 1). Several criteria (see Materials and methods) were used to filter out sequencing errors from genetic heterogeneity naturally occurring in treponemal strains (i.e. representing intrastrain heterogeneous sites), which reduced the 890 nucleotide positions to 46 candidates (Fig 1). Regions containing paralogous sequences and tandem repeats (summarized in S1 and S2 Tables) were omitted from the automated analyses of intrastrain heterogeneity due to the risk of ambiguously mapped reads. Using these criteria, 32 genomic regions covering 26,636 bp (2.34% of the entire genome length) were excluded from the analysis of Illumina sequencing reads in the TPA Nichols genome (S1 Table). Except for the TEN strain Bosnia A, similar regions were also excluded in whole genome sequences in other tested genomes (S2 Table) (see Materials and Methods). An instance of intrastrain heterogeneity was considered to be present if 1) two different nucleotides (or an indel) were detected at a given genome coordinate, and 2) this heterogeneity was present in at least two sequencing analyses using different sequencing chemistry. The automated analysis of Illumina reads revealed 46 candidates (Fig 1), of which 20 heterogeneous sites were directly verified by automated analysis of 454 reads. The remaining 26 candidate sites, solely found in Illumina reads, were sequenced using Sanger technology, and in three of them, heterogeneous sites were identified (Tables 2 and S3).

Intrastrain heterogeneous sites are mainly present in TPA and TEN but not in TPE strains

The 23 intrastrain heterogeneous sites, identified using the automated analysis of Illumina sequencing reads and either 454 or Sanger sequencing reads, were found in 5 out of 10 investigated treponemal genomes (Table 2), including TPA strains Nichols, DAL-1, and SS14, TPE strain Samoa D and TEN strain Bosnia A. No intrastrain heterogeneous sites were identified in TPA Mexico A, TPE CDC-2, Gauthier, Fribourg-Blanc and TPLC Cuniculi A genomes. Up to 7 intrastrain heterogeneous sites were identified in individual genomes. Whereas only one heterogeneous site was identified in the 4 examined TPE strains, 16 heterogeneous sites were detected among the 4 TPA strains analyzed. The TEN strain Bosnia A contained 5 single nucleotide heterogeneous sites, however, four of these heterogeneous sites (TENDBA_0314/331578, TENDBA_0314/331618, TENDBA_0317/333355 and TENDBA_0621/672156) were located within paralogous regions that had been excluded from analysis in all other genomes (S2 Table). In contrast to other genomes, the TEN Bosnia A genome was sequenced using the pooled segment genome sequencing method (PSGS) [20] as four distinct samples, whereas other treponemal genomes were not subdivided prior to Illumina sequencing. Therefore, orthologous genes to TENDBA_0314, TENDBA_0317 and TENDBA_0621 genes were not completely analyzed in other genomes. In contrast, the same heterogeneous site found in the tprK gene of TEN Bosnia A (TENDBA_0897/974407) was also identified in the TPA DAL-1 strain (TPADAL_0897/976768). Interestingly, this genome position is included in tprK variable regions of the TPA SS14 and Mexico A genomes, however, it was included in non-variable regions in all other genomes [37]. Therefore, in TPA SS14 and Mexico A genomes, these tprK hypervariable regions were excluded from analyses (Fig 2). In four cases, comprising genes TPASS_20117 (tprC), TENDBA_0314 (hypothetical gene), TPASS_20402 (fliI) and TPADAL_0720 (fliY), two heterogeneous sites were found in each gene (Fig 2 and Table 2).

Characteristics of identified intrastrain heterogeneous sites

All but one heterogeneous sites represented alternative nucleotides resulting from substitutions, while one indel-variable site was found (Table 2). Out of 23 identified heterogeneous sites, one was localized in an intergenic region and all others (n = 22) were within the predicted coding regions comprising 17 genes. The heterogeneous genes encode Tpr proteins (TprC, TprI, TprK and a chimeric TprGI), proteins involved in bacterial motility and chemotaxis (FliI and CheC-FliY), translation proteins (PrfA), peptidoglycan synthesis (MurC), general metabolism (putative SAM dependent methyltransferase), DNA metabolism (TopA), and hypothetical proteins of unknown function (TPANIC_0006, TPANIC_0222, TPANIC_0471; TPASS_21029; TPESAMD_0134; TENDBA_0314, TENDBA_0967). One alternative allele resulted in replacement of a stop codon and resulted in protein elongation, while the others resulted in synonymous (n = 2) or nonsynonymous mutations (n = 18). Of the nonsynonymous mutations, 3 resulted in conservative and 15 in nonconservative amino acid replacements (Table 2). Transitions (n = 13) were found more frequently than transversions (n = 9). Most frequent were C→T and G→A (n = 9) transitions while T→C and A→G transitions were less frequent (n = 4). C→A and T→A transversions were not found.

Identification of the intrastrain heterogeneous sites in different passages of TPA SS14

To test whether intrastrain heterogeneous sites were present stably within different rabbit passages, a set of intrastrain heterogeneous sites identified in the TPA SS14 were examined in four different DNA preparations originating from two different rabbit passages (see Materials and methods, Table 3). While DNA samples 4933 and 4950 were isolated from the same batch of treponemal cells (batch 2839), DNA samples 4934 and 4951 were prepared from bacterial stock 2840. Only minimal differences in the presence and frequency of alternative alleles were found between 4933 and 4950 (and also between 4934 and 4951), whereas clear differences between DNA preparations obtained from bacterial stocks 2839 and 2840 were found (Table 3).

Discussion

In this study, correct identification of intrastrain variable sites was considered of critical importance. To filter out sequencing errors, several restrictions in detecting algorithms were applied. Paralogous genome regions were omitted from analyses due to the risk of incorrect mapping of individual reads belonging to different genome regions. Duplicated reads, i.e. reads that showed identical start and end points were automatically identified and removed from further analyses in order to analyze only uniquely generated sequencing reads and to remove potential bias during DNA amplification. Since most of the Illumina errors are nucleotide substitutions located at the 3’ DNA end [58],[70], sequence differences close to the 3’ DNA end (at positions that were 4 or less nucleotides from end) of individual reads were filtered out. An increased error rate, within and in close proximity to homoplymeric regions, was also reported in the original Solexa chemistry [71]. Therefore, we also filtered out differences in homopolymeric tracts and in close vicinity (defined as 2-nt distance) to homopolymeric tracts although we are aware that the variations in length of homopolymeric tracts, especially those composed of guanosine tandem repeats, are of biological importance. These tandem repeats are known to regulate transcription (if located in promoter regions) and have been identified in the T. pallidum genomes [72],[73]. To further increase validity of the results, only alternative reads reaching at least a 20% frequency were analyzed. In summary, these relatively stringent measures certainly led to a number of missed heterogeneous sites both in the analyzed and in the non-analyzed genome regions. In addition to missed single nucleotide heterogeneous sites, larger sequences showing genetic heterogeneity were likely also missed due to the relatively short length of Illumina reads and due to applied restrictions in the detection algorithm. An example of such sites could be the 1.3 kb-long tprK-like sequence between TP0126 and TP0127 or the 64 bp-long indel between TP0135 and TP0136, previously identified in the TPA Nichols genome [25],[39]. Another example comes from this work where one region of intrastrain heterogeneity comprising a 9 nt-long insertion sequence in TENDBA_0967 was found in the Bosnia A strain during manual inspection of individual reads. The insertion represents an additional tandem repetition within a larger region between coordinates 1044918 and 1044951. Despite the possibility of missed sites of intrastrain heterogeneity, the automated analysis pipeline used in this study revealed 46 putative heterogeneous sites and 23 of them (50.0%) were verified using an independent sequencing method with different sequencing chemistry. The remaining, non-verified 23 positions likely represent falsely identified sites, likely as a consequence of accumulated error-containing Illumina reads. The majority of heterogeneous sites identified in this study represented transitions and not transversions, which, in general, are common Illumina sequencing errors; A→C was most common, followed by G→T transversions [59],[70]. The number of heterogeneous sites in a particular genome did not correlate with average sequencing coverage nor with estimated percent Illumina error rate per nucleotide (Table 1). Although heterogeneous sites were found to be mostly strain-specific, several examples revealed the same heterogeneous site was identified in two genomes. The same heterogeneous site was found in the tprK gene of the DAL-1 and Bosnia A genomes. Interestingly, the same position was also found to be heterogeneous in the Nichols genome, although the number of Illumina reads supporting the less frequent nucleotide remained below threshold (SRX012305, Fig 2). A similar situation was also found in two other sites, one in SS14 and Cuniculi A genomes and the other one in Samoa D and Nichols genomes (Fig 2). These findings indicate that the number of intrastrain heterogeneous sites per genome is limited and that different treponemal strains tend to display variability in the same positions of several genes. The abundance of nonsynonymous mutations, nonconservative amino acid replacements and the fact that most of the heterogeneous sites were located within coding regions suggest that the heterogeneous sites represent beneficial adaptive mutations [74]. In this study, 23 intrastrain heterogeneous sites in 17 genes were identified in 5 out of 10 investigated treponemal genomes, predominantly in TPA strains. The reason why most of the heterogeneous sites were identified in the TPA, but not in TPE strains, is not clear, however, it might reflect different tissue tropism of TPA and TPE strains, different growth rate in experimental rabbits, differences in pathogenesis or other reasons. Regardless, this finding indicates distinct genetic characteristics of TPA and TPE strains. Although the TEN strain Bosnia A resembled TPA strains in this respect, most of the heterogeneous positions were identified in paralogous regions which were excluded from the automated analysis of other genomes (Fig 2). The single heterogeneous site identified in nonparalogous regions in the Bosnia A genome thus resembles TPE strains. In fact, the Bosnia A genome is more related to TPE strains than to TPA strains, although several sequences similar to TPA sequences were identified in the Bosnia A genome [20]. In contrast to other TPA strains, analysis of the TPA Mexico A strain did not reveal any heterogeneous sites (Fig 1 and Table 2). Unlike other TPA strains, the Mexico A genome has been shown to contain two TPE-like sequences [15]. However, it remains unclear whether these two observations are related. A comparison of our results with a previously published paper describing heterogeneous sites in the TPA SS14 strain [14] is shown in the Table 4. In the analyzed portion of the SS14 genome, Matějková et al. found 18 heterogeneous sites. Out of these 18 sites, we automatically detected 5 sites. In other 4 sites, the frequency of the alternative allele was below threshold and/or did not meet restriction criteria, nonetheless manual inspection revealed the presence of the alternative allele. In additional two cases, the heterogeneity was identified in 454 reads (SRX000109), but not by Illumina reads. Comparison of our results with those published by Matějková et al. [14] identified a substantial overlap, however, 7 sites (38.9%) detected by Matějková et al. were not found in our study. Interestingly, all non-detected heterogeneous sites were located in tpr genes (including tprC,I,J) or in the intergenic regions between them. At least two independent explanations can be proposed; one explanation involves the fact that the BWA (Borrows-Wheeler Aligner) mapping algorithm used in this study was not able to detect closely spaced heterogeneous sites representing a specific haplotype in relatively short Illumina or 454 reads, due to alignment restrictions. To align an individual read to the reference sequence, a 95% identity with the reference genome sequence was required in our study. However, no such reads were found in the raw data set (SRX012306, SRX000109). The other explanation involves falsely identified heterogeneous sites as a result of PCR-based errors introduced during amplification of diluted target DNA and subsequent cloning of PCR products, as was done in the work of Matějková et al. [14]. The latter explanation is also supported by the fact that the undetected heterogeneous sites were often supported by low numbers of alternative clones (Table 4). Deeper sequencing of identified heterogeneous genome sites will be needed to answer these questions.
Table 4

Comparison of heterogeneous positions identified in TPA SS14 strain by Matějková et al. [14] and by the automated pipeline used in this study.

GeneGenome position in the SS14 genome CP000805.1 (CP004011.1) a Heterogeneity identified by Matějková et al. [14] b Nucleotide frequency identified in this study b Heterogeneity detected in Illumina reads
TPASS_20117135098 (135108)G or C (5/6)G or C (32/12)yes
135107 (135117)T or C (3/4)T or C (50/1)Yes c
135235 (135245)G or A (2/10)A (46)no
135239 (135249)C or T (2/10)T (49)no
135251 (135261)A or G (6/6)A or G (41/11)yes
TPASS_20402427435 (428628)C or T (NA)C or T (15/21)yes
427737 (428930)G or T (NA)G or T (25/14)yes
TPASS_20620671746 (673228)T or C (9/3)T (23)no
671751 (673233)T or G (19/10)T (22)no (but detected by 454) d
671753 (673235)T or C (19/10)T (22)no (but detected by 454) d
671763 (673245)C or T (8/4)C or T (24/5)yes c (also detected by 454) d
672286 (673768)G or A (4/12)A (29)no
Upstream of TPASS_20620672916–7 (674399–674400)(-) or C (6/6)(-) or C (7/5)yes c
672944 (674427)A or G (14/6)A (14)no
TPASS_20621673425 (674908)C or T (2/8)T (44)no
673428 (674911)A or G (2/8)G (44)no
TPASS_20971 e 1054447 (1056002)T or C (NA)T or C (35/3)yes c
TPASS_210291123796 (1125352)G or A (5/6)G or A (24/18)yes

aadditional intrastrain heterogeneous genome positions identified by Matějková et al. [14] including 135141, 135144, 135149, 135220, 135227, 671982, 672004, 672016, 672025, 672026, 672027, 672028, 672036, 672039, 672040, 672041, 672042, 672043, 672044, 672154, 673088, 673119, 673511, 673545, 673550, and 673554 (according to the CP000805.1) were located in paralogous regions and therefore were excluded from the automated pipeline (S2 Table)

bnumbers in parentheses show numbers of sequenced clones [14] or nucleotide frequency within individual Illumina sequence reads (this study); NA—not available

cnot present in Table 2; heterogeneous positions were detected in raw Illumina sequencing reads but were excluded due to study criteria

d these heterogeneous sites were not found among Illumina reads, but were identified among 454 reads (SRX000109)

esee also Table 3; independent DNA preparations showed clear differences in proportions of alternative alleles, ranging from 0.1 to 0.7

aadditional intrastrain heterogeneous genome positions identified by Matějková et al. [14] including 135141, 135144, 135149, 135220, 135227, 671982, 672004, 672016, 672025, 672026, 672027, 672028, 672036, 672039, 672040, 672041, 672042, 672043, 672044, 672154, 673088, 673119, 673511, 673545, 673550, and 673554 (according to the CP000805.1) were located in paralogous regions and therefore were excluded from the automated pipeline (S2 Table) bnumbers in parentheses show numbers of sequenced clones [14] or nucleotide frequency within individual Illumina sequence reads (this study); NA—not available cnot present in Table 2; heterogeneous positions were detected in raw Illumina sequencing reads but were excluded due to study criteria d these heterogeneous sites were not found among Illumina reads, but were identified among 454 reads (SRX000109) esee also Table 3; independent DNA preparations showed clear differences in proportions of alternative alleles, ranging from 0.1 to 0.7 In bacterial genomes, most mutations represent C→T transitions arising via deamination of cytosine [75], T→C transitions via oxidation of thymine and/or inefficient DNA repair [76], A→G transitions via deamination of adenine [76], and G→T transversions via oxidization of guanine [76]. In fact, these 4 (out of 12 possible) mutations were observed in 11 out of 22 single nucleotide substitutions (50%) indicating that most common types of substitutions overlap with the most frequently seen bacterial mutations. In contrast, sample oxidation frequently results in C→A and G→T changes [77], while Illumina errors are predominantly A→C transversions [59],[70]. Only three such substitutions (out of 22; 13.6%) were, in fact, found in this study indicating that these substitutions are not overrepresented. Interestingly, the candidate sites identified using the Illumina pipeline, but not verified by other sequencing techniques (S3 Table), frequently (in 73.9%) included these types of mutations, which points to Illumina as a source of errors and false-positive results. TPA SS14 bacterial stocks 2839 and 2840 differed in at least 12–14 treponemal generations of separated cultivation corresponding to two rabbit subcultivations each, of approximately 100-fold increase, in the number of treponemes per subcultivation. Heterogeneous sites were clearly different in DNA preparations obtained from different bacterial stocks, indicating the dynamic nature of this heterogeneity. This observation could also explain the strain-specificity of intrastrain heterogeneous sites identified in this study. The role of rabbit passages in the occurrence of heterogeneous sites remains unknown, however, genetic heterogeneity has also been identified in treponemes isolated directly from human host (Natasha Arora, personal communication). The occurrence of intrastrain heterogeneity in TPA from human samples suggests its potential significance for molecular typing of syphilis treponemes by both sequencing approach [78],[79] and RFLP analysis of amplified genes [80],[81]. Out of 22 heterogeneous sites showing alternative nucleotides, 16 heterogeneous sites were found in conserved genome positions (where all investigated genomes had identical sequences), while 6 were found in genome positions in which the analyzed genomes differed in sequence. In 5 out of 6 sites, alternative nucleotides of heterogeneous positions matched nucleotide sequences present in analyzed genomes. Considering the highest divergence observed in treponemal genomes, which represents 0.84% sequence diversity between the conserved regions of the TPA and TPLC genomes [17], the theoretical probability that a heterogeneous site would be located at a nonconserved genome position is 8.4 x 10−3. In our study, heterogeneous sites were found more frequently (in 6 out of 22) in nonconserved genome positions (2.7 x 10−1; p < 0.001), suggesting the role of heterogeneous sites in the process of treponemal genome diversification. This study identified heterogeneous sites in four tpr genes, in genes involved in bacterial motility and chemotaxis (2), in cell structure (1), translation (1), general and DNA metabolism (2), and in seven hypothetical genes. The average expression rate of these 17 genes (1.33) during experimental rabbit infection was greater than the whole genome average (1.0) [82] indicating that these genes are expressed during host infection. Interestingly, heterogeneous sites were identified in tprC, tprI, tprK and chimeric tprGI genes. Several studies have shown that Tpr antigens are expressed during infection and are able to elicit antibody and cellular immune responses in the infected host [23],[83],[84]. Moreover, several Tpr proteins have been predicted to be outer membrane proteins [23],[85]. In addition, the tprK gene undergoes antigenic changes in seven variable regions and TprK variants are selected by the immune response [86],[87]. It has also been shown that tprK variants accumulate during infection of the host [88],[89] and that individual TprK variants helped to disseminate T. pallidum infections [87]. As demonstrated by LaFond et al. [90], variable regions elicited a variant-specific antibody response indicating that minor sequence changes may affect antibody binding. In this context, nonconservative changes could result in strain-specific surface-exposed epitopes that are crucial for immune evasion as previously predicted for discrete variable regions within TprC and TprD [23]. In E. coli, the topA (corresponding to TPASS_20394) mutation has been shown to affect fitness relative to isogenic constructs [91]. Moreover, topA and genes involved in cell wall biosynthesis and translation have been shown to repeatedly mutate in independent lines of E. coli during long-term cultivation experiment [74]. Heterogeneous sites in pathogenic treponemal strains may therefore represent adaptive changes that take place during infection of various host tissues and compartments as described in other bacteria [52]. At the same time, these sites may represent snapshots of an ongoing evolutionary trajectory. Advances in deep sequencing techniques and prospective whole genome sequencing or metagenomic studies will help, in the future, to identify a larger and perhaps more complete set of treponemal intrastrain heterogeneous sites [53],[54],[92].

Chromosomal paralogous regions not included in the automated analysis of Illumina sequencing reads of the TPA Nichols genome.

(XLS) Click here for additional data file.

Chromosomal paralogous regions not included in the automated analyses of Illumina sequencing reads of all investigated genomes.

(XLS) Click here for additional data file.

A set of 23 putative heterogeneous positions identified solely by the Illumina workflow, but not verified by other sequencing methods.

(XLS) Click here for additional data file.

Primers used for DNA amplification and Sanger sequencing of 26 heterogeneous candidate positions (not-verified by 454 workflow).

(XLS) Click here for additional data file.

List of primers used for DNA amplification and Sanger sequencing of selected intrastrain heterogeneous sites in four different TPA SS14 DNA preparations.

(XLS) Click here for additional data file.
  93 in total

1.  The tprK gene is heterogeneous among Treponema pallidum strains and has multiple alleles.

Authors:  A Centurion-Lara; C Godornes; C Castro; W C Van Voorhis; S A Lukehart
Journal:  Infect Immun       Date:  2000-02       Impact factor: 3.441

2.  Genome evolution and adaptation in a long-term experiment with Escherichia coli.

Authors:  Jeffrey E Barrick; Dong Su Yu; Sung Ho Yoon; Haeyoung Jeong; Tae Kwang Oh; Dominique Schneider; Richard E Lenski; Jihyun F Kim
Journal:  Nature       Date:  2009-10-18       Impact factor: 49.962

3.  Complete genome sequence and annotation of the Treponema pallidum subsp. pallidum Chicago strain.

Authors:  Lorenzo Giacani; Brendan M Jeffrey; Barbara J Molini; HoaVan T Le; Sheila A Lukehart; Arturo Centurion-Lara; Daniel D Rockey
Journal:  J Bacteriol       Date:  2010-03-26       Impact factor: 3.490

4.  Syphilis-causing strains belong to separate SS14-like or Nichols-like groups as defined by multilocus analysis of 19 Treponema pallidum strains.

Authors:  Lukáš Nechvátal; Helena Pětrošová; Linda Grillová; Petra Pospíšilová; Lenka Mikalová; Radim Strnadel; Ivana Kuklová; Martina Kojanová; Miluše Kreidlová; Daniela Vaňousová; Přemysl Procházka; Hana Zákoucká; Alena Krchňáková; David Smajs
Journal:  Int J Med Microbiol       Date:  2014-04-26       Impact factor: 3.473

5.  Antigenic variation of TprK facilitates development of secondary syphilis.

Authors:  Tara B Reid; Barbara J Molini; Mark C Fernandez; Sheila A Lukehart
Journal:  Infect Immun       Date:  2014-09-15       Impact factor: 3.441

6.  Stability of Mycobacterium tuberculosis IS6110 restriction fragment length polymorphism patterns and spoligotypes determined by analyzing serial isolates from patients with drug-resistant tuberculosis.

Authors:  S Niemann; E Richter; S Rüsch-Gerdes
Journal:  J Clin Microbiol       Date:  1999-02       Impact factor: 5.948

7.  Mutation mapping and identification by whole-genome sequencing.

Authors:  Ignaty Leshchiner; Kristen Alexa; Peter Kelsey; Ivan Adzhubei; Christina A Austin-Tse; Jeffrey D Cooney; Heidi Anderson; Matthew J King; Rolf W Stottmann; Maija K Garnaas; Seungshin Ha; Iain A Drummond; Barry H Paw; Trista E North; David R Beier; Wolfram Goessling; Shamil R Sunyaev
Journal:  Genome Res       Date:  2012-05-03       Impact factor: 9.043

8.  Whole genome sequence of the Treponema Fribourg-Blanc: unspecified simian isolate is highly similar to the yaws subspecies.

Authors:  Marie Zobaníková; Michal Strouhal; Lenka Mikalová; Darina Cejková; Lenka Ambrožová; Petra Pospíšilová; Lucinda L Fulton; Lei Chen; Erica Sodergren; George M Weinstock; David Smajs
Journal:  PLoS Negl Trop Dis       Date:  2013-04-18

9.  Complete genome sequence of Treponema pallidum strain DAL-1.

Authors:  Marie Zobaníková; Pavol Mikolka; Darina Cejková; Petra Pospíšilová; Lei Chen; Michal Strouhal; Xiang Qin; George M Weinstock; David Smajs
Journal:  Stand Genomic Sci       Date:  2012-09-24

10.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.

Authors:  Juliane C Dohm; Claudio Lottaz; Tatiana Borodina; Heinz Himmelbauer
Journal:  Nucleic Acids Res       Date:  2008-07-26       Impact factor: 16.971

View more
  10 in total

Review 1.  The global roadmap for advancing development of vaccines against sexually transmitted infections: Update and next steps.

Authors:  Sami L Gottlieb; Carolyn D Deal; Birgitte Giersing; Helen Rees; Gail Bolan; Christine Johnston; Peter Timms; Scott D Gray-Owen; Ann E Jerse; Caroline E Cameron; Vasee S Moorthy; James Kiarie; Nathalie Broutet
Journal:  Vaccine       Date:  2016-04-19       Impact factor: 3.641

2.  Defective Epstein-Barr Virus Genomes and Atypical Viral Gene Expression in B-Cell Lines Derived from Multiple Myeloma Patients.

Authors:  Fang Lu; Kayla A Martin; Samantha S Soldan; Andrew V Kossenkov; Priyankara Wickramasinghe; Olga Vladimirova; Alessandra De Leo; Cindy Lin; Yulia Nefedova; Paul M Lieberman
Journal:  J Virol       Date:  2021-06-10       Impact factor: 5.103

3.  Characterizing the Syphilis-Causing Treponema pallidum ssp. pallidum Proteome Using Complementary Mass Spectrometry.

Authors:  Kara K Osbak; Simon Houston; Karen V Lithgow; Conor J Meehan; Michal Strouhal; David Šmajs; Caroline E Cameron; Xaveer Van Ostade; Chris R Kenyon; Geert A Van Raemdonck
Journal:  PLoS Negl Trop Dis       Date:  2016-09-08

4.  Human Treponema pallidum 11q/j isolate belongs to subsp. endemicum but contains two loci with a sequence in TP0548 and TP0488 similar to subsp. pertenue and subsp. pallidum, respectively.

Authors:  Lenka Mikalová; Michal Strouhal; Jan Oppelt; Philippe Alain Grange; Michel Janier; Nadjet Benhaddou; Nicolas Dupin; David Šmajs
Journal:  PLoS Negl Trop Dis       Date:  2017-03-06

5.  Identification of positively selected genes in human pathogenic treponemes: Syphilis-, yaws-, and bejel-causing strains differ in sets of genes showing adaptive evolution.

Authors:  Denisa Maděránková; Lenka Mikalová; Michal Strouhal; Šimon Vadják; Ivana Kuklová; Petra Pospíšilová; Lenka Krbková; Pavlína Koščová; Ivo Provazník; David Šmajs
Journal:  PLoS Negl Trop Dis       Date:  2019-06-19

6.  Stages of pregnancy and weaning influence the gut microbiota diversity and function in sows.

Authors:  Y J Ji; H Li; P F Xie; Z H Li; H W Li; Y L Yin; F Blachier; X F Kong
Journal:  J Appl Microbiol       Date:  2019-07-01       Impact factor: 3.772

7.  Complete genome sequences of two strains of Treponema pallidum subsp. pertenue from Ghana, Africa: Identical genome sequences in samples isolated more than 7 years apart.

Authors:  Michal Strouhal; Lenka Mikalová; Pavla Havlíčková; Paolo Tenti; Darina Čejková; Ivan Rychlík; Sylvia Bruisten; David Šmajs
Journal:  PLoS Negl Trop Dis       Date:  2017-09-08

Review 8.  Future prospects for new vaccines against sexually transmitted infections.

Authors:  Sami L Gottlieb; Christine Johnston
Journal:  Curr Opin Infect Dis       Date:  2017-02       Impact factor: 4.915

9.  Complete genome sequences of two strains of Treponema pallidum subsp. pertenue from Indonesia: Modular structure of several treponemal genes.

Authors:  Michal Strouhal; Lenka Mikalová; Jan Haviernik; Sascha Knauf; Sylvia Bruisten; Gerda T Noordhoek; Jan Oppelt; Darina Čejková; David Šmajs
Journal:  PLoS Negl Trop Dis       Date:  2018-10-10

10.  Whole genome sequence of the Treponema pallidum subsp. endemicum strain Iraq B: A subpopulation of bejel treponemes contains full-length tprF and tprG genes similar to those present in T. p. subsp. pertenue strains.

Authors:  Lenka Mikalová; Klára Janečková; Markéta Nováková; Michal Strouhal; Darina Čejková; Kristin N Harper; David Šmajs
Journal:  PLoS One       Date:  2020-04-01       Impact factor: 3.240

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.