| Literature DB >> 32046654 |
Nikolaus F Zwickl1, Nancy Stralis-Pavese1, Christina Schäffer2, Juliane C Dohm3, Heinz Himmelbauer4.
Abstract
BACKGROUND: Tannerella forsythia is a bacterial pathogen implicated in periodontal disease. Numerous virulence-associated T. forsythia genes have been described, however, it is necessary to expand the knowledge on T. forsythia's genome structure and genetic repertoire to further elucidate its role within pathogenesis. Tannerella sp. BU063, a putative periodontal health-associated sister taxon and closest known relative to T. forsythia is available for comparative analyses. In the past, strain confusion involving the T. forsythia reference type strain ATCC 43037 led to discrepancies between results obtained from in silico analyses and wet-lab experimentation.Entities:
Keywords: Codon usage bias; Comparative genomics; Computational analysis; Genome assembly; Glycosylation gene cluster; Pan-genome; Pathogenicity island; Periodontitis; Tannerella; Virulence
Year: 2020 PMID: 32046654 PMCID: PMC7014623 DOI: 10.1186/s12864-020-6535-y
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Tannerella genome assemblies analysed including the ATCC 43037 assembly generated in this work
| Strain name | GenBank Accession | Genome size [bp] | # of sequences | % GC | RefSeq Annotation Date |
|---|---|---|---|---|---|
| ATCC 43037 | VFJI00000000 (this work) | 3,296,274 | 87 | 47.1 | – |
| ATCC 43037 | JUET00000000.1 | 3,281,748 | 141 | 47.1 | 06/12/2017 |
| FDC 92A2 | NC_016610.1 | 3,405,521 | 1 | 47.0 | 10/21/2017 |
| 3313 | NZ_AP013044.1 | 3,350,939 | 1 | 47.1 | 04/04/2017 |
| KS16 | NZ_AP013045.1 | 3,393,002 | 1 | 47.2 | 04/04/2017 |
| UB4 | FMMN00000000.1 | 3,233,032 | 71 | 47.2 | 06/12/2017 |
| UB22 | FMML00000000.1 | 3,272,368 | 98 | 47.1 | 06/12/2017 |
| UB20 | FMMM00000000.1 | 3,252,894 | 93 | 47.1 | 06/12/2017 |
| 9610 | MEHX00000000.1 | 3,201,941 | 79 | 47.3 | 06/12/2017 |
| W11663 | NSLJ00000000.1 | 3,300,179 | 140 | 47.1 | 10/14/2017 |
| W10960 | NSLK00000000.1 | 3,312,685 | 98 | 47.2 | 10/14/2017 |
| n/a | CP017038.1 | 2,973,531 | 1 | 56.5 | 04/13/2017 |
Fig. 1Comparison of our assembled scaffolds to a previously published T. forsythia sequence. The sequence KP715369 (black bar in the middle) aligns partially to our scaffold 1 (bottom) and partially to scaffold 2 (top). The sections named A to F represent the scaffolded contigs, gaps between them are indicated by vertical bars. Coverage tracks are shown for two different mapping strategies (allowing zero mismatches versus allowing only uniquely mapping reads); the differences between the two tracks highlight repetitive content found especially at the contig ends. Numbers of linking read pairs between contigs are indicated (based on the uniquely-mapping strategy) along with the numbers of unique mapping positions (read 1 / read 2). There were only 20 read pairs that supported the linkage of contig C to contig E as suggested by the alignment of KP715369. All adjacent contigs as scaffolded by us were supported by more than 5000 pairs for each link
Fig. 2Multiple whole genome alignment of eight T. forsythia strains. Each coloured block represents a genomic region that aligned to a region in at least one other genome, plotted in the same colour, to which it was predicted to be homologous based on sequence similarity. Blocks above the centre line indicate forward orientation; blocks below the line indicate reverse orientation relative to strain 92A2. A histogram within each block shows the average similarity of a region to its counterparts in the other genomes. Red vertical lines indicate contig boundaries. Strain ATCC 43037 displayed two translocations compared to strain 92A2 with lengths of approximately 500 kbp (blue and yellow blocks at the right end of 92A2 and in the centre of ATCC) and 30 kbp (pink block at approx. 1.25 Mbp in 92A2 and at approx. 2.7 Mbp in ATCC), respectively. Previously described large-scale inversions in strain KS16 could be confirmed (reverted blocks in the left half of the alignment)
Alignable fraction of nine T. forsythia strains and Tannerella sp. BU063 in whole-genome alignments against T. forsythia strain FDC 92A2 as reference sequence. Results are based on blastn output. The scaffolded ATCC 43037 assembly generated in this work was used
| Strain name | |||||
| > = 99% seq identity | > = 95% seq identity | > = 80% seq identity | > = 70% seq identity | > = 50% seq identity | |
| ATCC 43037 | 40.58 | 88.52 | 91.46 | 92.15 | 92.59 |
| 3313 | 44.27 | 87.68 | 92.00 | 92.56 | 92.76 |
| KS16 | 43.43 | 90.63 | 92.72 | 93.24 | 93.55 |
| UB4 | 42.61 | 88.47 | 92.59 | 93.14 | 93.29 |
| UB22 | 51.94 | 90.99 | 92.02 | 93.01 | 93.36 |
| UB20 | 49.89 | 90.54 | 93.30 | 93.68 | 93.89 |
| 9610 | 42.58 | 87.86 | 90.35 | 90.87 | 91.21 |
| W11663 | 47.50 | 90.30 | 92.50 | 92.94 | 93.06 |
| W10960 | 44.83 | 88.75 | 91.70 | 92.55 | 92.92 |
| average | 45.29 | 89.30 | 92.07 | 92.68 | 92.96 |
| > = 95% seq identity | > = 80% seq identity | > = 70% seq identity | > = 50% seq identity | > = 30% seq identity | |
| n/a | 0.00 | 0.97 | 24.38 | 38.25 | 38.37 |
Fig. 3Phylogenetic tree showing the topology (a) and the distances (b) as computed by MASH applied on the whole-genome assemblies of T. forsythia strains and Tannerella sp. BU063, including Bacterioides vulgatus ATCC 8482 as outgroup
Fig. 4Whole genome alignment between the six frame amino acid translations of both Tannerella sp. BU063 and the scaffolded and ordered ATCC 43037 assembly. Whereas the amino acid alignment reflects similarity with respect to gene content, the order of genes is not preserved
Fig. 5Blast Score Ratio (BSR) values plotted as heatmap for 45 suggested virulence genes in ten T. forsythia strains and the genome of putative health-associated Tannerella sp. BU063. Gene sequences were blasted against the complete genomic sequences of each genome. Tannerella sp. BU063 achieved considerable BSR values for several genes that were actually suggested as virulence factors in pathogenic T. forsythia strains. On the other hand, some of the pathogenic strains show reduced similarity to some predicted virulence factors
Fig. 6Predicted core- and pan-genome sizes for T. forsythia based on ten genome assemblies using a sampling approach that iteratively adds genomes to the analysis. The species’ core genome has a saturated size of 1900 genes, i.e. genes that are found to be conserved throughout the ten analysed strains are likely to be conserved throughout the whole species (left panel). In contrast, novel genes are expected to be found in newly sequenced T. forsythia genomes as indicated by the pan-genome curve that has not yet reached a saturation plateau (right panel)
Positions of putative glycosylation (PGL) loci in T. forsythia strain FDC 92A2
| Locus tag (RefSeq, GenBank) | Position | Strand | Protein ID, Description | Conserved Domains | dbCAN |
|---|---|---|---|---|---|
| PGL_1 | |||||
| BFO_RS00485, BFO_0104 | 119,936–121,180 | + | WP_014223582.1, glycosyltransferase group 1 family protein | cl28208 RfaB superfamily | GT4 |
| BFO_RS00535, BFO_0114 | 132,395–133,165 | + | WP_014223590.1, hypothetical protein | cl11394 Glyco_tranf_GTA_type superfamily | – |
| BFO_RS00545, BFO_0116 | 133,715–134,380 | + | WP_041590509.1, hypothetical protein | cl11394 Glyco_tranf_GTA_type superfamily | – |
| BFO_RS00550, BFO_0117 | 134,417–135,097 | + | WP_014223593.1, glycosyl transferase | cl01298 Glyco_transf_25 superfamily | GT25 |
| PGL_2 | |||||
| BFO_RS02100, BFO_0467 | 500,734–501,384 | – | WP_052299218.1, hypothetical protein | cl02988 Glyco_transf_10 superfamily | GT10 |
| BFO_RS02105, BFO_0468 | 502,333–504,648 | – | WP_014223924.1 penicillin-binding protein 1C | TIGR02073 PBP_1c | GT51 |
| BFO_RS02135, BFO_0475 | 513,630–514,787 | + | WP_014223931.1 mannosyltransferase | cd03809 GT1_mtfB_like | GT4 |
| PGL_3 | |||||
| BFO_RS02420, BFO_0544 | 586,421–587,608 | + | WP_014223997.1, glycosyl transferase family 1 | cl10013 Glycosyltransferase_GTB_type | GT4 |
| BFO_RS02430, BFO_0547 | 588,656–589,774 | + | WP_014223999.1, glycosyl transferase family 1 | cl10013 Glycosyltransferase_GTB_type | GT4 |
| BFO_RS02435, BFO_0564 | 589,763–590,959 | – | WP_014224000.1, hypothetical protein | cl10013 Glycosyltransferase_GTB_type | – |
| PGL_4 | |||||
| BFO_RS07405, BFO_1699 | 1,808,692–1,809,366 | – | WP_014225043.1, glycosyl transferase | cl01298 Glyco_transf_25 superfamily | – |
| BFO_RS07410, BFO_1700 | 1,809,356–1,810,438 | – | WP_014225044.1, glycosyl transferase | cl10013 Glycosyltransferase_GTB_type | GT4 |
| BFO_RS14425, BFO_1705 | 1,812,769–1,814,883 | – | WP_052299248.1, hypothetical protein | cd03801 GT1_YqgM_like | GT4 |
| PGL_5 | |||||
| BFO_RS08625, BFO_1977 | 2,106,020–2,107,996 | – | WP_014225302.1, hypothetical protein | cl28208 RfaB superfamily | – |
| BFO_RS08630, BFO_1978 | 2,108,002–2,108,745 | – | WP_014225303.1, glycosyl transferase family 2 | cd04179 DPM_DPG-synthase_like | GT2 |
| BFO_RS14090, BFO_1987 | 2,122,302–2,123,087 | + | WP_052299260.1, hypothetical protein | cd04186 GT_2_like_c | GT2 |
| BFO_RS08670, BFO_ | 2,123,084–2,124,346 | + | WP_041590821.1, hypothetical protein | cl10013 Glycosyltransferase_GTB_type | GT4 |
| BFO_RS08675, BFO_1990 | 2,124,694–2,126,031 | + | WP_014225312.1, glycosyltransferase group 1 family protein | cl10013 Glycosyltransferase_GTB_type | GT4 |
| BFO_RS08680, BFO_1989 | 2,126,026–2,127,159 | – | WP_014225313.1, glycosyl transferase | cl10013 Glycosyltransferase_GTB_type | GT4 |
| PGL_6 | |||||
| BFO_RS10550, BFO_2565 | 2,598,381–2,599,619 | – | WP_014225708.1, hypothetical protein | cl10013 Glycosyltransferase_GTB_type | GT4 |
| BFO_RS10555, BFO_2566 | 2,599,616–2,600,713 | – | WP_014225709.1, UDP-N-acetylglucosamine 2-epimerase (non-hydrolyzing) | cd03786 GT1_UDP-GlcNAc_2-Epimerase | – |
| BFO_RS10600, BFO_2575 | 2,607,474–2,608,256 | + | WP_014225718.1 glycosyl transferase | cd04179 DPM_DPG-synthase_like | GT2 |
Fig. 7Analysis of codon usage for ATCC 43037 (left panel) and BU063 (right panel). The continuous curves indicate the NC values to be expected for a given GC3s content in the absence of other factors shaping codon usage. Every dot represents a protein coding gene, dots not positioned near the curve therefore represent genes that display a considerable codon usage bias. GC3s: G + C content at synonymous positions, NC: effective number of codons used within the sequence of a gene