| Literature DB >> 27504980 |
Ikuo Uchiyama1, Jacob Albritton2, Masaki Fukuyo2,3, Kenji K Kojima2,3,4,5, Koji Yahara2,3,6, Ichizo Kobayashi2,3,4,7,8.
Abstract
Genomes of a given bacterial species can show great variation in gene content and thus systematic analysis of the entire gene repertoire, termed the pan-genome, is important for understanding bacterial intra-species diversity, population genetics, and evolution. Here, we analyzed the pan-genome from 30 completely sequenced strains of the human gastric pathogen Helicobacter pylori belonging to various phylogeographic groups, focusing on 991 accessory (not fully conserved) orthologous groups (OGs). We developed a method to evaluate the mobility of genes within a genome, using the gene order in the syntenically conserved regions as a reference, and classified the 991 accessory OGs into five classes: Core, Stable, Intermediate, Mobile, and Unique. Phylogenetic networks based on the gene content of Core and Stable classes are highly congruent with that created from the concatenated alignment of fully conserved core genes, in contrast to those of Intermediate and Mobile classes, which show quite different topologies. By clustering the accessory OGs on the basis of phylogenetic pattern similarity and chromosomal proximity, we identified 60 co-occurring gene clusters (CGCs). In addition to known genomic islands, including cag pathogenicity island, bacteriophages, and integrating conjugative elements, we identified some novel ones. One island encodes TerY-phosphorylation triad, which includes the eukaryote-type protein kinase/phosphatase gene pair, and components of type VII secretion system. Another one contains a reverse-transcriptase homolog, which may be involved in the defense against phage infection through altruistic suicide. Many of the CGCs contained restriction-modification (RM) genes. Different RM systems sometimes occupied the same (orthologous) locus in the strains. We anticipate that our method will facilitate pan-genome studies in general and help identify novel genomic islands in various bacterial species.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27504980 PMCID: PMC4978471 DOI: 10.1371/journal.pone.0159419
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 5Co-occurring gene clusters (CGCs).
(A) The 60 CGCs ordered according to the cluster size (the number of OGs included). An occurrence pattern represents presence/absence of CGC in each strain where a large box indicates that the strain contains all OGs in the CGC and a small box indicates that the strain contains only part of the OGs. In the occurrence pattern, strains are ordered in the same way as in S1 Table and colors are assigned according to the phylogeographical groups in the same way as in Fig 4. (B-F) The five largest CGCs displayed on the RECOG system. (B) CGC-1 corresponding to cag pathogenicity island; (C) CGC-2 corresponding to a part of bacteriophage 1961P; (D) CGC-3 containing protein kinase C and protein phosphatase C2 homologs; (E) CGC-4 corresponding to a part of ICE containing type IV secretion system; (F) CGC-5 corresponding to a part of bacteriophage 1961P. The left part shows a hierarchical clustering tree based on the occurrence pattern similarity. The central part shows occurrence patterns, where the order of strains is same as in (A), and the colors are assigned according to the neighboring clustering, i.e., the cells filled in the same color in each column contain genes that are closely located on the chromosome (here, we used 8000 bp window for the neighborhood criterion). Enlarged figures B-F are shown in S1 Fig.
Fig 1Definition of the mobility classes.
Genes in the (syntenic) core OGs include the universal core genes (boxes in dark blue) and the remaining syntenic core genes (boxes in pale blue). Each core OG is assigned a core index (the above number) representing its order in the core alignment. Each of the non-core OGs (A, B, C, and D boxes) is assigned a pair of core indices representing the left- and right-neighboring core OGs. A set of genes that are located in the equivalent locus is enclosed in a box of a dashed line. The mobility_extent of each OG is defined as the number of distinct loci where the OG can be located, which is one for OG-A, three for OG-B, and one for OG-C. Note that we ignored the genes in OG-C in genomes 4 and 5 in which the difference between the left- and right-neighboring core indices is too large (non-consecutive), which indicates that the gene is located around a break point of a large rearrangement (in these cases, inversion and transposition). OG-D appears in only one genome and thus is classified as Unique. Mobility class is defined on the basis of mobility_extent and N (see text), where N is the effective number of genes obtained by the following calculation: (number of genes in OG)–non-consecutive.
Fig 2Pan-genome and core genome among Helicobacter pylori.
(A) Histogram showing the distribution of the number of strains in each OG among the 30 strains. Sets of OGs corresponding to pan-genome, universal core, accessory, and unique OGs are indicated. (B) Sizes of the syntenic core, universal core, and pan-genome as functions of the number of strains. An ordered lists of the 30 strains was randomly generated and the sets of n strains (n = 2,4,…,30) generated from this list was subject to core- and pan-genome analysis. The test was repeated 20 times and the average numbers of core- and pan-genome sizes were plotted with error bars that represent standard deviations. Syntenic core between two genomes is not well defined and thus is not plotted. (C) The number of new OGs added to the pan-genome as a function of the number of strains. The number of new OG in n strains (n = 4,6,…,30) was calculated as the difference between the pan-genome size in n strains and that in n– 2 strains.
Fig 3The number of OGs classified in each mobility class.
(A) Histogram showing the strain number distribution of each mobility class among non-unique accessory OGs. The histogram is equivalent to Fig 1(A) except the rightmost bar representing the universal core OGs (num_strain = 30) and the leftmost bar representing the unique OGs (num_strain = 1) are eliminated. (B) Frequencies of the mobility classes among the accessory OGs in each strain. The order of strains is same as in S1 Table. Note that each strain also has the same number (1248) of universal core OGs that are not shown in this graph.
Fig 4Phylogenetic networks among 30 H. pylori strains.
(A) From the concatenated alignment of the universal core OGs. (B) From the gene content (presence vs. absence) of the entire accessory OGs. (C) From the gene content of Core class OGs. (D) From the gene content of Stable class OGs. (E) From the gene content of Intermediate class OGs. (F) From the gene content of Mobile class OGs. Strain names are assigned colors according to the phylogeographic groups as follows: brown, Africa2; purple, Africa1; dark blue, SJM180; light blue, Europe; green, Asia2; khaki, PeCan4; orange, Amerind; red, East Asia.
Co-occurring gene clusters (CGCs).
| CGCID | NCGCID | NumOGs | Occurrence pattern | Comments | Mobility | RM |
|---|---|---|---|---|---|---|
| 1 | 23 | cag pathogenicity island | core[ | |||
| 2 | 2 | 19 | Bacteriophage 1961P | intermediate[ | ||
| 3 | 3 | 15 | Amerind+Europe; TerY-P triad cluster incl. Ser/Thr protein kinase and protein phosphatase | intermediate[ | IV | |
| 4 | 1 | 13 | ICE; type IV secretion system tfs4 | mobile[ | ||
| 5 | 2 | 11 | Bacteriophage 1961P | mobile[ | ||
| 6 | 1 | 9 | ICE; type IV secretion system tfs3 | mobile[ | ||
| 7 | 1 | 8 | ICE; type IV secretion system (common in tfs3 and tfs4) | mobile[ | ||
| 8 | 1 | 8 | ICE; relaxase, protease, gyrase | mobile[ | ||
| 9 | 6 | Amerind specific; incl. Exodeoxyribonuclease VII large subunit and HNH/ENDO VII nuclease | stable[ | |||
| 10 | 6 | incl. reverse transcriptase and phage-associated protein | stable[ | |||
| 11 | 3 | 6 | Amerind x 3, Europe x 2; incl. AAA family ATPase | intermediate[ | ||
| 12 | 1 | 6 | ICE; type IV secretion system tfs3; VirB2, VirB3, VirB4 | mobile[ | ||
| 13 | 6 | DNA exonuclease RecJ, conserved domain DUF262 | core[ | |||
| 14 | 1 | 6 | SAfrica7 specific; incl. type IV secretion system protein VirB11 | unique[ | ||
| 15 | 1 | 5 | ICE; relaxase VirD2, conjugal transfer protein TraG, VirD4 | mobile[ | ||
| 16 | 6 | 5 | Hypothetical (putative ATP-ase or ATP/GTP-binding protein) | core[ | ||
| 17 | 4 | 5 | N-acetylneuraminic acid synthetase, N-acylneuraminate cytidylyltransferase, sialyltransferase | stable[ | ||
| 18 | 5 | Dam and other restriction endonuclease and methyltransferase | core[ | II | ||
| 19 | 1 | 5 | ICE; DNA topoisomerase, Integrase/recombinase, toprim-like family protein | mobile[ | ||
| 20 | 5 | 5 | DnaK homolog, WxG100 family | core[ | ||
| 21 | 5 | Puno135 specific, urease alpha/beta, phage resistance protein RloAB | unique[ | |||
| 22 | 4 | Hypothetical | stable[ | |||
| 23 | 1 | 4 | ICE; VirB6 | mobile[ | ||
| 24 | 3 | 4 | AAA ATPase | intermediate[ | ||
| 25 | 4 | Phage lysozyme | core[ | |||
| 26 | 1 | 4 | ICE; VirB4-2 | mobile[ | ||
| 27 | 6 | 4 | incl. CrfC homolog (dynamin-like GTPase family) | core[ | ||
| 28 | 7 | 4 | PeCan4 Specific, methyltransferase, Type II restriction endonuclease | unique[ | II,III | |
| 29 | 4 | 4 | Thiamine biosynthesis, hsdR | core[ | I | |
| 30 | 4 | P12 specifc; Chorismate synthase,pyrophosphatase,menaquinone biosynthesis protein | unique[ | |||
| 31 | 3 | Type II restriction endonuclease and methyltransferase | stable[ | II | ||
| 32 | 3 | Site-specific DNA methylase Dcm | stable[ | II | ||
| 33 | 1 | 3 | Hypothetical (incl. weak homolog of tyrosine recombinase XerC) | intermediate[ | ||
| 34 | 3 | Hypothetical | stable[ | |||
| 35 | 5 | 3 | Hypothetical (incl. weak homolog of chromosome segregation protein SMC) | core[ | ||
| 36 | 7 | 3 | Type III restriction endonuclease and methyltransferase | stable[ | III | |
| 37 | 3 | Hypothetical (OMP) | stable[ | |||
| 38 | 3 | Type II methyltransferase | core[ | II | ||
| 39 | 3 | Type II restriction endonuclease and methyltransferase | core[ | II | ||
| 40 | 3 | Hypothetical | core[ | |||
| 41 | 3 | Hypothetical | intermediate[ | |||
| 42 | 4 | 3 | Type II restriction endonucleas and methyltransferase | intermediate[ | II | |
| 43 | 3 | Predicted metal-dependent hydrolase | stable[ | |||
| 44 | 3 | Type II restriction endonuclease and methyltransferase | core[ | II | ||
| 45 | 8 | 3 | Type II restriction endonuclease and methyltransferase | intermediate[ | II | |
| 46 | 3 | Type II restriction endonuclease and methyltransferase | stable[ | II | ||
| 47 | 3 | incl. alginate O-acetylation protein AlgI | stable[ | |||
| 48 | 3 | Type III restriction endonuclease and methyltransferase | core[ | III | ||
| 49 | 3 | SAfrica7 specific, incl. Multidrug resistance protein | unique[ | |||
| 50 | 3 | incl. P-loop containing NTPase | stable[ | |||
| 51 | 3 | Cuz20 specific, incl. thiamine pyrophosphokinase | unique[ | |||
| 52 | 3 | Molybdenum cofactor, Molybdopterin-guanine dinucleotide | core[ | |||
| 53 | 3 | SJM specific, restriction endonuclease, methyltransferase, Addiction module antidote protein | unique[ | II | ||
| 54 | 1 | 3 | 51 specific; Type IV secretion system, methyltransferase | unique[ | ||
| 55 | 5 | 3 | FtsK/SpoIIIE family, nuclease of HNH/ENDO VII superfamily | core[ | ||
| 56 | 1 | 3 | India7 specific; Chromosome partitioning protein, cag1 | unique[ | ||
| 57 | 1 | 3 | F32 specific; Type IV secretion system | unique[ | ||
| 58 | 3 | Uncharacterized conserved proteins DUF262, DUF1524 | core[ | |||
| 59 | 3 | Methyltransferase | stable[ | II | ||
| 60 | 8 | 3 | G27 specific; Methyltransferase, glycosyltransferase | unique[ | III |
a Summarization of the occurrence patterns of OGs included in the CGC. Each letter indicates presence/absence of OGs in the CGC in each strain. An upper case letter indicates the strain contains all OGs, a lower case letter indicates the strain contains at least half of the OGs; a period indicates the strain contains less than half of the OGs; an underscore indicates the strains does not contain any OG in the CGC. The strains are ordered in the same way as in S1 Table and Fig 3B. Each strain is indicated in an alphabet according to the phylogeographical group as follows: A, Africa2; B, Africa1; C, SJIM180; D, Europe; E, Asia2; F, PeCan4; G, Amerind; H, EastAsia.
b Mobility classes of OGs in each CGC. The number of OGs in each class is indicated in the brackets.
c Types of RM genes included in each CGC, which are assigned according to the REBASE.
Neighboring co-occurring gene clusters (NCGCs).
| NCGCID | Num OGs | Component CGCs | Comments |
|---|---|---|---|
| 1 | 80 | 4,6,7,8,12,14,15,19,23,26,33,54,56,57 | ICE/TnPZ |
| 2 | 30 | 2,5 | Bacteriophage |
| 3 | 25 | 3,11,24 | TerY-P triad cluster |
| 4 | 12 | 17,29,42 | Cell surface + RMs |
| 5 | 11 | 20,35,55 | WXG100 secretion system |
| 6 | 9 | 16,27 | Cluster of P-loop containing NTPases |
| 7 | 7 | 28,36 | Type II and III RMs |
| 8 | 6 | 45,60 | Type II and III RMs |
Fig 6Chromosomal context of CGCs.
(A) CGC-2 and CGC-5 (bacteriophage 1961P). (B) CGC-3 (the cluster containing TerY-P triad). Genes in the target CGCs are centered and colored cyan. In the flanking regions, genes in the syntenic core are colored according to the location in the reference genome (whose strain name shown in the left side is colored red). Thus, for a Mobile class CGC (such as CGC-5), the flanking core genes are assigned different colors in different strains.
Fig 7Gene cluster containing TerY-P triad conserved among three H. pylori strains and two other bacteria.
Orthologous genes are drawn with the same colors. Gene numbers or names are presented in or near the arrows. Regions of sequence similarity between loci are indicated by red bands. The diagram was drawn using GenomeMatcher [47].
Fig 8Integrating conjugative elements (ICEs).
(A) Locations of ICE genes displayed on the RECOG system. Colors are assigned by CGC groups (CGC-4, red; 6, light green; 7, yellow; 8, magenta; 12, cyan; 15, purple; 19, brown; 23, blue; 26, dark green; other mobile OGs, dark gray). The strains are ordered such that the first 10 strains correspond to type ICEHptfs3 and the rest correspond to ICEHptfs4.(B) A phylogenetic network created from the concatenated sequence of three OGs, virB9, virB11 and virD4. (C) A phylogenetic network created from the putative DNA methyltransferase (DNMT) conserved in all ICE subtypes. Strain names are assigned colors according to the phylogeographical groups as in Fig 4.
Fig 9Example of different RM systems occupying the same orthologous position.
(A) Location of three RM systems designated A (blue; OG-81, 1424, 1544 and 1524 containing HP0050, HP0051, and HP0052 in strain 26695), B (green; OG-1668, 1667 and 1691 containing jhp0045 and jhp0046 in strain J99), and C (red; OG-1782, 1615, 1727 and 1785 containing HP0053 and HP0054 in strain 26695). See S2 Table for details of each OG. (B) A phylogenetic network created from the concatenated sequence of the RM-A. (C) A phylogenetic network created from the concatenated sequence of the RM-B. Strain names are assigned colors according to the phylogeographical groups as in Fig 4.