| Literature DB >> 28831154 |
José Lourenço1, Eleanor R Watkins2, Uri Obolski2, Samuel J Peacock2, Callum Morris3, Martin C J Maiden2, Sunetra Gupta2.
Abstract
Populations of Streptococcus pneumoniae (SP) are typically structured into groups of closely related organisms or lineages, but it is not clear whether they are maintained by selection or neutral processes. Here, we attempt to address this question by applying a machine learning technique to SP whole genomes. Our results indicate that lineages evolved through immune selection on the groEL chaperone protein. The groEL protein is part of the groESL operon and enables a large range of proteins to fold correctly within the physical environment of the nasopharynx, thereby explaining why lineage structure is so stable within SP despite high levels of genetic transfer. SP is also antigenically diverse, exhibiting a variety of distinct capsular serotypes. Associations exist between lineage and capsular serotype but these can be easily perturbed, such as by vaccination. Overall, our analyses indicate that the evolution of SP can be conceptualized as the rearrangement of modular functional units occurring on several different timescales under different pressures: some patterns have locked in early (such as the epistatic interactions between groESL and a constellation of other genes) and preserve the differentiation of lineages, while others (such as the associations between capsular serotype and lineage) remain in continuous flux.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28831154 PMCID: PMC5567354 DOI: 10.1038/s41598-017-08990-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Random forest classification. (A) Random forest analysis (RFA) for serotype classification. (A, top) Density function of RFA scores obtained for each gene in the dataset. The 95% boundaries are marked by the dashed lines. Small bars highlight the RFA scores of genes within particular groups (yellow for MLST genes, blue for capsular locus genes). (A, bottom) Genomic position for each gene in the dataset against their RFA score (normalised to [0,1]). The circular genome is presented in a linear form on the y-axis, with the first gene being dnaA and the last gene parB. MLST genes are marked in yellow diamonds (spi, xpt, glkA, aroE, ddlA, tkt) and genes within the capsular locus with blue diamonds (pseudogenes tagged with ‘x’). (B) RFA analysis for sequence cluster classification; figure details the same as in A. Blue shaded areas in both A and B subplots mark the capsular locus (genes within aliA and dexB).
Figure 2Random forest classification excluding data with gene mismatches. (A) Random forest analysis (RFA) for serotype classification when excluding genes for which the allelic notation process had <99% positive matches with the reference genome. (A , top) Density function of RFA scores obtained for each gene in the dataset. The 95% boundaries are marked by the dashed lines. Small bars highlight the RFA scores of genes within particular groups (red for serotype, green for SC genes). (A, bottom) Genomic position for each gene in the dataset against their RFA score (normalised to [0,1]). The circular genome is presented in a linear form on the y-axis, with the first gene being dnaA and the last gene parB. Red and green diamonds mark the top 2.5% ranking genes for serotype and Sequence Cluster classification, respectively. (B) RFA for Sequence Cluster classification; figure details the same as in A. Blue shaded areas in both A and B mark the capsular locus (genes within aliA and dexB). Green shaded areas in both A and B mark the genes contiguous and including the groESL operon (Table 1).
Top genes for Serotype prediction.
| SPN23F | Name | Type/Function | |
|---|---|---|---|
| 00400 | Hypothetical protein | ||
| 02300 | a | pitA | Ferric iron ABC transporter, permease protein |
| 02320 | a | pitB | Ferric iron ABC transporter, ATP-binding protein |
| 02540 | b | glmS | Glucosamine-frutose-6-phosphate aminotransferase |
| 02550 | b | Luciferase-like monooxygenase/Oxidoreductase | |
| 02560 | b | spuA | Surface-anchored pullulanase |
| 02600 | polC | DNA polymerase III PolC-type | |
| 02870 | c | Maltodextrin glucosidase | |
| 02880 | c | basA | Glutathione peroxidase family protein |
| 03060 * | d | mraW | 16S rRNA cytosine-methyltransferase |
| 03070 * | d | ftsL | Cell division protein |
| 03080 * | d | pbpX | Penicillin binding protein/cell division protein |
| 03090 * | d | mraY | Phospho-N-acetylmuramoyl-pentapeptide-transferase |
| 03110 * | d | clpL | ATP-dependent Clp proteinase |
| 03130 * | d | luxS | S-ribosylhomocysteinase lyase |
| 03140 * | d | ATP-dependent Zinc protease | |
| 03150 * | d | dexB | Glucan-1 6-alpha-glucosidase |
| 03390 * | e | aliA | Oligopeptide ABC transporter |
| 03410 * | e | pbp1A | Transpeptidase/Penicillin-binding protein |
| 03420 * | e | recU | Holliday junction resolvase |
| 03430 * | e | Hypothetical protein | |
| 03450 * | e | 23S rRNA/guanine-methyltransferase | |
| 03470 * | e | gnd | 6-phosphogluconate dehydrogenase |
| 03480 * | e | ritR | Response regulator |
| 03540 | f | mvaD | Mevalonate diphosphate decarboxylase |
| 03550 | f | mvaK2 | Mevalonate kinase |
| 03560 | f | fni | Isopentenyl-diphosphate delta-isomerase |
| 03570 | f | vraT | Cell wall-active antibiotics response protein |
| 03580 | f | vraS | Sensor histidine kinase |
| 03840 | g | glyP | Sodium glycine symporter |
| 03860 | g | shetA | Exfoliative toxin |
| 03870 | g | serS | Seryl-tRNA synthetase |
| 03890 | g | lysC | Aspartokinase |
| 03960 | fabG | 3-oxoacyl-acyl-carrier protein reductase | |
| 04740 | h | ecsA | ABC transporter ATP-binding protein |
| 04790 | h | blpH | Histidine kinase of the competence regulon ComD |
| 15900 | lytC | Glucan-binding domain/Lysozyme M1 | |
| 18330 | trpF | Phosphoribosylanthranilate isomerase | |
| 20980 | patB | Multidrug resistance ABC transporter |
Genes marked with * flank up to 10 genes, upstream or downstream from the capsular locus. Letters a to h denote groups of contiguous genes (minimum proximity of 2 genes).
Top genes for Sequence Cluster prediction.
| SPN23F | Name | Type/Function | |
|---|---|---|---|
| 00090 | Phospholycenate mutase | ||
| 00540 | recO | DNA recombination and repair protein | |
| 00660 | vanZ | Teicoplanin resistance protein | |
| 02370 | Transcriptional regulator | ||
| 03790 | spi | Signal peptidase I | |
| 04050 | Hypothetical protein | ||
| 04730 | Histidine triad nucleotide-binding protein | ||
| 06210 | ABC transporter, ATP-binding protein | ||
| 06880 | sodA | Manganese superoxide dismutase | |
| 07240 | Hypothetical protein | ||
| 07340 | Hydrolase/Haloacid dehalogenase-like family | ||
| 07930 | iscU | Putative iron-sulfur cluster assembly scaffold protein | |
| 08320 | Putative membrane protein | ||
| 09040 | O-methyltransferase family protein C1 | ||
| 09280 | lmb | Laminin-binding protein | |
| 09460 | N-acetyltransferase GNAT family protein | ||
| 10040 | Cytosolic protein containing multiple CBS domains | ||
| 10480 | Hypothetical protein | ||
| 10670 | pdhB | Acetoin dehydrogenase E1 component | |
| 11320 | Acetyltransferase GNAT family protein | ||
| 11630 | licA | Choline kinase | |
| 11660 | carB | Membrane protein/O-antigen and teichoic acid | |
| 13490 | Hypothetical protein | ||
| 14640 | lta | Bacterocin transport accessory protein | |
| 15100 | pclA | Putative NADPH-dependent FMN reductase | |
| 16930 | Hypothetical protein | ||
| 17080 | Hypothetical protein | ||
| 18130 | Hypothetical protein | ||
| 19240 * | a | recX | Regulatory protein |
| 19250 * | a | Cysteinyl-tRNA synthase related protein | |
| 19300 * | b | groEL | Heat shock protein 60 family chaperone |
| 19310 * | b | groES | Heat shock protein 60 family co-chaperone |
| 19330 * | b | Short-chain dehydrogenase | |
| 19340 * | b | ytpR | Phenylalanyl-tRNA synthetase domain protein |
| 19360 * | b | Hypothetical protein | |
| 19370 * | b | Hypothetical protein | |
| 19380 * | b | Membrane protein | |
| 19390 * | b | Response regulator of LytR/AlgR family | |
| 20880 | c | Hydrolase, haloacid dehalogenase-like family | |
| 20900 | c | thrC | Threonine synthase |
| 22500 | mreD | Rod shape-determining protein |
Genes marked with * flank up to 10 genes, upstream or downstream from the groESL operon. Letters a to c denote groups of contiguous genes (minimum proximity of 2 genes).
Figure 3Population structure and vaccination. Conceptual representation of phylogenetic relationships between serotypes and Sequence Clusters (SC), where the former are defined by variation at the cps locus (arbitrarily designated X, W, Y, Z, M, and L, respectively coloured purple, yellow, green, orange, cyan and pink) and the latter are linked to variation in the groESL operon (arbitrarily designated A and B and respectively coloured red and blue). Circles symbolize genotypes, with size relative to their prevalence at the population level. Inner genome arcs represent epistatic links: those with the groESL operon extend across the genome, while links with the cps locus are more local. Within our framework and according to observed patterns[8], serotypes will be dominantly associated with an SC. Current vaccine strategies (white inner area) that target a selection of capsular serotypes can lead to the expansion of non-vaccine serotypes (VISR[73, 74]), potentially within the same sequence cluster (VIMS[17]). Vaccine strategies based on groESL variants (grey area) would target entire lineages instead, including all uncommon serotypes within and thereby preventing their expansion.