Magdalena Ksiezarek1,2, Filipa Grosso1,2, Teresa Gonçalves Ribeiro1,2, Luísa Peixe1,2. 1. Laboratory of Microbiology, UCIBIO - Applied Molecular Biosciences Unit, REQUIMTE, Department of Biological Sciences, Faculty of Pharmacy, University of Porto, 4050-313 Porto, Portugal. 2. Associate Laboratory i4HB - Institute for Health and Bioeconomy, Faculty of Pharmacy, University of Porto, 4050-313 Porto, Portugal.
Abstract
Entities:
Keywords:
Lactobacillaceae; animals; antimicrobial resistance; comparative genomics; human microbiome; pangenome
The Supplementary Material associated with this article is available in the data repository Figshare under the link: https://doi.org/10.6084/m9.figshare.20052164 [1]The recently proposed genus
(formerly included in
) comprises 23 species, some of which are well known for their importance in the food industry and probiotic potential. In this study, we applied comparative genomics approaches to investigate the differences between
species from different sources (human, animal and food), with particular focus on human isolates. The data presented here demonstrated that the genus
is highly diverse, comprising species that are present in multiple hosts and niches (e.g.
), those with more animal-specific adaptation (e.g.
) and putative novel
species not yet characterized. Although mostly antibiotic susceptible,
strains from food-producing animals are more prone to present acquired antibiotic-resistance genes. The common presence of bacteriocin-encoding regions, genes related to lactic acid production and CRISPR-Cas systems highlights that most
strains have defence mechanisms against other bacteria and/or foreign DNA. Additionally, to our knowledge, this is the first study to present comprehensive functional and metabolic predictions of human
, providing important insights for understanding the growth requirements and contribution of these bacteria to human nutrition and health.
Introduction
For several years, the genus
(family
) has been under investigation aiming for proper classification of this bacterial group characterized by high levels of phenotypic and genotypic diversity [2]. In early 2020, the taxonomy of the genus
was revisited based on a polyphasic approach (core-genome phylogeny, pairwise average amino acid identity, clade-specific signature genes, physiological criteria and ecology), resulting in the reclassification of the genus
into 25 genera, including
gen. nov. (formerly the
group) [3].To date,
comprises 23 validly published species (
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
), of which 7 were recently described [4, 5].members ferment a relatively broad spectrum of carbohydrates, yet several species do not ferment glucose [3]. Some
species, particularly
and
, are produced commercially for use as starter and probiotic cultures [6, 7]. All but four species (
,
,
and
) were isolated from intestinal or faecal samples or were shown experimentally to have adapted to the intestine of vertebrate animals [3-5].Advances in the healthy human microbiome reveal that some
species are often found in the gut, vagina and other human body niches, with
being the most understood [7]. Of note, the healthy human microbiome appears to be colonized to a higher extent with
than
species [8, 9]. Nonetheless, it seems that in the urogenital tract, species belonging to
co-exist with
species [3, 9]. Thus,
species seem to be also relevant for human health and deserve as extensive a characterization as their
relatives.The aim of this study was to evaluate the taxonomic diversity of the genus
and investigate the genomic diversity of
species from different sources (human, animal and food). We further performed comparative genomics of human isolates, including functional and metabolic characterization and niche-specific genomic content. To the best of our knowledge, this is the first study since the recent
reclassification focusing on the pangenome and characterizing the genomic diversity of the genus
.
Methods
Genomes database
A total of 338 genomes representing complete and draft genomes of 22
species, including the 2 recently published, i.e.
and
[4], were retrieved from the National Center for Biotechnology Information (NCBI) Assembly database on 6th September 2020 (Table S1, available with the online version of this article). Six genomes were excluded from further analysis based on worse assembly statistics since there was better representative assembly available for the same strains, narrowing down the final collection to 332
genomes (Table S1). Metadata related to downloaded assemblies was retrieved from the NCBI BioSample database. The genome sizes, guanine-cytosine content (G+C mol%), number of coding sequences (CDSs), tRNA and rRNA were extracted from Prokka v1.14.6 [10] and checkM v1.1.3 [11]. Draft genome completeness was accessed by checkM v1.1.3 [11] (Table S1).
Average nucleotide identity and phylogenomic analysis
Average nucleotide identity based in blast+ (ANIb) analysis on 332 assemblies was performed with pyani (v0.2) [12]. The ANIb results were interpreted according to widely established thresholds [13]. A percentage identity matrix was used to create a heatmap representing ANIb clusters by the R base heatmap package in R v4.0.3 [14] (Table S2). Phylogenomic analysis was performed using anvi’o v7.1 [15]. Single-copy core genes based on the Bacteria_71 collection from hidden Markov model profiles [16] were identified and 71 proteins were concatenated. FastTree version 2.1.11 [17] was used to recreate a maximum-likelihood phylogenomic tree with the Jones–Taylor–Thornton substitution model, local support of SH-like 1000 and “CAT” approximation (model that accounts for evolutionary rates across sites) with 20 rate categories. The resulting phylogenomic tree was edited in iTOL [18], while the ANIb figure was edited using Inkscape [19]. Genomes representing putative novel species were also submitted to the Type (Strain) Genome Server for genome-based taxonomy developed by the Leibniz Institute DSMZ (https://tygs.dsmz.de/).
Pangenome analysis
The anvi’o v7.1 [15] pipeline was used to profile hidden Markov models and predict genes using Prodigal [20]. Proteins were annotated with Clusters of Orthologous Groups (COGs; 2014 release) and Kyoto Encyclopedia of Genes and Genomes (KEGG) orthologues (KOs) and KEGG modules [21] using anvi’o v7.1 (anvi’o tutorials, https://merenlab.org). Pangenome analysis was performed by anvi’o v7.1, using muscle for sequence alignment [22], the Markov cluster algorithm [23] for clustering and NCBI blastp to assess amino acid sequence similarity. Partial gene calls were included in the analysis, and remaining parameters required to define gene clusters according to amino acid sequence homology were left as default [minbit heuristics of 0.5 and MCL (Markov Cluster algorithm) inflation of 2]. Gene collections were classified as follows: core genome included gene clusters present in 100 % of genomes; accessory genome included three collections – softcore (gene clusters present in more than 95%), dispensable (gene clusters present in at least two genomes and in less than 95 % of genome) and unique (singletons, genes present in just one unique genome).The pangenome figure was visualized by anvi’o [15] and edited in Inkscape [19]. COGs categories distribution among strains was visualized in the R v4.0.3 [14] ggplot2 package v3.3.5 [24]. In the case of multiple COGs categories predicted per gene cluster, the most frequent was used as a representative, and in the case of multiple COGs categories predicted per gene, the first one was used as the most significant hit. Pangenome accumulation curves for species that had representation of at least 50 genomes was performed with Roary v3.13.0 [25].Analysis of the accessory genome specific to isolation source was performed in R [14] v4.0.3 and the Venn diagram was created using VennDiagram R package v1.6.20 [26]. Only amino acid sequences with annotated COGs were included in this analysis. COGs specific to each origin group were identified with the VennDiagram package. The heatmap of accessory gene clusters for human strains was performed in R v4.0.3 [14] with phyloseq package v1.34.0 [27], hierarchical clustering was computed using Euclidean distance and the figure was edited in Inkscape [19].Functional enrichment analysis based on enrichment scores [28] was performed in anvi’o. In short, functional enrichment analysis identifies functions (and/or KEGG modules) that might be associated only with a specific collection of genomes (anvi’o tutorials, https://merenlab.org). This analysis provides statistical support including Rao test, uncorrected P value and q value (P value adjusted for false detection rate) [28]. Only results with a q value less than 0.05 were considered significant and interpreted in this study. Additionally, metabolic reconstruction, including summary of KOs and estimation of pathway completeness using 75 % as a threshold, was performed with anvi’o according to available tutorials (https://merenlab.org).
Other whole genome sequence analyses
Additional analyses included in silico bacteriocin-encoding gene prediction performed by bagel4 [29], detection of CDSs for Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR), CRISPR-associated genes (Cas) and spacers performed with CRISPRCasTyper v1.6.0 [30], and identification of prophage sequences performed by PhiSpy v4.2.19 [31]. Additionally, the detection of acquired genes mediating antimicrobial resistance (AMR) was performed by blastn from the NCBI blast2.8.1+ package [32] using a reference database extracted from ResFinder v4.1 [33].
Results
Major genomic features
A total of 332 genomes representing 22
species were analysed in this study (Tables 1 and S1). At the date of the genome retrieval, the NCBI Assembly database was lacking a representative genome for
. Most strains were isolated from animals (n=167 genomes, 50.3%), followed by humans (n=74, 22.3%) and food products (n=59, 17.8%).
strains were the most sequenced, representing 60.2 % of all available
genomes (Table 1).
Table 1.
List of
species, number of genomes and origin available in NCBI database
List of
species, number of genomes and origin available in NCBI databaseSpeciesNo. of genomesHost2Rodents2Rodents, lemur2Homo sapiens2Rodents, pheasant2Homo sapiens3Horse2Rodents79Homo sapiens, cheese, fermented food products3Sourdough3Homo sapiens1Gorilla3Pigeon13Homo sapiens, pig, boar, cattle3Homo sapiens1Sourdough3Homo sapiens, sourdough1Homo sapiens200Homo sapiens, rodents, poultry, pig, cattle, horse, sheep, goat, badger, sourdough, dairy products2Rodents1Unknown1Homo sapiens3Homo sapiensMost of the available
genomes were draft genomes (83.4 %; n=277/332 genomes; 15 species), and only 7 species (
,
,
,
,
,
,
) had at least one complete genome (16.6 %; n=55/332 genomes) (Table S1). Nevertheless, the mean completeness of draft genomes was 98.9 %. The size of the genomes within
range from 1.49 Mbp (
strain W1P28.032) to 2.65 Mbp (
strain KLR1002), with a mean genome size of 2.08 Mbp. The mean number of CDSs was 2024, with
W1P28.032 presenting the fewest CDSs (1388) and
MTCC 8711 the most (2854). There is also a striking range in the G+C content, which ranges from 37.89 mol% (
strain BG-MG3-A) to 53.45 mol% (
strain DSM 8475). The mean G+C content of genomes from human and food sources was similar (46.8 and 45.9 mol%, respectively), while for animals it was lower, 39.2 mol%. Moreover, the mean genome size and number of CDSs were similar between
genomes from different origins (2.14 Mbp and 2086 CDSs in animals, 2.04 Mbp and 1980 CDSs in human, and 1.99 Mbp and 1952 CDSs in food).
Taxonomy
Analysis of ANIb showed clear
species separation based on the widely accepted threshold of 95 % for species discrimination [13] (Fig. 1, Table S2). We observed near cut-off (94–95 %) variations in ANIb values inside the
clade supporting the existence of several subspecies as recently characterized by Li et al. [5]. Moreover, publicly available genomes of strains VA24_5, Lr4000, W1P44.042 and W1P28.032 deposited as
and strain UMB0683 deposited as
should be reclassified since the ANIb values between these strains and the type strain of
and
were all below 95 %. The ANIb values between VA24_5, Lr4000 and
Lr3000T were >95 % and between W1P44.042 and
DSM 13345T was 96 % (Fig. 1, Table S2). Moreover, the ANIb value between strains W1P28.032 and UMB0683 was 81%, and between each strain and the closely related species
DSM 8475T was 82.6 and 84 %, respectively (Fig. 1, Table S2), suggesting that these strains may represent distinct and putative novel species.
Fig. 1.
Heatmap representing percentage identity of ANIb for 332
genomes. The clusters of species according to the identification under which strains were deposited in the public NCBI database is provided on the left of the heatmap.
Heatmap representing percentage identity of ANIb for 332
genomes. The clusters of species according to the identification under which strains were deposited in the public NCBI database is provided on the left of the heatmap.The taxonomic position of strains was also elucidated by phylogenomic analysis based on 71 single-copy core proteins (Fig. 2). The results showed a clear distinction between genomes of different species. Moreover, strains VA24_5 and Lr4000 clustered with the type strain of
, strain W1P44.042 grouped with the type strain of
, and strains W1P28.032 and UMB0683 each formed an independent branch. Additionally, we submitted the two genomes from strains W1P28.032 and UMB0683 to the DSMZ Type (Strain) Genome Server, which predicted that both strains are putative novel species, with the closest relationship to
DSM 8475T, with digital DNA-DNA hybridization (dDDH) values of 47.1 and 58.2 %, respectively. On the basis of these analyses, we propose that strains VA24_5 and Lr4000 should be classified as members of the species
, and strain W1P44.042 classified as a member of
. Furthermore, two putative novel
species, here designated as
sp. nov. 1 (strain W1P28.032) and
sp. nov. 2 (UMB0683), were identified.
strains were not clustered by their host origin (Fig. 2), except for species with few representative genomes. The population structure of the 167 animal (mammals and birds) isolates was mostly associated with
, and there were only four species (
,
,
and
) containing both human and animal isolates.
Fig. 2.
Phylogenomic tree based on 71 single-copy core proteins created with the anvi’o 7.1 pipeline and edited in iTOL. The maximum-likelihood tree with Jones–Taylor–Thornton substitution model was built with FastTree v 2.1.11 with SH-like 1000 support. The tree includes 332 strains of which type strains identifiers are bold and misidentified species are marked with a red background. Additional data layers, i.e. species and strain origin, are incorporated in the figure according to the key. Bootstrap values are represented with a triangle symbol (only values ≥0.75 are shown). The scale bar represents 0.1 nucleotide substitutions per site.
Phylogenomic tree based on 71 single-copy core proteins created with the anvi’o 7.1 pipeline and edited in iTOL. The maximum-likelihood tree with Jones–Taylor–Thornton substitution model was built with FastTree v 2.1.11 with SH-like 1000 support. The tree includes 332 strains of which type strains identifiers are bold and misidentified species are marked with a red background. Additional data layers, i.e. species and strain origin, are incorporated in the figure according to the key. Bootstrap values are represented with a triangle symbol (only values ≥0.75 are shown). The scale bar represents 0.1 nucleotide substitutions per site.
Pangenome of
spp.
The
pangenome (332 genomes, 22 species) was represented by 20 401 gene clusters (699 830 gene calls), of which only 39 % (7937 gene clusters) had known COGs functions (Table S3). Only a minor part of the gene clusters presented metabolic predictions (21%, 4294 gene clusters, with known KOs, and 3%, 637 gene clusters, with known KEGG class). A high genomic variability was observed in this genome collection. The core genome contained only 266 gene clusters (1.3 %; n=266/20401; 95 017 core genes), and accessory genomes included 20 135 gene clusters (98.7 %; 2.8 % softcore, 577 gene clusters, 202 378 genes; 57.2 % dispensable, 11 659 gene clusters, 394 321 genes; 38.7 % singletons, 7899 gene clusters, 8114 genes).We analysed accumulation curves including core genes and the total number of genes for species that had more than 50 sequenced genomes i.e.
(n=197 genomes) and
(n=79) (Fig. S1). The pangenome accumulation curve did not reach a plateau for these two species, suggesting that the pangenome of
and
will continuously increase with new genomes. These findings indicate that these species have an open pangenome. In contrast, the core genome demonstrated a power trend line that plateaued. For the remaining species, only a limited number of genomes had been sequenced; thus, further sequencing is required to evaluate their genomic diversity.
Functional and metabolic characterization of
from different sources
As expected, core genes were mostly involved in housekeeping processes, including metabolic functions (e.g. COGs categories E, F, G, H, C), and non-metabolic-related central processes such as ribosomal biogenesis (category J) and replication (L) (Table S3). Similar functions were commonly observed within softcore genes. Regarding the accessory genome, the largest number of genes was related to translation, ribosomal structure and biogenesis (category J), while singletons were commonly associated with carbohydrate metabolism (category G), cell wall biogenesis (category M) and transcription (category K). Of note, most strains encode d-lactate dehydrogenase (n=329) and l-lactate dehydrogenase (n=331), which are associated with lactic acid production (Table S3).To investigate whether surviving in different niches requires special features, we analysed the accessory genome of
with respect to the isolation source, including only gene calls that had annotated COGs (n=261 256 gene calls; 2116 unique COGs). We found that 1301 COGs are shared between strains independently of their origin (Fig. 3, Tables S1 and S3), with many of them corresponding to the softcore group (508 COGs functions), such as ribosomal proteins (category J), translation elongation factors (category J), aminotransferases (category E) or various genes related to carbohydrate metabolism (category G). We also found 560 niche-specific COGs (312 from animals, 217 from humans and 31 from food), and 255 dual-host shared COGs. The most prevalent COGs function unique to human isolates was phosphoribosyl-ATP pyrophosphohydrolase involved in amino acid metabolism (54 %; 40/74 human strains), while N-acetylmuramoyl-l-alanine amidase CwlA, involved in cell wall organization, was the most frequent in animal isolates (21 %; 35/167 animal strains), and the restriction-modification system DNA methylase subunit related to defence mechanisms was the most prevalent in food isolates (12 %; 7/59 food strains). However, none of the COGs was present in all strains isolated from a particular niche.
Fig. 3.
VennDiagram image representing the number of accessory genes with annotated COGs functions (n=261 256 gene calls; 2116 unique COGs) that are shared and/or unique between
strains of different origin (i.e. human, animal, food).
VennDiagram image representing the number of accessory genes with annotated COGs functions (n=261 256 gene calls; 2116 unique COGs) that are shared and/or unique between
strains of different origin (i.e. human, animal, food).Metabolic enrichment was performed according to strain isolation source, i.e. identification of KOs that were more frequently associated with a particular origin (i.e. isolated from animals, humans or food products) (Table S4). Overall, strains isolated from humans were enriched in KOs associated with galactose (K02773, K02774), glycogen (K00688), starch and sucrose (K00703), and sulfur (K15554, K15553) metabolism, folate catabolism (K12941), the secretion system (K12063), and tetracycline antibiotic resistance (K08151). Strains isolated from animals were enriched in KOs associated with methane (K05884, K01007) and carbohydrate (e.g. K01624, K01625, K01686, K01685), and amino acid (K00558) metabolism, lipoic acid (K16869), cobalamin (e.g. K02224, K03394, K05934, K05895, K05936, K06042), and haem (e.g., K01772, K01698, K01749, K01845) biosynthesis, and tetracycline resistance (K08168). Strains isolated from food were enriched in KOs associated with amino acid metabolism (K00383), lipid biosynthesis proteins (K00142), KDP operon response regulator KdpE (K07667), ArsR family transcriptional regulator (K21903) and membrane transport (K10117). Detailed information on KOs and KEGG modules associated with strains of particular origin is provided in Table S4.
from humans
The 74
genomes from human strains belonged to 11 species, including
,
,
,
,
,
,
,
,
,
and
, and
sp. nov. 2 (strain UMB0683) (Table S5). The highest number of genomes belonged to
(n=36) and
(n=17), while remaining species had relatively poor representation (1–4 genomes each). Most strains were isolated from the gastrointestinal tract (n=31), followed by urogenital sources (n=9 vagina, n=7 urine), the oral cavity (n=6) or breast milk (n=5). For 16 strains, the human body site was not reported (Table S1).The pangenome of human
spp. was represented by 10 499 gene clusters (151 749 genes), with 5052 clusters of known COGs function (Fig. 4). Human isolates demonstrated high genomic variability, since the core genome was represented only by 453 gene clusters, which correspond to 4.3 % of all gene clusters (37 033 genes, 24.4%) (Fig. 4). Within the accessory genome (10 046 gene clusters, 95.7 %; 114 716 genes, 75.6%), the softcore group was represented by 316 gene clusters (24 553 genes), the dispensable genome had 5967 gene clusters (86 215 genes) and singletons comprised 3763 clusters (3948 genes). Most of the gene clusters in the core genome have known COGs function (97.8%), contrary to the accessory genome where most have unknown COGs function (54%).
Fig. 4.
Pangenome of all
strains isolated from human hosts (11 species, 1 putative novel species, 74 genomes) generated by anvi’o. Genomes are organized based on the tree of frequencies of gene clusters (top right). Each colour represents different species. External rings incorporate additional data regarding single-copy genes (SCG) clusters, number of genes per gene cluster and number of contributing genomes. The external white–green ring represents COGs functionality annotation, with green standing for known and white for unknown functions. Outside highlights represent particular gene collections: core, red; softcore, yellow; and singletons, blue. Additional information such as total length, G+C content, completion, redundancy, number of genes, mean gene length, number of genes per kbp, singleton gene clusters and number of gene clusters are represented by bars at the top right.
Pangenome of all
strains isolated from human hosts (11 species, 1 putative novel species, 74 genomes) generated by anvi’o. Genomes are organized based on the tree of frequencies of gene clusters (top right). Each colour represents different species. External rings incorporate additional data regarding single-copy genes (SCG) clusters, number of genes per gene cluster and number of contributing genomes. The external white–green ring represents COGs functionality annotation, with green standing for known and white for unknown functions. Outside highlights represent particular gene collections: core, red; softcore, yellow; and singletons, blue. Additional information such as total length, G+C content, completion, redundancy, number of genes, mean gene length, number of genes per kbp, singleton gene clusters and number of gene clusters are represented by bars at the top right.Most of the core genes were involved in information, storage and processing [translation, ribosomal structure and biogenesis (category J, 8236 genes), replication, recombination and repair (category L, 3183 genes)], and metabolism [nucleotide transport and metabolism (category F, 3172 genes), amino acid transport and metabolism (category E, 2440 genes) and carbohydrate transport and metabolism (category G, 2311 genes)]. Nearly the same functional pattern was observed for softcore genes compared with core genes (Fig. 5a, b). Many dispensable genes were related to metabolism (amino acids, 7492 genes; carbohydrates, 4520 genes; and coenzyme transport and metabolism, 4032 genes), followed by those related to cell wall biogenesis (category M, 3243 genes) and defence mechanisms (category V, 2293 genes), transcription (category K, 4983 genes), replication (category L, 3556 genes) and translation (category J, 1723 genes), and those related to mobilome (category X, 7942 genes). Singletons with a known function were associated with cell wall/membrane/envelope biogenesis (category M, 130 genes), metabolism of carbohydrates (category G, 113 genes), transcription (category K, 110 genes), replication, recombination and repair (category L, 93 genes) and mobilome – prophages, transposons (category X, 86 genes).
Fig. 5.
Pangenome-specific gene collections grouped by COGs functional categories. Specific categories included: category J, K, L into information storage and processing; categories D, V, T, M, N, U, O into cellular processes and signalling; categories C, G, E, F, H, I, P, Q into metabolism; category X into mobilome; categories R, S and not classified are designated as poorly characterized/unknown. The transparent bars represent the number of gene calls, while non-transparent bars represent the number of gene clusters. In the case of multiple COGs categories predicted per gene, the first one was used as the most significant hit. In the case of multiple COGs categories predicted per gene cluster, the most frequent was used as a representative. (a) Core and accessory genome grouped by COGs functional categories. (b) Accessory genome represented by softcore, dispensable and singleton genes.
Pangenome-specific gene collections grouped by COGs functional categories. Specific categories included: category J, K, L into information storage and processing; categories D, V, T, M, N, U, O into cellular processes and signalling; categories C, G, E, F, H, I, P, Q into metabolism; category X into mobilome; categories R, S and not classified are designated as poorly characterized/unknown. The transparent bars represent the number of gene calls, while non-transparent bars represent the number of gene clusters. In the case of multiple COGs categories predicted per gene, the first one was used as the most significant hit. In the case of multiple COGs categories predicted per gene cluster, the most frequent was used as a representative. (a) Core and accessory genome grouped by COGs functional categories. (b) Accessory genome represented by softcore, dispensable and singleton genes.We identified 15 and 4 gene clusters including duplicated genes (estimation of 2 genes in the gene cluster/strain) in the core and softcore genome, respectively. The majority of duplicated genes in the core were involved in amino acids (COGs category E, e.g. ABC-type polar amino acid transport system, ATPase component), followed by carbohydrates (COGs category G, e.g. Na+/melibiose symporter or related transporter) metabolism, and genes related to cell wall biogenesis (COGs category M, e.g. LysM repeat). For the softcore genome, the duplicated genes were mostly involved in energy production and conversion (category C, e.g. malate/lactate dehydrogenase). A detailed list of genes within each gene cluster, their functionality and homogeneity indices is provided in Table S6.Metabolic reconstruction revealed that all human
spp. strains were predicted to utilize galactose (Leloir pathway), synthesize UDP-N-acetyl-d-glucosamine and had F-type ATPase for energy metabolism. Also, the majority of strains use the pentose phosphate pathway and/or the Embden–Meyerhof pathway for carbohydrate metabolism, biosynthesis of arginine, ornithine, proline, cysteine, methionine, lysine and threonine for amino acid metabolism, the phosphate acetyltransferase-acetate kinase pathway associated with carbon fixation, and complete pathways for coenzyme A, thiamine, tetrahydrofolate, riboflavin and molybdenum cofactor biosynthesis (Table S7).To understand whether human strains cluster according to their isolation source in the human body, we performed clustering based on Euclidean distance using their accessory genome (Fig. 6). This comparison was based on the presence/absence of softcore, dispensable and singleton gene clusters. Overall, the distribution of these 10 046 gene clusters was different in the human strains, showing that the accessory gene clusters were conserved within species. This clustering showed that 74 human
strains cluster independently of their human isolation source, with some exceptions (e.g. urogenital
,
,
,
and gastrointestinal
), for which only a few genomes were available. Of note, some human strains belonging to
clustered according to their origin (e.g. urogenital strains); however, further observations are not possible due to the presence of strains of unknown origin (e.g. cluster of strains from gastrointestinal origin intermingled with those of unknown origin).
Fig. 6.
Heatmap representing 10 046 accessory gene clusters (presence/absence) of 74 human
strains. The dendrogram on the left represents hierarchical clustering based on Euclidean distance. Additional information regarding the isolation site within a human body is presented in the bar on the left of the heatmap. GI, Gastrointestinal.
Heatmap representing 10 046 accessory gene clusters (presence/absence) of 74 human
strains. The dendrogram on the left represents hierarchical clustering based on Euclidean distance. Additional information regarding the isolation site within a human body is presented in the bar on the left of the heatmap. GI, Gastrointestinal.To further clarify biological functions of human-associated
, we first performed COGs enrichment analysis in strains for which the isolation source was detailed, although few genomes were available for some body sites (e.g. six oral strains). Five COGs were found to be significantly enriched (q value <0.05) in oral or oral and urogenital strains. The enriched COGs for oral strains were associated with transcription (COG4977, COGs category K) and carbohydrate metabolism (COG3533, COGs category G), while those identified in oral and urogenital strains were more enriched in information, storage and signalling (COG2231, COGs category L; COG3695, category K) and cellular processes and signalling (COG3290, COGs category T).Subsequently, we compared the enriched KOs to explore the difference between the
strains in terms of their biological capabilities and to highlight adaptation to particular human sources (only entries with q value below 0.05 were considered) (Table 2). Some oral-associated
genomes uniquely presented categories related to metabolism (i.e. glycosidases) that were absent in other human sources, whereas some urogenital-associated genomes only presented categories related to defence mechanisms (i.e. toxin ParE1/3/4), signal transduction (i.e. sensor histidine kinase YcbA), and fructoselysine degradation pathway (i.e. fructoselysine 6-phosphate deglycase and fructoselysine 6-kinase).
Table 2.
A list of 12 KOs associated with particular isolation sources (only strains with a known human isolation source were considered)
Only KOs found as significant were included (q value <0.05). UGT, Urogenital tract; GI, Gastrointestinal.
Source
KO accession no.
Enzyme entry
KO family
Detection in strains from specific human body site (%)
MFS transporter, OPA family, phosphoglycerate transporter protein
31.25
0
0
0
K00407
–
Cytochrome c oxidase cbb3-type subunit IV
31.25
0
0
0
K10710
EC:2.7.1.218
Fructoselysine 6-kinase
31.25
0
0
0
A list of 12 KOs associated with particular isolation sources (only strains with a known human isolation source were considered)Only KOs found as significant were included (q value <0.05). UGT, Urogenital tract; GI, Gastrointestinal.SourceKO accession no.Enzyme entryKO familyDetection in strains from specific human body site (%)UGTOral cavityGIBreast milkOralK18205EC:3.2.1.185Non-reducing end β-l-arabinofuranosidase033.3300K12983EC:2.4.1.-UDP-glucose:(glucosyl)LPS β-1,3-glucosyltransferase18.7510019.3520UGT, oralK07457–Endonuclease III related protein62.533.336.450K07443–Methylated-DNA-protein-cysteine methyltransferase related protein62.533.336.450UGTK19092–Toxin ParE1/3/456.25000K19802EC:5.1.1.20l-Ala-d/l-Glu epimerase37.5000K07717EC:2.7.13.3Two-component system, sensor histidine kinase YcbA37.5000K01751EC:4.3.1.15Diaminopropionate ammonia-lyase31.25000K10708EC:3.5.-.-Fructoselysine 6-phosphate deglycase31.25000K11382–MFS transporter, OPA family, phosphoglycerate transporter protein31.25000K00407–Cytochrome c oxidase cbb3-type subunit IV31.25000K10710EC:2.7.1.218Fructoselysine 6-kinase31.25000
AMR determinants, bacteriocins, CRISPR-Cas system and phages in
spp.
The majority of
spp. were predicted as antibiotic susceptible (83%, n=268/322). Nonetheless, 18 AMR genes were identified among the remaining 54 strains (Table S8), with tetracycline [tet(L), tet(W), tet(M), tet(C), tet(O/W)], aminoglycoside [aadD, aph(3')-III, ant(6)-Ia, ant(9)-Ia], macrolide-lincosamide-streptogramin [erm(A), erm(B)] and lincomycin [lnu(A)] resistance genes being commonly identified. Moreover, AMR genes were mostly identified in
isolated from animals (72%, n=39/54, mainly from chickens and pigs). In human strains, only genes conferring resistance to tetracyclines [tet(W), tet(C)] and/or to lincomycin [lnu(A)] were identified. Interestingly, AMR genes from at least three antimicrobial classes, consistent with a multidrug-resistant genotype, were identified in 11 strains (
strains, isolated from animals).The in silico analysis of putative bacteriocin gene clusters (Table S9) showed that 233 strains (70.2 %; 17
species) were potential bacteriocin producers. In total, 396 putative bacteriocins were predicted and identified as: enterolysin A (class III; n=318; 15 species), sactipeptides (class I; n=48;
,
,
,
), gassericin K7B (class II; n=23;
,
), carnolysin (class I; n=4;
,
), gassericin T (class II; n=1
.
), acidocin 8912 (class II; n=1
.
) and cytolysin ClyLs (class I; n=1
.
). Forty-nine strains, mostly
from animals, encoded more than one bacteriocin. Overall, the bacteriocins were mostly identified in
strains from animal sources (n=143/332), covered all three classes, yet class III was the most detected.Six types of CRISPR-Cas systems (I-C, I-E, I-G, II-A, III-A and III-D) were identified in 106
strains (32%), of which 41 had more than one type (Table S10). The type II-A was the most frequent (n=73 strains), followed by I-E (n=46 strains) and III-A (n=14 strains). Interestingly, in strains belonging to
, five different CRISPR-Cas types were observed, while
had only type II-A. Overall, CRISPR-Cas-positive strains were common in
(60 isolates; 76%), while in
they were much less frequent (26 isolates; 13%). Regarding isolation source, CRISPR-Cas systems were more frequent among
isolated from humans (32 %; n=34/106), followed by food (26.4 %; n=28/106) and animals (25.5 %; n=27/106). CRISPR-Cas types I-C, I-E, II-A and III-A were detected in strains from humans, animals and food origins, while I-G and III-D were observed each in a human isolate. Among the
strains harbouring CRISPR-Cas systems, 2 to 111 spacers (median 20) were identified that can target foreign DNA. However, the nucleic acid source of the majority of spacer sequences remains unknown as no positive hits were identified.We identified complete or partial putative prophages in 55.4 % (n=184/332)
genomes (Table S11). The lengths of these prophage elements ranged from 4.9 to 72.1 kb (mean 20.5 kb). Fifty-five strains had two to four prophages. Moreover, the prevalence of predicted prophages varied considerably among strains collected from different sources. The highest prevalence was observed in animal strains (n=110; 1–4 prophages), followed by food (n=29; 1–3 prophages) and humans (n=30; 1–2 prophages). Most predicted phage genes were uncharacterized proteins (89.4 %; n=5425/6066 genes), and various mobilome-associated genes, e.g. tyrosine recombinases, transposases (e.g. IS3, IS30, IS200/IS605).
Discussion
This study is, to the best of our knowledge, the first to present a comprehensive pangenome analysis of
, including the largest number of
genomes (n=332) and species (n=22) retrieved from publicly available repositories. The number of complete genomes is still low and reserved mostly to species with well-explored beneficial activities, such as
and
[34], or/and isolated from animals. Thus, comprehensive detection of certain genomic characteristics associated with particular species or host-/niche-specificity can be challenging.The mean genome size of
was 2.08 Mbp with an average G+C content of 42.75 mol%, which was consistent with data from Zheng and colleagues [3]. Overall, ANIb and phylogenomic analysis revealed that public databases still contain misidentified strains (VA24_5 and Lr4000 here identified as
and W1P44.042 as
), and strains (W1P28.032 and UMB0683) that should be further characterized to confirm their potential as novel members of genus
(Fig. 2).Previous comparative genomics studies included members of
(
,
,
and
) and other related genera (former
;
,
,
,
,
,
,
,
,
,
) [35, 36]; however, a pangenome analysis comprising all
species had not been performed yet. Our findings revealed that the core genome of
spp. is relatively small (1.3%), demonstrating the high genomic diversity of this genus. Additionally, our findings support that
and
have an open pangenome, as previously noted [37, 38]. An open pangenome is characteristic of species for which genomic content is still not completely defined, and its diversity increases with constant acquisition of genes [39].It is noteworthy that our analyses showed that strains from humans, animal and food cannot be differentiated based on core phylogenomic analysis (Fig. 2); however, within the accessory genome, a subset of functions were uniquely present in some strains from each origin, which might enable their better adaptation to these niches. In fact, pangenome characterization showed that members of
shared little genomic similarity (98.7 % of pangenome comprises accessory gene clusters), suggestive of a diverse gene pool, customized more to suit species-specific instead of niche-specific needs. Nonetheless, a more conclusive view regarding the
genomic heterogeneity can be obtained only after including more genomes for members with few representations.While the majority of
genomic studies have focused on species most relevant to industry, e.g. the probiotic market [40-43], we focused our analysis on human
, since several species were identified in different human body sites, and occasionally in high relative abundance. For instance,
and
were found in high relative abundance in the vaginal microbiome of healthy women [44; M. Ksiezarek and others, unpublished data],
in the vagina of Indian women and the faeces of healthy children [45],
in breast milk and in saliva from patients with dental caries [46, 47], and
in the intestinal microbiota of obese adults [48].We reported here that 4.3 % of human
pan-gene clusters correspond to the core genome, which is higher than previously reported for other
inter-species studies (up to 2.8%) [35, 49, 50]. Analysis of the core genome supports that fundamental processes such as translation, ribosomal structure and biogenesis, replication, recombination and repair, and information storage and processing were conserved and essential for bacterial survival, similarly to previous observations [36, 51, 52]. Of note, we observed duplications for core genes related to, for example, amino acid and carbohydrates metabolism, which seems to play a significant role in the biological evolution [53].We also identified predicted in silico conserved metabolic pathways for human
spp. including sugar metabolism (pentose phosphate pathway and Embden–Meyerhof pathway), energy metabolism (F-type ATPase), amino acids metabolism (arginine, ornithine, proline, cysteine, methionine, lysine and threonine biosynthesis) or metabolism of cofactors and vitamins (coenzyme A, thiamine, tetrahydrofolate, riboflavin and molybdenum cofactor), which could be of interest to understand growth requirements and contribution of these bacteria to human nutrition and health [54, 55].Since the pangenome of human
revealed high genomic heterogeneity (75.6 % accessory genes), we analysed their accessory genome to understand whether there are signatures suggesting that these strains have specialized to succeed in the human body. Overall, the accessory genes of human
do not cluster according to isolation sources, appearing to be ‘promiscuous’ in terms of human body sites. A free-living lifestyle was previously suggested for
, both in terms of host range and body site [56]. In comparison, the accessory genes from urogenital
clustered together, suggesting that the accessory genes were affected by the habitat. In fact,
was previously suggested to diversify in a strict host-specific manner into host-adapted lineages by a long-term evolutionary process, allowing the development of a highly specialized symbiosis [57, 58]. Also,
isolated from gastrointestinal tract or
,
,
,
from urogenital tract were similar in the composition of their accessory genes, suggesting adaptation to specific human body sites. However, few genomes per species were available.According to our enrichment analyses, KOs involved in fructoselysine degradation were detected only in some urogenital strains. Fructoselysine, abundant in cooked foods via the Maillard reaction, is a key product leading to the formation of glycation end products in the human body that have been associated with chronic diseases and development of diabetes complications and ageing. Previously, the conversion of fructoselysine has been reported for a few bacteria, including
,
and
[59]. Urogenital
might be able to catabolize fructoselysine from undigested food, as this Amadori product is excreted in urine, and the frlD and frlB genes encoding fructoselysine 6-kinase and fructoselysine 6-phosphate deglycase, respectively, were identified. However, it is not yet known whether urogenital
are able to grow on fructoselysine as the sole carbon and energy source.AMR genes were found in higher frequency among strains isolated from animals, with few presenting multidrug-resistant genotypes. These findings may be explained by the extensive use of antibiotics in food-producing animals, since most of these strains were isolated from farm animals (e.g. chicken, pig) [60]. Moreover, the link between this antimicrobial usage in animals and the occurrence of AMR in the human microbiome/human infections has been deeply discussed in recent years [61, 62]. It should be noted, however, that most of AMR genes were identified in
, which was also the species that showed less frequent CRISPR-Cas systems. This is in agreement with the idea that the consumption and dissemination of antibiotics in the environment is favouring the deficient forms of immunity provided by CRISPR-Cas systems [63].Nearly one-third of
strains harboured the CRISPR-Cas system, being more frequent among strains recovered from humans, which might reflect the high prevalence of foreign DNA in the human host rather than in non-human sources, in particular the type II-A CRISPR-Cas system. The type II CRISPR-Cas system, a well‐known molecular mechanism that provides adaptive immunity against exogenous genetic elements such as bacteriophages and plasmids in bacteria [64], might indicate an advantage in promoting
genome stability by acting as a barrier to entry of foreign DNA elements. However, the functionality of the identified type II-A CRISPR-Cas system must be investigated. As previously observed, most of the spacer sequences present in the CRISPR-Cas system remain without a match, representing the vast CRISPR 'dark matter' [65], which might be attributed to the presence of substantial mobile genetic elements that had not been sequenced to date [66].A considerable diversity of bacteriocins, covering all classes, was found among
strains, which may contribute not only for a competitive advantage for the producing strain and modulate the neighbouring microbial community, but also benefit the host by inhibiting potential pathogens. Interestingly, our data suggests an enhanced ability of animal strains to produce different bacteriocins, with those from the animal encoding over 3.5 and 4.5 times as many putative bacteriocins as those from humans and food, respectively. Bacteriocin production might be a competitive advantage for strains from complex environments, such as the microbiota of animals. Enterolysin A, produced by some
species [67], was the most common bacteriocin, distributed among several niches (animal, food, human). This bacteriocin has been found to have a broad spectrum of activity due to its mode of action that results in cell wall degradation, being active against some pathogens (e.g. some strains of
and
), but with increased activity towards
[68, 69]. The second most common bacteriocin belonged to sactipeptides, also referred to as sactibiotics when they possess antimicrobial activity, and to date none from
have been characterized [70]. GassericinK7B, the third most common, was found among strains from animal origin, with the exception of one vaginal strain. It should be noted that gassericinK7B has shown potent activity against
strains that have been linked with the development of bacterial vaginosis and its production in strains from the urogenital tract may represent a beneficial trait [71]. Nevertheless, appropriate phenotypic tests should be performed to evaluate the expression of these genomic features.Prophages are a common feature among prokaryotic genomes, including in
[72, 73]. In this study, approximately 40 % of
human strains contained prophage regions, which is congruent with previous reports of the high prevalence of phages in strains colonizing humans, especially the gut [74]. Within the phage genes, we did not observe genes related to AMR or putatively related to bacterial virulence, which suggests that presence of bacteriophages in
strains can bring advantages on environmental adaptation rather than pathogenicity.In summary, this study presents a comprehensive comparative genomic analyses of the genus
and provides a scientific basis toward understanding the biology and evolution of this genus. Here, we demonstrate high genomic diversity within
, the existence of putative novel species in the public databases and different bacterial lifestyles, depending on the species (free-living or more host-specific). Furthermore, a large number of accessory genes within
suggests their high species- and strain-specificity. According to the accessory genome, human
appear to have a ubiquitous adaptation to different body sites, while for other species the evolution could be directed by a particular human body site (
). Additionally, several common genomic determinants (e.g. putative bacteriocins or lactic acid production) support that
can have a role in maintaining host homeostasis. Nevertheless, further sequencing to enlarge the genome collection and appropriate phenotypic testing is essential to evaluate their potential contribution to human health.Click here for additional data file.Click here for additional data file.Click here for additional data file.Click here for additional data file.Click here for additional data file.Click here for additional data file.Click here for additional data file.Click here for additional data file.Click here for additional data file.Click here for additional data file.Click here for additional data file.Click here for additional data file.
Authors: Fuyong Li; Christopher C Cheng; Jinshui Zheng; Junhong Liu; Rodrigo Margain Quevedo; Junjie Li; Stefan Roos; Michael G Gänzle; Jens Walter Journal: Int J Syst Evol Microbiol Date: 2021-02 Impact factor: 2.747
Authors: Sergey A Shmakov; Vassilii Sitnik; Kira S Makarova; Yuri I Wolf; Konstantin V Severinov; Eugene V Koonin Journal: mBio Date: 2017-09-19 Impact factor: 7.867
Authors: Daniel Belstrøm; Florentin Constancias; Yang Liu; Liang Yang; Daniela I Drautz-Moses; Stephan C Schuster; Gurjeet Singh Kohli; Tim Holm Jakobsen; Palle Holmstrup; Michael Givskov Journal: NPJ Biofilms Microbiomes Date: 2017-10-02 Impact factor: 7.290