Literature DB >> 34542586

Comparative genome analysis of Clostridium beijerinckii strains isolated from pit mud of Chinese strong flavor baijiu ecosystem.

Wei Zou¹, Guangbin Ye¹, Chaojie Liu¹, Kaizheng Zhang¹, Hehe Li², Jiangang Yang¹.

Abstract

Clostridium beijerinckii is a well-known anaerobic solventogenic bacterium which inhabits a wide range of different niches. Previously, we isolated five butyrate-producing C. beijerinckii strains from pit mud (PM) of strong-flavor baijiu (SFB) ecosystems. Genome annotation of the five strains showed that they could assimilate various carbon sources as well as ammonium to produce acetate, butyrate, lactate, hydrogen, and esters but did not produce the undesirable flavors isopropanol and acetone, making them useful for further exploration in SFB production. Our analysis of the genomes of an additional 233 C. beijerinckii strains revealed an open pangenome based on current sampling and will likely change with additional genomes. The core genome, accessory genome, and strain-specific genes comprised 1567, 8851, and 2154 genes, respectively. A total of 298 genes were found only in the five C. beijerinckii strains from PM, among which only 77 genes were assigned to Clusters of Orthologous Genes categories. In addition, 15 transposase and 12 phage integrase families were found in all five C. beijerinckii strains from PM. Between 18 and 21 genome islands were predicted for the five C. beijerinckii genomes. The existence of a large number of mobile genetic elements indicated that the genomes of the five C. beijerinckii strains evolved with the loss or insertion of DNA fragments in the PM of SFB ecosystems. This study presents a genomic framework of C. beijerinckii strains from PM that could be used for genetic diversification studies and further exploration of these strains.

Entities: Chemical

Keywords: zzm321990 Clostridium beijerinckiizzm321990 ; baijiu; butyrate; mobile genetic elements; pangenome; pit mud

Mesh：

Year: 2021 PMID： 34542586 PMCID： PMC8527462 DOI： 10.1093/g3journal/jkab317

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Introduction

Clostridium beijerinckii is a Gram-positive anaerobic solventogenic bacterium that is well known for its ability to produce acetone, butanol, and ethanol (ABE) or isopropanol, butanol, and ethanol (IBE) (Qureshi and Blaschek 2001; Survase ; dos Santos Vieira ) using low-cost carbon sources, such as different straw hydrolysates (Bellido ; Dalal ), lignocellulosic hydrolysate (Reddy ), and molasses (Li ; Fonseca ). In addition, it has been used to produce hydrogen with great success (Seelert ). Other metabolites that can be biosynthesized by wild or engineered C. beijerinckii strains include butyric acid (Drahokoupil and Patakova 2020), polyhydroxyalkanoate (Hassan ), butyl butyrate, butyl acetate (Fang ), and 1,3-propanediol (Wischral ). Furthermore, C. beijerinckii is widely used in co-cultures with other microorganisms for the production of butanol or hydrogen (Du ). Chinese baijiu, previously also called Chinese liquor or Chinese spirits, is a traditional distilled liquor which has existed in China for hundreds of years (Zheng and Han 2016; Jin ; Liu and Sun 2018). Strong-flavor baijiu (SFB) is the most common type, accounting for more than 70% of total baijiu yields (Xu ). At present, more than 1300 flavor compounds have been identified in SFB (Yao ), among which ethyl hexanoate, ethyl lactate, ethyl acetate, and ethyl butanoate are the representative compounds (Zheng and Han 2016). Butanoate is an important precursor of ethyl butanoate (Xu ), the content of which is of great importance to the quality of SFB. Many butyrate-producing strains have been isolated and identified in the pit mud (PM) of SFB ecosystems, and among them, Clostridium strains are considered the main producers (Zou ; Chai ; Luo ). Recently, we isolated five C. beijerinckii strains from the PM of different SFB plants in Sichuan province in China and found that they were capable of producing butyrate with yields of 3.7 − 6.8 gL−1 (Tian ). These C. beijerinckii strains are candidate baijiu additives for butyrate production and potentially for cultivating new PM. Moreover, they could be of further use in the industrial fermentation of butyrate, butanol, and hydrogen. However, due to the uniqueness of the PM habitat, the physiological and metabolic features of the five C. beijerinckii strains may be different from other C. beijerinckii isolates, which may hinder their further exploration. The huge amount of genomic information and the number of available bioinformatics software and databases allows for the analysis of genomes across species or genera (Zhong ; Jiang ). In addition, pangenomic methods can be carried out for the analysis of the genome of a species or genus or a given phylogenetic clade (Udaondo ). Pangenomes can contribute to our understanding of species diversity and the metabolic capabilities of species isolated from different niches (Vernikos ; Golicz ). In this study, we sequenced four C. beijerinckii strains isolated from PM, named 3-8, G3-1, G3-3, and G3-5 (Tian ). The four genomes as well as the previously sequenced genome of strain 2-1 were analyzed and compared. The metabolic features of the five strains related to assimilation of different carbon sources and the biosynthetic pathways of butyrate, butanol, and hydrogen were studied. Then, a pangenomic analysis of C. beijerinckii species was carried out to analyze the phylogenetic relationships, unique gene functions, and mobile genetic elements (MGEs) of the five C. beijerinckii strains isolated from PM.

Materials and methods

DNA extraction, genome sequencing, assembly, and annotation

Four C. beijerinckii strains, 3-8, G3-1, G3-3, and G3-5, were isolated from the PM of Chinese baijiu ecosystems (Tian ). The four strains were grown on reinforced Clostridium medium (Munch-Petersen and Boundy 1963) and cultivated at 35°C for 60 h. Bacterial cells were then collected by centrifugation at 8000 rpm for 10 min, and genomic DNA was extracted using a previously described method (Tian ). The entire genomes of all four C. beijerinckii strains were sequenced using an Illumina NovaSeq platform by Shanghai Personal Biotechnology Co., Ltd. TruSeq™ DNA Sample Prep Kit was used to prepare the Illumina libraries. After obtaining the raw sequence data, AdapterRemoval (version 2.1.7) (Schubert ) and SOAPec (version 2.0) (Luo ) were used for data filtering and quality adjustments. Quality sequenced reads were then assembled using A5-MiSeq (Coil ) and SPAdes (Bankevich ), and base correction were performed on the assembled data using Pilon (Walker ) to obtain the final assemblies. Gene prediction and annotation were carried out for the four C. beijerinckii genomes sequenced in this study and the previously sequenced genome of strain 2-1 using the Rapid Annotation using Subsystem Technology (RAST) sever (Aziz ). The amino acid sequences of proteins were further annotated using the Kyoto Encyclopedia of Genes and Genome (KEGG) Automatic Annotation Server (KAAS) pipeline (Moriya ). Metabolic pathways of C. beijerinckii were constructed using KEGG Mapper and created using the KAAS pipeline. MGEs, including genome islands (GIs) and prophage sequences, were predicted for the five C. beijerinckii genomes. GIs were predicted using IslandViewer 4, which involves three methods: SIGI-HMM, IslandPath-DIMOB, and IslandPick (Bertelli ). Prophage sequences were predicted using PHASTER (Arndt ). Default parameters were used for all software in this study unless otherwise specified.

Pangenome analysis

The genomes of an additional 233 C. beijerinckii strains were used for pangenomic analysis, including 229 genomes download from the National Center for Biotechnology Information (NCBI) and the genomes of the four C. beijerinckii strains sequenced in this study. The detailed genomic information used in this study is provided in Supplementary Table S1. Pangenome analysis of the 233 C. beijerinckii strains was performed using the Bacterial Pan Genome Analysis (BPGA) pipeline (version 1.3) (Chaudhari ). USEARCH was used as the clustering tool, and the sequence identity cutoff was set to 50%. A gene presence-absence binary matrix (pan-matrix) was obtained from BPGA and input into PanGP to analyze the pangenome and core genome profiles (Zhao ). The mathematical formula used for pan-genome profile fitting is a power-law regression based on Heaps’ law (y = AxB + C, where y represents the pangenome size; x represents the genome number; and A, B, and C are fitting parameters). When 0 1, the pangenome size does not increase when new genomes are added and are considered closed. The mathematical formula used for fitting the core genome size is an exponential regression model (y = AeB + C, where y represents the core genome size; x represents the number of analyzed genomes; and A, B, and C are fitting parameters). eggNOG-mapper was used to annotate genes in the pangenome with Clusters of Orthologous Groups (COG) of proteins categories (Huerta-Cepas ).

Phylogenetic tree reconstruction

To construct the C. beijerinckii phylogenetic tree based on the core genome, the amino acid sequences of the core genome were concatenated and aligned using the MAFFT online service (Katoh ). A phylogenetic tree was constructed using PhyML with the maximum likelihood algorithm (Guindon ). The phylogenetic tree was rendered using FigTree (version 1.4.4).

Results and discussion

Genome properties of Clostridium beijerinckii strains isolated from PM

Previously, the genome sequence of C. beijerinckii strain 2-1 was sequenced, and the data were deposited into NCBI under accession number PRJNA428897. In this study, the genomes of C. beijerinckii strains 3-8, G3-1, G3-3, and G3-5 were sequenced and deposited into GenBank under accession numbers PRJNA690962, PRJNA695099, PRJNA695100, and PRJNA698586, respectively. The genome sizes of the five C. beijerinckii strains isolated from PM of SFB ecosystems ranged from 5,461,616 bp to 5,637,825 bp, and the number of scaffolds ranged from 92 to 328 (Table 1). The GC content of the five strains ranged from 29.58% to 29.78%. The RAST sever was used to predict and annotate the genomes of the five isolated C. beijerinckii strains. The average number of annotated protein-encoding genes among the five genomes was 5040, and the average number of proteins annotated by RAST was 2039. The gene distributions according to RAST categories of the five C. beijerinckii strains were also compared (Supplementary Table S2). Carbohydrates, amino acids and derivatives, and cofactors, vitamins, prosthetic groups, and pigments were the three largest subsystems, representing an average of 17.0%, 12.4%, and 10.7% of each genome, respectively; Proteins belonging to the categories protein metabolism and dormancy and sporulation were less in strain 3-8 compared with other 4 strains (Supplementary Table S2).

Table 1

Genome features of the five Clostridium beijerinckii strains isolated from pit mud (PM)

Features	2-1	3-8	G3-1	G3-3	G3-5
Genome size (bp)	5,626,308	5,461,616	5,637,955	5,609,807	5,637,825
No. of all scaffolds	328	92	129	105	130
Total reads	10,554,482	6,891,264	10,865,966	9,147,036	8,625,822
Total reads length (bp)	1,704,906,022	1,021,795,131	1,615,799,558	1,352,042,586	1,282,836,405
Largest scaffold length (bp)	246,790	235,144	235,510	279,509	235,510
Scaffold N50 (bp)	60,705	83,607	92,783	122,976	92,771
G+C content (%)	29.78	29.58	29.64	29.60	29.64
Coding protein number	5035	4932	5081	5071	5083
Proteins annotated belong RAST subsystems	2078	1869	2083	2084	2083
rRNA	6	8	37	17	36
tRNA	59	52	81	58	80

Genome features of the five Clostridium beijerinckii strains isolated from pit mud (PM)

Metabolic features of the five Clostridium beijerinckii strains from PM

The microbiota of PM in SFB ecosystems is an important inoculum for SFB production (Zou ). A variety of substrates are present during the saccharification and fermentation of cereals, mostly starch or hydrolysis products of starch such as disaccharides and monosaccharides. The main nitrogen source is ammonium (Tao ). Genome annotation of the five C. beijerinckii strains from PM revealed that complete metabolic and transport pathways of many sugars were predicted to be present in all five strains. Furthermore, all five strains were predicted to be able to convert starch, dextrin, trisaccharides (raffinose and manninotriose), disaccharides (sucrose, melibiose, epimelibiose, maltose, trehalose, cellobiose, and isomaltose), monosaccharides (fructose, mannose, galactose, xylose, xylulose, and ribose), organic acids (l-gulonate, d-glucuronate, d-tagaturonate, d-fructuronate, d-galacturonate, d-glycerate, and 4-aminobutanoate), alcohol (mannitol and galactitol), arbutin, salicin, and N-acetyl-d-glucosamine into glucose or intermediates of glycolysis, which subsequently enter into the central carbon metabolism pathway to produce biomass or energy. In addition, the predicted transport systems for these substrates were annotated by KEGG, including ATP-binding cassette (ABC) transporters and phosphotransferase (PTS) systems. Sugars, including glucose, fructose, maltose, sucrose, cellobiose, trehalose, mannose, mannitol, galactitol, and N-acetyl-D-glucosamine, are mainly transported into the cell via the PTS system, but others are transported in part by ABC transporters, such as raffinose, isomaltotriose, maltotriose, melibiose, ribose, D-xylose, arbutin, and salicin. In addition, D-xylose and melibiose can be transported by the D-xylose proton-symporter XylT and the Na+/melibiose symporter, respectively. The presence of transporters and the complete metabolic pathways of various substrates indicates that the five C. beijerinckii strains likely assimilate various substrates in PM, allowing them to adapt to the environment of PM of SFB ecosystems. We reconstructed the biosynthetic pathways of ABE and IBE from glucose (Figure 1). All five C. beijerinckii strains appear to possess the complete biosynthetic pathways of formate, acetate, and ethanol. However, although formate pathways were not detected in strains 3-8 or G3-3 in our previous study (Tian ), we identified five and six gene copies encoding pyruvate formate-lyase (EC: 2.3.1.54), which catalyzes the formation of formate from pyruvate, in the genomes of strains 3-8 and G3-3, respectively. In addition, formate efflux transporters were also identified in the five strains. Formate may be transformed into formyltetrahydrofolate by tetrahydrofolate catalyzed by formate-tetrahydrofolate ligase (EC: 6.3.4.3), which was found in all five strains. C. beijerinckii strains 2-1, G3-1, G3-3, and G3-5 possessed the complete biosynthetic pathways of butyrate and butanol, but the gene encoding 3-hydroxybutyryl-CoA dehydrogenase (paaH, EC 1.1.1.157) was not annotated in the genome of strain 3-8. However, strain 3-8 was shown to be capable of producing butyrate with a yield of 6.81gL−1 in our previous report (Tian ). This inconsistency between experimental results and genome annotation calls for further examination of the annotation of this gene or experimental validation.

Figure 1

Biosynthetic pathways of acetone, butanol, and ethanol (ABE) and isopropanol, butanol, and ethanol (IBE) from glucose in the five Clostridium beijerinckii strains isolated from pit mud (PM). Genes shown in gray were absent in all five strains. Genes shown in red were absent in strain 3-8 but present in strains 2-1, G3-1, G3-3, and G3-5. Unlike other C. beijerinckii strains used in industrial biotechnology, the five C. beijerinckii strains isolated from PM are not currently predicted to be able to produce acetone or isopropanol due to the absence of acetoacetate decarboxylase (Adc, EC : 4.1.1.4), which converts acetoacetate into acetone. This feature may be beneficial for the further application of these strains in SFB manufacturing, as isopropanol is an undesirable flavor compound that causes dizziness and sleepiness when its content is above threshold values (Minqian ). In addition, acetone is also an undesired compound that causes SFB to have a peppery taste. The organic acids produced by the five C. beijerinckii strains, including ethyl acetate, ethyl butanoate, and ethyl lactate, may act as precursors of the main flavor compounds of SFB. All five C. beijerinckii strains produced ethyl butanoate, one of the representative flavor compounds in SFB, with a yield of 38–51 mg L−1 (Zheng and Han 2016). In addition, genes predicted to encode two esterases, carboxylesterase NA and carboxyl esterase, a/b hydrolase, were found in the genomes of all five strains and may contribute to the biosynthesis of ethyl butanoate. Clostridium beijerinckii has been used for the production of hydrogen (Fonseca ). In PM of SBF, hydrogen also plays important roles in maintaining the stability of the microbial community of the PM ecosystem (Zou ). The hydrogen biosynthetic pathway in C. beijerinckii is predicted to be the same as that in Clostridium butyricum and involves ferredoxin hydrogenase (HydA, EC: 1.12.7.2), ferredoxin, and nitrogenase (EC: 1.18.6.1) (Zou ). These three genes were all found in the five C. beijerinckii strains from SFB ecosystems, including 21 copies of the gene encoding ferredoxin that were present in each genome. Ammonium is present at high levels in PM of SFB ecosystems, with concentrations ranging from 1.86 to 4.2 gkg−1 in PM of different ages (Tao ). Two genes coding ammonium transporters, as well as a gene involved in its assimilation pathway, were found in all five C. beijerinckii strains. Ammonium is assimilated mainly by glutamine synthetase (EC: 6.3.1.2), which adds ammonium to glutamate for glutamine biosynthesis. Other enzymes catalyzing the assimilation of ammonia include aspartate-ammonia ligase (EC: 6.3.1.1) and asparagine synthase (EC: 6.3.5.4).

The pangenome and core genome of Clostridium beijerinckii

A total of 233 C. beijerinckii genomes were used for pangenomic analysis (Supplementary Table S3). The average genome size of the 233 C. beijerinckii genomes was 6.11 Mb, and the average number of proteins was 5182. The genes in the 233 C. beijerinckii genomes were grouped into 12,572 gene clusters. Among these, 1567 gene clusters were shared among all 233 genomes and made up the core C. beijerinckii genome (Supplementary Table S3). The genes in the core genome account for 30.2% of the average number of genes, which is greater than that of C. butyricum (26.3%) (Zou ). In addition, we found 8851 gene clusters present in two or more, but not all, of the genomes studied that were included in the accessory genome. The average number of accessory genes for each genome was 3153, representing 60.8% of the average number of genes for each genome (Supplementary Table S3). A total of 2154 gene clusters were found in only one genome (strain-specific genes). A total of 179 C. beijerinckii strains had no strain-specific gene clusters. Forty-one C. beijerinckii strains had between 1 and 10 strain-specific genes in each genome, and 13 strains had more than 10 strain-specific genes. Strains HUN142 and NCIMB 14988, isolated from rumen contents and garden soil, had the largest number of strain-specific genes, with 387 and 312, respectively (Supplementary Table S3). The five C. beijerinckii strains isolated from PM of SFB ecosystems had a total of 33 strain-specific genes. We fit a curve for the pangenome profile using power-law regression based on Heaps’ law (Figure 2). The fitted curved was positive, indicating that the pangenome was predicted to be open (B = 0.2) on current sampling of genomes. The C. beijerinckii pangenome increased in size when new analyzed genomes were added. This result is similar to those of previous studies on members of the genus Clostridium (Udaondo ), C. perfringens (Kiu ; Feng ), C. botulinum (Bhardwaj and Somvanshi 2017), and C. butyricum (Zou ). However, it should be noted that the pangenome analysis results may be limited by the number of C. beijerinckii analyzed genomes. The pangenome profile fitting curve showed that if the analyzed genome number exceeds 4298, the increased pangenome size will be less than one when adding a new genome. Thus, more genomes of C. beijerinckii from different ecological niches are expected to be added for further pangenome study.

Figure 2

Mathematical formula fitting the pangenome and core genome size when the genome number of Clostridium beijerinckii strains varied from 1 to 233. The cumulative curve (in blue) indicates an open pangenome.

Phylogenetic analysis of Clostridium beijerinckii

The C. beijerinckii strains used for pangenomic analysis were isolated from soil, anaerobic sludge, human or pig feces, rumen contents, and PM of SFB ecosystems (Supplementary Table S1). A total of 192 C. beijerinckii strains (named with the prefix “DJ”) that have been sequenced are commercial solventogenic Clostridia strains, but detailed information, including their location of isolation and origin, remain unknown. We constructed a phylogenetic tree for the 233 C. beijerinckii strains in this study based on concatenated core genome alignments (Figure 3). The five strains isolated from PM of SFB ecosystems were located in the same clade. Strains 3-8, G3-1, and G3-3 were located in a sub-branch close to a separate sub-branch containing strains 2-1 and G3-5, indicating that the five strains were closely related, despite the fact that these five strains were isolated from different pits in different SFB factories. In addition, we found that 11 strains isolated from fecal material were located in a single cluster, which was in the same clade as the five strains from the SFB ecosystems. The five strains isolated from different soil environments were scattered throughout the other clade and showed no phylogeographic relationships.

Figure 3

Phylogenetic trees of Clostridium beijerinckii strains based on concatenated amino acid sequences of the core genome. Red: strain from pit mud (PM); green: strains from fecal material; blue: strains from soil.

COG functional annotation of the Clostridium beijerinckii pangenome

The genes predicted to be in the pangenome were assigned to COG categories using eggNOG-mapper (Huerta-Cepas ). The results showed that 91.1% of the total number of genes in the core genome, 67.6% of the accessory genome, and 61.7% of the strain-specific genes were assigned COG terms. The largest category was function unknown (S), representing 19.4% of the core genome, 24.8% of the accessory genome, and 20.8% of strain-specific genes (Figure 4). Aside from genes assigned to the S category, most genes in the core genome were assigned to transcription (K), amino acid transport and metabolism (E), carbohydrate transport and metabolism (G), or energy production and conversion (C) (Figure 4). COG annotation of accessory genome and strain-specific genes revealed that the largest three categories were similar: K (11.9% and 12.3%, respectively), replication, recombination, and repair (L) (9.0% and 9.9%, respectively), and G (8.8% and 9.0%, respectively). A total of 584 genes in the accessory genome were assigned to the L category, among which were 88 transposases and 49 phage integrase family proteins. However, no transposases or phage-related proteins in the L category were found in the core genome. Similar results were observed in pangenomic analyses of C. perfringens (Kiu ) and C. butyricum (Zou ).

Figure 4

Distribution of Clusters of Orthologous Genes (COG) categories between the core genome, accessory genome, and strain-specific genes of Clostridium beijerinckii strains. (B) chromatin structure and dynamics; (C) energy production and conversion; (D) cell cycle control, cell division, chromosome partitioning; (E) amino acid transport and metabolism; (F) nucleotide transport and metabolism; (G) carbohydrate transport and metabolism; (H) coenzyme transport and metabolism; (I) lipid transport and metabolism; (J) translation, ribosomal structure, and biogenesis; (K) transcription; (L) replication, recombination, and repair; (M) cell wall/membrane/envelope biogenesis; (N) cell motility; (O) posttranslational modification, protein turnover, chaperones; (P) inorganic ion transport and metabolism; (Q) secondary metabolite biosynthesis, transport, and catabolism; (S) function unknown; (T) signal transduction mechanisms; (U) intracellular trafficking, secretion, and vesicular transport; (V) defense mechanisms; (W) extracellular structures; and (Z) cytoskeleton.

Genes shared only by strains isolated from PM of SFB ecosystems

According to the gene pan-matrix obtained from the BPGA pipeline, 298 gene clusters were found only in the five strains isolated from PM, of which 265 belonged to the accessory genome (228 genes shared by all the five strains) and 33 were strain-specific genes. For the 265 genes in the accessory genome, only 73 genes were assigned to COG categories, representing 27.5% of the total number of genes (Table 2). This proportion was much lower than that of the accessory genome (67.6%). COG category distributions showed that the three largest groups were L, S, and K, representing 26.3%, 22.5%, and 11.3% of all genes, respectively. For the L category, five transposase-related genes and four genes belonging to the phage integrase family were found. For metabolic function, three genes, encoding N-acetylmuramoyl-l-alanine amidase, PFAM polysaccharide deacetylase, and Zn peptidase, may play roles in the degradation of macromolecules. N-acetylmuramoyl-l-alanine amidase (EC: 3.5.1.28) hydrolyses the link between N-acetylmuramoyl residues and l-amino acid residues in certain cellwall glycopeptides. PFAM polysaccharide deacetylase has hydrolase activity and acts on carbon-nitrogen (but not peptide) bonds.

Table 2

Clusters of Orthologous Genes (COG) annotation of acceesory genes shared only by Clostridium beijerinckii strains isolated from pit mud (PM)

COG category	Function description	2-1	3-8	G3-1	G3-3	G3-5
D	Phage tail tape measure protein	1^#	1	1	1	1
S	von Willebrand factor, type A	1	1	1	1	1
L	Subunit R is required for both nuclease and ATPase activities, but not for modification	1	1	1	1	1
T	Nacht domain	1	1	1	1	1
K	Bacterial RNA polymerase, alpha chain C terminal domain	1	1	1	1	1
L	DNA primase	1	1	1	1	1
D	DNA recombination	1	1	1	1	1
U	Dynamin family	1	1	1	1	1
S	Dynamin family	1	1	1	1	1
S	Dynamin family	1	1	1	1	1
L	Helicase activity	1	1	1	1	1
M	PFAM Glycosyl transferase family 2	1	1	1	1	1
S	Phage minor structural protein	1	1	1	1	1
L	Domain of unknown function (DUF4277)	1	0^#	1	1	1
L	Uncharacterized conserved protein (DUF2075)	1	1	1	1	1
L	TIGRFAM type I restriction system adenine methylase (hsdM)	1	1	1	1	1
EGP	Major facilitator superfamily	1	1	1	1	1
H	Catalyzes the cyclization of GTP to (8S)-3′,8-cyclo-7,8-dihydroguanosine 5′-triphosphate	1	1	1	1	1
G	N-Acetylmuramoyl-l-alanine amidase	1	1	1	1	1
K	DNA binding	1	1	1	1	1
V	Type I restriction modification DNA specificity domain	1	1	1	1	1
V	Type I restriction modification DNA specificity domain	1	1	1	1	1
EGP	Major facilitator superfamily	1	1	1	1	1
GM	Methyltransferase FkbM domain	1	1	1	1	1
GM	Methyltransferase FkbM domain	1	1	1	1	1
M	transferase activity, transferring glycosyl groups	1	1	1	1	1
S	Protein of unknown function DUF262	1	1	1	1	1
M	transferase activity, transferring glycosyl groups	1	1	1	1	1
L	Belongs to the “phage” integrase family	1	1	1	1	1
S	PFAM transposase YhgA family protein	1	1	1	1	1
L	Psort location Cytoplasmic, score	1	1	1	1	1
L	Transposase	0	0	1	1	1
K	Bacterial regulatory proteins, tetR family	1	1	1	1	1
K	LysR family	1	1	1	1	1
M	Catalyzes the reduction of dTDP-6-deoxy-l-lyxo-4-hexulose to yield dTDP-l-rhamnose	1	1	1	1	1
S	Glycosyltransferase like family 2	1	1	1	1	1
D	Cell division	1	1	1	1	1
S	Protein of unknown function (DUF2971)	1	1	1	1	1
S	PD-(D/E)XK nuclease family transposase	1	1	1	1	1
S	PFAM Abortive infection protein	1	1	1	1	1
K	PFAM Helix-turn-helix	1	1	1	1	1
E	Pfam: DUF955	1	1	1	1	1
S	head morphogenesis protein, SPP1 gp7 family	1	1	1	1	1
KT	Lecithin retinol acyltransferase	1	1	1	1	1
T	Diguanylate cyclase	1	1	1	1	1
L	Transposase DDE domain	1	1	1	1	1
M	Cell wall binding	1	1	1	1	1
D	Cell wall binding repeat	1	1	1	1	1
L	Transposase DDE domain	1	1	1	1	1
L	Resolvase, N terminal domain	1	1	1	1	1
S	Putative restriction endonuclease	1	1	1	1	1
E	Zn peptidase	1	1	1	1	1
S	Protein of unknown function (DUF2691)	1	1	1	1	1
G	PFAM Polysaccharide deacetylase	0	1	1	1	1
S	NADPH-dependent FMN reductase	1	1	1	1	1
S	Helix-turn-helix domain	1	1	1	1	1
S	Protein of unknown function (DUF3268)	1	1	1	1	1
S	Domain of unknown function (DUF4258)	1	1	1	1	1
L	Belongs to the “phage” integrase family	1	1	1	1	1
L	Psort location Cytoplasmic, score 8.87	1	1	1	1	1
L	Psort location Cytoplasmic, score	1	1	1	1	1
L	Staphylococcal protein of unknown function (DUF960)	1	1	1	1	1
L	Transposase	1	0	1	1	1
L	Belongs to the “phage” integrase family	1	1	1	1	1
L	Belongs to the “phage” integrase family	1	1	1	1	1
L	Psort location Cytoplasmic, score	1	1	1	1	1
K	Helix-turn-helix XRE-family like proteins	1	1	1	1	1
K	Helix-turn-helix XRE-family like proteins	1	1	1	1	1
K	PFAM helix-turn-helix HxlR type	1	1	1	1	1
S	Domain of unknown function (DUF3797)	1	1	1	1	1
C	Electron transfer flavoprotein	0	0	1	0	1
V	Mate efflux family protein	1	1	1	1	1
L	PFAM transposase, mutator	1	1	1	1	1

#1: exist, 0: not exist.

Clusters of Orthologous Genes (COG) annotation of acceesory genes shared only by Clostridium beijerinckii strains isolated from pit mud (PM) #1: exist, 0: not exist. For the 33 strain-specific genes, only four genes were assigned to COG categories, all of which were from strain 2-1. The four genes encoded TIGRFAM phage replisome, permease for cytosine/purines, uracil, thiamine, allantoin, PFAM alpha amylase, catalytic, and tRNA cytidylyltransferase. The other 221 genes clusters that were only found in the strains isolated from PM were not assigned to COG categories. This ratio (74.1%) is far lower than those of the accessory genome (32.4%) or strain-specific genes (38.3%). The PM of SFB ecosystems is a relatively closed environment, and the microbial community in PM may have evolved for more than 100 years (based on the age of the PM) (Zhang ). Based on the transposase and phage integrase found in the genome, this particular niche may have led to the evolution of the genomes of the five C. beijerinckii strains via gene loss or gain from other microbes in the PM of SFB ecosystems.

MGE analysis of five Clostridium beijerinckii strains from SFB ecosystems

The average genome size of the 233 sequenced strains was 6.11 Mb, which was larger than that of the five stains isolated from PM (5.59 Mb), indicating that gene loss events had occurred. In addition, 265 gene clusters were shared only by the strains from SFB ecosystems, indicating the acquisition of new gene clusters from other microbes in the SFB ecosystem. To investigate MGEs in the five C. beijerinckii strains isolated from PM, a total of 15 transposase and 12 phage integrase families were first identified in all five C. beijerinckii strains isolated from PM according to COG analysis. Then, GIs were identified in the five C. beijerinckii genomes. The number of predicted GIs in the five C. beijerinckii genomes ranged from 18 to 21, and the ratios of the total size of GIs to the corresponding genome were between 3.3% and 7.2% (Table 3). Strain 2-1 had the largest total size of GIs, 406,702 bp, representing 7.2% of its total genome. More than 43% of genes identified in GIs were hypothetical proteins with unknown functions. In addition, bacteriophages were identified in the five stains from SFB ecosystems using PHASTER (Arndt ). A total of 33 bacteriophage sequences were found, of which 10 were questionable and 23 were incomplete (Supplementary Table S4). Twenty-four bacteriophage sequences were obtained from strains in the genus Clostridium, including Clostridium phage phiCT19406A (5), Clostridium phage phiCT19406B (2), Clostridium phage phiCT453A (5), Clostridium phage phiCT453B (6), Clostridium phage phiCT9441A (4), and Clostridium phage phiCTC2B (2). The other nine bacteriophage sequences belonged to Bacillus phage BM5 (5) and Thermoanaerobacterium phage THSA-485A (4). The existence of a large number of MGEs, including transposases, phage integrases, GIs, and bacteriophage sequences, indicated that the genomes of the five C. beijerinckii strains isolated from PM evolved via the loss or insertion of DNA fragments from the microbiota of SFB ecosystems.

Table 3

Distribution of genome islands in Clostridium beijerinckii strains isolated from pit mud (PM)

Strains	Total size	Number	Total length of GIs/genome size (%)	Total proteins	Hypothetical protein	Phage-related proteins
2-1	406,702	20	7.2	290	155	23
3-8	199,431	19	3.7	232	112	25
G3-1	269,305	21	4.8	290	125	20
G3-3	266,058	21	4.7	290	127	20
G3-5	185,086	18	3.3	230	124	16

Distribution of genome islands in Clostridium beijerinckii strains isolated from pit mud (PM)

Conclusions

In this study, we analyzed the genomes of five C. beijerinckii strains isolated from PM. Metabolic capabilities that are beneficial in the PM environment include assimilation of various carbon sources and ammonium; production of acetate, butyrate, lactate, hydrogen, and esters; and an inability to produce the undesirable flavors isopropanol and acetone. Our analysis of the genomes of 233 C. beijerinckii strains revealed an open pangenome with 12,572 gene clusters. A total of 298 genes were found only in the five C. beijerinckii strains isolated from PM, among which only 77 genes were assigned to COG categories. The existence of many MGEs indicated that the genomes of the five C. beijerinckii strains from the SFB ecosystem evolved in PM. This study will be helpful for future genetic diversification studies and further exploration of C. beijerinckii strains isolated from PM.

Data availability

In the present study, strains are available upon request. The raw sequencing reads of C. beijerinckii strains 3-8, G3-1, G3-3, and G3-5 were deposited in to Sequence Read Archive (SRA) under accession numbers SRR15304600, SRR15316868, SRR15316869, and SRR15318628. The genomes of C. beijerinckii strains 3-8, G3-1, G3-3, and G3-5 were deposited into GenBank under accession numbers PRJNA690962, PRJNA695099, PRJNA695100, and PRJNA698586, respectively. The genome sequence of Clostridium beijerinckii strain 2-1 was deposited into NCBI under accession number PRJNA428897. The other C. beijerinckii genome sequences utilized for the pangenome analysis are listed in Supplementary Table S1. All sequences were downloaded from the National Center for Biotechnology Information (NCBI) genome database. Supplementary material is available at G3 online. Click here for additional data file.

2 in total

1. Comparative Genomics Provides Insights Into Genetic Diversity of Clostridium tyrobutyricum and Potential Implications for Late Blowing Defects in Cheese.

Authors: Lucija Podrzaj; Johanna Burtscher; Konrad J Domig
Journal: Front Microbiol Date: 2022-06-02 Impact factor: 6.064

2. Comparative Genomics Unveils the Habitat Adaptation and Metabolic Profiles of Clostridium in an Artificial Ecosystem for Liquor Production.

Authors: Guan-Yu Fang; Li-Juan Chai; Xiao-Zhong Zhong; Zhen-Ming Lu; Xiao-Juan Zhang; Lin-Huan Wu; Song-Tao Wang; Cai-Hong Shen; Jin-Song Shi; Zheng-Hong Xu
Journal: mSystems Date: 2022-05-02 Impact factor: 7.324

2 in total