Literature DB >> 36048731

Toward the Complete Functional Characterization of a Minimal Bacterial Proteome.

David M Bianchi¹, James F Pelletier², Clyde A Hutchison³, John I Glass³, Zaida Luthey-Schulten¹.

Abstract

Recently, we presented a whole-cell kinetic model of the genetically minimal bacterium JCVI-syn3A that described the coupled metabolic and genetic information processes and predicted behaviors emerging from the interactions among these networks. JCVI-syn3A is a genetically reduced bacterial cell that has the fewest number and smallest fraction of genes of unclear function, with approximately 90 of its 452 protein-coding genes (that is less than 20%) unannotated. Further characterization of unclear JCVI-syn3A genes strengthens the robustness and predictive power of cell modeling efforts and can lead to a deeper understanding of biophysical processes and pathways at the cell scale. Here, we apply computational analyses to elucidate the functions of the products of several essential but previously uncharacterized genes involved in integral cellular processes, particularly those directly affecting cell growth, division, and morphology. We also suggest directed wet-lab experiments informed by our analyses to further understand these "missing puzzle pieces" that are an essential part of the mosaic of biological interactions present in JCVI-syn3A. Our workflow leverages evolutionary sequence analysis, protein structure prediction, interactomics, and genome architecture to determine upgraded annotations. Additionally, we apply the structure prediction analysis component of our work to all 452 protein coding genes in JCVI-syn3A to expedite future functional annotation studies as well as the inverse mapping of the cell state to more physical models requiring all-atom or coarse-grained representations for all JCVI-syn3A proteins.

Entities: Chemical

Mesh：

Substances：
Proteome

Year: 2022 PMID： 36048731 PMCID： PMC9483919 DOI： 10.1021/acs.jpcb.2c04188

Source DB: PubMed Journal: J Phys Chem B ISSN： 1520-5207 Impact factor: 3.466

Introduction

JCVI-syn3A is a genetically reduced bacterium containing 493 genes and 543 kbp DNA, that was reduced from its parent organism Mycoplasma mycoides subsp. capri GM12,[1−3] giving it one of the most minimal genomes of any living cell. To briefly describe its development, in 2008, the J. Craig Venter Institute (JCVI) demonstrated the chemical synthesis of an entire bacterial genome.[4] This was followed by the synthesis of a M. mycoides subsp. capri GM12 genome and its later transplantation into M. capricolum recipient cells to produce JCVI-syn1.0.[1,5] Further cycles of the synthetic biology “design, build, test” strategy of the JCVI to reduce the genome by omission of nonessential genes followed by genome synthesis, transplantation, and growth testing gave JCVI-syn3.0.[2] Because JCVI-syn3.0 cells were fragile and difficult to manipulate and also because they appeared pleomorphic (with irregular shapes across a population), 19 genes from JCVI-syn1.0 were reintroduced to give JCVI-syn3A.[3] Crucially, only approximately 90 of its 452 protein coding genes (or less than 20%) remain without an annotated function[3] (compared to approximately 40% for Escherichia coli).These 90 genes can be further subdivided into approximately equivalent thirds which are essential, quasi-essential, and nonessential to JCVI-syn3A development, survival, and proliferation.[3] Thus, JCVI-syn3A has one of the most well-characterized genomes of any living cell. The simplicity and small number of genes with unclear function along with genome scale proteomics and essentiality assignments[3] makes JCVI-syn3A an attractive system from which to probe the fundamental principles of cellular life (i.e., in essence a “hydrogen atom” for cell biology) as well as a potential platform from which to base future efforts in metabolic engineering of useful industrial or biomedical compounds.[6] While experimental efforts to characterize JCVI-syn3A genes of unknown function have been undertaken,[7] these remain time-consuming and costly, relative to guidance that might be obtained by computational means. Many bioinformatics approaches exist for characterization of genes of unknown function in relatively understudied organisms such as JCVI-syn3A or the mycoplasmas as a whole. Several previous efforts have sought to further clarify the contents of the JCVI-syn3A genome and that of its precursor JCVI-syn3.0. First, Danchin and Fang[8] studied the genome of JCVI-syn3.0 via an engineering perspective, by streamlining their investigative approach to the paleome content, that is the functional units common to all bacteria of this type, thereby suggesting the necessary generic cellular functions of JCVI-syn3.0. In this way they filled in previously unconsidered gaps connecting genes to macro-scale cellular functions and assigned probable functions to 13 of the unknown genes that could be interrogated further. In a second effort, Yang and Tsui[9] arrived at functional annotations based on secondary structure element alignments with the program SSEalign. This program connects prediction of structural motifs to a database of sequence motifs, leveraging the fact that amino acid sequences have evolved in a much more expansive fashion relative to protein secondary structures and individual protein folds. The algorithm can also take into account the protein–protein interaction network of JCVI-syn1.010 and generates support for an assignment if the identified E. coli homologue of the JCVI-syn1.0 protein from the SSE procedure shares a similar interaction network. Uptake of nutrients and metabolic building blocks from the growth media or external environment has been demonstrated to be crucial for JCVI-syn3A by both experiments and theory.[3,11] With this in mind, Antczak et al.[12] used several approaches, including domain identification,[13] prediction of transmembrane helices,[14] structural modeling (Phyre2),[15] ligand prediction,[16] among others,[17] that identified 24 genes likely coding for transporters. Along the same way of thinking, we examined cellular response to growth medium changes in JCVI-syn3A via the construction of whole-cell models that can respond to environmental changes.[11,18] Most recently, Zhang et al. employed a deep-learning, contact-assisted structure prediction method, followed by structure-based annotation to suggest biological roles and protein–protein interactions for several JCVI-syn3A proteins[19] by the C-I-TASSER-COFACTOR method. This method initially uses the I-TASSER protein structure prediction program[20] combined with deep-learning based protein residue–residue contact map predictions. Then, protein–ligand binding sites were predicted using COFACTOR[21] to help elucidate possible gene product functions. The previous bioinformatic processes, while often high-performance and high-throughput in their ability to be applied across the entire JCVI-syn3A genome, fail to account for certain factors in casting a wider net. For example, pipelines that depend heavily on mapping to existing structural motifs can suffer when no experimental structures for analogous bacterial proteins are available from the relevant databases. Using the recent and highly successful deep-learning based protein structure prediction method AlphaFold2,[22] these missing structures can be generated. While structure can offer hints as to the function of a protein, additional computational and experimental studies are required. Energy landscape theory[23] has shown the importance of nearby conformations in considering both the folding pathways as well as a function of a protein. For this reason, we provide predicted structures relaxed via AMBER[24] and use a flexible structural alignment approach when comparing predicted structures to experimental structures.[25] In taking a holistic and individualized view that accounts for the genome locality (i.e., which genes are neighboring to an unknown gene), interactomics, and where a gene product may fit in as a missing component of a larger cellular network and combining this with recent protein structure prediction tools,[22] we are able to provide increased resolution and predictive power for specific and essential JCVI-syn3A genes of unknown function. The processes of cell growth and division in JCVI-syn3A are of keen interest following recent work, both computational[11] and experimental.[26] First, Pelletier and co-workers[26] demonstrated that, of the 19 genes retained in JCVI-syn3A but not JCVI-syn3.0, 7 (including 5 genes of unclear function) were required together to reduce morphological variation in JCVI-syn3A. Two of the genes, ftsZ/0522 and sepF/0521, contribute to cell division in most bacteria, while five do not have a known biomolecular function. Then, Thornburg et al.[11] constructed a whole-cell kinetic model of JCVI-syn3A, describing growth emerging from metabolism and gene expression (including the contribution of integral membrane proteins and lipids etc.). Currently, the model has few instances of direct regulation of growth by specific genes, such as phoU/0428. Regulation of growth could further be influenced by the products of genes of unknown function. With this in mind, we sought to further clarify gene products involved in regulating JCVI-syn3A growth and division, to inform future experimental and modeling efforts. For these processes to be characterized fully, it is important to examine directly the individual molecular actors that contribute to the required increase of biomass and physical force necessary to produce daughter cells. In addition to those genes implicated in cell growth and division, we examined all of the 30 genes of unclear function that were classified as essential for JCVI-syn3A growth by transposon mutagenesis and computational analysis.[3] We demonstrate that several of these genes code for products involved in membrane maintenance, assembly of protein complexes, and regulatory and other crucial developmental processes of JCVI-syn3A cells. Of the approximately 90 protein-coding genes of unclear function reported by Breuer et al.,[3] we have upgraded the annotations for 25 genes with high confidence, while providing additional information for tens of the remaining protein-coding JCVI-syn3A genes. Our computational analysis was applied to the sequences of all the 452 proteins giving us complete set of structural models for JCVI-syn3A. We then analyzed our results to generate a selection of gene product annotations of interest provided in Supporting Information section S1, in addition to a larger annotation summary given in Supporting Information Spreadsheet S1 and Supporting Information section S3, and structural database (with an associated analysis notebook) found in Supporting Information section S2.

Materials and Methods

Methods

Critical to the computational workflow that we apply is the development and usage of the AlphaFold2 protein structure prediction software.[22] This novel, deep-learning based method utilizes existing sequence analysis tools (such as HHsuite3[27] and HHSearch[27]) in conjunction with a massive genetic database and structural training data that propelled it to win the CASP14 structure prediction contest as well as being named the Nature “Method of the Year”.[28] Here we review the techniques and concepts that we utilized in concert with AlphaFold2 to develop functional characterizations and structure predictions of JCVI-syn3A protein-coding genes.

Sequence Alignment of Conserved Motifs

Sequence Alignment of key sequence motifs can also be a powerful tool as we will demonstrate in the case of secDF/0412. Multiple sequence alignment tools such as CLUSTALW[29] and the HMMER Server from the Max Planck Institute (MPI) Bioinformatics Toolkit,[30] which allows for sensitive sequence based searching using profile hidden Markov models (HMMs), provide an avenue for such investigations. Another useful tool, the HHPred server from the MPI Bioinformatics toolkit allows for homology detection by multiple sequence alignment followed by HMM-HMM comparison.[31,32] Such a protein sequence analysis was generated during prediction runs using the HH-suite3[27] capabilities present within AlphaFold2, the results of which are provided in the Supplemental Database given in Supporting Information section S2. Finally, amino acid sequence to KEGG functional orthology analysis by BlastKOALA[33] and previous bioinformatic analysis of genes in the related Mycoplasma organism Mesoplasma florum L1(34) allowed for a narrowing based on their findings of the general functional categories in which genes of unknown function may lie. The continuing expansion of available sequence data alongside novel coevolutionary analysis methods that examine individual sequence-structural components[35] will allow further leveraging of sequence motif conservation for gene functional identification.

Protein Structure Prediction

To improve upon previous work, we used a protein structure prediction based pipeline via the software package AlphaFold2[22] to further analyze annotation updates for JCVI-syn3A genes leveraging structure-to-function relationships (with comparison to the related method RoseTTAFold[36] for gpsB/0353). This software suite, which has revolutionized protein structure prediction via deep learning-based methodologies following their performances in vastly outpacing all other competitors at the Critical Assessment of Structure Prediction (CASP) contest CASP14, allows for a prediction of structure from amino acid sequence with remarkable accuracy when comparing to experimentally solved structures from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM). All AlphaFold2 predictions were generated using the default parameters that are most similar to those used in the CASP14 event. This includes number of recycles = 3, number of models = 5, and the “full” 3+ terabyte structural and sequence genetic databases, and with PDB entries deposited on or before 9-15-2021. We provide both “unrelaxed” predictions and structures relaxed via the molecular dynamics platform AMBER,[24] as well as the associated lDDT confidence scores[22] for each model. Simulations were run on a local heterogeneous CPU-GPU compute cluster containing NVIDIA A40 and NVIDIA Tesla V100 GPUs and Intel Xeon Gold 6154 CPUs, with 1 GPU and 8 CPUs being used for each protein structure predicted.

Protein Sequence-to-Structure Threading

Protein structure prediction methods are useful either containing or in conjunction with other bioinformatic tools such as Phyre2,[15] HMMER,[30] and LOMETS,[37,38] which provide a threading capability that connects amino acid sequence to specific protein fold motifs that are associated with these programs. The predicted number of known amino acid sequences is several orders of magnitude greater than the approximately 1000 predicted similar protein fold families (such as those deposited in Pfam[39]) that have been observed in experimental structures deposited in the RCSB PDB (Protein Data Bank).[40] This sequence-to-structure threading investigation routine is useful in guiding candidates for pairwise structural alignment to extract significant structural similarity in conjunction with AlphaFold2 outputs. Such structural analysis was also generated during prediction runs using the HHSearch[27] capabilities present within AlphaFold2, the results of which are provided in the supplemental database given in Supporting Information section S2.

Protein Pairwise Structural Alignment

Various tools then exist that can be used to compare proteins via structural alignment methods such as STAMP,[41] and aligned structures can then be visualized in VMD.[42] When significant agreement between the top ranking predicted structures from AlphaFold2 to previously solved crystal structures from related bacterial organisms deposited in RCSB PDB[40] is obtained by this methodology, such a result motivates an increased gene functional annotation and the development of experiments to more specifically determine the function the gene in question. The quality of alignment can be interrogated and visually represented using Qres scoring and coloring, which calculates the fraction of similar native contacts between aligned residues in two or more protein structures via an energy landscape theory approach described by Eastwood et al.,[43] derived from the original theory.[44,45] An alternative structural alignment tool that is preferable for quantitative pairwise structural alignment is FATCAT.[25] FATCAT provides a flexible protein structural alignment algorithm that uses rotations and translations of one protein structure to minimize the root-mean square deviation (RMSD) between the two structures being aligned, thereby accounting for structural rearrangements due to crystallization conditions or the presence of various biological isoforms within a living cell. The FATCAT algorithm uses a scheme that creates various linked fragments of a protein in its optimization routine, which are chained and twisted to arrive at a final structural alignment with a p-value describing significance of structural similarity. The equations involved in this process are given in further detail in the Formulas section and in eqs , 3, and 4. The FATCAT alignment process was attempted for genes given in the main text and in Tables S1 and S2).

Interactomics and Genomic Locality

Interactomics data, such as that presented in the SynWiki resource compilation of JCVI-syn3A-related data[10] and deposited in the STRING database,[46] is helpful in determining the function of a gene of unknown function when genes from a related cellular process or with known physical interactions are observed in its interactomic network. Important to note is that this resource takes the STRING data for a highly phylogentically related organism, Mycoplasma mycoides subsp. mycoides SC PG1, to populate data for JCVI-syn3A genes. In conjunction with this, biological insight into genome locality (i.e., operonal or transcription unit structure) of related bacterial organisms can assist in clarifying the function of unknown JCVI-syn3A genes since genome architecture/gene locality is often significantly conserved in related organisms.[47,48] This is especially useful for proteins where experimentally determined structures are absent or sparse and sequence is not necessarily well conserved (e.g., atpI/0797, which is part of a similar operon structure in Bacillus subtilis(49)) that can be visualized in SubtiWiki.[50]

Formulas

Structural alignment via the FATCAT pairwise alignment tool[25] is a key metric by which we were able to assess confidence in functional assignments of JCVI-syn3A genes of unknown function based on the alignments of their predicted structures from AlphaFold2[22] and RoseTTAFold[36] to experimentally determined structures. For clarity, the process of FATCAT alignment scoring metric calculation is summarized below. FATCAT uses a flexible alignment methodology that takes advantage of aligned fragment pairs (AFPs), which define transformations of local structural elements of a protein. This representation allows for alignment that is not detracted by potential artifacts that may be due to crystallization conditions or regions of high error for a structure generated by protein structure prediction. These constructs are then translated to the mathematical graph formulation used for the structural scoring method detailed in eq . For a more detailed explanation, we direct the reader to the work by Li and co-workers.[25] FATCAT alignments are initially characterized with the chaining score given bywhere S(k) is the best score along the AFP k, a(k) is the score of AFP k itself, c(m → k) is the score of introducing a connection between AFP m and AFP k, and T(k) is the number of twists required to connect the chain of AFPs leading up to S(k). Such that And with similarity score:where cs is the previously mentioned chaining score from eq , L is the number of equivalent positions in the alignment, RMSD is the overall RMSD between the two structures when one structure is rearranged at the positions where twists are detected by FATCAT, and N is the number of blocks in the alignment (numberoftwists + 1). Generating p-value:where the location and the scale parameter of the extreme value distribution of FATCAT similarity scores of unrelated structures were determined by empirical simulation described by Li et al.[25]

Results

From Mother to Daughter: Characterization of Molecular Actors of Cell Growth, Division, and Morphology

Of keen interest are the additional biomolecular actors regulating cell division in JCVI-syn3A. Recently, it was shown that a set of seven genes that were deleted in the genome reduction from JCVI-syn1.0 to JCVI-syn3.0 and restored in the construction of JCVI-syn3A are necessary to maintain proper cell morphology.[26] Of this group, one gene, JCVISYN3A_0520 (that from now on will be referred to as /0520, and likewise for the other JCVI-syn3A genes discussed) is adjacent to genes of the highly conserved division and cell wall (dcw) cluster[51−53] and has been previously characterized for Mycoplasma genitalium.[54] Another of these genes (/0527) immediately precedes the dcw cluster on the reverse strand of the circular genome. From previous bioinformatic investigations,[12,19] /0520 is characterized as a putative member of the α–β hydrolase superfamily while /0527 is annotated as a protein of unknown function containing a domain of the DUF177 family whose members have been posited to participate in membrane protein biosynthesis among other roles.[26,55] Other dcw cluster region genes include sepF/0521, ftsZ/0522, and ftsA/0523 that are known to participate in cell division,[56]mraW/0524 which codes for a regulatory protein that can methylate the 16S rRNA, and mraZ/0525 that acts as a transcriptional repressor of the dcw cluster[53,54] but curiously was not detected in JCVI-syn3A proteomics.[3] In the following paragraphs, we suggest annotations for genes coding for common products of the Gram-positive divisome and provide possible explanations for the causation of irregular morphologies by deletion of the aforementioned dcw cluster adjacent genes.

gpsB/0353

Gene/0353 was previously annotated as a quasi-essential gene of unknown function. Here, we demonstrate that this gene codes for a gpsB-like protein that interacts with ftsZ/0522 to ensure proper cell division behavior. In many bacterial species, GpsB and its homologue DivIVA contribute to cell division, cell growth, and chromosome segregation.[57] Predicted structures for gene gpsB/0353 from AlphaFold2[22] (and the related protein structure prediction software RoseTTAFold[36]) align nearly exactly with the crystal structure of the GpsB protein from B. subtilis (RCSB PDB: 4UG3(58)) when aligned only at the N-terminal coiled-coil domain using STAMP[41] and visualizing in VMD[42] as is shown in Figure A. A long 65 amino acid, completely α-helical C-terminal domain of /0353, extending beyond the length of the crystal structure chosen for structural alignment, is not shown. The C-terminal domain of GpsB has yet to be experimentally determined in conjunction with the N-terminal domain to our knowledge. Nonetheless, even without the C-terminal domain of the experimental structure present, the predicted protein structure from AlphaFold2 and the experimental structure from B. subtilis are still judged as significantly similar by FATCAT alignment[25] with a p-value of 2.72 × 10–6 and all 60 residues of the crystal structure being evaluated as matching the predicted structure with a RMSD of only 1.10 Å and 1 twist (via the flexible alignment procedure described in the Formulas section). The gene locality of gpsB/0353 near that of recU/0351 within the JCVI-syn1.0 genome has been demonstrated previously in B. subtilis(59) and is thought to be significant due to the fact that recU is required for the segregation of chromosomes into daughter cells[60] in conjunction with cell division that is mediated by GpsB. Further support exists to distinguish gene gpsB/0353 from its orthologue divIVA, that is also involved in the Gram-positive cell division landscape and has been included as a potential functional assignment via previous bioinformatic work.[12] GpsB proteins are observed to have approximately 120 amino acids or less across a variety of organisms (gpsB/0353 has 125 residues) and contain the conserved “TNFDILK” consensus sequence in the C-terminal region (the portion not shown in structural modeling but analyzed here that can be seen aligned to sequences from B. subtilis and Staphylococcus aureus in Figure C), while cell division DivIVA homologs are always greater than 160 amino acids in length.[59]

Figure 1

(A) Predicted structures of gpsB/0353 from AlphaFold2 (red) and RoseTTAFold (blue) aligned with the experimental structure of gpsB from B. subtilis (RCSB PDB: 4UG3)[58] at the N-terminal coiled-coil domain using STAMP.[41] (B) Predicted structure of lemA/0604 (see section 3) from RoseTTAFold (blue) aligned by FATCAT[25] with the lemA two-component cell growth regulator from T. maritima (yellow, RCSD PDB: 2ETD(61)). 142 of the 210 residues (68%) are well-aligned with a RMSD of only 1.67 Å. (C) Multiple sequence alignment of /0353 from JCVI-syn3A with sequences of gpsB from B. subtilis (strain 168) and S. aureus (strain NCTC 8325/PS 47) shows the conserved C-terminal domain consensus sequence,[59] lending further evidence to the assignment of gpsB/0353 instead of its paralogue divIVA. All images visualized in VMD.[42]

ezrA/0239

When considering candidate genes for those unaccounted for in the Gram-positive bacterial divisome, which are excellently reviewed in a study regarding ftsZ,[56] we considered the fact that some of these genes may interact with the well-known proteins ftsA/0523 and ftsZ/0522. Of this pool of candidates, one showed similarities to /0239 in terms of both sequence and structural motifs, namely the ftsZ protofilament bundle regulator ezrA. This protein has been demonstrated to prevent aberrant FtsZ Z-ring formation in low GC Gram-positive bacteria that are closely related to JCVI-syn3A.[62] First, ezrA is well-known to interact with ftsZ via the sequence motif known as the “QNR” patch,[63,64] that is conserved across several bacterial organisms including B. subtilis, Staphylococcus aureus, Streptococcus pnuemonia, Lactobacillus acidophilus, and Enterococcus faecilis. When the amino acid sequence corresponding to /0239 is aligned against the sequences of ezrA in B. subtilis and S. aureus via CLUSTALW,[29] the “QNR” region from residue 504–510 is relatively conserved at the same residues as it is for the other bacteria. Most importantly, the central asparagine residue that is conserved across nearly all bacterial species (see Figure C) is present. When a structural alignment of the closely phylogenetically related Gram-positive bacterium B. subtilis ezrA to the predicted protein structure of /0239 from JCVI-syn3A is generated by AlphaFold2, a high degree of similarity is observed (Figure A). By FATCAT alignment, the structures are characterized as significantly similar to a p-value of approximately 3.01 × 10–10 (with values < 5.00 × 10–2 indicating significant structural similarity) and with 498 equivalent positions with a RMSD of 4.64 Å with 4 twists (see Figure A). Gene /0239 contains 595 residues, meaning that approximately 84% of these residues are well-aligned to the EzrA protein from B. subtilis. Furthermore, the predicted structure for /0239 is heavily covered with hydrophobic residues at both the N and C terminus regions and the same sites that are predicted to bind to the membrane and also have high hydrophobicity in the experimental structure from B. subtilis (see Figure B). Finally, further support is provided by the genome architecture of ezrA in the closely related Gram-positive bacterial organism B. subtilis where the ezrA gene is located just upstream of thiL on the reverse strand (separated by only a few genes) and immediately precedes rpsD on the forward strand. This exact genome arrangement of ezrA between thiL and rpsB is seen for ezrA/0239 in JCVI-syn3A (see Figure D).

Figure 2

(A) Structural alignment by FATCAT[25] of the predicted AlphaFold2 structure for gene /0239 (blue) to the structure of the cell division regulatory protein EzrA from B. subtilis (yellow, RCSB PDB: 4UXV(65)) with 498 or 84% of the residues of /0239 being well-aligned to the experimental structure with a p-value of 3.01 × 10–10 for significant structural similarity. (B) Hydrophobicity coloring (red, hydrophobic resiudes; blue, hydrophilic residues) for the AlphaFold2 predicted structure for /0239. Hydrophobic regions at the N and C termini match the previous findings that EzrA binds the membrane at each terminal domain.[65] (C) Sequence Alignment of gene /0239 with the sequences for ezrA from S. aureus and B. subtilis with BLOSUM50 similarity score coloring. While the entire characteristic “QNR” patch[63,64] is not present in JCVI-syn3A, significant sequence similarity remains and the nearly universally conserved asparagine residue located centrally in the patch is present. (D) The genome architecture of JCVI-syn3A also lends support to the assignment of ezrA/0239. Just as in the related Gram-positive organism B. subtilis, the gene coding for EzrA lies between the genes for thiL coding for thiamine monophosphate kinase and rpsD coding for ribosomal protein S2 on opposing strands of the circular genome.

yqkD/0520

If we consider gene /0520 in particular, while it is expressed well below the value of approximately 200 protein copies observed to be expressed on average for JCVI-syn3A genes,[3] it still retains a proteomic value on par with the dcw gene adjacent to it, sepF/0521, and the nearby ftsA/0523 having 50 and 51 copies respectively, with /0520 having 30 copies measured in a proteomic analysis of JCVI-syn3A.[3] Previously, Pelletier et al. and others[19,26] reported this gene as being a member of the α–β hydrolase family. Interestingly, this superfamily, containing proteins whose functions are wide-ranging and are reviewed in ref (66), has members with functions related to cell growth, such as lipases and serine aminopeptidases that may significantly modify lipids and proteins that make up the JCVI-syn3A membrane and bacterial membranes in general. A review of the similarity in sequence and structural motifs between α–β fold lipases and peptidases in this type has been conducted,[67] that is especially of interest in terms of the prolyl-oligopeptidase family proteins. Of the aforementioned candidates, the predicted structure of gene /0520 aligns especially well to the AlphaFold2 predicted structure of the esterase/α–β hydrolase yqkD from B. subtilis(68) with over 96% of residues well-aligned, a RMSD of 3.00 Å, and a p-value for similarity greater than 1.00 × 10–15 (see Figure A and Table ). Providing further support, structural alignment of the AlphaFold2 predicted protein structure for /0520 to a Pfam PF2001 α–β hydrolase from Pyrococcus furiosis (RCSB PDB: 5G59(69)), that also has esterase-like properties, shows that the hydrolase helical bundle is highly structurally conserved, which can been seen by Qres coloring of the STAMP structural alignment in VMD (see Figure B, C). This catalytic site conservation is additionally corroborated by previous bioinformatic work by Zhang et al.[19] via the COFACTOR analysis pipeline where peptidase ester or lipase cosubstrate binding sites are predicted at residues 96 and 166–168 especially. Amino acid sequence analysis[27] also detects the previously observed lipase sequence motifs HG and GXSXG (where X is a wildcard) from Mycoplasma mycoides subsp. mycoides LC.[70] Experimentally determined interactomics data reported in SynWiki[10] shows interactions of /0520 orthologues with the membrane protease lon/0394, that degrades FtsZ among other related substrates in Mycoplasma pneumoniae,[71] so /0520 and/or Lon could be involved in degrading controllers of cell division. If /0520 does indeed code for a yqkD-like serine peptidase and is of the chymotrypsin like substrate-specificity family, it would then degrade membrane proteins (thus explaining the presence of Lon in interactomic data) in which medium sized hydrophobic residues such as Tyr, Phe, and Trp are exposed.[72] Interestingly, this gene has also been observed to be positively correlated with the expression of other lipases and esterases in B. subtilis.[73] We have previously demonstrated that the JCVI-syn3A membrane is quite protein-rich[11] with over 10% of the proteome and approximately 50% of cell membrane surface area composed of membrane proteins. Thus, the absence of such a yqkD-like aminopeptidase/protease could substantially modify the membrane composition of JCVI-syn3A cells that give it regular daughter cell morphologies. Additionally, if /0520 was to have lipase activity, one could also anticipate that its deletion could have substantial effects on daughter cell morphologies, due to the elimination of membrane remodeling abilities that may allow for scavenging or transfer of acyl chains between lipids that may have played a key role in the pathogenic behaviors of the JCVI-syn3A Mycoplasma parent organism as it does in related organisms.[74]

Figure 3

(A) Structural alignment of the AlphaFold2 predicted structure of the YqkD esterase/lipase/serine aminopeptidase from B. subtilis,[68] (yellow) to the predicted protein structure of /0520 from AlphaFold2 (blue) using FATCAT 2.0.[25] The alignment characterizes the structures as being significantly similar to more than 95% of the residues of /0520 being well-aligned to the B. subtilis protein predicted structure. (B) Structural alignment via STAMP of P. furiosis PF2001 hydrolase (yellow, RCSB PDB: 5G59(69)) to the predicted structures for /0520 from both AlphaFold2 (red) and RoseTTAFold (blue). (C) The alignment with Qres coloring[43] shows that the hydrolase helical bundle is highly conserved (blue-white-red color scheme of decreasing conservation), supporting the assignment of /0520 as an aminopeptidase/esterase/lipase, of the α–β hydrolase superfamily (see ref (66) and Pfam), and suggesting that PF2001 might play a similar role beyond its confirmed general esterase activity.

Table 2

FATCAT[25] Structural Alignment Statistics for Proteins of Formerly Unknown Function in JCVI-syn3A Cell Growth, Maintenance, and Divisiona

gene	exp. struct. PDB ID	p-value	RMSD, Å (% aligned)
ezrA/0239	4UXV[65]	3.01 × 10^–10	4.64 (84%)
gpsB/0353	4UG3[58]	2.72 × 10^–6	1.10 (100%)
secD/0412	5XAM[102]	9.20 × 10^–4	6.10 (41%)
secF/0412	5MG3[103]	6.49 × 10^–5	3.05 (53%)
yloU/0421	B. subtilis (AF2)[104]	1.11 × 10^–11	2.29 (99%)
ylxM/0430	1S7O[82]	3.21 × 10^–10	1.99 (99%)
yqgP-gplG/0516	B. subtilis (AF2)[87] and 2NR9[89]	7.19 × 10^–9	3.07 (82%)
yqkD/0520	B. subtilis (AF2)[68]	0.00 × 10^–0	3.00 (96%)
ylbN/0527	B. subtilis (AF2)[76]	3.03 × 10^–7	2.45 (84%)
lemA-gacS/0604	2ETD[61]	6.78 × 10^–5	1.67 (68%)
atpI/0797	B. subtilis (AF2)[98] and 6FKI[101]	1.81 × 10^–11	1.22 (92%)
gabP/0878	B. subtilis (AF2[105])	4.47 × 10^–11	3.23 (87%)
yidC/0908	M. pneumoniae (AF2[106])	0.00 × 10^–0	2.33 (92%)

Alignment values generated from experimentally solved crystal structures or with AlphaFold2 (AF2) structures where this abbreviation is indicated, aligned with predicted AlphaFold2 structures for JCVI-syn3A genes. A p-value of less than 5.0 × 10–2 demonstrates significant structural similarity. A value of 0.00 indicates a p-value of less than 1.0 × 10–20. The percent aligned relates to the percentage of residues in the protein of shorter length that are well-aligned to the larger structure by FATCAT, with the associated RMSD. Equations generating these statistical values are given in the Formulas section.

Drivers of Cell Growth: Ribosomal Assembly and Protein Translocation Machinery

Previously, we have developed with our dynamic whole-cell model of JCVI-syn3A a cellular growth module that predicts doubling times across a population cells. The growth of the JCVI-syn3A cells is dependent on lipid uptake and synthesis, and membrane protein translocation, since both of these components contribute surface area to the membrane as it doubles.[11] For this reason, clarifying those genes coding for proteins involved the processes of translocation and degradation of membrane proteins and remodeling of the membrane is of paramount importance in improving our modeling efforts and deepening our understanding of JCVI-syn3A cell growth.

ylbN/0527

With gene yqkD/0520 identified, this leaves only one gene unannotated from the unannotated genes surrounding the JCVI-syn3A dcw cluster. This gene, /0527, had previously been bioinformatically characterized to contain a DUF177 domain.[26] Not far from the dcw operon in the genome of B. subtilis is the gene ylbN that is involved in the regulation and assembly of the 23S rRNA Large Subunit (LSU) component. Like gene /0527, YlbN contains a reported DUF177 domain, from residues 56–165 in B. subtilis, with a conserved 158Arg residue[75] that is also present in JCVI-syn3A when the sequence of /0527 is aligned to the B. subtilis sequence (see Figure C). The locality of gene /0527 adjacent to LSU component rpmF/0526 coding for ribosomal protein L32 (and that is also adjacent to ylbN in B. subtilis; see Figure A) lends additional support to this hypothesis. Structural alignment of the Alphafold2 prediction for /0527 to the predicted structure from AlphaFold2 for ylbN from B. subtilis,[76] that is the best available structure since none have been experimentally solved, results in an excellent agreement demonstrated in Figure B. Also, the interactome of /0527 reported in SynWiki[10] includes several LSU ribosomal proteins (L1, L13, L21, L32) including rplA/0809 (protein L1) that binds directly to the 23S rRNA.

Figure 4

(A) Genome locality of ylbN and ylbN/0527 in B. subtilis and JCVI-syn3A, respectively. The ylbN genes are colored in red, while adjacent genes located prior to the dcw cluster in both organisms include rpmF, mraZ, and rsmH (also known as mraW) (blue), with additional genes only present in B. subtilis shown (green). Both gene clusters show similar orientation and arrangement on the reverse strand. (B) Structural alignment of AlphaFold2 prediction of ylbN/0527 (blue) and the predicted AlphaFold2 structure for ylbN in B. subtilis 168 (yellow[76]). (C) CLUSTALW sequence alignment of ylbN/0527 and ylbN from B. subtilis with BLOSUM50 similarity coloring. The conserved 158R reported in ref (75) is aligned in the “DPR” region of both sequences. Due to its reduced genome and pathogenic origins, JCVI-syn3A has been considered to be a cell that “lives on the edge” in terms of maintaining its energetic requirements and cellular homeostasis.[11] Further investigation into these biological roles may shed some light onto why ylbN/0527 is a pivotal component in regulating JCVI-syn3A daughter cell morphology and proper development behavior.[26] A study in which a DUF177 domain ylbN-orthologue knockout strain was generated in Zea mays led to the observation of a significant decrease in 23S rRNA production.[55] Previously, in B. subtilis,[77]ylbN has been observed to be downregulated by the stringent response (for example, by the alarmone metabolite ppGpp that is synthesized by a gene present in JCVI-syn3A relA/0414[78]), whereby a cell reduces its growth activity to conserve energy often by decreasing the generation of ribosomes.[79] These two studies implicate YlbN in 23S rRNA accumulation and in ribosome biogenesis. However, at this point, this remains a speculative suggestion as to the possible linkage between gene ylbN/0527 removal and the generation of JCVI-syn3A daughter cells with irregular morphologies and further experimental study is needed (see section 4).

secDF/0412

Related to cell surface growth and more specifically to protein translocation across the membrane, we have further support to annotate /0412 as a putative secDF protein export enhancing membrane component. SecDF has been demonstrated to interact with the SecYEG translocon, which comprises secY/0652, secE/0839, and secG/0774, and enhance its translocation efficacy by proton-motive force that is transduced to SecYEG.[80] We have determined that SecF likely comprises the amino acids from 948 to 1384 while SecD comprises the residues from 1 to 947 given in the JCVI-syn3A NCBI entry (NCBI GenBank: CP002027.1). Each of these two regions contains conserved sequence motifs that have been observed for both SecD and SecF across a variety of organisms,[81] that previously supported the putative assignment of the analogous gene in M. florum L1 as secDF-like.[34] Structural alignments of the AlphaFold2 predicted protein structures for both SecD and SecF (blue) were aligned with FATCAT[25] to the analogous structures from Deinococcus radiodurans and E. coli, respectively. The secF portion has 52% of its residues well-aligned to the experimental structure with a RMSD of 3.05 Å, while the SecD portion has 43% alignment and a p-value of 9.20 × 10–4 denoting structural significance (see Table ). The interactome of this gene including other translocation related genes such as secY/0652, secA/0095, and the translocation regulator ftsY/0429 as well as the gene we will discuss next also support secDF assignment. There is additional support present for this functional assignment in conserved genome architecture of genes: relA/0414 and apt/0413 are adjacent to secDF/0412, that is also seen in B. subtilis (see Figure A).

Figure 5

(A) Genome architecture of the related Gram-positive bacterium B. subtilis near the gene coding for SecDF translocon components and that adjacent to gene/0412 show conservation. (B) STAMP structural alignment with visualization in VMD of the AlphaFold2 predicted structure of gene 0430/ylxM (red) and the experimentally determined structure for YlxM-like effector of the signal recognition particle from S. aureus (yellow, RCSD PDB: 1S7O(82)). (C) Qres coloring[43] (with a blue-white-red color palette of decreasing structural similarity) shows a high degree of structural conservation between the predicted and experimental structures.

ylxM/0430

Additionally, a previously unknown essential JCVI-syn3A gene that likely has a critical role in cell growth, gene ylxM/0430, has been identified as a putative effector of the signal recognition particle (SRP) ribonucleoprotein (whose RNA component is coded for by gene ffs/0049). When the predicted structure from AlphaFold2[22] is aligned to the crystal structure of the YlxM-like effector of the SRP from Streptococcus pygenes (RCSB PDB: 1S7O(82)), significant structural conservation is observed (see Figure B and C, regions where structural motifs are shaded blue via the Qres[43] coloring scheme). The effector of SRP, YlxM, is thought to modulate GTPase activity and thereby influence recycling of SRP components in the membrane protein translocation process that has been observed in Streptococcus mutans.[83] Improved kinetic modeling attained by including this protein in the JCVI-syn3A cell growth module implemented in Thornburg et al.[11] may have substantial effects via interactions with the 4.5S RNA signal recognition particle ffs/0049. The SRP system consists of the secYEG translocon consisting of the genes secY/0652, secE/0839, and secG/0774 that is responsible for receiving newly translated membrane proteins and translocating them through the lipid bilayer via the SecY polypetide conducting channel, while also modulating the activity of the SRP receptor GTPase ftsY/0429 that negatively regulates the translocation of membrane proteins,[84] as well as yidC/0908 discussed in the following paragraphs, all of which are observed in its JCVI-syn3A interactome.[10]

yidC/0908

Besides YlxM and the SecYEG,DF proteins, other biomolecules play a key role in membrane protein translocation. One such protein seems to be coded for by gene /0908 that was previously annotated as a probable translocase component.[2,3] Here, we further characterize this gene based on structural comparison, genome locality and interactomics. In the JCVI-syn3A genome, gene /0908 lies adjacent to rnpA/0907, the tRNA processing RNase P. In both of the related organisms B. subtilis and M. pneumoniae, RNase P exists immediately prior on the reverse strand to the gene coding for the translocase associated protein YidC. When an AlphaFold2 prediction is generated for /0907, an excellent agreement in terms of structural similarity is found to the analogous AlphaFold2 prediction for YidC from M. pneumoniae (see Table ). Lending further support is the observed interactomic data for this gene with SecA, SecY, FtsY, and ffh, that are all members of the translocon and signal recognition particle machinery.[10] YidC specifically acts to facilitate the translocation of membrane proteins across the membrane via the SRP-mediated network, by acting in conjunction with SecYEG to reduce the hydrophobicity experienced by newly translated membrane proteins as they are inserted,[85,86] making it a valuable component of the JCVI-syn3A translocation system to model moving forward.

Stewards of Cellular Maintenance: Protease, Assembly, Uptake and Regulatory Components

In a reduced genome such as that of JCVI-syn3A, regulatory components that are typically present in other bacterial organisms may not be present. For this reason, cellular maintenance molecules play a key role in preserving a stable biophysical environment as we have discussed in our previous whole-cell modeling work.[11] Here, we further characterize the players involved in these cellular housekeeping processes.

yqgP-gplG/0516

Gene /0516, located not far downstream on the reverse strand from yqkD/0520 and ylbN/0527, is also of interest as a quasi-essential gene for JCVI-syn3A viability with multiple predicted transmembrane helices (Figure ).[3] When the gene is analyzed via HHPred[31] and a predicted structure is generated via AlphaFold2, it aligns very well with the structure for Haemophilus influenzae gplG membrane protease, especially at the α-helical portion of the protein. When the /0516 structure is aligned to the predicted AlphaFold2 structure of the paralogous yqgP membrane protease[87] from the more closely phylogenetically related organism B. subtilis, an excellent FATCAT alignment is achieved across the entire structure with high coverage and statistically significant structural similarity (see Table ). yqgP-glpG is a membrane protease that partners with the membrane protease ftsH/0039 and is responsible for degradation of metal transporters such as mgtA/0787 to prevent ion toxicity.[88,89] This assignment is supported by the observation of ftsH/0039 in the interactomics for JCVI-syn3A[10] as well as the observation of the JCVI-syn1.0 ion transporter mgtE/0157 (deleted in genome reduction from JCVI-syn1.0 to JCVI-syn3A).

Figure 6

(A) FATCAT[25] structural alignment of AlphaFold2 predicted structure of /0516 (blue) and gplG protease from H. influenzae (RCSB PDB: 2NR9(89)), where only the lower helical portion of the protease is available in the experimental structure. (B) FATCAT structural alignment of the predicted structure from (A) with the Alphafold2 predicted structure for the gplG paralogue protease yqgP from B. subtilis shows excellent agreement to both the upper β- sheet and the lower -a-helical regions of the JCVI-syn3A structure, with a p-value of 7.19 × 10–9 showing significant structural similarity.

gabP/0878

Gene /0878 was previously described as encoding a quasi-essential transmembrane protein of unknown function.[2] More recent work has determined that this protein could have likely amino acid binding and transport functionality.[3,19] Here, we strengthen the certainty of this functional assignment by characterizing the gene as coding for a GabP-like amino acid permease. When the AlphaFold2 predicted structure for /0878 is aligned to the AlphaFold2 predicted structure for GabP from B. subtilis, a strong agreement is achieved, with 87% of residues being well-aligned with a p-value of 4.47 × 10–11 for significant similarity from FATCAT alignment.[25] The specific functionality of GabP lies in the uptake of proline, that is present in the JCVI-syn3A growth media[11] and is necessary to take up for generation of proteins, since no JCVI-syn3A amino acid synthesis capability exists. The speculated ability of this protein to bind amino acids nonspecifically (albeit with a preference for proline)[90] may shed some light onto the quasi-essential nature of this gene along with the other identified amino acid permeases and transporters gltP/0886 (with glutamine preference) and the Opp system coded for by genes oppB/0165–oppA/0169 that bind amino acids at an ATP cost[3] unlike the previously mentioned proton-symport permease systems.

yloU/0421

Gene/0421 presents an interesting case, in that it is an essential gene of unknown function adjacent to the fatty acid kinase fakA/0420 and lipid metabolism acyltransferase gene plsX/0419. Upon further examination, due to both genome locality and structural alignment, the suspicion that this gene may be involved in cellular processing of fatty acids is confirmed. When the AlphaFold2 prediction for /0421 is structurally aligned to the predicted structure for the fatty acid metabolism regulatory element YloU from B. subtilis, a convincing agreement is obtained (see Figure B) with a significant similarity p-value exceeding 10–11 and 99% of the residues being well-aligned by FATCAT[25] (see Table ). In addition, in B. subtilis, the gene yloU immediately precedes the fatty acid kinase gene fakA on the forward strand, just as it does in JCVI-syn3A in yet another case of conserved genome architecture. Previous bioinformatic work[12] predicted /0421 to be a member of the Asp23 protein family, that includes YloU and whose members are often involved in stress response processes. YloU is thought to regulate the processing of fatty acids and phospholipid generation as a whole, however its exact function remains uncertain.[91]

Figure 7

(A) Structural alignment of AlphaFold2 prediction for gabP/0878 (blue) with the predicted structure of the B. subtilis GabP permease (yellow). (B) FATCAT structural alignment of the predicted structure of yloU/0421 from JCVI-syn3A (blue) with the AlphaFold2 predicted structure for B. subtilis YloU fatty acid regulatory protein (yellow). (C) The genomic context in which yloU is found, adjacent to the fatty acid processing genes fakA and plsX, is conserved for gene yloU/0421 in JCVI-syn3A and has been observed to be conserved previously in the Firmicutes.[91] (D) FATCAT structural alignment of predicted structure for gene yidC/0908 and predicted structure for M. pneumoniae show significant structural similarity.

lemA-gacS/0604

Gene /0604 was among the seven genes necessary to restore normal cell division in JCVI-syn3.0, but its biomolecular function in JCVI-syn3A remains unknown.[26] When the predicted AlphaFold2 structure of /0604 is aligned via FATCAT to the structure of lemA from Thermotoga maritima a convincing agreement is obtained, that can be seen in Figure B. Typically, lemA is part of a two-component regulatory system, which have been well-studied across bacterial systems[92−94] (but that have not yet been observed in Mycoplasma to our knowledge), and are involved in regulation of various cellular processes varying from phosphate uptake, to respiration, to sporulation. While the exact function of this gene remains unclear, several two-component systems present in B. subtilis contribute to regulation of cell growth and division processes including the pairs comA and comP and the only essential two-component system in the bacterium walK and walR.[95]

atpI/0797

Gene /0797 provides an important case-study in the computational analysis of JCVI-syn3A genes, in taking a holistic view in characterizing genes, rather than using a more high-throughput sequence or structurally based approach that can be applied with relative speed to the entire genome. When techniques such as the TASSER-I based structural pipeline[19] or HHPred[31] are applied to /0797, a wide-ranging set of possible functionalities are generated, ranging from secretory proteins to sugar transporter components. However, when a closer look was taken at its position within the genome, we observed that it lies just prior to the genes coding for the ATP synthase complex, a critical molecular machine for maintaining an appropriate membrane pH gradient in Mycoplasmas[96,97] as well as in some instances generating ATP for cellular energy expenses,[11] on the reverse strand. When the ATP synthase containing operon is examined in B. subtilis,[49] we observe that the gene atpI is located immediately prior the remainder of the ATP synthase coding genes, in the identical order as they appear in JCVI-syn3A. With this in mind, we sought to compare the predicted protein structure of /0797 to existing structures of AtpI; however, existing structural information is sparse to nonexistent, especially for related organisms like B. subtilis. Faced with this dilemma, we utilized a predicted structure for the analogous gene in B. subtilis,[98] and when it is structurally aligned with the AlphaFold2 structure of /0797, the structures resemble each other strongly (see Figure A, B), with a significantly similar FATCAT p-value of 1.81 × 10–11 and 122 equivalent positions (approximately 92% of the 132 total residues in /0797) with a RMSD of 2.99 Å without twists. In terms of specific function, atpI is thought to possibly be involved in guiding the assembly of the ATP synthase complex, with mutants of this gene causing a 20% decrease in overall synthase efficiency[99] and a modulation of atpB/0796 expression, likely via a post-translational regulatory process.[100]

Figure 8

(A) STAMP structural alignment of predicted structure for B. subtilis atpI (yellow) with predicted protein structures for atpI/0797 from AlphaFold2 (red) and RoseTTAFold (blue). Images generated using VMD.[42] Alignment of these two structures obtained a FATCAT[25]p-value of greater than 1.0 × 10–11, showing significant similarity with 92% of the JCVI-syn3A structure being well-aligned to that predicted for B. subtilis. (B) FATCAT alignment of the Spinicia oleracea atpI structure (RSCB PDB: 6FKI(101)) with the AlphaFold2 predicted structure of atpI/0797 with conservation of the central two α-helical regions especially. Alignment of these two structures obtained a FATCAT[25] gave 130 residues or 99% of the JCVI-syn3A structure being well-aligned to the larger 247 residue cryo-EM structure for Spinicia oleracea. (C) Genomic organization from JCVI-syn3A and B. subtilis of the ATP synthase operon[3,49] is identical lending further support to the assignment of atpI/0797 (red, bottom-right).

Summary of Computationally Annotated JCVI-syn3A Genes of Unknown Function

Table contains functional annotations and generalized gene process descriptions for each of the previously unannotated JCVI-syn3A genes (of essential and quasi-essential nature[3]) involved in the crucial processes of cell growth, cell division, and membrane maintenance that we elucidated in this work. This knowledge can be used to extend the experimental and computational work in characterizing JCVI-syn3A given in previous works.[2,3,11]

Table 1

Table of Gene Products Characterized Computationallya

gene	cellular process	function
ezrA/0239	cell division	FtsZ filament regulator
gpsB/0353	cell division	divisome localization regulator
secDF/0412	cell membrane	protein translocase subunit
yloU/0421	cell growth	lipid synthesis regulation
ylxM/0430	cell membrane	effector of translocation SRP
yqgP-gplG/0516	cell membrane maintenance	membrane protease
yqkD/0520	cell growth and morphology	membrane lipase/aminopeptidase
ylbN/0527	cell growth and morphology	23S rRNA regulator
lemA-gacS/0604	cell growth and morphology	two component sensor/membrane
atpI/0797	membrane pH gradient	ATP synthase component
gabP/0878	nutrient uptake	amino acid transporter
yidC/0908	cell membrane	protein translocon component

JCVI-syn3A genes involved in cell growth, maintenance, and division in ascending order of gene locus tag (NCBI Entry - NCBI GenBank: CP002027.1), with associated bacterial orthologue gene names given. The general cellular process of each gene is given as well as its more specific cellular function.

Protein Structure Prediction Alignment Data

Table contains structural alignment data for each of the previously unannotated genes implicated in cell division, growth, and membrane maintenance that we have clarified here in this work. Data comparing alignment and structural similarity of experimental structures of these genes from related bacterial organisms such as B. subtilis compared to AlphaFold2 prediction protein structures for JCVI-syn3A genes is presented with analysis conducted via FATCAT.[25] Alignment values generated from experimentally solved crystal structures or with AlphaFold2 (AF2) structures where this abbreviation is indicated, aligned with predicted AlphaFold2 structures for JCVI-syn3A genes. A p-value of less than 5.0 × 10–2 demonstrates significant structural similarity. A value of 0.00 indicates a p-value of less than 1.0 × 10–20. The percent aligned relates to the percentage of residues in the protein of shorter length that are well-aligned to the larger structure by FATCAT, with the associated RMSD. Equations generating these statistical values are given in the Formulas section.

Discussion

Benefits of Computational Analysis and Suggested Directed Functional Experiments

We have hereby demonstrated the effectiveness of a comprehensive approach integrating protein structure prediction, interactomics, and genome architecture to determine the function of previously unannotated genes in JCVI-syn3A. Of the approximately 90 protein-coding genes of unclear function reported by Breuer et al. in 2019,[3] we have upgraded the annotations for 25 genes, while providing additional information for tens of the remaining protein-coding JCVI-syn3A genes, that will be useful in future studies. This workflow can be further used in elucidating the functional roles of the remaining JCVI-syn3A genes of unclear and unknown function by the design and execution of “directed wet lab assays” developed hand-in-hand with bioinformatics and simulation. Through this work, we believe the community can discover the roles of the relevant enzymes that are “missing puzzle pieces” in the biological reaction networks of JCVI-syn3A,[3] which serves as a platform from which to understand the fundamental behaviors of living cells. Nearly all of the genes annotated in this work were denoted as either essential or quasi-essential by analysis of transposon mutagenesis of JCVI-syn3A,[3] underscoring their importance in JCVI-syn3A cellular processes. We note that phenotypes of such gene products are difficult to study due to the inability to obtain deletion strains of the genes in question due to their essentiality. Several of these newly annotated components may play an important role in a more detailed model for cell growth and morphological development for JCVI-syn3A. First, incorporation of gpsB/0353 and ezrA/0239 regulatory effects into a cell division and filamentation model based on the activities of ftsZ/0522 and ftsA/0523 may be crucial in modeling the proper dynamics of this process. Previously, we have demonstrated the coupling of metabolism and gene expression via whole cell modeling of JCVI-syn3A,[11] and recent work in B. subtilis(107) suggests that similar couplings exist between metabolism and cell division, with pyruvate possibly allowing for sensitivity to metabolic activity by cell division components. In analyzing the possible functionality of ezrA/0239 as coding for an EzrA-like cell division protein, it is interesting to to consider its connection to the metabolic enzyme pdhA/0225 which was deleted in the reduction from JCVI-syn1.0 to JCVI-syn3.0. A previous study conducted in B. subtilis(107) demonstrated that the presence of either ezrA or pdhA, which mediates pyruvate production in the central metabolism as a signaling molecule, maintains the morphology of daughter cells, while a double knockout strain of these genes causes oblong cells. Cellular actors involved in cell growth and specifically the translocation of membrane proteins have also been clarified, including secDF/0412 which is part of the Sec translocase system, ylxM/0430 which is a regulator of the SRP which assists in recognition of proteins to be translocated across the membrane, and yidC/0908 which also mediates the SRP based translocation process. yqgP-gplG/0516 additionally plays a role in cell growth by its activity in conjunction with ftsH/0039 in the degradation of membrane proteins, thereby negatively regulating the addition of membrane proteins that is carried out by the Sec system. Inclusion of these specific proteins and their associated gene expression profiles from experimental proteomics[3] and theory[11] allows for enhanced molecular detail of the growth and membrane translocation model presented by Thornburg et al.,[11] making the model even more responsive to changes in metabolism by these biophysical avenues. By explicitly modeling the interactions of proteins involved in cell division and cell growth, we can go beyond the sizer model[108,109] which was assumed in previous work[11] that modeled cell doubling. In this way, we can consider more carefully the specific cell-to-cell doubling time variation in the populations of JCVI-syn3A cells that may be due to the aforementioned cell division and growth processes. In parallel to construction of such models, experiments to more confidently assign annotations to these genes can be undertaken. Of special interest are genes yqkD/0520 and ylbN/0527 which were demonstrated to cause significant changes in JCVI-syn3A daughter cell morphologies when these genes were singly deleted.[26] Our predictions suggest that the gene coding for yqkD/0520 may be involved in maintaining cell morphology due to its role in processing membrane proteins, while the gene product ylbN/0527 is involved in this process through its role in regulating accumulation of the 23S (large subunit associated) rRNA and thus generation of ribosomes overall. A possible experiment by which to further understand the role of ylbN/0527 would be a proteomic or transcriptomic experiment that gauges the number of ribosomal proteins or rRNA that is present in a ylbN/0527 deletion strain, such as the one reported by Pelletier et al. in that work.[26] With such an investigation, researchers might further understand the role of this gene in regulating accumulation of the 23S rRNA and how this might affect other properties tied to cell growth behavior, such as number of membrane proteins and number of cellular ribosomes present. Additionally, introduction of B. subtilis ylbN/0527 into the JCVI-syn3A Lox-insertion site genetic “landing pad” in a gene ylbN/0527 deletion strain provides another experimental avenue. Ideally, this gene complementation experiment would result in JCVI-syn3A cells with normal morphology and validate our hypothesis that gene /0527 encodes a YlbN orthologue. Gene yqkD/0520 remains a more difficult case to examine experimentally. However, the membrane protease activity of this gene could be observed in E. coli or a cell-free system using a recoded version of the gene that uses the standard genetic code (such a recoded set of JCVI-syn3A genes is available through the “Free Genes” Initiative[110]) and then is subjected to a membrane protease activity assay such as that developed by Yoshitani et al.,[111] or by a more specialized aminopeptidase assay using solid-phase fluorophore chemistry.[112] The secondary assignment of prolyl-oligopeptidase superfamily lipase activity could then be tested with the free genes purified protein in a similar manner using established lipase activity assays.[113] Reconstitution of /0239 in E. coli via the codon-optimized platform “FreeGenes”[110] might provide an avenue by which to test the GTPase activity of gene /0239 if it indeed codes for an EzrA-like protein that modulates the GTPase activity of FtsZ and formation of FtsZ filaments. This could be accomplished with an experiment such as a Malachite Green Phosphate Assay such as the one described in ref (7) to observe the changes in FtsZ GTPase activity that were seen in B. subtilis.[64] Confirming the annotation of gene atpI/0797 could be ascertained by fluorescence microscopy or immunoprecipitation experiments, whereby atpI/0797 could be colocalized alongside labeled copies of one of the ATP synthase genes that provides the base component of the molecular assembly, such as atpB/0796, consistent with the role of atpI in modulating assembly of the ATP synthase complex.[99] Likewise, gene ylxM/0430 could be confirmed as an effector of the regulation of the SRP by such colocalization to the SRP itself (ffs/0049) as well as the other members of the Sec translocase system such as secA/0095 and secY/0652. The predicted membrane protease yqgP-gplG/0516 might be subjected to “Free Genes”[110] reconstitution in E. coli or a similar technique and then be tested for activity on purified membrane proteins from E. coli that are also present in JCVI-syn1.0 and JCVI-syn3A such as mgtA/0787 and mgtE/0157. The assay could use the previously mentioned method developed by Yoshitani et al.,[111] which might be transferable to proteases such as this. The presence in the JCVI-syn3A genome of moonlighting proteins, i.e., multifunctional proteins that are not the result of gene fusions, could confound efforts to predict protein functions. Because of evolutionary pressure to delete nonessential genes, mycoplasmas are known for having such multifunctional proteins.[114−117] Some predict that mycoplasmas will have more moonlighting proteins than bacteria not under evolutionary pressure to shed genes.[118,119] This may result in a protein evolving to have part of its structure or amino acid sequence be similar to that of functionally characterized proteins with one function and another part similar to a different protein with a different function. Thus, protein function prediction software may offer more than one prediction with significant confidence. It is vital that both computational and experimental biologists keep this in mind when evaluating predictions of protein function. Furthermore, dissimilar protein function predictions with varying confidence levels may both be correct for moonlighting proteins. Additionally, an analysis of some transporter proteins such as that coded for by ptsG/0779 predicted via AlphaFold2 serves as a cautionary note for computational methodologies and motivation for further experimentation. Low prediction confidence (see Supporting Information section S2 analysis notebook) can be observed especially in and around membrane embedded domains where it is less likely that experimental structures were available at the time of CASP14 and during AlphaFold2 development. Note that we also provide in Supporting Information section S2 a database containing AlphaFold2 protein structure predictions that we have generated for all of the approximately 450 protein-coding genes present in the JCVI-syn3A genome, along with an associated Jupyter analysis and visualization notebook. We hope that these predictions in conjunction with the computational workflow we have demonstrated here and with additional directed experiments will help clarify and resolve our understanding of the remaining genes of unknown function in the JCVI-syn3A genome, moving us as a scientific community toward the complete characterization of the genetically minimal cell JCVI-syn3A. Additionally, we provide a supplemental table summarizing gene functionality for all of the genes of previously unclear function[3] in Supporting Information section S3 and Supporting Information Spreadsheet S1. Assignments of gene functional annotation were made when agreement of a predicted structure to an experimentally solved structure was observed via FATCAT.[25] These assignments were further supported by sequence analysis and genome architecture examination, in a similar manner to that demonstrated in this work (such high confidence predictions from the main text and Supporting Information Table S1 are highlighted).

Conclusions

Individualized computational analysis of JCVI-syn3A genes allowed us to further characterize the JCVI-syn3A genome. We uncovered the functions of several genes whose products include important players in fundamental cellular processes, such as cell growth, membrane pH maintenance, divisome formation, and membrane protein translocation, to name a few. These findings combined with ongoing work will result in better modeling and experimental characterization of these processes and their dependencies in the context of this minimized genome. Further results from this treatment and associated directed experiments can inform the “design, build, test” synthetic biology development of JCVI-syn3A moving forward. Bioinformatic, computational, and experimental efforts related to the remaining JCVI-syn3A genes of unknown function can precipitate a self-improving cycle, where in silico analysis and predictions can inform more robust computational models for cell growth and division, that in turn generate predictions that can be examined by experiments. We hope that the techniques used in this work and the suggested annotation updates and directed characterization experiments will drive increased understanding and motivation to uncover the secrets of the genetically minimal cell, JCVI-syn3A. In this way, we can elucidate the outstanding unknowns of its biological milieu and work toward the complete characterization of the genome of a living cell.

103 in total

1. Protein Structure and Function Prediction Using I-TASSER.

Authors: Jianyi Yang; Yang Zhang
Journal: Curr Protoc Bioinformatics Date: 2015-12-17

2. The Synchronization of Replication and Division Cycles in Individual E. coli Cells.

Authors: Mats Wallden; David Fange; Ebba Gregorsson Lundius; Özden Baltekin; Johan Elf
Journal: Cell Date: 2016-07-28 Impact factor: 41.582

3. A Life in Bacillus subtilis Signal Transduction.

Authors: James A Hoch
Journal: Annu Rev Microbiol Date: 2017-09-08 Impact factor: 15.500

Review 4. Hidden protein functions and what they may teach us.

Authors: Petra Schwille; Béla P Frohn
Journal: Trends Cell Biol Date: 2021-10-12 Impact factor: 20.808

5. Single cell analysis of gene expression patterns during carbon starvation in Bacillus subtilis reveals large phenotypic variation.

Authors: Imke G de Jong; Jan-Willem Veening; Oscar P Kuipers
Journal: Environ Microbiol Date: 2012-10-04 Impact factor: 5.491