João Victor Paccini Coutinho1, Janaina Macedo-da-Silva1, Simon Ngao Mule1, Thales Kronenberger2, Livia Rosa-Fernandes1, Carsten Wrenger1, Giuseppe Palmisano3. 1. Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil. 2. Department of Internal Medicine VIII, University Hospital Tuebingen, Tuebingen, Germany; Department of Pharmaceutical and Medicinal Chemistry, Institute of Pharmaceutical Sciences, Eberhard-Karls-Universität, Tuebingen, Germany; Cluster of Excellence iFIT (EXC 2180) "Image-Guided and Functionally Instructed Tumor Therapies", University of Tuebingen, Tuebingen, Germany; Tuebingen Center for Academic Drug Discovery & Development (TüCAD2), Tuebingen, Germany. 3. Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil; Faculty of Science and engineering, Macquarie University, Sydney, NSW, Australia. Electronic address: palmisano.gp@gmail.com.
Abstract
Molecular Dynamics (MD) is a method used to calculate the movement of atoms and molecules broadly applied to several aspects of science. It involves computational simulation, which makes it, at first glance, not easily accessible. The rise of several automated tools to perform molecular simulations has allowed researchers to navigate through the various steps of MD. This enables to elucidate structural properties of proteins that could not be analyzed otherwise, such as the impact of glycosylation. Glycosylation dictates the physicochemical and biological properties of a protein modulating its solubility, stability, resistance to proteolysis, interaction partners, enzymatic activity, binding and recognition. Given the high conformational and compositional diversity of the glycan chains, assessing their influence on the protein structure is challenging using conventional analytical techniques. In this manuscript, we present a step-by-step workflow to build and perform MD analysis of glycoproteins focusing on the SPIKE glycoprotein of SARS-CoV-2 to appraise the impact of glycans in structure stabilization and antibody occlusion.
Molecular Dynamics (MD) is a method used to calculate the movement of atoms and molecules broadly applied to several aspects of science. It involves computational simulation, which makes it, at first glance, not easily accessible. The rise of several automated tools to perform molecular simulations has allowed researchers to navigate through the various steps of MD. This enables to elucidate structural properties of proteins that could not be analyzed otherwise, such as the impact of glycosylation. Glycosylation dictates the physicochemical and biological properties of a protein modulating its solubility, stability, resistance to proteolysis, interaction partners, enzymatic activity, binding and recognition. Given the high conformational and compositional diversity of the glycan chains, assessing their influence on the protein structure is challenging using conventional analytical techniques. In this manuscript, we present a step-by-step workflow to build and perform MD analysis of glycoproteins focusing on the SPIKE glycoprotein of SARS-CoV-2 to appraise the impact of glycans in structure stabilization and antibody occlusion.
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly pathogenic virus and the agent responsible for the pandemic that began in late 2019 (Baloch, Baloch, Zheng, & Pei, 2020), which has already led to the death of more than 6 million people in the world and 434 million cases (WHO Coronavirus COVID-19 Dashboard, n.d.). Like other viruses in the Coronovidae family, SARS-CoV-2 also encodes the SPIKE (S) glycoprotein, which has an important role in infection, including virus entry into the target cells (Ou et al., 2020). This glycoprotein is also essential in the host's immune evasion, representing the main target of therapeutic interventions (Duan et al., 2020). Multiple SARS-CoV-2 variants of concern arose since the beginning of the pandemic, some of which initiated a new wave of infections. These variants include delta (Planas et al., 2021), omicron (Wolter et al., 2022), P1 (Naveca et al., 2021), among others (Hacisuleyman et al., 2021) and have mutations mainly in the S protein, which facilitates successful infection with higher transmission rate. The most relevant mutations in protein S are D614G, present in all variants of Concern (VOC) and it is responsible for accumulating higher viral load and infectivity (Volz et al., 2021). On the other hand, the N501Y mutation promotes new stabilizing hydrogen bonds between the receptor binding domain (RBD) and ACE-2 receptor, with ~ 10 times higher binding affinity (Liu et al., 2021, Liu et al., 2021). The presence of HV69-70del is required to induce this effect. The absence of this deletion does not allow immune evasion in B.1.1.7 strains (Meng et al., 2021). The L452R, T478K, and P681H/R mutations promote an increase in infectivity, possibly due to an increase in the positive electrostatic potential of the surface and greater steric impediments (Pascarella et al., 2021; Tian, Sun, Zhou, & Ye, 2022). In contrast, the E484K mutation promotes less antibody neutralization and greater escape from the immune system (Lippi, Mattiuzzi and Henry, 2022, Liu et al., 2021).Another additional layer of structural complexity comes from the conformational changes induced by post-translational modifications (PTMs). Protein glycosylation is a PTM that can have profound effects on the structure and function of a glycoprotein (Macedo-da-Silva, Santiago, Rosa-Fernandes, Marinho, & Palmisano, 2021; Reily, Stewart, Renfrow, & Novak, 2019; Schjoldager, Narimatsu, Joshi, & Clausen, 2020). N-linked glycosylation (N-X-S/T motif, where X is any amino acid residue except proline) is characterized by the addition of oligosaccharides of variable size and complexity to the side chains of asparagine, whereas in O-linked glycosylation, sugar addition occurs at the serine and threonine residues (Aebi, 2013; Hanisch, 2001). The glycan chains can impact in different ways on the protein proprieties, such as stabilizing loops and conformations (Wormald & Dwek, 1999), providing occlusion against antibodies (Ab) of a host cell (Watanabe et al., 2020), promoting correct folding of the protein (Dalziel, Crispin, Scanlan, Zitzmann, & Dwek, 2014), or enhancing lectin-based cell-adhesion (Lasky, 1991). The glycan composition and occupancy of glycosites can vary significantly. This, combined with the high flexibility of glycans, makes a definition of a glycosylated structural model very challenging (Lis & Sharon, 1993; Nagae & Yamaguchi, 2012). Both N- and O-glycosylation represents a key process in viral proteins in multiple aspects of their pathobiology. Since viruses have no metabolic pathways to perform this modification by themselves, they hijack the host-cell glycosylation machinery to their benefit. Glycosylation has been shown to control the infection of several viruses (Sugrue, 2007; Vigerust & Shepherd, 2007). The role of O-linked glycosylation during viral infection is poorly understood. There are reports in the literature that indicate CD99 binding to herpes simplex virus type 1 B glycoprotein (Wang et al., 2009) and the role of the glycan shield and occlusive neutralization during viral infection. Moreover, O-linked glycan epitopes have be shown to be highly immunogenic in Gammaherpesvirus (Machiels et al., 2011). On the other hand, N-glycan alterations during viral infections have been extensively described, from host cell entry, particle release, and immune evasion as seen in the SPIKE protein of Coronavirus, influenza HA, Ebola GP, and Lassa-virus GPC (Hastie et al., 2017; Kwon et al., 2015; Lee et al., 2014; Mohan, Li, Ye, Compans, & Yang, 2012; Walls et al., 2016; Watanabe, Bowden, Wilson, & Crispin, 2019; Zhao et al., 2016). Due to that, viral glycoproteins represent the major components of the viral envelope and have important roles associated to host cell recognition, attachment, infection, invasion, transmission and immune evasion (Banerjee & Mukhopadhyay, 2016; Cook & Lee, 2013; Watanabe et al., 2019).Currently, mass-spectrometry-based approaches are the primary choice for the analytical characterization of both site and structure-specific glycosylation (Oliveira, Thaysen-Andersen, Packer, & Kolarich, 2021; Pasing, Sickmann, & Lewandrowski, 2012). This enables a more accurate protein modeling in its biological context. In general, the enrichment of glycopeptides from complex samples is performed by applying strategies based on TiO2 (Larsen, Jensen, Jakobsen, & Heegaard, 2007; Palmisano et al., 2010), lectin affinity chromatography (Yang & Hancock, 2004), and hydrophilic interaction liquid chromatography (HILIC) (Ongay, Boichenko, Govorukhina, & Bischoff, 2012), followed by glycan release or analysis of intact glycopeptides (Parker, Gupta, Cordwell, Larsen, & Palmisano, 2011; Thaysen-Andersen & Packer, 2014; Thaysen-Andersen, Packer, & Schulz, 2016). For N-linked, the enzyme N-glycosidase F (PNGase F) is usually used, which cleaves between GlcNAc and asparagine residues of N-linked glycoproteins and glycopeptides (Kaji et al., 2003). The release of O-linked occurs mainly by reductive alkaline β-elimination (Wilkinson & Saldova, 2020). The isolated portion of N/O-glycans can be derivatized by permethylation (Kang, Mechref, & Novotny, 2008), and then analyzed by mass spectrometry techniques, as well the glycopeptides, which includes matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF-MS) (Reiding, Blank, Kuijper, Deelder, & Wuhrer, 2014) or by electrospray ionization (ESI-MS/MS) (Morelle & Michalski, 2007; Palmisano, Larsen, Packer, & Thaysen-Andersen, 2013; Palmisano, Melo-Braga, Engholm-Keller, Parker, & Larsen, 2012). Ion mobility mass spectrometry allows access to the native glycocode giving information on the monosaccharide composition and glycosidic linkages (Schindler et al., 2017). Moreover, novel software solutions have been applied for a comprehensive characterization of thousands of intact glycopeptides in different biological matrices (Alocci et al., 2019; Campbell, 2017; Chernykh, Kawahara, & Thaysen-Andersen, 2021; Kawahara et al., 2021). Native MS allows to transmit large protein ions such as intact glycoprotein complexes which allow the determination of subunits stoichiometry, ligand binding and infer glycoprotein function (Struwe & Robinson, 2019; Wu & Robinson, 2022). Recently, a new method, termed limited deglycosylation assay, was introduced to probe 3D conformational changes of glycoproteins on a proteome-wide scale using limited amount of sample (Mule et al., 2021). This method takes advantage of the accessibility-dependent PNGaseF cleavage to quantitative study changes in glycoprotein conformational changes. X-ray crystallography and cryoEM have limitation in solving the detailed structure of intact glycoproteins due to stoichiometry, heterogeneity and flexibility of the glycan chains. This is known as the glycosylation problem (Chang et al., 2007; Davis & Crispin, 2010). Another approach used in the structural study of glycoproteins is nuclear magnetic resonance (NMR) spectroscopy, a versatile, quantitative, and non-destructive analytical technique (Unione, Ardá, Jiménez-Barbero, & Millet, 2021). NMR provides absolute quantification information for both glycan and glycoprotein. However, the application of the technique has limitations that include the need for large amounts of samples, glycoprotein size and access to adequate labeling (Valverde, Quintana, Santos, Ardá, & Jiménez-Barbero, 2019). NMR overcomes one of the main challenges encountered in glycoproteomics, which consists of differentiating the stereochemistry of a glycan, with isobaric species such as glucose (Glc), galactose (Gal) and mannose (Man) in a single analysis (Hargett et al., 2021). Taken together, the analytical toolbox available to the scientific community allows structural characterization of glycoproteins and correlation with their functions.The complexity of glycans has always been a challenge to be overcome in molecular modeling, mainly due to the great variety of monomer units that can be organized into stereo and regiospecific-linked oligosaccharides within each glycosylation site. Initial techniques applied to carbohydrate modeling, allowed the elucidation of glycans' conformational and spatial preferences (Lemieux & Koto, 1974; Peters, Meyer, Stuike-Prill, Somorjai, & Brisson, 1993). Subsequently, the development of force fields capable of refining carbohydrate modeling during MD analysis changed the scope of oligosaccharide chain analysis in these simulations, taking into account their fluctuating nature and their interactions with solvent molecules (Woods, 2018). Development of reliable force-field parameters from experimental data suffers from the lack of training data, given the shortage of 3D structures of branched carbohydrates. These groups are often not resolved or removed from the initial crystal structures, due to considerable thermal fluctuations and flexibility. Among the computational tools to perform MD analysis of glycans and evaluate the force fields, there are AMBER (Pearlman et al., 1995), GLYCAM (Kirschner et al., 2008) and CHARMM (Vanommeslaeghe et al., 2010). MD simulations can reveal several features about the mechanisms and biological functions of glycoproteins and the effects that glycans have on them, as well as evaluate optimal conformations for binding lectins (Cheong, Shim, Kang, & Kim, 1999) or elucidate protein stability in a given conformation (Casalino et al., 2020).In view of the great importance of the SPIKE glycoprotein in the success of SARS-CoV-2 infection, understanding the structure and impact of adding glycans to this protein can provide important insights to elucidate the mechanisms of viral infection, and highlight possible therapeutic targets. Based on this, we present a step-by-step bioinformatics pipeline to perform the MD of the SARS-CoV-2 SPIKE protein, including carbohydrate structures, solvent, and antibody accessibility. The analysis of the energy states of the performed simulation indicated that it correctly followed the parameters entered and allowed to evaluate and quantify the differential protection of the glycan shield in residues exposed to the solvent. In addition, it was also possible to identify a wide variety of protein-glycan hydrogen bonds and evaluate their interactions within the structure. MD of glycoproteins presents some challenges due to the increase in atoms and in the waterbox size, together with the extension and mobility of the glycans. The MD simulation steps reported here can be applied to other glycoproteins to evaluate various effects that glycosylation can impose on structure and accessibility with potential implications on the biological activities.
Methods
MD simulation workflow applied to SARS-CoV-2 SPIKE glycoprotein
A step-by-step bioinformatic pipeline to perform MD analysis is shown in Fig. 1
. This workflow was applied to a structural model of the SPIKE glycoprotein published by Woo et al., 2020. The models were based on the PDB-6vxx (https://doi.org/10.2210/pdb6vxx/pdb) crystallographic structure and presented a structural resolution of 2.80 Å (Walls et al., 2020). The chosen structure presents a range of modeled residues, which covers most of the head portion of SPIKE, disregarding the membrane and intracellular domains, not considered for this workflow. The detailed structural characterization of the glycosylation sites and glycan occupancy was based on Watanabe et al. (Watanabe et al., 2020; Woo et al., 2020). Mass spectrometric data were used to determine the most frequent glycan structures in each glycosite. A structural model of the protein of interest and site-specific glycan structures are needed to perform MD simulations as described below.
Fig. 1
Computational workflow and software tools used to perform molecular dynamics and evaluate residue specific exposure to solvent and antibody (Ab). The protein of interest structure is downloaded from PDB, in-silico glycosylated and processed to obtain MD simulation using CHARMM-GUI tool. Then, GROMACS is used to run both the simulation (e.g., 100 ns as used in this study) and perform the solvent accessible surface area and average antibody accessible surface area (SASA/AbASA) calculations. VMD software is used to visualize trajectories and RMSD (root mean square deviation) and RMSF (root mean square fluctuation) calculations and Jalview to visualize the calculated proprieties of the protein.
Computational workflow and software tools used to perform molecular dynamics and evaluate residue specific exposure to solvent and antibody (Ab). The protein of interest structure is downloaded from PDB, in-silico glycosylated and processed to obtain MD simulation using CHARMM-GUI tool. Then, GROMACS is used to run both the simulation (e.g., 100 ns as used in this study) and perform the solvent accessible surface area and average antibody accessible surface area (SASA/AbASA) calculations. VMD software is used to visualize trajectories and RMSD (root mean square deviation) and RMSF (root mean square fluctuation) calculations and Jalview to visualize the calculated proprieties of the protein.In general, it is advisable to use high-resolution structures whenever available. When selecting a structure for simulations, it is important to analyze the coverage of the structural motifs including the conserved ones, as well as to compare the conformation of individual amino-acids from the catalytic site. The information regarding catalytic/relevant residues could be found in the literature and obtained using site-directed mutagenesis, or from the conserved known motifs of the protein class of interest. The use of multiple structures as starting points as well as multiple simulation replicas as an ensemble is highly advised.
Computational tools to model glycoproteins
There are multiple computational platforms to model a glycoprotein, such as GLYCAM, Amber, Glycosilator, among others. For this work CHARMM-GUI was used (Jo et al., 2017; Jo, Kim, Iyer, & Im, 2008; Park et al., 2019), due to its web-based intuitive interface for the addition of glycans, an integrated tool for modeling missing residues belonging to unmodelled/low confidence regions, and MD model preparation. It is relevant to mention that full missing loops can be treated by ab initio or homology models, despite this being out of the scope of this manuscript (Rodriguez, Chinea, Lopez, Pons, & Vriend, 1998).The CHARMM-GUI tool can be used to model a large array of biologically relevant structures, such as lipids, glycans, glycoproteins, lipopolysaccharide, solutions and others. For the modeling of glycoproteins, the “Glycan Reader and Modeler” was used, Fig. 2A (Jo, Song, Desaire, MacKerell, & Im, 2011; Park et al., 2019, Park et al., 2017). PDB models without the oligosaccharide chains can be directly downloaded from the Research Collaboratory for Structural Bioinformatics (RCSB), but can also be uploaded by the user. A relevant note should be made regarding uploading glycosylated models. The glycans should be listed as hetero atoms (HETATM) with a CONECT table indicating the bonds among the glycans and the ASN side chain for N-linked glycoproteins. These changes can be added to the PDB by using PyMol software to build a new .pdb file with the coordinates (Fig. 2A).
Fig. 2
Defining the model to build the glycoprotein of interest. (A) Depiction of the CHARMM-GUI “Glycan Reader and Modeler” tool used to build glycoproteins. The PDB protein model code is inputted in the “Download PDB File” field on the “Protein/Glycan System.” (B) The tool will proceed to download and identify protein chains, connected glycans, ligands, experimental method for structure acquisition and name. The user is able to select all the elements of the model of interest before proceeding for glycoprotein modeling. The glycans code will be displayed by the glycosylated residue site and may be edited on further steps.
Defining the model to build the glycoprotein of interest. (A) Depiction of the CHARMM-GUI “Glycan Reader and Modeler” tool used to build glycoproteins. The PDB protein model code is inputted in the “Download PDB File” field on the “Protein/Glycan System.” (B) The tool will proceed to download and identify protein chains, connected glycans, ligands, experimental method for structure acquisition and name. The user is able to select all the elements of the model of interest before proceeding for glycoprotein modeling. The glycans code will be displayed by the glycosylated residue site and may be edited on further steps.Once the model is uploaded, the software will detect the protein sequence, chains, missing residues, disulfide bridges, connected glycans, and ligands, all of which might be edited/added/removed in the “Glycan reader” step by deselecting the marked boxes (Fig. 2B).Upon advancing to the next section, the user will be prompted to edit/fix the protein. Any missing residues and disulfide bridges can be edited before MD simulations and the CHARMM-GUI can perform this modeling in an integrated manner (Fig. 3A and B). The glycans can also be edited in the glycosylation section, where a pop-up menu will bring an editable glycan builder, in which the user can input the appropriate glycans of interest by adding the sugars with the correct linkage (Fig. 3C). With the disulfide bridges added, missing residues modeled and glycans inserted, the glycosylated model is ready to be generated. The CHARMM-GUI will build the structure with the requested properties and prompt the user to the solvation tool, to build a waterbox for the solvent and ions of the protein.
Fig. 3
Editing steps of CHARMM-GUI Glycan reader. The information and selected chains of the downloaded model are displayed and available to be edited at this moment. Preceding the simulation, multiple factors must be accounted for the glycoprotein during its modeling, such as modeling missing residues from low resolution regions of the model, for which the user can use the built-in tool to mark missing residues to model (A), checking for the presence of all structural disulfide bridges and adding missing ones (B), adding explicit hydrogens in the model and, finally, adding and editing glycans (C). Upon selecting a glycan to edit, a pop-up window emerges, allowing for precise glycan editing parameters, including sugar monomer selection and glycosidic bond position. Upon updating the sequence, the glycan code will be updated to reflect the new glycan incorporated in the model.
Editing steps of CHARMM-GUI Glycan reader. The information and selected chains of the downloaded model are displayed and available to be edited at this moment. Preceding the simulation, multiple factors must be accounted for the glycoprotein during its modeling, such as modeling missing residues from low resolution regions of the model, for which the user can use the built-in tool to mark missing residues to model (A), checking for the presence of all structural disulfide bridges and adding missing ones (B), adding explicit hydrogens in the model and, finally, adding and editing glycans (C). Upon selecting a glycan to edit, a pop-up window emerges, allowing for precise glycan editing parameters, including sugar monomer selection and glycosidic bond position. Upon updating the sequence, the glycan code will be updated to reflect the new glycan incorporated in the model.The standard parameters found in the tool will generate a waterbox to match the protein size with an additional 10 Å to the edge and place a physiological concentration of KCl using the Monte-Carlo method, to correct for electrostatic interactions (Fig. 4A).
Fig. 4
Solvation and molecular dynamics input generation. After editing the structure for MD preparation, a solvent box with a physiological concentration of ions is added to the protein to both neutralize and add ionic strength to the system (A). The waterbox size, ion position method, salt type and concentration parameters can be selected in this section. In this step is critical to make sure that the waterbox encompass the whole protein and the glycans. The Periodic boundary conditions are set to define the “system size” during the simulation (B). Lastly, CHARMM-GUI generates the topology files using the selected “Force field” and generates inputs for an array of commonly used molecular dynamics programs, including parameter files for both equilibration and dynamics (C). The output structure of glycosylated protein with the solvent and ions occluded is presented as side view (D) and top view (E). Each chain of the protein is represented in either cyan, magenta or orange (water molecules were occluded to facilitate the visualization).
Solvation and molecular dynamics input generation. After editing the structure for MD preparation, a solvent box with a physiological concentration of ions is added to the protein to both neutralize and add ionic strength to the system (A). The waterbox size, ion position method, salt type and concentration parameters can be selected in this section. In this step is critical to make sure that the waterbox encompass the whole protein and the glycans. The Periodic boundary conditions are set to define the “system size” during the simulation (B). Lastly, CHARMM-GUI generates the topology files using the selected “Force field” and generates inputs for an array of commonly used molecular dynamics programs, including parameter files for both equilibration and dynamics (C). The output structure of glycosylated protein with the solvent and ions occluded is presented as side view (D) and top view (E). Each chain of the protein is represented in either cyan, magenta or orange (water molecules were occluded to facilitate the visualization).Then, after setting the periodic boundary conditions (Fig. 4B) and selecting the “force field” CHARMM36m (Huang et al., 2016) to be used for topology generation and inserting standard parameters for the equilibration of the system, the CHARMM-GUI will generate input files for several MD simulation programs as described below (Fig. 4C). The selection of the forcefield of choice can vary on the simulation specificities, but CHARMM36m was chosen due to its improved accuracy for backbone conformational ensembles, besides improved H-bond J coupling. Selecting the GROMACS box as “input generation option,” Fig. 4C, will add a folder with all the different files needed for GROMACS MD simulations:MD ready model, with glycans, ions and periodic boundaries;Topology files;Molecular dynamics parameter files (.mdp) for minimization, equilibration and a 100 ns md run.The final model for MD simulation can be visualized using Pymol or other molecular visualization tools such as Chimera, Coot or VMD. The file is located inside the GROMACS folder named “step3_input.gro” (Fig. 4D and E).
MD simulation parameters
There are several MD programs available for use, such as AMBER, CHARMM and LAMMPS. In this guide, we will present the use of GROMACS software (Bauer, Hess and Lindahl, 2022a, Bauer, Hess and Lindahl, 2022b; Van Der Spoel et al., 2005) for our simulations. The .mdp output from CHARMM-GUI can be used as input for the simulation.In ordsser to proceed with the MD simulation, the generated model needs to be compiled with all the topologies, restrains, run parameters, output options and pressure/temperature coupling algorithms parameters in a singular file, named the .tpr file. This compilation is done by the gmx grommp command as follow:-f: Molecular dynamics parameters file (.mdp)-o: name of the output portable binary run input file, necessary for the MD run (.tpr)-c: structure file, the generated glycoprotein model (.gro; .pdb; .tpr; etc)-p: topology file (.top)-n: index file, used to set/create a new set of atoms or just using the GROMACS default groups.ndxAlt-text: Unlabelled BoxIt is important to mention that the files: .mdp, .gro/.pdb, .top and others are obtained from the CHARMM-GUI output. The .mdp file contains detailed information on the parameters for MD that can be run in the default mode or adapted to best fit the simulation needs.All of the .mdp options and their respective inputs can be checked in the GROMACS documentation (https://manual.gromacs.org/documentation/2018/user-guide/mdp-options.html). A full discussion is outside the scope of this article, so we will focus on selected parameters/options to be tailored to the simulation needs. The parameters can be divided in five categories: (a) preprocessing and run options; (b) output control; (c) neighbor search; (d) bonds and constrains and e) pressure and temperature coupling.Molecular dynamics preprocessing and run optionsThe preprocessing and run parameters define directories to include in the topology beyond the .itp files inside the .top file (using the -include input) and add position restrictions for user-defined atoms (using the -define input). The run options include the integrator algorithms for running the simulation. Among the options available for the integrator there are: the md algorithm, that uses the leap-frog method for integrating equations of motion, or the steep algorithm, for energy minimization using the steepest descent, which would also depend of a emtol input for tolerance of energy variation before declaring the system minimized. Lastly, the nsteps and the dt parameters can be set specifying the number of molecular dynamic steps to be taken in the simulation and the time step for integration in picoseconds.Molecular dynamics output files control optionsThe output parameters must be set in order to specify the frequency GROMACS will write information in the trajectory and log files. These parameters are set as an integral number for the number of steps that elapse in the simulation before writing: nstlog—energies to the log file; nstvout—velocities to trajectory file; nstxcout—coordinates to trajectory; nstfout—forces to trajectory; nstcalcenergy is used to set the number of steps between calculating energies. The nstcalcenergy is relevant only with integrator set to dynamics simulation algorithms and must be a multiple of nstenergy, a parameter for writing energies to the energy file.Neighbor searching, electrostatics and interatomic interactionsThe interatomic interactions are accounted during the simulations by determining the interacting atomic vicinity using a list of user defined distances and other parameters. They account for both short and long distances interactions.These parameters can be used as default or modified according to the user needs. For example, the nstlist is the frequency of update of the neighbor list; rlist sets the cutoff for short-range interactions (in nm); rvdw distance set to apply to van der waals interactions; Cutoff-scheme sets the algorithm to search for the short range interactions list, based on the rvdw and rlist values set; vdwtype set the algorithm for van der waals interactions search using both rlist and rvdw; rvdw-switch and vdw-modifier work in tandem to apply a smooth transition between short and vdw interations, based on the algorithm chosen for vdw modifier and the distance set in rvdw-switch, which must be smaller than the one set for rvdw. Lastly, colombtype sets the algorithm for electrostatics interactions and rcoulomb sets the distance for the Coulomb electrostatic interactions cut-off.Setting up bonds and constrains optionsThis class of parameters set constrains in otherwise variable components of the simulation to facilitate the simulation processing. The constrain parameters can encompass either only the bonds or bonds and angles, and can be set to affect all (all-bonds;all-angles), only bonds with H-atoms (h-bonds;h-angles) or none of them (none). The constraint _algorithm must also be selected between the LINCS (Hess, Bekker, Berendsen, & Fraaije, 1997) or SHAKE algorithm.Applying temperature and pressure coupling algorithmsThese parameters define the algorithms, using both tcoupl and pcoupl to temperature and pressure respectively, and other variables for temperature and pressure equilibration during the simulation. It is important to define both tau-t and tau-p for both couplings to set the time constant for coupling (in picoseconds) and the target temperature (ref_t in Kelvin) and pressure (ref_p in bar) for the system. For this manuscript focusing on the SARS-CoV-2 SPIKE protein, the algorithms used were the nose-hoover (Evans & Holian, 1998) thermostat and the Parrinello-Rahman (Parrinello & Rahman, 1980) barostat for temperature and pressure, respectively. The selection and use of compatible ensembles for adequate monitoring of these variables is of high importance to take in account during the simulations (Hollingsworth & Dror, 2018; Hünenberger, 2005) but the usage of these ensembles is already validated and successfully used in past studies (Casarotto et al., 2021; Lupala, Ye, Chen, Su, & Liu, 2022; Mehdipour & Hummer, 2021).Glycosides linked to proteins are often on the solvent exposed surface and are known to be involved in the formation of water mediated hydrogen bonds (Stanca-Kaposta et al., 2008; Tachibana et al., 2004). Since the waterbox generation often creates a vacuum space between solute (protein) and solvent, it is relevant to allow for additional equilibration with restraints on the glycoprotein structure. This allows the water molecules in the vicinity of the carbohydrate-protein interface to relax and reconstitute the solvation shell, therefore reducing many artificial conformational changes in the glycosides.
Molecular dynamics simulation run
With the .tpr file generated by the gmx grommp command for the simulation, the command gmx mdrun can be used to generate the .xtc file.This command requires:-deffnm: set the default file name for all file output options-nt: total number of threads to start (make sure to never start more threads them your hardware as of number of cores)-s: portable xdr input file outputted by grompp (.tpr)-pin: tries to generate thread affinities to optimize run time of the simulation-dlb: optimizes the ratios of threads distributed between long and short distances interactions calculations-cpi: checkpoint file for appending to an interrupted run (.cpt)-append: append to a previously available output file when starting from a checkpoint fileAlt-text: Unlabelled BoxThe parameters in the .mpd files must be correctly set to perform 4 different steps:Minimization: a step performed to minimize the energy of the system, promoting solvent interactions and structure relaxation to represent better a biological context. The critical parameters to be set in this step is that the integrator for the simulation must be set to steep, the h-bonds can be used as constrains and no temperature and pressure coupling must be implemented in this step.Temperature equilibration: integrator set to md and apply the tcoupl using the nose-hoover ensemble to simulate the system in a specific temperature and modulate the forces in effect in the simulation taking in account the temperature effect on them. Tau_t was set to update every 1 ps.Pressure equilibration: same parameters set for the temperature equilibration, with addition of pcoupl = Parrinello-Rahman to encompass the pressure effect in the system. tau_p was set to update every 2 ps.Production: upon completing the equilibration step, the system may be released of all constrains and allowed to run for a defined simulation time (in this manuscript 100 ns were used for this step), using a longer time step for integration of 2 fs.Using this segmented relaxation of the system before production is of high relevance, as adding multiple variables in the step could lead to artifactual trajectories of protein dynamics (Walton & VanVliet, 2006).
Processing of trajectories and molecular dynamics outputs
The production step generates the .xtc and .trr files. These files must be preprocessed before the analysis. Indeed, during the simulation, the protein could present atoms jumping across the periodic boundary conditions (PBC) and that could present itself as artifacts to calculate solvent accessibility and simulation RMSD/RMSF, among other analyses. To correct this artifact, the PBC must be recentered in the simulation cell.To perform this correction, it is needed to generate an index using it to envelop all the sugars and protein chains in the same group, so the common center of mass can be centered in the PBC later on. That requires the extraction of the first frame of the trajectory, which is extracted by using the gmx trjconv command:-f: trajectory file (.xtc)-s: structure file (.gro)-o: output file name-dump 0 To extract the first frame of the simulation-sep to isolate the initial frame of the production stepAlt-text: Unlabelled BoxThen, to build the index for all the glycans in the protein, a group was created using gmx make_ndx -f firstframe.gro and selecting all the glycans added to the structure during the model building.To recenter the glycoprotein in the trajectory file in the simulation box, the gmx trjconv command is used twice during this step.1st: removing all jumps of bonded atoms across the PBC, making the glycoprotein as a whole but allowing atoms to diffuse out of the PBC.-pdc nojump-n: generated .ndx file with the protein and sugars grouped togetherAlt-text: Unlabelled Box2nd: set the whole glycoprotein center of mass to the center of the PBC using the nojump.tpr file and the group of glycans and protein described in the step before.-pbc mol-center-n: generated .ndx file with the protein and sugars grouped togetherAlt-text: Unlabelled BoxThe trajectory file with the PBC recentered in the glycoprotein can be used to check if the simulation variables agree with the input parameters. To perform this, the command gmx energy was used together with the .edr file as input to plot:Potential energy: in all steps of the simulation to check if the energy minimization was successful and if the energy in the system kept itself regular through the equilibrations and productions.Temperature: in the equilibrations and production steps to check if the system was correctly kept at the inputted temperature of ref_t.Pressure and density: in the pressure equilibration and production steps to check it the system was accurately equilibrated at the inputted pressure of ref_t and if the density of the system followed that trend.The output plots for the energy, temperature, pressure and density for the SPIKE glycoproteins analyzed in this manuscript are reported in Fig. 5
.
Fig. 5
output of the gmx energy for energy minimization (A), constant number volume and temperature (NVT) and constant number pressure and temperature (NPT) equilibration (first and second part of B, respectively), and production steps (C), for temperature during the NVT and NPT equilibration (D) and production steps (E), and pressure and density for the NPT (F) and production steps (G).
output of the gmx energy for energy minimization (A), constant number volume and temperature (NVT) and constant number pressure and temperature (NPT) equilibration (first and second part of B, respectively), and production steps (C), for temperature during the NVT and NPT equilibration (D) and production steps (E), and pressure and density for the NPT (F) and production steps (G).
Trajectory file analysis (RMSD/RMSF/H-bridges)
The recentered trajectory file (.xtc or .trr) was analyzed using the VMD v.1.9.4. (Humphrey, Dalke, & Schulten, 1996) to perform trajectory visualization and other analyses to evaluate the behavior and interactions of the different segments of the glycoproteins along the simulation.To load the trajectory file into the VMD software, a “new molecule” must be created and add the trajectory file. The molecule file browser (File > new molecule) is used to open the firstframe.gro and then loading the .xtc file into the molecule (Fig. 6A and B). The VMD will start loading frame by frame into the molecule. To improve the observation of the trajectory (representation > graphics…), we chose to hide the waters and ions of the representation, change the depiction of the protein to NewCartoon, and apply a 5 frames trajectory smoothing window to observe the trajectory (Fig. 6C–J).
Fig. 6
Analysis and visualization workflow using the VMD software. (A and B) Load the molecule and its trajectory into VMD. In VMD main, go to File - > New molecule (A). Load the first frame of the production simulation, browsing for the .gro file, then load the .xtc file into the molecule (B). (C–G) Define the representation of the molecule. Open the Graphical Representation panel by selecting Graphics → Representations… (C). In this panel (D), change the “selected atoms” panel to represent only the glycoprotein, by excluding the water atoms and the ions from the representation (E). Create a new representation, select only the protein atoms and change the “Drawing method” to “New Cartoon” (F). The final setup in the Graphical Representation panel is showed in (G). (H) All atom representation. (I) glycoprotein representation. (J) Glycoprotein representation using NewCartoon drawing method for the protein.
Analysis and visualization workflow using the VMD software. (A and B) Load the molecule and its trajectory into VMD. In VMD main, go to File - > New molecule (A). Load the first frame of the production simulation, browsing for the .gro file, then load the .xtc file into the molecule (B). (C–G) Define the representation of the molecule. Open the Graphical Representation panel by selecting Graphics → Representations… (C). In this panel (D), change the “selected atoms” panel to represent only the glycoprotein, by excluding the water atoms and the ions from the representation (E). Create a new representation, select only the protein atoms and change the “Drawing method” to “New Cartoon” (F). The final setup in the Graphical Representation panel is showed in (G). (H) All atom representation. (I) glycoprotein representation. (J) Glycoprotein representation using NewCartoon drawing method for the protein.To perform the calculations of RMSF/RMSD and search for H-bonds we used some command lines in the Tk console of VMD (Extensions“>”Tk console) as described below.##RMSD/RMSF setupset start [atomselect top "protein and backbone" frame 0]set current [atomselect top "protein and backbone"]set num_steps [molinfo top get numframes]set carbA [atomselect top "name CA"]##RMSDset outfile [open rmsd.dat w]for {set frame 0} {$frame < $num_steps} {incr frame} {$current frame $frameset trans_mat [measure fit $current $start]$current move $trans_matset rmsd [measure rmsd $current $start ]puts $outfile "$frame $rmsd"}close $outfile##RMSFset outfile [open rmsf.dat w]set sel [atomselect top "name CA"]set rmsf [measure rmsf $sel first 0 last [expr {$num_steps - 1}] step 1]for {set i 0} {$i < [$sel [expr {$num_steps - 1}]]} {incr i} {puts $outfile "[expr {$i + 1}] [lindex $rmsf $i]"}close $outfile## H-bondspackage require hbondshbonds -sel1 [atomselect top protein] -sel2 [atomselect top “all not water and not ion and not protein”] -writefile yes -plot yesAlt-text: Unlabelled BoxThis will generate 3 output files:hbonds-details.dat, referring to each residue that interacted with a glycan during the simulation and that interaction occupancy, Fig. 7
.
Fig. 7
Analysis of H-bridges content. (A) Relation of hydrogen bonds with above 20% occupancy and involving a glycan chain with a protein chain found in the simulation. BGLCN: N-acetyl-D-glucosamine; AFUC: L-fucose; A/BMAN: D-Mannose. (B) Graphical representation of a monomer of the SPIKE protein with its glycans. Aminoacid residues that presented a high occupancy H-bridge with glycans where highlighted in red and labeled using Pymol.
Analysis of H-bridges content. (A) Relation of hydrogen bonds with above 20% occupancy and involving a glycan chain with a protein chain found in the simulation. BGLCN: N-acetyl-D-glucosamine; AFUC: L-fucose; A/BMAN: D-Mannose. (B) Graphical representation of a monomer of the SPIKE protein with its glycans. Aminoacid residues that presented a high occupancy H-bridge with glycans where highlighted in red and labeled using Pymol.RMSD.txt, which outputs a comparison of a frame of the simulation with the first frame, Fig. 8A;
Fig. 8
Calculation of the root mean square deviation (A) during the simulation and root mean square fluctuation (B) per residue.
Calculation of the root mean square deviation (A) during the simulation and root mean square fluctuation (B) per residue.RMSF.txt, which represents the average fluctuations of each aminoacid in the model during the simulations, Fig. 8B.
SASA/AbASA calculations and analysis
To perform the SASA and AbASA measurements, the gmx sasa command was used within the GROMACS software (Eisenhaber, Lijnzaad, Argos, Sander, & Scharf, 1995):-f: the recentered trajectory file-s: the protein coordinates file-n: the generated index with the whole glycoprotein group (.ndx)-o: the output file name for total exposed area (.xvg)-or: the output file name for exposed area for each aminoacid residue (.xvg)-probe: radius of the solvent in nm (0.14 for water; 0.7 for Ab)Alt-text: Unlabelled BoxThe selected probe radius for Ab was based on the estimated size of the hypervariable loop performed by Grant, Montgomery, Ito, & Woods, 2020. The gmx sasa command performs a double cubic lattice method for surface area calculation and uses as input the group of interest to have its area measured and which group to be in the results. For the first two iterations of this command, the surface selection should be the whole glycoprotein group defined in the index file, and for the output the group selection that implies only the protein. This will generate a resarea.xvg file with the average GlySASA (for the probe radius of 0.14) and GlyAbASA (for the probe radius of 0.7) for each amino acid residue considering the presence of the oligosaccharides. Following this, a second calculation with the same inputs should be performed, using only the protein for both surface calculation and output writing SASA (for the probe radius of 0.14) and AbASA (for the probe radius of 0.7). This step is performed to evaluate the effect of site-specific glycosylation on solvent and antibody accessibility. Based on the ratios of GlyAbASA/AbASA for each residue, the glycan-dependent occlusion for the whole protein can be determined for each residue. Also, using the ratios of SASA/Max.SASA for each residue, being Max.SASA defined as the maximum available area per residue calculated by Tien, Meyer, Sydykova, Spielman, and Wilke (2013), it is possible to discern if a residue is buried into the structure (SASA/Max.SASA < 0.15). Comparing the GlyAbASA/AbASA in a specific region against the whole glycoprotein and removing from the analysis residues that are buried based on the SASA/Max.SASA < 0.15, it is possible to calculate which residues are statistically occluded (Fig. 9
). Other analysis can be derived from the output trajectories using this methodology, such as salt bridges, that could indicate protein stabilization, or radius of gyration, that is related to protein compactness, but they were not the scope of this project.
Fig. 9
Representation of Ab accessibility for each aminoacid residue. Residues in red indicate antibody occlusion dependent on the glycan chains. Residues in blue indicate solvent exposure in the protein without considering the glycan chains. Residues in green indicate asparagine with N-linked glycosylation. Blue box: RBD (465–630). Green box: Furin cleavage loop (675–692). Red box: Modulated region (792–804) with a significant shift toward higher glycan occlusion (P = 0.0003). Orange box: Heptad repeat 1 (920–970).
Representation of Ab accessibility for each aminoacid residue. Residues in red indicate antibody occlusion dependent on the glycan chains. Residues in blue indicate solvent exposure in the protein without considering the glycan chains. Residues in green indicate asparagine with N-linked glycosylation. Blue box: RBD (465–630). Green box: Furin cleavage loop (675–692). Red box: Modulated region (792–804) with a significant shift toward higher glycan occlusion (P = 0.0003). Orange box: Heptad repeat 1 (920–970).There is extensive literature discussing the relevance of the simulation length and sampling, which must be decided based on the hypothesis (Henzler-Wildman & Kern, 2007). Convergence in the carbohydrate structure conformations has been discussed as one of the main issues and, phenomena such as ring puckering were shown to require microsecond length simulations and cannot be addressed by nanoscecond length simulations (Sattelle, Hansen, Gardiner, & Almond, 2010). That being said, there are papers in the literature that successfully reported interesting findings by studying glycoprotein molecular dynamics trajectory files with the nanosecond timescale (Yokoyama et al., 2017; Bernardi, Kirschner, & Faller, 2017). In the latter timescale, it is not possible to explore all the glycan conformations that could be populated in the system; however, glycan-protein interactions can still be assessed by this methodology.
Results and discussion
The applied methodology was already performed in other MD simulations (Barros et al., 2021; Liguori, Croce, Marrink, & Thallmair, 2020; Mehdipour & Hummer, 2021; Turoňová et al., 2020). The analysis of the simulation energy files (Fig. 4) showed that the simulation correctly followed the defined parameters and the plot in JalView 2.11.1.7 (Waterhouse, Procter, Martin, Clamp, & Barton, 2009) allowed to evaluate and quantify the differential glycan shield protection in solvent-exposed residues (Fig. 9), such as the modulated regions in 792–804. We were also able to identify a large array of protein-glycan hydrogen bonds and assess their interactions inside the structure (Fig. 7, Fig. 9). This methodology can be applied to other glycoproteins to evaluate a large array of effects that the glycosylation might impose to the structure and accessibility of the protein in question, such as evaluating glycan promoted structure stability, glycan occlusion from solvent or antibody and ligand interaction.
Conclusions
The growing interest in protein glycosylation and structure to function relationship calls for reliable analytical and computational tools. MD simulation of glycoproteins can help in elucidating the role of oligosaccharides on protein structure, antibody binding, receptor interaction and function. In infectious diseases, glycoproteins can serve as chemotherapeutic targets and biomarkers giving a deeper understanding on the biological mechanisms of host-pathogen interaction. In this step-by-step guide, we provide a detailed view of the different computational platforms that can be used to perform MD simulation of glycoproteins focusing on the SARS-CoV-2 SPIKE glycoprotein. We believe that this guide will serve as a platform to start MD simulations of glycoproteins in several infectious diseases and help in improving the tools currently available.
Authors: David Van Der Spoel; Erik Lindahl; Berk Hess; Gerrit Groenhof; Alan E Mark; Herman J C Berendsen Journal: J Comput Chem Date: 2005-12 Impact factor: 3.376
Authors: Veronica T Chang; Max Crispin; A Radu Aricescu; David J Harvey; Joanne E Nettleship; Janet A Fennelly; Chao Yu; Kent S Boles; Edward J Evans; David I Stuart; Raymond A Dwek; E Yvonne Jones; Raymond J Owens; Simon J Davis Journal: Structure Date: 2007-03 Impact factor: 5.006
Authors: Erik Volz; Verity Hill; John T McCrone; Anna Price; David Jorgensen; Áine O'Toole; Joel Southgate; Robert Johnson; Ben Jackson; Fabricia F Nascimento; Sara M Rey; Samuel M Nicholls; Rachel M Colquhoun; Ana da Silva Filipe; James Shepherd; David J Pascall; Rajiv Shah; Natasha Jesudason; Kathy Li; Ruth Jarrett; Nicole Pacchiarini; Matthew Bull; Lily Geidelberg; Igor Siveroni; Ian Goodfellow; Nicholas J Loman; Oliver G Pybus; David L Robertson; Emma C Thomson; Andrew Rambaut; Thomas R Connor Journal: Cell Date: 2020-11-19 Impact factor: 41.582