Literature DB >> 34661307

Conformational variability of loops in the SARS-CoV-2 spike protein.

Abstract

The SARS-CoV-2 spike (S) protein facilitates viral infection, and has been the focus of many structure determination efforts. Its flexible loop regions are known to be involved in protein binding and may adopt multiple conformations. This article identifies the S protein loops and studies their conformational variability based on the available Protein Data Bank structures. While most loops had essentially one stable conformation, 17 of 44 loop regions were observed to be structurally variable with multiple substantively distinct conformations based on a cluster analysis. Loop modeling methods were then applied to the S protein loop targets, and the prediction accuracies discussed in relation to the characteristics of the conformational clusters identified. Loops with multiple conformations were found to be challenging to model based on a single structural template.

Entities: Chemical

Keywords: COVID-19; conformational ensembles; decoy selection; loop modeling; protein structure prediction; sequence variants

Mesh：

Substances：

Year: 2021 PMID： 34661307 PMCID： PMC8662175 DOI： 10.1002/prot.26266

Source DB: PubMed Journal: Proteins ISSN： 0887-3585

INTRODUCTION

The COVID‐19 disease is caused by the SARS‐CoV‐2 strain of coronavirus and its continued spread remains a concern since the first reported infections in late 2019. The SARS‐CoV‐2 viral genome encodes for four main structural proteins: spike, envelope, membrane, and nucleocapsid. The spike (S) protein is of particular importance as it facilitates viral entry into host cells via its receptor binding domain (RBD), which recognizes human angiotensin‐converting enzyme 2 (ACE2). Current vaccines being administered achieve efficacy against SARS‐CoV‐2 by enabling the human body to produce a modified version of its S protein; this in turn induces the production of neutralizing antibodies against the disease. Toward the development of such therapeutic interventions, many structure determination efforts have focused on the S protein, with the first standalone experimental structure of the full‐length S protein obtained via cryo‐electron microscopy in mid‐February 2020. Soon thereafter, the structure of the S protein RBD bound in a complex with ACE2 was also determined. As of January 13, 2021, there were 203 structures deposited in the Protein Data Bank (PDB ) associated with the SARS‐CoV‐2 S protein. These include studies of the standalone S protein, the S protein interacting with potential antibodies, , and the S protein interacting with various forms of ACE2. Finally, with the emergence of S protein sequence variants, structures corresponding to mutations are also being studied, with D614G being a common example. While individual PDB structures generally provide static snapshots of protein conformations, it is well‐known that proteins exhibit dynamic movement. , The local dynamics of atoms and residues are partially depicted via crystallographic B‐factors. Larger motions are also possible: for the SARS‐CoV‐2 S protein, a well‐documented example is the ability of its RBD to adopt “up” (or open) and “down” (or closed) states, where the “up” state is the conformation capable of binding to ACE2. Overall then, the PDB is a rich source of data for examining the conformational variability of the S protein, given the number of times its structure has been solved experimentally. This article focuses on the loop conformations of the S protein. Protein loops are the flexible connecting regions between regular secondary structures, and are where protein disorder is most likely to occur. This greater disordered nature of loops may be manifest in a PDB structure via missing atomic coordinates or atoms with high B‐factors. Accurate structure prediction for loops is both challenging and necessary, to construct useful models for downstream therapeutic applications. Loops are of particular importance as they are often associated with protein function, such as providing binding recognition sites and facilitating protein–protein interactions. For example, an extended loop of the SARS‐CoV‐2 S protein RBD interacts directly with loops of ACE2, as evidenced by the PDB structure of the RBD‐ACE2 complex. Dynamic structural changes can occur both in larger regions of a protein (e.g., the SARS‐CoV‐2 RBD), as well as in individual loops adopting conformational rearrangements to carry out protein function in accordance with their environment. Thus, when a protein has been solved many times in the PDB, we may be able to observe distinct conformations among some of its loops, given their potential for disorder and structural variability. In particular for the SARS‐CoV‐2 S protein, the PDB also documents sequence variants arising from mutations to some of its loop regions, and the possible structural impacts of mutations can also be studied more broadly via computational methods. , , Mutations to the S protein are especially of concern as they can lead to more infectious variants of SARS‐CoV‐2. The task of structure prediction for flexible loops with multiple distinct conformations has been found to be more challenging than for rigid or inflexible ones. Most loop prediction methods are designed to identify the most likely conformation, for example, with the lowest potential energy. , , , , , Such methods are typically trained on loop sets where a single conformation for each loop is taken from the PDB and assumed to represent the ground truth, and thus tend to be more successful at accurately predicting inflexible loops with one “correct” solution. Accuracy is typically measured by computing the root‐mean‐squared deviation (RMSD) of the backbone atoms from the predicted loop conformation to the corresponding one in the PDB. In order to study loops that can adopt multiple conformations, prediction methods might instead be applied to generate an ensemble of decoys, which often involves a combination of sampling and scoring steps. Then, the success of different methods could be assessed on the basis of whether their generated ensembles include decoys that are close to each of the known conformations. For the SARS‐CoV‐2 S protein, this kind of assessment is a good test on the ability of current methods to explore a range of likely conformations, especially if further mutations were to occur in the flexible loop regions. These considerations motivate the main contributions of this article. First, we identify the loop regions and sequence variants from the known PDB structures of the SARS‐CoV‐2 S protein, and use cluster analysis to classify each loop according to whether it has been observed to adopt multiple distinct conformations or a single conformation only. Second, we apply four current loop prediction methods on the identified loop regions, to generate ensembles of decoys for each one. Third, we discuss the results of these methods and the effectiveness of their application to modeling the loops of the S protein, along with the insights gained via our analyses.

MATERIALS AND METHODS

Data preparation and selection of loop targets

The 3‐D structures of the SARS‐CoV‐2 S protein were downloaded from the PDB at the RCSB website (https://rcsb.org) on January 13, 2021, by navigating to the page in the “COVID‐19 coronavirus resources” section entitled “Spike proteins and receptor binding domains.” We extracted the S protein structures that are not bound to other molecules and have sequence length greater than 1000. This facilitates study of the S protein loop conformations within the context of a (mostly) full‐length S protein structure, while without explicit interaction with other proteins. A total of 63 S protein PDB structures satisfied these criteria, most of which are provided as S protein trimers. We treated each chain as an individual sample and thus extracted a total of 193 S protein chains. Some realignments of the corresponding amino acid sequences were required in order to keep the residue numbers consistent across all chains; this was accomplished with the ClustalO service in Jalview. For each S protein chain, we first used DSSP to determine the secondary structure classification of each residue. The eight‐state DSSP classification was reduced to the traditional three types of helix (H), sheet (E), and coil (C) following the conventions in the SPIDER3 secondary structure prediction method: we map DSSP's “G,” “H,” and “I” to H; “E” and “B” to E; the remaining three states are mapped to C. Due to structural variability, the classified type (H, E, or C) for a given residue position may not always agree among the 193 S protein chains. Thus, we define a loop region for our study as follows: a segment of five or more consecutive residues where over 50% of the protein chains at each position are classified as type C. Further, if two such segments are separated by only one E or H type residue (i.e., where less than 50% of the chains are type C at that position), we treat the two combined segments (including that connecting residue) as a single loop region. With the starting and ending positions of loops defined in this manner, we check for the presence of sequence variants in each loop region among the S protein chains. If multiple distinct residue sequences are observed for a loop region, we shall treat each unique sequence separately for further analysis. This allows us to document the possible impact of mutations on the loop conformations. Thus, we shall say that a loop instance consists of its starting and ending positions together with its unique residue sequence. We then consider the structural variability of each loop instance. To account for the potential disordered nature and structural uncertainties of loops, we extract both the atomic coordinates and B‐factors from the PDB chains. Taking all chains that have no missing coordinates or B‐factors within the loop residues, we compute their pairwise RMSD matrix based on the loop's backbone (N, C , C, and O) atoms. The RMSD calculation is applied after the backbone atoms of the loop residues for each pair are optimally superimposed using the Kabsch algorithm. This is the “local RMSD” , that compares the loop region only, and so is not sensitive to orientation differences in the rest of the structure. Based on that distance matrix, we apply hierarchical clustering with average linkage (UPGMA ) and a distance cutoff of 1.5 Å to form initial clusters of loop conformations. Following, we incorporate B‐factors to ensure that the clusters formed are statistically distinct. Recall that the B‐factor can be expressed in terms of the mean‐square amplitude of atomic oscillations u 2 around their measured positions: . Using an isotropic Gaussian approximation for the corresponding coordinate uncertainties, we can determine whether the difference in backbone coordinates between a loop pair is significantly different with 95% confidence (see Appendix A for details). If none of the chains in one cluster are significantly different from any chains in another cluster, we merge them into a single cluster. Clusters composed entirely of chains with poor structure resolution (>3 Å) after this step are removed from further analysis as the atomic coordinates are unlikely to be sufficiently reliable for making detailed structural comparisons. Each remaining cluster then represents a distinct group of S protein chains which have a similar conformation for that loop instance. We consider a loop instance to have multiple distinct conformations if this analysis results in two or more such clusters of conformations; otherwise, we say that loop instance essentially adopts only a single conformation. We select a representative from each cluster by taking the chain with resolution ≤3 Å that is closest to the geometric centroid of the cluster. Our full list of S protein loop targets for study thus consists of all the cluster representatives obtained from the above steps.

Loop modeling methods

To study the conformational variability of the identified S protein loop targets, we make use of several loop modeling methods. We focus on methods that incorporate sampling‐based techniques for loop construction, which are suitable for stochastically generating an ensemble of decoys that represent plausible conformations for a loop. We include Rosetta's next‐generation KIC (NGK) algorithm, the DiSGro algorithm, and the PETALS algorithm, which are ab initio methods that explore the conformational space with the guidance of an energy or scoring function; these do not directly make use of any structure templates of known loop conformations. We also include the Sphinx algorithm, which is a hybrid method that begins with loop structure fragments obtained from sequence alignment and then completes the loop construction by ab initio sampling. Using each of the methods, we generate an ensemble of 500 decoys for each loop target. The input (or template) structure is the loop target's representative PDB chain, prepared by removing the coordinates of the loop residues: following loop modeling conventions, we treat the backbone atoms from the starting residue's C atom to the ending residue's C atom as unknown. The generated decoys are compared with the loop structures from each known conformation for that loop region. The backbone RMSD is used to assess the accuracy of the decoys. Two types of RMSDs are calculated, as in Choi and Deane : local RMSD (which superimposes the backbone of the loop residues, as in Section 2.1) and global RMSD, which superimposes the backbone atoms of the two residues on either side of the loop (rather than the backbone of the loop residues themselves) prior to the calculation. Global RMSD, as often reported in loop modeling studies, also considers the decoy's orientation to the rest of the structure. For loop regions with multiple conformations or mutations, decoy generation is carried out multiple times, once using each representative PDB as input; taken together, we may thus assess whether decoys generated from different PDB inputs have good coverage of the conformational space for that loop region. The scoring function associated with each method provides a ranking of its 500 generated decoys for a loop target. Thus, it is of interest to assess how well each method's top‐ranking decoys can predict the possible conformations of the loop region. We use three RMSD statistics for this purpose: (a) lowest RMSD among the 500 decoys, (b) RMSD of the top‐ranked decoy, and (c) lowest RMSD among the top‐five ranked decoys. The first RMSD statistic evaluates the method according to its ability to construct native‐like conformations, without regard to whether its scoring function can select the best prediction. The second RMSD statistic corresponds to typical loop modeling assessment, where the top‐ranked decoy is selected as the prediction. However, this approach of selecting a single prediction would be less informative if the loop region has multiple conformations. Thus, we also use the third RMSD statistic: by selecting multiple (i.e., the top five) decoys, we can examine whether these top‐ranking decoys are structurally distinct and accurately represent the different known conformations. We briefly describe how each of the loop modeling methods is run. The NGK algorithm is included in the Rosetta protein modeling suite (available at https://www.rosettacommons.org/), and we used the version provided in Rosetta release 2020.50 on December 18, 2020. NGK improves on a previous kinematic closure method, which consists of local conformational sampling and Monte Carlo minimization steps performed over two (coarse and full‐atom) stages. The program outputs the lowest energy loop structure found in each run, and so to obtain the desired ensemble of decoys we ran the program 500 times, following the recommended settings in the online guide (https://guybrush.ucsf.edu/benchmarks/benchmarks/loop_modeling). The DiSGro algorithm uses a distance‐guided sequential chain‐growth method to stochastically sample loop structures. We ran the authors' program to generate 100 000 conformations for the best possible coverage of the conformational space, then used their scoring function to select the 500 decoys with the lowest energy. The PETALS algorithm uses a sequence of propagation and filtering steps to explore the conformational space and locate low‐energy structures. We ran the authors' program with 60 000 seeds and outputted 30 000 decoys, then used an updated scoring function to select the 500 top‐ranked decoys, see Appendix B for details. The Sphinx algorithm begins by searching a database for suitable fragments according to loop sequence alignments; loop decoy backbones are then constructed by sampling and ranked with a coarse‐grained energy function, after which side chains are added and SOAP‐Loop is used to obtain the final ranking of decoys. Sphinx is hosted on the SAbPred server, for which we automated the loop target submissions and used the “general protein” option; no PDB blacklist was necessary as the fragment database had not yet been updated to contain any COVID‐19 S protein structures.

RESULTS AND DISCUSSION

Loop targets of the SARS‐CoV‐2 S protein

Applying the procedures in Section 2.1 to the 193 standalone S protein chains, a total of 44 loop regions were identified in the SARS‐CoV‐2 S protein. Their starting and ending residue positions are listed in the first column of Table 1. Then, 32 of the 44 loops lie within the S1 subunit, with 13 in the N‐terminal domain and 11 in the RBD; for example, loops 475–487 and 495–506 have been previously noted to form contacts with ACE2 during binding. Loop sequences are shown in the second column of Table 1. There are five loop regions with sequence variants in the PDB: 380–394, 410–416, 600–608, 614–620, and 891–897. For these loop regions, the most common variant in the PDB is shown first, followed by the other variants which have their mutated residue indicated in bold. The mutation that has received the most attention thus far is D614G. , , In total, there are 50 loop instances, that is, the combination of a loop's residue positions and unique amino acid sequence. The third column of Table 1 shows the number of PDB chains that contain a complete backbone (i.e., atomic coordinates and B‐factors) for each loop instance.

TABLE 1

Region	Sequence	#Chains	Representative conformations
14–27	QCVNLTTRTQLPPA	36	6zgeA(24), 7dddC(12)
31–46	SFTRGVYYPDKVFRSS	185	7a4nB(185)
56–60	LPFFS	185	6xr8A(185)
66–83	HAIHVSGTNGTKRFDNPV	11	none (all PDBs >3 Å resolution)
108–116	TTLDSKTQS	169	6zoxB(169)
130–140	VCEFQFCNDPF	168	6xluB(145), 7kdkC(5), 7kdlA(4)
146–168	HKNNKSWMESEFRVYSSANNCTF	38	6zgiB(27), 7dddC(9)
172–187	SQPFLMDLEGKQGNFK	52	7df3B(39), 6zp0B(12)
210–222	INLVRDLPQGFSA	154	6vxxA(152)
230–236	PIGINIT	185	6vxxA(185)
245–263	HRSYLTPGDSSSGWTAGAA	26	6zgiB(24)
280–284	NENGT	185	6x79B(185)
304–310	KSFTVEK	185	7a4nB(185)
320–324	VQPTE	185	6zoxC(181), 6xm3A(4)
329–338	FPNITNLCPF	180	6x29A(155), 7kdlB(21)
343–348	NATRFA	181	6zgeC(181)
370–375	NSASFS	182	6vxxA(139), 6zgiC(42)
380–394	YGVSPTKLNDLCFTN	170	7kdlC(164)
	YGVCPTKLNDLCFTN	12	6x79B(12)
410–416	IAPGQTG	179	7kdkA(178)
	IAPCQTG	3	6zoxB(3)
422–430	NYKLPDDFT	182	6xr8B(178), 6xm0B(2)
438–451	SNNLDSKVGGNYNY	93	6xr8A(85), 7kdlB(4)
454–472	RLFRKSNLKPFERDISTEI	96	6zgeC(95)
475–487	AGSTPCNGVEGFN	92	7dddA(87), 6xm0B(1)
495–506	YGFQPTNGVGYQ	124	6zp0A(118), 6xm0B(2), 7kdlB(3)
517–523	LLHAPAT	168	6zoxA(163), 6xm0A(2), 6xm0B(1), 6xm3A(2)
526–537	GPKKSTNLVKNK	181	7ad1B(26), 6x29B(154)
555–564	SNKKFLPFQQ	185	7kdkC(185)
578–583	DPQTLE	185	6zoxB(185)
600–608	PGTNTSNQV	170	7kdlA(169)
	PGTNTSNEV	12	none (all PDBs >3 Å resolution)
614–620	DVNCTEV	103	6xm4C(98)
	GVNCTEV	42	7kdkA(42)
	NVNCTEV	6	7a4nB(6)
624–641	IHADQLTPTWRVYSTGSN	26	6xm0B(18)
656–663	VNNSYECD	185	7kdkB(185)
697–710	MSLGAENSVAYSNN	185	6vxxB(185)
783–816	AQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRS	144	6zp0C(142)
825–836	KVTLADAGFIKQ	39	6xluB(2), 6xm3B(5), 6xm3C(1), 6zgiA(25)
841–848	LGDIAARD	43	6xluC(6), 6xm4B(1), 6zgeB(20), 6xm3B(6), 7dddB(6)
862–866	PPLLT	185	6zoxB(185)
891–897	GAALQIP	176	7kdkB(176)
	GPALQIP	9	7a4nB(9)
908–913	GIGVTQ	185	7a4nB(185)
968–976	SNFGAISSV	188	6zp0C(185), 6xraC(3)
1033–1046	VLGQSKRVDFCGKG	188	7kdkA(188)
1106–1112	QRNFYEP	188	7kdkC(188)
1124–1132	GNCDVVIGI	188	6xm0A(185), 6xraC(3)
1135–1141	NTVYDPL	161	7kdkB(158), 6xraC(3)

Abbreviation: PDB, Protein Data Bank.

SARS‐CoV‐2 S protein loops. The first column shows the starting and ending positions of each identified loop region. The second column shows the loop sequences; if there are sequence variants in the PDB, the most common variant is listed first, and other variants have their mutated residues marked in bold. The number of PDB chains containing that loop instance is shown in the third column. The rightmost column lists the representative PDB chains for each loop instance; if a loop instance has multiple conformations, each chain listed corresponds to one distinct conformation (cluster). The number of PDB chains represented by each cluster is shown in parentheses; these may not sum up to the third column since clusters with poor structure resolution (all chains >3 Å) are omitted Abbreviation: PDB, Protein Data Bank. The final column lists the representative PDB chains for each loop instance, obtained by the procedure for constructing clusters as described in Section 2.1. Thus, for example, there are 180 S protein chains that contain the loop at positions 329–338; clustering by pairwise RMSD identified two distinct conformations among structures with resolution ≤3 Å; 6x29A and 7kdkC were chosen to represent these clusters (which included 155 and 21 chains respectively), being the chains with resolution ≤3 Å closest to the cluster centroids. We illustrate the 329–338 loop example in the top panels of Figure 1: a histogram of all pairwise RMSDs of the loop backbone (among the 180 S protein chains that contain this loop) is shown on the left, while a close‐up of the part of the S protein chain containing the loop is shown on the right. The histogram shows distinct peaks at pairwise loop RMSDs of 0.4–0.6 Å and 2.0–2.4 Å, from which clustering identified the two distinct conformations colored dark blue and turquoise. In contrast, the bottom panels of Figure 1 show another length 10 loop region (555–564) but with little structural variability: the pairwise RMSDs do not exceed around 1.5 Å and clustering identified just one main conformation (colored in red).

FIGURE 1

Two examples of SARS‐CoV‐2 S protein loops of length 10: 329–338 (top panels) and 555–564 (bottom panels). The histograms (left panels) show the pairwise root‐mean‐squared deviations (RMSDs) of the loop backbone among all S protein chains containing that loop: it can be seen that 329–338 exhibits higher structural variability than 555–564, due to the presence of two distinct clusters. The right panels display close‐ups of the representative loop conformations: 329–338 has two distinct conformations, colored in dark blue and turquoise; 555–564 has essentially one conformation, colored in red The initial hierarchical clustering step resulted in 137 clusters for the 50 loop instances. Based on the B‐factor calculations, 17 of the 137 clusters did not have statistically distinct atomic coordinates compared to other clusters, and so merging these resulted in 120 clusters. All of the 17 clusters being merged had also failed to contain structures with sufficient resolution (≤3 Å). A further 45 of the 120 clusters contained no ≤3 Å structures, which led to two of the loop instances being omitted: 66–83 and 600–608 with the Q607E mutation. The final 75 clusters thus covered 48 loop instances; 17 of the 48 had multiple distinct conformations (ranging from 2 to 5). By choosing the centroid of each cluster as its representative conformation, a diverse set of 41 different PDB chains with ≤3 Å resolution can be seen in Table 1. It should be noted that the exact number and composition of clusters will depend on the algorithm (i.e., cutoff and criterion) chosen. Here, using a cutoff of 1.5 Å with UPGMA, the average RMSD between members of different clusters will be at least 1.5 Å. For example, if we used a cutoff of 1.5 Å with WPGMA instead, 42 of the 50 loop instances maintain the same final clustering results; WPGMA would have found 82 representative conformations for the 48 loop instances. Overall, we consider the clusters in Table 1 to provide a fairly stable characterization of the structural variability present in these loops. The final 75 clusters in Table 1 differ in their size and within‐cluster variation. There were 4 singleton clusters (defined by a single chain only), and 61 clusters were defined by at least four chains and two distinct PDB codes (and often significantly more). These high chain counts per cluster enable more cluster statistics to be examined, compared to related studies, for example, Marks et al. where clusters were defined by at most five chains (except in one case). Here, loop instances with multiple conformations tend have a dominant cluster that is defined by at least two‐thirds of the available chains; the one exception is 841–848, which is also the most structurally variable loop with five distinct clusters. For each of the 61 well‐represented clusters, we computed the average within‐cluster RMSD (i.e., between all pairs of members in that cluster) as a measure of its breadth of movement, and a histogram is shown in Figure 2. The average breadth over all 61 clusters is 0.72 Å. The list of clusters grouped according to their breadth d is shown in Table 2, where 16 clusters are fairly tight with d ≤ 0.5 Å, 36 clusters have 0.5 < d ≤ 1.0, and the 10 loosest clusters have d > 1.0 Å. It might be expected that shorter loops tend to form tighter clusters as they have a smaller conformational space; indeed, this pattern can be seen as the average loop length of clusters in these three groups are 6.5, 12.1, and 13.0 respectively. The larger clusters also tend to be tighter: the average cluster size in these three groups are 127, 108, and 49, respectively. However, we note that these are overall patterns only; for example, the cluster for the longest loop 783–816 is defined by 142 chains and has only a moderate d = 0.81.

FIGURE 2

TABLE 2

Clusters grouped according to their breadth of movement d as defined by their average within‐cluster RMSDs. Each cluster is listed based on its representative conformation (Table 1) together with its starting and ending residues. The average loop length and size of clusters in the three groups are shown in the rightmost columns

Breadth (d)	Clusters	Avg. length	Avg. size
d ≤ 0.5 Å	6xr8A_56_60, 6x79B_280_284, 7a4nB_304_310, 6zoxC_320_324, 6xm3A_320_324, 7kdkA_614_620, 7a4nB_614_620, 7kdkB_656_663, 7dddB_841_848, 6zoxB_862_866, 7kdkB_891_897, 7a4nB_891_897, 7a4nB_908_913, 6zp0C_968_976, 7kdkC_1106_1112	6.5	127
0.5 < d ≤ 1.0 Å	6zgeA_14_27, 7a4nB_31_46, 6zoxB_108_116, 6xluB_130_140, 7kdlA_130_140, 7dddC_146_168, 7df3B_172_187, 6zp0B_172_187, 6vxxA_230_236, 6x29A_329_338, 7kdlB_329_338, 6zgeC_343_348, 6vxxA_370_375, 6zgiC_370_375, 7kdlC_380_394, 6x79B_380_394, 7kdkA_410_416, 6xr8B_422_430, 6xr8A_438_451, 6zgeC_454_472, 6zp0A_495_506, 7ad1B_526_537, 6x29B_526_537, 7kdkC_555_564, 6zoxB_578_583, 7kdlA_600_608, 6xm4C_614_620, 6xm0B_624_641, 6vxxB_697_710, 6zp0C_783_816, 6xm3B_825_836, 6zgiA_825_836, 6zgeB_841_848, 7kdkA_1033_1046, 6xm0A_1124_1132, 7kdkB_1135_1141	12.1	108
d > 1.0 Å	7dddC_14_27, 7kdkC_130_140, 6zgiB_146_168, 6vxxA_210_222, 6zgiB_245_263, 7kdlB_438_451, 7dddA_475_487, 6zoxA_517_523, 6xluC_841_848, 6xm3B_841_848	13.0	49

Abbreviation: RMSD, root‐mean‐squared deviation.

The amount of within‐cluster variation for the 61 clusters defined by at least four chains and two distinct Protein Data Bank (PDB) codes. The breadth of movement observed within a cluster is measured by its average within‐cluster root‐mean‐squared deviation (RMSD); 36 of the clusters have an average between 0.5 and 1 Å Clusters grouped according to their breadth of movement d as defined by their average within‐cluster RMSDs. Each cluster is listed based on its representative conformation (Table 1) together with its starting and ending residues. The average loop length and size of clusters in the three groups are shown in the rightmost columns Abbreviation: RMSD, root‐mean‐squared deviation. It is well‐known that the SARS‐CoV‐2 RBD as a whole can adopt an “up” or “down” conformational state. Here, 7 of the 17 loop instances with multiple conformations were located within the RBD. Notably, both 475–487 and 495–506 which interact with ACE2 are among these. Thus, we examined whether this higher propensity for multiple conformation loops within the RBD might be associated with the chains having an “up” or “down” RBD state, even when the S protein chain is considered in isolation. We took PDB 6zge, where it is known that chain A has a “down” RBD and chain B has an “up” RBD. Then, each of the 193 S protein chains was classified as “up” or “down” according to whether its backbone RMSD to 6zgeB or 6zgeA was smaller. Based on this criterion, the loop at 370–375 has both distinct conformations coming from “down” RBD chains, while four other loops with two conformations (329–338, 422–430, 438–451, 475–487) indeed have one conformation associated with the “up” state and the other associated with the “down” state. Of the two remaining loops, 495–506 has one conformation from a “down” RBD and two from an “up” RBD, while 517–523 has two conformations from each. Overall then, five RBD loop regions have structures that do not vary significantly with the RBD state (370–375 and the four single conformation loops in the RBD), while the other six do potentially vary. Five loop regions had sequence variants present in the PDB, each consisting of a single point mutation. All of these loop instances had only a single conformation. Taking the representative chain for each sequence variant listed in Table 1, we computed the local loop backbone RMSD between the representatives and the results are shown in Table 3. For example, for the loop region 380–394, the sequence variants are S and C at position 383, represented by 7kdlC and 6x79B respectively; these structures have backbone RMSD 0.54 Å computed on the loop residues. For the loop 600–608, there were no high‐resolution PDB structures containing the Q607E mutation. Overall, these sequence variants do not have large impacts on the loop conformations with observed backbone differences all <1 Å, such that the conformational space of these loop regions (including variants) could be represented by a single cluster.

TABLE 3

Region	Sequence 1	Sequence 2	RMSD
380–394	YGVSPTKLNDLCFTN	YGVCPTKLNDLCFTN	0.54
410–416	IAPGQTG	IAPCQTG	0.40
614–620	DVNCTEV	GVNCTEV	0.67
614–620	DVNCTEV	NVNCTEV	0.62
614–620	GVNCTEV	NVNCTEV	0.51
891–897	GAALQIP	GPALQIP	0.23

Abbreviations: PDB, Protein Data Bank; RMSD, root‐mean‐squared deviation.

Backbone RMSDs between the PDB chains representing the different sequence variants, in loop regions where mutations are present. Local RMSDs are computed on the loop residues. The residues that differ between the sequence variants are highlighted in bold Abbreviations: PDB, Protein Data Bank; RMSD, root‐mean‐squared deviation. Three of the loop targets were omitted from consideration for loop modeling, as all of their PDB chains were missing a residue immediately next to the loop: 14–27 (both conformations missing residue 13), 614–620 with the D614G and D614N mutations (both missing residue 621). Thus, the loop modeling methods were applied to a total of 71 targets.

Loop modeling results

The four methods described in Section 2.2 were applied to model the conformations of the 71 loop targets identified in Section 3.1. Of these, 66 targets could be run successfully using all four methods. NGK and PETALS completed decoy generation for all 71 targets, while DiSGro completed 68 targets and Sphinx completed 66 targets. We focus the discussion on the results of the 66 loop targets for which all the methods could successfully generate decoys; the 5 remaining cases are discussed briefly at the end. First, we assess the ability of methods to predict a correct loop structure. We define this loop prediction accuracy by calculating the RMSD to the closest loop structure among all chains containing that loop instance. Thus, for this task, a good prediction can be close to any cluster member among any of the loop's known conformations (clusters), which accounts for the possible within‐cluster variation (Figure 2) and treats loop structures in all the chains as an equi‐energetic ensemble. Loop targets representing regions with multiple conformations can score well by this definition as long as a method can predict any one of the known conformations. For example, there are three targets for the loop 130–140 corresponding to its three conformations, represented by 6xluB, 7kdkC, and 7kdlA; decoys generated using 6xluB as input are compared to loop structures in all 154 chains of the three clusters combined, and likewise for 7kdkC and 7kdlA. We categorized the targets according to whether they belong to loop instances with multiple conformations or not; these categories are denoted as “Multiple conf.” and “Single conf.” in Table 4, containing 40 and 26 loop targets, respectively. Table 4 displays the three RMSD statistics described in the Section 2—lowest RMSD among the 500 decoys, RMSD of the top‐ranked decoy, and lowest RMSD among the top‐five ranked decoys—using both local and global RMSD calculations and averaged over the loop targets for each method. On average, all four methods can generate decoys at <1 Å local RMSD and <1.5 Å global RMSD from a correct structure. However, it remains difficult to correctly rank the generated decoys, with the RMSDs of the top‐ranked decoy often substantially higher than the best decoy available. When each method is allowed to choose five decoys, then it is more likely that at least one of the five is close to a correct structure; for example, NGK's average accuracy improves from 2.31 to 1.60 Å (global RMSD). Further, the difficulty of the loop prediction task tends to vary by target category: for all four methods, the average top decoy RMSD for loops with multiple conformations are higher than for single conformation loops, whether considering local or global RMSDs.

TABLE 4

		Local RMSD			Global RMSD
Method	Target category	Min.	Top	Top‐5	Min.	Top	Top‐5
DiSGro	Single conf.	0.76	1.81	1.28	0.97	2.66	1.73
	Multiple conf.	0.96	1.95	1.56	1.47	3.60	2.95
	All	0.88	1.90	1.45	1.27	3.23	2.47
NGK	Single conf.	0.42	1.06	0.85	0.58	1.93	1.62
	Multiple conf.	0.66	1.42	1.08	1.07	2.55	1.59
	All	0.56	1.28	0.99	0.87	2.31	1.60
PETALS	Single conf.	0.68	1.24	0.98	0.98	2.06	1.51
	Multiple conf.	0.85	1.58	1.33	1.42	3.00	2.32
	All	0.78	1.44	1.19	1.25	2.63	2.00
Sphinx	Single conf.	0.64	1.49	1.15	1.11	2.75	2.09
	Multiple conf.	0.74	1.77	1.31	1.34	3.53	2.46
	All	0.70	1.66	1.25	1.25	3.22	2.31

Abbreviations: PDB, Protein Data Bank; RMSD, root‐mean‐squared deviation.

RMSD metrics for assessing the loop prediction accuracy of the four methods. The loop backbone RMSDs shown are averaged over single conformation targets (n = 26), multiple conformation targets (n = 40), and all targets (n = 66). The columns “Min.,” “Top,” and “Top‐5” refer, respectively, to the lowest RMSD among the 500 decoys, RMSD of the top‐ranked decoy, and lowest RMSD among the top‐five ranked decoys. Prediction accuracy is defined as the RMSD to the closest loop structure among all chains containing that loop instance Abbreviations: PDB, Protein Data Bank; RMSD, root‐mean‐squared deviation. To visualize these results, the global RMSD of the top decoy is plotted against loop length for each method in Figure 3. It is clear that the prediction difficulty and the variance of prediction RMSDs tend to increase with loop length, with methods consistently achieving <2 Å RMSD accuracy only for the shortest loops (≤6 residues). This is sensible since the size of the conformational space increases with loop length, with long loops (>12 residues) often posing a challenge for methods to sample adequately. The plots also indicate that hardest targets for a given loop length tend to be those from multiple conformations, especially for the two most accurate methods (NGK and PETALS). The average lengths of loop targets in the “Single conf.” and “Multiple conf.” categories are similar (9.7 vs. 10.0 residues). The detailed results for each target individually are given in Table S1 of the Supporting Information.

FIGURE 3

Loop prediction accuracy for each of the four methods, visualized by plotting the global root‐mean‐squared deviation (RMSD) of the top decoy versus loop length. Prediction difficulty increases with loop length, with methods consistently achieving <2 Å RMSD only for the shortest loops (≤6 residues). The hardest targets for a given loop length tend to be those from multiple conformations, especially for the two most accurate methods (NGK and PETALS). Slight jitter is added along the x‐axis to the points for readability If one is allowed to select the best prediction among all targets for a loop instance, then results for loops with multiple conformations improve dramatically (e.g., taking the lowest RMSD of all decoys generated from 6xluB, 7kdkC, and 7kdlA together as the result for the loop 130–140); the average global RMSD for the top decoy in multiple conformation loops decreases to just 1.05 Å for NGK and 1.74 Å for PETALS. However, this is generally not a realistic scenario in practice, as often just a single template would be available for constructing predictions. In this sense, our findings on the difficulty of predicting multiple conformation loops are less categorical compared to Marks et al. for the targets in this S protein dataset. For these S protein targets, multiple conformation loops are more difficult to predict when a single template is used, but not when we can choose the best prediction among all available templates; for the dataset considered by Marks et al., the difficulty still remained when choosing the best prediction among all available templates, albeit accounting for less possible within‐cluster variation as their clusters had much less representation in the PDB. In addition to loop length, we also examine whether the cluster characteristics, namely their size (as measured by the number of chains) and breadth (as measured by the average within‐cluster RMSD in Figure 2), are associated with prediction difficulty. For each method, we consider a target to be successfully predicted if the top decoy has a global RMSD of <2 Å, and to be a failure otherwise. Based on this criterion, DiSGro, NGK, PETALS, and Sphinx had 25 (48%), 35 (67%), 31 (60%), and 25 (48%) successes, respectively, out of the 52 loop targets representing conformational clusters defined by at least four chains and two distinct PDB codes. We use the Welch t test to provide a simple assessment of whether the mean of each variable is significantly different between successes and failures, and the results are shown in Table 5 for the four methods. The sign of the t‐statistic indicates whether successes (positive t‐statistic) or failures (negative t‐statistic) are associated with larger values of that variable; for example, the t‐statistics for loop length are all negative, so successes are associated with shorter loop lengths as expected from Figure 3. Each of the three variables is significantly associated with prediction success (p < .01 for all tests, except cluster size for the Sphinx method with p = .011). Targets with longer loop lengths, smaller cluster sizes, and larger cluster breadths tend to be more difficult to predict successfully, regardless of which loop modeling method is used.

TABLE 5

Variables	Welch t‐test results
Variables	DiSGro	NGK	PETALS	Sphinx
Loop length	t(41.1) = −6.32	t(30.3) = −3.53	t(36.5) = −5.20	t(49.4) = −3.23
Loop length	p < .001	p = .0015	p < .001	p = .0022
Cluster size	t(49.2) = 4.18	t(31.1) = 3.91	t(41.9) = 3.10	t(5.0) = 2.63
Cluster size	p < .001	p < .001	p = .0034	p = .011
Cluster breadth	t(47.7) = −4.62	t(23.1) = −3.52	t(35.0) = −3.08	t(48.2) = −2.94
Cluster breadth	p < .001	p = .0018	p = .0040	p = .0050

Abbreviations: PDB, Protein Data Bank; RMSD, root‐mean‐squared deviation.

Comparing prediction successes and failures of the four methods, according to loop length, cluster size, and cluster breadth. Prediction success is defined as a global RMSD of <2 Å for the top decoy. The Welch t‐statistics (with degrees of freedom in brackets) and p‐values for each variable are shown. Positive t‐statistics indicate that successes have a larger mean than failures. The tests are based on the loop targets representing conformational clusters defined by at least four chains and two distinct PDB codes Abbreviations: PDB, Protein Data Bank; RMSD, root‐mean‐squared deviation. Next, we focus on the loop instances with multiple distinct conformations, to assess how well the decoys generated from a specific PDB input can represent all the known conformations for that loop instance. Taking the loop 130–140, for example: the decoys generated using 6xluB are compared to the loop structures in the clusters represented by 6xluB, 7kdkC, 7kdlA, and the RMSD to the closest structure in each cluster is recorded; the average of the RMSDs to these three clusters then provides an overall result for 6xluB; the same is done using the decoys from 7kdkC and 7dklA. The results are summarized in Table 6 using the same RMSD metrics, averaged over the targets in the multiple conformation categories. This task is noticeably more challenging than the prior prediction task, as evidenced by RMSDs in Table 6 which are all larger than the corresponding values in the “Multiple conf.” rows of Table 4 for all four methods. While the top decoy RMSDs are expected to increase relative to Table 4, a substantial increase still occurs when taking the entire decoy set (“Min.” column, e.g., 1.07–2.18 Å global RMSD for NGK) and when allowing methods to choose the top five decoys (“Top‐5” column, e.g., 1.59–2.85 Å global RMSD for NGK), whether considering local or global RMSD. This suggests that building the loop using the atomic environment of a single structural template may preclude the methods from being able to locate and predict all the possible loop conformations; Marks et al. observed a similar phenomenon in their dataset. The detailed results for each loop target individually are given in Table S2 of the Supporting Information.

TABLE 6

	Local RMSD			Global RMSD
Method	Min.	Top	Top‐5	Min.	Top	Top‐5
DiSGro	1.36	2.40	2.00	2.50	4.76	4.05
NGK	1.19	2.01	1.65	2.18	3.84	2.85
PETALS	1.28	2.11	1.86	2.56	4.26	3.60
Sphinx	1.14	2.24	1.80	2.28	4.70	3.65

Abbreviation: RMSD, root‐mean‐squared deviation.

RMSD metrics for the loop instances with multiple conformations. The loop backbone RMSDs shown are averaged over the targets in the multiple conformation category, where decoys generated from each target are compared to all known conformations for that loop instance and RMSDs are calculated to the closest structure in each cluster. The columns “Min.,” “Top,” and “Top‐5” refer, respectively, to the lowest RMSD among the 500 decoys, RMSD of the top‐ranked decoy, and lowest RMSD among the top‐five ranked decoys Abbreviation: RMSD, root‐mean‐squared deviation. The multiple conformation loop instances in the RBD were not more difficult to predict. Methods located known conformations from their loop targets at a comparable level of accuracy versus those outside the RBD; for example, average global RMSDs for assessing the representation of all the conformations in the top five decoys were 2.55 versus 3.07 Å for NGK and 4.21 versus 3.94 for DiSGro. The average length of these loop targets in the RBD is 9.9 residues, and similar to the average length (10.0) among all multiple conformation targets. The loop regions with sequence variants in the PDB had little structural variability (Table 3) and were not expected to pose additional challenges for the loop modeling methods. Detailed results for each sequence variant confirm this, and are provided in Table S3 of the Supporting Information. Five loop targets were omitted from the above analyses due to challenges encountered when running the methods. The two very long loops in the set, namely 146–168 and 783–816, were particularly difficult, with DiSGro and Sphinx unable to generate decoys possibly due to their lengths. The 146–168 loop has two conformations, both of which could be predicted moderately well by PETALS (top decoy global RMSDs: 2.18 for 6zgiB conformation, 2.39 for 7dddC conformation) and NGK (top decoy global RMSDs: 2.80 for 6zgiB conformation, 2.45 for 7dddC conformation). The length 34 loop (783–816) is very challenging, and no method could give useful results (top decoy global RMSDs: 26.8 for NGK, 12.0 for PETALS). The Sphinx webserver was also unable to generate decoys for 31–46 and 320–324 (6xm0A conformation) possibly due to a lack of suitable templates. Further, some of Sphinx's jobs were unable to complete the full SOAP‐Loop ranking steps; thus, we used the 500 SOAP‐Loop ranked decoys if they were available, and otherwise selected its top 500 decoys from the coarse‐grained ranking stage for our analysis. Detailed results for these five targets are provided in Table S4 of the Supporting Information.

CONCLUSION

In this article, we studied the conformations of loops in the SARS‐CoV‐2 S protein. We extracted all SARS‐CoV‐2 S protein loop regions, examined their sequence and structural variability based on the available structures in the PDB, and applied loop modeling methods to assess how well the loop conformations could be predicted. Then, 44 loop regions were identified, and as the structure of the S protein has been experimentally solved many times, 17 loop instances were observed to have substantive structural variability and be able to adopt multiple distinct conformations according to a cluster analysis. The clusters gave insights into the amount of structural uncertainty present in these loops, and there were quantifiable differences in their sizes and breadths. Loops' frequent association with protein function, together with their more disordered nature compared to regular secondary structures, means that their accurate modeling is an important problem in structural biology. Specifically for the S protein, loop regions we identified include 475–487 and 495–506, which correspond to key loops known to be involved in binding with ACE2. These are referred to as “Loop 3” and “Loop 4” in Williams et al., where molecular dynamics simulations revealed “Loop 3” to be highly flexible in the unbound state, including the possibility of a conformation that inhibits ACE2 binding. Interestingly, our results also showed that 475–487 was one of the most difficult loops to predict, with all four methods struggling with the 6xm0B template (global RMSD of top decoy >10 Å, Table S2). Exploring the conformational variability of “Loop 3” thus provides a fuller range of structural states that the development of therapeutics might target before the S protein binds to ACE2. More generally, high‐quality loop models are a crucial part of protein structures used in the computational drug discovery process. We found that the structurally flexible loops with multiple conformations in the S protein tended to be more challenging for loop modeling methods to predict a correct structure, compared to relatively inflexible loops with a single conformation. Prediction accuracies were strongly associated with loop length, due to the larger conformational space of longer loops. Further, it was very challenging for methods to predict all known conformations from a single structural template. Our results thus highlight limitations of current loop prediction methods, most of which were designed to predict a single “correct” conformation. These echo some of the findings in Marks et al., but with some important distinctions. First, we were able to more fully consider cluster size and breadth in the analysis, thanks to the large number of S protein chains in the PDB. Second, we did not construct a curated set of high and low flexibility loops specifically, but rather considered all S protein loops which cover a wider range of loop structural variability. In effect, a much larger proportion of loops (17 of 44 in our study) may be considered highly flexible, if other structures were to be solved this many times. Third, the multiple conformation targets in our dataset were easier to predict than those of Marks et al. when allowing the best decoys across all structural templates to be chosen. Overall, this work provides insight into the abilities of current loop prediction methods for a key protein associated with the ongoing COVID‐19 disease, and identifies the loops where structural flexibility could play a role as the SARS‐CoV‐2 virus continues to evolve. Future study in loop modeling protocols might better incorporate multiple conformation loops in their training data and improve prediction accuracies for longer loops. Finally, we note one limitation of this study, namely our focus on loops rather than more global protein structure. In this sense, more global structural variability across S protein chains may have hindered the ability of methods to locate all the distinct loop conformations from a single input structure, since the rest of the protein chain is held fixed. Additionally, we found the observable changes to loop structures from known sequence variants in the PDB to be small. There could be more global structural changes due to mutation not detected by the current analysis, for example, the D614G mutation. Nonetheless, loops deserve careful study in their own right, due to their functional importance. Further study could focus on larger‐scale variability in the S protein structure, leveraging the rich source of experimental data available in the PDB to better understand COVID‐19.

CONFLICT OF INTEREST

Both the authors declare no potential conflict of interest.

PEER REVIEW

The peer review history for this article is available at https://publons.com/publon/10.1002/prot.26266. Appendix S1: Supporting Information Click here for additional data file.

50 in total

Review 1. Dynamic personalities of proteins.

Authors: Katherine Henzler-Wildman; Dorothee Kern
Journal: Nature Date: 2007-12-13 Impact factor: 49.962

2. Optimized atomic statistical potentials: assessment of protein interfaces and loops.

Authors: Guang Qiang Dong; Hao Fan; Dina Schneidman-Duhovny; Ben Webb; Andrej Sali
Journal: Bioinformatics Date: 2013-09-27 Impact factor: 6.937

3. Fast de novo discovery of low-energy protein loop conformations.

Authors: Samuel W K Wong; Jun S Liu; S C Kou
Journal: Proteins Date: 2017-04-19

4. Protein loops with multiple meta-stable conformations: A challenge for sampling and scoring methods.

Authors: Amélie Barozet; Marc Bianciotto; Marc Vaisset; Thierry Siméon; Hervé Minoux; Juan Cortés
Journal: Proteins Date: 2020-10-12

5. Jalview Version 2--a multiple sequence alignment editor and analysis workbench.

Authors: Andrew M Waterhouse; James B Procter; David M A Martin; Michèle Clamp; Geoffrey J Barton
Journal: Bioinformatics Date: 2009-01-16 Impact factor: 6.937

6. Sphinx: merging knowledge-based and ab initio approaches to improve protein loop prediction.

Authors: Claire Marks; Jaroslaw Nowak; Stefan Klostermann; Guy Georges; James Dunbar; Jiye Shi; Sebastian Kelm; Charlotte M Deane
Journal: Bioinformatics Date: 2017-05-01 Impact factor: 6.937

7. Mutations Strengthened SARS-CoV-2 Infectivity.

Authors: Jiahui Chen; Rui Wang; Menglun Wang; Guo-Wei Wei
Journal: J Mol Biol Date: 2020-07-23 Impact factor: 5.469

8. Improvements to robotics-inspired conformational sampling in rosetta.

Authors: Amelie Stein; Tanja Kortemme
Journal: PLoS One Date: 2013-05-21 Impact factor: 3.240

9. Loop modeling: Sampling, filtering, and scoring.

Authors: Cinque S Soto; Marc Fasnacht; Jiang Zhu; Lucy Forrest; Barry Honig
Journal: Proteins Date: 2008-02-15

10. A Novel Coronavirus from Patients with Pneumonia in China, 2019.

Authors: Na Zhu; Dingyu Zhang; Wenling Wang; Xingwang Li; Bo Yang; Jingdong Song; Xiang Zhao; Baoying Huang; Weifeng Shi; Roujian Lu; Peihua Niu; Faxian Zhan; Xuejun Ma; Dayan Wang; Wenbo Xu; Guizhen Wu; George F Gao; Wenjie Tan
Journal: N Engl J Med Date: 2020-01-24 Impact factor: 91.245

1 in total

1. Conformational variability of loops in the SARS-CoV-2 spike protein.

Authors: Samuel W K Wong; Zongjun Liu
Journal: Proteins Date: 2021-10-23

1 in total