| Literature DB >> 34661307 |
Samuel W K Wong1, Zongjun Liu1.
Abstract
The SARS-CoV-2 spike (S) protein facilitates viral infection, and has been the focus of many structure determination efforts. Its flexible loop regions are known to be involved in protein binding and may adopt multiple conformations. This article identifies the S protein loops and studies their conformational variability based on the available Protein Data Bank structures. While most loops had essentially one stable conformation, 17 of 44 loop regions were observed to be structurally variable with multiple substantively distinct conformations based on a cluster analysis. Loop modeling methods were then applied to the S protein loop targets, and the prediction accuracies discussed in relation to the characteristics of the conformational clusters identified. Loops with multiple conformations were found to be challenging to model based on a single structural template.Entities:
Keywords: COVID-19; conformational ensembles; decoy selection; loop modeling; protein structure prediction; sequence variants
Mesh:
Substances:
Year: 2021 PMID: 34661307 PMCID: PMC8662175 DOI: 10.1002/prot.26266
Source DB: PubMed Journal: Proteins ISSN: 0887-3585
SARS‐CoV‐2 S protein loops. The first column shows the starting and ending positions of each identified loop region. The second column shows the loop sequences; if there are sequence variants in the PDB, the most common variant is listed first, and other variants have their mutated residues marked in bold. The number of PDB chains containing that loop instance is shown in the third column. The rightmost column lists the representative PDB chains for each loop instance; if a loop instance has multiple conformations, each chain listed corresponds to one distinct conformation (cluster). The number of PDB chains represented by each cluster is shown in parentheses; these may not sum up to the third column since clusters with poor structure resolution (all chains >3 Å) are omitted
| Region | Sequence | #Chains | Representative conformations |
|---|---|---|---|
| 14–27 | QCVNLTTRTQLPPA | 36 | 6zgeA(24), 7dddC(12) |
| 31–46 | SFTRGVYYPDKVFRSS | 185 | 7a4nB(185) |
| 56–60 | LPFFS | 185 | 6xr8A(185) |
| 66–83 | HAIHVSGTNGTKRFDNPV | 11 | none (all PDBs >3 Å resolution) |
| 108–116 | TTLDSKTQS | 169 | 6zoxB(169) |
| 130–140 | VCEFQFCNDPF | 168 | 6xluB(145), 7kdkC(5), 7kdlA(4) |
| 146–168 | HKNNKSWMESEFRVYSSANNCTF | 38 | 6zgiB(27), 7dddC(9) |
| 172–187 | SQPFLMDLEGKQGNFK | 52 | 7df3B(39), 6zp0B(12) |
| 210–222 | INLVRDLPQGFSA | 154 | 6vxxA(152) |
| 230–236 | PIGINIT | 185 | 6vxxA(185) |
| 245–263 | HRSYLTPGDSSSGWTAGAA | 26 | 6zgiB(24) |
| 280–284 | NENGT | 185 | 6x79B(185) |
| 304–310 | KSFTVEK | 185 | 7a4nB(185) |
| 320–324 | VQPTE | 185 | 6zoxC(181), 6xm3A(4) |
| 329–338 | FPNITNLCPF | 180 | 6x29A(155), 7kdlB(21) |
| 343–348 | NATRFA | 181 | 6zgeC(181) |
| 370–375 | NSASFS | 182 | 6vxxA(139), 6zgiC(42) |
| 380–394 | YGVSPTKLNDLCFTN | 170 | 7kdlC(164) |
| YGV | 12 | 6x79B(12) | |
| 410–416 | IAPGQTG | 179 | 7kdkA(178) |
| IAP | 3 | 6zoxB(3) | |
| 422–430 | NYKLPDDFT | 182 | 6xr8B(178), 6xm0B(2) |
| 438–451 | SNNLDSKVGGNYNY | 93 | 6xr8A(85), 7kdlB(4) |
| 454–472 | RLFRKSNLKPFERDISTEI | 96 | 6zgeC(95) |
| 475–487 | AGSTPCNGVEGFN | 92 | 7dddA(87), 6xm0B(1) |
| 495–506 | YGFQPTNGVGYQ | 124 | 6zp0A(118), 6xm0B(2), 7kdlB(3) |
| 517–523 | LLHAPAT | 168 | 6zoxA(163), 6xm0A(2), 6xm0B(1), 6xm3A(2) |
| 526–537 | GPKKSTNLVKNK | 181 | 7ad1B(26), 6x29B(154) |
| 555–564 | SNKKFLPFQQ | 185 | 7kdkC(185) |
| 578–583 | DPQTLE | 185 | 6zoxB(185) |
| 600–608 | PGTNTSNQV | 170 | 7kdlA(169) |
| PGTNTSN | 12 | none (all PDBs >3 Å resolution) | |
| 614–620 | DVNCTEV | 103 | 6xm4C(98) |
|
| 42 | 7kdkA(42) | |
|
| 6 | 7a4nB(6) | |
| 624–641 | IHADQLTPTWRVYSTGSN | 26 | 6xm0B(18) |
| 656–663 | VNNSYECD | 185 | 7kdkB(185) |
| 697–710 | MSLGAENSVAYSNN | 185 | 6vxxB(185) |
| 783–816 | AQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRS | 144 | 6zp0C(142) |
| 825–836 | KVTLADAGFIKQ | 39 | 6xluB(2), 6xm3B(5), 6xm3C(1), 6zgiA(25) |
| 841–848 | LGDIAARD | 43 | 6xluC(6), 6xm4B(1), 6zgeB(20), 6xm3B(6), 7dddB(6) |
| 862–866 | PPLLT | 185 | 6zoxB(185) |
| 891–897 | GAALQIP | 176 | 7kdkB(176) |
| G | 9 | 7a4nB(9) | |
| 908–913 | GIGVTQ | 185 | 7a4nB(185) |
| 968–976 | SNFGAISSV | 188 | 6zp0C(185), 6xraC(3) |
| 1033–1046 | VLGQSKRVDFCGKG | 188 | 7kdkA(188) |
| 1106–1112 | QRNFYEP | 188 | 7kdkC(188) |
| 1124–1132 | GNCDVVIGI | 188 | 6xm0A(185), 6xraC(3) |
| 1135–1141 | NTVYDPL | 161 | 7kdkB(158), 6xraC(3) |
Abbreviation: PDB, Protein Data Bank.
FIGURE 1Two examples of SARS‐CoV‐2 S protein loops of length 10: 329–338 (top panels) and 555–564 (bottom panels). The histograms (left panels) show the pairwise root‐mean‐squared deviations (RMSDs) of the loop backbone among all S protein chains containing that loop: it can be seen that 329–338 exhibits higher structural variability than 555–564, due to the presence of two distinct clusters. The right panels display close‐ups of the representative loop conformations: 329–338 has two distinct conformations, colored in dark blue and turquoise; 555–564 has essentially one conformation, colored in red
FIGURE 2The amount of within‐cluster variation for the 61 clusters defined by at least four chains and two distinct Protein Data Bank (PDB) codes. The breadth of movement observed within a cluster is measured by its average within‐cluster root‐mean‐squared deviation (RMSD); 36 of the clusters have an average between 0.5 and 1 Å
Clusters grouped according to their breadth of movement d as defined by their average within‐cluster RMSDs. Each cluster is listed based on its representative conformation (Table 1) together with its starting and ending residues. The average loop length and size of clusters in the three groups are shown in the rightmost columns
| Breadth ( | Clusters | Avg. length | Avg. size |
|---|---|---|---|
|
| 6xr8A_56_60, 6x79B_280_284, 7a4nB_304_310, 6zoxC_320_324, 6xm3A_320_324, 7kdkA_614_620, 7a4nB_614_620, 7kdkB_656_663, 7dddB_841_848, 6zoxB_862_866, 7kdkB_891_897, 7a4nB_891_897, 7a4nB_908_913, 6zp0C_968_976, 7kdkC_1106_1112 | 6.5 | 127 |
| 0.5 < | 6zgeA_14_27, 7a4nB_31_46, 6zoxB_108_116, 6xluB_130_140, 7kdlA_130_140, 7dddC_146_168, 7df3B_172_187, 6zp0B_172_187, 6vxxA_230_236, 6x29A_329_338, 7kdlB_329_338, 6zgeC_343_348, 6vxxA_370_375, 6zgiC_370_375, 7kdlC_380_394, 6x79B_380_394, 7kdkA_410_416, 6xr8B_422_430, 6xr8A_438_451, 6zgeC_454_472, 6zp0A_495_506, 7ad1B_526_537, 6x29B_526_537, 7kdkC_555_564, 6zoxB_578_583, 7kdlA_600_608, 6xm4C_614_620, 6xm0B_624_641, 6vxxB_697_710, 6zp0C_783_816, 6xm3B_825_836, 6zgiA_825_836, 6zgeB_841_848, 7kdkA_1033_1046, 6xm0A_1124_1132, 7kdkB_1135_1141 | 12.1 | 108 |
|
| 7dddC_14_27, 7kdkC_130_140, 6zgiB_146_168, 6vxxA_210_222, 6zgiB_245_263, 7kdlB_438_451, 7dddA_475_487, 6zoxA_517_523, 6xluC_841_848, 6xm3B_841_848 | 13.0 | 49 |
Abbreviation: RMSD, root‐mean‐squared deviation.
Backbone RMSDs between the PDB chains representing the different sequence variants, in loop regions where mutations are present. Local RMSDs are computed on the loop residues. The residues that differ between the sequence variants are highlighted in bold
| Region | Sequence 1 | Sequence 2 | RMSD |
|---|---|---|---|
| 380–394 | YGV | YGV | 0.54 |
| 410–416 | IAP | IAP | 0.40 |
| 614–620 |
|
| 0.67 |
| 614–620 |
|
| 0.62 |
| 614–620 |
|
| 0.51 |
| 891–897 | G | G | 0.23 |
Abbreviations: PDB, Protein Data Bank; RMSD, root‐mean‐squared deviation.
RMSD metrics for assessing the loop prediction accuracy of the four methods. The loop backbone RMSDs shown are averaged over single conformation targets (n = 26), multiple conformation targets (n = 40), and all targets (n = 66). The columns “Min.,” “Top,” and “Top‐5” refer, respectively, to the lowest RMSD among the 500 decoys, RMSD of the top‐ranked decoy, and lowest RMSD among the top‐five ranked decoys. Prediction accuracy is defined as the RMSD to the closest loop structure among all chains containing that loop instance
| Local RMSD | Global RMSD | ||||||
|---|---|---|---|---|---|---|---|
| Method | Target category | Min. | Top | Top‐5 | Min. | Top | Top‐5 |
| DiSGro | Single conf. | 0.76 | 1.81 | 1.28 | 0.97 | 2.66 | 1.73 |
| Multiple conf. | 0.96 | 1.95 | 1.56 | 1.47 | 3.60 | 2.95 | |
| All | 0.88 | 1.90 | 1.45 | 1.27 | 3.23 | 2.47 | |
| NGK | Single conf. | 0.42 | 1.06 | 0.85 | 0.58 | 1.93 | 1.62 |
| Multiple conf. | 0.66 | 1.42 | 1.08 | 1.07 | 2.55 | 1.59 | |
| All | 0.56 | 1.28 | 0.99 | 0.87 | 2.31 | 1.60 | |
| PETALS | Single conf. | 0.68 | 1.24 | 0.98 | 0.98 | 2.06 | 1.51 |
| Multiple conf. | 0.85 | 1.58 | 1.33 | 1.42 | 3.00 | 2.32 | |
| All | 0.78 | 1.44 | 1.19 | 1.25 | 2.63 | 2.00 | |
| Sphinx | Single conf. | 0.64 | 1.49 | 1.15 | 1.11 | 2.75 | 2.09 |
| Multiple conf. | 0.74 | 1.77 | 1.31 | 1.34 | 3.53 | 2.46 | |
| All | 0.70 | 1.66 | 1.25 | 1.25 | 3.22 | 2.31 | |
Abbreviations: PDB, Protein Data Bank; RMSD, root‐mean‐squared deviation.
FIGURE 3Loop prediction accuracy for each of the four methods, visualized by plotting the global root‐mean‐squared deviation (RMSD) of the top decoy versus loop length. Prediction difficulty increases with loop length, with methods consistently achieving <2 Å RMSD only for the shortest loops (≤6 residues). The hardest targets for a given loop length tend to be those from multiple conformations, especially for the two most accurate methods (NGK and PETALS). Slight jitter is added along the x‐axis to the points for readability
Comparing prediction successes and failures of the four methods, according to loop length, cluster size, and cluster breadth. Prediction success is defined as a global RMSD of <2 Å for the top decoy. The Welch t‐statistics (with degrees of freedom in brackets) and p‐values for each variable are shown. Positive t‐statistics indicate that successes have a larger mean than failures. The tests are based on the loop targets representing conformational clusters defined by at least four chains and two distinct PDB codes
| Variables | Welch | |||
|---|---|---|---|---|
| DiSGro | NGK | PETALS | Sphinx | |
| Loop length |
|
|
|
|
|
|
|
|
| |
| Cluster size |
|
|
|
|
|
|
|
|
| |
| Cluster breadth |
|
|
|
|
|
|
|
|
| |
Abbreviations: PDB, Protein Data Bank; RMSD, root‐mean‐squared deviation.
RMSD metrics for the loop instances with multiple conformations. The loop backbone RMSDs shown are averaged over the targets in the multiple conformation category, where decoys generated from each target are compared to all known conformations for that loop instance and RMSDs are calculated to the closest structure in each cluster. The columns “Min.,” “Top,” and “Top‐5” refer, respectively, to the lowest RMSD among the 500 decoys, RMSD of the top‐ranked decoy, and lowest RMSD among the top‐five ranked decoys
| Local RMSD | Global RMSD | |||||
|---|---|---|---|---|---|---|
| Method | Min. | Top | Top‐5 | Min. | Top | Top‐5 |
| DiSGro | 1.36 | 2.40 | 2.00 | 2.50 | 4.76 | 4.05 |
| NGK | 1.19 | 2.01 | 1.65 | 2.18 | 3.84 | 2.85 |
| PETALS | 1.28 | 2.11 | 1.86 | 2.56 | 4.26 | 3.60 |
| Sphinx | 1.14 | 2.24 | 1.80 | 2.28 | 4.70 | 3.65 |
Abbreviation: RMSD, root‐mean‐squared deviation.