| Literature DB >> 35062214 |
Lue Ping Zhao1, Terry P Lybrand2,3, Peter B Gilbert4, Thomas R Hawn5,6, Joshua T Schiffer4,5, Leonidas Stamatatos4,6, Thomas H Payne5, Lindsay N Carpp4, Daniel E Geraghty7, Keith R Jerome4.
Abstract
The emergence and establishment of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants of interest (VOIs) and variants of concern (VOCs) highlight the importance of genomic surveillance. We propose a statistical learning strategy (SLS) for identifying and spatiotemporally tracking potentially relevant Spike protein mutations. We analyzed 167,893 Spike protein sequences from coronavirus disease 2019 (COVID-19) cases in the United States (excluding 21,391 sequences from VOI/VOC strains) deposited at GISAID from 19 January 2020 to 15 March 2021. Alignment against the reference Spike protein sequence led to the identification of viral residue variants (VRVs), i.e., residues harboring a substitution compared to the reference strain. Next, generalized additive models were applied to model VRV temporal dynamics and to identify VRVs with significant and substantial dynamics (false discovery rate q-value < 0.01; maximum VRV proportion >10% on at least one day). Unsupervised learning was then applied to hierarchically organize VRVs by spatiotemporal patterns and identify VRV-haplotypes. Finally, homology modeling was performed to gain insight into the potential impact of VRVs on Spike protein structure. We identified 90 VRVs, 71 of which had not previously been observed in a VOI/VOC, and 35 of which have emerged recently and are durably present. Our analysis identified 17 VRVs ~91 days earlier than their first corresponding VOI/VOC publication. Unsupervised learning revealed eight VRV-haplotypes of four VRVs or more, suggesting two emerging strains (B1.1.222 and B.1.234). Structural modeling supported a potential functional impact of the D1118H and L452R mutations. The SLS approach equally monitors all Spike residues over time, independently of existing phylogenic classifications, and is complementary to existing genomic surveillance methods.Entities:
Keywords: SARS-CoV-2; Spike protein; homology modelling; statistical learning; unsupervised learning; variants of concern; variants of interest; viral residue variant
Mesh:
Substances:
Year: 2021 PMID: 35062214 PMCID: PMC8777887 DOI: 10.3390/v14010009
Source DB: PubMed Journal: Viruses ISSN: 1999-4915 Impact factor: 5.818
Figure 1Viral residue variant (VRV) spatiotemporal patterns in the United States. (A) Locally averaged proportions over time of five VRVs (V382, L452, T478, P681 and T732), modeled using sequences from California. The horizontal gray dotted line denotes the maximum proportion (Pmax) cutoff of 10% (see the formula for calculating Pmax in Section 3.3.1 of Materials and Methods). V382 exceeded the Pmax cutoff of 10% on day 259 and dropped below the Pmax cutoff of 10% on day 275 (marked by the vertical gray lines). (B) Heatmap of the 267 identified geo-VRVs, with color designating the state/territory-specific VRV proportion at the sampling time as designated on the left-hand vertical axis. Geo-VRVs with similar temporal dynamics are grouped into 10 clusters (TP1 through TP10), as designated by the color bar at the top of the heatmap. (C,D) Venn diagrams showing the relationships between AA-subs in VOIs, AA-subs in VOCs, and (C) VRVs or (D) pressing VRVs. AA-subs, amino acid positions that have been shown to harbor substitutions within US-circulating variants; VOCs, variants of concern; VOIs, variants of interest.
Comparison by state/territory of the time to detect an emerging VRV by the SLS method vs. the first reported appearance of the AA-sub.
| L5 | S13 | V70 | T95 | W152 | D253 | L452 | S477 | E484 | N501 | A570 | D614 | Q677 | P681 | A701 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Reporting Day * | 301 | 301 | 301 | 301 | 301 | 87 | 301 | 331 | 87 | 362 | 362 | 362 | 362 | 362 | 362 |
| Earliest SLS Detection Day Across All States | 11 | 159 | 329 | 149 | 381 | 98 | 381 | 405 | 371 | 206 | 404 | 10 | 20 | 11 | 176 |
| Alabama | 63–186 | - | - | - | - | - | - | - | 404 | - | - | 63 | 253 | - | - |
| Alaska | - | - | - | - | - | - | - | - | - | - | - | 56 | 305–323 | 383 | - |
| Arizona | - | - | - | - | - | - | - | - | - | - | - | 26 | 357–400 | 353 | - |
| Arkansas | - | - | - | - | - | - | - | - | - | - | - | 56 | 329 | - | - |
| California | - | 398 | - | - | 402 | - | 390 | - | - | - | - | 45 | - | 374 | - |
| Colorado | 286 | - | - | - | - | - | - | - | - | - | - | 45 | 286 | 370 | - |
| Connecticut | - | - | - | - | - | - | - | - | - | 314 | - | 43 | 191 | 378 | - |
| DC | - | - | - | - | - | - | - | - | 391 | - | - | 47 | - | 344 | - |
| Delaware | - | - | - | - | - | - | - | - | - | - | - | 52 | 384 | 288 | - |
| Florida | - | - | - | - | - | - | - | - | - | - | - | 33 | 368 | 389 | - |
| Georgia | - | - | - | - | - | - | - | - | - | - | - | 41 | 345 | 407 | - |
| Hawaii | 175–190 | - | - | - | - | - | - | - | - | - | - | 46 | 374 | 174–376 | - |
| Idaho | - | - | - | - | - | - | - | - | - | - | - | 53 | - | - | - |
| Illinois | - | - | - | - | - | - | - | - | - | - | - | 24 | 366 | 380 | - |
| Indiana | - | - | - | - | - | - | - | - | - | - | - | 48 | 370 | 383 | - |
| Iowa | - | - | - | - | - | - | - | - | - | - | - | 48 | 273–388 | - | - |
| Kansas | - | - | - | - | - | 260–291 | - | - | - | - | - | 47 | - | 385 | - |
| Kentucky | - | - | - | - | - | - | - | - | - | - | - | 59 | 389–397 | - | - |
| Louisiana | - | - | - | - | - | - | - | - | - | - | - | 50 | 307 | 368 | - |
| Maine | - | - | - | - | - | - | - | - | 404 | 371 | - | 51 | - | - | - |
| Maryland | - | - | - | - | 407 | - | 394 | - | 390 | - | - | 45 | - | 230 | - |
| Massachusetts | - | - | - | - | - | - | - | - | - | 206 | - | 10 | 346 | 298 | - |
| Michigan | - | - | - | 149–177 | - | - | - | - | - | 264–273 | - | 50 | 361 | - | - |
| Minnesota | 186–294 | - | - | - | - | - | - | - | - | - | - | 46 | 297 | 387 | - |
| Mississippi | 144–215 | - | - | - | - | - | - | - | - | - | - | 42 | 353 | 392 | - |
| Missouri | - | - | - | - | - | - | - | - | - | - | - | 47 | - | 364–384 | - |
| Montana | - | - | - | - | - | - | - | - | - | - | - | 68 | - | - | - |
| Nebraska | - | - | - | - | - | - | - | - | - | - | - | 46 | - | 387 | - |
| Nevada | - | 392 | - | - | 396 | - | 393 | - | - | - | - | 37 | 391 | 394 | - |
| New Hampshire | - | - | - | - | - | - | - | - | - | 374 | - | 41 | 349–390 | 364 | - |
| New Jersey | - | - | - | - | - | - | 402 | - | - | - | - | 44 | - | 276 | - |
| New Mexico | - | - | - | - | - | - | - | - | - | - | - | 50 | 291 | 387 | 233–252 |
| New York | 11–13 | - | - | - | - | - | 386 | - | 414 | - | - | 11 | - | 11 | - |
| North Carolina | - | - | - | - | - | - | - | - | - | - | - | 44 | - | 382 | - |
| North Dakota | - | - | - | - | - | - | - | - | - | - | - | 57 | 328–363 | - | - |
| Ohio | - | - | - | - | - | - | 405 | - | - | - | - | 20 | 20 | 391 | - |
| Oklahoma | - | - | - | - | - | - | - | - | - | - | - | 54 | 306 | - | - |
| Oregon | - | 397 | - | - | - | - | 398 | - | - | - | - | 45 | - | - | - |
| Pennsylvania | - | - | - | - | - | - | - | - | - | - | - | 44 | 394 | 317 | - |
| Puerto Rico | - | - | - | - | - | 185–252 | - | - | - | - | - | 49 | - | 347 | - |
| Rhode Island | - | - | - | - | - | - | - | 405 | 371–384 | 356 | - | 40 | 358–398 | 379 | - |
| South Carolina | - | - | - | - | - | - | - | - | - | - | - | 46 | 405 | 368–389 | - |
| Tennessee | 50–141 | - | - | - | - | - | - | - | - | - | - | 50 | 318 | - | - |
| Texas | - | - | - | - | - | - | - | - | - | - | - | 23 | 360 | 378 | - |
| Utah | - | 159–173 | - | - | - | 98–190 | - | - | - | - | - | 44 | - | 358 | - |
| Virginia | - | - | 329–331 | - | - | - | - | - | 397 | - | - | 47 | 384 | 359 | 176–185 |
| Washington | - | 406 | - | - | 411 | - | 403 | - | - | 410 | - | 50 | 373 | - | - |
| Wisconsin | - | - | - | - | - | - | 409 | - | - | 265–271 | - | 12 | 299–407 | 411 | - |
| Wyoming | - | 381 | - | - | 381 | - | 381 | - | - | - | - | 51 | - | 392 | - |
| Other States | - | - | - | - | - | - | - | - | 393 | 404 | 404 | 34 | - | 368 | 375–381 |
For each of 15 amino acid positions shown to harbor a substitution in a VOI or VOC, the top two rows show the first reported date in the literature of a VOI or VOC harboring a substitution at the designated site vs. the date of VRV detection at the same amino acid position by the SLS method (across all states/territories. * “Reporting Day” was set to the 15th day in each month in which the relevant publication appeared. “SLS Detection Day” was set to the day at which the locally averaged proportion of the specific VRV exceeded 10% based on temporality models fitted in each states/territory. If the locally averaged proportion of the VRV later declined below 10%, the second day is shown after a dash. All numbers in the table express the number of days post-19 January 2021. A dash means that that VRV for that column was not detected in the state for that row. VOI, variant of interest; VOC, variant of concern.
Figure 2Heatmap showing the presence of 10 selected VRVs among 9147 cases in Washington state. Unsupervised learning was used to organize the 10 VRVs into 5 residue groups (RG1 through RG5) and to organize the 9877 cases into 12 patient groups (PG1 through PG12).
VRV-haplotypes identified in Washington and in New York: state-specific frequencies of cases, number of VRVs per VRV-haplotype, and haplotypic polymorphisms (state-specific frequencies). Unimputable residues are denoted with an “X”.
| ID | VRV-Haplotype | Freq | L + | Haplotypic Polymorphism-Number of Substitutions (Frequency) |
|---|---|---|---|---|
| Washington | ||||
| W1 | S13-W152-L452-V483-N501-D614-A684 | 104 | 4 | ICRVNGA-4(20)/IWRVNGA-3(4)/SCRVNGA-3(5)/SLLVNGA-2(5)/SRLVNGA-2(4)/SWLVTGA-2(4)/SWLVYDA-1(1)/SWLVYGA-2(5)/SWQVNGA-2(2)/SWRVNGA-2(54) |
| W2 | D614-Q677-T732 | 12 | 3 | GHS-3(11)/XXX-3(1) |
| W3 | D614-T732 | 128 | 2 | GA-2(126)/GI-2(2) |
| W4 | D614-Q677 | 208 | 2 | DH-1(9)/GH-2(110)/GP-2(89) |
| W5 | D178-D614 | 74 | 2 | GG-2(70)/NG-2(4) |
| W6 | D614 | 7130 | 1 | G-1(7125)/N-1(5) |
| New York | ||||
| N1 | L5-L54-E132-Y453-T478-E484-D614-P681-T732 | 172 | 9 | LLEYKEGHA-4(168)/LLEYKEGHT-3(4) |
| N2 | L5-L54-E132-Y453-T478-E484-D614-T732 | 651 | 8 | FLEYREGT-3(4)/FLEYTEDT-1(11)/FLEYTEGA-3(3)/FLEYTEGS-3(1)/FLEYTEGT-2(266)/FLEYTKGA-4(1)/FLEYTKGT-3(44)/LLEYKEGT-2(3)/ LLEYTAGT-2(1)/LLEYTEGA-2(51)/LLEYTEGI-2(2)/LLEYTEGS-2(24)/LLEYTKGS-3(2)/LLEYTKGT-2(171)/LLEYTQGT-2(8)/LLQYTEGT-2(59) |
| N3 | D80-F157-L452-D614-P681-T859-D950 | 132 | 7 | DFLGHID-3(108)/DFLGPID-2(18)/DFLGPNH-3(4)/DSLGPNH-4(2) |
| N4 | D80-F157-L452-D614-T859-D950 | 637 | 6 | DFQGND-3(4)/DFRGID-3(15)/DFRGNH-4(1)/DFRGTD-2(120)/DFRNTD-2(2)/DSRGNH-5(3)/DSRGTD-3(2)/GFRGND-4(1)/GFRGNH-5(1)/GSLGNH-5(9)/GSRGND-5(10)/GSRGNH-6(455)/GSRGNY-6(1)/GSRGTD-4(13) |
| N5 | S494-D614-P681-T716 | 514 | 4 | PGHI-4(367)/PGHT-3(55)/PGPT-2(52)/SGHI-3(19)/SGHT-2(8)/SGPI-2(13) |
| N6 | D614-P681 | 1161 | 2 | GH-2(1124)/GL-2(4)/GR-2(32)/GS-2(1) |
| N7 | D614 | 10822 | 1 | D-0(1)/G-1(10821) |
+ Number of substitutions in VRV-haplotypes from the reference amino acids.
VRV-haplotypes. Cross-tabulation of individual VRV-haplotypes with GISAID-assigned lineages in all 167,893 sequences, excluding lineages with fewer than 10 occurrences. “Freq”, corresponding haplotype frequencies; “Unknown”, sequences not assigned to any lineage.
| Hap-Load | Freq | Unknown | A.2.4 | B.1 | B.1.1 | B.1.1.1 | B.1.1.171 | B.1.1.222 | B.1.1.29 | B.1.1.304 | B.1.1.317 | B.1.152 | B.1.165 | B.1.166 | B.1.2 | B.1.215 | B.1.234 | B.1.256 | B.1.324 | B.1.350 | B.1.354 | B.1.360 | B.1.399 | B.1.94 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (1) D80-F157-L452-D614-T859-D950 | ||||||||||||||||||||||||
| DSRGNH-5 | 63 | 58 | 5 | |||||||||||||||||||||
| GSLGNH-5 | 9 | 9 | ||||||||||||||||||||||
| GSRGND-5 | 21 | 19 | 1 | |||||||||||||||||||||
| GSRGNH-6 | 539 | 522 | 1 | 2 | 3 | 5 | ||||||||||||||||||
| (2) D80-S155-F157-L452-T859-D950 | ||||||||||||||||||||||||
| DRSRNH-5 | 39 | 39 | ||||||||||||||||||||||
| GRSRND-5 | 3 | 3 | ||||||||||||||||||||||
| GRSRNH-6 | 30 | 30 | ||||||||||||||||||||||
| GSSRNH-5 | 509 | 492 | 1 | 2 | 3 | 5 | ||||||||||||||||||
| (3) G142-E180-D614-Q677-S940 | ||||||||||||||||||||||||
| SEGHF-4 | 3 | 1 | 1 | 1 | ||||||||||||||||||||
| SVGHF-5 | 353 | 353 | ||||||||||||||||||||||
| SVGHS-4 | 273 | 2 | 1 | 262 | 8 | |||||||||||||||||||
| (4) S155-F157-L452-T859-D950 | ||||||||||||||||||||||||
| RSRND-4 | 3 | 3 | ||||||||||||||||||||||
| RSRNH-5 | 69 | 69 | ||||||||||||||||||||||
| SSRNH-4 | 533 | 511 | 1 | 7 | 3 | 5 | ||||||||||||||||||
| (5) S13-W152-L452-D614 | ||||||||||||||||||||||||
| ICLG-3 | 43 | 1 | 36 | 3 | ||||||||||||||||||||
| ICRG-4 | 795 | 51 | 557 | 1 | 4 | 10 | 14 | 10 | 34 | 2 | 72 | |||||||||||||
| IWRG-3 | 120 | 1 | 77 | 7 | 2 | 28 | ||||||||||||||||||
| SCRG-3 | 30 | 4 | 16 | 4 | ||||||||||||||||||||
| (6) S494-D614-P681-T716 | ||||||||||||||||||||||||
| PGHI-4 | 521 | 467 | 1 | 1 | 1 | 20 | 3 | |||||||||||||||||
| PGHT-3 | 194 | 100 | 8 | 3 | 31 | 2 | 3 | 29 | 1 | |||||||||||||||
| RGHI-4 | 3 | 3 | ||||||||||||||||||||||
| SGHI-3 | 38 | 19 | 3 | 1 | 4 | |||||||||||||||||||
| (7) T478-D614-P681-T732 | ||||||||||||||||||||||||
| KGHA-4 | 2132 | 11 | 17 | 2 | 14 | 2029 | 18 | 1 | 12 | 2 | ||||||||||||||
| KGHS-4 | 6 | |||||||||||||||||||||||
| KGHT-3 | 159 | 4 | 57 | 3 | 67 | 8 | 1 | |||||||||||||||||
| KGPA-3 | 5 | 1 | 3 | 1 | ||||||||||||||||||||
| TGHA-3 | 85 | 13 | 63 | 2 | 1 | 2 | 2 | |||||||||||||||||
| (8) F157-L452-D614-T859 | ||||||||||||||||||||||||
| FQGN-3 | 22 | 22 | ||||||||||||||||||||||
| FRGI-3 | 15 | 14 | 1 | |||||||||||||||||||||
| FRGN-3 | 5 | 5 | ||||||||||||||||||||||
| SLGN-3 | 11 | 10 | 1 | |||||||||||||||||||||
| SRGN-4 | 625 | 601 | 1 | 7 | 3 | 6 | ||||||||||||||||||
| SRGT-3 | 37 | 33 | ||||||||||||||||||||||
Green shading: >100 occurrences. Light green shading: >10 occurrences.
Figure 3Homology modeling of Spike mutations and haplotypic polymorphisms over time of the S13-W152-L452 VRV-haplotype. (A,B) Modeled structure of the Spike protein trimer with (A) D1118 or (B) H1118 (homology-modelled using PDB entry 7KRS as the template structure). Spike protein monomers are displayed in blue, salmon, and aquamarine; aspartic acid and histidine residues are rendered as CPK images. (C–F) Frequencies over time for seven commonly observed haplotypic polymorphisms of the S13-W152-L452 VRV-haplotype, out of its polymorphisms in the US. Only haplotypic polymorphisms with at least 50 observations are included. Nomenclature is as follows: The first three letters designate the amino acids present at positions 13, 152, and 452, respectively; the number after the hyphen designates the number of amino acids at these three positions that do not match their reference strain equivalents. Numbers of sequences harboring each S13-W152-L452 haplotypic polymorphism (across the entire USA) are shown in parentheses. Frequencies of seven common S13-W152-L452 VRV-haplotypic polymorphisms (C) in the entire US; (D) in California, Oregon, and Washington combined; (E) in Arizona, Colorado, Nevada, New Mexico, and Tennessee combined; and (F) in Florida and Georgia combined. (G) Homology-modeled complex of the receptor-binding domain of the Spike protein (salmon), harboring the L452R mutation, bound to the angiotensin-converting enzyme 2 (ACE2) receptor (aquamarine). Within the R452 residue, nitrogen atoms are shown in blue and carbon atoms are shown in grey.