| Literature DB >> 35456974 |
Angelo Boccia1, Rossella Tufano1, Veronica Ferrucci1,2, Leandra Sepe1,2, Martina Bianchi3, Stefano Pascarella3, Massimo Zollo1,2,4, Giovanni Paolella1,2.
Abstract
Tracing the appearance and evolution of virus variants is essential in the management of the COVID-19 pandemic. Here, we focus on SARS-CoV-2 spread in Italian patients by using viral sequences deposited in public databases and a tracing procedure which is used to monitor the evolution of the pandemic and detect the spreading, within the infected population of emergent sub-clades with a potential positive selection. Analyses of a collection of monthly samples focused on Italy highlighted the appearance and evolution of all the main viral sub-trees emerging at the end of the first year of the pandemic. It also identified additional expanding subpopulations which spread during the second year (i.e., 2021). Three-dimensional (3D) modelling of the main amino acid changes in mutated viral proteins, including ORF1ab (nsp3, nsp4, 2'-o-ribose methyltransferase, nsp6, helicase, nsp12 [RdRp]), N, ORF3a, ORF8, and spike proteins, shows the potential of the analysed structural variations to result in epistatic modulation and positive/negative selection pressure. These analyzes will be of importance to the early identification of emerging clades, which can develop into new "variants of concern" (i.e., VOC). These analyses and settings will also help SARS-CoV-2 coronet genomic centers in other countries to trace emerging worldwide variants.Entities:
Keywords: SARS-CoV-2; variant of concern (VOC); viral variants tracing
Mesh:
Substances:
Year: 2022 PMID: 35456974 PMCID: PMC9029933 DOI: 10.3390/ijms23084155
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 6.208
Figure 1Phylogenetic analysis of the 2020 dataset. The analysis is focused on viral sequences from Italy and neighboring countries, sampled during 2020. The number of samples from each country of the focus is reported in the inset. The samples in the tree are colored on the basis of their origin, according to the colors used in the inset; the samples in gray are additional from the rest of the world, added to ensure the presence of all major globally spread virus subpopulations. The occurrence of known clades is indicated by the labels near the top node of the corresponding sub-tree. Bottom bars indicates time periods in which the Italian government implemented stringent (black) or relaxed (gray) mobility restrictions.
Figure 2Growing subsets. The identified ten subsets are shown in the context of the phylogenetic tree, using colors. Subsets contained in a larger one are indicated by their names, separated by dash, as for 2–3, in which S2 is contained in S3. The occurrence of known clades is indicated by the labels near the top node of the corresponding sub-tree. The bottom bars indicate the time periods in which the Italian government implemented stringent (black) or relaxed (gray) mobility restrictions.
Annotated list of growing subsets identified in the 2020 dataset.
| Family | Subset | Date | Internal Subsets | # Viral Seqs | Score | Skew | Origin | # Sources | Sex | Clade | Related Pango Lineage | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Italy | Others | Div. | Cent. | F | M | U | Parent | Is | Name | Rel. | |||||||
| 1 | 4/2020 | 61 | 11.32 | −0.84 | 41 (67%) | 20 (33%) | 6 | 10 | 18 | 31 | 12 | 20A | B.1.258 | ovr | |||
|
| 2 | 6/2020 | 47 | 7.21 | −0.68 | 47 (100%) | 0 (0%) | 3 | 5 | 23 | 24 | 0 | 20D | C.18 | not | ||
| 3 | 4/2020 | 2 | 91 | 14.33 | −0.93 | 56 (62%) | 35 (38%) | 8 | 12 | 34 | 34 | 23 | 20D | B.1.1.1 | par | ||
| 4 | 10/2020 | 43 | 54.79 | −3.07 | 25 (58%) | 18 (42%) | 6 | 12 | 18 | 16 | 9 | 20B | 20I/501Y.V1 | B.1.1.7 | ovr | ||
| 5 | 3/2020 | 45 | 6.65 | −0.66 | 4 (9%) | 41 (91%) | 3 | 3 | 7 | 6 | 32 | 20B | B.1.1.39 | ovr | |||
|
| 6 | 4/2020 | 31 | 7.77 | −0.77 | 20 (65%) | 11 (35%) | 4 | 7 | 6 | 17 | 8 | 20A.EU2 | B.1.160 | par | ||
| 7 | 6/2020 | 6 | 164 | 8.66 | −0.56 | 68 (41%) | 96 (59%) | 6 | 12 | 29 | 56 | 79 | 20A | 20A.EU2 | B.1.160 | ovr | |
| 8 | 6/2020 | 52 | 20.53 | −1.33 | 27 (52%) | 25 (48%) | 5 | 8 | 19 | 13 | 20 | 20E (EU1) | B.1.177 | par | |||
| 9 | 7/2020 | 33 | 5.81 | −0.66 | 22 (67%) | 11 (33%) | 5 | 6 | 9 | 16 | 8 | 20E (EU1) | B.1.177 | par | |||
| 10 | 8/2020 | 49 | 13.77 | −1 | 47 (96%) | 2 (4%) | 4 | 5 | 13 | 34 | 2 | 20E (EU1) | B.1.177.33 | not | |||
The ‘family’ column indicates the overlapping subsets. ‘Date’ is the inferred date of the subset earliest appearance. ‘Score’ is the parameter used to identify emerging subsets, as detailed under ‘Methods’. ‘Origin’ indicates the number and percent of samples collected in Italy and in other countries. ‘Source’ reports the number of administrative divisions (Div.) and laboratories (Labs) from which the sequences are sourced. ‘Clade’ reports the parent as well as, when available, the corresponding Nextstrain clade. ‘Pango’ contains the closest Pango lineage and its relationship with the subset, annotated by using the following schema: ovr—Pango lineage overlaps with the subset; par—Pango lineage corresponds with the Nextstrain parental clade from which the subset derives; and not—Pango lineage not initially recognized as it was not available in the Pango collection at the time of analysis.
List of mutated sites observed in the subsets, sorted by genomic coordinates.
| Variant | Variant | Nucleotide | Subset | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |||
|
| - | 241 | C | C | C | C | C | C | C | C | C | C |
|
| leader: V60V | 445 | C | C | C | |||||||
|
| nsp2: S36S | 913 | B | |||||||||
|
| nsp3: F106F | 3037 | C | C | C | C | C | C | C | C | C | C |
|
| nsp3: T183I | 3267 | B | |||||||||
|
| nsp3: T428I | 4002 | C | C | ||||||||
|
| nsp3: T608T | 4543 | C | B | ||||||||
|
| nsp3: Y817Y | 5170 | B | |||||||||
|
| nsp3: A890D | 5388 | B | |||||||||
|
| nsp3: T970T | 5629 | C | B | ||||||||
|
| nsp3: F1089F | 5986 | B | |||||||||
|
| nsp3: T1189T | 6286 | C | C | C | |||||||
|
| nsp3: I1412T | 6954 | B | |||||||||
|
| nsp3: T1456I | 7086 | B | |||||||||
|
| nsp3: I1683T | 7767 | B | |||||||||
|
| nsp3: Y1776Y | 8047 | B | |||||||||
|
| nsp3: S1807S | 8140 | b | |||||||||
|
| nsp4: S76S | 8782 | ||||||||||
|
| nsp4: V192V | 9130 | B | |||||||||
|
| nsp4: M324I | 9526 | C | B | ||||||||
|
| 3C-like proteinase: G15S | 10,097 | C | C | ||||||||
|
| 3C-like proteinase: F66F | 10,252 | B | b | ||||||||
|
| nsp6: A54S | 11,132 | B | |||||||||
|
| nsp6: S106- | 11,288–11,290 | B | |||||||||
|
| nsp6: G107- | 11,291–11,293 | B | |||||||||
|
| nsp6: F108- | 11,294–11,296 | B | |||||||||
|
| nsp6: Y175Y | 11,497 | C | C | ||||||||
|
| nsp7: D38D | 11,956 | B | b | ||||||||
|
| RdRp: Y32Y | 13,536 | C | C | ||||||||
|
| RdRp: A185S | 13,993 | C | B | ||||||||
|
| RdRp: P323L | 14,408 | C | C | C | C | C | C | C | C | C | C |
|
| RdRp: P412P | 14,676 | B | |||||||||
|
| RdRp: H613H | 15,279 | B | |||||||||
|
| RdRp: V776L | 15,766 | C | B | ||||||||
|
| RdRp: L891L | 16,111 | b | |||||||||
|
| RdRp: T912T | 16,176 | B | |||||||||
|
| helicase: K218R | 16,889 | C | B | ||||||||
|
| helicase: E261D | 17,019 | C | B | ||||||||
|
| helicase: H290Y | 17,104 | B | |||||||||
|
| helicase: K460R | 17,615 | B | |||||||||
|
| 3′-to-5′ exon.: L280L | 18,877 | C | B | ||||||||
|
| endoRNAse: L216L | 20,268 | B | |||||||||
|
| 2′-o-MT: K160K | 21,138 | B | b | ||||||||
|
| 2′-o-MT: S166A | 21,154 | B | b | ||||||||
|
| 2′-o-MT: A199A | 21,255 | C | C | C | |||||||
|
| - | 21,767–21,769 | B | |||||||||
|
| - | 21,770 | B | |||||||||
|
| - | 21,992–21,993 | B | |||||||||
|
| - | 22,227 | C | C | C | |||||||
|
| - | 22,879 | B | |||||||||
|
| - | 22,992 | C | B | ||||||||
|
| - | 23,063 | B | |||||||||
|
| - | 23,271 | B | |||||||||
|
| - | 23,403 | C | C | C | C | C | C | C | C | C | C |
|
| - | 23,477 | B | |||||||||
|
| - | 23,587 | b | |||||||||
|
| - | 23,604 | B | |||||||||
|
| - | 23,709 | B | |||||||||
|
| - | 23,731 | C | C | ||||||||
|
| - | 24,506 | B | |||||||||
|
| - | 24,914 | B | |||||||||
|
| - | 25,563 | C | B | ||||||||
|
| - | 25,710 | C | B | ||||||||
|
| - | 26,735 | C | B | ||||||||
|
| - | 26,801 | C | C | C | |||||||
|
| - | 26,876 | C | B | ||||||||
|
| - | 27,319 | B | |||||||||
|
| - | 27,800 | B | |||||||||
|
| - | 27,944 | B | |||||||||
|
| - | 27,972 | B | |||||||||
|
| - | 27,982 | b | |||||||||
|
| - | 28,048 | B | |||||||||
|
| - | 28,111 | B | |||||||||
|
| - | 28,280–28,282 | B | |||||||||
|
| - | 28,487 | B | |||||||||
|
| - | 28,868 | B | |||||||||
|
| - | 28,881–28,882 | C | C | C | C | ||||||
|
| - | 28,883 | C | C | C | C | ||||||
|
| - | 28,932 | C | C | C | |||||||
|
| - | 28,975 | C | B | ||||||||
|
| - | 28,977 | B | |||||||||
|
| - | 29,366 | B | |||||||||
|
| - | 29,399 | C | B | ||||||||
|
| - | 29,645 | C | C | C | |||||||
|
| - | 29,706 | b | |||||||||
|
| - | 29,734 | B | |||||||||
Columns 1 and 2 reports mutations detected by comparing the consensus sequence of each subset to Wuhan-Hu-1 (Refseq NC_045512.2), used as reference. Mutations are described following standard naming convention where the standard one-letter code indicates original and mutated amino acids, ‘*’ and ‘-’ indicate stop codons and deleted amino acids, respectively. Numbers indicate the amino acid position: in column 1 they are referred to the main protein encoded by each gene, while in column 2, the amino acid position is referred to the peptide produced by cleavage of the polyprotein encoded by ORF1ab. In the ‘subsets’ columns, mutations shared among multiple subsets deriving from the same clade are indicated with a ‘C’, while those subset-specific are reported as ‘B’ or ‘b’, depending on whether their frequency within the subset is greater or less than 80%.
Figure 3Genome layout of genes/proteins carrying mutations emerged during the year 2020. The approximate positions on the SARS-CoV-2 genome of genes coding for proteins carrying mutations from the year 2020 were subjected to structural analysis. The protein structures are displayed as ribbon models and enclosed in frames colored as the corresponding ORFs.
Impact of mutations on protein conformation and stability.
| Gene | Protein/Peptide | Nucleotide | AA Change | Interactions | Stability (ΔΔG) | Structure | |
|---|---|---|---|---|---|---|---|
| S1 | ORF1ab | nsp3 | 7767 | I1683T | Decreased hydrophobic interactions to V1678, L1685 | I-Tasser model | |
| helicase | 17,104 | H290Y | Hbond to peptide bond of E261. Stacking with F262 | Stabilizing ++ (+2.790 kcal/mol) | 5RL9 | ||
| S | spike | 22,879 | N439K | ||||
| S2–3 | ORF1ab | 2’-o-ribose methyl transferase | 21,154 | S166A | Hydrophobic interactions to L59 and L126 (H bond to N210 abolished) | Stabilizing + (+0.425 kcal/mol) | 6W75 |
| S4 | ORF1ab | nsp3 | 3267 | T183I | Remove a OH group. No polar interaction to Q185 or Q180. | I-Tasser model | |
| 5388 | A890D | No specific interaction. At the N-terminal of an a-helix | |||||
| 6954 | I1412T | Smaller side chain. No specific interaction | |||||
| nsp6 | 11,288–11,290 | S106- | |||||
| 11,291–11,293 | G107- | ||||||
| 11,294–11,296 | F108- | ||||||
| helicase | 17,615 | K460R | Hbond to Y457 and electrostatic interaction to D458. Contact to F437 | Stabilizing ++ (+1.254 kcal/mol) | 5RL9 | ||
| S | spike | 21,767–21,769 | H69- | ||||
| 21,770 | V70- | ||||||
| 21,992–21,993 | Y144- | ||||||
| 23,063 | N501Y | ||||||
| 23,271 | A570D | ||||||
| 23,604 | P681H | ||||||
| 23,709 | T716I | ||||||
| 24,506 | S982A | Loss of an inter-protomer H-bond between the S982 and T547 side chains D1118H: S2 | |||||
| 24,914 | D1118H | ||||||
| ORF8 | ORF8 | 27,972 | Q27 * | 7JX6 | |||
| 28,048 | R52I | Solvent exposed. Contact to S54 | |||||
| 28,111 | Y73C | Solvent exposed | |||||
| N | nucleocapside | 28,280–28,282 | D3L | Hydrophobic interactions to V324 | |||
| 28,977 | S235F | Exposed at the N-terminus of an a-helix | |||||
| S5 | N | nucleocapside | 28,487 | V72I | Increases hydrophobic interactions | ||
| 28,868 | P199S | H-bond to S197 | |||||
| S7 | ORF1ab | nsp4 | 9526 | M324I | Hydrophobic interactions to L321, L323 of the opposite helix | I-Tasser model | |
| RdRp | 13,993 | A185S | Add H-bond to V182 and N213 peptide bond | 6YYT | |||
| 15,766 | V776L | Increases hydrophobic interactions to H752, F753 and Y748 | |||||
| helicase | 16,889 | K218R | Exposed to the solvent | 5RL9 | |||
| 17,019 | E261D | Hbond to S259 (H to Y324 and H290 are abolished) | |||||
| S | spike | 22,992 | S477N | ||||
| ORF3a | ORF3a | 25,563 | Q57H | Contact to His57 from the other subunit. Wall of the central pore. Interacts to Lys61 | Stabilizing ++ (+1.620 kcal/mol) | 6XDC | |
| N | nucleocapside | 28,975 | M234I | C-terminal of an a-helix | |||
| 29,399 | A376T | Potential Hbond to K374 | |||||
| S8 | ORF1ab | nsp6 | 11,132 | A3623S | |||
| S | spike | 23,587 | Q675H | ||||
| ORF8 | ORF8 | 27,982 | P30L | Solvent exposed | Stabilizing ++ (+1.620 kcal/mol) | 7JX6 | |
| S10 | ORF1ab | nsp3 | 7086 | T1456I | Exposed in a loop at the C-terminal of an a-helix. Removes polar interaction to N1457 | I-Tasser model | |
| S | spike | 23,477 | G639S | ||||
| N | nucleocapside | 29,366 | P365S | Exposed in a loop. Increases local flexibility? |
List of mutations organized by subset and, on a second level, by subgenomic mRNA/ORF. Mutation are expressed as already described in Table 2. The prediction of protein stability (produced using as template the structures indicated in ‘Structure’) is reported in ‘Stability’, where ‘stabilizing’ indicates that the structure of the protein is expected to increase in stability following the corresponding mutation. ‘Interactions’ reports the potential effect of amino acid changes on interaction with neighboring sites. ‘*’ indicate stop codons.
Figure 4Tracing of growing subsets over the year 2021. Each horizontal line corresponds to a dataset consisting of sequences available at the beginning of the month indicated on the left side. The circles correspond to the subsets identified with the procedure used for Figure 2 and Table 1, with the area proportional to the number of sequences, and colors and patterns are used to highlight the subsets which share the same consensus sequence. The connectors highlight relations between subsets taken from subsequent months, based on the sequences present in both datasets and shared by the two subsets, as described in Methods, with the gray intensity proportional to the number of sequences shared by them. A symmetric relationship is indicated by connectors of constant thickness, meaning that most (>80%) sequences from each subset are contained in the other one and vice versa. In contrast, lines with progressively reduced thickness indicate that only sequences of one subset are mostly (>80%) contained in the target one, but the opposite is not true. The colored sticker on the top right corner of each subset indicates the clade to which the subset corresponds to or derives from, as indicated in the legend. Subsets in the ‘January’ dataset are labeled with a number indicating the corresponding subset in the ‘first year’ dataset of Figure 2. On the left, the number of positive cases and deaths are reported in time. The gray gradients in the background indicate the percent of the population vaccinated with a first, second, and third dose, respectively, on the right, center, and left side. Red, orange, and yellow colors are used to indicate the prevalent level of restrictions imposed by the Italian government (from most to the least restrictive), as defined in the Italian Decree (G.U. n° 275 4 November 2020). The reference graph on the left was produced by using cases/deaths and vaccination numbers published by “Civil Protection Department” [47] and by “Ministry of Health” [48], respectively.