| Literature DB >> 32868447 |
Bethany Dearlove1,2,3,4, Eric Lewitus1,2,3,4, Hongjun Bai1,2,3,4, Yifan Li1,2,3,4, Daniel B Reeves5, M Gordon Joyce1,3, Paul T Scott1, Mihret F Amare1,3, Sandhya Vasan2,3,4, Nelson L Michael4, Kayvon Modjarrad6,4, Morgane Rolland6,2,3,4.
Abstract
The magnitude of the COVID-19 pandemic underscores the urgency for a safe and effective vaccine. Many vaccine candidates focus on the Spike protein, as it is targeted by neutralizing antibodies and plays a key role in viral entry. Here we investigate the diversity seen in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences and compare it to the sequence on which most vaccine candidates are based. Using 18,514 sequences, we perform phylogenetic, population genetics, and structural bioinformatics analyses. We find limited diversity across SARS-CoV-2 genomes: Only 11 sites show polymorphisms in >5% of sequences; yet two mutations, including the D614G mutation in Spike, have already become consensus. Because SARS-CoV-2 is being transmitted more rapidly than it evolves, the viral population is becoming more homogeneous, with a median of seven nucleotide substitutions between genomes. There is evidence of purifying selection but little evidence of diversifying selection, with substitution rates comparable across structural versus nonstructural genes. Finally, the Wuhan-Hu-1 reference sequence for the Spike protein, which is the basis for different vaccine candidates, matches optimized vaccine inserts, being identical to an ancestral sequence and one mutation away from the consensus. While the rapid spread of the D614G mutation warrants further study, our results indicate that drift and bottleneck events can explain the minimal diversity found among SARS-CoV-2 sequences. These findings suggest that a single vaccine candidate should be efficacious against currently circulating lineages.Entities:
Keywords: SARS-CoV-2; evolution; vaccine
Mesh:
Substances:
Year: 2020 PMID: 32868447 PMCID: PMC7519301 DOI: 10.1073/pnas.2008281117
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 12.779
Fig. 1.SARS-CoV-2 diversity across 18,514 genomes. (A) Distribution representing the location and date of sample collection. (B) Location and frequency of sites with polymorphisms across the genome. Proportion of sequences that showed polymorphisms compared to the reference sequence, Wuhan-Hu-1 (GISAID: EPI_ISL_402215, GenBank: NC_045512). ORFs are shown in gray for nonstructural proteins and in color for structural proteins (S, purple; E, blue; M, green; N, red). (C) Global phylogeny of 18,514 independent genome sequences. The tree was rooted at the reference sequence, Wuhan-Hu-1, and tips are colored by collection location. The scale indicates the distance corresponding to one substitution. Lineages are labeled following PANGOLIN (22).
Fig. 2.The S mutation D614G quickly became dominant. The mutation D614G was found in 69% of sequences sampled globally as of May 18, 2020, the second most frequent mutation in S was only found in ∼2% of sequences. (A) Number of sequences with D (gray) or G (purple) by continent and sampling date shown cumulatively through the outbreak. (B) Phylogenetic tree reconstructed from all of the ORFs showing the linkage between D614G in S and P4715L in ORF1ab. Tips are colored by continent. The phylogeny suggests that these mutations were linked to a bottleneck event when SARS-CoV-2 viruses were introduced in Europe; this mutation was first seen in Europe in a sequence sampled in Germany at the end of January. There is no evidence that the increasing predominance of this mutation was caused by convergent selection events that would have occurred in multiple individuals. (C) Overall number of sequences with D614 or D614G across continents; the predominance of D614G in Europe is suggestive of a founder event. (D) Distribution of Hamming distances between sequences with D614, G614 or discordant pairs. The median is marked with a dashed line.
Fig. 3.Evolution across the SARS-CoV-2 genome. (A) Bar plot of the average percentage of branch length under diversifying selection (dN/dS > 1) for each site. (B) Bar plot of dN/dS per gene (dN = dS is shown as dashed line). Error bars indicate SD across subsampled alignments. (C) Box plot of nonsynonymous substitutions per lineage per site across structural and nonstructural genes. Values across subsampled alignments for each gene are plotted. (D) Average percentage (over subsampled alignments) of branch lengths evolving under neutral (or negative) selection per site for each structural gene. Median values are shown by dashed lines.
Fig. 4.Limited evidence of adaptation of the viral population. (A–C) Bootstrapped global estimates of Nei’s GST and Jost’s D for population differentiation for each structural gene. (A) Estimates of Nei’s GST (closed circles) and Jost’s D (open circles) comparing sequences sampled from the Hubei province to sequences subsequently sampled globally. Estimates of (B) Nei’s GST and (C) Jost’s D comparing sequences sampled before or after a specific date. Lines connect the median estimates across datasets for each gene. (D) Ln-transformed phylogenetic η, indicative of the number of iterative events in the sampled subtree, for subtrees from each internal node (after the root) of a down-sampled SARS-CoV-2 whole-genome phylogeny (dark gray), of a phylogeny simulated under neutral parameters (gold), and of a phylogeny simulated under positive time-dependent rates (b(t) = 0.01e0.4t, green). (E) Box plot of ln-transformed phylogenetic η estimates across all down-sampled SARS-CoV-2 whole-genome phylogenies, phylogenies simulated under neutral parameters, and phylogenies simulated under different positive time dependencies, . Asterisks indicate significant differences in mean values (Student’s t test, P < 0.05) between the SARS-CoV-2 and positive time-dependent phylogenies at each .
Fig. 5.Mutations across SARS-CoV-2 S sequences. (A) Structure of SARS-CoV (5 × 58) (shown instead of SARS-CoV-2 for completeness of the Receptor Binding Motif [RBM]). (B–D) The three protomers in the closed SARS-CoV-2 S glycoprotein (Protein Data Bank ID code 6VXX) are colored in yellow, cyan, and white. Sites with mutations are shown as spheres. (B) Near-identity of potential vaccine candidates. The MRCA and Wuhan-Hu-1 reference sequences were identical, while the consensus derived from all circulating sequences showed a mutation (D614G). Site 614 is located at the interface between two subunits. (C) Sequence segments that differed between human and pangolin or bat hosts. Amino acid segments 439 to 455 and 482 to 501 impact receptor binding, while the 574 to 690 segment corresponds to the S2 cleavage site. (D) Sites with shared mutations across SARS-CoV-2 circulating sequences. The colors of the spheres correspond to the proportion of SARS-CoV-2 sequences that differed from the Wuhan-Hu-1 sequence (GISAID: EPI_ISL_402125, GenBank: NC_045512). Mutations that were found only in one or two sequences were not represented.