| Literature DB >> 35265640 |
Fatima Mostefai1,2, Isabel Gamache1,2, Arnaud N'Guessan1,3, Justin Pelletier1,2, Jessie Huang4, Carmen Lia Murall3, Ahmad Pesaranghader5, Vanda Gaonac'h-Lovejoy2,6, David J Hamelin1,2,6, Raphaël Poujol1, Jean-Christophe Grenier1, Martin Smith2,6, Etienne Caron6,7, Morgan Craig6,8, Guy Wolf5,8, Smita Krishnaswamy4,9, B Jesse Shapiro3, Julie G Hussin1,10.
Abstract
The genome of the Severe Acute Respiratory Syndrome coronavirus 2 (SARS-CoV-2), the pathogen that causes coronavirus disease 2019 (COVID-19), has been sequenced at an unprecedented scale leading to a tremendous amount of viral genome sequencing data. To assist in tracing infection pathways and design preventive strategies, a deep understanding of the viral genetic diversity landscape is needed. We present here a set of genomic surveillance tools from population genetics which can be used to better understand the evolution of this virus in humans. To illustrate the utility of this toolbox, we detail an in depth analysis of the genetic diversity of SARS-CoV-2 in first year of the COVID-19 pandemic. We analyzed 329,854 high-quality consensus sequences published in the GISAID database during the pre-vaccination phase. We demonstrate that, compared to standard phylogenetic approaches, haplotype networks can be computed efficiently on much larger datasets. This approach enables real-time lineage identification, a clear description of the relationship between variants of concern, and efficient detection of recurrent mutations. Furthermore, time series change of Tajima's D by haplotype provides a powerful metric of lineage expansion. Finally, principal component analysis (PCA) highlights key steps in variant emergence and facilitates the visualization of genomic variation in the context of SARS-CoV-2 diversity. The computational framework presented here is simple to implement and insightful for real-time genomic surveillance of SARS-CoV-2 and could be applied to any pathogen that threatens the health of populations of humans and other organisms.Entities:
Keywords: SARS-CoV-2; haplotype network; lineage annotation; population genomics; principal component analysis; variant detection; viral surveillance
Year: 2022 PMID: 35265640 PMCID: PMC8899026 DOI: 10.3389/fmed.2022.826746
Source DB: PubMed Journal: Front Med (Lausanne) ISSN: 2296-858X
Figure 1A data-driven methodological pipeline for analyzing viral genomic data. This workflow recapitulates the major analysis steps used to analyze SARS-CoV-2 consensus sequence data submitted to GISAID during the first year of the COVID-19 pandemic. Dark blue arrows represent steps where all positions are kept (except spurious sites), blue arrows represent steps where subsets of positions are kept (indicated next to the arrow), yellow boxes represent filtering steps at the level of sequences, light blue boxes represent the methodological steps and the main steps are numbered from 1 to 8. These population genetic and unsupervised learning methods constitute a comprehensive toolbox to allow the scientific community to monitor the evolution of the virus efficiently. Box plot modified from Bejarano (15).
Figure 2Viral genetic diversity during the first year of the pandemic. (A) Top panel shows Derived Allele Frequencies (DAF) over time of representative high-frequency substitutions during the first year of the pandemic. Only positions that exceed a DAF of 10% for a given month are shown. Positions with highly correlated DAF trajectories (r2>0.99) have the same line color. Solid lines are used for mutations appearing in the first wave of the pandemic (January–July), and dotted lines show mutations appearing during the second wave of the pandemic (August–December). Bottom panel shows the daily case counts in the top five countries from which we have the most GISAID sequences. The y-axis represents the % of maximum cases per day rolled over 14 days. On the x-axis, only ticks of the first of the month are represented. (B) Haplotype network representing genetic subtypes based on representative mutations (position underlined in A). Genomic positions that differ between two nodes (haplotypes) are specified on edges. Nodes are colored by haplotype and node size represents the number of consensus sequences for each haplotype. The 17 main haplotypes are annotated with roman numerals. (C) Divergence tree made from 15,690 SARS-CoV-2 consensus sequences using FastTree using a GTR+Gamma20 model and TreeTime to refine the divergence tree. The haplotype network built from prevalent mutations using all high-quality consensus sequences recapitulates the phylogeny well.
Figure 3Mutational signatures of the 17 major haplotypes. Aligned histograms of each of the main 17 haplotype groups. The y-axis of each histogram represents the frequency within each haplotype of mutations that differ from the reference nucleotide in at least 90% of the sequences represented. On the x-axis, each bar represents a mutated position colored by its substitution type and is labeled with the corresponding amino acid change (no labels are displayed for synonymous mutations). The annotation was done using SnpEFF (30).
Figure 4Distribution of SARS-CoV-2 sequences in space and time. Haplotype network of the first (A) and second (B) waves. Node size represents the number of consensus sequences for each haplotype and pie charts represent continental proportions for each haplotype. (C) GISAID consensus sequence counts (on a log10 scale) of the most prevalent haplotypes on each continent during the first year of the pandemic. (D) Tajima's D estimates of the three most prevalent haplotypes on each continent for the first year of the pandemic. Box plots represent 500 estimates of Tajima's D from random resamplings of 20 genome sequences for each month with at least 20 sequences. Both the haplotype network and Tajima's D are insightful tools for detecting expanding lineages at a given point in time.
Figure 5Recurrent mutations visualized using the haplotype network. Haplotype Networks colored according to the presence of specific alleles at genomic positions 14,805 (A), 1,163 (B), 22,992 (C), and 23,063 (D). Node size represents the number of the first year of the pandemic consensus sequences for each haplotype.
Figure 6Viral population structure during the first and second waves of the pandemic. Principal Component Analyses (PCA) of genetic diversity of the first (A,C) and second (B,D) waves' consensus sequences reveal the population structure of the 17 main haplotypes in each wave. Genetic variation present in at least 10 genomes is used. The PCA is computed with all sequences, and only the sequences from the 17 main haplotypes are projected. Identical sequences are projected onto the same coordinates, therefore, the number of sequences represented by each point is proportional to the size of the dots, with added transparency. PC1 and PC2 show differentiation between the main lineages from the two waves (A,B). The variant responsible for the Australian outbreak stands out clearly on PC3 from the first wave (C) and the Lambda variant sequences (XIII) are shown as the most distal subgroup on PC4 and PC5 from the second wave (D), in opposition to sequences from haplotype VI (on PC4) and subgroups of haplotype III (on PC5). The PCA recapitulates insightful characteristics of the evolutionary relationships of sequences and identifies major lineages from the two pandemic waves.
Figure 7Mutational landscape of haplotype VIII and its descendant lineages. (A) Haplotype network colored according to the alleles at positions 28881-28883. Two additional low-frequency combinations had emerged at this locus with genotypes AGG and TGG (arrow). (B) PCA generated from sequences from haplotype VIII and descendants V, X, XI, and XV. PC1, PC2, and PC3 explaining 3.5% of the variation were plotted into three axis. (C) PCA visualization of 0.6% of the variation within Alpha annotated sequences, PC1, PC2, and PC3 plotted onto three axis. PC1/2/3 reveal 9 major groups, arbitrarily labeled G1 to G9. (D) Mutational graphs reporting mutations seen in at least 25% of the sequences in each group in C. Bars are colored by substitution type, and the corresponding amino acid changes are shown, as in Figure 3. Genomic position annotation was done using SnpEFF (30).