The current coronavirus disease 2019 (COVID-19) pandemic, caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), was initially declared a Public Health Emergency of International Concern in January 2020[1,2]. Despite global efforts to interrupt transmission chains with quarantine, isolation, and travel restrictions at the onset of the pandemic, SARS-CoV-2 spread rapidly across the globe, creating a global pandemic and threat to human health worldwide. By November 15th, 2021, the World Health Organization (WHO) reported 253 million confirmed COVID-19 cases in 222 countries and over 5 million deaths[3,4]. SARS-CoV-2 reached all 50 states of the United States and associated territories, including Puerto Rico, by March 2020, after multiple introductions by travelers with infection[5-7]. The rapid spread across the United States was primarily propelled by interstate transmission chains and air travel to the associated territories[6,8,9].SARS-CoV-2 is an enveloped virus with a single-stranded positive-sense RNA genome of ~30,000 base pairs. During replication, a virus-encoded exonuclease provides a proof-reading activity that contributes to the observed low mutation rate and stable genome[10,11]. Nevertheless, the unprecedented spread of SARS-CoV-2 globally and the wealth of genomic sequence data available through the international initiative for genomic studies and surveillance has facilitated phylodynamic approaches to infer viral evolutionary rate, growth rate, and the estimated time of origin for specific outbreaks[11]. Studies have revealed that the viral genome has been accumulating mutations of concern, especially in the spike protein region, which confer phenotypes with increased fitness and pathogenicity[12-14]. Increased infectivity, resistance to monoclonal antibody therapy and evasion of the immune response were among the most frequently observed phenotypes attributed to WHO-monitored variants; these phenotypes often dominated transmission and replacement of other lineages upon emergence[15-18]. The Variant Being Monitored (VBM) B.1.1.7 (Alpha) was first identified in the United States in late December 2020 and was then characterized by a considerable increase in COVID-19 incidence associated with increased infectivity and occasionally more severe disease manifestations that increased hospitalization rates[19,20]. Alpha became the dominant variant, especially in Europe and the United States, until the emergence of Variant of Concern (VOC) B.1.617.2/AY.x (Delta), first identified in the United States in May 2021, which developed into a prominent variant with an apparent higher virulence and pathogenic phenotype[21-23]. Because of the potential for increased transmissibility, morbidity mortality, and decreased efficacy of vaccines and other intervention strategies, monitoring the spread of variants (VBMs and VOCs) rapidly became a public health concern and priority[24,25].Puerto Rico, an unincorporated territory of the United States, is a densely populated island and a popular tourist destination located in the Caribbean basin. SARS-CoV-2 was first identified in Puerto Rico on March 13th, 2020, in two European travelers who arrived on a cruise ship and in one local resident who had close contact with family members with recent travel history. Additional travel-related and local cases were confirmed within the following weeks[26]. In response to the emerging threat, the government of Puerto Rico executed the most restrictive (compared to the United States) national stay-at-home order on March 15th, 2020, to mitigate transmission while preparing the public health infrastructure for the imminent impact[27,28]. Travel restrictions imposed by the United States during the initial pandemic minimized international traffic to Puerto Rico, although domestic travel from the United States continued. Puerto Rico represents a unique epidemiologic setting in a geographically isolated location (an island), but with a regular influx of travelers mostly from the United States. This is an ideal setting to monitor introduction and spread of SARS-CoV-2 variants and answer questions to help inform SARS-CoV-2 spread and disease prevention strategies. Puerto Rico’s public health response incorporated extensive molecular surveillance to the increased laboratory capacity, which presented a unique opportunity to study the impact of SARS-CoV-2 variant turnover, local dissemination, and evolution during a period of changing epidemiology and public health responses.In response to the impending local epidemic, we established a partnership with the local health authorities and academia to conduct a genomic surveillance initiative to sample complete genomes of SARS-CoV-2 across the island through time, monitor lineage circulation, and understand the genomic epidemiology of the COVID-19 pandemic in Puerto Rico. This report presents the results from 19 months of genomic surveillance and phylogenetic analyses, which identified multiple introduction events that propelled the rapid expansion and persistent transmission of the virus on the island and lead to the establishment of an autochthonous lineage between August 2020 and January 2021.
Methods
Epidemiological data
We retrieved the number COVID-19 cases reported by the Puerto Rico Department of Health (PRDH) from March 2020 to 30 September 2021 from the PRDH database dashboard on 1 December 2021 available here https://covid19datos.salud.gov.pr/. The collection of cases includes cases classified as confirmed (by molecular tests) or probable (by antigen tests) and plotted by date of sample collection.
Patient sample selection
Nasopharyngeal swab samples pre-selected for genomic surveillance were received from COVID-19 passive surveillance conducted by PRDH, the Ponce Health Sciences University (PHSU), and hospital-based acute febrile illness surveillance conducted by the Centers for Disease Control and Prevention (CDC) Dengue Branch. A total of 785 samples were collected from March 2020 to September 30th, 2020, from the seven health regions of the island, including 63 out of the 78 municipalities, and selection criteria included all samples with SARS-CoV-2 detected by reverse-transcriptase polymerase chain reaction (RT-PCR), viral load (CT < 28) and sufficient residual sample volume stored at −80 °C [29]. All samples were de-linked from patient identifiable information and processed under the guidelines approved by the CDC and Ponce School of Medicine institutional review boards (IRB) protocol 6731, which waived the need for informed consent for sequencing of residual samples.
Lineage frequency analysis
The frequency of SARS-CoV-2 lineage detection in Puerto Rico was calculated using the total number of SARS-CoV-2 genomes published in the Global Initiative on Sharing All Influenza Data (GISAID) (https://www.gisaid.org) with collection dates ranging between March 1st, 2020 and September 30th, 2021. All complete genome sequences and metadata were retrieved from the GISAID database as of October 31st, 2021. The dataset was filtered for complete genome data, high-coverage data, and complete collection date for a final dataset of 2514 entries. Lineage assignment on GISAID was determined by the Phylogenetic Assignment of Named Global Outbreak Lineages (Pangolin)[30,31]. R with ggplot package was used to calculate lineage frequency and plot the graph focusing on the following lineages of interest: B.1.1.7 (Alpha), P.1 + P.1.1 (Gamma), B.1.588, Delta (B.1.617.2+AY.x), B.1.427 + B.1.429 (Epsilon), B.1.526 (Iota), B.1.621/1 (Mu), and all other Pangolin-designated B lineages grouped as Other. No genomes collected in May 2020 have been published in GISAID by October 31st, 2021.
Complete genome sequencing and assembly
Complete SARS-CoV-2 genome sequences were generated directly from clinical nasopharyngeal samples. Viral RNA was extracted from viral transport media using the automated MagNA Pure 96 system (Roche) with the MagNA Pure 96 DNA and Viral Nucleic Acid Small Volume Kit (Roche) following manufacturer-recommended protocols for 0.2 mL sample input volume and 0.1 mL RNA elution volume. MP96 external lysis buffer was used to pre-treat the samples for neutralization and assist the lysis process. First strand cDNA was synthesized with random hexamers using SuperScript IV reverse transcriptase (ThermoFisher), and tiling PCR amplicons were generated using Q5® high-fidelity DNA polymerase (New England Biolabs) and the ARTIC nCoV-2019 V3 primer scheme purchased from Integrated DNA Technologies (https://github.com/artic-network/artic-ncov2019/blob/master/primer_scheme/nCoV-2019/V3/nCov-2019.tsv). Candidate samples for sequencing presented clearly visible bands of target size (~400 bp) in DNA gel electrophoresis for both primer pools. PCR products were purified with AMPure XP magnetic beads (Beckman Coulter) and quantified using Qubit 4.0 fluorometer (ThermoFisher). DNA libraries were generated using the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs), reducing all reagents volumes to 25% from the manufacturer's recommended protocol to increase throughput. The resulting products were screened for size and quality using the Bioanalyzer 2100 instrument (Agilent Technologies) and quantified with Qubit 4 fluorometer (ThermoFisher). Qualifying libraries were pooled and run in the MiSeq sequencer instrument (Illumina) using the MiSeq Reagent Kit v3 in 600-cycle program.The resulting sequence reads were screened for quality, trimmed, and assembled into complete consensus SARS-CoV-2 genomes using the Genome Detective Virus Tool v1.136[32] (https://www.genomedetective.com) and assembly confirmed with iVar[33]. The Pangolin COVID-19 Lineage Assigner tool was used for lineage assignment[34] (https://pangolin.cog-uk.io). A total of 753 samples were sequenced with more than 95% genome coverage at a minimum of 10x sequence depth. All sequence data obtained for this study was submitted to GISAID, accession numbers available in Supplementary Data 1.
Phylogenetic analysis
Our Puerto Rico SARS-CoV-2 genomes dataset was analyzed against a diverse panel of genomes from across the world which provide regional phylogenetic context. Initially, we downloaded the Genomic Epidemiology metadata package for all entries from GISAID on August 18th, 2021 to screen genomes for subsampling. However, due to the large number of genomes available in GISAID, we downloaded and combined the following pre-sampled datasets for regional studies: NextRegion-North America, NextRegion-South America, and NextRegion-Global. We then used the standard ncov augur/auspice multiple input workflow available in the Nextstrain platform[35] (https://github.com/nextstrain/ncov) to subsample contextual genomes and phylogenetic inference with time-stamped trees. The custom subsampling scheme program selected 2611 contextual genomes from the United States, North America, the Caribbean, Central America, South America, Africa, Europe, Asia, and Oceania, with higher proportions from The Americas and selected based on collection dates and genetic proximity to our Puerto Rico dataset. The combined dataset of 3364 genomes was aligned using MAFFT[36] and a global maximum likelihood (ML) phylogenetic inference was reconstructed with IQ-TREE[37]. The ncov workflow then transferred the ML tree to TreeTime[38] for time calibration and ancestral state reconstruction of the tree topology at constate rate of 8 × 10−4 nucleotide substitutions per site per year. The resulting global ML tree was visualized with Nextstrain auspice[35] and annotated with iTol for region of origin and emerging variants[39]. Subsampling from the Genomic Epidemiology metadata package retrieved in August and from the combined NextRegions produced phylogenetic inference trees with similar topologies. A list of all the sequences used in this study, including sequence labels and authors can be found in the “Data availability” section.Selected lineages of interest were studied further by reconstruction of phylogenetic focus trees. For the B.1.588 lineage-focused tree, we selected all B.1.588 genomes published in GISAID by October 31st, 2021. Contextual B.1 lineage genomes were selected based on phylogenetic clustering near the base of the B.1.588 clade in the global tree and by temporal proximity to the date range of B.1.588 circulation between June 2020 and January 2021. Maximum likelihood phylogenetic trees were reconstructed with the resulting dataset of 239 genomes under the GTR + G + I nucleotide substitution model and 1000 bootstrap replicates using IQ-TREE v1.6.12[37]. The resulting tree topology and node support were compared to Bayesian maximum clade credibility (MCC) tree reconstruction using BEAST v1.10.4[40]. Briefly, we used time-calibrated genomes trimmed to coding sequence with sample collection dates and the nucleotide substitution model parametrized using Yang96 model under strict molecular clock, and Bayesian Skyline coalescent model. Markov chains were run for a total of 100 million steps with sampling every 10,000 steps in the chain. Run results were evaluated in Tracer (http://tree.bio.ed.ac.uk/software/tracer/) to ensure stationary parameters with statistical errors reflected in 95% highest probability density ranges with effective sample size (ESS) higher than 200 for each tree prior. MCC trees were generated in TreeAnnotator from BEAST package after discarding 10% of runs as burn-in. The resulting ML and MCC trees were visualized in FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree).For the B.1.1.7 (Alpha VBM) lineage-focused tree, we selected all B.1.1.7 + Q.x designated genomes from the global tree and supplemented the dataset with additional B.1.1.7 + Q.x genomes from Puerto Rico and the United States to understand the lineage emergence and spread in the island. A custom subsampling scheme was selected on the Nextstrain ncov workflow to select genomes from samples collected between November 1st, 2020 and February 28th, 2021. Maximum likelihood phylogenetic trees were reconstructed with the resulting dataset of 729 genomes under the GTR + G + I nucleotide substitution model and 1000 bootstrap replicates using IQ-TREE v1.6.12[37]. The resulting ML tree was visualized in FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree).For the B.1.617.2 (Delta VOC) lineage-focused tree, we selected all B.1.617.2 + AY.x designated genomes from the global tree and supplemented the dataset with additional B.1.617.2 + AY.x genomes from Puerto Rico to understand lineage spread and sub-lineage clustering patterns across the island. The genome selector Python script designed by Anderson Brito (https://github.com/andersonbrito/ncov) was used to select additional Delta-designated genomes from Puerto Rico with collection dates range from June 1st, 2021 to September 30th, 2021. Maximum likelihood phylogenetic trees were reconstructed with the resulting dataset of 815 genomes under the GTR + G + I nucleotide substitution model and 1000 bootstrap replicates using IQ-TREE v1.6.12[37]. The resulting ML tree was visualized in FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree).The date of the most recent common ancestor (tMRCA) determined by the Nextstrain ML phylogenetic inference was confirmed with Bayesian coalescent analyses for B.1.588, Alpha VBM and Delta VOC lineage trees. Due to the large size of the datasets, each focus tree was reduced by tree-pruning to datasets with <150 genomes. tMRCA analyses were performed with BEAST v1.10.4 under a strict molecular clock fixed at 8 × 10-4 substitutions per site per year and 150 million Markov chains sampling every 10,000 steps. Results were evaluated in Tracer (http://tree.bio.ed.ac.uk/software/tracer/) for convergence and ESS >200.
Authors: Yatish Turakhia; Nicola De Maio; Bryan Thornlow; Landen Gozashti; Robert Lanfear; Conor R Walker; Angie S Hinrichs; Jason D Fernandes; Rui Borges; Greg Slodkowicz; Lukas Weilguny; David Haussler; Nick Goldman; Russell Corbett-Detig Journal: PLoS Genet Date: 2020-11-18 Impact factor: 5.917
Authors: Ekaterina Minskaia; Tobias Hertzig; Alexander E Gorbalenya; Valérie Campanacci; Christian Cambillau; Bruno Canard; John Ziebuhr Journal: Proc Natl Acad Sci U S A Date: 2006-03-20 Impact factor: 11.205
Authors: James Hadfield; Colin Megill; Sidney M Bell; John Huddleston; Barney Potter; Charlton Callender; Pavel Sagulenko; Trevor Bedford; Richard A Neher Journal: Bioinformatics Date: 2018-12-01 Impact factor: 6.931
Authors: Diana M Tordoff; Alexander L Greninger; Pavitra Roychoudhury; Lasata Shrestha; Hong Xie; Keith R Jerome; Nathan Breit; Meei-Li Huang; Mike Famulare; Joshua T Herbeck Journal: Lancet Reg Health Am Date: 2021-07-13
Authors: Bette Korber; Will M Fischer; Sandrasegaram Gnanakaran; Hyejin Yoon; James Theiler; Werner Abfalterer; Nick Hengartner; Elena E Giorgi; Tanmoy Bhattacharya; Brian Foley; Kathryn M Hastie; Matthew D Parker; David G Partridge; Cariad M Evans; Timothy M Freeman; Thushan I de Silva; Charlene McDanal; Lautaro G Perez; Haili Tang; Alex Moon-Walker; Sean P Whelan; Celia C LaBranche; Erica O Saphire; David C Montefiori Journal: Cell Date: 2020-07-03 Impact factor: 66.850