Literature DB >> 33093069

Sequencing identifies multiple early introductions of SARS-CoV-2 to the New York City region.

Matthew T Maurano^1,2, Sitharam Ramaswami³, Paul Zappile³, Dacia Dimartino³, Ludovic Boytard⁴, André M Ribeiro-Dos-Santos^1,2, Nicholas A Vulpescu^1,2, Gael Westby³, Guomiao Shen², Xiaojun Feng², Megan S Hogan^1,2, Manon Ragonnet-Cronin⁵, Lily Geidelberg⁵, Christian Marier³, Peter Meyn³, Yutong Zhang³, John Cadley^1,2, Raquel Ordoñez^1,2, Raven Luther^1,2, Emily Huang^1,2, Emily Guzman³, Carolina Arguelles-Grande⁴, Kimon V Argyropoulos², Margaret Black², Antonio Serrano², Melissa E Call⁶, Min Jae Kim⁶, Brendan Belovarac², Tatyana Gindin², Andrew Lytle², Jared Pinnell², Theodore Vougiouklakis², John Chen⁷, Lawrence H Lin², Amy Rapkiewicz², Vanessa Raabe⁸, Marie I Samanovic⁸, George Jour^2,6, Iman Osman^4,6, Maria Aguero-Rosenfeld², Mark J Mulligan⁸, Erik M Volz⁵, Paolo Cotzia^2,4, Matija Snuderl², Adriana Heguy^2,3.

Abstract

Effective public response to a pandemic relies upon accurate measurement of the extent and dynamics of an outbreak. Viral genome sequencing has emerged as a powerful approach to link seemingly unrelated cases, and large-scale sequencing surveillance can inform on critical epidemiological parameters. Here, we report the analysis of 864 SARS-CoV-2 sequences from cases in the New York City metropolitan area during the COVID-19 outbreak in spring 2020. The majority of cases had no recent travel history or known exposure, and genetically linked cases were spread throughout the region. Comparison to global viral sequences showed that early transmission was most linked to cases from Europe. Our data are consistent with numerous seeds from multiple sources and a prolonged period of unrecognized community spreading. This work highlights the complementary role of genomic surveillance in addition to traditional epidemiological indicators.

Entities: Chemical

Mesh：

Year: 2020 PMID： 33093069 PMCID： PMC7706732 DOI： 10.1101/gr.266676.120

Source DB: PubMed Journal: Genome Res ISSN： 1088-9051 Impact factor: 9.438

In December 2019, the novel pneumonia COVID-19 emerged in the city of Wuhan, in Hubei Province, China. Shotgun metagenomics rapidly identified the new pathogen as SARS-CoV-2, a betacoronavirus related to the etiological agent of the 2002 SARS outbreak (SARS-CoV), and of possible bat origin (Andersen et al. 2020; Zhou et al. 2020). Building on infrastructure from past outbreaks (Carroll et al. 2015; Park et al. 2015), genomic epidemiology has been applied to track the worldwide spread of SARS-CoV-2 using mutations in viral genomes to link otherwise unrelated infections (Grubaugh et al. 2019; Zhang and Holmes 2020). Collaborative development of targeted sequencing protocols (Quick et al. 2017; Tyson et al. 2020), open sharing of sequences through the Global Initiative on Sharing All Influenza Data (GISAID) repository (Shu and McCauley 2017), and rapid analysis and visualization of viral phylogenies using Nextstrain (Hadfield et al. 2018) have provided unprecedented and timely insights into the spread of the pandemic. Notably, community transmission was identified using surveillance sequencing in the Seattle area in time to implement preventative measures (Bedford et al. 2020; Worobey et al. 2020). The New York City metropolitan region rapidly became an epicenter of the pandemic following the identification of the first community-acquired case on March 3, 2020 (a resident of New Rochelle in nearby Westchester County who worked in Manhattan). As of May 10, 2020, New York State had 337,055 cases: the highest in the United States and 8% of the worldwide total. Fully 55% of New York State cases lay within the five boroughs of New York City (185,357 cases), followed by the Nassau and Suffolk counties to the east on Long Island (75,248 cases) (NYS Department of Health 2020). The outlying boroughs and suburban counties reported markedly higher infection rates than Manhattan. The outbreak overlaps with the catchment area of the NYU Langone Health (NYULH) hospital system, including hospitals on the east side of Manhattan (Tisch/Kimmel), Brooklyn (formerly Lutheran Hospital), and Nassau County (Winthrop). Because even early COVID-19 cases presented mostly without travel history to countries with existing outbreaks, determining the extent of asymptomatic community spread and transmission paths became paramount. In parallel with increased clinical capacity for diagnostic PCR-based testing, we sought to trace the origin of NYULH-treated COVID-19 cases using phylogenetic analysis to compare them to previously deposited COVID-19 viral sequences. We further aimed to develop an approach to integrate sequencing as a complementary epidemiological indicator of outbreak trajectory.

Results

To assess the spread of SARS-CoV-2 within the NYULH COVID-19 inpatient and outpatient population, we deployed and optimized a viral sequencing, quality control (QC), and analysis pipeline by repurposing existing genomics infrastructure. Samples from unique individuals were selected for sequencing from those confirmed positive between March 12 and May 10, 2020. During this period, positive tests within the NYULH system mirrored those of New York City and nearby counties (Supplemental Fig. S1; Petrilli et al. 2020). Illumina RNA-seq libraries were generated using a ribodepletion strategy starting from total RNA from nasopharyngeal swabs. Hybridization capture with custom biotinylated baits targeting the full SARS-CoV-2 sequence was used to enrich RNA-seq libraries before sequencing (Methods; Supplemental Fig. S2). Of 1107 libraries generated and sequenced, fully 78% yielded a sequence passing QC (see Methods). Pass rates were lower for samples with qRT-PCR Ct values greater than 30 (Supplemental Fig. S3A,B). We observed that high-quality sequences could be generated directly from shotgun libraries for qPCR Ct values less than 30, thereby simplifying pooling and logistical constraints by skipping the capture step. Up to 23 samples were multiplexed in a single capture pool (Supplemental Fig. S3C,D). Samples with similar Ct values were grouped to minimize the range of target cDNA representation across a single capture pool (Supplemental Fig. S3E,F). Our pipeline was verified using a positive control synthetic RNA spiked in to total human RNA, as well as PCR negative and no-sample controls (Supplemental Table S1). This resulted in 864 sequences passing QC, comprising 10% of COVID-19-positive cases in NYULH over that time period (Supplemental Fig. S1; Supplemental Table S2). The cohort of 864 sequenced cases included a range of ages (Fig. 1A). Cases originated throughout the NYULH system, which comprises hospitals in the New York City boroughs of Manhattan and Brooklyn, as well as Nassau County, a suburb to the east of the city on Long Island (Fig. 1B). Sixty-six percent of cases resided within New York City; 86%, within New York State (Fig. 1C). Analysis of residential ZIP codes showed that cases reflected the hospital catchment area within the New York metropolitan region (Fig. 1D). Our data set included few cases from Westchester County to the north of the city, where the earliest detected regional outbreak was concentrated, as it is outside of the NYULH catchment area.

Figure 1.

Demographic parameters of sequenced SARS-CoV-2 cases in the NYULH system. Cases are broken down as follows: (A) Age and sex; (B) collecting hospital; (C) residential location, grouped by borough and outlying counties; “Other” includes counties with few cases. (D) Localization of case residences within the New York City region. The color scale indicates number of cases per ZIP code. Collecting hospitals are labeled with rounded boxes. (E) Potential exposure status, categorized by occupation as healthcare worker, travel history, and contact with a COVID-19-positive individual. The pie chart depicts the geographical destination of the potential travel-related exposures. (F) Potential exposure status by collection date. We compiled a database for 820 of these cases from electronic medical records, including potential exposures from health care worker status, travel history, and close contact with a COVID-19 individual (Methods). We found no recorded potential exposures for 43% of cases (Fig. 1E). Multiple potential exposures were less common: 113 cases were both health care workers and noted a COVID-19 contact, and three health care workers had travel history. Travel history was present for only 17 cases (2%), and all of these cases but one were collected in March (Fig. 1F). Of the 14 cases in which travel destination information was available, nine destinations were within the United States, four were in Europe, and one was in South Asia. This assessment relies upon clinical notes during a period in which clinical capacity was stretched, thus likely underestimates potential exposures. Conversely, the potential exposure may have been coincidental given the uncontrolled community spread at the time. We inferred a maximum likelihood phylogeny to assess relatedness among cases (Fig. 2). Coloring cases by county of residence within the New York region showed identical or related viral sequences present across multiple counties from the onset of our sampling (Fig. 2). We detected 890 nucleotide and 547 amino acid mutations across all cases (Supplemental Fig. S4). Mutation of D614G in the spike protein, which has been suggested to affect transmission or virulence (Zhang et al. 2020), was present in >95% of sequences. Functional analysis will be required to determine whether functional changes can be ascribed to any of these mutations and what role mutations might play in shaping the ongoing pandemic.

Figure 2.

Phylogenetic relationship of regional viral sequences. Maximum likelihood phylogeny inferred from 864 cases. Nodes with bootstrap support values above 75 are colored. Inner rings indicate groups of clade-defining mutations. Outer ring indicates county of residence. Scale bar, nucleotide substitutions per site. We then assessed the relatedness of our cases to 5004 sequences from across the world from the GISAID EpiCov repository (Supplemental Fig. S5; Supplemental Table S3). A maximum likelihood tree showed that cases from the New York region showed broader diversity than that initially reported in Seattle (Bedford et al. 2020), the only other U.S. region with a comparable number of viral sequences (Supplemental Fig. S6). To investigate the timing of introductions to New York City, we inferred a rooted timescaled phylogeny (Fig. 3A; Supplemental Fig. S7A). Analysis of our cases within this phylogeny identified 109 genotypes introduced to the northeast United States (Fig. 3B; Supplemental Table S4). Identification of source nodes ancestral to at least one sequence from outside the northeast United States in addition to these transmission chains placed most introductions broadly in late February and early March, slightly earlier than the first detected transmissions within New York City (Fig. 3C; Supplemental Fig. S7B). The timing of these introductions did not differ substantially under alternative nucleotide substitution models or rates (Supplemental Fig. S7C). The number of samples in each transmission chain varied widely, and two early transmission chains each comprised over 300 cases. Only a minority of transmission chains included samples from Asia, whereas samples from Europe and the rest of the United States were well represented (Fig. 3D).

Figure 3.

Timescaled phylogeny showing global sequence context. (A) Colored edges highlight transmission chains. Black squares indicate source nodes; dots, detected presence in the northeast United States. (B) Schematic of approach to infer introductions and transmission chains. (C,D) Transmission chains in the New York City region ordered by inferred divergence date from source. (C) Dates estimated for source transmission (orange) and earliest detected local transmission (purple) inferred from sequenced cases; lines represent 90% confidence intervals. Point size corresponds to the number of strains under source and all transmission chains. (D) Representation of global regions in each source transmission. Bar at top shows overall representation of regions in the phylogeny. To assess the ongoing trajectory of the outbreak, we applied phylodynamic analysis to estimate viral effective population size from a subsample of sequences (Methods) (Pybus and Rambaut 2009). Under moderate assumptions, effective population size will be proportional to epidemic prevalence, and growth rates of effective population size will correspond to epidemic growth (Volz et al. 2013). This analysis identified a period of rapid growth, followed by return nearly to the start point (Fig. 4A,B). We estimate that the peak effective population size occurred on March 29 (95% CI: March 19–April 5). The growth rate decreased steadily after March 1 and was negative with high confidence by mid-April (Fig. 4C), consistent with the epidemic curve of confirmed infections in the New York City region (Supplemental Fig. S1A).

Figure 4.

Phylodynamic analysis of outbreak trajectory. (A) Timeline of New York City outbreak, highlighting (i) announcement of first community-acquired case (March 3); (ii) ban on gatherings exceeding 500 people (March 12); (iii) closure of schools, restaurants, and bars, and other venues (March 16); (iv) closure of nonessential businesses (March 22). (B,C) Outbreak trajectory estimated from genetic data showing effective population size relative to March 1 (B) and growth rate of effective population size (C; units of 1/yr). Shaded regions represent 95% credible interval.

Discussion

Our work documents the genomic epidemiology of the COVID-19 outbreak in the New York City region in the spring of 2020. Analysis of the genetic data suggests that the New York outbreak was seeded by mid-February, largely by way of Europe, which can be placed within the context of reduced travel flows from Asia to the United States, the earlier spread of the pandemic from Asia to Europe, and the low overall prevalence in rest of the United States. Several other reports of the initial stages of the New York City region outbreak have identified early community spread on a similar time frame (Davis et al. 2020; Fauver et al. 2020; Gonzalez-Reiche et al. 2020). Although the low rate of travel history among our cases could reflect incomplete ascertainment of potential exposures, the extent of uncontrolled community spread likely reduces the representation of index travel cases in our data set. Indeed, the ability to track past transmissions is a key advantage of a genetic approach in the face of inadequate testing. It is important to caution that fine-scale delineation of individual introductions and transmissions through genomic epidemiology is limited by viral mutation rate, incomplete sampling, and incomplete availability of exposure history (Villabona-Arenas et al. 2020). In particular, many early sequences show identical genotypes, which could be consistent with additional transmission events, possibly by way of unsampled regions. Although our estimate of 109 introductions is thus likely to underestimate the total number of introductions, the genomic data are sufficiently informative to outline an unrecognized early spread in February that enabled rapid development of the outbreak in March. Further analysis (Worobey et al. 2020) and sequencing of archival samples will be needed to refine assessments of the initial spread. Our demonstration of rapid sample processing, deposition, and analysis underscores the potential for genomic epidemiology to provide an independent estimate of disease transmission, as well as its potential to recognize impending resurgence of a regional outbreak. Further surveillance by medical centers, regional public health departments, and national efforts will be needed to monitor genomic epidemiology, pandemic spread, and public responses (Supplemental Fig. S5). Given the logistical, regulatory, and methodological challenges to establishing such surveillance during an outbreak, it is critical to have this infrastructure already in place (Kim et al. 2020) for future waves of COVID-19 or other future pandemics.

Methods

Bioethics statement

The collection of COVID-19 human biospecimens for research has been approved by NYULH institutional review board under S16-00122, Universal Mechanism of Human Bio-Specimen Collection and Storage for Research. The approved IRB protocol allows for the collection and analysis of clinical, travel, exposure, and demographic data (Osman et al. 2020). Electronic medical records were reviewed to compile a clinical database for 820 cases listing health care worker status, travel history, and close contact with a known COVID-19 case. For cases in which a given exposure was not directly stated in the clinical record, we recorded that field as missing data but included other exposures in our analysis. A summary field of exposure history per case was generated from the presence of a COVID-19 contact, travel history, or health care worker status, in that order.

Sample collection

All samples were collected as part of clinical diagnostics. Nasopharyngeal swabs were collected and placed in 3 mL of viral transport medium (VTM; Copan universal transport medium) following clinical protocols. Samples were transported to the clinical microbiology laboratory at room temperature and tested for SARS-CoV-2 the same day. Remnant samples were stored at −70°C.

Clinical testing

All initial detections of COVID-19 cases were performed as part of the clinical care. Clinical testing was performed using the following three FDA emergency use authorization (EUA)–approved COVID-19 PCR-based tests: NYULH-validated PCR test using the U.S. CDC primer design, targeting three regions of the virus nucleocapsid (N) gene, and an internal control primer targeting the human RNase P (RP) gene (https://www.cdc.gov/coronavirus/2019-ncov/lab/rt-pcr-panel-primer-probes.html) with PCR performed on an ABI7500 Dx system. The limit of detection is 10,000 copies per milliliter. The Roche Cobas 6800 RT-PCR platform targeting the Orf1/a and E sequences, per the manufacturer's instructions. The limit of detection is 180 copies per milliliter. The Cepheid Xpert Xpress RT-PCR platform targeting the N2 and E viral sequences, per the manufacturer's instructions. The limit of detection is 250 copies per milliliter.

RNA extraction

RNA extraction was performed using two platforms for parallel sample processing: By using the Maxwell RSC instrument (Promega AS4500), total RNA was extracted from 300 µL of VTM with the buccal swab DNA kit (Promega AS1640). The following modifications were introduced to extract total RNA as opposed to total nucleic acids: Samples were incubated for 30 min at 65°C for Proteinase K digestion and virus deactivation, and DNase I (Promega) was added to the reagents cartridge to remove genomic DNA during nucleic acids extraction. Total RNA was eluted in 50 µL of nuclease-free water. By using the KingFisher flex system (Thermo Fisher Scientific), RNA was extracted from heat-inactivated nasopharyngeal swab samples in batches of 96 samples, following the manufacturer's instructions and the MagMax mirVana total RNA isolation kit (Thermo Fisher Scientific A27828). Briefly, 250 µL of nasopharyngeal swab collection was lysed in lysis buffer and β-mercaptoethanol and subsequently bound to magnetic beads and loaded into the KingFisher flex instrument. A DNase I treatment step was performed as part of the instrument protocol, and RNA samples were eluted in 50 µL of elution buffer and immediately stored at −80°C.

Library preparation and sequencing

lllumina sequencing libraries were prepared from 10 µL of total RNA. Two ribodepletion methods for cDNA RNA-seq library preps were used: Purified libraries were quantified using qPCR (Kapa Biosystems KK4824). Library size distribution was checked using an Agilent TapeStation 2200. KAPA RNA HyperPrep kit with RiboErase (HMR; Roche Kapa KK8561). We followed the manufacturer's protocol with the following modifications: For the adapter ligation step, we prepared a plate of barcoded adapters (IDT) at a concentration of 500 nM and performed 15 cycles of PCR amplification of the final library. Nugen trio with human rRNA depletion (Tecan Genomics 0606-96), including DNase I treatment, cDNA synthesis, single primer isothermal amplification (SPIA), enzymatic fragmentation, library construction, final PCR amplification (12–16 cycles), and an AnyDeplete step to remove host rRNA transcripts. An automated protocol was implemented on a Biomek FXP liquid handler integrated with a Biometra TRobot 96-well thermal cycler (Beckman Coulter). Libraries presumed more suitable for capture (generally, qPCR Ct value greater than 30) were enriched for SARS-CoV-2 genomic sequences using custom biotinylated DNA probe pools either from Twist Biosciences or from Integrated DNA Technologies: In general, we pooled samples with similar Ct values and accounted for variations in parent library concentration, multiplexing up to 23 libraries per reaction. Positive and negative control samples are described in Supplemental Table S1. For capture using the IDT xGen COVID capture panel (Integrated DNA Technologies 10006764), we followed the manufacturer's protocol. Briefly, hybridization of 500 ng–1 µg of combined library DNA with 4 µL of xGen Lockdown probes was performed for 4–16 h at 65°C, followed by PCR amplification for six to 10 cycles. For capture using the Twist Bioscience custom panel (Twist Design ID: TE-95888003, generously shared by the Seattle Flu Study), we followed the manufacturer's protocol using the Twist hybridization and wash kit (Twist Biosciences 101025). Hybridization of 1–2 µg combined library DNA was performed for 16–20 h at 70°C. Postcapture PCR amplification cycles ranged from 12 to 14 cycles. Samples were sequenced as paired-end 100- or 150-cycle reads on the NextSeq 500 or NovaSeq 6000 (using SP or S1 flow cells). All flow cells were loaded such that indexing barcode sequences for multiplexed samples differed by ≥3 bp.

Sequence read processing

Reads were demultiplexed with Illumina bcl2fastq v2.20, requiring a perfect match to indexing barcode sequences. All RNA-seq and Capture-seq data were processed using a uniform mapping pipeline. Illumina sequencing adapters were trimmed with Trimmomatic v0.39 (Bolger et al. 2014). Reads were aligned using BWA v0.7.17 (Li and Durbin 2009) to a custom index containing human genome reference (GRCh38/hg38), including unscaffolded contigs and alternate references plus the reference SARS-CoV-2 genome (NC_045512.2, wuhCor1). Presumed PCR duplicates were marked using SAMBLASTER v0.1.24 (Faust and Hall 2014). Only sequences with >23,000 bp of sequence with ≥20× coverage depth were analyzed, resulting in 864 final sequences (Supplemental Table S2). Variants were called across all samples using BCFtools v1.9 (Li et al. 2009): bcftools mpileup ‐‐redo-BAQ ‐‐adjust-MQ 50 ‐‐gap-frac 0.05 ‐‐max-depth 10000 ‐‐max-idepth 200000 ‐‐output-type u | bcftools call ‐‐ploidy 1 ‐‐keep-alts ‐‐multiallelic-caller -f GQ Raw pileups were filtered using Viral sequences were generated by applying VCF files to the reference sequence using `bcftools consensus` with -m to mask sites below 20× with Ns, and -m N to mask sites of ambiguous genotypes with N. bcftools norm ‐‐check-ref w ‐‐output-type u | bcftools filter -i “INFO/DP>=10 & QUAL>=10 & GQ>=99 & FORMAT/DP>=10” ‐‐SnpGap 3 ‐‐IndelGap 10 ‐‐set-GTs. ‐‐output-type u | bcftools view -i ‘GT=“alt”‘ ‐‐trim-alt-alleles

Geoplotting

The regional case heat map was generated using R v3.6.2 (R Core Team 2020), using the packages ggplot2 v3.3.0 (Wickham 2016) for plotting and sf v0.8 for geospatial data manipulation. Maps were generated based on the 2018 ZIP code tabulated area geographical boundaries obtained from the U.S. Census Bureau (United States Census Bureau 2018).

Phylogenetic analysis

Sequences for non-NYULH cases were downloaded from GISAID EpiCov on June 14, 2020, and filtered to sequences collected on or before May 10, 2020. Sequences from non-human hosts, annotated by Nextstrain as duplicate individuals or highly divergent, with fewer than 27,000 nonambiguous nucleotides or with improperly formatted dates or location were excluded. Sequences from outside New York State were subsampled to a maximum of 20 samples per admin division (United States) or country (outside United States) per month, prioritizing sequences most similar to the focal set of 864 NYULH samples. This priority was penalized if many non-US samples were most similar to the same U.S. sample, and mutations were weighted 333× more heavily than masked sites. Global sequences were then combined with the sequences from this study. Sequences were analyzed using the augur v7.0.2 pipeline (Hadfield et al. 2018). Sequences were aligned along with the reference genome using MAFFT v7.453 (Katoh and Standley 2013), and the resulting alignment was masked to remove 100 bp from the beginning, 50 from the end, and uninformative point mutations (positions 11083, 13402, 21575, 24389, 24390). Maximum likelihood phylogenetic reconstruction was performed with IQ-TREE v1.6.12 (Nguyen et al. 2015) using a GTR substitution model and the -czb option. Support values were generated with the ultrafast bootstrapping option with 1000 replicates. This tree was used to tabulate nucleotide and amino acid changes specific to lineages and cases; gaps with respect to the reference were reported as deletions. TreeTime v0.7.4 (Sagulenko et al. 2018) was used to generate a timetree rooted at the reference sequence, using the ‐‐keep-polytomies option, and under a strict mutational clock under a skyline coalescent prior with a rate of 8 × 10−4 mutations per site per year and a standard deviation of 4 × 10−4. For each NYULH case, the inferred earliest New York City transmission was identified as the most ancestral node or tip with >70% of sequences originating in the Northeast (defined as the states of New York, Connecticut, New Jersey, Pennsylvania) on the timescaled phylogeny using the ape (Paradis and Schliep 2019) and phangorn (Schliep 2011) R packages. The transmission source was identified as the first ancestral node defined by a unique mutation and ancestral to a sequence originating outside the Northeast. Transmissions with identical source nodes were grouped to yield transmission chains. Trees were plotted with the tidygraph and ggraph R packages.

Phylodynamic analysis

To minimize ascertainment and sampling bias, analysis was performed on a subset of sequenced cases residing in New York City and the outlying Westchester, Nassau, and Suffolk counties and excluded outpatients and known health care workers. Sequence data were aligned to reference (accession NC_045512.2) and ends trimmed using MAFFT 7.450 (Katoh and Standley 2013). A maximum likelihood tree was estimated using IQ-TREE 1.6.1 using a HKY substitution model (Nguyen et al. 2015). A further 20 phylogenies were derived by randomly resolving polytomies and enforcing a small minimum branch length of 7 × 10−6 substitutions per site using the ape R package (Paradis and Schliep 2019). Rooted timescaled phylogenies were estimated using the treedater R package version 0.5.1 (Volz and Frost 2017) and a strict molecular clock. The skygrowth R package version 0.3.1 (Volz and Didelot 2018) was used to estimate effective population size through time with an exponential prior for the smoothing parameter with rate 10−4. The final estimates were generated by averaging results over the 20 estimated timetrees. A script for reproducing these results is available at GitHub (https://gist.github.com/emvolz/d58cce01c3310a01df09faf615b77070).

Software availability

Code used in sequencing data processing is available at GitHub (https://github.com/mauranolab/mapping/tree/master/dnase) and as Supplemental Code.

Data access

All raw sequencing data generated in this study have been submitted to the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject/) under accession number PRJNA650245; sequencing reads have been filtered to remove the host genome. Sequences have been deposited into the GISAID repository immediately upon QC with virus name “NYUMC”.

Competing interest statement

The authors declare no competing interests.

28 in total

1. Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples.

Authors: Joshua Quick; Nathan D Grubaugh; Steven T Pullan; Ingra M Claro; Andrew D Smith; Karthik Gangavarapu; Glenn Oliveira; Refugio Robles-Sikisaka; Thomas F Rogers; Nathan A Beutler; Dennis R Burton; Lia Laura Lewis-Ximenez; Jaqueline Goes de Jesus; Marta Giovanetti; Sarah C Hill; Allison Black; Trevor Bedford; Miles W Carroll; Marcio Nunes; Luiz Carlos Alcantara; Ester C Sabino; Sally A Baylis; Nuno R Faria; Matthew Loose; Jared T Simpson; Oliver G Pybus; Kristian G Andersen; Nicholas J Loman
Journal: Nat Protoc Date: 2017-05-24 Impact factor: 13.491

2. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies.

Authors: Lam-Tung Nguyen; Heiko A Schmidt; Arndt von Haeseler; Bui Quang Minh
Journal: Mol Biol Evol Date: 2014-11-03 Impact factor: 16.240

Review 3. Tracking virus outbreaks in the twenty-first century.

Authors: Nathan D Grubaugh; Jason T Ladner; Philippe Lemey; Oliver G Pybus; Andrew Rambaut; Edward C Holmes; Kristian G Andersen
Journal: Nat Microbiol Date: 2018-12-13 Impact factor: 17.745

4. A pneumonia outbreak associated with a new coronavirus of probable bat origin.

Authors: Peng Zhou; Xing-Lou Yang; Xian-Guang Wang; Ben Hu; Lei Zhang; Wei Zhang; Hao-Rui Si; Yan Zhu; Bei Li; Chao-Lin Huang; Hui-Dong Chen; Jing Chen; Yun Luo; Hua Guo; Ren-Di Jiang; Mei-Qin Liu; Ying Chen; Xu-Rui Shen; Xi Wang; Xiao-Shuang Zheng; Kai Zhao; Quan-Jiao Chen; Fei Deng; Lin-Lin Liu; Bing Yan; Fa-Xian Zhan; Yan-Yi Wang; Geng-Fu Xiao; Zheng-Li Shi
Journal: Nature Date: 2020-02-03 Impact factor: 69.504

5. Factors associated with hospital admission and critical illness among 5279 people with coronavirus disease 2019 in New York City: prospective cohort study.

Authors: Christopher M Petrilli; Simon A Jones; Jie Yang; Harish Rajagopalan; Luke O'Donnell; Yelena Chernyak; Katie A Tobin; Robert J Cerfolio; Fritz Francois; Leora I Horwitz
Journal: BMJ Date: 2020-05-22

6. Coast-to-Coast Spread of SARS-CoV-2 during the Early Epidemic in the United States.

Authors: Joseph R Fauver; Mary E Petrone; Emma B Hodcroft; Kayoko Shioda; Hanna Y Ehrlich; Alexander G Watts; Chantal B F Vogels; Anderson F Brito; Tara Alpert; Anthony Muyombwe; Jafar Razeq; Randy Downing; Nagarjuna R Cheemarla; Anne L Wyllie; Chaney C Kalinich; Isabel M Ott; Joshua Quick; Nicholas J Loman; Karla M Neugebauer; Alexander L Greninger; Keith R Jerome; Pavitra Roychoudhury; Hong Xie; Lasata Shrestha; Meei-Li Huang; Virginia E Pitzer; Akiko Iwasaki; Saad B Omer; Kamran Khan; Isaac I Bogoch; Richard A Martinello; Ellen F Foxman; Marie L Landry; Richard A Neher; Albert I Ko; Nathan D Grubaugh
Journal: Cell Date: 2020-05-07 Impact factor: 41.582

7. Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors: Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal: Bioinformatics Date: 2014-04-01 Impact factor: 6.937

8. The urgency of utilizing COVID-19 biospecimens for research in the heart of the global pandemic.

Authors: Iman Osman; Paolo Cotzia; Una Moran; Douglas Donnelly; Carolina Arguelles-Grande; Sandra Mendoza; Andre Moreira
Journal: J Transl Med Date: 2020-06-01 Impact factor: 5.531

9. The emergence of SARS-CoV-2 in Europe and North America.

Authors: Michael Worobey; Jonathan Pekar; Brendan B Larsen; Martha I Nelson; Verity Hill; Jeffrey B Joy; Andrew Rambaut; Marc A Suchard; Joel O Wertheim; Philippe Lemey
Journal: Science Date: 2020-09-10 Impact factor: 47.728

10. The proximal origin of SARS-CoV-2.

Authors: Kristian G Andersen; Andrew Rambaut; W Ian Lipkin; Edward C Holmes; Robert F Garry
Journal: Nat Med Date: 2020-04 Impact factor: 87.241

31 in total

1. Stability of SARS-CoV-2 phylogenies.

Authors: Yatish Turakhia; Nicola De Maio; Bryan Thornlow; Landen Gozashti; Robert Lanfear; Conor R Walker; Angie S Hinrichs; Jason D Fernandes; Rui Borges; Greg Slodkowicz; Lukas Weilguny; David Haussler; Nick Goldman; Russell Corbett-Detig
Journal: PLoS Genet Date: 2020-11-18 Impact factor: 5.917

Review 2. The emergence, genomic diversity and global spread of SARS-CoV-2.

Authors: Juan Li; Shengjie Lai; George F Gao; Weifeng Shi
Journal: Nature Date: 2021-12-08 Impact factor: 49.962

3. Determinants of SARS-CoV-2 transmission to guide vaccination strategy in an urban area.

Authors: Sarah C Brüningk; Juliane Klatt; Madlen Stange; Alfredo Mari; Myrta Brunner; Tim-Christoph Roloff; Helena M B Seth-Smith; Michael Schweitzer; Karoline Leuzinger; Kirstine K Søgaard; Diana Albertos Torres; Alexander Gensch; Ann-Kathrin Schlotterbeck; Christian H Nickel; Nicole Ritz; Ulrich Heininger; Julia Bielicki; Katharina Rentsch; Simon Fuchs; Roland Bingisser; Martin Siegemund; Hans Pargger; Diana Ciardo; Olivier Dubuis; Andreas Buser; Sarah Tschudin-Sutter; Manuel Battegay; Rita Schneider-Sliwa; Karsten M Borgwardt; Hans H Hirsch; Adrian Egli
Journal: Virus Evol Date: 2022-03-17

4. Reopening During the Unprecedented: The Association of Biomolecular Resource Facilities Community Coronavirus Disease 2019 Pandemic Response. Part 2: Efforts to Effectively Ramp Up Core Facility Activities.

Authors: Joshua Z Rappoport; DeLaine D Larsen; Benjamin Abrams; Andrew Vinard; Justine Kigenyi; Isabelle Girard; A Nicole White; Desiree M Porter; Sheenah M Mische
Journal: J Biomol Tech Date: 2021-12-15

5. Statistical Challenges in Tracking the Evolution of SARS-CoV-2.

Authors: Lorenzo Cappello; Jaehee Kim; Sifan Liu; Julia A Palacios
Journal: Stat Sci Date: 2022-05-16 Impact factor: 4.015

Review 6. The role of multi-omics in the diagnosis of COVID-19 and the prediction of new therapeutic targets.

Authors: Jianli Ma; Yuwei Deng; Minghui Zhang; Jinming Yu
Journal: Virulence Date: 2022-12 Impact factor: 5.428

7. Genomic epidemiology of SARS-CoV-2 in Esteio, Rio Grande do Sul, Brazil.

Authors: Vinícius Bonetti Franceschi; Gabriel Dickin Caldana; Amanda de Menezes Mayer; Gabriela Bettella Cybis; Carla Andretta Moreira Neves; Patrícia Aline Gröhs Ferrareze; Meriane Demoliner; Paula Rodrigues de Almeida; Juliana Schons Gularte; Alana Witt Hansen; Matheus Nunes Weber; Juliane Deise Fleck; Ricardo Ariel Zimerman; Lívia Kmetzsch; Fernando Rosado Spilki; Claudia Elizabeth Thompson
Journal: BMC Genomics Date: 2021-05-20 Impact factor: 3.969

8. Molecular evidence of SARS-CoV-2 in New York before the first pandemic wave.

Authors: Matthew M Hernandez; Ana S Gonzalez-Reiche; Hala Alshammary; Shelcie Fabre; Zenab Khan; Adriana van De Guchte; Ajay Obla; Ethan Ellis; Mitchell J Sullivan; Jessica Tan; Bremy Alburquerque; Juan Soto; Ching-Yi Wang; Shwetha Hara Sridhar; Ying-Chih Wang; Melissa Smith; Robert Sebra; Alberto E Paniz-Mondolfi; Melissa R Gitman; Michael D Nowak; Carlos Cordon-Cardo; Marta Luksza; Florian Krammer; Harm van Bakel; Viviana Simon; Emilia Mia Sordillo
Journal: Nat Commun Date: 2021-06-08 Impact factor: 14.919

9. Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic.

Authors: Yatish Turakhia; Bryan Thornlow; Angie S Hinrichs; Nicola De Maio; Landen Gozashti; Robert Lanfear; David Haussler; Russell Corbett-Detig
Journal: Nat Genet Date: 2021-05-10 Impact factor: 41.307

10. Extracellular vesicles carry SARS-CoV-2 spike protein and serve as decoys for neutralizing antibodies.

Authors: Zach Troyer; Najwa Alhusaini; Caroline O Tabler; Thomas Sweet; Karina Inacio Ladislau de Carvalho; Daniela M Schlatzer; Lenore Carias; Christopher L King; Kenneth Matreyek; John C Tilton
Journal: J Extracell Vesicles Date: 2021-06-18