Single cell RNA sequencing (scRNAseq) studies have provided critical insight into the pathogenesis of Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2), the causative agent of COronaVIrus Disease 2019 (COVID-19). scRNAseq workflows are generally designed for the detection and quantification of eukaryotic host mRNAs and not viral RNAs. The performance of different scRNAseq methods to study SARS-CoV-2 RNAs has not been thoroughly evaluated. Here, we compare different scRNAseq methods for their ability to quantify and detect SARS-CoV-2 RNAs with a focus on subgenomic mRNAs (sgmRNAs), which are produced only during active viral replication and not present in viral particles. We present a data processing strategy, single cell CoronaVirus sequencing (scCoVseq), which quantifies reads unambiguously assigned to sgmRNAs or genomic RNA (gRNA). Compared to standard 10X Genomics Chromium Next GEM Single Cell 3' (10X 3') and Chromium Next GEM Single Cell V(D)J (10X 5') sequencing, we find that 10X 5' with an extended R1 sequencing strategy maximizes the unambiguous detection of sgmRNAs by increasing the number of reads spanning leader-sgmRNA junction sites. Differential gene expression testing and KEGG enrichment analysis of infected cells compared with bystander or mock cells showed an enrichment for COVID19-associated genes, supporting the ability of our method to accurately identify infected cells. Our method allows for quantification of coronavirus sgmRNA expression at single-cell resolution, and thereby supports high resolution studies of the dynamics of coronavirus RNA synthesis. IMPORTANCE: Single cell RNA sequencing (scRNAseq) has emerged as a valuable tool to study host-viral interactions particularly in the context of COronaVIrus Disease-2019 (COVID-19). scRNAseq has been developed and optimized for analyzing eukaryotic mRNAs, and the ability of scRNAseq to measure RNAs produced by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has not been fully characterized. Here we compare the performance of different scRNAseq methods to detect and quantify SARS-CoV-2 RNAs and develop an analysis workflow to specifically quantify unambiguous reads derived from SARS-CoV-2 genomic RNA and subgenomic mRNAs. Our work demonstrates the strengths and limitations of scRNAseq to measure SARS-CoV-2 RNA and identifies experimental and analytical approaches that allow for SARS-CoV-2 RNA detection and quantification. These developments will allow for studies of coronavirus RNA biogenesis at single-cell resolution to improve our understanding of viral pathogenesis.
Single cell RNA sequencing (scRNAseq) studies have provided critical insight into the pathogenesis of Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2), the causative agent of COronaVIrus Disease 2019 (COVID-19). scRNAseq workflows are generally designed for the detection and quantification of eukaryotic host mRNAs and not viral RNAs. The performance of different scRNAseq methods to study SARS-CoV-2 RNAs has not been thoroughly evaluated. Here, we compare different scRNAseq methods for their ability to quantify and detect SARS-CoV-2 RNAs with a focus on subgenomic mRNAs (sgmRNAs), which are produced only during active viral replication and not present in viral particles. We present a data processing strategy, single cell CoronaVirus sequencing (scCoVseq), which quantifies reads unambiguously assigned to sgmRNAs or genomic RNA (gRNA). Compared to standard 10X Genomics Chromium Next GEM Single Cell 3' (10X 3') and Chromium Next GEM Single Cell V(D)J (10X 5') sequencing, we find that 10X 5' with an extended R1 sequencing strategy maximizes the unambiguous detection of sgmRNAs by increasing the number of reads spanning leader-sgmRNA junction sites. Differential gene expression testing and KEGG enrichment analysis of infected cells compared with bystander or mock cells showed an enrichment for COVID19-associated genes, supporting the ability of our method to accurately identify infected cells. Our method allows for quantification of coronavirus sgmRNA expression at single-cell resolution, and thereby supports high resolution studies of the dynamics of coronavirus RNA synthesis. IMPORTANCE: Single cell RNA sequencing (scRNAseq) has emerged as a valuable tool to study host-viral interactions particularly in the context of COronaVIrus Disease-2019 (COVID-19). scRNAseq has been developed and optimized for analyzing eukaryotic mRNAs, and the ability of scRNAseq to measure RNAs produced by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has not been fully characterized. Here we compare the performance of different scRNAseq methods to detect and quantify SARS-CoV-2 RNAs and develop an analysis workflow to specifically quantify unambiguous reads derived from SARS-CoV-2 genomic RNA and subgenomic mRNAs. Our work demonstrates the strengths and limitations of scRNAseq to measure SARS-CoV-2 RNA and identifies experimental and analytical approaches that allow for SARS-CoV-2 RNA detection and quantification. These developments will allow for studies of coronavirus RNA biogenesis at single-cell resolution to improve our understanding of viral pathogenesis.
Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2) is the causative agent of COronaVIrus Disease-2019 (COVID-19), which as of November 2021 has caused over 250 million cases and over 5 million deaths globally(1, 2). Global efforts to understand the pathogenesis of SARS-CoV-2 infection have led to the development of vaccines and antivirals, which have reduced morbidity and mortality(3). “Omics” methods have been instrumental in studying SARS-CoV-2 in part because they have generated large amounts of data regarding host-viral interactions at unprecedented speed(4–15). Single-cell RNA sequencing (scRNAseq) studies in particular have been used to study viral tropism(16–22), peripheral immune changes(23–33), transcriptional changes induced by infection(34, 35), and to develop cell atlases of COVID-19 pathology(23, 24, 36, 37). Of note, most scRNAseq workflows have been developed and optimized for studies of eukaryotic transcription but not viral, specifically SARS-CoV-2, transcription. The performance of different scRNAseq methods to detect and quantify viral RNAs may impact the analysis and interpretation of such studies.SARS-CoV-2 is a betacoronavirus with a 29 kB positive-sense, single stranded RNA genome(38, 39). SARS-CoV-2 generates genomic RNA (gRNA), subgenomic mRNAs (sgmRNAs), and negative-sense antigenomic RNA during active infection(40, 41). Both gRNA and sgmRNAs are poly-adenylated, which enables detection by scRNAseq protocols that rely on poly-T primed reverse transcription(39–41). Translation of gRNA results in the production of one of two polyproteins, pp1a and pp1ab, which are subsequently cleaved into an array of non-structural proteins involved in pathogenesis and replication(39, 41). Translation of sgmRNAs generates structural and accessory viral proteins critical for virion production and pathogenesis(39, 41). sgmRNAs are produced only in cells with actively replicating virus, while gRNA is present in both infected cells and virions(40, 41). Therefore, specific detection of sgmRNAs can allow for: 1) specific identification of cells with actively replicating virus and 2) analysis of the dynamics of viral gene expression within and across cells and viruses.sgmRNAs are generated by discontinuous transcription events during negative strand synthesis(40). Transcription Regulatory Sequences (TRS), present in the 5′ leader sequence of the virus (TRS-L) and upstream of each ORF body (TRS-B), regulate this process(40). Template switching of the viral polymerase from a TRS-B to a TRS-L generates sgmRNAs with the 5′ leader sequence fused to the sgmRNA ORF body (Figure 1A)(40). These “nested” sgmRNAs share the viral ORF sequence downstream of the junction site in addition to a common leader sequence upstream of the junction site(40). This redundancy poses a challenge for standard scRNAseq data processing pipelines because reads mapping to redundant sgmRNA sequences are categorized as “ambiguous” and typically excluded from quantification. This problem has been addressed in bulk RNAseq by quantifying SARS-CoV-2 reads spanning leader-ORF junctions, which unambiguously identify sgmRNAs(12, 15). However, many scRNAseq methods do not sequence this region of sgmRNAs at significant coverage due to differences in library format and configuration of sequencing reads.
Figure 1:
A. Illustration of SARS-CoV-2 genomic RNA, gRNA, and subgenomic RNAs, sgmRNAs. B.
Top: Reads included for analysis by scCoVseq. Either: 1) contiguous reads mapping to ORF1a/b and therefore derived from gRNA or 2) discontinuous reads spanning the leader region and ORFS transcribed by sgmRNAs Bottom: Reads excluded from analysis by scCoVseq. Either: 1) discontinuous reads that do not include sequence mapping to the leader region and downstream of S or 2) contiguous reads that map to ORFs other than ORF1a/b, which are ambiguous. C. Activity diagram of scCoVseq pipeline. Blue rectangles indicate inputs/outputs for each stage. Orange rounded rectangles indicate a process in bold with software indicated.
We hypothesized that both experimental (i.e. scRNAseq method) and data processing decisions influence the ability to detect, resolve, and quantify SARS-CoV-2 RNA species with scRNAseq. We developed a data processing workflow, single cell coronavirus sequencing (scCoVseq), to quantify only unambiguous SARS-CoV-2 reads in scRNAseq data. We found that SARS-CoV-2 RNA detection differed by 10X Genomics Chromium scRNAseq method, due in part to ambiguity of the library fragments generated by each method. We show that 10X Chromium Next GEM Single Cell V(D)J (10X 5′) scRNAseq with an extended read 1 (R1) sequencing strategy maximized unambiguous SARS-CoV-2 reads and thereby increased detection of SARS-CoV-2 RNAs. Using this method, we identify infected and uninfected “bystander” cells within the same culture and determine differentially expressed genes between infected, bystander, and mock cells.
Materials and Methods
Cell lines and Viral Infection
Vero-E6 cells (ATCC, CRL-1586) were maintained in Dulbecco’s Modified Eagle Medium (DMEM, Corning #10–017-CV) supplemented with 10% fetal-bovine serum (FBS) and 1% Penicillin Streptomycin (PSN, Fisher scientific #15-140-122), and routinely cultured at 37° C with 5% CO2.SARS-CoV-2 (isolate USA-WA1/2020, BEI resource NR-52281) and control media (mock infected) stocks were grown by inoculating a confluent T175 flask of Vero-E6 cells (passage 2). Mock and SARS-CoV-2 infected cultures were maintained in reduced serum DMEM (2% FBS) for 72 hours, after which culture media was collected and filtered by centrifugation (8000 × g, 15 minutes) using an Amicon Ultra-15 filter unit with a 100KDa cutoff filter (Millipore # UFC910024). Concentrated stocks in reduced-serum media (2% FBS), supplemented with 50mM HEPES buffer (Gibco #15630080) were stored at −80°C. Viral titers were determined by plaque assay as previously described(42). All SARS-CoV-2 propagations and experiments were performed in a Biosafety Level 3 facility in compliance with institutional protocols and federal guidelines.
scRNAseq
For scRNAseq experiments, Vero-E6 cells in 6 well plates were infected with SARS-CoV-2 at a MOI of 0.1, or with an equivalent volume of control media, in reduced-serum media (2% FBS) for 24 hours. To prepare cells for scRNAseq, mock and SARS-CoV-2 infected cultures were washed with calcium/magnesium-free PBS and disassociated using TrypLE (Gibco # 12605010, 5 minutes at 37° C), after which samples were centrifuged (200 × g, 5 minutes), resuspended in calcium/magnesium-free PBS supplemented with 1% BSA, and counted. Mock and SARS-CoV-2 infected cell culture samples were filtered through a 40μm FlowMi strainer (ScienceWare # H13680–0040) and counted prior to loading on the 10X Genomics Chromium Controller according to manufacturer’s protocol. Mock and infected samples were loaded on separate lanes of a 10X Genomics Chromium Controller for either NextGEM Single Cell 3′ v3.1 (10X 3′), or NextGEM Single Cell V(D)J v1.1, (10X 5′).Gene expression libraries were prepared for 10X 3′ and 5′ samples according to manufacturer’s guidelines. Final 10X 3′ mock and infected gene expression libraries and the 10X 5′ infected gene expression library were PCR amplified for 16 cycles while the 10X 5′ mock gene expression library was amplified for 14 cycles. 10X 3′ gene expression libraries were pooled and sequenced by short-read sequencing on an Illumina NextSeq 500 using a high output 150 cycle reaction kit according to manufacturers’ protocol with the following read lengths: read 1 28 nt; i7 index 8 nt; and read 2 130 nt. 10X 5′ gene expression libraries were also pooled and sequenced with 10X recommended read lengths (read 1 26 nt; i7 index 8 nt; and read 2 132 nt) or with extended R1 protocol (read 1 158 nt; i7 index 8 nt; no read 2).
scRNAseq Pre-Processing
Conversion of Illumina BCL files to fastq
Fastq files for standard sequencing 3′ and 5′ gene expression libraries were generated using the mkfastq command in cellranger v.3.1.0 (10X Genomics). Fastq files for 5′ libraries sequenced with the extended R1 strategy were generated using bcl2fastq v2.20.0 (Illumina, Inc). Extended R1 fastqs were then separated into pseudo R1 fastqs, containing the cell barcode and UMI, and pseudo read 2 (R2) fastqs, containing cDNA sequence, using a customized Python/3.7.3 script (available at github link pending) as follows. The cell barcode and UMI are selected from the first 26 bp of R1. The subsequent 13 bp derive from the template switch oligonucleotide and are ignored. The remaining nucleotides (and corresponding quality scores) are reverse complemented and stored as pseudo R2. The read header of the pseudo R2 fastqs are modified to reflect the format for standard R2 fastqs.
Downsampling fastqs to control for sequencing depth
To control for differences in sequencing depth for each library, read depth per library was downsampled to approximately 50,000 reads per cell. To generate a whitelist of cell barcodes for downsampling while accounting for transcriptional shutdown in SARS-CoV-2 infected cells (35), we generated preliminary gene x cell matrices for our dataset using cellranger/3.1.0 count (10X Genomics, Inc) to quantify and align reads to a host reference (African Green Monkey, ChlSab1.1) combined with SARS-CoV-2 transcripts as annotated by the NCBI SARS-CoV-2 reference (NC_045512.2) with modifications for USA/WA01 strain for each dataset. The resulting output was analyzed in R/4.0.4 with Seurat/4.0.1(43–45) to filter out putative doublets and empty droplets according to total UMIs/cell, number of genes/cell, and percent of mitochondrial gene expression. After filtering, putative cell-containing cell barcodes were output to a whitelist per library. Based on the these whitelists, the initial fastq files were downsampled using seqtk (version 1.2)(46) to a total read depth of 50,000 multiplied by the number of cells in the library.
Preparation of empirically derived SARS-CoV-2 genome reference
Downsampled fastq files were then mapped using cellranger count/3.1.0 (10X Genomics, Inc) to an empirically defined reference of SARS-CoV-2 sgmRNAs derived from previously reported SARS-CoV-2 (BetaCoV/Korea/KCDC03/2020) RNAs sequenced with long-read direct RNA Nanopore sequencing(12). These were downloaded from the UCSC Genome Browser Table Browser(47) after filtering for TRS-dependent transcripts and score > 900 and exporting to gtf format. Transcripts for previously unknown ORFs were excluded from the annotation. An additional annotation for genomic RNA was included which covered the entire length of the SARS-CoV-2 genome. Aligning the BetaCoV/Korea/KCDC03/2020 genome with USA/WA-CDC-WA1/2020 genome showed that the USA/WA-CDC-WA1/2020 had an additional 21 3′ adenosine nucleotides annotated. To account for this in our reference, we extended any annotations from BetaCoV/Korea/KCDC03/2020 that ended at the 3′ end of the genome by an additional 21 bases. This SARS-CoV-2 reference was appended to the host ChlSab1.1 Ensembl reference.
scCoVseq
To unambiguously assign and quantify scRNAseq reads to SARS-CoV-2 RNAs, the cellranger output BAM was filtered for reads mapping to SARS-CoV-2 or ChlSab1.1 references using samtools (version 1.11)(48). SARS-CoV-2 aligned reads were then subset to likely genomic RNA reads or sgmRNA reads. Genomic reads were defined as those containing no gaps in their alignment and mapping upstream of the start of the most 5′ sgmRNA, S. sgmRNA reads were defined as SARS-CoV-2 reads containing a gap and mapping in part to the 5′ leader sequence, defined as the 5′ proximal 80 nucleotides of the SARS-CoV-2 genome, and in part 3′ to the start of S. All other reads mapping to the SARS-CoV-2 genome were discarded. Reads passing these filtering steps were quantified with umi_tools (version 1.0.0)(49). An R/3.5.3 script using the Matrix (version 1.2–18)(50) and readr (version 1.3.1)(51) packages was used to convert this to a sparse matrix and save as an rds file to decrease file size. UMIs that were assigned to multiple genes were removed from the resulting matrix during analysis.
scRNAseq Analysis
Sashimi Plots
Reads from 10X 3′, 10X 5′, and 10X 5′ extended R1 data that were aligned to the SARS-CoV-2 reference by cellranger were subset from the cellranger output BAM file. Each BAM file was downsampled to approximately 1 × 106 reads to control for differences in sequencing depth across libraries. Sashimi plots were generated with ggsashimi (version 1.0.0)(52).
Classification of SARS-CoV-2 Infected Cells
scCoVseq-derived gene by cell matrices were loaded into R/4.0.4 and analyzed with the Seurat/4.0.1(43–45) package. For each 10X method, mock and infected gene × cell matrices were merged with the Seurat merge command. Scaled SARS-CoV-2 UMI expression of 600 sampled cells were clustered with five methods (k means clustering, hierarchical/Ward clustering, DIANA, mixture model-based clustering, and k medoids clustering) using the clValid (version 0.7) package(53). Based on optimal performance as measured by average distance, average distance between means, average proportion of non-overlap, connectivity, Dunn index, figure of merit, and silhouette width, k-medoids clustering implemented with the PAM algorithm and k set to 2 optimally separated infected from uninfected cells. Therefore to identify infected and bystander cells within SARS-CoV-2 treated cultures, euclidean distance between the z-scaled expression of SARS-CoV-2 sgmRNA UMIs per cell was clustered using pam (k = 2) implemented in the cluster (version 2.1.2) package(54). Output clusters were then compared for viral UMI expression per cell, and the cluster with more viral UMIs was classified as infected and the other as uninfected.
Comparison of SARS-CoV-2 RNA UMIs per scRNAseq Method
To examine the distribution of SARS-CoV-2 UMIs per cell by scRNAseq method, the 25th percentile of total UMIs was quantified for all infected cells from each 10X method. Any cells with fewer UMIs than the minimum 25th percentile of all samples were discarded, and all cells were subsequently downsampled to this same number of total UMIs/cell using the Seurat SampleUMI command. Each dataset was randomly downsampled to the same number of infected cells to equalize for differences in cell numbers, and viral sgmRNA UMIs/cell were plotted by scRNAseq method.
SARS-CoV-2 Read Distribution by scRNAseq Method
SARS-CoV-2 reads were defined as genomic or subgenomic using scCoVseq. Reads aligning to the SARS-CoV-2 reference that were excluded from scCoVseq were classified as ambiguous. The number of genomic, subgenomic, or ambiguous reads per million SARS-CoV-2 reads was calculated and plotted for each scRNAseq method.
Differential Expression Analysis
To explore expression differences between infected, bystander, and mock cells, differential expression testing with edgeR (version 3.32.1) was performed with modifications for scRNAseq as previously described(55, 56). Viral genes were excluded from analysis, and only host genes expressed in at least 10% of cells were tested. To account for differences in RNA content of infected cells due to virally-induced transcriptional shutdown, all cells were downsampled to the 25th percentile of total UMIs of infected cells. Cells with fewer UMIs than the threshold were excluded from analysis. Differential gene expression was performed with edgeR using a generalized linear model quasi-likelihood F test adapted with a term for gene detection rate(55, 56). Genes with an absolute log2 fold change greater than or equal to 1 and false discovery rate less than 0.05 were considered significant. For KEGG enrichment analysis, pairwise tests between mock, bystander, and infected cells were performed. Differentially expressed genes with an absolute log2 fold change greater than or equal to 1 and false discovery rate less than 0.05 were considered significant and subject to KEGG enrichment analysis using the KEGG annotations for African Green Monkey as implemented in the edgeR function kegga.
Quantification of SARS-CoV-2 sgmRNA Junction Sites
We explored the ability of our extended R1 sequencing to detect SARS-CoV-2 sgmRNA junctions using STARsolo (version 2.7.8a)(57). Aligned reads were re-mapped to the empirical SARS-CoV-2 annotation described above and junction sites per cell were quantified. The resulting junction per cell matrix was plotted in R/v4.0.4.
Flow Cytometry
Vero E6 cells were fixed with 4% paraformaldehyde at room temperature for a minimum of 24 hours, washed once with PBS and permeabilized with 1X perm-wash buffer (BDBiosciences #554723) for 5 minutes. SARS-CoV nucleocapsid (N) antibody (clone 1C7C7) (kindly provided by Thomas Moran, Icahn School of Medicine at Mount Sinai, New York, NY), conjugated to AlexaFluor 647 was diluted 1:400 in perm-wash buffer, and added directly to samples. Samples were then incubated at room temperature for 40 minutes in the dark. After staining, samples were washed once with 1X perm-wash buffer, once with PBS, resuspended in FACS buffer (PBS supplemented with 1% FBS), and acquired on a Gallios flow cytometer (Beckman-Coulter). For all viral infections, analysis was performed with FlowJo software (version 10.7.1, Becton Dickinson), excluding cell doublets and debris and gating according to mock infected populations.
Immunofluorescence microscopy
Vero E6 were seeded in 6-well plates (Falcon REF-353046) with one coverslip (Fisher Scientific 12-550-143) per well. After 24 hours post infection, cells were washed with PBS and fixed with 4% paraformaldehyde (Fisher Scientific AA433689M) overnight. Fixed cells were permeabilized using 0.1% Triton-X (Fisher Scientific AC327371000) in PBS and blocked with 4% bovine serum albumin (BSA, Fisher Scientific BP1600–100) in PBS. Blocked coverslips were incubated with mouse anti-SARS-CoV N antibody (clone 1C7, 1:500 in 4% BSA PBS) overnight at 4C, washed three times with PBS, and incubated for 45 minutes with 1:500 AlexaFluor 488-conjugated anti-mouse (Invitrogen A11001, 1:500 in 4% BSA PBS) plus DAPI (Thermo Fisher Scientific D1306, 1:1000 in 4% BSA PBS) at room temperature. Coverslips were then stained with phalloidin (1:400 in PBS) for 1 hour at room temperature and washed again three times with PBS. Coverslips were mounted using Prolong Diamond (Life Technologies P36970). Confocal laser scanning was performed using a Leica SP5 DMI (ISMMS Microscopy CoRE and Advanced Bioimaging Center) with a ×40/1.25 oil objective. Images were collected at a resolution of 512 × 512 pixels in triplicate per slide. Images were processed and analyzed using LAS X and CellProfiler v4(58).
Data Availability
Raw and processed scRNAseq data are available at (GEO accession number pending) and code is available at (github pending).
Results
SARS-CoV-2 generates gRNA and sgmRNAs during infection, which are highly redundant in their sequences (Figure 1A). Reads mapping to redundant sequences are assigned to all genes which contain that sequence and are typically excluded from quantification steps in scRNAseq processing pipelines. We therefore identified read structures which could unambiguously identify gRNA or different species of sgmRNAs to allow for their quantification (Figure 1B). Reads derived from gRNA should be contiguous and could map anywhere on the SARS-CoV-2 genome. Reads derived from sgmRNA could be either gapped or contiguous and could map to the 5′ leader and/or downstream of the start site of S, the most 5′ sgmRNA. Because contiguous reads mapping downstream of S could derive either from gRNA or sgmRNAs, they were excluded from quantification. Only reads aligning in part to the 5′ leader and in part downstream of S could be confidently derived from sgmRNAs. We therefore defined gRNA reads as contiguous reads aligning upstream of regions contained in sgmRNAs. sgmRNA reads were defined as discontinuous reads spanning the leader region and regions used by sgmRNAs. Reads that did not match either of these formats could not unambiguously be assigned to gRNA or an sgmRNA and were therefore excluded from quantification (Figure 1B). With this framework, we developed scCoVseq to quantify unambiguous genomic and subgenomic viral reads (Figure 1C). Using scCoVseq, we compared the abilities of different scRNAseq methods to quantify SARS-CoV-2 RNAs.In the widely available Chromium scRNAseq method developed by 10X Genomics, Inc, there are two formats for droplet-based scRNAseq: 10X 3′ and 10X 5′. 10X 3′ generates library fragments derived from the 3′ regions of poly-adenylated RNAs within a cell (Figure 2A). Because sgmRNAs share all viral sequence 3′ of the leader-body junction site, 10X 3′ library fragments derived from SARS-CoV-2 heavily cover the 3′ end of the viral genome and do not contain leader-ORF junctions (Figure 2D). These reads cannot differentiate gRNA from sgmRNA or distinguish different sgmRNA species. 10X 5′ generates library fragments from the 5′ termini of poly-adenylated RNAs (Figure 2B). These fragments are on average approximately 500 bp long (according to the manufacturer’s documentation) and should contain leader-ORF junctions of SARS-CoV-2 sgmRNAs. The transcript read (R2), however, derives from the 3′ end of these fragments and at the recommended read length of 91 bases is not long enough to consistently sequence into the leader-sgmRNA junction site (Figure 2B). 10X 5′ can therefore detect some but not all junctions (Figure 2E). We reasoned that we could use 10X 5′ library fragments to detect junction-spanning reads by sequencing from the 5′ end of the fragment. To do this we extended R1, which is normally used to sequence the cell barcode and UMI, to sequence into the leader-body junction site (Figure 2C). Using 10X 5′ with Extended R1, we were able to sequence more leader-sgmRNA junction sites and increase our ability to unambiguously quantify sgmRNAs (Figure 2F). Indeed 10X 5′ Extended R1 increased the number of leader-sgmRNA spanning reads over 10X 5′ and 10X 3′ (Figure 2G). When quantified with scCoVseq, we found that 10X 5′ Extended R1 quantifies more UMIs per sgmRNA per cell compared to 10X 5′ or 10X 3′ (Figure 2H). Importantly, the average host gene expression per sample was significantly correlated across methods, suggesting that host gene measurements were minimally affected by 10X 5′ Extended R1 (Supplemental Figure 1). Taken together, 10X 5′ libraries sequenced with extended R1 sequencing results in a greater number of unambiguous reads derived from sgmRNAs over 10X 3′ or 10X 5′, and consequently recovers more sgmRNA UMIs/cell.
Figure 2:
A-C. Illustration of gRNA and S and ORF3a sgmRNAs. Red box indicates regions contained in final 10X library. Lower: Example illustration of 10X library fragments derived from gRNA and S and ORF3a sgmRNAs with sequencing read 1 and read 2 indicated. 10X 3′ (A), 10X 5′ (B), and 10X 5′ Extended R1 (C) libraries are illustrated. D-F. Sashimi plot of 10X 3′ (D), 10X 5′ (E), and 10X 5′ Extended R1 (F) reads mapped to the SARS-CoV-2 genome filtered to show only junctions supported by at least 1,000 reads. Total number of reads visualized is indicated in the bottom right. G. Reads per million reads mapped to SARS-CoV-2 reads mapping to a single viral gene in 10X 3′, 10X 5′, or 10X 5′ with Extended R1 data. Reads are colored by their mapping with contiguous reads mapping to ORF1a/b in yellow, leader-sgmRNA junction-spanning reads in blue, and ambiguously mapped reads in grey. H. UMIs per cell for all sgmRNAs in infected cells in each dataset. Each dataset was downsampled to an equal number of infected cells and each cells’ total UMIs were downsampled to the same value to control for differences in sequencing depth. The leader region is enlarged in illustrations of the genome for visibility. L = Leader.
Using this method, we analyzed Vero E6 cells 24 hours post infection with SARS-CoV-2 at an MOI of 0.1 (Figure 3A). We were able to quantify sgmRNAs and gRNA at single-cell resolution (Figure 3B). Using expression values for sgmRNAs, we compared multiple unsupervised methods to identify infected cells. We found that a k-medoid clustering approach implemented with the pam algorithm performed best as indicated by multiple metrics to separate infected from uninfected cells (Supplemental Figure 2A–C). We found that this classification method detected a similar percentage of infected cells as detected using flow cytometry and immunofluorescence microscopy of the same cultures (Supplemental Figure 2D). Using our infection classification, we performed differential expression testing of infected cells compared to bystander cells within the same culture as well as to cells from a mock culture. As previously described(35), we observed downregulation of many host genes in infected cells accompanied by an upregulation of cellular stress response genes (Figure 3D, E). We further observed that, while bystander and mock cells had similar gene expression, a small number of genes were upregulated in bystander cells compared to mock cells (Figure 3D, E). This is especially notable given the inability of Vero E6 cells to produce interferons in response to viral infection (59). KEGG enrichment analysis of differentially expressed genes in pairwise comparisons of infected, mock, and bystander cells showed that genes related to COVID19 were enriched in our infected cells supporting our method for infection classification (Figure 3F).
Figure 3:
A. Experimental design. Vero E6 cells were infected or mock infected with SARS-CoV-2 (USA-WA1/2020) at an MOI of 0.1. At 24 hours post-infection, cells were analyzed by scRNAseq using 10X 5′ with Extended R1 sequencing. B-C. 3,047 mock and infected cells embedded in tSNE space derived from euclidean distance of scaled viral sgmRNA expression. Cells are colored by (B) indicated viral RNA expression, or (C) experimental condition and assigned infection status of cells. D. Heatmap of genes differentially expressed in infected, bystander, or mock cells. Differential expression testing was performed on host gene expression downsampled to an equal number of UMIs/cell across cells to account for infection-induced transcriptional shutdown. Genes were selected for visualization based on false discovery rate of less than 0.05 and absolute log2 fold change of at least 1. Non-downsampled gene expression data is shown. Along the top, infection status, total viral UMIs and genomic RNA as quantified by CellRanger and scCoVseq are indicated. Cells and genes are clustered with ward d2 clustering based on euclidean distance. E. Expression of selected host genes per cell by infection status. Data shown is not downsampled. Top: genes induced in infected cells. Middle: genes repressed in infected cells. Bottom: genes upregulated in bystander cells compared to mock. F. KEGG pathway enrichment in genes differentially expressed in pairwise comparisons of downsampled infected, bystander, and mock cells. Dot size and fill indicates the −log10 p value of enrichment with red dots indicating enrichment in the first infection state and blue in the second infection state noted above each panel.
Discussion
In this study, we examined the ability of two commonly used scRNAseq methods, 10X 3′ and 10X 5′, to detect and quantify SARS-CoV-2 derived RNAs with a focus on sgmRNAs. Because of the redundant nature of coronavirus sgmRNA sequences, we developed scCoVseq, which unambiguously quantifies both sgmRNAs and gRNAs in 10X data. We found that 10X methods detect unambiguous leader-sgmRNA junction-spanning reads to different degrees. We were able to increase the detection of leader-sgmRNA junction-spanning reads by extending the length of R1 during sequencing of 10X 5′ libraries, an approach we term 10X 5′ Extended R1 sequencing. Combining 10X 5′ Extended R1 with scCoVseq maximized quantification of sgmRNA UMIs compared to 10X 5′ or 10X 3′.The ability to use sgmRNA expression to identify cells with actively replicating virus may improve the utility of scRNAseq in studies of coronavirus tropism. A challenge in many scRNAseq studies, particularly studies of primary tissues, has been identifying cells with active infection as opposed to cells with large amounts of ambient or extracellular viral RNA (such as phagocytic cells and/or cells not supporting active infection)(16). Because sgmRNAs are produced only during viral replication and are absent from virions, our method allows us to distinguish between infected and uninfected cells associated with “background” viral RNA (Supplemental Figure 2C). In downstream analyses of host transcriptomic changes induced by infection, accurate classification of infected cells is important for robust analyses of transcriptional differences between infected and uninfected cells because incorrect classifications may dilute effect sizes and resultant significance values. This method also enables the comparison of sgmRNA expression dynamics at single cell resolution. Such analyses may be particularly relevant for comparing viral gene expression between different cell types, coronaviruses, or between SARS-CoV-2 variants of interest, which have been described to have different kinetics of sgmRNA expression(60). This approach could be extended to any coronavirus or nidovirus, including potentially novel emerging coronaviruses. Finally, scCoVseq can be used to examine differential junction site usage within single cells (Supplemental Figure 3). Several groups have identified TRS-independent SARS-CoV-2 sgmRNAs(12, 13, 15), the significance of which remains unknown. It is possible that changes in junction site usage between cell types or during the course of infection may play a role in pathogenesis.It should be noted that there are some limitations to our study. With our dataset, we are unable to know the true infection state of a cell processed for scRNAseq, and therefore we cannot assess the true accuracy of our method to classify infected cells. An additional limitation of our method is that quantification of viral genes with scCoVseq is dependent on accurate annotation of viral RNAs. We derived our annotation based on published empirically-defined TRS-dependent RNAs(12), but this does not preclude the existence of other viral RNAs at time points or in cell types not studied. Importantly, we explicitly exclude TRS-independent RNAs from our analyses. Methods such as STARsolo(57) or sequencing 10X libraries with long-read sequencing(61) may allow for detection and quantification of viral RNAs without reference annotation and irrespective of TRSs.Supplemental Figure 1: Average counts of host gene expression of cells analyzed by 10X 3′, 10X 5′, and 10X 5′ with Extended R1 sequencing. Each point represents the average UMIs/cell of a single gene assayed in the indicated assays. At the top left, the Pearson correlation coefficient and resulting p value are indicated.Supplemental Figure 2: A. Comparison of performance metrics (average distance, AD; average distance between means, ADM; average proportion of non-overlap, APN; connectivity; Dunn Index; figure of merit, FOM; and silhouette index) by several clustering methods (diana, model-based, hierarchical, kmeans, and pam) run on sgmRNA expression of 600 randomly sampled cells analyzed with 10X 5′ Extended R1 and scCoVseq. Left: Performance metrics for each method across k values from 2 to 5. Right: performance metrics for each method with k = 2. B. Visualization of infection classification by different methods. C. Viral gene expression of cells by infection status, determined by pam clustering method. D. Percent of infected cells per sample as measured by flow cytometry, immunofluorescence, and infection classification with unsupervised (pam) method or supervised infection classification by classifying infected cells as those with at least 375 total viral UMIs. Because the same sample was sequenced with 10X 5′ and 10X 5′ Extended R1, flow cytometry and immunofluorescence results are duplicated for ease of visualization. Error bars for immunofluorescence indicate mean ± one standard deviation of percent infected cells based on three fields per sample.Supplemental Figure 3: Detection of junction sites in SARS-CoV-2 reads with 10X 5′ Extended R1. Junction sites are represented by the 5′ start site and 3′ end site on the y and x-axis, respectively. The color indicates the log2 total UMIs/junction across all cells in the SARS-CoV-2 infected sample. Below each axis, the number of UMIs supporting a position as a junction start or end site is indicated with a density plot.
Authors: Evan Z Macosko; Anindita Basu; Rahul Satija; James Nemesh; Karthik Shekhar; Melissa Goldman; Itay Tirosh; Allison R Bialas; Nolan Kamitaki; Emily M Martersteck; John J Trombetta; David A Weitz; Joshua R Sanes; Alex K Shalek; Aviv Regev; Steven A McCarroll Journal: Cell Date: 2015-05-21 Impact factor: 41.582
Authors: Changfu Yao; Stephanie A Bora; Tanyalak Parimon; Tanzira Zaman; Oren A Friedman; Joseph A Palatinus; Nirmala S Surapaneni; Yuri P Matusov; Giuliana Cerro Chiang; Alexander G Kassar; Nayan Patel; Chelsi E R Green; Adam W Aziz; Harshpreet Suri; Jo Suda; Andres A Lopez; Gislâine A Martins; Barry R Stripp; Sina A Gharib; Helen S Goodridge; Peter Chen Journal: Cell Rep Date: 2020-12-16 Impact factor: 9.423
Authors: Ryan A Flynn; Julia A Belk; Yanyan Qi; Yuki Yasumoto; Jin Wei; Mia Madel Alfajaro; Quanming Shi; Maxwell R Mumbach; Aditi Limaye; Peter C DeWeirdt; Cameron O Schmitz; Kevin R Parker; Elizabeth Woo; Howard Y Chang; Tamas L Horvath; Jan E Carette; Carolyn R Bertozzi; Craig B Wilen; Ansuman T Satpathy Journal: Cell Date: 2021-03-11 Impact factor: 66.850
Authors: Jason D Fernandes; Angie S Hinrichs; Hiram Clawson; Jairo Navarro Gonzalez; Brian T Lee; Luis R Nassar; Brian J Raney; Kate R Rosenbloom; Santrupti Nerli; Arjun A Rao; Daniel Schmelter; Alastair Fyfe; Nathan Maulding; Ann S Zweig; Todd M Lowe; Manuel Ares; Russ Corbet-Detig; W James Kent; David Haussler; Maximilian Haeussler Journal: Nat Genet Date: 2020-10 Impact factor: 38.330
Authors: Neal G Ravindra; Mia Madel Alfajaro; Victor Gasque; Nicholas C Huston; Han Wan; Klara Szigeti-Buck; Yuki Yasumoto; Allison M Greaney; Victoria Habet; Ryan D Chow; Jennifer S Chen; Jin Wei; Renata B Filler; Bao Wang; Guilin Wang; Laura E Niklason; Ruth R Montgomery; Stephanie C Eisenbarth; Sidi Chen; Adam Williams; Akiko Iwasaki; Tamas L Horvath; Ellen F Foxman; Richard W Pierce; Anna Marie Pyle; David van Dijk; Craig B Wilen Journal: PLoS Biol Date: 2021-03-17 Impact factor: 8.029