Literature DB >> 33256807

Pervasive generation of non-canonical subgenomic RNAs by SARS-CoV-2.

Jason Nomburg1,2,3, Matthew Meyerson4,5,6,7, James A DeCaprio8,9,10.   

Abstract

BACKGROUND: n class="Species">SARS-CoV-2, a positive-sense RNA virus in the family Coronaviridae, has caused a worldwide pandemic of coronavirus disease 2019 or COVID-19. Coronaviruses generate a tiered series of subgenomic RNAs (sgRNAs) through a process involving homology between transcriptional regulatory sequences (TRS) located after the leader sequence in the 5' UTR (the TRS-L) and TRS located near the start of ORFs encoding structural and accessory proteins (TRS-B) near the 3' end of the genome. In addition to the canonical sgRNAs generated by SARS-CoV-2, non-canonical sgRNAs (nc-sgRNAs) have been reported. However, the consistency of these nc-sgRNAs across viral isolates and infection conditions is unknown. The comprehensive definition of SARS-CoV-2 RNA products is a key step in understanding SARS-CoV-2 pathogenesis.
METHODS: Here, we report an integrative analysis of eight independent SARS-CoV-2 transcriptomes generated using three sequencing strategies, five host systems, and seven viral isolates. Read-mapping to the SARS-CoV-2 genome was used to determine the 5' and 3' coordinates of all junctions in viral RNAs identified in these samples.
RESULTS: Using junctional abundances, we show nc-sgRNAs make up as much as 33% of total sgRNAs in cell culture models of infection, are largely consistent in abundance across independent transcriptomes, and increase in abundance over time during infection. By assessing the homology between sequences flanking the 5' and 3' junction points, we show that nc-sgRNAs are not associated with TRS-like homology. By incorporating read coverage information, we find strong evidence for subgenomic RNAs that contain only 5' regions of ORF1a. Finally, we show that non-canonical junctions change the landscape of viral open reading frames.
CONCLUSIONS: We identify canonical and non-canonical junctions in SARS-CoV-2 sgRNAs and show that these RNA products are consistently generated by many independent viral isolates and sequencing approaches. These analyses highlight the diverse transcriptional activity of SARS-CoV-2 and offer important insights into SARS-CoV-2 biology.

Entities:  

Keywords:  COVID-19; Direct RNA sequencing; SARS-CoV-2; Transcription

Mesh:

Substances:

Year:  2020        PMID: 33256807      PMCID: PMC7704119          DOI: 10.1186/s13073-020-00802-w

Source DB:  PubMed          Journal:  Genome Med        ISSN: 1756-994X            Impact factor:   11.117


Background

Severe acute respiratory syndrome coronavirus 2 (n class="Species">SARS-CoV-2) emerged from Wuhan, China, in December 2019 and rapidly led to a worldwide pandemic, known as COVID-19 [1]. The initial sequencing report found that this virus maintains 89.1% nucleotide identity to SARS-like coronaviruses found in bats in China [1]. Coronaviruses (n class="Species">CoV) including SARS-CoV-2 contain large, ~ 30 kb, positive-sense, single-stranded RNA genomes with a unique genome organization structure. The 5′ end of the genome contains two large open reading frames (ORFs), ORF1a and ORF1b. ORF1a produces a large polyprotein, while ribosomal slippage at an RNA pseudoknot structure and slippery sequence at the end of ORF1a occasionally leads to a frameshift and subsequent translation of a joint ORF1a and ORF1b polyprotein [2]. ORF1a and ORF1ab polyproteins contain protease activity that can cleave the polyproteins into a variety of nonstructural proteins (Nsp) capable of facilitating viral transcription and replication and modulating host transcription and translation. Following ORF1b, there is an arrangement of ORFs encoding structural and accessory proteins that varies in content and order across the CoV family. While structural proteins are incorporated into emergent virions, accessory proteins are thought to be dispensable for replication in cell culture but to increase viral fitness in vivo [3-5]. Initial translation of ORF1ab caene">n occur directly from iene">ncomiene">ng viral geene">nomic RNA, geene">neratiene">ng the n class="Gene">ORF1a and ORF1ab polyproteins and initiating infection. However, because the viral genome is so large, and because downstream ORFs are far from the 5′ end of the genome, effective translation of these ORFs requires the generation of subgenomic RNAs (sgRNAs) [6]. To generate these sgRNAs, it is thought that CoV undergoes a process of discontinuous extension during synthesis of negative-stranded RNA templates. As the virus-encoded RNA-dependent RNA polymerase progresses from the 3′ end towards the 5′ end of the viral genome, it encounters transcriptional regulatory sequences (TRS) upstream of each major ORF (TRS-B). Here, the polymerase can skip to a similar TRS that is 3′ of a shared leader sequence within the 5′ UTR of the CoV genome (TRS-L). These antisense sgRNAs are then transcribed, resulting in a series of tiered sgRNAs containing the same 5′ leader sequence [6]. Several studies have characterized SARS-CoV-2 sgRNAs usiene">ng Naene">nopore direct RNA sequeene">nciene">ng (dRNAseq). Iene">n contrast to short-read sequeene">nciene">ng, which caene">n produce read pairs of 150 bp each, dRNAseq is capable of directly sequeene">nciene">ng eene">ntire traene">nscripts iene">ncludiene">ng the full leene">ngth of the viral geene">nome (29,903 bases). This allows dRNAseq to ideene">ntify RNA isoforms aene">nd variaene">nts that may be challeene">ngiene">ng to deconvolute usiene">ng short-read sequeene">nciene">ng. Usiene">ng naene">nopore sequeene">nciene">ng, Taiaroa et al. ideene">ntified eight major sgRNAs, as well as a small number of non-caene">nonical juene">nctions betweeene">n the 5′ leader sequeene">nce aene">nd downstream regions of the n class="Species">SARS-CoV-2 genome [7]. Kim et al. identified a total of 9 distinct subgenomic RNAs, as well as a series of non-canonical subgenomic RNAs (nc-sgRNAs) that contain unexpected junctions between 5′ and 3′ sequences [8]. In addition, they confirmed the existence of non-canonical junctions at various sites across the genome using short-read DNA nanoball sequencing [8]. Davidson et al. identified a series of transcripts encoding the nucleocapsid (N) protein, but with distinct small internal deletions, and validated this finding with proteomic evidence [9]. While these previous studies identified non-canonical RNA junctions and unexpected RNA species in isolation, there have been no integrative analyses of the SARS-CoV-2 juene">nctions. It is uene">nclear if the ideene">ntified non-caene">nonical juene">nctions are broadly represeene">ntative of n class="Species">SARS-CoV-2 biology or if they are a result of dataset- or isolate-specific artifacts. Here, we report an integrative analysis of eight independent SARS-CoV-2 transcriptomes generated using three sequencing strategies, five host systems, and seven viral isolates to comprehensively characterize the landscape of canonical and non-canonical sgRNAs generated by SARS-CoV-2.

Methods

Sequencing data

Details on all studied transcriptomes are listed in Table 1.
Table 1

Information on transcriptomes analyzed in this study

SourceVirus isolate (MOI)(details on stock generation)Infection hostTimepointAccession or DOITechnology
Taiaroa et al. [7]

Australia/VIC01/2020 (MOI: Unknown)

(Passage unknown)

Vero cellsUnknownSRR11350376dRNAseq
Kim et al. [8]

Korea/KCDC02/2020 (MOI: 0.05)

(4th passage. No plaque isolation)

Vero cells24 hpi10.17605/OSF.IO/8F6N9dRNAseq
Davidson et al. [9]

England/VE6-T/2020 (MOI: Unknown)

(2nd passage. No plaque isolation)

Vero cellsUnknown10.5281/zenodo.3722580dRNAseq
Finkel et al. [10]

GISAID Acc. No. EPI_ISL_406862 (MOI: 0.2)

(4th passage. No plaque isolation)

Vero cells5 hpiSRR11713354Illumina PolyA
Finkel et al. [10]

GISAID Acc. No. EPI_ISL_406862 (MOI: 0.2)

(4th passage. No plaque isolation)

Vero cells24 hpiSRR11713362Illumina PolyA
Blanco-Melo et al. [11]

USA/WA1/2020 (MOI: 0.2)

(Passage unknown)

A549-ACE2 cells24 hpiSRR11517741Illumina PolyA
Blanco-Melo et al. [11]

USA/WA1/2020 (5 × 104 PFU/ferret)

(Passage unknown)

Ferret (Nasal washings)3 dpiSRR11517855- SRR11517858Illumina PolyA
Suzuki et al. [12]

SARS-CoV-2/Hu/DP/Kng/19-020 (5 × 104 PFU/100 organoids)

(Passage unknown but plaque isolated)

Bronchial organoids5 dpiSRR11811022Illumina PolyA
Emanuel et al. [13]

Frankfurt strain AY310120.1 OR BetaCoV/Munich/BavPat1/2020|EPI_ISL_406862 (MOI: 0.33)

(Passage unknown)

Calu34 hpiSRR11550047Illumina Total RNA
Emanuel et al. [13]

Frankfurl strain AY310120.1 OR BetaCoV/Munich/BavPat1/2020|EPI_ISL_406862 (MOI: 0.33)

(Passage unknown)

Calu324 hpiSRR11550045Illumina Total RNA
Viehweger et al. [14]

CoV-229E (MOI: 3)

(Passage unknown)

Huh7 cells24 hpiERR3460961dRNAseq

hpi hours post infection, dpi days post infection

Information on transcriptomes analyzed in this study Australia/VIC01/2020 (MOI: Unknown) (Passage unknown) Korea/KCDC02/2020 (MOI: 0.05) (4th passage. No plaque isolation) England/VE6-T/2020 (MOI: Unknown) (2nd passage. No plaque isolation) GISAID Acc. No. EPI_ISL_406862 (MOI: 0.2) (4th passage. No plaque isolation) GISAID Acc. No. EPI_ISL_406862 (MOI: 0.2) (4th passage. No plaque isolation) USA/WA1/2020 (MOI: 0.2) (Passage unknown) USA/WA1/2020 (5 × 104 PFU/ferret) (Passage unknown) SARS-CoV-2/Hu/DP/Kng/19-020 (5 × 104 PFU/100 orgaene">noids) (Passage unknown but plaque isolated) Frankfurt strain AY310120.1 OR BetaCoV/Muene">nich/BavPat1/2020|EPI_ISL_406862 (MOI: 0.33) (Passage unknown) Frankfurl strain AY310120.1 OR BetaCoV/Muene">nich/BavPat1/2020|EPI_ISL_406862 (MOI: 0.33) (Passage unknown) CoV-229E (MOI: 3) (Passage unknown) hpi hours post infection, dpi days post n class="Disease">infection In the case of the Finkel et al. and Emanual et al. transcriptomes, the 24-hpi samples were used in all analyses except for the investigation of junction changes over time.

Determining junction coordinates

Reads in all transcriptomes were mapped against the SARS-CoV-2 refereene">nce geene">nome (accession NC_045512.2) usiene">ng miene">nimap2 [15] aene">nd custom parameters detailed iene">n miene">nimap_sars2.sh aene">nd iene">nspired by Kim et al. [8], aene">nd uene">nmapped reads were discarded. The exceptions are the traene">nscriptomes from Fiene">nkel et al. aene">nd Emaene">nuel et al.—because their read leene">ngths were very short (60 aene">nd 72 bases iene">n leene">ngth respectively), mappiene">ng was conducted with n class="Gene">STAR [16] using settings detailed in star.sh. The 5′ and 3′ junction positions of junction-spanning reads were determined with generate_synthetic_transcripts.py. This script reads a BAM file and the SARS-CoV-2 geene">nome fasta aene">nd, by parsiene">ng the BAM aligene">nmeene">nt CIGAR aene">nd n class="Gene">start position, determines the start, end, and junction coordinates of the alignment. Any minimap2-called deletion or intron of at least 1000 bases in length was considered a junction. Junction coordinates were then written to a tab-delimited output file for further analysis.

Analysis and visualization of junctions

The 5′ and 3′ coordinates of each junction were processed in R [17], and figures generated with ggplot2 [18]. Gene maps were generated using the gggenes R package [19]. For the global distribution of reads in each transcriptome, specific 5′–3′ junctions that occur at least twice are represented as arcs. Each inverted peak represents a histogram of 5′ or 3′ junctions at each genome position, with a bin size of 100 bases—this means that the magnitude of each peak represents the total number of 5′ or 3′ junctions within each 100 base span. Code is global_junctions.Rmd. The distribution of major 3′ junctions was generated by plotting a histogram of 3′ junction positions that occur after position 21000 and that have a 5′ end prior to position 100. Bin size is 20 bases. Code is canonical_3prime_junctions.Rmd. The percentage of non-canonical junctions in each was calculated by first assigning each individual 5′–3′ junction as canonical or non-canonical. A junction was considered to be canonical if its 5′ position was within 20 bases of the TRS-L core sequence ACGAAC (startiene">ng at position 69), aene">nd its 3′ juene">nction was withiene">n 15 bases of a TRS-B (defiene">ned by the preseene">nce of the TRS core sequeene">nce). The perceene">ntage of total juene">nctions that were assigene">ned as non-caene">nonical was theene">n determiene">ned. Code is global_juene">nctions.Rmd.

Identifying dataset-specific junctions

The consistency of non-canonical junctions across transcriptomes was determined separately for 5′ and 3′ junctions based on one replicate each from Taiaroa et al., Kim et al., Davidson et al., Finkel et al., and Blanco-Melo et al. transcriptomes generated from cells infected with n class="Species">SARS-CoV-2 in cell culture. For each SARS-CoV-2 genome position, the percentage of non-canonical junctions at that site was calculated for each transcriptome (X). The mean (μ) and standard deviation (σ) of percentages at each site across the five transcriptomes were then calculated. For each position in each transcriptome, a Z-score was calculated by subtracting the observed percentage by the mean and dividing by the standard deviation ((X − μ)/σ). The landscape of non-canonical junctions was determined by splitting the SARS-CoV-2 genome into sequential 100 nt spans and calculating the percentage of non-canonical junctions with a 5′ or 3′ end in each span. Z-scores were calculated in a similar manner as base-by-base Z-scores, but by comparing percentages in each 100 nt span across one replicate of the five transcriptomes listed above. Code is dataset_specific_junctions.Rmd.

Change in junctions over time

The change in junctions over time was determined based on datasets from Finkel et al. (5 hpi and 24 hpi) and Emanuel et al. (4 hpi and 24 hpi). The percentage of junctions that originate at each genome position was determined for each timepoint, and the absolute difference between the two values was calculated. Notably, this analysis focused on 5′ and 3′ counts at each position separately, such that a percentage of 5′ junctions and a percentage of 3′ junctions at each position was determined. To determine the aggregated change in each class of junctions, each 5′–3′ junction was assigned to a class based on the location of its 5′ and 3′ position. If a junction’s 5′ position was located within 20 bases of the TRS-L, and its 3′ position was located within 15 bases of a TRS-B, it was assigned to the ORF closest to the TRS-B. Otherwise, the junction was assigned as non-canonical. The percentage of total junctions in each class at the early timepoint was subtracted from the class percentages at the late timepoint for each dataset. Code is junctions_over_time.Rmd.

Assessing homology near 5′ and 3′ positions of each junction

The identity and length of homologous sequences flanking the 5′ and 3′ sites of each junction were determined by identifying the 30 bases flanking each junction site. The longest identical subsequence between these two 30 base sequences was then collected. The most common homology sequence was plotted for each group of junctions. Here, a junction was considered canonical if it has a 3′ end within 15 bases of a canonical TRS-B or was assigned as “Within ORF” if it was otherwise within an ORF. The exception were junctions with a 5′ end within ORF1a—these were assigene">ned to n class="Gene">ORF1a. The total distribution of homology lengths was also determined for canonical and non-canonical junctions. To determine the random distribution of homology lengths, all 30 base sequences (30mers) present in the SARS-CoV-2 genome were determined. The longest identical subsequence between 100,000 random pairs of 30mers that are derived from at least 1000 bases apart on the genome was determined. Code is find_trs_lengths.py and TRS_plotting.Rmd.

Assessment of excess 5′ ORF1a genome coverage

Coverage at each position was assessed from n class="Species">SARS-CoV-2-mapped BAMs using the depth subcommand of samtools [20] (Command: samtools depth -aa -d0 file.bam > file.cov). Coverage and cumulative number of 5′ junctions (excluding junctions occurring before position 100) were plotted. Code is ORF1a_coverage.Rmd.

Analysis of ORFs on dRNAseq reads

generate_synthetic_transcripts.py was used to generate “coordinate-derived transcripts” from the SARS-CoV-2 mapped BAMs from the Taiaroa et al., Kim et al., aene">nd Davidson et al. dRNAseq traene">nscriptomes. This script reads a BAM file aene">nd the n class="Species">SARS-CoV-2 genome fasta and, by parsing the BAM alignment CIGAR and start position, determines the start, end, and junction coordinates of the alignment. Coordinate-derived transcripts were created by using these coordinates to retrieve the associated sequence from the SARS-CoV-2 genome. Any minimap2-called deletion or intron of at least 1000 bases in length was considered a junction. All coordinate-derived transcripts were then aligned against a minimal leader sequence fragment (5′–CTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTC–3′) using blastn [21]. This minimal sequence was used instead of the full leader sequence because inspection of 5′ coverage in the three long-read datasets revealed a significant drop in coverage at the end of this sequence. Transcripts with an alignment of at least 45 nucleotides in length and a percent identity of at least 85% and that contain the start of this sequence within the first 50 nucleotides of the transcript were kept. Transcripts not containing this leader sequence were discarded. ORFs were called directly from each transcript using Prodigal [22], using parameters listed in prodigal.sh. ORFs were allowed to initiate from any nUG start codon, were not required to be the most 5′ ORF on each traene">nscript, aene">nd were only reported if they are at least 30 codons iene">n leene">ngth. Each ORF was traene">nslated iene">nto amiene">no acid sequeene">nce usiene">ng prodigal_to_orf_direct.py. If multiple Prodigal-predicted ORFs contaiene">ned aene">n overlappiene">ng amiene">no acid sequeene">nce, the longest ORF was output. Each ORF was output iene">n fasta format aene">nd labeled with traene">nscript of origiene">n aene">nd coordiene">nates of the ORF on the traene">nscript. The amino acid sequence of each ORF was then mapped against canonical SARS-CoV-2 proteiene">ns (from accession NC_045512.2) as well as Prodigal-predicted uene">naene">nnotated ORFs from the n class="Species">SARS-CoV-2 genome using DIAMOND [23]. DIAMOND parameters are described in diamond.sh and include an E-value threshold of 10. During downstream analysis, alignments were considered valid if the aligned portions maintained at least 95% amino acid identity to the reference protein. Various statistics of each ORF, including alignment information, the start position of the ORF on the traene">nscript, aene">nd fusion iene">nformation, were geene">nerated from the DIAMOND aligene">nmeene">nt file usiene">ng parse_orf_assigene">nmeene">nts.py. Aene">n ORF was considered variaene">nt if it was not the same leene">ngth as the caene">nonical proteiene">n, with a notable exception. Because there are alterene">native n class="Gene">start codons upstream of some annotated genes, and Prodigal would call ORFs using these start codons if they are present on a transcript, ORFs were allowed to have up to 20 amino acids of additional N-terminal amino acids while still being classified as canonical if they end at the expected position. If an ORF had matches against multiple SARS-CoV-2 proteins, it was given a primary assignment of the protein with the longest alignment length.

Results

Analysis of three independent dRNAseq datasets reveals canonical and non-canonical junctions

To determine the major RNA species present in SARS-CoV-2-infected cells, we applied a uene">niform computational approach to three iene">ndepeene">ndeene">nt dRNAseq datasets from Taiaroa et al., Kim et al., aene">nd Davidson et al. [7-9]. To assess sgRNA preseene">nce, we ideene">ntified juene">nction-spaene">nniene">ng reads followiene">ng read-mappiene">ng agaiene">nst the n class="Species">SARS-CoV-2 reference genome. Analysis of the 5′ and 3′ locations of junctions separated by at least 1000 bases revealed that the majority of junctions consist of a 5′ end originating in the 5′ UTR and a 3′ end at distinct sites in the 3′ region of the genome (Fig. 1a–c). These represent the canonical junctions between the leader sequence and 3′ sequences that generate the major sgRNAs [6]. Plotting the counts of 3′ junctions that contain a 5′ end within the first 100 bases revealed eight prevalent junction points along the 3′ end of the genome (Fig. 1d–f), all of which are proximal to the TRS core sequence ACGAAC. Interestingly, we observed that the 3′ junction peak and TRS-B closest to ORF6 laene">nds 151 bases withiene">n the M ORF. Based on these juene">nction aene">nalyses aene">nd the locations of ideene">ntified TRS body sites, predicted sgRNAs aene">nd their most 5′ geene">ne or geene">nes are shown iene">n Fig. 1g. We predict 8 major sgRNAs based both on ideene">ntification of the TRS core sequeene">nce aene">nd preseene">nce of a domiene">naene">nt juene">nction peak. Iene">n contrast to Kim et al. [8] but iene">n concordaene">nce with Taiaroa et al. [7] aene">nd Davidson et al. [9], we do not fiene">nd evideene">nce of a promiene">neene">nt 3′ juene">nction poiene">nt near n class="Gene">ORF7B (Fig. 1d–f).
Fig. 1

SARS-CoV-2 generates a defined population of canonical subgenomic RNAs. a–c For each location on the viral genome, a histogram of 5′ and 3′ junctions at that position was calculated and plotted as an inverse peak. The histogram bin size is 100 bases, meaning each inverse peak represents the cumulative count of 5′ or 3′ junctions occurring within that span. Curved lines represent the 5′ and 3′ locations of junctions that occur at least twice. Red curves represent canonical junctions, black curves represent non-canonical junctions. “Taiaroa” [7], “Kim” [8], and “Davidson” [9] represent the three independent SARS-CoV-2 dRNAseq datasets investigated. d–f A histogram of 3′ junctions past position 21000 with a 5′ end before position 100 are plotted. Dashed lines indicate the start coordinates of annotated viral genes. Bin size is 20 bases. g Based on junction analysis, the predicted major species of virus-produced RNAs are represented. The most 5′ gene or genes on each subgenomic RNA is listed

SARS-CoV-2 geene">nerates a defiene">ned population of caene">nonical subgeene">nomic RNAs. a–c For each location on the viral geene">nome, a histogram of 5′ aene">nd 3′ juene">nctions at that position was calculated aene">nd plotted as aene">n iene">nverse peak. The histogram biene">n size is 100 bases, meaene">niene">ng each iene">nverse peak represeene">nts the cumulative couene">nt of 5′ or 3′ juene">nctions occurriene">ng withiene">n that spaene">n. Curved liene">nes represeene">nt the 5′ aene">nd 3′ locations of juene">nctions that occur at least twice. Red curves represeene">nt caene">nonical juene">nctions, black curves represeene">nt non-caene">nonical juene">nctions. “Taiaroa” [7], “Kim” [8], aene">nd “Davidson” [9] represeene">nt the three iene">ndepeene">ndeene">nt n class="Species">SARS-CoV-2 dRNAseq datasets investigated. d–f A histogram of 3′ junctions past position 21000 with a 5′ end before position 100 are plotted. Dashed lines indicate the start coordinates of annotated viral genes. Bin size is 20 bases. g Based on junction analysis, the predicted major species of virus-produced RNAs are represented. The most 5′ gene or genes on each subgenomic RNA is listed

Many SARS-CoV-2 junctions are non-canonical

Similar to observations by Kim et al. [8], we identified an unexpectedly diffuse pattern of non-canonical junctions across the genome (Fig. 1a–c, black arcs), suggestive of many unexpected RNA species. Had only canonical junctions been observed, the arcs on this plot would be exclusively between the first 100 bases and defined points on the x-axis (Fig. 1a–f). Because the transcriptomes in these long-read sequencing studies [7-9] all consist of dRNAseq from infected Vero cells, we sought to confirm this observation usiene">ng iene">ndepeene">ndeene">nt traene">nscriptomes from other n class="Species">SARS-CoV-2 isolates, in other host systems, and with other sequencing technologies. To this end, we assessed five additional transcriptomes generated in Vero cells [10], A549 cells [11], Calu3 cells [13], bronchial organoids [12], and ferret nasal washings [11] using Illumina PolyA or total RNA sequencing. Information regarding each sample is detailed in Table 1. Junction analysis in these transcriptomes likewise revealed pervasive non-canonical junctions (Additional file 1: Fig. S1A-E). Quantification of the relative abundance of canonical and non-canonical junctions in these transcriptomes revealed that up to 33% of junctions were non-canonical in cell culture (Fig. 2a, Additional file 1: Fig. S1F). In addition, the transcriptome from a ferret’s nasal washing contained 51% non-canonical junctions, indicating that non-canonical junctions are also produced in vivo.
Fig. 2

Identification of few prominent dataset-specific non-canonical junctions. a Percentage of junctions that are non-canonical in five independent datasets. Junctions were assigned as canonical if their 5′ location was within 20 bases of the TRS-L and their 3′ location within 15 bases of a TRS-B, and otherwise assigned as non-canonical. Taiaroa, Kim, and Davidson are dRNAseq, while Finkel and Blanco-Melo are Illumina PolyA RNAseq. b Illustration of computational approach to determine the consistency of 5′ and 3′ junction points across independent datasets. The percentage of non-canonical junctions at each genome position in each dataset was determined (X). The mean (μ) and standard deviation (σ) of percentages at each position across the five datasets was calculated. For each position in each dataset, the Z-score was calculated as (X − μ)/σ (i.e., the number of standard deviations away from the mean), and the percentages and Z-scores for each position in each dataset were plotted. c, d Each point represents the percentage and Z-score of one position in one dataset. For each position in the SARS-CoV-2 genome, the percentage and Z-score of non-canonical junctions with a 5′ end (c) or a 3′ end (d) was determined as described above for five independent datasets: Taiaroa, Kim, Davidson, Finkel, and Blanco-Melo. Points with a percentage above 4% of non-canonical junctions and a Z-score above 1 were highlighted

Identification of few prominent dataset-specific non-canonical junctions. a Percentage of junctions that are non-canonical in five independent datasets. Junctions were assigned as canonical if their 5′ location was within 20 bases of the TRS-L and their 3′ location within 15 bases of a TRS-B, and otherwise assigned as non-canonical. Taiaroa, Kim, and Davidson are dRNAseq, while Finkel and Blanco-Melo are Illumina n class="Chemical">PolyA RNAseq. b Illustration of computational approach to determine the consistency of 5′ and 3′ junction points across independent datasets. The percentage of non-canonical junctions at each genome position in each dataset was determined (X). The mean (μ) and standard deviation (σ) of percentages at each position across the five datasets was calculated. For each position in each dataset, the Z-score was calculated as (X − μ)/σ (i.e., the number of standard deviations away from the mean), and the percentages and Z-scores for each position in each dataset were plotted. c, d Each point represents the percentage and Z-score of one position in one dataset. For each position in the SARS-CoV-2 genome, the percentage and Z-score of non-canonical junctions with a 5′ end (c) or a 3′ end (d) was determined as described above for five independent datasets: Taiaroa, Kim, Davidson, Finkel, and Blanco-Melo. Points with a percentage above 4% of non-canonical junctions and a Z-score above 1 were highlighted We considered if non-canonical junctions were driven by dataset-specific artifacts. For example, it is possible that defective particles containing defective viral genomes (DVGs), often with large internal deletions that increase their replicative fitness, could accumulate duriene">ng preparation of virion n class="Species">stocks [24] and generate the observed non-canonical junctions during subsequent infections. If this was true, we would expect to observe non-canonical junctions specific to a single virion preparation and accumulation of these specific non-canonical junctions over the course of infection. To address this possibility, we identified dataset-specific junctions by identifying junctions that are differentially abundant across datasets (Fig. 2b). While there is evidence of position-by-position variability across transcriptomes, this analysis revealed that there are only six cases where a junction site in one transcriptome is both highly abundant (> 4% of non-canonical junctions) and variable (Z > 1) across five independent SARS-CoV-2 sequencing studies [7-11] (Fig. 2c, d). To further investigate the distribution of non-canonical junctions in each transcriptome, we determined the percentage of non-canonical junctions in sequential 100 nt spans across the SARS-CoV-2 genome (Additional file 1: Fig. S2). This analysis revealed that, while there is some heterogeneity in non-canonical junction distributions in some datasets, the general landscape of non-canonical junctions is similar (Additional file 1: Fig. S2C, S2D). Together, these data suggest that non-canonical junctions are not entirely driven by dataset-specific artifacts. Of note, we found that three 5′ junction sites in the Finkel et al. transcriptome and three 3′ junctions in the Davidson et al. transcriptome were overrepresented (Fig. 2c, d). Incorporating additional replicates of these transcriptomes and assessing the non-canonical junction landscape of each transcriptome revealed that the Finkel et al.-specific 5′ junctions are highly elevated in both Finkel et al. replicates relative to other transcriptomes (Additional file 1: Fig. S2C). In contrast, the three Davidson et al.-specific 3′ junctions are present in other datasets (Additional file 1: Fig. S2D). The consistency of the Finkel et al.-specific junctions across replicates and exclusion from other samples raises the possibility that these junctions were generated from defective particles in the Finkel et al. virion preparation.

The proportion of non-canonical junctions increases with time after infection

Next, we assessed changes in junctions over time using the approach detailed in Fig. 3 using transcriptomes from early (4 or 5 h post infection) aene">nd late (24 h post n class="Disease">infection) timepoints from Finkel et al. [10] and Emanual et al. [13]. This analysis revealed that there are large decreases in 5′ junctions originating at position 64, and in 3′ junctions ending in 28254 (proximal to the N TRS-B), with a smaller decrease in 3′ junctions ending at 26467 (proximal to the M TRS-B) (Fig. 3b, c). Notably, the Finkel et al.-specific 5′ junction at 16971 increased in relative abundance over time. In contrast, there was no prominent increase in the relative abundance of junctions at other positions (Fig. 3b, c). Instead, there are small but widespread increases in non-canonical junction sites throughout the SARS-CoV-2 genome (Fig. 3d, e). Quantification of the total percent changes of canonical and non-canonical junctions revealed a large decrease in canonical N junctions and increase in non-canonical junctions (Fig. 3f, g). These results indicate that the abundance of non-canonical junctions increases over time in cell culture. It is unclear if a similar pattern occurs during infection in vivo.
Fig. 3

Non-canonical junctions accumulate over time in cell culture. a Illustration of computational approach used to determine change in junction percentages over time. For each position in the SARS-CoV-2 genome and separately for 5′ and 3′ junctions, the number of junctions at that position was calculated. Using these numbers, the percentage of 5′ or 3′ junctions falling at each position was calculated. The change in junction percentage at each position is defined as the difference between the position’s junction percentage at the late and early timepoints, and this junction change was plotted for each position. b–e The change in junction percentage for 5′ (orange) and 3′ (blue) junctions over time in the Finkel (b, d) and Emanuel (c, e) datasets was determined as described above. Changes in the percentage of junctions Positions with at change greater than 2.5% are annotated with text on the plot. Panels d and e are zoomed in versions of b and c. f, g Junctions were assigned into groups based on their 5′ and 3′ junction positions. If a junction had a 5′ end within 20 bases of the TRS-L and within 15 bases of a TRS-B, it was considered a canonical junction belonging to the ORF with a start closest to the TRS-B. Otherwise, it was considered non-canonical. The percentage of junctions falling into each category was calculated for early and late timepoints, and the difference between each category’s percentage in the late vs early timepoint was plotted

Non-canonical junctions accumulate over time in cell culture. a Illustration of computational approach used to determine change in junction percentages over time. For each position in the SARS-CoV-2 geene">nome aene">nd separately for 5′ aene">nd 3′ juene">nctions, the number of juene">nctions at that position was calculated. Usiene">ng these numbers, the perceene">ntage of 5′ or 3′ juene">nctions falliene">ng at each position was calculated. The chaene">nge iene">n juene">nction perceene">ntage at each position is defiene">ned as the differeene">nce betweeene">n the position’s juene">nction perceene">ntage at the late aene">nd early timepoiene">nts, aene">nd this juene">nction chaene">nge was plotted for each position. b–e The chaene">nge iene">n juene">nction perceene">ntage for 5′ (oraene">nge) aene">nd 3′ (blue) juene">nctions over time iene">n the Fiene">nkel (b, d) aene">nd Emaene">nuel (c, e) datasets was determiene">ned as described above. Chaene">nges iene">n the perceene">ntage of juene">nctions Positions with at chaene">nge greater thaene">n 2.5% are aene">nnotated with text on the plot. Paene">nels d aene">nd e are zoomed iene">n versions of b aene">nd c. f, g Juene">nctions were assigene">ned iene">nto groups based on their 5′ aene">nd 3′ juene">nction positions. If a juene">nction had a 5′ eene">nd withiene">n 20 bases of the TRS-L aene">nd withiene">n 15 bases of a TRS-B, it was considered a caene">nonical juene">nction belongiene">ng to the ORF with a n class="Gene">start closest to the TRS-B. Otherwise, it was considered non-canonical. The percentage of junctions falling into each category was calculated for early and late timepoints, and the difference between each category’s percentage in the late vs early timepoint was plotted

Non-canonical junctions are not associated with TRS-like homology

While homology between the TRS-L and TRS-B is thought to be critical for generation of canonical sgRNAs, the role of TRS-like homology in the generation of nc-sgRNAs is unclear. To address this question, we determined the 30 bases surrounding the 5′ and 3′ sites of each junction and identified the longest continuous subsequence in common between these sequence (Fig. 4a). We found that junctions with a 3′ end landing within 15 nucleotides of a TRS-B overwhelmingly have junction homology associated with variants of the TRS core sequence (Fig. 4b–d, blue). In contrast, junctions landing within ORFs and not proximal to a TRS-B do not have a prevalent homologous sequence (Fig. 4b–d, red). The exception are intra-M junctions, whose homology is a consequence of the location of the ORF6 TRS withiene">n the M ORF.
Fig. 4

Non-canonical junctions are not associated with TRS-like homology. a Illustration of computational approach used to assess homologous sequences between the 5′ and 3′ junction points. For each of the three dRNAseq datasets, the 30 bases flanking the 5′ and 3′ junction points were assessed, and the longest homologous sequence between these two 30 base regions was determined. b–d Homologous sequences present in the 15 bases on either side of the 5′ and 3′ ends of each junction. Junctions are classified by the location of their 3′ end—if this is within 15 bases of the canonical TRS site or if it falls within a gene it is assigned accordingly. The only exception is ORF1a—junctions with a 5′ end originating in ORF1a are assigned to ORF1a. Labels represent the most common homologous sequence between the ends of each junction. The core TRS sequence ACGAAC is underlined. e–g The length of the longest homologous sequence for canonical (C) and non-canonical (NC) are plotted. Here, junctions were considered canonical if they have a 5′ end within 20 nucleotides of the TRS-L and a 3′ end within 15 nucleotides of a TRS-B. (R) represents random homology lengths. The points for (R) were calculated by first extracting all possible 30 base sequences (30mers) from the SARS-CoV-2 genome, and then assessing the length of the longest homologous sequence between 100,000 random pairs of 30mers that are separated by at least 1000 bases on the genome. Within each column, the relative width of each band represents the relative abundance of junctions with each homology length. The value at the top of each column is the mean homology length

Non-canonical junctions are not associated with TRS-like homology. a Illustration of computational approach used to assess homologous sequences between the 5′ and 3′ junction points. For each of the three dRNAseq datasets, the 30 bases flanking the 5′ and 3′ junction points were assessed, and the longest homologous sequence between these two 30 base regions was determined. b–d Homologous sequences present in the 15 bases on either side of the 5′ and 3′ ends of each junction. Junctions are classified by the location of their 3′ end—if this is within 15 bases of the canonical TRS site or if it falls within a gene it is assigned accordingly. The only exception is ORF1a—juene">nctions with a 5′ eene">nd origiene">natiene">ng iene">n n class="Gene">ORF1a are assigned to ORF1a. Labels represent the most common homologous sequence between the ends of each junction. The core TRS sequence ACGAAC is underlined. e–g The length of the longest homologous sequence for canonical (C) and non-canonical (NC) are plotted. Here, junctions were considered canonical if they have a 5′ end within 20 nucleotides of the TRS-L and a 3′ end within 15 nucleotides of a TRS-B. (R) represents random homology lengths. The points for (R) were calculated by first extracting all possible 30 base sequences (30mers) from the SARS-CoV-2 genome, and then assessing the length of the longest homologous sequence between 100,000 random pairs of 30mers that are separated by at least 1000 bases on the genome. Within each column, the relative width of each band represents the relative abundance of junctions with each homology length. The value at the top of each column is the mean homology length While there is not a single strong motif driving non-canonical, intra-ORF junctions, it is still possible that non-canonical junctions are associated with junction-specific homology between the 5′ and 3′ junction sites. To address this possibility, we tallied the length of homology between the 5′ and 3′ sites of canonical and non-canonical junctions. This analysis revealed that canonical junctions have markedly longer homology lengths (mean > 10 bases) than non-canonical junctions (mean < 5 bases) (Fig. 4e–g). Indeed, the homology lengths of non-canonical junctions are only slightly longer than homology lengths between 100,000 random pairs of SARS-CoV-2-derived 30 base sequeene">nces (Fig. 4e–g). These data iene">ndicate that non-caene">nonical juene">nctions are not associated with TRS-like homology.

sgRNAs containing only a 5′ region of ORF1a are produced in all studied cell culture and in vivo infection systems

Assessment of the three dRNAseq transcriptomes revealed evidence of sgRNAs that contain only the 5′ region of ORF1a (Additional file 1: Fig. S3A). This is similar to aene">n observation previously reported by Kim et al. [8]. Iene">nspection of read n class="Species">coverage of ORF1a in these transcriptomes confirmed that there is elevated 5′ ORF1a coverage with specific decreases in coverage concomitantly with sharp increases in junctions near genome positions 1600 and 6000 (Fig. 5a–c).
Fig. 5

Relative abundance of subgenomic RNAs that contain only the 5′ end of ORF1a. a–h Read coverage (black) and cumulative 5′ junctions (red) are plotted for eight independent datasets. The sequencing strategy and sample types are annotated. i A schematic of a representative subgenomic RNA that consists of only the 5′ region of ORF1a. Pileups showing reads can be found in Additional file 1: Fig. S3

Relative abundance of subgenomic RNAs that contain only the 5′ end of ORF1a. a–h Read n class="Species">coverage (black) and cumulative 5′ junctions (red) are plotted for eight independent datasets. The sequencing strategy and sample types are annotated. i A schematic of a representative subgenomic RNA that consists of only the 5′ region of ORF1a. Pileups showing reads can be found in Additional file 1: Fig. S3 While there is clearly higher coverage of the 5′ region of n class="Gene">ORF1a in the three dRNAseq transcriptomes, it remained unclear if these 5′ ORF1a sgRNAs are a major percentage of total ORF1a in the cell, or if this finding was due to a systematic bias favoring shorter reads in nanopore sequencing. To address this question, we assessed ORF1a coverage in four independent Illumina polyA-enriched RNAseq transcriptomes. Analysis of these transcriptomes revealed a similarly elevated 5′ ORF1a coverage and consistent drops in coverage near genome positions 1600 and 6000 (Fig. 5d–g). Importantly, one of these transcriptomes is from bronchial organoids (Fig. 5g) and one transcriptome is from the nasal washings of an infected ferret (Fig. 5f), suggesting that 5′ ORF1a subgenomic RNAs can be produced in vivo and were not only due to artifacts of highly replicative conditions in cell culture. It is likely that polyA selection, as well as a technical bias against long transcripts in nanopore sequencing, has exaggerated the increased density at the 5′ region of ORF1a. However, analysis of short-read transcriptomes, including one generated from RNA that was not polyA-enriched (Fig. 5h), also revealed a consistent elevation in 5′ ORF1a coverage. Together, these data suggest that the excess 5′ ORF1a coverage is not entirely driven by these biases. Across the eight transcriptomes, there is anywhere between 2-fold and 17-fold excess coverage of the 5′ region of n class="Gene">ORF1a when comparing the 5′ and 3′ 1000 bases (median 3.1-fold). We found that there is similar elevated 5′ ORF1a coverage in the circulating human coronavirus CoV 229E [14] (Additional file 1: Fig. S4), supporting the existence of 5′ ORF1a RNAs in other coronaviruses. All together, these data support the hypothesis that subgenomic RNAs containing just the 5′ region of ORF1a are produced by SARS-CoV-2.

Junctions have the potential to generate variant open reading frames

To test the hypothesis that non-canonical junctions can result in variant open reading frames (ORFs) in nc-sgRNAs, we assessed the ORF content of each read in the three dRNAseq transcriptomes. Because dRNAseq can have an error rate of greater than 10% [25], we generated “coordinate-derived transcripts” by determining the start, eene">nd, aene">nd juene">nction coordiene">nates of each raw traene">nscript by mappiene">ng to n class="Species">SARS-CoV-2 (NC_045512.2) and extracting the associated genomic sequence. While this approach is capable of overcoming sequencing errors and can resolve major RNA species, it is not capable of resolving small or structurally complex RNA rearrangements. Because the 5′ end of nanopore reads may be truncated due to technical factors during sequencing [26], we leveraged the assumption that functional sgRNAs should contain the 5′ leader sequence to identify reads generated from complete transcripts. While excluding reads that do not contain the leader sequence may have reduced sensitivity, it prevents identification of variant ORFs as a consequence of truncated reads. ORFs were predicted directly from the coordinate-derived transcripts, and ORFs of unexpected length were identified as “variant.” Assessment of the relative abundance of variant and canonical ORFs revealed evidence of variant ORFs of ORF1a/b, S, aene">nd M (Fig. 6a, b, c).
Fig. 6

Junctions have the potential to generate variant open reading frames. a–c ORFs were predicted directly from transcript-derived reads for the three dRNAseq datasets. Each ORF was aligned against the protein sequences of canonical SARS-CoV-2 genes using the DIAMOND aligner. Variant ORFs were defined as ORFs that were assigned to a canonical SARS-CoV-2 protein but had an unexpected start or stop position, while perfectly aligning ORFs were considered canonical. The percentage of canonical and variant ORFs for each protein is plotted. d–l Schematics of M, N, and ORF1a are displayed with the approximate location of predicted transmembrane domains labeled in red. A histogram of the start and end sites of variant M, N, and ORF1a ORFs are displayed. Start sites of each variant ORF are on top, and end sites are on the bottom of each panel. Percentages represent the percentage of each ORF that is variant. Histogram bin sizes: M, 1; N, 10; ORF1a, 20. NTD: N-terminal domain. RBD: Receptor binding domain. CD: connector domain. NSP: nonstructural protein. m–o The identity of ORF1a-fusion partners is plotted on the Y-axis, with the count of such fusions on the X-axis. The top 10 fusion partners for each sample are represented. Color indicates if the fusion partner is on the N and C terminus of ORF1a, if the terminus is ambiguous, or if the fusion is a “self” fusion between an upstream and downstream region of ORF1a

Junctions have the potential to generate variant open reading frames. a–c ORFs were predicted directly from transcript-derived reads for the three dRNAseq datasets. Each ORF was aligned against the protein sequences of canonical SARS-CoV-2 geene">nes usiene">ng the DIAMOND aligene">ner. Variaene">nt ORFs were defiene">ned as ORFs that were assigene">ned to a caene">nonical n class="Species">SARS-CoV-2 protein but had an unexpected start or stop position, while perfectly aligning ORFs were considered canonical. The percentage of canonical and variant ORFs for each protein is plotted. d–l Schematics of M, N, and ORF1a are displayed with the approximate location of predicted transmembrane domains labeled in red. A histogram of the start and end sites of variant M, N, and ORF1a ORFs are displayed. Start sites of each variant ORF are on top, and end sites are on the bottom of each panel. Percentages represent the percentage of each ORF that is variant. Histogram bin sizes: M, 1; N, 10; ORF1a, 20. NTD: N-terminal domain. RBD: Receptor binding domain. CD: connector domain. NSP: nonstructural protein. m–o The identity of ORF1a-fusion partners is plotted on the Y-axis, with the count of such fusions on the X-axis. The top 10 fusion partners for each sample are represented. Color indicates if the fusion partner is on the N and C terminus of ORF1a, if the terminus is ambiguous, or if the fusion is a “self” fusion between an upstream and downstream region of ORF1a To characterize the identified M, S, and ORF1a ORF variaene">nts, we determiene">ned the stop aene">nd n class="Gene">start points of each ORF variant relative to their canonical counterparts (Fig. 6d–l). This analysis revealed that all theoretical M ORF variants were predicted to start at a UUG start codon 42 amino acids before the end of the M gene and end at the canonical stop codon (Fig. 6d, g, j). M ORF variants make up between 7.9 and 19% of M ORFs in the three dRNAseq datasets and are generated by the major junction point at the ORF6 TRS sequence within the M reading frame (Fig. 1d–f). If translated, M ORF variants would lack the N-terminal transmembrane domains and contain regions predicted to be present on the inside of the virion or in the cytosol. While the efficacy of translation initiation from this start codon is unknown, it has an acceptable Kozak consensus site with an A at the − 3 position. S ORF variants make up between 5.1 and 24.2% of all S ORFs identified in these datasets (Fig. 6a–c). The majority of S ORF variants are predicted from nc-sgRNAs with 3′ junctions landing within the S ORF (Additional file 1: Fig. S2D-F), resulting in N-terminal truncations with theoretical AUG and non-AUG start codons (Fig. 6e, h, k). If traene">nslated, these variaene">nts would lack the N-termiene">nal sigene">nal sequeene">nce [27] aene">nd, giveene">n the variations iene">n ORF n class="Gene">start sites, it is unclear if these ORFs encode functional isoforms. Our identification of ORF1a variaene">nts is consisteene">nt with the observed patterene">ns of 5′ n class="Gene">ORF1a coverage (Fig. 5). The peak end position of identified ORF1a variant ORFs is around amino acid position 500 (Fig. 6f, i, l), consistent with the observed decrease in read coverage around genome position 1600. This position falls within the NSP2 amino acid sequence suggesting that, if translated, many ORF1a ORF variants would consist of NSP1 and N-terminal fragments of NSP2. Interestingly, we found that approximately 1/3 of ORF1a variants are in-frame fusions with downstream ORFs, encoding mostly N (Fig. 6m–o). We were only able to identify full-length ORF1a ORFs in the Kim et al. transcriptome. This likely reflects the fact that full-length ORF1a ORFs are only expected in genome-length RNAs, which are technically challenging to sequence.

Discussion

While prior studies have identified nc-sgRNAs in SARS-CoV-2 [8, 9] aene">nd other n class="Species">coronaviruses [14], there has been no multi-study analysis of SARS-CoV-2 nc-sgRNAs. We conducted an integrative analysis of eight independent SARS-CoV-2 transcriptomes to catalog unexpected, non-canonical RNA species. We show that SARS-CoV-2 produces nc-sgRNAs, that these nc-sgRNAs increase in abundance over time, and that they are not associated with TRS-like homology. We highlight strong evidence supporting the existence of nc-sgRNAs containing only the 5′ region of ORF1a and show that the presence of nc-sgRNAs can alter the landscape of viral ORFs. We found that canonical sgRNAs are consistently produced across SARS-CoV-2 traene">nscriptomes, aene">nd ideene">ntify eight major sgRNAs eene">ncodiene">ng S, n class="Gene">ORF3a, E, M, ORF6, ORF7a/b, ORF8, and N. Each of these sgRNAs is supported by abundant 3′ junctions originating at a TRS-B just upstream of the associated ORF. Similar to other reports, we do not find evidence of a major series of sgRNAs encoding ORF10 [7-9]. We also do not find strong evidence of a major group of sgRNAs encoding ORF7B despite the presence of a TRS-core-like sequence (AAGAAC) proximal to the ORF7B start. Interestingly, up to 33% of sgRNAs from infected cells iene">n cell culture are non-caene">nonical. Oene">ne poteene">ntial source of nc-sgRNAs are DVGs iene">ntroduced by defective particles duriene">ng n class="Disease">infection. These can accumulate during the propagation of virus stocks. While information on the passaging of viral stocks is only available for four transcriptomes, at least five of the six SARS-CoV-2 transcriptomes generated from cell lines were generated following a low MOI infection, which may reduce the impact of defective particles [28, 29]. Indeed, while there is some variation in the absolute abundance of non-canonical junctions across transcriptomes that may reflect dataset-specific factors, the landscape of non-canonical junctions across transcriptomes is generally similar. One exception is that dataset-specific junctions are found in the Finkel et al. transcriptomes. In particular, there is a strong 5′ junction point at position 16971 that represents over 12% of non-canonical junctions in two Finkel et al. replicates but is largely absent in other transcriptomes. A hypothesis that could explain this observation is that there is a population of defective particles in the Finkel et al. virion preparation that contain an internal deletion. This DVG, which is smaller than the full genome, may have a replicative advantage during virus stock preparation [24]. Indeed, the Finkel et al.-specific junction at 16971 displays a relatively large increase in relative abundance over time, in contrast to smaller and non-specific increases of non-canonical junctions at other positions. The consistent presence of non-canonical junctions across the SARS-CoV-2 geene">nome iene">n iene">ndepeene">ndeene">nt traene">nscriptomes suggests that the bulk of the observed nc-sgRNAs are not dataset-specific but are geene">neralizable to n class="Species">SARS-CoV-2 biology. Furthermore, their non-specific increase in abundance over time in cell culture supports a model in which nc-sgRNAs are generated nonspecifically and at higher frequencies later in infection. We found that non-canonical junctions are not associated with TRS-like homology, raising the possibility that there is another mechanism behind their formation. For example, it is possible that factors such as RNA structure [30], accumulation of viral proteins, or actions of the viral RdRp and associated cofactors [31] could influence nc-sgRNA formation. Indeed, a recent study of the murine betacoronavirus MHV suggests that the 3′ to 5′ exoribonuclease Nsp14 influences recombination site selection of the viral RdRp [31]. Additional studies are necessary to gain a mechanistic understanding of the factors that influence nc-sgRNA generation and to understand the phenotypic effects of nc-sgRNAs on SARS-CoV-2 pathogenesis. Notably, considering that defective viral genomes in negative-sense RNA viruses have been associated with antiviral immunity, dendritic cell maturation, and interferon production [5, 24, 32, 33], it will be important to assess if SARS-CoV-2 nc-sgRNAs are associated with immune activity. Junctions in viral RNA change the landscape of viral ORFs. Because the ORF6 TRS-B is located 151 bases iene">nto the M ORF, sgRNAs created usiene">ng this TRS-B contaiene">n a 3′ region of M. There is a UUG located 41 codons prior to the eene">nd of M that will now be situated prior to the n class="Gene">ORF6 on these sgRNAs. While UUG is a relatively weak start codon, there is potential that this UUG could stimulate translation of a C-terminal M fragment or regulate the expression of the downstream ORF6 as an upstream ORF [34]. Additional studies are needed to understand if this M ORF variant is translated and to understand if its UUG start codon influences ORF6 translation. In addition to this M ORF variant, we found that some nc-sgRNAs contain S and ORF1a ORF variaene">nts. Maene">ny of the S ORF variaene">nts would lack the N-termiene">nal sigene">nal sequeene">nce [35], aene">nd especially consideriene">ng the variation iene">n the hypothetical n class="Gene">start positions of these variants, it is unclear if they are translated. Notably, subgenomic RNAs encoding variant S ORFs have been reported for SARS-CoV [36]. In contrast, the ORF1a ORF variaene">nts have a peak stop position arouene">nd n class="Gene">ORF1a amino acid position 500, which correlates well with an observed decrease in ORF1a 5′ coverage around genome position 1600. There is strong support for the existence of 5′ ORF1a-containing nc-sgRNAs across all eight transcriptomes studied here. Although there is not a strong peak of 5′ junctions at specific positions in the 5′ region of ORF1a, the consistent decreases in coverage and concomitant increases in 5′ junctions around positions 1600 and 6000 are notable. It is possible that there are biological factors at these positions that mediate these generalized inflection points but, unlike TRS-mediated homology, do not concentrate junctions to a specific base. As posited by Kim et al. [8], the existence of subgenomic RNAs with the 5′ end of ORF1a raises the possibility that these nc-sgRNAs may influence the relative abundance of Nsps, making this observation an important target for future study.

Conclusions

In conclusion, we show that SARS-CoV-2 geene">nerates nc-sgRNAs iene">n cell culture aene">nd iene">n vivo aene">nd that these nc-sgRNAs are largely consisteene">nt across iene">ndepeene">ndeene">nt traene">nscriptomes aene">nd sequeene">nciene">ng technologies. We show that nc-sgRNAs are not associated with TRS-like homology, suggestiene">ng there may be a homology-iene">ndepeene">ndeene">nt mechaene">nism driviene">ng their formation. Fiene">nally, we show that non-caene">nonical juene">nctions have the poteene">ntial to iene">nflueene">nce the laene">ndscape of viral ORFs aene">nd that there is especially strong support for nc-sgRNAs that contaiene">n only a 5′ portion of n class="Gene">ORF1a. Future studies are necessary to understand the mechanisms behind nc-sgRNA formation and to understand the influence of nc-sgRNAs on SARS-CoV-2 pathogenesis. Additional file 1: Figure S1. There is evidence of non-canonical junctions in diverse datasets. Figure S2. The landscape of non-canonical junctions across transcriptomes is generally consistent. Figure S3. Subgenomic RNAs containing non-canonical junctions are present in three independent dRNAseq datasets. Figure S4. There is elevated ORF1a n class="Species">coverage in CoV 229E.
  28 in total

Review 1.  A contemporary view of coronavirus transcription.

Authors:  Stanley G Sawicki; Dorothea L Sawicki; Stuart G Siddell
Journal:  J Virol       Date:  2006-08-23       Impact factor: 5.103

2.  Minimap2: pairwise alignment for nucleotide sequences.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2018-09-15       Impact factor: 6.937

3.  STAR: ultrafast universal RNA-seq aligner.

Authors:  Alexander Dobin; Carrie A Davis; Felix Schlesinger; Jorg Drenkow; Chris Zaleski; Sonali Jha; Philippe Batut; Mark Chaisson; Thomas R Gingeras
Journal:  Bioinformatics       Date:  2012-10-25       Impact factor: 6.937

4.  The coding capacity of SARS-CoV-2.

Authors:  Yaara Finkel; Orel Mizrahi; Aharon Nachshon; Shira Weingarten-Gabbay; David Morgenstern; Yfat Yahalom-Ronen; Hadas Tamir; Hagit Achdout; Dana Stein; Ofir Israeli; Adi Beth-Din; Sharon Melamed; Shay Weiss; Tomer Israely; Nir Paran; Michal Schwartz; Noam Stern-Ginossar
Journal:  Nature       Date:  2020-09-09       Impact factor: 49.962

Review 5.  The Impact of Defective Viruses on Infection and Immunity.

Authors:  Emmanuelle Genoyer; Carolina B López
Journal:  Annu Rev Virol       Date:  2019-05-13       Impact factor: 10.431

6.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

Review 7.  Gene expression regulation by upstream open reading frames and human disease.

Authors:  Cristina Barbosa; Isabel Peixeiro; Luísa Romão
Journal:  PLoS Genet       Date:  2013-08-08       Impact factor: 5.917

8.  A new coronavirus associated with human respiratory disease in China.

Authors:  Fan Wu; Su Zhao; Bin Yu; Yan-Mei Chen; Wen Wang; Zhi-Gang Song; Yi Hu; Zhao-Wu Tao; Jun-Hua Tian; Yuan-Yuan Pei; Ming-Li Yuan; Yu-Ling Zhang; Fa-Hui Dai; Yi Liu; Qi-Min Wang; Jiao-Jiao Zheng; Lin Xu; Edward C Holmes; Yong-Zhen Zhang
Journal:  Nature       Date:  2020-02-03       Impact factor: 49.962

Review 9.  SARS coronavirus accessory proteins.

Authors:  Krishna Narayanan; Cheng Huang; Shinji Makino
Journal:  Virus Res       Date:  2007-11-28       Impact factor: 3.303

10.  The Architecture of SARS-CoV-2 Transcriptome.

Authors:  Dongwan Kim; Joo-Yeon Lee; Jeong-Sun Yang; Jun Won Kim; V Narry Kim; Hyeshik Chang
Journal:  Cell       Date:  2020-04-23       Impact factor: 41.582

View more
  20 in total

1.  SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes.

Authors:  Irwin Jungreis; Rachel Sealfon; Manolis Kellis
Journal:  Nat Commun       Date:  2021-05-11       Impact factor: 14.919

Review 2.  CoV-er all the bases: Structural perspectives of SARS-CoV-2 RNA synthesis.

Authors:  Brandon Malone; Elizabeth A Campbell; Seth A Darst
Journal:  Enzymes       Date:  2021-08-23

3.  Transcriptional and epi-transcriptional dynamics of SARS-CoV-2 during cellular infection.

Authors:  Jessie J-Y Chang; Daniel Rawlinson; Miranda E Pitt; George Taiaroa; Josie Gleeson; Chenxi Zhou; Francesca L Mordant; Ricardo De Paoli-Iseppi; Leon Caly; Damian F J Purcell; Timothy P Stinear; Sarah L Londrigan; Michael B Clark; Deborah A Williamson; Kanta Subbarao; Lachlan J M Coin
Journal:  Cell Rep       Date:  2021-04-23       Impact factor: 9.423

4.  The Roles of APOBEC-mediated RNA Editing in SARS-CoV-2 Mutations, Replication and Fitness.

Authors:  Kyumin Kim; Peter Calabrese; Shanshan Wang; Chao Qin; Youliang Rao; Pinghui Feng; Xiaojiang S Chen
Journal:  bioRxiv       Date:  2022-04-07

5.  Template switching and duplications in SARS-CoV-2 genomes give rise to insertion variants that merit monitoring.

Authors:  Sofya K Garushyants; Igor B Rogozin; Eugene V Koonin
Journal:  Commun Biol       Date:  2021-11-30

6.  A colloidal gold-based immunochromatographic strip for rapid detection of SARS-CoV-2 antibodies after vaccination.

Authors:  Jia Liu; Weiqi Yan; Zhuojun Liu; Yizhao Han; Leqiang Zhang; Jian Yu
Journal:  Med Nov Technol Devices       Date:  2021-06-26

7.  A stark difference in the profiles of defective viral transcripts between SARS-CoV-2 and SARS-CoV.

Authors:  Dongying Xie; Hin Chu; Dong Yang; Qiutao Ding; Gefei Huang; Luo Chen; Zongwei Cai; Jiandong Huang; Zhongying Zhao
Journal:  J Infect       Date:  2021-07-01       Impact factor: 38.637

8.  Subgenomic RNA identification in SARS-CoV-2 genomic sequencing data.

Authors:  Matthew D Parker; Benjamin B Lindsey; Shay Leary; Silvana Gaudieri; Abha Chopra; Matthew Wyles; Adrienn Angyal; Luke R Green; Paul Parsons; Rachel M Tucker; Rebecca Brown; Danielle Groves; Katie Johnson; Laura Carrilero; Joe Heffer; David G Partridge; Cariad Evans; Mohammad Raza; Alexander J Keeley; Nikki Smith; Ana Da Silva Filipe; James G Shepherd; Chris Davis; Sahan Bennett; Vattipally B Sreenu; Alain Kohl; Elihu Aranday-Cortes; Lily Tong; Jenna Nichols; Emma C Thomson; Dennis Wang; Simon Mallal; Thushan I de Silva
Journal:  Genome Res       Date:  2021-03-15       Impact factor: 9.438

Review 9.  Coronavirus, the King Who Wanted More Than a Crown: From Common to the Highly Pathogenic SARS-CoV-2, Is the Key in the Accessory Genes?

Authors:  Nathalie Chazal
Journal:  Front Microbiol       Date:  2021-07-14       Impact factor: 5.640

Review 10.  SARS-CoV-2 Subgenomic RNAs: Characterization, Utility, and Perspectives.

Authors:  Samuel Long
Journal:  Viruses       Date:  2021-09-24       Impact factor: 5.818

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.