Literature DB >> 36075564

Translation landscape of SARS-CoV-2 noncanonical subgenomic RNAs.

Kai Wu¹, Dehe Wang¹, Junhao Wang¹, Yu Zhou².

Abstract

The ongoing COVID-19 pandemic is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) with a positive-stranded RNA genome. Current proteomic studies of SARS-CoV-2 mainly focus on the proteins encoded by its genomic RNA (gRNA) or canonical subgenomic RNAs (sgRNAs). Here, we systematically investigated the translation landscape of SARS-CoV-2, especially its noncanonical sgRNAs. We first constructed a strict pipeline, named vipep, for identifying reliable peptides derived from RNA viruses using RNA-seq and mass spectrometry data. We applied vipep to analyze 24 sets of mass spectrometry data related to SARS-CoV-2 infection. In addition to known canonical proteins, we identified many noncanonical sgRNA-derived peptides, which stably increase after viral infection. Furthermore, we explored the potential functions of those proteins encoded by noncanonical sgRNAs and found that they can bind to viral RNAs and may have immunogenic activity. The generalized vipep pipeline is applicable to any RNA viruses and these results have expanded the SARS-CoV-2 translation map, providing new insights for understanding the functions of SARS-CoV-2 sgRNAs.

Entities: Chemical

Keywords: Mass spectrometry; RNA binding; SARS-CoV-2; Subgenomic RNA (sgRNA); Translation

Year: 2022 PMID： 36075564 PMCID： PMC9444306 DOI： 10.1016/j.virs.2022.09.003

Source DB: PubMed Journal: Virol Sin ISSN： 1995-820X Impact factor: 6.947

Introduction

Coronavirus disease 2019 (COVID-19) is a worldwide pandemic respiratory disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (Graham, 2020; Liu et al., 2020; Zhu et al., 2020). Up to June 2022, SARS-CoV-2 has infected more than 530 million people and caused more than 6 million deaths worldwide (Dong et al., 2020). SARS-CoV-2 is a positive-stranded enveloped RNA virus in the Betacoronavirus genus with 29,000 nucleotide (nt) RNA genome which contains a 5′-cap structure and a 3′ poly(A) tail (Wu et al., 2020; Zhou et al., 2020). The genomic RNA (gRNA) of SARS-CoV-2 contains two large open reading frames (ORF1a/1b) encoding 16 viral nonstructural proteins (nsps) through −1 ribosomal frameshifting, and 13 other open reading frames (ORFs) encoding four structural proteins and 9 accessory factors from homologous gene prediction (Chan et al., 2020; Perlman and Netland, 2009; Zhou et al., 2020). Those non-structure proteins play important roles, such as replicating and transcribing the genome (nsp7, nsp8, and nsp12), capping virus RNA (nsp13 and nsp14), cleaving viral polyprotein (3C-like), etc. (Gordon et al., 2020b). Interestingly, proteins other than those in ORF1a/1b are believed to be translated from subgenomic RNAs (sgRNAs) generated through discontinuous transcription (Kim et al., 2020; Parker et al., 2021; Stern and Kennedy, 1980). This discontinuous transcription step called ‘‘template switch”, is mainly mediated by the RNA-RNA interactions between the RNA template and the synthesizing RNA sequence flanking the junction sites during sgRNA biogenesis (Wang et al., 2021). Previously, we have defined three types of sgRNAs including leader-type sgRNAs (canonical sgRNAs), ORF1ab-type, and S/N-type sgRNAs (noncanonical sgRNAs) (Wang et al., 2021). Nomburg et al. also analyzed SARS-CoV-2 noncanonical sgRNAs and found that they accounted for 33% of the total sgRNAs in the transcription level, and the proportions increased along the time after infection in cell culture (Nomburg et al., 2020). With the widespread application of the bioinformatics method in homology comparison and codon analysis, some new ORFs encoded by canonical sgRNAs are continuously discovered, such as ORF3c and ORF3d (Firth, 2020; Jungreis et al., 2021; Nelson et al., 2020). Furthermore, many potential novel proteins encoded by noncanonical sgRNAs have also been found. Finkel et al. observed that some noncanonical sgRNAs may have translational potential based on ribosome footprints (Finkel et al., 2021). Andrew et al. found a novel S glycoprotein with 8 amino acid (aa) deletion and an N protein with 17 aa deletion by mass spectrometry (MS) (Davidson et al., 2020). We also found that two ORF1ab-type sgRNAs have translation potential using a limited number of MS data (Wang et al., 2021). Those sporadic evidence indicated that noncanonical sgRNAs have coding capacity, but a systematic study on noncanonical sgRNAs using large sets of MS data is still missing for SARS-CoV-2. Considering that the sgRNA levels increase along with the viral infection time and are much higher than gRNA, the protein products of sgRNAs and their potential functions are important questions requiring more comprehensive studies. Here, we developed a general and strict pipeline called vipep to systematically identify novel viral RNA-encoded peptides using RNA-seq and mass spectrometry data. We collected 18 sets of RNA-seq data and 24 sets of mass spectrometry data related to SARS-CoV-2 that are currently available. We used vipep pipeline to identify reliable peptides from all ORFs in both strands of gRNAs and sgRNAs. We obtained 473 and 53 peptides derived from canonical and noncanonical sgRNAs, respectively. We found that the translation level of noncanonical sgRNAs stably increases at different time points along with viral infection, consistent with the increase at the RNA level. Furthermore, we explored the biological functions of noncanonical sgRNAs-encoded proteins and found that they have the capability of binding with viral RNAs and being immunogenic.

Materials and methods

Data collection and curation

Public mass spectrometry data were downloaded from iProX (Ma et al., 2019) and PRIDE (Perez-Riverol et al., 2019) databases. We manually screened 24 sets of data with SARS-CoV-2 infection (5 in iProX and 19 in PRIDE). The sgRNA junction sites of SARS-CoV-2 were extracted from 18 RNA-seq [9 next-generation sequencing (NGS) and 9 nanopore] datasets using our published method (Wang et al., 2021). Public Ribo-seq data were downloaded from the NCBI SRA database with accession number: SRP260279. The annotated SARS-CoV-2 protein sequences were collected from the NCBI gene reference (NC_045512.2) and UniProt database (version 20210416, 16 entries) (UniProt Consortium, 2021). The domain annotations of those proteins were downloaded from UCSC SARS-CoV-2 Genome Browser (Fernandes et al., 2020).

Mass spectrometry data analysis

We constructed a complete and non-redundant MS searching database composed of three parts: 1) 6-frame translation for SARS-CoV-2 gRNA which contains annotated ORFs; 2) host proteome from human UniProt (version 2020/09/28 with 192,656 entries) or monkey UniProt (version 2020/10/22 with 19,525 entries) databases, and 3) sgRNA junction-spanning ORFs from the 3-frame translation of SARS-CoV-2 sgRNAs in both forward and reverse strands. In our vipep workflow, the original raw data were divided into tandem mass tag (TMT) and label-free groups according to their experimental design. Then, each group was searched by MaxQuant (1.6.12.0) with corresponding default parameters. A fixed modification of carbamidomethyl (C) and variable modifications of oxidation (M) and acetyl (protein N-term) were included in the search, and the maximum number of missed cleavages was set to 2. For novel peptide identification, the peptide-spectrum match false discovery rate (FDR) threshold was set as 0.01. In order to obtain more reliable results, potential pollutants or reversed peptides were discarded first. The peptides with leucine and isoleucine content higher than 25% were also discarded due to that the two amino acids could not be distinguished by the mass, which may lead to false positives. The remaining peptides detected in single- or multi-datasets were strictly filtered requiring the search score to be larger than 120 or 90, respectively. Only the junction-spanning peptides with overhang equal to or longer than 6 nt at both ends were kept for further analysis.

SNPs and RNA editing analysis

The human single nucleotide polymorphisms (SNPs) were downloaded from UCSC common dbSNP (version V153, hg38) (Sherry et al., 2001). For all coding isoforms in GENCODE basic annotation (version 41, hg38), the reference base was replaced with the corresponding mutation in the coding sequence (CDS) region to generate the SNP-derived protein sequences that differ from the reference proteome. The human A-to-I editing sites (hg38) were downloaded from the REDIportal database (Mansi et al., 2021). The RNA-editing-derived protein sequences were constructed similarly as SNPs by replacing A with G in coding regions. The novel SARS-CoV-2 peptides were then scanned against the SNP-mutated and RNA editing-mutated proteins.

Ribo-seq data analysis

Ribo-seq reads were firstly mapped to the host-virus merged genomes, SARS-CoV-2 gRNA (MN996528) with sgRNA junctions plus Vero E6 (Chlorocebus sabaeus Ensembl v99) or Calu-3 (human hg38) genome, using STAR (v2.7.2b) program (Dobin et al., 2013) with parameters “--sjdbFileChrStartEnd JSDB.txt --outFilterMultimapNmax 1 --alignSJoverhangMin 6 --outSJfilterOverhangMin 6 6 6 6 --outSJfilterCountUniqueMin 3 3 3 3 --outSJfilterCountTotalMin 3 3 3 3 --outSJfilterDistToOtherSJmin 0 0 0 0 --scoreGap −4 --scoreGapNoncan −4 --scoreGapATAC −4 --alignIntronMax 30,000 --alignMatesGapMax 30,000 --alignSJstitchMismatchNmax -1 -1 -1 -1”. The mapped reads with length between 27 and 33 nt were used in further analysis.

RNA-protein interactome MS data analysis

The RNA-protein interactome data was from iProX accession PXD024808. The raw data related to SARS-CoV-2 infection were searched by MaxQuant (1.6.12.0) with default label-free parameters. We divided the data into three groups according to the corresponding probes. Only the peptides appearing in those we identified from the above 24 MS datasets were used in further analysis.

Epitope prediction

Potential epitopes were predicted by netMHCpan (v4.1) program with default parameters (Reynisson et al., 2020) except for setting the %Rank_EL and %Rank_BA cutoffs as 0.02 to be more stringent. The novel epitopes were defined as those derived from noncanonical sgRNA encoded proteins but not appearing in any of SARS-CoV-2 annotated proteins in Uniprot (version 2021/04 containing 16 entries) and NCBI RefSeq (reference NC_045512.2 containing 38 entries).

Results

Overview of vipep pipeline and SARS-CoV-2 proteome

To explore the global landscape of SARS-CoV-2 proteome, we downloaded and analyzed 24 mass spectrometry (MS) datasets with SARS-CoV-2 infection from iProX (Ma et al., 2019) and PRIDE (Perez-Riverol et al., 2019) databases (Fig. 1 A and Supplementary Table S1). Within vipep pipeline, we first built a comprehensive search database for SARS-CoV-2 including its annotated proteins, additional ORFs from gRNA in-silico translation in 6-frames, predicted junction-spanning ORFs from sgRNAs, and host proteins. The SARS-CoV-2 sgRNA junction sites are from our integrated analysis of multiple NGS and nanopore sequencing data (Wang et al., 2021). We identified peptides in collected MS data using MaxQuant software (Tyanova et al., 2016). After quality control and several strict filtering steps, we keep highly authentic novel SARS-CoV-2 peptides for further downstream analyses (Fig. 1A and Supplementary Fig. S1A).

Fig. 1

The vipep pipeline and global landscape of SARS-CoV-2 encoded peptides. A Schematic diagram for identifying and annotating novel SARS-CoV-2 encoded peptides. A custom database was built to search SARS-CoV-2 peptides in 24 mass spectrometry datasets currently available in the literature. Pink represents the source of the datasets, blue represents the construction method of the nonredundant mass spectrometry database, green represents the peptides identified by vipep, and orange represents the functional exploration of the novel peptides. B Genome browser view of annotated gORFs, numbers of peptide spectrum matches (PSMs) for annotated ORFs, and predicted gORFs by six frames across the SARS-CoV-2 genome. C Total counts of peptides detected in all mass spectrometry datasets for the annotated ORFs of SARS-CoV-2. D The numbers of annotated and novel peptides derived from genomic or subgenomic ORFs in sense or antisense (±) strands. E The statistics of peptides supported by different numbers of MS datasets for annotated (left) and novel peptides (right). F An example of multi-datasets supported novel peptides spanning a sgRNA junction. The peptide is represented underneath the sgRNA junction spanning two ORFs (top) with the MS/MS spectra found in multiple datasets (bottom). The dataset for the shown spectra is highlighted in red. G The numbers of annotated (red) and novel peptides (blue) in all datasets grouped by virus infection and mock samples. The sample origin is labeled for each dataset (C, cell; P, patient). The datasets with mock samples are marked in purple. We used 194,309 sgRNA junction sites (JSs) found in RNA-seq datasets (Supplementary Fig. S1B) to predict the junction-spanning ORFs that are different from the annotated SARS-CoV-2 proteins. We identified a total of 649 and 440 peptides derived from annotated and novel junction-spanning ORFs, respectively. To minimize false positives, we strictly screened the peptides detected by MS (Supplementary Fig. S1A). First, potential pollutants and reversed peptides were discarded. Second, for a peptide spanning a sgRNA junction, the RNA sequence encoding the peptide was required to have more than 6 nt overhangs at both ends. Third, the leucine and isoleucine content of the peptide should be less than 25% since these two amino acids cannot be distinguished by molecular weight. Fourth, we set a higher threshold of MaxQuant search score for peptides detected in a single dataset. We found that 92.73% (408/440) of novel peptides only appear in one dataset versus 42.2% of annotated peptides (Supplementary Fig. S1C). Based on the distribution of those search scores for annotated peptides, we determined the threshold as 90 for multiple-sample supported peptides (Supplementary Fig. S1D left). For single-sample supported peptides, we increased the threshold to 120 to reduce potential false positives (Supplementary Fig. S1D right). Finally, we identified 473 annotated and 53 novel peptides using the vipep workflow (Supplementary Table S2 and Supplementary Table S3). For annotated ORFs, ORF1ab has the largest number of peptides, followed by N and S proteins (Fig. 1B and C), which may be due to their long lengths and high abundances. Interestingly, no novel peptide was found in any of the predicted gORFs (Fig. 1B) beyond annotated ORFs (Fig. 1D). Although (−)gRNA has many predicted ORFs, none of them has any peptide from MS (Fig. 1D). In principle, there should be no viral peptides in non-infected samples (mock), based on which we evaluated the data quality of all MS datasets. In the MS data IPX0002166001, there are five annotated peptides detected in seven healthy people (Supplementary Fig. S1E). We suspect that some of these healthy people are asymptomatic carriers at the time of taking the samples due to the three healthy ones (H02, H03, and H04) being undifferentiable from the recovered patients (Li et al., 2020), and we have removed this dataset in the following analysis. All novel peptides derived from sgRNAs are junction-spanning, except two peptides (Supplementary Table S2 and Supplementary Fig. S2A), which are located in a different frame inside the N gene and resulted from different degrees of proteinase digestion (Supplementary Fig. S2A). Expectedly, all but one novel sgRNA-derived peptides are from the sense strand of the viral genome (Fig. 1D and Supplementary Fig. S2B). Considering the limitations of MS technology, not all proteins can be detected by MS depending on the presence of appropriate protease digestion sites. We defined an ORFs as ‘MS theoretical’ if it can produce at least one appropriate enzymolysis peptide fragment within the length of 6–20 aa by trypsin digestion. Of 16 annotated ORFs, 14 can be considered as ‘MS theoretical’ ORFs, and 10 of them can be detected by MS with vipep (Supplementary Fig. S2C left). There are 44 and 94 ‘MS theoretical’ gORFs in sense and negative strands, respectively; however, no novel peptide was detected (Supplementary Fig. S2C middle). There are about 42,000 and 13,000 ‘MS theoretical’ predicted sgORFs, of which there are 732 and 1 sgORFs are detected by MS in sense and negative strands, respectively (Supplementary Fig. S2C right). These proteome-based results suggest that the sense strand of SARS-CoV-2 RNA is indeed more likely to produce proteins, which is consistent with previous works reporting that SARS-CoV-2 showed a high degree of bias and efficiency in generating sense strand RNA (Zhao et al., 2021) and the majority of Ribo-seq reads were derived from sense strand (Puray-Chavez et al., 2022). We reason that the more frequently observed in different datasets, the higher probability of being a real peptide. Not unexpectedly, about 76.5% of annotated peptides have support from two or more MS datasets (Fig. 1E left). For novel peptides, about 41.5% (22/53) are detected in two or more MS datasets (Fig. 1E right). The low consistency between different datasets could be partially caused by the inclusion of some datasets with very few viral peptides or a specific dataset with very high sensitivity. As a representative example in Fig. 1F, the novel peptide spanning the sgRNA junction (7792–29,440) fusing nsp3 and N protein, is detected in up to ten datasets. Furthermore, to exclude the possibility that the novel peptides are generated by unannotated host RNAs or newly transcribed RNAs in response to viral infection, we performed 3-frame and 6-frame translations on the host's transcriptome and genome, respectively. We kept those predicted ORFs with RNA-seq signals and translated them into protein sequences for searching for novel peptides. No novel peptide is identified from those protein sequences of the host. Notably, we searched the novel peptides against the NCBI non-redundant protein database using tblastn, and found that the best hits are all derived from SARS-CoV-2. In addition, we performed the SNPs and RNA editing analysis and confirmed that the novel peptides are neither from the SNP-mutated nor the RNA editing-mutated proteins. While the mock datasets do not have annotated peptides as controlled in the analysis, they do not contain any novel peptides either (Fig. 1G). Besides, the amount of a detected peptide in the samples derived from viral transfected cells is significantly higher than that in the samples derived from patients. Thus, we believe that the novel peptides we found are reliable based on a set of quality controls, suggesting that the translation of noncanonical sgRNAs is a widespread phenomenon.

Global landscape of SARS-CoV-2 sgRNA-derived peptides

To more intuitively display the novel junction-spanning peptides we identified above, we mapped the corresponding sgRNAs in the SARS-CoV-2 genome. We observed that many novel peptides could be derived from multiple sgRNAs (Supplementary Figs. S3A–B). We reasoned that this phenomenon may be caused by the multiple mappings of the coding sequences of the peptides or synonymous codons around the junction sites. For further analysis, we only reserved the 42 single-positional novel peptides (Supplementary Fig. S3C). Similar to our previous work, we classified the SARS-CoV-2 sgRNAs into four types by the positions of junction sites (JSs): Leader type, ORF1ab/ORF1ab, ORF1ab/S–N, and S–N/S–N (Fig. 2 A right) (Wang et al., 2021). The first type is also named canonical sgRNA, while the latter three types are called noncanonical sgRNAs. Although it was reported that the expression levels of canonical sgRNAs are higher than the levels of noncanonical sgRNAs (Kim et al., 2020; Wang et al., 2021), we found that all the novel peptides detected by MS were derived from noncanonical sgRNAs (Fig. 2A left, Fig. 2B), suggesting that canonical sgRNAs do not produce novel proteins. Only two types of noncanonical sgRNAs, ORF1ab/S–N and S–N/S–N, have detected novel peptides (14 and 28 peptides), which are dispersedly observed in different datasets (Fig. 2C). It is worth noting that the dataset PXD018241 (virus-infected cells for 72 h and divided into 20 MS fractions) has significantly higher sensitivity than all other datasets, and 19 novel peptides are only detected in this dataset. In excluding this best dataset, 47.8% (11/23) of novel peptides are repeatedly detectable in two or more datasets.

Fig. 2

Global landscape of sgRNA-derived peptides. A Global view of sgRNA junctions with mass spectrometry evidence (left) and all theoretical sgRNA junctions (right) in SARS-CoV-2. The canonical Leader-type and 3 noncanonical sgRNA junctions are shown in different colors (Leader, red; ORF1ab/S–N, blue; S–N/S–N, green; ORF1ab/ORF1ab, purple). B The statistics of mass spectrometry (MS) peptides, theoretical sgRNA junctions, and the reads spanning theoretical sgRNA junctions by the sgRNA type as in A. C The presence (black) or absence (white) heatmap of MS peptides (row) in different datasets (column). The peptides are grouped by sgRNA type as in A. The sgRNA junction positions and peptide sequences are annotated at the right. The total number of peptides in each dataset and the number of supporting datasets for each peptide are shown on the top and on the left side, respectively. D Global arc-view of sgRNA junctions with evidence from mass spectrometry only (black) and from both MS and Ribo-seq (red). E An exemplar junction-spanning peptide of noncanonical sgRNA with evidence from both Ribo-seq reads. The MS/MS spectra of the peptide are shown on the right. To further confirm the sgRNA translation for those novel peptides, we downloaded and re-analyzed the Ribo-seq data published by Finkel et al. (2021). Using junction-spanning Ribo-seq reads to verify the translation of junction-spanning peptides, we found that five novel peptides are supported by Ribo-seq signals (Fig. 2D). A representative peptide is shown in Fig. 2E, which is supported by three different Ribo-seq datasets. The results further support the translational capability of novel ORFs at the level of ribosome binding.

Annotation of sgRNA ORFs containing novel peptides

To explore the characteristics of those translated noncanonical sgRNAs, we classified the novel peptide-originated ORFs into in-frame, frame-shift, and novel types according to whether they can change the coding frame of the corresponding annotated ORFs (Fig. 3 A). The in-frame type ORF fuses upstream and downstream annotated ORFs or results from a deletion in one annotated ORF, without frame change. The frame-shift type ORF fuses two parts of annotated ORFs, in which upstream, downstream, or both-end are in a different frame. The novel type ORF contains a sequence that is not annotated as a coding region (Fig. 3A). Two representative ORFs of frame-shift type are shown in Fig. 3B and C, while a novel type ORF is shown in Fig. 3D.

Fig. 3

Characterization of the sgRNA-derived peptides. A Graphical illustration of the classification for peptides derived from sgRNAs. ORF1 and ORF2 represent two different ORFs or two terminals of the same protein. Blue represents the upstream non-changed ORF, green represents the downstream non-changed ORF, orange represents the upstream ORF with frame-shift, pink represents the downstream ORF with frame-shift, and grey represents the ORF located in unannotated regions. B–D Representative examples for three types of peptides spanning sgRNA junctions (B, Upstream frame-shift; C, Downstream frame-shift; D, Novel). The spanned ORF(s) and the sgRNA ORF are shown at the top. The junction position, RNA and protein sequences of annotated ORF(s), and novel peptide sequence are presented at the bottom. The MS/MS spectra with both y and b ion information for the peptide are shown on the right. E The statistics of novel peptides grouped by sgRNA junction spanned ORF(s). F The statistics of novel peptides by peptide type as in A. G The numbers of RNA-seq junction-spanning (JS) reads and detected novel peptides from theoretical noncanonical sgRNA ORFs with deletion and frame-shift inside the same gene (N–N type). The statistical significance is based on two sides Fisher's exact test (P = 0.03). Based on the locations of annotated ORFs, we found that most novel peptides are derived from S–N/S–N type sgRNAs (19 for N–N), followed by two kinds ORF1ab/S–N type sgRNAs (9 for nsp3-N and 5 for nsp2-N, Fig. 3E). The novel peptides derived from in-frame type ORFs were much more than those from the frame-shift type, but no novel peptide was found in both-end frame-shift ORFs (Fig. 3F). To evaluate the ability of maintaining coding frame of the ORFs containing novel peptides, we counted the RNA-seq junction-spanning (JS) reads from theoretical ORFs and peptide numbers for the novel peptides with both upstream and downstream sub-peptides located in the same annotated ORFs. We collected N–N type peptides for this analysis. We found that a larger proportion of in-frame type novel ORFs can be translated compared to frame-shift type ones (P = 0.03, Fig. 3G). This result suggests that noncanonical sgRNAs have a tendency to retain the coding frame of their corresponding annotated ORFs. It is worth noting that in N–N frame-shift type peptides, five out of seven peptides were downstream frame-shift (Fig. 3E). Interestingly, less than 20% (8 out of 42) of novel peptides used a different upstream frame of corresponding annotated ORFs (Fig. 3F). These results indicate that novel sgRNAs-derived ORFs may have a high probability to share the same start codon with the annotated ORFs.

A quantitative map of peptides from SARS-CoV-2 sgRNAs

To investigate the dynamics of proteins encoded by noncanonical sgRNAs, we chose a time-course MS dataset (PXD018594) with SARS-CoV-2 infection at two different MOIs (multiplicity of infections). We found that the amounts of both novel peptides from noncanonical sgRNAs and annotated peptides increase along with the infection time (Fig. 4 A). By counting the numbers of spectra in specific ORFs, as expected, all annotated proteins increase along with the infection time (Fig. 4B), consistent with a previous report (Bojkova et al., 2020). We also observed the same increasing trend for the two types of noncanonical sgRNA-derived peptides. Interestingly, we found that higher MOI makes it easier to detect both annotated and novel peptides at an earlier time point, and the levels of both annotated and novel peptides begin to keep stable at the late stage of viral infection, independent of virus MOI (Fig. 4B). These results reveal the reliability of novel peptides and indicate that the noncanonical sgRNAs can be translated and are regulated in the same manner as canonical sgRNAs.

Fig. 4

The increasing dynamics of viral peptides after SARS-CoV-2 infection. A Heatmap of the spectra counts for novel and annotated SARS-CoV-2 peptides at different time points in dataset PXD020019. The total spectra counts for each time point are shown on the top. B The counts of novel sgRNA-derived peptides and peptides in different annotated ORFs. The counts are summed by spectra with different virus MOI. C Graphical illustration of the continuous junction-covering peptides (blue), junction-spanning peptides (red), and other continuous peptides (grey) located in ORF9b and N genes. All peptides are from the dataset PXD018594 (day4.rep2). D Estimated proportions of novel peptides at different time points as in C. E Estimated proportions of the novel (red) and annotated (blue) peptides by spectra in five different datasets as in C. To estimate the relative proportion of noncanonical sgRNAs-derived proteins in the viral proteome, we devised a computational scheme by using junction-spanning peptides to represent novel proteins and using continuous junction-covering peptides to represent annotated proteins (Fig. 4C). The rationale is due to the caveat: the continuous peptides not covering the junctions may be derived from annotated ORFs or non-canonical sgORFs. As shown in Fig. 4D (using dataset PXD018594), the relative abundance of novel peptides reached a maximum of around 13% of viral proteome using ORF9b and N as representative ORFs. Furthermore, we extended the same analysis to the top 5 datasets with the largest number of novel peptides, and found that the average abundances of novel peptides are similar (Fig. 4E). These results imply that the translation of noncanonical sgRNAs was widespread and reached a considerable proportion level. Together, these findings revealed a global and wide translation phenomenon for SARS-CoV-2 noncanonical sgRNAs, and the stable level of novel peptides indicates that the translated proteins may perform certain functions in infected cells.

Functional potentials of sgRNA-encoded proteins

To further explore the biological functions of noncanonical sgRNAs, we used recently published RNA-protein interactome data to explore the potential RNA binding ability for sgRNA-derived proteins. This dataset used two separate pools of RNA probes targeting the regions of ORF1ab and S–N, respectively (Fig. 5 A top). The ORF1ab probes are expected to specifically hybridize with the genomic RNA while S–N probes are expected to enrich both the genomic RNA and sgRNAs (Lee et al., 2021). Using the novel peptides we identified above, we found that 13 novel peptides are immunoprecipitated by the RNA probes. For the 10 single-positional peptides, 5 peptides are detected in both groups of RNA probes, and 5 peptides are detected in only one group of probes (Fig. 5A bottom). Not unexpectedly, many annotated proteins are immunoprecipitated by both groups of RNA probes (Fig. 5B left), and similarly, most novel peptides (7 out of 13) are shared in the two groups of RNA probes (Fig. 5B). From the detailed analysis of each peptide, we found that the amounts of novel peptides in the two groups of SARS-CoV-2 probes are much higher than that of the rRNA probe control, similar to the annotated peptides (Fig. 5C). Expectedly, the mock control samples do not have any peptides encoded by SARS-CoV-2 RNAs. These results indicate that the proteins encoded by noncanonical sgRNAs have the potential to bind with SARS-CoV-2 RNAs.

Fig. 5

The functional potentials of noncanonical sgRNA-derived proteins. A Global arc-view of unique-positional novel peptides detected in RAP-MS dataset PXD024808. The targeted regions by two groups of probes are marked in red and blue at the top, and the detected sgRNA junction-spanning peptides are shown at the bottom. B Venn diagram of detected peptides between the two types of probes in C for annotated (left) and novel peptides (right). C Spectra counts heatmap of annotated and novel peptides in dataset PXD024808. The total spectra counts for different time points are shown on the top, and the sequences of novel peptides are shown on the right side colored as in A. Previous studies have shown that T cell responses play an essential role in SARS-CoV-2 immunity and viral clearance (Altmann and Boyton, 2020; Grifoni et al., 2020; Le Bert et al., 2020). In a recent study, Shira et al. found that HLA-I peptides derived not only from canonical ORFs but also from internal out-of-frame ORFs which are not captured by current vaccines (Weingarten-Gabbay et al., 2021). We thus predicted the potential antigenic peptides derived from SARS-CoV-2 proteins. After passing a strict binding score cutoff, a total of 12 novel neoantigenic epitope clusters from nine noncanonical sgRNAs were identified (Supplementary Fig. S4A and Supplementary Table S4). Together, these data suggest that noncanonical sgRNAs-derived proteins may function in immunophysiology.

Discussion

Many noncanonical sgRNAs of SARS-CoV-2 have been identified (Kim et al., 2020; Nomburg et al., 2020; Wang et al., 2021), but only a few studies mentioned that they have translation ability (Davidson et al., 2020; Finkel et al., 2021; Wang et al., 2021). This study combined RNA-seq and mass spectrometry data to explore the translation landscape of SARS-CoV-2. We found that many noncanonical sgRNAs could be translated into proteins under strict criteria. The proteins are stably existing and may play important roles in the same manner as annotated proteins, such as binding to viral RNAs. The SARS-CoV-2 noncanonical sgRNA-derived proteins can be regarded as structural variations of corresponding annotated proteins due to deletions or frame-shifts. We further evaluated the functional potentials of those proteins at the domain level. Some deletion-type novel proteins lack complete functional domains. For example, the noncanonical sgRNA (28,359–28962) derived protein lacks an RNA-binding domain relative to its annotated N protein (Supplementary Fig. S4B), which may generate a novel smaller protein losing the RNA binding activity (Banerjee et al., 2020) and cannot participate in the biological function of forming phase separation complex (Lu et al., 2021), whereas its dimerization domain may allow this protein to retain the function of interacting with other proteins (Gordon et al., 2020a; Lu et al., 2021). For fusion-type noncanonical sgRNAs-derived proteins, the combination of functional domains in different annotated proteins may produce new biological functional proteins. For example, the sgRNA (6060–28599) derived protein was a fusion protein combining the macro and peptidase C16 domains in the nsp3 protein with the dimerization domain in the N protein (Supplementary Fig. S4C). It is worth noting that current mass spectrometry technology has certain limitations, such as low sensitivity to detect proteins of low abundance, and the inability to detect all kinds of proteins due to dependence on the presence of appropriate protease digestion sites. Thus, noncanonical sgRNAs-derived proteins seem difficult to be fully detected. The COVID-19 caused by SARS-CoV-2 has been in pandemic for more than two years. One reason for unable to end the pandemic is that the virus is continuously mutating (Kaushik, 2021; Sharma et al., 2021; Varahachalam et al., 2021). Some new variants, such as delta and omicron variants, have higher transmissibility and immune evasion than the early original virus strain (Fisman et al., 2020; Funk et al., 2021; Kim et al., 2021; Thomas et al., 2021). Currently, researchers focus on the mutations which change the spike protein of SARS-CoV-2 and affect viral transmissibility, pathogenicity, and immune escape (Harvey et al., 2021; Tian et al., 2021). Nevertheless, we hypothesize that many variants in non-coding regions may change the canonical and noncanonical sgRNAs when they break or gain the RNA-RNA interactions that regulate sgRNA biogenesis (Wang et al., 2021), such that novel proteins encoded by noncanonical sgRNAs may play important roles in regulating viral pathogenicity. We anticipate having more MS data, especially under viral infection of different variant strains, to further investigate noncanonical sgRNAs-encoded proteins. Currently, the novel peptides identified from the computational analysis of MS data lack additional biochemical experimental data. We propose two possible experimental methods. First, we may use the novel peptides to generate specific antibodies for validation with Western blotting. Second, we may try to tag a GFP to the N gene in a SARS-CoV-2 pseudovirus, and then enrich proteins translated from non-canonical N sgRNAs for validation with mass spectrometry. However, both strategies are challenging and time-consuming, which thus merit a further separate study, to complement our current computational study.

Conclusions

We developed a general vipep pipeline for RNA viruses to analyze mass spectrum datasets and constructed the translation landscape of SARS-CoV-2. We identified many peptides translated from noncanonical sgRNAs. Those novel peptides are dynamically regulated in the same manner as annotated proteins, suggesting that they are stably existing and functional. The novel proteins may play important roles by combining different domains or losing domains in annotated proteins. Some of the proteins are found to bind with viral RNAs or are predicted to have immunogenic activity. Our results expand the SARS-CoV-2 translation proteome and indicate that noncanonical sgRNAs-derived proteins are a non-negligible component meriting further studies.

Data availability

All data relevant to the study are included in the article or uploaded as supplementary information.

Ethics statement

This article does not contain any studies with human or animal subjects performed by any of the authors.

Author contributions

Kai Wu: conceptualization, formal analysis, methodology, writing-original draft preparation. Dehe Wang: conceptualization, formal analysis, methodology. Junhao Wang: downloaded and organized the MS datasets. Yu Zhou: conceptualization, data curation, funding acquisition, methodology, supervision, validation, writing – review & editing.

Conflict of interest

The authors declare that they have no competing interests.

49 in total

1. Manipulative magnetic nanomedicine: the future of COVID-19 pandemic/endemic therapy.

Authors: Ajeet Kaushik
Journal: Expert Opin Drug Deliv Date: 2020-12-14 Impact factor: 6.648

2. SARS-CoV-2-specific T cell immunity in cases of COVID-19 and SARS, and uninfected controls.

Authors: Nina Le Bert; Anthony T Tan; Kamini Kunasegaran; Christine Y L Tham; Morteza Hafezi; Adeline Chia; Melissa Hui Yen Chng; Meiyin Lin; Nicole Tan; Martin Linster; Wan Ni Chia; Mark I-Cheng Chen; Lin-Fa Wang; Eng Eong Ooi; Shirin Kalimuddin; Paul Anantharajah Tambyah; Jenny Guek-Hong Low; Yee-Joo Tan; Antonio Bertoletti
Journal: Nature Date: 2020-07-15 Impact factor: 49.962

3. The Translational Landscape of SARS-CoV-2-infected Cells Reveals Suppression of Innate Immune Genes.

Authors: Maritza Puray-Chavez; Nakyung Lee; Kasyap Tenneti; Yiqing Wang; Hung R Vuong; Yating Liu; Amjad Horani; Tao Huang; Sean P Gunsten; James B Case; Wei Yang; Michael S Diamond; Steven L Brody; Joseph Dougherty; Sebla B Kutluay
Journal: mBio Date: 2022-05-23 Impact factor: 7.786

4. Proteomics of SARS-CoV-2-infected host cells reveals therapy targets.

Authors: Denisa Bojkova; Kevin Klann; Benjamin Koch; Marek Widera; David Krause; Sandra Ciesek; Jindrich Cinatl; Christian Münch
Journal: Nature Date: 2020-05-14 Impact factor: 69.504

5. The SARS-CoV-2 nucleocapsid phosphoprotein forms mutually exclusive condensates with RNA and the membrane-associated M protein.

Authors: Shan Lu; Qiaozhen Ye; Digvijay Singh; Yong Cao; Jolene K Diedrich; John R Yates; Elizabeth Villa; Don W Cleveland; Kevin D Corbett
Journal: Nat Commun Date: 2021-01-21 Impact factor: 14.919

6. Characteristics of SARS-CoV-2 variants of concern B.1.1.7, B.1.351 or P.1: data from seven EU/EEA countries, weeks 38/2020 to 10/2021.

Authors: Tjede Funk; Anastasia Pharris; Gianfranco Spiteri; Nick Bundle; Angeliki Melidou; Michael Carr; Gabriel Gonzalez; Alejandro Garcia-Leon; Fiona Crispie; Lois O'Connor; Niamh Murphy; Joël Mossong; Anne Vergison; Anke K Wienecke-Baldacchino; Tamir Abdelrahman; Flavia Riccardo; Paola Stefanelli; Angela Di Martino; Antonino Bella; Alessandra Lo Presti; Pedro Casaca; Joana Moreno; Vítor Borges; Joana Isidro; Rita Ferreira; João Paulo Gomes; Liidia Dotsenko; Heleene Suija; Jevgenia Epstein; Olga Sadikova; Hanna Sepp; Niina Ikonen; Carita Savolainen-Kopra; Soile Blomqvist; Teemu Möttönen; Otto Helve; Joana Gomes-Dias; Cornelia Adlhoch
Journal: Euro Surveill Date: 2021-04

7. Urine proteome of COVID-19 patients.

Authors: Yanchang Li; Yihao Wang; Huiying Liu; Wei Sun; Baoqing Ding; Yinghua Zhao; Peiru Chen; Li Zhu; Zhaodi Li; Naikang Li; Lei Chang; Hengliang Wang; Changqing Bai; Ping Xu
Journal: Urine (Amst) Date: 2021-03-05

Review 8. Coronaviruses post-SARS: update on replication and pathogenesis.

Authors: Stanley Perlman; Jason Netland
Journal: Nat Rev Microbiol Date: 2009-06 Impact factor: 60.633

9. Profiling SARS-CoV-2 HLA-I peptidome reveals T cell epitopes from out-of-frame ORFs.

Authors: Shira Weingarten-Gabbay; Susan Klaeger; Siranush Sarkizova; Leah R Pearlman; Da-Yuan Chen; Kathleen M E Gallagher; Matthew R Bauer; Hannah B Taylor; W Augustine Dunn; Christina Tarr; John Sidney; Suzanna Rachimi; Hasahn L Conway; Katelin Katsis; Yuntong Wang; Del Leistritz-Edwards; Melissa R Durkin; Christopher H Tomkins-Tinch; Yaara Finkel; Aharon Nachshon; Matteo Gentili; Keith D Rivera; Isabel P Carulli; Vipheaviny A Chea; Abishek Chandrashekar; Cansu Cimen Bozkus; Mary Carrington; Nina Bhardwaj; Dan H Barouch; Alessandro Sette; Marcela V Maus; Charles M Rice; Karl R Clauser; Derin B Keskin; Daniel C Pregibon; Nir Hacohen; Steven A Carr; Jennifer G Abelin; Mohsan Saeed; Pardis C Sabeti
Journal: Cell Date: 2021-06-03 Impact factor: 66.850