| Literature DB >> 35244721 |
Camilla Ugolini1, Logan Mulroney1,2,3, Adrien Leger2, Matteo Castelli4, Elena Criscuolo4, Maia Kavanagh Williamson5, Andrew D Davidson5, Abdulaziz Almuqrin5,6, Roberto Giambruno1, Miten Jain3, Gianmaria Frigè7, Hugh Olsen3, George Tzertzinis8, Ira Schildkraut8, Madalee G Wulf8, Ivan R Corrêa8, Laurence Ettwiller8, Nicola Clementi4,9, Massimo Clementi4,9, Nicasio Mancini4,9, Ewan Birney2, Mark Akeson3, Francesco Nicassio1, David A Matthews5, Tommaso Leonardi1.
Abstract
The SARS-CoV-2 virus has a complex transcriptome characterised by multiple, nested subgenomic RNAsused to express structural and accessory proteins. Long-read sequencing technologies such as nanopore direct RNA sequencing can recover full-length transcripts, greatly simplifying the assembly of structurally complex RNAs. However, these techniques do not detect the 5' cap, thus preventing reliable identification and quantification of full-length, coding transcript models. Here we used Nanopore ReCappable Sequencing (NRCeq), a new technique that can identify capped full-length RNAs, to assemble a complete annotation of SARS-CoV-2 sgRNAs and annotate the location of capping sites across the viral genome. We obtained robust estimates of sgRNA expression across cell lines and viral isolates and identified novel canonical and non-canonical sgRNAs, including one that uses a previously un-annotated leader-to-body junction site. The data generated in this work constitute a useful resource for the scientific community and provide important insights into the mechanisms that regulate the transcription of SARS-CoV-2 sgRNAs.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35244721 PMCID: PMC8989550 DOI: 10.1093/nar/gkac144
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 19.160
Figure 1.NRCeq allows sequencing of full-length viral sgRNAs. (A) Schematic representation of the landscape of the SARS-CoV-2 transcriptome, ORFs cleavage sites and protein structures (Adrien Leger (2020): SARS-COV-2 replication cycle. Doi: 10.6084/m9.figshare.12229013.v1). (B) Read coverage across the viral genome calculated from the aggregated standard Nanopore DRS datasets used in this study (see Supplementary Table S1). The figure also reports the coverage fold change between the 5′ (from 15 to 60) and 3′ (region from 29805 to 29850). The reported p-value was calculated with the two-sided Welch's test. (C) Schematic overview of the NRCeq recapping protocol. (D) Number (left) and percentage (right) of basecalled, trimmed and mapped reads for the NRCeq datasets. (E) Read coverage across the viral genome calculated using aggregated NRCeq data from CaCo2 and Vero cells.The figure also reports the coverage fold change between the 5′ (genomic region from 15 to 60) and 3′ (genomic region from 29805 to 29850). The reported p-value was calculated with the two-sided Welch's test. (F) Coverage of the viral genome calculated using only the alignment start sites (top) or alignment termination sites (bottom) for the NRCeq data from CaCo2 and Vero cells, aggregated in a single dataset.
Figure 2.NRCeq assembly identifies and quantifies viral sgRNAs. (A) UCSC Genome Browser track showing the SARS-CoV-2 transcriptome assembly obtained with NRCeq data. The figure reports both canonical and non-canonical (NC) transcript models. SARS-CoV-2 ORFs are reported for reference. The colour coding indicates the number of identical amino acids between the first ORF of the sgRNA and the best match in the reference SARS-CoV-2 proteome (Uniprot) expressed as a fraction of the reference protein's length. (B) Quantification of the ORFs performed by NRCeq and Northern Blot. NRCeq data from CaCo2 and Vero cells were aggregated in a single dataset. For each bin of 400nt (x-axis) the cumulative expression of all assembled transcript models was calculated and expressed as a percentage (y-axis). The northern blot quantification data was obtained from Ogando et al. (43).
Figure 3.Independent sgRNAs encode ORF9d and ORF10. (A, B) UCSC Genome Browser track showing the alignments of NRCeq reads assigned to ORF9d (A) and ORF10 (B). The figure also includes tracks showing the location of TRS-B sequences with 0, 1 or 2 mismatches, NCBI Genes, Uniprot Protein Products, ORF predictions and ribosome footprints (46). Arrows indicate the genomic position of the products found in the bands of the gel in (C). (C) Agarose gel electrophoresis after PCR amplification of short sgRNA species. The band at 1500nt, 650nt and 190nt correspond to the expected size for the amplicons of full-length N ORF9d and ORF10 respectively. The bands at 450 and 1100 did not correspond to the size of any assembled transcript models. (D) Representative alignments of Illumina DNA sequencing data (250nt × 2) of short, PCR-amplified sgRNAs. The bands at 1500nt, 650nt, 450nt and 190nt from the gel in (C) were purified and sequenced. (E) Heatmap showing the location and abundance of split alignments connecting the viral 5′UTR with downstream regions. The figure was generated using the Illumina DNA sequencing data as in (D). The y-axis reports the genomic coordinate upstream of the junction, whereas the x-axis reports the genomic coordinate downstream of the junction. The colour scale reports the number of reads that support each junction (log10 transformed) after binning the genome in intervals of 10nt (x-axis) or 20nt (y-axis). Axis labels report the midpoint of the interval.
Figure 4.Expression profiling of sgRNAs in different cell lines. (A, B) Number (A) and percentage (B) of reads basecalled and mapped to the SARS-CoV-2 genome for each dataset. For the NRCeq datasets, the percentage is calculated as a fraction of the total number of capped reads. (C) Structure and mean expression of sgRNA transcript models across all datasets. The labels on the y-axis refer to the first ORF identified in each sgRNA. (D) Cumulative expression of all transcripts that code for each ORF. The values correspond to the mean expression of ORFs across all samples. The error bars report the combined standard deviation of the expression of the transcript models encoding each ORF (see Materials and Methods). (E) Scatter plot showing transcript per million (TPM) versus transcript length in each dataset.