| Literature DB >> 34339929 |
Abstract
Six candidate overlapping genes have been detected in SARS-CoV-2, yet current methods struggle to detect overlapping genes that recently originated. However, such genes might encode proteins beneficial to the virus, and provide a model system to understand gene birth. To complement existing detection methods, I first demonstrated that selection pressure to avoid stop codons in alternative reading frames is a driving force in the origin and retention of overlapping genes. I then built a detection method, CodScr, based on this selection pressure. Finally, I combined CodScr with methods that detect other properties of overlapping genes, such as a biased nucleotide and amino acid composition. I detected two novel ORFs (ORF-Sh and ORF-Mh), overlapping the spike and membrane genes respectively, which are under selection pressure and may be beneficial to SARS-CoV-2. ORF-Sh and ORF-Mh are present, as ORF uninterrupted by stop codons, in 100% and 95% of the SARS-CoV-2 genomes, respectively.Entities:
Keywords: Codon usage; Membrane protein; Multivariate statistics; Overlapping reading frame; Selection pressure; Spike protein; Virus evolution
Mesh:
Substances:
Year: 2021 PMID: 34339929 PMCID: PMC8317007 DOI: 10.1016/j.virol.2021.07.011
Source DB: PubMed Journal: Virology ISSN: 0042-6822 Impact factor: 3.513
List of the six overlapping ORFs detected in SARS-CoV-2.
| ORF name | Type of experimental evidence for expression or function | Type of computational method for detection |
|---|---|---|
| ORF2b | Ribosome profiling ( | – |
| ORF3b | Interferon antagonist, when expressed from a plasmid in Sendai-virus infected cells ( | – |
| ORF3c | Ribosome profiling ( | Synplot2 ( |
| ORF3d | Ribosome profiling ( | Codon permutation method ( |
| ORF9b | Ribosome profiling ( | GOFIX ( |
| ORF9c | Suppressor of antiviral response, when expressed from a plasmid in transfected cells ( | GOFIX ( |
Fig. 1Location of the eight overlapping ORFs detected in the 3′ genome region of SARS-CoV-2.
Fig. 2Example workflow for CodScr + SeqComp analysis. As input data, CodScr + SeqComp requires the nucleotide sequence of a protein coding region (the ancestral reading frame) which contains an overlapping ORF shifted one nucleotide 3’ (+1 overlapping ORF) or an overlapping ORF shifted two nucleotides 3’ (+2 overlapping ORF). ORF indicates a contiguous stretch of codons, beginning with a start AUG codon, ending with a stop codon, not interrupted by premature stop codons, and having a length ≥ 90 nt. A detailed example of calculation of the five prediction scores (P-value, PLS-DA score, LDA score, LDA-ancestral score, and LDA-novel score) is shown in Supplementary File S1.
Fig. 3Nucleotide and amino acid sequence of the two predicted overlapping ORFs in the 3′ genome region of SARS-CoV-2. (A) Overlapping ORF-Sh: the nucleotide sequence (from nt 24,050 to 24,172) encodes the region of protein S spanning residues 830–868, while the +1 overlapping ORF-Sh (from nt 24,051 to 24,170) encodes a predicted protein of 39 aa (underlined characters). Bold characters indicate a predicted transmembrane helix. (B) Overlapping ORF-Mh: the nucleotide sequence (from nt 26,691 to 26,873) encodes the region of protein M spanning residues 57–116, while the +2 overlapping ORF-Mh (from nt 26,693 to 26,872) encodes a predicted protein of 59 aa (underlined characters). Bold characters indicate two predicted transmembrane helices.
List of the eight overlapping ORFs in the 3’ genome region of SARS-CoV-2 meeting all five prediction criteria of the CodScr + SeqComp method.
| Overlapping ORF | Genome position | Length (nt) | Within gene (genome position) | Shift of the overlapping ORF | P-value from the CodScr method | PLS-DA score | LDA score | LDA-ancestral score | LDA-novel score | Prediction criteria met |
|---|---|---|---|---|---|---|---|---|---|---|
| nORF1 | 24051–24170 | 120 | S (24050–24172) | +1 | 0.0001 | −0.45 | −37.03 | 37.07 | −1.54 | 5 |
| nORF2* | 24072–24170 | 99 | S (24071–24172) | +1 | 0.0001 | −0.13 | −35.41 | 37.08 | −8.74 | 5 |
| nORF3 | 25457–25582 | 126 | ORF3a (25456–25584) | +1 | 0.04 | −1.46 | −41.89 | 21.48 | 12.76 | 5 |
| nORF4 | 25524–25697 | 174 | ORF3a (25522–25698) | +2 | 0.02 | −1.32 | −40.44 | −27.80 | −59.59 | 5 |
| nORF5 | 26693–26872 | 180 | M(26691–26873) | +2 | 0.003 | −0.02 | −36.26 | −17.74 | −60.62 | 5 |
| nORF6 | 28284–28577 | 294 | N (28283–28579) | +1 | 0.0001 | −0.52 | −39.17 | 27.12 | 11.56 | 5 |
| nORF7* | 28305–28577 | 273 | N (28304–28579) | +1 | 0.0001 | −0.35 | −39.29 | 26.88 | 12.29 | 5 |
| nORF8* | 28359–28577 | 219 | N (28358–28579) | +1 | 0.0001 | −0.21 | −37.77 | 27.78 | 14.76 | 5 |
Term “n” stands for “new” and asterisk indicates an overlapping ORF starting with an AUG codon which is in frame with respect to the previous overlapping ORF.
The boundaries of the overlapping ORF are referred to the reference genome sequence of SARS-CoV-2 (NC_045512.2).
List of the five candidate overlapping ORFs in the 3’ genome region of SARS-CoV-2 (italic characters indicate ORFs predicted by other methods; bold characters indicate ORFs predicted only by the CodScr + SeqComp method).
| Candidate overlapping ORF (length) | Ancestral overlapping gene | Boundaries of candidate overlapping ORF | Shift vs. ancestral gene | P value from CodScr | PLS-DA score | LDA score | LDA-ancestral score | LDA-novel score | Prediction criteria met |
|---|---|---|---|---|---|---|---|---|---|
aBoundaries of the overlapping ORF are referred to the reference genome sequence of SARS-CoV-2 (NC_045512.2).