| Literature DB >> 30099499 |
Timothy E Schlub1, Jan P Buchmann2, Edward C Holmes2.
Abstract
Overlapping genes in viruses maximize the coding capacity of their genomes and allow the generation of new genes without major increases in genome size. Despite their importance, the evolution and function of overlapping genes are often not well understood, in part due to difficulties in their detection. In addition, most bioinformatic approaches for the detection of overlapping genes require the comparison of multiple genome sequences that may not be available in metagenomic surveys of virus biodiversity. We introduce a simple new method for identifying candidate functional overlapping genes using single virus genome sequences. Our method uses randomization tests to estimate the expected length of open reading frames and then identifies overlapping open reading frames that significantly exceed this length and are thus predicted to be functional. We applied this method to 2548 reference RNA virus genomes and find that it has both high sensitivity and low false discovery for genes that overlap by at least 50 nucleotides. Notably, this analysis provided evidence for 29 previously undiscovered functional overlapping genes, some of which are coded in the antisense direction suggesting there are limitations in our current understanding of RNA virus replication.Entities:
Mesh:
Year: 2018 PMID: 30099499 PMCID: PMC6188560 DOI: 10.1093/molbev/msy155
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1.Method to detect frameshifted open reading frames (ORFs) in viruses. (A) The expected ORF length based on codon composition is calculated. ORFs longer than expected by random chance are identified. The expected ORF length is estimated by one of three tests. For the codon permutation test (B) the codon sequence on the original frame is permuted and ORF lengths on alternative reading frames measured for each permutation. For the synonymous mutation test (C), codons that preserve the original amino acid sequence are randomly generated and the length of ORFs on alternative reading frames subsequently measured (note that codon replacement is not restricted to the example mutations shown in the figure, all of which occur in the third nucleotide positions, and that codon replacement with the original codon is also possible). The third test requires both the codon permutation test and the synonymous mutation test P values to be below some cut-off value.
. 2.Proof of concept for ORF detection using Potato Latent virus as an example. (A) Schematic of genomic structure for the Potato Latent virus. This virus contains a known overlapping gene I870_gp1 in Frame 2+. (B) The expected distribution of ORF lengths in frame 1+ (shaded area) calculated by the permutation test, and the actual open reading frame (ORF) lengths in frame 1+ (black dots). (C) The expected distribution of ORF lengths in frame 2+ (shaded area) calculated by permutation test, and the actual ORF lengths on frame 1+ (black dots). The known frameshifted gene, I870_gp1, was clearly identified using the permutation test as its length was much larger than that expected by chance alone (P < 0.0001).
. 3.Receiver operator characteristic curves showing the sensitivity and false discovery rate of each test, for different P value cut-off values, and different minimum overlapping lengths in nucleotides. Table 1 shows the precise values for this for minimum overlapping lengths of 50, 100, 200, and 300 nucleotides, and P value cutoffs of 0.01 and 0.001.
Sensitivity, false discovery, and area under the curve for each test across a range of P value cut-offs and overlapping lengths.
| Test | Minimum overlap length (nucleotides) | Sensitivity | False discovery rate | Area under the curve | ||
|---|---|---|---|---|---|---|
| Codon perm. | 50 | 0.33 | 0.51 | 0.16 | 0.40 | 0.56 |
| Codon perm. | 100 | 0.43 | 0.65 | 0.69 | ||
| Codon perm. | 200 | 0.54 | 0.77 | 0.78 | ||
| Codon perm. | 300 | 0.77 | 0.90 | 0.87 | ||
| Synonymous mut. | 50 | 0.35 | 0.55 | 0.29 | 0.41 | 0.55 |
| Synonymous mut. | 100 | 0.45 | 0.71 | 0.67 | ||
| Synonymous mut. | 200 | 0.57 | 0.84 | 0.74 | ||
| Synonymous mut. | 300 | 0.82 | 0.95 | 0.83 | ||
| Combined | 50 | 0.25 | 0.43 | 0.097 | 0.22 | 0.59 |
| Combined | 100 | 0.33 | 0.56 | 0.72 | ||
| Combined | 200 | 0.43 | 0.68 | 0.81 | ||
| Combined | 300 | 0.69 | 0.86 | 0.89 | ||
Comparison of the codon permutation, synonymous mutation and combined methods to Synplot2 for the Synplot2 validation data set (table 1 from Firth 2014).
| Taxon | RefSeq | Gene overlap | Genomic location (nt) | ORF length (nuc) | Codon perm. | Synonymous mut. |
|---|---|---|---|---|---|---|
| Picornaviridae, Cardiovirus, Theilovirus | NC_001366.1 | L/L* | 1081–1551 | 470 | 0.0005 | 0.002 |
| Arteriviridae, | NC_001961.1 | GP2/GP3 | 12696–12843 | 147 | 0.87 | 0.52 |
| Arterivirus, PRRSV | GP3/GP4 | 13241–13460 | 219 | 0.06 | 0.03 | |
| Bromoviridae, Cucumovirus, Cucumber mosaic virus | NC_002035.1 | ORF2a/2b | 2419–2660 | 241 | 0.002 | 0.0007 |
| Hepeviridae, Hepevirus, HEV | NC_001434.1 | CP/ORF3 | 5123–5453 | 330 | 0.15 | 0.02 |
| Betaflexiviridae, Capillovirus, Apple stem grooving virus | NC_001749.2 | replicase-CP/MP | 4787–5749 | 962 | < 0.0001 | <0.0001 |
| Betaflexiviridae, Trichovirus, Apple chlorotic leaf spot virus | NC_001409.1 | MP/CP | 6784–7100 | 316 | 0.004 | <0.0001 |
| Alphaflexiviridae, Potexvirus, Pepino mosaic virus | NC_004067.1 | TGB2/TGB3 | 5340–5488 | 148 | 0.19 | 0.23 |
| Sobemovirus, Rice yellow mottle virus | NC_001575.2 | replicase/CP | 3447–3607 | 160 | 0.57 | 0.56 |
| Nodaviridae, Betanodavirus, Striped jack nervous necrosis virus | NC_003448.1 | replicase/B2 | 2756–2983 | 227 | 0.15 | 0.007 |
| Tombusviridae, Tombusvirus, Tomato bushy stunt virus | NC_001554.1 | MP/p19 | 3888–4406 | 518 | <0.0001 | <0.0001 |
| Birnaviridae, Aquabirnavirus, Infectious pancreatic necrosis virus | NC_001915.1 | VP5/VP2 | 120–514 | 394 | 0.002 | 0.0005 |
| Birnaviridae, Avibirnavirus, Infectious bursal disease virus | NC_004178.1 | VP5/VP2 | 130–533 | 403 | 0.002 | 0.03 |
| Reoviridae, Orthoreovirus, Mammalian orthoreovirus 3 | NC_004277.1 | σ1/σ1s | 71–433 | 362 | 0.002 | 0.001 |
| Totiviridae, Totivirus, | NC_003745.1 | gag/pol | 1964–2072 | 108 | 0.88 | 0.97 |
| Bunyaviridae, Orthobunyavirus, La Crosse virus | NC_004110.1 | N/NSs | 101–379 | 278 | 0.01 | 0.03 |
| Paramyxoviridae, Morbillivirus, Measles virus | NC_001498.1 | P/C | 1829–2389 | 560 | 0.0001 | <0.0001 |
| P/V | 2499–2705 | 206 | 0.18 | 0.007 | ||
| Paramyxoviridae, Respirovirus, Human parainfluenza virus 3 | NC_001796.2 | P/C | 1794–2393 | 599 | <0.0001 | <0.0001 |
| P/V | 2505–2903 | 398 | 0.0001 | 0.0003 | ||
| Paramyxoviridae, Rubulavirus, Mumps virus | NC_002200.1 | P/V | 2442–2653 | 211 | 0.02 | 0.02 |
| Picornaviridae, Cardiovirus, Theilovirus | NC_001366.1 | L/L* | 1081–1551 | 470 | 0.0005 | 0.002 |
| Arteriviridae, Arterivirus, PRRSV | NC_001961.1 | GP2/GP3 | 12696–12843 | 147 | 0.87 | 0.52 |
| GP3/GP4 | 13241–13460 | 219 | 0.06 | 0.03 | ||
| Bromoviridae, Cucumovirus, Cucumber mosaic virus | NC_002035.1 | ORF2a/2b | 2419–2660 | 241 | 0.002 | 0.0007 |
| Hepeviridae, Hepevirus, HEV | NC_001434.1 | CP/ORF3 | 5123–5453 | 330 | 0.15 | 0.02 |
New putative ORF discoveries made here.
| Family, virus name | RefSeq | Coding region | Coding product | ORF position | Reading frame | ORF length (nuc) |
|---|---|---|---|---|---|---|
| Reoviridae, Aedes pseudoscutellaris reovirus | NC_007673 | 17..1054 | VP8 | 510–890 | +1 | 126 |
| Betaflexiviridae, Ligustrum necrotic ringspot virus | NC_010305 | 6604..6924 | Triple gene block protein | 6605–6919 | +1 | 105 |
| Unassigned, Circulifer tenellus virus 1 | NC_014360 | 643..4044 | Proline-alanine-rich protein | 1652–2647 | +1 | 331 |
| Rhabdoviridae, Infectious hematopoietic necrosis virus | NC_001652 | 1466..2158 | Polymerase-associated protein | 1690–2085 | +2 | 131 |
| Paramyxoviridae, Bovine respirovirus 3 | NC_002161 | 1784..3574 | Phosphoprotein P | 2500–3021 | +2 | 173 |
| Pneumoviridae, Avian metapneumovirus | NC_007652 | 6111..7868 | Attachment glycoprotein | 6560–7675 | +2 | 371 |
| Unassigned, Cassava virus C | NC_013112 | 186..1055 | Putative movement protein | 209–646 | +2 | 145 |
| Unassigned, Circulifer tenellus virus 1 | NC_014360 | 643..4044 | Proline-alanine-rich protein | 645–1757 | +2 | 370 |
| Unassigned, Halastavi arva RNA virus | NC_016418 | 828..6278 | Replicase protein | 1610–2155 | +2 | 181 |
| Reoviridae, Spissistilus festinus reovirus | NC_016874 | 9..3740 | RNA directed RNA polymerase | 380–1267 | +2 | 295 |
| Paramyxoviridae, Bat Paramyxovirus Eid_hel/GH-M74a/GHA/2009 | NC_025256 | 2053..4665 | Phosphoprotein | 2958–3479 | +2 | 173 |
| Arenaviridae, Okahandja mammarenavirus | NC_027137 | 58..339 | Z protein | 60–332 | +2 | 91 |
| Potyviridae, Sweet potato virus 2 | NC_017970 | 118..10518 | Polyprotein | 118–1119 | −c0 | 333 |
| Filoviridae, Marburg marburgvirus | NC_024781 | 5941..7986 | Glycoprotein | 6046–6753 | −c0 | 235 |
| Rhabdoviridae, Oak-Vale virus | NC_025399 | 3393..4988 | Putative glycoprotein | 3837–4721 | −c0 | 294 |
| Virgaviridae, Macrophomina phaseolina tobamo-like virus | NC_025674 | 208..6594 | RNA-dependent RNA polymerase | 406–1422 | −c0 | 338 |
| Rhabdoviridae, Northern cereal mosaic virus | NC_002251 | 142..1437 | Nucleocapsid protein | 810–1436 | −c1 | 208 |
| Flaviviridae, Nhumirim virus | NC_024017 | 103..10440 | Polyprotein | 2328–4454 | −c1 | 708 |
| Rhabdoviridae, Infectious hematopoietic necrosis virus | NC_001652 | 2999..4525 | Glycoprotein | 3555–4037 | −c2 | 160 |
| Tombusviridae, Hibiscus chlorotic ringspot virus | NC_003608 | 2603..3277 | Hypothetical protein | 2745–3248 | −c2 | 167 |
| Paramyxoviridae, Tioman virus | NC_004074 | 2033..2667 | W protein | 2147–2665 | −c2 | 172 |
| Paramyxoviridae, Tioman virus | NC_004074 | 2033..3188 | Phosphoprotein | 2038–2595 | −c2 | 186 |
| Totiviridae, Magnaporthe oryzae virus 1 | NC_006367 | 575..2815 | Putative coat protein | 1314–1931 | −c2 | 205 |
| Alphaflexiviridae, Hydrangea ringspot virus | NC_006943 | 5549..6022 | Virally coded protein | 5553–6020 | −c2 | 156 |
| Tymoviridae, Scrophularia mottle virus | NC_011537 | 127..1980 | Putative movement protein | 776–1438 | −c2 | 220 |
| Peribunyaviridae, Simbu orthobunyavirus | NC_018477 | 50..325 | Nonstructural protein | 60–323 | −c2 | 87 |
| Reoviridae, Umatilla virus | NC_024503 | 13..3912 | RNA-dependent RNA polymerase | 1826–2515 | −c2 | 229 |
| Paramyxoviridae, Sosuga virus | NC_025343 | 1908..3105 | Phosphoprotein | 1913–2542 | −c2 | 210 |
| Paramyxoviridae, Salmon aquaparamyxovirus | NC_025360 | 2535..3667 | V protein | 2538–3164 | −c2 | 209 |