| Literature DB >> 23966842 |
Angelo Pavesi1, Gkikas Magiorkinis, David G Karlin.
Abstract
A well-known mechanism through which new protein-coding genes originate is by modification of pre-existing genes, e.g. by duplication or horizontal transfer. In contrast, many viruses generate protein-coding genes de novo, via the overprinting of a new reading frame onto an existing ("ancestral") frame. This mechanism is thought to play an important role in viral pathogenicity, but has been poorly explored, perhaps because identifying the de novo frames is very challenging. Therefore, a new approach to detect them was needed. We assembled a reference set of overlapping genes for which we could reliably determine the ancestral frames, and found that their codon usage was significantly closer to that of the rest of the viral genome than the codon usage of de novo frames. Based on this observation, we designed a method that allowed the identification of de novo frames based on their codon usage with a very good specificity, but intermediate sensitivity. Using our method, we predicted that the Rex gene of deltaretroviruses has originated de novo by overprinting the Tax gene. Intriguingly, several genes in the same genomic region have also originated de novo and encode proteins that regulate the functions of Tax. Such "gene nurseries" may be common in viral genomes. Finally, our results confirm that the genomic GC content is not the only determinant of codon usage in viruses and suggest that a constraint linked to translation must influence codon usage.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23966842 PMCID: PMC3744397 DOI: 10.1371/journal.pcbi.1003162
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Rationale for our approach.
Benchmark dataset of 27 overlapping genes with known genealogy.
| Viral family and nature of the genome | Genus and genome accession number | Species | Ancestral frame [function |
| Length of overlapping regions (nt) | Length of non- overlapping regions (nt) | Host organism | Reference |
|
|
|
| Capsid protein [capsid] | Replicase, C-term domain | 1833 | 3951 | Insect |
|
|
|
|
| Capsid protein [capsid] | NABP, N-term domain | 303 | 7011 | Plant |
|
|
|
|
| TGBp2 [viral Movement] | TGBp3, N-term domain | 153 | 5991 | Plant |
|
|
|
|
| MP [viral Movement] | Polyprotein | 966 | 5352 | Plant |
|
|
|
|
| Coat protein [capsid] | MP, C-term domain | 318 | 6804 | Plant |
|
|
|
|
| VP2 [capsid] | VP5 [apoptotic factor] | 396 | 5112 | Fish |
|
|
|
|
| N [Nucleoprotein] | NSs [blocks interferon and cellular transcription] | 309 | 11412 | Insect |
|
|
|
|
| Capsid protein [capsid] | VF1 [virulence factor] | 642 | 6639 | Mammal |
|
|
|
|
| VP2 [phosphatase?] | Apoptin [apoptotic factor] | 366 | 1275 | Bird |
|
|
|
|
| Capsid Protein [capsid] | Pog | 315 | 8115 | Insects |
|
|
|
|
| AL1 [rolling circle replication initiator] | AC4 [silencing suppressor] | 426 | 2928 | Plant |
|
|
|
|
| Pol, central domain [reverse transcriptase] | L [envelope glycoprotein] | 708 | 1578 | Mammal |
|
|
|
|
| Pol, C-term domain [RNAse H] | X [virulence factor] | 252 | 1578 | Mammal |
|
|
|
|
| P5 [capsid] | P4 [viral Movement] | 468 | 4311 | Plant |
|
|
|
|
| NS1 [rolling circle replication initiator] | NS2 | 1122 | 2142 | Insect |
|
|
|
|
| NS1 [rolling circle replication initiator] | NS2 | 792 | 4005 | Insect |
|
|
|
|
| VP2 [capsid] | AAP [capsid assembly co-factor] | 618 | 3429 | Mammal |
|
|
|
|
| Capsid protein [capsid] | SAT [virulence factor] | 210 | 3639 | Mammal |
|
|
|
|
| p104 | p130 | 2682 | 3312 | Insect |
|
|
|
|
| Capsid protein [capsid] | p17 | 384 | 6591 | Insect |
|
|
|
|
| p28 [replicase cofactor] | p23 [virulence factor] | 633 | 2163 | Plant |
|
|
|
|
| Capsid protein [capsid] | p25 [viral movement] | 678 | 2163 | Plant |
|
|
|
|
| p22 [viral movement] | p19 [silencing suppressor] | 522 | 3648 | Plant |
|
|
|
|
| Capsid protein | p31 | 453 | 2625 | Plant |
|
|
|
|
| Replicase [Methyltransferase-Guanylyltransferase] | MP [viral movement] | 1881 | 4230 | Plant |
|
|
|
|
| TGBp2 [viral movement] | TGBp3, N-term domain | 192 | 8796 | Plant |
|
| Unassigned ssRNA(+) |
|
| ORF4 [viral movement] | ORF3 [viral movement] | 699 | 2619 | Plant |
|
gene overlaps described previously (see reference [3]).
additional overlaps collected for this study.
The function is that of the overlapping region of the protein; if it is not known, the field is left blank.
The NS2 proteins of brevidensoviruses and that of densoviruses are not homologous (they are encoded in different frames relative to NS1).
The alphacarmotetravirus polymerase and machlomovirus capsid have originated by horizontal transfer and thus the two corresponding overlaps are not part of the benchmark dataset, although we perform the same analyses on them than on other overlaps(see text).
Abbreviations: AAP, assembly-activating protein; dsRNA, double-stranded RNA; C-term, C-terminal; L, large envelope protein; MP, movement protein; NABP, nucleic-acid binding protein; NS, non-structural protein; NSs, non-structural protein of the small RNA segment; N-term, N-terminal; Pog, predicted overlapping gene; Pol, Polymerase; SAT, small alternatively translated protein; ssDNA, single-stranded DNA; ssRNA, single-stranded RNA (+, positive or −, negative); TGBp2, Triple Gene Block protein 2; TGBp3, Triple Gene Block protein 3; VP, viral protein.
Figure 2Definition used for overlapping regions and non-overlapping regions of a viral genome.
If a viral genome contains other overlapping genes than those under study (e.g. the genes to the right), we only considered non-overlapping regions of these genes; their overlapping regions (in grey) were excluded from the analysis.
Analysis of the codon usage of overlapping frames from the benchmark dataset.
| Calculations performed on actual frames | Calculations performed on simulated frames | |||||||||||||
| Genus | Ancestral frame |
| NA | NN | rsA | rsN |
| t-Hotelling | P< | rsA | rsN |
| P< | Agreement between t-Hotelling and simulation |
|
| Capsid | p17 | 49 | 48 | 0.70 | 0.04 | 0.66 | 4.26 | 0.001 | 0.252 | 0.239 | 0.012 | 0.001 | Yes |
|
| Capsid | VF1 | 59 | 57 | 0.68 | 0.23 | 0.45 | 3.57 | 0.001 | 0.304 | 0.222 | 0.083 | 0.002 | Yes |
|
| VP2 | AAP | 59 | 57 | 0.61 | 0.19 | 0.42 | 3.03 | 0.005 | 0.312 | 0.255 | 0.056 | 0.001 | Yes |
|
| Replicase | p23 | 59 | 57 | 0.42 | −0.10 | 0.52 | 3.00 | 0.005 | 0.305 | 0.245 | 0.060 | 0.001 | Yes |
|
| VP2 | VP5 | 53 | 57 | 0.62 | 0.24 | 0.38 | 2.74 | 0.005 | 0.285 | 0.239 | 0.046 | 0.004 | Yes |
| Luteo | P5 | P4 | 59 | 52 | 0.44 | 0.01 | 0.43 | 2.72 | 0.005 | 0.314 | 0.284 | 0.030 | 0.0005 | Yes |
|
| Replicase | MP | 57 | 59 | 0.79 | 0.59 | 0.20 | 2.65 | 0.01 | 0.294 | 0.292 | 0.002 | 0.073 | No |
|
| MP | Replicase | 59 | 59 | 0.65 | 0.40 | 0.25 | 2.30 | 0.025 | 0.357 | 0.288 | 0.069 | 0.080 | No |
|
| Capsid | NABP | 37 | 44 | 0.58 | 0.22 | 0.36 | 2.22 | 0.025 | 0.229 | 0.247 | −0.019 | 0.011 | Yes |
|
| Capsid | p25 | 59 | 59 | 0.35 | −0.05 | 0.40 | 2.13 | 0.025 | 0.316 | 0.227 | 0.089 | 0.003 | Yes |
|
| Capsid | Replicase | 59 | 59 | 0.40 | 0.11 | 0.29 | 1.88 | 0.05 | 0.306 | 0.290 | 0.017 | 0.003 | Yes |
|
| VP2 | Apoptin | 59 | 51 | 0.29 | −0.01 | 0.30 | 1.78 | 0.05 | 0.259 | 0.217 | 0.042 | 0.020 | Yes |
|
| TGBp2 | TGBp3 | 20 | 33 | 0.42 | 0.03 | 0.39 | 1.75 | 0.05 | 0.237 | 0.206 | 0.031 | 0.015 | Yes |
|
| VP2 | SAT | 29 | 35 | 0.47 | 0.12 | 0.35 | 1.69 | 0.10 | 0.234 | 0.224 | 0.010 | 0.063 | Yes |
|
| p22 | p19 | 59 | 59 | 0.33 | 0.13 | 0.20 | 1.24 | 0.15 | 0.310 | 0.289 | 0.021 | 0.051 | Yes |
|
| Capsid | Pog | 37 | 49 | 0.25 | 0.02 | 0.23 | 1.04 | 0.20 | 0.276 | 0.262 | 0.014 | 0.061 | Yes |
|
| NS1 | NS2 | 59 | 57 | 0.36 | 0.19 | 0.17 | 1.04 | 0.20 | 0.278 | 0.303 | −0.024 | 0.083 | Yes |
|
| Pol | L | 59 | 55 | 0.42 | 0.29 | 0.13 | 0.98 | 0.20 | 0.241 | 0.252 | −0.011 | 0.146 | Yes |
|
| ORF4 | ORF3 | 59 | 55 | 0.40 | 0.23 | 0.17 | 0.97 | 0.20 | 0.280 | 0.279 | 0.001 | 0.101 | Yes |
|
| Replicase | AC4 | 53 | 55 | 0.18 | 0.10 | 0.08 | 0.49 | 0.50 | 0.317 | 0.281 | 0.036 | 0.377 | Yes |
|
| TGBp2 | TGBp3 | 29 | 27 | 0.24 | 0.36 | −0.12 | 0.39 | 0.50 | 0.195 | 0.209 | −0.014 | 0.297 | Yes |
|
| Pol | X | 48 | 44 | 0.06 | 0.10 | −0.04 | 0.24 | 0.50 | 0.221 | 0.217 | 0.004 | 0.438 | Yes |
|
| NS1 | NS2 | 59 | 59 | 0.62 | 0.63 | −0.01 | 0.09 | 0.50 | 0.348 | 0.353 | −0.005 | 0.556 | Yes |
|
| N | NSs | 55 | 55 | 0.28 | 0.26 | 0.02 | 0.06 | 0.50 | 0.308 | 0.235 | 0.073 | 0.655 | Yes |
|
| Capsid | MP | 43 | 41 | 0.31 | 0.32 | −0.01 | 0.03 | 0.50 | 0.313 | 0.278 | 0.035 | 0.426 | Yes |
| Recombinant: | ||||||||||||||
|
| Replicase | p130 | 59 | 59 | 0.00 | 0.51 | −0.51 | 2.94 | 0.005 | 0.279 | 0.303 | −0.024 | 0.001 | Yes |
|
| Capsid | p31 | 43 | 41 | 0.34 | 0.18 | 0.16 | 1.01 | 0.20 | 0.273 | 0.230 | 0.044 | 0.189 | Yes |
Abbreviations are the same as in Table 1. The last two overlaps have entered their genome by horizontal transfer (see text).
r is the Spearman rank correlation coefficient r between the codon usage of the ancestral frame and that of its genome. r is the equivalent coefficient for the de novo frame. NA and NN are the number of codons on which r and r were calculated. The first row indicates whether calculations are presented for the actual overlapping frames or for the corresponding simulated frames. The calculation of P for the actual frames is based on Hotelling's t-test, whereas for simulated frames P is based on the distribution of the simulated d (see text). Agreement between t-Hotelling and simulation is calculated on the basis of whether corresponding P-values are both <0.05 or >0.05.
Prediction of the ancestral frame in overlapping genes from the benchmark dataset.
| Genus | rsA | rsN |
| t-Hotelling | P< | Predicted ancestral frame | Prediction correct? |
|
| 0.70 | 0.04 | 0.66 | 4.26 |
| Capsid | Yes |
|
| 0.68 | 0.23 | 0.45 | 3.57 |
| Capsid | Yes |
|
| 0.61 | 0.19 | 0.42 | 3.03 |
| VP2 | Yes |
|
| 0.42 | −0.10 | 0.52 | 3.00 |
| Replicase | Yes |
|
| 0.62 | 0.24 | 0.38 | 2.74 |
| VP2 | Yes |
|
| 0.44 | 0.01 | 0.43 | 2.72 |
| P5 | Yes |
|
| 0.79 | 0.59 | 0.20 | 2.65 |
| Replicase | Yes |
|
| 0.65 | 0.40 | 0.25 | 2.30 |
| MP | Yes |
|
| 0.58 | 0.22 | 0.36 | 2.22 |
| Capsid | Yes |
|
| 0.35 | −0.05 | 0.40 | 2.13 |
| Capsid | Yes |
|
| 0.40 | 0.11 | 0.29 | 1.88 |
| Capsid | Yes |
|
| 0.29 | −0.01 | 0.30 | 1.78 |
| VP2 | Yes |
|
| 0.42 | 0.03 | 0.39 | 1.75 |
| TGBp2 | Yes |
|
| 0.47 | 0.12 | 0.35 | 1.69 | 0.10 | - | - |
|
| 0.33 | 0.13 | 0.20 | 1.24 | 0.15 | - | - |
|
| 0.25 | 0.02 | 0.23 | 1.04 | 0.20 | - | - |
|
| 0.36 | 0.19 | 0.17 | 1.04 | 0.20 | - | - |
|
| 0.42 | 0.29 | 0.13 | 0.98 | 0.20 | - | - |
|
| 0.40 | 0.23 | 0.17 | 0.97 | 0.20 | - | - |
|
| 0.18 | 0.10 | 0.08 | 0.49 | 0.50 | - | - |
|
| 0.24 | 0.36 | −0.12 | 0.39 | 0.50 | - | - |
|
| 0.06 | 0.10 | −0.04 | 0.24 | 0.50 | - | - |
|
| 0.62 | 0.63 | −0.01 | 0.09 | 0.50 | - | - |
|
| 0.28 | 0.26 | 0.02 | 0.06 | 0.50 | - | - |
|
| 0.31 | 0.32 | −0.01 | 0.03 | 0.50 | - | - |
| Recombinant | - | - | |||||
|
| 0.00 | 0.51 | −0.51 | 2.94 | 0.005 | p130 | No |
|
| 0.34 | 0.18 | 0.16 | 1.01 | 0.20 | - | - |
The last two overlaps have entered their genome by horizontal transfer and are not taken into account for calculations of specificity and sensitivity of the method.
Abbreviations and conventions are the same as in Table 2. A frame is predicted ancestral if its r is positive and significantly higher than the r of the other frame (P<0.05, corresponding to t-Hotelling >1.70). If no prediction is possible, the field is left blank. Numerical values are the same as in Table 3 for actual frames, but are reproduced here for clarity.
Prediction, by codon usage, of the ancestral frame in overlapping reading frames with identical phylogenetic distribution.
| Phylogenetic distribution | Genome accession number | Species | Frame 1 [function] | Frame 2 [function] | Length of overlap (nt) | Length of non-overlapping regions (nt) | rs1 | rs2 | t-Hotelling | P< | Predicted ancestral frame | Predicted |
| Genus | NC_004142 |
| Replicase A, C-term domain | B2 [silencing suppressor] | 300 | 3930 | 0.62 | 0.16 | 2.74 |
| Replicase A | B2 |
| Genus | HTU19949 |
| Rex [post-transcriptional regulator] | Tax [transcription activator] | 510 | 6021 | 0.32 | 0.58 | 2.06 |
| Tax | Rex |
| Genus | NC_003448 |
| Replicase A, C-term domain | B2 | 231 | 3744 | 0.04 | 0.47 | 1.59 | 0.10 | - | - |
| Genus | NC_003842 |
| Replicase, C-term domain | 2b [silencing suppressor] | 276 | 7338 | 0.32 | 0.26 | 0.37 | 0.50 | ||
| Genus | NC_001747 |
| P0 [silencing suppressor] | P1, N-term domain | 612 | 3789 | 0.30 | 0.24 | 0.38 | 0.50 | - | - |
| Genus | NC_001747 |
| P1 | Replicase, N-term domain | 456 | 3789 | 0.35 | 0.33 | 0.10 | 0.50 | - | - |
| Genus | NC_002035 |
| Replicase, C-term domain | 2b [silencing suppressor] | 243 | 6900 | 0.14 | 0.11 | 0.12 | 0.50 | - | - |
Conventions are the same as in Table 3. A frame is predicted ancestral if its r is positive and significantly higher than the r of the other frame (P<0.05, corresponding to t-Hotelling>1.70).
Figure 3A “gene nursery”: the pX region of deltaretroviruses.
The pX region of HTLV1 encodes five genes unique to deltaretroviruses by a complex pattern of alternative splicing and leaky scanning [36], [39]. The initial exons of these genes are very short and have not been represented, nor have been shorter versions of p12 and p30 expressed alternatively. Only the 3′ end of the Env gene is represented. The figure is approximately to scale. Ancestral regions in red and de novo regions in blue. Frame numbering is as in [45], with the Tax frame taken as “0”. Protein regions with unusually low sequence complexity are indicated by dashed, grey lines.
Figure 4Presumed evolution of the deltaretrovirus pX region.
The deltaretrovirus phylogeny is shown as a cladogram. Conventions are the same as in Figure 3.
Figure 5A genomic hotspot of origination of silencing suppressors in plus-strand RNA viruses.
The replicases of Nodaviridae and Bromoviridae contain C-terminal extensions predicted disordered (thin boxes) downstream of their homologous polymerase (RdRP) domain. These extensions encode structurally unrelated suppressors of RNA silencing, B2 and 2b (PDB accession codes respectively 2AZ2 and 2ZI0) in different reading frames. Neither the C-terminal extensions nor the suppressors of RNA silencing have detectable sequence similarity, even between closely related genera. Which region is ancestral in each overlap could not be determined (see text).