| Literature DB >> 34262534 |
Diego Simón1,2,3, Juan Cristina2, Héctor Musto1.
Abstract
The genetic material of the three domains of life (Bacteria, Archaea, and Eukaryota) is always double-stranded DNA, and their GC content (molar content of guanine plus cytosine) varies between ≈ 13% and ≈ 75%. Nucleotide composition is the simplest way of characterizing genomes. Despite this simplicity, it has several implications. Indeed, it is the main factor that determines, among other features, dinucleotide frequencies, repeated short DNA sequences, and codon and amino acid usage. Which forces drive this strong variation is still a matter of controversy. For rather obvious reasons, most of the studies concerning this huge variation and its consequences, have been done in free-living organisms. However, no recent comprehensive study of all known viruses has been done (that is, concerning all available sequences). Viruses, by far the most abundant biological entities on Earth, are the causative agents of many diseases. An overview of these entities is important also because their genetic material is not always double-stranded DNA: indeed, certain viruses have as genetic material single-stranded DNA, double-stranded RNA, single-stranded RNA, and/or retro-transcribing. Therefore, one may wonder if what we have learned about the evolution of GC content and its implications in prokaryotes and eukaryotes also applies to viruses. In this contribution, we attempt to describe compositional properties of ∼ 10,000 viral species: base composition (globally and according to Baltimore classification), correlations among non-coding regions and the three codon positions, and the relationship of the nucleotide frequencies and codon usage of viruses with the same feature of their hosts. This allowed us to determine how the base composition of phages strongly correlate with the value of their respective hosts, while eukaryotic viruses do not (with fungi and protists as exceptions). Finally, we discuss some of these results concerning codon usage: reinforcing previous results, we found that phages and hosts exhibit moderate to high correlations, while for eukaryotes and their viruses the correlations are weak or do not exist.Entities:
Keywords: GC-content; base composition; codon usage; compositional correlations; viral diversity
Year: 2021 PMID: 34262534 PMCID: PMC8274242 DOI: 10.3389/fmicb.2021.646300
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
The total number of viruses analyzed and within each Baltimore classification group.
| Total* | dsDNA | ssDNA | dsRNA | +ssRNA | –ssRNA | +ssRNA-RT | dsDNA-RT |
| 9,994 | 4,165 | 1,951 | 388 | 1,551 | 621 | 78 | 107 |
The total number of hosts represented in this study and within each taxonomic group considered.
| Total | Animals | Archaea | Bacteria | Fungi | Plants | Protists |
| 1,170 | 378 | 31 | 486 | 72 | 181 | 22 |
FIGURE 1Base composition of (A) all viruses, and by Baltimore classification groups (B–H); i.e., (B) double-stranded DNA (dsDNA), (C) single-stranded DNA (ssDNA), (D) double-stranded RNA (dsRNA), (E) positive single-stranded RNA (+ssRNA), (F) negative single-stranded RNA (-ssRNA), (G) +ssRNA retro-transcribing (+ssRNA-RT), and (H) dsDNA retro-transcribing (dsDNA-RT).
FIGURE 2Association between GC content of non-coding regions and (A) GC1, (B) GC2, and (C) GC3, and the slope and adjusted R2 (adjR2) of the linear regression model (dashed line) for all colored dots (regardless of color).
Spearman’s rank correlation coefficients between non-coding regions and GC1, GC2, and GC3, when available within viral genomes, sorted by Baltimore classification group.
| Baltimore | GC1 | GC2 | GC3 | n |
| dsDNA | 0.97 | 0.94 | 0.94 | 3,884 |
| ssDNA | 0.47 | 0.56 | 0.42 | 1,881 |
| dsRNA | 0.57 | 0.60 | 0.57 | 352 |
| +ssRNA | 0.50 | 0.58 | 0.52 | 1,486 |
| –ssRNA | 0.61 | 0.54 | 0.71 | 597 |
| +ssRNA-RT | 0.51 | 0.60 | 0.69 | 76 |
| dsDNA-RT | 0.48 | 0.48 | 0.65 | 105 |
FIGURE 3Association between GC content of hosts and that of their infecting viruses for (A) all host-virus pairs, (B) prokaryotic, and (C) eukaryotic, and slope and adjusted R2 (adjR2) of the linear regression model (dashed line) for all colored dots (regardless of color), omitting pairs that do not apply; i.e., gray dots at the background: (B) eukaryotics or (C) prokaryotics.
Spearman’s correlation coefficients (ρ) and adjusted R2 (adjR2) coefficients between codon frequencies of phages (first and second columns) or eukaryotic viruses (third and fourth columns), and the respective values or their hosts.
| Phages | Eukaryotic viruses | |||
| Codon | ρ | adjR2 | ρ | adjR2 |
| UUU | 0.85 | 0.70 | 0.01 | 0.00 |
| UUC | 0.71 | 0.50 | −0.13 | 0.00 |
| UUA | 0.92 | 0.85 | 0.14 | 0.00 |
| UUG | 0.49 | 0.20 | 0.14 | 0.02 |
| CUU | 0.64 | 0.36 | 0.10 | 0.01 |
| CUC | 0.85 | 0.65 | 0.07 | 0.01 |
| CUA | 0.74 | 0.49 | −0.09 | 0.01 |
| CUG | 0.77 | 0.57 | 0.19 | 0.05 |
| AUU | 0.82 | 0.66 | 0.13 | 0.01 |
| AUC | 0.80 | 0.62 | 0.00 | 0.00 |
| AUA | 0.85 | 0.75 | 0.06 | 0.00 |
| AUG | 0.60 | 0.29 | 0.10 | 0.00 |
| GUU | 0.67 | 0.46 | 0.21 | 0.04 |
| GUC | 0.81 | 0.67 | 0.05 | 0.01 |
| GUA | 0.77 | 0.52 | −0.00 | 0.00 |
| GUG | 0.75 | 0.52 | 0.14 | 0.02 |
| UAU | 0.84 | 0.69 | 0.07 | 0.00 |
| UAC | 0.57 | 0.30 | 0.13 | 0.02 |
| UAA | 0.73 | 0.54 | 0.09 | 0.00 |
| UAG | 0.48 | 0.13 | 0.04 | 0.00 |
| CAU | 0.70 | 0.51 | 0.19 | 0.04 |
| CAC | 0.83 | 0.67 | −0.03 | 0.00 |
| CAA | 0.87 | 0.72 | 0.09 | 0.00 |
| CAG | 0.59 | 0.47 | 0.16 | 0.03 |
| AAU | 0.86 | 0.72 | 0.23 | 0.03 |
| AAC | 0.26 | 0.09 | 0.04 | 0.00 |
| AAA | 0.90 | 0.82 | 0.02 | 0.00 |
| AAG | 0.35 | 0.11 | 0.23 | 0.01 |
| GAU | 0.73 | 0.57 | 0.16 | 0.03 |
| GAC | 0.78 | 0.65 | 0.25 | 0.05 |
| GAA | 0.82 | 0.67 | 0.06 | 0.01 |
| GAG | 0.72 | 0.46 | −0.02 | 0.00 |
| UCU | 0.61 | 0.29 | 0.05 | 0.00 |
| UCC | 0.71 | 0.44 | −0.01 | 0.00 |
| UCA | 0.80 | 0.59 | 0.09 | 0.00 |
| UCG | 0.82 | 0.72 | 0.25 | 0.04 |
| CCU | 0.55 | 0.27 | −0.03 | 0.00 |
| CCC | 0.84 | 0.70 | 0.13 | 0.02 |
| CCA | 0.65 | 0.35 | 0.07 | 0.00 |
| CCG | 0.81 | 0.62 | 0.05 | 0.00 |
| ACU | 0.69 | 0.37 | −0.06 | 0.00 |
| ACC | 0.85 | 0.67 | 0.24 | 0.05 |
| ACA | 0.81 | 0.72 | −0.02 | 0.00 |
| ACG | 0.65 | 0.41 | −0.00 | 0.00 |
| GCU | 0.49 | 0.19 | 0.07 | 0.01 |
| GCC | 0.82 | 0.61 | 0.08 | 0.02 |
| GCA | 0.51 | 0.25 | −0.04 | 0.00 |
| GCG | 0.76 | 0.57 | 0.04 | 0.01 |
| UGU | 0.72 | 0.49 | 0.09 | 0.01 |
| UGC | 0.62 | 0.39 | 0.08 | 0.01 |
| UGA | 0.66 | 0.51 | −0.00 | 0.00 |
| UGG | 0.39 | 0.19 | 0.07 | 0.00 |
| CGU | 0.54 | 0.28 | 0.28 | 0.05 |
| CGC | 0.79 | 0.56 | 0.12 | 0.02 |
| CGA | 0.13 | 0.05 | −0.01 | 0.00 |
| CGG | 0.84 | 0.67 | −0.02 | 0.00 |
| AGU | 0.82 | 0.67 | 0.10 | 0.01 |
| AGC | 0.35 | 0.14 | 0.09 | 0.02 |
| AGA | 0.79 | 0.62 | 0.04 | 0.00 |
| AGG | 0.33 | 0.37 | 0.15 | 0.02 |
| GGU | 0.48 | 0.20 | 0.10 | 0.01 |
| GGC | 0.78 | 0.56 | 0.21 | 0.04 |
| GGA | 0.55 | 0.39 | −0.07 | 0.00 |
| GGG | 0.59 | 0.30 | 0.16 | 0.02 |
| Median | 0.73 | 0.52 | 0.08 | 0.01 |