| Literature DB >> 20686665 |
Vladimir A Belyi1, Arnold J Levine, Anna Marie Skalka.
Abstract
Vertebrate genomes contain numerous copies of retroviral sequences, acquired over the course of evolution. Until recently they were thought to be the only type of RNA viruses to be so represented, because integration of a DNA copy of their genome is required for their replication. In this study, an extensive sequence comparison was conducted in which 5,666 viral genes from all known non-retroviral families with single-stranded RNA genomes were matched against the germline genomes of 48 vertebrate species, to determine if such viruses could also contribute to the vertebrate genetic heritage. In 19 of the tested vertebrate species, we discovered as many as 80 high-confidence examples of genomic DNA sequences that appear to be derived, as long ago as 40 million years, from ancestral members of 4 currently circulating virus families with single strand RNA genomes. Surprisingly, almost all of the sequences are related to only two families in the Order Mononegavirales: the Bornaviruses and the Filoviruses, which cause lethal neurological disease and hemorrhagic fevers, respectively. Based on signature landmarks some, and perhaps all, of the endogenous virus-like DNA sequences appear to be LINE element-facilitated integrations derived from viral mRNAs. The integrations represent genes that encode viral nucleocapsid, RNA-dependent-RNA-polymerase, matrix and, possibly, glycoproteins. Integrations are generally limited to one or very few copies of a related viral gene per species, suggesting that once the initial germline integration was obtained (or selected), later integrations failed or provided little advantage to the host. The conservation of relatively long open reading frames for several of the endogenous sequences, the virus-like protein regions represented, and a potential correlation between their presence and a species' resistance to the diseases caused by these pathogens, are consistent with the notion that their products provide some important biological advantage to the species. In addition, the viruses could also benefit, as some resistant species (e.g. bats) may serve as natural reservoirs for their persistence and transmission. Given the stringent limitations imposed in this informatics search, the examples described here should be considered a low estimate of the number of such integration events that have persisted over evolutionary time scales. Clearly, the sources of genetic information in vertebrate genomes are much more diverse than previously suspected.Entities:
Mesh:
Year: 2010 PMID: 20686665 PMCID: PMC2912400 DOI: 10.1371/journal.ppat.1001030
Source DB: PubMed Journal: PLoS Pathog ISSN: 1553-7366 Impact factor: 6.823
Figure 1Organization and transcription maps of Borna disease virus (BDV), Marburgvirus (MARV) and Ebolavirus (EBOV) genomes.
Open reading frames are labeled and indicated by colored boxes, non-coding regions by empty boxes. For BDV, the locations of transcription initiation (S) and termination (T) sites are shown on the scale beneath the genome map. The horizontal arrows below the scale depict the origins of primary transcripts. The two longest BDV transcripts are subjected to alternative splicing to form multiple mature mRNAs. For MARV and EBOV, vertical arrows indicate transcription initiation and termination sites, except for regions of overlap, where these sites are not marked. The pink arrowhead points to the location of an editing site in the GP gene of EBOV.
Sequences derived from single strand RNA viral genes, which are integrated in mammalian genomes.1)
| Borna Disease Virus | Filoviruses | Midway (or similar) Virus | Tamana Bat Virus (or other Flaviridae) | |||||
| N | M | L | NP | L | VP35 | L | NS3 | |
| Primates | + | |||||||
| Bushbaby | + | |||||||
| Lemur | + | + | ||||||
| Tarsier | + | + | ||||||
| Mouse | + | + | ||||||
| Rat | + | + | ||||||
| Squirrel | + | |||||||
| Guinea pig | + | +/− | ||||||
| Cow | +/− | |||||||
| Microbat | + | + | + | + | ||||
| Shrew | +/− | |||||||
| Opossum | + | + | + | |||||
| Wallaby | +/− | + | + | + | ||||
| Medaka | +/− | + | +/− | |||||
| Takifugu | + | |||||||
| Zebrafish | + | |||||||
| Lamprey | + | |||||||
Integrations with BLAST E-value below 10−10 are labeled with plus sign “+”. Integrations with E-value as high as 10−5 are marked “+/−”: these may be derived from earlier infections or infections with a different strain of the virus. All integrations were cross checked against the NCBI database of protein and nucleotide sequences to confirm the viral origin of the sequence. Species are listed in the reverse chronological order from the time they shared common ancestor with humans.
While Ebolavirus and Marburgvirus are now recognized as different virus genera, their sequences are closely related. Accordingly, it is not possible to uniquely associate integrated fragments with either virus.
Selected endogenous viral sequences found in vertebrate genomes.
| Specie | Scaffold or Chromosome | Virus | Integrated gene | Location within present-day virus protein | Viral protein length | BLAST hit, E-value and percent identity | Label | Significant large ORFs (length and position) |
| Human ( | chr10 | Bornavirus | N | 28–349 | 370aa | 2E-65/41% | hsEBLN-1 | 366aa (full protein) |
| Squirrel ( | scaffold113120 | Bornavirus | N | 40–368 | 370aa | 1E-155/77% | stEBLN | 203aa (residues 170–370) |
| Microbat ( | scaffold144630 | Reston Ebolavirus | VP35 | 74–329 | 329aa | 5E-23/30% | mlEEL35 | 281aa (residues 52–329) |
| Tarsier ( | scaffold521 | Reston Ebolavirus | VP35 | 138–329 | 329aa | 5E-16/34% | tsEEL35 | 131aa (residues 137–261) |
| Grey Mouse Lemur ( | scaffold5488 | Bornavirus | M | 1–123 | 142aa | 4E-13/45% | mmEBLM | 93aa (residues TSS-102) |
| Medaka | scaffold1213 | Bornavirus | M | 15–138 | 142aa | 5E-07/33% | olEBLM | 69aa (residues TSS-71) |
| Microbat ( | scaffold114379 | Bornavirus | L | 189–1066 | 1608aa | 3E-96/42% | mlEBLL-1B | 149aa |
| Microbat ( | scaffold131047 | Lake Victoria Marburgvirus | N | 63–437 | 695aa | 2E-36/32% | mlEELN-1 | 158aa (residues 72–228) and 164aa (residues 228–391) |
| Opossum (Monodelphis Domestica) | chr2 | Reston Ebolavirus | NP | 175–409 | 739aa | 4E-39/46% | mdEELN | no significant ORF found |
| Wallaby ( | scaffold117569 | Sudan Ebolavirus | NP | 22–312 | 738aa | 1E-28/33% | meEELN-5 | >218aa likely (incomplete scaffold) |
| Opossum (Monodelphis Domestica) | chr3 | Lake Victoria Marburgvirus | L | 605–1354 | 2331aa | 5E-72/ | mdEELL | no significant ORF found |
| Zebrafish ( | chr25 | Midway Virus | L | 238–962 | 1935aa | 8E-027/21% | drEMLL-3 | 761aa (residues TSS-756) |
| 180aa (residues 792–971) |
Only the top BLAST E-value and average percent identity are shown when BLAST alignment returns multiple gene fragments. Please refer to supplementary data (Tables S1, S2, S3, S4, S5, S6 and S7) for a complete list of integrations and individual BLAST hits.
Presence of direct repeats, viral transcription start sites, and poly-A sequences in some virus-related genomic integrations.
| Insertion | Direct repeat and 5′ TSS sequence | TSS location | 3′ Poly-A sequence and direct repeat | Poly-A location |
|
| ||||
| Human hsEBLN-2 |
| −3 | … | 1107 |
| Human hsEBLN-3 |
| −10 | … | 1133 |
| Human hsEBLN-4 |
| −3 | … | 1133 |
| Human hsEBLN-1 |
| −21 | … | 1126 |
|
| ||||
| Microbat mlEELN-2 |
| +21 | … | 2212 |
| Microbat mlEEL35 |
| −65 | … | 1869 |
| Tarsier tsEEL35 |
| −134 | … | 1311 |
Sequences most resembling the canonical transcription start site (TSS) and canonical poly-A sequences are underlined. Direct repeat sequences flanking virus-derived integrations are shown in bold.
Location of the TSS relative to the estimated position of the coding sequence start, based on the present day viral protein. The expected location is −11 for Bornavirus EBLN insertions, −414 for EBOV EELN insertions, −55 for MARV EELN insertions; and −92 to −97 for EEL35 insertions.
Location of the poly-A sequence relative to the estimated position of the coding sequence start, based on the present day viral protein. The expected location is 1110 for Bornavirus EBLN insertions, 2545 for EBOV EELN insertions, 2730 for MARV EELN insertions, and 1268 to 1455 for EEL35 insertions.
Bornavirus integration hsEBLN-2 in human genome is directly followed by a repeat element AluSx, also observed by Hoire et al [5]. These two integrated sequences are surrounded by a common direct repeat.
Figure 2Phylogenetic tree of vertebrates that encode Bornavirus- and Filovirus- like proteins in their genomes.
Bornaviruses-related sequences are denoted by icosahedrons and Filoviruses-related sequences by triangles. Times of the viral gene integrations are approximate, unless discussed in the text.
Figure 3Phylogeny of endogenous Filovirus VP35 - like gene integrations.
The tree was built with PHYLIP based on ClustalW alignment using only aligned residues present in all sequences. The tree is unrooted (the wallaby integration was used as an outgroup for given representation). Bootstrap values are at least 92, with the exception for Sudan EBOV (54), Cote D'Ivore EBOV (77), and MARV in bats (70).
List of representative vertebrate integrations found by BLAST search and total number of stop codons inside aligned peptide regions.1)
| Integration | Specie | Virus | Integrated gene | Total number of stop codons | Total length of BLAST alignments | Sequence identity | Number of stop codons per 100 aminoacids |
| drEMLL-4 | Zebrafish | Midway Virus | L | 0 | 365 | 22% | 0.0 |
| stEBLN | Squirrel | Bornavirus | N | 0 | 329 | 77% | 0.0 |
| hsEBLN-1 | Human | Bornavirus | N | 0 | 318 | 41% | 0.0 |
| mlEEL35 | Microbat | Ebola/Marburgvirus | VP35 | 0 | 263 | 30% | 0.0 |
| laEBLN-2 | Elephant | Bornavirus | N | 0 | 256 | 32% | 0.0 |
| mlEBLL-2B | Microbat | Bornavirus | L | 0 | 229 | 37% | 0.0 |
| tsEEL35 | Tarsier | Ebola/Marburgvirus | VP35 | 0 | 191 | 34% | 0.0 |
| olENS3 | Medaka | Tamana Bat Virus | NS3 | 0 | 190 | 28% | 0.0 |
| ogEBLN-1 | Bushbaby | Bornavirus | N | 0 | 168 | 29% | 0.0 |
| mlEBLN-1 | Microbat | Bornavirus | N | 0 | 168 | 29% | 0.0 |
| saEBLN-1 | Shrew | Bornavirus | N | 0 | 167 | 31% | 0.0 |
| ogEBLN-3 | Bushbaby | Bornavirus | N | 0 | 138 | 34% | 0.0 |
| drEMLL-3 | Zebrafish | Midway Virus | L | 1 | 712 | 21% | 0.1 |
| rodEBLN-4 | Mouse | Bornavirus | N | 1 | 269 | 40% | 0.4 |
| rodEBLN-3 | Rat | Bornavirus | N | 1 | 263 | 38% | 0.4 |
| mlEELN-3 | Microbat | Ebola/Marburgvirus | NP | 1 | 204 | 42% | 0.5 |
| rodEBLN-3 | Mouse | Bornavirus | N | 2 | 314 | 37% | 0.6 |
| trEBLL | Fugu | Bornavirus | L | 3 | 458 | 43% | 0.7 |
| olEBLL | Medaka | Bornavirus | L | 3 | 340 | 44% | 0.9 |
| mmEBLM | Lemur | Bornavirus | M | 1 | 112 | 45% | 0.9 |
| cpEBLN | Guinea Pig | Bornavirus | N | 2 | 207 | 41% | 1.0 |
| hsEBLN-2 | Human | Bornavirus | N | 3 | 309 | 38% | 1.0 |
| rodEBLN-1 | Rat | Bornavirus | N | 3 | 289 | 43% | 1.0 |
| meEELN-5 | Wallaby | Ebola/Marburgvirus | NP | 3 | 280 | 33% | 1.1 |
| meEBLL-1 | Wallaby | Bornavirus | L | 4 | 364 | 39% | 1.1 |
| rodEBLN-2 | Mouse | Bornavirus | N | 3 | 266 | 39% | 1.1 |
| saEBLN-2 | Shrew | Bornavirus | N | 2 | 152 | 31% | 1.3 |
| mdEELL | Opossum | Ebola/Marburgvirus | L | 8 | 520 | 31% | 1.5 |
| tsEBLN | Tarsier | Bornavirus | N | 2 | 130 | 37% | 1.5 |
| hsEBLN-4 | Human | Bornavirus | p40 | 4 | 234 | 37% | 1.7 |
| olEBLM | Medaka | Bornavirus | M | 2 | 114 | 33% | 1.8 |
| mimEBLN | Lemur | Bornavirus | N | 5 | 277 | 35% | 1.8 |
| rodEBLL | Rat | Bornavirus | L | 12 | 640 | 35% | 1.9 |
| meEEL35 | Wallaby | Ebola/Marburgvirus | VP35 | 4 | 201 | 31% | 2.0 |
| hsEBLN-3 | Human | Bornavirus | N | 6 | 288 | 43% | 2.1 |
| meEELN-4 | Wallaby | Ebola/Marburgvirus | NP | 6 | 272 | 41% | 2.2 |
| rodEBLL | Mouse | Bornavirus | L | 16 | 693 | 28% | 2.3 |
| mdEBLN-1 | Opossum | Bornavirus | N | 7 | 302 | 32% | 2.3 |
| laEBLN-3 | Elephant | Bornavirus | N | 7 | 297 | 33% | 2.4 |
| mdEELN | Opossum | Ebola/Marburgvirus | NP | 6 | 237 | 46% | 2.5 |
| btEBLN | Cow | Bornavirus | N | 5 | 192 | 28% | 2.6 |
| mlEELN-2 | Microbat | Ebola/Marburgvirus | NP | 11 | 419 | 44% | 2.6 |
| meEELN-12 | Wallaby | Ebola/Marburgvirus | NP | 8 | 262 | 39% | 3.1 |
Only representative hits per specie are listed. Additional integrations are shown in Supplementary Table S8.
Borna disease virus integrations and known host susceptibility to Borna disease.
| Host | Known gene integrations | Natural viral host | Experimental infection |
| Primates/humans | N | No | Yes |
| Rodents (mice, rats) | N, L | No | Yes |
| Lemur | N, M | ||
| Tarsier | N | ||
| Dogs | - | Yes | Yes |
| Horses | - | Yes | |
| Cows | N (?) | Yes | Yes |
| Rabbits | - | Yes | |
| Donkeys | not sequenced | Yes | |
| Sloth | - | Yes | |
| Sheep | not sequenced | Yes | |
| Pigs | - | ||
| Birds | - | Yes | Yes |
| Opossum, wallaby | N, L | ||
| Guinea pig | N | Yes | |
| Squirrel | N |
Whole genome sequences of donkey and sheep were not available at the time of writing.
Figure 4Domain structure of BDV N (p40) protein, and its alignment with open reading frames encoded in human and squirrel endogenous BDV N-like sequences.
Shaded blue rectangles show open reading frames as seen in today's integrations. Solid black lines show total alignment found by BLAST.
Figure 5Domain structure of the EBOV N protein, and its alignment with several related endogenous sequences identified by the BLAST program.
Amino acid coordinates marked with (&) have been mapped to the Zaire strain of Ebolavirus and may differ slightly from coordinates in Supplemental Table S4.
Figure 6Comparisons of Filovirus VP35 protein sequences with those of related endogenous sequences.
A) Domain structure of the EBOV (Zaire) VP35 protein, and its alignment with related endogenous sequences in the microbat and tarsier genomes. Shaded blue rectangles show open reading frames as seen in today's integrations. Solid black lines show total alignment found by BLAST; B) multiple alignment of endogenous sequences in wallaby, tarsier, and microbat, with the present day strains of EBOV and MARV. We used the default color scheme for ClustalW alignment in the Jalview program.
Glycoprotein integrations sites.
| Specie | Total Number of BLAST hits | Hits that have no retroviral | Virus | Glycoprotein residues | BLAST E-value and percent identity |
| Human | 1 | chr1: 46259885–46260178 | Bornavirus | 32–126 | 1E-07/37% |
| Chimp | 1 | chr1: 46259885–46260178 | Bornavirus | 32–126 | 6E-08/38% |
| Baboon | 1 | on several partial scaffolds | Bornavirus | 6–155 | 3E-13/38% |
| Gorilla | 1 | gene scaffold 2544: 11117–11408 | Bornavirus | 59–155 | 6E-11/43% |
| Macaque | 1 | chr1: 48,422,406–48,422,783 | Bornavirus | 6–126 | 1E-10/35% |
| Tarsier | 7 | scaffold 99624:1,564–1,884 | Reston Ebolavirus | 497–610 | 2E-11/30% |
| Kangaroo rat | 1 | scaffold 40120: 1783–2128 | Marburgvirus | 509–628 | 4E-08/40% |
| Stickleback | 1 | chrVIII: 8,031,996–8,032,262 | Reston Ebolavirus | 533–628 | 2E-07/32% |
| Shrew | 4 | scaffold 231484:15,984–16,259 | Reston Ebolavirus | 559–648 | 3E-06/29% |
| Horse | 13 | chr10: 13,359,493–13,359,900 | Reston Ebolavirus | 508–652 | 3E-06/27% |
| Zebrafish | 10 | chr15: 6,268,907–6,269,284 | Sudan Ebolavirus | 516–651 | 3E-05/27% |
| zv8_NA3400: 7,797–8.099 | Reston Ebolavirus | 519–628 | 2E-07/32% | ||
| Tetraodon | 1 | chr1:15,715,939–15,716,202 | Zaire Ebolavirus | 532–626 | 1E-06/29% |
| Fugu | 1 | chrUn:120,943,623–120,943,895 | Zaire Ebolavirus | 530–627 | 1E-07/31% |
| Sloth | 4 | None | |||
| Cow | 1 | None | |||
| Squirrel | 1 | None | |||
| Platypus | 3 | None | |||
| Chicken | 8 | None | |||
| Zebrafinch | 21 | None |
All regions were tested for nearby gag, pol, and LTR elements to eliminate sequences of retroviral origin, as described in the methods section.
Only the most similar strain of virus is shown for filovirus-like integrations.