| Literature DB >> 24920817 |
Emanuele Marchi1, Alex Kanapin2, Gkikas Magiorkinis3, Robert Belshaw4.
Abstract
UNLABELLED: One lineage of human endogenous retroviruses (HERVs), HERV-K(HML2), is upregulated in many cancers, some autoimmune/inflammatory diseases, and HIV-infected cells. Despite 3 decades of research, it is not known if these viruses play a causal role in disease, and there has been recent interest in whether they can be used as immunotherapy targets. Resolution of both these questions will be helped by an ability to distinguish between the effects of different integrated copies of the virus (loci). Research so far has concentrated on the 20 or so recently integrated loci that, with one exception, are in the human reference genome sequence. However, this viral lineage has been copying in the human population within the last million years, so some loci will inevitably be present in the human population but absent from the reference sequence. We therefore performed the first detailed search for such loci by mining whole-genome sequences generated by next-generation sequencing. We found a total of 17 loci, and the frequency of their presence ranged from only 2 of the 358 individuals examined to over 95% of them. On average, each individual had six loci that are not in the human reference genome sequence. Comparing the number of loci that we found to an expectation derived from a neutral population genetic model suggests that the lineage was copying until at least ∼250,000 years ago. IMPORTANCE: About 5% of the human genome sequence is composed of the remains of retroviruses that over millions of years have integrated into the chromosomes of egg and/or sperm precursor cells. There are indications that protein expression of these viruses is higher in some diseases, and we need to know (i) whether these viruses have a role in causing disease and (ii) whether they can be used as immunotherapy targets in some of them. Answering both questions requires a better understanding of how individuals differ in the viruses that they carry. We carried out the first careful search for new viruses in some of the many human genome sequences that are now available thanks to advances in sequencing technology. We also compared the number that we found to a theoretical expectation to see if it is likely that these viruses are still replicating in the human population today.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24920817 PMCID: PMC4136357 DOI: 10.1128/JVI.00919-14
Source DB: PubMed Journal: J Virol ISSN: 0022-538X Impact factor: 5.103
FIG 1Detection of integrations not in the human reference sequence. (A) Schematic of pipeline for finding loci showing how mapping of trimmed reads is linked to result of RetroSeq analysis. Mapping creates a cluster of trimmed reads that are derived from HK2 loci, which are inside the cluster of RetroSeq anchor reads. In contrast, trimmed reads derived from other regions by chance sequence similarity are scattered around the genome. The next stage is confirmation of integration by BreakAlign analysis. Chr, chromosome. (B) Example of the Integrative Genomics Viewer genome browser (49) screenshot showing evidence for the 4q22.3 locus (from chromosome 4 [chr4], coordinates 9602941 to 9603548). (Top) Mapping of all reads with colored ones representing RetroSeq anchors (see Materials and Methods; the color shows the chromosome on which the mate has been mapped to another HK2 locus in the reference genome); (middle) mapping of trimmed reads, with the coverage at each nucleotide position being shown above the reads. The short overlap representing the 6-nt target site duplication causes a doubling of coverage at these 6 nt, forming the tower in the characteristic submarine-shaped profile of the coverage. (Bottom) RepeatMasker track. In this instance, the HK2 virus has integrated into an existing ERV belonging to another lineage, HERVS71.
FIG 2Validation of integrations. Edited output of the BreakAlign program showing a few representative chimeric NGS reads that span the integration site of unfixed loci. In each read, part of the sequence is viral (red lowercase nucleotides) and the other part aligns to the reference preintegration sequence shown above (on a black background). For each locus, we have chimeric reads from upstream and downstream flanks of the integration, both of which contain the 5-nt-long or (unless indicated) 6-nt-long target site duplication (TSD). Loci found in TCGA patients that are not shown here are described by Marchi et al. (33).
FIG 3How chimeric reads result from ERV integration. (A to D) A guide to interpretation of outputs by use of locus 5q12.3 as an example. After reverse transcription, viral double-stranded DNA (red) is integrated into the chromosome. The viral integrase enzyme makes a staggered cut, typically of 6 nt, into which the viral DNA is inserted. DNA repair of the now single-stranded DNA on either side of the integration produces six identical nucleotides (the target site duplication) flanking the virus. (E) However, in some cases the virus has integrated in reverse orientation, and an example of where this has occurred is shown for locus 1p21.1. Note the changed viral sequence.
The 17 HK2 loci that are not in the human reference genome
| Cytoband | Coordinate | Other name(s) | Flanking region | Frequency | ||
|---|---|---|---|---|---|---|
| TCGA ( | WGS500 ( | Lee et al. ( | ||||
| 1p21.1 | 106015874–106015881 | — | 0.04 | 0.003 | 0 | |
| 1p13.2 | 111802591–111802598 | DE5, ERVK1 | L1 | 0.62 | 0.593 | 0.35 |
| 1q41 | 223578303–223578310 | ERVK2 | L1 | 0 | 0.006 | 0.02 |
| 4q22.3 | 9603239–9603245 | ERVK6 | ERV | 0.96 | 0.958 | 0.86 |
| 5q12.3 | 64388439–64388446 | ERVK9 | L1 | 0.15 | 0.075 | 0.12 |
| 5q14.1 | 80442265–80442272 | DE6, NE1, ERVK10 | RASGRF2 intron | 0 | 0.093 | 0.14 |
| 6p21.32 | 32648035–32648041 | L1 | 0.46 | 0.443 | 0 | |
| 6q26 | 161270898–16127090 | DE2, ERVK12 | — | 0.96 | 0.834 | 0.70 |
| 9q34.11 | 132205208 | DE7, ERVK16 | MaLR | 1.00 | 0.961 | 0.33 |
| 11q12.2 | 60449889 | DE4, ERVK18 | L1 | 0 | 0.003 | 0.02 |
| 12q12 | 44313656–44313662 | ERVK20 | L1 in TMEM117 intron | 0.31 | 0.241 | 0.14 |
| 12q24.31 | 124066476–124066483 | ERVK21 | Alu | 0.35 | 0.238 | 0.14 |
| 13q31.3 | 90743182–90743189 | NE2, ERVK22 | AT rich | 0.15 | 0.190 | 0.12 |
| 15q22.2 | 63374593–63374600 | ERVK24 | Alu | 0.81 | 0.889 | 0.79 |
| 19p12 | 21841536–21841542 | K113, DE1, ERVK26 | — | 0.08 | 0.087 | 0.08 |
| 19q12 | 29855781–29855787 | DE3, ERVK28 | — | 0.54 | 0.678 | 0.56 |
| 20p12.1 | 12402386–12402392 | ERVK30 | — | 0 | 0.015 | 0.05 |
The 5- or 6-nt difference between coordinates is the length of the target site duplication (hg19).
From 41 germ line genomes from cancer patients plus 3 healthy individuals from the HapMap project (21).
Single-copy nontranscribed DNA region.
Also found by Kahyo et al. (35).
Locus present in some publicly available HLA haplotype sequences.
We found evidence for only one side of the integration.
Within long noncoding RNA.
As validated by PCR by Lee et al. (21), one integration and one preintegration site for both loci.
The locus is also 12 nt from an Alu.
The distribution of loci among individuals and zygosity in the 26 TCGA patients are given in Table S1 in the supplemental material. Cytobands are taken from http://www.tallphil.co.uk/bioinformatics/cytobands.
FIG 4Comparison of the observed and expected numbers of loci. The number of loci in the 26 TCGA patients predicted by the genetic drift model is shown. Along the x axis are the expectations assuming either that the rate of copying until the present day is constant (the date of extinction is year 0) or that the copying of loci ceased at different dates in the last half million years. The red line across the figure shows the observed number (n = 13). The boxes show the medians, interquartile ranges, and the most extreme values from 10,000 replicates.