| Literature DB >> 30568734 |
Jainy Thomas1, Hervé Perron2,3, Cédric Feschotte4.
Abstract
BACKGROUND: Human endogenous retroviruses (HERVs) occupy a substantial fraction of the genome and impact cellular function with both beneficial and deleterious consequences. The vast majority of HERV sequences descend from ancient retroviral families no longer capable of infection or genomic propagation. In fact, most are no longer represented by full-length proviruses but by solitary long terminal repeats (solo LTRs) that arose via non-allelic recombination events between the two LTRs of a proviral insertion. Because LTR-LTR recombination events may occur long after proviral insertion but are challenging to detect in resequencing data, we hypothesize that this mechanism is a source of genomic variation in the human population that remains vastly underestimated.Entities:
Keywords: Endogenous retrovirus; HERV-H; HERV-K; HERV-W; Long terminal repeats; Provirus; Solo LTR; Transposable elements
Year: 2018 PMID: 30568734 PMCID: PMC6298018 DOI: 10.1186/s13100-018-0142-3
Source DB: PubMed Journal: Mob DNA
Fig. 1Structure of a provirus and generation of a solo LTR and their detection from whole genome sequence data. Structure of a typical provirus (a) with its internal region (red line) encoding gag, pol and env genes flanked by two long terminal repeats (LTR). Ectopic recombination occurs between the two LTRs of the provirus (b) leading to the deletion of the internal region along with one LTR, resulting in the formation of a solo LTR (c). Note how the 5′ and 3′ junction sequences between the element and the flanking host DNA (black line), including the target site duplication (not shown), remain the same after recombination. Presence of provirus is identified from whole genome resequencing data aligned to the reference assembly when the reference allele is a solo LTR using the findprovirus pipeline (d). The findprovirus pipeline infer the presence of provirus from the mates of discordant reads with significant homology to the internal region of the respective HERV family. The discordant reads are colored light green and the forward and reverse reads originated from the same fragment are matched by numbers (e.g. F1 and R1). The findsoloLTR pipeline identifies the presence of solo LTR when the reference allele is provirus (e). It infers the presence of solo LTR based on the deviation of read depth across the provirus and across the flank
Fig. 2Flowchart of findprovirus pipeline. The first step indexes the coordinates of solo LTRs of a HERV family in the reference genome. Mapped reads (of mapping quality score (MAPQ) equal or greater than 30) and mates of discordant reads are extracted in a window extending ±100-bp from each LTR. Homology based searches are performed with mates of discordant reads against the respective consensus of internal sequence of HERV to infer the presence of a provirus allele at the locus. The read depth for each locus is calculated and compared to the average of read depths for all solo LTRs of that family in an individual. Increased read depth may be observed for some candidate loci reflecting the presence of a provirus allele. A local de novo assembly of the reads is also performed to infer the presence or absence of a solo LTR allele at the locus. These two additional approaches (enclosed by dashed lines) are performed by the pipeline but are not primarily used to infer the presence of a provirus
Dimorphic HERV-K, HERV-H and HERV-W candidates
| HERV name | Coordinate (GRCh38/hg38) | Reference allele | Previously reported |
|---|---|---|---|
| 1p31.1_K2b | chr1:73129298–73,130,265 | Solo LTR | [ |
| K111/K105b | chrUn_GL000219v1:175210–176,178 | Solo LTR | [ |
| 1p31.1_K3c | chr1:75377086–75,383,458 | Provirus | [ |
| 11q22.1_K1c | chr11:101695063–101,704,528 | Provirus | [ |
| 12q14.1_K1c | chr12:58327459–58,336,915 | Provirus | [ |
| 3q13.2_K2c | chr3:113024277–113,033,435 | Provirus | [ |
| 3q27.2_K1 | chr3:185562548–185,571,727 | Provirus | [ |
| 5q33.3_K1c | chr5:156657706–156,666,885 | Provirus | [ |
| 6q14.1_K1c | chr6:77716945–77,726,366 | Provirus | [ |
| 7p22.1_K1 | chr7:4582426–4,591,897 | Provirus | [ |
| 5p13.3_K2c,a | chr5:30486653–30,496,098 | Provirus | This study |
| 4q22.1_H8b,a | chr4:91045790–91,046,151 | Solo LTR | This study |
| 5p15.31_H2b,a | chr5:7262337–7,262,742 | Solo LTR | This study |
| 11q13.2_H5c | chr11:68633778–68,639,439 | Provirus | This study |
| 2q34_H4a,c | chr2:209078020–209,084,376 | Provirus | This study |
| 2p14_H2c | chr2:64252414–64,257,646 | Provirus | This study |
| 13q21.32_H1c | chr13:66141332–66,147,036 | Provirus | This study |
| 1q32.2_H3c | chr1:210111090–210,116,207 | Provirus | This study |
| 1p32.3_H6c | chr1:54897904–54,903,584 | Provirus | This study |
| 11q24.3_H2c | chr11:130753498–130,759,137 | Provirus | This study |
| 11p14.3_H1c | chr11:23183934–23,189,744 | Provirus | This study |
| 12p12.1_H2c | chr12:25163213–25,169,508 | Provirus | This study |
| 13q21.1_H1c | chr13:55578228–55,584,087 | Provirus | This study |
| 13q22.3_H1c | chr13:77933223–77,939,379 | Provirus | This study |
| 2q36.1_H5c | chr2:224296633–224,302,363 | Provirus | This study |
| 2p12_H2c | chr2:75213731–75,219,537 | Provirus | This study |
| 3q22.3_H2c | chr3:137595601–137,601,190 | Provirus | This study |
| 3p14.3_H1a,c | chr3:54634484–54,640,204 | Provirus | This study |
| 4q32.3_H5c | chr4:166716125–166,722,054 | Provirus | This study |
| 6q23.2_H3c | chr6:131338800–131,344,564 | Provirus | This study |
| 6p22.3_H3c | chr6:18754144–18,759,870 | Provirus | This study |
| 6p12.2_H1c | chr6:51938241–51,944,426 | Provirus | This study |
| 6q16.1_H1c | chr6:93830156–93,835,749 | Provirus | This study |
| 18q21.1_W2b,a | chr18:50449151–50,449,914 | Solo LTR | This study |
Footnotes: This table only lists candidates identified by our pipeline and supported by at least one additional piece of evidence: a PCR validation, b alternative allele genomic sequence, c annotation in the Database of Genomic Variants. Other candidates not listed here are in Additional file 2 and Additional file 9. The notations ‘K’ ‘H’ and ‘W’ in the HERV name represent HERV-K(HML2), HERV-H and HERV-W families respectively
Fig. 3Experimental validation of dimorphic HERV loci. Type of HERV allele in the reference assembly is shown within brackets after the name of the element. a PCR amplification of HERV-W solo LTR at the 18q21.1 locus in the human reference assembly. Primers were designed flanking the solo LTR. PCR amplification of the 18q21.1_W2 provirus with primers designed to the flank and internal gag sequence and with primers to the env sequence and flank. b PCR amplification of HERV-H solo LTR at the 4q22.1 locus in the reference assembly with primers flanking the solo LTR. PCR amplification of the 4q22.1_H8 provirus with primers designed to the internal env sequence and flank. c PCR amplification of HERV-H provirus at the 5p15.31 locus with primers designed to the internal env sequence and flank. The reference allele is solo LTR. d PCR amplification of HERV-K solo LTR at the 5p13.3 locus with primers flanking the solo LTR. PCR amplification of the the reference allele 5p13.3_K2 provirus with primers designed to the internal env sequence and flank. e PCR amplification of HERV-H solo LTR at 2q34 locus with primers flanking the solo LTR. PCR amplification of the reference provirus 2q34_H4 with primers designed to the internal env sequence and flank. f PCR amplification of HERV-H solo LTR at 3p14.3 locus with primers flanking the solo LTR. PCR amplification of the reference provirus 3p14.3_H1 with primers designed to the internal gag sequence and flank. The DNA samples of various South Asian populations and an African individual used for validation are listed in the key. LTRs are in shown as green boxes, the internal region as a red line, the flanking region as a black line. The primer positions are shown as black arrows
Fig. 4Flowchart of findsoloLTR pipeline. The first step indexes the coordinates of proviruses of a HERV family in the reference genome. Average of read depth (of mapping quality score (MAPQ) equal or greater than 30 and base call accuracy equal to or greater than 20) at the HERV locus and at the flanking window extending ±250-bp from both LTRs are calculated. Percentage of the average read depth at each HERV locus to the average of the read depths at the two flanking 250-bp window is assessed. An estimated percentage equal to or greater than 50% is used to infer the presence of a provirus and the percentage lower than 50% infer the presence of a solo LTR allele
Fig. 5Karyotypic view of the location of the candidate dimorphic HERVs. The dimorphic candidates of HERV-K (HML2) are shown as blue triangles, HERV-H as red triangles and HERV-W as golden yellow triangle. The candidates that are supported by at least one additional evidence such as PCR validation, alternative allele genomic sequence, annotation in the Database of Genomic Variants are marked with a blue arrow. The genomic coordinates and other details of the candidates are detailed in Additional file 2 and Additional file 9. The ideograms were generated using the genome decoration page at NCBI https://www.ncbi.nlm.nih.gov/genome/tools/gdp