| Literature DB >> 27613107 |
Nicole Grandi1, Marta Cadeddu1, Jonas Blomberg2, Enzo Tramontano3,4.
Abstract
BACKGROUND: Human endogenous retroviruses (HERVs) are ancient sequences integrated in the germ line cells and vertically transmitted through the offspring constituting about 8 % of our genome. In time, HERVs accumulated mutations that compromised their coding capacity. A prominent exception is HERV-W locus 7q21.2, producing a functional Env protein (Syncytin-1) coopted for placental syncytiotrophoblast formation. While expression of HERV-W sequences has been investigated for their correlation to disease, an exhaustive description of the group composition and characteristics is still not available and current HERV-W group information derive from studies published a few years ago that, of course, used the rough assemblies of the human genome available at that time. This hampers the comparison and correlation with current human genome assemblies.Entities:
Keywords: Endogenous retroviruses; HERV; HERV-W; Syncytin
Mesh:
Substances:
Year: 2016 PMID: 27613107 PMCID: PMC5016936 DOI: 10.1186/s12977-016-0301-x
Source DB: PubMed Journal: Retrovirology ISSN: 1742-4690 Impact factor: 4.602
Fig. 1Overview on the HERV-W structural completeness. The presence of each retroviral element in the three classes of HERV-W sequences is depicted. An element is considered as retained if at least the 20 % of the sequence is present (lengths are referred to LTR17-HERV17 reference elements). With respect to the LTR17 RepBase consensus (780 nucleotides), proviral sequences show complete LTRs, while processed pseudogenes present typically truncated LTRs due to the L1-mediated retroposition, lacking U3 in 5′ LTR (R-U5 structure, nucleotides 256–780) and U5 in 3′ LTR (U3-R structure, nucleotides 1–326)
Fig. 2Insertion and deletions of the 59 proviral sequences >2.5 Kb with respect to LTR17-HERV17-LTR17 reference
Fig. 3Comparison between HERV-W RepBase consensus LTR17-HERV17-LTR17 (black) and the proviral dataset generated consensus (grey). Nucleotide identity between the two consensus sequences is represented by the colored upper bar (green 100 % identity; greeny-brown between 100 and 30 % identity; red identity <30 %), while single nucleotide differences of the new consensus with respect to LTR17-HERV17-LTR17 are represented with black lines. The retroviral LTRs and genes localization is shown below
Fig. 4Neighbor joining trees of HERV-W proviruses 5′- and 3′ LTRs (a), processed pseudogenes 5′ LTRs (b) and processed pseudogenes 3′ LTRs (c). RepBase LTR17 consensus is labeled with a black square
Recurrent mutations in HERV-W subgroup 2 LTRs
| Position (nt)a | Substitutionb | Frequencyc | |||
|---|---|---|---|---|---|
| PVd subgroup 2 | PGe subgroup 2 | Solo LTRs subgroup 2 | Subgroup 1 | ||
| 43 | C>T | 100 | 100 | 98 | 0.7 |
| 95 | C>T | 100 | 95.8 | 96 | 3.4 |
| 100 | T>C | 97.3 | 100 | 95 | 2.2 |
| 180 | C>T | 97.3 | 100 | 95 | 0 |
| 254 | A>G | 97.3 | 96 | 87 | 1.4 |
| 706 | A>G | 97.4 | 73.3 | 88 | 1.7 |
| 765 | G>A | 95 | 73.3 | 90 | 1.7 |
aNucleotide positions are referred to RepBase Update LTR17consensus
bSubstitutions are indicated with the original nucleotide and the acquired new variant separate by the symbol >
cRelative percentage based on the total of sequences that hold the position in an alignment
dProviruses
ePseudogenes
Fig. 5Boxplot representations of HERV-W subgroups divergence based estimated period of integration. The approximated age (in million years) was calculated considering the divergence values between the 5′- and 3′ LTRs of the same provirus (only for proviral sequences); between each LTR and a generated consensus for each subgroup and between a 150–300 nucleotides region of each HERV-W internal element gag, pro, pol RT, pol IN and env genes and a generated consensus (proviruses and pseudogenes). a Averaged values of age obtained for each sequence, after the sequences division in proviruses and pseudogenes for each subgroup. b Single method estimations for the two HERV-W subgroups. c Highlight of the heterogeneous action of the divergence at different genic regions
Fig. 6PBS types among all HERV-W sequences and diversity between subgroup 1 and subgroup 2. The PBS types are identified by the amino acid single letter of the corresponding cellular tRNA. W = tryptophan, R = arginine, F = phenylalanine, I = isoleucine, S = serine, P = proline. “Others” category encloses Leucine (L), Glutamic Acid (E) and Glycine (G), each found only in one sequence. Elements that lost the PBS sequence (–) or with PBS with ambiguous assignation (?) are also included
Fig. 7Logos representing the HERV-W main features. a PBS nucleotide sequence; b Gag nucleocapsid Zinc fingers amino acid composition and c Pol IN C-terminal GPY/F motif amino acid composition. The overall height indicates the sequence conservation, while the height of symbols indicates the relative frequency of each amino or nucleic acid. Created at http://weblogo.berkeley.edu/logo.cgi
HERV-W genomic context: insertions into human coding genes
| HERV-W | Human gene | Gene or relative protein function and associations |
|---|---|---|
| 1p34.2 (+) |
| Transcription factor, binds Ig and T-cell receptors recombination signal |
| 1q25.2 (−) |
| RAS superfamily of small GTPases protein activator like. Associations: BMI, weight |
| 1q42.13* (−) |
| Zinc Finger protein. Associations: body height |
| 2p23.1a* (+) |
| Predominantly remodels anionic phospholipids in endoplasmic reticulum |
| 2p16.2 (+) |
| Suppressor of cytokine signaling proteins and their binding partners |
| 2q22.2* (−) |
| NAD cofactors biosynthesis from tryptophan. Associations: body height, cholesterol, schizophrenia |
| 2q24.3 (−) |
| Cordon bleu WH2 repeat protein-like 1. Associations: BMI, Cholesterol, HDL, triglycerides, stroke, response to statin therapy, anthropometric sexual dimorphism |
| 2q31.2a (−) |
| [603051] Mutations are cause of rhizomelic chondrodysplasia punctata type 3 |
| 2q35 (−) |
| Disrupted in renal carcinoma long non-coding RNA. Associations: diabetes mellitus |
| 3p22.2* (−) |
| Solute carrier transmembrane protein |
| 3q22.1 (−) |
| Never In mitosis kinase. Involved in DNA replication and G2/M checkpoint response to DNA damage. Related to embryonic lethality and preeclampsia |
| 3q23b (+) |
| Exoribonuclease involved in Long noncoding RNA decapping and miRNA regulation |
| 3q26.32 (+) |
| Zinc finger matrin. Acts as a bona fide target gene of p53/TP53 |
| 4p16.3* (−) |
| Zinc finger protein. Function as transcription factor |
| 4p16.1* (+) |
| Oxidizes the CoA-esters of 2-methyl-branched fatty acids |
| 4q31.3 (+) |
| ADP ribosylation factor interacting protein1. [605928] Enhance the cholera toxin activity |
| 5q12.1* (+) |
| Significantly upregulated in nonsmall cell lung carcinoma cell lines (reduced patient survival) |
| 5q22.2* (+) |
| Acyl-CoA thioesterase. Involved in regulation of lipid composition and metabolism |
| 6q12 (+) |
| [612424] In photoreceptor layer: mutated in autosomal recessive retinitis pigmentosa |
| 6q14.3a* (−) |
| Role in embryonic development. Associations: cholesterol, coronary disease |
| 6q21a (+) |
| Autophagy related apoptosis specific protein. Associations: lipoproteins, LES |
| 6q21b° (+) |
| Prenyl (decaprenyl) diphosphate synthase, subunit 2. Synthesizes the side-chain of coenzyme Q. [610564] Coenzyme Q10 deficiency, primary, 3: fatal encephalomyopathy and nephrotic syndrome |
| 6q21c (+) |
| Na(+)-independent transport of aromatic amino acids across the plasma membrane. Associations: cholesterol, LDL |
| 6q24.2a (−) |
| Androgen-induced. Associations: C-reactive protein, insulin, myocardial infarction |
| 7p21.1 (−) |
| Homo sapiens basic leucine zipper and W2 domains 2 |
| 7p14.1* (−) |
| [609187] Mutations are associated with glutaric aciduria type III. Others: BMI, fat distribution, cardiomegaly, coronary disease, pancreatic and prostatic neoplasms |
| 7q31.1a (+) |
| Neuronal Cell Adhesion Molecule. Associations: autism, obsessive compulsive disorder, schizophrenia |
| 7q31.1b (−) |
| [605317] Required for development of speech and language regions of the brain during embryogenesis. Associated to speech-language disorders |
| 8p21.3 (+) |
| Involved in vesicular transport of biogenic amines. Associations: bipolar disorder, major depressive disorder |
| 8q12.3a (−) |
| Na+/K+ transporting ATPase interacting proteins. Associations: mental competency, neuroblastoma, stroke |
| 8q12.3b (+) |
| [603711] Cyp450 enzyme. Associations: bile acid synthesis congenital defect, spastic paraplegia. Others: Alzheimer disease, lipoproteins, schizophrenia |
| 8q21.11 (+) |
| Ubiquitin-conjugating enzyme. Along with ubiquitin-activating (E1) and ligating (E3) enzymes, coordinates the ubiquitin addition to proteins. [614277] Interacts with FANCL and regulates the monoubiquitination of Fanconi anemia protein FANCD2 |
| 8q21.13 (+) |
| Zinc finger protein |
| 9p24.1 (+) |
| Protein tyrosine phosphatase, receptor type, D. [601598] Restless Legs Syndrome. Associations: asthma, BMI, cholesterol, lipids, lipoproteins, triglycerides, diabetes |
| 9p13.3* (−) |
| B-cell proliferation and differentiation antigen. Associations: lupus erythematosus |
| 10q23.33 (−) |
| [124020] Cyp450 enzyme, responsible for therapeutic agents metabolism. Associated to metabolic defects and variants |
| 10q24.1 (−) |
| [601752] Triphosphate Diphosphohydrolase. Associated with Spastic Paraplegia |
| 11p14.2* (−) |
| [610110] May act as a chloride channel. Associations: Dystonia 24. Others: bmi, obesity, c-reactive protein, cholesterol, coronary disease, schizophrenia |
| 11q14.1 (−) |
| Adipogenesis associated Mth938 domain containing |
| 11q14.2 (−) |
| Encodes a conserved member of the trypsin family of serine proteases |
| 12p13.31b (−) |
| Catalyses ATP-dependent condensation of NAA and glutamate to produce NAAG |
| 12q23.3 (+) |
| Solute carrier family 41member 2 |
| 13q13.3 (+) |
| Participates in N-linked glycosylation of proteins |
| 14q11.2* (+) |
| T cell receptor alpha locus |
| 14q21.2* (−) |
| Homo sapiens family with sequence similarity 179 member B |
| 14q23.1 (+) |
| Associations: attention deficit disorder with hyperactivity |
| 17q12a (+) |
| Implicated in regulation of cell growth and T-cell development (studies in mouse |
| 17q12b° (−) |
| Biogenesis of long-chain fatty acid. Associations: BMI, breast cancer |
| 17q22 (−) |
| Translocation of transport vesicles from cytoplasm to plasma membrane, like the insulin-stimulated GLUT4 translocation in adipocytes. Associations: BMI, cholesterol |
| 19p12a (+) |
| Zinc finger protein 90. May be involved in transcriptional regulation. [603973] |
| 19q13.2a (+) |
| Zinc finger protein 780A |
| 19q13.2b (−) |
| Cytochrome P450, family 2, subfamily A, polypeptide 7 |
| 21q22.2 (−) |
| Participates at tight-junctions (kidney, gut) or acts as adhesion molecule (testis). Associations: coronary disease, lipoproteins, Parkinson disease, stroke |
| Xp11.21 (−) |
| Degradation and inactivation of bioactive fatty acid amides |
| Yq11.222* (+) |
| Mature granulocytes and B cells surface antigen |
Proviruses and undefined sequences are labeled respectively with * and °. For HERV-W sequences and genes, the strand direction is reported into round brackets. Bold genes are listed as OMIM diseases associated and the relative accession number is reported into square brackets. Underlined genes are reported to be positive associated with specific phenotypes in UCSC Gene annotations
aAlready reported in Li et al. [70]
bAlready reported in Schmitt et al. [20]
HERV-W genomic context: insertions into human non-coding genes
| HERV-W | Human gene | Gene function and associations |
|---|---|---|
| 1p12 (−) |
| Uncharacterized antisense long non-coding RNA |
| 1p13.3* (−) |
| Large intergenic non coding RNA |
| 1q32.1 (−) |
| Uncharacterized antisense long non-coding RNA |
| 2q11.2 (−) |
| StAR-related lipid transfer domain protein 7 antisense long non coding RNA (LOC285033) |
| 2q24.3 (−) |
| Long intergenic non coding RNA |
| 2q31.2b (+) |
| Homo sapiens microRNA 548n |
| 3q25.1b (+) |
| CLRN1 antisense non-coding RNA |
| 4p13* (−) |
| Long intergenic non coding RNA |
| 4q23 (−) |
| Uncharacterized antisense long non-coding RNA |
| 4q28.3 (+) |
| Long intergenic non coding RNA |
| 4q32.3 (+) |
| MicroRNA involved in post-transcriptional regulation of gene expression |
| 6q15 (−) |
| Long intergenic non coding RNA |
| 6q27a° (+) |
| Long intergenic non coding RNAs |
| 7p14.2* (+) |
| Antisense non coding RNA |
| 8q12.1 (−) |
| Long intergenic non coding RNA |
| 9p21.3 (+) |
| Uncharacterized long non-coding RNA |
| 10q11.22 (−) |
| Long intergenic non coding RNA |
| 11q14.2 (−) |
| Protease serine 23 near-coding RNA |
| 11q23.3 (−) |
| Antisense non-coding RNA |
| 13q21.33* (+) |
| Long intergenic non coding RNA |
| 13q31.3° (+) |
| Long intergenic non coding RNA |
| 21q21.1* (−) |
| MIRNA548X host gene long non-coding RNA |
| 14q22.1 (+) |
| Long non-coding RNA |
| 19p12d (+) |
| Antisense non coding RNA |
| Xq13.3* (−) |
| Long intergenic non coding RNA |
Proviruses and undefined sequences are labeled respectively with * and °. For HERV-W sequences and genes, the strand direction is reported into round brackets
aAlready reported in Li et al. [70]
bAlready reported in Schmitt et al. [20]
HERV-W genomic context: transcription factor (TF) binding sites
| HERV-W | TF recognized | Position | Score (0–1000) |
|---|---|---|---|
| 2p12a* | POLR2A | chr2:76098843–76099352 | 803 |
|
| E2F1 | chr2:143661226–143661546 | 958 |
|
| CTCF | chr3:38331061–38331485 | 900 |
|
| POLR2A | chr4:8429472–8430544 | 1000 |
| TCF7L2 | chr4:8424096–8424592 | 922 | |
| 6p12.2* | TFAP2C | chr6:52783052–52783485 | 1000 |
| FOXA2 | chr6:52783244–52783462 | 848 | |
| STAT3 | chr6:52784270–52784579 | 808 | |
|
| STAT3 | chr6:85427859–85428174 | 1000 |
| CEBPB | chr6:85427862–85428118 | 817 | |
|
| TCF7L2 | chr7:92103429–92103733 | 1000 |
|
| TCF7L2 | chr7:107981247–107981830 | 1000 |
| E2F1 | chr7:107981308–107981897 | 1000 | |
| 7q33* | YY1 | chr7: 34270591–134271127 | 1000 |
| 7q36.1* | FOXA1 | chr7:149370177–149370408 | 1000 |
| 9q22.1 | TCF7L2 | chr9:91556701–91556965 | 1000 |
| 10q21.2* | GATA3 | chr10:62797340–62797529 | 1000 |
| E2F1 | chr10:62796837–62797697 | 806 | |
| 10q23.1* | GATA1 | chr10:86284672–86285183 | 1000 |
| MAFK | chr10:86285572–86285911 | 1000 | |
| MAFF | chr10:86285647–86285793 | 1000 | |
| TBL1XR1 | chr10:86284785–86285185 | 824 | |
|
| TAL1 | chr2:97480654–97480806 | 928 |
| TEAD4 | chr10:97480630–97480810 | 859 | |
| 10q21.3 | MAFK | chr10:65805045–65805364 | 1000 |
|
| TFAP2C | chr21:20128637–20128925 | 1000 |
| YY1 | chr21:20131977–20132464 | 859 |
Data obtained from Genome Browser Encode Transcription Factor ChIP-seq database
Proviruses and undefined sequences are labeled respectively with * and °. Bold loci are the one inserted into human genes
Env puteins analysis
| Sequence | ORF length (amino acids) | Stop | Shift |
|---|---|---|---|
|
|
|
|
|
| 1p32.3b* | 559 | 3 | 2 |
| 6q21a | 552 | 4 | 2 |
| 15q21.3 | 543 | 2 | 2 |
|
| 542 | 0 | 2 |
|
| 542 | 0 | 4 |
| 5q21.3* | 542 | 6 | 2 |
| 12q13.12* | 542 | 2 | 1 |
| 14q21.2* | 542 | 3 | 1 |
|
| 542 | 1 | 0 |
| 3q11.2* | 541 | 1 | 3 |
| 4q31.1* | 541 | 2 | 2 |
| 17q12b | 540 | 2 | 2 |
|
| 529 | 0 | 6 |
|
| 483 | 0 | 1 |
| 11p15.4 | 475 | 2 | 3 |
| 3p24.1* | 466 | 2 | 0 |
| 9q31.3 | 462 | 3 | 1 |
| 3q23a | 453 | 2 | 1 |
|
| 443 | 1 | 2 |
|
| 361 | 0 | 3 |
| 5p12* | 355 | 6 | 0 |
| 1p34.2 | 352 | 2 | 1 |
| Xq27.1 | 352 | 4 | 1 |
|
| 320 | 0 | 1 |
|
| 296 | 0 | 2 |
|
| 267 | 1 | 0 |
Proviruses are labeled with *, Syncytin-1 ORF is highlighted in bold. Underlined sequences retain an ORF without internal stop codons; italic sequences did not present frameshifts
HERV-W loci homology of previously described MSRV sequences and probes
| MSRV GenBank entry | HERV-W locus/loci | Query cover | n° of discordant bases | Mapped portion in LTR17-HERV17-LTR17 |
|---|---|---|---|---|
| AF127227 (544 bp) | 3q23a* (99.5 %) | 1–544 | 3 | env (8208–8752) |
| AF127228 (1932 bp) | Xq22.3b* (99.6 %) | 1–1932 | 9 | pol-env (5444–5838 and 7682–9200) |
| AF127229 (2004 bp) | 3p12.3* (99.9 %) | 1–1084 | 2 | pol-env-3′ LTR (5452–6792 and 8290–8318 and 9115–9732) |
| 18q21.32* (99.9 %) | 1055–2004 | 2 | ||
| AF123882 (2477 bp) | 12q21.3* (99.8 %) | 1–2477 | 7 | pol-env (5720–8199) |
| AF331500 (1629 bp) | Xq22.3b* (99.7 %) | 1–1332 | 4 | env (7720–9348) |
| 5p12* (99.4 %) | 1308–1629 | 2 | ||
| AF123881 (1511 bp) | 3q26.32* (99.9 %) | 1–1511 | 2 | gag-pro (2765–4269) |
| AF009668 (2304 bp) | 1p34.2 (99.1 %) | 1–633 | 6 | pro-pol (4178–6480) |
| 2p12a (100 %) | 623–736 | 0 | ||
| 2p24.2 (100 %) | 717–871 | 0 | ||
| 6q27b (98.5 %) | 837–1424 | 11 | ||
| 6q15 (97.2 %) | 1415–1763 | 10 | ||
| 3p12.3 (99.4 %) | 1719–2304 | 4 | ||
| AF009666 (324 bp) | 1p34.2 (99.5 %) | 1–324 | 3 | pro-pol (4178–4521) |
| AF009667 (118 bp) | 17q22 (98.2 %) | 1–118 | 2 | pol (5031–5148) |
| AF123880 (1003 bp) | 5p12 (99.6 %) | 1–203 | 1 | 5′ LTR (255–803) |
| 3p24.1 (100 %) | 198–593 | 1 | ||
| 3q26.32 (98 %) | 592–1003 | 11 | ||
| AF072494 pol probe (678 bp) | 6q21b (99.6 %) | 1–678 | 5 | pol (4660–5338) |
| AF072496 gag probe (536 bp) | 6q21b (99.6 %) | 1–536 | 2 | pre gag-gag(2706–3199) |
| AF072497 pro probe (364 bp) | 1p34.2 (99.2 %) | 1–364 | 4 | pro-pol (4166–4522 and 5641–5549) |
| AF072498 env probe (591 bp) | Xq22.3b (99.5 %) | 1–591 | 3 | env (8606–9196) |
Previously published MSRV sequences and probes (column 1) were analyzed for their homology to one/more HERV-W locus/loci by BLAT search, considering the best match in human genome (reported in column 2 near to each HERV-W element). The MSRV elements portion similar to HERV-W locus/loci (column 3) as the number of discordant nucleotides with respect to the identified HERV-W locus/loci (column 4) and the correspondent positions in the LTR17-HERV17-LTR17 reference (column 5) were obtained through Mafft alignment and Geneious platform analysis. MSRV sequences were characterized through the analysis of each element with respect to the whole HERV-W dataset with Recco software
* Already investigated by Laufer et al. [89]
° 95 % similarity with AF135487, a retroviral-related sequence reported to be schizophrenia associated and mapped to multiple sites