Literature DB >> 30072963

Related Endogenous Retrovirus-K Elements Harbor Distinct Protease Active Site Motifs.

Matthew G Turnbull1, Renée N Douville1,2.   

Abstract

Background: Endogenous retrovirus-K is a group of related genomic elements descending from retroviral infections in human ancestors. HML2 is the clade of these viruses which contains the most intact provirus copies. These elements can be transcribed and translated in healthy and diseased tissues, and some of them produce active retroviral enzymes, such as protease. Retroviral gene products, including protease, contribute to illness in exogenous retroviral infections. There are ongoing efforts to test anti-retroviral regimens against endogenous retroviruses. Herein, we examine the potential activity and diversity of human endogenous retrovirus-K proteases, and their potential for impact on immunity and human disease.
Results: Sequences similar to the endogenous retrovirus-K HML2 protease and reverse transcriptase were identified in the human genome, classified by phylogenetic inference and compared to Repbase reference sequences. The topologies of trees inferred from protease and reverse transcriptase sequences were similar and agreed with the classification using reference sequences. Surprisingly, only 62/480 protease sequences identified by BLAST were classified as HML2; the remainder were classified as other HML groups, with the majority (216) classified as HML3. Variation in functionally significant protease motifs was explored, and two major active site variants were identified - the DTGAD variant is common in all groups, but the DTGVD motif appears limited to HML3, HML5, and HML6. Furthermore, distinct RNA expression patterns of protease variants are seen in disease states, such as amyotrophic lateral sclerosis, breast cancer, and prostate cancer.
Conclusion: Transcribed ERVK proteases exhibit a diversity which could impact immunity and inhibitor-based treatments, and these facets should be considered when designing therapeutic regimens.

Entities:  

Keywords:  RNAseq; active site motifs; amyotrophic lateral sclerosis; breast cancer; endogenous retrovirus-K (ERVK); prostate cancer; protease; protease inhibitor

Year:  2018        PMID: 30072963      PMCID: PMC6058741          DOI: 10.3389/fmicb.2018.01577

Source DB:  PubMed          Journal:  Front Microbiol        ISSN: 1664-302X            Impact factor:   5.640


Introduction

Retroviridae is a diverse family composed of both exogenous infectious viral species whose life cycle includes stages with a ssRNA genome inside virions which is converted into a dsDNA provirus, and endogenous proviral species transmitted in a Mendelian fashion. The evolutionary pressures on ERVs and exogenous retroviruses (XRVs) are very different; ERVs experience much stronger negative selection against pathogenicity since their survival depends directly on reproduction of their host (Holmes, 2011; Stewart et al., 2011; Barbeau and Mesnard, 2015). If fact, some ERVs are directly involved in placentation (Mi et al., 2000), and some host genes descend from retroviral ancestors (Kaneko-Ishino and Ishino, 2012). Indeed, the role of ERVs in the evolution and function of their host genome is becoming more clear (Bannert and Kurth, 2004; Cowley and Oakey, 2013). The apparent inactivity of many ERVs is probably a direct consequence of this pressure, as many ERV integrations contain inactivating mutations that disrupt the function or expression of viral genes. The most severe inactivating mutation is the case of solo LTRs, where the entire coding region of the ERV is deleted. Despite this, a surprising number of ERVs appear to lack obvious inactivating mutations, and some are even infectious, blurring the line between ERV and XRV (Boller et al., 2008; Jha et al., 2011; Denner and Young, 2013; Kozak, 2014). Although no infectious ERV has been proven to exist in the human genome, some loci are polymorphic and population genetics analysis and functional studies have not ruled out low-level ongoing infectious replication of some human ERVs (Turner et al., 2001; Jha et al., 2011; Naveira et al., 2014; Wildschutte et al., 2014, 2016). Here we focus on the Betaretrovirus-like human ERVK, and particularly the HML2 clade (HK2), which includes loci encoding functional enzymes, and whose youngest members may be less than 200,000 years old (Franklin et al., 1988; Boller et al., 2008). Actively transcribed ERVs, such as ERVK-10, are assigned names by the Human Gene Nomenclature Committee as recommended by Mayer et al. (2011). The biological role of transcribed ERVK elements is becoming more clear in recent years (Ruda et al., 2004; Macfarlan et al., 2012; Manghera et al., 2014, 2015; Michaud et al., 2014; Schlesinger et al., 2014; Grow et al., 2015; Li et al., 2015; Bray et al., 2016; Christensen, 2016; Becker et al., 2017; Prudencio et al., 2017), but the effects of individual viral proteins is understudied as compared to their XRV counterparts. Protease-mediated maturation of retroviral structural proteins and enzymes is critical to the XRV life-cycle; viruses lacking PR activity produce virions with an immature morphology and are not infectious (Kohl et al., 1988). Even though ERVs are not known to replicate by infection of new cells, they can increase in copy number by reinfection of their host cell; this process called reintegration requires the activity of all the viral enzymes (Dewannieux et al., 2006). Because PR function is conserved across Retroviridae, a virus with an inactive protease may be complemented by another retroviral protease. The factors which permit or exclude this complementation must be context dependent; for example, the HK2 protease encoded by ERVK-10 cleaves HIV-1 Gag in vitro, although not in virio (Towler et al., 1998). The contextual nature of complementation is not perfectly understood, so care must be taken in applying findings across model systems, and particularly in the context of human disease (Towler et al., 1998; Contreras-Galindo et al., 2012; Morandi et al., 2015). Structural analysis proceeds most easily using crystallographic structural determination, which is not yet available for the HK2 PR; however, biochemical and genetic analyses reveal a typical A2 aspartic protease, with each 106-residue monomer of the mature dimer processed autocatalytically from a larger precursor (Mueller-Lantzsch et al., 1993; Towler et al., 1998; Kuhelj et al., 2001; Dunn et al., 2002). The active HK2 PR is moderately sensitive to pepstatin, with a pH optimum around 4.5 (Towler et al., 1998; Kuhelj et al., 2001). Some HK2 elements (such as ERVK-10) encode a functional protease, and HK2 virions with condensed cores have been observed budding from human cells (Mueller-Lantzsch et al., 1993; Sauter et al., 1995; Towler et al., 1998; Boller et al., 2008). Furthermore, the two reconstructed HK2 viruses Phoenix and ERVKCON have a PR-dependent infectivity (Dewannieux et al., 2006; Lee and Bieniasz, 2007). Although the substrate affinity of the HK2 PR has not been extensively studied, the PR cleavage sites in the HK2 Gag polyprotein are known (Kraus et al., 2011), as is its susceptibility to select PIs (Towler et al., 1998; Kuhelj et al., 2001). In addition to catalyzing maturation of viral proteins, PR may play a role in immunity and pathology of human diseases. Protease pathogenicity has been best studied in HIV-1, whose PR is known to cleave a variety of host proteins (Shoeman et al., 1991; Impens et al., 2012), and to trigger apoptosis under specific conditions (Shoeman et al., 1991; Rumlova et al., 2014). HIV-1 PR also directly combats innate immunity by trafficking the dsRNA sensor RIG-I to the lysosome (Solis et al., 2011). Another example of HIV PR inhibition of innate immunity is its ability to cleave RIPK1 and RIPK2 proteins, thus successfully abrogating NF-κB signaling (Wagner et al., 2015). The effect of proteases is not limited to innate immune signaling proteins, but also effector proteins such as intrinsic restriction factors. Both FIV and MLV cleave APOBEC3 in their respective hosts as a means to evade this antiviral defense (Abudu et al., 2006; Yoshikawa et al., 2017). Given the known pathogenic potential of retroviral proteases, it seems prudent to explore the potential effects of PRs encoded by ERVK which, in contrast to HIV-1, are present in all human beings – and which have been associated with specific human diseases, such as ALS, schizophrenia, rheumatic disease and cancer (Frank et al., 2005; Seifarth et al., 2005; Reynier et al., 2009; Douville et al., 2011; Li et al., 2015; Christensen, 2016). Here, we undertake genomic, sequence-function, and transcriptomic analysis of the diverse ERV PRs in the human genome. We draw on the existing ERV and XRV literature to evaluate the phylogenetic distribution of sequence motifs with potential functional relevance, focusing on HK2 PRs. We accomplish this by identifying human ERV protease sequences, inferring their phylogenetic affiliation to each other, and comparing them to representative sequences. The classification process was carried out in parallel using RT as a standard comparator. Here we show that distinct ERVK PR variants with predicted differences in their functional activity are differentially transcribed in disease-relevant human tissues.

Results

Representative Retroviral Motifs

With the goal of using the most recent human genome build GRCh38 to identify loci encoding retroviral PR and RT enzymes, HMMs were used as a search tool. Pfam-A was searched for all relevant protein families related to retroviruses. Table describes the output from this search, revealing 92 HMMs in total, with 40 directly related to retroviral core and accessory proteins. Most important amongst these are the retroviral protease domain RVP, and the RT core domain RVT_1. Retrovirus-associated HMMs (Table ), the PR sequence of ERVK-10 (Towler et al., 1998), and the RVT_1 domain of Phoenix (Dewannieux et al., 2006), were subsequently used to identify pro and pol sequences in representative retroviral sequences and in the human genome. Retroviral HMMs from Pfam sorted according to the retroviral protein product from which they were derived.

Representative Endogenous Retroviral Sequences

As expected, BLAST identified more, but less diverse sequences, than HMMER in both the human genome and Repbase consensa. tBLASTn for HK2 PR matched every ERVK clade except HK10, but no other ERV. HMMER identified an RVP domain in every ERVK clade, and most other ERVs. The regions identified by each method overlap. tBLASTn for the Phoenix RVT_1 domain matched most ERVs. HMMER also identified an RVT_1 domain in most ERVs. The results of these methods mostly overlapped, with some notable differences. BLAST identified only part of the RVT_1 domain on ERVE, and the RVT_1 domain on ERV9 and ERVFc covers only part of the BLAST result. Each of the BLAST and HMMER results from ERVFb only partially overlapped. HMMER identified an RVT_1 domain on ERVT, but BLAST found no match. Both methods disproportionately detected HML2 loci; BLAST did so because the query sequences were HML2-derived, whereas HMMER did so because HML2 loci are more intact and therefore more easily discovered by LTRharvest. The nucleotide sequences from each search were aligned using their translation and their phylogeny was inferred using RAxML from both nucleotide and amino acid (AA) alignments (see Figure for PR tBLASTn results). The resulting alignments and trees were used to classify the genomic sequences identified by the corresponding search. These reference trees have very low bootstrap support values; however, pol and pro trees have similar topology and the classifications based on evolutionary placement agree with the tree topology inferred directly from genomic pro sequences. Sequences identified by tBLASTn which are similar to the HK2 protease. Representative ERV sequences from Repbase were searched using tBLASTn for the HK2 PR. These were aligned by MACSE and their phylogeny was inferred using RAxML under the JTT + G model of evolution selected using ProtTest 3. Bootstrap support values are represented as branch labels. This figure was produced using Geneious, FigTree, and GIMP.

Classification of ERV Sequences in the Human Genome

Having established a system to identify PR and RT elements, ERV nucleotide sequences were identified in GRCh38 using BLAST and HMMER as directed by GenomeTools. These were then curated to eliminate highly divergent sequences with rare insertions (less frequent than 1/20 sequences) and to minimize the size of their multiple alignment before phylogenetic inference. LTRdigest hits that did not co-occur with another retroviral domain were eliminated. The genomic distribution of the resulting ERV annotations is presented in Table , along with curation data. Distribution of PR and RT search results in the human genome GRCh38. When examining PR hits, tBLASTn for ERVK-10 PR identified 480 sequences in GRCh38 (Additional Files ). Forty-two are identical to another PR sequence, of which 15 are from an alternative assembly. LTRdigest identified 150 RVP domains in GRCh38 (Additional Files ). Only 8 of these were shorter than half of the expected length of an RVP domain. Occasionally, sequences from different loci were indistinguishable; twelve were identical to another RVP locus. RVP positions 38–42 are absent in 83 PRs; 48 of these were confidently classified into one of ERVW, ERV9, ERVE, ERVT, or ERV3. The remaining 35 PR sequences were not confidently classified. In contrast, the PR sequences identified by LTRdigest are less numerous and more diverse than those identified by BLAST. There is a substantial overlap in the sequences recognized by each method, but LTRdigest identified HK2 PR with a disproportionately higher frequency than other ERVs, given their relative occurrence in the human genome (Gifford and Tristem, 2003; Bannert and Kurth, 2006). When examining RT hits, tBLASTn for the RVT_1 domain of Phoenix identified 1412 sequences in GRCh38, of which 840 were removed because they induced gaps (Additional Files ). LTRdigest identified 1766 RVT_1 domains in GRCh38, of which 1617 were removed either because they did not co-occur with another core retroviral domain or they contained uncommon insertions which induced gaps in the alignment (Additional Files ). Standalone RVT_1 domains were eliminated. This is because non-ERV retroelements, such as LINEs (Boissinot et al., 2000), also encode a RT with an RVT_1 domain, but lack other retroviral proteins. These non-ERV retroelements would skew the analysis, since they are much more numerous; retroelements as a whole make up 42.2% of the human genome, but ERVs make up less than 8.3% [47]. Although LINEs are not flanked by LTRs, they are nonetheless annotated when LTRharvest mistakes other appropriately spaced repetitive sequences for LTRs. In order to assign appropriate nomenclature to the identified sequences, we classified 489 loci (one or more PR or RT encoding regions separated by no more than 10,000 bp) (Table ). We observe 53 loci contain only one BLAST or LTRdigest result, 324 loci contain 2, 57 loci contain 3, 53 loci contain 4, and one each contain 5 and 6 results. Four loci had all results placed on internal nodes by RAxML and were not classified at all. Eighteen loci contained sequences whose classification varied; in most cases these sequences were truncated or insertion of one retroelement inside another resulted in spurious association of sequences belonging to separate elements. Number of sequences classified into each category from each search method. Totals are shown for each ERV supergroup. The total number of unplaced sequences is also shown along with the number of sequences placed on internal branches, which in both cases were between the H/F and W/9 clades. Sequences were considered unplaced if the probability of their placement into the tree was less than 90%. One sequence was placed on an internal node outside these clades. AA alignments are translated from nucleotide (NT) alignments. Phylogenetic trees constructed from genomic sequences (Figure ) shared similar trends as those observed in the reference trees (Figure and Additional Files ). The classifications assigned by placement of reference sequences into the phylogeny (Additional Files ) agrees with the branching of the phylogeny inferred from genomic sequences directly. This can be clearly seen in Figure , in which the leaves of these phylogenies are colored according to their classification by evolutionary placement. Each phylogenetic tree (Additional Files ) constructed from aligned nucleic acid sequences of either PR or RT from reference ERV sequences contained a bipartition segregating ERVK clades, although it should be noted that BLAST PR results include exclusively ERVK sequences. HK1, 2, 4, 9, and 10 commonly formed a clade within ERVK. In all four trees in Figure , ERVK leaves form a distinct clade, and likewise ERVW and ERV9 form a clade which is associated with ERVF. ERVS forms a clade with ERVL. ERVI forms a clade with ERVADP. ERVE forms a clade with ERV3. Figure is intended to allow the reader to quickly see the broad trends; if the reader is interested in the placement of specific elements, the phylogenetic trees output by RAxML are included in Additional Files . Amino acid sequence alignment derived phylogenies of human genomic endogenous retroviral sequences. A tree is shown for each search method (PR, protease BLAST; RT, reverse transcriptase BLAST; RVP, HMMER for RVP; RVT_1, HMMER for RVT_1). Leaves are colored by their classification through the evolutionary placement algorithm implemented by RAxML (see color legend at bottom of figure). Branch coloring represents bootstrap support, where red indicates poor support (closer to zero) and green indicates good support (closer to 100).

Alternate Genome Assemblies Reveal Additional PR Sequences

Although it fails to capture the real diversity of humanity, the reference human genome contains variable regions with multiple valid assemblies. New loci can be distinguished from loci which were duplicated by inclusion in both assemblies by examining the flanking genomic sequences, which should also be identical for technical duplicates. The 15 kilobases flanking the start site of each PR identified in GRCh38 by BLAST and GenomeTools was extracted and compared to alternative assemblies using blastn. Eight regions had a perfect match, of which five were originally identified by BLAST. Many BLAST hits surrounding identical PR sequences had partial matches indicative of closely similar insertions in divergent genomic loci. Moreover, 31 of the 52 regions identified on alternative assemblies had no high scoring matches. The fact that many sequences could not be paired up is consistent with previous observation of unique insertions in a recent analysis of the ERV content within alternative assemblies (Wildschutte et al., 2014).

Diversity of Protease Sequences

The consensus sequences for each ERVK clade from Repbase contain many conserved sites (Figure ). It is clear that PR functional motifs are less diverse than the surrounding sequences (Figure ), and that variants occur with different frequencies (Table ). The α-helix (C2) has only one common variant (GRDLL). The active site loop (B1) has two; the DTGAD motif is most common and occurs in all clades, whereas the second most frequent (DTGVD) occurs only in the HK3, HK5, and HK6 clades. The tree in Figure is displayed to clearly show the abundance of HK3 sequences (top of figure) versus all other clades. Additionally, Figure highlights the distribution of common active site variants (DTGAD in blue, DTGVD in orange). HMM logo of representative mature endogenous retrovirus K proteases. (A) The structure of the HIV-1 protease 3A2O (Hidaka et al., 2009) is shown with the major regions referenced in the text highlighted. (B) HMM logo of the aligned mature protease sequences of each ERVK entry in Repbase (HK1 through 10) with referenced regions highlighted in the same colors as the homologous regions in 3A2O. This figure was made with Pymol, Geneious, and GIMP. Frequency of co-occurrence for observed ERVK active site and associated helix motifs. Phylogeny of human genomic nucleotide sequences derived using BLAST and the ERVK HK2 PR. Potentially active sequences with a DTGAD motif in their active site are colored blue, and those with DTGVD are colored orange. Sequences containing inactivating mutations are colored gray. The classification of each area of the tree is indicated by the name associated with the arc around the perimeter. Branch coloring represents bootstrap support, where red indicates poor support (closer to zero) and green indicates good support (closer to 100). Placement of the root at the branch connecting HK3 is only a matter of visual convenience – the real root branch is more likely that leading to HK8. This alignment is available in Additional File . Active site motifs were extracted from MACSE aligned protease sequences identified by BLAST in the human genome (GRCh38). Sequences containing stop codons and/or frameshift mutations were removed and their frequency was determined. This was done using standard UNIX utilities. Sequences were considered inactive if they contained mutations not observed in active proteases of other XRVs; otherwise, they were considered potentially active.

ERVK HML2 Proteases Exhibit Great Diversity

Sequences classified as HK2 by RAxML were examined in more detail to determine if their variability could have functional consequences if the protease is translated, especially if it could lead to differential drug susceptibility because the variable residue intrudes into the substrate binding site. Several columns of the alignment in Figure represent variable sites which clearly subdivide the clade in two. Aligned protein sequences of human genomic HML2 proteases. The subset of the full protein alignment of BLAST results for the ERVK HK2 PR in the human genome which were classified as HK2 by the evolutionary placement algorithm are shown here. Two clear patterns emerge, suggesting distinct protease sequence groups. Figure produced using Geneious. The most notable variation within HK2 sequences (Figure ) is the variable AA (L, V or I) at position 89 which is predicted to help form the S3 and S1′ binding subsites by homology (Menendez-Arias et al., 1994). Columns 55, 65, and 66 are also notable. The AA (V or I) at position 55 follows the flap interface (columns 51–54) predicted by homology (Kuhelj et al., 2001). Residues 65 and 66 are near the surface of PR in the N-terminal strand of A2; from this position, their variability could impact protein-protein interactions within the viral polyproteins or with host partners, and could potentially influence flap mobility during substrate binding via interactions with α-helix C1. Thus, the cellular complement of HK2 PR variants represent a continuum of enzymatic affinities and protein-protein interactions.

Variant Proteases Are Expressed in Human Health and Disease

We looked in publicly available RNA-Seq datasets (Table ) from diseases with some known association with ERV expression to establish how genomic diversity is transcribed in healthy and disease states (Seifarth et al., 2005; Douville et al., 2011; Ren et al., 2012; Agoni et al., 2013; Wildschutte et al., 2014; Abba et al., 2015; Li et al., 2015; Brohawn et al., 2016). RNA-Seq results were narrowed to examine the expression of the two most highly expressed and well-studied groups HML2 and HML3. Further, ERVK loci (detectable by both BLAST and LTRharvest as previously described) were limited to proteases without gross inactivating mutations. Reanalysis of transcriptomics data from three independent studies focused on ALS, breast cancer and prostate cancer are shown in Figure ; expression of other HML groups are in Additional File . Overall, the expression of loci encoding DTGAD and DTGVD motifs is significantly different, due to low expression of DTGVD loci, and each of these is divergent from the pooled expression of other less common motifs. RNA-Seq Libraries from the Sequence Read Archive analyzed in this study. Expression of HML2 and HML3 by RNA-Seq. The expression of ERV loci in GRCh38 containing an uninterrupted protease was measured with bowtie2 using RNA-Seq data from different conditions (SRA accession numbers for each condition are: ALS – SRP064478, Breast Cancer – SRP058722, Prostate Cancer – ERP000550). Normalization to fragments per kilobase of exon per million mapped reads (FPKM) is relative to the entire locus. This graphic was made using R with the tidyverse library, and GIMP. The N’s of each column are, from left to right, HML2 and then HML3: 104, 8, 0, 7, 29, 7, 118, 9, 0, 10, 35, 8; 369, 27, 0, 17, 108, 25, 146, 11, 0, 4, 40, 5; 195, 14, 0, 4, 64, 13, 177, 14, 0, 2, 52, 13. Counts are also provided in CSV and Excel format in Additional Files . Expression of protease encoding transcripts alone does not prove that these will be properly translated or proteolytically processed (Bauerova-Zabranska et al., 2005; Tien et al., 2018). Additional supporting evidence comes from examination of the upstream gag gene by searching reads aligning to the same locus using tBLASTn for the Gag sequence of ERV-K113. Several, but not all, loci expressed gag as well as pro. Individual experiments are required to confirm Gag-pro-pol polyprotein processing for each transcribed locus. There is a trend toward increased ERVK transcription in cervical spinal tissues of ALS patients versus neuro-normal controls (Figure ); indeed, this enhancement is evident when patients are stratified by sex, with female cases having an increased burden of ERVK (Additional File , p < 0.01). When tissues from cancer patients were examined, breast cancer biopsies revealed increased PR transcripts with DTGAD (p < 0.05) and alternate PR active site motifs (p < 0.05), as compared to cosmetic breast resection controls. Overall, HML-2 expression was enhanced in breast cancer (p < 0.01). This ERVK load difference in breast cancer appears to be driven by the increased expression of three loci in 8p23.1 flanked by LTR5A, and encoding a DTGSD active site motif. Interestingly, an HIV-1 PR variant which encodes a DTGSD motif is known to be active, although less so than the wild-type enzyme (Hong et al., 1998). Similarly, HML-2 transcripts are increased in prostate cancer specimens as compared to autologous non-cancerous adjacent tissue (p <0.01). These findings point toward a mixed profile of PR variants in disease states with elevated ERVK expression.

Discussion

The human genome is home to a diversity of ERVs; this necessitates much effort in discriminating those of biological and/or clinical importance. Our identification of transcribed ERVK loci with a potential to encode functional PR enzymes will inform future studies examining their biological impact and drug susceptibility. There remain technical challenges in assessing ERVs, as the results from LTRharvest depict that this paradigm is not ideal for analysis of more complicated loci containing recombination, serial insertions, or substantial deletions. Theoretically, each ERV should encode exactly 1 pro and 1 pol at the moment of insertion, but over time deletion and insertion events can remove or obscure these signatures. Despite this issue, the use of HMMs to identify ERVs within the human genome was effective. Indeed, RVP and RVT_1 annotations are more commonly identified within elements annotated by LTRharvest, since BLAST did not detect loci distantly related to the query HK2 sequences. The search results were improved by the merging of results from the same paralog and alignment-based extension. These sequences were grouped both by direct phylogenetic inference and by comparison to a reference tree. The phylogenetic patterns apparent in Figures broadly agree with previously recognized relationships between ERVs (Vargiu et al., 2016). Although trees from pro and pol differ in some respects, they generally agree at deeper nodes which are important for classification, despite the lower phylogenetic signal of protease due to its comparatively shorter length. There is value in examining both reference genomes and alternate assemblies. The reference genome was constructed from few people (Osoegawa et al., 1998; Lander et al., 2001), suggesting that the ERV content of individual human genomes could vary considerably and deserves closer scrutiny; especially since unfixed loci are likely to be younger (Belshaw et al., 2005), and could encode functional enzymes. Future use of data from platforms such as the 1000 Genomes Project (2018) will provide an improved appreciation of the diversity of ERV content in human DNA.

ERVK Protease Sequence Diversity

The diversity of human genomic retroviral proteases reflects their evolutionary history and conserved structure-function relationships. Residues flanking the active site and those maintaining the dimer interface are less variable, a pattern clearly discernible in the Pfam family RVP (Finn et al., 2014) and evident within Figure . The β-sheet at the “bottom” of the enzyme is an exception to this rule, as its residues are not as conserved, perhaps because this region of tertiary structure is driven by backbone interactions (Louis et al., 2007). Overall, the most notable variation in genomic protease sequences is the absence of RVP positions 38–42 in 83 PRs, none of which were classified as ERVK. The sequence of the retroviral protease active site (B1) and the active-site associate helix (C2) are conserved, as seen in Figures for ERVK. The most common active site motif of ERVK, DTGAD, is identical to that predominantly observed in exogenous retroviruses. The second most common motif, DTGVD (A29V), is observed in functional retroviral proteases; in fact, there are many examples of HIV proteases bearing this variation (e.g., Uniprot: P15833). Given the wide phylogenetic distribution of A29V and the similar chemical properties of alanine and valine residues, this is probably a functional variant of the ERVK protease. This motif could therefore represent pre-integration variation of the ancestral XRV which gave rise to ERVK, which is further supported by the concentration of DTGVD motif containing sequences in HK3, seen in Figure . Alternatively, the motif could represent a post-integration mutation predating the amplification of HK3 (Mayer and Meese, 2002). We also predict several α-helix (C2) variants could be functional in ERVK. In the HIV-1 PR (e.g., PDB: 1HVC), the C2 arginine residue (R94) fills the space between B1 and the loop connecting A1′ to D2′ while forming hydrogen bonds to residues in both. Another large polar residue might fill this role; indeed, lysine is known to do so in HIV-1 protease (Louis et al., 2007). Several genomic motifs fit these requirements, such as the HK2 R94Q and R94K variants, and the HK8 D95E variant. Systematic experimental verification of the activity of individual PR variants is thus warranted. Nonetheless, we predict that most of the less common active site loop variants are inactive (Figure ). This prediction is pulled from our understanding of the composition of this motif. Many variations are of a residue which catalyzes the hydrolytic reaction, or which is predicted to be essential to the dimer interface. The catalytic residues must be aspartic acid. The threonine residue which helps form the dimer interface is uncommonly replaced by serine (e.g., PDB: 2JYS, 2RSP) (Jaskolski et al., 1990; Hartl et al., 2008). Others could disrupt the active site by substituting a side-chain of very different size. The glycine residue must be small, as distortions to the structure of this region could disrupt the catalytic structure. This is also true for the alanine residue, although it is known to be replaced by serine in HIV-1 without a total loss of activity (Ido et al., 1991). The final aspartic acid residue participates in hydrogen bonds with the nearby arginine residues (Louis et al., 2007), and might be replaced by another small charged or polar residue; however, the character of this residue may not be critical, since it can vary with glycine in HIV-1 protease (Louis et al., 2007). The young, polymorphic HK2 element ERVK113 is well studied and was reported to produce immature virions, perhaps due to a non-functional protease (Boller et al., 2008). This is in contrast to the known active HK2 protease of ERVK-10 (Ono, 1986; Towler et al., 1998). This is further supported by reported differences in the sequences of these two mature proteases; the G56S (reported as G234S) substitution in ERVK113 may disrupt the flap interface, compromising PR activity and conferring an immature particle morphology (Boller et al., 2008).

Implications of ERVK PR Diversity on Immunity

Retroviral proteases have numerous and varied protein targets which are difficult to predict, with correspondingly broad cellular effects stemming from their expression. Many proteases include targets with immune functions, but these vary based on host-virus pairings. It remains to be seen how ERVK PRs may modulate innate immunity signaling in humans, as seen with XRVs (Abudu et al., 2006; Solis et al., 2011; Wagner et al., 2015; Yoshikawa et al., 2017). Based on our knowledge of HIV (Konvalinka et al., 1995; Snasel et al., 2000; Lin et al., 2003), it is even conceivable that each ERVK PR variant may target distinct panels of cellular proteins, due to their different substrate specificities. The potential impact of ERVK PR not limited to the innate immune system. Observations from the recovery of immune function in patients on HAART containing or not containing PIs suggests that HIV PR can directly impact the functioning of adaptive immune cells independently of other viral proteins (Ananworanich et al., 2003; Chiodi, 2006). Furthermore, differential ERVK PR substrate specificities could lead to changes in the peptide fragment population from which MHC class I epitopes are drawn, in addition to the impact that PI use alters this system (Kourjian et al., 2014; Kourjian et al., 2016). Our observation of distinct ERVK PR variants expressed in disease highlights the need for future studies to consider their overlapping and distinct impacts on cell signaling and proteome profiles related to immunity.

Implications for Drug-Based Treatment of ERVK-Associated Disease

Many protease coding sequences could be translated into an active enzyme, and their sequence variability may impact their activity, and likely their susceptibility to antiretroviral drugs. Variation between DTGAD and DTGVD in the active site of sequences classified as HK3 is notable. This variation could have biological relevance when both HK2 and HK3 elements are expressed, such as in ALS (Douville et al., 2011) and schizophrenia (Frank et al., 2005). The HK2 PR is susceptible to some clinically relevant HIV-1 PIs, although not to the same degree as the HIV-1 PR (Towler et al., 1998; Kuhelj et al., 2001; Tyagi et al., 2017). Since ERVK expression is associated with many disease states, and the activity of retroviral enzymes can be pathogenic, it is possible that PIs may be useful in the future treatment of ERVK-associated conditions. In fact, at least one ongoing clinical trial (NCT02437110) plans to administer the PI Darunavir, along with other drugs, to patients with ALS. Another clinical trial (NCT01528865) targeting ERVK in lymphoma was withdrawn – but the intent to explore the clinical utility of ERVK PIs is becoming clear. The rapid development of drug resistance is a major hurdle in the treatment of HIV-1 infections, but since ERVK is fixed in the host genome, it cannot rapidly evolve such drug resistance. However, ERVK elements are diverse and likely respond differently to PIs. This diversity is not represented in the ERVK literature, which is focused on the consensus HK2 PR (Towler et al., 1998; Kuhelj et al., 2001; Dewannieux et al., 2006; Lee and Bieniasz, 2007; Tyagi et al., 2017). One recent paper states that the active site loop of the HIV-1 and ERVK PR are identical (Tyagi et al., 2017), a finding which this report directly contradicts. We predict this diversity may necessitate the simultaneous use of multiple PIs in the treatment of ERVK-associated diseases. Consequently, if the drug regimen in the above-mentioned clinical trial is not effective in treating ALS, this cannot be taken as evidence that such therapy would not be effective using multiple PIs in a combination therapy. We have established that a multitude of ERVK protease variants exist in the human genome, and that some of these variants can reasonably be predicted to have differing drug binding profiles. However, many ERVs are transcriptionally repressed (Bogerd et al., 2006; Schlesinger and Goff, 2013; Schlesinger et al., 2014); our analysis of publicly available RNA-Seq data shows not only that ERVs are expressed, but that the expression of loci with differing biochemical and evolutionary characteristics varies between and within different disease conditions, as well as in healthy controls. Due to the relatively recent insertion of ERVK sequences within the human genome, and the pathogenicity of XRV PRs, it is warranted that future studies consider their role in modulating immunity. Since distinct biochemical and ancestral features may impact the susceptibility of these enzymes to PIs, we propose that patient-specific drug regimens may be required in treatment of ERVK associated disease. Moreover, given the potential importance of protease sequence variability, the sequences of other ERVK proteins (particularly the drug targets reverse transcriptase and integrase) should also be explored.

Conclusion

Within the human genome, ERVK proteases exhibit a high degree of diversity. Specifically, two predominant PR active sites emerge, the DTGAD and DTGVD variants, which are differentially expressed in disease states. This study will also be an asset for inferring how ERVK PRs impact the human proteome, specifically as it pertains to immune function. It is possible to interpret this paper as casting a negative light on the prospect of inhibitor-based treatment of ERVK-associated disease, but this is not our intention. We do, however, caution against treatments and techniques that treat the numerous and diverse elements called ERVK as though they were a single invariable element. This variability is not endless; biomedical and clinical techniques to account for this enzymatic variation exist. We recommend that the results of inhibition assays against the specific ERVK proteases expressed in each disease state should be considered if anti-PR drug regimens are to be implemented for inflammatory disease.

Materials and Methods

Sources for databases and software employed in the course of this study are listed in Table . Databases and software used in this study. Sequences homologous to retroviral protease (pro) and reverse transcriptase (pol) were automatically extracted from the last major build of the reference human genome and classified with published sequences from Repbase. Structure-function analysis was applied to ERV PR sequences using the limited published data on them, and by transferring knowledge of better studied proteases.

Reference Sequences

The 38th official assembly of the human genome (GRCh38/hg38 published December 2013) was obtained from NCBI. Representatives for ERVADP, ERV9, ERV3, ERVE, ERVFRD, ERVF, ERVFXA, ERVFC, ERVFB, ERVH, ERVI, ERVL, ERVRB, ERVP, ERVS, ERVT, ERV16, ERVW, and HML groups 1 through 10 (HK1 through HK10) were derived from Repbase v19.07 (Jurka et al., 2005). HK9 and HK10 are also called K14C and KC4, respectively. The sequences of ERVK Phoenix (Dewannieux et al., 2006), and ERVK-10 (Kuhelj et al., 2001) were taken from their respective publications. These sequences can be found in Additional Files , respectively. Genes were located using BLAST v2.2.28 (Altschul et al., 1990) and HMMER v3.1b1 (Eddy, 1998). Cytological bands were assigned by a perl script using co-ordinates from NCBI.

BLAST

Each GRCh38 chromosome and representative retroviral sequence was searched by tBLASTn for the amino acid sequences of the mature ERVK-10 PR and the RVT_1 domain of Phoenix pol identified by HMMER. BLAST hits from same gene in different frames were merged, aligned by MACSE v1.01b (Ranwez et al., 2011), and then sequences 5′ and/or 3′ of each hit which would make the alignment flush were retrieved. Proteases identified in the reference genome assembly were matched to alternative assemblies using a perl script.

HMMER and GenomeTools

HMMs for which to search were identified in Pfam-A using the search term “retrovirus” (Finn et al., 2014). ERV reference sequences were searched by HMMER with an e-value of 1.0. Putative LTR retroelements were identified in GRCh38 by LTRharvest, and these were subsequently analyzed by LTRdigest, which ran HMMER. LTRharvest and LTRdigest are part of Genometools v1.5.1 (Gremme et al., 2013). Hits from the same locus in different reading frames were merged and aligned with MACSE, then the regions which would make each sequence flush to the alignment were retrieved.

Phylogenetic Inference and Classification

Results from each of the four searches were curated by eliminating HMM matches not associated with another core retroviral domain and by removing BLAST results which induced gaps in ≥95% of alignment rows with MACSE. Sequences were then re-aligned with MACSE and models of evolution were selected using jModelTest (Darriba et al., 2012) and ProtTest (Guindon et al., 2010; Darriba et al., 2011). Maximum likelihood phylogenies were inferred using RAxML v7.2.8 (Stamatakis, 2014) for nucleotide and protein alignments of each search (8 trees in total). GRCh38 results were classified using RAxML’s evolutionary placement algorithm with reference to the inferred phylogeny of representative ERV sequences. Misclassified or recombinant elements were sought by comparing the assignment of search results separated by less than 10 kbp. This locus-based definition is computationally simple, but potentially error prone; a more accurate element-based analysis would require algorithmic reconstruction of insertion architecture, as was undertaken in the construction of Dfam (Wheeler et al., 2013).

Measuring Expression in Publicly Available RNA-Seq Data

Publicly available RNA-Seq data from tissues associated with ERVK expression in human disease were obtained from the SRA using sratools. The resulting FASTQ files were aligned to the human genome (GRCh38) using bowtie2 following examination with FASTQC. The resulting SAM file was then indexed using samtools and finally the resulting indexed BAM file was queried for expression of ERV loci identified using the methods above. The raw expression values thus obtained were normalized by library size and locus length to produce FPKM values. The resulting FPKM values were examined on density and QQ plots to confirm that they did not conform to a normal distribution. Non-parametric tests of statistical significance were used in this paper. For unpaired expression data (breast cancer, ALS) the Mann–Whitney U test was used, and for paired data (prostate cancer) the Wilcoxon Signed-Rank Test was used. In all cases tests were two-sided and p-values were corrected using the Bonferroni procedure, considering each set of contrasts involving the same variables to be a family.

Availability of Data and Material

The datasets used to generate the findings of this paper are available from NCBI by their accession numbers. SRA accession numbers for each RNA-Seq data are: ALS – SRP064478, Breast Cancer – SRP058722, Prostate Cancer – ERP000550. The human genome version 38 was obtained from the NCBI FTP site [79]. Datasets generated during the current study which are not included as additional files are available from the corresponding author upon request. Important scripts used in the course of this study are included in Additional File .

Author Contributions

MT and RD conceived the experiments. MT performed the bioinformatics analysis. MT and RD wrote the manuscript.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Table 1

Retroviral HMMs from Pfam sorted according to the retroviral protein product from which they were derived.

ProteinPfam HMMs
• MA• Gag_p10• Gag_p12• zf-CCHC_5
• CA• zf-CCHC• Gag_p30• Gag_p17
• NC• Gag_p24• zf-H2C2• Gag_p19
• Gag_MA• gag_pre-integrs• Retro_M
• Retrotrans_gag
• PR• Asp_protease_2• RVP• Spuma_A9PTase
• gag-asp_protease• RVP_2
• RT• RVT_1• RVT_3• RVT_N
• RVT_2• RVT_connect• RVT_thumb
• RH• RNase_H
• IN• IN_DBD_C• Rve• Integrase_Zn
• SU• GP41• HERV-K_env_2• Env_gp36
• Avian_gp85• MMTV_SAg
• DU• dUTPase
• Accessory proteins• Deltaretro_Tax• Myc-N• Tax
• F-protein• Myc-LZ• VPR
• Other• SPX• Ribosomal_L22• UBN2_2
• Borrelia_orfX• Chromo• UBN2_3
• Gypsy• PHD• Exo_endo_phos_2
• Peptidase_A17• PHO4• Ets
• MRI• G-patch• Flu_PA
• TYA• PhoH• RdRP_1
• Orbi_VP4• EXS• Mononeg_RNA_pol
• EVI2A• Exo_endo_phos• Flavi_NS5
• Peptidase_A2B• Tctex-1• Bromo_coat
• ALIX_LYPXL_bnd• MRVI1• RdRP4
• Reo_sigma1• Pkinase_Tyr• Peptidase_A3
• TLV_coat• DUF1725• Flavi_glycop_C
• DUF1011• DUF2155• BAF
• HLH• AA_permease_2• RNA_replicase_B
• Pkinase• Cupin_8• Birna_RdRp
• Ras• AA_permease_C• Peptidase_C34
• bZIP_1• DUF4219• BLVR
• UQ_con
Table 2

Distribution of PR and RT search results in the human genome GRCh38.

ChromosometBLASTn ERVK-10 PRLTRdigest RVPtBLASTn Phoenix RVT_1LTRdigest RVT_1


RawCuratedRawCurated
13215973411515
21767936957
328141095713613
435141135314415
5231174391409
630695371049
71835824923
8241391328513
9813913493
101955223808
111812714010412
12151075378210
13523419301
141043513574
1573177263
161042312142
171132117102
1823195282
194267638186
2050150200
2130126100
2271146121
X227108142677
Y3778510484
Alternative52N/A64N/AN/AN/A
Total48015014125721766149
Table 3

Number of sequences classified into each category from each search method.

PRRVPRTRVT_1

NTAANTAANTAANTAA
ERVP121211
ERV934311141142222
ERVW13135353109
ERVH24226261119
ERVF22
ERVFXA77
ERVFB11
ERVRB122
ERVI2018
ERVE33202033
ERV3222211
ERVT4444
Gammaretrovirus total81752832805148
HK1393911161622
HK26260322053533635
HK477226622
HK9 (K14C)1512225522
HK10 (KC4)511
HK3216215222281801414
HK53636383811
HK652541111
HK73322101077
HK834358835341716
Betaretrovirus total46446169572602548279
ERVL6654
UNPLACED161901810191017
INTERNAL11
Total480150572149
Table 4

Frequency of co-occurrence for observed ERVK active site and associated helix motifs.

NumberB1C2NumberB1C2NumberB1C2
Predicted active if translated
52DTGADGRDLL17DTGVDGRDLL
Potentially active
4DTRADGRDLL1DTVVDGRDLL1DTGDDGRDLL
4DTEADGRDLL1DTVADGRDLL1DTGANGRDLL
3DTGSDGRDLL1DTGVDGRHLL1DTGANGKDLL
3DTGEDGRDLL1DTGVDGKDLL1DTGADGRELL
3DTGADGQDLL1DTGTDGRDLL1DTGADGRDIL
3DTGADGKDLL1DTGMDGRDLL1DTGADGQELL
3DTDADGRDLL1DTGIDGRDLL1DTAADGRDLL
2DTEVDGRDLL1DTGGDGRDLL
Predicted inactive if translated
4DTGADRRDLL1DTRSDGRDLL1DTGADGHLL
3DTGADERDLL1DTRMDGKEIY1DTEADGQDLL
2HTGADGRDLL1DTRADGWDPL1DRGMDGRDLL
2DTVVDGGTLL1DTGVDLSPHTFI1DMGADDQDLL
2WAWAVGWDLL1DTGVAGRDLL1DMGAD
1VTGVDGRDLL1DTGPD1DIGVDGRDLL
1ITWGRGVDN1DTGVDRRDLL1DIGADGGDLL
1GWDLL1DTGADVWDLL1DIGADERDLLL
1ETGVDGWDLL1DTGADGVDLL1DAGADGRDLL
1DTVADGGDLL1DTGADGTRPI1ATGADGRDLL
1DTRTDGRDVL1DTGADGRYLL1ALGADRRDLL
Table 5

RNA-Seq Libraries from the Sequence Read Archive analyzed in this study.

AccessionConditionTissueLibrarySequencing platformSamplesReference
SRP064478sALSCervical spinal cordTruseq Stranded Total RNA HT Kit (Illumina)Nextseq 500 from ∼260 bp fragments7 sALS, 8 non-ALS, without technical replicationBrohawn et al., 2016
ERP000550Prostate cancerProstateOligo dT purification, fragmentation, and random hexamer PCRHiSeq 2000 2 × 90 bp from ∼200 bp fragments14 tumor/non-tumor from same patient, without technical replicationRen et al., 2012
SRP058722High grade ductal carcinoma in situBreastScriptSeq v2 RNA-Seq Kit (Epicenter)HiSeq 2000 2 × 76 bp, fragment length unreported25 ductal carcinoma in situ, 10 cultured primary normal breast tissue, without technical replicationAbba et al., 2015
Table 6

Databases and software used in this study.

NameURLVersion or accession date
Bowtie2https://www.github.com/BenLangmead/bowtie22.3.3
BLASTftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.28
Cytological bandsftp.ncbi.nih.gov/genomes/MapView/Homo_sapiens/objects/2015-11-20
FASTQChttps://www.bioinformatics.babraham.ac.uk/projects/fastqc/0.11.5
Figtreehttp://tree.bio.ed.ac.uk/software/figtree/1.4.1
Geneioushttps://www.geneious.com/
GenomeToolshttp://genometools.org/pub/1.5.1
GIMPhttp://www.gimp.org/downloads2
GRCh38ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/2014-08-27
HMMERhttp://hmmer.janelia.org3.1b1
Htslibhttps://github.com/samtools1.5-22-g7a6854b
JAVA1.8.0-144
jModelTest(D Darriba) https://github.com/ddarriba/jmodeltest22
MACSE(V Ranwez) http://www.mybiosoftware.com/macse-1-01b-multiple-alignment-of-coding-sequences.html1.01b
Pfam-Aftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz2014-09-05
ProtTest(D Darriba) https://www.github.com/ddarriba/prottest33
RAxMLhttps://github.com/stamatak/standard-RAxML7.2.8
Repbase(J Jurka) http://www.girinst.org/repbase/19.07
RSYNChttps://rsync.samba.org/v3.0.6 protocal v30
Samtoolshttps://github.com/samtools1.5-9-g473d6a4
Sratoolkithttps://github.com/ncbi/sra-tools2.8.2-1
Trimmomatichttps://github.com/timflutre/trimmomatic0.36
  94 in total

Review 1.  The evolution, distribution and diversity of endogenous retroviruses.

Authors:  Robert Gifford; Michael Tristem
Journal:  Virus Genes       Date:  2003-05       Impact factor: 2.332

2.  GenomeTools: a comprehensive software library for efficient processing of structured genome annotations.

Authors:  Gordon Gremme; Sascha Steinbiss; Stefan Kurtz
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2013 May-Jun       Impact factor: 3.710

3.  Cleavage of vimentin by different retroviral proteases.

Authors:  J Snásel; R Shoeman; M Horejsí; O Hrusková-Heidingsfeldová; J Sedlácek; T Ruml; I Pichová
Journal:  Arch Biochem Biophys       Date:  2000-05-15       Impact factor: 4.013

Review 4.  Does chronic infection in retroviruses have a sense?

Authors:  Benoit Barbeau; Jean-Michel Mesnard
Journal:  Trends Microbiol       Date:  2015-02-17       Impact factor: 17.079

Review 5.  Retroelements and the human genome: new perspectives on an old relation.

Authors:  Norbert Bannert; Reinhard Kurth
Journal:  Proc Natl Acad Sci U S A       Date:  2004-08-13       Impact factor: 11.205

6.  A catalogue of putative HIV-1 protease host cell substrates.

Authors:  Francis Impens; Evy Timmerman; An Staes; Kathleen Moens; Kevin K Ariën; Bruno Verhasselt; Joël Vandekerckhove; Kris Gevaert
Journal:  Biol Chem       Date:  2012-09       Impact factor: 3.915

7.  Characterization of the human endogenous retrovirus K Gag protein: identification of protease cleavage sites.

Authors:  Benjamin Kraus; Klaus Boller; Andreas Reuter; Barbara S Schnierle
Journal:  Retrovirology       Date:  2011-03-23       Impact factor: 4.602

8.  The distribution of insertionally polymorphic endogenous retroviruses in breast cancer patients and cancer-free controls.

Authors:  Julia H Wildschutte; Daniel Ram; Ravi Subramanian; Victoria L Stevens; John M Coffin
Journal:  Retrovirology       Date:  2014-08-12       Impact factor: 4.602

9.  Feline Immunodeficiency Virus Evolutionarily Acquires Two Proteins, Vif and Protease, Capable of Antagonizing Feline APOBEC3.

Authors:  Rokusuke Yoshikawa; Junko S Takeuchi; Eri Yamada; Yusuke Nakano; Naoko Misawa; Yuichi Kimura; Fengrong Ren; Takayuki Miyazawa; Yoshio Koyanagi; Kei Sato
Journal:  J Virol       Date:  2017-05-12       Impact factor: 5.103

Review 10.  The evolution of endogenous viral elements.

Authors:  Edward C Holmes
Journal:  Cell Host Microbe       Date:  2011-10-20       Impact factor: 21.023

View more
  2 in total

1.  Pro-Inflammatory Signaling Upregulates a Neurotoxic Conotoxin-Like Protein Encrypted Within Human Endogenous Retrovirus-K.

Authors:  Domenico Di Curzio; Mamneet Gurm; Matthew Turnbull; Marie-Josée Nadeau; Breanna Meek; Julia D Rempel; Samuel Fineblit; Michael Jonasson; Sherry Hebert; Jennifer Ferguson-Parry; Renée N Douville
Journal:  Cells       Date:  2020-06-30       Impact factor: 6.600

2.  Detection and Characterisation of an Endogenous Betaretrovirus in Australian Wild Deer.

Authors:  Jose L Huaman; Carlo Pacioni; David M Forsyth; Anthony Pople; Jordan O Hampton; Teresa G Carvalho; Karla J Helbig
Journal:  Viruses       Date:  2022-01-27       Impact factor: 5.048

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.