Host-retrovirus interactions influence the genomic landscape and have contributed substantially to mammalian genome evolution. To gain further insights, we analyzed a female boxer (Canis familiaris) genome for complexity and integration pattern of canine endogenous retroviruses (CfERV). Intriguingly, the first such in-depth analysis of a carnivore species identified 407 CfERV proviruses that represent only 0.15% of the dog genome. In comparison, the same detection criteria identified about six times more HERV proviruses in the human genome that has been estimated to contain a total of 8% retroviral DNA including solitary LTRs. These observed differences in man and dog are likely due to different mechanisms to purge, restrict and protect their genomes against retroviruses. A novel group of gammaretrovirus-like CfERV with high similarity to HERV-Fc1 was found to have potential for active retrotransposition and possibly lateral transmissions between dog and human as a result of close interactions during at least 10.000 years. The CfERV integration landscape showed a non-uniform intra- and inter-chromosomal distribution. Like in other species, different densities of ERVs were observed. Some chromosomal regions were essentially devoid of CfERVs whereas other regions had large numbers of integrations in agreement with distinct selective pressures at different loci. Most CfERVs were integrated in antisense orientation within 100 kb from annotated protein-coding genes. This integration pattern provides evidence for selection against CfERVs in sense orientation relative to chromosomal genes. In conclusion, this ERV analysis of the first carnivorous species supports the notion that different mammals interact distinctively with endogenous retroviruses and suggests that retroviral lateral transmissions between dog and human may have occurred.
Host-retrovirus interactions influence the genomic landscape and have contributed substantially to mammalian genome evolution. To gain further insights, we analyzed a female boxer (Canis familiaris) genome for complexity and integration pattern of canine endogenous retroviruses (CfERV). Intriguingly, the first such in-depth analysis of a carnivore species identified 407 CfERV proviruses that represent only 0.15% of the dog genome. In comparison, the same detection criteria identified about six times more HERV proviruses in the human genome that has been estimated to contain a total of 8% retroviral DNA including solitary LTRs. These observed differences in man and dog are likely due to different mechanisms to purge, restrict and protect their genomes against retroviruses. A novel group of gammaretrovirus-like CfERV with high similarity to HERV-Fc1 was found to have potential for active retrotransposition and possibly lateral transmissions between dog and human as a result of close interactions during at least 10.000 years. The CfERV integration landscape showed a non-uniform intra- and inter-chromosomal distribution. Like in other species, different densities of ERVs were observed. Some chromosomal regions were essentially devoid of CfERVs whereas other regions had large numbers of integrations in agreement with distinct selective pressures at different loci. Most CfERVs were integrated in antisense orientation within 100 kb from annotated protein-coding genes. This integration pattern provides evidence for selection against CfERVs in sense orientation relative to chromosomal genes. In conclusion, this ERV analysis of the first carnivorous species supports the notion that different mammals interact distinctively with endogenous retroviruses and suggests that retroviral lateral transmissions between dog and human may have occurred.
We mapped CfERV integrations on all chromosomes and found a high density of integrations in some regions compared to other nearly empty regions (Fig. 1A). To characterize CfERV integrations into intergenic, intronic, translated and untranslated genomic regions, detected proviruses were mapped to the dog genome and human genes (xref track at UCSC) (Fig. 1C). We confirmed that CfERVs were preferentially located in intergenic and intronic regions with numerous intronic integrations on chromosomes 6 and 18. With a caveat for sequence quality and ERV detection limitations, all chromosomes but 22 (at 6.8 kbp from the q-telomere) appear to lack telomeric CfERVs in contrast to both SINEs and LINEs that appear to be present at telomeres. Apparently, these chains are either not placed in relation to integrations of old (such as LINEs) (Fig. 1D) or young elements (SINEC_Cf) (Fig. 1E) in dogs.
Figure 1
A) Chromosomal distribution of CfERVs. Every CfERV is placed into its chromosomal position rescaled to a megabase (Mb) size bin to be noticeable in the chromosomal picture. A color code is assigned depending on its classified genus. Non-acrocentric chromosomes (chrX) present arrows point at their centromeres. B) Cumulative histogram distinguishing the numbers and genus categories of CfERVs distributed per chromosome. Breaks in bar plots indicate scale changes. C) Cumulative histogram with the CfERV amount of nucleotides contained in exonic, intronic, intergenic or untranslated (UTR) regions per chromosome analyzed. Breaks in bar plots indicate scale changes. D) LINEs (old non-LTR transposable integrations); E) SINE_Cf (newer non-autonomous integrations in dogs). Repeats are grouped in bins to a resolution of 1 Mbp. Relative chromosomal occupancy by these elements of a bin is symbolized by the degree of hue in a grey color scale (darkest, higher). Herein, every CfERV is denoted as specified before. F) Distances to telomeres. The boxplot “Start” describes the distribution of CfERV (or repeat) distances to the telomere at the start of the chromosome, the “End” group is towards the other end of the chromosome. In this graph, the number of elements is represented by ‘n’ (number of chromosomes) and the minimum value in the distribution is ‘min’ for sake of clarity. A number of 0 indicates the fact that an integration of repeat(CfERV) in a telomere exists. Note the different scale in measures (in bp) between images (from left to right): CfERVs contained in each chromosome, LINEs, SINEs-Cf, and repeats without difference of class/type annotated by RepetMasker.
A) Chromosomal distribution of CfERVs. Every CfERV is placed into its chromosomal position rescaled to a megabase (Mb) size bin to be noticeable in the chromosomal picture. A color code is assigned depending on its classified genus. Non-acrocentric chromosomes (chrX) present arrows point at their centromeres. B) Cumulative histogram distinguishing the numbers and genus categories of CfERVs distributed per chromosome. Breaks in bar plots indicate scale changes. C) Cumulative histogram with the CfERV amount of nucleotides contained in exonic, intronic, intergenic or untranslated (UTR) regions per chromosome analyzed. Breaks in bar plots indicate scale changes. D) LINEs (old non-LTR transposable integrations); E) SINE_Cf (newer non-autonomous integrations in dogs). Repeats are grouped in bins to a resolution of 1 Mbp. Relative chromosomal occupancy by these elements of a bin is symbolized by the degree of hue in a grey color scale (darkest, higher). Herein, every CfERV is denoted as specified before. F) Distances to telomeres. The boxplot “Start” describes the distribution of CfERV (or repeat) distances to the telomere at the start of the chromosome, the “End” group is towards the other end of the chromosome. In this graph, the number of elements is represented by ‘n’ (number of chromosomes) and the minimum value in the distribution is ‘min’ for sake of clarity. A number of 0 indicates the fact that an integration of repeat(CfERV) in a telomere exists. Note the different scale in measures (in bp) between images (from left to right): CfERVs contained in each chromosome, LINEs, SINEs-Cf, and repeats without difference of class/type annotated by RepetMasker.To rule out the possibility that differences in assembly quality account for the higher rate of CfERVs on chromosomes 1 and X, we computed the density of possible ERV-containing sequencing gaps for all chromosomes (Table S2).We next estimated whether these integrations correlated with chromosome length or with several annotated sets of genes per chromosome, such as protein-encoding genes and non-coding RNA (ncRNA) (Fig. S2A–F). Correlation against chromosome length showed most significance (r2 = 0,55; P-value = 3.2×10−8). However, weak positive linear correlations were also found against both coding genes and ncRNAs annotated by EnsEMBL and UCSC human-projected genes. The longest chromosomes, 1 and X, did not show linear correlation with any category and were considered outliers (Fig. S2A–F).
Genomic neighborhood
To analyze potential LTR promoter and enhancer functions to adjacent chromosomal genes which have been described for up to 90 kb [15], we collected 100 kb sequences flanking the CfERV integrations. Presence of genes including alternative genomic transcripts, were analyzed in histograms according to their distances to the CfERVs. For this analysis, the longest transcripts were selected and if alternative transcripts overlapped and extended the longest one, these extensions were annexed and this new pseudo-transcript was taken as the final model. The results were divided into “sense” or “antisense” groups depending on CfERV integration relative to chromosomal transcription direction. We then compared the CfERV neighborhood against several datasets: the RefSeq genes annotated at the UCSC genome browser (ref track), for annotated non-canine species in the UCSC browser (xref track), and human genes from the UCSC browser mapped to the dog genome both protein coding and non-coding (Fig. 2A–D).
Figure 2
Gene neighboring CfERVs plot.
On the x-axis, the distance in bp 3′ and 5′ of the CfERVs in a 200 kb surrounding window [-100 kb, 100 kb] are shown. Position zero refers to the exact location of the CfERV. The presence of genes within each region is shown for the sense (in blue) and antisense (in red) strands. The region with the highest number of genes is marked with the number of genes indicated. A) UCSC RefSeq dog annotated genes; B) including all other UCSC listed species (except dog) annotated genes (xref); C) protein coding genes annotated with only UCSC projections of human genes in the dog genome; D) UCSC xref track annotations counting only ncRNA human genes. E) anciently integrated (>10% LTR divergence); F) intermediately integrated (> = 5% and < = 10%); G) recently integrated (<5%). A visual explanation of the methodology used to calculate the histogram values is sketched in Fig. 3B.
Gene neighboring CfERVs plot.
On the x-axis, the distance in bp 3′ and 5′ of the CfERVs in a 200 kb surrounding window [-100 kb, 100 kb] are shown. Position zero refers to the exact location of the CfERV. The presence of genes within each region is shown for the sense (in blue) and antisense (in red) strands. The region with the highest number of genes is marked with the number of genes indicated. A) UCSC RefSeq dog annotated genes; B) including all other UCSC listed species (except dog) annotated genes (xref); C) protein coding genes annotated with only UCSC projections of human genes in the dog genome; D) UCSC xref track annotations counting only ncRNA human genes. E) anciently integrated (>10% LTR divergence); F) intermediately integrated (> = 5% and < = 10%); G) recently integrated (<5%). A visual explanation of the methodology used to calculate the histogram values is sketched in Fig. 3B.
Figure 3
Gene neighborhood statistics.
A) Histogram based on the gene vicinity graphs of Fig. 2. Plots indicate the total distribution of genes in the antisense and sense strands. The total number of nucleotides from the longest RefSeq transcript composition overlapping within 200kb context relative to the CfERV integrations and their orientations were measured. Where the presence of a greater number of genes on the antisense –relative to CfERV- strand is found, we classify this in the over-represented category, whereas the presence of more genes on the sense strand –relative to CfERV- is classified as under-represented. This is performed for all regions within 100kb on both sides of every CfERV. From left to right: a) UCSC RefSeq dog annotated genes, b) UCSC listed species (except dog) annotated genes, c) genes annotated with only UCSC projections of human genes in the dog genome, d) UCSC xref track annotations counting only ncRNA human genes, e) only recent integrations (<5% LTR divergence) against UCSC xref track annotations, f) intermediately aged CfERVs (> = 5% and < = 10%), and g) ancient CfERVs (>10%). B) Schematic view explaining the methodology employed to calculate the histogram values of Fig. 2. From top to bottom, from a CfERV integrated in sense (U3-RU5-puteins-U3-RU5), we search a 100kb surrounding for transcripts in the same sense of transcriptional direction (light blue) as well as opposite (dark yellow). Another example, with a CfERV integrated in antisense (RU5-U3-puteins-RU5-U3) is also depicted with transcripts in the opposite (dark yellow) and same relative transcriptional direction (light blue). Each set of overlapping transcripts is composed into a common model of transcripts in antisense (red) and in sense (blue) relative to CfERVs. The blue line has been thickened to highlight the places where both curves take equal values. These models are counted into a −100kb to +100kb histogram where the total of CfERVs detected are centered in position 0. A value for this position suggests that any transcript overlaps any part of the CfERVs, as shown in the example. Finally, for the histogram in Fig. 3A, the x-axis is iterated counting the number of positions where the red curve (antisense) takes a higher value than the blue (sense) depicted in dark blue. These positions sum up to the green bars in the histogram in the A panel, being the opposite situation reflected in the green bars where the blue curve (sense) is higher to the red (antisense). When these two curves take the same exact value, the resultant positions are summed in the dark yellow bars.
Gene neighborhood statistics.
A) Histogram based on the gene vicinity graphs of Fig. 2. Plots indicate the total distribution of genes in the antisense and sense strands. The total number of nucleotides from the longest RefSeq transcript composition overlapping within 200kb context relative to the CfERV integrations and their orientations were measured. Where the presence of a greater number of genes on the antisense –relative to CfERV- strand is found, we classify this in the over-represented category, whereas the presence of more genes on the sense strand –relative to CfERV- is classified as under-represented. This is performed for all regions within 100kb on both sides of every CfERV. From left to right: a) UCSC RefSeq dog annotated genes, b) UCSC listed species (except dog) annotated genes, c) genes annotated with only UCSC projections of human genes in the dog genome, d) UCSC xref track annotations counting only ncRNA human genes, e) only recent integrations (<5% LTR divergence) against UCSC xref track annotations, f) intermediately aged CfERVs (> = 5% and < = 10%), and g) ancient CfERVs (>10%). B) Schematic view explaining the methodology employed to calculate the histogram values of Fig. 2. From top to bottom, from a CfERV integrated in sense (U3-RU5-puteins-U3-RU5), we search a 100kb surrounding for transcripts in the same sense of transcriptional direction (light blue) as well as opposite (dark yellow). Another example, with a CfERV integrated in antisense (RU5-U3-puteins-RU5-U3) is also depicted with transcripts in the opposite (dark yellow) and same relative transcriptional direction (light blue). Each set of overlapping transcripts is composed into a common model of transcripts in antisense (red) and in sense (blue) relative to CfERVs. The blue line has been thickened to highlight the places where both curves take equal values. These models are counted into a −100kb to +100kb histogram where the total of CfERVs detected are centered in position 0. A value for this position suggests that any transcript overlaps any part of the CfERVs, as shown in the example. Finally, for the histogram in Fig. 3A, the x-axis is iterated counting the number of positions where the red curve (antisense) takes a higher value than the blue (sense) depicted in dark blue. These positions sum up to the green bars in the histogram in the A panel, being the opposite situation reflected in the green bars where the blue curve (sense) is higher to the red (antisense). When these two curves take the same exact value, the resultant positions are summed in the dark yellow bars.Assuming genetic drift and identical LTRs at the time of integration, we separated CfERVs into three groups depending on age estimated from LTR divergence: “young”, “middle” and “old” with less than 5%, 5–10%, and more than 10% LTR divergence, respectively. Using a neutral nucleotide substitution rate of 0.2%/mya [16], a limit of 5% divergence would contain integrations that occurred around 12.5 mya whereas an integration with a 10% divergence would have occurred around 25 mya. Histograms for CfERVs of the three age groups of CfERV loci (Fig. 2E–G) were correlated with genes annotated in the UCSC browser for all other species than dog and with genes which have been projected onto the dog genome, i.e: the xref track. Old CfERVs were found mostly in antisense orientation relative to chromosomal genes when integrated within 70–80 kb from the genes (see Fig. 2E). CfERVs of intermediate age also showed an antisense integration pattern except for a large segment spanning 30 kb (between 42.5 kb and 72.5 kb) (Fig. 2F). Young CfERVs present a more uneven integration pattern with respect to genes in the sense orientation. However, proviruses were more common in antisense 20 kb upstream of the gene, covering the chromosomal promoter region, as well as in sense orientation 6–12 kb downstream of the gene (Fig. 2G).The number of sense and antisense positions of CfERVs at various distances from the chromosomal gene within 200 kb surrounding context were measured. Positions where the antisense curve takes either a higher (marked as “over” category) or lower value (“under” category) than the corresponding sense value is summarized in Fig. 3A. We also collected the number of CfERVs for which distance values are equal in both the sense and antisense directions. We tested for the independence between the two categories “over” and “under” containing the number of positions where the antisense and sense values are higher than its counterpart. The χ2 test yielded a score of 106136.9 with a p-value of 2.2e–16 discarding the independence between the categories and therefore, the normality of their distributions.Our results revealed CfERVs integrations adjacent to five genes in the antisense direction and two integrations in the sense direction in the promoter region within 5 kb of two other genes (Fig. 2A). These proximal genes were annotated with the corresponding gene ontology terms (Table S3).The under-annotated UCSC dog RefSeq set currently contains 998 annotated dog genes [12]. A proof for accuracy of this estimation is that the total set of human genes projected onto the dog genome by the UCSC annotation pipeline matched to 19,568 loci. With this gene set projected to the dog genome, independent of CfERV direction relative to chromosomal genes, we discovered 161 genes containing partial CfERVs and 50 CfERVs within 5 kb upstream of annotated genes, in the promoter region, or 5 kb downstream of the 3′UTR.
Phylogenetic analysis
To perform phylogenetic analyses, we narrowed our CfERV collection to 286 Pol containing integrations of which 219 integrations encoded RT motifs. Finally, only 205 passed our putein quality selection (see materials and methods). When comparing CfERVs with previously described retroviruses in an unrooted tree (Fig. 4), we identified a group that clustered with HERV-Fc-like elements and one outgroup chain (id: 1098) that clustered with the HERV-FRD/MER4like/HERV-W group. After discarding the outgroup element, 24 HERV-Fc-like CfERVs remained and were grouped in separate phylogenies.
Figure 4
Cluster of the CfERVs with annotated HERV-Fc-like Pol puteins in relation to reference retroviruses.
An unrooted NJ tree constructed with a putative evolutionary relationship between HERV-Fc-like CfERV proviruses and their external counterparts. Detected chains are grouped by genus with characteristic colors (green for the gamma-like and yellow for the unclassified). Confidence values to the most deep tree branches are specified over a bootstrapping set of 1000 repetitions. A black asterisk symbolizes a degree of confidence over 90% and solid black circle a higher confidence than 75%.
Cluster of the CfERVs with annotated HERV-Fc-like Pol puteins in relation to reference retroviruses.
An unrooted NJ tree constructed with a putative evolutionary relationship between HERV-Fc-like CfERV proviruses and their external counterparts. Detected chains are grouped by genus with characteristic colors (green for the gamma-like and yellow for the unclassified). Confidence values to the most deep tree branches are specified over a bootstrapping set of 1000 repetitions. A black asterisk symbolizes a degree of confidence over 90% and solid black circle a higher confidence than 75%.
Canine HERV-Fc-like copy-number variation
In an attempt to classify CfERVs and to reconstruct the intra-population evolutionary history, we performed a detailed Pol-based phylogenetic analysis with the selected elements (above) and four additional Pol puteins (id: 216, 472, 657 and 992) that had been excluded due to missing RT motifs. In order to identify segmental duplications, we plotted the phylogeny (neighbor joining, 1000 bootstraps) next to the multiple alignment in eBioX [17] (summarized in Fig. 5). To increase the power of our analysis, we compared every candidate CfERV against itself and the other HERV-Fc-like proviruses using dot-matrix analyses without finding significant recombinations.
Figure 5
Classification of Fc-like CfERVs with Pol puteins.
Upper panel left, rooted tree on the fish WDSV shows the relationship between aligned CfERVs and bootstrapped values (n = 1000). Upper panel right: alignment window where horizontal white bars indicate the presence of aligned viral sequence, larger squared ends represent open gaps, and vertical blue colors indicate the degree of similarity (i.e. light: high, dark: low). Lower panel: three different un-rooted phylograms with WDSV to approximate a root point (red square joint) and zoomed views over dense branches of the tree. A) genus-labeled phylogram, gamma-like and unclassified elements in green and yellow respectively. B) Age classification phylogram, youngest elements in light blue, ancient in dark, undated CfERVs in black. C) Score classification phylogram, highest scoring elements in bright red color. A color scale to measure the variation in tone is provided for both B) and C).
Classification of Fc-like CfERVs with Pol puteins.
Authors: Belén de la Hera; Jezabel Varadé; Marta García-Montojo; Antonio Alcina; María Fedetz; Iraide Alloza; Ianire Astobiza; Laura Leyva; Oscar Fernández; Guillermo Izquierdo; Alfredo Antigüedad; Rafael Arroyo; Roberto Álvarez-Lafuente; Koen Vandenbroeck; Fuencisla Matesanz; Elena Urcelay Journal: PLoS One Date: 2014-03-03 Impact factor: 3.240
Authors: Amanda Y Chong; Kenji K Kojima; Jerzy Jurka; David A Ray; Arian F A Smit; Sally R Isberg; Jaime Gongora Journal: Retrovirology Date: 2014-12-12 Impact factor: 4.602