Literature DB >> 35166831

Endogenous retroviruses co-opted as divergently transcribed regulatory elements shape the regulatory landscape of embryonic stem cells.

Stylianos Bakoulis1, Robert Krautz1, Nicolas Alcaraz1,2, Marco Salvatore1, Robin Andersson1.   

Abstract

Transposable elements are an abundant source of transcription factor binding sites, and favorable genomic integration may lead to their recruitment by the host genome for gene regulatory functions. However, it is unclear how frequent co-option of transposable elements as regulatory elements is, to which regulatory programs they contribute and how they compare to regulatory elements devoid of transposable elements. Here, we report a transcription initiation-centric, in-depth characterization of the transposon-derived regulatory landscape of mouse embryonic stem cells. We demonstrate that a substantial number of transposable element insertions, in particular endogenous retroviral elements, are associated with open chromatin regions that are divergently transcribed into unstable RNAs in a cell-type specific manner, and that these elements contribute to a sizable proportion of active enhancers and gene promoters. We further show that transposon subfamilies contribute differently and distinctly to the pluripotency regulatory program through their repertoires of transcription factor binding site sequences, shedding light on the formation of regulatory programs and the origins of regulatory elements.
© The Author(s) 2022. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Substances:

Year:  2022        PMID: 35166831      PMCID: PMC8887488          DOI: 10.1093/nar/gkac088

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Transcriptional regulatory elements are stretches of genomic sequence that exert enhancer and promoter activities essential for the precise spatial and temporal control of gene expression (1–4). The activity of a regulatory element is controlled by the specificity of transcription factors (TFs) to bind the element and the density of their binding sites, both of which are in turn dependent on its DNA sequence (5–8). TF DNA-sequence preferences (9), gene expression (10,11) and the specific regulation of genes by TFs (12,13) are generally well conserved across eukaryotes. On the contrary, regulatory elements with enhancer activity are generally associated with high evolutionary turnover (14–16). Transposable elements (TEs) are an abundant source of TF binding sites that contribute to the spread of sequences with regulatory potential (17). In mammals, the large majority of TEs are dormant, having lost their ability to replicate and are maintained repressed by H3K9me3, H3K27me3 and DNA methylation (18–22). However, favorable genomic integration may lead to the co-option of TEs as endogenous regulatory elements by utilizing their native or acquired TF binding sites (23–32). Consequently, the majority of species-specific open chromatin regions are associated with TEs (14) and TE-derived enhancers are generally not well conserved across evolution (25,33,34). In embryonic stem cells (ESCs) long terminal repeat (LTR) retrotransposons bound by pluripotency TFs, including OCT4 and NANOG, have been shown to possess enhancer activities (29,31) and initiate transcription (35). In addition, species-specific TE-derived regulatory elements cause binding differences of pluripotency TFs between human and mouse ESCs (36). Distribution and fixation of mobile elements with readily available regulatory potential may therefore provide regulatory innovation but also stabilize the gene regulatory functions of TFs. Thus, characterizing the contribution of TEs to transcriptional regulation has the potential to provide insights into the formation of pluripotency regulatory programs, the origins of regulatory elements and thus the basis for their evolutionary turnover. Most studies to date have inferred TE-associated regulatory elements from chromatin-accessible loci flanking nucleosomes with specific histone modifications indicative of enhancers (e.g., H3K4me1, H3K27ac). However, only few of such predicted loci show enhancer activity (31,37). Rather, a growing body of literature points toward divergent transcription initiation as a key property of active regulatory elements with either enhancer or promoter function (4,38–45). While there is a general relationship between histone modifications and the transcriptional output of a regulatory element (4,44–46), the transcriptional status of a regulatory element better reveals its regulatory potential (40,44,47). Characterization of TE-derived enhancers from transcription initiation events therefore provide a more accurate picture of their regulatory contribution. Divergent transcription of regulatory elements is established at closely spaced pairs of divergently oriented core promoters within open chromatin, resulting in long non-coding enhancer RNA (eRNA) transcripts at regulatory elements with enhancer activity, and pairs of mRNAs and promoter upstream transcripts (PROMPTs) at gene promoters (39–42,44,48). Such transcription initiation events are accurately identified and quantified through 5′ end sequencing of capped RNAs (CAGE, Cap Analysis of Gene Expression) (49,50). eRNAs and PROMPTs are generally non-polyadenylated and thus unprotected at their 3′ ends, and as a consequence these RNA species are frequently targeted by the 3′-5′ ribonucleolytic RNA exosome for degradation (39,51,52). Although co-option of TEs as regulatory elements has been established as a mode of regulatory innovation, its extent and how TE-derived regulatory elements compare to non-TE associated regulatory elements (e.g., with regards to divergent transcription initiation, RNA metabolism and TF binding potential) remain unclear. Encouraged by the possibility to study TE-associated regulatory elements through transcription initiation mapping (35), we here investigate how the wide repertoire of mouse TEs contribute to the transcription initiation landscape and thus regulatory elements of mouse ESCs (mESCs). We demonstrate that many dormant TE insertions, in particular endogenous retroviral elements (ERVs), carry open chromatin regions that are divergently transcribed into unstable RNAs targeted by the exosome for degradation. Furthermore, open chromatin regions that are associated with transcribed ERVs show a high degree of species specificity and a large fraction of these either have enhancer function or contribute to gene promoters. This suggests that TE co-option as regulatory elements contributes to a sizable proportion of active species-specific regulatory elements in mESCs. Our transcription-centric approach allows for an unbiased systematic investigation of the regulatory potential across TE subfamilies, indicating that these contribute differently to the TF binding repertoire of the mouse genome, which can be linked to regulatory specificity in mESCs.

MATERIALS AND METHODS

E14 mESC and HeLa S2 CAGE libraries, data processing and mapping

Previously sequenced CAGE libraries for E14 mESCs, embryoid bodies sampled after 3 days of differentiation of mESCs (GEO ID GSE115710) (53) and human HeLa S2 cells (GEO ID GSE62047) (39) were collected. As described in each report, CAGE libraries were prepared from exosome depleted samples as well as from control samples. E14 mESCs and differentiated embryoid bodies were transduced with pLKO vectors encoding the shRNA: SHC002 (scrambled control - referred to as Scr control) and NM_025513.1–909s1c1 (referred to as Rrp40 exosome knockdown). HeLa cells were transfected with enhanced green fluorescent protein (eGFP, referred to as EGFP control) and the hRRP40 (EXOSC3) siRNA (referred to as RRP40 exosome knockdown). The CAGE libraries from mouse and human were processed as in the original publications, with some minor modifications. Reads were trimmed using the FASTX-Toolkit (version 0.013 - http://hannonlab.cshl.edu/fastx_toolkit/) to remove linker sequences (Illumina adaptors) and then filtered for a minimum sequencing quality of 30 in 50% of the bases. Mapping to the mouse reference genome (mm10) was performed using Bowtie (version 1.1.2), applying the following parameters to ensure several (up to 100) good alignments per read, which is essential for the rescue and analysis of TE-derived sequences: -k 100 (report up to 100 good alignments per read), -m 100 (eliminate reads that map > 100 times), –best and –strata (report alignments which have the highest quality). Reads that mapped to unplaced chromosome patches or chrM were discarded. Finally, all reads corresponding to reference rRNA sequences (mouse: BK000964.3, human: U13369.1) with up to two mismatches were discarded using rRNAdust (fantom.gsc.riken.jp/5/suppl/rRNAdust/). Mapping to the human reference genome (hg19) was performed using the exact same parameters and version of Bowtie. For exploratory/comparative purposes, the mapping of HeLa S2 CAGE libraries was performed with three additional alignment approaches (Supplementary Figure S1B,C). BWA (Burrows-Wheeler Aligner) version 0.7.15-r1140 (54) was used with the parameter -n 2 (maximum distance for each alignment) and a subsequent mapping quality MAPQ > 20 threshold, using SAMtools (version 1.3.1) (55). BWA-PSSM, a modification of BWA using position specific scoring matrices (PSSM) (56), was used with parameters -n2, -m 2000 (to allow for more suboptimal partial alignments to be tested) and a downstream MAPQ > 20 threshold. Finally, LAST (version 801) (57) was used with parameters -u NEAR and -R 01 (lastdb) and -Q 1 and -D 100 (lastal), followed by a downstream MAPQ > 20 threshold.

Probabilistic multi-mapping rescue of CAGE tags

Following initial mapping of reads with Bowtie, we employed MUMRescueLite (58) to resolve short multi-mapping CAGE reads that aligned equally well to more than one genomic location. In short, this method examines the information about the local context of potential mapping positions given by uniquely mapping reads. By assuming that multi-mapping reads are more likely to come from regions which already have more uniquely mapping reads, MUMRescueLite probabilistically assigns the true source of a multi-mapping read. The probabilistic alignment of a multi-mapping read is weighted by the abundance and location of uniquely mapping reads. Thus, a nominal window parameter is required for which to identify unique mappers that occur around (upstream and downstream) each locus occupied by a multi-mapper. In addition, a weight threshold is required over which one locus of a multi-mapper is ‘rescued’, referring to the fraction out of the total number of unique mappers proximal to all loci associated with a specific multi-mapper. The window parameter was cautiously selected to be 50 bp after a saturation investigation of how many reads were rescued. The weight threshold was set to 0.7, in order to select one locus as the true source of a multi-mapping read.

TElocal-inferred expression of mESC TE families

For statistical comparison purposes, CAGE unique and multi-mapping reads aligned with Bowtie, as described above, were supplied to TElocal (version 0.1.0) from the TEToolkit suite (59) to quantify transposable element expression at the locus level. The resulting quantification for each TE for both TElocal and multi-mapping rescuing output was normalized to tags per million mapped reads (TPM) and the results of the two approaches were compared using Spearman’s rank-based correlations.

CAGE tag clustering, quantification and normalization

Following multi-mapping rescue, the number of overall CAGE tag 5′ends were counted for each genomic position to obtain base-pair (bp) resolution of CAGE transcription start sites (CTSSs). We then assigned CTSSs to transposable elements (TEs) as defined by RepeatMasker (version 4.0.7; http://www.repeatmasker.org/) and considered only instances with two or more CAGE tags. TEs denoted as Alu elements in RepeatMasker were considered B1 and proto-B1 (PB1) elements. Tag clusters (TCs) were generated from pooled CAGE libraries per condition, and wide TCs were narrowed or decomposed into sub-peak TCs if they contained multiple peaks (https://github.com/anderssonlab/CAGEfightR_extensions), as previously described (44). In short, CTSSs located within 20 bp from each other on the same strand were merged into initial TCs. For each TC, the bp with the most abundant count (summit position) or the median of multiple equal summits were identified. Next, the fraction of total CTSSs of each location within a TC to that of the summit position was calculated. If a position carried <10% of the summit signal, it was discarded and decomposed TCs were merged if positioned within 20 bp from each other on the same strand. Expression quantification of each individual TC in each CAGE replicate was performed by aggregating the read counts of CTSSs falling into its genomic region. Using the CAGE genomic background noise estimation (as described below), all TCs with expression values below the noise threshold were discarded from further analyses. Expression levels of TCs were normalized to tags per million mapped reads (TPM). Finally, we assigned TCs to TEs on the same strand through direct overlap using BEDtools (60).

Footprints of CAGE expression on TEs

To investigate the relationship between transposable element families and/or classes to CTSS locations, we plotted the average binarized (presence or absence of CTSS, regardless of expression value) pooled CTSS signal 500 bp upstream, across the body and 500 bp downstream of each TE instance using deepTools (61). Unique TE instance profiles were averaged for each TE family or class based on their RepeatMasker annotation. Furthermore, we constructed a synthetic CAGE uniqueness track by mapping the mm10 reference genome split in 25 bp long segments back to itself. Localization of the synthetic CAGE tags on a bp resolution was conducted as described above at the CTSS level, assigned to TE instances and the signal was binarized, representing the expected uniquely mapped background signal. The log2 ratio of observed (CAGE libraries) versus expected (as estimated from the synthetic uniqueness track) average binarized signal was calculated in R (version 4.0.3; http://www.R-project.org/).

HeLa RNA-seq data processing

RNA-seq data from HeLa cells depleted of hRRP40 using siRNA-mediated knockdown as described elsewhere (62) (SRA accession: SRX365673) were considered. Briefly, after filtering of low-quality reads, removal of Illumina adaptors and reads <25 bp with Trimmomatic (version 0.36) (63), reads were mapped against the human reference genome (hg19) using HISAT2 (version 2.1.0) (64). Uniquely mapped and properly paired reads were selected with SAMtools (version 1.3.1). Gene-level expression quantification of mapped reads was performed with featureCounts (version 1.6.3) (65). Further analyses and comparison to gene-level expression with CAGE using generalized linear Poisson regression models with backward elimination for variable selection was performed in R using the glm function (see Supplementary Note).

CAGE gene-level expression quantification

For statistical comparison of gene-level HeLa RNA-seq expression to gene-level HeLa expression as measured by CAGE, we quantified abundances of genes using CTSSs within ±500 bp windows from the 5′end of GENCODE (version M10; GRCm38) transcripts (66), as CAGE signal saturates after ∼500 bp from annotated gene TSSs (50). Gene-level abundances were quantified by first merging potentially overlapping TSS-centered windows per transcript belonging to the same gene and then summing the expression levels of all transcript windows for each gene. Similarly, gene-level expression was quantified using CAGE data for mESC and embryoid bodies.

Processing of DNase-seq data and DHSs as focus points for transcription initiation

For identification of TE-associated regulatory elements, sequencing reads from DNase-seq for the mouse ES-E14 cell line (GEO ID GSE37073, GSM1014154) were processed using the ENCODE DNase-HS pipeline. Called hotspot FDR 1% peaks in the mouse reference genome (mm10) were used as DNase I hypersensitive sites (DHSs). DHSs were used as focus points of minus and plus strand expression by defining DHS midpoints as positions optimizing the coverage of proximal CAGE tags within flanking windows of size ±300 bp around them, as previously described (44). The final set of 165 052 transcribed DHS was determined by filtering DHSs to not overlap any other DHS ±300 bp window on the same strand and to be supported by either control or exosome knockdown CAGE expression above the noise threshold (described below). Transcriptional directionality and exosome sensitivity scores were calculated considering this set of DHS regions, as defined previously (39). In short, the directionality score measures the expression level strand bias within transcribed DHSs, it ranges from -1 to 1 (100% minus or plus strand expression), while 0 indicates balanced bidirectional transcription. The exosome sensitivity score measures the relative amount of degraded RNAs by the exosome by quantifying the fraction of exosome-depleted CAGE expression seen only after exosome depletion: exosome sensitivity values closer to 1 are indicative of highly unstable RNAs.

Estimation of CAGE genomic background noise

The estimation of a CAGE genomic background noise threshold (https://github.com/anderssonlab/CAGEfightR_extensions) for robust assessment of lowly expressed regions was based on quantifying the CAGE 5′ ends in randomly selected uniquely mappable regions of 200 bp distal to known TSSs, exons and DHSs, followed by extracting the 99th percentile of the empirical distribution of CAGE expression and using the max value across control libraries as a noise threshold for significant expression in further analyses, described in detail elsewhere (44).

Genomic annotation of transcribed TEs

We annotated TE-associated TCs based on different genomic regions as defined using GENCODE (version M10) and BEDtools, ensuring there were no overlapping regions counted twice. Coordinates for all genic regions (exons, 5′ UTRs, 3′ UTRs) were extracted from the GENCODE annotation. Promoter regions were defined as regions at the starting positions of each transcript ±500 bp. To define intronic regions, we subtracted the exonic regions from the genic regions. Finally, distal/intergenic regions were defined as the remaining parts of the genome in-between annotated genes.

Estimation of evolutionary conservation of TE-associated DHSs

Genomic regions spanning ±150 bp around mESCs DHS signal peaks carrying CAGE tags and overlapping TEs were aligned to rat (rn7) and human (hg38) assemblies using the UCSC liftOver tool (67) with a -minMatch = 0.6 parameter. Similarly, 300 bp regions of nonTE-associated DHSs carrying CAGE tags, the mm10 genome assembly split in 300 bp fragments, the subset of those fragments not overlapping TEs, and all TE instances of RepeatMasker (full length) were aligned to rat and human assemblies. The genomic regions that had a >60% match (coverage) with those in the other species were considered orthologous.

FANTOM enhancers and enhancer peak calling from STARR-seq data

Transcribed enhancers identified by remapped FANTOM5 CAGE libraries to mm10, deposited in Zenodo (68), were associated with TE-associated DHSs by overlap using BEDTools. STARR-seq data for 2iL grown mESCs E14Tg2a (E14) (69) (GEO ID GSE143546) were used to evaluate the enhancer potential of transcribed TEs. STARR-seq data were processed using Bowtie2 (70) and SAMtools (55). Reads were aligned to the mouse reference genome (mm10) using Bowtie2 (–very sensitive). The reads of the two replicates from each sample were sorted and merged and reads falling into regions from the ENCODE blacklist of the mouse reference genome were removed. STARRPeaker (version 1.0) (71) was used to identify potential enhancers with default parameters and an adjusted P-value threshold of 0.05. Potential enhancers called from STARR-seq data were associated with expressed TE-associated DHSs by overlap using BEDTools.

Processing and analysis of histone modification ChIP-seq data

Mouse ENCODE E14Tg2a or E14 mESCs ChIP-seq data for six histone modifications: H3K4me3, H3K4me1, H3K27ac, H3K36me3, H3K9me3 (GEO ID GSE136479) and H3K27ac (GEO ID GSE31039) were processed using the ENCODE ChIP-seq processing pipeline (version 1.3.6). Adapters and low-quality reads were filtered with cutadapt (version 2.5) (72), reads were mapped to the mouse genome assembly mm10 with bwa, duplicate reads were removed with Picard (version 2.20.7) (http://broadinstitute.github.io/picard/), ENCODE mm10 blacklist regions were masked and only reads with mapping quality above 30 were considered for further downstream analyses. For the heatmap and footprint plots, ChIP signal expressed as fold-over input control was averaged across sites in 10 bp bin intervals from the CAGE TC summit position up to a maximum of ±2000 bp, using deepTools (version 3.1.3). Hierarchical clustering, annotation and visualization were conducted in R with ChIPSeeker (73), ggplot2 (https://ggplot2.tidyverse.org) and profileplyr (version 1.6.0), using clustering parameters rowMax for summarizing the ranges across samples and median for defining proximity between clusters. The association of clusters with TE classes and families was done in R, using the Fisher’s exact test with Bonferroni correction for multiple testing.

Chromatin state discovery and characterization using chromHMM

To annotate TE-associated DHSs and characterize which regulatory elements in mESCs (mm10) they are occupying, we constructed a 12-state model chromatin states map to identify genomic regions enriched in specific combinations of histone modifications and TF marks, as previously described (74). The multivariate hidden Markov model framework of chromHMM (75,76) was applied to the mouse reference genome (mm10) and ENCODE ChIP-seq data in E14 cells for the following ten marks: H3K4me1, H3K4me3, H3K27me3, H3K27ac, H3K9me3, H3K9ac, H3K36me3, CTCF, Nanog, Oct4. TE-associated DHSs and non-TE DHSs associated with TCs from control and exosome-depleted CAGE libraries were overlapped with the chromHMM states with BEDtools and the heatmaps were generated in R using heatmap.2 from the gplots package (version 3.1.1).

Transcription factor motif analysis

We scanned full ERV transposons carrying > 1 CAGE tag with known TF motifs in the HOMER motif database using findMotifsGenome.pl (77). For each TF, findMotifsGenome.pl employs a hypergeometric test to compare the number of motifs found in the target set with that found in a specified background set. The tool was run using the set of expressed ERV transposons per subfamily as the target set and the full set of non-TE-associated CAGE TCs (±200 bp regions) as the background set. The significantly enriched TF binding site motifs were selected with three additional conditions: (i) TF genes were expressed in mESCs using gene-level CAGE quantification, (ii) at least 10% of the target set contained the motif and (iii) known TF genes had a match score > 0.9 to the de novo motifs found by HOMER. The motif enrichment score was calculated as log2 (% of target sequences with motif / % of background sequences with motif). Heatmaps of TF motif enrichment were generated in R using the ggplot2 package. In order to account for cell type specific TE expression, as measured by CAGE in mESCs and embryoid bodies, we scanned TE-associated DHSs in mESCs, using scanMotifGenomeWide.pl (77), with the position weight matrices of nine pluripotency TFs that demonstrated enrichment across several transcribed ERV subfamilies (Figure 5A). ERV instances carrying a predicted binding site of at least one out of the nine TFs were considered to be associated with pluripotency TFs. Comparisons between binding sites of selected mouse TFs were performed using TomTom of the MEME suite including the three mouse-specific motif databases: HOCOMOCO Mouse (version 11 FULL), UniPROBE Mouse (Sci09 Cell08), Embryonic Stem Cell TFs (Chen2008) both as query and target motifs.
Figure 5.

ERV subfamilies contribute to distinct enrichments of binding sites for pluripotency factors. (A) Motif enrichments for selected TFs (columns) in transcribed TEs across selected ERV subfamilies (rows) versus a background of non-TE genomic regions ±200 bp around the summits of all CAGE-inferred TSS clusters. White cells indicate no enrichment and cases of complete depletion were assigned the lowest detected score, represented in dark blue according to the scale. The full TF enrichment heatmap is shown in Supplementary Figure S15A. Example genomic insertion sites for ERVK subfamilies are shown to illustrate their differences in carrying putative TF binding sites for Sox2, Esrrb and Oct4. (B) Correlations (Pearson’s r) between TF motif enrichments across all ERV subfamilies (as shown in Supplementary Figure S15A). The full heatmap of correlations is given in Supplementary Figure S15B. (C) Similarity (q-value) between binding motifs for selected TFs. Sequence logos (right) for Oct4, Nanog and Klf4 exemplify differences and similarities of TF binding sites.

Functional enrichment analysis

Gene ontology (GO) and pathway enrichment analyses, using TE-regulated ABC (activity-by-contact) predicted target genes (retrieved from https://osf.io/uhnb4/) for enhancers (78) as target sets and all ABC-predicted target genes as background, were performed in R using clusterProfileR (79). Only statistically significant results (p.adjust < 0.05) were considered after employing false discovery rate for multiple testing correction. ABC region coordinates were lifted from mm9 to mm10 using the UCSC liftOver tool (80) and were associated with CAGE TCs by coordinate overlap (BedTools).

RESULTS

Transcription at transposable elements is divergent and results in unstable RNAs

To investigate the prevalence and co-option of TEs as regulatory elements, we first characterized the association of TEs with divergent transcription initiation, a key property of regulatory elements with either enhancer or promoter activities (4,40,44,45,81,82). To this end, we analyzed the genomic location of sequencing reads of capped RNA 5′ ends in mESCs. 5′ ends of capped RNAs reveal transcription start sites (TSSs) of RNA polymerase II transcripts and can be accurately assayed using CAGE (49). Here, we considered CAGE data from mESCs depleted for the RNA exosome core component Rrp40 (53). TSSs identified by these data thus also include those whose RNAs are targeted by the exosome for degradation. To allow characterization of TSSs at base-pair resolution within TE insertions, we considered both uniquely mapping and multi-mapping rescued (58,83) CAGE reads (see Materials and Methods). Multi-mapping rescue increased the number of identified TE-associated TSSs and improved expression quantification of TE-associated loci in mESCs and, as a confirmation, also in HeLa cells (Supplementary note; Supplementary Figures S1–S4). Comparable expression levels were observed using an alternative strategy based on maximum likelihood alignments to annotated repeats (59), although the sets of detected, expressed TE insertions varied between methods (Supplementary note; Supplementary Figure S5). The ability to infer the TSSs and expression levels of TE-derived RNAs from CAGE prompted us to characterize the transcription initiation patterns across TE families. A total of 82 383 mouse TE insertion events were associated with measurable RNA polymerase II transcripts in mESCs. Among these, endogenous retroviral TE family elements (ERV1, ERVK, ERVL, ERVL-MaLR) were most prevalent (P < 2e-16, Fisher’s exact test; Supplementary Figures S2B and S3). These TE families and L1 LINE elements showed preferential TSS locations close to their repeat body boundaries (Figure 1A,B). Of note, TSS location preferences could in general not be explained by varying mappability over repeat bodies (Supplementary Figure S6). However, a lack of detected expression in DNA transposons could be due to low mappability, suggesting that expression quantification and TSS mapping of this class may require alternative approaches or longer sequencing reads. By taking mappability into account, a strong divergent pattern for L1 elements originating at their 5′ and 3′ ends was revealed (Supplementary Figure S6), suggesting that some of these may utilize their native TSSs. We note, however, that TSSs within TEs often deviate from their native TSSs, as seen for ERVs (Supplementary Figure S6).
Figure 1.

TEs are divergently transcribed into unstable RNAs. (A) Average distribution of CAGE-inferred TSS locations (vertical axis; expression agnostic) ±500 bp upstream/downstream and across the body of major TE families (horizontal axis). TSS locations are visualized separately for the sense (upper panel) and antisense (middle panel) strands. (B) Average distribution of CAGE-inferred TSS locations for the ERVK family. (C) Transcriptional directionality score, describing the strand bias in expression levels (ranges between -1 for 100% minus strand expression and 1 for 100% plus strand expression), for mRNA and non-mRNA (non-protein-coding GENCODE transcripts) as well as TE-associated and non-TE-associated RNAs (regardless of annotation). (D) Exosome sensitivity, measuring the relative amount of exosome degraded RNAs (ranges between 0 for RNAs unaffected by the exosome and 1 for 100% unstable RNAs), for transcripts associated with LTR families ERV1, ERVK, ERVL, and ERVL-MaLR. For comparison, exosome sensitivity is shown for mRNAs and gene-distal loci. (E and F) Genome browser tracks for two loci of unannotated transcripts with characteristic divergent expression patterns falling on TE insertions of ORR1A2 (ERVL-MaLR; (E)) and RMER17B (ERVK; (F)) subfamilies. Pooled replicate CAGE expression levels in control (Scr) and after exosome depletion (Rrp40) split by plus (blue) and minus (red) strands are shown. For visibility reasons, the scales of CAGE signals differ between strands and conditions.

TEs are divergently transcribed into unstable RNAs. (A) Average distribution of CAGE-inferred TSS locations (vertical axis; expression agnostic) ±500 bp upstream/downstream and across the body of major TE families (horizontal axis). TSS locations are visualized separately for the sense (upper panel) and antisense (middle panel) strands. (B) Average distribution of CAGE-inferred TSS locations for the ERVK family. (C) Transcriptional directionality score, describing the strand bias in expression levels (ranges between -1 for 100% minus strand expression and 1 for 100% plus strand expression), for mRNA and non-mRNA (non-protein-coding GENCODE transcripts) as well as TE-associated and non-TE-associated RNAs (regardless of annotation). (D) Exosome sensitivity, measuring the relative amount of exosome degraded RNAs (ranges between 0 for RNAs unaffected by the exosome and 1 for 100% unstable RNAs), for transcripts associated with LTR families ERV1, ERVK, ERVL, and ERVL-MaLR. For comparison, exosome sensitivity is shown for mRNAs and gene-distal loci. (E and F) Genome browser tracks for two loci of unannotated transcripts with characteristic divergent expression patterns falling on TE insertions of ORR1A2 (ERVL-MaLR; (E)) and RMER17B (ERVK; (F)) subfamilies. Pooled replicate CAGE expression levels in control (Scr) and after exosome depletion (Rrp40) split by plus (blue) and minus (red) strands are shown. For visibility reasons, the scales of CAGE signals differ between strands and conditions. The average profiles of TSS locations for ERVs and L1 LINEs (Figure 1A,B) indicated divergent transcription initiation, reminiscent of that of gene promoters and gene-distal enhancers (38–40,43). To investigate the functional relevance of individual genomic TE insertions, we quantified transcriptional directionality in TE-associated open chromatin loci, as measured by DNase I hypersensitive sites (DHSs). The large majority of TE-associated CAGE-derived TSSs were proximal to DHSs (Supplementary Figure S7) and the majority of TE-associated DHSs displayed balanced bidirectional transcription initiation (Figure 1C), a hallmark of gene-distal regulatory elements with enhancer activity (40). This suggests that some of the investigated TEs may act as enhancers. Divergent transcripts from enhancers and gene promoters are frequently associated with nuclear decay by the RNA exosome (39,51,52). Comparing CAGE data from exosome-depleted mESCs with wildtype mESCs (scrambled shRNA control) (53) confirmed that TE-derived RNAs are also degraded by the exosome (Figure 1D; Supplementary Figures S2–S4,8; Table S1), as exemplified by CAGE data at genomic loci containing insertion sites for ORR1A2 (ERVL-MaLR) and RMER17B (ERVK) (Figure 1E,F). The exosome sensitivity of TE-derived RNAs was similar to those of gene-distal loci, in contrast to mRNAs which are mostly protected against decay by the exosome (Figure 1D) (39,51,52). Across TE families, we generally observed more TE-associated TSSs in exosome-depleted mESCs compared to wild-type mESCs (73 246 versus 28 215). Overall, more TEs with measurable transcription initiation (>1 CAGE tag, hereafter referred to as transcribed TEs) were detected in exosome-depleted compared to wildtype mESCs (26 079 in wild-type mESCs; 64 299 in exosome-depleted mESCs; 82 383 in pooled CAGE data of exosome-depleted and wild-type mESCs). Exosome-depleted cells further displayed an increased expression level of TE-derived RNAs (Supplementary Figure S2 and Table S1). Together, these results indicate that, although a prominent number of TEs are transcribed, the majority of derived RNAs are at least partially degraded. Taken together, our results demonstrate that transcribed TEs in mESCs are associated with open chromatin regions accommodating divergent transcription initiation of RNAs that are subject to degradation. These properties imply the potential for TEs to function as enhancers.

Transcription of dormant ERVs reveals co-opted regulatory elements

The characteristics of transcribed TEs and the similarities between TE-derived RNAs, eRNAs and PROMPTs led us to investigate further similarities with transcribed regulatory elements. Genomic annotation of transcribed TEs in mESCs revealed that a substantial fraction was either located in gene-distal intergenic regions or overlapped with gene promoters (Figure 2A). Across all transcribed TE families, ERV and L1 families contained the biggest fractions of TEs in intergenic regions localized at least 10 kb from the nearest gene, indicating their preference for gene-distal regulatory elements (50.6% of ERVs on average across ERV families and 59.8% of L1 LINEs). In contrast, other TE families displayed an elevated proportion of expressed TEs in genic regions. Many CAGE-inferred TSSs of transcribed TEs overlapped with open chromatin regions as defined by DHSs (62 514, ∼65%, of transcribed TEs overlapped via their TSSs with 7431 DHSs). This suggests that a prominent fraction of transcribed TEs may act, alone (34% of DHSs overlapped TSSs of one transcribed TE) or in combination with other transcribed TEs (66% of DHSs overlapped TSSs of multiple transcribed TEs), as gene regulatory elements, but may also reflect a selfish TE tropism for open chromatin regions (14,33,84,85).
Figure 2.

Transcription of transposable elements reveals co-opted regulatory elements. (A) Fraction of expressed TEs in control (Scr) and exosome KD (Rrp40) mESCs as well as all TEs annotated in Repeatmasker (expression agnostic) per genomic annotation group for each TE family. The number of instances of each TE family is shown in parenthesis. (B) Percentages of orthologous TE-associated, ERV-associated, and non-TE-associated expressed DHSs, as well as background regions between mouse and rat or human genomes. (C) The number of transcribed TEs overlapping FANTOM 5 mouse enhancers at the TE family level. TE subfamily counts are displayed in Supplementary Figure S9A. (D) The number of transcribed TEs overlapping STARR-seq mESC enhancers at the TE family level. TE subfamily counts are displayed in Supplementary Figure S9B.

Transcription of transposable elements reveals co-opted regulatory elements. (A) Fraction of expressed TEs in control (Scr) and exosome KD (Rrp40) mESCs as well as all TEs annotated in Repeatmasker (expression agnostic) per genomic annotation group for each TE family. The number of instances of each TE family is shown in parenthesis. (B) Percentages of orthologous TE-associated, ERV-associated, and non-TE-associated expressed DHSs, as well as background regions between mouse and rat or human genomes. (C) The number of transcribed TEs overlapping FANTOM 5 mouse enhancers at the TE family level. TE subfamily counts are displayed in Supplementary Figure S9A. (D) The number of transcribed TEs overlapping STARR-seq mESC enhancers at the TE family level. TE subfamily counts are displayed in Supplementary Figure S9B. In agreement with an association of TEs with species-specific DHSs (14), we observed that the majority of transcribed TE-associated DHSs in mESCs are rodent- or mouse-specific (Figure 2B). Only 16.9% of DHS sequences had human orthologous sequences while 79.4% were orthologous to the rat genome, indicating that they derive from TEs that have accumulated after the primate-rodent split. In contrast, transcribed DHSs devoid of TEs were more conserved. Interestingly, transcribed DHSs associated with ERV subfamilies were even more mouse specific (Figure 2B and Supplementary Figure S10), in agreement with previous reports (25,32,33), suggesting regulatory innovation through TE exaptation and that these elements contribute to the high evolutionary turnover of enhancers. To generally assess whether transcribed dormant TEs have been co-opted as transcriptional regulatory elements, we first evaluated their association with FANTOM enhancers (40,68,86), an extensively validated set of regulatory elements with predicted enhancer function inferred from bidirectional transcription initiation across a large number of cell types and tissues. Notably, 22.1% of FANTOM mouse enhancers with detectable expression in mESCs (1979 out of 8942) overlapped with transcribed TE-associated DHSs in mESCs (1504 out of 7431, 20.2%; P = 1e-5, Fisher’s exact test). This indicates that TEs contribute to a sizable fraction of regulatory elements with enhancer activity. Moreover, half of this set was composed of DHSs with LTR elements, in particular ERVKs and ERVL-MaLRs (Figure 2C and Supplementary Figure S9A), consistent with previously reported bidirectional transcription of ERVs (35). Of note, given that FANTOM enhancers were neither identified from exosome depleted mESCs nor multi-mapping rescued reads, it is conceivable that some TE-associated enhancers are not present in the FANTOM enhancer set. Since FANTOM enhancers were predicted based on the same property that we had observed for transcribed TEs, namely divergent transcription initiation, alternative experimental data are required to evaluate the enhancer potential of TEs (87). To this end, we considered the in vitro enhancer potential of transcribed TEs using genome-wide mESC STARR-seq data (69). About 18.9% of the transcribed TE-associated DHSs (1404 out of 7431) overlapped with open chromatin-associated STARR-seq enhancers (1949 out of 7078, 26.2%; Figure 2D; Supplementary Figure S9B). Again, LTRs, in particular ERVKs, constitute a sizable fraction of these enhancers, in agreement with the FANTOM enhancer overlap. In contrast, transcribed L1 LINE elements, despite frequently residing in gene-distal intergenic loci (Figure 2A), rarely overlapped with FANTOM or STARR-seq enhancers. L1 elements are therefore less likely to act as enhancers in ESCs, in line with human LINEs (29). Interestingly, in agreement with the strong association between transcription initiation and in vitro enhancer potential (40,44,47), we observed that non-transcribed TE-associated DHSs were to a lesser degree validated by STARR-seq enhancers than transcribed ones (P < 2e-16 for ERVKs and ERVL-MaLR, Fisher’s exact test). Combined, the STARR-seq and FANTOM sets indicate that 2378 (∼32%) transcribed TE-associated DHSs in mESCs are likely enhancers. Example loci are displayed in Figure 3, which clearly illustrate the association between divergent transcription initiation and enhancer activity from individual TEs (Figure 3A; RLTR41 insertion) or, in some cases, pairs of TEs (Figure 3B; RLTR41 in pairs with MYSERV-int and RMER10B insertions).
Figure 3.

TE insertions co-opted as divergently transcribed enhancers. (A and B) Genome browser tracks for two intergenic loci showing TPM-normalized CAGE data pooled across replicates and split by plus (blue) and minus (red) strands. Shown are also the locations of FANTOM5 mouse enhancers and STARR-seq mESC enhancers and signal tracks for ENCODE DNase-seq data and H3K4me1 and H3K27ac ChIP-seq data for E14 mESCs. The CAGE signals identify divergent transcription initiation from ERV1 RLTR41 insertions, alone (A) and in pairs (B) with TE insertions of RMER10B and MYSERV-int.

TE insertions co-opted as divergently transcribed enhancers. (A and B) Genome browser tracks for two intergenic loci showing TPM-normalized CAGE data pooled across replicates and split by plus (blue) and minus (red) strands. Shown are also the locations of FANTOM5 mouse enhancers and STARR-seq mESC enhancers and signal tracks for ENCODE DNase-seq data and H3K4me1 and H3K27ac ChIP-seq data for E14 mESCs. The CAGE signals identify divergent transcription initiation from ERV1 RLTR41 insertions, alone (A) and in pairs (B) with TE insertions of RMER10B and MYSERV-int.

Transcribed TEs carry chromatin features of regulatory elements

The strong overlap of ERVK elements and weak overlap of L1 elements with FANTOM and STARR-seq enhancer sets implies differences in the regulatory potential among TE families (Figure 2C,D). To investigate these differences further, we assessed the chromatin states at transcribed TEs. For this, ChIP-seq signal for histone modifications was aggregated around CAGE-inferred TSSs in TE-associated DHSs (88,89). Hierarchical clustering of these aggregated signals revealed six major chromatin states (Figure 4A and Supplementary Figure S11).
Figure 4.

Transcribed TEs exhibit chromatin features of regulatory elements. (A) Hierarchical clustering of histone modification (ChIP-seq) signals ±2000 bp around the summits of TE-associated clusters of CAGE-inferred TSSs (CAGE tag clusters, Materials and Methods). The ChIP-seq signal is shown as fold-change over input control. Clusters are represented in rows (color coded in left legend) and histone modifications in columns. Average distributions of ChIP-seq signals for each cluster are shown (top panel). (B) Annotations of TE-associated CAGE tag clusters based on GENCODE and RepeatMasker TE classes and families. (C and D) Bar plots of odds ratios of enrichments (Fisher’s exact test) of TE classes (C) and TE families (D) in each cluster.

Transcribed TEs exhibit chromatin features of regulatory elements. (A) Hierarchical clustering of histone modification (ChIP-seq) signals ±2000 bp around the summits of TE-associated clusters of CAGE-inferred TSSs (CAGE tag clusters, Materials and Methods). The ChIP-seq signal is shown as fold-change over input control. Clusters are represented in rows (color coded in left legend) and histone modifications in columns. Average distributions of ChIP-seq signals for each cluster are shown (top panel). (B) Annotations of TE-associated CAGE tag clusters based on GENCODE and RepeatMasker TE classes and families. (C and D) Bar plots of odds ratios of enrichments (Fisher’s exact test) of TE classes (C) and TE families (D) in each cluster. The majority of loci (clusters 1, 3, 4) carried histone modification signal associated with active regulatory elements (90–92): H3K4me1/3 and/or H3K27ac (Figure 4A and Supplementary Figure S12). The elements in cluster 3 and 4 were mainly intronic or gene-distal (>80%; Figure 4B) and displayed H3K4me1 signal, associated with weak transcription (4,46), in combination with H3K27ac. We observed a strong association with LTRs (Figure 4B,C), in particular ERVs (Figure 4B,D), in these two clusters, reinforcing the association statistics with STARR-seq and FANTOM enhancers. Although cluster 3 loci were associated with many TE families, a considerable fraction of TEs belong to the ERVL, ERVL-MaLR and B4 TE families (ERVL: 871, 33.9% of clustered ERVLs; ERVL-MaLR: 4106, 37.4% of clustered ERVL-MaLRs; B4: 1597, 35.5% of clustered B4s). ERV1s and ERVLs were enriched among cluster 4 TEs (Figure 4D). In addition, transcribed STARR-seq associated DHSs displayed comparable expression levels and similar histone modifications at TE and non-TE loci (Supplementary Figure S13), suggesting that TE-derived enhancers operate at similar activity levels as non-TE enhancers. A considerable fraction of TE insertions (cluster 1) displayed strong H3K4me3 signal, associated with promoter activity of highly expressed transcription units, e.g., mRNA genes (4,46). The majority of these fell close to GENCODE annotated TSSs and were enriched with SINE (B1 and proto-B1) and LTR (ERVK) elements (Figure 4B,D). Hence, TEs are not only evolutionary co-opted into distal regulatory elements, but also gene promoters. Indeed, out of 7431 transcribed TE-associated DHSs, 1012 (13.6%) were associated with GENCODE annotated mRNA gene TSSs and 558 (7.5%) with GENCODE long non-coding RNA (lncRNA) gene TSSs. Complementary chromatin state segmentation analysis confirmed the strong bias of transcribed TEs having histone modifications indicative of active regulatory elements (Supplementary Figure S14). Interestingly, we found some ERV1s and ERVKs, in addition to L1 LINEs, to be associated with both H3K27ac and H3K9me3 (cluster 6). The latter is present at repressed TEs in heterochromatin (19–22) and the combination of both repressing (H3K9me3) and activating (H3K27ac) histone modifications has been suggested to keep these TE-associated regulatory elements in a partially activated state (93). We also identified a group of transcribed TE loci (cluster 5) that was highly enriched with L1s but overall, only carried low levels of histone modifications. Some of these loci were associated with H3K27me3 signal, indicative of polycomb-mediated repression. TE loci of cluster 5 and 6 thus likely represent repressed regulatory elements with low transcriptional activity, which may facilitate later full activation, although we cannot rule out that this is an effect of mESC clonal heterogeneity. Taken together, through several lines of evidence we demonstrate that dormant TEs, in particular endogenous retroviral elements, have frequently been repurposed into regulatory elements with enhancer and promoter activities in mESCs.

The TF binding repertoires and regulatory topologies of ERV subfamilies indicate involvement in distinct regulatory programs

TF binding to regulatory elements is the key determinant of regulatory activity and the basis of cell-type specificity. To assess how TEs co-opted as regulatory elements may contribute to transcriptional regulatory programs in mESCs, we performed a TF binding site enrichment analysis of transcribed DHSs associated with TE insertions from 224 LTR subfamilies versus all genomic loci of CAGE-inferred TSSs in mESCs (Materials and Methods section). This analysis thus reveals TFs that are enriched or depleted in TEs compared to what we would expect from regulatory elements in general. Interestingly, predicted binding sites for several TFs, including pluripotency factors Nanog, Sox2 and Oct4, were enriched in transcribed LTRs compared to non-TE associated TSSs (Benjamini–Hochberg FDR-adjusted P < 1e-3). Binding site sequences for these TFs were further found in a large number of LTR insertions (in 70.9%, 24.5% and 11.5% of all 19 436 transcribed DHS-associated LTRs versus in 40.3%, 9.6% and 5.2% of CAGE-inferred TSSs for Nanog, Sox2 and Oct4, respectively). These results indicate that LTRs are a major source of regulatory elements controlled by pluripotency factors in mESCs. These results are in line with those observed for human ESCs (29,36), despite the low conservation of LTRs between species (Figure 2B) (34). Given their strong association with regulatory elements, we focused on putative TF binding sites in DHSs of transcribed TEs across ERV subfamilies. In fact, we observe a variability in enrichment of various TF binding site sequences in specific ERV subfamilies indicating diverse regulatory potentials of TEs and specific TF-TE-associations (Figure 5A, Supplementary Figure S15A and Table S2). We observed a positive correlation between enrichments of binding site sequences for pluripotency factors, e.g., Sox2 and Esrrb (Pearson’s r = 0.54), as well as Sox2 and Nanog (Pearson’s r = 0.47; Figure 5B; Supplementary Figure S15B). Differences in enrichments for Sox2 and Nanog were, however, found for some ERV subfamilies, as seen for instance for ORR1A1 and ORR1A2 (Figure 5A), indicating that calculated motif enrichment similarities between these two TFs are not necessarily driven by motif similarities. The overall enrichments of Sox2, Nanog and Esrrb generally agreed with Isl1, Bcl6 and Oct4 (Figure 5B and Supplementary Figure S15B), indicating that certain ERV subfamilies carry binding sites for a wide variety of pluripotency factors. Binding site sequences for all these TFs and the core pluripotency factor Oct4 were enriched in RLTR41, RLTR23, MMERVK9C_I_int, MLTR25A and RLTR11A (mean log2 enrichment > 1). ERV subfamilies contribute to distinct enrichments of binding sites for pluripotency factors. (A) Motif enrichments for selected TFs (columns) in transcribed TEs across selected ERV subfamilies (rows) versus a background of non-TE genomic regions ±200 bp around the summits of all CAGE-inferred TSS clusters. White cells indicate no enrichment and cases of complete depletion were assigned the lowest detected score, represented in dark blue according to the scale. The full TF enrichment heatmap is shown in Supplementary Figure S15A. Example genomic insertion sites for ERVK subfamilies are shown to illustrate their differences in carrying putative TF binding sites for Sox2, Esrrb and Oct4. (B) Correlations (Pearson’s r) between TF motif enrichments across all ERV subfamilies (as shown in Supplementary Figure S15A). The full heatmap of correlations is given in Supplementary Figure S15B. (C) Similarity (q-value) between binding motifs for selected TFs. Sequence logos (right) for Oct4, Nanog and Klf4 exemplify differences and similarities of TF binding sites. We noted that SOX family related motifs, including that for Sox2, were enriched in specific ERV subfamilies (most notably in RLTR9D, RLTR9E, LTRIS2, RLTR13B1, RLTR41). Similarly, we identified a specific enrichment of KLF factors, including Klf4 (most highly enriched in RLTRETN_Mm, ORR1A1, ORR1A2 and RLTR10A). ERVL-MaLRs ORR1A1 and ORR1A2 were, in addition to Klf4, also enriched with Nanog, Isl1, Bcl6, and Lhx3, but interestingly not Esrrb and Oct4 binding site sequences. This suggests that, by carrying different repertoires of TF binding sites, ERV subfamilies may contribute differently and distinctly to the pluripotency regulatory program (Supplementary Figure S15). Similarly, Lhx3 binding site sequences co-occurred with those of Esrrb in RLTR41 ERV1s but were depleted from Esrrb-enriched ERVKs RLTR9D and RLTR13B1. In addition, the top enriched ERVs for putative Oct4 binding sites (LTRIS2, RLTR17, RLTR16B_MM, RLTR41) were depleted from those of Klf4. The specific enrichments or depletions of Klf4 or Oct4 binding site sequences in certain ERV subfamilies indicate that the Klf4 and Oct4 ERV-derived regulatory networks have partially evolved independently from those of Sox2 and Nanog. This is consistent with distinct binding preferences (Figure 5C) and context-dependent cooperativity between Sox2 and Oct4 (94), which is limited during pluripotency maintenance (95). We note that the binding preferences for Oct4 and Nanog could allow for the derivation of new binding sites for one TF from ancestral binding site sequences of the other through mutations (Figure 5C). However, the dissimilarity of the Klf4 motif with those of Oct4 and Nanog indicates that evolutionary acquisition of Klf4 binding sites likely requires new transposition events. In addition to the enrichment of pluripotency TFs, we identified putative ERV-derived binding sites for a broad range of TFs (Supplementary Figure S15A). For instance, binding site sequences for ETS factors, involved in a large variety of gene regulatory programs and across cell types, were highly enriched among ERVL-MaLR subfamilies ORR1E and ORR1D2, which were further depleted of putative TF binding sites for Oct4 and Sox2. This suggests that transposition of TF binding sites native to LTR retrotransposons or the subsequent birth of new TF binding sites through mutations (27) could impact several regulatory programs, including those distinct from the naive pluripotency. To further investigate the regulatory programs contributed to by ERVs co-opted as regulatory elements, we linked each gene-distal, transcribed ERV-associated DHS with its predicted target gene from activity-by-contact (ABC) modeling (78). The resulting gene ontology enrichments of linked genes indicate a highly specific association of individual ERV subfamilies with gene regulatory programs (Figure 6A and Supplementary Figure S16), including regulation of genes involved in immunity by RLTR13B1 elements and regulation of genes involved in transcription, specifically basal TFs in the TFIID complex, by MLTR14 elements (Figure 6B).
Figure 6.

ERV subfamilies contribute to distinct gene regulatory programs. (A) GO term enrichment for putative target genes (ABC) of gene distal ERVs split by ERV subfamily (foreground) versus all ABC-predicted target genes (background). For ease of visualization, gene ontology terms are colored by manually curated process or function, and the underlying gene ontology term enrichments are shown for RLTR13B1 and MLTR14. Full results are provided in Supplementary Figure S16. (B) Predicted enhancer interactions with the promoter of gene Taf2 (TATA-Box Binding Protein Associated Factor 2). The four enhancers are marked by gray boxes and a zoom-in is provided below, showing overlaps with ERV insertions of RMER10B, MLTR14 and RMER17C. Tracks for GENCODE (M19) transcripts, FANTOM5 enhancers, STARR-seq enhancers, and TEs provided by RepeatMasker are shown. (C) Expression fold-change (log2), as measured by CAGE, in mESCs versus EBs at ERV-associated DHSs that carry (right) or not (left) predicted binding sites for pluripotency factors. (D) Gene-level expression (TPM normalized), as measured by CAGE, quantified for ABC-linked genes of transcribed TE-associated DHSs that carry predicted binding sites for pluripotency factors.

ERV subfamilies contribute to distinct gene regulatory programs. (A) GO term enrichment for putative target genes (ABC) of gene distal ERVs split by ERV subfamily (foreground) versus all ABC-predicted target genes (background). For ease of visualization, gene ontology terms are colored by manually curated process or function, and the underlying gene ontology term enrichments are shown for RLTR13B1 and MLTR14. Full results are provided in Supplementary Figure S16. (B) Predicted enhancer interactions with the promoter of gene Taf2 (TATA-Box Binding Protein Associated Factor 2). The four enhancers are marked by gray boxes and a zoom-in is provided below, showing overlaps with ERV insertions of RMER10B, MLTR14 and RMER17C. Tracks for GENCODE (M19) transcripts, FANTOM5 enhancers, STARR-seq enhancers, and TEs provided by RepeatMasker are shown. (C) Expression fold-change (log2), as measured by CAGE, in mESCs versus EBs at ERV-associated DHSs that carry (right) or not (left) predicted binding sites for pluripotency factors. (D) Gene-level expression (TPM normalized), as measured by CAGE, quantified for ABC-linked genes of transcribed TE-associated DHSs that carry predicted binding sites for pluripotency factors. The enrichment of pluripotency factors in transcribed ERV-associated DHSs suggests that the regulatory activities of these elements have a specificity toward cell types in which these TFs are active. To investigate their putative cell-type restricted activity, we quantified their expression using CAGE data for exosome-depleted samples derived after 3 days differentiation of mESCs into embryoid bodies (EBs) (53). In agreement with their predicted role in mESCs, the expression of ERV-associated regulatory elements with inferred binding sites for pluripotency TFs (as given in Figure 5A) were generally reduced in EBs, while those not containing such sites displayed a similar expression level in mESCs and EBs (Figure 6C). Accordingly, genes whose ABC-linked ERV-associated enhancers contained binding site sequences for pluripotency factors displayed lower expression in EBs (Figure 6D). These observations reflect a reduced activity of pluripotency TFs in EBs and demonstrate that expression of ERV-associated DHSs can be used as a marker for their cell-type specificity in regulatory activity. Taken together, our results indicate that ERV-derived regulatory elements transcribed in mESCs contribute in a specific manner to the pluripotency regulatory network through their binding sites for pluripotency TFs. Although we identified putative TF binding sites for multiple TFs for each ERV subfamily, the TF enrichments across ERV subfamilies were highly distinct (Figure 5 and Supplementary Figure S15A). Together with the diversity of functions of predicted target genes (Supplementary Figure S16), this suggests that each ERV subfamily contributes to distinct regulatory programs and pathways.

DISCUSSION

Transcriptional regulation of ESCs is multi-faceted. On one hand, regulatory elements act as binding platforms for transcription factors controlling the transcriptional activities of genes involved in activities not necessarily specific to ESCs, such as metabolism, transcription, stress response and replication. In parallel, ESCs must maintain plasticity to ensure potent differentiation capabilities. At the core of such activities are highly specialized and conserved TFs, including Oct4, Sox2 and Nanog, which, through targeted transcriptional regulation, maintain a naive pluripotent state. While gene regulation in ESCs by these TFs is highly conserved across Metazoa, their respective binding sites are not. TEs have been suggested to contribute to stabilizing the gene regulatory functions of TFs by providing regulatory sequences with the required binding sites. We here demonstrate that retrotransposons contribute to a sizable fraction of such regulatory innovation. Using an accurate and unbiased approach based on genome-wide profiling of TSSs, we systematically investigate the regulatory potential and transcriptional activities of the wide repertoire of mouse TEs in mESCs. We show that a fraction of TEs are transcribed and that these display balanced divergent transcription initiation patterns within sites of open chromatin. We provide, to the best of our knowledge, the first evidence that retrotransposon-derived RNAs are targeted by the exosome for degradation. As many as a third of divergently transcribed TE-associated DHSs are supported by in vitro enhancer potential derived from STARR-seq data or by overlap with, rigorously validated, previously predicted enhancer sets from FANTOM. In addition, we find a large overlap with annotated gene promoters, demonstrating that at least a half of transcribed TE-associated DHSs in mESCs are regulatory elements (enhancers or promoters). Transcription start site profiling by CAGE thus lends itself as an accurate approach for genome-wide surveys of the regulatory activity of TEs, for which predictions based on open chromatin and histone modifications yield considerably lower validation rates (29,31). We observed that a substantial fraction of transcribed TE-derived regulatory elements were non-orthologous to rat or human genomes, demonstrating a high degree of regulatory innovation by TEs. Our TSS-centric approach thus allows for an unbiased, systematic investigation of the regulatory potential across TE subfamilies, beyond a selected few subfamilies. We note that disagreements in validation rates to a recent study (31) could further be due to difference in validation assays used. While CRISPR interference (CRISPRi), used by Todd et al. (31), has the potential to better reflect in vivo activities, the false-negative rate of CRISPRi remains a concern (96). In addition, redundant enhancers may buffer regulatory effects (97,98). Therefore, an enhancer could still have a causal role in gene regulation even though no observable effect can be measured by CRISPRi. As such, in vitro measurements, e.g., derived from STARR-seq, are therefore better suited to reveal the enhancer potential of a regulatory element, even though such an approach may suggest enhancers that are not active in vivo. Further studies are necessary to properly compare the quantified activities of regulatory elements between STARR-seq and CRISPRi. Endogenous retroviral elements were most frequently transcribed, and ERVKs stand out as the largest contributor of regulatory elements with enhancer potential in mESCs. This bias is likely explained by their general enrichment of binding sites (23,27) and binding site sequences (Figure 5A) for TFs regulating naive pluripotency, including Oct4, Sox2 and Nanog. However, we do acknowledge that mapping of sequencing reads might skew our focus to evolutionary older ERV subfamilies (99), which have accumulated more mutations than younger ones and therefore have a higher chance of mapping reads uniquely (59). However, any mapping bias cannot explain differential TF enrichments across ERV subfamilies or their cell-type specific expression patterns. The diverse TF enrichments observed across ERV subfamilies indicate a major contribution of ERVs to the landscape of regulatory elements and a wide variety of gene regulatory programs. Interestingly, we observed varying degrees of over-representation of putative TF binding sites in ERV subfamilies, including those enriched with the wide repertoire of pluripotency TFs, those partially depleted of specific pluripotency TFs, and those enriched with non cell-type specific TFs, like the ETS superfamily. We noted a general strong co-occurrence of binding site sequences for Nanog, Sox2, and Esrrb, while Oct4 and Klf4 were depleted in some subfamilies. While some of these TFs have similar motifs, e.g., Nanog, Sox2, and may therefore overestimate co-occurrences, the specific enrichments and depletions argue for unique contributions by ERV subfamilies to the regulatory programs of mESCs, which is confirmed by the specific functions of putative target genes by ERVs predicted to be co-opted as gene-distal enhancers. Further, our data suggest that the regulatory landscape of naive pluripotency in mESCs has been shaped independently by multiple subfamilies of ERVs. In agreement with previous work (23,29,36), we observed a substantial enrichment of binding site sequences for pluripotency factors in ERV-associated regulatory elements. However, our transcription-centric approach identified many ERV subfamilies with enrichment for Nanog or Oct4 binding site sequences not revealed when focusing on their binding sites as inferred from ChIP-seq (36), including ORR1A1, ORR1A2, RMER17B, LTRIS2, RLTR9E, RMER19B for Nanog and LTRIS2, RLTR16B_MM, MLTR25A, RLTR23, BGLII_B, RMER1B for Oct4. While we acknowledge that such differences can be due to false positives from motif scanning (100), by focusing on sites of transcription initiation we may better capture the regulatory active subset of TF binding events (101). We speculate that utilizing transcription initiation profiling to infer the regulatory activity of TEs specifically identifies TE-derived regulatory elements for which relevant DNA sequences, e.g., binding sites for pluripotency factors, are linked with cell-type specific transcriptional and regulatory activity. This is confirmed by seemingly cell-type restricted expression of pluripotency-associated ERVs and that of their putative target genes, when expression is compared between mESCs and EBs (Figure 6C,D). In addition to the core pluripotency TFs, we observed an enrichment of binding site sequences for multiple TFs in several ERV subfamilies. Since ERVs originate from ancestral genomic insertions of retroviral elements, the ancestral ERV sequences must therefore have either carried the full repertoire of TF binding sites that have been maintained through evolution or have served as substrates for the evolutionary acquisition of diverse TF binding sites through mutations (27,28). That way, ERVs may offer enough sequence diversity to allow encoding for enhancer activity even in the absence of binding sites for specific master regulators of mESCs (102). Taken together, we present a systematic characterization of the transcriptional activities, RNA decay patterns, chromatin signatures and regulatory potential of TEs in mESCs. Our results demonstrate a sizable contribution of TEs to regulatory innovation and the regulatory landscape of naive pluripotency. We further show that expression of TE-associated open chromatin regions is indicative of cell-type restricted regulatory activity. Charting their dynamic activities over development will thus be an important next step to further our understanding of the cell-type specific roles of TEs in transcriptional regulation. Click here for additional data file.
  100 in total

1.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.

Authors:  Yang Liao; Gordon K Smyth; Wei Shi
Journal:  Bioinformatics       Date:  2013-11-13       Impact factor: 6.937

2.  Systematic dissection of genomic features determining transcription factor binding and enhancer function.

Authors:  Sharon R Grossman; Xiaolan Zhang; Li Wang; Jesse Engreitz; Alexandre Melnikov; Peter Rogov; Ryan Tewhey; Alina Isakova; Bart Deplancke; Bradley E Bernstein; Tarjei S Mikkelsen; Eric S Lander
Journal:  Proc Natl Acad Sci U S A       Date:  2017-01-30       Impact factor: 11.205

Review 3.  Eukaryotic core promoters and the functional basis of transcription initiation.

Authors:  Vanja Haberle; Alexander Stark
Journal:  Nat Rev Mol Cell Biol       Date:  2018-10       Impact factor: 94.444

4.  The evolution of Great Apes has shaped the functional enhancers' landscape in human embryonic stem cells.

Authors:  Gennadi Glinsky; Tahsin Stefan Barakat
Journal:  Stem Cell Res       Date:  2019-05-03       Impact factor: 2.020

5.  Massively parallel decoding of mammalian regulatory sequences supports a flexible organizational model.

Authors:  Robin P Smith; Leila Taher; Rupali P Patwardhan; Mee J Kim; Fumitaka Inoue; Jay Shendure; Ivan Ovcharenko; Nadav Ahituv
Journal:  Nat Genet       Date:  2013-07-28       Impact factor: 38.330

6.  Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding.

Authors:  Dominic Schmidt; Michael D Wilson; Benoit Ballester; Petra C Schwalie; Gordon D Brown; Aileen Marshall; Claudia Kutter; Stephen Watt; Celia P Martinez-Jimenez; Sarah Mackay; Iannis Talianidis; Paul Flicek; Duncan T Odom
Journal:  Science       Date:  2010-04-08       Impact factor: 47.728

7.  A comparative encyclopedia of DNA elements in the mouse genome.

Authors:  Feng Yue; Yong Cheng; Alessandra Breschi; Jeff Vierstra; Weisheng Wu; Tyrone Ryba; Richard Sandstrom; Zhihai Ma; Carrie Davis; Benjamin D Pope; Yin Shen; Dmitri D Pervouchine; Sarah Djebali; Robert E Thurman; Rajinder Kaul; Eric Rynes; Anthony Kirilusha; Georgi K Marinov; Brian A Williams; Diane Trout; Henry Amrhein; Katherine Fisher-Aylor; Igor Antoshechkin; Gilberto DeSalvo; Lei-Hoon See; Meagan Fastuca; Jorg Drenkow; Chris Zaleski; Alex Dobin; Pablo Prieto; Julien Lagarde; Giovanni Bussotti; Andrea Tanzer; Olgert Denas; Kanwei Li; M A Bender; Miaohua Zhang; Rachel Byron; Mark T Groudine; David McCleary; Long Pham; Zhen Ye; Samantha Kuan; Lee Edsall; Yi-Chieh Wu; Matthew D Rasmussen; Mukul S Bansal; Manolis Kellis; Cheryl A Keller; Christapher S Morrissey; Tejaswini Mishra; Deepti Jain; Nergiz Dogan; Robert S Harris; Philip Cayting; Trupti Kawli; Alan P Boyle; Ghia Euskirchen; Anshul Kundaje; Shin Lin; Yiing Lin; Camden Jansen; Venkat S Malladi; Melissa S Cline; Drew T Erickson; Vanessa M Kirkup; Katrina Learned; Cricket A Sloan; Kate R Rosenbloom; Beatriz Lacerda de Sousa; Kathryn Beal; Miguel Pignatelli; Paul Flicek; Jin Lian; Tamer Kahveci; Dongwon Lee; W James Kent; Miguel Ramalho Santos; Javier Herrero; Cedric Notredame; Audra Johnson; Shinny Vong; Kristen Lee; Daniel Bates; Fidencio Neri; Morgan Diegel; Theresa Canfield; Peter J Sabo; Matthew S Wilken; Thomas A Reh; Erika Giste; Anthony Shafer; Tanya Kutyavin; Eric Haugen; Douglas Dunn; Alex P Reynolds; Shane Neph; Richard Humbert; R Scott Hansen; Marella De Bruijn; Licia Selleri; Alexander Rudensky; Steven Josefowicz; Robert Samstein; Evan E Eichler; Stuart H Orkin; Dana Levasseur; Thalia Papayannopoulou; Kai-Hsin Chang; Arthur Skoultchi; Srikanta Gosh; Christine Disteche; Piper Treuting; Yanli Wang; Mitchell J Weiss; Gerd A Blobel; Xiaoyi Cao; Sheng Zhong; Ting Wang; Peter J Good; Rebecca F Lowdon; Leslie B Adams; Xiao-Qiao Zhou; Michael J Pazin; Elise A Feingold; Barbara Wold; James Taylor; Ali Mortazavi; Sherman M Weissman; John A Stamatoyannopoulos; Michael P Snyder; Roderic Guigo; Thomas R Gingeras; David M Gilbert; Ross C Hardison; Michael A Beer; Bing Ren
Journal:  Nature       Date:  2014-11-20       Impact factor: 49.962

8.  Functional evaluation of transposable elements as enhancers in mouse embryonic and trophoblast stem cells.

Authors:  Christopher D Todd; Özgen Deniz; Darren Taylor; Miguel R Branco
Journal:  Elife       Date:  2019-04-23       Impact factor: 8.140

9.  Nonreciprocal and Conditional Cooperativity Directs the Pioneer Activity of Pluripotency Transcription Factors.

Authors:  Sai Li; Eric Bo Zheng; Li Zhao; Shixin Liu
Journal:  Cell Rep       Date:  2019-09-03       Impact factor: 9.423

10.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

View more
  2 in total

Review 1.  Transposable Elements in Pluripotent Stem Cells and Human Disease.

Authors:  Gang Ma; Isaac A Babarinde; Xuemeng Zhou; Andrew P Hutchins
Journal:  Front Genet       Date:  2022-06-02       Impact factor: 4.772

Review 2.  Transcriptional and reverse transcriptional regulation of host genes by human endogenous retroviruses in cancers.

Authors:  Mengwen Zhang; Shu Zheng; Jessie Qiaoyi Liang
Journal:  Front Microbiol       Date:  2022-07-19       Impact factor: 6.064

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.