Literature DB >> 26442173

Applications of DNA integrating elements: Facing the bias bully.

Johann de Jong¹, Lodewyk F A Wessels², Maarten van Lohuizen³, Jeroen de Ridder⁴, Waseem Akhtar³.

Abstract

Retroviruses and DNA transposons are an important part of molecular biologists' toolbox. The applications of these elements range from functional genomics to oncogene discovery and gene therapy. However, these elements do not integrate uniformly across the genome, which is an important limitation to their use. A number of genetic and epigenetic factors have been shown to shape the integration preference of these elements. Insight into integration bias can significantly enhance the analysis and interpretation of results obtained using these elements. For three different applications, we outline how bias can affect results, and can potentially be addressed.

Entities: CellLine Chemical Disease Gene Species

Keywords: chromatin position effect; epigenomics; gene therapy; insertional mutagenesis; integration bias

Year: 2015 PMID： 26442173 PMCID： PMC4588226 DOI： 10.4161/2159256X.2014.992694

Source DB: PubMed Journal: Mob Genet Elements ISSN： 2159-2543

Introduction

DNA integrating elements are parasitic nucleic acids capable of integrating their DNA into the host genome. These elements can be divided into 2 broad categories, viruses and transposons. The integration process is guided by dedicated sequences at the flanks of these elements called terminal repeats. The rest of the sequence is not required for the integration process and can therefore be replaced by genetic material of interest using molecular engineering techniques. In this way, these elements serve as vectors for the delivery of specialized genetic cargo into the cells of interest. This feature of transposons and viruses has made them ideal tools for studying the function of different genetic components such as genes, promoters and enhancers, by integrating these components into the genome and studying their function in the cellular context. DNA integrating elements can also be used to add a new function or restore a defective function in the cell. This provides the basis for their extensive use in gene therapy. The viruses have strong transcriptional enhancers in their long terminal repeats (LTRs) that can activate the expression of endogenous genes in the vicinity of viral integrations. Additionally, viruses and transposons can be engineered to carry specific sequences that can either activate or disrupt the nearby genes, such as enhancers or transcription stop signals. This has made these elements into powerful tools for forward genetic screens, where they are used to identify the function of endogenous genes. An important example of this is the identification of putative oncogenes and tumor suppressor genes from insertional mutagenesis (IM) screens. If an integration activates a proto-oncogene or disrupts a tumor suppressor gene, this can lead to the development of tumors. Mapping the integration loci in resulting tumors and subsequent identification of integration hot spots allows the discovery of the cancer-related genes. In addition to forward genetic screens, engineered transposons are increasingly used in the development of functional methods to study mechanisms of gene regulation at a genome-wide level. Retroviruses and transposons do not integrate uniformly across the genome, which limits their usability in molecular biology applications. In this paper, we briefly review the literature on the integration bias of commonly used retroviruses and DNA transposons. We compare the biases between different vectors with a special emphasis on the genetic and epigenetic features of the host cells, which determine these biases. We provide our perspective on how these biases might affect different applications of these vectors. Furthermore, we discuss how the detailed knowledge of the a priori integration bias of these elements can be harnessed to refine some of their applications. Finally, we provide a brief overview of the efforts to control the integration bias of transposons to expand the potential of these tools in molecular biology research.

Chromatin landscapes of integration bias

A common approach for analyzing target site selection is to characterize integration loci in terms of the local genomic and/or chromatin context, and compare them to randomly chosen control loci, e.g. Since integration sites are often retrieved using restriction enzymes, and restriction sites are distributed non-uniformly across the genome, each integration site is typically matched to a number of random control loci, based on the distance toward the nearest restriction site. Using this approach, it has been shown that integration preferences differ across different species of retrovirus. Initially, the analyses focused mostly on genomic marks and gene expression. For example, the Murine Leukemia Virus (MuLV) appeared to strongly favor transcription start sites, and the Human and Simian Immunodeficiency Viruses (HIV and SIV) and the Avian Sarcoma-Leukosis Virus (ASLV) favored actively transcribed genes. Later studies provided more elaborate analyses of integration profiles, such as scale-based analyses, as well as detailed analyses of sequence specificity and epigenetic marks. By analyzing the profiles of a wide range of retroviruses (among which MuLV and HIV) and 2 transposons (among which the Sleeping Beauty (SB) transposon), it was found that sequence specificity could explain integration bias to a substantial degree, and that conclusions drawn are dependent on the genomic scale at which the insertions and features are analyzed. Furthermore, contrary to HIV, MuLV demonstrated a strong preference for DNase I hypersensitive sites and regions rich in transcription factor binding site motifs. Moreover, its bias appeared to be mostly determined by the MuLV-specific integrase and the enhancer in the LTR. HIV integrations associated with epigenetic marks such as H3K36me3, consistent with its reported bias for transcriptionally active genes. These and other studies on retroviral integration biases were reviewed extensively in. Recently, with the explosive growth of available epigenomics datasets, attention has shifted more and more to the epigenetic determinants of target site selection. In a comparison of 3700000 MuLV integration sites in K562 cells with the corresponding ENCODE data, the previously reported bias of MuLV for regulatory elements such as enhancers and promoters was confirmed. An especially broad study characterized integration bias of a wide range of retroviruses, MuLV, HIV, ASLV, Porcine Endogenous Retrovirus (PERV), Xenotropic Murine leukemia virus-related Virus (XMRV), Human T-lymphotropic Virus (HTLV), and Foamy Virus (FV) with respect to histone modifications and transcription factor binding as determined by ChIP-seq. Strong association was observed of MuLV, PERV, and XMRV with STAT1, H3/H4 acetylation, and H2AZ/H3K4/K9 methylation. For MuLV specifically, by combining different ChIP-seq data sets, a supermarker was constructed that was present within 2 kb of 75% of the insertion sites. Compared to MuLV, the integration bias of the Mouse Mammary Tumor Virus (MMTV), another retrovirus that is commonly used in IM, is far less extensively studied. Its integration profile was suggested to be the most random across retroviruses, as no preferences could be demonstrated with respect to genes and CpG islands. Based on a large dataset of ∼180000 MMTV integrations, we recently demonstrated that biases with respect to genes and CpG islands in fact do exist, but are very weak. As an interesting exception to its generally weak bias, we showed that MMTV did have a strong preference for integrating near the interface between topological domains and their boundary regions. Thus, MMTV integration target selection cannot be considered as uniformly random across the genome. Compared to many retroviruses, transposon integration biases have been less well characterized. Two main systems used in molecular biology are the Sleeping Beauty and the piggyBac (PB) transposons. SB integrates almost exclusively in TA dinucleotides. Apart from this highly specific recognition sequence, SB did not seem strongly biased. However, in our recent study, based on a much larger number of integrations (∼120000), we found that SB has a strong preference for genes, and preferentially integrates almost uniformly across gene bodies ().

Figure 1.

Schematic overview of integration bias with respect to genes for 4 different DNA integrating elements. Adapted from.

Schematic overview of integration bias with respect to genes for 4 different DNA integrating elements. Adapted from. PB integrates almost exclusively in TTAA sites, and biases were demonstrated with respect to CpG islands, transcription start sites and actively transcribed loci. Interestingly, PB integration profiles are highly similar to those of MuLV (). Across SB, PB and MMTV, we recently identified topological domain boundary interfaces as integration hotspots across different systems. Furthermore, based on a comparison with ∼80 publicly available (epi)genomics data sets in the same cell type, we demonstrated that target site selection is directed at multiple genomic scales. At a large scale, it is directed by macrofeatures, i.e. domain-oriented features that are shared between systems, such as expression of proximal genes, proximity to CpG islands and genic features, chromatin compaction and replication timing. At smaller scales, target site selection is directed by microfeatures, i.e., a diverse range of (epi)genomic features, which are generally less domain-oriented and can differ across systems.

The impact of integration bias on applications

As was briefly outlined in the introduction, the integration biases of retroviruses and transposons can pose problems for the applicability of these elements in many areas of molecular biology. In this section we will go into more detail regarding 3 areas of application, 1) cancer gene discovery through insertional mutagenesis (IM) screens, 2) studying the chromatin position effect, i.e. the influence of the genomic location of a genetic unit on its activity, and 3) gene therapy.

Cancer discovery through IM screens

The analysis and interpretation of IM data can be confounded by the a priori integration bias of the DNA integrating elements used for IM. In IM, putative cancer genes are identified by detecting genomic regions that are recurrently integrated. These genomic regions are called common integration sites (CISs). Identification of CISs is generally done under the assumption that all regions of the genome have an equal probability of hosting an integration event. This can lead to spurious CISs, where integration hot spots may be caused merely by passenger integration events rather than tumor-induced selective pressure (). In one study, this problem was addressed by assuming that a true CIS gene should harbor significantly more integrations than its flanking genes. In this way, 3 out of 9 CISs in this study were marked as false positive. Other studies have compared control datasets of integrations that were subjected to minimal selective pressure to integration loci retrieved from tumors. One such a study proposed a 47% false positive rate for their MuLV screen. Another SB study found 6 CISs in their control data set, whereas 79 CISs could be found in the tumor screen. We recently compared the integrations from 3 different IM screens utilizing PB, SB and MMTV with the integration profiles of these vectors obtained under unselected conditions. The analysis showed that a substantial fraction of CISs (733%–) overlap with the integration hot spots and are therefore likely not related to the process of tumor development. Especially the integration bias for CIS regions far from endogenous genes was strong. This warrants higher statistical stringency when calling CISs in gene-distant regions of the genome.

Figure 2.

Integration bias can give rise to spurious Common Integration Sites (CISs). Hypothetical integration profile of a tumor screen ("Selected" in blue) and corresponding unselected background integrations ("Unselected" in red). In a typical effort for retrieving cancer genes from an IM screen, a genomic region is called as a CIS if the local integration density exceeds a certain threshold ("CIS threshold;" black dotted line). However, some of these regions may also reflect an a priori integration bias. Additionally, the use of restriction enzymes for retrieving integration sites could potentially impact CIS calling. To overcome these biases, a method has been developed that can retrieve integration sites by random shearing of DNA. Further, a method for calling CISs based on a Poisson distribution has been used to computationally address restriction site bias. Depending on the occurrence of integrations relative to endogenous genes and their orientation homogeneity, CISs can have either activating or repressing influence on their target genes, as such allowing to distinguish between putative oncogenes and tumor suppressor genes. In this way, PB was shown to be more efficient at finding oncogenes, whereas SB would be a better tool for mining tumor suppressor genes. These observations highlight the significance of generating large integrations datasets under non-selective conditions in order to refine and prioritize CISs for downstream validation studies.

Studying chromatin position effects

Recently, we presented the TRIP (short for Thousands of Reporters Integrated in Parallel) technology, which depends on mobile genetic elements to study the chromatin position effect in a high-throughput manner. A PB construct with a reporter gene was randomly integrated into the genome, and the expression of individual reporter genes was tracked using barcode technology. However, when integrating into the genome, PB shows substantial biases (Section Chromatin landscapes of integration bias). Given these integration biases, one may wonder what the importance of these biases is for computing associations of reporter gene expression with any (epi)genomic features. For example, in the TRIP study we computed the association of reporter gene expression with a number of binarized (epi)genomic features, such as lamina-associated domains (LADs). It is known that there is an integration bias against LADs, i.e., there are relatively few integrations within LADs. To demonstrate the influence of this bias on the association of reporter gene expression with LADs we ran a simple simulation. We randomly generated 104 integrations in silico, and distributed these in an increasingly uneven fashion across 2 classes, e.g., LADs and inter-LADs (iLADs), from completely even (i.e. 5000 in one class and 5000 in the other) to highly uneven (i.e., 9998 in one class and 2 in the other). For each integration, depending on the class of an integration, we simulated expression values by sampling from a certain class-specific expression distribution, i.e. a normal distribution with mean 0.1 and standard deviation 1 for class 1, and a normal distribution with mean 0 and standard deviation 1 for class 2. Then, for each distribution, we performed Welch's t-test to distinguish between the 2 classes. The results of the simulation are shown in . It shows 2 measures, 1) the statistical significance of the t-test, expressed as a z-normalized t-statistic, and 2) the effect size, expressed as the difference in mean reporter gene expression between the 2 classes. As could be expected, it clearly illustrates that with an increasingly uneven distribution, the expected effect size remains the same. However, the variance in the effect size increases, and with it the statistical significance reduces. In other words, given a certain distribution of a number of integrations across 2 classes, a more asymmetric distribution will require a larger total number of integrations to detect a certain effect size as statistically significant. Based on our TRIP data, we computationally estimated that approximately 120 PB integrations in total would be sufficient to distinguish between LADs and iLADs in terms of PGK-driven reporter gene expression (in 95% of cases, at a significance level of 5%; data not shown).

Figure 3.

Integration bias reduces statistical power in TRIP applications. 104 integrations were generated in silico, and distributed across 2 classes (Class 1 and Class 2) in an increasingly uneven fashion. Depending on the assigned class, reporter gene expression was simulated by drawing from a class-specific distribution. Then, (A) the significance of the difference between the 2 classes was determined by Welch's t-test as a function of the size of Class 2 (dashed gray line: 2-sided 5% significance threshold; red solid line: theoretical expected value of the z-normalized t-statistic), and (B) the effect size was determined as the difference in means between the 2 classes as a function of the size of Class 2 (red solid lines: theoretical expected value and standard deviation of the sample distribution of the difference). The x-axis represents the size of Class 2, which indicates how uneven the distribution across the 2 classes is. Not for all questions it is equally straightforward to determine the (lack of) influence of integration bias. In these cases, it may be needed to regularize the genome-wide integration profile. We provided one such example when inferring PGK domains reflecting genome-wide domains of transcriptional permissiveness, using a hidden Markov model (HMM). Since by inferring an HMM, equidistant spacing of integrations on the genome was assumed, we asked to what extent integration bias affected the eventual domain calling. For this purpose, a non-homogeneous HMM was additionally inferred, with the HMM transition probabilities depending on the distance between integrations. The domains inferred using both approaches were highly similar. In conclusion, while in the case of interpreting TRIP results it should always be kept in mind that integration is random but biased, the impact of these biases on results seems often limited. However, an important drawback of integration bias is that it reduces statistical power, which can be regained by generating a larger data set of integrations.

Gene therapy

Another important area of research where DNA integrating elements are of great use is gene therapy. Retroviruses and transposons are extensively used in ex vivo gene therapy as a molecular vehicle for introducing a therapeutic gene into cells with genetic defects. For this purpose, sustained expression of the introduced gene is desirable. However, this comes with many complications depending on the site of integration of the vector carrying the therapeutic gene. For example, initial gene therapy trials using MuLV showed that viruses integrated in the proximity of proto-oncogenes led to the formation of tumors in some of the treated patients. The reason for this is that many MuLV integrations occur in the vicinity of endogenous genes, and more specifically near transcription start sites. The use of DNA integrating elements that preferably integrate away from endogenous genes can potentially circumvent this problem. Unfortunately, currently there are no such elements with a distinct preference of integrating away from genes (Section Chromatin landscapes of integration bias). Some insight into this type of bias can be gained by studying large datasets of integrations generated under minimal pressure. When considering the bias of 3 currently used integrating elements, one can see that SB has a higher proportion of integrations landing more than 5kb away from the endogenous genes compared to PB (). This means that the chance of gene disruption is comparatively smaller when using SB for gene therapy. Note that, when only considering the integration with respect to genes, MMTV would be even less likely to disrupt endogenous genes, as MMTV has a mild bias against integrating near genes. However, limited tropism would restrict its potential use in gene therapy. Hitting cancer-related genes by using any of the integration vectors cannot be completely ruled out. This risk can be substantially reduced by genetically engineering gene therapy vectors capable of integrating the transgene at specific loci away from any endogenous genes (see below).

Figure 4.

Bias for genic regions of 3 DNA integrating elements. (A) For 3 DNA integration elements , integrations were counted within and outside 5kb from genes, and p-values were determined by 2-sided binomial tests. (B) For comparison, the same analysis was done for 3 sets of matched random controls. Refer to for a detailed description of how these controls were generated.

Future perspectives

As has been outlined above, the bias in the integration profile of DNA integrating elements can be an impediment to realizing the full potential of many applications of these elements such as gene therapy, forward genetic screens and massively parallel chromatin sensor assays such as TRIP. One way to circumvent the bias issue is to genetically modify the integration behavior of these elements, for example by redirecting the integration of these elements to gene-poor regions of the genome. Lentiviral integrations could be directed to heterochromatic regions by fusing the integrase binding domain of host cell encoded LEDGF (involved in the integration of lentiviruses) to CBX1β, which binds to heterochromatin. Blocking the activity of BET proteins, which are cellular binding partners of MuLV integrase, reduces the strong preference of MuLV for endogenous promoters. Along similar lines, the bias of the transposons can be altered by genetically engineering the transposases. Attempts at this have already been made by fusing transposase to the adeno-associated virus Rep protein, zinc finger modules targeting specific sequences and custom transcription activator like effector DNA-binding domains. Until now these efforts have yielded only limited success. It is however foreseeable that emerging DNA targeting technologies such as the CRISPR-Cas9 system, as well as a deeper understanding of the mechanism of action of transposases, will lead to the engineering of more effective transposition systems. These systems would be capable of precisely targeting the integrations to safe but nonetheless transcriptionally permissive loci of the genome. Such magic transposons will not only make gene therapeutic approaches safer and more controllable, but will also be valuable in studying the chromatin landscape of genomic regions of interest with TRIP-like approaches, at an unprecedented resolution.

53 in total

1. Chromatin position effects assayed by thousands of reporters integrated in parallel.

Authors: Waseem Akhtar; Johann de Jong; Alexey V Pindyurin; Ludo Pagie; Wouter Meuleman; Jeroen de Ridder; Anton Berns; Lodewyk F A Wessels; Maarten van Lohuizen; Bas van Steensel
Journal: Cell Date: 2013-08-15 Impact factor: 41.582

2. Manipulating piggyBac transposon chromosomal integration site selection in human cells.

Authors: Claudia Kettlun; Daniel L Galvan; Alfred L George; Aparna Kaja; Matthew H Wilson
Journal: Mol Ther Date: 2011-07-05 Impact factor: 11.454

3. Using TRIP for genome-wide position effect analysis in cultured cells.

Authors: Waseem Akhtar; Alexey V Pindyurin; Johann de Jong; Ludo Pagie; Jelle Ten Hoeve; Anton Berns; Lodewyk F A Wessels; Bas van Steensel; Maarten van Lohuizen
Journal: Nat Protoc Date: 2014-05-08 Impact factor: 13.491

4. Large-scale analysis of the regulatory architecture of the mouse genome with a transposon-associated sensor.

Authors: Sandra Ruf; Orsolya Symmons; Veli Vural Uslu; Dirk Dolle; Chloé Hot; Laurence Ettwiller; François Spitz
Journal: Nat Genet Date: 2011-03-20 Impact factor: 38.330

5. Transcription start regions in the human genome are favored targets for MLV integration.

Authors: Xiaolin Wu; Yuan Li; Bruce Crise; Shawn M Burgess
Journal: Science Date: 2003-06-13 Impact factor: 47.728

6. Computational identification of insertional mutagenesis targets for cancer gene discovery.

Authors: Johann de Jong; Jeroen de Ridder; Louise van der Weyden; Ning Sun; Miranda van Uitert; Anton Berns; Maarten van Lohuizen; Jos Jonkers; David J Adams; Lodewyk F A Wessels
Journal: Nucleic Acids Res Date: 2011-06-07 Impact factor: 16.971

7. Selection of target sites for mobile DNA integration in the human genome.

Authors: Charles Berry; Sridhar Hannenhalli; Jeremy Leipzig; Frederic D Bushman
Journal: PLoS Comput Biol Date: 2006-11-24 Impact factor: 4.475

8. Insertional mutagenesis in mice deficient for p15Ink4b, p16Ink4a, p21Cip1, and p27Kip1 reveals cancer gene interactions and correlations with tumor phenotypes.

Authors: Jaap Kool; Anthony G Uren; Carla P Martins; Daoud Sie; Jeroen de Ridder; Geoffrey Turner; Miranda van Uitert; Konstantin Matentzoglu; Wendy Lagcher; Paul Krimpenfort; Jules Gadiot; Colin Pritchard; Jack Lenz; Anders H Lund; Jos Jonkers; Jane Rogers; David J Adams; Lodewyk Wessels; Anton Berns; Maarten van Lohuizen
Journal: Cancer Res Date: 2010-01-12 Impact factor: 12.701