| Literature DB >> 24464997 |
Matthew C LaFave1, Gaurav K Varshney, Derek E Gildea, Tyra G Wolfsberg, Andreas D Baxevanis, Shawn M Burgess.
Abstract
Retroviruses integrate into the host genome in patterns specific to each virus. Understanding the causes of these patterns can provide insight into viral integration mechanisms, pathology and genome evolution, and is critical to the development of safe gene therapy vectors. We generated murine leukemia virus integrations in human HepG2 and K562 cells and subjected them to second-generation sequencing, using a DNA barcoding technique that allowed us to quantify independent integration events. We characterized >3,700,000 unique integration events in two ENCODE-characterized cell lines. We find that integrations were most highly enriched in a subset of strong enhancers and active promoters. In both cell types, approximately half the integrations were found in <2% of the genome, demonstrating genomic influences even narrower than previously believed. The integration pattern of murine leukemia virus appears to be largely driven by regions that have high enrichment for multiple marks of active chromatin; the combination of histone marks present was sufficient to explain why some strong enhancers were more prone to integration than others. The approach we used is applicable to analyzing the integration pattern of any exogenous element and could be a valuable preclinical screen to evaluate the safety of gene therapy vectors.Entities:
Mesh:
Year: 2014 PMID: 24464997 PMCID: PMC3985626 DOI: 10.1093/nar/gkt1399
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.GeIST integration site mapping workflow. The GeIST workflow is available to download from http://research.nhgri.nih.gov/software/GeIST/. The details of each step are covered in the ‘Integration Site Mapping’ portion of the ‘Materials and Methods’ section. When the effect of a ‘no’ response is not explicitly stated, it is implied that reads failing to meet the criteria are removed from the analysis.
ENCODE files used in enrichment analysis
| File/data set | GEO ID | Analysis |
|---|---|---|
| Analysis files | ||
| wgEncodeBroadHmmK562HMM.bed | GSM936088 | State segmentation |
| wgEncodeBroadHmmHepg2HMM.bed | GSM936090 | State segmentation |
| wgEncodeOpenChromDnaseHepg2Pk.narrowPeak | GSM816662 | DNAse sensitivity (HepG2) |
| wgEncodeOpenChromDnaseK562PkV2.narrowPeak | GSM816655 | DNAse sensitivity (K562) |
| wgEncodeGisChiaPetK562Pol2InteractionsRep1.bed | GSM970213 | DNA–DNA interaction |
| wgEncodeGisChiaPetK562Pol2InteractionsRep2.bed | GSM970213 | DNA–DNA interaction |
| wgEncodeBroadHistoneHepg2CtcfStdPk.broadPeak | GSM733645 | Chromatin mark enrichment (HepG2) |
| wgEncodeBroadHistoneHepg2H3k04me1StdPk.broadPeak | GSM798321 | Chromatin mark enrichment (HepG2) |
| wgEncodeBroadHistoneHepg2H3k4me2StdPk.broadPeak | GSM733693 | Chromatin mark enrichment (HepG2) |
| wgEncodeBroadHistoneHepg2H3k4me3StdPk.broadPeak | GSM733737 | Chromatin mark enrichment (HepG2) |
| wgEncodeBroadHistoneHepg2H3k9acStdPk.broadPeak | GSM733638 | Chromatin mark enrichment (HepG2) |
| wgEncodeBroadHistoneHepg2H3k27acStdPk.broadPeak | GSM733743 | Chromatin mark enrichment (HepG2) |
| wgEncodeBroadHistoneHepg2H3k27me3StdPk.broadPeak | GSM733754 | Chromatin mark enrichment (HepG2) |
| wgEncodeBroadHistoneHepg2H3k36me3StdPk.broadPeak | GSM733685 | Chromatin mark enrichment (HepG2) |
| wgEncodeBroadHistoneHepg2H4k20me1StdPk.broadPeak | GSM733694 | Chromatin mark enrichment (HepG2) |
| wgEncodeBroadHistoneK562CtcfStdPk.broadPeak | GSM733719 | Chromatin mark enrichment (K562) |
| wgEncodeBroadHistoneK562H3k4me1StdPk.broadPeak | GSM733692 | Chromatin mark enrichment (K562) |
| wgEncodeBroadHistoneK562H3k4me2StdPk.broadPeak | GSM733651 | Chromatin mark enrichment (K562) |
| wgEncodeBroadHistoneK562H3k4me3StdPk.broadPeak | GSM733680 | Chromatin mark enrichment (K562) |
| wgEncodeBroadHistoneK562H3k9acStdPk.broadPeak | GSM733778 | Chromatin mark enrichment (K562) |
| wgEncodeBroadHistoneK562H3k27acStdPk.broadPeak | GSM733656 | Chromatin mark enrichment (K562) |
| wgEncodeBroadHistoneK562H3k27me3StdPk.broadPeak | GSM733658 | Chromatin mark enrichment (K562) |
| wgEncodeBroadHistoneK562H3k36me3StdPk.broadPeak | GSM733714 | Chromatin mark enrichment (K562) |
| wgEncodeBroadHistoneK562H4k20me1StdPk.broadPeak | GSM733675 | Chromatin mark enrichment (K562) |
Figure 2.Integration pattern near LMO2 in K562 and HepG2 cells. (A) Integrations in K562 cells. Bars indicate the sum of unique integration events color-coded to match the genomic state in which they integrated (left axis); there are 471 events represented in this ∼110-kb span. Bar tops are highlighted with a black line and a dot for visibility. The right axis indicates the rate of integration in a 1-kb sliding window for both experimental integrations (blue) and a representative in silico random control (gray). The location of LMO2 is represented in the lower right of the figure. Approximately 0.15% of the 315 810 integrations scored landed in this LMO2 interval, which comprises 0.0036% of the genome. The colored track below the integrations is the chromatin state segmentation track (21); colors correspond to different states, as indicated. (B) Integrations in HepG2 cells. The scales are the same as in (A). There are 19 integrations in this region. Experimental integrations are significantly depleted relative to random control integrations in this region (bootstrapping; P < 0.0001). Note that the number of random integrations per kilobase tends to be higher for HepG2 than K562 cells. This is because there are >10-fold as many integrations in HepG2 relative to K562, resulting in a corresponding increase in the number of random control integrations. This figure clearly shows the cell-type-specific differences in the raw number of experimental integrations in the LMO2 region between K562 and HepG2 cells. These differences can likely be attributed to the presence of active LMO2 enhancers and promoters in K562 cells and their absence in HepG2 cells.
Figure 3.Enrichment of integrations in chromatin segmentation states in HepG2 cells. (A) The mean value of 10 000 ratios of experimental integration versus in silico random integration; error bars represent the standard deviation of these ratios. The dotted line separates entries with more integrations than expected by chance from those with fewer. The experimental integration counts in all states are significantly different from random, as determined by bootstrapping (significance threshold = 0.0017; all differences from random each have P < 0.0001). The enrichment values in each state are all significantly different from each other, as determined by analysis of variance (P < 2 × 10−16) and Tukey’s multiple comparisons of means (all pairs differ with adjusted P < 10−7). (B) Percentage observed frequency of chromatin marks for each of the 15 states across the genome in HepG2 cells [modified from Ernst et al. (21)]. Darker blue cells indicate a higher observed frequency than lighter cells. The states are sorted by mean enrichment versus random; the horizontal dashed line separates states with more integration sites than expected by chance from states with as many or fewer, and the chromatin marks to the right of the vertical dashed line are most associated with strong enhancers. Numerical values for this table are in Ernst et al. (21). Txn, transcription; lo, low signal; CNV, copy number variation.
Figure 4.Enrichment of integrations in ChIP-seq peaks of chromatin marks in HepG2 cells. We compared experimental integrations and integrations in the 10 000 matched random control data sets to ChIP-seq peaks from the ENCODE project (29). We calculated enrichment as described in the caption of Figure 3. The error bars represent the standard deviation of the enrichment ratio. The dotted line indicates the level of enrichment expected by chance. The experimental integration counts for all marks are significantly different from random (determined as above, by bootstrapping; significance threshold = 0.0028; all differences from random have P < 0.0001). The enrichment values for each mark are all significantly different from each other, as determined by analysis of variance (P < 2 × 10−16) and Tukey’s multiple comparisons of means (all pairs differ with adjusted P < 10−7).
Figure 5.Comparison of integration enrichment across various genomic features in HepG2 cells. We compared the enrichment at features that are associated with MLV integration to measure their ability to explain non-random integration. We calculated enrichment as described in the caption of Figure 3. The error bars represent the standard deviation of the enrichment ratio, and the dotted line indicates the level of enrichment expected by chance. The enrichment values for each feature are all significantly different from random (determined as above, by bootstrapping; significance threshold = 0.00625; all differences from random have P < 0.0001), and different from each other, as determined by analysis of variance (P < 2 × 10−16) and Tukey’s multiple comparisons of means (all pairs differ with adjusted P < 10−7). Labels: state 4, the state 4 strong enhancers; DNAse, DNAse-sensitive regions; TSS, regions within 5 kb of a transcription start site; sequence, sites matching the TNVNNBNA motif.
A 20-kb window was used to identify the regions with the most integrations; the top 10 regions are shown
| Chromosome | Start (base pairs) | End (base pairs) | Integrations | Mean random integrations ± SD | Nearby genes |
|---|---|---|---|---|---|
| HepG2 integration hotspots | |||||
| chr14 | 31720463 | 31740463 | 2438 | 19.60 ± 4.41 | Downstream of HEATR5A |
| chr14 | 31492094 | 31512094 | 2123 | 26.75 ± 5.23 | STRN3 and SP4S1 |
| chr20 | 48290780 | 48310780 | 1841 | 26.18 ± 5.05 | B4GALT5 |
| chr20 | 46079011 | 46099011 | 1834 | 16.65 ± 4.07 | Upstream of NCOA3 |
| chr19 | 47597773 | 47617773 | 1825 | 19.44 ± 4.43 | ZC3H4 |
| chr20 | 371219 | 391219 | 1731 | 17.46 ± 4.21 | Downstream of TIRB3, upstream of RBCK1 |
| chr20 | 32916489 | 32936489 | 1634 | 15.94 ± 3.99 | Upstream of AHCY and ITCH |
| chr12 | 50915923 | 50935923 | 1623 | 21.62 ± 4.69 | DIP2B |
| chr16 | 15721516 | 15741516 | 1616 | 26.97 ± 5.13 | KIAA0430 and NDE1 |
| chr12 | 69178983 | 69198983 | 1598 | 22.15 ± 4.72 | Upstream of LOC100130075 and MDM2 |
The start and end points are presented as they would be in a BED file (0-based start). Mean random integrations ± SD indicates the mean value and standard deviation of the integration count over 10 000 matched random control data sets, and represents the number of integrations expected by chance. Genes with at least part of one RefSeq transcript within the 20-kb window are displayed in the final column; nearby downstream genes that are outside the window are also indicated.
Figure 6.Bias of primary sequence at MLV integration sites in HepG2 cells. We ascertained the sequence of experimental and random in silico inserts with ‘BEDTools getfasta’, taking strand orientation into account (26). (A) Base composition of random control integrations. Integration occurs in positions 1–4 (black box); positions are relative to the 5′ base of the integration site. The values are the mean proportion of the ratio of the base in question at that position >9 920 random control data sets (80 random controls contained a site in which the 5 bp flanking sequence extended off the end of the chromosome; these controls were removed). (B) Base composition of experimental integrations. All proportions are significantly different from random (bootstrapping; significance threshold = 0.0004; P < 0.0001 for each). The values that differ from the mean random value by ≥10% are highlighted (green, 10% more than random; magenta, 10% less; gray, not significantly different from random).