Integration of new DNA into cellular genomes mediates replication of retroviruses and transposons; integration reactions have also been adapted for use in human gene therapy. Tracking the distributions of integration sites is important to characterize populations of transduced cells and to monitor potential outgrow of pathogenic cell clones. Here, we describe a pipeline for quantitative analysis of integration site distributions named INSPIIRED (integration site pipeline for paired-end reads). We describe optimized biochemical steps for site isolation using Illumina paired-end sequencing, including new technology for suppressing recovery of unwanted contaminants, then software for alignment, quality control, and management of integration site sequences. During library preparation, DNAs are broken by sonication, so that after ligation-mediated PCR the number of ligation junction sites can be used to infer abundance of gene-modified cells. We generated integration sites of known positions in silico, and we describe optimization of sample processing parameters refined by comparison to truth. We also present a novel graph-theory-based method for quantifying integration sites in repeated sequences, and we characterize the consequences using synthetic and experimental data. In an accompanying paper, we describe an additional set of statistical tools for data analysis and visualization. Software is available at https://github.com/BushmanLab/INSPIIRED.
Integration of new DNA into cellular genomes mediates replication of retroviruses and transposons; integration reactions have also been adapted for use in human gene therapy. Tracking the distributions of integration sites is important to characterize populations of transduced cells and to monitor potential outgrow of pathogenic cell clones. Here, we describe a pipeline for quantitative analysis of integration site distributions named INSPIIRED (integration site pipeline for paired-end reads). We describe optimized biochemical steps for site isolation using Illumina paired-end sequencing, including new technology for suppressing recovery of unwanted contaminants, then software for alignment, quality control, and management of integration site sequences. During library preparation, DNAs are broken by sonication, so that after ligation-mediated PCR the number of ligation junction sites can be used to infer abundance of gene-modified cells. We generated integration sites of known positions in silico, and we describe optimization of sample processing parameters refined by comparison to truth. We also present a novel graph-theory-based method for quantifying integration sites in repeated sequences, and we characterize the consequences using synthetic and experimental data. In an accompanying paper, we describe an additional set of statistical tools for data analysis and visualization. Software is available at https://github.com/BushmanLab/INSPIIRED.
Integration of new DNA is important in studies in many fields, including retroviral and transposon replication,1, 2, 3, 4 HIV latency,5, 6, 7 and human gene therapy.8, 9, 10, 11, 12, 13 Distributions of integration sites are not random in the host cell genome but differ among different integrating elements.1, 2, 3, 14, 15 For several cases, tethering of integration complexes to cellular proteins has been shown to influence integration target site selection.1, 2, 3, 16, 17, 18, 19 Genomic alterations resulting from integration can contribute to preferential proliferation or survival of the modified cells. Examples include insertional activation by retroviruses in animal models,1, 4 outgrowth of cells in HIV latency,5, 6, 7 accumulation of endogenous retroviruses evolutionarily in metazoan genomes,1, 2, 20 and outgrowth of specific cell clones during human gene therapy.21, 22, 23, 24, 25, 26, 27, 28 Often, it is useful to track the behavior of cells harboring newly integrated DNA longitudinally using next-generation sequencing.Previously, we and others have carried out sequence-based surveys of integration site distributions, using first Sanger sequencing, then 454/Roche pyrosequencing, and today Illumina sequencing.6, 7, 9, 11, 12, 13, 15, 29, 30, 31, 32, 33, 34 The Illumina platform has the advantages of allowing paired-end sequencing and providing larger data volumes. Several reports have described methods for analysis of these data.9, 29, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48 However, to date, none have taken full advantage of all types of paired reads, dealt comprehensively with integration in repeated sequences, or provided a statistical framework for quantitative inference of cell abundances based on integration site data.Here we adapt statistical approaches reported in three previous publications to management of Illumina paired-end data.49, 50, 51 We first describe optimized biochemical methods for integration site isolation, which achieve the critical criteria of suppressing PCR contamination between samples while sampling randomly from the pool of integrated DNAs. We then describe methods for alignment, data management, and quantification of cell clones based on integration site data. The pipeline accommodates analysis of integration in both single-copy and repeated sequences. We generated synthetic integration sites corresponding to known locations on the human genome and used them in tests to optimize performance of our pipeline, including quantifying the influence of error in sequence determination. Performance was then tested over several datasets ranging from experimental infections to human gene therapy samples, allowing analysis of the influence of repeated sequences on site capture. In our accompanying paper in this issue of Molecular Therapy: Methods & Clinical Development, we describe a suite of analytical tools that draws on the data products described here.
Results
Biochemical Methods for Determining Sequences of Integration Acceptor Sites
Biochemical methods for recovering integration sites from DNA samples are diagrammed in Figure 1, and a detailed protocol is available in the Supplemental Materials and Methods and Tables S1–S8. Initially, isolated genomic DNA containing integration sites is randomly sheared by sonication. DNA linkers are then ligated to the sheared DNA ends. These DNA fragments are used as templates for PCR using one primer complementary to the linker and a second complementary to the end of the integrated DNA. In retroviruses and retroviral vectors, the ends of the integrated DNA correspond to the long terminal repeats (LTRs). Two rounds of PCR with nested primers are used to maximize specificity and recovery of sites from samples with small numbers of proviruses. Illumina sequencing adapters are attached to the DNA primers used for the second round of PCR, so that the PCR products generated contain the terminal sequences needed for sequence analysis on the MiSeq or HiSeq platforms.
Figure 1
Diagram of the Biochemical Method for Isolating and Sequencing Sites of New DNA Integration
Genomic DNA containing an integrated retrovirus or retroviral vector (top) is sheared and DNA linkers are ligated onto the resulting ends (middle). The molecules are then subjected to two rounds of PCR amplification and then Illumina paired-end sequencing (bottom). The color code for sequence elements is summarized on the right. Black spheres indicate DNA 5′ ends. -NH2 indicates an amino-modifier group, which prevents polymerase extension from the modified 3′ end.
The LTR sequences of an integrated provirus or retroviral vector are duplicated at each end of the integrated element—as a result, PCR using a primer complementary to the LTR results in amplification of two DNA products. One contains the desired flanking host DNA, and the second contains an unwanted internal sequence (Figure 2A). We thus developed a blocking oligonucleotide to reduce polymerase extension from the internal fragment. To increase affinity, blocking oligonucleotides were synthesized with multiple bases containing a bridging ring between the 2′ and 4′ positions.53, 54, 55 The blocking oligonucleotide terminates with a 3′ amino modification to inhibit polymerase extension from the blocking oligonucleotide itself (Table S9).
Figure 2
Use of a Locked Oligonucleotide to Block Recovery of the Internal Fragment to Improve Integration Site Yield
(A) Diagram of the method. The BNA-containing blocking oligonucleotide is shown in black; DNA polymerase is shown in gray. Other markings are as in Figure 1. (B) Quantification of recovery of the internal fragment. Twelve replicates were compared for each sample. (C) Increase in yield as a result of use of the locked oligonucleotide blocking primer. Error bars are SD.
In experiments comparing results with and without the blocking oligonucleotides (Figures 2B and 2C), inclusion of the blocking oligo reduced capture of the internal fragment from 42% to 1.6% of sequence reads and increased the average sampling of cellular genomes from 765 cells per replicate to 975 cells per replicate (as measured by SonicAbundance; described below).Given that multiple samples are commonly worked up simultaneously, and batches of samples may be analyzed frequently, PCR contamination between samples can be a severe problem. To suppress PCR contamination, each DNA sample is analyzed using 1 of 96 unique DNA linkers, which are paired with unique complementary PCR primers. Thus, any molecule moving between tubes would bear the wrong linker and thus would not be a substrate for PCR amplification.Each sample is also given a unique 12-nucleotide (nt) error-correcting DNA barcode.56, 57, 58 The combination of specific linker and barcode is rotated for each batch of samples processed, and correct pairing between bar code and linker sequences is required during quality filtering of output sequences (below). For all batches, negative controls are included, which are human DNA specimens lacking integrated vectors. Using these precautions, contamination due to PCR cross-over is rare or eliminated, as indicated by a consistent lack of recovery of integration sites from genomic DNA-only controls that lack integrated vectors.To further mark each unique integration site sequence, each linker is synthesized with a random sequence of 12 nucleotides. Thus, linker ligation attaches a unique “primer ID” to each molecule prior to PCR. These tags provide a potential means of abundance estimation by counting primer IDs, but in practice, this is complicated by PCR recombination (unpublished data). Thus, the main use in our pipeline is tracking possible contamination due to PCR cross-over between replicates by tracking primer IDs.
The SonicAbundance Method
We use the SonicAbundance method to infer the abundance of cell clones from integration site data (Figure 3). Simply counting the number of sequence reads per integration site is known to yield distorted abundance estimates32, 60—for example, shorter molecules amplify more efficiently than longer ones. The SonicAbundance method takes advantage of marks introduced into DNA molecules by sonication and linker ligation prior to the PCR amplification steps.
Figure 3
Estimating Abundance Using the SonicAbundance Method
Cells harboring integrated vectors are shown at the top. One cell clone has expanded to comprise 4/6 cells (flanking DNA colored cyan). DNA is then purified and cleaved and linkers are ligated. Note that the cyan expanded clone is present as four distinct fragment lengths. A stacked bar graph (bottom) summarizes the differences seen based on summing the abundance of different length fragments.
In a DNA sample from cells containing integration sites, an integration site from an expanded cell clone will be found in many cell genomes (Figure 3, top). Fragmentation by sonication followed by linker ligation results in many linkers joined near the integrated provirus from the expanded clone (Figure 3, middle). PCR amplification and paired-end sequencing results in recovery of many different sites of linker ligation near the unique integration site in the expanded clone. Sites of linker ligation are recovered in read 1, and LTR-host junctions are recovered in read 2 (Figure 1). The number of these linker positions is tabulated, providing an abundance score (Figure 3, bottom). For statistical analysis, the estimated abundance needs to be corrected to account for the frequency of identical linker ligation positions generated independently, which occurs with increasing frequency as the numbers of linker positions increases per integration site. Numbers of linker ligation sites are recorded along with integration site positions and uploaded into our IntSiteDB database for analysis.
Processing and Aligning Integration Site Sequence Data
INSPIIRED begins by parsing raw Illumina output files (FASTQ format) using both index and linker sequences. Indexes are based on Golay codes with maximized edit distance, so that up to two errors in the index reads can be unambiguously corrected to recover the read.56, 57, 58 Reads are subsequently trimmed to remove primer and LTR sequences (requiring exact matching to predicted sequences), yielding only genomic sequence data. A problem arises due to mispriming in the human genome, which can yield spurious integration sites. For this reason, we require a perfect match for the LTR segment extending between the 3′ end of the amplification primer to the 5′-CA-3′ sequence that defines the edge of the LTR.Reads are next filtered to remove sequences complementary to the vector or virus used, requiring at least 75% global identity, a value chosen based on results with empirical datasets, and aligning in the first 5 nt of the read. Sequences are aligned to the reference genome using BLAT (parameters for alignment are found in the Materials and Methods).Alignment information is then paired between the reads, and the integration site position and DNA fragment breakpoints (linker positions) are returned and stored in the IntSiteDB database (described in detail in our accompanying paper). Read 1 and read 2 are joined based on location in the sequencing flow cell (encoded as the read name). Read pairs that map to identical sites are judged to be PCR duplicates and collapsed into single sites. To pass our quality filter, the genomic coordinates of these positions must lie within the range accessible by the sequencing chemistry—we allow a maximum of 2,500 base pairs (bp) as the default value (Figure 4A). Integration sites for which the read 1 (linker side) and read 2 (integration site side) positions are unreasonably distant, or on different chromosomes, are judged to be chimeras formed during PCR and are removed from the output data (Figure 4B). Paired alignments are additionally filtered for correct relative orientation.
Figure 4
Interpretation of Paired Read Data
(A) Unique integration site. In this case, the two reads are within a short distance of one another on the chromosome and are correctly oriented on opposite DNA strands. (B) An artifactual chimera. In this case, the two reads are on two distinct chromosomes, or found implausibly far apart on the same chromosome, and so are judged to be artifacts formed during construction of the library for sequencing. (C) Multihit. In this case, both reads in the pair have equally good alignments at multiple distinct locations.
Integration Sites in the Human Genome Showing Multiple Equally Good Alignments: “Multihits”
Viral or vector genomes that integrate within repetitive genomic elements often cannot be mapped to a single genomic coordinate, so that both read pairs align nearby, but they can be mapped to multiple locations in the human chromosomes (Figure 4C). For some forms of analysis, the multihits may be ignored—for example, in an analysis of integration site distributions relative to genomic features. However, for monitoring clinical gene therapy samples for possible adverse events, it is not safe to rule out possible insertional activation by integration in a repeated sequence—it is possible that an integration site in a multihit site may be near a cancer-related gene and involved in an adverse event. At least 40% of the human genome is composed of repeated sequences such as L1 retrotransposons, endogenous retroviruses, Alu elements, and others.A complication is that unique sonic fragments of the same parent integration site may map to non-identical lists of genomic coordinates, and even PCR duplicates may show different mapping behavior due to sequencing error. INSPIIRED thus uses a graph-theory-based approach to group alignments into clusters, so that each cluster can be treated as an integration site in downstream analysis. INSPIIRED assigns multihit reads to multihit clusters by creating an undirected graph G = (V,E), where V is the set of reads identified as multihits and E is the set of pairs of multihit reads that share at least one putative integration site in the output list of multiple alignments. Each connected component of G is designated as a unique multihit cluster.When considering the number of reads produced by the Illumina technology, the computational resources required to compare putative integration locations in a pairwise fashion can become prohibitive. To improve the scalability of multihit clustering, reads that have identical genomic DNA sequences across both read 1 and read 2 are combined into a single representative read before executing the pairwise comparison of potential genomic mappings. When building an undirected graph from multihit read alignments, only the first connection of completely connected reads is used, reducing memory demand even further while yielding the same result with improved scalability.
Performance of the Pipeline Analyzed Using Synthetic Data
The performance of the pipeline was analyzed by generation and analysis of synthetic integration site data. Reads were generated with lengths of 179 and 143 nt corresponding to read 1 and read 2, respectively, including addition of the Illumina sequencing primers, DNA barcodes, primer landing pads, and flanking host DNA. A total of 5,000 sites were simulated. The distances between reads 1 and 2 were chosen randomly from a distribution of distances modeled to match empirical data, with 100 different distances between pairs sampled for each of the 5,000 integration sites. Four sets of the 5,000 integration sites were studied, containing no error, 1% error (roughly that expected from the Illumina sequencing method), 2% error, and 4% error.Integration site datasets were trimmed, aligned, and quality filtered using the INSPIIRED pipeline. Results are tabulated in Table 1. Initially we asked whether each integration site could be recovered from at least one of the 100 read pairs (Table 1, top), and we then asked how many of the read pairs were recovered (Table 1, bottom).
Table 1
Processing of Unique In Silico-Generated Integration Sites Using INSPIIRED
R2 (LTR Read) + R1 (Linker Read)
R2 (LTR Read) Only
0% Error
1% Error
2% Error
4% Error
0% Error
1% Error
2% Error
4% Error
Integration Sites
Total simulated unique sites
5,000
5,000
5,000
5,000
5,000
5,000
5,000
5,000
Sites for which the collection of alignments contains the correct site
4,979
4,983
4,985
4,985
4,960
4,979
4,985
4,982
Site with single correct alignment
4,843
4,908
4,926
4,929
4,686
4,838
4,876
4,869
Sites with multiple alignments that include the correct site
136
189
144
96
276
334
264
197
Sites for which some read pairs show unique alignments while others show multiple alignments that include the correct alignment
617
613
532
347
477
547
483
291
Sites with no alignments
21
17
15
15
40
21
15
17
Sites for which individual reads yield different and/or incorrect alignment locations
87
229
278
179
75
222
280
198
Sequencing Reads
Total simulated read pairs
500,000
500,000
500,000
500,000
500,000
500,000
500,000
500,000
Passed primer + LTRbit trimming
500,000
433,814
375,379
278,239
500,000
433,814
375,379
278,239
Passed linker trimming
500,000
433,797
375,059
275,280
500,000
433,797
375,059
275,277
Aligned correctly
489,927
406,195
313,977
113,580
486,939
404,667
313,387
113,617
Aligned unique integration site
458,457
384,179
299,833
109,808
450,702
379,411
296,961
109,155
Aligned multihit
31,470
22,016
14,144
3,772
36,237
25,256
16,426
4,462
For 0% error, 99.6% of sites could be recovered. Twenty-one sites were not aligned, and 87 sites were incorrectly aligned. These latter sites mapped to regions annotated as “low alignability,” as defined by the GEM mappability program. By visual inspection, these regions were rich in multiple repetitive element classes that were often nested within each other. Overall, of the 100 simulated sequence reads for each integration site, on average 98 could be mapped correctly. For the same integration sites containing 1%–4% error, the fractions mapping were 99.7% in each case.The behavior of individual reads is summarized in the bottom of Table 1. Although the majority of sites were recovered, the proportion of the 100 paired sequences per integration site that could be recovered fell with increasing error, from 98% at 0% error to 23% at 4% error. Reassuringly, sequences lost with increasing amounts of error were mostly aligned correctly—error resulted in a lack of alignment, rather than misalignment.We next asked how the sites were distributed among unique locations and repeated sequences. For each integration site in a repeated sequence, the R1 sequence at the linker end of each read has the potential of reaching into flanking unique DNA, resulting in a unique mapping position. At 0% sequence error, multihits accounted for only 6.4% of correctly aligned reads, while at 4% error, multihits accounted for 3.3%. Thus, the proportion of integration sites scoring as multihits was modest.We investigated to what extent use of the linker side read (R1) allow for increased capture of unique sites and decreased recovery of multihits. We thus compared results with paired reads (R1 plus R2; Table 1, left side) to results with the integration site read only (R2; Table 1, right side). Results comparing R1 plus R2 to R2 alone showed a decrease in reads mapping to multiple locations (down 4,767 reads) and reads not aligning to the human genome (down 2,876 reads), as well as an increase in the number of reads that align correctly (2,988 more reads) and align to unique locations in the human genome (7,755 more reads). There was little difference in the number of sites detected, although there were half as many multihit sites when considering R1 plus R2 over R2 alone.
Analysis of Experimental Integration Site Data
We next assessed the performance of the pipeline over three empirical datasets (Table 2). The first consists of humanHAP1 cells infected with an HIV-based vector and then grown for only 24 hr. A total of 38,967 integration sites were recovered, with an average of only 1.36 cells per site (SonicAbundance estimate). The second dataset consists of five purified 293T cell lines, each with a single lentiviral integration site. Cells were pooled and integration sites were determined from DNA purified from the mixture. Thus, only five sites were recovered by sequencing, with an average of 3,582 cells per site. The third dataset was a specimen of blood cells from a patient successfully treated for severe combined immunodeficiency (SCID-X1) using a gammaretroviral vector. A total of 755 integration sites were recovered, ranging from 1 to 9,325 cells per site (average SonicAbundance of 40). This emphasizes that after gene correction, different progenitor cell clones delivered quite different proportions of cells to the periphery.
Table 2
Experimental Datasets and Their SonicAbundance Values
Name
Description
Reads
Unique Sites
Multihits
Average SonicAbundance
LentiAcute
acute infection of HAP1 cells with a lentiviral vector
951,985
36,399
2,568
1.36
ClonedLentia
mixture of cloned cells with five lentiviral integration sites
386,234
5
1
3,582.40
SCID, GTSP0855 p7 m162 PBMC
blood cell sample from the first SCID trial
845,267
666
89
40.17
One of the ClonedLenti sites annotates as either a unique site or a multihit, depending on the length of the fragment recovered. Samples filtered on abundance of 20.
These data allow us to investigate the effects of human repeated sequences on integration site recovery using Illumina paired-end data. Multihits accounted for 6.6% of sites in the HAP1 cells and 11.8% in the SCID gene therapy specimen. Thus, multihits were roughly in the range expected from tests with the in silico-generated data.Quantification was tested using the 293T cell clones with known integration sites mixed in different ratios. Figure 5 shows the expected and observed values (R = 0.852). In all cases, the expected ranking by frequency was observed, although there was some departure from the exact value expected. A clone included as only 1% of the population was readily detected.
Figure 5
Results of a Control Study of Synthetic Mixtures of DNA from Cell Lines with Known Integration Site Distributions
The expected abundance based on the composition of the mixture is shown on the x axis, and the observed abundance (over 4 replicates) is shown on the y axis. Three cell mixtures were compared. In set 1, the clones were mixed in equal amounts. In set 2, the composition was 20% clone 1, 10% clone 2, 45% clone 3, 15% clone 4, and 10% clone 5. In set 3, the composition was 9% clone 1, 15% clone 2, 1% clone 3, 10% clone 4, and 65% clone 5. Error bars are SD.
Discussion
Here, we present a pipeline for the generation and first-stage alignment of integration site data generated using Illumina paired-end reads. Portions of this pipeline have been used in earlier studies.3, 9, 11, 14, 16, 17, 18, 19, 20, 24, 26, 27, 30, 32, 33, 35, 47, 49, 50, 51, 60, 63, 64, 65, 66, 67, 68, 69, 70, 71 The full pipeline presented here introduces several improvements, including (1) use of a blocking oligonucleotide containing non-natural bases to suppress recovery of internal fragments; (2) use of all sets of paired-end reads, not just those that overlap, as in some earlier pipelines43, 45; (3) a graph-theory-based method for tracking integration in repeated sequences; (4) use of synthetic data to assess the effect of read spacing and error; (5) implementation of the SonicAbundance method for quantification; and (6) use of cloned cell lines for empirical characterization of quantification accuracy.Several integration site processing pipelines have been published, including IntegrationSeq/Map (2007), SeqMap (2008), QuickMap (2009), MAVRIC (2012), VISPA (2014), VISA (2015), and GeIST (2015).29, 36, 38, 39, 42, 43, 46 Early pipelines focused on efficient mapping of single reads onto reference genomes29, 36, 38 and included functions such as mapping the distance to the nearest transcription start site. Most pipelines include steps to reduce the total number of reads that must be aligned, so that computational time is used efficiently.Illumina paired-end sequencing yields two reads from each DNA fragment, which may or may not overlap. When adapting paired-end sequencing to integration site pipelines, previous groups have chosen to merge these reads if possible,39, 42, 43 but if they did not overlap, then the linker side read was discarded. INSPIIRED uses both reads, irrespective of overlap, for quantitative analysis of cell populations. That is, INSPIIRED uses all information to determine the location of integrated elements, and it also uses the location of the fragment breakpoint due to sonication for abundance quantification (Figure 3).Determining the location of integration elements can be challenging due to repeated sequences in the human genome. From our synthetic integration site data with 0% error, reads mapping to multiple locations in the human genome, “multihits,” were found within repeated elements such as Alu elements, L1 retrotransposons, endogenous retroviruses, and others. Longer paired alignments to the human genome (that are generated by not requiring overlapping reads) have the potential to align the linker side read outside of these repeated regions, yielding a unique location for the integration site. This is evident in the analysis of in silico data (Table 1), in which more multihit sites were identified in paired reads (R1 plus R2) compared to those using the integration site sequence read only (R2). Multihit reads are grouped based on alignments and may have multiple linker attachment sites, as with unique integration sites. Thus, multihits can also be quantified by the SonicAbundance method and queried for possible clonal expansion.The output data tables generated with INSPIIRED make possible the generation of a series of analytical products. These include (1) interactive heatmaps summarizing relationships of integration site data to genomic and epigenetic features; (2) reproducible reports on gene-corrected patient samples summarizing numerous features of integration site populations, including expansion of clones with integration sites near cancer-related genes; and (3) data frames suitable for statistical analysis based on the SonicAbundance method. These tools are described in our accompanying paper and are available at https://github.com/BushmanLab/INSPIIRED.
Materials and Methods
Recovery of Integration Sites Using PCR and a Blocking Primer
A detailed protocol is available in the Supplemental Materials and Methods, including the method for library preparation and sequencing of sites of new DNA integration in the human genome.Samples of genomic DNA were prepared for Illumina sequencing by random shearing using a Covaris M220 ultrasonicator to achieve an average size distribution of 800–900 bp. DNA fragment ends were repaired (5′ phosphorylated and dA tailed) prior to TA ligation with custom linkers using NEBNext Ultra End Repair/dA-Tailing and NEBNext Ultra Ligation Modules, respectively (see Table S10 for linker design and Table S14 for sequences). Ligated DNA was split into at least four replicates prior to ligation-mediated (LM) PCR amplification (PCR1, 25 total cycles). Using the PCR1 product as a template, a nested LM PCR was conducted (PCR2, 20 total cycles) adding replicate-specific 12-bp Golay barcodes and Illumina adaptor sequences. Portions of PCR1 and PCR2 products were visually examined on ethidium bromideagarose gels. PCR2 products were pooled across replicates and bead purified prior to library construction. Sample concentrations were measured using the Quant-iT PicoGreen dsDNA Assay Kit and sequencing libraries were constructed by pooling samples by equal mass. Average amplicon size and library molarity were measured using an Agilent D1000 ScreenTape System and a KAPA SYBR FAST Universal qPCR Kit, respectively. Sequencing libraries were then sequenced on an Illumina MiSeq instrument. Sequences of oligonucleotides are provided in Tables S11 and S12.Nested PCRs were supplemented with vector-specific blocking oligos complementary to the primer binding site found downstream of the 5′ LTR-U5 sequence. Blocking oligos contained nine blocked nucleic acids (BNAs) with a total length between 27 and 32 nucleotides and an estimated annealing temperature around 80°C. Each of the blocking oligos is terminated with a 3′ amino modification.Paired-end sequencing was performed on the Illumina MiSeq, using 300-cycle V2 reagent kits with nano-, micro-, and standard flowcells. Cycle allocation was conducted as follows: read 1 used 179 cycles, index 1 used 12 cycles, and read 2 used 143 cycles. This allows for approximately 130 nucleotides of host sequence from both sides of the template molecule. Fastq output files were subsequently used as input for INSPIIRED.
Integration Site Quality Control and Read Trimming
Read sequences are trimmed to remove the linker, primer, and viral DNA end (LTRbit) sequences. A read pair is required to have the linker sequence at the beginning of read 1 and primer and LTRbit sequences at the beginning of read 2. The primer and the LTRbit sequences are determined by the corresponding vector sequence and are different for each virus or vector studied. Primer and LTRbit sequences are trimmed off from the beginning of read 2, and the linker sequences are trimmed off from read 1.In cases where the sonic break point is close to the integration site, read 2 may read into the reverse complement of the linker and read 1 may read into the reverse complement of the primer and LTRbit. These sequences at the tails of the reads interfere with alignment and quality control, and thus are detected and trimmed off. About 20% of the reads are affected. These alignments are detected by Bioconductor’s Biostrings pairwiseAlignment function to trim and filter the leading and tailing sequences, while requiring 85% of identity based on edit distances.Although a blocking oligo is employed to reduce pure vector sequence amplification during PCR, the blocking is not 100% efficient. Therefore, trimmed and filtered sequences are aligned to the vector reference and removed if either the read 1 or read 2 aligns with 75% of global identity and within 5 nt of the start of the read.
Sequence Alignment
INSPIIRED uses BLAT for DNA alignments because of its accuracy and the fact that it reports all alignments if a read has multiple hits in the genome with scores above a certain threshold, which makes it useful for handling multihit read pairs. The following BLAT parameters were used to align the read pairs: -tileSize = 11, -stepSize = 9, -minIdentity = 85, -maxIntron = 5, and -minScore = 27. Removing the commonly used -ooc = 11.ooc option led to better alignment in repeated regions, such as LINE, SINE, and LTR regions, since the option -ooc prohibits search in those regions. The option -maxIntron was changed to 5 from the default 750,000, as reads are amplified from genomic DNA and should not align across splicing elements. Tests showed that reducing the stepSize improved the accuracy of the alignments but increased the demand on memory; therefore, a stepSize of 9 represented a workable compromise. As BLAT is a local aligner, alignments were filtered out that only partially match to the host genome. Alignments were also filtered by a global identity score, defined as (matches + repMatches)/qSize for quality control on alignments, requiring a minimum of 95%. After aligning and filtering by the global identity score, the two reads of a template are paired by requiring that the reads (1) align to the same chromosome, (2) align to opposite strands, (3) maintain the correct predicted orientation (the linkered breakpoint read is “downstream” of the LTR read if the orientation is on the positive strand and vice versa for the negative strand), and (4) predicted template length is shorter than 2,500 nt.For the studies reported here, the h18 draft of the human genome was used, to allow direct use of chromatin immunoprecipitation sequencing (ChIP-seq) data that was originally analyzed on this background (see our accompanying paper). However, the organism and draft genome used for analysis can be selected by users and analyzed with any suitable annotation tracks compiled on that sequence.The following definitions were used to describe the locus for each alignment in the format of “chromosome:strand:position.” Here, tName, tStart, tEnd, and strand are columns of the BLAT output in the PSL format and they refer to chromosome, start position, end position, and strand, respectively. The integration site is given by [ site = tName:+:tStart ] if read 2 is mapped to the positive strand and [ site = tName:-:tEnd ] if mapped to the negative strand. Likewise, the break point is given by [ breakpoint = tName:+:tEnd ] if read 1 is mapped to the positive strand and [ breakpoint = tName:-:tStart ] if mapped to the negative strand. If more than one read pair yields the same integration site and breakpoint on the same strand, only one is kept and the others are considered to be PCR duplicates.
Generation of Synthetic Integration Site Data and Introduction of Error
Sequences were simulated from random locations in the human reference genome, then additional technical sequences required for the Illumina technology (Figure 1) were added to each read. Given a locus (composed of chromosome, strand, and location), the downstream sequence was obtained from the human reference genome. To generate simulated templates, the following components were concatenated together: Illumina P7, 12-nt Golay barcode, sequencing primer 2 (for both index and read 2 sequences), primer, LTRbit, reference genome sequence, linker (reverse complement), sequencing primer 1 (for read 1, reverse complement), and lastly, the Illumina P5 sequence (reverse complement). Read 2 sequences were generated by obtaining the 143 nt following the sequencing primer 2 of the above template. Read 1 sequences were generated by obtaining the 179 nt following the sequencing primer 1 after the reverse complementing the template. Should fewer bases exist in the template strand than are required, the ends of the reads are filled with poly(T) sequence, as is observed on the Illumina instrument. Index reads were generated as random 12 nt or a specified Golay DNA sequence. Sequencing primer, PCR primer, and LTRbit sequences are specific to each study.For the synthetic dataset, 5,000 integration sites were generated from the human genome, and 100 lengths between read 1 and read 2 were simulated for each site. Simulated template lengths followed a normal distribution with a mean of 70 (SD of 250), while only keeping lengths greater than 30. In total, there are 500,000 read pairs in the simulation set. To evaluate the performance of INSPIIRED with sequencing error, four simulation datasets (with the same random sites and lengths) were generated containing 0%, 1%, 2%, and 4% read errors, applied after template construction.
Experimental Integration Site Analysis Workflow
After sequencing, Illumina output FASTQ files are used as inputs for INSPIIRED, along with sample information (which linkers and barcodes were used with which samples) and vector sequence information. As Golay DNA barcodes have large edit distances, the index sequence file that contains the barcode sequences for each read is subjected to error correction (up to two bases of correction). With the corrected barcodes, information from read 1 (containing the linker sequence) is used with the barcodes to demultiplex the samples, creating independent FASTQ files for read 1 and read 2. Sequences are then subjected to quality trimming (only keeping information with Q-scores greater than Q30). Following quality trimming, respective reads are filtered and trimmed for linker, primer, LTRbit, and vector-related sequences to yield only genomic DNA. The genomic sequences are then aligned to the host reference genome using BLAT, yielding an output PSL file with the alignment information. The alignment information is then used to determine the locations of alignments and read information is filtered and paired, as previously discussed. Alignments for integration sites are then placed in four output files, including allSites (containing all uniquely mapped integration sites and their corresponding breakpoints), sites.final (contains a condensed form of allSites), multihitData (contains all properly paired and filtered integration sites that could not be uniquely mapped to the reference genome), and chimeraData (contains integration sites that failed proper pairing and are considered artifacts). These outputs are generated for each sample given a specific linker and barcode in the input sample information file.
Software Installation
INSPIIRED is distributed online as a downloadable virtual machine executable on the Windows, Mac, and Linux operating systems as well as a GitHub source code repository supported by a Conda software environment. The virtual machine, software, instructions, and a walkthrough that processes provided sample data are available at https://github.com/BushmanLab/INSPIIRED. As described in the instructions, the software supports two types of databases, MySQL and SQLite.
Data Availability
SRA accession numbers for the datasets studied can be found in Table S13.
Author Contributions
E. Sherman, C.N., C.C.B., E. Six, M.C., and F.D.B. designed the study; E. Sherman, C.N., C.C.B., E. Six, M.C., F.M., S.R., M.J.D., P.B., S.H.-B.-A, L.C., and F.D.B. carried out the study; E. Sherman, C.N., C.C.B., E. Six, Y.W., A.D., N.M., A.B., K.B., J.K.E., M.C., F.M.S.R., M.J.D., P.B., S.H.-B.-A, L.C., and F.D.B. analyzed the data.
Conflicts of Interest
The authors declare that they have no competing interests.
Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki Journal: Nature Date: 2001-02-15 Impact factor: 49.962
Authors: Astrid R W Schröder; Paul Shinn; Huaming Chen; Charles Berry; Joseph R Ecker; Frederic Bushman Journal: Cell Date: 2002-08-23 Impact factor: 41.582
Authors: Angela Ciuffi; Richard S Mitchell; Christian Hoffmann; Jeremy Leipzig; Paul Shinn; Joseph R Ecker; Frederic D Bushman Journal: Mol Ther Date: 2005-12-01 Impact factor: 11.454
Authors: M K Lewinski; D Bisgrove; P Shinn; H Chen; C Hoffmann; S Hannenhalli; E Verdin; C C Berry; J R Ecker; F D Bushman Journal: J Virol Date: 2005-06 Impact factor: 5.103
Authors: Angela Ciuffi; Manuel Llano; Eric Poeschla; Christian Hoffmann; Jeremy Leipzig; Paul Shinn; Joseph R Ecker; Frederic Bushman Journal: Nat Med Date: 2005-11-27 Impact factor: 53.440
Authors: Rick S Mitchell; Brett F Beitzel; Astrid R W Schroder; Paul Shinn; Huaming Chen; Charles C Berry; Joseph R Ecker; Frederic D Bushman Journal: PLoS Biol Date: 2004-08-17 Impact factor: 8.029
Authors: Rebecca T Veenhuis; Abena K Kwaa; Caroline C Garliss; Rachel Latanich; Maria Salgado; Christopher W Pohlmeyer; Christopher L Nobles; John Gregg; Eileen P Scully; Justin R Bailey; Frederic D Bushman; Joel N Blankson Journal: JCI Insight Date: 2018-09-20
Authors: Christopher L Nobles; Scott Sherrill-Mix; John K Everett; Shantan Reddy; Joseph A Fraietta; David L Porter; Noelle Frey; Saar I Gill; Stephan A Grupp; Shannon L Maude; Donald L Siegel; Bruce L Levine; Carl H June; Simon F Lacey; J Joseph Melenhorst; Frederic D Bushman Journal: J Clin Invest Date: 2020-02-03 Impact factor: 14.808
Authors: Francesco R Simonetti; Hao Zhang; Garshasb P Soroosh; Jiayi Duan; Kyle Rhodehouse; Alison L Hill; Subul A Beg; Kevin McCormick; Hayley E Raymond; Christopher L Nobles; John K Everett; Kyungyoon J Kwon; Jennifer A White; Jun Lai; Joseph B Margolick; Rebecca Hoh; Steven G Deeks; Frederic D Bushman; Janet D Siliciano; Robert F Siliciano Journal: J Clin Invest Date: 2021-02-01 Impact factor: 14.808
Authors: Emma C Morris; Thomas Fox; Ronjon Chakraverty; Rita Tendeiro; Katie Snell; Christine Rivat; Sarah Grace; Kimberly Gilmour; Sarita Workman; Karen Buckland; Katie Butler; Ronnie Chee; Alan D Salama; Hazem Ibrahim; Havinder Hara; Cecile Duret; Fulvio Mavilio; Frances Male; Frederick D Bushman; Anne Galy; Siobhan O Burns; H Bobby Gaspar; Adrian J Thrasher Journal: Blood Date: 2017-07-17 Impact factor: 22.113
Authors: Laura Breda; Valentina Ghiaccio; Naoto Tanaka; Danuta Jarocha; Yasuhiro Ikawa; Osheiza Abdulmalik; Alisa Dong; Carla Casu; Tobias D Raabe; Xiaochuan Shan; Gwenn A Danet-Desnoyers; Aoife M Doto; John Everett; Frederic D Bushman; Enrico Radaelli; Charles A Assenmacher; James C Tarrant; Natalie Hoepp; Ryo Kurita; Yukio Nakamura; Virginia Guzikowski; Kim Smith-Whitley; Janet L Kwiatkowski; Stefano Rivella Journal: Mol Ther Date: 2021-01-29 Impact factor: 11.454
Authors: Laura P Spector; Matthew Tiffany; Nicole M Ferraro; Nathan S Abell; Stephen B Montgomery; Mark A Kay Journal: Mol Ther Date: 2020-11-26 Impact factor: 11.454