| Literature DB >> 35496474 |
Suzanne Scott1,2,3, Susanna Grigson4, Felix Hartkopf5, Claus V Hallwirth2,3, Ian E Alexander2,3,6, Denis C Bauer1,7,8, Laurence O W Wilson1,8.
Abstract
Viral integration is a complex biological process, and it is useful to have a reference integration dataset with known properties to compare experimental data against, or for comparing with the results from computational tools that detect integration. To generate these data, we developed a pipeline for simulating integrations of a viral or vector genome into a host genome. Our method reproduces more complex characteristics of vector and viral integration, including integration of sub-genomic fragments, structural variation of the integrated genomes, and deletions from the host genome at the integration site. Our method [1] takes the form of a snakemake [2] pipeline, consisting of a Python [3] script using the Biopython [4] module that simulates integrations of a viral reference into a host reference. This produces a reference containing integrations, from which sequencing reads are simulated using ART [5]. The IDs of the reads crossing integration junctions are then annotated using another python script to produce the final output, consisting of the simulated reads and a table of the locations of those integrations and the reads crossing each integration junction. To illustrate our method, we provide simulated reads, integration locations, as well as the code required to simulate integrations using any virus and host reference. This simulation method was used to investigate the performance of viral integration tools in our research [6]. CrownEntities:
Keywords: Gene therapy; In silico; Integration; Vector; Virus
Year: 2022 PMID: 35496474 PMCID: PMC9046613 DOI: 10.1016/j.dib.2022.108161
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Fig. 1The workflow for simulating the example data. Boxes represent steps in simulating the data, and arrows represent files which are output by one step and input into the next. First, integrations are simulated (simulate_integrations, aqua), and then reads are generated using ART (green). The resulting SAM file is sorted and coverted to BAM format (orange) and then the reads crossing each integration are annotated (red). The locations of the integrations across which reads cross are then output in BED3 format (yellow). A summary table of the integration simulation parameters is also written to a file (blue). The resulting summary, fastq files, and tables containing the locations of the integrations are the primary output of the pipeline (all, light green. This figure was generated using snakemake [2] and dot [14].
| Subject | Computational Biology |
| Specific subject area | Bioinformatics; Simulation |
| Type of data | Code |
| How the data were acquired | Our simulation pipeline was developed using snakemake 5.27 |
| Data format | Raw, Simulated |
| Description of data collection | The simulation pipeline begins by simulating integration by taking pieces of the viral reference and inserting them into the host reference, keeping track of which parts of the viral reference were integrated and where in the host genome the integrations occurred. This step is carried out by a Python script that outputs a file in fasta format containing the host reference with integrated viral sequences and a table containing the location of the integrations. The properties of these integrations can be adjusted by setting the number of integrations, the minimum distance between adjacent integrations, the probability that the whole viral genome will be integrated (or a sub-genomic fragment), the minimum and maximum length of the sub-genomic fragments (if appropriate), the probability that the integrated genome will contain a rearrangement or deletion, the probability of a gap or overlap at the host/virus junctions, and the probability of a deletion from the host at each integration site. |
| Next, the reads that cross each integration junction are identified by a Python script, and the table of integration locations is updated with this information. Finally, a file containing the locations in the host of the integration junctions which are crossed by at least one read is output in BED format. | |
| Data source location | AAV2 and human chr1 references (GenBank) |
| Data accessibility | Repository name: GitHub (code only) |
| Related research article | S. Scott, C.V. Hallwirth, F. Hartkopf, S. Grigson, Y. Jain, I.E. Alexander, D.C. Bauer, L.O.W. Wilson, Isling: A Tool for Detecting Integration of Wild-Type Viruses and Clinical Vectors, Journal of Molecular Biology. (2021) 167,408. |