Literature DB >> 33978747

DetectIS: a pipeline to rapidly detect exogenous DNA integration sites using DNA or RNA paired-end sequencing data.

Luigi Grassi1, Claire Harris1, Jie Zhu2, Colin Hardman3, Diane Hatton1.   

Abstract

MOTIVATION: Recombinant DNA technology is widely used for different applications in biology, medicine and bio-technology. Viral transduction and plasmid transfection are among the most frequently used techniques to generate recombinant cell lines. Many of these methods result in the random integration of the plasmid into the host genome. Rapid identification of the integration sites is highly desirable in order to characterize these engineered cell lines.
RESULTS: We developed detectIS: a pipeline specifically designed to identify genomic integration sites of exogenous DNA, either a plasmid containing one or more transgenes or a virus. The pipeline is based on a Nextflow workflow combined with a Singularity image containing all the necessary software, ensuring high reproducibility and scalability of the analysis. We tested it on simulated datasets and RNA-seq data from a human sample infected with Hepatitis B virus. Comparisons with other state of the art tools show that our method can identify the integration site in different recombinant cell lines, with accurate results, lower computational demand and shorter execution times. AVAILABILITY: The Nextflow workflow, the Singularity image and a test dataset are available at https://github.com/AstraZeneca/detectIS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2021. Published by Oxford University Press.

Entities:  

Year:  2021        PMID: 33978747      PMCID: PMC9502153          DOI: 10.1093/bioinformatics/btab366

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.931


1 Introduction

Recombinant DNA technology can be used to generate transgenic animals, plants and cell lines, widely used for different applications in biology, medicine and biotechnology (Ghaderi ; Khan ). Therapeutic proteins with complex post-translational modifications are normally expressed in mammalian cell lines (Walsh, 2018; Zhu and Hatton, 2018). Viral transduction and plasmid transfection are methods largely used to establish recombinant cell lines (Kim and Eberwine, 2010; Lee ) and typically result in random integration of the transgene construct into the host genome. The identification of the transgene integration site (IS) is important for the characterization of stable recombinant cell lines and, can reveal regulatory features relevant for transgene expression. It can also detect aberrant transgene–host fusion proteins, potentially caused by the plasmid integrating in the proximity of protein-coding genes. Understanding ISs can identify integration ‘hot spots’, i.e. genomic sites conferring high expression of the transgene and data from multiple experiments can be used for the design of targeted ISs. Moreover, as the transgene ISs are unique for an individual transfection event, the IS information can be used to design PCR experiments to assess the clonality of a cell line (Sommeregger et al., 2013). Inverse PCR (Liang et al., 2008; Uemura et al., 2014), splinkerette-PCR (Uren et al., 2009) and targeted locus amplification (de Vree ) are techniques specifically designed to localize ISs in host genomes. High-throughput sequencing (HTS) experiments have been successfully used to localize a similar biological event: the viral ISs in host genomes (Chen ). Moreover, several studies have proved the usefulness of HTS in localizing plasmid ISs in stable cell lines (Brett ; Lambirth et al., 2015; Srivastava et al., 2014). Although pipelines have been developed for detecting viral integration sites, some of them are specifically designed for the human genome reference sequence. Moreover, all the tools require the preparation of indexes specific for each host and exogenous DNA element. We present detectIS, a pipeline to detect the ISs in paired end (PE) HTS experiments (either DNA or RNA sequencing data). It can be directly used with different host and exogenous DNA references, without the need of creating a specific index. Consequently, it is suitable for different applications, for example detecting ISs of plasmids in stable cell lines, either clones or pools, as well as locating viruses integrated in any host genome. The speed of execution makes the detectIS pipeline well-suited for quickly screening HTS data from panels of different cell lines generated during the cell line development process for therapeutic protein manufacture, enabling the detection of cell lines with undesirable transgene fusion sequences.

2 Materials and methods

DetectIS (Supplementary Fig. S1) consists of three main steps. PE reads are aligned, in single-end mode onto the exogenous sequence reference (i.e. transgene, plasmid or viral sequences). Reads with any overlap with the exogenous reference sequence are subsequently aligned, in single-end mode, to the host genome reference. The alignment is made by using the Minimap2 program (Li, 2018). Finally, a Perl script integrates the four alignment results looking for potential ISs. ISs can be identified by split reads—read pairs in which at least one read has a part mapping to the host genome and the remaining part mapping to the plasmid/transgene, and chimeric reads, read pairs in which one of the two reads is mapped to the host genome and the other one to the plasmid/transgene. The pseudocode of the subroutines used by the Perl script is reported in Supplementary Figures S2–S9. Final results are provided as a txt file detailing all the potential ISs and the number of supporting split and chimeric read pairs. The same information is also reported in a markdown file that can be converted to a pdf and/or html file. All the steps of the detectIS pipeline are embedded in a Nextflow (Di Tommaso ) workflow that, together with the Singularity (Kurtzer ) container ensures reproducibility and scalability from a single PC/workstation to high-performance computational (HPC) environments.

3 Usage

In order to use the workflow, the user has to create a configuration file specifying the reference host genome and exogenous sequence references, the directory containing the raw data and the output directory. The analysis can be executed locally or in an HPC environment, in the latter scenario the user also has to specify the cluster executor. A configuration file is provided to analyze a test dataset and can be used as a template for other analyses. The recipe of the Singularity image with all the necessary software is also supplied. A bash script is also given to analyze a test dataset without Nextflow and can be used as a template for analysis in local environments.

3.1 Comparison with existing tools for structural variant identification

In order to test the functionality of detectIS and the accuracy of its results, we simulated random integrations of a plasmid in a Chinese hamster ovary (CHO) scaffold, exploring different modalities of transgene size, depth of sequencing coverage and read length. We compared the results of detectIS with the ones derived by other tools for viral detection, that are able to use host references different from human. SeekSV (Liang et al., 2017) is a program designed to identify ISs and other structural variants in RNA-seq and DNA-seq experiments and was one of the best performing tools for identifying viral integrations in a recent study (Chen ). BatVI (Tennakoon and Sung, 2017) is a sensitive and fast tool used for the detection of viral integrations that, similarly to detectIS, uses a subtractive strategy where raw reads are aligned to the viral reference genomes in the first instance, and the partially mapped reads are then aligned to the host reference genome to detect viral integrations. SurVirus (Rajaby et al., 2021) is a recently published repeat-aware virus integration caller. The detectIS results are among the ones with highest precision and sensitivity in most of the simulated experiments with sequenced read of lengths 250 and 150 bases (Supplementary Figs S10A–F, S11–AF, Supplementary Tables S1–S3). Minimap2 works with read length of 100 bases or higher (Li, 2018) and, for this reason, 100 bases is the lowest read length compatible with detectIS. In this simulated scenario, the tool is less precise and sensitive than SurVirus and SeekSV for sequence coverages of 5× and 10×, but performs similarly at higher coverage (Supplementary Figs S10–GI, S11G–I, Tables Supplementary S1–S3). The execution times of the analyses are similar for detectIS, SurVirus and BatVI and higher for SeekSV in all the simulated experiments (Supplementary Fig. S12). DetectIS has the lowest computational demands with the lowest CPU times in all the simulated experiments (Supplementary Fig. S13). It is also notable that detectIS can be executed without the reference index generation, a time consuming step required by all the other tools (Supplementary Fig. S14). The integration sites detected by all the used tools have an average discrepancy of a few nucleotides in respect to the original sites (Supplementary Fig. S15). In the simulated integrations, plasmid and host had the same orientation 5′→3′ and this feature was captured by all the tools. We extended the comparison to publicly available RNA-seq experiments of four hepatitis B virus (HBV) positive hepatocellular carcinoma cell lines with verified chimeric viral-human transcripts (Lau et al., 2014). In this analysis, SurVirus terminated with a segmentation fault error in all the four analyzed experiments and produced an empty final result file in three of them. Analogously, BatVI produced a final result file for only one of the four analyzed experiments, for this reason, we could compare only the results generated by detectIS and seekSV. We defined true positives as ISs that supported the chimeric viral-human transcripts verified in the study of Lau , with a tolerance of 50 nucleotides (Supplementary Table S4). The two tools gave similar results in term of precision, sensitivity (Supplementary Fig. S16A, Supplementary Table S5) and difference from the real data (Fig. S16B) with a significantly shorter running time for detectIS (Supplementary Fig. S16C and D). This difference in running times can be justified by the fact that the two pipelines are based on different programs and strategies, with seekSV looking for all potential structural variants while detectIS uses a subtractive strategy and is designed to specifically identify variants affecting the exogenous DNA (plasmid/virus). The results presented in this study demonstrate that detectIS is able to identify integration sites in HTS experiments, in a short time without high demands on computational resources. The benchmark analysis indicates that a longer read length improves detectIS precision and sensitivity in experiments made at a lower coverage. The usage of the Minimap2 program for the alignment gives the possibility of running the analysis without any index preparation step and makes the pipeline unique among all the existing programs for viral integration. Due to its versatility, detectIS can be executed to identify viral integration sites in transcriptome or genome sequencing experiments and identify the ISs of plasmids inserted into stable cell lines from HTS experiments routinely made to exclude the presence of variants in transgenic transcripts during clone selection (Harris ; Lin et al., 2019). Financial Support: none declared. Conflict of Interest: none declared. Click here for additional data file.
  24 in total

1.  Transgene copy number comparison in recombinant mammalian cell lines: critical reflection of quantitative real-time PCR evaluation.

Authors:  Wolfgang Sommeregger; Bernhard Prewein; David Reinhart; Alexander Mader; Renate Kunert
Journal:  Cytotechnology       Date:  2013-06-27       Impact factor: 2.058

2.  Seeksv: an accurate tool for somatic structural variation and virus integration detection.

Authors:  Ying Liang; Kunlong Qiu; Bo Liao; Wen Zhu; Xuanlin Huang; Lin Li; Xiangtao Chen; Keqin Li
Journal:  Bioinformatics       Date:  2016-09-14       Impact factor: 6.937

3.  Biopharmaceutical benchmarks 2018.

Authors:  Gary Walsh
Journal:  Nat Biotechnol       Date:  2018-12-06       Impact factor: 54.908

Review 4.  Production platforms for biotherapeutic glycoproteins. Occurrence, impact, and challenges of non-human sialylation.

Authors:  Darius Ghaderi; Mai Zhang; Nancy Hurtado-Ziola; Ajit Varki
Journal:  Biotechnol Genet Eng Rev       Date:  2012

5.  Identifying and genotyping transgene integration loci.

Authors:  Zhong Liang; Amy Marie Breman; Brenda R Grimes; Elliot D Rosen
Journal:  Transgenic Res       Date:  2008-07-09       Impact factor: 2.788

6.  Revealing Key Determinants of Clonal Variation in Transgene Expression in Recombinant CHO Cells Using Targeted Genome Editing.

Authors:  Jae Seong Lee; Jin Hyoung Park; Tae Kwang Ha; Mojtaba Samoudi; Nathan E Lewis; Bernhard O Palsson; Helene Faustrup Kildegaard; Gyun Min Lee
Journal:  ACS Synth Biol       Date:  2018-11-14       Impact factor: 5.110

7.  Comprehensive comparative analysis of methods and software for identifying viral integrations.

Authors:  Xun Chen; Jason Kost; Dawei Li
Journal:  Brief Bioinform       Date:  2019-11-27       Impact factor: 11.622

8.  Novel molecular and computational methods improve the accuracy of insertion site analysis in Sleeping Beauty-induced tumors.

Authors:  Benjamin T Brett; Katherine E Berquam-Vrieze; Kishore Nannapaneni; Jian Huang; Todd E Scheetz; Adam J Dupuy
Journal:  PLoS One       Date:  2011-09-13       Impact factor: 3.240

9.  Singularity: Scientific containers for mobility of compute.

Authors:  Gregory M Kurtzer; Vanessa Sochat; Michael W Bauer
Journal:  PLoS One       Date:  2017-05-11       Impact factor: 3.240

10.  Discovery of transgene insertion sites by high throughput sequencing of mate pair libraries.

Authors:  Anuj Srivastava; Vivek M Philip; Ian Greenstein; Lucy B Rowe; Mary Barter; Cathleen Lutz; Laura G Reinholdt
Journal:  BMC Genomics       Date:  2014-05-14       Impact factor: 3.969

View more
  1 in total

Review 1.  Revolutionized virome research using systems microbiology approaches.

Authors:  Suwalak Chitcharoen; Pavaret Sivapornnukul; Sunchai Payungporn
Journal:  Exp Biol Med (Maywood)       Date:  2022-06-20
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.