Literature DB >> 21245052

Identifying viral integration sites using SeqMap 2.0.

Troy B Hawkins1, Jessica Dantzer, Brandon Peters, Mary Dinauer, Keithanne Mockaitis, Sean Mooney, Kenneth Cornetta.   

Abstract

UNLABELLED: Retroviral integration has been implicated in several biomedical applications, including identification of cancer-associated genes and malignant transformation in gene therapy clinical trials. We introduce an efficient and scalable method for fast identification of viral vector integration sites from long read high-throughput sequencing. Individual sequence reads are masked to remove non-genomic sequence, aligned to the host genome and assembled into contiguous fragments used to pinpoint the position of integration.
AVAILABILITY AND IMPLEMENTATION: The method is implemented in a publicly accessible web server platform, SeqMap 2.0, containing analysis tools and both private and shared lab workspaces that facilitate collaboration among researchers. Available at http://seqmap.compbio.iupui.edu/.

Entities:  

Mesh:

Year:  2011        PMID: 21245052      PMCID: PMC3042184          DOI: 10.1093/bioinformatics/btq722

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


Retroviruses were first characterized by their ability to cause malignancy. Subsequently, retroviruses were identified that lacked oncogenes but mediated malignancy through a process termed insertional mutagenesis (IM). The molecular mechanisms of IM are varied but most commonly involve upregulation of cellular oncogenes in close proximity to the site of viral integration via cis- and trans-effects of promoter and enhancer sequences within the viral long terminal repeats (LTRs). Because of IM effects, the mapping of retroviral integration sites (RISs) has become a powerful tool for identifying cellular oncogenes. Copeland and Jenkins (Buchberg ; Copeland and Jenkins, 1990) used retroviruses to identify potential oncogenes by determining the site of viral integration in tumor tissues. This work led to the development of a database of cancer-associated genes (Akagi ). IM has also been associated with malignancy in the setting of human gene therapy applications. While most gene therapy trials have not been associated with the development of cancer, a notable exception was the treatment of X-linked Severe Combined Immuno-Deficiency (SCID-X1), where several patients developed a T-cell leukemia associated with vector integration near the proto-oncogenes LMO2, BMI1 and CCND2 (Hacein-Bey-Abina , 2008). The US Food and Drug Administration (FDA) now requires assessment of RISs for any human gene therapy trials utilizing integrating vector systems (USDHHS, 2006). In animal models and human clinical trials, retroviral transduction targets millions of cells. As integration can occur throughout most of the genome, the resulting cell populations can contain extremely large, but unknown, numbers of RISs. Initial methods to identify the RISs utilized PCR-based capture and amplification assays that were inefficient and highly labor intensive. High-throughput next-generation sequencing technologies have facilitated much more efficient identification of RISs, which presents a new bioinformatics challenge. We (Peters ) and others (Appelt ; Giordano ) had previously developed web-based bioinformatics tools that can facilitate identification of RISs by mapping sequence data obtained from Sanger sequencing technology, but the tools are not sufficient to quickly map and characterize RISs in high-throughput methods. Here we introduce and explain our new methodology for quickly mapping RISs to a reference genome from extremely large datasets. Depending on the frequency of insertion sites within the cell population, and the number of samples run in parallel, there can be anywhere from 50 to 5000-fold coverage of an individual RIS within the reads generated from a single sequencing run. SeqMap 2.0 provides a scalable method for sequence matching, clustering and alignment, and also addresses challenges specific to 454 pyrosequencing data output, namely base stutter and redundant coverage of each RIS. The SeqMap 2.0 workflow has three stages: (i) sequence processing, including identification and masking of vector features and distribution of sequence reads into multiplex identifier (MID)/barcode-specific groups; (ii) sequence clustering and alignment; and (iii) data visualization and storage for further analysis (Supplementary Fig. 1B).
Fig. 1.

Graphical representation of mapped integration site in sequence viewer. The consensus sequence for a cluster is shown with glyphs for bar code (yellow), vector feature (green) and genomic alignment (blue) at the top of the page, color-coordinated to the sequences shown below at the right. A specific integration site is proposed (black) when the position flanking the user-defined LTR feature aligns to the genome. Details for the integration are shown at the left, including links to a list and MSA of reads contributing to the consensus sequence for the RIS, and details of the genomic alignment linked to the Entrez entry for the closest identified gene. Users can access expanded graphics of local genomic regions from the batch summary page (data not shown).

Graphical representation of mapped integration site in sequence viewer. The consensus sequence for a cluster is shown with glyphs for bar code (yellow), vector feature (green) and genomic alignment (blue) at the top of the page, color-coordinated to the sequences shown below at the right. A specific integration site is proposed (black) when the position flanking the user-defined LTR feature aligns to the genome. Details for the integration are shown at the left, including links to a list and MSA of reads contributing to the consensus sequence for the RIS, and details of the genomic alignment linked to the Entrez entry for the closest identified gene. Users can access expanded graphics of local genomic regions from the batch summary page (data not shown). SeqMap 2.0 is able to analyze data from the major PCR techniques used in RIS analysis: ligase-mediated PCR (LM-PCR) (Smith, 1992), linear-amplification-mediated PCR (LAM-PCR) (Schmidt , 2007) and non-restrictive LAM-PCR (nrLAM-PCR) (Gabriel ); see Supplementary Material. Each individual sequence read input to SeqMap 2.0 originates from an amplicon with common features. From 5′ to 3′ is a sequencing adaptor, a nucleotide bar code, viral LTR, RIS-flanking genomic sequence, linker cassette and another sequencing adaptor (Supplementary Fig. 1A). This sequence processing phase removes these common features to isolate the genomic portion of the read for clustering and mapping, and to group reads belonging to individual samples by bar code. First, each vector feature is matched to a database of input sequence reads by pairwise alignment (Brudno, 2007). Each base position in a vector feature mapping to a read is then masked. Second, direct regular expression matching is used to ‘read’ the bar code included in each sequence read. At this stage, reads are split into coded groups for further analysis during the clustering and mapping stages. Redundancy in coverage necessitates the use of clustering to group similar sequence reads before mapping and visualization. Rather than using all-by-all pairwise alignment (Niu ) or clustering by alignment to dynamically created contiguous sequence fragments, we cluster individual sequence reads by grouping those reads mapped by Blat (Kent, 2002) alignment to an overlapping region in the reference genome of the host cell. Each of the reads mapping to a common genomic region is assigned into a cluster, and all of the reads in each cluster are aligned by MUSCLE (Edgar, 2004). A simple majority-voting algorithm is used to create a consensus sequence of each RIS. This RIS sequence is then Blat aligned back to the reference genome. Once a genomic location is confirmed, the exact position of the RIS is defined by the genomic position flanking proviral LTR in the consensus sequence. Since LTR regulatory regions may influence cellular genes within a large distance of the RIS (Hargrove ; Kustikova ; Lazo ; Sadat ), genes located within 300 kb of the RIS are identified and reported. The consensus sequence is used as the basis for visualizing each RIS. We map the location of each vector feature, the bar code and the genomic alignment to the sequence using BioPerl graphics. The names of and distances to the closest genes in both Ensembl and UCSC genome builds are reported, and the raw multiple sequence alignment (MSA) of reads contributing to the RIS is linked (Fig. 1). SeqMap 2.0 allows a user to: (i) upload full sets of 454 pyrosequencing reads, (ii) create savable lists of bar codes and identifiers, (iii) create savable lists of vector features to mask from each read and (iv) identify the appropriate reference genomes to which RISs should be mapped. The rest of the process is completely automated and data are returned to the user through secure login to a saved workspace or by email. Investigators are also able to use SeqMap 2.0 as a collaborative research tool by creating lab workspaces accessible to multiple users. SeqMap 2.0 is available at http://seqmap.compbio.iupui.edu/. Funding: This work was supported by the National Institutes of Health (P40 RR024928, R01LM009722, T32 HL007910, P01 HL53586); Indiana Clinical and Translational Sciences Institute Bioinformatics and Advanced Information Technology Cores (U54 RR025761). Conflict of Interest: none declared.
  20 in total

1.  Ligation-mediated PCR of restriction fragments from large DNA molecules.

Authors:  D R Smith
Journal:  PCR Methods Appl       Date:  1992-08

2.  MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors:  Robert C Edgar
Journal:  Nucleic Acids Res       Date:  2004-03-19       Impact factor: 16.971

3.  Clonal dominance of hematopoietic stem cells triggered by retroviral gene marking.

Authors:  Olga Kustikova; Boris Fehse; Ute Modlich; Min Yang; Jochen Düllmann; Kenji Kamino; Nils von Neuhoff; Brigitte Schlegelberger; Zhixiong Li; Christopher Baum
Journal:  Science       Date:  2005-05-20       Impact factor: 47.728

4.  High-resolution insertion-site analysis by linear amplification-mediated PCR (LAM-PCR).

Authors:  Manfred Schmidt; Kerstin Schwarzwaelder; Cynthia Bartholomae; Karim Zaoui; Claudia Ball; Ingo Pilz; Sandra Braun; Hanno Glimm; Christof von Kalle
Journal:  Nat Methods       Date:  2007-12       Impact factor: 28.547

5.  An introduction to the Lagan alignment toolkit.

Authors:  Michael Brudno
Journal:  Methods Mol Biol       Date:  2007

6.  New bioinformatic strategies to rapidly characterize retroviral integration sites of gene therapy vectors.

Authors:  F A Giordano; A Hotz-Wagenblatt; D Lauterborn; J-U Appelt; K Fellenberg; K Z Nagy; W J Zeller; S Suhai; S Fruehauf; S Laufs
Journal:  Methods Inf Med       Date:  2007       Impact factor: 2.176

7.  Globin lentiviral vector insertions can perturb the expression of endogenous genes in beta-thalassemic hematopoietic cells.

Authors:  Phillip W Hargrove; Steven Kepes; Hideki Hanawa; John C Obenauer; Deiqing Pei; Cheng Cheng; John T Gray; Geoffrey Neale; Derek A Persons
Journal:  Mol Ther       Date:  2008-01-15       Impact factor: 11.454

8.  Automated analysis of viral integration sites in gene therapy research using the SeqMap web resource.

Authors:  B Peters; S Dirscherl; J Dantzer; J Nowacki; S Cross; X Li; K Cornetta; M C Dinauer; S D Mooney
Journal:  Gene Ther       Date:  2008-06-26       Impact factor: 5.250

9.  Long-distance activation of the Myc protooncogene by provirus insertion in Mlvi-1 or Mlvi-4 in rat T-cell lymphomas.

Authors:  P A Lazo; J S Lee; P N Tsichlis
Journal:  Proc Natl Acad Sci U S A       Date:  1990-01       Impact factor: 11.205

10.  Evi-2, a common integration site involved in murine myeloid leukemogenesis.

Authors:  A M Buchberg; H G Bedigian; N A Jenkins; N G Copeland
Journal:  Mol Cell Biol       Date:  1990-09       Impact factor: 4.272

View more
  16 in total

Review 1.  Center for fetal monkey gene transfer for heart, lung, and blood diseases: an NHLBI resource for the gene therapy community.

Authors:  Alice F Tarantal; Sonia I Skarlatos
Journal:  Hum Gene Ther       Date:  2012-10-19       Impact factor: 5.695

2.  Bioinformatic clonality analysis of next-generation sequencing-derived viral vector integration sites.

Authors:  Anne Arens; Jens-Uwe Appelt; Cynthia C Bartholomae; Richard Gabriel; Anna Paruzynski; Derek Gustafson; Nathalie Cartier; Patrick Aubourg; Annette Deichmann; Hanno Glimm; Christof von Kalle; Manfred Schmidt
Journal:  Hum Gene Ther Methods       Date:  2012-05-04       Impact factor: 2.396

Review 3.  Unraveling the web of viroinformatics: computational tools and databases in virus research.

Authors:  Deepak Sharma; Pragya Priyadarshini; Sudhanshu Vrati
Journal:  J Virol       Date:  2014-11-26       Impact factor: 5.103

4.  Transgenic sheep generated by lentiviral vectors: safety and integration analysis of surrogates and their offspring.

Authors:  Kenneth Cornetta; Kimberly Tessanne; Charles Long; Jing Yao; Carey Satterfield; Mark Westhusin
Journal:  Transgenic Res       Date:  2012-11-23       Impact factor: 2.788

5.  VISPA: a computational pipeline for the identification and analysis of genomic vector integration sites.

Authors:  Andrea Calabria; Simone Leo; Fabrizio Benedicenti; Daniela Cesana; Giulio Spinozzi; Massimilano Orsini; Stefania Merella; Elia Stupka; Gianluigi Zanetti; Eugenio Montini
Journal:  Genome Med       Date:  2014-09-03       Impact factor: 11.117

6.  Dr.VIS: a database of human disease-related viral integration sites.

Authors:  Xin Zhao; Qi Liu; Qingqing Cai; Yanyun Li; Congjian Xu; Yixue Li; Zuofeng Li; Xiaoyan Zhang
Journal:  Nucleic Acids Res       Date:  2011-12-01       Impact factor: 16.971

7.  Functional variants of human papillomavirus type 16 demonstrate host genome integration and transcriptional alterations corresponding to their unique cancer epidemiology.

Authors:  Robert Jackson; Bruce A Rosa; Sonia Lameiras; Sean Cuninghame; Josee Bernard; Wely B Floriano; Paul F Lambert; Alain Nicolas; Ingeborg Zehbe
Journal:  BMC Genomics       Date:  2016-11-02       Impact factor: 3.969

8.  VISPA2: a scalable pipeline for high-throughput identification and annotation of vector integration sites.

Authors:  Giulio Spinozzi; Andrea Calabria; Stefano Brasca; Stefano Beretta; Ivan Merelli; Luciano Milanesi; Eugenio Montini
Journal:  BMC Bioinformatics       Date:  2017-11-25       Impact factor: 3.169

9.  ViralFusionSeq: accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution.

Authors:  Jing-Woei Li; Raymond Wan; Chi-Shing Yu; Ngai Na Co; Nathalie Wong; Ting-Fung Chan
Journal:  Bioinformatics       Date:  2013-01-12       Impact factor: 6.937

10.  VISA--Vector Integration Site Analysis server: a web-based server to rapidly identify retroviral integration sites from next-generation sequencing.

Authors:  Jonah D Hocum; Logan R Battrell; Ryan Maynard; Jennifer E Adair; Brian C Beard; David J Rawlings; Hans-Peter Kiem; Daniel G Miller; Grant D Trobridge
Journal:  BMC Bioinformatics       Date:  2015-07-07       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.