Literature DB >> 23314323

ViralFusionSeq: accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution.

Jing-Woei Li1, Raymond Wan, Chi-Shing Yu, Ngai Na Co, Nathalie Wong, Ting-Fung Chan.   

Abstract

SUMMARY: Insertional mutagenesis from virus infection is an important pathogenic risk for the development of cancer. Despite the advent of high-throughput sequencing, discovery of viral integration sites and expressed viral fusion events are still limited. Here, we present ViralFusionSeq (VFS), which combines soft-clipping information, read-pair analysis and targeted de novo assembly to discover and annotate viral-human fusions. VFS was used in an RNA-Seq experiment, simulated DNA-Seq experiment and re-analysis of published DNA-Seq datasets. Our experiments demonstrated that VFS is both sensitive and highly accurate. AVAILABILITY: VFS is distributed under GPL version 3 at http://hkbic.cuhk.edu.hk/software/viralfusionseq

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 23314323      PMCID: PMC3582262          DOI: 10.1093/bioinformatics/btt011

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Viral infection accounts for 15–20% of human cancers (Morissette and Flamand, 2010). Well-known cancer-associated viruses include the Human papillomavirus (HPV) and the Epstein–Barr virus (EBV), which are present in nearly all cervical cancers (Walboomers ) and nasopharyngeal carcinoma (NPC) (Young and Rickinson, 2004), respectively. Hepatitis B virus (HBV) infection is a strong etiologic factor for hepatocellular carcinoma worldwide (Chemin and Zoulim, 2009). Some viruses like HPV and HBV commonly integrate into the host genome, where they predispose to genome instability and cancer risks (Zhao ). Nevertheless, the ability to precisely locate the viral insertional sites has long been hindered by previous low-resolution techniques and thus limiting research into the mutagenic effects of such integrations. Viral integration in the host can either be episomal or induce viral–human fusion transcript (Schmitz ). Integration of HBV can be detected in as much as 90% of hepatocellular carcinoma (HCC), where clonal expansion of the same integration site has been reported (Cougot ; Jiang ; Sung ). It is hence not only important to identify the sites of genome integrations, but also to discover transcribed viral–human sequences, which both may possess a functional role in tumorigenesis. SeqMap 2.0, an earlier web-based system, uses pre-defined viral features to locate viral integration sites (Hawkins ). Unfortunately, their framework is specific to the 454 sequencing platform and does not address the concerns many have of data privacy. Besides, reliability of the putative fusion breakpoints was not evaluated. More importantly, as HBV has no preferential sites in the human genome to be integrated into (Kraus ; Wentzensen ), the framework could not discover novel viral–human integrations. More recently, VirusSeq was proposed for detecting the presence of viral species in sequence data, and finding viral integration events using discordant Read Pair (RP) information. Through alignments, VirusSeq was able to identify regions of a chromosome that fused with a virus (Chen ). Here, we propose a genome-wide viral fusion discovery and annotation pipeline. Our method resembles CREST (Wang ) and ClipCrop (Suzuki ), both of which use soft-clipping to identify genomic structural variations. What sets our method apart from theirs is our focus on viral integration and the use of viral genome(s) as the primary input to our pipeline. Our unified pipeline ViralFusionSeq (VFS) is used for discovering viral integration events and expressed fusion transcripts using high-throughput sequencing (HTS). The most notable difference between VFS and other tools is that VFS uses both RP and Clipped Sequence (CS) information to find viral fusion events and breakpoints (Supplementary Section S1). Using the latter, VFS is able to discern fusion breakpoints accurately to single-base resolution. Moreover, VFS is generalized to major sequencing platforms, and is applicable to both DNA- and RNA-Seq data.

2 METHODS

2.1 RNA-Seq, simulation experiment and re-analysis of real DNA-Seq data

We performed paired-end transcriptomic sequencing on a HBV-infected HCC cell-line HKCI-5a by Illumina HiSeq 2000 (Supplementary Section S2). We applied VFS on this RNA-Seq data, followed by validation with Sanger sequencing. Afterwards, we demonstrated VFS on our simulated DNA-Seq dataset and re-analyzed a published DNA dataset (Sung ).

2.2 Discovery of putative fusion events

Using the Burrows-Wheeler Aligner (BWA) (Li and Durbin, 2009), VFS starts with pre-processing the sequence reads according to BWA’s trimming algorithm. Quality-trimmed sequence reads are then mapped onto viral sequences. Sensitive mapping is achieved by the use of (i) viral and human decoy sequences that have incorporated different haplotypes or assemblies of references to allow mapping of reads originated from rather divergent strains or sequenced subjects, and (ii) the use of BWA-SW algorithm implemented in BWA (Li and Durbin, 2010), which is optimized for the increasingly common longer sequencing reads. BWA-SW performs Smith–Waterman local alignment. For viral–human chimeric sequence reads, the viral portion would be aligned as mapped sequences (MS), leaving the unaligned human CS as overhang. These overhangs are soft-clipped and the sequence is retained in the alignment file (Li ). VFS extracts all CS and MS and determines breakpoints using the soft-clipping information. Specificity of mapping to viral sequence is evaluated to avoid false mapping, which might happen due to simultaneously mapping an excessive quantity of sequence reads onto viral sequences. The function, implemented by BLAST, scrutinizes both the MS and CS for significant matches to non-target species. In the process of identifying fusion partners, read-level analysis is performed (Fig. 1).
Fig. 1.

(A) Schematic of reads alignment. Fusion breakpoints between viral (grey) and human (white) sequences are identified by soft-clipped alignment. Paired-end reads (diagonal) substantiate the fusion event and assist in transcript reconstruction. (B) Overview of VFS

(A) Schematic of reads alignment. Fusion breakpoints between viral (grey) and human (white) sequences are identified by soft-clipped alignment. Paired-end reads (diagonal) substantiate the fusion event and assist in transcript reconstruction. (B) Overview of VFS

2.3 Reliability of fusion breakpoint and annotation ranking

VFS uses a simple yet effective empirical statistical method to evaluate the quality of fusion breakpoint and rank fusion’s annotation. The concept is based on the Minimal Match on Either side of Fusion (MMEF) (Wang ). For each fusion event, the reliability is directly computed by MMEF using the following equation: where Llen and Rlen represent alignment lengths and Lmn and Rmn indicate mismatches along the alignment. The sub-score of each fusion partner is directly calculated by subtracting len by mm. The best fusion candidate with the highest sub-score is selected. The composite MMEF score is the minimal of the two sub-scores. The score becomes higher when the sequence length of the respective side of the fusion is more balanced, conferring higher uniqueness to the respective genome or gene-set of the target species. Fusion events are annotated by numerous data sources, including NCBI Nucleotide and RefSeq databases and human repetitive elements identified by RepeatMasker, which were obtained from the UCSC Genome Browser.

2.4 Reconstruction of fusion transcript by paired-end information

VFS is capable of exploiting both CS and RP information to reconstruct fusion transcripts. Fusion breakpoint sequences are used as seeds to perform targeted assembly on RNA-Seq data. VFS executes the RP fusion detection method to identify all sequence reads with one end mapped onto the viral genome and the other on to the human genome. Then, sequences mapped onto respective genomes are subjected to targeted de novo assembly; these include (i) CS and their paired mates from the CS module; (ii) reads from the RP module; and (iii) RPs with one end mapped in the vicinity (500 bp) of the human regions reported by the RP module.

3 RESULTS AND DISCUSSION

3.1 RNA-Seq experiment

Viral integration in HCC often elicits transcriptional impact on cancer marker genes, suggesting the importance of expressed fusion transcripts (Jiang ). We performed RNA-Seq on HKCI-5 to a depth of 11 Gb. VFS identified three candidate fusion events in HKCI-5a, of which all could be successfully validated by Sanger sequencing. We highlight the most complicated fusion transcript formed between the HBV core gene and the human chr7 containing CDHR3 and TRRAP (Supplementary Fig. S1). Sequence data have been deposited in NCBI Sequence Read Archive under the accession SRA061758. Other fusion events will be described elsewhere (manuscript in preparation).

3.2 Simulation experiment

To get a better understanding of the two modules that form the basis of VFS: the CS and the RP module, we conducted a set of simulation experiments. Our aim was to determine the sequencing depth required to identify a fusion event using either or both methods. Synthesized data allow us to know beforehand where the virus has fused with the host chromosome. The simulation experiment showed that VFS is highly sensitive and accurate. While the RP module reports fusion events with accuracy equal to the inner insertion length, the CS method was able to identify 90% of the fusion events within an accuracy of ±3 bp. Combining the two methods gave the best overall performance. Our simulation also determined that sequencing depth coverage of 10× was sufficient for the detection accuracy to saturate. (Supplementary Section S3).

3.3 Re-analysis of real DNA-Seq data

We re-analyzed the whole genome sequencing data of HBV-infected HCC samples using VFS. Two samples (198T and 268T) were randomly chosen (Sung ). Remarkably, VFS pinpointed all the exact fusion breakpoints reported by Sung et al. Sung reported viral–human integration events only at the genomic DNA level, and it is currently unknown if those reported fusion events would be transcribed. On the other hand, we generated our own RNA-Seq data on one HBV-infected HCC cell line, and then identified and validated three fusion events that are being actively transcribed. In terms of number of reported integration events, Sung et al. reported an average of two validated integrations per cell lines, which is comparable with our findings (Supplementary Section S4).

4 CONCLUSION

To the best of our knowledge, VFS is the first approach for simultaneously discovering novel viral–human fusion events and reconstructing transcript sequences at single-base resolution. VFS represents an improvement on a methodology that will help with the discovery of viral integration events and expressed transcripts in diseases with viral integration.
  19 in total

1.  Human papillomavirus is a necessary cause of invasive cervical cancer worldwide.

Authors:  J M Walboomers; M V Jacobs; M M Manos; F X Bosch; J A Kummer; K V Shah; P J Snijders; J Peto; C J Meijer; N Muñoz
Journal:  J Pathol       Date:  1999-09       Impact factor: 7.996

2.  VirusSeq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue.

Authors:  Yunxin Chen; Hui Yao; Erika J Thompson; Nizar M Tannir; John N Weinstein; Xiaoping Su
Journal:  Bioinformatics       Date:  2012-11-17       Impact factor: 6.937

3.  Characterization of viral-cellular fusion transcripts in a large series of HPV16 and 18 positive anogenital lesions.

Authors:  Nicolas Wentzensen; Ruediger Ridder; Ruediger Klaes; Svetlana Vinokurova; Ulrike Schaefer; Magnus von Knebel Doeberitz
Journal:  Oncogene       Date:  2002-01-17       Impact factor: 9.867

Review 4.  Hepatitis B virus induced hepatocellular carcinoma.

Authors:  I Chemin; F Zoulim
Journal:  Cancer Lett       Date:  2009-01-14       Impact factor: 8.679

5.  The majority of viral-cellular fusion transcripts in cervical carcinomas cotranscribe cellular sequences of known or predicted genes.

Authors:  Irene Kraus; Corina Driesch; Svetlana Vinokurova; Eivind Hovig; Achim Schneider; Magnus von Knebel Doeberitz; Matthias Dürst
Journal:  Cancer Res       Date:  2008-04-01       Impact factor: 12.701

6.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

Review 7.  HBV induced carcinogenesis.

Authors:  Delphine Cougot; Christine Neuveut; Marie Annick Buendia
Journal:  J Clin Virol       Date:  2005-12       Impact factor: 3.168

8.  A statistical method for the detection of alternative splicing using RNA-seq.

Authors:  Liguo Wang; Yuanxin Xi; Jun Yu; Liping Dong; Laising Yen; Wei Li
Journal:  PLoS One       Date:  2010-01-08       Impact factor: 3.240

9.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

10.  Fast and accurate long-read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2010-01-15       Impact factor: 6.937

View more
  32 in total

1.  Alternative applications for distinct RNA sequencing strategies.

Authors:  Leng Han; Kasey C Vickers; David C Samuels; Yan Guo
Journal:  Brief Bioinform       Date:  2014-09-22       Impact factor: 11.622

Review 2.  Cancer transcriptome profiling at the juncture of clinical translation.

Authors:  Marcin Cieślik; Arul M Chinnaiyan
Journal:  Nat Rev Genet       Date:  2017-12-27       Impact factor: 53.242

3.  Integrated Pan-Cancer Map of EBV-Associated Neoplasms Reveals Functional Host-Virus Interactions.

Authors:  Srishti Chakravorty; Bingyu Yan; Chong Wang; Luopin Wang; Joseph Taylor Quaid; Chin Fang Lin; Scott D Briggs; Joydeb Majumder; D Alejandro Canaria; Daniel Chauss; Gaurav Chopra; Matthew R Olson; Bo Zhao; Behdad Afzali; Majid Kazemian
Journal:  Cancer Res       Date:  2019-09-03       Impact factor: 12.701

4.  ViFi: accurate detection of viral integration and mRNA fusion reveals indiscriminate and unregulated transcription in proximal genomic regions in cervical cancer.

Authors:  Nam-Phuong D Nguyen; Viraj Deshpande; Jens Luebeck; Paul S Mischel; Vineet Bafna
Journal:  Nucleic Acids Res       Date:  2018-04-20       Impact factor: 16.971

Review 5.  Unraveling the web of viroinformatics: computational tools and databases in virus research.

Authors:  Deepak Sharma; Pragya Priyadarshini; Sudhanshu Vrati
Journal:  J Virol       Date:  2014-11-26       Impact factor: 5.103

Review 6.  Finding the lost treasures in exome sequencing data.

Authors:  David C Samuels; Leng Han; Jiang Li; Sheng Quanghu; Travis A Clark; Yu Shyr; Yan Guo
Journal:  Trends Genet       Date:  2013-08-22       Impact factor: 11.639

7.  CSN1 Somatic Mutations in Penile Squamous Cell Carcinoma.

Authors:  Daniel C Worth; Ankur Chakravarthy; Andrew Feber; Patricia de Winter; Kunal Shah; Manit Arya; Muhammad Saqib; Raj Nigam; Peter R Malone; Wei Shen Tan; Simon Rodney; Alex Freeman; Charles Jameson; Gareth A Wilson; Tom Powles; Stephan Beck; Tim Fenton; Tyson V Sharp; Asif Muneer; John D Kelly
Journal:  Cancer Res       Date:  2016-06-20       Impact factor: 12.701

8.  Vy-PER: eliminating false positive detection of virus integration events in next generation sequencing data.

Authors:  Michael Forster; Silke Szymczak; David Ellinghaus; Georg Hemmrich; Malte Rühlemann; Lars Kraemer; Sören Mucha; Lars Wienbrandt; Martin Stanulla; Andre Franke
Journal:  Sci Rep       Date:  2015-07-13       Impact factor: 4.379

9.  VERSE: a novel approach to detect virus integration in host genomes through reference genome customization.

Authors:  Qingguo Wang; Peilin Jia; Zhongming Zhao
Journal:  Genome Med       Date:  2015-01-20       Impact factor: 11.117

10.  VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data.

Authors:  Qingguo Wang; Peilin Jia; Zhongming Zhao
Journal:  PLoS One       Date:  2013-05-24       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.