Literature DB >> 26087185

Virus-Clip: a fast and memory-efficient viral integration site detection tool at single-base resolution with annotation capability.

Daniel W H Ho1,2, Karen M F Sze1,2, Irene O L Ng1,2.   

Abstract

Viral integration into the human genome upon infection is an important risk factor for various human malignancies. We developed viral integration site detection tool called Virus-Clip, which makes use of information extracted from soft-clipped sequencing reads to identify exact positions of human and virus breakpoints of integration events. With initial read alignment to virus reference genome and streamlined procedures, Virus-Clip delivers a simple, fast and memory-efficient solution to viral integration site detection. Moreover, it can also automatically annotate the integration events with the corresponding affected human genes. Virus-Clip has been verified using whole-transcriptome sequencing data and its detection was validated to have satisfactory sensitivity and specificity. Marked advancement in performance was detected, compared to existing tools. It is applicable to versatile types of data including whole-genome sequencing, whole-transcriptome sequencing, and targeted sequencing. Virus-Clip is available at http://web.hku.hk/~dwhho/Virus-Clip.zip.

Entities:  

Keywords:  breakpoint detection; next-generation sequencing; viral integration; viral integration site detection

Mesh:

Year:  2015        PMID: 26087185      PMCID: PMC4673242          DOI: 10.18632/oncotarget.4187

Source DB:  PubMed          Journal:  Oncotarget        ISSN: 1949-2553


INTRODUCTION

Viral infection is a common risk factor for various human malignancies [1]. Particular viruses e.g. hepatitis B virus (HBV) can integrate into the human genome upon infection and lead to disruption in gene functions that predispose to carcinogenesis. In the past, PCR-based methods were employed to detect viral integration events. As a result of limited sensitivity and resolution, the efficiency of detection was restrained. This major obstacle was solved due to the recent advancement in next-generation sequencing (NGS). Since NGS data is large, manual inspection is impossible. This imposes huge demand on useful tools for the task. Existing tools provide useful resources in identifying viral integration events but there are still limitations remained unsolved. For instance, VirusSeq [2] cannot report the exact human and virus breakpoint positions. Besides, ViralFusionSeq [3] and VirusFinder [4, 5] involve sophisticated installation procedures and long execution time, which hinder their practical use. In addition, not all the existing tools are having annotation function on the affected human genes. Here, we present our viral integration detection tool, namely Virus-Clip. Virus-Clip makes use of the virus genome as the primary read alignment target. Then, it extracts soft-clipped reads from the alignment and maps the soft-clipped segments (potentially containing sequences of HBV-integrated human loci) to the human genome. Making use of the mapping information, Virus-Clip can report the human and virus integration breakpoints to single-base resolution. Besides, all the integration sites are automatically annotated with the affected human genes and their corresponding gene regions. With streamlined procedures involving minimal steps and tools, Virus-Clip delivers a simple, fast and memory-efficient solution to viral integration site detection (Figure 1). Execution performance demonstrated a significant advancement, compared to existing tools. Virus-Clip is available at http://web.hku.hk/~dwhho/Virus-Clip.zip.
Figure 1

Workflow of Virus-Clip

RESULTS

To evaluate the performance of Virus-Clip, we applied it to whole-transcriptome sequencing (RNA-seq) data of two human HBV-associated hepatocellular carcinoma (HCC) samples. The RNA-seq data was generated by 101bp paired-end Illumina HiSeq 2000 platform. Viral integration site detection was similarly performed by VirusFinder and ViralFusionSeq with default parameters. VirusSeq was not included in the comparison as it cannot report exact breakpoint positions. Raw execution result data is available as Supplementary Data. Performance comparison was undertaken on the basis of speed, computer resources requirement and viral integration site identification outcome (Table 1).
Table 1

Benchmark result for viral integration site detection

SampleTool# of reads (M)Execution time (min)# of CPUMemory used (GB)# of viral integration eventsKey affected human gene
HBV-associated HCC 1Virus-Clip63.835.912.18KMT2B
VirusFinder244.6815.91
ViralFusionSeqExecution failure
HBV-associated HCC 2Virus-Clip70.036.412.414TERT
VirusFinder259.3815.83
ViralFusionSeqExecution failure

Note: Error encountered during ViralFusionSeq execution

Note: Error encountered during ViralFusionSeq execution Virus-Clip identified 8 and 14 HBV integration sites respectively for the two studied samples while 1 and 3 sites were found by VirusFinder. ViralFusionSeq was failed to execute on our dataset but its execution could finish on its example data, suggesting there was no installation error. In the context of HBV integration into human genome, locations upstream of TERT gene and inside KMT2B gene were frequently reported on HBV-associated HCC [6]. These two key HBV integration events were found in the two studied samples respectively and were successfully identified both by Virus-Clip and VirusFinder. Therefore, both tools were able to identify key viral integration events. Nevertheless, the numbers of supporting soft-clipped reads on the TERT integration event were 12 and 6, while they were 17 and 8 on the KMT2B integration event, for Virus-Clip and VirusFinder respectively. To further evaluate the sensitivity and specificity of the detection by Virus-Clip, we selected 17 HBV integration events supported by at least 1 soft-clipped sequencing read and designed primers that flank the identified HBV integration junctions (Table 2). The validity of the integration events was related to the supporting read count. Most of the events (9 of 10 or 90%) supported by more than 2 soft-clipped sequencing reads were successfully validated while the validated proportion was still pretty high (10 of 14 or 71.4%) when the threshold was set at 2 soft-clipped sequencing reads (Figure 2 and Table 2). Using a stringent threshold of more than 2 soft-clipped sequencing reads, Virus-Clip still reported more HBV integration events than VirusFinder, suggesting a higher sensitivity of the former over the latter. More importantly, the validated proportion was concomitantly high, indicating high specificity or minimal false-positive reports. Based on the empirical data, we recommend 2 soft-clipped sequencing reads as a sensible threshold for preliminary filtering of viral integration events reported by Virus-Clip. Taken together, lines of evidence suggest the superior sensitivity and specificity of Virus-Clip and it allows the potential detection of rarer viral integration events that are supported by fewer sequencing reads. More importantly, in terms of speed, CPU and memory usage, and the total number of viral integration events identified, Virus-Clip outperformed VirusFinder. Hence, Virus-Clip represents a significant improvement on existing viral integration site identification tools.
Table 2

Validation experiment on 17 selected HBV integration events

SampleLane in Figure 2Integrated genic regionAffected human geneSupporting read countForward primerReverse primerPCR product size (bp)Validated
HBV-associated HCC 11intronicKMT2B17GGAGGAGTTGGGGGAGGAGCTGGAAAGTGTCCAAGGAGG162Yes
2exonicKMT2B7CTCAAGAGAGCCAAAGTGCAGCACACAGAATAGCTTGCCTGAG170Yes
3splicingKMT2B6AATTTGTCCTGGTTATCGCTGGCTCCGGCCACCTCCTCCATCTGC141Yes
4UTR3TJAP15GCAACCTGCTCAACTAGGGCCCCTGCTGGATTACATATCCCATGAAGTTAAGG147No
5intronicKMT2B2AGCAGAAGGTGGCAGCTTCCATGCGGGTCAATGTCCATGCCCCAAAGC116Yes
6UTR5ZNF7921TCTCGCAGCGCCGCCGCTGCCATCAGACGGGGAGTCCGCGTAAAGAGAG101No
7intergenic-1CTTTAATTAGTATCTTCTACGGCCATTGATCCGTGTTGG101No
8exonicKMT2B1TGGACTTTCAGCAATGTCAACGGATCTGCTTGACATCCCCGGCCAC101No
HBV-associated HCC 21upstreamTERT37ATCCCAGTAGAGTAGGAGCAAATACTCAAGAACAGTTTC148Yes
2upstreamTERT15GGCGAGAAACTTCTGGGTCTCGCATTTGGTGGTCTGTAAGC154Yes
3exonicTERT12GCTGGATGGGTCGGCGGCGGCAGGAACTTGGCCAGGATC150Yes
4intergenic-9ACCAACATTTGAACAGTCACCTACGGGTCAATGTCCATGCC134Yes
5exonicTERT7GCGGCGTTTTATCATCTTCCGCACAGCCTCTGCAGCACTC112Yes
6intergenic-4GAGTTGGGGGAGGAGATTAGGTTTCTGAGCTCTGTCAAAACGG154Yes
7UTR5NUMB2GTTTTATCATCTTCCTCTTCATCCTGCTTGAATTGTAACAGTGGCTGC132No
8intergenic-2GGAGATTAGGTTAAAGGTCGCCAAAGTTAAGGACACTCTTGTGAC113No
9exonicHP2CTTTGGAAGAGAAACTGTTCTTGAGGGACTGTGCTGCCTTCATAATGCC109No

It summarizes the details of the events and corresponds to the PCR amplification in Figure 2. Integration events are sorted according to supporting read count, with those successfully amplified by PCR and confirmed by Sanger sequencing remarked as validated in the rightmost column.

Figure 2

PCR amplification of selected HBV integration events

The lanes correspond to the integration events listed in Table 2. Order of integration events is sorted according to supporting read count with the leftmost one supported by the most.

It summarizes the details of the events and corresponds to the PCR amplification in Figure 2. Integration events are sorted according to supporting read count, with those successfully amplified by PCR and confirmed by Sanger sequencing remarked as validated in the rightmost column.

PCR amplification of selected HBV integration events

The lanes correspond to the integration events listed in Table 2. Order of integration events is sorted according to supporting read count with the leftmost one supported by the most.

DISCUSSION

The availability of NGS technology opens up the possibility of systematic and unbiased examination of viral integration event. Although existing analysis tools allow the screening of NGS data at great resolution, the huge data size imposes severe demand on the computational resources and requires long execution time. These major obstacles make some of the existing tools not suitable in analyzing whole-genome sequencing (WGS) data of extremely large size. With the strategy of initial read alignment to virus reference genome instead of human reference genome and streamlined procedures involving only a few essential tools, these issues lead to the superior performance of Virus-Clip. Due to the relatively small size of virus genome, the alignment to it is significantly more efficient. Moreover, Virus-Clip makes use of BWA-MEM for initial alignment to virus genome, SAMTools for soft-clipped reads extraction, BLASTN for local alignment of human chimeric fragment to human genome, and ANNOVAR for annotation. Such minimal combination of tools and workflow allows streamlined procedures. Virus-Clip substantially shortened the process and time in analyzing viral integration event. It also requires significantly fewer computational resources. The installation of Virus-Clip is also simplified, as a result of the simple overall workflow. Furthermore, the automatic annotation capability of the integration sites can facilitate the practical use of the obtained viral integration information. Therefore, to our best knowledge, Virus-Clip contributes a major advancement in viral integration site identification. It provides a simple, fast and memory-efficient solution to identify viral integration event at single-base resolution that requires minimal computer resources and applicable to versatile types of NGS data including WGS, RNA-seq and targeted sequencing. Apart from the RNA-seq data mentioned above, we have also applied Virus-Clip on targeted DNA sequencing data. Similarly satisfactory performance could be obtained (data not shown). One limitation of Virus-Clip is that it requires the provision of virus reference genome as input and hence it is not applicable to data without virus reference genome available (which is unlikely in most circumstances).

MATERIALS AND METHODS

Implementation of Virus-Clip

Virus-Clip is implemented in shell script that executes third-party tools and our own Perl program (Figure 1). The viral integration site identification relies on soft-clipped sequencing reads that represent chimeric fusion of human and virus genomic sequences. It can accept both single-end and paired-end sequencing reads in FASTQ format. Virus-Clip consists of a shell script (Virus_Clip.sh) that executes third-party tools and our own Perl program (virus_clip.pl). The actual procedure involves 3 major steps. First, it maps sequencing reads to virus reference genome by Burrows-Wheeler Aligner (BWA-MEM) [7], which is capable of soft-clipped alignment. As the size of virus reference genome is far smaller than the human reference genome, this step can effectively narrow down the search space in the initial alignment and lead to significantly shortened execution time and reduced computational resources when compared with initial alignment to human reference genome. Then, with the use of SAMTools [8], it examines the alignment of Sequence Alignment/Map (SAM) format and extracts soft-clipped reads from it, through utilizing the CIGAR flag. Other information such as the mapping position and aligned sequence are obtained from the SAM columns. Information is stored as a temporary file. Finally, Perl program virus_clip.pl reads the temporary file and obtains the soft-clipped portions of the reads (potentially including the flanking human genomic sequence that the virus integrated at). It subsequently maps them to the human reference genome by the BLASTN stand-alone version (available at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) with default parameters. Top match (if any) is reported as the virus integrated location. Using ANNOVAR [9], annotation information on the affected human gene region and the affected human gene were obtained. In the result file (virus_clip.out), information on the human and virus integration loci, the corresponding flanking sequences, the number of supporting soft-clipped reads, and the affected human genes and their regions are reported.

Validation experiment on HBV integration events detected by Virus-Clip

We selected 17 HBV integration events supported by at least 1 soft-clipped sequencing read and designed primers that flank the identified HBV integration junctions (Table 2). To confirm the validity of the PCR amplicons, they were subjected to Sanger sequencing and confirmed to match with the detected chimeric fragment sequences.
  9 in total

1.  VirusSeq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue.

Authors:  Yunxin Chen; Hui Yao; Erika J Thompson; Nizar M Tannir; John N Weinstein; Xiaoping Su
Journal:  Bioinformatics       Date:  2012-11-17       Impact factor: 6.937

2.  Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma.

Authors:  Wing-Kin Sung; Hancheng Zheng; Shuyu Li; Ronghua Chen; Xiao Liu; Yingrui Li; Nikki P Lee; Wah H Lee; Pramila N Ariyaratne; Chandana Tennakoon; Fabianus H Mulawadi; Kwong F Wong; Angela M Liu; Ronnie T Poon; Sheung Tat Fan; Kwong L Chan; Zhuolin Gong; Yujie Hu; Zhao Lin; Guan Wang; Qinghui Zhang; Thomas D Barber; Wen-Chi Chou; Amit Aggarwal; Ke Hao; Wei Zhou; Chunsheng Zhang; James Hardwick; Carolyn Buser; Jiangchun Xu; Zhengyan Kan; Hongyue Dai; Mao Mao; Christoph Reinhard; Jun Wang; John M Luk
Journal:  Nat Genet       Date:  2012-05-27       Impact factor: 38.330

Review 3.  Viruses associated with human cancer.

Authors:  Margaret E McLaughlin-Drubin; Karl Munger
Journal:  Biochim Biophys Acta       Date:  2007-12-23

4.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

5.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.

Authors:  Kai Wang; Mingyao Li; Hakon Hakonarson
Journal:  Nucleic Acids Res       Date:  2010-07-03       Impact factor: 16.971

6.  VERSE: a novel approach to detect virus integration in host genomes through reference genome customization.

Authors:  Qingguo Wang; Peilin Jia; Zhongming Zhao
Journal:  Genome Med       Date:  2015-01-20       Impact factor: 11.117

7.  Fast and accurate long-read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2010-01-15       Impact factor: 6.937

8.  VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data.

Authors:  Qingguo Wang; Peilin Jia; Zhongming Zhao
Journal:  PLoS One       Date:  2013-05-24       Impact factor: 3.240

9.  ViralFusionSeq: accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution.

Authors:  Jing-Woei Li; Raymond Wan; Chi-Shing Yu; Ngai Na Co; Nathalie Wong; Ting-Fung Chan
Journal:  Bioinformatics       Date:  2013-01-12       Impact factor: 6.937

  9 in total
  18 in total

1.  ViFi: accurate detection of viral integration and mRNA fusion reveals indiscriminate and unregulated transcription in proximal genomic regions in cervical cancer.

Authors:  Nam-Phuong D Nguyen; Viraj Deshpande; Jens Luebeck; Paul S Mischel; Vineet Bafna
Journal:  Nucleic Acids Res       Date:  2018-04-20       Impact factor: 16.971

2.  Relative Abundance of Integrant-Derived Viral RNAs in Infected Tissues Harvested from Chronic Hepatitis B Virus Carriers.

Authors:  Natalia Freitas; Tetyana Lukash; Sumedha Gunewardena; Benjamin Chappell; Betty L Slagle; Severin O Gudima
Journal:  J Virol       Date:  2018-04-27       Impact factor: 5.103

Review 3.  Utility of high-throughput DNA sequencing in the study of the human papillomaviruses.

Authors:  Noé Escobar-Escamilla; José Ernesto Ramírez-González; Graciela Castro-Escarpulli; José Alberto Díaz-Quiñonez
Journal:  Virus Genes       Date:  2017-12-27       Impact factor: 2.332

4.  BS-virus-finder: virus integration calling using bisulfite sequencing data.

Authors:  Shengjie Gao; Xuesong Hu; Fengping Xu; Changduo Gao; Kai Xiong; Xiao Zhao; Haixiao Chen; Shancen Zhao; Mengyao Wang; Dongke Fu; Xiaohui Zhao; Jie Bai; Likai Mao; Bo Li; Song Wu; Jian Wang; Shengbin Li; Huangming Yang; Lars Bolund; Christian N S Pedersen
Journal:  Gigascience       Date:  2018-01-01       Impact factor: 6.524

5.  Molecular Pathogenesis of Hepatocellular Carcinoma.

Authors:  Daniel Wai-Hung Ho; Regina Cheuk-Lam Lo; Lo-Kong Chan; Irene Oi-Lin Ng
Journal:  Liver Cancer       Date:  2016-09-14       Impact factor: 11.740

6.  Unraveling the functional role of DNA demethylation at specific promoters by targeted steric blockage of DNA methyltransferase with CRISPR/dCas9.

Authors:  Daniel M Sapozhnikov; Moshe Szyf
Journal:  Nat Commun       Date:  2021-09-29       Impact factor: 17.694

Review 7.  Revolutionized virome research using systems microbiology approaches.

Authors:  Suwalak Chitcharoen; Pavaret Sivapornnukul; Sunchai Payungporn
Journal:  Exp Biol Med (Maywood)       Date:  2022-06-20

8.  FastViFi: Fast and accurate detection of (Hybrid) Viral DNA and RNA.

Authors:  Sara Javadzadeh; Utkrisht Rajkumar; Nam Nguyen; Shahab Sarmashghi; Jens Luebeck; Jingbo Shang; Vineet Bafna
Journal:  NAR Genom Bioinform       Date:  2022-04-26

9.  Hepatocellular carcinoma cell lines retain the genomic and transcriptomic landscapes of primary human cancers.

Authors:  Zhixin Qiu; Keke Zou; Liping Zhuang; Jianjie Qin; Hong Li; Chao Li; Zhengtao Zhang; Xiaotao Chen; Jin Cen; Zhiqiang Meng; Haibin Zhang; Yixue Li; Lijian Hui
Journal:  Sci Rep       Date:  2016-06-07       Impact factor: 4.379

10.  ChimericSeq: An open-source, user-friendly interface for analyzing NGS data to identify and characterize viral-host chimeric sequences.

Authors:  Fwu-Shan Shieh; Patrick Jongeneel; Jamin D Steffen; Selena Lin; Surbhi Jain; Wei Song; Ying-Hsiu Su
Journal:  PLoS One       Date:  2017-08-22       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.