| Literature DB >> 23717618 |
Qingguo Wang1, Peilin Jia, Zhongming Zhao.
Abstract
Next generation sequencing (NGS) technologies allow us to explore virus interactions with host genomes that lead to carcinogenesis or other diseases; however, this effort is largely hindered by the dearth of efficient computational tools. Here, we present a new tool, VirusFinder, for the identification of viruses and their integration sites in host genomes using NGS data, including whole transcriptome sequencing (RNA-Seq), whole genome sequencing (WGS), and targeted sequencing data. VirusFinder's unique features include the characterization of insertion loci of virus of arbitrary type in the host genome and high accuracy and computational efficiency as a result of its well-designed pipeline. The source code as well as additional data of VirusFinder is publicly available at http://bioinfo.mc.vanderbilt.edu/VirusFinder/.Entities:
Mesh:
Year: 2013 PMID: 23717618 PMCID: PMC3663743 DOI: 10.1371/journal.pone.0064465
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1VirusFinder’s pipeline to detect viruses and their integration sites in next generation sequencing (NGS) data.
VirusFinder overall follows a three-step procedure: (1) preprocessing, (2) virus detection, and (3) virus integration site detection. The current version of VirusFinder (release 1.0) uses UCSC hg19 (http://hgdownload.cse.ucsc.edu/downloads.html#human) as reference human genome. DB: database. SVs: structural variants (especially inter-chromosomal translocations).
Detection of viruses in seven NGS samples using VirusFinder and VirusSeq.
| Sample | Sequencing technology | Virus | #Virus integration sites | VirusFinder | VirusSeq |
| HCC sample 198T | WGS | HBV | 2 | √ | – |
| HCC sample 268T | WGS | HBV | 3 | √ | – |
| HCC cell line HKCI-5a | RNA-Seq | HBV | 3 | √ | × |
| HeLa cervical cancer cell line | RNA-Seq | HPV-18 | 1 | √ | × |
| MCC case 27 | Targeted sequencing | MCV | 1 | √ | √ |
| MCC case 36 | Targeted sequencing | MCV | 2 | √ | √ |
| An Indian patient with fever and acute encephalitis | Targeted sequencing | JEV | 0 | √ | √ |
HCC: Hepatocellular carcinoma. MCC: Merkel cell carcinoma. WGS: whole genome sequencing. RNA-Seq: whole transciptome sequencing. √: detected. ×: failed. –: software did not end within allowable time.
Detection of virus integration sites in five NGS samplesa.
| Sample | Integration sites | VirusFinder | VirusSeq | ViralFusionSeq |
| HCC sample 198T | chr5:1,269,387 | chr5:1,269,387 | – | chr5:1,269,387 |
| chr5:1,269,405 | chr5:1,269,405 | chr5:1,269,405 | ||
| HCC sample 268T | chr5:1,292,391 | chr5:1,292,391 | – | chr5:1,292,391 |
| chr5:1,292,403 | chr5:1,292,403 | chr5:1,292,403 | ||
| chr19:30,298,787 | chr19:30,298,787 | chr19:30,298,787 | ||
| HCC cell line HKCI-5b | N/A | chr7:98,532,319 | chr7:98,532,182 | ∼chr7: 98,532,184- 98,532,285 |
| N/A | chr16:30,407,194 | chr16:30,408,118 | ∼chr16:30,408,324 | |
| MCC case 27 | chr9:121,417,276 | chr9:121,417,092 | chr9:121,417,087 | × |
| MCC case 36 | chr13:99,978,184 | chr13:99,978,244 | chr13:99,977,889 | × |
| chr13:97,820,362 | chr13:97,820,192 | chr13:97,820,189 |
N/A: not available. –: software did not end within allowable time. ×: software failure. aOnly the samples in Table 1 that harbor virus integration sites are included in this table (HeLa cervical cancer cell line was excluded from this table because a large chromosomal region in 8q24 instead of a precise virus insertion position was provided for this sample [7]). bIt is the test data of ViralFusionSeq [11]. The virus integration sites of this sample were validated but are not publically available. ViralFusionSeq outputs human-virus fusion sequences instead of fusion breakpoints. Its predictions of virus integration sites for the first two samples, 198T and 268T, were taken from its published paper. When running ViralFusionSeq on the sample HCC cell line HKCI-5, we got the intermediate results that indicate a virus integration site around chr16:30408324. Its user manual provides another position, chr7: 98532184- 98532285, for this sample. Both loci were included here for the purpose of comparison.
Comparison of computation time of three virus integration-detecting methods on whole genome sequencing (WGS) dataa.
| Sample | Coverage | VirusFinder | ViralFusionSeq | VirusSeq | |||
| #CPUs | Time (days) | #CPUs | Time (days) | #CPUs | Time (days) | ||
| 26T | 65.5× | 8 | 3.1 | 8 | 17.8 | 8 | >7 |
| 71T | 32.2× | 8 | 1.9 | 8 | 11.5 | 8 | >7 |
| 106T | 44.8× | 8 | 2.4 | 8 | 17.1 | 8 | >12.5 |
| 180N | 121.2× | 8 | 7.3 | 8 | >17.4 | 8 | >12.5 |
| 186T | 36.5× | 8 | 2.0 | 8 | 13.0 | 8 | >12.5 |
| 198T | 34.4× | 8 | 1.8 | 8 | 10.8 | 8 | >12.5 |
| 200N | 32.6× | 8 | 1.9 | 8 | 11.5 | 8 | >12.5 |
| 200T | 31.7× | 8 | 2.0 | 8 | 12.5 | 8 | >12.5 |
| 268N | 40.7× | 8 | 2.7 | 8 | 14.5 | 8 | >12.5 |
| 268T | 34.1× | 8 | 2.0 | 8 | 13.5 | 8 | >9.9 |
| Average | 2.7 | 14.0 | >11.1 | ||||
The computation time of the three methods on these samples were analyzed on Vanderbilt Advanced Computing Center for Research & Education (ACCRE, http://www.accre.vanderbilt.edu/), with the same configuration of CPUs in each node.
ViralFusionSeq did not terminate successfully on sample 180N.
We attempted to run VirusSeq three times on these WGS samples. The first trial failed because the size of its intermediate files exceeded our cluster quota. After getting more space, we reran VirusSeq. After non-stop running for a whole week, all our jobs were killed in server due to their exceeding allocated time – not realizing initially the long computation time of VirusSeq on WGS samples. In our latest trial of VirusSeq on February 13, 2013, we requested 35 GB memory, 8 CPUs, 30 days for each job and resubmitted our jobs to ACCRE. Seven jobs were scheduled to run first. After twelve and a half day, all these jobs were killed due to an unexpected internal network outage of ACCRE. Though we were not able to make VirusSeq terminate successfully on these WGS samples due to expensive computing, we may conclude from the data that VirusFinder runs much faster than VirusSeq.