| Literature DB >> 30016933 |
Saurabh Baheti1, Xiaojia Tang1, Daniel R O'Brien1, Nicholas Chia2, Lewis R Roberts3, Heidi Nelson2, Judy C Boughey2, Liewei Wang4, Matthew P Goetz4,5, Jean-Pierre A Kocher1, Krishna R Kalari6.
Abstract
BACKGROUND: Transfer of genetic material from microbes or viruses into the host genome is known as horizontal gene transfer (HGT). The integration of viruses into the human genome is associated with multiple cancers, and these can now be detected using next-generation sequencing methods such as whole genome sequencing and RNA-sequencing.Entities:
Keywords: Horizontal gene transfer; Next-generation sequencing; RNA-Seq – Cancer; Viral integration; Whole-genome sequencing
Mesh:
Year: 2018 PMID: 30016933 PMCID: PMC6050683 DOI: 10.1186/s12859-018-2260-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Overview of the HGT-ID workflow
Fig. 2Diagram of HGT event and break point identification. a The searching starts with clustered discordant read pairs. Reads that fall within a search window of twice of the library size around the cluster are extracted. b If soft-clipped reads are available, an exact integration site can be inferred. c If only discordant read pairs are available, only an approximate integration site can be inferred
Sample sets that were used to validate the performance of HGT-ID
| Sample Set | Possible Virus | Data type | No. of Samples | Ref |
|---|---|---|---|---|
| 1. Cervical cell lines and cervical carcinoma | Human papillomavirus | WGS | 4 WGS | [ |
| 2. Hepatocellular carcinoma | Hepatitis B virus | WGS | 13 WGS | [ |
| 3. TCGA Breast invasive carcinoma | NA | WGS | 220 WGS | [ |
| 4. Hepatocellular carcinoma | Hepatitis B virus | WGS + RNA-Seq | 7 WGS + 7 RNA-Seq |
|
Performance comparison of HGT-ID, BATVI and VirusFinder2
| Coverage | Simulated data | |||||
|---|---|---|---|---|---|---|
| HGT-ID | BATVI | VirusFinder2 | ||||
| TP | FP | TP | FP | TP | FP | |
| 40 | 249 | 16 | 244 | 52 | 234 | 3 |
| 30 | 249 | 16 | 244 | 40 | 234 | 1 |
| 20 | 249 | 14 | 246 | 24 | 220 | 4 |
| 10 | 249 | 8 | 246 | 11 | 206 | 2 |
| 4 | 249 | 6 | 230 | 6 | 121 | 1 |
| 2 | 237 | 20 | 190 | 2 | 40 | 2 |
Fig. 3ROC curve of the simulation data with different coverages of HGT-ID. Different color lines showed different coverages. The false positive ratio (FPR) was calculated as the ratio of the number of false positives and the number of total identified HGT events. The true positive rate (TPR) was calculated as the ratio of the number of true positives and the number of total positives. The coverages were down-sampled from 40X to 30X, 20X, 10X, 4X and 2X, respectively
All 11 viral integration sites identified in whole genome sequencing data from two HPV-positive cell lines (SiHa and HeLa) and two cervical carcinomas (T4931 and T6050) using HGT-ID
| Sample ID (coverage) | Affected Gene | Function of integration site | Integrated Position | Score | Reported and validateda | Identified by VirusFinder 2.0 |
|---|---|---|---|---|---|---|
| HELA (40x) | CCAT1 | intron | chr8 128,230,630 | 1273.7 | yes | yes |
| CCAT1 | upstream | chr8 128,233,368 | 121.2 | yes | no | |
| CCAT1 | upstream | chr8 128,234,256 | 180.3 | yes | no | |
| CCAT1 | upstream | chr8 128,241,549 | 235.7 | yes | yes | |
| SIHA (37×) | KLF12 | downstream | chr13 74,087,563 | 158.0 | yes | yes |
| KLF12 | downstream | chr13 73,788,864 | 136.4 | yes | yes | |
| T4931 (41×) | GLI2 | intron | chr2 121,670,164 | 2.4 | yes | yes |
| GLI2 | intron | chr2 121,687,141 | 213.4 | yes | no | |
| GLI2 | intron | chr2 121,688,179 | 48.9 | yes | no | |
| T6050 (42×) | KLF12 | downstream | chr13 74,230,820 | 305.1 | yes | no |
| KLF12 | downstream | chr13 74,231,436 | 342.2 | yes | yes |
aReported and validated in the original paper [28]
Validation of the integration sites in HPV data
| Sample ID and coverage | Affected genes | Function of integration site | Integration breakpoints in the human genome | Integration breakpoints in HBV virus | Score | Identified by HGT-ID? |
|---|---|---|---|---|---|---|
| 145 T (37×) | CCNE1 | intron | chr19: 30303492 | 1053 | 87.2 | yes |
| CCNE1 | intron | chr19: 30303498 | 1819 | 87.2 | yes | |
| 177 T (43×) | SENP5 | intron | chr3: 196625752* | 1827* | – | no |
| 180 N (121×) | FN1 | intron | chr2: 216280279 | 1822 | 11.9 | yes |
| 186 T (36×) | KMT2B | exon | chr19: 36214005 | 2448 | 206.2 | yes |
| KMT2B | exon | chr19: 36214017 | 1605 | 206.2 | yes | |
| 198 T (34×) | TERT | intron | chr5: 1269387 | 821 | 137.5 | yes |
| TERT | intron | chr5: 1269405 | 1950 | 137.5 | yes | |
| 26 T (66×) | DUX4 | intron | chr18: 107920* | 670* | – | no |
| 200 T (32×) | CCNE1 | exon | chr19: 30315003 | 1798 | 51.4 | yes |
| CCNE1 | downstream | chr19: 30315365 | 316 | 222.751 | yes | |
| 268 T (34×) | CCNE1 | upstream | chr19: 30298787 | 1931 | 155.2 | yes |
| TERT | intron | chr5: 1291758 | 3175 | 134.3 | yes | |
| TERT | intron | chr5: 1292403 | 354 | 134.3 | yes | |
| 43 T (33×) | SENP5 | intron | chr3: 196625710* | 1910* | – | no |
| 46 T (32×) | TERT | upstream | chr5: 1295367 | 751 | 34.4 | yes |
| 70 T (114×) | KMT2B | exon | chr19: 36212331 | 1931 | 1015.6 | yes |
| KMT2B | exon | chr19: 36212311 | 227 | 1015.6 | yes | |
| 71 T (32×) | SENP5 | intron | chr3: 196625776* | 417* | – | no |
| KMT2B | intron | chr19: 36213141 | 1884 | 10 | yes | |
| KMT2B | intron | chr19: 36213136 | 619 | 10 | yes | |
| 95 T (35×) | KMT2B | exon | chr19: 36212564 | 2240 | 27.3 | yes |
Eighteen of 22 previously experimentally validated viral integration sites identified in sequencing data from 13 HBV-positive hepatocellular carcinoma samples using the HGT-ID algorithm. Integration breakpoints of the four missing events (noted with *) were obtained from the original publication [10]
Viral HGT events detected by HGT-ID algorithm between paired TCGA HCC tumor and normal samples via WGS and RNA-Seq datasets
| Sample ID | WGS-T | WGS-N | RNA-T | RNA-N | Common HGT |
|---|---|---|---|---|---|
| TCGA-BW-A5NP | 11 | 0 | 2 | NA | 0 |
| TCGA-CC-5262 | 3 | 0 | 2 | NA | 2 |
| TCGA-CC-A1HT | 5 | 0 | 2 | NA | 2 |
| TCGA-DD-A1EH | 0 | 0 | 0 | 2 | NA |
| TCGA-DD-A1EI | 2 | 0 | 1 | 2 | 1 |
| TCGA-DD-A1EL | 17 | 0 | 1 | 2 | 1 |
| TCGA-G3-A3CK | 4 | 0 | 0 | NA | 0 |
T stands for primary solid tumor and N for matched solid normal tissue. Only 3 of the 7patients had RNA-Seq data for matched normal tissue. The “Common HGT” column contains the number of events that were identified in both WGS and RNA-Seq for the primary tumor (T)