| Literature DB >> 29922260 |
Krithika Bhuvaneshwar1, Lei Song1, Subha Madhavan1, Yuriy Gusev1.
Abstract
An estimated 17% of cancers worldwide are associated with infectious causes. The extent and biological significance of viral presence/infection in actual tumor samples is generally unknown but could be measured using human transcriptome (RNA-seq) data from tumor samples. We present an open source bioinformatics pipeline viGEN, which allows for not only the detection and quantification of viral RNA, but also variants in the viral transcripts. The pipeline includes 4 major modules: The first module aligns and filter out human RNA sequences; the second module maps and count (remaining un-aligned) reads against reference genomes of all known and sequenced human viruses; the third module quantifies read counts at the individual viral-gene level thus allowing for downstream differential expression analysis of viral genes between case and controls groups. The fourth module calls variants in these viruses. To the best of our knowledge, there are no publicly available pipelines or packages that would provide this type of complete analysis in one open source package. In this paper, we applied the viGEN pipeline to two case studies. We first demonstrate the working of our pipeline on a large public dataset, the TCGA cervical cancer cohort. In the second case study, we performed an in-depth analysis on a small focused study of TCGA liver cancer patients. In the latter cohort, we performed viral-gene quantification, viral-variant extraction and survival analysis. This allowed us to find differentially expressed viral-transcripts and viral-variants between the groups of patients, and connect them to clinical outcome. From our analyses, we show that we were able to successfully detect the human papilloma virus among the TCGA cervical cancer patients. We compared the viGEN pipeline with two metagenomics tools and demonstrate similar sensitivity/specificity. We were also able to quantify viral-transcripts and extract viral-variants using the liver cancer dataset. The results presented corresponded with published literature in terms of rate of detection, and impact of several known variants of HBV genome. This pipeline is generalizable, and can be used to provide novel biological insights into microbial infections in complex diseases and tumorigeneses. Our viral pipeline could be used in conjunction with additional type of immuno-oncology analysis based on RNA-seq data of host RNA for cancer immunology applications. The source code, with example data and tutorial is available at: https://github.com/ICBI/viGEN/.Entities:
Keywords: RNA-seq; TCGA; cancer immunology; liver cancer; next-generation sequencing; variant analysis; viral detection
Year: 2018 PMID: 29922260 PMCID: PMC5996193 DOI: 10.3389/fmicb.2018.01172
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Comparison of existing pipelines that detect viruses from human transcriptome data.
| Virana (Schelhorn et al., | Yes | Identifies microbial transcripts, does not quantify | Both | No | Yes | Also offers analysis of homologs |
| VirusSeq (Chen et al., | Yes | No | Both | No | Yes | Pre-designed to work on a select set of 18 viruses |
| Viral Fusion Seq (Li et al., | Yes | No | Both | No | Yes | Can also detect fusion events |
| Virus Finder (Wang et al., | Yes | No | Both | No | Yes | Can be applied to samples infected with undiagnosed viruses |
| PathSeq (Kostic et al., | Yes | No | Both | No | No | |
| RINS (Bhaduri et al., | Yes | No | RNA-seq | No | No | Generates contigs with these non-human sequences |
| Kraken (Wood and Salzberg, | Yes | No | No | No | No | Metagenomic analysis tool |
| Cetrifuge (Kim et al., | Yes | No | No | No | No | Metagenomic analysis tool |
| viGEN (our pipeline) | Yes | Yes | Both | Yes | No |
Figure 1viGEN pipeline. Each module has a color, shown in the legend.
Figure 2The HPV viruses detected in cervical cancer patients using the viGEN pipeline.
Estimation of sensitivity and specificity for HPV-16 detection in TCGA cervical cancer samples using the viGEN pipeline.
| RNA-seq from tumor tissue (output of algorithm) | Positive | 10 | 4 | 14 |
| Negative | 2 | 6 | 8 | |
| Total | 12 | 10 | 22 | |
Estimation of sensitivity and specificity for HPV-18 detection in TCGA cervical cancer samples using the viGEN pipeline.
| RNA-seq from tumor tissue (output of algorithm) | Positive | 3 | 1 | 4 |
| Negative | 1 | 17 | 18 | |
| 4 | 18 | 22 | ||
Virus species detected in at-least 5 samples.
| Hepatitis C virus |
| Tick-borne encephalitis virus |
| Hepatitis C virus genotype 2 |
| Cutthroat trout virus |
| Human endogenous retrovirus K113 |
| Lunk virus NKS-1 |
| Hepatitis C virus genotype 6 |
| Hepatitis B virus |
Differential expression analysis of transcript level read counts Liver cancer dataset comparing Dead and Alive samples.
| NC_003977.1_gene_1814_2452 | 1.128 | 13.449 | 1.71 | 0.000243 | Hepatitis B virus | Contains Gene C that produces pre-code protein external core antigen; HBeAg. HBeAg is produced by proteolytic processing of the pre-core protein |
| NC_003977.1_CDS_1814_2452 | 1.128 | 13.449 | 1.71 | 0.000243 | Hepatitis B virus | Contains Gene C that produces pre-code protein external core antigen; HBeAg. |
| NC_003977.1_CDS_1901_2452 | 0.828 | 12.42 | 0.000507 | 0.053928 | Hepatitis B virus | Contains Gene C, encodes core antigen HBcAg |
| NC_003977.1_CDS_155_835 | −2.121 | 16.335 | 1.67 | 1.11 | Hepatitis B virus | Encodes Gene S that produces small envelope protein, S protein; S glycoprotein; S-HBsAg, |
| NC_003977.1_gene_2307_4838 | −2.133 | 12.655 | 2.61 | 1.11 | Hepatitis B virus | Gene P, encodes protein P |
| NC_003977.1_CDS_2307_4838 | −2.133 | 12.655 | 2.61 | 1.11 | Hepatitis B virus | Gene P, encodes protein P |
| NC_003977.1_CDS_3205_4050 | −2.352 | 8.67 | 1.93 | 6.84 | Hepatitis B virus | Gene S, encodes middle envelope protein pre-S2/S |
| NC_003977.1_gene_2848_4050 | −2.75 | 11.741 | 5.84 | 6.20 | Hepatitis B virus | Encodes Gene S that produces a large surface protein/L glycoprotein/L-HBsAG |
| NC_003977.1_CDS_2848_4050 | −2.75 | 11.741 | 5.84 | 6.20 | Hepatitis B virus | Encodes Gene S that produces a large surface protein/L glycoprotein/L-HBsAG |
| NC_003977.1_CDS_155_835 | −2.121 | 16.335 | 1.67 | 1.11 | Hepatitis B virus | Encodes Gene S that produces small envelope protein, S protein; S glycoprotein; S-HBsAg, |
| NC_001405.1_intron_9724_12307 | 2.527 | 6.463 | 4.02 | 1.22 | Human mastadenovirus C | Gene = L1, locus_tag = HAdVC_gp10, note = precedes capsid protein precursor pIIIa CDS |
| NC_001405.1_intron_9724_11039 | 2.5 | 6.451 | 6.65 | 1.28 | Human mastadenovirus C | Gene = L1, locus_tag = HAdVC_gp10, note = precedes encapsidation protein 52K CDS |
| NC_001405.1_gene_10866_11023 | 2.5 | 6.451 | 6.65 | 1.28 | Human mastadenovirus C | Gene = VAII, locus_tag = HAdVC_gs02, GeneID:2653002 |
| NC_001405.1_transcript_10866_11023 | 2.5 | 6.451 | 6.65 | 1.28 | Human mastadenovirus C | Gene = VAII, locus_tag = HAdVC_gs02, GeneID:2653002 |
| NC_001405.1_exon_10866_11023 | 2.5 | 6.451 | 6.65 | 1.28 | Human mastadenovirus C | Gene = VAII, locus_tag = HAdVC_gs02, GeneID:2653002 |
| NC_001405.1_intron_10580_14015 | 2.428 | 6.475 | 9.24 | 1.64 | Human mastadenovirus C | Gene = E2B, locus_tag = HAdVC_gp04 |
| NC_022518.1_STS_7174_7323 | −0.992 | 9.122 | 0.000309 | 0.034527 | Human endogenous retrovirus K113 | Sequence-tagged site (STS), locus_tag = Q779_gp1, standard_name = D6S2277, UniSTS:59918 |
| NC_022518.1_STS_5100_5381 | −1.051 | 9.532 | 0.000118 | 0.0139 | Human endogenous retrovirus K113 | Sequence-tagged site (STS), standard_name = D22S1651, UniSTS: 474031 |
| NC_022518.1_region_1112_6746 | −1.186 | 13.022 | 3.49 | 0.000463 | Human endogenous retrovirus K113 | gag-pro-pol; two−1 frameshifts predicted to occur to produce a fusion protein; the location of frameshifts has not been determined |
| NC_018464.1_region_1_927 | −1.288 | 12.784 | 5.78 | 9.45 | Shamonda virus | mol_type = genomic RNA, isolate = Ib An 5550, taxon:159150, segment = S |
| NC_002645.1_gene_293_20568 | −2.598 | 6.126 | 3.93 | 0.004911 | Human coronavirus 229E | locus_tag = HCoV229Egp1, GeneID: 918764, replicase polyprotein 1ab |
These results shown used the viral-gene data obtained from Module 1 (using alignment tool Bowtie2) + Module 3. The table shows results with q value <0.06 and sorted based on LogFC in the descending order. Table 4A shows transcript level read counts in the Hepatitis B virus while Table 4B shows transcript level read counts in other species.
Cox proportional hazard survival analysis (across 25 HepB samples and 25 HepB + HepC Samples).
| coxph(formula = survObject ~ NC_003977.1_CDS_2848_4050 + NC_003977.1_CDS_1814_2452) | |||||
| (13 observations deleted due to missingness) | |||||
| NC_003977.1_CDS_2848_4050 | −0.5548 | 0.5742 | 0.7434 | −0.746 | 0.456 |
| NC_003977.1_CDS_1814_2452 | 0.5302 | 1.6993 | 0.6145 | 0.863 | 0.388 |
| NC_003977.1_CDS_2848_4050 | 0.5742 | 1.7415 | 0.1337 | 2.465 | |
| NC_003977.1_CDS_1814_2452 | 1.6993 | 0.5885 | 0.5096 | 5.667 | |
| Concordance = 0.654 (se = 0.188) | |||||
| Rsquare = 0.12 (max possible = 0.329) | |||||
| Likelihood ratio test = 4.74 on 2 df, | |||||
| Wald test = 0.75 on 2 df, | |||||
| Score (logrank) test = 10.58 on 2 df, | |||||
These results shown used the viral-gene/CDS data obtained from Module 1 (using alignment tool Bowtie2) + Module 3. Coef: coefficient (Beta) of the model; exp(coef): Hazard Ratio; se(coef): Standard Error; Pr(>|z|): P-value
(a) The Cox PH model shows that assuming other covariant to be constant, unit increase in expression of this region NC_003977.1_CDS_1814_2452, increases the hazard of event (death) by 70%.
(b) On the other hand, that assuming other covariant to be constant, unit increase in expression of this region NC_003977.1_CDS_2848_4050, decreases the hazard of event (death) by 43%.
(c) The overall model is significant with p-value < 0.05 from the Log rank test (also called Score test).
Results of case-control association test applied on the results from viral variant calling.
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 1479 | C | 4 | 0 | A | 0.02857 | Gene = X, |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 2573 | C | 0 | 6 | T | 0.03571 | Gene = P, |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 2651 | T | 4 | 0 | C | 0.00476 | product = polymerase, |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 2813 | C | 2 | 0 | T | 0.03571 | protein_id = NP_647604.2 |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 2990 | T | 2 | 0 | A | 0.02222 | Gene = P, product = polymerase, |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 2997 | C | 2 | 0 | T | 0.03571 | protein_id = NP_647604.2; |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 3105 | C | 2 | 0 | A | 0.02222 | Gene = S, product = large envelope |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 3156 | G | 4 | 0 | A | 0.00476 | protein, protein_id = YP_355333.1 |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 1979 | G | 2 | 0 | A | 0.03571 | Gene = C, product = pre-capsid protein, |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 2396 | 0 | 4 | 0 | CG | 0.01499 | protein_id = YP_355335.1, NP_647607.1 |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 115 | C | 2 | 0 | A | 0.02222 | Pre S2 region, ID = id0, Dbxref = taxon: 10407, |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 126 | C | 2 | 0 | T | 0.02222 | Is_circular = true, gbkey = Src, genome = genomic, |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 148 | G | 2 | 0 | A | 0.02222 | mol_type = genomic DNA, strain = ayr |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 1762 | T | 0 | 4 | A | 0.06061 | Gene = X, Name = NP_647606.1, product = X protein, |
| gi|21326584|ref|NC_003977.1| | Hepatitis B virus | 1764 | A | 0 | 4 | G | 0.06061 | protein_id = NP_647606.1 |
| gi|548558394|ref|NC_022518.1| | Human endogenous retrovirus K113 | 7476 | 0 | 10 | 14 | TACTG | 0.00600 | ID = gene0, Name = Q779_gp1; ID = cds0, Name = YP_008603282.1, |
| gi|548558394|ref|NC_022518.1| | Human endogenous retrovirus K113 | 7426 | G | 3 | 0 | A | 0.00714 | product = putative env, |
| gi|548558394|ref|NC_022518.1| | Human endogenous retrovirus K113 | 8086 | T | 3 | 0 | C | 0.00714 | protein_id = YP_008603282.1 |
The table is sorted based on Annotation. Annotation includes gene name, protein name, etc., separated by commas, multiple annotations separated by semi-colon.
Table 6A shows variants in the Hepatitis B virus only while Table 6B shows variants in other species. (Shows only common results between two possible analysis steps).
Comparing the viral detection ability of viGEN with other tools.
| HPV 16 (Alphapapillomavirus 9) | Kraken | 58 |
| HPV 16 | Centrifuge | 55 |
| HPV 16 | viGEN | 53 |
| HPV 18 | Kraken | 2 |
| HPV 18 (Alphapapillomavirus 7) | Centrifuge | 15 |
| HPV 18 | viGEN | 13 |
| HPV 26 (Alphapapillomavirus 5) | Kraken | 1 |
| HPV 26 | Centrifuge | 0.3 |
| HPV 26 | viGEN | 0.3 |