Literature DB >> 29922260

viGEN: An Open Source Pipeline for the Detection and Quantification of Viral RNA in Human Tumors.

Krithika Bhuvaneshwar¹, Lei Song¹, Subha Madhavan¹, Yuriy Gusev¹.

Abstract

An estimated 17% of cancers worldwide are associated with infectious causes. The extent and biological significance of viral presence/infection in actual tumor samples is generally unknown but could be measured using human transcriptome (RNA-seq) data from tumor samples. We present an open source bioinformatics pipeline viGEN, which allows for not only the detection and quantification of viral RNA, but also variants in the viral transcripts. The pipeline includes 4 major modules: The first module aligns and filter out human RNA sequences; the second module maps and count (remaining un-aligned) reads against reference genomes of all known and sequenced human viruses; the third module quantifies read counts at the individual viral-gene level thus allowing for downstream differential expression analysis of viral genes between case and controls groups. The fourth module calls variants in these viruses. To the best of our knowledge, there are no publicly available pipelines or packages that would provide this type of complete analysis in one open source package. In this paper, we applied the viGEN pipeline to two case studies. We first demonstrate the working of our pipeline on a large public dataset, the TCGA cervical cancer cohort. In the second case study, we performed an in-depth analysis on a small focused study of TCGA liver cancer patients. In the latter cohort, we performed viral-gene quantification, viral-variant extraction and survival analysis. This allowed us to find differentially expressed viral-transcripts and viral-variants between the groups of patients, and connect them to clinical outcome. From our analyses, we show that we were able to successfully detect the human papilloma virus among the TCGA cervical cancer patients. We compared the viGEN pipeline with two metagenomics tools and demonstrate similar sensitivity/specificity. We were also able to quantify viral-transcripts and extract viral-variants using the liver cancer dataset. The results presented corresponded with published literature in terms of rate of detection, and impact of several known variants of HBV genome. This pipeline is generalizable, and can be used to provide novel biological insights into microbial infections in complex diseases and tumorigeneses. Our viral pipeline could be used in conjunction with additional type of immuno-oncology analysis based on RNA-seq data of host RNA for cancer immunology applications. The source code, with example data and tutorial is available at: https://github.com/ICBI/viGEN/.

Entities: CellLine Chemical Disease Gene Mutation Species

Keywords: RNA-seq; TCGA; cancer immunology; liver cancer; next-generation sequencing; variant analysis; viral detection

Year: 2018 PMID： 29922260 PMCID： PMC5996193 DOI： 10.3389/fmicb.2018.01172

Source DB: PubMed Journal: Front Microbiol ISSN： 1664-302X Impact factor: 5.640

Introduction

An estimated 17% of cancers worldwide are associated with infectious causes. These infectious agents include viruses, bacteria, parasites and other microbes. Examples of viruses include human papilloma viruses (HPVs) in cervical cancer, epstein-Barr virus (EBV) in nasopharyngeal cancer, hepatitis B and C in liver cancer (HBV and HCV), human herpes virus 8 (HHV-8) in Kaposi sarcoma (KS); human T-lymphotrophic virus-1 (HTLV-1) in adult T cell lymphocytic leukemia (ATL) and non-Hodgkin lymphoma; merkel cell polyomavirus (MCV) in Merkel cell carcinoma (ACS, 2007). Bacteria such as Helicobacter pylori have been implicated in stomach cancer. Parasites have also been associated with cancer, examples are Opisthorchis viverrini and Clonorchis sinensis in bile duct cancer and Schistosoma haematobium in bladder cancer (ACS, 2007). Detection and characterization of these infectious agents in tumor samples can give us better insights into disease mechanisms and their treatment (Hausen, 2000). Vaccines have been developed to help protect against infection from the many cancers. But these vaccines can only be used to help prevent infection and cannot treat existing infections (ACS, 2007). There are several screening methods widely used to detect viral infections, especially for blood borne viruses including HBV, HCV, HIV and HTLV. These include the enzyme linked immunosorbent assay (ELISA or EIA) (Yoshihara, 1995), chemluminescent immunoassay (ChLIA), Indirect fluorescent antibody (IFA), Western blot (WB), Polymerase Chain Reaction (PCR), and Rapid immunoassays. ELISA and WB test detects and measures antibodies in serum taken from the patient's blood, and are typically prescribed after certain symptoms are observed in the patient. There are several challenges in detection of viruses in tumors including loss of viral information in progressed tumors and limited or latent replication resulting in low transcription of tumors (Schelhorn et al., 2013). The extent and biological significance of viral presence/infection in actual tumor samples is generally unknown but could be measured using human transcriptome data from tumor samples. The popularity of next-generation sequencing (NGS) technology has exploded in the last decade. NGS technologies are able to perform rapid sequencing, and in a massively parallel fashion (Datta et al., 2015). In recent years, applications of NGS technologies in clinical diagnostics have been on the rise (Barzon et al., 2011; Byron et al., 2016). Amongst the various NGS technologies, whole-transcriptome sequencing, also called RNA-seq, has been very popular with methods and tools being actively developed. Exploring the genome using RNA-seq gives a different insight than looking at the DNA since the RNA-seq would have captured actively transcribed regions. Every aspect of data output from this technology is now being used for research, including detection of viruses and bacteria (Khoury et al., 2013; Salyakina and Tsinoremas, 2013; Wang et al., 2016). They are also independent of prior sequence information, and require less starting material compared to conventional cloning based methods, making them powerful and exciting new technologies in virology (Datta et al., 2015). These high throughput technologies give us direct evidence of infection in the tissue as compared to ELISA-based assays, which only proves presence of infection somewhere in the human body. RNA-seq technology has hence enabled the exploration and detection of viral infections in human tumor samples. This technology also enables detection of variants in viral genome, which have been connected to clinical outcome (Moyes et al., 2005; Downey et al., 2015). In recent years, US regulators approved a viral based cancer therapy (Ledford, 2015), proving that the study of viruses in the human transcriptome has biomedical interest, and is paving the way for promising research and new opportunities. In this paper, we present our pipeline viGEN to not only detect and quantify read counts at the individual viral-gene level, but also detect viral variants from human RNA-seq data. The characterization of viral variants helps enable better epidemiological analysis. The input file to our pipeline is a fastq (Wikipedia, 2009) file, so our viGEN pipeline can be extended to work with genomic data from any NGS technology. Our pipeline can also be used to detect and explore not only viruses, but other microbes as well, as long as the sequence information is available in NCBI. We applied our viGEN pipeline to two case studies as a proof of concept - a dataset of 304 cervical cancer patients, and a set of 50 liver cancer patients, both from the TCGA collection. We first applied the pipeline to the transcriptome of cervical cancer patients to see if we are able to detect the human papilloma viruses. We also performed additional in-depth analyses on a small focused study of liver cancer patients. In this cohort, we performed viral-gene quantification, viral-variant extraction and survival analysis. From our analyses, we show that we were able to successfully detect the human papilloma virus among the TCGA cervical cancer patients. We compared the viGEN pipeline with two metagenomics tools and demonstrate similar sensitivity/specificity. We were also able to quantify viral-transcripts and extract viral-variants using the liver cancer dataset. This enabled us to perform downstream analysis to give us new insights into disease mechanisms. In addition to the two case studies, we have made available an end-to-end tutorial demonstrated on a publicly available RNA-seq sample from an HBV liver cancer patient from NCBI SRA (http://www.ncbi.nlm.nih.gov/bioproject/PRJNA279878). We also provided step-by-step instructions on how to run our viGEN pipeline on this sample data, along with the code at https://github.com/ICBI/viGEN/ and demonstrated the detection of HBV transcripts in this sample. This allows other users to apply this pipeline to explore viruses in their data and disease of interest. We are currently implementing the viGEN pipeline in the Seven Bridges Cancer Genomics Cloud. There are a number of existing pipelines that detect viruses from human transcriptome data. Of these, very few pipelines offer quantification at the gene expression level. A comprehensive comparison of these pipelines is provided in Table 1. Our goal was not to compete with these other tools, but to offer a convenient and complete end–to-end publicly available pipeline to the bioinformatics community. To the best of our knowledge there are no publicly available pipelines or packages that would provide this type of complete analysis in one package. Customized solutions have been reported in the literature however were not made public.

Table 1

Comparison of existing pipelines that detect viruses from human transcriptome data.

Tool name	Detect viruses from Human RNA-seq data	Perform quantification at viral-gene/CDS level	Works on DNAseq, RNAseq or both	Variant calling at viral-variant level	Discover viral integration sites	Other comments
Virana (Schelhorn et al., 2013)	Yes	Identifies microbial transcripts, does not quantify	Both	No	Yes	Also offers analysis of homologs
VirusSeq (Chen et al., 2013)	Yes	No	Both	No	Yes	Pre-designed to work on a select set of 18 viruses
Viral Fusion Seq (Li et al., 2013)	Yes	No	Both	No	Yes	Can also detect fusion events
Virus Finder (Wang et al., 2013)	Yes	No	Both	No	Yes	Can be applied to samples infected with undiagnosed viruses
PathSeq (Kostic et al., 2011)	Yes	No	Both	No	No
RINS (Bhaduri et al., 2012)	Yes	No	RNA-seq	No	No	Generates contigs with these non-human sequences
Kraken (Wood and Salzberg, 2014)	Yes	No	No	No	No	Metagenomic analysis tool
Cetrifuge (Kim et al., 2016)	Yes	No	No	No	No	Metagenomic analysis tool
viGEN (our pipeline)	Yes	Yes	Both	Yes	No

Comparison of existing pipelines that detect viruses from human transcriptome data. In the future, our plan is to package this pipeline and make it available to users through Bioconductor (Lawrence et al., 2013), allowing users to perform analysis on either their local computer or the cloud.

Materials and methods

In this paper, we applied our viGEN pipeline to two case studies as a proof of concept - a dataset of 304 cervical cancer patients, and a set of 50 liver cancer patients, both from the TCGA collection (NCI, 2011). We first applied the pipeline to the transcriptome of cervical cancer patients to see if we are able to detect the human papilloma viruses. We also performed additional in-depth analyses on a small focused study of liver cancer patients afflicted with Hepatitis B virus. In this cohort, we perform viral-gene quantification, viral-variant extraction and survival analysis. The results from these analyses allowed us to compare experimental and control groups using viral-gene expression data and viral-variant data, and give us insights into their impacts on the tumor, and disease mechanisms. In the following sections, we describe the viGEN pipeline, and the two case studies.

The viGEN pipeline

The viGEN pipeline includes 4 major modules. Figure 1 shows an image of our viGEN pipeline.

Figure 1

viGEN pipeline. Each module has a color, shown in the legend.

Module 1: Viral genome level analysis (filtered human sample input)

In Module 1 (labeled as “filtered human sample input”), the human RNA sequences were aligned to the human-reference genome using the RSEM (Li and Dewey, 2011) tool. One of the outputs of RSEM includes sequences that did not align to the human genome (hence the name “filtered human sample input”). These un-aligned sequences were taken and aligned to the viral reference file using popular alignment tools BWA (Li and Durbin, 2009) and Bowtie2 (Langmead and Salzberg, 2012).

Module 2: Viral genome level analysis (unfiltered human sample input)

In Module 2 (labeled as “unfiltered human sample input”), the RNA seq sequences were directly aligned to the viral reference using Bowtie2 without any filtering. The reason for using two methods to obtain the viral genomes in human RNA-seq data (Module 1 and Module 2) was to allow us to be as comprehensive as possible in viral detection. The aligned reads from Module 1 and 2 were in the form of BAM files (Center-for-Statistical-Genetics, 2013), from which read counts were obtained for each viral genome species (referred to as “genome level counts”) using Samtools idxstats (Li et al., 2009) or Picard BAMIndexStats tools. Using the genome level counts, we estimated the number of reads that covered the genome, a form of viral copy number. Viral copy number was defined as in equation below: Only those viral species with copy number more than a threshold are selected for the next module.

Module 3: Viral gene expression analysis

The BAM files from Module 1 and 2 (from Bowtie2 and BWA) were input into Module 3 (referred to as “viral gene expression level analysis”), which calculated quantitate read counts at the individual viral-gene level. We found existing RNA-seq quantification tools to be not sensitive enough for viruses, and hence developed our own algorithm for this module. Our in-house algorithm used region-based information from the general-feature-format (GFF) files of each viral genome, and the reads from the BAM file. It created a summary file, which had a total count of reads within or on the boundary of each region in the GFF file. This is repeated for each sample and for each viral GFF file. At the end, a matrix is obtained where the features (rows) are regions from the GFF file, and the columns are samples. The read count output from Module 3 (viral gene expression module) allowed for downstream differential expression analysis of viral genes between case and controls groups. The source code for our in-house algorithm, written using the R programming language (R Core Team, 2014), has been made public at available at github.com/ICBI/viGEN.

Module 4: Viral RNA variant calling module

The BAM files from Module 1 and 2 (from Bowtie2) were also input to Module 4 to detect mutations in the transcripts from these viruses (referred to as “viral RNA variant calling module”). The BAM files were first sorted coordinate-wise using Samtools (Li et al., 2009); PCR duplicates were removed using tool Picard4, then the chromosomes in the BAM file were ordered in the same way as the reference file using Picard. The Viral reference file was created from combining all known and sequenced human viruses obtained from NCBI2. Because viral variants are known to be low frequency, we have selected a variant calling tool Varscan2 (Koboldt et al., 2012), which allows detection of low-frequency variants (Spencer et al., 2014). Low quality and low depth variants were flagged, but not filtered out, in case these low values were due to low viral load. Once the variants were obtained, they were merged to form a multi-sample VCF file. Only variants that had a variant in two or more samples were retained. PLINK (Chang et al., 2015) was used to perform case-control association test (Fishers Exact Test) to compare groups.

Tutorial in Github

The viGEN pipeline is easy to implement because our pipeline incorporates existing best practices and tools available. For Module 3, we developed our own algorithm for viral-gene quantification. The major motivation for this paper was to build on existing viral detection tools, and to build a quantification tool in order to quantify, explore and analyse the genes detected in viruses. The source code for the in-house algorithm, along with a tutorial on how to execute the code on sample data has been made public at https://github.com/ICBI/viGEN/. Since access to TCGA raw data is controlled access, we could not use this dataset to create a publicly available tutorial. So we used a publicly available RNA-seq dataset to demonstrate our pipeline with an end-to-end workflow. We chose one sample (SRR1946637) from publicly available HBV liver cancer RNA-seq dataset from NCBI SRA (http://www.ncbi.nlm.nih.gov/bioproject/PRJNA279878). This dataset is also available through EBI SRA (http://www.ebi.ac.uk/ena/data/view/SRR1946637). The dataset consisted of 50 HBV Liver cancer patients, and 5 adjacent normal liver tissues. We downloaded the raw reads for one sample, and applied our viGEN pipeline to it and were able to successfully detect HBV transcripts in this sample. A step-by-step workflow that includes – description of tools, code, intermediate and final analysis results are provided in Github: https://github.com/ICBI/viGEN/. This tutorial has also been provided as Additional File 1.

Custom reference index

We were interested in exploring all viruses existing in humans. So we first obtained reference genomes of all known and sequenced human viruses obtained from NCBI2 (745 viruses) and merged them into one file (referred to as the “viral reference file”) in fasta file format (Wikipedia, 2004). This file has been shared in our Github page.

Case studies

Cervical cancer dataset

Cervical cancer is caused by the Human Papilloma Virus (HPV). This dataset consisted of 304 cervical cancer patients in the TCGA data collection. These samples were primary tumors from either Cervical Squamous Cell Carcinoma or Endocervical Adenocarcinoma where RNA-seq data was available. We applied our viGEN pipeline on these samples using the Seven Bridges platform (https://cgc.sbgenomics.com). Among the 304 cervical cancer patients, 22 patients had virus detection confirmed by PCR or other lab methods and made available through the clinical data. So we used this information from the 22 patients to estimate the sensitivity and specificity of our viGEN pipeline.

Liver cancer dataset

This dataset consisted of 50 liver cancer patients in the TCGA data collection. 25 of these patients were afflicted with Hepatitis B virus (labeled “HepB”), while the rest of the 25 patients had a co-infection of both Hepatitis B and C viruses (labeled “HepB+C”). Information about viral presence was obtained from “Viral Hepatitis Serology” attribute from the clinical information. We first applied the viGEN pipeline on the 50 samples, using the Globus Genomics platform (Bhuvaneshwar et al., 2015). Once the viral genomes were detected, we then chose only the high abundance viral species for the gene quantification step and viral variant detection steps (Module 3 and 4 respectively). We then performed a focused analysis on this dataset. We used the viral-gene expression read counts, to examine the differences between “Dead” and “Alive” samples. The Dead/Alive status of the samples was obtained from the clinical data and refers to patients in the cohort that died or not from cancer. We performed this analysis on the 25 patients in the HepB only group to prevent any confounding with the HepB+HepC group. Out of 25 HepB patients, 16 were alive (baseline group), and 9 dead (comparison group) as per the clinical data. The analysis was performed using a Bioconductor software package called EdgeR (Robinson et al., 2010) in the R programming language (http://www.R-project.org). Cox proportional hazards (Cox PH) regression model (Cox and Oakes, 2000) was then applied to look at the association of viral-gene expression data with overall survival. Thie Cox model was applied on all 50 samples in the cohort (i.e., 25 Hep B and 25 HepB+HepC) samples to maximize power. We also compared the dead and alive samples at the viral RNA variant level in the HepB group using a tool called PLINK to see if it can add valuable information to the tumor landscape in humans.

Results

Detection of HPV in cervical cancer patients

We used our viGEN pipeline to detect viruses in the RNA of human cervical tissue and obtained viral copy number for each species. We used a threshold copy number of 10 as a “positive” viral detection for both HPV-16, HPV-18 and HPV-26 viruses. Based on this criterion, HPV-16 was detected in 53% of the samples, HPV-18 in 13% of the samples and HPV-26 in 0.3 % of the samples (Figure 2). The threshold copy number limit that defines a “positive” detection is one of the parameters of the software which could be set by the user depending on the specifics of the experiment.

Figure 2

The HPV viruses detected in cervical cancer patients using the viGEN pipeline.

The HPV viruses detected in cervical cancer patients using the viGEN pipeline. We obtained the clinical data for this TCGA cervical cancer cohort from the cBio portal (Cerami et al., 2012). Among the 304 patients, 22 patients had virus detection confirmed by PCR or other lab methods and made available through the clinical data. Out of the 22 patients, 12 patients had the HPV-16 virus, 4 patients had HPV-18, and the rest had other HPV viruses. So we used this information from the clinical data to estimate the sensitivity and specificity of our viGEN pipeline. We got a sensitivity of 83% and specificity of 60% for HPV-16 detection (Table 2A); and a sensitivity of 75% and specificity of 94% for HPV-18 detection (Table 2B).

Table 2A

Estimation of sensitivity and specificity for HPV-16 detection in TCGA cervical cancer samples using the viGEN pipeline.

		PCR or other lab techniques for screening viruses (obtained from clinical data)
		Positive	Negative	Total
RNA-seq from tumor tissue (output of algorithm)	Positive	10	4	14
	Negative	2	6	8
	Total	12	10	22

Table 2B

Estimation of sensitivity and specificity for HPV-18 detection in TCGA cervical cancer samples using the viGEN pipeline.

		PCR or other lab techniques for screening viruses (obtained from clinical data)
		Positive	Negative	Total
RNA-seq from tumor tissue (output of algorithm)	Positive	3	1	4
	Negative	1	17	18
		4	18	22

Estimation of sensitivity and specificity for HPV-16 detection in TCGA cervical cancer samples using the viGEN pipeline. Estimation of sensitivity and specificity for HPV-18 detection in TCGA cervical cancer samples using the viGEN pipeline.

Additional analysis in liver cancer patients

Detection of hepatitis B virus at the genome level

We applied our viGEN pipeline (modules 1 and 2) on the RNAseq data from the TCGA liver cancer tumors, and obtained genome-level read counts for each viral species. We used a threshold copy number of 10 to define a positive detection of the Hepatitis B virus. Once the viral genomes were detected, we short-listed the high abundance viral species for the viral-gene quantification step and viral-variant detection steps (Module 3 and 4 respectively). High abundance was defined as those virus species that were detected in at-least 5 samples. In addition to Hepatitis B and C viruses, several other viruses came up in this short list including Human endogenous retrovirus K113 (HERV K113) and others. A complete list is provided in Table 3.

Table 3

Virus species detected in at-least 5 samples.

Name of Virus
Hepatitis C virus
Tick-borne encephalitis virus
Hepatitis C virus genotype 2
Cutthroat trout virus
Human endogenous retrovirus K113
Lunk virus NKS-1
Hepatitis C virus genotype 6
Hepatitis B virus

Virus species detected in at-least 5 samples.

Comparing dead and alive samples in the using viral gene expression data

To get a more detailed overview of the viral landscape, we applied Module 3 of the viGEN pipeline to the liver cancer dataset. This allowed us to quantify viral-gene expression regions in the RNA of liver tumor tissues. We then used those results to examine the differences between dead and alive samples. It is known that these patients were afflicted with the Hepatitis B virus and hence many of the differentially expressed regions were from this viral genome. But as we know, other viruses also coexist in humans. This was confirmed by the presence of differentially expressed viral-regions from other viruses. The differentially expressed regions that were significant among the results are shown in Tables 4A,B. Table 4A lists only the differentially expressed regions from Hepatitis B virus and Table 4B shows the differentially expressed regions from other viruses.

Table 4

Differential expression analysis of transcript level read counts Liver cancer dataset comparing Dead and Alive samples.

Name of region (Name of virus_region_start position of region_end position of region)	Log Fold change (logFC)	Log counts per million (logCPM)	P-Value	Q-Value (FDR)	Name of virus	Region annotation
A
NC_003977.1_gene_1814_2452	1.128	13.449	1.71E−06	0.000243	Hepatitis B virus	Contains Gene C that produces pre-code protein external core antigen; HBeAg. HBeAg is produced by proteolytic processing of the pre-core protein
NC_003977.1_CDS_1814_2452	1.128	13.449	1.71E−06	0.000243	Hepatitis B virus	Contains Gene C that produces pre-code protein external core antigen; HBeAg.
NC_003977.1_CDS_1901_2452	0.828	12.42	0.000507	0.053928	Hepatitis B virus	Contains Gene C, encodes core antigen HBcAg
NC_003977.1_CDS_155_835	−2.121	16.335	1.67E−15	1.11E−12	Hepatitis B virus	Encodes Gene S that produces small envelope protein, S protein; S glycoprotein; S-HBsAg,
NC_003977.1_gene_2307_4838	−2.133	12.655	2.61E−15	1.11E−12	Hepatitis B virus	Gene P, encodes protein P
NC_003977.1_CDS_2307_4838	−2.133	12.655	2.61E−15	1.11E−12	Hepatitis B virus	Gene P, encodes protein P
NC_003977.1_CDS_3205_4050	−2.352	8.67	1.93E−12	6.84E−10	Hepatitis B virus	Gene S, encodes middle envelope protein pre-S2/S
NC_003977.1_gene_2848_4050	−2.75	11.741	5.84E−22	6.20E−19	Hepatitis B virus	Encodes Gene S that produces a large surface protein/L glycoprotein/L-HBsAG
NC_003977.1_CDS_2848_4050	−2.75	11.741	5.84E−22	6.20E−19	Hepatitis B virus	Encodes Gene S that produces a large surface protein/L glycoprotein/L-HBsAG
NC_003977.1_CDS_155_835	−2.121	16.335	1.67E−15	1.11E−12	Hepatitis B virus	Encodes Gene S that produces small envelope protein, S protein; S glycoprotein; S-HBsAg,
B
NC_001405.1_intron_9724_12307	2.527	6.463	4.02E−08	1.22E−05	Human mastadenovirus C	Gene = L1, locus_tag = HAdVC_gp10, note = precedes capsid protein precursor pIIIa CDS
NC_001405.1_intron_9724_11039	2.5	6.451	6.65E−08	1.28E−05	Human mastadenovirus C	Gene = L1, locus_tag = HAdVC_gp10, note = precedes encapsidation protein 52K CDS
NC_001405.1_gene_10866_11023	2.5	6.451	6.65E−08	1.28E−05	Human mastadenovirus C	Gene = VAII, locus_tag = HAdVC_gs02, GeneID:2653002
NC_001405.1_transcript_10866_11023	2.5	6.451	6.65E−08	1.28E−05	Human mastadenovirus C	Gene = VAII, locus_tag = HAdVC_gs02, GeneID:2653002
NC_001405.1_exon_10866_11023	2.5	6.451	6.65E−08	1.28E−05	Human mastadenovirus C	Gene = VAII, locus_tag = HAdVC_gs02, GeneID:2653002
NC_001405.1_intron_10580_14015	2.428	6.475	9.24E−08	1.64E−05	Human mastadenovirus C	Gene = E2B, locus_tag = HAdVC_gp04
NC_022518.1_STS_7174_7323	−0.992	9.122	0.000309	0.034527	Human endogenous retrovirus K113	Sequence-tagged site (STS), locus_tag = Q779_gp1, standard_name = D6S2277, UniSTS:59918
NC_022518.1_STS_5100_5381	−1.051	9.532	0.000118	0.0139	Human endogenous retrovirus K113	Sequence-tagged site (STS), standard_name = D22S1651, UniSTS: 474031
NC_022518.1_region_1112_6746	−1.186	13.022	3.49E−06	0.000463	Human endogenous retrovirus K113	gag-pro-pol; two−1 frameshifts predicted to occur to produce a fusion protein; the location of frameshifts has not been determined
NC_018464.1_region_1_927	−1.288	12.784	5.78E−07	9.45E−05	Shamonda virus	mol_type = genomic RNA, isolate = Ib An 5550, taxon:159150, segment = S
NC_002645.1_gene_293_20568	−2.598	6.126	3.93E−05	0.004911	Human coronavirus 229E	locus_tag = HCoV229Egp1, GeneID: 918764, replicase polyprotein 1ab

These results shown used the viral-gene data obtained from Module 1 (using alignment tool Bowtie2) + Module 3. The table shows results with q value <0.06 and sorted based on LogFC in the descending order. Table 4A shows transcript level read counts in the Hepatitis B virus while Table 4B shows transcript level read counts in other species.

Differential expression analysis of transcript level read counts Liver cancer dataset comparing Dead and Alive samples. These results shown used the viral-gene data obtained from Module 1 (using alignment tool Bowtie2) + Module 3. The table shows results with q value <0.06 and sorted based on LogFC in the descending order. Table 4A shows transcript level read counts in the Hepatitis B virus while Table 4B shows transcript level read counts in other species. From the differential expression analyses, the two most informative results were (1) a region of the Hepatitis B genome that produced the HBeAg and HBcAg proteins were overexpressed in the dead patients and (2) another region of the Hepatitis B genome that produced HBsAg protein was overexpressed in the alive patients. In detail, we saw several important findings as described below: Region NC_003977.1_CDS_1814_2452 of the Hepatitis B genome was 2.18 times overexpressed (log fold change = +1.128) in dead patients. This region contains Gene C that produces pre-code protein external core antigen; HBeAg. HBeAg is produced by proteolytic processing of the pre-core protein Region NC_003977.1_CDS_1901_2452 which was 1.74 times overexpressed (log fold change = +0.8, FDR = 0.053) in dead patients contains Gene C as above, but encodes a different core antigen HBcAg Region NC_003977.1_CDS_2848_4050 of the Hepatitis B genome was 6.73 times over expressed (log fold change = −2.7) in the alive patients of compared to the dead'patients. This region encodes Gene S that produces a large surface protein/L glycoprotein/L-HBsAG We also found several regions of the Human endogenous retrovirus K113 (HERV K113) viral genome (NC_022518.1_region_1112_6746, NC_022518.1_STS_5100_5381 and NC_022518.1_STS_7174_7323) to be about 2 times overexpressed on average in alive patients (log fold change = −1.186, −1.051, −0.992).

Survival analysis (cox regression) using viral gene expression data

Based on the results from previous section, we selected two most informative regions from the Hepatitis B genome (log counts per million from NC_003977.1_CDS_2848_4050, NC_003977.1_CDS_1814_2452) for a Cox Proportional Hazard (Cox PH) model to look at association with overall survival event and time. The result from this model (Table 5), are consistent with the results from differential expression analysis:

Table 5

Cox proportional hazard survival analysis (across 25 HepB samples and 25 HepB + HepC Samples).

Formula:
coxph(formula = survObject ~ NC_003977.1_CDS_2848_4050 + NC_003977.1_CDS_1814_2452)
Results from the model:
n = 37, number of events = 5
(13 observations deleted due to missingness)
Covariate	coef	exp(coef)	se(coef)	Z	Pr(>\|z\|)
NC_003977.1_CDS_2848_4050	−0.5548	0.5742	0.7434	−0.746	0.456
NC_003977.1_CDS_1814_2452	0.5302	1.6993	0.6145	0.863	0.388
Covariate	exp(coef)	exp(coef)	Lower 0.95	Upper 0.95
NC_003977.1_CDS_2848_4050	0.5742	1.7415	0.1337	2.465
NC_003977.1_CDS_1814_2452	1.6993	0.5885	0.5096	5.667
Concordance = 0.654 (se = 0.188)
Rsquare = 0.12 (max possible = 0.329)
Likelihood ratio test = 4.74 on 2 df, p = 0.09343
Wald test = 0.75 on 2 df, p = 0.6856
Score (logrank) test = 10.58 on 2 df, p = 0.00503

These results shown used the viral-gene/CDS data obtained from Module 1 (using alignment tool Bowtie2) + Module 3. Coef: coefficient (Beta) of the model; exp(coef): Hazard Ratio; se(coef): Standard Error; Pr(>|z|): P-value

(a) The Cox PH model shows that assuming other covariant to be constant, unit increase in expression of this region NC_003977.1_CDS_1814_2452, increases the hazard of event (death) by 70%.

(b) On the other hand, that assuming other covariant to be constant, unit increase in expression of this region NC_003977.1_CDS_2848_4050, decreases the hazard of event (death) by 43%.

The Cox PH model shows that assuming other covariant to be constant, unit increase in expression of this region NC_003977.1_CDS_1814_2452, increases the hazard of event (death) by 70%. On the other hand, that assuming other covariant to be constant, unit increase in expression of this region NC_003977.1_CDS_2848_4050, decreases the hazard of event (death) by 43%. The overall model is significant with p-value < 0.05 from the Log rank test (also called Score test). Cox proportional hazard survival analysis (across 25 HepB samples and 25 HepB + HepC Samples). These results shown used the viral-gene/CDS data obtained from Module 1 (using alignment tool Bowtie2) + Module 3. Coef: coefficient (Beta) of the model; exp(coef): Hazard Ratio; se(coef): Standard Error; Pr(>|z|): P-value (a) The Cox PH model shows that assuming other covariant to be constant, unit increase in expression of this region NC_003977.1_CDS_1814_2452, increases the hazard of event (death) by 70%. (b) On the other hand, that assuming other covariant to be constant, unit increase in expression of this region NC_003977.1_CDS_2848_4050, decreases the hazard of event (death) by 43%. (c) The overall model is significant with p-value < 0.05 from the Log rank test (also called Score test).

Comparing dead and alive samples using viral-variant data

We performed variant calling (Module 4) on the liver cancer patients to see if it can add valuable information to the tumor landscape in humans. We then compared the dead and alive samples at the viral-variant level on the 25 HepB patients. For this analysis, the outputs from both Module 1 and 2 were fed into Module 4. Most of the top variants from filtered human sample (Module 1 + Module 4) and unfiltered human sample (Module 2 + Module 4), were the same. We collated the significant common results (p-value ≤ 0.05) in Tables 6A,B. Among these results, we saw several missense and frameshift variants in Gene X of the Hepatitis B genome (nucleotide 1479), Gene P (2573, 2651, 2813), and a region that overlaps Gene P and PreS1 (nucleotides 2990, 2997, 3105, 3156). All these variants were found mutated more in the cases than controls. Other significant common results included variants in Gene C (nucleotide 1979, 2396) and variants in PreS2 region (nucleotide positions 115, 126 and 148) (Table 6A).

Table 6

Results of case-control association test applied on the results from viral variant calling.

CHR (Chromosome)	Species (Name of Virus)	BP (Base pair)	A1 (minor allele)	C_A (Number of cases with A1)	C_U (number of controls with A1)	A2 (major allele)	P (P value)	Annotation from GFF file
A
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	1479	C	4	0	A	0.02857	Gene = X,product = X protein,protein_id = NP_647606.1
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	2573	C	0	6	T	0.03571	Gene = P,
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	2651	T	4	0	C	0.00476	product = polymerase,
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	2813	C	2	0	T	0.03571	protein_id = NP_647604.2
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	2990	T	2	0	A	0.02222	Gene = P, product = polymerase,
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	2997	C	2	0	T	0.03571	protein_id = NP_647604.2;
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	3105	C	2	0	A	0.02222	Gene = S, product = large envelope
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	3156	G	4	0	A	0.00476	protein, protein_id = YP_355333.1
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	1979	G	2	0	A	0.03571	Gene = C, product = pre-capsid protein,
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	2396	0	4	0	CG	0.01499	protein_id = YP_355335.1, NP_647607.1
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	115	C	2	0	A	0.02222	Pre S2 region, ID = id0, Dbxref = taxon: 10407,
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	126	C	2	0	T	0.02222	Is_circular = true, gbkey = Src, genome = genomic,
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	148	G	2	0	A	0.02222	mol_type = genomic DNA, strain = ayr
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	1762	T	0	4	A	0.06061	Gene = X, Name = NP_647606.1, product = X protein,
gi\|21326584\|ref\|NC_003977.1\|	Hepatitis B virus	1764	A	0	4	G	0.06061	protein_id = NP_647606.1
B
gi\|548558394\|ref\|NC_022518.1\|	Human endogenous retrovirus K113	7476	0	10	14	TACTG	0.00600	ID = gene0, Name = Q779_gp1; ID = cds0, Name = YP_008603282.1,
gi\|548558394\|ref\|NC_022518.1\|	Human endogenous retrovirus K113	7426	G	3	0	A	0.00714	product = putative env,
gi\|548558394\|ref\|NC_022518.1\|	Human endogenous retrovirus K113	8086	T	3	0	C	0.00714	protein_id = YP_008603282.1

The table is sorted based on Annotation. Annotation includes gene name, protein name, etc., separated by commas, multiple annotations separated by semi-colon.

Table 6A shows variants in the Hepatitis B virus only while Table 6B shows variants in other species. (Shows only common results between two possible analysis steps).

Results of case-control association test applied on the results from viral variant calling. The table is sorted based on Annotation. Annotation includes gene name, protein name, etc., separated by commas, multiple annotations separated by semi-colon. Table 6A shows variants in the Hepatitis B virus only while Table 6B shows variants in other species. (Shows only common results between two possible analysis steps). In addition, there were two missense variants that were common among the top results, but not significant (p-value = 0.06). They were variants in the X gene of the Hepatitis B genome (nucleotides 1762 and 1764) (Table 6A). Among the significant common results to both, were a few variants of the Human endogenous retrovirus K113 complete genome (HERV K113). These include nucleotide positions 7476, 7426, and 8086. These map to frameshift and missense mutations in the putative envelope protein of this virus (Q779_gp1, also called “env”) (Table 6B).

Discussion

The Seven Bridges team used two metagenomic tools, Centrifuge (Kim et al., 2016) and Kraken (Wood and Salzberg, 2014), to detect HPV viruses on the same cohort of TCGA patients (Bridges, 2017; Malhotra et al., 2017), and shared the results with us. They used an abundance of 0.02 as a positive viral detection (Bridges, 2017; Malhotra et al., 2017). We compared viGEN with Kraken and Centrifuge in terms of the percentage of samples where the species was detected (Table 7). We can see that the results are in the same range for all three tools.

Table 7

Comparing the viral detection ability of viGEN with other tools.

Virus Name	Tool Name	% of samples where the species was found
HPV 16 (Alphapapillomavirus 9)	Kraken	58
HPV 16	Centrifuge	55
HPV 16	viGEN	53
HPV 18	Kraken	2
HPV 18 (Alphapapillomavirus 7)	Centrifuge	15
HPV 18	viGEN	13
HPV 26 (Alphapapillomavirus 5)	Kraken	1
HPV 26	Centrifuge	0.3
HPV 26	viGEN	0.3

Comparing the viral detection ability of viGEN with other tools. We also estimated the sensitivity and specificity of these tools using the same 22 patients and compared with that of the viGEN pipeline. The Centrifuge tool had a sensitivity of 83% and specificity of 60% for HPV-16 detection; and a sensitivity of 75% and specificity of 94% for HPV-18 detection. The Kraken tool had a sensitivity of 83% and specificity of 20% for HPV-16 detection; and a sensitivity of 75% and specificity of 17% for HPV-18 detection (detailed in Additional File 2). It shows that our viGEN pipeline was able to match the sensitivity and specificity of Centrifuge tool and surpassed that of Kraken (detailed in Additional Files 2, 3).

Additional analysis on liver cancer patients

We used our viGEN pipeline to get genome-level read counts obtained from viruses detected in the RNA of human liver tissue. In our results, HBV was detected in 20% of the samples. This is similar to earlier analyses of TCGA liver cancer cohort study (Khoury et al., 2013; Tang et al., 2013; The Cancer Genome Atlas Research Network, 2017), which detected the HBV virus in 23 and 32% (with typically low counts range) of cases respectively. It has also been reported that the viral gene X (HBx) was the most predominately expressed viral gene in liver cancer samples (Tang et al., 2013) which is in concordance with our findings where the peak number of reads were observed for gene X region of the HBV genome.

Comparing dead and alive samples in the liver cancer cohort using viral gene expression data

To get a more detailed overview of the viral landscape, we examined the human RNA-seq data to detect and quantify viral gene expression regions. We then examined the differences between dead and alive samples at the viral-transcript level on the Hepatitis B sub-group (Tables 4A,B). From the differential expression analyses, the two most informative results were (1) a region of the Hepatitis B genome that produced the HBeAg protein was overexpressed in the dead patients and (2) another region of the Hepatitis B genome that produced HBsAg protein was overexpressed in the alive patients. Presence of HBeAg or HBcAg is an indicator of active viral replication; this means the person infected with Hepatitis B can likely transmit the virus on to another person. Typically, loss of HBeAg is an indicator of recovery from acute Hepatitis B infection. Active viral replication could allow the virus to persist in infected cells, and increase the risk of disease (Tage-Jensen et al., 1985; Liang, 2009). So our results, showing that antigens HBeAg and HBcAg were overexpressed in dead patients compared to alive patients makes sense, indicating that these patients never recovered from acute infection. The results also indicate a higher level of HBsAg in the alive patients compared to the dead patients. The highest levels of HBsAg in the virus are known to occur in the “immunotolerant phase.” This pattern is seen in patients who are inactive carriers of the virus i.e., they have the wild type DNA, and the virus has been in the host for so long, that the host does not see the virus as a foreign protein in the body, and hence there's no immune reaction against the virus. In this phase, there is known to be minimal liver inflammation and low risk of disease progression (Park, 2004; Tran, 2011; Locarnini and Bowden, 2012). This could explain why we saw higher level of HBsAg in the alive patients compared to the dead patients. Also among the significant results were three regions from the Human endogenous retrovirus K113 (HERV K113) genome (with negative log fold change) that were overexpressed in the alive patients. Two of these regions were Sequence-tagged sites (STS) and the third region was in the gag-pro-pol region that has frameshifts. HERV could protect the host from invasion from related viral agents through either retroviral receptor blockade or immune response to the undesirable agent (Nelson et al., 2003). Overall, we found that our results from viral-gene expression level make biological sense, with much of the results validated through published literature.

Comparing dead and alive samples in the liver cancer cohort using viral-variant data

We performed variant calling on the viral data to see if it can add valuable information to the tumor landscape in humans. We then compared the dead and alive samples at the viral-variant level on the 25 patients in the Hepatitis B sub-group. Among the significant results (Tables 6A,B) included variants in Gene C (nucleotide 1979, 2396) and variants in PreS2 region (nucleotide positions 115, 126 and 148). The Gene C region creates the pre-capsid protein, which plays a role in regulating genome replication (Tan et al., 2015). The mutation in the 2396 position lies in a known CpG island (ranging from 2215 to 2490), whose methylation level is significantly correlated with hepatocarcinogenesis (Jain et al., 2015). Mutations in PreS2 are associated with persistent HBV infection, and emerge in chronic infections. The PreS1 and PreS2 regions are known to play an essential role in the interaction with immune responses because they contain several epitopes for T or B cells (Cao, 2009). Mutations in the 1762/1764 positions of the X gene are known to be associated with greater risk of HCC (Cao, 2009; Wang et al., 2014), and is independent of serum HBV DNA level (Wang et al., 2014). This mutation combination is also known to be associated with hepatitis B related acute-on-chronic liver failure (Xiao et al., 2011). It is predicted that mutations associated with HCC variants are likely generated during HBV-induced pathogenesis. The A1762T/G1764A combined mutations was shown to be a valuable biomarker in the predicting the risk of HCC (Cao, 2009; Wang et al., 2014); and are often detected about 10 years before the diagnosis of HCC (Cao, 2009). Among the significant common results to both, were a few variants of the Human endogenous retrovirus K113 complete genome (HERV K113). These variants map to frameshift and missense mutations in the putative envelope protein of this virus (Q779_gp1, also called “env”). Studies have shown that this envelope protein mediates infections of cells (Robinson and Whelan, 2016). HERV K113 is a provirus and is capable of producing intact viral particles (Boller et al., 2008). Studies have shown a strong association between HERV-K antibodies and clinical manifestation of disease and therapeutic response (Moyes et al., 2005; Downey et al., 2015). It is hypothesized that retroviral gene products can be “reawakened” when genetic damage occurs through mutations, frameshifts and chromosome breaks. Even though the direct oncogenic effects of HERVs in cancer are yet to be completely understood, it has shown potential as diagnostic or prognostic biomarkers and for immunotherapeutic purposes including vaccines (Downey et al., 2015). We compared various viral detection pipeline using the several criteria (Table 1). Our pipeline provides similar functionality as the tools listed in Table 1 for the detection of viruses from human RNAseq data; but also has an advantage of enabling gene-level expression analysis and quantification, as well as variant analysis of viral genomes in a single open source publicly available package.

Limitations

One limitation of our viGEN pipeline is that it is dependent on sequence information from reference genome. This makes it challenging to detect viral strains where reference sequence information is not known. In the future, we plan to explore de novo assembly incorporating more sophisticated methods like Hidden Markov Models (HMM) (Alves et al., 2016). This would enable us to provide in-depth analysis of strain pathogenicity in the context of clinical outcome.

Biological significance

In recent years, US regulators approved a viral based cancer therapy (Ledford, 2015), proving that the study of viruses in the human transcriptome has biomedical interest, and is paving the way for promising research and new opportunities. We show that our viGEN pipeline can thus be used on cancer and non-cancer human NGS data to provide additional insights into the biological significance of viral and other types of infection in complex diseases, and tumorigeneses. Our viral pipeline could be used in conjunction with additional type of immuno-oncology analysis based on RNA-seq data of host RNA for cancer immunology applications. Detection and characterization of these infectious agents in tumor samples can give us better insights into disease mechanisms and their treatment (Hausen, 2000).

Conclusion

With the decreasing costs of NGS analysis, our results show that it is possible to detect viral sequences from whole-transcriptome (RNA-seq) data in humans. Our analysis shows that it is not easy to detect DNA and RNA viruses from tumor tissue, but certainly possible. We were able to not only quantify them at a viral-gene expression level, but also extract variants. Our goal is to facilitate better understanding and gain new insights in the biology of viral presence/infection in actual tumor samples. The results presented in this paper on two case studies are in correspondence with published literature and are a proof of concept of our pipeline. This pipeline is generalizable, and can be used to examine viruses present in genomic data from other next generation sequencing (NGS) technologies. It can also be used to detect and explore other types of microbes in humans, as long as the sequence information is available from the National Center for Biotechnology Information (NCBI) resources. This pipeline can thus be used on cancer and non-cancer human NGS data to provide additional insights into the biological significance of viral and other types of infection in complex diseases, and tumorigeneses. We are planning to package this pipeline and make it open source to the bioinformatics community through Bioconductor.

Availability of data and material

The TCGA liver cancer dataset was used in the analysis and writing of this manuscript. The data can be obtained from https://cancergenome.nih.gov/. Since access to TCGA raw data is controlled access, we could not use this dataset to create a publicly available tutorial. So we looked for publicly available RNA-seq dataset to demonstrate our pipeline with an end-to-end workflow. We chose one sample (SRR1946637) from publicly available liver cancer RNA-seq dataset from NCBI SRA (http://www.ncbi.nlm.nih.gov/bioproject/PRJNA279878). This dataset is also available through EBI SRA (http://www.ebi.ac.uk/ena/data/view/SRR1946637). The dataset consists of 50 Liver cancer patients, and 5 adjacent normal liver tissues. We downloaded the raw reads for one sample, and applied our viGEN pipeline to it. A step-by-step workflow that includes – description of tools, code, intermediate and final analysis results are provided in Github: https://github.com/ICBI/viGEN/. Project details: Project name: viGEN Project home page: https://github.com/ICBI/viGEN/ Operating system(s): The R code is platform independent. The shell scripts can run on Unix, Linux, or iOS environment Programming language: R, bash/shell Other requirements: N/A License: N/A Any restrictions to use by non-academics: N/A

Author contributions

KB and YG designed the pipeline. KB and LS implemented the pipeline. KB and YG wrote the manuscript with editorial comments from SM.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

45 in total

1. Fast gapped-read alignment with Bowtie 2.

Authors: Ben Langmead; Steven L Salzberg
Journal: Nat Methods Date: 2012-03-04 Impact factor: 28.547

2. Immune tolerant hepatitis B: a clinical dilemma.

Authors: Tram T Tran
Journal: Gastroenterol Hepatol (N Y) Date: 2011-08

3. Hepatitis B: the virus and disease.

Authors: T Jake Liang
Journal: Hepatology Date: 2009-05 Impact factor: 17.425

4. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.

Authors: Bo Li; Colin N Dewey
Journal: BMC Bioinformatics Date: 2011-08-04 Impact factor: 3.307

5. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

6. ViralFusionSeq: accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution.

Authors: Jing-Woei Li; Raymond Wan; Chi-Shing Yu; Ngai Na Co; Nathalie Wong; Ting-Fung Chan
Journal: Bioinformatics Date: 2013-01-12 Impact factor: 6.937

7. Sensitive detection of viral transcripts in human tumor transcriptomes.

Authors: Sven-Eric Schelhorn; Matthias Fischer; Laura Tolosi; Janine Altmüller; Peter Nürnberg; Herbert Pfister; Thomas Lengauer; Frank Berthold
Journal: PLoS Comput Biol Date: 2013-10-03 Impact factor: 4.475

8. Viral expression associated with gastrointestinal adenocarcinomas in TCGA high-throughput sequencing data.

Authors: Daria Salyakina; Nicholas F Tsinoremas
Journal: Hum Genomics Date: 2013-11-27 Impact factor: 4.639

9. Using Small RNA Deep Sequencing Data to Detect Human Viruses.

Authors: Fang Wang; Yu Sun; Jishou Ruan; Rui Chen; Xin Chen; Chengjie Chen; Jan F Kreuze; ZhangJun Fei; Xiao Zhu; Shan Gao
Journal: Biomed Res Int Date: 2016-03-15 Impact factor: 3.411

Review 10. Translating RNA sequencing into clinical diagnostics: opportunities and challenges.

Authors: Sara A Byron; Kendall R Van Keuren-Jensen; David M Engelthaler; John D Carpten; David W Craig
Journal: Nat Rev Genet Date: 2016-03-21 Impact factor: 53.242

4 in total

1. High-Throughput Sequencing Assists Studies in Genomic Variability and Epidemiology of Little Cherry Virus 1 and 2 infecting Prunus spp. in Belgium.

Authors: Rachid Tahzima; Yoika Foucart; Gertie Peusens; Tim Beliën; Sébastien Massart; Kris De Jonghe
Journal: Viruses Date: 2019-06-29 Impact factor: 5.048

2. Correlation of HPV16 Gene Status and Gene Expression With Antibody Seropositivity and TIL Status in OPSCC.

Authors: Adrian von Witzleben; Eve Currall; Oliver Wood; Lindsey Chudley; Oluyemisi Akinyegun; Jaya Thomas; Kaïdre Bendjama; Gareth J Thomas; Peter S Friedmann; Emma V King; Simon Laban; Christian H Ottensmeier
Journal: Front Oncol Date: 2021-01-26 Impact factor: 6.244

3. Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper.

Authors: Phillip Andrew Richmond; Alice Mary Kaye; Godfrain Jacques Kounkou; Tamar Vered Av-Shalom; Wyeth W Wasserman
Journal: PLoS Comput Biol Date: 2021-03-22 Impact factor: 4.475

4. Kmerator Suite: design of specific k-mer signatures and automatic metadata discovery in large RNA-seq datasets.

Authors: Sébastien Riquier; Chloé Bessiere; Benoit Guibert; Anne-Laure Bouge; Anthony Boureux; Florence Ruffle; Jérôme Audoux; Nicolas Gilbert; Haoliang Xue; Daniel Gautheret; Thérèse Commes
Journal: NAR Genom Bioinform Date: 2021-06-23

4 in total