Literature DB >> 32637500

Human endogenous retrovirus-K mRNA expression and genomic alignment data in hepatoblastoma.

David F Grabski1,2, Aakrosh Ratan3, Laurie R Gray2,4, Stefan Bekiranov5, David Rekosh2,4, Marie-Louise Hammarskjold2,4, Sara K Rasmussen2,6.   

Abstract

Human Endogenous Retroviruses are a class of genomic elements that are the result of ancient retroviral infection of the human germline. Many are biologically active elements that have been implicated in multiple diseases including cancer. The most recent class to invade the human genome is the HERV-K(HML-2) (HERV-K) family. Approximately 90 HERV-K proviruses and many smaller elements have been identified to date in the human genome. Additional proviruses are continually being discovered with the rapid advancement of deep-sequencing and long-read sequencing technologies. HERV-K proviruses are poorly annotated in human transcriptome databases making their analysis in RNA-seq data difficult. To enable analysis, we compiled the sequences of 91 HERV-K proviruses identified in NCBI GenBank (ID JN675007-JN675097) and created a proviral alignment tool for visualizing RNA-seq reads aligned across individual proviruses. This allowed us to analyse publicly available RNA-seq data from 10 hepatoblastoma samples and 3 normal liver controls (GEO Accession ID: GSE89775). This data report includes the raw FASTA sequence files of the HERV-K proviruses from NCBI, a differential gene expression list between hepatoblastoma samples, and genomic alignment figures from 5 HERV-K proviruses identified as differentially expressed in the companion research article "Upregulation of Human Endogenous Retrovirus-K (HML-2) mRNAs in hepatoblastoma: Identification of potential new immunotherapeutic targets and biomarkers [1]. The data provided here are available for other research groups interested in evaluating individual HERV-K proviral expression using RNA-seq data. Furthermore, the data analysis is highly flexible and will accommodate the addition of other HERV-K proviruses.
© 2020 Published by Elsevier Inc.

Entities:  

Keywords:  Genomic alignment; Hepatoblastoma; Human endogenous retrovirus-K; Transcriptome analysis

Year:  2020        PMID: 32637500      PMCID: PMC7330144          DOI: 10.1016/j.dib.2020.105895

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications table

Value of the data

Human Endogenous Retrovirus-K are biologically active genomic elements in many cancers and during fetal development, making evaluation in fetal malignancy especially interesting. These data support one of the first investigations of HERV-K mRNA expression in fetal tumors. The data presented in this investigation will assist virologists and immunologists investigating HERV-K mRNA expression in human systems. It will also assist translational oncologists interested in studying the development of HERV-K as a potential neoantigen and therapeutic target for immune therapy. These data characterize HERV-K mRNA expression in hepatoblastoma. Additional experimental validation will determine a potential role for this expression as either a tumor marker or as a immunotherapeutic target. The data in this investigation are presented in a flexible, easy to modify format making reproducible analyses in other experimental conditions (e.g. other cancers or biological conditions) quickly feasible.

Data description

The data in this investigation relates to the expression of Human Endogenous Retrovirus-K in Hepatoblastoma. It is a companion data manuscript to the research article “Upregulation of Human Endogenous Retrovirus-K (HML-2) mRNAs in hepatoblastoma: Identification of potential new immunotherapeutic targets and biomarkers [1].” Human Endogenous Retroviruses are a class of genomic elements that resulted from ancient retroviral infection of the human germline. Though often transcriptionally silent, HERV's are biologically active in many cancers [2] as well as during fetal development [3]. We developed an approach to measure HERV-K mRNA using RNA-seq data and examined HERV-K mRNA expression in 10 Hepatoblastoma (HB) and 3 normal liver controls (NC) using a publicly available RNA-seq dataset (NCBI Biorepository: GEO accession ID GSE89). We report data on the differential gene expression of Hepatoblastomas with high and low HERV-K expression and ensuing Gene Enrichment Analysis. We also report data on RNA-seq read alignment across specific HERV-K proviruses that were found to be differentially expressed. Supplemental File 1 represents the raw HERV-K FASTA file used to create our transcriptome alignments. Supplemental File 2 represents the full differential gene expression list (775 genes), which includes log 2-fold change and p-adjusted values following analysis and comparison of high HERV-K expressing tumors to low HERV-K expressing tumors. A Gene Enrichment analysis of the differential gene expression list was conducted using both Gene Otology (GO) terms as well as a Kyoto Encyclopedia of Genes and Genomes (KEGG). Table 1 includes the GO Molecular Function analysis. Table 2 includes the GO Cellular Localization analysis. Table 3 includes the KEGG fucntional analysis. For the HERV-K proviruses that were differentially expressed between HB and NC (1q21.3, 3q27.2, 7q22.2, 12q24.33 and 17p13.1), we plotted the read distribution of each sample across the respective provirus. Fig. 1 represents the RNA-seq read alignments from all samples across provirus 17p13.1 (panel A), 12q24.33 (panel B), 1q21.3 (panel C), 3q27.2 (panel D) and 7q22.2 (Panel E). Larger images for each individual provirus in Fig. 1 are provided in Supplemental File 3.
Table 1

Gene Ontology (GO) molecular function analysis following differential gene expression analysis of high HERV-K expressing Hepatoblastoma vs low HERVK expressing Hepatoblastoma.

Functional CategoryGenes in listTotal genesEnrichment False Discovery Rate (Adjusted-p-value)
Phospholipid binding324410.021578265
Collagen binding10720.023567822
Lipid binding457610.023567822
Identical protein binding9218710.023567822
Extracellular matrix structural constituent161790.040987689
Growth factor binding141500.043859833
Extracellular matrix binding8560.043859833
Protein kinase binding396730.046968545
Table 2

Gene Ontology (GO) cellular localization analysis following differential gene expression analysis of high HERV-K expressing Hepatoblastoma vs low HERVK expressing Hepatoblastoma (Top 20 terms).

Functional CategoryGenes in listTotal genesEnrichment FDR
Secretory granule809469.06E−12
Vesicle22542521.39E−11
Secretory vesicle8711081.39E−11
Extracellular region part20136932.14E−11
Vesicle lumen453863.07E−11
Extracellular organelle14123266.29E−11
Cytoplasmic vesicle lumen443856.29E−11
Extracellular exosome14023006.29E−11
Extracellular vesicle14123246.29E−11
Extracellular space18834791.41E−10
Secretory granule lumen413676.38E−10
Extracellular region22846172.16E−09
Cytoplasmic vesicle part10917617.46E−09
Cytoplasmic vesicle14426252.85E−08
Intracellular vesicle14426282.88E−08
Collagen-containing extracellular matrix384251.26E−06
Platelet alpha granule lumen14701.92E−06
Extracellular matrix445512.64E−06
Endomembrane system22549886.27E−06
Lysosome547971.78E−05
Table 3

Kyoto Encyclopedia of Genes and Genomes Enrichment Analysis following differential gene expression analysis of high HERV-K expressing Hepatoblastoma vs low HERVK expressing Hepatoblastoma.

Functional CategoryGenes in listTotal genesEnrichment False Discovery Rate (Adjusted p-value)
Amoebiasis14960.000793149
Complement and coagulation cascades12780.001119036
Fatty acid degradation8440.005062775
Legionellosis9550.005062775
Peroxisome10820.012590092
Focal adhesion171990.012590092
Human papillomavirus infection243300.012590092
PI3K-Akt signaling pathway243530.020654396
Rheumatoid arthritis10890.020654396
ECM-receptor interaction9820.034941883
AGE-RAGE signaling pathway in diabetic complications101000.034941883
Epithelial cell signaling in Helicobacter pylori infection8680.034941883
Salmonella infection9850.036101621
Regulation of actin cytoskeleton162140.036577774
Tryptophan metabolism6420.038157306
Oocyte meiosis111240.041198024
IL-17 signaling pathway9920.047460432
Toxoplasmosis101110.04997533
Fig. 1

Graphical representation of uniquely aligned reads across HERV-K provirus (A)17p13.1 (B) 12q24.33 (C) 1q21.3 (D) 3q27.2 and (E) 7q22.2 created in bioinformatics platform Geneious. The x-axis represents the genomic position along the provirus. Major annotated regions of the proviral genome at each provirus are illustrated at the bottom of the panel. Coding regions for viral proteins Gag, Pro, Pol, Env, Rec or Np9 are represented by green bars, but does not necessarily infer an open-reading frame for the protein. Individual reads from each sample are represented on the y-axis. Abbreviations: FT- fetal tumor (hepatoblastoma), NC- normal control (liver).

Gene Ontology (GO) molecular function analysis following differential gene expression analysis of high HERV-K expressing Hepatoblastoma vs low HERVK expressing Hepatoblastoma. Gene Ontology (GO) cellular localization analysis following differential gene expression analysis of high HERV-K expressing Hepatoblastoma vs low HERVK expressing Hepatoblastoma (Top 20 terms). Kyoto Encyclopedia of Genes and Genomes Enrichment Analysis following differential gene expression analysis of high HERV-K expressing Hepatoblastoma vs low HERVK expressing Hepatoblastoma. Graphical representation of uniquely aligned reads across HERV-K provirus (A)17p13.1 (B) 12q24.33 (C) 1q21.3 (D) 3q27.2 and (E) 7q22.2 created in bioinformatics platform Geneious. The x-axis represents the genomic position along the provirus. Major annotated regions of the proviral genome at each provirus are illustrated at the bottom of the panel. Coding regions for viral proteins Gag, Pro, Pol, Env, Rec or Np9 are represented by green bars, but does not necessarily infer an open-reading frame for the protein. Individual reads from each sample are represented on the y-axis. Abbreviations: FT- fetal tumor (hepatoblastoma), NC- normal control (liver).

Experimental design, materials, and methods

HERV-K database

Approximately 90 HERV-K proviruses have been identified to date in the human genome. However, HERV-K proviruses are currently not well annotated in human transcriptome databases. This makes quantifying HERV-K mRNA expression difficult using standard RNA-seq pipelines which rely on gene annotation for quantification. We searched the NCBI Data Repository for HERV-K proviral sequences, excluding solo long terminal repeats (LTRs). The search resulted in 91 HERV-K proviruses (GenBank ID JN675007-JN675097) [4]. Using the sequence of each provirus we created a HERV-K FASTA file. We then employed two separate analytical pipelines for RNA-seq analysis: one for HERV-K mRNA quantification and differential gene expression, and the second for proviral alignment and visualization, both are described in detail below.

Hepatoblastoma dataset (publicly available)

For the analysis in this investigation, we utilized a publicly available RNA-seq dataset of 10 hepatoblastoma samples and 3 normal liver controls. The data was generated as part of a larger investigation to identifying activated cancer pathways in hepatoblastoma aggressive hepatoblastoma [5]. The raw sequencing data are available from the NCBI biorepository (GEO accession ID GSE89775). The raw .fastq files were downloaded using the NCBI Sequence Read Archive (SRA) Toolkit. Following download, data was analyzed with the program FASTQC and was confirmed to be from strand-specific, 100 bp paired-end libraries containing approximately 40 M reads per sample (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc). Trimmomatic was used to remove illumina adaptors, low-quality reads and assure a minimum read length of 50 bp [6].

Proviral quantification and differential expression

We concatenated the HERV-K FASTA file onto the human cDNA transcriptome from Ensembl (GRCh38.95) (available at ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/cdna/). The concatenated file allowed us to analyze HB and NC RNA-seq reads aligned to the full human transcriptome as well as HERV-K proviral loci. We used the alignment program Salmon [7] in mapping-based mode with the validateMappings flag to create a count matrix over the full human transcriptome including the concatenated HERV-K sequences (example code: salmon quant -i GRCh38_HERVK.fa -l A −1 NC_1_1.fq −2 NC_1_2.fq –validateMappings -o NC_1_quant). Transcript abundance estimates from Salmon were imported into R (version 3.5.1) using tximport [8]. Gene abundance estimates were normalized for sequencing depth using DESeq2 [9]. We then focused on the read counts assigned to HERV-K loci and performed a differential gene expression analysis also using DESeq2. A p-adjusted value less than 0.05 (calculated using Benjamini-Hochberg False Discovery Rate) and an absolute value of log2 fold change greater than 1.5 were considered significant [10].

Gene enrichment analysis

Hepatoblastoma samples demonstrated heterogeneity in overall HERV-K expression levels and were sub-classified as high HERV-K expressing tumors and low HERV-K expressing tumors. A differential gene expression analysis between the 3 highest HERV-K expressing tumors and the 3 lowest HERV-K expressing tumors was conducted in DESeq2 as described above. A Gene Enrichment analysis of the differential gene expression list was conducted using both Gene Otology (GO) as well as a Kyoto Encyclopedia of Genes and Genomes (KEGG) terms. The analysis was performed using the clusterProfiler package in R [11]. Significantly enriched terms were determined by a False Discovery Rate < 0.05.

Proviral alignment and visualization

We utilized the HERV-K FASTA file to create a positional index using the alignment program HISAT2 (example code: hisat2-build HB_Data/HERVK_Genome.FASTA HERVK_Genome_tran) [12]. We aligned the HB and NC samples to the HISAT2-HERV-K index to create .BAM files. Uniquely mapped reads were selected with SAMtools (MAPQ Score >= 50). We imported the uniquely aligned .BAM files into the bioinformatics and genomic visualization platform Geneious (Biomatters, Auckland, New Zealand). For the HERV-K proviruses that were differentially expressed between HB and NC, we plotted read distribution of each sample across the respective provirus.

Ethics statement

The RNA-sequencing data utilized in this study is publicly available genomic data from National Center for Biotechnology Information (Accession number: GSE89775). It was not generated at our institution. It is de-inidentified data that meets all criteria for exemption as described by Human Subjects Research Exemption 45 CFR 46.101(b)(4) for Existing Data, Documents, Records and Specimens.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.
SubjectImmunology and Microbiology: Virology
Specific subject areaHuman Endogenous Retroviruses and Oncology
Type of dataTablesFiguresAdditional Data- text file (FASTA) of genomic sequences
How data were acquiredBioinformatic analysis of HERV-K elements in RNA-seq data (Salmon, HISAT2, DESeq2)
Data formatRaw and Analysed
Parameters for data collection91 HERV-K proviral sequences contained in the NCBI Data Repository (GenBank ID JN675007-JN675097) were concatenated into a single FASTA file. The HERV-K FASTA file was used to analyze a publically available RNA-seq dataset of Hepatoblastoma and Normal Liver Controls.
Description of data collectionHERV-K FASTA file was used to perform a standard differential gene expression analysis across conditions with ensuing gene enrichment analysis (GO & KEGG) as well as a positional alignment analysis of RNA-seq reads across individual proviruses.
Data source locationUniversity of Virginia School of Medicine Charlottesville, VirginiaUnited States
Data accessibilityWith the article
Related research articleDavid F Grabski, Aakrosh Ratan, Laurie R Gray, Stefan Bekiranov, David Rekosh, Marie-Louise Hammarskjold, Sara K Rasmussen; Upregulation of Human Endogenous Retrovirus-K (HML-2) mRNAs in hepatoblastoma: Identification of potential new immunotherapeutic targets and biomarkers; Jounral of Pediatric Surgery; Submitted.
  11 in total

1.  clusterProfiler: an R package for comparing biological themes among gene clusters.

Authors:  Guangchuang Yu; Li-Gen Wang; Yanyan Han; Qing-Yu He
Journal:  OMICS       Date:  2012-03-28

2.  DNA microarray data imputation and significance analysis of differential expression.

Authors:  Rebecka Jörnsten; Hui-Yu Wang; William J Welsh; Ming Ouyang
Journal:  Bioinformatics       Date:  2005-08-23       Impact factor: 6.937

3.  Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.

Authors:  Mihaela Pertea; Daehwan Kim; Geo M Pertea; Jeffrey T Leek; Steven L Salzberg
Journal:  Nat Protoc       Date:  2016-08-11       Impact factor: 13.491

4.  Identification, characterization, and comparative genomic distribution of the HERV-K (HML-2) group of human endogenous retroviruses.

Authors:  Ravi P Subramanian; Julia H Wildschutte; Crystal Russo; John M Coffin
Journal:  Retrovirology       Date:  2011-11-08       Impact factor: 4.602

5.  Close to the Bedside: A Systematic Review of Endogenous Retroviruses and Their Impact in Oncology.

Authors:  David F Grabski; Yinin Hu; Monika Sharma; Sara K Rasmussen
Journal:  J Surg Res       Date:  2019-03-29       Impact factor: 2.417

6.  Intrinsic retroviral reactivation in human preimplantation embryos and pluripotent cells.

Authors:  Edward J Grow; Ryan A Flynn; Shawn L Chavez; Nicholas L Bayless; Mark Wossidlo; Daniel J Wesche; Lance Martin; Carol B Ware; Catherine A Blish; Howard Y Chang; Renee A Reijo Pera; Joanna Wysocka
Journal:  Nature       Date:  2015-04-20       Impact factor: 49.962

7.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

Authors:  Michael I Love; Wolfgang Huber; Simon Anders
Journal:  Genome Biol       Date:  2014       Impact factor: 13.583

8.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.

Authors:  Charlotte Soneson; Michael I Love; Mark D Robinson
Journal:  F1000Res       Date:  2015-12-30

9.  Loss of EGFR-ASAP1 signaling in metastatic and unresectable hepatoblastoma.

Authors:  Sarangarajan Ranganathan; Mylarappa Ningappa; Chethan Ashokkumar; Brandon W Higgs; Jun Min; Qing Sun; Lori Schmitt; Shankar Subramaniam; Hakon Hakonarson; Rakesh Sindhi
Journal:  Sci Rep       Date:  2016-12-02       Impact factor: 4.379

10.  Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors:  Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal:  Bioinformatics       Date:  2014-04-01       Impact factor: 6.937

View more
  2 in total

Review 1.  Ancient Adversary - HERV-K (HML-2) in Cancer.

Authors:  Eoin Dervan; Dibyangana D Bhattacharyya; Jake D McAuliffe; Faizan H Khan; Sharon A Glynn
Journal:  Front Oncol       Date:  2021-05-13       Impact factor: 6.244

2.  Screening and Identification of Human Endogenous Retrovirus-K mRNAs for Breast Cancer Through Integrative Analysis of Multiple Datasets.

Authors:  Yongzhong Wei; Huilin Wei; Yinfeng Wei; Aihua Tan; Xiuyong Chen; Xiuquan Liao; Bo Xie; Xihua Wei; Lanxiang Li; Zengjing Liu; Shengkang Dai; Adil Khan; Xianwu Pang; Nada M A Hassan; Kai Xiong; Kai Zhang; Jing Leng; Jiannan Lv; Yanling Hu
Journal:  Front Oncol       Date:  2022-02-16       Impact factor: 6.244

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.