Literature DB >> 30835723

New functionalities in the TCGAbiolinks package for the study and integration of cancer data from GDC and GTEx.

Mohamed Mounir¹, Marta Lucchetta¹, Tiago C Silva², Catharina Olsen^3,4, Gianluca Bontempi^3,4, Xi Chen^5,6, Houtan Noushmehr^2,7, Antonio Colaprico^3,4,6, Elena Papaleo^1,8.

Abstract

The advent of Next-Generation Sequencing (NGS) technologies has opened new perspectives in deciphering the genetic mechanisms underlying complex diseases. Nowadays, the amount of genomic data is massive and substantial efforts and new tools are required to unveil the information hidden in the data. The Genomic Data Commons (GDC) Data Portal is a platform that contains different genomic studies including the ones from The Cancer Genome Atlas (TCGA) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiatives, accounting for more than 40 tumor types originating from nearly 30000 patients. Such platforms, although very attractive, must make sure the stored data are easily accessible and adequately harmonized. Moreover, they have the primary focus on the data storage in a unique place, and they do not provide a comprehensive toolkit for analyses and interpretation of the data. To fulfill this urgent need, comprehensive but easily accessible computational methods for integrative analyses of genomic data that do not renounce a robust statistical and theoretical framework are required. In this context, the R/Bioconductor package TCGAbiolinks was developed, offering a variety of bioinformatics functionalities. Here we introduce new features and enhancements of TCGAbiolinks in terms of i) more accurate and flexible pipelines for differential expression analyses, ii) different methods for tumor purity estimation and filtering, iii) integration of normal samples from other platforms iv) support for other genomics datasets, exemplified here by the TARGET data. Evidence has shown that accounting for tumor purity is essential in the study of tumorigenesis, as these factors promote confounding behavior regarding differential expression analysis. With this in mind, we implemented these filtering procedures in TCGAbiolinks. Moreover, a limitation of some of the TCGA datasets is the unavailability or paucity of corresponding normal samples. We thus integrated into TCGAbiolinks the possibility to use normal samples from the Genotype-Tissue Expression (GTEx) project, which is another large-scale repository cataloging gene expression from healthy individuals. The new functionalities are available in the TCGAbiolinks version 2.8 and higher released in Bioconductor version 3.7.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 30835723 PMCID： PMC6420023 DOI： 10.1371/journal.pcbi.1006701

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

This is a PLOS Computational Biology Software paper.

Introduction

Cancer is among the leading causes of mortality worldwide. It is a complex disease where multiple different mechanisms are at play all at once. This complexity also arises from the fact that cancer is extremely heterogeneous and can exist in distinct forms where each cancer type or subtype can be characterized by different molecular profiles with possible consequences on treatment and prognosis for the patient [1,2]. Advances in next-generation sequencing are currently making a massive amount of data available via the profiling of samples from cancer patients [3-7]. In this context, numerous large-scale studies have been conducted using state-of-the-art genome analysis technologies. One of the most important examples is The Cancer Genome Atlas (TCGA), which started in 2006 as a pilot project aiming to collect and conduct analyses on an unprecedented amount of clinical and molecular data including over 33 tumor types spanning over 11,000 patients. This project has subsequently generated more than 2.5 petabytes of publicly available data over the past decade [8,9]. Publicly funded by The National Institute of Health (NIH), TCGA has made numerous discoveries regarding genomic and epigenomic alterations that are candidate drivers for cancer development. This was achieved through the creation of an "atlas" and by applying large-scale genome-wide sequencing and multidimensional analyses. These efforts have significantly contributed to high-quality oncology studies, either led by the TCGA research network or other independent researchers [10], which recently culminated in 27 original publications from the Pan-Cancer TCGA initiative [11]. In 2016, TCGA was moved under the umbrella of the broader repository Genomic Data Commons (GDC) Data Portal [12] together with other studies. TCGA offers two versions of public data: legacy and harmonized. The legacy data is an unmodified collection of data that was previously maintained by the Data Coordinating Center (DCC) using GRCh36 (hg18) and GRCh37 (hg19) as genome reference assemblies. On the other hand, the harmonized version provides data that has been fully harmonized using GRCh38 (hg38) as a reference genome available through the GDC portal. Many tools have been developed to interface with TCGA data [13-25] and to help with the aggregation, pre- and post-processing of the datasets. Among them, TCGAbiolinks was developed as an R/Bioconductor package to address the challenges of comprehensive analyses of TCGA data [19,20,26]. Software packages such as TCGAbiolinks regularly require enhancements and revisions in light of new biological or methodological evidence from the literature or new computational requirements imposed by the platforms where the data are stored. For example, it is well-recognized that the tumor microenvironment also includes non-cancerous cells of which a large proportion are immune cells or cells that support blood vessels and other normal cells [27,28]. These components can ultimately alter the outcome of genomic analyses and the biological interpretation of the results. Recently, an extensive effort was made to systematically quantify tumor purity with a variety of diverse methods integrated into a consensus approach across TCGA cancer types [29], which the tools for analyses of TCGA data should employ. Other cancer genomic initiatives have been following the TCGA model, such as Therapeutically Applicable Research to Generate Effective Treatments (TARGET), which is an NCI-funded project conducting a large-scale study that seeks to unravel novel therapeutic targets, biomarkers, and drug targets in childhood cancers by comprehensive molecular characterization and understanding of the genomic landscape in pediatric malignancies [30]. Comprehensive support for the analyses of different genomic datasets with the same workflow is thus essential for both reproducibility and harmonization of the results. Lastly, it is common practice to use adjacent tissue showing normal characteristics at a macroscopic or histological level as a control. This advantageous practice concerning time-efficiency and reduction of patient-specific bias is based on the assumption that these samples are truly normal. Nevertheless, a tissue that is in the vicinity of or adjacent to a highly genetically abnormal tumor is likely to show cancer-related molecular aberrations [31], biasing the comparison. Moreover, circulating biomolecules, originating from cancer cells, can be taken in by the surrounding normal-like cells and alter their gene expression and processes. TCGA includes non-tumor samples from the same cancer participants. Furthermore, the pool of TCGA normal samples is often limited or lacking in TCGA projects. In this context, initiatives such as Recount [32], Recount2 [33] and RNASEQDB [34] where TCGA data were integrated with normal healthy samples from the Genotype-Tissue Expression (GTEx) project [35] have the potential to boost the comparative analyses especially for those TCGA datasets where normal samples are underrepresented or unavailable. In light of recent discoveries on the impact of tumor purity quantification on the samples under investigation [29], the need for a more substantial amount of normal samples [33], as well as the implementation of robust and statistically sound workflows for differential expression analyses [36,37] and exploration of potential sources of batch effects [38], we present new key features and enhancements that we implemented in TCGAbiolinks version 2.8 and higher.

Results

Overview of TCGAbiolinks

For the sake of clarity, we will briefly introduce the main functions of TCGAbiolinks that are extensively discussed in the original publication and a recently published workflow [19,20]. We advise referring directly to these publications and to the vignette on Bioconductor for more details about the basic functionalities. The data retrieval is handled by the three main TCGAbiolinks functions: GDCquery, GDCdownload and GDCprepare and allows the user to interface with three main platforms: i) TCGA, ii) TARGET and, iii) The Cancer Genome Characterization Initiative (CGCI) (https://ocg.cancer.gov/programs/cgci). TCGAbiolinks also allows the user to interface with different -omics data including genomics and transcriptomics, clinical and pathological data, information on drug treatments, and subtypes. GDCprepare allows the user to prepare the gene expression data for downstream analyses. This step is done by restructuring the data into a SummarizedExperiment (SE) object [39] that is easily manageable and integrable with other R/Bioconductor packages or just as a dataframe for other forms of data manipulation, which the user can operate even decoupled from the TCGAbiolinks package. Moreover, TCGAbiolinks offers the option to apply normalization methods with the function TCGAanalyze_Normalization adopting the EDASeq protocol [40], to apply between-lane normalization to adjust for distributional differences between samples or within-lane normalization (to account for differences in GC content and gene length). To guide result interpretation, the TCGAvisualize function allows the user to generate the plots required for a comprehensive view of the analyzed data using mostly the ggplot2 package that has incremental layer options (such as principal component analysis, pathway enrichment analysis etc.) [41]. We extended TCGAbiolinks with new functionalities and methods that could boost the analyses of genomic data while at the same time not necessarily limiting these functionalities to just the TCGA initiative.

Towards a more generalized analyses of genomic data in GDC

TCGAbiolinks was initially conceived to interact with TCGA data, but the same workflow could be in principle extended to other datasets if the functions to handle their differences in formats and data availability are properly handled. Thus, we worked to support the SE format for other GDC datasets, such as the ones from the TARGET consortium which is included in TCGAbiolinks version 2.8. The SE object provides the advantage of collecting clinical information on the samples (such as patient gender, age and treatments) and on genes (ENSEMBL and ENTREZ IDs). One of the major problems in the study of genomic data is that they are often stored in unconnected silos which can lead to the of stalling of advancements in the analyses [42]. The design of the GDCprepare function of TCGAbiolinks thus nicely fulfills the need for standardized and harmonized ways to process data from different genomics initiatives which could find common storage in the GDC portal. Moreover, we provide the possibility to integrate data from external sources and carry out joint analyses with the GDC dataset (see the new TCGAbatch_correction function below).

Handling batch corrections in TCGAbiolinks: TCGAbatch_Correction

High-throughput sequencing and other -omics experiments are subject to unwanted sources of variability due to the presence of hidden variables and heterogeneity. Samples are processed through different protocols, depending on the practices followed by each independent laboratory, involving time factors and multiple people orchestrating the genomic experiments. Known as batch effects, these sources of heterogeneity can have severe impacts on the results by statistically or biologically compromising the validity of the research [38,43,44]. Here, we created the TCGAbatch_Correction function to address and correct for different potential sources of batch effects linked to TCGA gene expression data using the sva package in R [38]. The sva package provides a framework for removing artifacts either by (i) estimating surrogate variables that introduce unwanted variability into high-throughput, high-dimensional datasets or (ii) using the ComBat function that employs an empirical Bayesian framework to remove batch effects related to known sources [44]. Modeling for known batch effects significantly helps to improve results by stabilizing error rates and reducing dependence on surrogates. In this context, TCGAbatch_Correction takes GDC gene expression data as input, extracts all the needed metadata by parsing barcodes, corrects for a user-specified batch factor, and also adjusts for any selected cofactor. In cases where the investigator is not interested in correcting for batch effects with ComBat or this step is discouraged for the downstream analyses, the voom (an acronym for variance modeling at the observational level) transformation can be applied to carry out normal-based statistics on RNA-Seq gene counts [36] (see below). The TCGAbatch_Correction function also generates plots to compare the parametric estimates for the distribution of batch effects across genes and their kernel estimates. Moreover, the so-called Q-Q plots can be produced showing the empirical data of ranked batch effects on each gene compared to their parametric estimate. Before applying batch effect corrections, one should investigate if there is any evidence of extreme differences between the kernel and the parametric estimates. Such differences can show up as bimodality or severe skewness and are due to the inability of the parametric estimation to pick up the empirical kernel behavior (an example is provided in the case study on breast cancer below and is discussed in Fig 1).

Fig 1

Example of the exploration of batch effects.

Example of the exploration of batch effects.

Four plots generated by ComBat to correct for batch effects. For the left panel plots, the red lines are the parametric estimates, and the black lines are the kernel estimates for the distribution of effects across genes. The right panel shows Q-Q plots with the red line for the parametric estimate and the ordered batch effects for each gene (black points). The bottom plots show the analyses for the variances and the top plots refers to the means. Plots were generated for batches TSS E9 and E2 to avoid batches containing only one sample. Additionally, to make TCGA data useful in a broader context, we included the possibility of integrating data from external sources or unpublished data in the context of publicly available datasets such as the ones in the GDC portal. To reach this goal, we have provided the possibility within the TCGAbatch_Correction function to integrate gene expression data from external sources (e.g GEO or unpublished datasets) and obtain a merged dataframe that can be used for further analysis within the TCGAbiolinks pipeline such as differential expression analysis. Nevertheless, we recommend the user to proceed with extreme caution with regards to the downstream analyses and to include the proper steps for batch corrections and harmonization of the data when they come from different sources. It is also important to rely on data that have been collected with the same technique and possibly the same instrument. We provide an example for illustrative purposes only to handle the integration of datasets from external sources with TCGA data. The TCGAbatch_Correction function can be used to correct the integrated data for a common batch factor. In this example, we integrated the TCGA Lung Adenocarcinoma (LUAD) with the GEO dataset GSE60052 [45] where RNA-seq data are available for 79 samples from Small Cell Lung Cancer (SCLC) tissues and 7 normal controls. We restricted our analysis to only tumor samples in both datasets since there were no clear annotations for the normal samples on the GEO dataset. We queried, downloaded, and pre-processed the TCGA-LUAD data according to the workflow used in case study 1 and 2 (see below). We log2-transformed the TCGA data to make them comparable with the GEO data, which were released as log-transformed values. We decided to correct the data according to the year when the sample was taken since it is the only factor in common and a suitable candidate to correct for technical variability in this example. We retrieved the sample year from the downloaded TCGA clinical data using the GDCquery_clinic function. The GEO clinical data has been released as supplementary material to the original publication (Table S1 in [45]). In particular, we selected all the tumor samples taken from 2010 to 2012 (three batches in total) in both datasets. We also ensured that more than one sample was available for each batch. The tumor samples which fulfilled the chosen batch criterion were 50 and 21 in TCGA and GEO, respectively. Since TCGA includes 17400 and GEO 15711 genes, we selected only the features in common (15711) by converting the TCGA Ensembl IDs to gene names using the information stored in the SummarizedExperiment object, retrieved through the rowData function. We then merged the two datasets and created the corresponding batch information. This information was then provided as input to the TCGAbatch_Correction function to produce the integrated year-corrected matrix. The script to reproduce this example is available in the GitHub repository associated to this publication (https://github.com/ELELAB/TCGAbiolinks_examples). We would like to stress the fact that this is just an example to show how the function works. In a real case study, the best course of action would be to process the external (GEO) data and the TCGA data through the same pipeline, starting from the external raw data and calculating the read count as it is done in the harmonized or legacy version of the TCGA data, depending on the dataset of interest for the comparison.

TCGA_MolecularSubtype

Although each cancer is believed to be a single disease, advances in the genomic field now indicate that each cancer type is much more heterogeneous than previously thought and that different subtypes can be identified. Bioinformatics applied to genomics data can enable a molecular understanding of the tumors across different cancer subtypes. Instead of binning all cases and patients into a single category, differentiating the intrinsic subtypes of each cancer has provided efficient, targeted, treatment strategies and prognoses. Cancer subtypes can be defined according to histology or molecular profiles. Tables with general annotations from the TCGA publications on classifications of the patients are provided by the TCGAquery_subtype function [19]. However, the format of these data is not so easy to navigate or integrate within other functions. For this reason, we designed a new function TCGA_MolecularSubtype to retrieve information on manually curated molecular subtypes for a total of 24 cancer types (Table 1). Collectively, we have molecular subtype annotations for 7734 individuals. The function also allows fetching of the subtype information not only for each cancer type, but also for each TCGA barcode (i.e. for each individual sample). The information used to classify cancer subtypes is the one used (and most recently published) by the Pan-Cancer works from the TCGA consortium (http://bioinformaticsfmrp.github.io/TCGAbiolinks/subtypes.html#pancanceratlas_subtypes:_curated_molecular_subtypes). As an alternative, there is also the PanCancerAtlas_subtypes function. These new functions have the advantage that the data are manually curated from each TCGA cancer type marker paper and are thus up to date when a new paper from the TCGA research network is published and reported in https://gdc.cancer.gov/about-data/publications.

Table 1

Information on molecular subtypes for TCGA cancer studies as provided by the TCGA_MolecularSubtype function.

TCGA Abbreviation	Cancer type	Number of samples	Subtypes Selected
ACC	Adrenocortical carcinoma	91	ACC.CIMP-high, ACC.CIMP-intermediate, ACC.CIMP-low
AML	Acute Myeloid Leukemia	187	AML.1, AML.2, AML.3, AML.4, AML.5, AML.6, AML.7
BLCA	Bladder Urothelial Carcinoma	129	BLCA.1, BLCA.2, BLCA.3, BLCA.4
BRCA	Breast invasive carcinoma	1218	BRCA.Basal, BRCA.Her2, BRCA.LumA, BRCA.LumB, BRCA.Normal
COAD	Colon adenocarcinoma	341	GI.CIN, GI.GS, GI.HM-indel, GI.HM-SNV
ESCA	Esophageal carcinoma	169	GI.CIN, GI.ESCC, GI.GS, GI.HM-indel, GI.HM-SNV
GBM	Glioblastoma multiforme	606	GBM_LGG.Classic-like, GBM_LGG.Codel, GBM_LGG.G-CIMP-high, GBM_LGG.G-CIMP-low, GBM_LGG.LGm6-GBM, GBM_LGG.Mesenchymal-like
HNSC	Head and Neck squamous cell carcinoma	279	HNSC.Atypical, HNSC.Basal, HNSC.Classical, HNSC.Mesenchymal
KICH	Kidney Chromophobe	66	KICH.Eosin.0, KICH.Eosin.1
KIRC	Kidney renal clear cell carcinoma	442	KIRC.1, KIRC.2, KIRC.3, KIRC.4
KIRP	Kidney renal papillary cell carcinoma	161	KIRP.C1, KIRP.C2a, KIRP.C2b, KIRP.C2c - CIMP
LGG	Brain Lower Grade Glioma	516	GBM_LGG.Classic-like, GBM_LGG.Codel, GBM_LGG.G-CIMP-high, GBM_LGG.G-CIMP-low, GBM_LGG.Mesenchymal-like, GBM_LGG.PA-like
LIHC	Liver hepatocellular carcinoma	196	LIHC.iCluster:1, LIHC.iCluster:2, LIHC.iCluster:3
LUAD	Lung adenocarcinoma	230	LUAD.1, LUAD.2, LUAD.3, LUAD.4, LUAD.5, LUAD.6
LUSC	Lung squamous cell carcinoma	178	LUSC.basal, LUSC.classical, LUSC.primitive, LUSC.secretory
OVCA	Ovarian serous cystadenocarcinoma	489	OVCA.Differentiated, OVCA.Immunoreactive, OVCA.Mesenchymal, OVCA.Proliferative
PCPG	Pheochromocytoma and Paraganglioma	178	PCPG.Cortical admixture, PCPG.Pseudohypoxia, PCPG.Wnt-altered
PRAD	Prostate adenocarcinoma	333	PRAD.1-ERG, PRAD.2-ETV1, PRAD.3-ETV4, PRAD.4-FLI1, PRAD.5-SPOP, PRAD.6-FOXA1, PRAD.7-IDH1, PRAD.8-other
READ	Rectum adenocarcinoma	118	GI.CIN, GI.GS, GI.HM-indel, GI.HM-SNV
SKCM	Skin Cutaneous Melanoma	333	SKCM.-, SKCM.BRAF_Hotspot_Mutants, SKCM.NF1_Any_Mutants, SKCM.RAS_Hotspot_Mutants, SKCM.Triple_WT
STAD	Stomach adenocarcinoma	383	GI.CIN, GI.EBV, GI.GS, GI.HM-indel, GI.HM-SNV
THCA	Thyroid carcinoma	496	THCA.1, THCA.2, THCA.3, THCA.4, THCA.5
UCEC	Uterine Corpus Endometrial Carcinoma	538	UCEC.CN_HIGH, UCEC.CN_LOW, UCEC.MSI, UCEC.POLE
UCS	Uterine Carcinosarcoma	57	UCS.1, UCS.2

Recently, we showed the advantage of using these functions to have a curated matrix in one single place for all of the subtypes. In particular, it has been applied to identify associations between molecular subtypes and the stemness index [46] and the immune subtypes [47] of TCGA samples.

TCGAtumor_purity

The tumor microenvironment encompases cellular and non-cellular units that play a critical role in the initiation, progression, and metastasis of the tumor [27,29,48-50]. An important concept to remember from the TME definition is that tumor purity is described as the proportion of carcinoma cells in a tumor sample. In previous times, tumor purity used to be estimated through visual inspection with the assistance of a pathologist and by image analysis. Nowadays, with the advent of computational methods and the use of genomic features such as somatic mutations, DNA methylation, and somatic copy-number variation (CNV), it is feasible to estimate tumor purity [27]. To account for tumor purity in the TCGAbiolinks workflow, we designed the TCGAtumor_purity function that filters data according to one of the following five methods: i) ESTIMATE (Estimation of Stromal and Immune cells in Malignant Tumor tissues using Expression data) [49]; ii) ABSOLUTE to infer tumor purity from the analysis of somatic DNA aberrations [50]; iii) LUMP (Leukocytes Unmethylation) that uses the average of 44 detected non-methylated immune-specific CpG site; iv) IHC, that uses hematoxylin- and eosin–stained slides, provided by the Nationwide Children’s Hospital Biospecimen Core Resource, which are processed using image analysis techniques to generate a tumor purity estimate; v) Consensus measurement of Purity Estimation (CPE), a consensus estimate from the four methods mentioned above [29]. CPE is calculated as the median purity level after normalization of the values from the four methods and correcting for the means and standard deviations and it is the default option of the TCGAtumor_purity function.

TCGAanalyze_DEA extension

We revised and expanded the pre-existing TCGAbiolinks function TCGAanalyze_DEA that performs differential expression analysis (DEA) by calling the commonly used R package, edgeR [37]. In the former version of TCGAbiolinks, only a pairwise approach (for example, control versus case) was applied to a matrix of count data and samples to extract differentially expressed genes (DEGs). More specifically, the former TCGAanalyze_DEA function implemented two options: (i) the exactTest framework for a simple pairwise comparison or (ii) the GLM (Generalized Linear Model) where a user faces a more complex experimental design involving multiple factors. However, in the latter case, the design of the function allowed the user to provide arguments only for case and control thereby being incompatible with multifactor experiments, for which GLM methods are particularly suited [51]. We thus implemented a different design to improve the functionality of TCGAanalyze_DEA by providing the ability to analyze RNA-Seq data in a more general and comprehensive way. The user is now able to apply edgeR with a more sophisticated design matrix and to use the limma-voom method, an emerging gold standard for RNA-Seq data [52]. Furthermore, modeling multifactor experiments and correcting for batch effects related to TCGA samples is now an option in the updated version of TCGAanalyze_DEA. The new arguments for the function allow to use different sources of batch effects in the design matrix, such as the plates, the TSS (Tissue Source Site), the year in which the sample was taken and the patient factor in the cases of paired normal and tumor samples. Moreover, an option is provided to apply two different pipelines to the study of paired or unpaired samples, namely limma-voom and limma-trend pipelines. A contrast formula is provided to determine coefficients and design contrasts in a customized way, as well as the possibility to model a multifactor experimental design. In particular, the model formula for the edgeR pipeline is designed so that the intercept is set to 0 when there are multiple conditions (such as the molecular subtypes) or contrasts to be explored, following the recommendation of edgeR developers. The function returns two types of objects: i) a table with DEGs containing logFC, logCPM, p-value, and FDR corrected p-values in cases of pairwise comparison for each gene, and/or ii) a list object containing multiple tables for DEGs according to each contrast specified in the contrast.formula argument.

TCGAquery_recount2

The Recount project was created as an online resource that comprises gene count matrices built from 8 billion reads using 475 samples gathered from 18 published studies [32]. This atlas of RNA-Seq count matrices improves the process of data acquisition and allows cross-study comparisons since all of the count matrices were produced from one single pipeline reducing batch effects and promoting alternative normalization. Recount was then extended to Recount2 consisting of more than 4.4 trillion reads using 70,603 human RNA-seq samples from the Sequence Read Archive (SRA), GTEx, and TCGA that were uniformly processed, quantified with Rail-RNA [51], and included in the recent Recount2 interface [33]. For this reason, TCGAquery_recount2 queries GTEx and TCGA data for all tissues available in the Recount2 platform, providing the user with the flexibility to decide which tissue source to use for the calculations. TCGAquery_recount2 integrates normal samples from GTEx and normal samples from TCGA. If the user wants to use GTEx alone as a source of normal samples, an ad hoc curation of the dataset will be needed before applying the functions for pre-processing of the data and downstream analyses with TCGAbiolinks. Below, we illustrate two case studies as an example of the usage of the new functions and the interpretation of their results.

Case study 1—A protocol for pre-processing and differential expression analysis of TCGA-BRCA luminal subtypes

The TCGA Breast Invasive Carcinoma (BRCA) dataset is the ideal case study to illustrate the new functionalities of TCGAbiolinks (see Fig 2 for a workflow illustrating this case study and the new functions).

Fig 2

The workflow illustrates the steps and TCGAbiolinks functions to be used for case study 1 on TCGA-BRCA luminal subtypes.

We carried out the query, download and pre-processing of the TCGA-BRCA RNA-Seq data through the GDC portal with a variation of the workflow suggested for the previous versions of the TCGAbiolinks software (see the script reported in https://github.com/ELELAB/TCGAbiolinks_examples). As an example, out of a possible 1222 BRCA samples available in the GDC portal, we restricted our analysis to 100 tumor (TP) samples and 100 normal (NT) samples respectively. We constructed the SE object as the starting structure displaying information for both genes and samples with gene expression tables of HTSeq-based counts from reads harmonized and aligned to hg38 genome assembly. Afterwards, we applied an Array Array Intensity correlation (AAIC) to pinpoint samples with low correlation (0.6 threshold for this study) using TCGAanalyze_Preprocessing, which generates a count matrix ready to be used as input for the downstream analysis pipeline. In addition, we normalized the gene counts for GC-content using TCGAanalyze_Normalization adopting EDASeq protocol incorporated with TCGAbiolinks. An exploratory data analysis (EDA) step is now possible within TCGAbiolinks to help to understand the quality of the data and to identify possible anomalies or cofounder effects. This can be done by estimating the presence of batch effects through the plots provided by the ComBat function, as described above. We can call the TCGAbatch_Correction function on a log2 transformed instance of the count matrix. For the sake of clarity, we used batch correction on TSS as a cofounder factor along with accounting for one covariate (cancer versus normal) and only two batches were retained. The results are reported in Fig 1. According to the standard defined by the TCGA consortium, 60% tumor purity is the recommended threshold for analyses [29]. Thus, we applied a filtering step using the TCGAtumor_purity function of TCGAbiolinks whereby tumor samples that show a purity of less than 60% median CPE are discarded from the analysis. As a result, a total of 26 samples were discarded with the goal of reducing the confounding effect of tumor purity on genomic analyses. We then applied the new TCGAanalyze_DEA function to exploit the power of generalized linear models beyond the control versus case scheme. As an illustrative case, we queried the PAM50 classification [52] for each of the samples through TCGA_MolecularSubtype. We identified 86 samples with information on subtypes. The output is then provided to the DEA method so the customizable contrast.formula argument can contain the formula for designing the contrasts. Beforehand, the data is normalized for GC-content, as explained above. As a final step, quantile filtering is applied with a cutoff of 25%, as suggested by the original TCGAbiolinks workflow. Within the TCGAanalyze_DEA function, it is also possible to perform a voom transformation of the count data, as detailed above. In Fig 3A, we show the results of the new implementation of the TCGAanalyze_DEA function as a volcano plot. The genes with highest logFC are shown (using logFC higher or lower than 6 in absolute value as a cutoff). We then compared these results to the ones produced using DEA as implemented in edgeR within the TCGAanalyze_DEA (see volcano plot in Fig 3B). We calculated the correlation between the top 500 DE genes identified by the two methods (Fig 3C) which resulted in a Pearson Correlation Coefficient higher than 0.9. We then quantitatively compared the results of the two methods calculating the intersect with UpSetR [53] (Fig 3D). The two methods are in good agreement showing 1629 and 1365 down- and up-regulated genes in common, which account for approximately 90% of the total DE genes. With both methods we identified up-regulated matrix metalloproteinases (such as MMP11 and MMP13) which are a class of enzyme known to be involved in cancer invasion and metastasis and have been linked to breast cancer outcomes [54]. We also identified different collagen proteins (such as COL10A1 and COL11A1) that are up-regulated in luminal versus normal breast cancer samples. Those proteins are important for the composition of the extracellular matrix (ECM). Changes in the amount or composition of the ECM have been considered a hallmark of tumor development [55]. COL11A1 and COL10A1 have recently been proposed as markers to discriminate between breast cancer and healthy tissues and could be helpful in the diagnosis of suspicious breast nodules [56].

Fig 3

DEA analyses of TCGA-BRCA data comparing luminal subtypes with normal samples.

A-B) Volcano plots are shown where only those genes with logFC higher than 6 or lower than -6 are labelled and only the significant up- or down-regulated genes are shown as dots. We carried out DEA using the limma (A) or edgeR pipelines (B) of TCGAbiolinks. C) The correlation plot between the logFC estimated by the two pipelines for the top 500 DE genes is shown. The genes discussed in the main text are highlighted in bold. D) The intersect between all the DE genes estimated by the two pipelines is shown using UpSet.

DEA analyses of TCGA-BRCA data comparing luminal subtypes with normal samples.

Case study 2—Uterine cancer dataset exploiting Recount2

One issue that can be encountered when planning DEA of TCGA data is the fact that some projects on the GDC portal do not contain normal control samples for the comparison with the tumor samples. As explained previously, it is now possible to query data from the Recount2 platform to increase the pool of normal samples and apply the DEA pipelines of TCGAbiolinks (see Fig 4A for a workflow).

Fig 4

DE genes in uterine cancer compared to healthy uterine tissue samples.

DE genes in uterine cancer compared to healthy uterine tissue samples.

A) The workflow illustrates the steps and TCGAbiolinks functions to be used for this case study. B-C) In the volcano plot, the up-regulated genes with logFC higher than 5 (B) or the down-regulated genes with logFC lower than -5 (C) are shown as a result of DEA carried out using the limma pipeline comparing primary tumor samples from TCGA-UCS and normal uterine tissue samples from GTEx. For this case study, we used the TCGA Uterine Carcinosarcoma (UCS) dataset to illustrate this application. We queried, downloaded, and pre-processed the data using a similar workflow to our previous case study, and then GTEx healthy uterine tissues were used as a source of normal samples for DEA. Concerning the type of count data queried, it was similarly harmonized HTSeq counts and aligned to the hg38 genome assembly (see the script reported in https://github.com/ELELAB/TCGAbiolinks_examples). We used the TCGAquery_recount2 function to download tumor and normal uterine samples from the Recount2 platform as Ranged Summarized Experiment (RSE) objects. Before engaging in DEA, one should keep in mind that the Recount2 resource contains reads, some of them soft-clipped, aligned to Gencode version 25 hg38 using the splice-aware Rail-RNA aligner. Moreover, the RSE shows coverage counts instead of standard read count matrices. Since most methods are adapted to read count matrices, there are some highly recommended transformations to perform before commencing with DEA. The user should extract sample metadata from RSE objects regarding read length and mapped read counts to pre-process the data. If one provides a target library size (40 million reads by default), coverage counts can be scaled to read counts usable for classic DEA methods according to Eq (1) (possibly with the need to round the counts since the result might not be of an integer type). The denominator is the sum of the coverage for all base-pairs of the genome which can be replaced by the Area under Curve (AUC) [57]. It is possible to use the function scale_counts from the recount package. After that, we merged the two prepared gene count matrices, normalized for GC-content and applied the quantile filtering with a 25% cut-off. The data were then loaded into the TCGAanalyze_DEA function for comparison of normal samples versus cancer samples using the limma-voom pipeline. Two volcano plots depicting the top up- and down-regulated genes are shown in Fig 4B and 4C, respectively. As an example, we identified the up-regulated gene ADAM28 in the UCS tumor samples when compared to the normal ones (logFC = 3.13, thus not shown in Fig 4B). ADAM28 belongs to the ADAM family of disintegrins and metalloproteinases which are involved in important biological events such as cell adhesion, fusion, migration and membrane protein shedding and proteolysis. They are often overexpressed in tumors and contribute to the promotion of cell growth and invasion [58]. Among the top up-regulated genes in UCS, we also identified other key players in cell adhesion such as the cadherin CDH1 [58] shown in Fig 4B.

Availability and future directions

The functions illustrated in this manuscript are now available in version 2.8 of TCGAbiolinks on Bioconductor version 3.7 (https://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html ), as well as through the two Github repositories (https://github.com/ELELAB/TCGAbiolinks and https://github.com/BioinformaticsFMRP/TCGAbiolinks/). In addition, we provide daily scientific advice to the Github community within the ‘issues’ forum (https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues) to solve both software bugs and to provide new functionalities needed or requested by the Github community. This forum is also a place where TCGAbiolinks users can share and discuss their experience with their analyses with our team and/or other Github users. The newly developed functions will for the first time allow users to fully appreciate the effect of using genuinely healthy samples or normal tumor-adjacent samples as a control as well as the benefits of correcting for the tumor purity of the samples. We provide a more robust and comprehensive workflow to carry out differential expression analysis with two different methods and a customizable design matrix, as well as the capability to handle batch corrections. Overall, this will provide the community with the possibility to use the same framework for vital analyses such as the benchmarking of differential expression methods. (https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/extension.html).

55 in total

1. Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation.

Authors: Tathiane M Malta; Artem Sokolov; Andrew J Gentles; Tomasz Burzykowski; Laila Poisson; John N Weinstein; Bożena Kamińska; Joerg Huelsken; Larsson Omberg; Olivier Gevaert; Antonio Colaprico; Patrycja Czerwińska; Sylwia Mazurek; Lopa Mishra; Holger Heyn; Alex Krasnitz; Andrew K Godwin; Alexander J Lazar; Joshua M Stuart; Katherine A Hoadley; Peter W Laird; Houtan Noushmehr; Maciej Wiznerowicz
Journal: Cell Date: 2018-04-05 Impact factor: 41.582

Review 2. Cancer whole-genome sequencing: present and future.

Authors: H Nakagawa; C P Wardell; M Furuta; H Taniguchi; A Fujimoto
Journal: Oncogene Date: 2015-03-30 Impact factor: 9.867

Review 3. Orchestrating high-throughput genomic analysis with Bioconductor.

Authors: Wolfgang Huber; Vincent J Carey; Robert Gentleman; Simon Anders; Marc Carlson; Benilton S Carvalho; Hector Corrada Bravo; Sean Davis; Laurent Gatto; Thomas Girke; Raphael Gottardo; Florian Hahne; Kasper D Hansen; Rafael A Irizarry; Michael Lawrence; Michael I Love; James MacDonald; Valerie Obenchain; Andrzej K Oleś; Hervé Pagès; Alejandro Reyes; Paul Shannon; Gordon K Smyth; Dan Tenenbaum; Levi Waldron; Martin Morgan
Journal: Nat Methods Date: 2015-02 Impact factor: 28.547

Review 4. Intra-tumour heterogeneity: a looking glass for cancer?

Authors: Andriy Marusyk; Vanessa Almendro; Kornelia Polyak
Journal: Nat Rev Cancer Date: 2012-04-19 Impact factor: 60.716

5. TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages.

Authors: Tiago C Silva; Antonio Colaprico; Catharina Olsen; Fulvio D'Angelo; Gianluca Bontempi; Michele Ceccarelli; Houtan Noushmehr
Journal: F1000Res Date: 2016-06-29

6. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.

Authors: Davis J McCarthy; Yunshun Chen; Gordon K Smyth
Journal: Nucleic Acids Res Date: 2012-01-28 Impact factor: 16.971

7. GC-content normalization for RNA-Seq data.

Authors: Davide Risso; Katja Schwartz; Gavin Sherlock; Sandrine Dudoit
Journal: BMC Bioinformatics Date: 2011-12-17 Impact factor: 3.169

8. recount workflow: Accessing over 70,000 human RNA-seq samples with Bioconductor.

Authors: Leonardo Collado-Torres; Abhinav Nellore; Andrew E Jaffe
Journal: F1000Res Date: 2017-08-24

9. Facilitating a culture of responsible and effective sharing of cancer genome data.

Authors: Lillian L Siu; Mark Lawler; David Haussler; Bartha Maria Knoppers; Jeremy Lewin; Daniel J Vis; Rachel G Liao; Fabrice Andre; Ian Banks; J Carl Barrett; Carlos Caldas; Anamaria Aranha Camargo; Rebecca C Fitzgerald; Mao Mao; John E Mattison; William Pao; William R Sellers; Patrick Sullivan; Bin Tean Teh; Robyn L Ward; Jean Claude ZenKlusen; Charles L Sawyers; Emile E Voest
Journal: Nat Med Date: 2016-05-05 Impact factor: 53.440

10. Systematic pan-cancer analysis of tumour purity.

Authors: Dvir Aran; Marina Sirota; Atul J Butte
Journal: Nat Commun Date: 2015-12-04 Impact factor: 14.919

111 in total

1. Characterization of the Tumor Immune Microenvironment Identifies M0 Macrophage-Enriched Cluster as a Poor Prognostic Factor in Hepatocellular Carcinoma.

Authors: Mark Farha; Neil K Jairath; Theodore S Lawrence; Issam El Naqa
Journal: JCO Clin Cancer Inform Date: 2020-10

2. A systematic comparison of data- and knowledge-driven approaches to disease subtype discovery.

Authors: Teemu J Rintala; Antonio Federico; Leena Latonen; Dario Greco; Vittorio Fortino
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 11.622

3. Targeting Tumor-Stromal IL6/STAT3 Signaling through IL1 Receptor Inhibition in Pancreatic Cancer.

Authors: Austin R Dosch; Samara Singh; Xizi Dai; Siddharth Mehra; Iago De Castro Silva; Anna Bianchi; Supriya Srinivasan; Zhen Gao; Yuguang Ban; Xi Chen; Sulagna Banerjee; Nagaraj S Nagathihalli; Jashodeep Datta; Nipun B Merchant
Journal: Mol Cancer Ther Date: 2021-09-13 Impact factor: 6.261

4. Proteogenomic Landscape of Breast Cancer Tumorigenesis and Targeted Therapy.

Authors: Karsten Krug; Eric J Jaehnig; Shankha Satpathy; Lili Blumenberg; Alla Karpova; Meenakshi Anurag; George Miles; Philipp Mertins; Yifat Geffen; Lauren C Tang; David I Heiman; Song Cao; Yosef E Maruvka; Jonathan T Lei; Chen Huang; Ramani B Kothadia; Antonio Colaprico; Chet Birger; Jarey Wang; Yongchao Dou; Bo Wen; Zhiao Shi; Yuxing Liao; Maciej Wiznerowicz; Matthew A Wyczalkowski; Xi Steven Chen; Jacob J Kennedy; Amanda G Paulovich; Mathangi Thiagarajan; Christopher R Kinsinger; Tara Hiltke; Emily S Boja; Mehdi Mesri; Ana I Robles; Henry Rodriguez; Thomas F Westbrook; Li Ding; Gad Getz; Karl R Clauser; David Fenyö; Kelly V Ruggles; Bing Zhang; D R Mani; Steven A Carr; Matthew J Ellis; Michael A Gillette
Journal: Cell Date: 2020-11-18 Impact factor: 41.582

5. Association between lincRNA expression and overall survival for patients with triple-negative breast cancer.

Authors: Jie Ping; Shuya Huang; Jie Wu; Pingping Bao; Timothy Su; Kai Gu; Hui Cai; Xingyi Guo; Loren Lipworth; William J Blot; Wei Zheng; Qiuyin Cai; Xiao-Ou Shu
Journal: Breast Cancer Res Treat Date: 2020-11-27 Impact factor: 4.872

Review 6. Cancer biology as revealed by the research autopsy.

Authors: Christine A Iacobuzio-Donahue; Chelsea Michael; Priscilla Baez; Rajya Kappagantula; Jody E Hooper; Travis J Hollman
Journal: Nat Rev Cancer Date: 2019-09-13 Impact factor: 60.716

Review 7. Aldo Keto Reductases AKR1B1 and AKR1B10 in Cancer: Molecular Mechanisms and Signaling Networks.

Authors: Sreeparna Banerjee
Journal: Adv Exp Med Biol Date: 2021 Impact factor: 2.622

8. Galectin-1 fosters an immunosuppressive microenvironment in colorectal cancer by reprogramming CD8⁺ regulatory T cells.

Authors: Alejandro J Cagnoni; María Laura Giribaldi; Ada G Blidner; Anabela M Cutine; Sabrina G Gatto; Rosa M Morales; Mariana Salatino; Martín C Abba; Diego O Croci; Karina V Mariño; Gabriel A Rabinovich
Journal: Proc Natl Acad Sci U S A Date: 2021-05-25 Impact factor: 11.205

9. eIF4A3 regulates the TFEB-mediated transcriptional response via GSK3B to control autophagy.

Authors: Despoina Sakellariou; Matteo Tiberti; Thomas H Kleiber; Lorea Blazquez; Aida Rodríguez López; Marie Holm Abildgaard; Michal Lubas; Jiri Bartek; Elena Papaleo; Lisa B Frankel
Journal: Cell Death Differ Date: 2021-06-22 Impact factor: 15.828

10. Transcriptomic signatures of tumors undergoing T cell attack.

Authors: Aishwarya Gokuldass; Aimilia Schina; Martin Lauss; Katja Harbst; Christopher Aled Chamberlain; Arianna Draghi; Marie Christine Wulff Westergaard; Morten Nielsen; Krisztian Papp; Zsofia Sztupinszki; Istvan Csabai; Inge Marie Svane; Zoltan Szallasi; Göran Jönsson; Marco Donia
Journal: Cancer Immunol Immunother Date: 2021-07-17 Impact factor: 6.968