| Literature DB >> 30835723 |
Mohamed Mounir1, Marta Lucchetta1, Tiago C Silva2, Catharina Olsen3,4, Gianluca Bontempi3,4, Xi Chen5,6, Houtan Noushmehr2,7, Antonio Colaprico3,4,6, Elena Papaleo1,8.
Abstract
The advent of Next-Generation Sequencing (NGS) technologies has opened new perspectives in deciphering the genetic mechanisms underlying complex diseases. Nowadays, the amount of genomic data is massive and substantial efforts and new tools are required to unveil the information hidden in the data. The Genomic Data Commons (GDC) Data Portal is a platform that contains different genomic studies including the ones from The Cancer Genome Atlas (TCGA) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiatives, accounting for more than 40 tumor types originating from nearly 30000 patients. Such platforms, although very attractive, must make sure the stored data are easily accessible and adequately harmonized. Moreover, they have the primary focus on the data storage in a unique place, and they do not provide a comprehensive toolkit for analyses and interpretation of the data. To fulfill this urgent need, comprehensive but easily accessible computational methods for integrative analyses of genomic data that do not renounce a robust statistical and theoretical framework are required. In this context, the R/Bioconductor package TCGAbiolinks was developed, offering a variety of bioinformatics functionalities. Here we introduce new features and enhancements of TCGAbiolinks in terms of i) more accurate and flexible pipelines for differential expression analyses, ii) different methods for tumor purity estimation and filtering, iii) integration of normal samples from other platforms iv) support for other genomics datasets, exemplified here by the TARGET data. Evidence has shown that accounting for tumor purity is essential in the study of tumorigenesis, as these factors promote confounding behavior regarding differential expression analysis. With this in mind, we implemented these filtering procedures in TCGAbiolinks. Moreover, a limitation of some of the TCGA datasets is the unavailability or paucity of corresponding normal samples. We thus integrated into TCGAbiolinks the possibility to use normal samples from the Genotype-Tissue Expression (GTEx) project, which is another large-scale repository cataloging gene expression from healthy individuals. The new functionalities are available in the TCGAbiolinks version 2.8 and higher released in Bioconductor version 3.7.Entities:
Mesh:
Year: 2019 PMID: 30835723 PMCID: PMC6420023 DOI: 10.1371/journal.pcbi.1006701
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Example of the exploration of batch effects.
Four plots generated by ComBat to correct for batch effects. For the left panel plots, the red lines are the parametric estimates, and the black lines are the kernel estimates for the distribution of effects across genes. The right panel shows Q-Q plots with the red line for the parametric estimate and the ordered batch effects for each gene (black points). The bottom plots show the analyses for the variances and the top plots refers to the means. Plots were generated for batches TSS E9 and E2 to avoid batches containing only one sample.
Information on molecular subtypes for TCGA cancer studies as provided by the TCGA_MolecularSubtype function.
| TCGA Abbreviation | Cancer type | Number of samples | Subtypes Selected |
|---|---|---|---|
| ACC | Adrenocortical carcinoma | 91 | ACC.CIMP-high, ACC.CIMP-intermediate, ACC.CIMP-low |
| AML | Acute Myeloid Leukemia | 187 | AML.1, AML.2, AML.3, AML.4, AML.5, AML.6, AML.7 |
| BLCA | Bladder Urothelial Carcinoma | 129 | BLCA.1, BLCA.2, BLCA.3, BLCA.4 |
| BRCA | Breast invasive carcinoma | 1218 | BRCA.Basal, BRCA.Her2, BRCA.LumA, BRCA.LumB, BRCA.Normal |
| COAD | Colon adenocarcinoma | 341 | GI.CIN, GI.GS, GI.HM-indel, GI.HM-SNV |
| ESCA | Esophageal carcinoma | 169 | GI.CIN, GI.ESCC, GI.GS, GI.HM-indel, GI.HM-SNV |
| GBM | Glioblastoma multiforme | 606 | GBM_LGG.Classic-like, GBM_LGG.Codel, GBM_LGG.G-CIMP-high, GBM_LGG.G-CIMP-low, GBM_LGG.LGm6-GBM, GBM_LGG.Mesenchymal-like |
| HNSC | Head and Neck squamous cell carcinoma | 279 | HNSC.Atypical, HNSC.Basal, HNSC.Classical, HNSC.Mesenchymal |
| KICH | Kidney Chromophobe | 66 | KICH.Eosin.0, KICH.Eosin.1 |
| KIRC | Kidney renal clear cell carcinoma | 442 | KIRC.1, KIRC.2, KIRC.3, KIRC.4 |
| KIRP | Kidney renal papillary cell carcinoma | 161 | KIRP.C1, KIRP.C2a, KIRP.C2b, KIRP.C2c - CIMP |
| LGG | Brain Lower Grade Glioma | 516 | GBM_LGG.Classic-like, GBM_LGG.Codel, GBM_LGG.G-CIMP-high, GBM_LGG.G-CIMP-low, GBM_LGG.Mesenchymal-like, GBM_LGG.PA-like |
| LIHC | Liver hepatocellular carcinoma | 196 | LIHC.iCluster:1, LIHC.iCluster:2, LIHC.iCluster:3 |
| LUAD | Lung adenocarcinoma | 230 | LUAD.1, LUAD.2, LUAD.3, LUAD.4, LUAD.5, LUAD.6 |
| LUSC | Lung squamous cell carcinoma | 178 | LUSC.basal, LUSC.classical, LUSC.primitive, LUSC.secretory |
| OVCA | Ovarian serous cystadenocarcinoma | 489 | OVCA.Differentiated, OVCA.Immunoreactive, OVCA.Mesenchymal, OVCA.Proliferative |
| PCPG | Pheochromocytoma and Paraganglioma | 178 | PCPG.Cortical admixture, PCPG.Pseudohypoxia, PCPG.Wnt-altered |
| PRAD | Prostate adenocarcinoma | 333 | PRAD.1-ERG, PRAD.2-ETV1, PRAD.3-ETV4, PRAD.4-FLI1, PRAD.5-SPOP, PRAD.6-FOXA1, PRAD.7-IDH1, PRAD.8-other |
| READ | Rectum adenocarcinoma | 118 | GI.CIN, GI.GS, GI.HM-indel, GI.HM-SNV |
| SKCM | Skin Cutaneous Melanoma | 333 | SKCM.-, SKCM.BRAF_Hotspot_Mutants, SKCM.NF1_Any_Mutants, SKCM.RAS_Hotspot_Mutants, SKCM.Triple_WT |
| STAD | Stomach adenocarcinoma | 383 | GI.CIN, GI.EBV, GI.GS, GI.HM-indel, GI.HM-SNV |
| THCA | Thyroid carcinoma | 496 | THCA.1, THCA.2, THCA.3, THCA.4, THCA.5 |
| UCEC | Uterine Corpus Endometrial Carcinoma | 538 | UCEC.CN_HIGH, UCEC.CN_LOW, UCEC.MSI, UCEC.POLE |
| UCS | Uterine Carcinosarcoma | 57 | UCS.1, UCS.2 |
Fig 2The workflow illustrates the steps and TCGAbiolinks functions to be used for case study 1 on TCGA-BRCA luminal subtypes.
Fig 3DEA analyses of TCGA-BRCA data comparing luminal subtypes with normal samples.
A-B) Volcano plots are shown where only those genes with logFC higher than 6 or lower than -6 are labelled and only the significant up- or down-regulated genes are shown as dots. We carried out DEA using the limma (A) or edgeR pipelines (B) of TCGAbiolinks. C) The correlation plot between the logFC estimated by the two pipelines for the top 500 DE genes is shown. The genes discussed in the main text are highlighted in bold. D) The intersect between all the DE genes estimated by the two pipelines is shown using UpSet.
Fig 4DE genes in uterine cancer compared to healthy uterine tissue samples.
A) The workflow illustrates the steps and TCGAbiolinks functions to be used for this case study. B-C) In the volcano plot, the up-regulated genes with logFC higher than 5 (B) or the down-regulated genes with logFC lower than -5 (C) are shown as a result of DEA carried out using the limma pipeline comparing primary tumor samples from TCGA-UCS and normal uterine tissue samples from GTEx.