| Literature DB >> 35199033 |
Huey-Miin Chen1, Justin A MacDonald1.
Abstract
Advances in high-throughput sequencing technologies now yield unprecedented volumes of OMICs data with opportunities to conduct systematic data analyses and derive novel biological insights. Here, we provide protocols to perform differential-expressed gene analysis of TCGA and GTEx RNA-Seq data from human cancers, complete integrative GO and network analyses with focus on clinical and survival data, and identify differential correlation of trait-associated biomarkers. For complete details on the use and execution of this protocol, please refer to Chen and MacDonald (2021).Entities:
Keywords: Bioinformatics; Cancer; Gene Expression; Genomics; RNAseq; Systems biology
Mesh:
Substances:
Year: 2022 PMID: 35199033 PMCID: PMC8841814 DOI: 10.1016/j.xpro.2022.101168
Source DB: PubMed Journal: STAR Protoc ISSN: 2666-1667
List of primary sites where complimentary GTEx normal tissue samples can be found for TCGA tumor samples. Numbers represent count of sample IDs
| Primary site | GTEX normal tissue | TCGA primary tumor |
|---|---|---|
| Adrenal gland | 126 | 77 |
| Bladder | 9 | 404 |
| Brain | 1148 | 660 |
| Breast | 178 | 1090 |
| Colon | 307 | 282 |
| Esophagus | 652 | 181 |
| Kidney | 28 | 884 |
| Liver | 110 | 369 |
| Lung | 288 | 1011 |
| Ovary | 88 | 419 |
| Pancreas | 167 | 177 |
| Prostate | 100 | 494 |
| Skin | 555 | 102 |
| Stomach | 174 | 410 |
| Testis | 165 | 132 |
| Uterus | 78 | 57 |
List of values that can be used for arguments “paraCohort” and “paraDatasets” (defined during step 5.b) to retrieve dataset(s) containing the desired cancer type
| Primary site | paraCohort | paraDatasets |
|---|---|---|
| Adrenal gland | TCGA Adrenocortical Cancer | TCGA.ACC.sampleMap/ACC_clinicalMatrix |
| Bladder | TCGA Bladder Cancer | TCGA.BLCA.sampleMap/BLCA_clinicalMatrix |
| Brain | TCGA Glioblastoma | TCGA.GBM.sampleMap/GBM_clinicalMatrix |
| TCGA Lower Grade Glioma | TCGA.LGG.sampleMap/LGG_clinicalMatrix | |
| Breast | TCGA Breast Cancer | TCGA.BRCA.sampleMap/BRCA_clinicalMatrix |
| Colon | TCGA Colon Cancer | TCGA.COAD.sampleMap/COAD_clinicalMatrix |
| Esophagus | TCGA Esophageal Cancer | TCGA.ESCA.sampleMap/ESCA_clinicalMatrix |
| Kidney | TCGA Kidney Chromophobe | TCGA.KICH.sampleMap/KICH_clinicalMatrix |
| TCGA Kidney Clear Cell Carcinoma | TCGA.KIRC.sampleMap/KIRC_clinicalMatrix | |
| TCGA Kidney Papillary Cell Carcinoma | TCGA.KIRP.sampleMap/KIRP_clinicalMatrix | |
| Liver | TCGA Liver Cancer | TCGA.LIHC.sampleMap/LIHC_clinicalMatrix |
| Lung | TCGA Lung Cancer | TCGA.LUNG.sampleMap/LUNG_clinicalMatrix |
| Ovary | TCGA Ovarian Cancer | TCGA.OV.sampleMap/OV_clinicalMatrix |
| Pancreas | TCGA Pancreatic Cancer | TCGA.PAAD.sampleMap/PAAD_clinicalMatrix |
| Prostate | TCGA Prostate Cancer | TCGA.PRAD.sampleMap/PRAD_clinicalMatrix |
| Skin | TCGA Melanoma | TCGA.SKCM.sampleMap/SKCM_clinicalMatrix |
| Stomach | TCGA Stomach Cancer | TCGA.STAD.sampleMap/STAD_clinicalMatrix |
| Testis | TCGA Testicular Cancer | TCGA.TGCT.sampleMap/TGCT_clinicalMatrix |
| Uterus | TCGA Uterine Carcinosarcoma | TCGA.UCS.sampleMap/UCS_clinicalMatrix |
List of values that can be used for arguments “paraPrimarySiteGTEx” and “paraPrimaryTissueGTEx” (defined during step 6.a) to retrieve IDs for GTEx normal samples of desired tissue type(s)
| paraPrimarySiteGTEx | paraPrimaryTissueGTEx | Sample size |
|---|---|---|
| Adrenal Gland | Adrenal Gland | 126 |
| Bladder | Bladder | 9 |
| Brain | Brain - Amygdala | 69 |
| Brain - Anterior Cingulate Cortex \\(Ba24\\) | 83 | |
| Brain - Caudate \\(Basal Ganglia\\) | 108 | |
| Brain - Cerebellar Hemisphere | 97 | |
| Brain - Cerebellum | 117 | |
| Brain - Cortex | 105 | |
| Brain - Frontal Cortex \\(Ba9\\) | 101 | |
| Brain - Hippocampus | 84 | |
| Brain - Hypothalamus | 82 | |
| Brain - Nucleus Accumbens \\(Basal Ganglia\\) | 104 | |
| Brain - Putamen \\(Basal Ganglia\\) | 81 | |
| Brain - Spinal Cord \\(Cervical C-1\\) | 60 | |
| Brain - Substantia Nigra | 57 | |
| Breast | Breast - Mammary Tissue | 178 |
| Colon | Colon - Sigmoid | 141 |
| Colon - Transverse | 166 | |
| Esophagus | Esophagus - Gastroesophageal Junction | 136 |
| Esophagus - Mucosa | 271 | |
| Esophagus - Muscularis | 245 | |
| Kidney | Kidney - Cortex | 28 |
| Liver | Liver | 110 |
| Lung | Lung | 288 |
| Ovary | Ovary | 88 |
| Pancreas | Pancreas | 167 |
| Prostate | Prostate | 100 |
| Skin | Skin - Not Sun Exposed \\(Suprapubic\\) | 232 |
| Skin - Sun Exposed \\(Lower Leg\\) | 323 | |
| Stomach | Stomach | 174 |
| Testis | Testis | 165 |
| Uterus | Uterus | 78 |
List of values that can be used for arguments “paraPrimarySiteTCGA” and “paraHistologicalType” (defined during step 6.b) to retrieve IDs for TCGA primary tumor samples of desired histological type(s)
| paraPrimarySiteTCGA | paraDatasets | paraHistologicalType | Sample size |
|---|---|---|---|
| Adrenal gland | TCGA.ACC.sampleMap/ACC_clinicalMatrix | Adrenocortical Carcinoma- Myxoid Type | 1 |
| Adrenocortical Carcinoma- Oncocytic Type | 3 | ||
| Adrenocortical carcinoma- Usual Type | 73 | ||
| Bladder | TCGA.BLCA.sampleMap/BLCA_clinicalMatrix | Muscle invasive urothelial carcinoma | 404 |
| Brain | TCGA.GBM.sampleMap/GBM_clinicalMatrix | Glioblastoma Multiforme | 1 |
| Treated primary GBM | 1 | ||
| Untreated primary \\(de novo\\) GBM | 150 | ||
| TCGA.LGG.sampleMap/LGG_clinicalMatrix | Astrocytoma | 193 | |
| Oligoastrocytoma | 126 | ||
| Oligodendroglioma | 189 | ||
| Breast | TCGA.BRCA.sampleMap/BRCA_clinicalMatrix | Infiltrating Carcinoma NOS | 1 |
| Infiltrating Ductal Carcinoma | 780 | ||
| Infiltrating Lobular Carcinoma | 203 | ||
| Medullary Carcinoma | 6 | ||
| Metaplastic Carcinoma | 9 | ||
| Mixed Histology | 29 | ||
| Mucinous Carcinoma | 17 | ||
| Other | 45 | ||
| Colon | TCGA.COAD.sampleMap/COAD_clinicalMatrix | Colon Adenocarcinoma | 244 |
| Colon Mucinous Adenocarcinoma | 38 | ||
| Esophagus | TCGA.ESCA.sampleMap/ESCA_clinicalMatrix | Esophagus Adenocarcinoma, NOS | 89 |
| Esophagus Squamous Cell Carcinoma | 92 | ||
| Kidney | TCGA.KICH.sampleMap/KICH_clinicalMatrix | Kidney Chromophobe | 66 |
| TCGA.KIRC.sampleMap/KIRC_clinicalMatrix | Kidney Clear Cell Renal Carcinoma | 530 | |
| TCGA.KIRP.sampleMap/KIRP_clinicalMatrix | Kidney Papillary Renal Cell Carcinoma | 288 | |
| Liver | TCGA.LIHC.sampleMap/LIHC_clinicalMatrix | Fibrolamellar Carcinoma | 3 |
| Hepatocellular Carcinoma | 359 | ||
| Hepatocholangiocarcinoma \\(Mixed\\) | 7 | ||
| Lung | TCGA.LUNG.sampleMap/LUNG_clinicalMatrix | Lung Acinar Adenocarcinoma | 18 |
| Lung Adenocarcinoma Mixed Subtype | 105 | ||
| Lung Adenocarcinoma- Not Otherwise Specified | 320 | ||
| Lung Basaloid Squamous Cell Carcinoma | 14 | ||
| Lung Bronchioloalveolar Carcinoma Mucinous | 5 | ||
| Lung Bronchioloalveolar Carcinoma Nonmucinous | 19 | ||
| Lung Clear Cell Adenocarcinoma | 2 | ||
| Lung Micropapillary Adenocarcinoma | 3 | ||
| Lung Mucinous Adenocarcinoma | 2 | ||
| Lung Papillary Adenocarcinoma | 23 | ||
| Lung Papillary Squamous Cell Carcinoma | 6 | ||
| Lung Signet Ring Adenocarcinoma | 1 | ||
| Lung Small Cell Squamous Cell Carcinoma | 1 | ||
| Lung Solid Pattern Predominant Adenocarcinoma | 5 | ||
| Lung Squamous Cell Carcinoma- Not Otherwise Specified | 477 | ||
| Mucinous \\(Colloid\\) Carcinoma | 10 | ||
| Ovary | TCGA.OV.sampleMap/OV_clinicalMatrix | Serous Cystadenocarcinoma | 419 |
| Pancreas | TCGA.PAAD.sampleMap/PAAD_clinicalMatrix | Pancreas-Adenocarcinoma Ductal Type | 147 |
| Pancreas-Adenocarcinoma-Other Subtype | 25 | ||
| Pancreas-Colloid \\(mucinous non-cystic\\) Carcinoma | 4 | ||
| Pancreas-Undifferentiated Carcinoma | 1 | ||
| Prostate | TCGA.PRAD.sampleMap/PRAD_clinicalMatrix | Prostate Adenocarcinoma Acinar Type | 479 |
| Prostate Adenocarcinoma, Other Subtype | 15 | ||
| Skin | TCGA.SKCM.sampleMap/SKCM_clinicalMatrix | Not Available | 102 |
| Stomach | TCGA.STAD.sampleMap/STAD_clinicalMatrix | Stomach Adenocarcinoma, Signet Ring Type | 12 |
| Stomach, Adenocarcinoma, Diffuse Type | 68 | ||
| Stomach, Adenocarcinoma, Not Otherwise Specified | 155 | ||
| Stomach, Intestinal Adenocarcinoma, Mucinous Type | 19 | ||
| Intestinal Adenocarcinoma, Not Otherwise Specified∗ | 73 | ||
| Stomach, Intestinal Adenocarcinoma, Papillary Type | 7 | ||
| Stomach, Intestinal Adenocarcinoma, Tubular Type | 76 | ||
| Testis | TCGA.TGCT.sampleMap/TGCT_clinicalMatrix | ˆNon-Seminoma; Choriocarcinoma | 1 |
| ˆNon-Seminoma; Embryonal Carcinoma | 32 | ||
| ˆNon-Seminoma; Teratoma \\(Immature\\) | 5 | ||
| ˆNon-Seminoma; Teratoma \\(Mature\\) | 16 | ||
| ˆNon-Seminoma; Yolk Sac Tumor | 8 | ||
| ˆSeminoma; NOS | 70 | ||
| Uterus | TCGA.UCS.sampleMap/UCS_clinicalMatrix | Uterine Carcinosarcoma/ Malignant Mixed Mullerian Tumor | 24 |
| Uterine Carcinosarcoma/ MMMT: Heterologous Type | 20 | ||
| Uterine Carcinosarcoma/MMMT: Homologous Type | 13 |
Figure 1Impact of removing lowly-expressed genes on the distribution of expression values
(A and B) Density plot of log-CPM values before (A) and after (B) removal of genes that are lowly-expressed in TCGA primary tumor and GTEx normal colon tissue samples.
Figure 2The mean-variance relationship of the input gene expression data
Mean-variance relationship of log-CPM values for the input dataset (TCGA primary tumor and GTEx normal colon tissue gene expression data) is appropriate for subsequent linear modeling with limma-voom since a drop in variance levels at the low end of the expression scale was not observed.
Figure 3Identification of sample outliers on a density plot for filtered expected count
(A–C) A density plot of log-CPM values for expected count shows distinct distributions of log-CPM values before (A) and after (B) removal of genes using the filterByExpr function. The proportion of genes below lcpm.cutoff (indicated by the vertical dotted lines in A and B) by sample is summarized in a histogram (C), and samples with density (proportion of genes) > 0.8 for log-CPM values < lcpm.cutoff were defined as outliers.
Figure 4Comparison of module-trait relationship matrices
For the Module-Trait Relationships heat maps, the Pearson correlation value and p value (in brackets) are provided; p values below 0.05 were considered significant. Gene significance (GS) vs. module membership (MM) scatterplots were generated with varied minModuleSize values. Red boxes mark modules that were significantly correlated with the clinical trait of interest lymphatic invasion.
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| TcgaTargetGtex_gene_expected_count | Xena Toil data hub | |
| TcgaTargetGTEX_phenotype | Xena Toil data hub | |
| COAD_clinicalMatrix | Xena TCGA data hub | |
| TCGA_survival_data | Xena Toil data hub | |
| Windows OS 10 Home, 64-bit | Microsoft | |
| R (4.1.2) | The R Project | |
| RStudio (2021.09.1 Build 372) | RStudio Team | |
| BiocManager (1.30.16) | ||
| UCSCXenaTools (1.4.7) | ||
| data.table (1.14.2) | ||
| R.utils (2.11.0) | ||
| dplyr (1.0.7) | ||
| limma (3.48.3) | ||
| edgeR (3.34.1) | ||
| topGO (2.44.0) | ||
| grex (1.9) | ||
| biomaRt (2.48.3) | ||
| ggplot2 (3.3.5) | ||
| RegParallel (1.10.0) | ||
| survminer (0.4.9) | ||
| Cytoscape (3.9.0) | ||
| stringApp (1.7.0) | ||
| DGCA (1.0.2) | ||
| org.Hs.eg.db (3.13.0) | ||
| GOstats (2.58.0) | ||
| HGNChelper (0.8.1) | ||
| plotrix (3.8-2) | ||
| ID/gene mapping | GENCODE project | |
| Genes.xlsx | ||
| Computing Platform (e.g., Alienware Aurora R12 desktop; 11th Gen Intel® Core™ i7-11700F @ 2.50GHz processor with 32 GB, 2×16GB, 3200 MHz, VMR memory) | Dell Technologies | |