| Literature DB >> 28673244 |
Yuanyuan Li1, Kai Kang1, Juno M Krahn2, Nicole Croutwater1, Kevin Lee1, David M Umbach1, Leping Li3.
Abstract
BACKGROUND: The Cancer Genome Atlas (TCGA) has generated comprehensive molecular profiles. We aim to identify a set of genes whose expression patterns can distinguish diverse tumor types. Those features may serve as biomarkers for tumor diagnosis and drug development.Entities:
Keywords: And sex dimorphism; Classification; Ga/KNN; Pan-cancer; RNA-seq; TCGA
Mesh:
Year: 2017 PMID: 28673244 PMCID: PMC5496318 DOI: 10.1186/s12864-017-3906-0
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Tumor types and number of TCGA RNA-seq samples used in the analysis
| Available Cancer Types | Number of Samples | |||
|---|---|---|---|---|
| Pan-cancer | Males (%) | Females (%) | ||
| Adrenocortical carcinoma | ACC | 79 | 31 (0.76) | 48 (1.82) |
| Bladder urothelial carcinoma | BLCA | 408 | 272 (6.67) | 99 |
| Breast invasive carcinoma | BRCA | 1102 | Sex-specific (omitted) | |
| Cervical squamous cell carcinoma and endocervical adenocarcinoma | CESC | 306 | Sex-specific (omitted) | |
| Cholangiocarcinoma | CHOL | 36 | Too few (omitted) | |
| Colon adenocarcinoma | COAD | 287 | 156 (3.82) | 129 (4.89) |
| Lymphoid neoplasm diffuse large B-cell lymphoma | DLBC | 48 | Too few (omitted) | |
| Esophageal carcinoma | ESCA | Not available | 159 (3.90) | 26 (0.99) |
| Glioblastoma multiforme | GBM | 169 | 109 (2.67) | 59 (2.24) |
| Head and Neck squamous cell carcinoma | HNSC | 522 | 385 (9.43) | 137 (5.19) |
| Kidney chromophobe | KICH | 66 | Too few (omitted) | |
| Kidney renal clear cell carcinoma | KIRC | 534 | 346 (8.48) | 188 (7.13) |
| Kidney renal papillary cell carcinoma | KIRP | 291 | 214 (5.24) | 77 (2.92) |
| Acute Myeloid Leukemia | LAML | 173 | 93 (2.28) | 80 (3.03) |
| Brain lower grade glioma | LGG | 534 | 292 (7.16) | 241 (9.14) |
| Liver hepatocellular carcinoma | LIHC | 374 | 253 (6.20) | 121 (4.59) |
| Lung adenocarcinoma | LUAD | 517 | 240 (5.88) | 277 (10.50) |
| Lung squamous cell carcinoma | LUSC | 502 | 371 (9.09) | 131 (4.97) |
| Mesothelioma | MESO | 87 | 71 (1.74) | 16 (0.61) |
| Ovarian serous cystadenocarcinoma | OV | 266 | Sex-specific (omitted) | |
| Pancreatic adenocarcinoma | PAAD | 179 | 99 (2.43) | 80 (3.03) |
| Pheochromocytoma and Paraganglioma | PCPG | 184 | 82 (2.01) | 102 (3.87) |
| Prostate adenocarcinoma | PRAD | 498 | Sex-specific (omitted) | |
| Rectum adenocarcinoma | READ | 95 | 52 (1.27) | 42 (1.59) |
| Sarcoma | SARC | 263 | 119 (2.92) | 144 (5.46) |
| Skin cutaneous melanoma | SKCM | 473 | 259 (6.35) | 156 (5.91) |
| Stomach adenocarcinoma | STAD | Not available | 268 (6.57) | 147 (5.57) |
| Testicular germ cell tumors | TGCT | 156 | Sex-specific (omitted) | |
| Thyroid carcinoma | THCA | 513 | 102 (2.50) | 246 (9.33) |
| Thymoma | THYM | 120 | 63 (1.54) | 57 (2.16) |
| Uterine corpus endometrial carcinoma | UCEC | 177 | Sex-specific (omitted) | |
| Uterine carcinosarcoma | UCS | 57 | Too few (omitted) | |
| Uveal melanoma | UVM | 80 | 45 (1.10) | 35 (1.33) |
| Total | 9096 | 4081 | 2638 | |
Summary statistics for πcc values when classifying 31 tumor types and ignoring sex of the samples across 1000 GA/KNN runs for each of two training/testing partitions (2000 runs total)
| Type | Minimum | 1st Quartile | Median | Mean | 3rd Quartile | Maximum | Modal Prediction Accuracy |
|---|---|---|---|---|---|---|---|
| ACC | 0.23 | 0.76 | 0.88 | 0.83 | 0.92 | 0.97 | 0.97 |
| BLCA | 0.01 | 0.51 | 0.81 | 0.71 | 0.96 | 1.00 | 0.91 |
| CHOL | 0.00 | 0.01 | 0.40 | 0.37 | 0.50 | 0.66 | 0.73 |
| COAD | 0.18 | 0.77 | 0.85 | 0.83 | 0.91 | 0.98 | 0.99 |
| DLBC | 0.65 | 0.82 | 0.89 | 0.87 | 0.94 | 0.98 | 1.00 |
| GBM | 0.46 | 0.86 | 0.96 | 0.91 | 0.98 | 1.00 | 0.99 |
| HNSC | 0.04 | 0.91 | 0.98 | 0.93 | 1.00 | 1.00 | 0.99 |
| KICH | 0.00 | 0.88 | 0.92 | 0.86 | 0.96 | 0.99 | 0.96 |
| KIRC | 0.00 | 0.98 | 1.00 | 0.93 | 1.00 | 1.00 | 0.96 |
| KIRP | 0.00 | 0.79 | 0.97 | 0.85 | 1.00 | 1.00 | 0.92 |
| LAML | 0.89 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 | 1.00 |
| LGG | 0.56 | 0.99 | 1.00 | 0.97 | 1.00 | 1.00 | 1.00 |
| LHIC | 0.04 | 0.97 | 0.99 | 0.94 | 1.00 | 1.00 | 0.98 |
| LUAD | 0.00 | 0.88 | 0.96 | 0.88 | 0.99 | 1.00 | 0.96 |
| LUSC | 0.03 | 0.67 | 0.92 | 0.78 | 0.97 | 1.00 | 0.88 |
| MESO | 0.00 | 0.72 | 0.87 | 0.76 | 0.93 | 1.00 | 0.90 |
| PAAD | 0.03 | 0.84 | 0.96 | 0.85 | 0.99 | 1.00 | 0.95 |
| PCPG | 0.71 | 0.98 | 1.00 | 0.98 | 1.00 | 1.00 | 1.00 |
| READ | 0.03 | 0.09 | 0.14 | 0.15 | 0.19 | 0.28 | 0.00 |
| SARC | 0.03 | 0.78 | 0.91 | 0.83 | 0.96 | 1.00 | 0.96 |
| SKCM | 0.00 | 0.93 | 0.97 | 0.90 | 0.99 | 1.00 | 0.97 |
| THCA | 0.37 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 | 1.00 |
| THYM | 0.08 | 0.90 | 0.99 | 0.89 | 1.00 | 1.00 | 0.94 |
| UCS | 0.01 | 0.06 | 0.26 | 0.27 | 0.41 | 0.62 | 0.62 |
| UVM | 0.52 | 0.95 | 0.99 | 0.95 | 1.00 | 1.000 | 1.00 |
| BRCA | 0.01 | 0.98 | 0.99 | 0.97 | 1.00 | 1.00 | 0.99 |
| CESC | 0.00 | 0.52 | 0.76 | 0.68 | 0.87 | 0.98 | 0.94 |
| OV | 0.36 | 0.95 | 0.98 | 0.95 | 0.99 | 1.00 | 1.00 |
| PRAD | 0.53 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 | 1.00 |
| TGCT | 0.25 | 0.97 | 1.00 | 0.94 | 1.00 | 1.00 | 1.00 |
| UCEC | 0.04 | 0.52 | 0.71 | 0.68 | 0.86 | 1.00 | 0.96 |
The rightmost column labeled “Modal Prediction Accuracy” is not based on πcc but instead on a prediction using the tumor type to which each sample was assigned most often
Fig. 1Proportion of test-set samples predicted to be each of the 31 tumor types. Y-axis lists the 31 actual tumor types; x-axis lists the 32 possible classification categories (31 tumor types plus “unclassified” [UC]). Each bar represents one of the 32 proportions that samples from the actual tumor type were predicted to be. The 32 plotted proportions represent means from the corresponding proportions for all samples of the actual tumor type
Fig. 2Stem plot of gene selection frequency based on 2000 near optimal gene selection classifiers from 1000 GA/KNN runs for each of two training/testing partitions
Enriched gene ontology (GO) terms for the top 200 genes from the pan-cancer classification of all 9096 samples ignoring the gender
| Gene ontology (GO) terms |
|
|---|---|
| Anatomical structure development | 3.2e-10 |
| Anatomical structure morphogenesis | 3.7e-10 |
| Developmental process | 5.0e-10 |
| System development | 1.1e-9 |
| Tissue development | 1.4e-9 |
| Organ development | 2.7e-9 |
| Multicellular organismal development | 3.7e-9 |
| Epithelium development | 2.5e-7 |
| Tube development | 1.2e-6 |
| Regulation of transcription, DNA-dependent | 1.5e-6 |
Fig. 3Heatmap representation of the expression patterns of the top 50 genes across all 9096 samples. Each row (gene) was centered by the median expression value across all samples. A hierarchical clustering analysis was carried out for both samples and genes using the Euclidean distance as the similarity metric
Quartiles for πcc values when classifying 23 non-sex-specific tumor types separately using male and female samples across 1000 GA/KNN runs for each of five training/testing partitions (5000 runs total)
| Type | Minimum | 1st Quartile | 3rd Quartile | Maximum | Modal Prediction Accuracy | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| F | M | F | M | F | M | F | M | F | M | |
| ACC | 0.01 | 0.01 | 0.83 | 0.61 | 0.95 | 0.84 | 1.00 | 0.95 | 0.97 | 0.93 |
| BLCA | 0.01 | 0.04 | 0.46 | 0.66 | 0.91 | 0.96 | 1.00 | 1.00 | 0.89 | 0.93 |
| COAD | 0.22 | 0.12 | 0.83 | 0.83 | 0.93 | 0.91 | 1.00 | 0.96 | 1.00 | 0.99 |
| GBM | 0.35 | 0.59 | 0.91 | 0.96 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| HNSC | 0.54 | 0.06 | 0.93 | 0.96 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| KIRC | 0.00 | 0.00 | 0.98 | 0.98 | 1.00 | 1.00 | 1.00 | 1.00 | 0.95 | 0.98 |
| KIRP | 0.00 | 0.00 | 0.51 | 0.93 | 0.92 | 1.00 | 1.00 | 1.00 | 0.89 | 0.92 |
| LAML | 0.93 | 0.83 | 1.00 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| LGG | 0.72 | 0.16 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| LIHC | 0.01 | 0.01 | 0.93 | 0.98 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 0.99 |
| LUAD | 0.06 | 0.05 | 0.90 | 0.80 | 0.99 | 0.96 | 1.00 | 1.00 | 0.96 | 0.97 |
| LUSC | 0.00 | 0.03 | 0.49 | 0.85 | 0.94 | 0.99 | 1.00 | 1.00 | 0.86 | 0.94 |
| MESO | 0.03 | 0.00 | 0.50 | 0.78 | 0.79 | 0.95 | 0.89 | 1.00 | 0.92 | 0.95 |
| PAAD | 0.00 | 0.08 | 0.84 | 0.83 | 0.99 | 0.99 | 1.00 | 1.00 | 0.95| | 0.93 |
| PCPG | 0.14 | 0.80 | 0.98 | 0.97 | 1.00 | 1.00 | 1.00 | 1.00 | 0.98 | 1.00 |
| READ | 0.01 | 0.03 | 0.08 | 0.09 | 0.15 | 0.17 | 0.33 | 0.23 | 0.00 | 0.00 |
| SARC | 0.17 | 0.06 | 0.88 | 0.83 | 0.98 | 0.95 | 1.00 | 1.00 | 1.00 | 0.99 |
| SKCM | 0.15 | 0.01 | 0.89 | 0.90 | 0.98 | 0.98 | 1.00 | 1.00 | 0.98 | 0.97 |
| THCA | 0.81 | 0.71 | 1.00 | 0.96 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| THYM | 0.09 | 0.08 | 0.85 | 0.89 | 1.00 | 0.99 | 1.00 | 1.00 | 0.98 | 0.92 |
| UVM | 0.76 | 0.28 | 0.95 | 0.92 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 0.97 |
| ESCA | 0.00 | 0.05 | 0.02 | 0.35 | 0.61 | 0.97 | 0.80 | 1.00 | 0.38 | 0.64 |
| STAD | 0.07 | 0.08 | 0.84 | 0.80 | 0.98 | 0.96 | 1.00 | 1.00 | 0.98 | 0.97 |
The rightmost column labeled “overall” is not based on πcc but instead on a prediction using the tumor type to which each sample was assigned most often
Gene ranks from full female dataset, full male dataset, and the eight “matched” male datasets
| Gene | Rank from full female dataset | Rank from full male dataset | Difference (F-M) | Mean (SD) rank from 8 matched male datasets | Difference (F-meanM) | |
|---|---|---|---|---|---|---|
| Genes ranked higher using male samples than female samples | BNC1 | 932 | 45 | 887 | 54 (16) | 878 |
| FAT2 | 392 | 90 | 302 | 143 (23) | 249 | |
| KRT5 | 328 | 47 | 281 | 165 (57) | 163 | |
| RNF43 | 299 | 94 | 205 | 81 (14) | 218 | |
| S1PR5 | 281 | 99 | 181 | 98 (38) | 183 | |
| ANKS4B | 245 | 96 | 148 | 115 (20) | 130 | |
| CSTA | 218 | 93 | 125 | 129 (33) | 89 | |
| ANXA8 | 161 | 48 | 113 | 121 (36) | 40 | |
| KRT8 | 175 | 65 | 110 | 94 (22) | 81 | |
| CLRN3 | 204 | 98 | 106 | 86 (15) | 118 | |
| Genes ranked higher using female samples than male samples | FOXA1 | 82 | 417 | −335 | 237 (92) | −155 |
| AMY1A | 100 | 370 | −270 | 386 (162) | −286 | |
| HPN | 74 | 336 | −262 | 256 (94) | −182 | |
| LAD1 | 45 | 269 | −224 | 129 (40) | −84 | |
| PDZK1 | 83 | 293 | −210 | 228 (79) | −145 | |
| TMC5 | 55 | 241 | −186 | 139 (50) | −84 | |
| KIF12 | 89 | 249 | −160 | 324 (135) | −235 | |
| STK32A | 79 | 226 | −147 | 123 (28) | −44 | |
| CFAP221 | 81 | 187 | −106 | 94 (21) | −13 | |
| TRIM29 | 86 | 188 | −102 | 143 (25) | −57 | |
| HOXA11 | 84 | 184 | −100 | 291 (77) | −207 |
Fig. 4Boxplots of FOXA1 expression data in the 23 sex non-specific tumors from males (blue) and females (pink)