| Literature DB >> 30577835 |
Hung-I Harry Chen1,2, Yu-Chiao Chiu2, Tinghe Zhang1, Songyao Zhang1,3, Yufei Huang4, Yidong Chen5,6.
Abstract
BACKGROUND: Bioinformatics tools have been developed to interpret gene expression data at the gene set level, and these gene set based analyses improve the biologists' capability to discover functional relevance of their experiment design. While elucidating gene set individually, inter-gene sets association is rarely taken into consideration. Deep learning, an emerging machine learning technique in computational biology, can be used to generate an unbiased combination of gene set, and to determine the biological relevance and analysis consistency of these combining gene sets by leveraging large genomic data sets.Entities:
Keywords: Autoencoder; Deep learning; Gene superset analysis; Survival analysis
Mesh:
Year: 2018 PMID: 30577835 PMCID: PMC6302374 DOI: 10.1186/s12918-018-0642-2
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Fig. 1The architecture of gene superset autoencoder (GSAE). In the gene set layer, one color node represents a gene set, and edges in the same color show connect associate genes to a gene set
Fig. 2The t-SNE results of TCGA 9806 samples using (a) logTPM data with 15,975 genes (an initial PCA step was performed), and (b) 200 superset outputs
Evaluation of the clustering performance of the two t-SNE results in Fig. 2. As a reference, the compression rate from 15,975 features down to 200 supersets is about 98.7%
| Index Method | t-SNE of | t-SNE of | Compression lossa |
|---|---|---|---|
| Dunn index | 0.247 | 0.189 | 23.48% |
| Silouette index | 0.355 | 0.358 | −0.85% |
| IID index | 7.924 | 8.125 | −2.54% |
aThe compression loss = (index score of genes – index score of supersets) / index score of genes
Fig. 3Subtype analysis in BRCA data set. (a) The t-SNE results of BRCA data, where HDBSCAN classified the samples into two groups. The noisy samples were labeled in black and omitted from further analysis. (b) The density plots of the most significant up-superset and three selected top gene sets. The blue/yellow arrow corresponds to positive/negative weight in the model between the gene set and superset. (c) The density plots of the most significant down-superset and three selected top gene sets. (d) The Venn diagram of the significant gene sets in the top 3 up-supersets
Top 15 gene sets in up-superset #1 in BRCA subtype analysis
| Gene Set Terms |
|
| Weightb |
|---|---|---|---|
| CUI_TCF21_TARGETS_2_UP | 81.999 | 0.980 | 0.198 |
| PEDERSEN_METASTASIS_BY_ERBB2_ISOFORM_7 | 92.578 | 0.927 | −0.154 |
| GRADE_COLON_AND_RECTAL_CANCER_UP | 68.314 | 0.427 | 0.138 |
| DOANE_BREAST_CANCER_ESR1_DN | 87.537 | 0.374 | 0.066 |
| VANTVEER_BREAST_CANCER_ESR1_DN | 80.186 | 0.366 | 0.083 |
| HATADA_METHYLATED_IN_LUNG_CANCER_UP | 76.111 | 0.333 | −0.103 |
| FARMER_BREAST_CANCER_BASAL_VS_LULMINAL | 90.606 | 0.321 | 0.079 |
| RICKMAN_TUMOR_DIFFERENTIATED_WELL_VS_MODERATELY_UP | 80.636 | 0.247 | −0.113 |
| BOQUEST_STEM_CELL_DN | 10.658 | 0.245 | −0.130 |
| BONOME_OVARIAN_CANCER_SURVIVAL_OPTIMAL_DEBULKING | 54.414 | 0.228 | −0.100 |
| MOREAUX_MULTIPLE_MYELOMA_BY_TACI_UP | 88.073 | 0.206 | −0.059 |
| YANG_BREAST_CANCER_ESR1_DN | 71.886 | 0.194 | 0.088 |
| DOUGLAS_BMI1_TARGETS_DN | 18.273 | 0.177 | −0.127 |
| STEIN_ESRRA_TARGETS_RESPONSIVE_TO_ESTROGEN_UP | 84.141 | 0.174 | −0.076 |
| BERNARD_PPAPDC1B_TARGETS_DN | 88.410 | 0.161 | −0.073 |
aThe PScore of gene set Mann Whitney U test with location shift = 0.5
bThe weight in the model corresponding to the connection of a gene set to the corresponding superset
Top 15 gene sets in down-superset #1 in BRCA subtype analysis
| Gene Set Terms |
|
| Weightb |
|---|---|---|---|
| PEDERSEN_METASTASIS_BY_ERBB2_ISOFORM_7 | 92.578 | 0.997 | 0.166 |
| VANTVEER_BREAST_CANCER_ESR1_DN | 80.186 | 0.811 | −0.185 |
| LEI_MYB_TARGETS | 55.903 | 0.800 | 0.201 |
| DOANE_BREAST_CANCER_ESR1_DN | 87.537 | 0.644 | −0.114 |
| CUI_TCF21_TARGETS_2_UP | 81.999 | 0.511 | −0.103 |
| ACEVEDO_NORMAL_TISSUE_ADJACENT_TO_LIVER_TUMOR_DN | 46.415 | 0.340 | 0.127 |
| DELACROIX_RARG_BOUND_MEF | 55.442 | 0.336 | 0.141 |
| SENGUPTA_NASOPHARYNGEAL_CARCINOMA_UP | 37.512 | 0.230 | −0.074 |
| FRASOR_RESPONSE_TO_ESTRADIOL_DN | 77.02 | 0.218 | 0.108 |
| HATADA_METHYLATED_IN_LUNG_CANCER_UP | 76.111 | 0.215 | 0.066 |
| SMID_BREAST_CANCER_LUMINAL_A_UP | 51.385 | 0.211 | 0.077 |
| FOSTER_KDM1A_TARGETS_UP | 62.021 | 0.188 | 0.091 |
| SHEDDEN_LUNG_CANCER_GOOD_SURVIVAL_A12 | 6.764 | 0.178 | −0.096 |
| SWEET_LUNG_CANCER_KRAS_UP | 6.548 | 0.178 | 0.138 |
| KOYAMA_SEMA3B_TARGETS_DN | 24.534 | 0.176 | 0.094 |
aThe PScore of gene set Mann Whitney U test
bThe weight in the model corresponding to the connection of a gene set to the superset
The size of encoder layers and the 10-fold cross-validation accuracy of each neural network classifier
| NN classifiera | Input | Encoder Layer | Encoder Layer | Encoder Layer | Encoder Layer | Accuracy of 10-fold cross validation |
|---|---|---|---|---|---|---|
| Superset | Genesb | 2334 | 200 | 88.79% | ||
| Gene set | Genes | 2334 | 87.69% | |||
| 2-layer fc | Genes | 2334 | 200 | 47.86% | ||
| 2-layer fc | Genes | 2000 | 500 | 37.98% | ||
| 4-layer fc | Genes | 2000 | 200 | 100 | 50 | 46.06% |
| 2-layer fc | PCc | 400 | 100 | 87.57% | ||
| 4-layer fc | PC | 200 | 200 | 100 | 25 | 87.57% |
a2-, 4-layer fc: 2- or 4- layer fully connected AE
bGenes input is the 15,183 genes of TCGA BRCA RNA-seq data
cPC input is the top 500 principal components of PCA analysis
dThe encoder layer 1 of superset and gene set classifier is the gene set layer (not a fully connected layer)
The mean sensitivities and specificities of superset classifier by ten iterations of 10-fold cross-validations
| BRCA subtype | Sensitivity | Specificity |
|---|---|---|
| Basal | 0.957 | 1.000 |
| HER2 | 0.924 | 0.977 |
| Luminal A | 0.862 | 0.935 |
| Luminal B | 0.835 | 0.907 |
Top 15 gene sets in the highest ranked superset in LUAD survival analysis
| Gene Set Terms |
| Weightb | |
|---|---|---|---|
| CUI_TCF21_TARGETS_2_UP | 1.30 × 10− 4 | 0.446 | −0.211 |
| RICKMAN_TUMOR_DIFFERENTIATED_WELL_VS_POORLY_DN | 2.02 × 10−4 | 0.254 | −0.201 |
| GROSS_HYPOXIA_VIA_ELK3_UP | 5.21 × 10−4 | 0.245 | 0.115 |
| KAAB_HEART_ATRIUM_VS_VENTRICLE_DN | 5.10 × 10−4 | 0.194 | 0.145 |
| MARTINEZ_RESPONSE_TO_TRABECTEDIN_DN | 0.0015 | 0.183 | 0.107 |
| MITSIADES_RESPONSE_TO_APLIDIN_UP | 0.2863 | 0.159 | −0.171 |
| KIM_WT1_TARGETS_DN | 0.0064 | 0.146 | 0.081 |
| ENK_UV_RESPONSE_EPIDERMIS_UP | 1 × 10−5 | 0.143 | 0.077 |
| SENESE_HDAC1_TARGETS_DN | 0.8285 | 0.138 | −0.162 |
| SENGUPTA_NASOPHARYNGEAL_CARCINOMA_WITH_LMP1_UP | 0.1411 | 0.129 | −0.130 |
| YANG_BCL3_TARGETS_UP | 0.0299 | 0.126 | 0.163 |
| GINESTIER_BREAST_CANCER_ZNF217_AMPLIFIED_DN | 0.9507 | 0.124 | 0.132 |
| CONCANNON_APOPTOSIS_BY_EPOXOMICIN_DN | 0.0264 | 0.112 | 0.147 |
| SPIELMAN_LYMPHOBLAST_EUROPEAN_VS_ASIAN_UP | 0.0048 | 0.110 | −0.051 |
| RHEIN_ALL_GLUCOCORTICOID_THERAPY_DN | 0.0154 | 0.107 | 0.051 |
aThe P-value of gene set log-rank. bThe weight in the model corresponding to the connection of a gene set to the superset
Top 15 gene sets in 4th ranked superset in LUAD survival analysis
| Gene Set Terms |
| Weightb | |
|---|---|---|---|
| SWEET_LUNG_CANCER_KRAS_DN | 0.7304 | 0.780 | −0.185 |
| ZHANG_BREAST_CANCER_PROGENITORS_UP | 0.0248 | 0.256 | 0.096 |
| ROZANOV_MMP14_TARGETS_UP | 0.1038 | 0.161 | 0.103 |
| MONNIER_POSTRADIATION_TUMOR_ESCAPE_DN | 0.0058 | 0.157 | −0.117 |
| ACEVEDO_FGFR1_TARGETS_IN_PROSTATE_CANCER_MODEL_DN | 0.0988 | 0.154 | 0.114 |
| YOSHIMURA_MAPK8_TARGETS_DN | 0.0195 | 0.150 | −0.126 |
| DELYS_THYROID_CANCER_DN | 0.0065 | 0.125 | −0.079 |
| SWEET_LUNG_CANCER_KRAS_UP | 0.2762 | 0.122 | 0.141 |
| OSWALD_HEMATOPOIETIC_STEM_CELL_IN_COLLAGEN_GEL_DN | 0.0132 | 0.101 | 0.120 |
| GROSS_HYPOXIA_VIA_ELK3_UP | 5.21 × 10−4 | 0.100 | 0.058 |
| WATTEL_AUTONOMOUS_THYROID_ADENOMA_UP | 0.1555 | 0.096 | −0.089 |
| VERHAAK_GLIOBLASTOMA_MESENCHYMAL | 0.7314 | 0.095 | 0.113 |
| PHONG_TNF_RESPONSE_NOT_VIA_P38 | 0.0972 | 0.093 | −0.121 |
| RUTELLA_RESPONSE_TO_HGF_UP | 0.7217 | 0.091 | −0.055 |
| IWANAGA_CARCINOGENESIS_BY_KRAS_PTEN_UP | 0.0249 | 0.090 | 0.088 |
aThe P-value of gene set log-rank. bThe weight in the model corresponding to the connection of a gene set to the superset
Fig. 4The Kaplan-Meier Curves of (a) 1st ranked superset and selected three top 20 gene sets associated with the superset, (b) 4th ranked superset and selected three top 20 gene sets associated with the superset. The blue/yellow arrow corresponds to positive/negative weight in the model between the gene set and superset
The statistical information of GSAE outputs between the training and test TCGA data sets of four cancer types
| Two proportion z-test | |||||
|---|---|---|---|---|---|
| TCGA | Superset | Gene set | Superset | Gene set | |
| BRCA | 0.344 | 0.124 | 11 / 24 | 31 / 197 | 0.0002 |
| LUAD | 0.182 | 0.113 | 6 / 12 | 32 / 145 | 0.0150 |
| SKCM | 0.179 | 0.069 | 5 / 19 | 17 / 139 | 0.0485 |
| LGG | 0.483 | 0.475 | 29 / 45 | 299 / 481 | 0.3821 |
Supersets/gene sets with log-rank P-value < 0.05 were selected as prognostic significant sets. aJaccard index of significant supersets between training and test data. bJaccard index of significant gene sets between training and test data. cSuperset proportion: (# of overlapped significant supersets between training and test data) over (# of significant supersets in training data). dGene set proportion: (# of overlapped significant gene sets between training and test data) over (# of significant gene sets in training data). eThe P-value of z-test comparing superset and gene set proportions