| Literature DB >> 35105355 |
Travis S Johnson1,2,3, Christina Y Yu1,2, Zhi Huang4, Siwen Xu5, Tongxin Wang6, Chuanpeng Dong5, Wei Shao1, Mohammad Abu Zaid1, Xiaoqing Huang3, Yijie Wang6, Christopher Bartlett7, Yan Zhang2,8, Brian A Walker9, Yunlong Liu5,10, Kun Huang11,12,13,14, Jie Zhang15.
Abstract
We propose DEGAS (Diagnostic Evidence GAuge of Single cells), a novel deep transfer learning framework, to transfer disease information from patients to cells. We call such transferrable information "impressions," which allow individual cells to be associated with disease attributes like diagnosis, prognosis, and response to therapy. Using simulated data and ten diverse single-cell and patient bulk tissue transcriptomic datasets from glioblastoma multiforme (GBM), Alzheimer's disease (AD), and multiple myeloma (MM), we demonstrate the feasibility, flexibility, and broad applications of the DEGAS framework. DEGAS analysis on myeloma single-cell transcriptomics identified PHF19high myeloma cells associated with progression. Availability: https://github.com/tsteelejohnson91/DEGAS .Entities:
Keywords: Alzheimer’s disease; Cox proportional hazards; Deep learning; Machine Learning; Multiple myeloma; Prognostic models; Single-cell RNA sequencing; Survival; Transfer learning; scRNA-seq
Mesh:
Year: 2022 PMID: 35105355 PMCID: PMC8808996 DOI: 10.1186/s13073-022-01012-2
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Fig. 1A workflow diagram of the DEGAS framework. A The workflow for a typical experiment with DEGAS. Note that DEGAS is not meant to replace the abundant packages available to load, preprocess, select features, cluster, and visualize scRNA-seq data. It is rather meant to augment these packages to assign disease associations to cells. B The scRNA-seq and patient expression data are preprocessed into expression matrices. Next, Bootstrap aggregated DenseNet DEGAS models are trained using both single-cell and patient disease attributes using a multitask learning neural network that learns latent representation reducing the differences between patients and single cells at the final hidden layer using maximum mean discrepancy (MMD). C The output layer of this model can be used to simultaneously infer disease attribute impressions in single cells and cellular composition impressions in patients
Summary of the clinical features of patients in each bulk expression cohort used during model training
| Sex | 74 Male, 37 Female |
| Age (years) | Range: 14–83, Mean: 56, Median: 58 |
| Clinical GBM subtype | 34 Classical, 33 Mesenchymal, 9 Neural, 35 Proneural |
| Sex | 90 Male, 131 Female |
| Age (years) | Range: 61–90+, Mean* > 82, Median = 84 |
| AD diagnosis | 135 AD, 86 Control |
| Sex | 387 Male, 260 Female |
| Age (years) | Range: 27–93, Mean: 64, Median: 64 |
| Progression-free survival time (days) | Range: 13–1753, Mean: 665.4, Median: 629 200 patients progressed |
*Final age category is > 90 years. The following are all of the abbreviations: The Cancer Genome Atlas (TCGA), Glioblastoma Multiforme (GBM), Mount Sinai/JJ Peters VA Medical Center Brain Bank (MSBB), and Multiple Myeloma Research Foundation (MMRF)
Overview of all datasets used in the analysis
| Study | Dataset | Sample size | Data type | Attribute |
|---|---|---|---|---|
| Simulated cellsa | 5000 cells | scRNA-seq | Cell type | |
| Simulated patientsa | 600 patients | RNA-seq | Disease status | |
| Patel et al., 2014 [ | 532 cells (5 patients) | scRNA-seq (SMART-seq) | None | |
| TCGA GBM [ | 111 patients | Microarray | GBM subtype | |
| AIBS | 47,396 cells (11 patients) | scRNA-seq (SMART-seq) | Brain cell types | |
| Grubman et al., 2019 [ | 13,214 cells (12 patients) | snRNA-seq (10x Genomics) | AD and normal brain cell types | |
| Mathys et al., 2019 [ | 5288 cellsb (48 patients) | snRNA-seq (10x Genomics) | AD and normal brain cell types | |
| MSBB [ | 682 samples (221 patients) | RNA-seq | AD diagnosis | |
| MMRF [ | 647 patients | RNA-seq | PFS | |
IUSM Chen et al. 2021 [ | 22,968 cells (4 patients) | scRNA-seq (10x Genomics) | Subtype cluster (Subtype 1-5) | |
| Ledergor et al., 2019 [ | 13,440 cells (35 patients) | scRNA-seq (MARS-seq) | Malignancy (NHIP, MGUS, SMM, MM) | |
| Zhan et al., 2006 [ | 559 patients | Microarray | OS |
aThe simulated patients were generated from the simulated cells by combining known proportions of cell types. “None” is used to denote the lack of labels for the cells/samples in a given dataset. bCells were down-sampled from the total number of cells because some cell types were over-represented. The following are all of the abbreviations: The Cancer Genome Atlas (TCGA), Glioblastoma Multiforme (GBM), Allen Institute for Brain Science (AIBS), Mount Sinai/JJ Peters VA Medical Center Brain Bank (MSBB), Multiple Myeloma Research Foundation (MMRF), Indiana University School of Medicine (IUSM), Alzheimer’s disease (AD), progression-free survival (PFS), overall survival (OS), normal hip (NHIP), monoclonal gammopathy of undetermined significance (MGUS), smoldering multiple myeloma (SMM), multiple myeloma (MM), RNA sequencing (RNA-seq), single-cell RNA-seq (scRNA-seq), and single nuclei RNA-seq (snRNA-seq)
Patient cellular makeup for simulation experiments. The abbreviations are Simulation (sim), Normal (N), and Disease (D). The high-risk cell types are in bold
| Cell type 1 | Cell type 2 | Cell type 3 | Cell type 4 N | Cell type 4D | |
|---|---|---|---|---|---|
| Patients sim1D | 16.6% | 16.6% | 16.6% | 00.0% | |
| Patients sim1N | 25.0% | 25.0% | 25.0% | 25.0% | 00.0% |
| Patients sim2D | 25.0% | 25.0% | 25.0% | 00.0% | |
| Patients sim2N | 25.0% | 25.0% | 25.0% | 25.0% | 00.0% |
| Patients sim3D | 16.6% | 16.6% | 16.6% | ||
| Patients sim3N | 25.0% | 25.0% | 25.0% | 25.0% | 00.0% |
Fig. 2Simulation study and baseline comparisons of DEGAS framework. A 5000 simulated cells from Splatter with 4 cell types where one of the cell types has two subtypes. Cell type 4 is composed of two subtypes that are specific to either disease or normal patients. In total, 2000 of these cells were used to generate the 600 simulated patients in B–D and 3000 were used as the cell input to our DEGAS models. E Optimal cluster number (4 clusters) based on average silhouette width for the 3000 cells not used to generate patients. F The same 3000 cells used as the cellular input colored by their cluster. G DEGAS comparison to Augur in simulation 1. H DEGAS comparison with Augur in simulation 2. I DEGAS comparison with Augur in simulation 3. J–L DEGAS-calculated disease association from each simulation overlaid onto 3000 cells. The violin plot in the bottom left corner is deconvolution cell type proportion for cell type 1 in simulation 1 patients (J), cell type 4 proportion in simulation 2 patients (K), and cell type 4 proportion in simulation 3 patients (L)
Fig. 3DEGAS validation in GBM and AD. DEGAS output of the distribution of GBM subtypes in single cells from five GBM tumors. Four of the five tumors had known GBM subtype information from Patel et al. (MGH26: Proneural, MGH28: Mesenchymal, MGH29: Mesenchymal, and MGH30: Classical, indicated by red boxes) which were recapitulated by DEGAS. The subtype information for the tumors, MGH26, MGH28, MGH29, and MGH30 were derived from Patel et al. where MGH31 did not have a clearly defined GBM subtype. The association of cells assigned to each subtype were plotted for each tumor; A MGH26, B MGH28, C MGH29, D MGH30, and E MGH31. Median values are marked by a diamond in each of the violin plots. F The death association centered around 0 is overlaid on all of the single cells from the five tumors (indicated by color). G DEGAS output of AD association for each single cell. The AD association score is indicated by the color and is overlaid onto AIBS single cells. This plot shows the negative AD association in neuron cells and positive AD association in Microglia. H–I There also appeared to be a subpopulation of astrocytes with positive AD association. The astrocytes were plotted separately and colored by AIBS Astrocyte subtypes (H) and GFAP expression, a disease-associated astrocyte marker (I). J Comparison of DEGAS-derived AD associations for single cells from AD and Normal control samples from Grubman et al. K–M Targeted analysis of microglia from Grubman et al. including the AD associations overlaid onto microglia (K), AD association comparing AD status of patient sample from which the cells were sampled (L), and PCC between AD association with HAM marker genes comparing up- and downregulated HAM marker genes (M). Significance values: n.s. (not significant), • (0.1), * (0.05), ** (0.01), *** (0.001)
Comparison of AD association scores in single cells between cell types as visualized in Fig. 3G
| Cell type | Cell type mean association | Number of cells | |
|---|---|---|---|
| Oligodendrocyte | 0.05 | 1795 | 3.42 |
| Astrocyte | 0.03 | 809 | 3.94 |
| OPC | − 0.12 | 738 | 1.09 |
The DEGAS models were trained using neuron, oligodendrocyte, astrocyte, OPC, and microglia cell types. The single cells were split into groups based on their cell type and the mean AD associations of each cell type were evaluated as a correlation. The neuron and microglia groups are bolded to highlight their much higher mean AD association. P values are calculated by treating the association score as a Pearson correlation coefficient
Fig. 4Association between subtypes and progression risk in MM. IUSM CD138+ scRNA-seq subtype clusters generated from Seurat colored by A cluster, i.e., subtype and B progression association. C Kaplan-Meier curves of PFS from cross-validation for the MMRF patients stratified by median proportional hazard. D Kaplan-Meier curves of OS from Zhan et al. external dataset stratified by median proportional hazard. E Progression association for IUSM CD138+ subtypes. F Progression association for NHIP, MGUS, SMM, and MM in the external dataset Ledergor et al. G Subtype 2 enrichment for NHIP, MGUS, SMM, and MM in the external dataset Ledergor et al. NHIP: normal hip bone marrow, MGUS: monoclonal gammopathy of undetermined significance, SMM: smoldering multiple myeloma, MM: multiple myeloma. Significance values: • (0.1), * (0.05), ** (0.01), *** (0.001). All plots were generated using the default parameters for the DEGAS package described in the section of Methods: “Transfer learning using DEGAS”