Keita Iida1, Jumpei Kondo2,3, Johannes Nicolaus Wibisana1, Masahiro Inoue3, Mariko Okada1,4. 1. Institute for Protein Research, Osaka University, Suita, Osaka, 565-0871, Japan. 2. Division of Health Sciences, Osaka University Graduate School of Medicine, Suita, 565-0871, Japan Osaka. 3. Department of Clinical Bio-resource Research and Development, Graduate School of Medicine Kyoto University, Kyoto, 606-8501, Japan. 4. Center for Drug Design and Research, National Institutes of Biomedical Innovation, Health and Nutrition, Ibaraki, Osaka, 567-0085, Japan.
Abstract
MOTIVATION: Single-cell RNA sequencing (scRNA-seq) analysis reveals heterogeneity and dynamic cell transitions. However, conventional gene-based analyses require intensive manual curation to interpret biological implications of computational results. Hence, a theory for efficiently annotating individual cells remains warranted. RESULTS: We present ASURAT, a computational tool for simultaneously performing unsupervised clustering and functional annotation of disease, cell type, biological process, and signaling pathway activity for single-cell transcriptomic data, using a correlation graph decomposition for genes in database-derived functional terms. We validated the usability and clustering performance of ASURAT using scRNA-seq datasets for human peripheral blood mononuclear cells, which required fewer manual curations than existing methods. Moreover, we applied ASURAT to scRNA-seq and spatial transcriptome datasets for human small cell lung cancer and pancreatic ductal adenocarcinoma, respectively, identifying previously overlooked subpopulations and differentially expressed genes. ASURAT is a powerful tool for dissecting cell subpopulations and improving biological interpretability of complex and noisy transcriptomic data. AVAILABILITY: ASURAT is published on Bioconductor (DOI: 10.18129/B9.bioc.ASURAT). The codes for analyzing data in this article are available at Github (https://github.com/keita-iida/ASURATBI) or figshare (https://doi.org/10.6084/m9.figshare.19200254.v3). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Single-cell RNA sequencing (scRNA-seq) analysis reveals heterogeneity and dynamic cell transitions. However, conventional gene-based analyses require intensive manual curation to interpret biological implications of computational results. Hence, a theory for efficiently annotating individual cells remains warranted. RESULTS: We present ASURAT, a computational tool for simultaneously performing unsupervised clustering and functional annotation of disease, cell type, biological process, and signaling pathway activity for single-cell transcriptomic data, using a correlation graph decomposition for genes in database-derived functional terms. We validated the usability and clustering performance of ASURAT using scRNA-seq datasets for human peripheral blood mononuclear cells, which required fewer manual curations than existing methods. Moreover, we applied ASURAT to scRNA-seq and spatial transcriptome datasets for human small cell lung cancer and pancreatic ductal adenocarcinoma, respectively, identifying previously overlooked subpopulations and differentially expressed genes. ASURAT is a powerful tool for dissecting cell subpopulations and improving biological interpretability of complex and noisy transcriptomic data. AVAILABILITY: ASURAT is published on Bioconductor (DOI: 10.18129/B9.bioc.ASURAT). The codes for analyzing data in this article are available at Github (https://github.com/keita-iida/ASURATBI) or figshare (https://doi.org/10.6084/m9.figshare.19200254.v3). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Single-cell RNA sequencing (scRNA-seq) has deepened our knowledge of biological complexity in terms of heterogeneity and dynamic transition of cell populations, and this knowledge has immense potential for helping elucidate the regulatory principles underlying our body plans (La Manno ). scRNA-seq has been widely used to improve the molecular understanding of malignant cells in lymphoma (Zhang ), intra- and intertumoral heterogeneity in drug-treated cancer populations (Stewart ), ligand–receptor interaction in tumor immune microenvironments (Chen ) and effects of viral infection on immune cell populations (Devitt ). Various clustering methods based on gene expression similarity have been proposed (Pasquini ) and applied to annotate cell types (Kim ). However, conventional gene-based analyses require intensive manual curation to annotate clustering results; hence, efficient and unbiased interpretation of single-cell data remains challenging (Andrews ; Aran ; Kiselev ).Conventionally, single-cell transcriptomes are analyzed and interpreted by means of unsupervised clustering followed by manual curation of marker genes selected from a large number of differentially expressed genes (DEGs) (Andrews ). Here, manual curations are based on literature searches of biological functions of DEGs. Today, several computational tools for cell type inference are available, as detailed in a review by Pasquini . However, manual curation is often difficult because a single gene is generally multifunctional and therefore associated with multiple biological function terms (Cancer Genome Atlas Research Network ). In cancer transcriptomics, this difficulty is exacerbated by the complex interdependence between disease-related genes and their heterogeneous expressions, associated with numerous biological function terms.A possible solution is to realize clustering and interpretation simultaneously. Recently, reference component analysis has been used for accurate clustering of single-cell transcriptomes along with unbiased cell-type annotation based on similarity to reference transcriptome panels (Li ). Yet, these methods require transcriptomic data of well-characterized reference cells as learning datasets, which might not always be available. Another approach is using functionally annotated gene sets for scoring cells, implemented in R packages including PAGODA (Fan ) and ssGSEA (Subramanian ). Given single-cell transcriptome data, these methods use statistical methods, such as principal component analysis (PCA) and gene set enrichment analysis (GSEA), for providing each cell with scores of annotations against functionally annotated gene sets, such as signaling pathway modules. Nevertheless, correlations of gene expressions are complex with positive and negative (Saxena ), strong and weak and non-linear relationships, which can be poorly captured using the existing methods (see also Section 4).To overcome these limitations, a nonlinear framework defining biological terms in a more interpretable way is needed. Therefore, we propose the computational tool, ASURAT (functional annotation-driven unsupervised clustering of single-cell transcriptomes), which simultaneously performs unsupervised clustering and biological interpretation in terms of cell type, disease, biological process and signaling pathway, using a nonlinear correlation graph decomposition for functionally annotated gene sets. In this study, we demonstrate the clustering performance of ASURAT using scRNA-seq datasets for healthy and disease human peripheral blood mononuclear cells (PBMCs), small cell lung cancer (SCLC) and spatial transcriptome (ST) datasets for pancreatic ductal adenocarcinoma (PDAC). Our results suggest that ASURAT can greatly improve functional understanding of single-cell transcriptomes, adding a new layer of biological interpretability to conventional gene-based analyses.
2 Materials and methods
2.1 Overview of ASURAT workflow
ASURAT is a computational tool for simultaneously clustering and interpreting single-cell transcriptomes (Fig. 1) using functionally annotated gene sets collected from knowledge-based databases for cell type, disease, biological process and signaling pathway activity, such as Cell Ontology (CO) (Diehl ), Disease Ontology (DO) (Yu ), Gene Ontology (GO) (Yu ) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa and Goto, 2000) (Fig. 1b). ASURAT creates multiple biological terms using single-cell transcriptome data and annotated gene sets (Fig. 1c). We called such biological terms ‘signs’. Then, ASURAT creates sign-by-sample matrices (SSMs), in which rows and columns stand for signs and samples (cells), respectively (Fig. 1d). SSM is analogous to a read count table, where the rows represent signs with biological meaning instead of individual genes and the values contained are ‘sign scores’ instead of read counts. By analyzing SSMs, individual cells can be characterized by various biological terms (Fig. 1e).
Fig. 1.
Workflow of ASURAT. (a) Flowchart of ASURAT. (b) Collection of databases. (c) Creation of signs and (d) SSMs. (e) Analysis of SSMs to infer cell types, diseases, biological processes and signaling pathway activities
Workflow of ASURAT. (a) Flowchart of ASURAT. (b) Collection of databases. (c) Creation of signs and (d) SSMs. (e) Analysis of SSMs to infer cell types, diseases, biological processes and signaling pathway activities
2.2 Sign
Let be a read count table of size from single-cell transcriptomic data, whose rows and columns are genes, represented by , and cells, respectively, and a ‘rapport’ (e.g. correlation matrix) among . Let be a set of ordered pairs, where and are biological description and the annotated gene set, respectively. Consider an -dependent representation , where is an integer, for ; then, the triplet is termed a sign, in particular a parent sign. Our definition is based on Saussure’s semiology. According to Maruyama, the original notion of a signe is a segment of ‘a thing of interest’, created by an arbitrary decomposition based on its relations. For example, ‘rainbow’ is a continuum of varying light input, from which we see distinct colors of red, yellow, green and blue by our subjective decomposition based on their spectral relationships (Couper, 2015).
2.3 Correlated gene set
Let be a correlation matrix of size defined by and a certain measure (e.g. Spearman’s measure), whose diagonal elements are 1s. Let and be positive and negative constants satisfying and , respectively. Let us arbitrarily fix and consider the following subsets of :Hereinafter we omit the arguments and for simplicity. Let be a submatrix of . Let us identify the row vectors of with points in -dimensional Euclidean space and denote the set of those points as . Then, let us identify all the elements in with those in through the subscripts of , . If is not empty, decompose into two disjoint subsets and by Partitioning Around Medoids (PAM) clustering (Schubert and Rousseeuw, 2019), from which we obtainOtherwise, if is empty, let and (empty). Thus, is decomposed into three parts as follows:
where . Let and be submatrices of and let (resp. ) be the mean of off-diagonal elements of (). We assume without loss of generality. If , then , and are termed strongly, variably and weakly correlated gene sets, which are hereafter abbreviated as SCG, VCG and WCG, respectively. Otherwise, correlated gene sets cannot be defined for .Figure 2 shows that the SCG and VCG include KRT18 and ASCL1, which have negative and positive contributions for lung small cell carcinoma, respectively. Thus, we interpret that and for DOID 5409 relate positively and negatively with this cell type, respectively. Though there exist simpler methods for decomposing graphs, such as one-shot PAM clustering, hierarchical clustering and tree cutting (Murtagh and Legendre, 2014), PCA-based methods (Hyvarinen, 1999) and several graph statistical approaches (Blondel ; Bodenhofer ), we found that the VCG definition is critical for clustering cells. In fact, we tried replacing our decomposition method (1) with one-shot PAM clustering, but the results often exhibited deteriorated performance since both VCG and WCG (obtained from the one-shot clustering) included weakly correlated genes.
Fig. 2.
Representation of correlation graph decomposition. From scRNA-seq data and a Disease Ontology term with DOID 5409, which concerns SCLC, three signs , , were produced from their parent sign by decomposing the correlation graph into strongly, variably and weakly correlated gene sets (e.g. positive and negative correlations are observed between KRT18 and CD9, and KRT18 and IGFBP2, respectively). The edge width indicates the strength of the correlation.
Representation of correlation graph decomposition. From scRNA-seq data and a Disease Ontology term with DOID 5409, which concerns SCLC, three signs , , were produced from their parent sign by decomposing the correlation graph into strongly, variably and weakly correlated gene sets (e.g. positive and negative correlations are observed between KRT18 and CD9, and KRT18 and IGFBP2, respectively). The edge width indicates the strength of the correlation.
2.4 Sign-by-sample matrix
Let be a gene-by-cell matrix of size from a single-cell transcriptomic data, whose entries stand for normalized-and-centered gene expression levels. For simplicity, let us assume that annotated gene sets can be decomposed into non-empty , and , for . Let , , be matrices of size , whose entries are defined as follows:
where stands for the number of elements in . Additionally, let , , be matrices, whose entries are defined as follows:
where , , are weight constants. Here, and are termed SSMs for SCG and VCG, respectively, while the entry as sign score of the th sign for th sample (cell). Sign score profiles can be represented as points in a Euclidean space, termed sign space in this article. Notably, the ensemble means of sign scores across cells are zeros since SSMs are derived from the centered gene expression matrix .
2.5 Significant sign
Using ASURAT, one can create multiple signs and SSMs by setting the appropriate parameters (Supplementary Note S1 and Fig. S1). Using SSMs, we can infer cell states by performing unsupervised clustering and investigating significant signs, where ‘significant’ means that the sign scores (2) are specifically upregulated or downregulated at the cluster level. Significant signs are analogous to DEGs. Here, naïve usages of statistical tests and fold change analyses should be avoided because the row vectors of SSMs are centered. Hence, we propose a nonparametric separation index, which quantifies the extent of separation between two sets of random variables (Supplementary Note S2 and Fig. S2).
3 Results
3.1 Clustering single-cell transcriptomes of PBMCs from healthy donors
To validate the clustering and interpretation performances of ASURAT in comparison with existing methods, we analyzed two public scRNA-seq datasets, namely PBMCs 4k and 6k (Supplementary Note S3), in which the cell types were inferred via computational tools based on prior assumptions (Cao ). We first excluded low-quality genes and cells (Supplementary Note S4); the resulting read count tables were supplied to ASURAT and four other methods: scran (Lun ), Seurat (Hao ), Monocle 3 (Trapnell ) and SC3 (Kiselev ). To infer the cell types, we used existing methods, performed unsupervised clustering of cells, and annotated each cluster by manually investigating DEGs based on the adjusted P-values or false discovery rates (Supplementary Note S5). Using ASURAT, we created SSMs from CO, Molecular Signatures Database (MSigDB) (Subramanian ) and CellMarker databases (Zhang ) for cell type, GO database for biological process and KEGG for signaling pathway. Then, cells were clustered via k-nearest neighbor graph generation and the Louvain algorithm (Hao ) based on the SSM for cell type and annotated by significant signs and DEGs.Among all methods used, Seurat, Monocle 3 and ASURAT could robustly reproduce most blood cell types (Fig. 3a and Supplementary Fig. S5), as inferred by Cao . We found that manual annotations provided comparable population ratios with previous results (Cao ). Yet, it was quite laborious to manually investigate marker genes from numerous DEGs (Fig. 3b). To avoid such laborious process, we applied scCATCH (Shao ) to automatically annotate the clustering results of Seurat. Nevertheless, population ratios inferred by scCATCH were less consistent than those by manual annotation (Fig. 3a and Supplementary Fig. S5). We also used ssGSEA (Subramanian ) for providing each cell with enrichment scores against ‘cell type signature gene sets’ defined in MSigDB without relying on DEGs (Supplementary Note S5). Although ssGSEA seemed to have imperfect clustering performance (Supplementary Figs S3e and S4e), the resulting scores were approximately consistent with Seurat annotations with a few exceptions (Supplementary Fig. S6).
Fig. 3.
Identification of cell types in peripheral blood mononuclear cell 4k single-cell transcriptome. (a) Population ratios predicted via seven different methods. (b) Uniform manifold approximation and projection (UMAP) plots, computed using Seurat and Monocle 3, for the manual investigation of DEGs based on the adjusted P-values . (c) SSMs and scaled log-transformed read count table, which are vertically concatenated and clustered based on the SSM for cell type. Only top significant signs and DEGs are shown in rows. (d) UMAP plot of the SSM for cell type. (e) Violin plots showing sign scores of significant signs, in which separation indices (I) for the clusters marked with asterisks against the others show ***. Suffixes ‘-S’ and ‘-V’ after IDs indicate the signs are defined by strongly and variably correlated gene sets, respectively
Identification of cell types in peripheral blood mononuclear cell 4k single-cell transcriptome. (a) Population ratios predicted via seven different methods. (b) Uniform manifold approximation and projection (UMAP) plots, computed using Seurat and Monocle 3, for the manual investigation of DEGs based on the adjusted P-values . (c) SSMs and scaled log-transformed read count table, which are vertically concatenated and clustered based on the SSM for cell type. Only top significant signs and DEGs are shown in rows. (d) UMAP plot of the SSM for cell type. (e) Violin plots showing sign scores of significant signs, in which separation indices (I) for the clusters marked with asterisks against the others show ***. Suffixes ‘-S’ and ‘-V’ after IDs indicate the signs are defined by strongly and variably correlated gene sets, respectivelyUsing ASURAT, we identified six cell types, including dendritic cell and megakaryocyte (Fig. 3a and Supplementary Fig. S5), by primarily investigating significant signs for cell type and secondarily the other signs and DEGs (Fig. 3c–e and Supplementary Fig. S7). The population ratios were approximately consistent with the reported values (Cao ), except for the small dendritic cell population possibly included in PBMCs (Villani ). Such a small discrepancy was unavoidable, since Cao used author-defined DEGs and preselected cell types to identify the most preferable ones. Our results were robust against variations of the number of cells by randomly downsampling the cells from PBMC 4k data and analyzing these data using Seurat and ASURAT with almost the same parameters (Supplementary Note S6). These results demonstrate that ASURAT can perform robust clustering for single-cell transcriptomes.
3.2 Clustering single-cell transcriptomes of PBMCs from control and sepsis donors
ASURAT uses database-derived biological terms for clustering single-cell transcriptomes, which inevitably introduces annotation bias; some biological terms are associated with many genes, while others are associated with few (Gaudet and Dessimoz, 2017). Hence, it is important to validate the clustering performance in terms of cell state granularity. Here, we analyzed scRNA-seq datasets of PBMC published by the clinical cohort study for bacterial sepsis (Reyes ) (Supplementary Note S3), in which 65 subjects with different health conditions were included, total of 106 545 CD45+ cells and 19 806 LIN−CD14−HLA-DR+ dendritic cells were profiled, 16 immune-cell states were defined and the immune signatures of sepsis against bacterial infection were studied.After excluding low-quality genes and cells (Supplementary Note S4), we inferred the cell types, using scran, Seurat, Monocle 3 and ASURAT with almost the same parameters as in the analyses of PBMCs 4k and 6k (Supplementary Note S7). Among all the methods, ASURAT could reproduce most immune cell types (Supplementary Fig. S12), which was consistent with the previous report (Reyes ). ASURAT identified 11 clusters by performing an unsupervised clustering of the SSM for cell type (Fig. 4a). Among these, we identified three monocyte subpopulations (Fig. 4a–c and Supplementary Fig. S11): M1 and M2, opposite subtypes with decreased and increased functions of fatty acid degradation; M3, similar to T cells, are characterized by increased function of cell adhesion, containing CD2, CD8A and CD58 (Fig. 4c).
Fig. 4.
Identification of cell states in peripheral blood mononuclear cell single-cell transcriptomes from control and sepsis donors. (a) t-distributed stochastic neighbor embedding (t-SNE) plots of the SSM for cell type, showing (top) the clustering result and (bottom) reported labels for CD45+ cells and dendritic cells (DCs) by Reyes . (b and c) Violin plots showing sign scores of significant signs, in which separation indices (I) for the clusters marked with asterisks against the others show ***, ** and *. (d) Population ratios of total monocytes (M1, M2 and M3) and subcluster M2 to all cells except for the inferred dendritic cells across each subject type. Control, uninfected and healthy control; Leuk-UTI, subjects with urinary tract infection (UTI) with leukocytosis but no organ dysfunction; Int-URO and URO, subjects with UTI with mild (or transient) and clear (or persistent) organ dysfunction, respectively; Bac-SEP, bacteremic subjects with sepsis in hospital wards; ICU-SEP and ICU-NoSEP, bacteremic subjects admitted to the intensive care unit with and without sepsis, respectively
Identification of cell states in peripheral blood mononuclear cell single-cell transcriptomes from control and sepsis donors. (a) t-distributed stochastic neighbor embedding (t-SNE) plots of the SSM for cell type, showing (top) the clustering result and (bottom) reported labels for CD45+ cells and dendritic cells (DCs) by Reyes . (b and c) Violin plots showing sign scores of significant signs, in which separation indices (I) for the clusters marked with asterisks against the others show ***, ** and *. (d) Population ratios of total monocytes (M1, M2 and M3) and subcluster M2 to all cells except for the inferred dendritic cells across each subject type. Control, uninfected and healthy control; Leuk-UTI, subjects with urinary tract infection (UTI) with leukocytosis but no organ dysfunction; Int-URO and URO, subjects with UTI with mild (or transient) and clear (or persistent) organ dysfunction, respectively; Bac-SEP, bacteremic subjects with sepsis in hospital wards; ICU-SEP and ICU-NoSEP, bacteremic subjects admitted to the intensive care unit with and without sepsis, respectivelyWe investigated the differences in cell state composition across each subject type and found that the fractions of total monocytes in subjects with infections (Leuk-UTI, Int-URO, URO, Bac-SEP, ICU-SEP and ICU-NoSEP) are larger than those in healthy controls (Control) (Fig. 4d), which is consistent with the reported result (Reyes ). Moreover, the fractions of M2 in subjects with organ dysfunction (Int-URO, URO, Bac-SEP and ICU-SEP) are smaller than those in subjects without organ dysfunction (Control and Leuk-UTI) except for severely ill patients without infection (ICU-NoSEP). Our results suggest that sepsis is associated with impairments of lipid metabolism in monocytes, supported by a previous proteomic study (Sharma ). These results demonstrate that ASURAT can cluster cells in fine-grained manners and identify functional subtypes.
3.3 Clustering a single-cell transcriptome of SCLC
SCLC tumors undergo a transition from chemosensitivity to chemoresistance states against platinum-based therapy through changes in transcriptional heterogeneity (Stewart ). The ability of SCLC to change phenotype in response to environmental cues involves multiple physiological states of cells, such as pathological states, cell cycle phases (Dominguez ) and metabolic processes (Jalili ). However, functional states cannot be readily identified using conventional gene-based analysis alone, and hence the mechanism behind chemoresistance remains unclear. To better understand cancer subtypes in chemoresistant tumors, we applied Seurat and ASURAT to the SCLC scRNA-seq data with cisplatin treatments, obtained from circulating tumor cell-derived xenografts generated from treatment-naïve lung cancer patients (Stewart ) (Supplementary Note S3).First, we examined the expression levels of known SCLC marker genes (Ireland ), namely ASCL1, NEUROD1, YAP1 and POU2F3, and confirmed that almost all of the cells are of the ASCL1 single-positive subtype (Supplementary Fig. S13), which is consistent with the previous report (Stewart ). Then, we excluded low-quality genes and cells (Supplementary Note S4). To investigate molecular subtypes and potential resistance pathways, we performed unsupervised clustering of cells, using Seurat (Supplementary Note S8). We found that the populations assigned to G1, S and G2M phases are sequentially distributed in the uniform manifold approximation and projection (UMAP) space (Fig. 5a), while few clusters were observed when we regressed out the cell cycle effects (Supplementary Fig. S14), indicating that cell cycle signals are informative in SCLC heterogeneity. Next, we performed KEGG pathway enrichment analysis based on the DEGs with adjusted P-values <, the approximate highest threshold where meaningful enrichments of KEGG terms were obtained, but chemoresistance-related terms were not primarily enriched (Fig. 5b, Supplementary Note S8).
Fig. 5.
Clustering result of a single-cell transcriptome of SCLCs. (a) Uniform manifold approximation and projection (UMAP) plots, showing (left) the clustering result and (right) cell cycle phases computed using Seurat. (b) Pathway enrichment analysis for the clustering result in (a), based on the DEGs with adjusted P-values, in which top five enriched terms are shown. (c) SSMs and scaled log-transformed read count table, which are vertically concatenated and clustered based on the SSM for disease. Only top significant signs and DEGs are shown in rows. (d) Diffusion map plots of the SSM for disease, showing (top) the clustering result and (bottom) cell cycle phases. (e) Violin plots showing sign scores of significant signs, in which separation indices (I) for the clusters marked with asterisks against the others show ** and *
Clustering result of a single-cell transcriptome of SCLCs. (a) Uniform manifold approximation and projection (UMAP) plots, showing (left) the clustering result and (right) cell cycle phases computed using Seurat. (b) Pathway enrichment analysis for the clustering result in (a), based on the DEGs with adjusted P-values, in which top five enriched terms are shown. (c) SSMs and scaled log-transformed read count table, which are vertically concatenated and clustered based on the SSM for disease. Only top significant signs and DEGs are shown in rows. (d) Diffusion map plots of the SSM for disease, showing (top) the clustering result and (bottom) cell cycle phases. (e) Violin plots showing sign scores of significant signs, in which separation indices (I) for the clusters marked with asterisks against the others show ** and *Subsequently, to investigate functional subtypes of SCLCs, we used ASURAT to create SSMs from DO, GO and KEGG databases for disease, biological process and signaling pathway, respectively (Fig. 5c). We performed a dimensionality reduction using a diffusion map (Coifman and Lafon, 2006), followed by unsupervised clustering of cells using MERLoT (Parra ) based on the SSM for disease (Fig. 5d, Supplementary Note S8). Investigating significant signs and DEGs based on the separation indices and adjusted P-values <, respectively, we found three SCLC subpopulations (Fig. 5c–e): S1, cancer cells characterized by significant signs for hematopoietic system disease and DEGs for ribosomal protein (RPS12, RPL18A, etc.); S2, characterized by significant signs for platinum drug resistance and DEGs for cancer-related genes (HMGB1, PTTG1, etc.); and S3, characterized by significant signs for programmed death ligand 1 (PD-L1) expression-mediated immunosuppression and DEGs for proto-oncogenes (FOS, JUN, etc.). Although SCLC molecular subtypes have been extensively studied (Balanis ; Chen ; Ireland ; Schwendenwein ), these functional subpopulations have been previously overlooked. Identifying de novo SCLC subtypes by future work will validate our clustering results. ASURAT provides a novel clue for the clinical improvements for relapsed SCLC tumors.
3.4 Clustering an ST of PDAC
Recent studies of spatially resolved transcriptomic profiling for human PDAC tumors have uncovered that cancer and non-cancer cells are spatially distributed in the distinct tissue regions of primary tumors, and that PDAC cells are accompanied by inflammatory fibroblasts and immune cells (Elosua-Bayes ; Moncada ). In an original study (Moncada ), the cellular resolutions of the STs were estimated at 20–70 cells per ST spot, far lower than those of scRNA-seq. Thus, computational methods have been proposed to predict existing cell types by integrating ST and scRNA-seq datasets (Elosua-Bayes ; Moncada ). Here, we applied Seurat and ASURAT to the published PDAC ST and scRNA-seq data (Moncada ) (Supplementary Note S3), aiming to compare the clustering results of ASURAT with those of existing methods.First, we combined all scRNA-seq datasets after confirming minimal batch effects (Supplementary Fig. S15). We excluded low-quality genes and cells from the ST and scRNA-seq data (Supplementary Note S4). To cluster the ST with reference to the combined scRNA-seq data, we integrated these two data, using a canonical correlation analysis-based data integration method (Butler ) (Fig. 6a). Then, we used Seurat to perform unsupervised clustering for the integrated data. Unexpectedly, batch effects were not corrected between ST and scRNA-seq datasets after data integration (Fig. 6b); nevertheless, the inferred cancer and non-cancer regions were approximately consistent with the previously annotated histological regions (Moncada ) (Fig. 6c), wherein several marker genes such as S100P and FSCN1 were identified as DEGs for the putative cancer cluster (Supplementary Note S9).
Fig. 6.
Clustering results of a ST of PDAC. (a) Canonical correlation analysis-based data integration of ST and scRNA-seq data using Seurat functions. (b–g) t-distributed stochastic neighbor embedding (t-SNE) plots and clustering results shown in the PDAC tissue, based on the indicated methods: (b), (d) and (f), showing (left) the labels for ST and scRNA-seq data and (right) annotation results by manual investigation based on the DEGs with adjusted P-values and significant signs; (c), (e) and (g), showing the clustering results, in which labels are the same with those in (b), (d) and (f), respectively. The black colored spots pointed by the arrows indicate the spots newly predicted as atypical region, which might be a normal pancreas involved in cancer. (h) Profiles of sign scores in the PDAC tissue, predicting cancer, inflammation and pancreas spots
Clustering results of a ST of PDAC. (a) Canonical correlation analysis-based data integration of ST and scRNA-seq data using Seurat functions. (b–g) t-distributed stochastic neighbor embedding (t-SNE) plots and clustering results shown in the PDAC tissue, based on the indicated methods: (b), (d) and (f), showing (left) the labels for ST and scRNA-seq data and (right) annotation results by manual investigation based on the DEGs with adjusted P-values and significant signs; (c), (e) and (g), showing the clustering results, in which labels are the same with those in (b), (d) and (f), respectively. The black colored spots pointed by the arrows indicate the spots newly predicted as atypical region, which might be a normal pancreas involved in cancer. (h) Profiles of sign scores in the PDAC tissue, predicting cancer, inflammation and pancreas spotsNext, to investigate complex cell state composition of the PDAC tissue, we used ASURAT to create SSMs from DO, CO, MSigDB and CellMarker databases for cell type and disease, GO database for biological process and KEGG for signaling pathway. Then, we performed unsupervised clustering of the integrated data based on the SSMs for cell type and disease, and biological process (Supplementary Note S9). Remarkably, ASURAT could remove the aforementioned batch effects (Fig. 6d and f) and identify cancer and non-cancer populations (Supplementary Figs S16 and S17). Moreover, we noticed that the spots grouped in the same cluster with PDAC cells are dispersed within the normal pancreas region, which might be a normal pancreas involved in cancer (Fig. 6e and g).To further investigate the cell states in these spots, we profiled all sign scores for 4282 signs across the tissue (Supplementary Fig. S18). The sign scores for PDAC were increased in the ST spots approximately matching the reported PDAC region (Moncada ), while those for micro RNAs in cancer were increased both in the previously annotated PDAC spots and the newly predicted atypical spots (Fig. 6h). These newly predicted spots were also annotated by a sign for Th17 cell differentiation, suggesting tumor-associated inflammation or antitumor immunity through intercellular communications between Th17 and cancer cells (Muller-Hubenthal ), which remains to be elucidated in PDAC (Liu ). In more than 90% of PDAC cases, KRAS is mutated at the G domain of the 12th residue (Ischenko ; Luchini ). Hence, we speculated that it is possible to validate our clustering results of cancer and non-cancer spots by comparing the frequencies of KRAS mutations using ST data. Unfortunately, we were unable to detect any read mapped to the specific reported region, possibly owing to the shallow read depth and inherent 3′ bias present in the data. Simultaneous genetic and transcriptional profiling may address this problem in the future (Lee ).
4 Discussion
We developed ASURAT, an original computational tool for simultaneous cell clustering and biological interpretation using database-derived functional terms. ASURAT performs correlation graph decompositions of functionally annotated gene sets to define multiple biological terms, termed signs. The notions of SCG and VCG are critical for capturing complex correlation structures compared with existing methods such as PAGODA2 (Fan ) and ssGSEA (Subramanian ) (Supplementary Note S10 and Fig. S19). ASURAT then transforms scRNA-seq data into SSMs, whose rows and columns represent signs and samples (cells), respectively. This SSM plays a key role in characterizing individual cells by various biological terms. Applying ASURAT to several single-cell and ST datasets for PBMCs, SCLC and PDAC, we robustly reproduced the previously reported blood cell types and identified putative subtypes of chemoresistant SCLC and distinct regions within the PDAC tissue.ASURAT uses database-derived biological terms for clustering single-cell transcriptomes, which inevitably introduces annotation bias (Supplementary Note S11); some biological terms are associated with many genes, whereas others are associated with only a few genes (Gaudet and Dessimoz, 2017). Moreover, in some cases, there might be no functional category for a cell type of interest. Nevertheless, users still have a means to manually characterize cells using combinations of significant signs in different functional categories as well as DEGs. Automatic curation will be a potential addition to ASURAT in the future.Conventionally, single-cell transcriptomes are analyzed and interpreted by means of unsupervised clustering followed by manual curation of marker genes selected from DEGs, which has been a common bottleneck of gene-based analyses (Andrews ; Aran ). The statistical significance of individual genes, typically defined by P-value or fold change, is dependent on clustering results. ASURAT can provide an alternative approach and demonstrates superior performance for identifying functional subtypes even within a fairly homogeneous population such as isolated cancer cells. In practice, complementing ASURAT with existing methods (Butler ; La Manno ) will provide a more comprehensive understanding of single-cell and STs, shedding light on putative transdifferentiation of neuroendocrine cancers (Balanis ; Kubota ), intercellular communication in tumor immune microenvironments (Maynard ) and virus infection on immune cell populations.In omics data analyses, knowledge databases are used to interpret computational results: pathway and motif enrichment analyses are often used for transcriptomic and epigenomic analyses (McLeay and Bailey, 2010; Reimand ). In contrast, we propose a unique computational workflow, in which such databases are used for simultaneous clustering and biological interpretation by defining signs. This framework is potentially applicable to any multivariate data with variables linked with annotation information. We can also find such datasets in studies of T cell receptor sequencing (De Simone ; Rempala ) along with a pan-immune repertoire (Zhang ). We anticipate that ASURAT will allow for the identification of various inter-sample differences among T cell receptor repertoires in terms of cellular subtype, antigen–antibody interaction, genetic and pathological backgrounds.Since ASURAT can create multivariate data (i.e. SSMs) from multiple signs, ranging from cell types to biological functions, it will be valuable to consider graphical models of signs, from which we may infer conditional independence structures. A non-Gaussian Markov random field theory is one of the most promising approaches to address this problem, although requires a large number of samples for achieving true graph edges (Morrison ). As the increase in size and diversity of the available data, biological interpretation will become increasingly important. Hence, future work should improve methods for prioritizing biological terms more efficiently than manual screening. We believe that ASURAT will greatly expand our understanding of various biological data and open new means of general functional annotation-driven data analysis.Click here for additional data file.
Authors: Reuben Moncada; Dalia Barkley; Florian Wagner; Marta Chiodin; Joseph C Devlin; Maayan Baron; Cristina H Hajdu; Diane M Simeone; Itai Yanai Journal: Nat Biotechnol Date: 2020-01-13 Impact factor: 54.908
Authors: Boris Müller-Hübenthal; Marc Azemar; Dirk Lorenzen; Michael Huber; Marina A Freudenberg; Chris Galanos; Clemens Unger; Bernd Hildenbrand Journal: Anticancer Res Date: 2009-11 Impact factor: 2.480
Authors: C Allison Stewart; Carl M Gay; Yuanxin Xi; Santhosh Sivajothi; V Sivakamasundari; Junya Fujimoto; Mohan Bolisetty; Patrice M Hartsfield; Veerakumar Balasubramaniyan; Milind D Chalishazar; Cesar Moran; Neda Kalhor; John Stewart; Hai Tran; Stephen G Swisher; Jack A Roth; Jianjun Zhang; John de Groot; Bonnie Glisson; Trudy G Oliver; John V Heymach; Ignacio Wistuba; Paul Robson; Jing Wang; Lauren Averett Byers Journal: Nat Cancer Date: 2020-02-17
Authors: Miguel Reyes; Michael R Filbin; Roby P Bhattacharyya; Kianna Billman; Thomas Eisenhaure; Deborah T Hung; Bruce D Levy; Rebecca M Baron; Paul C Blainey; Marcia B Goldberg; Nir Hacohen Journal: Nat Med Date: 2020-02-17 Impact factor: 53.440