| Literature DB >> 31870278 |
Thomas A Geddes1,2, Taiyun Kim1, Lihao Nan3, James G Burchfield2, Jean Y H Yang1, Dacheng Tao3, Pengyi Yang4,5.
Abstract
BACKGROUND: Single-cell RNA-sequencing (scRNA-seq) is a transformative technology, allowing global transcriptomes of individual cells to be profiled with high accuracy. An essential task in scRNA-seq data analysis is the identification of cell types from complex samples or tissues profiled in an experiment. To this end, clustering has become a key computational technique for grouping cells based on their transcriptome profiles, enabling subsequent cell type identification from each cluster of cells. Due to the high feature-dimensionality of the transcriptome (i.e. the large number of measured genes in each cell) and because only a small fraction of genes are cell type-specific and therefore informative for generating cell type-specific clusters, clustering directly on the original feature/gene dimension may lead to uninformative clusters and hinder correct cell type identification.Entities:
Keywords: Autoencoder; Cell type identification; Cluster ensemble; Single cells; Single-cell transcriptome; scRNA-seq
Mesh:
Year: 2019 PMID: 31870278 PMCID: PMC6929272 DOI: 10.1186/s12859-019-3179-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Summary of the experimental scRNA-seq datasets used for hyperparameter optimisation and method evaluation
| Repository | Source | # cell | # class | Ref. | Protocol | Purpose |
|---|---|---|---|---|---|---|
| GSE60361 | Mouse cortex | 3005 | 7 | [ | SMARTer | Optimisation |
| GSE45719 | Mouse embryogenesis | 300 | 8 | [ | SMART-seq2 | Optimisation |
| GSE67835 | Adult and fetal human brain | 466 | 8 | [ | SMARTer | Optimisation |
| E_MTAB_3929 | Human embryogenesis | 1529 | 5 | [ | SMART-seq2 | Optimisation |
| GSE84371 | Mouse neurons | 1402 | 8 | [ | Smart-seq2 | Evaluation |
| GSE82187 | Mouse striatum | 705 | 10 | [ | SMARTer & Smart-seq2 | Evaluation |
| Broad portal | Human archived brain | 14963 | 19 | [ | Drop-seq | Evaluation |
| Broad portal | Mouse archived brain | 13313 | 26 | [ | Drop-seq | Evaluation |
Fig. 1Hyperparameter optimisation for autoencoders using Pareto analysis. Left panel: PCA visualisation of the four evaluation metrics (i.e. ARI, NMI, FM and Jaccard) on each of the four optimisation datasets. Each point corresponds to a single combination of hyperparameter values including random projection size, encoded feature space size, and autoencoder learning rate during backpropagation; each combination/point is colour-coded by the number of times it was assigned Pareto rank 1 (i.e. the combination that gives best clustering performance) across all possible combinations of the four optimisation datasets. Right panel: Autoencoder architecture as determined by the hyperparameter optimisation procedure
Fig. 2Ensemble of k-means clustering results on the four scRNA-seq datasets. Red boxes represent ensemble of k-means clustering on the raw input expression matrix without using the autoencoder framework. Light blue boxes represent autoencoder-based k-means cluster ensemble
Fig. 3Evaluation of autoencoder-based SIMLR ensemble. Ensemble sizes range from 1 to 100 were tested using four evaluation metrics in two scRNA-seq datasets
Comparison of direct application of k-means and SIMLR clustering on raw gene expression data with autoencoder-based k-means and SIMLR ensemble
| Raw | Autoencoder | |||||||
|---|---|---|---|---|---|---|---|---|
| ARI | NMI | FM | Jaccard | ARI | NMI | FM | Jaccard | |
| Mouse neurons | 0.22 ±0.02 | 0.36 ±0.03 | 0.39 ±0.02 | 0.22 ±0.01 | 0.38 ±0.02 | 0.56 ±0.01 | 0.53 ±0.02 | 0.34 ±0.02 |
| Mouse striatum | 0.36 ±0.07 | 0.69 ±0.06 | 0.51 ±0.06 | 0.31 ±0.05 | 0.45 ±0.01 | 0.75 ±0.01 | 0.58 ±0.01 | 0.37 ±0.01 |
| Human archived brain | 0.29 ±0.01 | 0.49 ±0.01 | 0.35 ±0.01 | 0.21 ±0.01 | 0.37 ±0.01 | 0.56 ±0 | 0.43 ±0.01 | 0.27 ±0.01 |
| Mouse archived brain | 0.32 ±0.01 | 0.49 ±0 | 0.35 ±0.01 | 0.21 ±0.01 | 0.43 ±0.01 | 0.58 ±0 | 0.46 ±0.01 | 0.3 ±0.01 |
| SIMLR | ||||||||
| Mouse neurons | 0.44 ±0 | 0.65 ±0 | 0.58 ±0 | 0.39 ±0 | 0.71 ±0 | 0.7 ±0 | 0.81 ±0 | 0.67 ±0 |
| Mouse striatum | 0.55 ±0 | 0.81 ±0 | 0.67 ±0 | 0.34 ±0 | 0.8 ±0.02 | 0.87 ±0.01 | 0.85 ±0.02 | 0.74 ±0.03 |
Cell type identification accuracy were quantified by the four evaluation metrics
Fig. 4Comparison of autoencoder-based clustering framework with PCA-based dimension reduction and clustering using the four evaluation metrics. Statistical significance (p<0.001; denoted by ⋆) of either autoencoder with k-means clustering against the rest for human and mouse archived brain datasets, or autoencoder with SIMLR clustering against the rest for mouse neurons and striatum datasets were performed using Wilcoxon Rank Sum test (two-sided)
Fig. 5A schematic illustration of the proposed autoencoder-based cluster ensemble framework. The first step is the sampling of multiple random projections from the original input scRNA-seq data set. A separate autoencoder artificial neural network is trained on each of these random projections and used to encode the data to a smaller-dimensional space. Subsequently, clustering of each encoded dataset is conducted using an arbitrary clustering method; the final clustering output is produced by integrating individual clustering results using a fixed-point algorithm [31]
Fig. 6A schematic showing the quantification of concordance of the clustering output with the original ’gold standard’ annotation using a panel of evaluation metrics