| Literature DB >> 32555334 |
Ioana Bica1,2,3, Helena Andrés-Terré4, Ana Cvejic5,6,7, Pietro Liò8.
Abstract
Using machine learning techniques to build representations from biomedical data can help us understand the latent biological mechanism of action and lead to important discoveries. Recent developments in single-cell RNA-sequencing protocols have allowed measuring gene expression for individual cells in a population, thus opening up the possibility of finding answers to biomedical questions about cell differentiation. In this paper, we explore unsupervised generative neural methods, based on the variational autoencoder, that can model cell differentiation by building meaningful representations from the high dimensional and complex gene expression data. We use disentanglement methods based on information theory to improve the data representation and achieve better separation of the biological factors of variation in the gene expression data. In addition, we use a graph autoencoder consisting of graph convolutional layers to predict relationships between single-cells. Based on these models, we develop a computational framework that consists of methods for identifying the cell types in the dataset, finding driver genes for the differentiation process and obtaining a better understanding of relationships between cells. We illustrate our methods on datasets from multiple species and also from different sequencing technologies.Entities:
Mesh:
Year: 2020 PMID: 32555334 PMCID: PMC7300092 DOI: 10.1038/s41598-020-66166-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Pipeline for identifying the cell types in a dataset using DiffVAE. Illustration on the zebrafish dataset. (a) Train DiffVAE to map the gene expression measurements for each cell to a m-dimensional latent representation z. (b) Apply T-SNE on the latent representation z and clustering to find the different cell clusters in the dataset. (c) Identify which latent dimensions in z encode the differentiation of the cells in each cluster. (d) Find the high weights genes for the relevant latent dimensions. (e) Map the clusters to cell types based on the high weight genes for each cluster.
Zebrafish.
| Cluster 1 (HSPCs) | Cluster 2 (Neutrophils) | Cluster 3 (Monocytes) | Cluster 4 (Erythrocytes) | Cluster 5 (Thrombocytes) |
|---|---|---|---|---|
High weight genes computed using the high weight connections to the latent dimensions with the highest percentage for differentiating the corresponding cell type. Using references from scientific literature each cluster found using DiffVAE is mapped to a cell type.
Figure 2Methodology proposed for changing the cellular states: HSPCs can be converted into Monocytes by shifting the latent dimensions differentiating Monocytes by a factor λ multiplied with their standard deviation. Increasing the shifting parameter λ will result in more of the HSPCs to be subsequently classified as Monocytes.
Figure 3Results obtained after performing cell perturbations. We show in colour the cells of interested for each subfigure and in grey the rest of the cells. Each subfigure indicates how many of the HSPCs were converted into each type of mature blood cell after performing perturbations to the latent representations of DiffVAE. Notice that increasing the shifting parameter λ in the perturbations will result in more cells to be changed.
Zebrafish.
| Clustering method | Dim size ( | Latent representation | T-SNE embedding of latent representation | ||||||
|---|---|---|---|---|---|---|---|---|---|
| DiffVAE | VAE | AE | PCA | DiffVAE | VAE | AE | PCA | ||
| k-means | 20 | 0.771 | 0.799 | 0.633 | 0.738 | 0.699 | 0.717 | ||
| 50 | 0.775 | 0.811 | 0.629 | 0.801 | 0.759 | 0.709 | |||
| 100 | 0.831 | 0.815 | 0.627 | 0.796 | 0.806 | 0.680 | |||
| DBSCAN | 20 | 0.007 | 0.004 | 0.001 | 0.0002 | 0.717 | 0.556 | 0.506 | |
| 50 | 0.474 | 0.243 | 0.223 | 0.0009 | 0.667 | 0.573 | 0.590 | ||
| 100 | 0.154 | 0.018 | 0.011 | 0.002 | 0.799 | 0.749 | 0.570 | ||
Mean ARI obtained for clustering the latent representation and the t-SNE embedding of the latent representation for three settings of the reduced dimension size m. The clustering algorithms used are k-means and Gaussian Mixture Models.
Figure 4Methodology proposed for analyzing links between cells. (a) Graph-DiffVAE uses an initial adjacency matrix and individual node features to predict more links between cells. (b) Projection of cells onto 2-dimensional t-SNE embedding of the latent node features learnt by Graph-DiffVAE and illustration of initial and predicted links between HSPCs and differentiated cells. (c) Adjacency matrix with input links between cells (the colour white indicates an input edge); each cell is connected to the highest positively correlated cell. (d) Adjacency matrix with predicted links between all cells by Graph-DiffVAE (the colour white indicates a predicted edge). (e) Co-expression matrix between all cells; each entry represents the absolute value of the Pearson correlation coefficient.