| Literature DB >> 36072672 |
Musu Yuan1, Liang Chen2, Minghua Deng1,2,3.
Abstract
Single-cell multiomics sequencing techniques have rapidly developed in the past few years. Among these techniques, single-cell cellular indexing of transcriptomes and epitopes (CITE-seq) allows simultaneous quantification of gene expression and surface proteins. Clustering CITE-seq data have the great potential of providing us with a more comprehensive and in-depth view of cell states and interactions. However, CITE-seq data inherit the properties of scRNA-seq data, being noisy, large-dimensional, and highly sparse. Moreover, representations of RNA and surface protein are sometimes with low correlation and contribute divergently to the clustering object. To overcome these obstacles and find a combined representation well suited for clustering, we proposed scCTClust for multiomics data, especially CITE-seq data, and clustering analysis. Two omics-specific neural networks are introduced to extract cluster information from omics data. A deep canonical correlation method is adopted to find the maximumly correlated representations of two omics. A novel decentralized clustering method is utilized over the linear combination of latent representations of two omics. The fusion weights which can account for contributions of omics to clustering are adaptively updated during training. Extensive experiments over both simulated and real CITE-seq data sets demonstrated the power of scCTClust. We also applied scCTClust on transcriptome-epigenome data to illustrate its potential for generalizing.Entities:
Keywords: artificial intelligence; bioinformatics; data integration; genetics; genomics; omics; statistical; transcriptomics
Year: 2022 PMID: 36072672 PMCID: PMC9441595 DOI: 10.3389/fgene.2022.977968
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Structure of the scCTClust model: (1) preprocessed multiomics data are used as input of omics-specific encoders separately; outputs are latent features and estimated posterior parameters of the ZINB or NB model. (2) A fusion layer is introduced to linearly fuse the latent features of different omics data. (2) A CCA loss is introduced to find the maximumly correlated omics latent representations. (4) A Cauchy–Schwatz divergence-based clustering module is added after the fusion layer.
Simulation data settings for all experiments; each cluster contains 500 cells; ‘*’ refers to the variable parameter.
| Cell/cluster | Cluster | Protein | ProbRNA | Probprotein |
|---|---|---|---|---|
| 500 | * | 75 | 0.15 | 0.7 |
| 500 | 8 | * | 0.15 | 0.7 |
| 500 | 8 | 75 | * | 0.7 |
| 500 | 8 | 75 | 0.15 | * |
FIGURE 2Simulation Experiments. (A) Performance of scCTClust and competing methods by ARI over simulated CITE-seq data sets. (B) Two-dimensional visualization of latent features extracted by scCTClust using the UMAP dimension reduction method. Only cluster numbers varied in according experiments. (C) Behavior of fusion weights during the simulation experiments.
FIGURE 3CITE-seq Experiments. (A) Performance of scCTClust and competing methods by NMI and ARI over real CITE-seq data sets 10X10 k, 10XInhouse, Lymph, and Spleen. (B) UMAP visualization of RNA, protein, and fused features applying scCTClust over the 10XInHouse data set; the left three are colored by true cell types, and the right one is colored by the predictions. (C) UMAP visualization of latent features extracted by competing integrative methods, namely, scCTClust trained without CCA loss, TotalVI, Seurat, and CiteFuse, over the 10XInHouse data set.
FIGURE 4SNARE-seq experiments and ablation studies. (A) Performance of scCTClust and competing methods by NMI and ARI over the SNARE-seq data set CellLine and SHARE-seq data set Ma. (B) Ablation study to determine the robustness of hyperparameters and the advantages of C-S divergence-based clustering against metric-based clustering. (C) Performance of scCTClust over different scales of simulated data sets. The estimated number of clusters and clustering results were obtained with it.