Mingbo Cheng1, Zhijian Li1, Ivan G Costa1. 1. Institute for Computational Genomics, Joint Research Center for Computational Biomedicine, RWTH Aachen University Medical School, 52074 Aachen, Germany.
Abstract
MOTIVATION: The advent of multi-modal single-cell sequencing techniques have shed new light on molecular mechanisms by simultaneously inspecting transcriptomes, epigenomes and proteomes of the same cell. However, to date, the existing computational approaches for integration of multimodal single-cell data are either computationally expensive, require the delineation of parameters or can only be applied to particular modalities. RESULTS: Here we present a single-cell multi-modal integration method, named Multi-mOdal Joint IntegraTion of cOmpOnents (MOJITOO). MOJITOO uses canonical correlation analysis for a fast and parameter free detection of a shared representation of cells from multimodal single-cell data. Moreover, estimated canonical components can be used for interpretation, i.e. association of modality-specific molecular features with the latent space. We evaluate MOJITOO using bi- and tri-modal single-cell datasets and show that MOJITOO outperforms existing methods regarding computational requirements, preservation of original latent spaces and clustering. AVAILABILITY AND IMPLEMENTATION: The software, code and data for benchmarking are available at https://github.com/CostaLab/MOJITOO and https://doi.org/10.5281/zenodo.6348128. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: The advent of multi-modal single-cell sequencing techniques have shed new light on molecular mechanisms by simultaneously inspecting transcriptomes, epigenomes and proteomes of the same cell. However, to date, the existing computational approaches for integration of multimodal single-cell data are either computationally expensive, require the delineation of parameters or can only be applied to particular modalities. RESULTS: Here we present a single-cell multi-modal integration method, named Multi-mOdal Joint IntegraTion of cOmpOnents (MOJITOO). MOJITOO uses canonical correlation analysis for a fast and parameter free detection of a shared representation of cells from multimodal single-cell data. Moreover, estimated canonical components can be used for interpretation, i.e. association of modality-specific molecular features with the latent space. We evaluate MOJITOO using bi- and tri-modal single-cell datasets and show that MOJITOO outperforms existing methods regarding computational requirements, preservation of original latent spaces and clustering. AVAILABILITY AND IMPLEMENTATION: The software, code and data for benchmarking are available at https://github.com/CostaLab/MOJITOO and https://doi.org/10.5281/zenodo.6348128. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
The technological advances of high-throughput single-cell sequencing enable us to characterize cellular heterogeneity of complex tissues for distinct molecular players of cells such as transcripts, proteins and chromatin (Efremova and Teichmann, 2020). The advent of multimodal technologies allow us to simultaneously measure two or more modalities at the same cells, i.e. RNA and open chromatin (HuCao ; Ma ); RNA and protein (Stoeckius ); and RNA, open chromatin and protein (Mimitou ; Swanson ). These methods allow us to access how genetic information is associated at distinct molecular levels, i.e. the effect of DNA accessibility changes on gene expression or the expression of genes to proteins. However, data produced by each modality has quite distinct characteristics regarding their numerical values (e.g. low counts for open chromatin and variable count values for RNA and proteins levels), dimensionality (dozens for proteins, tens of thousands for genes, hundreds of thousands for open chromatin) and levels of data sparsity (Argelaguet ; Li ). These make the integrative analysis of multi-modal data a challenging task.Here, we are interested in the problem of estimating a shared latent space from parallel multiomic approaches, where two or more modalities are measured in the same cells. A few methods have been proposed for this problem. These follow two main frameworks: metric learning and latent variable learning. Weighted nearest neighbors (WNN) (Hao )) and Schema (Singh ) explore, respectively, nearest neighbors and quadratic programming to estimate a single distance matrix representing the integrated multimodal data. Both approaches explore efficient algorithms, but do not explicitly provide models associating molecular features to the ‘latent space’. MOFA (Argelaguet ), scAI (Jin ), totalVI (Gayoso ) and LIGER (Kriebel and Welch, 2022) explore distinct methods for matrix factorization and estimation of shared latent spaces between modalities. Moreover, estimated matrices can be used for model interpretation, i.e. decomposed matrices can be used to associate molecular features with the latent space. Overall, these methods have a large number of free parameters including the size of the latent space (or rank of the low dimensional matrices). These methods require the optimization of the size of the latent space, which in turn increases computational costs. Of note, the main application of LIGER is the integration of experiments within a modality (Welch ). The data integration evaluated here refers to a latter version of LIGER (Kriebel and Welch, 2022), which can only be used for two modalities and a subset of the molecular features need to be common in both modalities. Also, the implementation of some methods (totalVI (Gayoso ) and scAI (Jin )) only allow integration of particular modalities (i.e. scRNA-seq and protein abundance for totalVI; scRNA-seq and scATAC-seq for scAI).Two recent works have explored canonical correlation analysis (CCA) for problems close to the one addressed here. Symphony (Kang ), which has as main objective to build reference single-cell atlas, used CCA to obtain a joint space in a case study with CITE-seq data. However, this approach does not support more than two modalities and was not evaluated concerning its interpretability or computational requirements. DIABLO (Singh ) extends generalized canonical analysis for integration of bulk multi-modal datasets with respect to different phenotypes. It requires class labels for the samples, which are used for feature selection. However, such class labels are usually not provided in single-cell multi-modal datasets, as explored here.
2 Approach
Here, we propose Multi-mOdal Joint IntegraTion of cOmpOnents (MOJITOO), an efficient method that is based on CCA to learn a shared latent space for any single-cell multimodal data protocol. The canonical components can be interpreted as factors and be used to characterize feature relevance by relating features across modalities (Fig. 1). In contrast to matrix factorization methods, MOJITOO does not require the definition of parameters such as the rank of the matrix. Furthermore, it provides an approach to estimate the size of the latent space after a single execution of CCA. MOJITOO is provided as an R package and is compatible with common single-cell pipelines for RNA, proteins (Seurat; Hao ) and ATAC modalities (Signac; Stuart ). MOJITOO builds upon a simple CCA procedure presented in a case study in Symphony (Kang ) and differs from DIABLO (Singh ), as it does not require labels for data integration.
Fig. 1.
Schematic MOJITOO. (A) MOJITOO receives as input two (or more) dimensional reduced matrices, where each matrix represents a particular molecular modality describing the same cells. In this example, we assume RNA and open chromatin (as measured by ATAC-seq) modalities are given. The main idea of MOJITOO is to use Canonical Correlation Analysis to find a set of canonical vectors and . Exploring a geometrical interpretation of CCA, MOJITOO finds canonical vectors such that the cosine similarity between latent dimensions in and is maximized. A final representation Z can be obtained by adding the modality-specific latent spaces. In the example, we show vectorial representations of two cells (a and b) in both original and latent spaces. (B) An association between original features for each modality ( and ) can be obtained by multiplying original data representation per modality ( and ) with the shared latent space Z
Schematic MOJITOO. (A) MOJITOO receives as input two (or more) dimensional reduced matrices, where each matrix represents a particular molecular modality describing the same cells. In this example, we assume RNA and open chromatin (as measured by ATAC-seq) modalities are given. The main idea of MOJITOO is to use Canonical Correlation Analysis to find a set of canonical vectors and . Exploring a geometrical interpretation of CCA, MOJITOO finds canonical vectors such that the cosine similarity between latent dimensions in and is maximized. A final representation Z can be obtained by adding the modality-specific latent spaces. In the example, we show vectorial representations of two cells (a and b) in both original and latent spaces. (B) An association between original features for each modality ( and ) can be obtained by multiplying original data representation per modality ( and ) with the shared latent space ZWe evaluate MOJITOO and competing methods (WNN, MOFA, scAI, LIGER, Schema, DIABLO, Symphony) in two bi-modal datasets with RNA and protein measurements (Buus ; Stuart ), two bi-modal datasets with RNA and ATAC-seq measurements (Ma ) and two tri-modal datasets with RNA, proteins and ATAC-seq measurements (Mimitou ; Swanson ) in regards to their ability to recover a shared latent space. The latent spaces are then evaluated with measures regarding the accuracy of clustering (adjusted Rand index), distance (silhouette score) and structure preservation, i.e. relation between shared space and original space of individual modalities (Jain). Altogether, results show a superior performance of MOJITOO in both computational requirements and accuracy of estimated latent spaces. Moreover, we illustrate how estimated canonical components can be used to interpret the underlying single-cell data.
3 Materials and methods
3.1 MOJITOO
MOJITOO takes as input a set of matrices from m modalities:
where represents the data of a particular single-cell modality, n represents the number of cells and represents the number of features in modality i. Here, we focus on multimodal data, where the cells are the same across matrices and there is no direct relation between the features of the distinct modalities.
Reducing the dimension for each modality
We first obtain a dimension reduced matrix for each modality independently using a modality-specific approach:
where represents the low-dimensional matrix for modality i, represents the number of dimensions and represents the specific dimension reduction method for this modality. MOJITOO uses latent semantic indexing (LSI) for scATAC-seq and principal component analysis (PCA) for other modalities, as is usual in the literature (Granja ; Hao ; Stuart ). The rational behind the use of dimension reduction is two-fold. First, low-dimensional matrices reduce the computing time of the CCA analysis without impacting accuracy for a minimum number of dimensions. Moreover, it allows to work directly on batch-corrected data, which is usually represented in a low-dimensional space (Hao ; Korsunsky ). Of note, batch correction is recommended previous to MOJITOO, whenever the data are affected by batch effects.
Learning a shared space with canonical correlation analysis with two modalities
MOJITOO aims to learn a shared latent space Z from the set of low dimensional matrices where represents the cells, n is the number of cells and k is the dimension of this latent space. When has two modalities, we first use CCA (This notation is based on a geometrical interpretation of CCA.) to project the matrices and to vectors and :
where and represent canonical components (CC). The vectors and can be obtained by solving the following optimization problem:
where represent the first canonical weight vectors, and is the cosine similarity between two vectors a and b defined by:This is repeated times, such that new canonical vectors are orthogonal to previously estimated vectors. These provide the matrices:These can be used to estimate the modality transformed space asA unique latent space is obtained as
where and k is the number of canonical variables retained.To further remove the noise from the latent space Z, we only keep highly correlated canonical components and by measuring the Pearson correlation and using a Student’s t-test for significance. The P-values are then corrected using Benjamini Hochberg (BH) (Benjamini and Hochberg, 1995) and only canonical components with adjusted P-values <0.05 are retained.MOJITOO uses an algorithm based on generalized eigenvector decomposition (Ramsay and Silverman, 1997) to estimate the canonical components. MOJITOO has a time complexity of for computing covariance matrices and for the eigenvector decomposition. As n (number of cells) is usually 100 times larger than (number of reduced dimensions in ) the first term dominates the complexity.Of note, CCA is one of the several steps in the integration algorithm of an earlier version of Seurat (Butler ). This had the objective to integrate distinct scRNA-seq experiments and CCA was performed in the common gene space, i.e. on transposed matrices and the objective was to find matching cells.
Learning a shared space for multiple modalities
For the case that has more than two modalities, we perform the pairwise integration of modalities starting with the pair with highest dimensionality. The result of this CCA is then used for integration with the next modality. See Algorithm 1 for a brief description, which receives a set of matrices with increasing dimensions as input. This heuristic algorithm was adopted to avoid the high computational costs of multiple CCA, which grows exponentially with the number of modalities.Multimodal MOJITOO Algorithmprocedure MOJITOO()while
do▹ only consider significantly correlated dimensionend whilereturn Zend procedureend
Association of molecular features with latent space
We can use the estimated latent spaces to associate molecular features to the latent space Z. For example, let be the gene expression matrix and be the peak matrix, where n is the number of cells, s is the number of genes and t is the number of peaks. We can obtain a feature associating molecular features to the latent space by
where and . The ith column of matrix represent the scores of features in the ith canonical component.
3.2 Datasets
We make use of public multimodal datasets with two or tri-modalities in our evaluation. The first dataset is single-cell CITE-seq data which measures single-cell RNA and surface proteins simultaneously. The human bone marrow mononuclear cells (BM-CITE) dataset contains full transcriptomes and 25 surface proteins for over 30 672 cells annotated in 27 cell types (Stuart ). This data were obtained with the ‘LoadData(“bmcite”)’ command from package SeuratData. Next, we applied the pre-processing pipeline. Another CITE-seq data used were the human peripheral blood mononuclear cells from lung (LUNG-CITE) (Buus ) with 52 surface proteins. It contains 10 470 cells annotated in 22 cell types. This data were obtained from here.The next dataset contains human peripheral blood mononuclear cells (PBMC-multiome) generated by the 10× multiome technology to measure gene expression (scRNA-seq) and chromatin accessibility (scATAC-seq) on the same cells. This data contains 11 787 cells with 13 cell types annotated by 10× Genomics. We use the scRNA-seq and scATAC-seq count matrices as provided by 10× genomics after processing with the cellranger pipeline obtained from the here. We also use a dataset based on the SHARE-seq protocol measuring gene expression and chromatin accessibility of mouse skin cells (SKIN-SHARE) (Ma ). This data contains 34 774 cells, which are annotated as 23 cell types. We obtain the skin scRNA-seq and scATAC-seq counts and fragments files from the Gene Expression Omnibus under accession number (GSE140203).A tri-modal dataset of human PBMCs is measured with the DOGMA-seq protocol (Mimitou ). This provides RNA, ATAC and epitope sequencing of the same cells (PBMC-DOGMA). We use data under low-loss lysis condition, which contains 13 763 cells in 27 cell types. We download count matrices as provided by the authors here. A second tri-modal dataset is based on human PBMCs measured with the TEA-seq protocol (Swanson ). It contains transcripts, epitopes and chromatin accessibility of 25 517 PBMCs grouped into 12 cell types (PBMC-TEA). For this dataset, we obtain original matrices and combine data from distinct wells from GEO (GSE158013). For scATAC-seq, we obtain an integrated matrix by combing peaks (allowing an extension of 250 bps). We finally intersect all barcodes from scRNA-seq, protein and scATAC-seq to obtain matrices in the same cell space. Characteristics of each of the six datasets are described in Table 1.
Table 1.
Major characteristics of multiomics datasets
Dataset
Protocol
Species
Organ
Modalities
No. of cells
No. of cell types
No. of features (gene/peak/protein)
BM-CITE
CITE-seq
Human
Bone Marrow
RNA/protein
30 672
27
17 009/—/25
LUNG-CITE
CITE-seq
Human
PBMC&Lung
RNA/protein
10 470
22
33 514/—/52
PBMC-Multiome
Multiome
Human
PBMC
RNA/ATAC
11 787
13
36 610/108 377/—
Skin-SHARE
SHARE-seq
Mouse
Skin
RNA/ATAC
34 774
23
23 296/344 592/—
PBMC-TEA
TEA-seq
Human
PBMC
RNA/ATAC/epitope
25 517
12
36 601/128 853/47
PBMC-DOGMA
DOGMA-seq
Human
PBMC
RNA/ATAC/protein
13 763
27
36 495/68 963/210
Major characteristics of multiomics datasets
Processing of single-cell sequencing data
We perform a uniform pre-processing of all previously datasets starting from their count matrices. For scRNA-seq matrices, we adopt the standard Seurat 4 pipeline. First, we log normalize the data by calling the function NormalizeData with default parameters. Next we use FindVariableFeatures to find top 3000 variable features and run ScaleData. Finally, we use RunPCA to perform dimension reduction (Hao ) by keeping the first 50 PCs. For scATAC-seq, we adopt the standard pipeline from Signac (Stuart ). We first run TF-IDF (term frequency—inverse document frequency) on the peaks. Next, we use RunSVD on the top features calculated by function FindTopFeatures with parameter min.cutoff = ‘q0’, which provides an LSI dimension reduced matrix. We keep the first 50 dimensions, but we discard the first dimension as this is highly correlated to the number of fragments. For protein/epitopes, we adopt the standard Seurat 4 pipeline (Hao ). In short, we call NormalizeData with parameters normalization.method = ‘CLR’ and margin = 2 followed by ScaleData and RunPCA with 30 PCs using default parameters.PBMC-DOGMA is the only dataset evaluated here with two biological conditions (stimulated and unstimulated). To address the presence of batch, we apply the harmony integration (Korsunsky ) for RNA-seq and epitope data independently to integrate control and stimulated samples. For scATAC-seq, integration is performed by ignoring the first LSI dimension, which has a high correlation with the stimulation. MOJITOO results with and without batch correction are shown in Supplementary Figure S1. We provide these input matrices to MOJITOO, WNN, MOFA, DIABLO and Symphony. Other competing methods provide their own functionalities for normalization and dimension reduction, which are used accordingly (see below). Time and memory requirements of pre-processing data are considered for the benchmarking of the respective method.
3.3 Benchmarking of integration methods
We use three distinct metrics to measure the accuracy of the methods. The structure score measures the similarity between two latent space structures (Jain). It is based on the Pearson correlation of the pairwise Euclidean distance estimated on the shared (Z) and latent spaces () for each individual modality. This score indicates how well the shared space is related to the modality and the average values indicate how well integration worked. This metric is also employed by Schema (Singh ). We also evaluate the metrics concerning their distance representation using the silhouette score (Rousseeuw, 1987). For this, we use the labels as provided by the cluster of the respective dataset. We evaluate the use of Euclidean distance as ‘distance’ for the silhouette score. Finally, we evaluate the performance of methods regarding clustering. We perform Louvain clustering with varying resolution (parameter from 0.1 to 2.0) and estimate the adjusted Rand index (ARI) using cell labels (Hubert and Arabie, 1985).
3.4 Execution of competing methods
3.4.1 MOFA
MOFA+ (Argelaguet ) uses Bayesian group factor analysis and variational inference to decompose individual modalities simultaneously by estimating a common latent factor matrix Z, as well as the weights for the transformation of the modalities to the latent space. MOFA+ includes a procedure to determine the optimal number of factors (dimension of the latent space) and has several hyper parameters for model regularization, detection of number of factors and learning rates. We execute MOFA with default parameters and followed their recommendations tutorial for the analysis of all data. However, we provide now PCA/LSI reductions as input for MOFA, as this improved its computational time (Supplementary Table S1), as well the clustering performance on MOFA’s latent space (Supplementary Fig. S2).
3.4.2 Schema
Schema (Singh ) explores metrics learning to re-weigh modality features through maximizing the agreement with other modalities. Specifically, it utilizes quadratic programming (QP) to learn a scaling transformation u for the primary matrix X such that pairwise distances of the transformation (where is coordinate-wise multiplication, for each ) are highly correlated in other modalities. Schema has two main parameters: minimum desired correlation and number of random pairs. We run Schema using default parameters as in schema tutorial.
3.4.3 Seurat4 WNN
Weighted nearest neighbor (WNN) (Hao ) constructs single unified representation across multiple modalities. It first creates k-nearest neighbor (KNN) graphs for each modality based on the latent representation of each feature matrix. Next, it calculates affinities using the exponential kernel between a cell and the average NN for each modality. The latter is used to weigh cells. WNN has two major free parameters: the number of neighbors and scaling factor of the neighborhood kernel. We execute WNN, which is part of Seurat4, using default parameters. WNN does not provide a shared latent space, but we can use the weighted nearest neighbors graph to build a distance matrix that can be used in all benchmarking evaluations.
3.4.4 scAI
scAI simultaneously decomposes transcriptomic and epigenomic data into multiple biologically relevant factors (Clark ). Its framework is similar to MOFA, but it can only cope with two modalities at a time. scAI uses a stability method to define the rank (size of the latent space) and has three main free parameters used for model regularization. We execute scAI in only bi-modal with RNA and ATAC-seq datasets with default parameters.
3.4.5 LIGER
LIGER (Welch ), which is based on non-negative matrix factorization, was originally proposed for data integration whenever modalities are in the same feature space. A newer variant of LIGER (Kriebel and Welch, 2022) is able to perform integration, whenever there is some overlap between the features across modalities (shared features), i.e. protein and RNA expression of the same gene or gene accessibility scores for ATAC-seq. LIGER estimates a gene accessibility (ATAC-seq) matrix by counting the total number of ATAC-seq reads within the gene body and promoter regions(3 kb upstream) for each gene per cell. An additional unshared feature matrix is further produced by binning the genome into bins of 100 000 bps and counting the overlap of these bins with peaks from the respective dataset. LIGER has two major parameters: a regularization term and the number of factors (dimensions of the latent space). Regions associated to ENCODE Blacklist regions (Amemiya ) are removed. Moreover, LIGER can be only executed for bi-modal datasets.
3.4.6 Symphony integration
Symphony (Kang ) is a method to create single-cell reference atlas for subsequent annotation of new single-cell datasets. For a single case study with a multi-ome CITE-seq data, Symphony used canonical correlation analysis to find shared reference. This simple procedure differs from MOJITOO in several ways: it does not use dimension reduction as input; it is not able to cope with more than two modalities and it uses the latent space of only one modality (RNA) as shared space. Moreover, Symphony included the execution of batch correction with Harmony after the execution of CCA. The CCA analysis is not part of Symphony source code, but we implemented it based on a script provided by authors (https://github.com/immunogenomics/TB_Tcell_CITEseq/blob/main/R/cca_analysis.R). Also, due to high computational requirements, we were only able to run this on the dimension reduced space (PCA/LSI) described above. We call this method Symph-Int to reflect the fact that this is not Symphony, but an integration method used in one of the analysis of Symphony manuscript.
3.4.7 DIABLO
DIABLO (Singh ) is based on a generalized canonical analysis to integrate multiple datasets in a supervised way. It was originally designed for bulk multi-omics data, where sample labels are available and explored for feature selection. Since labels are not available for single-cell multi-modal data evaluated here, we provide a distinct label for each cell. DIABLO is executed as indicated in their tutorial. Due to a prohibitive computational costs if raw matrices are used, we provide dimension reduced matrices as input. Another major parameter is the number of CC components, which is usually set to the number of classes—1. Here, we used 30 components. Also, DIABLO does not provide a common latent space. We therefore combined the CCs to obtain a final latent space.
4 Results
4.1 Benchmarking of multimodal integration methods
We evaluate MOJITTO and competing methods using six publicly available multimodal datasets with two or three modalities. These datasets have between 10 000 and 35 000 cells, 12 and 27 cell types and 25 to 344 492 features per modality (Table 1). We compare MOJITOO with MOFA (Argelaguet ), WNN (Hao ), Schema (Singh ), scAI (Jin ), LIGER (Kriebel and Welch, 2022), DIABLO (Singh ) and Symph-Int (Kang ). Of note, some methods (scAI, LIGER and Symph-Int) failed to be executed in some conditions, due to their inability to cope with more than 2 modalities or the lack of raw sequences for some of the evaluated datasets.First, we evaluate algorithms regarding their structure preservation, i.e. the average similarity between the euclidean distances in the shared space and distances in the space of each modality (Jain). Results indicate highest structure scores for MOJITOO (4 out of 6) followed by DIABLO (1 out of 6) and MOFA (1 out of 6). A ranking of the structure scores indicates MOJITOO as the best algorithm followed by DIABLO, MOFA and Symph-Int (Fig. 2A). Interestingly, we observe that MOFA and Symph-Int tend to obtain higher structure scores for RNA and that MOJITOO and DIABLO have lower variance of structure scores across modalities. This suggests that the MOJITOO and DIABLO shared space captures information of all individual modalities more uniformly than competing methods. The lower performance of Symphony integration compared to other CCA methods (MOJITOO and DIABLO) is explained by the fact it only considers CCs from the RNA space as integrated embedding.
Fig. 2.
Benchmarking on data integration methods. (A) We show the average (trace) and modality-specific structure scores (dots) (y-axis) versus methods (x-axis) for the six datasets. The last graph shows the combined ranking of the method over all datasets, where the highest rank indicates the best performer. (B) Barplots showing silhouette score (y-axis) versus methods (x-axis) for six benchmark datasets. The last plot shows the combined ranked per method. (C) Boxplots showing ARI scores (y-axis) versus methods (x-axis) for distinct clustering solutions for all six data-sets. Asterisks indicate P-values of <0.05(*), <0.01(**), <0.001(***), <0.0001(****) obtained via t-test comparing the ARI values of MOJITOO versus other methods. The last boxplot shows the combined ranking for competing methods
Benchmarking on data integration methods. (A) We show the average (trace) and modality-specific structure scores (dots) (y-axis) versus methods (x-axis) for the six datasets. The last graph shows the combined ranking of the method over all datasets, where the highest rank indicates the best performer. (B) Barplots showing silhouette score (y-axis) versus methods (x-axis) for six benchmark datasets. The last plot shows the combined ranked per method. (C) Boxplots showing ARI scores (y-axis) versus methods (x-axis) for distinct clustering solutions for all six data-sets. Asterisks indicate P-values of <0.05(*), <0.01(**), <0.001(***), <0.0001(****) obtained via t-test comparing the ARI values of MOJITOO versus other methods. The last boxplot shows the combined ranking for competing methodsNext, we make use of the cell types reported in the original manuscripts introducing the single-cell datasets as true labels for benchmarking. First, we use these labels to evaluate the silhouette scores by contrasting class labels with Euclidean distance matrices estimates on the shared space. Regarding silhouette, MOFA is best in four out of six dataset, while MOJITTO is best in the other two datasets. MOJITOO obtains second rank in four out of six datasets and is ranked second in the overall ranking (Fig. 2B). Finally, we perform Louvain clustering at distinct resolutions (0.1–2.0) on the shared latent space. We then measure the agreement of clustering results with labels using the Adjusted Rand Index (ARI). WNN and Symphony have the highest ARI in two datasets each, while DIABLO and MOJITOO have highest values in one dataset each (Fig. 2C). MOJITOO has a higher rank than two in all dataset and has the highest overall rank followed by WNN and DIABLO. Moreover, we also perform a sensitivity analysis on MOJITOO to inspect if the dimension of the original PCA/LSI space has impact on its performance (Supplementary Fig. S3). We observe that if 50 or more components are used, MOJITOO obtain similar clustering and structure preservation scores. Examples of low dimensional embeddings obtained by distinct integration methods with the PBMC-Multiome dataset are provided in Supplementary Figure S4.A crucial aspect of single-cell analysis is the computational resources needed for computation on an increasing number of cells. For this, we inspect the time and memory used in the largest datasets in our benchmark (SKIN-SHARE). To obtain curves, we down-sample the number of cells from 30 000 to 3000 (Fig. 3A and B and Supplementary Table S1–S2). We observe that MOJITOO has the overall lowest computational requirement (2.4 min and 6.3 GBs) followed closely by MOFA (3.21 minuts and 13.09 GBs) and WNN (3.74 min and 6.8 GBs). DIABLO, on the other hand, required up to 103 min and 78.3 GBs for 30 000 cells, while scAI required 637 min and 75 GB of memory. Altogether, results indicate MOJITOO has the best recovery of data structure and clustering results, while being the fastest and having the lowest memory footprint among all competing methods.
Fig. 3.
Time and memory consumption on the Skin-SHARE. (A) Line plots showing elapsed time (log of seconds) for each method (y-axis). (B) Line plots showing peak memory (Gigabytes) required by each method (y-axis). In both A–B, the x-axis shows the number of cells used (randomly sampled) from the Skin-SHARE data
Time and memory consumption on the Skin-SHARE. (A) Line plots showing elapsed time (log of seconds) for each method (y-axis). (B) Line plots showing peak memory (Gigabytes) required by each method (y-axis). In both A–B, the x-axis shows the number of cells used (randomly sampled) from the Skin-SHARE data
4.2 Canonical vectors support the interpretation of multiome data
Additionally, we explore the use of the dimensions of the latent space (Z) as factors for interpreting the PBMC multiome data. We denote the latent features as canonical components (CC). As shown in Figure 4, positive or negative values for the top CCs discern well all major cell types (Fig. 4). High values of CC1 are associated to myeloid cells (CD14+ and CD16+ monocytes and dendritic cells), while negative values are associated to T and NK cells (Fig. 4A). CC2 values discern B cell and plasmacytoid dendritic cells (pDC) from other cells, while CC3 differentiates B cells from pDCs (Fig. 4B and C). Further CCs capture subtle changes between major cell sub-types (Fig. 4D and E). CC4 and CC5 capture changes between naive T cells and active T CD8 and active T CD4 cells respectively, while CC5 captures differences between naive monocytes (CD14+) and activated monocytes (CD16+). Other smaller cell types (dendritic cells, platelets, double negative T cells and pre-B and progenitor B cells) can be characterized with further CCs (Supplementary Fig. S5). We also evaluated CCs with low correlation, which were considered as noise by MOJITOO (Supplementary Fig. S6). We observe that these CC have high scores for a few cells and low concordance across modalities. This supports their low biological relevance.
Fig. 4.
(A–F) UMAP with the scores of CC1 to CC6. We highlight major cell types (or sub-types) associated to positive or negative CC scores and arrows indicate directions associated to the activation of particular immune cells
(A–F) UMAP with the scores of CC1 to CC6. We highlight major cell types (or sub-types) associated to positive or negative CC scores and arrows indicate directions associated to the activation of particular immune cellsNext, we explore the U matrices, which provide values associating molecular features with the latent dimensions (CCs). Indeed, the expression of genes with high CC1 values include monocyte genes as LYN and FCN1, while negative CC1 values are associated to T cell genes BCL11B and IL7R (Fig. 5A). Similarly, we observe that top ranked peaks with high or low CC1 scores have monocyte or T-cell-specific open chromatin. These include regions close to the T cell gene BCL11B (Fig. 5B). High CC2 value are associated with B cell genes IGHM and BCL11A, while low CC1 genes (BCBL11B and IL32) are associated with T cells (Fig. 5C). As before, we observe cell-specific open chromatin patterns on top ranked ATAC-seq peaks associated with high and low CC2 values. Altogether, these results indicates that MOJITOO CCs can be used to capture major cell types of peripheral blood cells as well as to detect modality-specific molecular features associated to these.
Fig. 5.
(A) Heatmap with scores for the top 10 positive and negative genes for CC1 (y-axis) versus cells (x-axis). Cells are ordered by CC1 scores (high to low). (B) Genome browser tracks with top 2 positive and negative peaks for CC1. Tracks correspond to normalized cell-specific pseudo bulk ATAC-seq profiles generated by deeptools (Ramírez ). Cell-specific tracks are ordered by CC1 score (high to low). (C and D) show respectively the heatmap of top genes and the genome browser of top peaks for CC2
(A) Heatmap with scores for the top 10 positive and negative genes for CC1 (y-axis) versus cells (x-axis). Cells are ordered by CC1 scores (high to low). (B) Genome browser tracks with top 2 positive and negative peaks for CC1. Tracks correspond to normalized cell-specific pseudo bulk ATAC-seq profiles generated by deeptools (Ramírez ). Cell-specific tracks are ordered by CC1 score (high to low). (C and D) show respectively the heatmap of top genes and the genome browser of top peaks for CC2
5 Conclusion
We present here MOJITOO, which is a fast and parameter free method based on canonical correlation analysis for integration of multimodal single-cell data of any protocol. A comprehensive analysis with six bi-modal and tri-modal multimodal datasets indicates that MOJITOO has the best performance regarding the preservation of the structures across modalities and the recovery of clusters, while it is ranked second regarding distance representation. Moreover, MOJITOO has the lowest time and memory requirements requiring 2.5 min and 6.4 GB in the largest dataset with 30.000 cells. WNN, which is the standard method for integration in Seurat, performed well on the clustering problem (2nd after MOJITOO) and had a low computational time, but had a poor performance in the structure preservation and silhouette scores. Moreover, WNN, which outputs a distance matrix on the shared space, does not provide latent features as MOJITOO, DIABLO or MOFA. MOFA performed well on distance representation, but was ranked fourth in the clustering problem and third at structure preservation score. DIABLO, which is also based in CCA, had an good predictive performance (2nd for structure preservation and 3rd in clustering and distance preservation), but had an a poor computational performance being the 2nd slowest method.An interesting result is the fact the structure preservation scores are more uniform across modalities for CCA-based methods MOJITOO and DIABLO, while runner-up methods (MOFA) obtained highest scores for the RNA modality. This is possibly rooted on the analytical frameworks of these methods. CCA analysis explicitly finds canonical vectors with high correlation across modalities, while matrix factorization methods (MOFA) do not explicitly guarantee factors are uniformly well represented across modalities. Of note, the CCA approach explored in Symphony was also biased toward structure preservation in RNA, due to the fact it only use RNA canonical vectors as a final latent representation. Another finding of this study is that matrix factorization methods (MOJITOO, MOFA, DIABLO), benefit from working with low dimensional matrices as input by having lower computational requirements. For MOFA, a comparison of the results with or without dimension reduction even indicates positive effects of low dimensional inputs in the clustering problem.It is important to point out that measures used in our benchmarking have their own limitations. The distance and clustering evaluation requires labels, which might be biased toward the methods used to derive the labels. Indeed, WNN does well on the clustering problem of the two CITE-seq data, which are originally evaluated with WNN. The Silhouete score has underlying distribution assumptions, which might not be met by the latent space generated by integration methods. Finally, the structure score requires a low-dimensional representation of the spaces. In this study, however, we consider methods performing well in all metrics, which mitigates biases of individual metrics.Finally, we highlight how a simple inspection of CCA derived latent spaces supports the biological interpretation and detection of relevant molecular features, as exemplified in the multiome PBMC dataset. Future work includes further exploring the interpretability of MOJITOO, for example, by finding associations between molecular features across modalities as gene to peak links (Granja ). Here, it would be interesting to explore potential non-linearities of features, as commonly explored in scRNA-seq data (Van Dijk ). Finally, an interesting topic is to investigate if differences in the modality-specific canonical vectors detected by MOJITOO can indicates biological properties of those. For example, in the Skin SHARE-seq data (Ma ), authors show that cells with changes in chromatin preceding changes in gene expression indicates cell differentiation.
Funding
This project was funded by the German Research Foundation (DFG) (project GE 2811/3-2) and the E: MED Consortia Fibromap funded by the German Ministry of Education and Science (BMBF).Conflict of Interest: none declared.
Data availability
Pre-processed publically available data used for benchmarking is available at ZENODO: https://doi.org/10.5281/zenodo.6348128.Click here for additional data file.
Authors: Eleni P Mimitou; Caleb A Lareau; Kelvin Y Chen; Andre L Zorzetto-Fernandes; Yuhan Hao; Yusuke Takeshima; Wendy Luo; Tse-Shun Huang; Bertrand Z Yeung; Efthymia Papalexi; Pratiksha I Thakore; Tatsuya Kibayashi; James Badger Wing; Mayu Hata; Rahul Satija; Kristopher L Nazor; Shimon Sakaguchi; Leif S Ludwig; Vijay G Sankaran; Aviv Regev; Peter Smibert Journal: Nat Biotechnol Date: 2021-06-03 Impact factor: 68.164
Authors: Fidel Ramírez; Devon P Ryan; Björn Grüning; Vivek Bhardwaj; Fabian Kilpert; Andreas S Richter; Steffen Heyne; Friederike Dündar; Thomas Manke Journal: Nucleic Acids Res Date: 2016-04-13 Impact factor: 16.971