Literature DB >> 34025945

Computational strategies for single-cell multi-omics integration.

Nigatu Adossa¹, Sofia Khan¹, Kalle T Rytkönen^1,2, Laura L Elo^1,2.

Abstract

Single-cell omics technologies are currently solving biological and medical problems that earlier have remained elusive, such as discovery of new cell types, cellular differentiation trajectories and communication networks across cells and tissues. Current advances especially in single-cell multi-omics hold high potential for breakthroughs by integration of multiple different omics layers. To pair with the recent biotechnological developments, many computational approaches to process and analyze single-cell multi-omics data have been proposed. In this review, we first introduce recent developments in single-cell multi-omics in general and then focus on the available data integration strategies. The integration approaches are divided into three categories: early, intermediate, and late data integration. For each category, we describe the underlying conceptual principles and main characteristics, as well as provide examples of currently available tools and how they have been applied to analyze single-cell multi-omics data. Finally, we explore the challenges and prospective future directions of single-cell multi-omics data integration, including examples of adopting multi-view analysis approaches used in other disciplines to single-cell multi-omics.

Entities: CellLine Chemical Disease Gene Species

Keywords: Clustering; Integration; Multi-omics; Single-cell

Year: 2021 PMID： 34025945 PMCID： PMC8114078 DOI： 10.1016/j.csbj.2021.04.060

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Recent developments in single-cell omics technologies to measure different modalities such as genome, transcriptome, epigenome, and proteome have enabled unprecedented insight and resolution to cellular phenotypes, biological processes and developmental stages [1], [2], [11], [3], [4], [5], [6], [7], [8], [9], [10]. Single-cell studies can resolve the confounding effects of distinct cell types in heterogeneous samples, that can not be separated with traditional bulk approaches. Recent technological advancements have demonstrated simultaneous assaying of two or more of different omics layers [12], [13], [22], [23], [24], [25], [26], [27], [28], [29], [30], [14], [15], [16], [17], [18], [19], [20], [21]. The multimodal approaches at single-cell resolution are pushing forward a new era of scientific exploration in the field of molecular biology and medicine. Combination of several single-cell omics layers have enabled higher resolution for differentiation processes involved, for instance, in embryonic development [20], [31], development of immune system [32], [33], [34], [35], [36], cancer biology [37], and neuronal development [38], [39]. Additionally, the potential for new translational aspects is high [40]; within known cell-types distinct subpopulations of cells have been discovered to associate with disease versus healthy states, for instance, in the context of somatic cancer evolution [41], heart [42], [43] and neuronal diseases [44], and recurrent miscarriage [45]. Generally, in cancer cells, tumor heterogeneity plays a crucial role in drug resistance, relapse and metastasis. Therefore, accurately identifying tumor subpopulations using multi-omics approaches holds potential in the field of precision medicine. Further, multimodal omics data enables joint analysis of the different players, such as transcripts and proteins, in complex regulatory processes [46]. Computationally, the multimodal single-cell omics profiling has opened up the way for developing models that can relate the interactions and associations among multiple omics layers at single-cell resolution and allows utilization of complementary evidence from the multimodal data [47], [48]. At the core of the single-cell analysis are clustering algorithms that are used to separate cell types or functional cell states, either static or continuous. Strategically, multimodal single-cell data analysis can be roughly divided into three main approaches based on the stage where the integration of the data layers is conducted: early, intermediate, and late integration (Fig. 1). Similar categories have been described earlier in the context of bulk multi-omics data analysis [49], [50]. Early integration concatenates multiple omics data types into one integrated dataset and performs analysis on this data using the same algorithms typically used for the single omics layers. In late integration, analysis is first performed separately on each omics layer and these results are then integrated to determine the final consensus results. In intermediate integration, the multiple omics layers are analyzed together, including integration of sample similarities, joint dimension reduction techniques, and statistical modeling approaches [49].

Fig. 1

Single-cell multi-omics workflow. The first step in the workflow is sample extraction where cells are harvested, for example, from blood or tissues. Next, the extracted cells are dissociated and used to profile multiple layers of omics data from individual cells. In the computational analysis three data integration strategies can be used: early, intermediate and late data integration. In the end, for instance, distinct cell types and cell states can be recognized by clustering. In this review, we provide a coarse overview of the recent development in different approaches for integrative single-cell multi-omics analysis and clustering. We focus on the basic principles and strategies and provide examples of the available tools and software utilizing the different strategies. We also briefly discuss the challenges and future directions for the method development and application.

Single-cell multi-omics data

The single-cell omics datasets can either be matched, i.e. different omics layers have been measured simultaneously from the same individual cell with recent techniques such as Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-Seq), RNA expression and protein sequencing assay (REAP-seq), gDNA-mRNA sequencing (DR-seq) or single-cell methylome and transcriptome sequencing (scM&T-seq) [12], [16], [24], [27] with a comprehensive listing in [51], or unmatched, i.e. different omics layers have been measured from different single-cell experimental samples [52]. Compared to matched multimodal data, the unmatched multi-omics datasets have a relatively higher source of variation as the different omics layers originate from different cells and experimental setups [48]. Despite the challenge in addressing different sources of variations and batch effects, the unmatched single-cell multi-omics data integration has large potential to reveal novel biological insights because of the high quantity of single-modality single-cell data generated in recent years. Until recently, measurement of one layer of single-cell data has been economically a far more reachable and easier option than matched multi-omics. Hence, in several cases where related data are available, integrating these is still a viable option for wider research community. Also, several of the current data analysis methods have been developed using unmatched data. The first comparisons to provide details on the increased accuracy of the matched data are only currently emerging. As a very preliminary example, a recent study compared computationally inferred cluster assignments from matched single-cell RNA sequencing (scRNA-seq) and single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) datasets to (de facto) measured couplings and reported highly variable and dataset dependent accuracy (37–75%) for the computational inference; however, the clustering miss-assignments represented related cell types [1]. Single-cell transcriptomic (scRNA-seq) data is by far the most commonly assayed single-cell data type. Epigenomic (scATAC-seq and methylome) data are typically sparser than scRNA-seq data, leading to a situation where the integration strategy should be weighted to take into account the unbalanced information content. One simplistic solution here is to transfer clusters or cell type labels from information-rich scRNA-seq data to another more sparse data layer [53], [54]. On the other hand, when cell surface receptor data is available from CITE-seq, then it may be biologically relevant to use the well-known protein markers to guide the clustering of scRNA-seq results [55]. Considering experimental design, scRNA-seq from whole single cells commonly requires fresh tissues, whereas nuclear samples (nuclear snRNA-seq, ATAC-seq and methylome) can be frozen specimens, which greatly facilitates projects with extensive sampling. Promisingly, although nuclear transcriptome sequencing have less coverage and depth compared to full cell scRNA-seq of mRNA, recent comparisons have suggested that majority of the expression changes can be retrieved from single-nuclei RNA-seq [56], [57], further motivating the use of matched nuclei samples (RNA + ATAC, RNA + methylome etc.) for gene regulation studies. Notably, some assays retrieve matched genomic/chromatin and RNA from nuclei such as the commercially available (10X Genomics) snRNA/ATAC-seq, whereas others such as scM&T-seq or single-cell Chromatin Accessibility and Transcriptome sequencing (scCAT-seq) [16], [58] combine nuclear genomic/chromatin collection with transcriptome assay from cytoplasm. Further, methods to simultaneously assay three omics modalities simultaneously from single cells are currently developed, such as scNOMeRe [59] that collects transcriptome data from cytoplasmic extract and DNA methylation and chromatin accessibility data from nuclei or scNMT-seq [29] that instead of chromatin accessibility collects nucleosome data. Another developing front is the integration of the above mentioned liquid omics with spatial transcriptomics and other data that preserves the information of tissue structures [60] Also, for translational aspects, single-cell diagnostic such as the detection of circulating tumor cells (CTCs) [61], [62], [63] are emerging, and in the future these may also include integration of several omics layers.

Single-cell multi-omics data integration strategies

A primary goal of single-cell analysis is to discover known and novel cell populations. Hence, the data analysis methods to achieve this goal most often use an unsupervised approach. Additionally, some semi-supervised approaches have been suggested [64]. Here, we describe the general single-cell multi-omics integration strategies, divided into early, intermediate and late integration (Fig. 2). While some single-cell applications have used early or late data integration, intermediate integration approach has been most widely used in integrating single-cell multi-omics data.

Fig. 2

Schematic illustration of the early, intermediate and late data integration strategies in single-cell multi-omics analysis. In early data integration, multiple omics datasets are concatenated together for downstream analysis. By default, early integration increases the dimensionality of the data and does not account for the different distribution of the values in each separate omics layer. The intermediate data integration strategy covers a range of techniques to jointly analyze multiple omics datasets. Typically, this is done by transforming the datasets to a single integrated data matrix using, for instance, similarity-based integration, joint dimensionality reduction, or statistical modeling-based approaches. The late data integration strategy first employs the data analysis separately for each omics layer and then integrates these results to create a consensus result. In early data integration, multiple omics data layers are first concatenated as a single merged data matrix before proceeding into the analysis. A merged data matrix can then be used as an input for machine learning methods that are able to consider any type of dependences between the features [65]. The advantages of this approach include relatively easy application of any method that can utilize a data matrix. However, the merged data matrix increases complexity beyond single omics data and hence early data integration approaches often utilize automatic feature learning, such as dimensionality reduction and representation learning [66]. Feature learning methods, such as autoencoders combine multiple omics layers with variable numbers of features into a compressed data matrix at the hidden layer to create an integrated representation from the multi-omics data. However, as such autoencoders are conceptually closer to intermediate integration approaches. In bulk omics setting, the early integration has been applied, for instance, for tumor subtyping involving jointly all the considered omics [67]. In single-cell omics setting, early data integration has been most commonly applied to combine multiple datasets of the same omics type from different studies, such as scRNAseq data from multiple sources coupled with different normalization and scaling steps [68] A major challenge of early data integration is that the features from multiple omics datasets are often different both in terms of dimension and scale that may lead to more weight on an omics layer with more dimensions unless properly normalized [69]. Furthermore, the sparsity and the high-dimensional nature of omics datasets make it challenging to construct a common representation across multiple datasets. This could be addressed by lower dimensional embedding of an individual datasets retaining the overall structure of the original data followed by the subsequent data integration technique, including several linear [70] and nonlinear [71] methods. The late integration strategy first employs the analysis separately for each omics data layer and then integrates these results to create a consensus result. These methods have previously been used to integrate separate scRNA-seq experiment or in bulk multi-omics but have not yet been widely applied to single-cell multi-omics. For instance, mixture model ensemble clustering has been applied to combine multiple scRNA-seq clustering results and could readily be applied to create a consensus clustering of multi-omics data [72]. It models the interdependence of local clustering results with the aim to find a robust and improved global clustering solution across multiple data sources through optimization. The SAME-clustering [72] tool implements a mixture model ensemble method for aggregating clustering solutions generated from different clustering algorithms in scRNA-seq data aiming to land in a robust consensus clustering. The Graph Partitioning-Based Cluster Ensemble Method Sc-GPE [73] and the Ensemble Clustering Based on Probability Graphical Model With Graph Regularization EC-PGMGR [74] use graph-based ensemble clustering. These approaches can also be applied for single-cell multi-omics clustering, as they give flexibility to use different omics specific clustering algorithms to generate the best local clustering solutions. On the other hand, potential late integration approaches have been employed in bulk multi-omics, including Cluster-of-clusters analysis (COCA) [75], a two-step integrative clustering algorithm that performs integrative cluster analysis summarizing the clustering results found from multiple omics datasets. In Kernel Learning Integrative Clustering (KLIC) [76] multiple clustering structures are integrated as a multiple kernel learning problem where each of the datasets provide a weighted contribution to the final clustering. Obviously, as late integration algorithms often take a clustering result as an input, they directly fit to a workflow where unmatched single-cell multi-omics datasets are first analyzed separately.” The intermediate integration covers a range of techniques that aim to jointly analyze the different omics layers together using, for instance, similarity-based integration, joint dimension reduction, or statistical modeling. The similarity-based integration approaches include, for instance, spectral clustering approaches and graph fusion algorithms. The joint dimension reduction techniques aim to find a lower dimensional representation for the single-cell multimodal data layers by projecting them into a common latent space. These include various matrix factorization techniques as well as covariance-based techniques, such as canonical correlation analysis. The statistical modelling techniques for integration utilize, for instance, Bayesian approaches to determine cluster probabilities of cells from multiple omics layers. Representative examples of different tools that apply these approaches are provided in Table 1.

Table 1

Computational single-cell multi-omics tools applying intermediate integration approaches and their applicable omic data types.

Tool	Methodology	Single-cell omics types (designed for matched/unmatched)	Refs.
Similarity-based approaches
SCHEMA	Metric-learning based method	Multi-omics data (matched)	[77]
Spectrum	Weighted-nearest neighbor analysis	Multi-omics data (unmatched)	[78]
Seurat4	Weighted-nearest neighbor analysis	Transcriptome and chromatin accessibility or proteome data (matched)	[79]
Dimension reduction-based approaches
BindSC	Canonical correlation analysis	Transcriptome and chromatin accessibility data (matched)	[80]
CoupledNMF	Non-negative matrix factorization	Transcriptome and chromatin accessibility data (unmatched)	[53]
LIGER	Non-negative matrix factorization	Transcriptome and spatial gene expression data or DNA methylation (unmatched)	[81]
MAGAN	Manifold alignment	Multi-omics data (unmatched)	[82]
MATCHER	Manifold alignment	Transcriptome and DNA methylation data (matched)	[83]
MMD-MA	Manifold alignment	Multi-omics data (matched)	[84]
MOFA+	Factor analysis	Multi-omics data (matched)	[85]
scMVAE	Variational autoencoder	Multi-omics data (matched)	[86]
Seurat3	Canonical correlation analysis	Transcriptome and chromatin accessibility data (unmatched)	[87]
totalVI	Deep generative model	Transcriptome and proteome data (matched)	[88]
Unicom	Manifold alignment	Multi-omics data (unmatched)	[89]
Statistical modeling-based approaches
BREM-SC	Bayesian mixture model	Transcriptome and proteome data (matched)	[90]
Clonealign	Statistical model	Transcriptome and genome data (unmatched)	[91]

Computational single-cell multi-omics tools applying intermediate integration approaches and their applicable omic data types.

Computational tools for intermediate integration of single-cell multi-omics data

Similarity-based approaches

Spectral clustering utilizes similarity matrices as a basis for clustering. The adoption of the multi-view version of spectral clustering can be used to deal with the multi-omics data. Currently, several methods that are applied in bulk multi-omics data integration are being proposed for single-cell multi-omics integration [92], [93], [94], [95], [96]. For example, Spectrum [78] uses a self-tuning density-aware kernel that enhances the similarity between points that share common nearest neighbours. In addition to bulk data, it has been applied on simulated single-cell data [78]. The Pair-wised Co-regularized Multimodal Spectral Clustering (PC-MSC) [97] implements a co-regularization approach to combine multiple kernels representing the different omics layers. The method has been applied to single-cell transcriptome and protein marker data [94]. SCHEMA [77] implements a metric-learning based method [98], which first determines similarities between cells under each modality and then transforms the primary modality so that it has maximum level of agreement with the other modalities. Graph fusion algorithms construct graphs from each omics layer and map them to a single fused graph. Recently, several graph fusion algorithms [92], [93], [94], [95], [96] have been proposed for integrating graphs in multi-view clustering domains. Generally, once graphs are integrated from multiple omics layers, any conventional clustering methods can be implemented to partition the joint graph into clusters. Crucial for the accuracy of this approach is that geometric properties of the single data layers are sufficiently maintained in the global presentation. Most notably for single-cell omics, the latest version of the widely used Seurat, Seurat4 [79], implements a weighted-nearest neighbor graph-based integration for cluster analysis. It has been applied, for instance, on CITE-seq data of blood cells to improve the discovery of cell states and cell types.

Dimension reduction-based approaches

Canonical correlation analysis (CCA) is a correlation-based multivariate analysis method to examine the linear relationship between two datasets [99], [100]. A set of linear combinations of all variables in each of the two datasets is determined so that it maximizes the correlation between them and best explains both within and between dataset variability. The high dimensionality, sparsity and variable feature spaces across the different omics layers pose constraints for the linear combinations limiting the biological applicability of CCA. Generally, to solve these issues variants of CCA including sparse CCA [100] and penalized matrix decomposition (PMD) method [101] have been proposed. For instance Seurat3 [87] implements CCA in order to integrate two single-cell omics datasets. It first jointly reduces the dimensionality of two datasets using the diagonalized CCA followed by a search for a mutual nearest neighbor in lower dimensional space, and then establishes the cellular relationship across the datasets as an anchor. This has been used, for example, to integrate scRNA-seq and scATAC-seq data from the mouse visual cortex and scRNA-seq and surface protein expression from bone marrow [87]. Another recent adaptation of CCA for single-cell multi-omics clustering is bindSC [80] which utilizes bi-order canonical correlation analysis (bi-CCA) that captures the correlated variables from both cells and features between two modalities to formulate the canonical correlation vectors in a latent space. While Seurat3 or bindSC can only be applied to two datasets at a time, multiset CCA [102] aims to simultaneously find multivariate associations between more than two modalities. In multiset CCA, the canonical coefficients of all variables are optimized to maximize the pairwise canonical correlations [103]. Currently, we are not aware of multiset CAA being applied to single-cell multi-omics. Non-negative matrix factorization (NMF) extracts a low-dimensional non-negative representation of the high-dimensional data that is typically sparse. LIGER (linked inference of genomic experimental relationships) [81] is a recently introduced tool for single-cell multi-omics analysis that utilizes integrative non-negative matrix factorization (iNMF) [104] in order to identify the shared and dataset specific factors across the datasets. It was applied to spatial and scRNA-seq data from mouse brain frontal cortex in order to cluster cell subtypes, and to scRNA-seq and DNA methylation data from mouse cortical to perform integrative cluster analysis [81]. Further, recently [105] extended the iNMF implementation of LIGER to make it an online learning algorithm [106] where multiple datasets are used as mini-batches in a continual cycle allowing fast and memory efficient integration of large multimodal datasets. Another NMF-based implementation for scRNA-seq and scATAC-seq data coupledNMF [53] formulates an optimization problem to couple the information from each dataset during the cluster optimization. The factor analysis-based tool MOFA [107] and its improved version MOFA+ [85], on the other hand, use a variational Bayesian inference framework and have been applied to both bulk and single-cell multi-omics analysis. Manifold alignment is a class of machine learning algorithms that produce projections between sets of data that lie on a common manifold [108]. The idea is to create a low-dimensional representation (or manifold) for each dataset and then align these representations (manifolds) in a common space where the different datasets are directly comparable. Manifold alignment algorithms can be supervised, semi-supervised, or unsupervised based on the level of available correspondence information among disparate datasets. The currently available manifold alignment tools are unsupervised, such as MATCHER [83] which has been applied on matched and unmatched single-cell transcriptome and DNA methylation data. The method assumes that the variation among cells can be explained mainly by a single latent variable. Another tool, Manifold-Aligning GAN (MAGAN) [82], is a generative adversarial network (GAN) based manifold alignment tool for single-cell multi-omics analysis. It has demonstrated its efficiency in integrating scRNA-seq and proteomic (mass cytometry) datasets. Other manifold tools that have been introduced for single-cell multi-omics include, for example, Unicom [89] and MMD-MA [84]. Autoencoders [109] are neural networks that unfold the underlying nonlinear patterns from multiple high-dimensional datasets by compressing them into a unified lower-dimensional subspace. Architecturally, autoencoders have an input, hidden and output layers with the bottleneck in the middle showing the most compressed form of the input data at subspace. The encoder part of the neural network compresses the input data so as to store the compressed data at the bottleneck layer, whereas the decoder part decompresses the data to regenerate the original input data as an output. The compressed data can then be used for further analysis. Two variations of autoencoders have been recently applied in single-cell multi-omics, variational autoencoders (VAE) [86], [88], and adversarial autoencoders (AAE) [110]. The advantage of variational autoencoders is that they encode the latent attributes of the input in a probabilistic distribution instead of a deterministic single value. This approach has been used in totalVI [88] for jointly transforming the RNA and protein data into joint lower-dimensional cell states. The single-cell multimodal variational autoencoder (scMVAE) was recently used in integrative analysis of scRNA-seq and scATAC-seq data [86]. Additionally, an adversarial autoencoder method [110] was recently developed and applied to integrate scRNA-seq and imaging data. Adversarial autoencoders take advantage of GANs to more accurately integrate the data layers [111].

Statistical modeling-based approaches

Bayesian framework allows probabilistic modeling of multi-omics data. For instance, Dirichlet mixture model can be used to construct a context-dependent Bayesian clustering framework that can be used for clustering multiple omics datasets on the level of individual omics, while also simultaneously extracting global multi-omics structure [112]. The probabilistic model-based algorithm BREM-SC [90] utilizes Dirichlet multinomial distribution and introduces specific random effects in order to correlate between different omics layers. It was recently applied on gene expression and surface protein expression data. Clonealign [91] also implements a statistical framework for integrating gene expression and copy number profiles from unmatched single-cell RNA-seq and scDNA-seq data to assign gene expression states to cancer clones. The inference is done using a mean field variational Bayes approach. Other Bayesian frameworks for integrative model-based clustering have been proposed for clustering multi-omics data in bulk studies [113], [114]. Such methods can be a useful asset to be tested in the context of single-cell multimodal cluster analysis.

Summary and outlook

Single-cell technology is having enormous impact on the discovery of novel cell-types and defining more accurate cell differentiation trajectories, as well as translational effects on precision medicine. Clustering is a widely used unsupervised machine learning method used for analyzing cellular heterogeneity in both single-cell mono- and multi-omics analysis. In the multi-omics analysis, we discussed early, intermediate and late data integration strategies together with recently introduced single-cell multi-omics analytical tools. These tools apply algorithms and analysis methods that have previously been developed in a wider framework of multi-view analysis [115], [116] in different fields, such as text mining [54], image/video analysis [116], [117] and bulk multi-omics analysis [49], [50], [118]. Many of these methods still remain unexplored in single-cell multi-omics analysis and we expect them to be intensively examined in that context in the near future. Here we expand our previous description of the specific tools already used in the field of single-cell multi-omics by discussing multi-view approaches that have been utilized in other fields not yet applied to single-cell multi-omics. Currently the most widely used multi-omics integration approaches reduce the datasets to a single data matrix from multiple omics datasets using CCAs, manifold alignments, graph-based integration techniques, or autoencoders before performing cluster analysis. The CCAs, that have most often been used via for instance Seurat3, could in the future be further developed to take into account the potential advances of sparse CCA [100], [101], [119]. The non-linearity aspect of the high-dimensional single-cell multi-omics data could be also dealt with other CCA variants, such as kernel CCA [103], [120] or deep CCA [121]. Further, importantly, new unified distributional embedding methods, such as Multi-view Neighborhood Embedding (MvNE) [50] are potentially relevant additions in single-cell omics. In general, data integration approaches where each of the omics datasets are jointly used for optimization can be considered to have advantage. For this there still remains a variety of clustering implementations for multi-view data in a co-training fashion that have not been properly tested for single-cell multi-omics clustering, while their utility in other disciplines such text mining is more established. For example, multi-view k-means clustering has proved its effectiveness in the fields of image analysis [122], [123], [124], [125], [126], [127], whereas Cluster-of-clusters analysis (COCA) [75], Kernel Learning Integrative Clustering (KLIC) [76] and perturbation-based clustering [128] have been used in bulk multi-omics cluster analysis but have not yet been widely applied for single-cell multi-omics. The benefit of late integration approaches, on the other hand, is the flexibility for the different algorithms that are used at each of the individual omics layers before integration into an ensemble solution. Currently, several single-cell multi-omics tools have been developed to address the integration and clustering of multi-omics datasets (Table 1), but comprehensive and objective comparison and benchmarking of these recent methods is yet to be conducted and in high demand. Additionally, the current multi-modal analysis tools mostly focus on integrative clustering of multi-omics data with the aim to identify the shared cell type heterogeneity. More tools are needed that are capable of addressing various biological questions from matched single-cell multi-omics data, such as integrative motif discovery and inference of gene regulatory networks or combining spatial expression patterns with liquid based sequencing results. The future is likely to bring more robust and improved technological advancement in the area of single-cell multi-modal profiling, enabling multitudes of omics and other data such as imaging from a single cell. This will open up new opportunities in finding novel insights in relation to the biological mechanisms answering key questions related to diseases and advances in personalized medicine. Future developments include advanced simultaneous assays for three [59] or more omics modalities, and more solutions for preserved samples in order to enhance practicality of wet-lab and the possibility to study large clinical cohorts. Also, there remain challenges in relation to data storage, management and analytical aspects. In terms of data storage and management, there are few efforts to aggregate the multi-omics data in bulk setups [34], [129], [130], [131], [132]. So far, however, there is no unified single-cell multi-omics platform that encompasses the multi-modal single-cell omics data in a repository, except some efforts taken by the recent activities under the human cell atlas project [133]. Therefore, gathering the growing multi-modal single-cell multi-omics data in a unified repository would facilitate a collaborative work towards computational multi-omics analysis. In terms of cluster analytics, the multi-modal single-cell analysis has already benefited from the recently advanced multi-view machine learning methodologies [54], [134], [135] and these will continue to advance the computational analysis of single-cell multi-omics data.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

5 in total

Computational strategies for single-cell multi-omics integration.

Introduction

Single-cell multi-omics data

Single-cell multi-omics data integration strategies

Computational tools for intermediate integration of single-cell multi-omics data

Similarity-based approaches

Dimension reduction-based approaches

Statistical modeling-based approaches

Summary and outlook

Declaration of Competing Interest

1. Linking cells across single-cell modalities by synergistic matching of neighborhood structure.

Review 2. Combining Molecular, Imaging, and Clinical Data Analysis for Predicting Cancer Prognosis.

3. TargetMine 2022: A new vision into drug target analysis.

Review 4. Exploring long non-coding RNA networks from single cell omics data.

Review 5. Angiogenesis goes computational - The future way forward to discover new angiogenic targets?