Literature DB >> 32637041

Single-cell ATAC sequencing analysis: From data preprocessing to hypothesis generation.

Abstract

Most genetic variations associated with human complex traits are located in non-coding genomic regions. Therefore, understanding the genotype-to-phenotype axis requires a comprehensive catalog of functional non-coding genomic elements, most of which are involved in epigenetic regulation of gene expression. Genome-wide maps of open chromatin regions can facilitate functional analysis of cis- and trans-regulatory elements via their connections with trait-associated sequence variants. Currently, Assay for Transposase Accessible Chromatin with high-throughput sequencing (ATAC-seq) is considered the most accessible and cost-effective strategy for genome-wide profiling of chromatin accessibility. Single-cell ATAC-seq (scATAC-seq) technology has also been developed to study cell type-specific chromatin accessibility in tissue samples containing a heterogeneous cellular population. However, due to the intrinsic nature of scATAC-seq data, which are highly noisy and sparse, accurate extraction of biological signals and devising effective biological hypothesis are difficult. To overcome such limitations in scATAC-seq data analysis, new methods and software tools have been developed over the past few years. Nevertheless, there is no consensus for the best practice of scATAC-seq data analysis yet. In this review, we discuss scATAC-seq technology and data analysis methods, ranging from preprocessing to downstream analysis, along with an up-to-date list of published studies that involved the application of this method. We expect this review will provide a guideline for successful data generation and analysis methods using appropriate software tools and databases for the study of chromatin accessibility at single-cell resolution.

Entities: Chemical Disease Gene Species

Keywords: ATAC sequencing; Chromatin accessibility; Single-cell ATAC sequencing; Single-cell RNA sequencing; Single-cell biology

Year: 2020 PMID： 32637041 PMCID： PMC7327298 DOI： 10.1016/j.csbj.2020.06.012

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Assay for Transposase Accessible Chromatin with high-throughput sequencing (ATAC-seq) was designed to identify open chromatin regions in the genome [1]. Due to the use of hyperactive Tn5 transposase, which simultaneously tags and fragments DNA sequences in open chromatin regions, ATAC-seq requires shorter sample preparation times and fewer number of cells for high quality profiling of chromatin accessibility compared to other existing methods [1]. With emergence of single-cell biology and adaptation of various sequencing-based omics technologies, the study of chromatin accessibility at single cell resolution became possible owing to the development of single-cell ATAC sequencing (scATAC-seq). However, computational analysis of scATAC-seq data remains challenging. Moreover, a wide range of potentially functional elements within the accessible genomic regions add more complexity for interpretation of scATAC-seq data, if they are not well understood. Recently, computational algorithms and software tools for scATAC-seq data analysis have been developed. However, algorithmic approaches and parameters for each step of the data analysis pipeline must be carefully selected for reliable translation of chromatin accessibility information into novel biological hypotheses. In this review, we aim to elaborate the overall workflow of scATAC-seq data analysis (Fig. 1) from data preprocessing to various downstream analyses, including integration with other types of genetics and genomics data. The analyses of sequencing read data from scATAC-seq are subject to initial data preprocessing, which is similar to those of other next-generation sequencing data [2]. Sequence files are processed with software tools widely used for quality control of sequence information, read mapping to the reference genome, and identification of read peaks that may indicate open chromatin regions [3], [4]. The generation of cell-by-feature matrix is critical for scATAC-seq data analysis and this is facilitated by the various options available for defining genomic features [5]. The preprocessed data are then used for downstream analysis to elucidate networks among cis-regulatory elements, such as promoters and enhancers, and trans-regulatory elements, such as transcription factors (TFs). Gene activity and accessibility to genetic variants can also be analyzed using scATAC-seq data [6]. Moreover, scATAC-seq can be integrated with single-cell RNA sequencing (scRNA-seq) data [7] and other omics data for multi-omics studies.

Fig. 1

Schematic overview of a typical single-cell ATAC sequencing analysis workflow.

Single-cell ATAC sequencing technologies

Within two years of the development of bulk ATAC-seq technology, two different strategies of single-cell adaptations were introduced: split-and-pool combinatorial cellular indexing such as sci-ATAC-seq [8] and microfluidics approach such as using integrated fluidics circuit (IFC) [9] (Fig. 2). In sci-ATAC-seq, the nuclei of lysed cells are placed in 96-well plates with uniquely barcoded transposases and pooled back together before dispensing into a second 96-well plate using fluorescence-activated cell sorter (FACS). The second barcodes are introduced during amplification. By recognizing unique combinations of two barcodes, sciATAC allows sequencing of about 1500 cells with median reads of 2500 and ~11% collision rate. In contrast, IFC scATAC-seq utilizes a Fluidigm C1 device to capture single cells and to perform transposition and PCR on IFC. While this method can obtain more than 70,000 reads per cell, only up to 96 cells can be processed in parallel. Another microfluidics-based scATAC-seq using 10x Genomics Chromium device is recently gaining popularity. Chromium system captures single transposed nucleus in a Gel bead-in EMulsion (GEM), which involves addition of unique barcodes to DNA fragments [6]. The scalability and high throughput of GEM, combined with the intuitive software called Cell Ranger ATAC, allows for scATAC-seq study on large number of cells.

Fig. 2

Schematic summary of two major strategies for single-cell adaptation of ATAC sequencing library generation: (a) split-pool cellular indexing and (b) microfluidics-based, and (c) their modified methods. Since the initial single-cell adaptations of ATAC-seq technology with cellular indexing and microfluidics, various modifications and improvement have been added to them. Protein-indexed scATAC-seq (Pi-ATAC) [10] profiles protein epitopes in parallel with DNA transposition to quantify protein expression and chromatin accessibility of the same individual cell. The small molecule inhibitor Pitstop2 (scip-ATAC-seq) [11] improves the efficiency of transposase entry into the nucleus and thus enhances library complexity and resolution. Transcript-indexed ATAC-seq (T-ATAC-seq) [12] using microfluidic devices allows sequencing of T cell receptor-encoding genes with ATAC-seq. Perturb-ATAC [13] adds CRISPR single guide RNA (sgRNA) after transposition and sequences both the sgRNA and ATAC DNA to study the relationship among factors regulating chromatin accessibility. Plate-based scATAC-seq facilitates high library complexity with lower amounts of mitochondrial DNA and higher fraction of reads in peaks (FRiP), along with bulk Tn5 tagging and single-nuclei sorting [14]. By combining cellular indexing with microfluidics, droplet microfluidics scATAC-seq with cellular indexing (dsciATAC-seq) [15] maintains read depth of microfluidics-based scATAC-seq while increasing cellular throughput in parallel. Nano-well ATAC-seq (μATAC-seq) employs ICELL8 platform to offer single cell sequencing with high throughput and low library preparation costs [16]. Nevertheless, it is important to consider the availability of experimental devices, compatibility with analysis software, required read depth and cellular throughput, and overall purpose of study before selecting scATAC-seq technologies.

Data preprocessing

Before generating biological hypotheses with downstream analysis, scATAC-seq data must undergo preprocessing steps for accurate interpretation. Preprocessing of scATAC-seq data starts from demultiplexing of sequence files and removal of low-quality cells. Genomic regions used for cell-by-feature matrix, data transformation methods, dimension reduction (DR) approaches, and clustering methods for annotation of cell identities must be carefully selected. Additionally, batch effects must be removed if necessary. Since there is no magic bullet in data analysis, comparisons of multiple methods with complementary algorithms is necessary for obtaining the best result from a given dataset. In Table 1, we summarize 13 software packages available for scATAC-seq data analysis: ChromVAR [17], SCRAT [18], scABC [19], Cicero [20], Scasat [21], cisTopic [22], snapATAC [23], epiScanpy [24], Destin [25], SCALE [26], scATAC-pro [27], Signac [7] and ArchR [28]. Although varying in capability of downstream analysis, they all include unique preprocessing steps. Recently, many of these tools were also evaluated based on the performances in accurately identifying cell types with clustering results [5].

Table 1

Summary of scATAC-seq analysis software packages.

Tool	Platform	Feature Matrix	Preprocessing	Clustering	DAR	Motif/k-mer	Gene activity	Co-accessibility	Trajectory	Pathway	Enrichment analysis	scRNA integration	Reference
ChromVAR	R	TF motifs, k-mer	O	O	X	O	X	X	X	X	X	X	[17]
SCRAT	R/Web	Selectable feature	O	O	O	X	X	X	X	X	X	X	[18]
scABC	R	Peak	O	O	X	O (ChromVAR)	X	X	X	X	X	X	[19]
Cicero	R	TSS	O	O	O	X	O	O	O	X	X	X	[20]
Scasat	Python/R	Peak	O	O	O	X	X	X	X	O (GREAT)	X	X	[21]
cisTopic	R	Peak	O	O	X	X	O	X	X	O	O	X	[22]
snapATAC	Python/R	Bin, peak	O	O	O	O (ChromVAR, Homer)	O	X	X	O (GREAT)	X	O (Seurat)	[23]
epiScanpy	Python	Peak	O	O	X	X	X	X	X	X	X	X	[24]
Destin	R	Peak	O	O	O	X	X	X	X	X	O	X	[25]
SCALE	Python	Peak	O	O	O	O (ChromVAR)	X	X	X	X	X	X	[26]
scATAC-pro	Python/R	Peak	O	O	O	O (ChromVAR)	O	O (Cicero)	X	O (GREAT)	X	X	[27]
Signac	R	Peak	O	O	O	O (ChromVAR)	O	X	X	X	X	O (Seurat)	[7]
ArchR	R	Bin, peak	O	O	O	O (ChromVAR), TF footprinting	O	O	O	X	O	O (Seurat)	[28]

Tools used in junction are indicated in parentheses.

Summary of scATAC-seq analysis software packages. Tools used in junction are indicated in parentheses.

Preprocessing of sequencing reads

If multiple samples are indexed and sequenced in a single reaction through multiplexing, they need to be demultiplexed based on the index adapter sequence by software packages, such as Illumina’s bcl2fastq. Demultiplexed sample files are then processed by adaptor trimming, in which adaptor and primer sequences are trimmed off by Bowtie2 [3] or Trimmomatic [29]. Trimmed reads are then aligned to the genome of the same species to the prepared sample using Bowtie2 [3] or BWA [30] and sorted with Samtools [31].

Quality control

After processing sequencing read data, barcodes that are corresponding to low quality cells or doublets must be filtered out. Generally, quality control (QC) criteria for most of single-cell sequencing technologies are based on the read counts (count depth) and feature counts per barcode [32]. Barcodes with either a low count depth or too high count depth are considered to be low quality cells or doublets, respectively. The same can be applied to the feature counts. However, utilizing unique characteristics of scATAC-seq data may lead to more adequate QC. For example, fraction of reads in peaks (FRiP), ratio of reads in promoter regions, ratio of reads in blacklist sites, or enrichment of transcription start sites (TSS) are often used for barcode selection [9], [23], [28]. Barcodes that do not show nucleosomal banding patterns that are unique to high-quality ATAC-seq data are also excluded [8], [33]. In addition to barcodes, features (e.g., peaks) that are located in blacklist regions or house-keeping genes can be filtered out [23]. It is important to remember that there is no absolute QC standard fitting for all samples. Therefore, combinations of QC criteria must be carefully chosen depending on characteristics of the samples such as overall structure of data, heterogeneity, possibly exiting cell types, batches, or sequencing platforms.

Cell-by-feature matrix formation

The cells that have passed the QC are selected to generate a cell-by-feature matrix for downstream analysis. A major factor that diversifies the data matrices is defined by the genomic regions from raw peak reads and annotation of the defined regions using regulatory elements. While majority of the pipelines employ a single combination of defining and annotating genomic regions, some pipelines adapt various suited matrices for different purposes of downstream analysis. Largely, the definition of genomic regions can be classified by the use of sample-specific information and feature annotation can be varied with regulatory elements of interest. The use of sample-specific information includes utilization of bulk ATAC-seq peaks from public data or analyzing aggregated or merged peaks from scATAC-seq data [17], [19], [20], [21], [22], [34]. Single cell aggregation can be carried out using cells from the entire sample or from every cluster (which represents a distinct cell type) obtained from initial temporary clustering results [23]. In most cases, MACS2 [4] is used for peak identification. The other definition of genomic regions is fixed-size bins or windows of the genome along with scores based on the relative abundance of sequence reads for the regions [23], [33]. After defining genomic regions by either peaks or bins/windows of fixed-size, regulatory elements, such as TF motifs and TSS, are used to generate cell-by-feature matrix. Since motifs and k-mers for TF binding are specific for cell types, cell type annotation based on the information is included in some data analysis pipelines [17], [18], [35]. The genomic regions are annotated with either the known TF motifs from public databases, such as cisBP [36], JASPAR [37], and HOMER [38], or k-mers for unsupervised annotation using motifmatchr [39]. Moreover, accessibility to TSS can be used as cell type-specific features [20]. Frequently, these genomic features are combined together to form a feature set for accurate analysis of cell heterogeneity [18]. Some tools simply merge nearby peaks or use them directly as features for matrix formation without annotation of genomic elements [22], [23].

Batch correction and data integration

When we analyze scATAC-seq data of multiple batches collectively, non-biological factors such as technical variance can lead to wrong biological hypotheses. Batch effects can occur by differences in experimenters, sample preparation protocols, sample harvest time, sequencing lanes, and sequencing technologies [32], [40]. Batch effect correction for scATAC-seq data are often indirectly carried out without specific computational tools. With careful examination, batch-specific peaks or features can be removed [21], [22], [26]. Batch effects are often corrected during other preprocessing steps such as selecting variable peaks or dimension reduction [6], [28], [33]. Batch effects of single-cell omics data can be more systematically corrected with data integration approaches based on non-linear algorithms. These methods assume that all batches share at least one cell type with another and differences between batches are smaller than those between cell types [40]. However, these methods may also remove biological variance, thus resulting in overcorrection. Therefore, both capability of batch removal and conservation of biological variation need to be considered [41]. Although there is no designated tool for integrating scATAC-seq data, those developed for scRNA-seq may be utilized. A benchmarking study of the data integration tools with atlas-level scATAC-seq data showed that most of them performed poorly, which may be attributable to the sparsity and binary nature of the data [41]. Nevertheless, Harmony [42], Seurat v3 [7] and scVI [43] showed best trade-off between batch removal and conservation of biological variation for integrating scATAC-seq data in the benchmarking study. Data integration tools for batch correction can also be used for integrating multi-modal single-cell omics data (e.g., integration of scRNA-seq and scATAC-seq data generated from the same tissue source). They are further described in the later part of this review.

Data transformation

Though various experimental technologies are being attempted to increase the sequencing output, peak reads from a single cell have been reported to represent only about 1~10% of overall detectable peaks in scATAC-seq analysis [5]. Therefore, instead of using the initial cell-to-feature matrices directly for downstream analysis, data transformation can be applied to compensate for the limitation from data sparsity. Due to the binary nature of scATAC-seq profiles (1 for presence and 0 for absence of sequence read, respectively), classical text mining methods of topic modeling can be used for data transformation [22], [33]. Term-frequency inverse-document-frequency (TF-IDF) method transforms a cell-to-feature matrix to give more weight to rarer peaks in the cell population [33]. The transformed data matrix tends to capture peaks that are more variable (i.e., more informative) for distinct cell types. Jaccard distances can also be used to measure dissimilarity of two cells in accessibility matrix to signify unique peaks in one cell against all the other peaks [21]. Based on the assumption that higher sequencing depth attributes to the better capture of important features, some methods weigh features of each cell by its sequencing depth [19].

Dimension reduction, visualization and clustering

After transforming data to overcome inherit sparsity, the cell-by-feature matrix undergoes DR which can mitigate redundant information and potential noise of high dimensional data, and may reduce the computational time for downstream analysis [5]. Principle Component Analysis (PCA) is a widely used linear DR technique and the number of principal components to be chosen is determined based on the elbow of scree plot analysis or Jackstraw test [44]. Topic modeling methods (e.g., cisTopic) reduce dimensions of feature matrix by choosing top topics based on topic-cell distribution generated by latent Dirichlet allocation (LDA) [22]. While LDA is relatively time-consuming, it can capture cell type-specific characteristics which might improve clustering accuracy [5]. Latent Semantic indexing (LSI) is performed by using TF-IDF followed by singular value decomposition (SVD) [33]. Multidimensional scaling (MDS) is also used to reduce dimensions based on the profile similarity among cells [21]. Diffusion map is a nonlinear method of DR processing and it tends to be robust to sequencing noise [23]. While some data analysis pipelines omit a linear DR step, its application is shown to improve overall clustering results during downstream analysis [5]. Overall, the results from these DR methods are used as input for both visualization and clustering. In order to visualize the data in a 2- or 3-dimensional space, non-linear DR techniques such as t-distributed stochastic neighborhood embedding (t-SNE) [45] and uniform manifold approximation and projection (UMAP) [46] are used. These techniques are often called as embeddings. UMAP visualization tends to preserve global structures better while t-SNE visualization preserves local neighborhoods [46]. However, there are still debates over which methods need to be used for single cell analysis, and frequently the choice of method depends on the properties of each dataset and the data preprocessing method used. Therefore, it is highly recommended to apply multiple visualization methods for given datasets and make a choice based on the results obtained. Cells with similar accessibility profiles can be organized into clusters. For scATAC-seq data, there are several clustering methods which are widely used: hierarchical, k-means, k-medoids, and Louvain algorithm. Hierarchical clustering is useful for understanding overall relationships among clusters and the result is often visualized with a dendrogram to capture the hierarchical relationship. K-means and k-medoids clustering use parametric algorithms with predetermined number of clusters. K-medoids clustering is known to be more robust to noise but requires more computational power. Louvain clustering is a graph-based method which often takes the results of k-nearest neighbor (KNN) method as input data [7], [23]. Some analytic tools might have preferred methods for clustering but in most cases they are interchangeable. Recent benchmark results for clustering scATAC-seq data showed the most favorable results with Louvain clustering [5].

Downstream analysis for hypothesis generation

The main purpose of the single-cell omics study is to generate biological hypotheses about distinct subsets of complex mixtures of heterogeneous cellular population. Thus, downstream analysis generally begins with assigning cellular identity of the clusters obtained from the preprocessed scATAC-seq data. Peak calling is often repeated for each cluster to identify accessible chromatin regions for distinct cellular populations, which are then subject to a statistical test for association with various pre-defined genomic features, such as, cis- and trans-regulatory elements and genetic variants, such as, disease-associated SNPs. Downstream analysis methods mainly aim to uncover novel regulatory elements and to understand their functional roles in a cell type-specific manner. In addition, the dynamics of chromatin accessibility during cellular development can be studied during downstream analysis.

Cell identity annotation

For the analysis of single-cell omics data, cell identity annotation of clusters is preliminary yet must be carried out with care. Incorrect cell identity information can mislead to a wrong biological hypothesis during downstream analysis of scATAC-seq data. While there are a number of tools for automated cell type annotation for scRNA-seq data [47] and an extensive list of cell type-specific genes are available from various databases [32], [48], there are only a limited set of tools for scATAC-seq data and a few reference datasets for cell type specific chromatin accessibility [33]. Therefore, combined use of complementary approaches for cluster annotation is necessary for scATAC-seq data. Largely, there are two approaches to cell identity annotation; the first one is based on feature annotation of ATAC peaks, and the second one utilizes integration with reference scRNA-seq data. After cells are assigned to distinct clusters based on profile similarities, each cluster can have Differentially Accessible Regions (DARs) which might contain various regulatory elements. The first approach to cell identity annotation can vary in genomic features to be used for identification of such cluster-specific peaks. Supervised or manual annotation of cluster identity requires databases or literature references of cell type specific genomic features, such as TF motifs, enhancers, promoters, and TSSs [6]. Promoters and TSSs are most widely used for cluster annotation due to the extensive list of cell type-specific genes. In simpler approaches, accessibility to the cell type-specific genes can be defined by the existence of ATAC peaks within certain distance from upstream of cell type-specific promoters or TSSs. More advanced analysis takes various distal and proximal regulatory elements into consideration. ‘Gene activity scores’ weigh co-accessible elements to a gene’s promoter region differently to infer gene expression from chromatin accessibility profiles more accurately [33]. As a result, gene activity scores correlate better with gene expression profiles than simple profiles of promoter accessibility [33]. A software called Garnett also employs gene activity scores and a priori profile with known cell types along with their marker genes for supervised classification of cell types [49]. The second approach takes advantage of extensively available scRNA-seq data for diverse cell types. Gene expression matrix from scRNA-seq data can be integrated with gene activity score matrix from scATAC-seq data for the same cell types. After projecting them onto the maximally correlated dimensions, mutual nearest neighbors (MNN) algorithm is used to transfer cell-labels from the scRNA-seq data to the scATAC-seq data [7], [33]. While samples with a highly dominant cell type or non-matching cell types to the other omics data show limitations in accuracy, overall results of cell identity annotations are concordant with matching datasets [33]. With semi-supervised identification of cell populations in scATAC-seq data (SSIPs), existing reference scRNA-seq and bulk ATAC-seq data are used to generate a network of scATAC-seq data for the sample of interest, with reference cells from external data sources to transfer cell-labels [50].

Study of chromatin accessibility dynamics

Annotated clusters proceed to the study of chromatin accessibility dynamics. Hypotheses about regulation of cellular development can be generated using various genomic elements associated with DARs, pseudotime-dependent changes, and co-accessibility. DAR analysis is used to identify regulatory elements specific for each cell type. In general, cell type-specific DARs are identified by comparing chromatin accessibility in cells for a particular cluster with all the other cells in the dataset. Various statistical tests have been employed for DAR analysis, including a binomial test [33], negative binomial generalized linear model [20], a Wald test [19], Fisher’s exact test [23], unequal variances t-test [17], and information gain [21] along with 1% or 5% false discovery rate (FDR) adjustment with Benjamini-Hochberg [6], [23], [33] or Bonferroni correction [21]. Single-cell trajectory analysis utilizes pseudotemporal ordering of cells to reconstruct differentiation processes or cell lineages. Trajectory analysis is useful if chromatin accessibility changes continuously within cell population. Cicero [20] is an extension of Monocle2 [51], a widely used trajectory analysis tool for scRNA-seq data, for scATAC-seq data. Nearby peaks are aggregated for dealing with sparsity and DARs are selected to define temporal states. After cells are ordered in pseudotime using DDRTree [52] method, accessibility kinetics at selected genomic regions can be depicted. STREAM [53] is a trajectory analysis tool that can handle both transcriptomic and epigenomic data. For analyzing scATAC-seq data, k-mer score matrix in accessibility variable regions is used to construct pseudotime trajectories. The strength of STREAM lies in an unbiased end-to-end pipeline starting with unprocessed raw data files. Trajectory analysis with such tools can be used for identification of cell type-specific regulatory elements associated with cellular development from one cell type to another [6], [20], [54], [55], [56]. For example, if accessibility of TF motifs changes significantly during differentiation process, the matching TFs can be studied further for their involvement in activation or initiation of differentiation [11], [53], [57]. Interactions between different genomic elements are important for understanding regulatory networks. Such interactions can be analyzed with co-accessibility of different genomic regions. Cicero groups similar cells to generate cell accessibility matrix to calculate covariance between each pair of genomic elements in overlapping genomic windows. Co-accessibility is used for analysis of interactions between TSS and enhancers [8], [11], [57], [58], [59], promoters [20], and other genomic elements.

TF motif-based hypothesis generation

TFs are major trans-acting regulators of gene expression. Analysis of scATAC-seq enables identifying specific TFs for different cell types within heterogeneous cell population [17]. Since TFs are highly involved in the developmental process, the analysis of cell-to-cell variation of TF expression will facilitate understanding of their roles during cellular differentiation [35]. Furthermore, scATAC-seq allows for simultaneous analysis of cis-regulatory elements that are associated with the activities of relevant TFs. Study of TFs with scATAC-seq data requires both software packages and databases for TFs and their binding motifs. Initially, methods for scATAC-seq data analysis mainly utilized the known TF motifs [8], [9]. Though not invented solely for analysis of scATAC-seq data, bioinformatics tools, such as, Homer [38] and FIMO [60], are useful in identifying TF motifs within open chromatin regions. A software package chromVAR, which was developed for scATAC-seq analysis, is a widely used algorithm for calculating bias-corrected deviations and z-scores of TF motifs and k-mers [17]. TFs related to various cell types, such as immune cells [12], cardiac progenitor cells [55], and neuronal cells [61], have been analyzed with chromVAR deviations and z-scores. Furthermore, TF motif accessibility can be compared with TF expression values calculated from scRNA-seq data [62]. For identification of cell type specific TFs and prediction of cell types from those TF motifs, several models, such as convolutional neural network [33] and random forest classification [57], can be used.

Gene-based hypothesis generation

scRNA-seq has been widely used for studying gene expression profiles of heterogeneous cell populations [63]. Gene expression can be inferred from chromatin accessibility information at TSSs, gene body, and other regulatory elements. TSSs and transcription termination sites of active genes are located at open chromatin regions or nucleosome-depleted regions [64] and so, accessibility profiles at TSSs can be utilized for gene-based downstream analysis of scATAC-seq data. UROPA [65] can assign TSSs to scATAC-seq peaks using genomic annotation databases. Peaks annotated by TSSs can be used for further analysis, such as comparison of opening and closing of chromatin at TSSs [55], calculation of TSS gene set deviation [58], and identification of chromatin accessibility at known marker genes to identify cell types and states [61], [66], [67]. However, considering only the chromatin states of TSSs might not fully indicate gene expression [20], calculation of ‘gene activity score’, which takes information from regulatory elements, can improve translation of accessibility information into gene expression [20]. Cicero gene activity scores consider accessibility at sites proximal and distal to TSS of a gene and weigh them by their co-accessibility. Gene activity scores have been used to compare TF motif accessibility to TF gene activity scores from the same scATAC-seq data [11], to annotate cells using cell type specific marker genes [6], [68], and to transfer cell-labels from scRNA-seq datasets to matched scATAC-seq datasets [69]. Lastly, for visualization of DARs at gene bodies, Deeptools [70] and MACS2 [4] generates bigwig files, which can be displayed with genomic browsers, such as Gviz [71], Integrative Genomics Viewer (IGV) [72], and UCSC Genome Browser [73]. Gene set enrichment analysis for a distinct cell population is useful for identifying pathways relevant to the cell identity. Gene Ontology (GO) [74] and KEGG [75] are the most widely used databases for pathway gene sets. Pathways associated with a cell population are analyzed based on genes associated with cell type-specific accessible (peak) regions. Peaks within upstream and downstream extension of gene body [61], [76], TSS [55], or with gene activity scores [57] are used as input data for pathway analysis. Various gene set enrichment tools, such as GREAT [77] or clusterProfiler [78], can be applied to scATAC-seq data.

Enhancer-based hypothesis generation

Enhancers are cis-regulatory elements distal from their regulatory target genes. Proximal and distal interactions of enhancers with other regulatory elements have been identified by analyzing 3D structures of chromatin [79]. Moreover, enhancer dense regions, called super-enhancers, are known to be cell type and state specific [80] and are involved in disease-associated regulatory nodes [81]. Studies on enhancers at single cell resolution are useful in predicting specific cell types, as it has higher accuracy than other cis-regulatory elements and transcriptomes [82]. Various studies have focused on identifying cell type-specific enhancers and their involvement in developmental processes. The most common types of enhancer analyses include identification of cell type-specific distal and proximal enhancers [76] and relative enrichment of enhancer activities [54], [83], [84]. Notably, various enhancer databases, such as VISTA [85], CRM Activity Database (CAD) [54], Redfly Enhancer [86], and Vienna Tiles library [87], can be utilized for such analysis. Furthermore, evaluating the interactions of enhancers to promoters or genes with co-accessibility [57], paired scRNA-seq data [88], virtual latent space [69], and Activity-by-Contact model [83] have been suggested in several data analysis pipelines.

Hypothesis generation with disease-associated genetic variants

Disease-associated SNPs detected via genome-wide association study (GWAS) and expression quantitative trait loci (eQTL) analysis are useful resources for understanding genomic regulation in diseases. Since most SNPs are located in non-coding regions [68], it is anticipated that many GWAS SNPs and eQTLs are associated with cis-regulatory elements; therefore, the study of open chromatin regions are useful in identifying their functional effects [89], [90]. In addition, the identification of cell types relevant to disease-associated variants is crucial for in-depth understanding of these variants [91]. Using scATAC-seq, genetic variants can be linked to their cellular and functional targets through the identification of both DNA sequences and chromatin accessibility of regulatory elements at single cell resolution. While relating various epigenetic features to GWAS signals through various bulk sequencing methods have provided useful results, single cell resolution analysis additionally enables us to overcome the limitations imposed by cell-type heterogeneity [33]. Indeed, several studies have demonstrated the importance of providing enrichment profiles of GWAS SNPs in cell-type specific peaks [25], [33], [67], [92]. The modified version of chromVAR, called gchromVAR, scores each single cell for GWAS enrichment to identify causal variants in genomic regions and putative target genes of those variants in a cell type specific manner [92]. By utilizing co-accessibility measurements, interconnected peaks overlapping GWAS SNPs and GTEx eQTL to other peaks containing regulatory elements can be analyzed [6]. GREGOR [93] is also used to annotate enrichment of disease-associated SNPs from various databases [67]. More complex models using deep learning and machine learning framework to identify cell type-specific functional SNPs and associated novel functional genes were also implemented in some recent studies [67], [68].

Integrative analysis with single-cell transcriptome data

Integration of single-cell gene expression and chromatin accessibility data may improve cell identity annotation. More importantly, joint analysis of multimodal data will facilitate detecting correlations between trans- and cis-regulatory elements underlying the cellular state of interest. Integrative analysis of single-cell transcriptome and chromatin accessibility can be achieved by both experimental and computational approaches (Fig. 3).

Fig. 3

Integration of single-cell ATAC sequencing data with single-cell RNA sequencing data via experimental approaches and computational approaches. Integrative analysis of gene expression and chromatin accessibility for the same cell types can be used for confirming cell identity annotation and for facilitating generating new hypotheses for regulatory elements. For example, identification of peak-to-gene interactions can infer enhancer-promoter interactions; comparison between expression of a gene and accessibility of its TF-enriched regions across pseudotime can reveal kinetic relationship between transcription and regulatory regions; comparison between expression of a gene and accessibility of its TF-enriched regions across cell types or sample groups can reveal expression and accessibility signature associated with a cell type or subpopulation. Experimental approaches to integrative analysis focus on obtaining transcriptome and epigenome data from the same cells simultaneously. The multimodal single-cell analysis method sci-CAR employs combinatorial indexing method for both scRNA-seq and scATAC-seq to increase throughput [94]. Another method, single-cell chromatin accessibility and transcriptome sequencing (scCAT-seq), separates cytoplasm components and nuclei for scRNA-seq and scATAC-seq, respectively [95]. Single-Nucleus chromatin Accessibility and mRNA Expression sequencing (SNARE-seq) method utilizes linked barcodes for capturing both gDNA from transposed DNA fragments and mRNA from a nucleus in a single droplet for parallel sequencing using the same barcodes for each cell [96]. There is a method that involves cell fixation with chemical reagents, followed by bulk transposition for single cell sorting to reduce the cost and simplify overall procedures [97]. Using multimodal single-cell technologies, chromatin accessibility can be directly compared to gene expression for understanding the functional relationships between cis/trans-regulatory elements and associated gene expressions. At present, there are algorithmic approaches for computational integration of single-cell genomics data derived from different sample groups, experiments, or even technologies. Methods based on non-negative matrix factorization (NMF), such as CoupledNMF [62] and LIGER [98], have been proven useful in multimodal single-cell data integration. Seurat v3 is a widely used method for scRNA-seq and scATAC-seq integration [7]. Seurat v3 integrates multimodal single-cell data by projecting two different datasets into a sub-space defined by correlated variables and then identifying anchors between datasets. Harmony is a fast and scalable algorithm of single-cell data integration based on iterative adjustment of data-specific clusters [42]. Recently, more approaches for data integration were reported, including the maximum mean discrepancy manifold alignment (MMD-MA) algorithm [99] and De-Convolution and Coupled-Clustering (DC3) [100]. The single-cell multi-omics integration has been used for validation of cell identity assignments [57], [58], [69], linking differentially expressed genes (DEGs) to DARs for inference of enhancer-promoter (E-P) interactions [88], observation of a trend for accessibility of enhancers predicted by TF-motif to precede changes in gene expression [83] and identification of conserved chromatin accessibility and transcription across cell types or sample groups [101].

Conclusion and outlook

Despite potentially wide applications in the study of cellular systems, a relatively high cost of single-cell sequencing technologies and high complexity of the data might limit accessibility to single-cell biology for many researchers. However, there have been many community-wide efforts for improving both experimental and computational methods of single-cell omics, including scATAC-seq data analysis. While a reasonable consensus in data analysis pipelines has not been achieved yet, the number of publications on data generation technology and data analysis methods for scATAC-seq are growing exponentially during recent times. Benchmarking studies utilizing different methods for data generation and analysis would provide useful information to the community for establishing the best practices of scATAC-seq data analysis [27]. Moreover, integration with other types of single-cell and bulk omics data, as well as genomic variation data, will greatly potentiate scATAC-seq studies aimed at elucidating complex circuits of gene regulation involved in disease progression. Especially, integration of scATAC-seq with other epigenetic technologies, such as ChIP-seq and Hi-C, will unravel 3D chromatin structures [68], [102]. Such integrative multimodal analysis will facilitate identification of key regulators involved in disease progression, which are often potential therapeutic targets and biomarkers for diagnosis. Conclusively, we anticipate that scATAC-seq will promote a holistic view of epigenetic regulation and regulatory networks involved in the development of normal cells and disease progression in human and other multi-cellular organisms.

CRediT authorship contribution statement

Seungbyn Baek: Conceptualization, Writing - original draft, Writing - review & editing. Insuk Lee: Conceptualization, Writing - review & editing, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

15 in total

Review 1. Transcription factor chromatin profiling genome-wide using uliCUT&RUN in single cells and individual blastocysts.

Authors: Benjamin J Patty; Sarah J Hainer
Journal: Nat Protoc Date: 2021-04-28 Impact factor: 13.491

Review 2. Ontogenetic rules for the molecular diversification of hypothalamic neurons.

Authors: Marco Benevento; Tomas Hökfelt; Tibor Harkany
Journal: Nat Rev Neurosci Date: 2022-07-29 Impact factor: 38.755

3. scFeatures: multi-view representations of single-cell and spatial data for disease outcome prediction.

Authors: Yue Cao; Yingxin Lin; Ellis Patrick; Pengyi Yang; Jean Yee Hwa Yang
Journal: Bioinformatics Date: 2022-10-14 Impact factor: 6.931

Review 4. Breaking the Immune Complexity of the Tumor Microenvironment Using Single-Cell Technologies.

Authors: Simone Caligola; Francesco De Sanctis; Stefania Canè; Stefano Ugel
Journal: Front Genet Date: 2022-05-16 Impact factor: 4.772

Review 5. Understanding the Adult Mammalian Heart at Single-Cell RNA-Seq Resolution.

Authors: Ernesto Marín-Sedeño; Xabier Martínez de Morentin; Jose M Pérez-Pomares; David Gómez-Cabrero; Adrián Ruiz-Villalba
Journal: Front Cell Dev Biol Date: 2021-05-12

6. Single Cell Technologies to Dissect Heterogenous Immune Cell Therapy Products.

Authors: Katherine Mueller; Krishanu Saha
Journal: Curr Opin Biomed Eng Date: 2021-09-15

7. Cell type-specific analysis by single-cell profiling identifies a stable mammalian tRNA-mRNA interface and increased translation efficiency in neurons.

Authors: William Gao; Carlos J Gallardo-Dodd; Claudia Kutter
Journal: Genome Res Date: 2021-12-02 Impact factor: 9.438

Review 8. Bench-to-Bedside in Vascular Medicine: Optimizing the Translational Pipeline for Patients With Peripheral Artery Disease.

Authors: Tom Alsaigh; Belinda A Di Bartolo; Jocelyne Mulangala; Gemma A Figtree; Nicholas J Leeper
Journal: Circ Res Date: 2021-06-10 Impact factor: 23.213

Review 9. Annotating the Insect Regulatory Genome.

Authors: Hasiba Asma; Marc S Halfon
Journal: Insects Date: 2021-06-29 Impact factor: 2.769

10. ShinyArchR.UiO: User-friendly, integrative and open-source tool for visualisation of single-cell ATAC-seq data using ArchR.

Authors: Ankush Sharma; Akshay Akshay; Marie Rogne; Ragnhild Eskeland
Journal: Bioinformatics Date: 2021-09-29 Impact factor: 6.937