Literature DB >> 30742040

Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data.

George C Linderman¹, Manas Rachh¹, Jeremy G Hoskins¹, Stefan Steinerberger², Yuval Kluger^3,4.

Abstract

t-distributed stochastic neighbor embedding (t-SNE) is widely used for visualizing single-cell RNA-sequencing (scRNA-seq) data, but it scales poorly to large datasets. We dramatically accelerate t-SNE, obviating the need for data downsampling, and hence allowing visualization of rare cell populations. Furthermore, we implement a heatmap-style visualization for scRNA-seq based on one-dimensional t-SNE for simultaneously visualizing the expression patterns of thousands of genes. Software is available at https://github.com/KlugerLab/FIt-SNE and https://github.com/KlugerLab/t-SNE-Heatmaps .

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Genetic Markers
RNA

Year: 2019 PMID： 30742040 PMCID： PMC6402590 DOI： 10.1038/s41592-018-0308-4

Source DB: PubMed Journal: Nat Methods ISSN： 1548-7091 Impact factor: 28.547

Main

scRNA-seq enables high-throughput transcriptome profiling at the individual cell level and is increasingly being used to study cell-to-cell heterogeneity in both physiologic and disease processes. Data visualization techniques have played a pivotal role in both analyzing the expression of different marker genes in known cell populations and in identifying new cell types. Over the last decade data visualization using t-SNE has become a cornerstone of scRNA-seq analysis. t-SNE is used to embed a scRNA-seq dataset into a low-dimensional space such that proximal pairs of single cells in the high-dimensional transcriptome space remain proximal in the low dimensional space. The embedding is often colored by the expression levels of a gene of interest, one gene at a time. Several difficulties arise when applying t-SNE to scRNA-seq data. The number of cells profiled in scRNA-seq experiments has been growing exponentially,[1] with recent datasets measuring the expression of 30,000 genes in over 1,000,000 cells.[2] Profiling such large numbers of cells facilitates the characterization of rare and moderately-sized subpopulations not apparent in smaller samples. However, existing algorithms for constructing t-SNE embeddings are computationally expensive, often necessitating downsampling of the cells prior to running t-SNE, which can in turn result in rare cell populations being missed. Furthermore, removal of the few cells which may express a given marker gene can make even moderately sized populations difficult to identify. An additional difficulty with applying t-SNE to scRNA-seq data is that overlaying the expression levels of marker genes on separate 2D t-SNE plots is cumbersome owing to the large number of marker genes for each dataset. Practically, only a modest number of such plots can be visually compared. In this paper, we present two improvements for the application of t-SNE to scRNA-seq data visualization. First, we present FFT-accelerated Interpolation-based t-SNE (FIt-SNE), an algorithm for rapid computation of one- and two-dimensional t-SNE based on polynomial interpolation and further accelerated using the fast Fourier transform. We also present t-SNE heatmaps, a heatmap-style visualization method based on one-dimensional t-SNE, which simultaneously visualizes expression patterns of hundreds to thousands of genes.

FIt-SNE.

t-SNE is often run many times with different parameters and initializations, so that the embedding most consistent with prior knowledge can be chosen. FIt-SNE is a dramatically accelerated implementation of t-SNE, allowing practitioners to analyze entire datasets as opposed to first downsampling. By doing so, FIt-SNE allows practitioners to identify known populations using marker genes which may not be expressed in sufficiently many cells post-downsampling. For example, we used FIt-SNE to embed a dataset consisting of 1.3 million mouse brain cells[2] and identified two known cell types from the Allen Brain Atlas[3] which cannot be identified using a random subset of 50,000 cells (Figure 1), as the latter does not have enough cells expressing both markers. Specifically, GABAergic neurons from the caudal ganglionic eminence which express marker genes Sncg and Slc18a8 and a population of vascular leptomeningeal cells (VLMC) expressing marker genes Spp1 and Col15a1 can both be identified using only the full embedding, as opposed to a random subset.

Figure 1.

FIt-SNE allows for embedding of the full 1.3 million mouse brain cell dataset (left), enabling the identification of known cell populations that cannot be identified when downsampling to a random 50,000 cells (right). (For the left figure, instead of plotting all 1.3 million embedded points, only 100,000 of the cells not expressing the marker genes are shown, whereas all the cells expressing the marker genes are shown.)

The t-SNE algorithm solves an optimization problem for embedding the cells (points) in a low-dimensional space based on their transcriptome similarities. Formally, this problem is equivalent to a physical system of particles (points) in which particles exert repulsive and attractive forces on each other. Naively implemented, computing the force each particle exerts on all the other particles is prohibitively slow; we devise approximation schemes for evaluating the repulsive and attractive forces that can scale to millions of points. Computation of the repulsive forces between every pair of the N points is the most time-consuming step in t-SNE. Instead of calculating the interaction of each point with all the other points (which requires N2 computations), Barnes-Hut (BH) t-SNE[4] —the fastest published t-SNE implementation—uses a tree structure to compress the interaction between distant cells, hence requiring N log N computations. We take a different approach by defining a small number p of interpolation nodes, which “mediate” the interaction between the points. First, we calculate the interaction of each point with those nodes (p · N computations). Then we compute the interaction of those nodes with each other (p2 naively, p log p using FFTs). Finally, we interpolate from the interpolation nodes to all of the original points (also p · N computations). Hence, we can approximate the repulsive force in ~ 2p · N computations, as opposed to N2 or N log N (Table 1 and S1). We prove rigorous bounds on the approximation error in the Online Methods; in particular, we show that the number of interpolation nodes p required for a certain level of accuracy is independent of N. We set the default FIt-SNE parameters to give an approximation at least as accurate as BH t-SNE’s default setting (Figure S1 and Section §8.3.3).

Table 1.

Time taken for 1000 iterations of the gradient descent phase of 2D t-SNE using Barnes-Hut t-SNE (BH t-SNE) compared to our implementation (FIt-SNE), as compared on a 2017 Macbook Pro for a given number of points N. See section 8.3.5 for more details.

N	BH t-SNE	FIt-SNE
10,000	1 min.	< 1 min.
100,000	11 min.	< 1 min.
500,000	1 hr. 10 min.	3 min.
1,000,000	3 hr. 9 min.	15 min.

The attractive force between two points decays exponentially fast as a function of the distance between them, so that a point only exerts a significant attractive force on its nearest neighbors. In BH t-SNE, the k–nearest neighbors of each point are identified using vantage-point (VP) trees[5] which tend to be prohibitively expensive for high-dimensional datasets. In FIt-SNE, there are two options for identifying nearest neighbors—multithreaded VP trees and approximate nearest neighbors using ANNOy[6] (Tables 2 and S2). Multithreaded VP trees are exactly as accurate as the VP tree implementation of BH t-SNE, just substantially faster. The use of approximate nearest neighbors is even faster, but could theoretically obscure subtle detail. In practice, however, we find the resulting embedding quality to be essentially indistinguishable (Figures S2, S3, S4, and S5).

Table 2.

Time taken to compute input similarities in Barnes-Hut t-SNE (vptree) compared to FIt-SNE using either multithreaded vantage-point trees (vptreeMT) or a multi-threaded approximate nearest neighbor (annMT) approach on a 2017 Macbook Pro for a given number of points N.

	50 Dimensions			100 Dimensions
N	vptree	vptreeMT	annMT	vptree	vptreeMT	annMT
10,000	< 1 min.	< 1 min.	< 1 min.	< 1 min.	< 1 min.	< 1 min.
100,000	2 min.	< 1 min.	< 1 min.	3 min.	< 1 min.	< 1 min.
500,000	56 min.	15 min.	3 min.	1 hr. 30 min.	20 min.	4 min.
1,000,000	4 hr. 45 min.	1 hr. 15 min.	6 min.	7 hr. 9 min.	1 hr. 40 min.	8 min.

Although FIt-SNE makes it practical to run t-SNE on datasets with millions of points, the choice of parameters which lead to an ideal embedding is an active area of research. For example, when the number of points is large, the attractive forces must be exaggerated during the beginning stages of t-SNE in order to ensure optimal embedding of large numbers of points[7] (Supplemental Figure S6). While this paper was in revision, a new paper by Belkina and colleagues (2018)[8] proposed an approach for automatically determining the step size and the optimal number iterations to exaggerate the attractive forces, which they validate using CyTOF and scRNA-seq datasets. In another very recent work, Kobak and Berens (2018)[9] proposed a protocol for exploratory analysis of scRNA-seq data using FIt-SNE (including suggested parameter choices), which leads to dramatically improved embedding quality, particularly with regard to preservation of multi-scale and global structure.

Heatmaps.

Exploration of scRNA-seq data using t-SNE consists of tiling two-dimensional t-SNE plots, each colored by the expression pattern of a different marker gene. Although this information is presented in two dimensions, users are most interested in which genes are associated with which clusters, not the shape or relative locations of the clusters. It has been shown that t-SNE preserved the cluster structure of well-clustered data regardless of the embedding dimension,[7] and thus, one-dimensional t-SNEs usually contain the same information as two-dimensional t- SNEs. Furthermore, multiple one-dimensional t-SNEs, each using different groups of markers, have been previously used to visualize CyTOF data[10] We develop a related approach which exploits the compactness of a single one-dimensional embedding to enable simultaneous exploration of expression patterns of hundreds to thousands of genes in heatmap form. This approach also allows us to discover new marker genes and organize the genes based on their smoothed expression patterns along the one-dimensional t-SNE representation of the cells. In t-SNE Heatmaps, we first construct a one-dimensional t-SNE embedding of the cells. Next, we discretize the one-dimensional t-SNE embedding into b bins, where b is user specified, and represent each gene by the sum of its expression in the cells contained in each bin. We then visualize these vectors in heatmap format (i.e. each row is a gene and each column is a bin) using an interactive visualization tool called heatmaply.[12] Notably, unlike dotplots which present the average expression of genes in each cluster (e.g. Figure 2A of Shekhar et al. (2016)[11]), it does not require pre-clustering, and hence can discover patterns in poorly clustered data that might be missed if averaging across clusters.

Figure 2.

Schematic and demo of t-SNE Heatmaps. Starting with the expression matrix (A) compute 1D t-SNE, which is plotted in (B) colored by the expression of each gene (with added jitter). We bin the 1D t-SNE, and represent each gene by its average expression in each bin (C), and then generate a heatmap of these vectors, so that genes with similar expression patterns in the t-SNE are grouped together (D). In (E), we demonstrate t-SNE heatmaps using retinal bipolar cells[11]

Various strategies can be used to select the genes presented in the heatmap. If the user has prior knowledge as to genes of interest, these genes can be presented, along with genes whose onedimensional t-SNE binned representation are most similar, allowing for marker gene discovery. If the user wants to identify genes specific to clusters, a “metagene” can be constructed, which is 1 on cells in a cluster and 0 elsewhere. Then genes whose one-dimensional t-SNE binned representation are most similar to these “metagenes” (ie. specific to a cluster) can be presented in the heatmap. “Metagenes” for combinations of clusters can also be constructed. Figure 2 demonstrates t-SNE heatmaps using retinal bipolar cells from Shekhar et al. (2016).[11] In this work, scRNA-seq was used to profile ~ 25,000 mouse retinal bipolar cells and classify them into 15 types. Using graph-based clustering techniques, cells were clustered, and marker genes corresponding to each of the putative subtypes of bipolar cells were subsequently identified. We embedded these bipolar cells using 1D t-SNE and found the 25 genes most associated with the marker genes listed in Table S2 of Shekhar et al. (2016). We also found the 25 genes most associated with “metagenes” for each cluster in the 2D t-SNE. The resulting t-SNE heatmap (Figure 2, Supplementary Figures S7, and S8) identified all 16 of the new bipolar cell markers listed in Figure 2A of Shekhar et al. (2016). The clustered structure of the dataset is evident in the heatmap, and the user can zoom in to identify the genes that characterize and distinguish different regions of the embedding. We note that the structure is substantially clearer than a heatmap of the same genes binned using standard hierarchical clustering, even when the rows are ordered as in the t-SNE heatmaps (Figure S9).

Methods

R, Python, and Matlab implementations of FIt-SNE and an R implementation of t-SNE heatmaps are available from https://github.com/KlugerLab/. Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper. The Life Sciences Reporting Summary was also completed.

Online Methods

We first briefly review the t-SNE approach and then then present FIt-SNE’s method for optimizing the computation of the repulsive force in Section §8.3. Section §8.4 presents an implementation of out-of-core PCA for the analysis of datasets too large to fit in the memory. Finally, Section §8.5 provides details of the embedding of 1.3 million mouse brain cells (Figure 1), Section §8.6 describes the demonstration of t-SNE heatmaps (Figure 2), and Section §8.7 provides details about our comparison of VP trees to approximate nearest neighbors on three scRNA-seq datasets.

t-distributed Stochastic Neighborhood Embedding.

Given a d-dimensional dataset , t-SNE aims to compute the low-dimensional embedding where s ≪ d, such that if two points x and x are close in the input space, then their corresponding points y and y are also close. Affinities between points x and x in the input space, p, are defined as Here σ is the bandwidth of the Gaussian distribution is computed based on the user-specified perplexity P (the conditional distribution of all other points given x). Similarly, the affinity between points y and y in the embedding space is defined using the Cauchy kernel t-SNE finds the points {y1, …, y} that minimize the Kullback-Leibler divergence between the joint distribution of points in the input space P and the joint distribution of the points in the embedding space Q, Starting with a random initialization, the cost function is minimized by gradient descent, with the gradient[13] where Z is a global normalization constant We split the gradient into two parts where the first sum Fattr, corresponds to an attractive force between points and the second sum Frep, corresponds to a repulsive force The computation of the gradient at each step is an N-body simulation, where the position of each point is determined by the forces exerted on it by all other points. Exact computation of N-body simulations scales as O(N2), making exact t-SNE computationally prohibitive for datasets with tens of thousands of points. It should be noted that since the input similarities do not change they can be precomputed and hence do not dominate the computational time.

Early Exaggeration.

In the expression for the gradient descent, the sum of attractive and repulsive forces, the numerical quantity α > 0 plays a substantial role as it determines the strength of attraction between points that are similar (in the sense of pairs x, x with p large). In early exaggeration, first α =12 for the first several hundred iterations, after which it set[13] to 1. One of the main results of Linderman and Steinerberger (2017)[7] is that α plays a crucial role and that when it is set large enough, t-SNE is guaranteed to separate well-clustered data and also successfully embed various synthetic datasets (e.g. a swiss roll) that were previously thought to be poorly embedded by t-SNE.

Accelerating computation of repulsive forces in FIt-SNE.

In existing methods, the repulsive forces Frep, are approximated at each iteration using the Barnes-Hut Algorithm,[17] a tree-based algorithm which scales as O(N log N), where N is the total number of data points. In this work, we present an interpolation-based fast Fourier transform accelerated algorithm for computing Frepul, which scales as O(N). Moreover, empirical tests show a significant improvement over the Barnes-Hut approach for any sized system. Recall that, {y1, y2, … , y} is the s-dimensional embedding of a collection of d-dimensional vectors {x1, … , x}. At each step of gradient descent, the repulsive forces are given by where k = 1, 2, … N, m = 1, 2 … s, and y(j) denotes the jth component of y. Evidently, the repulsive force between the vectors {y1, …, y} consists of N2 pairwise interactions, and were it computed directly, would require CPU-time scaling as O(N2). Even for datasets consisting of a few thousand points, this cost becomes prohibitively expensive. Our approach enables the accurate computation of these pairwise interactions in O(N) time. Since the majority of applications of t-SNE are for at most two-dimensional embeddings, in the following we focus our attention on the cases where s = 1 or 2. However, we note that our algorithm extends naturally to arbitrary dimensions. In such cases, though the constants in the computational cost will vary, our approach will still yield an algorithm with a CPU-time which scales as O(N). We begin by observing that the repulsive forces Frep, defined in eq. (1) can be expressed as s + 2 sums of the form where the kernel K(y, z) is either for y, . Note that both of the kernels K1 and K2 are smooth functions of y, z for all y, . The key idea of our approach is to use polynomial interpolants of the kernel K in order to accelerate the evaluation of the N–body interactions defined in eq. (2).

Mathematical Preliminaries.

First, we demonstrate with a simple example how polynomial interpolation can be used to accelerate the computation of the N–body interactions with a smooth kernel. Suppose that y1,…, y ∈ (y0, y0 + R) and z1, … , z ∈ (z0, z0 + R). Let I and I denote the intervals (y0, y0 + R) and (z0, z0 + R), respectively. Note that no assumptions are made regarding the relative locations of y0 and z0; in particular, the case y0 = z0 is also permitted. Now consider the sums Let p be a positive integer. Suppose that , are a collection of p points on the interval I and that , are a collection of p points on the interval I. Let K(y, z) denote a bivariate polynomial interpolant of the kernel K(y, z) satisfying A simple calculation shows that K(y, z) is given by where and are the Lagrange polynomials ℓ =1, 2 … p. In the following we will refer to the points , and as interpolation points. Let denote the approximation to φ(y) obtained by replacing the kernel K in eq. (4) by its polynomial interpolant K, i.e. for i = 1, 2 … M. Clearly the error in approximating φ(y) via is bounded (up to a constant) by the error in approximating K(y, z) via K(y, z). In particular, if the polynomial interpolant satisfies the inequality then the error is given by A direct computation of φ(y1), … , φ(y) requires O(M · N) operations. On the other hand, the values , i =1, 2, … M, can be computed in O((M + N) · p + p2) operations as follows. Using eq. (5), can be rewritten as for i =1, 2, … M. The values , are computed in three steps. Step 1: Compute the coefficients w defined by the formula for each m = 1, 2, … p. This step requires O(N · p) operations. Step 2: Compute the values v at the interpolation nodes defined by the formula for all ℓ = 1, 2, … p. This step requires O(p2) operations. Step 3: Evaluate the potential using the formula for all i = 1, 2 … M. This step requires O(M · p) operations. See Figure S10 for an illustrative figure of the above procedure.

Algorithm.

In this section, we present the main algorithm for the rapid evaluation of the repulsion forces eq. (2). The central strategy is to use piecewise polynomial interpolants of the kernel with equispaced points, and use the procedure described in Section §8.3.1. Specifically, suppose that the points y, i = 1, 2, … N are all contained in the interval [ymin, ymax]. We subdivide the interval , into Nint intervals of equal length. Let denote p equispaced nodes on the interval I given by where h = 1/(Nint · p), j = 1, 2 … p, and ℓ = 1, 2 …Nint. Remark 1. The nodes , j = 1, 2 … p, and ℓ =1, 2, … Nint, defined in eq. (7), are also equispaced on the whole interval [ymin, ymax]. The interaction between any two intervals I, J, i.e. can be accelerated via the algorithm discussed in section 8.3.1. This procedure amounts to using a piecewise polynomial interpolant of the kernel K(y, z) on the domain y, z ∈ [ymin, ymax] as opposed to using an interpolant on the whole interval. We summarize the procedure below. Step 1: For each interval I, ℓ = 1, 2, … Nint, compute the coefficients w defined by the formula for each m = 1, 2, … p. This step requires O(N · p) operations. Step 2: Compute the values v at the equispaced nodes defined by the formula for all m = 1, 2, … p, n = 1, 2 … Nint. This step requires O((Nint · p)2) operations. Step 3: For each interval I, ℓ =1, 2, … Nint, compute the potential φ(y) via the formula for all points y ∈ I. This step requires O(N · p) operations. In this procedure, the functions , are the Lagrange polynomials corresponding to the equispaced interpolation nodes on interval I. In Step 2 of the above procedure, we are evaluating N–body interactions on equispaced grid points. For notational convenience, we rewrite the sum eq. (8) i = 1, 2, … Nint · p. The kernels of interest (K1 and K2 defined in eq. (3)) are translationally-invariant, i.e., the kernels satisfy K(y, z) = K(y + δ, z + δ) for any δ. The combination of using equispaced points, along with the translational-invariance of the kernel, implies that the matrix associated with the evaluation of the sums eq. (9) is Toeplitz. This computation can thus be accelerated via the fast-Fourier transform (FFT), which reduces the computational complexity of evaluating the sums eq. (9) from O((Nint · p)2) operations to O(Nint · p log (Nint · p)). Algorithm 1 describes the fast algorithm for evaluating the repulsive forces eq. (2) in one dimension (s=1) which has computational complexity O(N · p + (Nint · p) log (Nint · p)).

Optimal choice of p and Nint.

Recall that the computational complexity of Algorithm 1 is O(N · p + Nint · p log (Nint · p)). We remark that the choice of the parameters Nint and p depends solely on the specified tolerance ε and is independent of the number of points N. Generally, increasing p will reduce the number of intervals Nint required to obtain the same accuracy in the computation. However, we observe that the reduction in Nint for an increased p is not advantageous from a computational perspective—since, as the number of points N increases, the computational cost is independent of Nint and is only a function of p. Moreover, for the t-SNE kernels K1 and K2 defined in eq. (3), it turns out that for a fixed accuracy the product Nint · p remains nearly constant for p ≥ 3. Thus, it is optimal to use p = 3 for all t-SNE calculations. In a more general environment, when higher accuracy is required and for other translationally invariant kernels K, the choice of the number of nodes per interval p and the total number of intervals Nint can be optimized based on the accuracy of computation required. Remark 2. Special care must be taken when increasing p in order to achieve higher accuracy due to the Runge phenomenon associated with equispaced nodes. In fact, the kernels that arise in t-SNE are archetypical examples of this phenomenon. Since we use only low-order piecewise polynomial interpolation (p = 3), we encounter no such difficulties. In our simulations, we set the values of p = 3 and Nint = max(50, ⌈y – y⌉). These values are chosen to ensure that the computation of F is at least as accurate as the Barnes-Hut approximation at default setting (θ = 0.5). We test the accuracy of the two methods by comparing the repulsive forces computed using BH t-SNE and FIt-SNE to the exact repulsive forces computed using direct algorithm on a dataset with 4000 points. In Figure S1, we report the relative error of the BH t-SNE and FIt-SNE approximations at default values and note that the latter achieves the same (or better) accuracy. Since the approximation error is independent of the number of points (Section §8.3.6), this error analysis applies to datasets of any size.

Extension to two dimensions.

The above algorithm naturally extends to two-dimensional embeddings (s=2). In this case, we divide the computational square [ymin, ymax] × [ymin, ymax] into a collection of Nint × Nint squares with equal side length, and for polynomial interpolation, we use tensor product p × p equispaced nodes on each square. The matrix mapping the coefficients w to the coefficients v which is of size (Nint · p)2 × (Nint · p)2, is not a Toeplitz matrix, however, it can be embedded into a Toeplitz matrix of twice its size. The computational complexity of the algorithm analogous to Algorithm 1 for two-dimensional t-SNE is O(N · p2 + (Nint · p)2 log (Nint · p)).

Performance comparison.

The datasets for comparing the CPU-time performance of BH t-SNE and FIt-SNE in Tables 1, 2, S1, and S2 are generated in the following manner. For each N, we sample N/10 points from 10 gaussians in d–dimensions with mean and fixed variance σ = 0.0001. The experiments were performed on two systems—a 2017 Macbook Pro laptop with 2.9 GHz (Turbo up to 3.6GHz) Intel i7 CPU with 2 cores (each supporting 4 threads) and 16GB RAM; and a server with Intel Xeon CPUs with 24 cores clocked at 2.4 GHz and 500GB RAM. In FIt-SNE, the computation of nearest neighbors when computing input similarities, the summing of attractive forces at each iteration of gradient descent, and step 3 of the interpolation scheme outlined above are all multithreaded using C++11 threads, whereas the rest of the computation of the repulsive forces is done via single thread FFTs owing to the small size of FFTs involved. The poorer performance of both BH t-SNE and FIt-SNE on the server as compared to the Macbook can be attributed to the slower single processor clock speed.

Approximation error estimates.

In this section we prove error estimates related to interpolation by equispaced points on a subinterval of the computational domain. First we fix x0 and suppose that K(x0, y) is to be approximated on the interval [a, b] by the p-point Lagrange inter-polant w(y). For ease of exposition, let f (y) = K(x0, y) where K(x, y) is either K1 or K2 given by eq. (3). Then, a classical theorem in approximation theory (see Dalquist and Björck (2008)[18] for example) states that for all y ∈ (a, b) there exists a ζ ∈ (a, b) such that where f( denotes the pth derivative of f, and Let h = (b – a)/p and the interpolation nodes on the interval (a, b) are y = a + (j – 1/2)h, j = 1, …, p. We bound π(y) in the following way (see Trefethen (2013)[19] for example). Suppose that y < y < y. Then Clearly this is bounded by . Similarly, if y < y1, or y > y then In order to bound f((ζ) we first consider the case where f(y) = K1(x0, y). Then Taking p derivatives we obtain and hence Similarly, if f(y) = K2(x0, y) then from which it follows that Putting the above estimates together gives which holds for both K1 and K2. Using Stirling’s approximation (see Abramowitz and Stegun (1965),[20] for example) it follows that We now use this estimate to construct an error bound of the form given in eq. (6). First, for fixed x ∈ [a, b] let K(x, y) denote the polynomial interpolant for y ∈ [c, d]. Then Similarly, for fixed y ∈ [c, d] let K(x, y) denote the polynomial interpolant for x ∈ [a, b], in which case Note that by construction, and where L, j = 1, … , p are the Lagrange polynomials for the nodes y1, … , y ∈ [c, d]. As above, let K(x, y) denote the polynomial interpolant of K(x, y) which is degree p in both x and y for x ∈ [a, b] and y ∈ [c, d]. Evidently, Hence A slight modification of the argument presented in Trefethen and Weideman (1991)[21] yields the following bound, from which it follows that Then which is the estimate we require. In particular, if L = b – a = d – c we obtain the bound Note that if then the error will decay exponentially in p. In two-dimensions an almost identical analysis shows that the error is bounded by In principle this guarantees convergence only when . In practice, extensive numerical evidence suggests that the error decays exponentially in p provided that L < 1.4.

Out-of-Core PCA.

The methods for t-SNE presented above allows for the embedding of millions of points, but can only be used to reduce the dimensionality of datasets that can fit in the memory. For many large, high dimensional datasets, specialized servers must be used simply in order to load the data. In order to allow for visualization and analysis of such datasets on resource-limited machines, we present an out-of-core implementation of randomized PCA, which can be used to compute the top few (e.g. 50) principal components of a dataset to high accuracy, without ever loading it in its entirety.[22] Note that out-of-core PCA was not used in the analysis above, but we include it as it can be useful for users interested in running t-SNE on large datasets using a resource-limited machine.

Randomized Methods for PCA.

The goal of PCA is to approximate the matrix being analyzed (after mean centering of its columns) with a low-rank matrix. PCA is primarily useful when such an approximation makes sense; that is, when the matrix being analyzed is approximately low-rank. If the input matrix is low-rank, then by definition, its range is low-dimensional. As such, when the input matrix is applied to a small number of random vectors, the resulting vectors nearly span its range. This observation is the core idea behind randomized algorithms for PCA: applying the input matrix to a small number of random vectors results in vectors that approximate the range of the matrix. Then, simple linear algebra techniques can be used to compute the principal components. Notably, the only operations involving the large input matrix are matrix-vector multiplications, which are easily parallelized, and for which highly optimized implementations exist. Randomized algorithms have been rigorously proven to be remarkably accurate with extremely high probability,[25,26] because for a rank-k matrix, as few as l = k + 2 random vectors are sufficient for the probability of missing a significant part of the range to be negligible. The algorithm and its underlying theory are covered in detail in Halko et al. (2011).[25] An easy-to-use “black box” implementation of randomized PCA is available and described in Li et al. (2017),[23] but it requires the entire matrix to be loaded in the memory. We present an out-of-core implementation of PCA in C++/R, oocPCA, allowing for decomposition of matrices which cannot fit in the memory.

Implementation.

Our implementation is described in Algorithm 1. Given an m × n matrix of doubles A, stored in row-major format on the disk of a machine with M bytes of available memory, the number of rows that can fit in the memory is calculated as . The only operations performed using A are matrix multiplications, which can be performed block-wise. Specifically, the matrix product AB, where B is an n × p matrix stored in the fast memory, can be computed by loading the first b rows of A, and forming the inner product of each row with the columns of B. The process can be continued with the remaining blocks of the matrix, essentially “filling in” the product AB with each new block. In this manner, left multiplication by A can be computed without ever loading the full matrix A. By simply replacing the matrix multiplications in the implementation of Li et al. (2017)[23] with block-wise matrix multiplication, an out-of-core algorithm can be obtained. However, significant optimization is possible. The run-time of an out-of-core algorithm is almost entirely determined by disk access time; namely, the number of times the matrix must be loaded to the memory. As suggested in Li et al. (2017),[23] the renormalization step between the application of A and A* is not necessary in most cases, and in the out-of-core setting, doubles the number of times A must be loaded per power iterations. In our implementation, we remove this renormalization step, and apply AA* simultaneously, hence requiring the matrix only be loaded once per iteration. Our implementation is in C++ with an R wrapper. For maximum optimization of linear algebra operations, we use the highly parallelized Intel MKL for all BLAS functions (e.g. matrix multiplications). The R wrapper provides functions for PCA of matrices in CSV and in binary format. Furthermore, basic preprocessing steps including log transformation and mean centering of rows and/or columns can also be performed prior to decomposition, so that the matrix need not ever be fully stored in the memory. To demonstrate oocPCA’s performance, we generated a random 1,000,000 × 30,000 rank-50 matrix stored as doubles, which would require 240GB to simply store in the memory, far exceeding the memory capacity of a personal computer. Using oocPCA we can compute the top principal components of the matrix with much less memory. Using a 2017 Macbook Pro laptop with 16GB RAM, solid state drive, and a 2.9 GHz Intel i7 CPU, the rank-50 approximation was computed in 38 minutes.

FIt-SNE of 1.3 million mouse brain cells.

The scRNA-seq dataset consisting of 1.3 million cells from the cortex, hippocampus, and ventricular zones of embryonic day 18 mouse brains were downloaded from the 10X Genomics website and processed using the normalization and filtering steps of Zheng et al.,[14] as implemented by the python package scanpy.[15] Scanpy was also used to compute a neighborhood graph of the observations using a Gaussian kernel with adaptive widths, and then the points were clustered using the Louvain method. Subsequent analysis of this dataset was then performed in R. FIt-SNE of all 1,306,127 cells was computed with 4,000 iterations of gradient descent (2,000 of them being early exaggeration iterations) and other parameters set to defaults. FIt-SNE with the same parameters was also run on a random subset of 50,000 cells. We sought to identify known cell types from the Allen Brain Atlas (http://celltypes.brain-map.org/rnaseq/mouse) in the embedding, and gave two examples of cell populations (see Supplementary Table 9 of Tasic et al. (2018)[3]) that could be identified in the full dataset, but not in the downsampled embedding.

t-SNE heatmap of retinal cells.

The scRNA-seq retinal cells data of Shekhar et al. (2016)[11] was downloaded from GEO (GSE81905). The digital expression matrix was preprocessed using the code provided by the authors of the original publication (https://github.com/broadinstitute/BipolarCell2016). In short, libraries containing more than 10% mitochondrially derived transcripts were removed, cells with ≤ 500 genes were removed, as were genes with expression in ≤ 30 cells or having ≥ 60 transcripts, resulting in 13,166 genes and 27,499 cells. Finally, the data were median normalized, log-transformed, and the genes were Z-scored. The top 37 principal components were computed and used as input to 1D FIt-SNE with perplexity 30 and for 1000 iterations. Finally, the t-SNE heatmap (Figure 2) was computed as described in the main text, with the marker genes (Tacr3, Rcvrn, Syt2, Irx5, Irx6, Vsx1, Hcn4, Grik1, Gria1, Kcng4, Hcn1, Cabp5, Grm6, Isl1, Scgn, Otx2, Vsx2, Car8, Sebox, Prkca) from Shekhar et al. (2016)[11] listed in Supplemental Table 2. Each marker gene was enriched with the 25 genes with most similar expression patterns. Genes associated with each cluster in the 2D embedding were obtained by running dbscan on the 2D t-SNE with the settings ϵ = 2 and a minimum number of points of 40. For each cluster i, a “metagene” c of length 27,499 was generated, where c(k) = 1 if the kth cell is in the ith cluster and c(k) = 0 otherwise. These vectors were then treated as “genes” and enriched in the same fashion as the genes.

Comparing approximate nearest neighbors and VP trees on scRNA-seq data.

To evaluate the effect of approximate nearest neighbors on embedding quality of scRNA-seq data, we compared the resulting embeddings on several scRNA-seq datasets where labels are predetermined by other sources. For each dataset, we also compute the 1-nearest neighbor error (1N error), defined as the percentage of cells for which the cell closest to them in the embedding belongs to a different label. We did the comparison on the 1.3 million mouse brain cells from above, purified PBMC populations from Zheng et al. (2017),[14] and mouse visual cortex cells from Hrvatin et al. (2018).[16] Filtered expression matrices for FACS purified peripheral blood monocyte (PBMC) populations were downloaded from the 10X website[14] and concatenated them to a single expression matrix. The matrix was filtered to include cells expressing more than 400 genes and gene expressed in more than 100 cells, resulting in a matrix with 83,992 cells and 12,776 genes. Purified CD4 helper T cells and cytotoxic T cells were removed, as they (by definition) are supersets of some of the other subtypes, leaving 64,664 cells. After library and log normalization, the top 25 principal components (PCs) were computed using randomized SVD.[24] FIt-SNE using VP trees and approximate nearest neighbors were was computed on the the PCs and qualitatively compared in Figure S4. The scRNA-seq expression matrix of mouse visual cortex cells from Hrvatin et al.[16] was obtained from GEO (GSE102827). Genes with mean expression less than 0.00003 and non-zero expression in less than 4 cells were excluded, resulting in a matrix with 65,539 cells and 19,155 genes. The cells were further subsetted to those assigned to subtypes, resulting in 48,266 cells. After library and log normalization, the top 25 principal components were computed using randomized SVD. FIt-SNE using VP trees and approximate nearest neighbors were then computed on the PCs and compared in Figure S5.

Code Availability

FIt-SNE is available at https://github.com/KlugerLab/FIt-SNE. The code for all experiments is available at request and will be publicly available at https://github.com/KlugerLab/FIt-SNE-paper on publication.

Data Availability

The 1.3 million mouse brain cells dataset and FACS purified PBMCs of Zheng et al.[14] can be downloaded from 10X Genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets/). Two other public scRNA-seq datasets from NCBI Gene Expression Omnibus (GEO) were used: Hrvatin et al. (GSE102827) and Shekhar et al. (GSE81905).

Algorithm 1: FFT-accelerated Interpolation-based t-SNE (FIt-SNE)
Input:Collection of points{yi}i=1N,source strengths{qi}i=1N,number of intervalsNint,number of interpolation points per intervalpOutput:ϕ(yi)=∑Nj=1K(yi,yj)qjfori=1,2,…N1For each intervalIℓ,form the equispaced nodesy~j,ℓ,j=1,2,…pgiven by eq.(7)2forI←1toNintdo3∣Compute the coefficientswm,ℓgiven bywm,ℓ=∑yi∈IℓLm,y~ℓ(yi)qi,m=1,2,…p.∣4end5Use the fast-Fourier transform to compute the values ofvm,ngiven by(10)[v1,1v2,1⋮vp−1,Nintvp,Nint]=K~⋅[w1,1w2,1⋮wp−1,Nintwp,Nint],whereK~is the Toeplitz matrix given by(11)K~i,j=K(y~i,y~j),i,j=1,2,…Nint⋅p.6forI←1toNintdo7∣Computeϕ(yi)at all pointsyi∈Iℓviaϕ(yi)=∑j=1pLj,y~ℓ(yi)vj,ℓ∣8end

Algorithm 1: FFT-accelerated Interpolation-based t-SNE (FIt-SNE)

Input:Collection of points{yi}i=1N,source strengths{qi}i=1N,number of intervalsNint,number of interpolation points per intervalpOutput:ϕ(yi)=∑Nj=1K(yi,yj)qjfori=1,2,…N1For each intervalIℓ,form the equispaced nodesy~j,ℓ,j=1,2,…pgiven by eq.(7)2forI←1toNintdo3∣Compute the coefficientswm,ℓgiven bywm,ℓ=∑yi∈IℓLm,y~ℓ(yi)qi,m=1,2,…p.∣4end5Use the fast-Fourier transform to compute the values ofvm,ngiven by(10)[v1,1v2,1⋮vp−1,Nintvp,Nint]=K~⋅[w1,1w2,1⋮wp−1,Nintwp,Nint],whereK~is the Toeplitz matrix given by(11)K~i,j=K(y~i,y~j),i,j=1,2,…Nint⋅p.6forI←1toNintdo7∣Computeϕ(yi)at all pointsyi∈Iℓviaϕ(yi)=∑j=1pLj,y~ℓ(yi)vj,ℓ∣8end

Algorithm 2: Out-of-Core PCA (oocPCA)
Input:MatrixAof sizem×nstored in slow memory,non-negative integersits,k,l,b,where0<k≤l<min(m,n),andldefaults tok+2Output:OrthonormalUof sizem×k,non-negative diagonal matrixΣof sizek×k,orthonormalVof sizen×k,such thatA≈UΣV∗1Generate uniform random matrixΩof sizen×l2FormY0=AΩblock-wise,brows at a time3Renormalize with LU factorizationL0U0=Y04fori←1toitsdo5678∣FromYi=AA∗Li−1block-wise,brows at a timeifi<itsthen∣Renormalize with LU factorizationLiUi=Yiend∣9end10Renormalize with QR factorizationQR=Yi11Compute SVD of small matrixU′ΣV∗=Q∗A12SetU=QU′

Algorithm 2: Out-of-Core PCA (oocPCA)

Input:MatrixAof sizem×nstored in slow memory,non-negative integersits,k,l,b,where0<k≤l<min(m,n),andldefaults tok+2Output:OrthonormalUof sizem×k,non-negative diagonal matrixΣof sizek×k,orthonormalVof sizen×k,such thatA≈UΣV∗1Generate uniform random matrixΩof sizen×l2FormY0=AΩblock-wise,brows at a time3Renormalize with LU factorizationL0U0=Y04fori←1toitsdo5678∣FromYi=AA∗Li−1block-wise,brows at a timeifi<itsthen∣Renormalize with LU factorizationLiUi=Yiend∣9end10Renormalize with QR factorizationQR=Yi11Compute SVD of small matrixU′ΣV∗=Q∗A12SetU=QU′

9 in total

1. Exponential scaling of single-cell RNA-seq in the past decade.

Authors: Valentine Svensson; Roser Vento-Tormo; Sarah A Teichmann
Journal: Nat Protoc Date: 2018-03-01 Impact factor: 13.491

2. Comprehensive Classification of Retinal Bipolar Neurons by Single-Cell Transcriptomics.

Authors: Karthik Shekhar; Sylvain W Lapan; Irene E Whitney; Nicholas M Tran; Evan Z Macosko; Monika Kowalczyk; Xian Adiconis; Joshua Z Levin; James Nemesh; Melissa Goldman; Steven A McCarroll; Constance L Cepko; Aviv Regev; Joshua R Sanes
Journal: Cell Date: 2016-08-25 Impact factor: 41.582

3. Algorithm 971: An Implementation of a Randomized Algorithm for Principal Component Analysis.

Authors: Huamin Li; George C Linderman; Arthur Szlam; Kelly P Stanton; Yuval Kluger; Mark Tygert
Journal: ACM Trans Math Softw Date: 2017-01 Impact factor: 1.704

4. Categorical Analysis of Human T Cell Heterogeneity with One-Dimensional Soli-Expression by Nonlinear Stochastic Embedding.

Authors: Yang Cheng; Michael T Wong; Laurens van der Maaten; Evan W Newell
Journal: J Immunol Date: 2015-12-14 Impact factor: 5.422

5. Shared and distinct transcriptomic cell types across neocortical areas.

Authors: Bosiljka Tasic; Zizhen Yao; Lucas T Graybuck; Kimberly A Smith; Thuc Nghi Nguyen; Darren Bertagnolli; Jeff Goldy; Emma Garren; Michael N Economo; Sarada Viswanathan; Osnat Penn; Trygve Bakken; Vilas Menon; Jeremy Miller; Olivia Fong; Karla E Hirokawa; Kanan Lathia; Christine Rimorin; Michael Tieu; Rachael Larsen; Tamara Casper; Eliza Barkan; Matthew Kroll; Sheana Parry; Nadiya V Shapovalova; Daniel Hirschstein; Julie Pendergraft; Heather A Sullivan; Tae Kyung Kim; Aaron Szafer; Nick Dee; Peter Groblewski; Ian Wickersham; Ali Cetin; Julie A Harris; Boaz P Levi; Susan M Sunkin; Linda Madisen; Tanya L Daigle; Loren Looger; Amy Bernard; John Phillips; Ed Lein; Michael Hawrylycz; Karel Svoboda; Allan R Jones; Christof Koch; Hongkui Zeng
Journal: Nature Date: 2018-10-31 Impact factor: 49.962

6. Massively parallel digital transcriptional profiling of single cells.

Authors: Grace X Y Zheng; Jessica M Terry; Phillip Belgrader; Paul Ryvkin; Zachary W Bent; Ryan Wilson; Solongo B Ziraldo; Tobias D Wheeler; Geoff P McDermott; Junjie Zhu; Mark T Gregory; Joe Shuga; Luz Montesclaros; Jason G Underwood; Donald A Masquelier; Stefanie Y Nishimura; Michael Schnall-Levin; Paul W Wyatt; Christopher M Hindson; Rajiv Bharadwaj; Alexander Wong; Kevin D Ness; Lan W Beppu; H Joachim Deeg; Christopher McFarland; Keith R Loeb; William J Valente; Nolan G Ericson; Emily A Stevens; Jerald P Radich; Tarjei S Mikkelsen; Benjamin J Hindson; Jason H Bielas
Journal: Nat Commun Date: 2017-01-16 Impact factor: 14.919

7. Single-cell analysis of experience-dependent transcriptomic states in the mouse visual cortex.

Authors: Sinisa Hrvatin; Daniel R Hochbaum; M Aurel Nagy; Marcelo Cicconet; Keiramarie Robertson; Lucas Cheadle; Rapolas Zilionis; Alex Ratner; Rebeca Borges-Monroy; Allon M Klein; Bernardo L Sabatini; Michael E Greenberg
Journal: Nat Neurosci Date: 2017-12-11 Impact factor: 24.884

8. SCANPY: large-scale single-cell gene expression data analysis.

Authors: F Alexander Wolf; Philipp Angerer; Fabian J Theis
Journal: Genome Biol Date: 2018-02-06 Impact factor: 13.583

9. heatmaply: an R package for creating interactive cluster heatmaps for online publishing.

Authors: Tal Galili; Alan O'Callaghan; Jonathan Sidi; Carson Sievert
Journal: Bioinformatics Date: 2018-05-01 Impact factor: 6.937

9 in total

84 in total

1. Doubly Stochastic Normalization of the Gaussian Kernel Is Robust to Heteroskedastic Noise.

Authors: Boris Landa; Ronald R Coifman; Yuval Kluger
Journal: SIAM J Math Data Sci Date: 2021-03-23

2. A tractable latent variable model for nonlinear dimensionality reduction.

Authors: Lawrence K Saul
Journal: Proc Natl Acad Sci U S A Date: 2020-06-22 Impact factor: 11.205

3. Single-cell RNA-seq clustering: datasets, models, and algorithms.

Authors: Lihong Peng; Xiongfei Tian; Geng Tian; Junlin Xu; Xin Huang; Yanbin Weng; Jialiang Yang; Liqian Zhou
Journal: RNA Biol Date: 2020-03-01 Impact factor: 4.652

4. Immune Landscape of Viral- and Carcinogen-Driven Head and Neck Cancer.

Authors: Anthony R Cillo; Cornelius H L Kürten; Tracy Tabib; Zengbiao Qi; Sayali Onkar; Ting Wang; Angen Liu; Umamaheswar Duvvuri; Seungwon Kim; Ryan J Soose; Steffi Oesterreich; Wei Chen; Robert Lafyatis; Tullia C Bruno; Robert L Ferris; Dario A A Vignali
Journal: Immunity Date: 2020-01-07 Impact factor: 31.745

5. D-EE: Distributed software for visualizing intrinsic structure of large-scale single-cell data.

Authors: Shaokun An; Jizu Huang; Lin Wan
Journal: Gigascience Date: 2020-11-11 Impact factor: 6.524

6. Integrative genomics approach identifies conserved transcriptomic networks in Alzheimer's disease.

Authors: Samuel Morabito; Emily Miyoshi; Neethu Michael; Vivek Swarup
Journal: Hum Mol Genet Date: 2020-10-10 Impact factor: 6.150

7. FLOW-MAP: a graph-based, force-directed layout algorithm for trajectory mapping in single-cell time course datasets.

Authors: Melissa E Ko; Corey M Williams; Kristen I Fread; Sarah M Goggin; Rohit S Rustagi; Gabriela K Fragiadakis; Garry P Nolan; Eli R Zunder
Journal: Nat Protoc Date: 2020-01-13 Impact factor: 13.491

8. Single-Cell Profiles of Retinal Ganglion Cells Differing in Resilience to Injury Reveal Neuroprotective Genes.

Authors: Nicholas M Tran; Karthik Shekhar; Irene E Whitney; Anne Jacobi; Inbal Benhar; Guosong Hong; Wenjun Yan; Xian Adiconis; McKinzie E Arnold; Jung Min Lee; Joshua Z Levin; Dingchang Lin; Chen Wang; Charles M Lieber; Aviv Regev; Zhigang He; Joshua R Sanes
Journal: Neuron Date: 2019-11-26 Impact factor: 17.173

9. Visualizing structure and transitions in high-dimensional biological data.

Authors: Kevin R Moon; David van Dijk; Zheng Wang; Scott Gigante; Daniel B Burkhardt; William S Chen; Kristina Yim; Antonia van den Elzen; Matthew J Hirn; Ronald R Coifman; Natalia B Ivanova; Guy Wolf; Smita Krishnaswamy
Journal: Nat Biotechnol Date: 2019-12-03 Impact factor: 54.908

Review 10. Statistical and Bioinformatics Analysis of Data from Bulk and Single-Cell RNA Sequencing Experiments.

Authors: Xiaoqing Yu; Farnoosh Abbas-Aghababazadeh; Y Ann Chen; Brooke L Fridley
Journal: Methods Mol Biol Date: 2021