Literature DB >> 35798813

Using the Kriging Correlation for unsupervised feature selection problems.

Cheng-Han Chua¹, Meihui Guo¹, Shih-Feng Huang².

Abstract

This paper proposes a KC Score to measure feature importance in clustering analysis of high-dimensional data. The KC Score evaluates the contribution of features based on the correlation between the original features and the reconstructed features in the low dimensional latent space. A KC Score-based feature selection strategy is further developed for clustering analysis. We investigate the performance of the proposed strategy by conducting a study of four single-cell RNA sequencing (scRNA-seq) datasets. The results show that our strategy effectively selects important features for clustering. In particular, in three datasets, our proposed strategy selected less than 5% of the features and achieved the same or better clustering performance than when using all of the features.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35798813 PMCID： PMC9263137 DOI： 10.1038/s41598-022-15529-4

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.996

Introduction

Feature selection for unsupervised learning is a challenging problem. In this study, we propose a Kriging-Correlation (KC) Score, which integrates the Automatic Fixed Rank Kriging (AutoFRK)[1] method with a correlation analysis, to measure feature importance in clustering analysis. A KC Score-based feature selection strategy is further developed for extracting important features for high dimensional clustering analysis. The feature selection procedure includes three main steps: calculating the importance score of each feature, ordering the features by their scores, and deciding the number of features to be selected. In addition, the proposed strategy also suggests an appropriate kernel to enhance the clustering accuracy and efficiency. To investigate the performance of the proposed strategy, we study four single-cell RNA sequencing(scRNA-seq) datasets to extract the critical features for cell type clustering analysis. Cell type identification has many applications, including helping to understand how different cells function and interact. A classical and straightforward approach is to assign cells their types by micromanipulation, but this method is usually time-consuming and risks mislabeling. Recently, a data-driven approach, single-cell interpretation via multi-kernel learning (SIMLR)[2], proposes to cluster cells based on single-cell RNA sequences and then to accordingly identify the cell type for each cluster. This approach not only saves time by not requiring cell-by-cell identification, but it is also able to identify some undiscovered cells associated with cancer[3]. For scRNA-seq data, there are very few samples n compared to the number of genes (features) p. A dimension reduction method is often employed to circumvent this low n/p ratio when conducting clustering analysis. The SIMLR study accomplished dimension reduction through the well-known t-distributed stochastic neighbor embedding (t-SNE)[4] method. Since gene expression in scRNA-seq data usually contains dropout events (zero measurements), SIMLR adopts a multi-resolution Gaussian kernel to measure the similarity matrix used in the t-SNE. After conducting clustering analysis in the latent space obtained by t-SNE with SIMLR, we propose a strategy to find important genes in cell clustering. The results show that our strategy effectively selects important features for clustering. In particular, in three datasets, less than 5% of features are selected by our proposed strategy, yet the classification accuracy and Normalized Mutual Information (NMI) is the same or better than using all features. Furthermore, for the four datasets, the KC Score has either comparable or better NMI than the Laplacian Score[5], which is one of the well-known locality preserving filtering methods for unsupervised feature selection.

Results

Datasets

In this study, we apply the proposed method to four published scRNA-seq datasets, which record gene expression for different kinds of cell species. The numbers of subjects n, genes (or features) p, number of classes, and descriptions of the datasets are given in Table 1. From Table 1, one can see the aforementioned low n/p ratio in each dataset. The datasets analyzed in this study are available in the Supplementary information section on the website https://www.nature.com/articles/nmeth.4207.

Table 1

Description of four scRNA-seq datasets.

Data name	n	p	Class #	Description
mECS	182	8989	3	Embryonic stem cells under different cell cycle stages
Kolod	704	13,473	3	Pluripotent cells under different environment conditions
Pollen	249	6982	11	Eleven cell populations including neural cells and blood cells
Usoskin	622	17,772	4	Neuronal cells with sensory subtypes

Description of four scRNA-seq datasets.

Features selection procedure

Let denote the projection of the original data X (see Eq. (1)) on the 2-dimensional latent space by t-SNE with the similarity matrix obtained via SIMLR in Eq. (3), where the superscript M represents all results generated from SIMLR. Let be the clustering label vector of the cells by applying k-means on . For each gene, we calculate its Laplacian Score by and KC Score by , details are given in “Methods” section. Let be the feature collection of the first k highest KC Scores; hence is a submatrix of X with dimension . Let be the similarity matrix estimated by the single Gaussian kernel method based on the , where the superscript G represents all results generated from a single Gaussian kernel method. Our goal is to find genes that play important roles in clustering by KC Scores. Details (Strategy A) are given below. [Strategy A] For each , apply t-SNE on with and denote the associated latent space projection as . Then apply k-means on to obtain the clustering label vector . For each , calculate the Pillai’s trace in MANOVA for and and denote the result as . Let . To further prune , consider . If , then output and ; otherwise, go back to steps 1-3, replacing the superscript G by M and output and . Figure 1 presents the flow chart of the above procedure. In step 1, the reasons for using are twofold: requires lower computational costs than ; the clustering performances of and are comparable when the number of critical genes ( in step 3) is small. In step 2, we decide the initial features set by maximizing the Pillai’s trace statistics in MANOVA, which corresponds to the variance ratio of between-group and within-group. Since still possibly contains some irrelevant or non-significant features, we adopted a pruning step in Strategy A. The pruning step is a widely used technique to further refine the selected features in model selection[6] and tree-based methods in machine learning[7]. It aims to reduce variance and avoid overfitting by deleting some irrelevant or non-significant features. Therefore, we pruned the set to in step 3 of Strategy A such that after dropping unimportant genes in , the NMIs between and are greater than 0.95 for all . In step 4, we check the adequacy of by comparing the NMI between and to decide whether to replace superscript G by M in steps 1–3. Similarly, to find the set of critical genes, denoted by , in clustering by Laplacian Scores, we only need to replace the KC Scores by the Laplacian Scores in the above procedure.

Figure 1

Flow chart of Strategy A.

Performance of Strategy A

We mainly use two metrics, NNA[2] (under supervised setting) and NMI[2] (under unsupervised setting), to compare the classification and clustering performance based on the latent space projections and , where or M. Since a good latent space projection should facilitate distance-based classifiers, NNA is used to measure the goodness of the distance measure from the latent space projection or . When the is greater than the , the latent space projection is more efficient for classification than the latent space projection and vice versa. We use NMI to evaluate the consistency between the obtained clustering or and the true labels. A higher NMI indicates a better clustering result. In addition to using NNA in a supervised setting, we also adopted random forest to evaluate the classification performances of different methods. We calculate a random forest classifier’s average classification accuracy, denoted by RFA, under a 5-fold cross-validation framework. Table 2 reports the number of critical genes after pruning, , the ratio , and the corresponding NNA, RFA, and NMI based on , and for the four datasets. The results in Table 2 show that, in most cases, Strategy A only select a small percentage () of features, but achieve NNA, RFA, and NMI that are comparable to or even better than using all features. In view of , Strategy A pruned over 25% features in for the mECS, Kolod, and Usoskin datasets. In particular, for the Kolod dataset, Strategy A only requires 2 genes, accounting for , to achieve the same performance based on all features. Also, for the Usoskin dataset, NNA, RFA, and NMI based on KC and Laplacian scores are higher than the benchmark.

Table 2

, , , NNA, RFA and NMI based on , and for the four data sets.

Data set	Latent space projection	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k_2$$\end{document}k2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k_2/k_1$$\end{document}k2/k1 (%)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k_2/p$$\end{document}k2/p (%)	NNA	RFA	NMI
mECS	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z^S({\mathcal {V}}_{k_2})$$\end{document}ZS(Vk2)	2860	50.49	31.82	0.97	0.96	0.84
(\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=8989$$\end{document}p=8989)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z^S({\mathcal {V}}^*_{k_2})$$\end{document}ZS(Vk2∗)	5595	69.89	62.24	0.95	0.96	0.85
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z^S(X)$$\end{document}ZS(X)				0.95	0.95	0.89
Kolod	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z^G({\mathcal {V}}_{k_2})$$\end{document}ZG(Vk2)	2	8	0.01	1.00	1.00	1.00
(\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=13473$$\end{document}p=13473)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z^G({\mathcal {V}}^*_{k_2})$$\end{document}ZG(Vk2∗)	10	28.57	0.07	1.00	1.00	1.00
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z^S(X)$$\end{document}ZS(X)				1.00	1.00	0.99
Pollen	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z^G({\mathcal {V}}_{k_2})$$\end{document}ZG(Vk2)	225	100	3.22	0.98	0.98	0.94
(\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=6982$$\end{document}p=6982)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z^G({\mathcal {V}}^*_{k_2})$$\end{document}ZG(Vk2∗)	115	100	1.65	0.98	0.98	0.91
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z^S(X)$$\end{document}ZS(X)				0.98	0.95	0.95
Usoskin	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z^G({\mathcal {V}}_{k_2})$$\end{document}ZG(Vk2)	65	41.94	0.37	0.99	0.99	0.96
(\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=17772$$\end{document}p=17772)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z^G({\mathcal {V}}^*_{k_2})$$\end{document}ZG(Vk2∗)	55	73.33	0.31	0.98	0.98	0.93
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z^S(X)$$\end{document}ZS(X)				0.94	0.96	0.74

, , , NNA, RFA and NMI based on , and for the four data sets. To further evaluate the performance of KC and Laplacian Scores, we compare their NMIs based on the two ’s selected respectively by the KC and Laplacian Scores in Table 2, see Fig. 2. The results show that the KC Score has either comparable or better NMI for the four datasets than the Laplacian Score for both of the two ’s.

Figure 2

The NMIs based on the two ’s selected respectively by the KC and Laplacian Scores in Table 2.

Comparison of single and multi-resolution Gaussian kernels

Figure 3 illustrates the reason why the first step of Strategy A adopts the similarity matrix () estimated by the single Gaussian kernel method rather than the one () estimated by multi-resolution Gaussian kernels. Figure 3 presents the curves of NNA in (a)–(d) and NMI in (e)–(h) versus the feature size (log scale) for the four datasets. Each subfigure plots the curves of KC Score with single Gaussian kernel (red), KC Score with SIMLR (yellow), Laplacian Score with single Gaussian kernel (blue), and Laplacian Score with SIMLR (green). For the Kolod, Pollen, and Usoskin datasets, the KC Score and Laplacian Score with single Gaussian kernel perform better than the counterparties with SIMLR. The KC Score with single Gaussian kernel reaches the highest NNA and NMI much faster than the other three methods, especially for the Kolod. The corresponding highest NNA and NMI occur at , where denotes the minimal k at which the NNA or NMI attains the highest peak of each method in Fig. 3. In contrast, the KC Score and Laplacian Score with SIMLR perform better than the single Gaussian kernel for the mECS dataset, and the highest NNA and NMI of KC Score with SIMLR were reached at , which are larger than the values of the other three methods. Hence, KC Score with single Gaussian kernel is recommended when the number of the critical genes is small, but KC Score with SIMLR is recommended when the number is large. One possible explanation of this phenomenon is that multi-resolution Gaussian kernels are designed for collecting a larger set of features than a single Gaussian kernel since different kernels might highlight various critical features. Nevertheless, if the number of helpful classification features is small, a single kernel might be good enough to identify these genes. This might explain why our numerical experiments reveal that KC score with a single Gaussian kernel performs better than the multi-kernel approach if is small. Moreover, each subfigure in Fig. 3 is also marked with the values of the KC Score and the Laplacian Score obtained from Table 2. The NMI and NNA at are higher than those at most of the other k values in each dataset, which indicates that the features recommended by Strategy A can produce satisfactory clustering performances for the four datasets.

Figure 3

The NNA and NMI curves of Strategy A against different numbers(log scale) of genes based on the KC Score(single Gaussian kernel; SIMLR) versus Laplacian Score(single Gaussian kernel; SIMLR) for the four scRNA-seq datasets, where the circles in each subplot denote the locations of .

Discussion

This study proposes a KC Score to measure feature importance. The KC Score is designed by measuring the correlation between the original and the associated reconstruction genes expression based on the latent space obtained from SIMLR and t-SNE. A feature selection strategy is also developed for the KC Score or Laplacian Score to select the critical genes. The strategy is applied to four datasets. The results show that, when there are few critical genes, the latent space based on KC Score and single Gaussian kernel has the best performance. In contrast, the latent space based on SIMLR is recommended when the number of critical genes is relatively large. In particular, for the Kolod dataset, Strategy A with a KC Score only selects two critical genes to achieve perfect clustering, meaning that the corresponding NMI is 1. To gain more insights into how those two selected critical genes produce perfect clustering, Fig. 4 shows the scatter plot of the first two (the 9708th and 11221st) critical cell gene expressions. In Fig. 4, if both two cell genes’ expressions occur dropout, the cells are classified as Class 1 (red points); if only the 9708th cell gene expression is dropped, the cells are classified as Class 2 (green points); if neither of the two cell gene expressions dropout, the cells are classified as Class 3 (blue points). As the figure shows, the three classes are separated perfectly among the above three dropout patterns. This phenomenon shows that dropout patterns can provide helpful and informative signals for scRNA-seq clustering. In the literature, other studies also reported similar findings that dropout patterns might be a helpful signal in single-cell data analysis[8-10]. Nonetheless, dropout patterns may be less informative when they are very dispersed, as is common in other areas such as microbiome data.

Figure 4

Scatter plot of the first two (the 9708th and 11221st) critical cell gene expressions for the Kolod dataset.

Scatter plot of the first two (the 9708th and 11221st) critical cell gene expressions for the Kolod dataset. Moreover, concerning the NMIs in Table 2 for the Usoskin dataset, the clustering performance improves markedly, from 0.74 to over 0.93, by proceeding feature selection. To visualize this finding, Fig. 5 shows the projections and clustering results in the three 2-dimensional latent spaces obtained by t-SNE: (a) , (b) , and (c) . In Fig. 5, the clustering performances of and are shown to be comparable. However, compared to and , the separability among the four classes in is relatively low, especially for the cells in Class 4 (purple points), since they are divided incorrectly into two groups. This finding highlights that identifying and using critical features for clustering is more effective than all features.

Figure 5

Three 2-dimensional latent spaces obtained by t-SNE for the Usoskin dataset: (a) , (b) , and (c) .

Three 2-dimensional latent spaces obtained by t-SNE for the Usoskin dataset: (a) , (b) , and (c) . In the procedure of Strategy A, we adopted the unsupervised learning method SIMLR to cluster the cells. In SIMLR, one must conduct many matrix calculations with size , where n denotes the number of cells. Therefore, the computational cost is very expensive with a large n, which leads to a limitation of Strategy A. Table 3 presents the running hours spent in steps 1 and 2 of Strategy A when applying different methods to the four datasets. All the computations are conducted on servers with 2.40 GHz CPU, NVIDIA-SMI 430.50 GPU, and 126 GB RAM. One can see that the running hours of using SIMLR to compute the KC Score and Laplacian Score are roughly proportional to . Hence, the running hours of the Kolod () and Usoskin () datasets dramatically increase compared to the other two datasets with smaller n. In addition, the running hours of SIMLR are remarkably more extensive than those of the associated single Gaussian kernel. Consequently, once the challenge of the heavy computational burden in SIMLR with a large n can be solved in the future, this limitation of Strategy A could be released.

Table 3

The running hours spent in steps 1 and 2 of Strategy A when applying different methods to the four datasets.

Dataset	n	KC Score (single Gaussian Kernel)	KC Score (SIMLR)	Laplacian Score (single Gaussian Kernel)	Laplacian Score (SIMLR)
mECS	182	0.09	5.90	0.09	5.93
Kolod	704	2.76	64.17	2.70	68.09
Pollen	249	0.12	7.21	0.12	7.33
Usoskin	622	3.39	76.02	3.46	77.43

The running hours spent in steps 1 and 2 of Strategy A when applying different methods to the four datasets. Figure 6 presents the MANOVA Pillai’s Trace statistic curves for the four datasets. From the figure, one can find that the statistic is not a monotonic function of k. In particular, the curves tend to decrease as the number of features is large enough in the Kolod, Pollen, and Usoskin datasets. Moreover, it can be seen that the curves in Fig. 6 are not smooth around and severely fluctuate for in the mECS, Pollen, and Usoskin datasets. Suppose we pruned genes via a similar method to step 3 of Strategy A, by replacing the role of NMI with the statistic. In that case, we would need to develop a new test statistic to decide the critical value for selecting . Doing so is beyond the scope of this study, and we leave it to our future work.

Figure 6

The MANOVA Pillai’s Trace statistic curves of Strategy A against different numbers (log scale) of genes based on the KC Score (single Gaussian kernel; SIMLR) versus Laplacian Score (single Gaussian kernel; SIMLR) for the four scRNA-seq datasets, where the circles denote the locations of the ’s for each method. In addition, we investigate the performance of strategy A in imbalanced data by delving into the most imbalanced one, Pollen, among the four datasets. The Pollen dataset consists of 11 classes, and one of the classes only contains 3% of cells in the dataset. Figure 7 shows the confusion table of the clustering result of in Table 2, where the clusters are rearranged with the highest accuracy. The results in Fig. 7 reveal that the sensitivity of the 5th class is indeed affected by its low proportion of cells in the dataset. In general, the classification problem of imbalanced data is significantly challenging, even in supervised learning. In this study, since we considered an unsupervised learning task, the identification problem of a class with a small percentage of observations is even more difficult. Therefore, the low sensitivity for the class containing a low proportion of cells in the dataset revealed in Fig. 7 is not surprising. Further studies to improve the sensitivity of this scenario in supervised and unsupervised learning are needed in the future.

Figure 7

The confusion table of the clustering result of , where the clusters are rearranged with the highest accuracy.

The confusion table of the clustering result of , where the clusters are rearranged with the highest accuracy. Note that the performance of Strategy A using KC Score highly relies on the classes being able to be separated in the projected latent space, because the KC Score is calculated by reconstruction from projections in the latent space. It is therefore necessary to overcome this limitation in order to extend the applicability of the proposed method. Also, step 3 of Strategy A is based on filtering out the most non-essential genes iteratively until a stopping criterion is met, so there is still room for further refinement in future studies.

Methods

Feature ranking criteria

Laplacian Score

The following notation is used for the data matrix where denotes the n observations of the kth feature and denotes the p features of the ith observation. The Laplacian Score is a well-known unsupervised feature ranking method, which uses a similarity matrix[5]. The Laplacian Score of a given similarity matrix is defined aswhere , , and . The smaller the Laplacian Score, the more important the feature is. The similarity matrix of the SIMLR method, denoted by , is constructed via the following optimization objective function and constraints:where is the l-th kernel-induced implicit mapping of ith observation. We can rewrite as the representative of the kernel functionwhere the kernel functions are defined aswith and being the set of the top k neighbors of the ith observation. In practice, the parameters k and in are obtained from the combinations of and , which results in 55 different kernels.

Kriging-Correlation Score (KC Score)

The Kriging-Correlation Score (KC Score) proposed in this study aims to improve the efficiency of feature selection. We now introduce the algorithm for the KC score. Let be the n projections in a d-dimensional latent space of a dimension reduction method. Based on Z, we adopt the Automatic Fixed Rank Kriging (AutoFRK)[1] to define the KC score. The details are as follows: Step 1: Use as the location inputs to generate the multi-resolution thin-plate spline basis matrix G. For each feature in data matrix , fit the following spatial random effect modelwhere , , and . In particular, denotes the multi-resolution thin-plate spline basis function and is defined aswhere and Step 2: The KC Score is defined as where are the fitted values obtained from Step 1.

MANOVA Pillai’s Trace statistic

Let G be the number of classes and be the clustering of observation i. The total of Sum of Cross-Products (SSCP) can be divided into ‘between’ and ‘within’ groups. That is,where and . The Pillai’s Trace statistic is defined as

Evaluation criteria

NNA We consider a performance metric in a supervised setting, namely the Nearest Neighbor Accuracy (NNA). We use 5-fold cross-validation on the transformed matrix and its true labels y, where or M. For each trial, we use four folders as the training set and the remaining one as the validation set. For each cell in the validation set, its class is assigned as the label of the training set object that is smallest Euclidean distance from the target cell. The is defined as the average accuracy of the five validation sets. NMI Normalized Mutual Information (NMI) is a measure to evaluate the clustering consistency between the two clusters and . Let P and Q denote the number of labels in U and V, respectively. Let , , and . The NMI is defined as where

5 in total

1. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning.

Authors: Bo Wang; Junjie Zhu; Emma Pierson; Daniele Ramazzotti; Serafim Batzoglou
Journal: Nat Methods Date: 2017-03-06 Impact factor: 28.547

2. A revised airway epithelial hierarchy includes CFTR-expressing ionocytes.

Authors: Daniel T Montoro; Adam L Haber; Moshe Biton; Vladimir Vinarsky; Brian Lin; Susan E Birket; Feng Yuan; Sijia Chen; Hui Min Leung; Jorge Villoria; Noga Rogel; Grace Burgin; Alexander M Tsankov; Avinash Waghray; Michal Slyper; Julia Waldman; Lan Nguyen; Danielle Dionne; Orit Rozenblatt-Rosen; Purushothama Rao Tata; Hongmei Mou; Manjunatha Shivaraju; Hermann Bihler; Martin Mense; Guillermo J Tearney; Steven M Rowe; John F Engelhardt; Aviv Regev; Jayaraj Rajagopal
Journal: Nature Date: 2018-08-01 Impact factor: 49.962