Literature DB >> 29579153

iterClust: a statistical framework for iterative clustering analysis.

Hongxu Ding1,2, Wanxin Wang3, Andrea Califano1,4.   

Abstract

Motivation: In a scenario where populations A, B1 and B2 (subpopulations of B) exist, pronounced differences between A and B may mask subtle differences between B1 and B2.
Results: Here we present iterClust, an iterative clustering framework, which can separate more pronounced differences (e.g. A and B) in starting iterations, followed by relatively subtle differences (e.g. B1 and B2), providing a comprehensive clustering trajectory. Availability and implementation: iterClust is implemented as a Bioconductor R package. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2018        PMID: 29579153      PMCID: PMC6084607          DOI: 10.1093/bioinformatics/bty176

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

In a scenario where two clusters may exist (A and B), with B further divided into two sub-clusters (B1 and B2), the more pronounced differences between A and B may prevent subtle differences between B1 and B2 from being revealed. To solve this problem and to better describe the sub-cluster hierarchy, we propose to perform cluster analysis iteratively, such that individual clusters may be subdivided into smaller ones until further subdivisions are no longer statistically significant. Thus, for example, differences between A and B would lead to identification of two clusters in the first iteration, while B1 and B2 would be further identified in iteration 2. Previous effort in iterative clustering analysis (Usoskin ) lacks systematic criteria in determining key clustering parameters, e.g. optimal number of clusters among iterations. The iterClust Bioconductor R package provides an unsupervised statistical framework for iterative clustering analysis that can be used, for instance, to discover biological heterogeneity, especially in single cell analyses of heterogeneous tissues, where cell lineages impose a relatively strong hierarchical structure, or solve general clustering problems.

2 Results

R function iterClust() performs iterative clustering analysis by organizing user-defined functions in the following workflow: ith iteration start featureSelect(), select clustering features in this iteration. clustHetero(), confirm observation sets to be splitted in this iteration are heterogeneous. coreClust(), for heterogeneous observation sets confirmed by clustHetero(), generate several clustering schemes. clustEval(), choose the optimal scheme given by coreClust(). obsEval(), evaluate how each observation is clustered. obsOutlier(), poorly clustered observations are removed. ith iteration end iterClust takes diverse feature selection methods (Saeys ); clustering algorithms, e.g. partition-based, hierarchy-based (Kaufman and Rousseeuw, 2009), density-based (Ester ) and graph-based (Newman and Girvan, 2004); and cluster/observation evaluation methods, e.g. sampling-based consensus score (Monti ) or regular silhouettes score (Rousseeuw, 1987). In addition, parameters for all user-defined functions can be set up as a function of the iteration, for instance, clustHetero() can be set up such that looser threshold parameters may be used as the iteration depth increases to deal with more and more subtle heterogeneity. In addition, featureSelect() can be used to select clustering features based on previous iterations. For instance this can help exclude features used to identify coarser clusters in prior iterations to unveil novel, more subtle heterogeneity at the current iteration. Taken together, these two functions make iterClust a highly flexible statistical framework for iterative cluster analysis. The results of iterClust are organized by iteration. Within a specific iteration, for each cluster, the corresponding observation names and clustering features are recorded, providing a comprehensive clustering trajectory. As a statistical framework, the running time, as well as influencing factors of iterClust is majorly dependent on the clustering algorithm in coreClust() function that is specified by the user. As an example, we benchmarked iterClust on a public human PBMC (Peripheral Blood Mononuclear Cell) scRNA-Seq dataset. The original dataset was subsampled into different sizes, and pam() function (Partition Around Medoids, in R package cluster) was used in coreClust() function. In this case, running time increases exponentially and linearly as number of cells and genes increases, respectively, agreeing with the property of pam() function (Supplementary Fig. S1). We further tested the performance of iterClust in heterogeneity detection. As shown in Figure 1, within the PBMC dataset, in the first iteration, iterClust identified T-cell and APC (Antigen Presenting Cell) clusters. In the second iteration, the algorithm further separated the original two clusters into additional sub-clusters, including monocyte and B-cells in the APC cluster (monocyte and B-cell are two major types of APC), as well as effector T-cell and naïve/memory T-cells in the T-cell cluster. Critically, all clusters identified by the analysis were characterized by well-established cell-type-specific gene expression markers (Supplementary Fig. S2). The finer grain sub-division was not the optimal solution using single pass analysis (Supplementary Fig. S3). Taken together, iterClust can correctly elucidate complex hierarchical substructures that contribute to tissue heterogeneity in PBMC single cell dataset, with more pronounced differences in starting iterations, followed by relatively subtle differences, providing a comprehensive clustering trajectory. We further confirmed these conclusions on independent scRNA-Seq datasets (Supplementary Figs S4 and S5), as well as general benchmarking datasets for clustering analysis (Supplementary Figs S6 and S7).
Fig. 1.

Revealing cell types within human PBMC using iterClust. For illustration purpose, the data was projected on 2D-space with t-SNE plots (Maaten and Hinton, 2008), on which iterClust discovered clusters were colored. iterClust in first round (A) separated two major cell types, T-cell and APC (Antigen Presenting Cell) and second round (B) further dissected these clusters, separating monocyte and B-cell in APC cluster, as well as Effector T-cell and Naïve/Memory T-cell among T-cell cluster

Revealing cell types within human PBMC using iterClust. For illustration purpose, the data was projected on 2D-space with t-SNE plots (Maaten and Hinton, 2008), on which iterClust discovered clusters were colored. iterClust in first round (A) separated two major cell types, T-cell and APC (Antigen Presenting Cell) and second round (B) further dissected these clusters, separating monocyte and B-cell in APC cluster, as well as Effector T-cell and Naïve/Memory T-cell among T-cell cluster Click here for additional data file.
  3 in total

1.  Finding and evaluating community structure in networks.

Authors:  M E J Newman; M Girvan
Journal:  Phys Rev E Stat Nonlin Soft Matter Phys       Date:  2004-02-26

Review 2.  A review of feature selection techniques in bioinformatics.

Authors:  Yvan Saeys; Iñaki Inza; Pedro Larrañaga
Journal:  Bioinformatics       Date:  2007-08-24       Impact factor: 6.937

3.  Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing.

Authors:  Dmitry Usoskin; Alessandro Furlan; Saiful Islam; Hind Abdo; Peter Lönnerberg; Daohua Lou; Jens Hjerling-Leffler; Jesper Haeggström; Olga Kharchenko; Peter V Kharchenko; Sten Linnarsson; Patrik Ernfors
Journal:  Nat Neurosci       Date:  2014-11-24       Impact factor: 24.884

  3 in total
  3 in total

1.  BACH2 inhibition reverses β cell failure in type 2 diabetes models.

Authors:  Jinsook Son; Hongxu Ding; Thomas B Farb; Alexander M Efanov; Jiajun Sun; Julie L Gore; Samreen K Syed; Zhigang Lei; Qidi Wang; Domenico Accili; Andrea Califano
Journal:  J Clin Invest       Date:  2021-12-15       Impact factor: 14.808

2.  Notch-mediated Ephrin signaling disrupts islet architecture and β cell function.

Authors:  Alberto Bartolomé; Nina Suda; Junjie Yu; Changyu Zhu; Jinsook Son; Hongxu Ding; Andrea Califano; Domenico Accili; Utpal B Pajvani
Journal:  JCI Insight       Date:  2022-03-22

3.  Identification of islet cell characteristics in humans with type 2 diabetes by single-cell sequencing.

Authors:  Junta Imai
Journal:  J Diabetes Investig       Date:  2022-06-04       Impact factor: 3.681

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.