Literature DB >> 31658262

SCMarker: Ab initio marker selection for single cell transcriptome profiling.

Fang Wang¹, Shaoheng Liang¹, Tapsi Kumar², Nicholas Navin^1,2, Ken Chen¹.

Abstract

Single-cell RNA-sequencing data generated by a variety of technologies, such as Drop-seq and SMART-seq, can reveal simultaneously the mRNA transcript levels of thousands of genes in thousands of cells. It is often important to identify informative genes or cell-type-discriminative markers to reduce dimensionality and achieve informative cell typing results. We present an ab initio method that performs unsupervised marker selection by identifying genes that have subpopulation-discriminative expression levels and are co- or mutually-exclusively expressed with other genes. Consistent improvements in cell-type classification and biologically meaningful marker selection are achieved by applying SCMarker on various datasets in multiple tissue types, followed by a variety of clustering algorithms. The source code of SCMarker is publicly available at https://github.com/KChen-lab/SCMarker.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Biomarkers
RNA

Year: 2019 PMID： 31658262 PMCID： PMC6837541 DOI： 10.1371/journal.pcbi.1007445

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Current single-cell RNA-sequencing (scRNA-seq) data generated by a variety of technologies such as Drop-seq and SMART-seq, can reveal simultaneously the mRNA transcript levels of thousands of genes in thousands of cells [1-3]. However, the increased dimensionality makes it challenging to delineate cell types, due to complex and often undefined associations between individual genes and cell-types [4,5]. It is well accepted that genes are not equally informative in delineating cell types [6,7]. Certain genes are only expressed in certain cell types, but not others [8]. Moreover, the expression levels of certain genes cannot be robustly measured (e.g., zero inflated), due to technological bias [9-11]. Thus, it has become a common practice to retain only highly expressed or highly variable genes for cell population analysis [12-15]. Several scRNA-seq data clustering packages () perform marker selection through dimensionality reduction techniques such as principal component analysis and tSNE [16], which are equivalent to identifying the set of highly variable genes. Unfortunately, the biological implications and the technical optimality of these gene selection strategies retain unclear, despite their wide use in cell-type clustering. Here, we propose an ab initio method, named SCMarker, which applies information-theoretic principles to determine the optimal gene subsets for cell-type identification, without referencing to any known transcriptomic profiles or cell ontologies. The central idea of our method is to select genes that are individually discriminative across underlying cell types, based on a mixture distribution model, and are co- or mutually exclusively expressed with some other genes, due to cell-type specific functional constraints. Although the techniques of applying a mixture distribution model for a set of continuous data points have been widely used in clustering analysis of gene expressions, it is unclear whether this approach can benefit this problem context [17,18]. In particular, because single-cell gene expression measurements have vast dimensions (>20,000 genes), are highly noisy (e.g., zero-inflated, drop-off errors), and are generated by technologies of varied properties [19,20]. For example, SMART-seq is aimed at sequencing the entire RNA transcript, while Drop-seq only the 3’ end using unique molecular indices (UMI) to track individual transcript [2,21,22]. Part of our investigation here is to examine whether the previously applied data analytical techniques can be reapplied in the single-cell data-type that have different properties and population structures. Our main goal is to identify not only cell-types, but also biologically meaningful cell-type markers from scRNA-seq data at accuracies higher than results derived using canonical gene selection strategies.

Materials and methods

Discriminativeness of gene expressions for subpopulation clustering

By definition, cell-type-discriminative markers should have distinctive expression levels across cell subpopulations. Therefore, in a dataset with mixed cell subpopulations, the expression level of a marker should follow a bi- or multi-modal, instead of a unimodal distribution () [23-25]. Following this assumption, we quantify the degree of modality based on the probability density distribution (f) of each gene expression using a Gaussian kernel function, instead of a mixture model which requires knowing the number of mixture components: where g is the expression level of gene g in cell j, L is the number of cells and h>0 is a smoothing parameter called bandwidth, which can be leveraged to alleviate biases introduced by uneven sequencing depths. is a scaled kernel function, where K(x) is a standard Gaussian density function. The optimal value for h can be calculate from where the standard deviation , g is the expression level of gene g in cell i and is the average expression level of g across all the cells [26]. We estimated h from the datasets used in this study [27-30] and obtained a mean h = 0.3 on the datasets generated by the SMART-seq platform and 0.05 on the datasets generated by the Drop-seq platform. We set these as the default values for subsequent analyses. For each gene, we count the number (T) of peaks in the estimated probability density function . A peak is found at the density value c, if there exists a 2 times h long interval I centred at c such that for all g in I. A gene expression level follows a multi-modal distribution, if it has multiple (T≥2) local maximum probability density values. Only genes with multimodal probability density distributions are considered as markers.

Illustration of SCMarker.

Plotted as examples are (A) a bimodally distributed gene expression and (B) a unimodally distributed gene expression. From a binarized gene-cell expression matrix (C), a k-nearest co-occurrence neighbour (KNCON) graph and a k-nearest mutually exclusive neighbour (KNMEN) graph are constructed (D), based on which co- or mutually exclusively expressing gene pairs (CMEGPs) in the KNCON (node 1, 2 and 3, node 6 and 7, connected by red double arrows) and in the KNMEM (node 1, 2, 6 and 7, connected by green double arrows) can be identified. Marker genes (node 1, 2, 3, 6 and 7) are subsequently selected based on the CMEGPs.

Co- or mutually exclusively expressing gene pairs (CMEGPs)

Cell-type-discriminative markers are often co- or mutually exclusively expressed, due to modularized regulatory interactions specific to cell types [31]. Consequently, identifying these CMEGPs, will help identify markers which delineate cell type. Because scRNA-seq data are often sparse with limited sequencing depth [32], binarization of the counts would help mitigate technical artifacts and improve robustness over different sequencing platforms (e.g., whole transcript vs 3’ sequencing protocols). To identify CMEGPs, we only consider genes with multimodal distribution and discretize a gene-cell expression matrix with N genes and L cells into an N×L binary matrix X∈{0,1}, with x = 1 designating an expressed gene i in cell j, if the expression level is above the average and x = 0 otherwise (). For gene i, x. = (x,x,…,x) is a binary string. We can calculate a co-occurrence matrix (S) that measures the pair-wise co-occurrence between all the gene pairs, S can also be represented as a directed graph (), in which a node denotes a gene and an edge from gene A to gene B represents that B co-occurs with gene A in at least n cells. Among the connected nodes, the k genes that co-occurred with A in k largest sets of cells are termed the k-nearest co-occurrence neighbours (KNCONs). In addition, we calculate a mutually exclusive matrix (M) that measures the pair-wise mutual exclusivity between all the gene pairs through Eq (4), where M represents a directed k-nearest mutually exclusive neighbour (KNMEN) graph (). Similar to KNCON, but in opposite ways, the KNMENs of a gene A are the k genes that occur mutually exclusively with A in k largest sets of cells. The detailed algorithm for constructing these graphs is shown in . Under these definitions, an CMEGP is identified as two genes bi-directionally connected in the KNCEN or the KNMEN graph. We selected as markers the genes that belong to at least one CMEGP, because these genes are more likely associated with cell-type specific functions than those that do not have any CMEGP (likely due to random, non-function-related fluctuation). This concept has been previously examined in the RNA microarray data analysis, but has not been successfully applied in the context of single-cell RNA-seq data analysis, due to vastly different properties between the technologies [17,18].

Results

Comparison with other marker selection strategies

We applied SCMarker to the scRNA-seq data obtained from 1) 19 melanoma patients, which include 4,645 cells; and 2) 18 head and neck cancer patients, which include 5,902 cells sequenced by the SMART-seq2 platform [27,28]. In the original studies, each cell in the sets was labelled as a malignant or non-malignant cell through copy number analysis. The expression levels of known marker genes were used to further classify the non-malignant cells, such as T cells, B/plasma cells, macrophages, dendritic cells, mast cells, endothelial cells, fibroblasts, and myocytes. We found that most known marker genes (96%) demonstrated bi/multi-modal distributions across cells (). Overall, around 6% of genes with bi/multi-modal distributions are identified as marker genes, among which half are the known marker genes. We assessed SCMarker results with those obtained under two canonical strategies: selecting genes with A) the highest average expression levels and B) the highest variance across cells. In our experiments, the highest variable genes were determined using Seurat [12]. We used five clustering methods: k-means, Clara, hierarchical clustering, DBSCAN and Seurat to cluster single cells based on the selected markers [33,34]. Same numbers of clusters are specified for DBSCAN, k-means, Clara, and hierarchical clustering. The adjusted rand index (ARI), which measures the similarity of two sets of clustering results, was used to quantify the consistency between the clustering results and the known cell labels [35]. Compared to marker sets A and B, selected by the canonical strategies, the marker set selected by SCMarker (equal numbers of markers) resulted in a higher ARI with fairly evident margins (). The conclusion appeared to be robust over a range of k and n parameters and were unaffected by using different clustering methods ().These experiments indicated that setting k between 100 and 300 resulted in the most accurate cell type identification results irrespectively to n (). Hence, we select k = 300 and n = 30 as the default parameters for applying SCMarker.

Comparison of 3 marker selection methods for cell-type identification over a range of k (from 50 to 1000 at a step-size of 50) and n (from 5 to 100 at a step-size of 5).

Accuracy of cell-type identification (in terms of adjusted rand index) are compared across 3 marker sets selected respectively by SCMarker, the highest expressed and the highest variable gene approaches, using two scRNA-seq datasets from (A) melanoma and (B) head-and-neck cancer samples by 5 clustering algorithms: k-means, Clara, hierarchical clustering (hc), DBSCAN, and Seurat. We obtained 902 markers from the melanoma data and more distinguishable cell types using SCMarker than using the canonical strategies (, ). Better performance of SCMarker was also obtained in analysing the head and neck cancer data (). Moreover, the genes selected by SCMarker had substantially higher degrees of overlap with the known cell-type markers reported in the original publications than the sets returned by other approaches (the same number of 902 top scoring genes were selected for fair comparison), including the “FindMarker” approach in Seurat (). Notably, SCMarker selected significantly more immune cell surface markers specific to T cytotoxic, T helper, B lymphocyte, and macrophage cells that are likely present in the tumour microenvironment, as indicated by gene set enrichment analysis () [36].

Fig 3

Results on the melanoma data.

Plotted in tSNE space are 4,645 melanoma cells with markers selected respectively by (A) SCMarker, (B) the highest expressed and (C) the highest variable genes. Also plotted are the Venn diagrams between the known cell-type markers and the marker sets determined respectively by (D) SCMarker, (E) the highest expressed, (F) the highest variable genes and (G) Seurat FindMarker in the melanoma data.

Results on the melanoma data.

Application of SCMarker to 3’ UMI count data

To avoid introducing biases due to potential overfitting and assess the utility of our approaches on other platforms, we further assessed the utility of SCMarker in analysing the 3’ UMI count scRNA-seq data generated by the droplet platforms. We determined the optimal values of k and n under various sample sizes via resampling the melanoma data (). When the sample size was 500, the optimal values were k = 100 and n = 20 (or 30). When the sample size was 2,500 or 25,000, the optimal values were k = 300, n = 30, the same as what we obtained from the original dataset (4,645 cells). When sample size increased to 50,000, the optimal values increased to k = 400, n = 100. Overall, although the sample size did affect the optimal choices of k and n, their influences were relatively modest. We first analysed a set of 5,602 cells from the cerebellar hemisphere of normal brain tissues generated by Drop-seq under k = 300 and n = 30 [29]. SCMarker selected 699 genes as markers, which differentially expressed across cell subpopulations under the default parameters (). Alternatively, the default mode of Seurat led to the selection of 6,111 highest variable genes (HVGs). For comparison, we selected 699 highest expressed genes (HEGs). Although SCMarker selected less markers than Seurat, the clustering result showed a clearer separation than that based on the Seurat HVGs and on the HEGs (). In particular, SCMarker successfully delineated Purkinje neurons into purk1 (cluster4, ) and purk2 (cluster7, ) and recapitulated the differential levels of SORCS3 between two clusters (), which are consistent with the results in the original paper. In contrast, although the Purkinje neurons were clustered into four groups by Seurat (), purk1 and purk2 were not well separated (), and the expression levels of SORCS3 showed mosaic patterns across the 4 groups (cluster4, 6, 11 and 12, ). As additional controls, we performed clustering using the top 500 and 1000 Seurat HVGs. That did not result in any improvement ().We then analysed the scRNA-seq data of 52,698 cells from 5 lung tumours generated by the 10X Chromium platform (10X Genomics) under k = 400 and n = 100 [30]. SCMarker identified 950 markers under the parameters (), while Seurat identified 1,832 HVGs. We also selected 950 HEGs for comparison. SCMarker led to 23 clearly distinguishable clusters, while the Seurat HVGs led to 12 and the HEGs led to 19 (). The 10 highest expressed markers per cluster derived by SCMarker showed a high degree of cluster-specificity in the heatmap (). Among the selected markers were the 17 known markers reported by the original study (). SCMarker also discovered multiple putative subtypes for some cell types, such as the T, B and myeloid cells (). For example, cluster 4 and 23 are the B cells expressing known surface marker CD79A (), yet cells in cluster 4 are evidently different from cells in cluster 23, due to differential IGHG1 and BANK1 expression levels (). For comparison, we selected the 10 highest expressed genes from the clusters determined by the Seurat HVGs and by the HEGs, respectively (). They appeared non-specifically distributed across clusters (). These genes also contained fewer known markers (). For example, cluster 3 determined by the Seurat HVGs contained markers (CLDN5, CAV1 and IFITM3) from 3 cell-types (endothelia, alveolar and B cell, respectively). Most clusters expressed IFITM3, except for clusters 1 and 6 (). Only T cell and fibroblast markers appeared to be cluster-specific. As additional controls, we also performed analysis using fewer (i.e., 500 and 1000) Seurat HVGs. That resulted in worse results with fewer known markers and marker-specific clusters ().

Fig 4

Results on the human brain data.

Plotted in tSNE space are 5,602 cells in the cerebellar hemisphere of human brain tissue based on markers selected respectively by (A) SCMarker, (B) the highest variable genes and (C) the highest expressed genes, colored by performing clustering using Seurat. Cell types were labelled consistently as they were in the original paper. Also plotted are the heatmaps of the top 10 gene expression levels derived respectively from (D) SCMarker in cluster 4 and 7, (E) the highest variable genes in cluster 4, 6 11 and 12 and (F) the highest expressed genes in cluster 3, 4 and 6.

Fig 5

Results on the lung cancer data generated by Dropseq.

Results on the human brain data.

Results on the lung cancer data generated by Dropseq.

Plotted in tSNE space are 52,698 cells of 6 different lung cancer patients, clustered based on markers selected respectively by (A) SCMarker, (B) the highest variable genes and (C) the highest expressed genes. Colors correspond to clusters determined by DBSCAN. Heatmaps of the average expression levels of the 10 highest expressed genes per cluster identified respectively by SCMarker (D), the highest variable genes (E) and the highest expressed genes (F). Cell types in (A) to (C) are labelled based on the known cell-type specific markers (Table ), which are highlighted in red boxes in (D) to (F). Only the top 10 highest expressed markers per cluster are included in the comparison. Overall, SCMarker demonstrated higher sensitivity and specificity for cell type and cell-type specific marker identification than the alternative approaches. Moreover, markers selected by SCMarker were more significant among genes which were identified by Seurat to define clusters ().

Discussion

In this manuscript, we reported a new bioinformatics tool, SCMarker, which performs ab initio cell-type discriminative marker selection from scRNA-seq data. SCMarker operates based on two new information-theoretic metrics: 1) bi/multi-modal distribution of subpopulation-discriminative gene expression in mixed cell populations and 2) co- or mutually-exclusively expressing gene pairs, which quantifies populational structural properties intrinsic to single-cell RNA-seq data. We found that SCMarker can consistently significantly boost cell-type identification accuracy in datasets from a variety of tissues such as cancer and brain, generated by both SMART-seq and Drop-seq platforms. Because SCMarker does not depend on any prior knowledge, we anticipate that it will prove most useful in discovery settings for analysing cell populations of a high degree of plasticity and heterogeneity [37]. SCMarker can potentially be expanded to analyse other types of single-cell data, including mass cytometry and single cell ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) data [38,39]. It can be easily incorporated as a module into current scRNA-seq data analysis workflows to pre-process the cell-gene count/expression matrix before performing further downstream analysis.

Clustering methods.

(DOCX) Click here for additional data file.

Marker gene list.

(XLSX) Click here for additional data file.

Algorithm for constructing the k-nearest mutually co/anti-occurrence neighbour graph.

(JPG) Click here for additional data file.

The expression levels of most known marker genes (CD3G, CD8A, IL7R, MS4A1, CD19, CD79A, CD79B and PMEL) follow a bi/multi-modal distribution in the melanoma data.

(JPG) Click here for additional data file.

Comparison of 3 marker selection methods for cell-type identification.

Tested were a range of parameters and 5 clustering algorithms: k-means, Clara, hierarchical clustering (hc), DBSCAN, and Seurat. Plotted in heatmaps are the ARI values calculated based on markers selected respectively by SCMarker, the highest expressed genes and the highest variable genes from (A) the melanoma and (B) the head and neck cancer data. X and Y axes in the SCMarker panel indicate the n and k parameters used by SCMarker and the corresponding (equal number of markers) results in the highest expressed or the highest variable gene panels. (JPG) Click here for additional data file.

Determining the optimal parameters.

Plotted in the heatmaps are the number of selected markers for (A) the melanoma and (B) the head and neck cancer data over a range of n (X-axis) and k (Y-axis) parameters. Bars on the side and the top are the mean values in the corresponding rows and columns. Also plotted are clustering accuracy measured by the adjusted rand index (ARI), a metric that measures the similarity of two clustering results, for (C) the melanoma and (D) the head and neck cancer data over various n and k parameters. (JPG) Click here for additional data file.

Validation of genes selected by SCMarker.

Plotted in tSNE space are 5,902 cells from the head and neck cancer data, based on genes selected respectively by (A) SCMarker, (B) the highest expressed and (C) the highest variable genes. (JPG) Click here for additional data file. Gene set enrichment analysis (GSEA) of markers selected by 3 methods: SCMaker, the highest expressed and the highest variable genes from the (A) melanoma; and (B) the head and neck cancer data, respectively. Only the top 15 terms are shown. The darkness of the colors corresponds to -log10 P values. (JPG) Click here for additional data file.

Estimation of k and n through resampling of the melanoma data.

Plotted in heatmap are clustering accuracy measured by the adjusted rand index (ARI). The sample sizes of each dataset were labelled above each of the figures. (JPG) Click here for additional data file.

Results on the human brain tissue data.

Plotted in tSNE space are 5,602 cells in the cerebellar hemisphere of human brain tissue based on the highest 500 (A) and 1000 (B) variable genes, colored by cell types from the original paper. (JPG) Click here for additional data file.

Results on the lung cancer data.

Plotted in tSNE space are 52,698 cells of 6 different lung cancer patients, clustered based on the highest 500 (A) and 1000 (B) variable genes. Colors correspond to clusters determined by DBSCAN. Heatmaps of the average expression levels of the 10 highest expressed genes per cluster identified respectively by the highest 500 (C) and 1000 (D) variable genes. Cell types in (A) and (B) are labelled based on the known cell-type specific markers, which are highlighted in red box in (C) and (D). (JPG) Click here for additional data file.

The distribution of Significance of markers identified by Seurat and overlaps with SCMarker in melanoma, head and neck cancer (HNSCC), brain tissue and lung cancer.

(JPG) Click here for additional data file.

Table 1

Known cell-type specific markers identified respectively by SCMarker and the highest variable gene approach.

Known markers	Cell type	Cluster ID(SCMarker)	Cluster ID(highest variable)	Cluster ID(highest expressed)
CD3D	T cell	1	1	1
S100A2	Myeloid	8, 9
IGHG1	B cell	23
MS4A2	Mast	2
LYZ	Myeloid	2, 15, 17, 19		3, 4
CLDN5	Endothelial	3	3
COL1A1	Fibroblast	5
CD27	T cell	6
CAPS	Epithelial	14
CAV1	Alveolar	11	3
GZMH	T cell	1
GZMA	T cell	1		11
IFITM3	Myeloid		3
COL6A2	Fibroblast		5
MS4A	B cell		6
CD79A	B cell		6
SFTPC	Alveolar		10

Only the top 10 highest expressed markers per cluster are included in the comparison.

33 in total

Review 1. Computational analysis of microarray data.

Authors: J Quackenbush
Journal: Nat Rev Genet Date: 2001-06 Impact factor: 53.242

2. Full-length RNA-seq from single cells using Smart-seq2.

Authors: Simone Picelli; Omid R Faridani; Asa K Björklund; Gösta Winberg; Sven Sagasser; Rickard Sandberg
Journal: Nat Protoc Date: 2014-01-02 Impact factor: 13.491

3. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets.

Authors: Evan Z Macosko; Anindita Basu; Rahul Satija; James Nemesh; Karthik Shekhar; Melissa Goldman; Itay Tirosh; Allison R Bialas; Nolan Kamitaki; Emily M Martersteck; John J Trombetta; David A Weitz; Joshua R Sanes; Alex K Shalek; Aviv Regev; Steven A McCarroll
Journal: Cell Date: 2015-05-21 Impact factor: 41.582

Review 4. The technology and biology of single-cell RNA sequencing.

Authors: Aleksandra A Kolodziejczyk; Jong Kyoung Kim; Valentine Svensson; John C Marioni; Sarah A Teichmann
Journal: Mol Cell Date: 2015-05-21 Impact factor: 17.970

5. Comparative Analysis of Single-Cell RNA Sequencing Methods.

Authors: Christoph Ziegenhain; Beate Vieth; Swati Parekh; Björn Reinius; Amy Guillaumet-Adkins; Martha Smets; Heinrich Leonhardt; Holger Heyn; Ines Hellmann; Wolfgang Enard
Journal: Mol Cell Date: 2017-02-16 Impact factor: 17.970

6. MATCHER: manifold alignment reveals correspondence between single cell transcriptome and epigenome dynamics.

Authors: Joshua D Welch; Alexander J Hartemink; Jan F Prins
Journal: Genome Biol Date: 2017-07-24 Impact factor: 13.583

7. SC3: consensus clustering of single-cell RNA-seq data.

Authors: Vladimir Yu Kiselev; Kristina Kirschner; Michael T Schaub; Tallulah Andrews; Andrew Yiu; Tamir Chandra; Kedar N Natarajan; Wolf Reik; Mauricio Barahona; Anthony R Green; Martin Hemberg
Journal: Nat Methods Date: 2017-03-27 Impact factor: 28.547

8. Cell type discovery using single-cell transcriptomics: implications for ontological representation.

Authors: Brian D Aevermann; Mark Novotny; Trygve Bakken; Jeremy A Miller; Alexander D Diehl; David Osumi-Sutherland; Roger S Lasken; Ed S Lein; Richard H Scheuermann
Journal: Hum Mol Genet Date: 2018-05-01 Impact factor: 6.150

9. A descriptive marker gene approach to single-cell pseudotime inference.

Authors: Kieran R Campbell; Christopher Yau
Journal: Bioinformatics Date: 2019-01-01 Impact factor: 6.937

Review 10. Single-cell RNA sequencing technologies and bioinformatics pipelines.

Authors: Byungjin Hwang; Ji Hyun Lee; Duhee Bang
Journal: Exp Mol Med Date: 2018-08-07 Impact factor: 8.718

8 in total

1. FSCAM: CAM-Based Feature Selection for Clustering scRNA-seq.

Authors: Yan Wang; Jie Gao; Chenxu Xuan; Tianhao Guan; Yujie Wang; Gang Zhou; Tao Ding
Journal: Interdiscip Sci Date: 2022-01-14 Impact factor: 2.233

2. Future of biomedical, agricultural, and biological systems research using domesticated animals.

Authors: Thomas E Spencer; Kevin D Wells; Kiho Lee; Bhanu P Telugu; Peter J Hansen; Frank F Bartol; LeAnn Blomberg; Lawrence B Schook; Harry Dawson; Joan K Lunney; John P Driver; Teresa A Davis; Sharon M Donovan; Ryan N Dilger; Linda J Saif; Adam Moeser; Jodi L McGill; George Smith; James J Ireland
Journal: Biol Reprod Date: 2022-04-26 Impact factor: 4.161

Introduction

Materials and methods

Discriminativeness of gene expressions for subpopulation clustering

Illustration of SCMarker.

Co- or mutually exclusively expressing gene pairs (CMEGPs)

Results

Comparison with other marker selection strategies

Comparison of 3 marker selection methods for cell-type identification over a range of k (from 50 to 1000 at a step-size of 50) and n (from 5 to 100 at a step-size of 5).

Results on the melanoma data.

Application of SCMarker to 3’ UMI count data

Results on the human brain data.

Results on the lung cancer data generated by Dropseq.

Discussion

Clustering methods.

Marker gene list.

Algorithm for constructing the k-nearest mutually co/anti-occurrence neighbour graph.

The expression levels of most known marker genes (CD3G, CD8A, IL7R, MS4A1, CD19, CD79A, CD79B and PMEL) follow a bi/multi-modal distribution in the melanoma data.

Comparison of 3 marker selection methods for cell-type identification.

Determining the optimal parameters.

Validation of genes selected by SCMarker.

Estimation of k and n through resampling of the melanoma data.

Results on the human brain tissue data.

Results on the lung cancer data.

The distribution of Significance of markers identified by Seurat and overlaps with SCMarker in melanoma, head and neck cancer (HNSCC), brain tissue and lung cancer.

Review 1. Computational analysis of microarray data.

Review 4. The technology and biology of single-cell RNA sequencing.

Review 10. Single-cell RNA sequencing technologies and bioinformatics pipelines.

Review 5. Feature selection revisited in the single-cell era.

Review 8. Statistics or biology: the zero-inflation controversy about scRNA-seq data.