Literature DB >> 34797848

Discriminative feature of cells characterizes cell populations of interest by a small subset of genes.

Takeru Fujii^1,2, Kazumitsu Maehara¹, Masatoshi Fujita², Yasuyuki Ohkawa¹.

Abstract

Organisms are composed of various cell types with specific states. To obtain a comprehensive understanding of the functions of organs and tissues, cell types have been classified and defined by identifying specific marker genes. Statistical tests are critical for identifying marker genes, which often involve evaluating differences in the mean expression levels of genes. Differentially expressed gene (DEG)-based analysis has been the most frequently used method of this kind. However, in association with increases in sample size such as in single-cell analysis, DEG-based analysis has faced difficulties associated with the inflation of P-values. Here, we propose the concept of discriminative feature of cells (DFC), an alternative to using DEG-based approaches. We implemented DFC using logistic regression with an adaptive LASSO penalty to perform binary classification for discriminating a population of interest and variable selection to obtain a small subset of defining genes. We demonstrated that DFC prioritized gene pairs with non-independent expression using artificial data and that DFC enabled characterization of the muscle satellite/progenitor cell population. The results revealed that DFC well captured cell-type-specific markers, specific gene expression patterns, and subcategories of this cell population. DFC may complement DEG-based methods for interpreting large data sets. DEG-based analysis uses lists of genes with differences in expression between groups, while DFC, which can be termed a discriminative approach, has potential applications in the task of cell characterization. Upon recent advances in the high-throughput analysis of single cells, methods of cell characterization such as scRNA-seq can be effectively subjected to the discriminative methods.

Entities: Chemical

Mesh：

Substances：
Genetic Markers

Year: 2021 PMID： 34797848 PMCID： PMC8641884 DOI： 10.1371/journal.pcbi.1009579

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Organisms are composed of various cell types with specific functions. The cell types also include undifferentiated cells, such as stem cells and progenitor cells, or cells in transition during differentiation. Understanding cell types as the functional or structural units of an organism can confer a comprehensive understanding of the functions of organs and tissues, as well as their origins. The classification and definition of cell types have been based on the identification of specific marker genes that define them. Marker genes have been identified by comprehensive analysis of gene expression. Statistical tests are particularly important in the identification of marker genes, for which evaluation of differences in the mean expression levels of genes is often used. Cell-type-specific genes/proteins are responsible for cell-type-specific functions. Therefore, to identify marker genes, comparisons have been performed between the cell types of interest and control groups to extract specifically expressed genes. The use of differentially expressed genes (DEGs) is a widely accepted way of defining marker gene candidates, the validity of which has been confirmed by biological experiments. However, the risk of false positives is increased by the tens of thousands of statistical tests associated with comprehensive analysis. Therefore, methods of correcting for multiple testing, such as Benjamini–Hochberg’s false discovery rate (FDR) [1], Storey’s q-value [2,3], and Efron’s local FDR [4], have been widely employed. Along with the increasing demand for multiple testing and correction methods, DEG detection methods in the field of biostatistics for high-dimensional data have also been developed. In particular, limma [5], using Bayesian statistics, edgeR [6], and DESeq1-2 [7,8] have been developed to improve the statistical power (sensitivity) of DEGs while suppressing false positives. These methods have often been applied to cases in which only a small number of samples can be obtained because of the experimental scale and cost limitations [9]. Nevertheless, with the development of comprehensive gene expression analysis, especially single-cell analysis [10], new challenges have arisen as a result of the rapid increase in the number of samples [11] and the involvement of exploratory analysis schemes. The presence of a large sample size along with the application of exploratory analysis inducing selection bias can lead to an overly small P-value, impeding the application of conventional methods to call differentially expressed genes (DEGs) for bulk RNA-seq [12]. In particular, because a larger sample size can detect increasingly small differences, with a fixed difference, a larger sample size usually gives smaller P-values, resulting in the unnecessary expansion of candidate genes (e.g., the definition of the t-statistic is proportional to √n). Therefore, efforts to improve the ability to detect DEGs are still being made to adapt to scRNA-seq data with the characteristics of low coverage and large sample size. That is, one of the major problems to be solved in this field is gene prioritization, namely, the selection of a small list of genes that should be validated and interpreted as a priority. Here, we propose discriminative feature of cells (DFC), an alternative approach to the use of DEGs for characterizing cell populations by discrimination and variable selection. Ntranos et al. [13] incorporated logistic regression to detect isoform changes of transcript from scRNA-seq data. We further pursue the applicability of a discriminative approach for characterizing cell groups, especially for tissue scRNA-seq data in which gene expression levels are correlated and revealing the heterogeneous subpopulations within a specified population of interest (POI). We demonstrated that DFC succeeded in selecting a small set of genes that characterize a group of cells of interest, while avoiding the problem of a large candidate gene list due to the large sample size of scRNA-seq. DFC is also shown to have the potential to provide biological insights that are difficult to make using DEGs, such as detecting genes that characterize small subpopulations and specific combinations of genes that are functionally linked to each other.

Results

We focused on the discriminative method to characterize a POI in cells (e.g., stem cell population), using scRNA-seq data from samples that contain a large number of cell types, such as tissues. In the discriminative method, each cell is considered as a data point in the gene expression space, and the decision boundary surface that separates the two groups is determined. The boundary surface is confined to a small dimension of the space by variable selection, which suggests the idea of characterizing cell populations by a selected subset of gene expression patterns as DFC (Fig 1A). While the conventional concept of using a DEG-based approach involves the comparison of group means of individual gene expression levels in two cell populations, the POI and a control group, DFC obtains a group of genes that are useful to discriminate the POI from the control group. Thus, DFC is supposed to provide a highly selective gene list of only the number of genes needed for discrimination, while simultaneously using information from all genes. In this study, we implemented the method for determining DFC using binary classification by logistic regression and variable selection by adaptive LASSO [14]. Specifically, we performed adaptive LASSO–logistic regression with the objective variable of belonging to the POI (1 or 0) and the explanatory variable of gene expression level; the genes whose weights were not 0 were considered as DFC.

Fig 1

The dependent pairs of gene expression selected as the DFC.

(a) Different concepts for gene selection of DFC and DEG. The common goal is to extract a set of genes that characterizes the population of interest (left). A DEG-based approach involves a list of genes with statistically significant differences between the studied groups. In contrast, the DFC-based approach involves a subset of genes that distinguish between two populations (top-right). DFC is expected to feature a small set of genes selected by taking into account the relationships among genes (bottom-right). (b–d) Artificially generated data set in which DFC has priority over DEG; case 1: correlation. (b) Schematic of the synthesized data design. Only the pair X3 and X4 has intra-group correlation; the other pairs are independent. All variables have the same variance, and the differences in means are the same for all pairs (see Materials and Methods for details). (c) Pairs that are easier to classify are given priority to become DFC. The lower triangle shows the plot of each pair of variables; the diagonal elements show the distribution of each variable and the upper triangle shows the correlation coefficient within the cluster of each two variables. The decision boundary in the plain of the selected variable pair X3 and X4 is shown as a solid line. (d) The process of selecting discriminative variables; solution path. This indicates transition of the weights (partial regression coefficients) of each variable when regularization parameter λ (sparsity) is varied. (e–g) Synthesized data set in which DFC has priority over DEG; case 2: exclusive. (e) Schematic of the synthesized data design. In one-third of the group A cells, the expression of X1 and that of X2 are mutually exclusive. The variances and the means of variables are designed as in case 1. In other words, this simulates a logical product relationship such that cells that express X1 and X2 simultaneously are equivalent to the population of group A. (f) An example of logical relationships of case 2, shown in a scatter plot as in (c). (g) The solution path in case 2 as shown in (d).

The dependent pairs of gene expression selected as the DFC.

DFC calls a pair of genes with dependences in their expression

In the first case, we tested the possibility that adaptive LASSO would prioritize correlated gene pairs. As the gene expression pattern of a simulated cell population, the expression levels of four genes were generated from a normal distribution with equal differences in group means and identical variances (Fig 1B). Therefore, all genes are equivalent as DEGs (i.e., they have the same P-value in the two-tailed t-test). However, only the gene pair X3–X4 is correlated (r = 0.7) within the two groups, while all other pairs are uncorrelated (r = 0). The expression of these hypothetical genes is shown in Fig 1C as a scatter plot. The process of variable selection (LASSO solution path) is shown in Fig 1D, indicating that, in the process of increasing the parameter λ, which adjusts the sparsity (the strength of variable selection), the correlated X3–X4 pair is finally selected (Fig 1D). This correlated pair has the clearest boundary between the groups (Fig 1C, bold box), and it can be intuited that the two selected variables are useful for differentiating groups A and B. To calculate the slope and intercept of the decision boundary in Fig 1C, we used the minimum value of λ that gives two variables are selected. In addition, the accuracy of the discrimination using all four variables is 0.999, while the accuracy using only the two selected variables is 0.995, indicating that the model maintains adequate performance. These results indicate that gene pairs that are correlated within a group are likely to be selected as DFC. This suggests that DFC can select a set of genes that have direct or indirect dependences as useful features for discriminating groups. We also investigated the nature of DFC using simulated data that more closely resemble real scRNA-seq data. We used ESCO [15], a simulator that takes drop-out into account, to generate the data. We designed the data with two cell populations and 15 genes. The two cell populations each have two marker genes. Two sets were generated and combined to create a data set containing gene sets derived from the two gene correlation networks (GCN) (500 cells for groupA and 493 cells for groupB) (S1A Fig). Gene1_1, Gene6_1, Gene7_1, and Gene8_1 were marker genes generated from GCN1, while Gene1_2, Gene4_2, Gene5_2, and Gene7_2 were ones generated from GCN2. The correlation between marker genes generated from different GCNs is low. Using these data, we extracted the DFC for Group1. In this case, we controlled the sparsity (λ) and limited the number of variables to be extracted to four. As a result, Gene1_1, Gene6_1, Gene1_2, and Gene4_2 were extracted. From the scatter plot, we can see that the pairs of Gene1_1 and Gene6_1, and Gene1_2 and Gene4_2 are highly correlated within the class and have high separability. Therefore, through the simulation by ESCO, we can show that the gene pairs with the same relationship as in Fig 1B, 1C and 1D are preferentially extracted. Next, we examined the ability of DFC to detect mixed subpopulations. Here, we assume a situation that includes multiple subpopulations with different expression patterns in one group (Fig 1E). As in the previous scenario, all genes are equivalent as DEGs, that is, they have the same variance, and the group means are the same for all variables (see Materials and Methods). Here, genes X1 and X2 are strongly expressed in only one-third of the cells in group B. Furthermore, the expression of X1 and that of X2 are mutually exclusive. Therefore, X1 and X2 are not statistically independent. In addition, because group B is composed of multiple subpopulations, the distributions of gene expression of both X1 and X2 are multimodal (Fig 1F). The slope and intercept of the decision boundary were obtained in the same way as in Fig 1C. In this example, adaptive LASSO also prioritized the non-independent variable pair X1 and X2 (Fig 1G). This suggests that DFC is generally prone to selecting non-independent pairs of genes. In this scenario, the condition for being in cell group A is the expression of both X1 and X2 at the same time, that is, the logical AND (&) relation, which cannot be realized by considering either X1 or X2 alone. These results suggest that DFC may provide useful insights when considering cellular functions, not simply as a candidate list of differentially expressed genes, but as a set of genes that well defines characteristics of a cell population, including dependences such as correlations among multiple genes and mixtures of different populations.

Small gene set of DFC determined by a unique criterion that differs from DEG-based analysis

Next, to demonstrate the practicability of DFC, we performed scRNA-seq data analysis and compared the yielded gene lists of DEG and DFC. The data were obtained from scRNA-seq data of mouse tibialis anterior (TA) muscle tissue injured by notexin at days 0, 2, 5, and 7 during the regeneration of skeletal muscle [16]. Skeletal muscle satellite cells (MuSCs) are known to be an essential cell population for muscle regeneration, and are activated upon muscle injury and undergo multiple progenitor cell stages until differentiating into myofibers. In this process of muscle regeneration, there is also a self-renewing state in which MuSCs again transition to a quiescent state and refill the MuSC pool [17,18]. Therefore, we attempted to characterize heterogeneous cell populations of MuSCs with multiple transient states by DFC, which have been difficult to capture by a DEG-based approach. Fig 2A shows the procedure for defining the POI and the determination of DFC. First, public data of scRNA-seq (GSE143437) [16] were obtained. Then, the genes expressed in very few cells (fewer than 10 cells) were filtered (see Materials and Methods for details). Next, UMAP was used to visualize the data in two dimensions. The 12th cluster obtained by Louvain clustering (Fig 2B) was designated as the POI for this analysis, and all other clusters were designated as the control group (Fig 2C). Cluster 12 was selected according to the expression levels of Pax7, a marker gene for MuSCs, and Myod1, a transcription factor that functions in progenitor cells and activated satellite cells. The specific expression of the genes indicated that cluster 12 represents the satellite/progenitor cell cluster (Fig 2D and 2E).

Fig 2

Smaller gene set of DFC was selected by a unique selection criterion.

Smaller gene set of DFC was selected by a unique selection criterion.

(a) Procedure of DEG and DFC extraction from scRNA-seq data. (b–e) The determined POI is compared with all other cell clusters in the muscle tissue. Embedding the scRNA-seq data into two-dimensional space with UMAP. (b) The clusters determined by the Louvain algorithm. The 12th cluster corresponds to the cluster of muscle stem cells and their progenitors. (c) The 12th cluster is set as the POI, and the other clusters are assigned as the control group, “Others.” (d, e) Single-cell expression levels for Pax7 and Myod1. (f) Some of the DEGs are selected as DFC. Venn diagram indicating the overlap of DEGs and DFC. (g) Genes in DFC not selected by the DEGs’ criteria. Volcano plot of DEGs and (h) MA plot of DEGs. Next, to elucidate the differences in the criteria for selecting genes in DFC and DEG in the data, we compared their gene lists. For DEGs, the criterion of FDR < 0.01 in DESeq2 was used. As a result of applying this criterion, the number of DEGs was 7,495, which was nearly half of the total (42%) of 16,351 mouse genes. Among the DFC, most of the genes (96/108) overlapped with the DEGs, but there were also 12 DFC-specific genes (Fig 2F). Next, we examined whether genes in DFC could be obtained by adjusting the gene selection criteria such as the P-value and log2 fold change in DEGs. Fig 2G shows the position of genes selected as DFC among all DEGs by a volcano plot. The large sample size of scRNA-seq and the statistical test on the clusters, which were also determined using the same scRNA-seq data, resulted in overly small FDRs. In addition, the genes selected as DFC among the DEGs were scattered irregularly. The results suggested that genes in DFC were selected independently of the DEG criteria. Similarly, in the MA (log ratio vs. abundance) plot (Fig 2H), genes in DFC were found to be scattered among the DEGs, indicating that the DFC selection was also independent of the gene expression levels. These results indicate that DFC selects genes according to its own criteria and also selects genes that are more useful for discrimination of the POI among the DEG candidates. S1 Table summarizes the averages of the computational time and the number of extracted genes for DFC and DEG extraction for each method. In addition to the Sampling + adaptive LASSO–logistic regression and Wald test (DESeq2) used in this study, the results of the same analysis using five other methods are also shown. For DFC extraction, we repeated the analyses ten times for averaging the effect of subsampling and cross-validation. For the other option to detect DEGs, limma [5] is known to be effective for DE analysis in single-cell RNA-seq data [10], the large number of genes was extracted as in DESeq2. For the DFC set extracted by each method, we created UpSet plots (S1C Fig) and analyzed the association between genes using the STRING database (S1 Table). The UpSet plots show that LASSO extracted a large number of genes and contained many specific genes, and that there were many genes in common between Sampling + adaptive LASSO and SIS + adaptive LASSO. In addition, Sampling + adaptive LASSO showed the highest average number of edges (associations), while conversely this number was relatively low in SCAD. These results suggest that adaptive LASSO can extract a more biologically meaningful gene set in terms of the gene associations (e.g., co-expression, PPIs, pathways) than other discriminative methods. In the extraction of DFC, the performance of SIS + adaptive LASSO was better than that of Sampling + adaptive LASSO in terms of computational time and the cross-validation error. Therefore, the use of SIS may be a promising option to reduce computational costs. However, in this study, we decided to use all genes by sampling cells to search a wide range of feature genes. The best hyper parameter γ between 0.5 and 2.0 [14], which controls the size of the ridge penalty for adaptive LASSO, was determined using the faster SIS + adaptive LASSO. The number of selected genes decreased as the γ increased, but the selected genes showed consistency (S1C Fig). In addition, the cross validation error was minimized at γ = 1. Therefore, we fixed at γ = 1. The sample size ratio of POI to Other is 11.5, reflecting imbalanced data. This imbalance may cause a decrease in the performance of the discriminative model. We therefore evaluated the AUC of the precision–recall (PR) curve. The AUC was ~1 (S1D Fig), indicating that an appropriate model for discrimination had been created. We further examined the effect of increasing the imbalance between the sizes of POI and Other by training the discriminative model (extraction of DFC) using SIS + adaptive LASSO–logistic regression with a subsample of gradually decreasing POI size (S1D Fig). Even with a 10% subsample, the AUC was about 0.98, indicating that the model has sufficient ability to discriminate between groups. This can be attributed to the fact that we selected well-separable populations in this case of comparison after clustering. In line with the principles of logistic regression analysis for imbalanced data, it is worth checking the predictive performance using PR curves to be aware of any problem of overfitting. Such an overfitted model would overlook the characteristics of the POI.

DFC is useful to identify the combinatorial patterns of gene expression and minor subpopulations

Next, we investigated whether DFC can extract the biological function of the POI. To evaluate this, we interpreted how genes in DFC help to discriminate the POI by referring to known marker genes of MuSC or muscle progenitors. First, we classified the DFC genes into three groups according to the specificity of expression in each cluster: The “Strong” feature refers to the genes expressed in 25% or more of the cells in one or two clusters. The “Weak” feature refers to the genes expressed in three or more clusters, as depicted in Fig 3A. The “Niche” feature is defined as genes with minor expression in less than 25% of the cells in all clusters (Fig 3B for the annotation of clusters, S2 Fig for the original annotation by the authors of the scRNA-seq data, and S4 and S5 Figs for highlighted expression of all genes in DFC as visualized using UMAP).

Fig 3

Biological significance of genes in DFC revealed by the discriminative ability.

Biological significance of genes in DFC revealed by the discriminative ability.

(a) According to the specificity of the expression, the genes in DFC are classified into three groups. The three groups are named Strong (specific to 1–2 clusters), Weak (>2), and Niche features (none of them). (b) The data are from samples collected at 0, 2, 5, and 7 days after skeletal muscle injury. In addition, the clusters of fibro/adipogenic progenitors (FAPs), mature skeletal muscle (SKM), and lymphocytes (LYM) are shown. (c) Genes specifically expressed in the POI are assigned to the Strong features. For the Strong feature Cdh15, its expression level for each cluster shown in Fig 2B is plotted. The medians, 25th/75th percentiles, and 1.5 interquartile range (IQR) are employed to draw the box plots. (d) The Strong features contain many genes that act as markers of skeletal muscle. The results of GO enrichment analysis for the Strong features. GOs are ordered by the proportion of their inclusion in the Strong features. (e) Genes expressed in some clusters are assigned to the Weak features. For the Weak feature Col6a2, its expression level is plotted in the upper panel, and the single-cell expression level visualized by UMAP is plotted in the lower panel. (f, g) The expression levels of Sepw1 and Des, two of the Weak features, are plotted as (e). (h) Genes with low expression levels that are expressed in a minor subpopulation of the POI are assigned to the Niche features. For the Niche feature Calcr, its expression level is plotted as (c). (i) Cells expressing Niche features (Calcr, Edn3, and Gm12603) are highlighted. (j) DFC has the property of capturing interrelated genes. STRING is used to connect related DFC. (k) Ribosomal proteins are a notablef example of Weak features in DFC that are difficult to interpret in the context of binary combinations. Eighteen ribosomal protein genes in DFC are averaged in each cluster as a heat map. Genes belonging to the Strong feature included many of the genes known as markers of MuSCs and activated satellite cells (Figs 3C and S4A). M-cadherin (Cdh15) [19] was expressed almost universally in the POI. The results of GO enrichment analysis using only the Strong feature showed that many genes are related to skeletal muscle cells in muscle tissue (Fig 3D). In addition, Myf5 is a representative feature of the group of cells that are not represented by Myog and Myod1 in the POI (S4A Fig). Thus, it can be interpreted that Myf5 plays a different role from other myogenic regulatory factors [20]. These results indicate that DFC can select genes that correspond to single biomarkers, similar to DEG. The Weak feature (S5 Fig) included Kai (Cd82) [21-23] and a group of genes used as quiescent satellite cell markers such as syndecan-4 (Sdc4). In contrast, the negative LASSO weight of Col6a2, which is selectively expressed in fibro/adipogenic progenitors (FAPs), helps to characterize the POI as non-FAPs (Fig 3E). Furthermore, a notable property of the Weak feature involved the combined expression pattern using multiple genes. For example, Des is expressed in both a part of the POI and the mature SKM population, which is a part of Others. This means that Des itself cannot characterize the POI (Figs 3F and S5). Therefore, to exclude the character of mature SKM from the POI, the condition of Sepw1(-), which is significantly expressed in mature SKM, is additionally imposed (Fig 3G). Thus, DFC can be used as a molecular marker to identify specific cell types for immunostaining and cell sorting, for example, Des(+)Sepw1(-)Col6a2(-). The Niche feature detects minor subpopulations scattered within the POI (Figs 3H and S4B). These genes in DFC are difficult to prioritize in the list of DEG (i.e., they tend to have larger P-values). In fact, Calcr, which is ranked lower than 5,000th in the order of P-values in the DEGs, is known to be expressed transiently in a quiescent state [24,25]. We also detected Edn3[24], which has prominent localization on day 7 after injury in the POI, and Gm12603 (Wincr1; WNT-induced noncoding RNA), a gene expressed in a different cell group from Calcr and Edn3. These binary combinations (+/-) of multiple genes and minor cell groups would appear as the representative cases in Fig 1F. Therefore, DFC can characterize the best combinations of genes for determining cell type and even small subpopulations caused by transient expression or state changes, even using a small list of genes, and even upon comparison with the heterogeneous control group of cells in tissue. To validate this discriminative approach of cell characterization with other data, we obtained the data of bone marrow, a tissue composed of relatively similar cells with the same lineage, as most of the cells in the bone marrow are differentiated from hematopoietic stem cells. Therefore, we thought that DFC could provide a more detailed evaluation. After visualization by UMAP and clustering by the Louvain algorithm, the hematopoietic precursor cell population was set as the POI (target) for analysis (S3A and S3B Fig). 31 genes were extracted, of which 14 were Strong features, 14 were Weak features, and 3 were niche features. Next, we attempted DE analysis by limma, and found 877 DEGs, of which two genes overlapped with DFC. Compared to the analysis of muscle tissue, the number of DEGs is small, but this may be due to the similarity of the cells that make up Bone marrow. Furthermore, these extracted genes were analyzed in detail by GO enrichment analysis. The GO enrichment analysis using strong features showed accumulation of myeloid cell differentiation-related terms such as interferon gamma production and T cell differentiation (S3C Fig). This result also suggests that POI is a population of hematopoietic progenitors related to lymphoid. In addition, three genes were classified as Niche features, and Socs1 was included as a characteristic gene (S3D Fig). Socs1 has been reported to be a negative regulator of the JAK-STAT pathway [26], which is involved in the regulation of myeloid cell differentiation and proliferation [27,28]. It was thus suggested that Socs1 is a factor that characterizes a small subpopulation with suppressed differentiation and proliferation.

DFC extracts genes with functional associations

Finally, we attempted to elucidate more complex functional associations between genes from the obtained DFC. The artificial data in Fig 1 show that DFC selects the pair of genes with dependences among DEG-equivalent genes. This suggests that DFC may tend to contain the network of many-to-many gene relations that forms the unique characteristics of each cell type. First, we examined the correlation matrix of expression levels for the genes in DFC, and found that the POI showed a more distinct hierarchical structure than the Others group (S6 Fig). We further evaluated the functional associations of 108 genes by STRING [29] (Figs 3J, S7 and S8). The 104 genes in DFC were contained in the STRING database and had 257 association edges, which is more than would occur by chance (87 edges), indicating that these are a set of genes that are strongly related to each other (PPI enrichment P-value < 106). Next, we attempted to interpret the network by referring to the characteristics of the adaptive LASSO–logistic regression. First, we examined the correspondence with the weights of adaptive LASSO. We found that the two clusters of genes related to the function of FAPs with negative weights and the clusters of genes related to the activation of MuSCs with positive weights, such as MyoD and Myog, were linked via the proteoglycan Sdc4 (Fig 3J). Among them, the most remarkable result was obtained for the ribosomal protein-coding genes, which formed a dense cluster of ribosome subunit components. All of these genes are included in the Weak feature. The expression patterns of these genes are not clearly segregated into clusters and are difficult to interpret by binary combinations (S4 Fig), suggesting that they reflect a certain composition of ribosomal protein-coding genes that may have a critical functions in MuSCs and their progenitors. To confirm the discriminative ability of the genes, we extracted only the ribosomal protein-coding genes in the DFC, and visualized them by principal component analysis. The results confirmed that the subset of DFC was sufficient to discriminate the POI from the Others (S9A and S9B Fig). In more detail, the composition of these genes was characterized by higher overall expression levels compared with the Others, with Rpl31 and Rps37 being particularly highly expressed (Fig 3K). In contrast, the POI has a ribosomal protein profile closest to that of lymphocytes (LYM), but the negative weight of Rpl34 appears to differentiate the POI from the lymphocyte population. Furthermore, these ribosomal protein-coding genes clearly captured the temporal changes after muscle injury in the POI (S9C and S9D Fig). This suggests that some ribosomal protein-coding genes are the critical factor in stem cell functions, such as self-renewal [30-32], and supports their role in muscle regeneration [33]. To further extend understanding of DFC for biological functions, we performed pathway analysis with the Reactome database [34,35] using all DFC (S2 Table). We found pathways enriched in ribosomal proteins such as rRNA processing and translation, and pathways enriched in skeletal muscle marker genes such as myogenesis. These results are consistent with the results of DFC set analysis by STRING (Fig 3J) and GO enrichment analysis using the Strong feature. Therefore, pathway analysis is also useful for investigating the nature of DFC or POI. We also examined how many DFC belong to esential genes, which are defined by Dickinson et al. as genes essential for survival. According to this definition, 5,280 genes were essential genes [36], and 26.58% of DFC were essential genes in the data used in this study. In contrast, the number of essential genes in DFC was 32 (32/108 ≈ 0.30). Therefore, there was no tendency for essential genes to be selected as DFC. Genes that are essential for survival do not need to be present as a characteristic of a particular POI. In conclusion, DFC can extract a small set of genes that characterize a POI, including functional associations between genes.

Discussion

In this paper, we have proposed a new concept of characterizing the POI, which is an alternative to the DEG-based approach that uses lists of genes with differences in expression between groups. DFC has the potential to identify subpopulations within the population of interest and extract gene networks that regulate these subpopulations, in addition to the conventional identification of expression markers in cell populations, and is expected to significantly advance our understanding of cell populations from gene expression analysis. Our method, which can be termed a discriminative approach, has potential applications in the task of cell characterization. In particular, given the recent developments of high-throughput biological measurements, statistical models based on discrimination can be effective for the increased sample size of scRNA-seq (capable cell number). To select a small number of genes to characterize a cell, as in a DEG-based approach, rather than to determine the cell type itself, a variable selection procedure was employed to select a small set of genes that are effective for discrimination. Variable selection is a methodology that selects a small number of M < N optimal combinations of variables from N input variables, while preserving the predictive performance of the statistical model. Several methods of variable selection with discriminative models have been developed, such as SVM [37], and logistic regression with LASSO penalty. Among them, LASSO–logistic regression is a method that can construct an interpretable linear model and perform variable selection in one step. However, the gene clusters obtained by these discriminative methods have been mostly used as gene signatures in cell-type classification [38,39] (e.g., a cell is normal or malignant), and no attempt has been made to interpret these gene signatures themselves biologically by giving rationales comes from the employed statistical method. In this study, we compared two different population groups: the POI, which is specified after nonlinear dimensionality reduction and clustering, and the rest of the population. In this paper, this comparison was assumed to be the most frequently used procedure for profiling unknown cell populations using scRNA-seq data. However, DE analysis after clustering has been criticized for introducing selection bias, which results in excessively low P-values [12]. This exploratory data analysis of scRNA-seq makes the proper use of P-values more difficult, while our method bypasses the use of P-values. In addition, it can be assumed that a control group including heterogeneous populations will increase the variance in the group and lead to large P-values. For this reason, DE may miss subtle changes of state or fail to discover minor subpopulations within the cell groups of interest. In other words, calling DEGs may not be the best strategy for exploratory discovery in a mixed cell population represented by tissue. Furthermore, a simple two-group comparison of one vs. others is practical enough and thus is one of the major advantages of our method. The reason why this easy comparison works well is that DFC combines multiple Weak/Niche features to improve discriminative performance and pick up even small populations. The advantage is derived from the linearity of the model adapted in DFC; that is, the results can be interpreted as a superposition of features, as described above. As an implementation of the concept of DFC, we employed the framework of binary classification with logistic regression and variable selection with adaptive LASSO. In addition to its several beneficial statistical properties (e.g., consistency in variable selection), adaptive LASSO has superior practical performance among the improved versions of original LASSO [40]. Although there are many methods for variable selection, such as best subset selection (L0) [41], random forest [42], and SVM [37], in this study, we did not provide the benchmark tests of each method. However, we believe that how the mathematical properties of each method are used in the various scenarios of scRNA-seq data analysis is an important topic. In this paper, we have discussed the usefulness of the discriminative method when the dependence among all genes is included. We also found the discriminative approach to be useful, especially in the analysis of tissue scRNA-seq data where gene expression correlations and subpopulations within the same population are expected to be mixed. The further development of methods that focus on the interpretation of large-scale data is anticipated.

Materials and methods

Setting POI of scRNA-seq data

Normalized count matrix and the annotations of cells of scRNA-seq data were downloaded from GEO (GSE143437). We filtered out genes that were expressed (> 0) in fewer than 10 cells. UMAP visualization was performed using the uwot R package (version 0.1.10) [43]. In the embedded two-dimensional space, the clusters were determined by the Louvain algorithm [44] implemented in the igraph R package [45] (version 1.2.6). In our analysis of scRNA-seq data from muscle tissue, the POI was set as the cluster in which the majority of cells expressed both Pax7 (53.3%) and Myod1 (42.6%).

Adaptive LASSO–logistic regression

The adaptive LASSO–logistic regression was performed using the glmnet [46] (version 4.1) R package. To perform the adaptive LASSO, we followed the two steps of parameter estimation: fitting the ridge and then fitting the LASSO regression with the penalty factor. These ridge (first step) and LASSO (second step) regressions in the adaptive LASSO were performed by setting the hyperparameter α in the glmnet::cv.glmnet function to 0 or 1, respectively. The sparsity parameter λ was determined by 10-fold cross-validation (cv.glmnet) of binomial deviance. The penalty factor was set to be 1/|ridge|, where ridge is estimated by ridge regression in the first step. To reduce the computational cost in real scRNA-seq data, we obtained 30% subsamples of cells from each cluster (10,324 cells were extracted from 17,730 cells in total) in the estimation of ridge. In the performance comparison of SIS, LASSO and SCAD, SIS [47] and ncvreg [48] R packages were used (S1 Table).

Differential expression analysis

A raw count matrix was downloaded from GEO (GSE143437). The Wald test was performed using the DESeq2 [8] (version 1.28.1) R package to identify genes that were differentially expressed (DEGs) between the POI and the Others. The parameters were used with the default settings. Genes with FDR < 1% were considered significantly differentially expressed.

Synthetic data generation

The artificial data set consists of randomly generated data points (cells) with four variables (genes). We set the variables as the equivalent genes in terms of DE. Because the statistical significance of a DEG that is estimated by the z- or t-statistic is uniquely determined by variances and the difference of means between groups, we only modified the relationships of the genes, while maintaining the variance and difference of means. Specifically, the two cases of DE-equivalent genes were generated as follows. Case I: Correlated expression. All four variables follow the Gaussian distribution with constant variance σ2 = 1 and difference of means | μA − μB | = 2, where A and B indicate groups of cells (each of 1000 cells). All pairs of variables are independent (r = 0), except for the pair (X3, X4) having a strong correlation (r = 0.7) in both groups. Case II: Heterogenous population. All four variables have the same variance σ2 and the same group means μA, μB. We set X1 and X2 of group B as having an exclusive relationship. We divided group B into three subgroups (B1–3: each of 333 cells). Group B1 expresses X1, group B2 expresses X2, and group B3 expresses neither of them. The others are independent Gaussian variables. To equalize the variance and means in group B to those in the others, we used the mixed distribution as the marginal distribution of X1 and X2 in B: where g is the Gaussian probability density function and p is the proportion of subgroup relative to the size of group B. In general, the mean and variance of f are calculated as follows: We used the parameters: σ12 = σ22 = 1, μ1 = 0, μ2 = 5, p = 2/3, and hence μB = 5/3 and σ2 = 59/9 for all variables. We set μA = 5.

Computational environment

All calculations in this study were performed under the following environment. CPU: Intel Xeon Skylake-SP (2.3 GHz, 18 core) × 2 Memory: 384 GB (a) Pairs of marker genes that are simulated by ESCO. The lower triangle shows the plot of each pair of variables; the diagonal elements show the distribution of each variable and the upper triangle shows the correlation coefficient within the cluster of each two variables. The highlighted pairs are prioritized in DFC selection. (b) UpSet plot to compare DFC sets extracted by each four methods (SCAD, SIS + adaptive LASSO, Sampling + adaptive LASSO and LASSO). Each columns represents number of genes that are shared by only the marked sets. (c) UpSet plot to compare DFC sets extracted by SIS + adaptive LASSO with γ = 0.5, 1, 2. (d) As the ratio of A to B increases, the performance of the generated model deteriorates. The PR curve when the sample size of POI is gradually decreased (100%, 90%,…, 10%). (PDF) Click here for additional data file.

Original annotations of scRNA-seq data by the authors [GSE143437].

(a) Days after muscle injury. (b) Cell type annotations. (PDF) Click here for additional data file. (a) The scRNA-seq data embedded into two-dimensional space with UMAP. (b) ”POI” and “Others” determined by the Louvain algorithm. POI corresponds to the cluster of hematopoietic precursor cells. (c) The Strong features contain many genes that are the markers of POI. The results of GO enrichment analysis for the Strong features. GOs are ordered by the contained proportion of Strong feature genes. (d) Zoom in on the area indicated in Fig S3b. Cells expressing Niche features (Calcr, Edn3, and Gm12603) are highlighted. (PDF) Click here for additional data file. UMAP visualizations of (a) Strong and (b) Niche features. The markers colored in red/blue indicate the sign of the weight (coefficient) estimated by adaptive LASSO. (PDF) Click here for additional data file.

UMAP visualizations of Weak feature.

The markers colored in red/blue indicate the sign of the weight (coefficient) estimated by adaptive LASSO. (PDF) Click here for additional data file.

The correlation coefficient matrix of genes in DFC shows a distinct hierarchical structure within the POI.

The matrixes in (a) POIs and in (b) Others are shown. Types of feature (Strong, Weak, or Niche) and the signs of LASSO weight are also indicated. (PDF) Click here for additional data file.

Functional association of genes in DFC.

The functional associations among 108 DFC genes were annotated using STRING. (PDF) Click here for additional data file.

Functional association of DEGs.

The functional associations of the top 108 genes in terms of P-value among the DEGs annotated by STRING. (PDF) Click here for additional data file.

Some of the ribosomal protein-coding genes characterizing the POI.

Scatter plot showing the results of PCA performed using (a) ribosomal protein-coding genes in DFC and (b) all ribosomal protein-coding genes. (c) PCA performed in the POI using genes in (a), and in all cells (POI and others) using genes in (b). (PDF) Click here for additional data file.

Summarizing the performance of DFC and DEG extraction method and the result of analyzing the association between DFC sets using the STRING database.

(XLSX) Click here for additional data file.

Pathways enriched in DFC analyzed with Reactome database.

(XLSX) Click here for additional data file. 13 Jul 2021 Dear Prof. Ohkawa, Thank you very much for submitting your manuscript "Discriminative feature of cells characterizes cell populations of interest by a small subset of genes" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Lilia M. Iakoucheva, Ph.D. Associate Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have proposed using the Elastic Net package in R to address the limitation of differentially expressed gene identification. The idea looks legitimate. However, the technical contribution to methodology could be limited. There are several key questions which have to be addressed to fit into the preference of computational biology, on top of bioinformatics. 1. It is straightforward to expect that the adaptive LASSO can always identify a small set of key genes for cell type / population identification. Therefore, the DFC set looks smaller than the DEG. I was wondering what are the biological significance of that DFC set? Are those genes only used to differentiate cell populations? What are the implications behind the DFC set. Is the DFC set reflecting / overlapping with the "essential genes" ? https://www.nature.com/articles/nrg.2017.75 2. The technical contribution to the methodology looks limited as it is only based on the off-the-book-shelf adaptive LASSO method from the well-known Elastic Net package in R. Comparisons to other feature selection or model regularization methods in machine learning could be beneficial. 3. The underlying pathways behind those DFC genes could be interesting. 4. The concept of DEG has been well-established in the past years. It is therefore important to tell the difference between DFC and DEG. The overlapping and non-overlapping genes between DFC and DEG should be analysed further to reveal its biological difference / similarity. 5. The running time and computing environment should be reported as gene expression data can be very big in scale. Reviewer #2: In this manuscript, authors proposed a new concept of discriminative feature of cells (DFC), which was an alternative to the DEG-based approach that uses lists of genes with differences in expression. DFC was a discriminative approach implemented using adaptive LASSO logistic regression. The results revealed that DFC well captured cell-type-specific markers, specific gene expression patterns, and subcategories of the cell population. Some comments are given below. 1. For section “Adaptive LASSO–logistic regression” staring from line 380, the adaptive Lasso used was a little different from the method in reference [13]. For computations of adaptive LASSO, the gamma parameter in the weights or penalty factor was directly set to be 1. In the reference, the optimal pair of gamma and lambda parameters was obtained by cross-validation. It is suggested to explain why gamma was directly set to be 1. 2. Ridge regression estimates were used in the first step with 30% subsamples of cells from each cluster. Please give more details about the sample size and the number of variables involved. For ultrahigh dimensional statistical models, (I)SIS is suggested. And based on the code available on Github, Ridge regression estimates were chosen by cross-validation (cv.glmnet), which is not shown in the manuscript. It is suggested to give more details of the implemented adaptive LASSO. Diego Franco Saldana and Yang Feng (2018) SIS: An R package for Sure Independence Screening in Ultrahigh Dimensional Statistical Models, Journal of Statistical Software, 83, 2, 1-25. Jianqing Fan and Jinchi Lv (2008) Sure Independence Screening for Ultrahigh Dimensional Feature Space (with discussion). Journal of Royal Statistical Society B, 70, 849-911. 3. Since logistic regression was involved, it is suggested to consider the impact of imbalanced binary response data. 4. It is suggested to compare adaptive LASSO with other variable selection methods, e.g. SCAD. 5. It is suggested to demonstrate the computation efficiency of DFC methods. For example, the computation time comparison between DEG and DFC methods. 6. It is suggested to do another real data analysis to validate the results for DFC. 7. For the code on Github, string.nb.html is not displayed correctly. Reviewer #3: This article develops a new method to identify cell-type-specific genes based on single-cell RNA sequencing (scRNA-seq) data. The basic idea is to fit a classification model between the population of interest and other groups, and the identification of marker genes is viewed as a variable selection problem. In this article, the authors promoted the use of a logistic regression model with adaptive Lasso penalty, and genes with nonzero regression coefficients are defined to be discriminative features of cells (DFC). Overall, this article introduces a tool that could be useful in practice, but I would hope that the authors can address the issues I list below. 1. The authors should have provided a more comprehensive review of existing works. The use of discriminative models for differential expression analysis is not completely new. For example, [1] fits a logistic regression to predict cell membership, although their predictors are transcripts instead of genes. The authors need to explain the overlap with prior art and highlight the novelty. 2. The proposed method was mainly compared with DESeq2. However, DESeq2 was originally designed for bulk RNA-seq data, and scRNA-seq data have certain characteristics that may harm the performance of these methods. This phenomenon was explained in [2] in more details. Therefore, the authors may want to consider some methods that are tailored for scRNA-seq, for example the ones listed in Table 1 of [2]. 3. The simulation of synthetic data can be made more realistic, as scRNA-seq data in real world contain more noise than the simple normal distribution used in the article. For example, the drop-out effect is one characteristic of scRNA-seq that should not be ignored. The authors may consider existing simulators such as Splatter [3] and ESCO [4]. 4. Minor comment: the text "in principle the P-value decreases with increasing sample size" in line 79-80 is not very accurate. Under the null hypothesis (no difference), P-value follows a uniform distribution, so it does not decrease to zero as sample size increases. The actual meaning is that a larger sample size can detect smaller and smaller differences, so with a fixed difference, a larger sample size usually gives smaller P-values. 5. Minor comment: in Figure 1d and 1g, what are the actual lambda values? Does cross validation correctly select the sparse model? [1] Ntranos, V., Yi, L., Melsted, P., & Pachter, L. (2019). A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nature methods, 16(2), 163-166. [2] Wang, T., Li, B., Nelson, C. E., & Nabavi, S. (2019). Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC bioinformatics, 20(1), 1-16. [3] Zappia, L., Phipson, B., & Oshlack, A. (2017). Splatter: simulation of single-cell RNA sequencing data. Genome biology, 18(1), 1-15. [4] Tian, J., Wang, J., & Roeder, K. (2020). ESCO: single cell expression simulation incorporating gene co-expression. bioRxiv. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 14 Sep 2021 Submitted filename: Response_to_Reviewers.docx Click here for additional data file. 9 Oct 2021 Dear Prof. Ohkawa, Thank you very much for submitting your manuscript "Discriminative feature of cells characterizes cell populations of interest by a small subset of genes" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Lilia M. Iakoucheva, Ph.D. Associate Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed my comments. Reviewer #2: This manuscript has been improved a lot after revision. Some comments are given below. 1. In line 467, it is said that “We obtained 30% subsamples of cells from each cluster (16,351 cells in total ).” However, in the response for referee#2-2, it is said that “The total number of cells after subsampling was 10,324. Genes with fewer than 10 cells expressed in the subsampled state were filtered out, and 16,351 genes were used as variables.” Please double check. 2. For UpSet plots (Fig. S1b&c), results are confusing. For example, in Fig. S1b, the number of interaction features of all four methods is 32, but the number of interaction features of SCAD and Lasso is 27<32. Please check. Reviewer #3: The authors have addressed my previous questions. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 11 Oct 2021 Submitted filename: Responce_to_Reviewers_v2.docx Click here for additional data file. 19 Oct 2021 Dear Prof. Ohkawa, We are pleased to inform you that your manuscript 'Discriminative feature of cells characterizes cell populations of interest by a small subset of genes' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Lilia M. Iakoucheva, Ph.D. Associate Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #2: The authors have addressed my previous comments. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: No 1 Nov 2021 PCOMPBIOL-D-21-00960R2 Discriminative feature of cells characterizes cell populations of interest by a small subset of genes Dear Dr Ohkawa, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofia Freund PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

35 in total

1. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization.

Authors: Guangchuang Yu; Qing-Yu He
Journal: Mol Biosyst Date: 2016-02

2. M-cadherin and beta-catenin participate in differentiation of rat satellite cells.

Authors: Edyta Wróbel; Edyta Brzóska; Jerzy Moraczewski
Journal: Eur J Cell Biol Date: 2007-01-11 Impact factor: 4.492

3. Bias, robustness and scalability in single-cell differential expression analysis.

Authors: Charlotte Soneson; Mark D Robinson
Journal: Nat Methods Date: 2018-02-26 Impact factor: 28.547

4. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

5. Ribosome Levels Selectively Regulate Translation and Lineage Commitment in Human Hematopoiesis.

Authors: Rajiv K Khajuria; Mathias Munschauer; Jacob C Ulirsch; Claudia Fiorini; Leif S Ludwig; Sean K McFarland; Nour J Abdulhay; Harrison Specht; Hasmik Keshishian; D R Mani; Marko Jovanovic; Steven R Ellis; Charles P Fulco; Jesse M Engreitz; Sabina Schütz; John Lian; Karen W Gripp; Olga K Weinberg; Geraldine S Pinkus; Lee Gehrke; Aviv Regev; Eric S Lander; Hanna T Gazda; Winston Y Lee; Vikram G Panse; Steven A Carr; Vijay G Sankaran
Journal: Cell Date: 2018-03-15 Impact factor: 41.582

6. Interstitial Cell Remodeling Promotes Aberrant Adipogenesis in Dystrophic Muscles.

Authors: Jordi Camps; Natacha Breuls; Alejandro Sifrim; Nefele Giarratana; Marlies Corvelyn; Laura Danti; Hanne Grosemans; Sebastiaan Vanuytven; Irina Thiry; Marzia Belicchi; Mirella Meregalli; Khrystyna Platko; Melissa E MacDonald; Richard C Austin; Rik Gijsbers; Giulio Cossu; Yvan Torrente; Thierry Voet; Maurilio Sampaolesi
Journal: Cell Rep Date: 2020-05-05 Impact factor: 9.423

7. The molecular basis of JAK/STAT inhibition by SOCS1.

Authors: Nicholas P D Liau; Artem Laktyushin; Isabelle S Lucet; James M Murphy; Shenggen Yao; Eden Whitlock; Kimberley Callaghan; Nicos A Nicola; Nadia J Kershaw; Jeffrey J Babon
Journal: Nat Commun Date: 2018-04-19 Impact factor: 14.919

8. An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets.

Authors: Arezo Torang; Paraag Gupta; David J Klinke
Journal: BMC Bioinformatics Date: 2019-08-22 Impact factor: 3.169

9. Tracking intratumoral heterogeneity in glioblastoma via regularized classification of single-cell RNA-Seq data.

Authors: Marta B Lopes; Susana Vinga
Journal: BMC Bioinformatics Date: 2020-02-18 Impact factor: 3.169

10. Myf6/MRF4 is a myogenic niche regulator required for the maintenance of the muscle stem cell pool.

Authors: Felicia Lazure; Darren M Blackburn; Aldo H Corchado; Korin Sahinyan; Nabila Karam; Ahmad Sharanek; Duy Nguyen; Christoph Lepper; Hamed S Najafabadi; Theodore J Perkins; Arezu Jahani-Asl; Vahab D Soleimani
Journal: EMBO Rep Date: 2020-10-13 Impact factor: 9.071

1 in total

1. Identification of useful genes from multiple microarrays for ulcerative colitis diagnosis based on machine learning methods.

Authors: Lin Zhang; Rui Mao; Chung Tai Lau; Wai Chak Chung; Jacky C P Chan; Feng Liang; Chenchen Zhao; Xuan Zhang; Zhaoxiang Bian
Journal: Sci Rep Date: 2022-06-15 Impact factor: 4.996

1 in total