| Literature DB >> 34797848 |
Takeru Fujii1,2, Kazumitsu Maehara1, Masatoshi Fujita2, Yasuyuki Ohkawa1.
Abstract
Organisms are composed of various cell types with specific states. To obtain a comprehensive understanding of the functions of organs and tissues, cell types have been classified and defined by identifying specific marker genes. Statistical tests are critical for identifying marker genes, which often involve evaluating differences in the mean expression levels of genes. Differentially expressed gene (DEG)-based analysis has been the most frequently used method of this kind. However, in association with increases in sample size such as in single-cell analysis, DEG-based analysis has faced difficulties associated with the inflation of P-values. Here, we propose the concept of discriminative feature of cells (DFC), an alternative to using DEG-based approaches. We implemented DFC using logistic regression with an adaptive LASSO penalty to perform binary classification for discriminating a population of interest and variable selection to obtain a small subset of defining genes. We demonstrated that DFC prioritized gene pairs with non-independent expression using artificial data and that DFC enabled characterization of the muscle satellite/progenitor cell population. The results revealed that DFC well captured cell-type-specific markers, specific gene expression patterns, and subcategories of this cell population. DFC may complement DEG-based methods for interpreting large data sets. DEG-based analysis uses lists of genes with differences in expression between groups, while DFC, which can be termed a discriminative approach, has potential applications in the task of cell characterization. Upon recent advances in the high-throughput analysis of single cells, methods of cell characterization such as scRNA-seq can be effectively subjected to the discriminative methods.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34797848 PMCID: PMC8641884 DOI: 10.1371/journal.pcbi.1009579
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1The dependent pairs of gene expression selected as the DFC.
(a) Different concepts for gene selection of DFC and DEG. The common goal is to extract a set of genes that characterizes the population of interest (left). A DEG-based approach involves a list of genes with statistically significant differences between the studied groups. In contrast, the DFC-based approach involves a subset of genes that distinguish between two populations (top-right). DFC is expected to feature a small set of genes selected by taking into account the relationships among genes (bottom-right). (b–d) Artificially generated data set in which DFC has priority over DEG; case 1: correlation. (b) Schematic of the synthesized data design. Only the pair X3 and X4 has intra-group correlation; the other pairs are independent. All variables have the same variance, and the differences in means are the same for all pairs (see Materials and Methods for details). (c) Pairs that are easier to classify are given priority to become DFC. The lower triangle shows the plot of each pair of variables; the diagonal elements show the distribution of each variable and the upper triangle shows the correlation coefficient within the cluster of each two variables. The decision boundary in the plain of the selected variable pair X3 and X4 is shown as a solid line. (d) The process of selecting discriminative variables; solution path. This indicates transition of the weights (partial regression coefficients) of each variable when regularization parameter λ (sparsity) is varied. (e–g) Synthesized data set in which DFC has priority over DEG; case 2: exclusive. (e) Schematic of the synthesized data design. In one-third of the group A cells, the expression of X1 and that of X2 are mutually exclusive. The variances and the means of variables are designed as in case 1. In other words, this simulates a logical product relationship such that cells that express X1 and X2 simultaneously are equivalent to the population of group A. (f) An example of logical relationships of case 2, shown in a scatter plot as in (c). (g) The solution path in case 2 as shown in (d).
Fig 2Smaller gene set of DFC was selected by a unique selection criterion.
(a) Procedure of DEG and DFC extraction from scRNA-seq data. (b–e) The determined POI is compared with all other cell clusters in the muscle tissue. Embedding the scRNA-seq data into two-dimensional space with UMAP. (b) The clusters determined by the Louvain algorithm. The 12th cluster corresponds to the cluster of muscle stem cells and their progenitors. (c) The 12th cluster is set as the POI, and the other clusters are assigned as the control group, “Others.” (d, e) Single-cell expression levels for Pax7 and Myod1. (f) Some of the DEGs are selected as DFC. Venn diagram indicating the overlap of DEGs and DFC. (g) Genes in DFC not selected by the DEGs’ criteria. Volcano plot of DEGs and (h) MA plot of DEGs.
Fig 3Biological significance of genes in DFC revealed by the discriminative ability.
(a) According to the specificity of the expression, the genes in DFC are classified into three groups. The three groups are named Strong (specific to 1–2 clusters), Weak (>2), and Niche features (none of them). (b) The data are from samples collected at 0, 2, 5, and 7 days after skeletal muscle injury. In addition, the clusters of fibro/adipogenic progenitors (FAPs), mature skeletal muscle (SKM), and lymphocytes (LYM) are shown. (c) Genes specifically expressed in the POI are assigned to the Strong features. For the Strong feature Cdh15, its expression level for each cluster shown in Fig 2B is plotted. The medians, 25th/75th percentiles, and 1.5 interquartile range (IQR) are employed to draw the box plots. (d) The Strong features contain many genes that act as markers of skeletal muscle. The results of GO enrichment analysis for the Strong features. GOs are ordered by the proportion of their inclusion in the Strong features. (e) Genes expressed in some clusters are assigned to the Weak features. For the Weak feature Col6a2, its expression level is plotted in the upper panel, and the single-cell expression level visualized by UMAP is plotted in the lower panel. (f, g) The expression levels of Sepw1 and Des, two of the Weak features, are plotted as (e). (h) Genes with low expression levels that are expressed in a minor subpopulation of the POI are assigned to the Niche features. For the Niche feature Calcr, its expression level is plotted as (c). (i) Cells expressing Niche features (Calcr, Edn3, and Gm12603) are highlighted. (j) DFC has the property of capturing interrelated genes. STRING is used to connect related DFC. (k) Ribosomal proteins are a notablef example of Weak features in DFC that are difficult to interpret in the context of binary combinations. Eighteen ribosomal protein genes in DFC are averaged in each cluster as a heat map.