| Literature DB >> 35941549 |
M E Nelson1,2,3,4, S G Riva5,6,7,8, A Cvejic9,10,11.
Abstract
BACKGROUND: Single-cell RNA-sequencing is revolutionising the study of cellular and tissue-wide heterogeneity in a large number of biological scenarios, from highly tissue-specific studies of disease to human-wide cell atlases. A central task in single-cell RNA-sequencing analysis design is the calculation of cell type-specific genes in order to study the differential impact of different replicates (e.g. tumour vs. non-tumour environment) on the regulation of those genes and their associated networks. The crucial task is the efficient and reliable calculation of such cell type-specific 'marker' genes. These optimise the ability of the experiment to isolate highly-specific cell phenotypes of interest to the analyser. However, while methods exist that can calculate marker genes from single-cell RNA-sequencing, no such method places emphasise on specific cell phenotypes for downstream study in e.g. differential gene expression or other experimental protocols (spatial transcriptomics protocols for example). Here we present SMaSH, a general computational framework for extracting key marker genes from single-cell RNA-sequencing data which reliably characterise highly-specific and niche populations of cells in numerous different biological data-sets.Entities:
Keywords: Feature selection; Marker genes; Single-cell RNA-sequencing
Mesh:
Substances:
Year: 2022 PMID: 35941549 PMCID: PMC9361618 DOI: 10.1186/s12859-022-04860-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1The SMaSH framework. A SMaSH works directly from the counts matrix, producing a dictionary relating the user-defined classes of interest (e.g. cell type annotations) to top marker genes for each class (default top 5). B SMaSH filters and ranks genes according to an ensemble learning model or a deep neural network
Single-cell RNA-sequencing data-sets in this study
| Data-set | Technology | Cells | Genes | References | |
|---|---|---|---|---|---|
| Human lung cancer (broad) | 10X scRNA-seq | 54,574 | 18,612 | 7 | N.A. |
| Human lung cancer (cell sub-types) | 10X scRNA-seq | 54,574 | 18,612 | 34 | N.A. |
| Mouse brain (broad) | Single nucleus RNA-seq | 40,532 | 31,053 | 9 | [ |
| Mouse brain (cell sub-types) | Single nucleus RNA-seq | 40,532 | 31,053 | 31 | [ |
| Zeisel | 10X scRNA-seq | 3005 | 4000 | 7 | [ |
| CITE-seq | CITE-seq | 8617 | 500 | 13 | [ |
| Paul15 | MARS-seq | 2730 | 3451 | 10 | [ |
| Human foetal liver | 10X scRNA-seq | 65,712 | 19,572 | 18 | [ |
| Human foetal organs | 10X scRNA-seq | 211,754 | 23,054 | 40 | [ |
The different data-sets considered in the benchmarking of SMaSH
Fig. 2Classifying broad cell types based on SMaSH-specific marker genes. A Confusion matrices for the top 30 marker genes per cell type in the lung broad cell classification data-set for scGeneFit, RankCorr, SMaSH using the network mode, and SMaSH using the ensemble mode (using XGBoost). B Cell misclassification and scores for the two SMaSH modes against scGeneFit and RankCorr. C Benchmarking different SMaSH ensemble learning models across biological scRNA-seq data and related modalities
Fig. 3Marker genes for the broad mouse brain cell types. A The mean |Shapley value| for the top 30 ranked marker genes across all broad cell types of the mouse brain, before additional filtering and sorting, using SMaSH’s network mode. Different colours indicate the different class contributions which that particular gene explains. B the final three markers for each class/broad cell type are shown, with the colour profile corresponding to the mean logarithm of the gene expression and a pattern uniquely matching specific markers to specific cell types
Fig. 4Marker gene misclassification rates and scores in cell types in the lung and mouse brain. Performance for each human lung cancer cell sub-type and framework, including the two modes in SMaSH. HLC: Human lung cancer; MB: Mouse brain
Fig. 5Marker genes for the mouse brain cell sub-types from the Inhibitory neuron broad types, and human foetal organ of origin classification. The mean logarithm of gene expression for mouse brain cell Inhibitory neuron cell sub-type markers. A the markers for scGeneFit; B the markers for RankCorr; C markers from SMaSH’s network mode. Particularly in the case of SMaSH unique patterns can still be identified in this highly granular cell-type identification problem, whereas approaches such as scGeneFit are not able to identify many markers which uniquely resolve the sub-types present. D SMaSH is able to select statistically significant markers for a highly imbalanced problem of distinguishing organs of origin in foetal scRNA-seq
Marker gene misclassification rates in organs of origin in early foetal development
| Data-set | scGeneFit | RankCorr | SMaSH (DNN) | SMaSH (RF) | SMaSH (BRF) | SMaSH (XGBoost) |
|---|---|---|---|---|---|---|
| HFO skin versus kidney versus liver | (13.9, 0.85) | (5.2, 0.95) | (1.1, 0.99) | (1.4, 0.99) | (1.8, 0.98) | (1.2, 0.99) |
Performance in early foetal organ data, including the four different models implemented in SMaSH. All metrics are summarised as (M, ) tuples. The top 2 performing models are indicated in bold for each data-set
HFO Human foetal organs