| Literature DB >> 35495159 |
Piu Upadhyay1, Sumanta Ray2,3.
Abstract
Cell type prediction is one of the most challenging goals in single-cell RNA sequencing (scRNA-seq) data. Existing methods use unsupervised learning to identify signature genes in each cluster, followed by a literature survey to look up those genes for assigning cell types. However, finding potential marker genes in each cluster is cumbersome, which impedes the systematic analysis of single-cell RNA sequencing data. To address this challenge, we proposed a framework based on regularized multi-task learning (RMTL) that enables us to simultaneously learn the subpopulation associated with a particular cell type. Learning the structure of subpopulations is treated as a separate task in the multi-task learner. Regularization is used to modulate the multi-task model (e.g., W 1, W 2, … W t ) jointly, according to the specific prior. For validating our model, we trained it with reference data constructed from a single-cell RNA sequencing experiment and applied it to a query dataset. We also predicted completely independent data (the query dataset) from the reference data which are used for training. We have checked the efficacy of the proposed method by comparing it with other state-of-the-art techniques well known for cell type detection. Results revealed that the proposed method performed accurately in detecting the cell type in scRNA-seq data and thus can be utilized as a useful tool in the scRNA-seq pipeline.Entities:
Keywords: cell type detection; manual annotation; marker genes; regularized multi-task learning(RMTL); scRNA-seq data; supervised learning
Year: 2022 PMID: 35495159 PMCID: PMC9043858 DOI: 10.3389/fgene.2022.788832
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
Details of the used dataset.
| Dataset | No. of cells | No. of genes | No. of cell types |
|---|---|---|---|
| CBMC | 7,895 | 2,000 | 13 |
| Goolam | 124 | 40,315 | 5 |
| Melanoma | 5,038 | 3,546 | 8 |
| PBMC | 32,738 | 68,793 | 11 |
| Yan | 20,514 | 90 | 7 |
| Klein | 24,175 | 2,717 | 4 |
FIGURE 1Workflow of the methodology: the proposed approach for cell type identification—the data are randomly divided into training and test sets. The cell types present in training sets are used to train the multi-task learning classifier with cross-validation. Then, the learnt model is tested with test datasets, and accuracy is measured with a confusion matrix.
Prediction accuracy on test data for different datasets and different methods. Results refer to integrated data representations for all datasets. The test accuracy is displayed as mean ± standard deviation, referring to 100 randomly initialized training runs. Percentages refer to the relative amount of training data used during training. Maximum values for means and minimum values for standard deviation of the test accuracy are highlighted in bold.
| Method | 40% | 60% | 80% | 100% | 40% | 60% | 80% | 100% |
|---|---|---|---|---|---|---|---|---|
|
|
| |||||||
| scPred | 71.87 ± 0.29 | 77.92 ± 0.20 | 84.81 ± 0.09 | 90.26 ± 0.01 | 61.25 ± 0.31 | 69.32 ± 0.20 | 75.46 ± 0.21 | 78.71 ± 0.10 |
| ACTINN | 70.78 ± 0.25 |
|
| 96.03 ± 0.10 | 62.14 ± 0.35 | 68.42 ± 0.31 | 73.81 ± 0.22 | 77.85 ± 0.10 |
| CHETAH | 66.97 ± 0.11 | 73.71 ± 0.15 | 87.91 ± 0.10 | 94.34 ± 0.01 | 68.91 ± 0.30 | 72.34 ± 0.15 | 77.63 ± 0.11 | 81.29 ± 0.10 |
| Garnett | 69.75 ± 0.28 | 79.68 ± 0.19 | 85.59 ± 0.19 | 96.01 ± 0.18 | 64.81 ± 0.30 |
|
| 81.61 ± 0.10 |
| RMTL |
| 80.53 ± 0.01 | 89.58 ± 0.06 |
|
| 70.34 ± 0.03 | 76.85 ± 0.02 |
|
|
|
| |||||||
| scPred | 60.29 ± 0.31 | 68.91 ± 0.30 |
|
| 65.86 ± 0.05 | 69.87 ± 0.17 | 71.48 ± 0.13 | 78.25 ± 0.10 |
| ACTINN | 62.30 ± 0.43 | 67.51 ± 0.35 | 73.90 ± 0.19 | 78.31 ± 0.09 | 63.21 ± 0.31 | 73.91 ± 0.29 | 74.56 ± 0.19 | 81.29 ± 0.19 |
| CHETAH | 62.38 ± 0.20 | 65.19 ± 0.10 | 77.35 ± 0.11 | 81.82 ± 0.07 | 61.37 ± 0.0.28 | 63.45 ± 0.22 | 72.71 ± 0.19 | 81.19 ± 0.10 |
| Garnett | 66.27 ± 0.05 | 68.31 ± 0.05 | 71.81 ± 0.01 | 79.72 ± 0.01 |
| 72.53 ± 0.10 | 77.81 ± 0.05 | 82.10 ± 0.01 |
| RMTL |
|
| 70.61 ± | 82.12 ± 0.01 | 64.50 ± 0.10 |
|
|
|
|
|
| |||||||
| scPred |
|
| 77.10 ± 0.25 | 86.57 ± 0.19 | 77.16 ± 0.05 |
|
| 97.25 ± 0.10 |
| ACTINN | 62.10 ± 0.33 | 64.51 ± 0.35 | 68.90 ± 0.19 | 70.31 ± 0.09 | 76.11 ± 0.21 | 88.91 ± 0.29 | 90.16 ± 0.19 | 94.29 ± 0.01 |
| CHETAH | 60.38 ± 0.20 | 66.29 ± 0.10 | 72.15 ± 0.11 | 85.82 ± 0.07 | 71.37 ± 0.0.28 | 78.45 ± 0.22 | 83.71 ± 0.29 | 92.29 ± 0.10 |
| Garnett | 66.27 ± 0.05 | 72.21 ± 0.05 |
| 83.72 ± 0.01 | 75.50 ± 0.10 | 82.53 ± 0.10 | 85.81 ± 0.05 | 92.10 ± 0.01 |
| RMTL | 63.27 ± 0.05 | 68.31 ± 0.05 | 72.41 ± 0.01 |
|
| 80.33 ± 0.10 | 87.81 ± 0.05 |
|
The amount of training data used during the training phase (in percentage).
The bold values represent the amount of training data used during the training phase (in percentage).
FIGURE 2Prediction results on the Melanoma dataset. (A). Two-dimensional t-SNE plot representing original and predicted labels of melanoma data. (B). t-SNE visualization of original and predicted labels for individual cells. Each column shows three figures: the first and second one represent original and predicted labels in a two-dimensional t-SNE embedding, while the third one shows a donut plot proportion of true-positive and false-positive samples of the predicted labels. (C). Proportion of original cell types within the data is shown in a donut plot. (D). Proportion of predicted labels is shown in a donut plot.
Table shows the percentage of correct prediction for all the competing models on CBMC and Melanoma datasets.
| Cell type | #Samples present in the dataset | Methods | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| RMTL | Garnett | scPred | ACTINN | CHETAH | |||||||
| Recall | prec | recall | prec | recall | prec | recall | prec | recall | prec | ||
| CBMC | |||||||||||
| Eryth | 105 | 94 | 93.1 | 92.6 | 86.5 | 81.8 | 78.7 | 93.9 | 91.8 | 89.8 | 84.7 |
| NK | 1,089 | 87.77 | 94.8 | 88.7 | 89.7 | 84 | 80.8 | 89.2 | 88.1 | 86 | 80.3 |
| CD14+ mono | 2,293 | 97.7 | 99.1 | 98.8 | 98.1 | 93.8 | 92.7 | 99 | 97.5 | 97.8 | 96 |
| MK | 96 | 92.1 | 89.6 | 85.3 | 81.2 | 78.6 | 71.9 | 88.1 | 85.3 | 83.7 | 78.9 |
| CD34+ | 119 | 89.04 | 88.8 | 87.8 | 82.9 | 82.3 | 81.8 | 96.4 | 91.5 | 84.3 | 80.2 |
| DC | 70 | 91.1 | 90.8 | 82.8 | 79.8 | 79 | 78.6 | 92.7 | 90.6 | 82 | 80 |
| Memory CD4 T | 1,781 | 97 | 95.1 | 97.6 | 91.6 | 90 | 91.7 | 97 | 96.4 | 93.9 | 90.8 |
| CD8 T | 273 | 90.2 | 89.7 | 86 | 81.8 | 83.8 | 79.0 | 90.4 | 82.8 | 89 | 81.2 |
| CD16+ mono | 230 | 87.7 | 88.5 | 90 | 86.8 | 80.7 | 78.4 | 85.8 | 80.9 | 81.8 | 80 |
| B | 350 | 93.3 | 91.07 | 92.7 | 90.5 | 88.6 | 88.1 | 96 | 94.6 | 92.7 | 89.2 |
| T/mono doublets | 182 | 92.7 | 91.5 | 88.7 | 81.3 | 85.1 | 81.7 | 97.2 | 90 | 91.8 | 88.3 |
| PDcs | 49 | 91.8 | 90 | 93.2 | 85.5 | 81.2 | 75.2 | 93 | 88.8 | 86.5 | 78.6 |
| Naive CD4 T | 1,248 | 98.2 | 96.6 | 97.8 | 89 | 88.3 | 81.8 | 98 | 86.7 | 88 | 79.1 |
| Melanoma | |||||||||||
| B cells | 729 | 96.5 | 93.6 | 91.8 | 89 | 88.3 | 81.8 | 95 | 86.7 | 88 | 79.1 |
| Macrophages | 225 | 89.3 | 88.1 | 81.7 | 87.3 | 86.8 | 89.1 | 88.7 | 81.5 | 84.7 | 83.9 |
| NK | 87 | 83.9 | 85.6 | 80.9 | 84.3 | 80.1 | 81.9 | 82.7 | 82.3 | 83.6 | 85.1 |
| CAF | 685 | 95.8 | 97.7 | 91.7 | 95.8 | 93.2 | 94.4 | 96.8 | 96.9 | 95.7 | 93.4 |
| Endothelial cells | 360 | 90.3 | 92.8 | 88.8 | 90.6 | 90.1 | 90.5 | 91.5 | 91.7 | 89.4 | 90.2 |
| CD4 T cells | 1,044 | 98.1 | 97.3 | 96.1 | 97.9 | 95.8 | 98.4 | 98.2 | 95.1 | 92.9 | 95.8 |
| CD8 T cells | 1,643 | 97.1 | 97.7 | 95.6 | 96.3 | 96.8 | 92.9 | 96.1 | 96.0 | 93.9 | 97.1 |
| Treg cells | 225 | 90.6 | 89.1 | 88.2 | 88.1 | 89.0 | 86.8 | 89.2 | 86.9 | 87.1 | 88.9 |
p-value obtained from the Wilcoxon rank-sum test for the five competing methods.
| Method | CBMC | Klein | Melanoma | PBMC68k | Goolam | Yan |
|---|---|---|---|---|---|---|
| scPred | 2.01E-02 | 1.09E-03 | 3.87E-02 | 4.6E-02 | 1.87E-03 | 1.98E-02 |
| ACTINN | 1.08E-02 | 2.8E-03 | 2.08E-02 | 2.96E-02 | 1.09E-03 | 1.87E-02 |
| CHETAH | 1.78E-03 | 1.6E-02 | 1.78E-02 | 2.98E-02 | 1.98E-03 | 2.89E-02 |
| Garnett | 2.86E-02 | 1.76E-02 | 2.10E-02 | 1.76E-02 | 1.65E-03 | 1.87E-02 |
| RMTL | 1.05E-03 | 2.56E-03 | 1.89E-03 | 1.87E-02 | 1.09E-03 | 1.07E-03 |
FIGURE 3Test accuracy across all the folds for the five competing methods on CBMC data.
Execution time (in minute) for the five competing methods.
| Dataset | #Feature | # Cell | # Class | Execution time (in min) | ||||
|---|---|---|---|---|---|---|---|---|
| scPred | ACTINN | Garnett | CHETAH | RMTL | ||||
| Data 1 | 2,000 | 500 | 2 | 2 | 1 | 2 | 3 | 1 |
| Data 2 | 2,000 | 1,000 | 3 | 4 | 1 | 4 | 7 | 1 |
| Data 3 | 2,000 | 1,500 | 4 | 10 | 5 | 10 | 11 | 4 |
| Data 4 | 2,000 | 2,000 | 5 | 15 | 10 | 16 | 16 | 9 |