| Literature DB >> 30304022 |
Yuval Lieberman1,2, Lior Rokach2, Tal Shay1,2.
Abstract
Single-cell RNA sequencing (scRNA-seq) is an emerging technology for profiling the gene expression of thousands of cells at the single cell resolution. Currently, the labeling of cells in an scRNA-seq dataset is performed by manually characterizing clusters of cells or by fluorescence-activated cell sorting (FACS). Both methods have inherent drawbacks: The first depends on the clustering algorithm used and the knowledge and arbitrary decisions of the annotator, and the second involves an experimental step in addition to the sequencing and cannot be incorporated into the higher throughput scRNA-seq methods. We therefore suggest a different approach for cell labeling, namely, classifying cells from scRNA-seq datasets by using a model transferred from different (previously labeled) datasets. This approach can complement existing methods, and-in some cases-even replace them. Such a transfer-learning framework requires selecting informative features and training a classifier. The specific implementation for the framework that we propose, designated ''CaSTLe-classification of single cells by transfer learning,'' is based on a robust feature engineering workflow and an XGBoost classification model built on these features. Evaluation of CaSTLe against two benchmark feature-selection and classification methods showed that it outperformed the benchmark methods in most cases and yielded satisfactory classification accuracy in a consistent manner. CaSTLe has the additional advantage of being parallelizable and well suited to large datasets. We showed that it was possible to classify cell types using transfer learning, even when the databases contained a very small number of genes, and our study thus indicates the potential applicability of this approach for analysis of scRNA-seq datasets.Entities:
Mesh:
Year: 2018 PMID: 30304022 PMCID: PMC6179251 DOI: 10.1371/journal.pone.0205499
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Test datasets.
| Dataset number | Group | Accession | Classes | Number of cells | Number of genes | Imbalance | Sparsity |
|---|---|---|---|---|---|---|---|
| HSCs | GSE59114 [ | 3 | 1,428 | 8,422 | 1.5 | 57.4 | |
| HSCs | GSE59114 [ | 3 | 564 | 19,586 | 1.8 | 62.8 | |
| HSCs | GSE81682 [ | 3 | 1,920 | 34,892 | 3.9 | 24.9 | |
| Retina | GSE63473 [ | 3 | 37,309 | 23,288 | 18.1 | 95.3 | |
| Retina | GSE81904 [ | 3 | 26,530 | 13,166 | 258.2 | 93.1 | |
| Embryo | GSE45719 [ | 5 | 256 | 22,431 | 9.5 | 40.8 | |
| Embryo | E-MTAB-3321 [ | 5 | 124 | 41,427 | 10.7 | 28.4 | |
| Pancreas | GSE81608 [ | 2 | 1,358 | 39,851 | 1.9 | 71.7 | |
| Pancreas | E-MTAB-5061 [ | 2 | 1,156 | 25,525 | 3.3 | 64.5 |
1 See Materials and Methods for an explanation of HSCs, Retina, Embryo and Pancreas.
2 Calculated as the number of samples in the majority class divided by the number of samples in the minority class.
3 Calculated as the percentage of zero values, before any preprocessing and feature selection.
Dataset pairs used for transfer learning.
| Dataset number | Dataset pair | Organism | Classes | Class type | Number of common genes | Distribution similarity (cosine similarly of class distribution) |
|---|---|---|---|---|---|---|
| HSCs 1 to 2 | Mouse | 3 | Ordinal | 7,423 | 0.97 | |
| HSCs 1 to 3 | Mouse | 3 | Ordinal | 7,423 | 0.85 | |
| HSCs 2 to 3 | Mouse | 3 | Ordinal | 7,423 | 0.77 | |
| Retina | Mouse | 3 | Nominal | 12,307 | 0.22 | |
| Embryo | Mouse | 5 | Ordinal | 12,783 | 0.32 | |
| Pancreas | Human | 2 | Nominal | 16,299 | 0.98 |
1See Materials and Methods for an explanation of HSCs, Retina, Embryo and Pancreas.
Fig 1Classification accuracy of transfer learning by CaSTLe compared to the cross-validation accuracy (upper bound) and majority vote.
Results are shown for the 12 source-target datasets pairs tested (X-axis). Error bars reflect the standard deviation between ten runs (irrelevant for majority vote).
Fig 2Classification accuracy of transfer learning by CaSTLe compared to two benchmark methods.
Results are shown for the 12 source-target datasets pairs tested (X-axis). Error bars reflect the standard deviation between ten runs.
Fig 3Classification of a target dataset by multiple binary classifiers.
(A) The label of each cell in target dataset (left), the label each cell was given by the classifiers, black for unclassified (center), and the correctness of classification (right). (B) Heatmap of the scores given to each cell by each classifier, from zero (blue) to one (red). Horizontal black line separates the labels of those classifiers from labels for which classifiers were not trained. (a) Classification of target dataset GSE81904 by multiple binary classifiers built on source dataset GSE63473. Top bar shows which classifiers classified for labels that are in the target dataset. Note that many unknown cells were classified as bipolar or muller, which may be correct. (b) Classification of target dataset GSE81608 by multiple binary classifiers built on source dataset EMTAB5061. Top bar shows which classifiers classified for labels that are in the target dataset. Note that the labels that some of the seemingly incorrect results are very likely correct—the cells labeled as 'alpha contaminated' are classified as alpha or ductal, and same for beta, gamma and delta contaminated. (c) Classification of target dataset EMTAB5061 by multiple binary classifiers built on source dataset GSE81608. Note that many of the 'novel' cell types (acinar, ductal) were not classified, thus 'identified as novel'. Many of the incorrect cells are labeled 'coexpression' or 'not applicable' or 'unclassified', meaning that their classification may be correct. (d) Classification of target dataset GSE63473 by multiple binary classifiers built on source dataset GSE81904.
CaSTLe performance per cell type.
| Cell type | Instances in Source | Instances in | Accuracy | Sensitivity | Specificity | AUC |
|---|---|---|---|---|---|---|
| Amacrine cells | 9.9% | 0.9% | 99.9% | 99.9% | 94% | 99.7% |
| Astrocytes | 0.1% | None | 99.9% | 99.9% | - | - |
| Bipolar cells | 14% | 85.4% | 97.5% | 84.8% | 99.7% | 95% |
| Cones | 4.2% | 0.2% | 99.9% | 100% | 93.7% | 100% |
| Fibroblasts | 0.2% | None | 99.8% | 99.8% | - | - |
| Ganglion cells | 0.9% | None | 99.9% | 99.9% | - | - |
| Horizontal cells | 0.6% | None | 99.9% | 99.9% | - | - |
| Microglia | 0.1% | None | 99.9% | 99.9% | - | - |
| Müller cells | 3.6% | 10.7% | 99.2% | 99.1% | 99.6% | 99.9% |
| Pericytes | 0.1% | None | 99.9% | 99.9% | - | - |
| Rods | 65.6% | 0.3% | 99.9% | 99.9% | 96.7% | 99.2% |
| Vascular_endothelial cells | 0.6% | None | 99.9% | 99.9% | - | - |
| Amacrine cells | 0.9% | 9.9% | 95.9% | 97.9% | 77.7% | 96.2% |
| Bipolar cells | 87.6% | 14% | 94.1% | 95.2% | 87.1% | 96.4% |
| Cones | 0.2% | 4.2% | 98.7% | 99.8% | 71.4% | 99.5% |
| Müller cells | 10.9% | 3.6% | 99.4% | 99.5% | 96.4% | 99.6% |
| Rods | 0.3% | 65.6% | 65.2% | 90.3% | 52.1% | 88.6% |
| Alpha cells | 59.4% | 25.2% | 88.8% | 85.8% | 97.7% | 96.9% |
| Beta cells | 31.6% | 7.7% | 94.1% | 93.8% | 97.8% | 98.8% |
| Delta cells | 3.3% | 3.2% | 95.7% | 96.3% | 85.1% | 98.5% |
| Gamma cells | 5.7% | 5.6% | 85.4% | 85.6% | 82.2% | 94.9% |
| MHC class II cells | 0.2% | None | 100% | 100% | - | - |
| Pancreatic stellate cells | 2.5% | None | 99.1% | 99.1% | - | - |
| Acinar cells | 8.5% | None | 99.8% | 99.8% | - | - |
| Alpha cells | 40.9% | 55.4% | 90.6% | 97.5% | 85.1% | 98.7% |
| Beta cells | 12.5% | 29.5% | 97.3% | 98.8% | 93.9% | 99.5% |
| Co-expression | 1.8% | None | 99.6% | 99.6% | - | - |
| Delta cells | 5.3% | 3.1% | 98.6% | 99.6% | 65.3% | 99.1% |
| Ductal cells | 17.8% | None | 98.7% | 98.7% | - | - |
| Endothelial cells | 0.7% | None | 100% | 100% | - | - |
| Epsilon cells | 0.3% | None | 100% | 100% | - | - |
| Gamma cells | 9.1% | 5.3% | 97.7% | 99.5% | 64.7% | 99.7% |
| Mast cells | 0.3% | None | 100% | 100% | - | - |
Method runtimes.
| Source dataset | Target dataset | Multi class classification–time (seconds) | Per cell type binary classification–time (seconds) |
|---|---|---|---|
| 1 | 2 | 9 | 21 |
| 2 | 1 | 6 | 13 |
| 1 | 3 | 12 | 26 |
| 3 | 1 | 12 | 27 |
| 2 | 3 | 7 | 15 |
| 3 | 2 | 10 | 23 |
| 4 | 5 | 847 | 3055 |
| 5 | 4 | 794 | 1162 |
| 6 | 7 | 8 | 26 |
| 7 | 6 | 5 | 20 |
| 8 | 9 | 57 | 72 |
| 9 | 8 | 73 | 194 |