| Literature DB >> 35253844 |
Mohit Goyal1, Guillermo Serrano2, Josepmaria Argemi3,4,5,6, Ilan Shomorony1, Mikel Hernaez2,7,8, Idoia Ochoa1,8,9.
Abstract
MOTIVATION: An important step in the transcriptomic analysis of individual cells involves manually determining the cellular identities. To ease this labor-intensive annotation of cell-types, there has been a growing interest in automated cell annotation, which can be achieved by training classification algorithms on previously annotated datasets. Existing pipelines employ dataset integration methods in order to remove potential batch effects between source (annotated) and target (unannotated) datasets. However, the integration and classification steps are usually independent of each other and performed by different tools. We propose JIND, a neural-network-based framework for automated cell-type identification that performs integration in a space suitably chosen to facilitate cell classification. To account for batch effects, JIND performs a novel asymmetric alignment in which unseen cells are mapped onto the previously learned latent space, avoiding the need of retraining the classification model for new datasets. JIND also learns cell-type-specific confidence thresholds to identify cells that cannot be reliably classified.Entities:
Year: 2022 PMID: 35253844 PMCID: PMC9278043 DOI: 10.1093/bioinformatics/btac140
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Overview of JIND. (a) We assume access to a source batch containing the gene expression matrix accompanied with the corresponding cell-types. (b) A Neural Network-based prediction model, consisting of an encoder and a classifier, is trained on the source batch. The low-dimensional representation output by the encoder subnetwork is denoted as the latent code. Note that this prediction model should not be directly used to annotate the target batch due to batch effects. (c) JIND uses adversarial training via a generator and discriminator pair to align the source and target latent codes. The discriminator is trained to classify an input latent code either as a latent code produced by the generator (negative label) or as the source latent code produced by the encoder (positive label). In contrast, the generator is trained to fool the discriminator into misclassifying the generator’s output as source latent code. Finally, the output of the trained generator (the aligned latent code) is used by the classifier subnetwork to infer the cell-types of the target batch
Benchmarking: tabular comparison for (a) batched datasets and (b) non-batched datasets based on four metrics: raw the initial accuracy of the classifier, rej the percentage of cells rejected by the classifier (if supported by the method), eff is the effective accuracy after rejecting unconfident predictions and wf1 the weighted F1 score based on the predicted probabilities
|
| |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Datasets | Metrics | JIND | JIND+ | Seurat-LT | ItClust | SVMRej | scPred | ACTINN | |||
| HInt | SInt | HInt | SInt | HInt | SInt | ||||||
| Pancreas Bar16-Mur16 |
|
|
| 0.870 | 0.945 | 0.959 | 0.932 | 0.856 | 0.914 | 0.955 | 0.932 |
|
| 0.05 | 0.02 | — | — | 0.07 | 0.02 | 0.39 | 0.08 | 0.05 | 0.05 | |
|
| 0.979 | 0.971 | — | — | 0.986 | 0.944 | 0.945 | 0.942 | 0.976 | 0.954 | |
|
| 0.964 | 0.965 | 0.853 | 0.944 | 0.960 | 0.933 | 0.834 | 0.914 | 0.955 | 0.933 | |
| Pancreas Bar16-Seg16 |
| 0.995 |
| 0.929 | 0.978 | 0.970 | 0.964 | 0.726 | 0.945 | 0.976 | 0.963 |
|
| 0.05 | 0.02 | — | — | 0.10 | 0.02 | 0.43 | 0.11 | 0.06 | 0.04 | |
|
| 1.000 | 1.000 | — | — | 0.981 | 0.973 | 0.906 | 0.971 | 0.992 | 0.982 | |
|
| 0.995 | 0.997 | 0.915 | 0.977 | 0.968 | 0.962 | 0.661 | 0.942 | 0.975 | 0.960 | |
| PBMC 10x_v3-10x_v5 |
| 0.966 | 0.973 |
| 0.969 | 0.939 | 0.962 | 0.341 | 0.946 | 0.943 | 0.965 |
|
| 0.08 | 0.06 | — | — | 0.99 | 0.05 | 0.49 | 0.10 | 0.49 | 0.05 | |
|
| 0.968 | 0.984 | — | — | 1.000 | 0.975 | 0.545 | 0.971 | 0.990 | 0.980 | |
|
| 0.966 | 0.972 | 0.980 | 0.968 | 0.940 | 0.962 | 0.223 | 0.942 | 0.943 | 0.964 | |
Note: On batched datasets, for SVMRej, scPred and ACTINN, we report results with batch alignment prior to classification using using Seurat (SInt) and Harmony (HInt). Best raw accuracy rates are bold faced and rejection rates above 0.1 are colored red.
Since ItClust is designed to run on raw datasets and the Mouse Atlas and PBMC datasets are already processed, we modified the preprocessing step in ItClust to annotate these datasets.
Fig. 2.Performance evaluation and differential expression analysis on two datasets. The alluvial plots (left) reflect the performance of JIND+ on (a) PBMC 10x_v3-10x_v5 and (b) Pancreas Bar16-Mur16 datasets. The tSNE plots (middle) illustrate the cell-type clusters of the target batch, and highlight the two cell-types with the highest misclassification rates: (a) Monocyte_FCGR3A and Monocyte_CD14 and (b) Acinar and Ductal. The heatmaps (right) show the top 25 differentially expressed genes between (a) Monocyte_FCGR3A cells classified as Monocyte_FCGR3A (G1) and Monocyte_FCGR3A classified as Monocyte_CD14 (G2), and between (b) Ductal cells classified as Ductal (G1) and Ductal cells classified as Acinar (G2). The shown hierarchical clustering is performed using all the differentially expressed genes