| Literature DB >> 34844647 |
Adam Chan1,2, Wei Jiang3,4, Emily Blyth3,4,5, Jean Yang1,2,6, Ellis Patrick7,8,9.
Abstract
High-throughput single-cell technologies hold the promise of discovering novel cellular relationships with disease. However, analytical workflows constructed for these technologies to associate cell proportions with disease often employ unsupervised clustering techniques that overlook the valuable hierarchical structures that have been used to define cell types. We present treekoR, a framework that empirically recapitulates these structures, facilitating multiple quantifications and comparisons of cell type proportions. Our results from twelve case studies reinforce the importance of quantifying proportions relative to parent populations in the analyses of cytometry data - as failing to do so can lead to missing important biological insights.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34844647 PMCID: PMC8628061 DOI: 10.1186/s13059-021-02526-5
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1treekoR helps to extract insight from cytometry data through deriving a hierarchy of cell clusters and measuring proportions to parent. a An example t-SNE plot showing clustering of single-cell data. b Hierarchical tree constructed using HOPACH algorithm on the cluster median marker expressions. c Definition of proportions to parent and proportions to all defined according to the organization of the hierarchical tree. d Significance testing is performed using both types of proportions calculated, testing for difference between the patient clinical endpoint of interest. e Visualization of the significance testing results. On the left, a scatterplot of each node in the hierarchical tree with the test statistic calculated using the %total (x-axis) vs. the test statistic calculated using the %parent (y-axis). On the right of the scatterplot, the hierarchical tree is colored with the test statistics: the nodes colored by the test statistic using %total and the branches of the nodes colored by the test statistic using %parent. An example of a corresponding node between the two graphs is highlighted in blue. The heatmap plots the median marker expression of the leaf nodes to assist in identification of the corresponding cell clusters
Fig. 2Measuring %parent can provide additional insight over %total. a Scatterplot of test statistics with the cell clusters in differentiating between latent CMV infection patients. Highlighted clusters are significant using %parent, while not significant using %total. b Comparative boxplot of the proportions of highlighted cell clusters, between patients with CMV and without CMV, with the %total (upper panel) and %parent (lower panel). c A heatmap generated using treekoR on a CD8+ T cell compartment to predict healthy vs COVID-19, containing a hierarchical tree of cell clusters colored by the test statistic using the corresponding %total (nodes) and %parent (branches). The heatmap is colored by the scaled cluster median expression values characterize leaf nodes in the tree. d Scatterplot of test statistics of each cell cluster with test statistic from using %total (x-axis) vs. test statistic from %parent (y-axis). The HLA-DR+ CD38+ cluster highlighted has a larger test statistic when differentiating between COVID-19 patients and healthy control using %parent than %total. e Comparison of − log10 of p values of a HLA-DR+ CD38+ subset for %total, %parent, and manually gated proportions from De Biasi et al. from a t-test (pink) and Wilcoxon test (green). f Comparative boxplot of a HLA-DR+ CD38+ subset, with the %total (upper left panel), %parent (lower panel), and manually gated proportions (upper right panel) between COVID-19 and healthy patients
Benchmark datasets. Eleven published datasets were used to compare %total and %parent in significance testing and classification using the treekoR workflow. “Name” is used to refer to each dataset throughout the manuscript
| Name | Technology | Description | Number of cells | Number of samples | Outcome or response variable | References |
|---|---|---|---|---|---|---|
| Age chronic | CyTOF | Age chronic inflammation predicting young vs old | 1036209 | 29 | Young / old | Shen-Orr et al. 2016 [ Immport [ |
| Anti-CTLA-4 and anti-PD-1 | CyTOF | Predicting response vs non-response in anti-CTLA-4 and anti-PD-1 treatments | 7264780 | 24 | Response / non-response to treatment | Subrahmanyam et al. 2018 [ |
| Anti-PD-1 | CyTOF | Predicting response vs non-response in anti-PD-1 treatment | 85718 | 20 | Response / non-response to treatment | Kreig et al. 2018 [ |
| BCR-XL-sim | CyTOF | Detecting samples with stimulated B cells | 88435 | 16 | Spiked / non-spiked | Weber et al. 2019 [ |
| Breast cancer tumor | CyTOF | Predicting tumor in breast cancer samples | 855914 | 194 | Tumor/non-tumor breast cancer samples | Wagner et al. 2019 [ |
| CMV | CyTOF | Predicting positive vs negative CMV titer results in influenza patients | 18153877 | 69 | Positive/negative results from CMV titer | Tomic et al. 2019 [ Immport [ |
| COVID-19 whole blood CyTOF | CyTOF | Profiling whole blood to predict COVID-19 vs. healthy patients | 4747543 | 21 | COVID-19 / healthy control | Geanon et al. 2021 [ |
| COVID-19 PBMCs | Flow cytometry | Predicting between ICU vs. hospital ward COVID-19 patients | 4790053 | 38 | ICU / ward | Humblet-Baron et al. 2021 [ |
| COVID-19 PBMC CD8+ non-naive T cells | Flow cytometry | Profile of CD8+ Non-Naive T Cells to distinguish recovered from COVID-19 vs. healthy | 11591741 (60% of cells were sampled and analyzed) | 168 | COVID-19 recovered / healthy | Mathew et al. 2020 [ |
| COVID-19 T cells | Flow cytometry | T cell compartment samples (CD4 and CD8) to predict healthy vs COVID-19 | 5000 | 31 | COVID-19 / healthy control | De Biasi et al. 2020 [ |
| Melanoma | scRNA-seq | Predicting response to checkpoint immunotherapy in melanoma | 5928 | 19 | Responder/non-responder | Sade-Feldman et al. 2019 [ |
Fig. 3treekoR provides stronger associations with patient clinical outcomes. Cell clusters, constructed using both average-linkage hierarchical clustering and HOPACH, were tested between patient conditions using %total and %parent. Q-Q plots were plotted for each dataset by plotting the ordered negative log p values using %total (x-axis) vs. using %parent (y-axis)
Fig. 4Measuring %parent offers improvements in patient classification performance. a Comparative boxplots (lower panel) of balanced accuracy rates for each dataset and feature set: %total, %parent using average-linkage hierarchical clustering, and %parent using HOPACH. Values plotted are from a 5-fold CV with 20 repetitions, averaged across each repetition. The rank of each feature set within each dataset is shown in the bubble plot (upper panel), with rank 1 being the best (highest mean / lowest variance) and rank 3 being the worst (lowest mean / highest variance)