Literature DB >> 35642896

Palo: Spatially-aware color palette optimization for single-cell and spatial data.

Abstract

SUMMARY: In the exploratory data analysis of single-cell or spatial genomic data, single cells or spatial spots are often visualized using a two-dimensional plot where cell clusters or spot clusters are marked with different colors. With tens of clusters, current visualization methods often assign visually similar colors to spatially neighboring clusters, making it hard to identify the distinction between clusters. To address this issue, we developed Palo that optimizes the color palette assignment for single-cell and spatial data in a spatially-aware manner. Palo identifies pairs of clusters that are spatially neighboring to each other and assigns visually distinct colors to those neighboring pairs. We demonstrate that Palo leads to improved visualization in real single-cell and spatial genomic datasets. AVAILABILITY: Palo R package is freely available at Github (https://github.com/Winnie09/Palo) and Zenodo (https://doi.org/10.5281/zenodo.6562505).

Entities: Chemical

Year: 2022 PMID： 35642896 PMCID： PMC9272793 DOI： 10.1093/bioinformatics/btac368

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

Data visualization is a key step in exploring the underlying structure of single-cell and spatial genomic data. For single-cell sequencing data [e.g. single-cell RNA-seq (Tang )], cells are commonly projected into a low-dimensional space using methods such as Uniform Manifold Approximation and Projection (UMAP, Becht ) or t-Distributed Stochastic Neighbor Embedding (t-SNE, Van der Maaten and Hinton, 2008) and visualized by a 2-D scatterplot where the two axes represent two reduced dimensions. Cells with the same cell type or cluster are shown with the same color. For spatial transcriptomics data (Ståhl ), spatial spots are visualized by a 2-D spatial map where the two axes represent the two spatial coordinates of the tissue slide. Similarly, spots with the same cluster are shown with the same color. The visualization guides downstream analyses such as cell type identification (Abdelaal ) and trajectory reconstruction (Hou ; Ji and Ji, 2016; Trapnell ). In many cases, cells or spots are grouped into tens of clusters to reflect their heterogeneity, thus tens of different colors are needed to visualize the different clusters. This will inevitably lead to similar colors in the color palette that are hard for human eyes to perceive and differentiate. As existing methods [e.g. ggplot2 (Wickham, 2016)] assign colors to clusters either alphabetically or in a random order, it is highly likely that some spatially neighboring clusters are assigned similar colors that are hard for human eyes to differentiate. Figure 1A shows an example of visualizing a single-cell RNA-seq dataset with different T cells subsets (Caushi ). UMAP coordinates were obtained using the standard Seurat (Stuart ) pipeline and cell type information was from the original publication. The geom_point() function in ggplot2 R package (Wickham, 2016) was used to generate the plot with the default color palette and settings. Multiple neighboring clusters [e.g. CD4-Treg and CD4-Tfh(2)] share similar colors that are hard to differentiate. This problem cannot be solved by randomly permuting and reassigning colors to clusters (Fig. 1B). Figure 1C shows an example of visualizing Visium spatial transcriptomics data of a mouse brain (10X Genomics, 2020). Spot clusters were obtained using the standard Seurat (Stuart ) pipeline. The plot was generated using the SpatialDimPlot() function in Seurat R package (Stuart ) with the default color palette and settings. Similarly, there are neighboring clusters (e.g. clusters 8, 9, 10) that share similar colors and are not visually distinct. Randomly permuting and reassigning the color palette cannot resolve the issue (Fig. 1D).

Fig. 1.

Visualization of single-cell RNA-seq data with default ggplot2 palette (A) or a randomly permuted palette (B). Neighboring clusters with visually similar colors are circled. Visualization of spatial transcriptomics data with default ggplot2 palette (C) or a randomly permuted palette (D). Neighboring clusters with visually similar colors are circled. (E) Schematic of Palo. (F) Visualization of single-cell RNA-seq data with Palo palette. (G) Visualization of spatial transcriptomics data with Palo palette This visualization issue may create false impressions of cell type abundances or spatial interactions between spot clusters. It cannot be directly addressed by existing visualization methods such as ASAP (Gardeux ), dittoSeq (Bunis ), SPRING (Weinreb ) and SCUBI (Hou and Ji, 2022) which focus on other aspects of visualization. A simple solution is to manually exchange the colors assigned to different cell clusters multiple times. However, this manual process is tedious and time-consuming when there are many colors to be exchanged or when each cell cluster is spatially close to numerous other clusters. Plus, the manual process cannot fit in automatic analysis pipelines or efficiently handle a large number of samples or datasets. To address this issue, we developed Palo to optimize the color palette assignments to cell or spot clusters in a spatially aware manner. Palo first calculates the spatial overlap score between each pair of clusters. It then identifies a color palette that assigns visually distinct colors to cluster pairs with high spatial overlap scores (Fig. 1E). We applied Palo to both the single-cell RNA-seq dataset (Fig. 1F) and the spatial dataset (Fig. 1G). The results show that Palo resolves the visualization issue, and spatially neighboring clusters are assigned visually distinct colors. The optimized color palette by Palo improves the visualization and identification of boundaries between spatially neighboring clusters.

2 Materials and methods

The inputs to Palo are (i) the 2-D coordinates of cells or spots; (ii) a vector indicating clusters of the cells or spots; (iii) a vector of user-defined colors. For single-cell genomic data, the coordinates are usually obtained by dimension reduction. For spatial data, the coordinates are the spatial locations of spots in a tissue slide. The output of Palo is the optimized permutation of the user-defined input color vector assigned to the clusters. The Palo method consists of the following steps. Step 1: for each cluster, a 2-D kernel density function [MASS::kde2d() in R] with 100 × 100 grid points is fitted using the 2-D coordinates of all cells or spots in the cluster. Step 2: for each cluster, all grid points with density values larger than a cutoff are treated as the hot grid points. To identify the cutoff, the cluster labels for all cells or spots are randomly permuted once, and the 2-D kernel density function is refitted for each permuted cluster. For each cluster, the cutoff is the 95 percentile of the density values across all grid points obtained in the permutation. Step 3: for a pair of clusters a and b, an overlap score is calculated as the Jaccard index , where S and S are the sets of hot grid points of a and b, respectively. Step 4: for a pair of colors e and f, the color dissimilarity is defined as the Euclidean distance between the red, green, and blue (RGB) values of the two colors. Different weights can be specified for each of RGB to better match how human eyes perceive the actual colors. For colorblind-friendly visualizations, Palo can also convert the colors to simulate how the colors are perceived by people with color-blindness, and the RGB distances will be calculated with the converted colors. Step 5: let P be a permutation of the user-defined color vector and P be the color assigned to cluster k. A color score is defined as , where and C is the total number of clusters. Palo finds P that maximizes the color score. To do that, Palo first randomly permutes the user-defined color vector multiple times (1000 times by default) and finds the initial permutation with the highest color score. It then fine-tunes the permutation by repeatedly exchanging colors between a pair of randomly selected clusters. If the exchange results in an increased color score, the exchange is kept. The exchange is repeated multiple times (2000 times by default). An early stopping rule is employed to stop the exchange process when the color score remains unchanged for several consecutive exchanges (500 consecutive exchanges by default). Supplementary Figure S1 shows how the color score changes with iterations for the two datasets analyzed in this study.

3 Implementations

Palo is implemented as an open-source R package. The package has one function, Palo(), that performs the color palette optimization. The following R command runs Palo: pal <- Palo(position,cluster,palette) Here, position is a cell by reduced dimension coordinate matrix with two columns (single-cell data) or a spot by spatial coordinate matrix with two columns (spatial transcriptomics data); cluster is a vector of cell or spot clusters; and palette is a user-defined color vector. The output pal is a named vector of optimized color palette which can be directly fed into other functions in R for plotting. For ggplot2: ggplot(…) + geom_point() + scale_color_manual(values=pal) For spatial maps in Seurat: SpatialDimPlot(…) + scale_fill_manual(values=pal)

Funding

Z.J. was supported by the National Institutes of Health [1U54AG075936-01]. W.H. was supported by the National Institutes of Health [1K99HG011468]. W.H. would like to acknowledge Dr. Hongkai Ji, Dr. Stephanie C. Hicks and Dr. Andrew P. Feinberg for their mentorship. Conflict of Interest: none declared.

Data availability

The T cell single-cell RNA-seq dataset was obtained from Gene Expression Omnibus (GSE176022). The spatial transcriptomics dataset was obtained from 10X Genomics website (https://www.10xgenomics.com/resources/datasets/mouse-brain-serial-section-1-sagittal-anterior-1-standard-1-1-0). Click here for additional data file.

12 in total

1. Comprehensive Integration of Single-Cell Data.

Authors: Tim Stuart; Andrew Butler; Paul Hoffman; Christoph Hafemeister; Efthymia Papalexi; William M Mauck; Yuhan Hao; Marlon Stoeckius; Peter Smibert; Rahul Satija
Journal: Cell Date: 2019-06-06 Impact factor: 41.582

2. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics.

Authors: Patrik L Ståhl; Fredrik Salmén; Sanja Vickovic; Anna Lundmark; José Fernández Navarro; Jens Magnusson; Stefania Giacomello; Michaela Asp; Jakub O Westholm; Mikael Huss; Annelie Mollbrink; Sten Linnarsson; Simone Codeluppi; Åke Borg; Fredrik Pontén; Paul Igor Costea; Pelin Sahlén; Jan Mulder; Olaf Bergmann; Joakim Lundeberg; Jonas Frisén
Journal: Science Date: 2016-07-01 Impact factor: 47.728

3. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis.

Authors: Zhicheng Ji; Hongkai Ji
Journal: Nucleic Acids Res Date: 2016-05-13 Impact factor: 16.971

4. mRNA-Seq whole-transcriptome analysis of a single cell.

Authors: Fuchou Tang; Catalin Barbacioru; Yangzhou Wang; Ellen Nordman; Clarence Lee; Nanlan Xu; Xiaohui Wang; John Bodeau; Brian B Tuch; Asim Siddiqui; Kaiqin Lao; M Azim Surani
Journal: Nat Methods Date: 2009-04-06 Impact factor: 28.547

5. dittoSeq: Universal User-Friendly Single-Cell and Bulk RNA Sequencing Visualization Toolkit.

Authors: Daniel G Bunis; Jared Andrews; Gabriela K Fragiadakis; Trevor D Burt; Marina Sirota
Journal: Bioinformatics Date: 2020-12-12 Impact factor: 6.937

6. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.

Authors: Cole Trapnell; Davide Cacchiarelli; Jonna Grimsby; Prapti Pokharel; Shuqiang Li; Michael Morse; Niall J Lennon; Kenneth J Livak; Tarjei S Mikkelsen; John L Rinn
Journal: Nat Biotechnol Date: 2014-03-23 Impact factor: 54.908

7. ASAP: a web-based platform for the analysis and interactive visualization of single-cell RNA-seq data.

Authors: Vincent Gardeux; Fabrice P A David; Adrian Shajkofci; Petra C Schwalie; Bart Deplancke
Journal: Bioinformatics Date: 2017-10-01 Impact factor: 6.937

8. SPRING: a kinetic interface for visualizing high dimensional single-cell expression data.

Authors: Caleb Weinreb; Samuel Wolock; Allon M Klein
Journal: Bioinformatics Date: 2018-04-01 Impact factor: 6.937

9. Unbiased visualization of single-cell genomic data with SCUBI.

Authors: Wenpin Hou; Zhicheng Ji
Journal: Cell Rep Methods Date: 2022-01-04

10. Transcriptional programs of neoantigen-specific TIL in anti-PD-1-treated lung cancers.

Authors: Justina X Caushi; Jiajia Zhang; Zhicheng Ji; Ajay Vaghasia; Boyang Zhang; Emily Han-Chung Hsiue; Brian J Mog; Wenpin Hou; Sune Justesen; Richard Blosser; Ada Tam; Valsamo Anagnostou; Tricia R Cottrell; Haidan Guo; Hok Yee Chan; Dipika Singh; Sampriti Thapa; Arbor G Dykema; Poromendro Burman; Begum Choudhury; Luis Aparicio; Laurene S Cheung; Mara Lanis; Zineb Belcaid; Margueritta El Asmar; Peter B Illei; Rulin Wang; Jennifer Meyers; Kornel Schuebel; Anuj Gupta; Alyza Skaist; Sarah Wheelan; Jarushka Naidoo; Kristen A Marrone; Malcolm Brock; Jinny Ha; Errol L Bush; Bernard J Park; Matthew Bott; David R Jones; Joshua E Reuss; Victor E Velculescu; Jamie E Chaft; Kenneth W Kinzler; Shibin Zhou; Bert Vogelstein; Janis M Taube; Matthew D Hellmann; Julie R Brahmer; Taha Merghoub; Patrick M Forde; Srinivasan Yegnasubramanian; Hongkai Ji; Drew M Pardoll; Kellie N Smith
Journal: Nature Date: 2021-07-21 Impact factor: 49.962