| Literature DB >> 30309354 |
Aparna Bhaduri1,2, Tomasz J Nowakowski3,4,5, Alex A Pollen3,4, Arnold R Kriegstein6,7.
Abstract
BACKGROUND: High throughput methods for profiling the transcriptomes of single cells have recently emerged as transformative approaches for large-scale population surveys of cellular diversity in heterogeneous primary tissues. However, the efficient generation of such atlases will depend on sufficient sampling of diverse cell types while remaining cost-effective to enable a comprehensive examination of organs, developmental stages, and individuals.Entities:
Keywords: Bioinformatics; Cell atlas studies; Downsampling; Single-cell analysis
Mesh:
Year: 2018 PMID: 30309354 PMCID: PMC6180488 DOI: 10.1186/s12915-018-0580-x
Source DB: PubMed Journal: BMC Biol ISSN: 1741-7007 Impact factor: 7.431
Fig. 1Downsampling of cell number preserves major cell type distinctions. a t-SNE plots of the full dataset and five smaller downsampled subsets. Each dataset is shown in the t-SNE space of the full dataset. Clustering was performed independently in every subset. b Cluster preservation is a key metric to evaluate similarities and differences between clusters from different analyses, measuring preservation as a fraction of the original cluster that remains in analyzed subsets. The diagram depicts a simplified cluster preservation calculation (see also the “Methods” section). c Cluster preservation represents the best instance of the fraction of a cluster that is represented during downsampling. Nine original subsets are represented and a total of 56 datapoints are represented; the cell number is shown on a log2 (number of cells) score to improve ease of graph interpretation
Fig. 2Downsampling of cell complexity preserves major cell type distinctions. a Cell complexity is calculated in the PCA space of the largest reference cell set analyzed. A hierarchical tree of clusters is calculated for each subset in the PCA space, and the total distance between the branches defines the cell complexity (see also the “Methods” section). b Cell complexity downsampling was performed by selecting branches of a larger tree with varied cell numbers and distances between groups. c Plot of complexity versus cell preservation. Each dot represents a point from 9 original subsets and a total of 56 datasets are analyzed. Log2 (cell diversity index) is used to easily interpret the dots at lower cell diversity numbers. d Number of clusters derived from subset analyses as a function of cell complexity. The graph begins to plateau at a cell complexity of ~ 100,000, suggesting there is a maximal number of clusters that can be derived from a sample even as cell number and complexity increases. e Complexity calculated by cell class annotations show neurons are the most complex of the cell types retrieved
Fig. 3Cluster conservation from downsampled datasets. a Cluster conservation is an alternative metric to evaluate similarities and differences between clusters from different analyses, measuring conservation as a fraction of the subset cluster that originates from the same cluster. The diagram depicts a simplified cluster conservation calculation (see also Methods). b Cluster conservation as a function of cell number. Points are averaged within a sample from 56 downsampled subsets. c Cluster conservation as a function of complexity index. Points are averaged within a sample from 56 downsampled subsets. d When grouping clusters by cell type, cluster conservation is nearly perfect for most cell types. e The split of single cluster can be measured by counting the number of clusters that share ≥ 1 cell with either the original or subset cluster, as depicted in the diagram. f Cluster split number of subset clusters as a function of complexity index divided by cell type. Again, a plateau can be seen regardless of cell type around ~ 100,000. More complex cell types are split more, but complexity rather than cell type appears to indicate the number of splits that may occur
Fig. 4Downsampling of Cajal-Retzius cells. a t-SNE plot depicting the iterative clustering result of all 20,550 Cajal-Retzius (CR) cells from the full dataset. b Regional origin is a well-studied classifier of CR subtypes, and two of these markers feature prominently in the iteratively clustered dataset: Foxg1 is enriched in three clusters while Lhx9 is enriched in seven clusters. c Violin plots of regional markers in the full datasets and CR subsets of downsampled datasets indicate that these markers are enriched in one more clusters up until 1/24 of the dataset is sampled, after which Foxg1 enrichment is diluted across multiple clusters. Lhx9 enrichment is conserved to even the smallest downsampled subset. One subset for each downsampling is used. d Enrichment metrics of CR cells in the context of previously shown metrics indicate that informatically, saturation of this cell type has not yet been achieved. e Framework to evaluate if technical saturation has been achieved. f Examination of R2 values when incrementally decreasing the number of maximum cells used in the analysis shows that plateau emerges around an R2 value of 0.6