| Literature DB >> 35910222 |
Lei Li1,2, Linda Yu-Ling Lan1,2, Lei Huang3, Congting Ye4, Jorge Andrade3,5, Patrick C Wilson1,2.
Abstract
Rapid growth of single-cell sequencing techniques enables researchers to investigate almost millions of cells with diverse properties in a single experiment. Meanwhile, it also presents great challenges for selecting representative samples from massive single-cell populations for further experimental characterization, which requires a robust and compact sampling with balancing diverse properties of different priority levels. The conventional sampling methods fail to generate representative and generalizable subsets from a massive single-cell population or more complicated ensembles. Here, we present a toolkit called Cookie which can efficiently select out the most representative samples from a massive single-cell population with diverse properties. This method quantifies the relationships/similarities among samples using their Manhattan distances by vectorizing all given properties and then determines an appropriate sample size by evaluating the coverage of key properties from multiple candidate sizes, following by a k-medoids clustering to group samples into several clusters and selects centers from each cluster as the most representatives. Comparison of Cookie with conventional sampling methods using a single-cell atlas dataset, epidemiology surveillance data, and a simulated dataset shows the high efficacy, efficiency, and flexibly of Cookie. The Cookie toolkit is implemented in R and is freely available at https://wilsonimmunologylab.github.io/Cookie/.Entities:
Keywords: R; antibody candidate selection; k-medoids; sampling; single cell
Year: 2022 PMID: 35910222 PMCID: PMC9335369 DOI: 10.3389/fgene.2022.954024
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Workflow of k-medoids–based sampling. (A) Workflow of Cookie pipeline and (B) selecting representative samples using k-medoid clustering method.
FIGURE 2Select representative samples from a large single-cell population. (A) Determine appropriate sample size by quantifying coverage on all factors. Line of subject factor was indicated by dash line to avoid overlap. (B) Coverage on each factor of selected samples. (C) Compare distributions on each factor between original population and select samples.
FIGURE 3Selected representative samples from human Influenza H1N1 surveillance data. (A) Testing coverage rate on each factor of different sample sizes. (B) Selected samples and unselected samples on 2D visualization (t-SNE). (C) Distributions of each factor of original population and selected samples.
Runtime of major steps of Cookie pipeline on different population sizes. All the tests were performed on a simulated dataset using a 2015 Apple MacBook Pro (Core i5, 2.7GHZ, 8 GB DDR3 memory). N denotes population size.
| Processing Step | Runtime (seconds) | ||||
|---|---|---|---|---|---|
| N = 1,000 | N = 2,500 | N = 5,000 | N = 10,000 | ||
| Preprocess | Create object | 0.001 | 0.004 | 0.004 | 0.05 |
| Normalization | 0.003 | 0.005 | 0.012 | 0.027 | |
| Distance calculation | 0.347 | 1.77 | 7.375 | 32.511 | |
| Nonlinear reduction (t-SNE) | 6.759 | 13.731 | 50.701 | 142.668 | |
| Prime factor mode* (PAM algorithm) | Sample size test | 0.279 | 1.414 | 9.097 | 54.703 |
| Sampling | 0.04 | 0.195 | 1.523 | 6.134 | |
| Prime factor mode* (FastPAM algorithm) | Sample size test | 0.663 | 1.036 | 4.276 | 43.047 |
| Sampling | 0.123 | 0.154 | 0.578 | 3.089 | |
| Nonprime factor mode** (PAM algorithm) | Sample size test | 26.258 | 301.05 | 1750.68 | >3,000 |
| Sampling | 3.517 | 36.251 | 261.614 | >3,000 | |
| Non-prime factor mode** (FastPAM algorithm) | Sample size test | 4.403 | 27.656 | 115.79 | 598.854 |
| Sampling | 0.493 | 3.31 | 13.522 | 68.22 | |
A prime factor is determined in this run. Algorithms for k-medoids clustering are indicated in the brackets.
No prime factor is determined in this run. Algorithms for k-medoids clustering are indicated in the brackets.
Coverage rates of k-medoids sampling on different population sizes. All the tests were performed on a simulated dataset using a 2015 Apple MacBook Pro (Core i5, 2.7GHZ, 8 GB DDR3 memory). N denotes population size. Tests were generated using the Cookie package with the FastPAM method. The sample size for prime factor mode is set to 10 (from each level of prime factor) and that for no-prime factor mode is set to 100.
| Factor | Coverage Rate (%) | ||||
|---|---|---|---|---|---|
| N = 1,000 | N = 2,500 | N = 5,000 | N = 10,000 | ||
| Prime factor | Factor 1 | 84.00 | 82.00 | 78.00 | 80.00 |
| Factor 2 | 100.00 | 100.00 | 100.00 | 100.00 | |
| Factor 3 | 100.00 | 100.00 | 100.00 | 100.00 | |
| Factor 4 | 90.91 | 100.00 | 90.91 | 90.91 | |
| Factor 5 | 81.82 | 72.7 | 72.7 | 72.73 | |
| Nonprime factor | Factor 1 | 92.00 | 86.00 | 88.00 | 84.00 |
| Factor 2 | 100.00 | 100.00 | 100.00 | 100.00 | |
| Factor 3 | 100.00 | 100.00 | 100.00 | 100.00 | |
| Factor 4 | 90.91 | 100.00 | 90.91 | 90.91 | |
| Factor 5 | 81.82 | 72.73 | 72.7 | 72.73 | |
FIGURE 4Compare k-medoid sampling with probability sampling method (stratified sampling). (A) Coverage rate on each factor of ten independent runs of stratified sampling. (B) Distributions of each factor of original population and samples selected by k-medoid sampling and stratified sampling.