| Literature DB >> 32831127 |
Felix Raimundo1, Celine Vallot2,3, Jean-Philippe Vert1.
Abstract
BACKGROUND: Many computational methods have been developed recently to analyze single-cell RNA-seq (scRNA-seq) data. Several benchmark studies have compared these methods on their ability for dimensionality reduction, clustering, or differential analysis, often relying on default parameters. Yet, given the biological diversity of scRNA-seq datasets, parameter tuning might be essential for the optimal usage of methods, and determining how to tune parameters remains an unmet need.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32831127 PMCID: PMC7444048 DOI: 10.1186/s13059-020-02128-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Overview of the benchmark protocol. We ran five representative DR methods, systematically varying their parameters on a large grid of values, on ten scRNA-seq datasets with known cell identity. We evaluate their ability to map cells of a given identity near other cells of the same identity, as measured by the silhouette score and the AMI after k-means clustering in the representation space
Benchmark datasets
| Dataset | Technology | Organism | N cells | Cell types | N types | Ref. |
|---|---|---|---|---|---|---|
| Zhengmix4eq | 10x | Human | 3,909 | B (24.5%), Monocytes (24.5%), cytoT (25.5%), rT (25.5%) | 4 | [ |
| Zhengmix4uneq | 10x | Human | 6,345 | B (15%), Monocytes (30%), cytoT (8%), rT (46%) | 4 | [ |
| Zhengmix5eq | 10x | Human | 4,876 | hT (20%), mT (20%), cytoT (20%), nT (20%), rT (20%) | 5 | [ |
| Zhengmix8eq | 10x | Human | 3,908 | B (12.5%), Monocytes (14.5%), hT (10%), NK (15%), mT (12.5%), cytoT (10%), nT (13%), rT (12.5%) | 8 | [ |
| Zhengmix8uneq | 10x | Human | 6,350 | B (7.5%), Monocytes (15%), hT (8%), NK (4%), mT (15%), cytoT (4%), nT (23%), rT | 8 | [ |
| sc_10x | 10x | Human | 902 | H1975 (34.5%), H2228 (35%), HCC827 (30.5%) | 3 | [ |
| sc_10x_5cl | 10x | Human | 3,918 | A549 (32%), H1975 (11%), H2228 (19.5%), H838 (22.5%), HCC827 (15%) | 5 | [ |
| sc_celseq2 | CEL-Seq2 | Human | 274 | H1975 (41%), H2228 (29.5%), HCC827 (29.5%) | 3 | [ |
| sc_celseq2_5cl | CEL-Seq2 | Human | 895 | A549 (36%), H1975 (14.5%), H2228 (14%), H838 (22%), HCC827 (13.5%) | 5 | [ |
| TabulaMuris | Smart-Seq2 | Murine | 12,081 | Brain (36.5%), intestine (31.5%), skin (19%), spleen (13%) | 4 | [ |
The first five datasets are derived from [21] and [11] and contain CD19+ B cells (B), CD14+ monocytes (Monocytes), CD4+ helper T cells (hT), CD56+ natural killer cells (NK), CD4+/CD45RO+ memory T cells (mT), CD8+/CD45RA+ naive cytotoxic T cells (cytoT), CD4+/CD45RA+/CD25 − naive T cells (nT), and CD4+/CD25+ regulatory T cells (rT). The four next are from [12] and contain the five following cell lines: A549, H1975, H2228, H838, and HCC827. The last one is from [22]
Fig. 2UMAP representation of the ten scRNA-seq datasets, run after processing of the count matrices with Seurat with default parameters
Fig. 3Performance of five DR pipelines (scran, Seurat, ZinbWave, DCA, and scVI) with default parameters and a dimension of 10 (legend “default”) or after parameter optimization (legend “best”) on our benchmark of ten datasets. a AMI (left) and silhouette (right) reached by each method on each dataset. b UMAP representation of Zhengmix8eq after DR by each method (in column) using default parameters (top two rows) of after parameter optimization (bottom two rows). In each row, cells are colored either based on their true cell type (rows 1 and 3) or based on a k-means clustering
Mean performance on the ten datasets of each method in terms of AMI and silhouette
| Method | Mean AMI | Mean silhouette | ||||||
|---|---|---|---|---|---|---|---|---|
| Default | Best | ANOVA AMI heuristic | Silhouette heuristic | Default | Best | ANOVA AMI heuristic | Silhouette heuristic | |
| scran | 0.868 | 0.841 | 0.741 | 0.362 | 0.547 | 0.396 | 0.494 | |
| Seurat | 0.788 | 0.860 | 0.814 | 0.683 | 0.369 | 0.543 | 0.490 | 0.373 |
| ZinbWave | 0.780 | 0.249 | 0.609 | |||||
| DCA | 0.758 | 0.885 | 0.837 | 0.583 | 0.403 | 0.381 | ||
| scVI | 0.560 | 0.872 | 0.709 | 0.510 | 0.384 | 0.621 | 0.482 | 0.318 |
The “Default” columns correspond to the performance of each method using its default parameters, with a dimension of 10. The “Best” column corresponds to the best performance reached after varying the parameters. The “ANOVA AMI heuristic” column corresponds to the performances of the new default parameters described in the “Influence of parameters on performance” section. The “Silhouette heuristic” column corresponds to the performance of the heuristic described in “Tuning parameters in practice” section
Description of parameter sweep
| Method | Parameters | Values |
|---|---|---|
| scran | Size factors normalization | { |
| ERCC counts normalization | { | |
| Assay type | { | |
| High variance genes | { 100, 300, | |
| Dimension of latent space | { 2, 8, 10, 16, 32, | |
| Seurat | Normalization method | { |
| Criteria for high variance genes | { | |
| High variance genes | { 100, 300, 500, 1000, | |
| Dimension of latent space | { 2, 8, 10, 16, 32, | |
| ZinbWave | Gene covariates | { True, |
| Epsilon (regularizer) | { 200, 500, | |
| High variance genes | { | |
| Dimension of latent space | { | |
| DCA | Dispersion and reconstruction | { |
| Batch normalization | { | |
| Dimension of the latent space | { 2, 8, 10, 16, | |
| Number of training epochs | { 20, 50, 100, 200, | |
| Normalize counts | { | |
| Scale variance | { | |
| Log normalization | { | |
| Dropout rate | { | |
| Number of hidden neurons | { | |
| Random seed | { | |
| scVI | Number of hidden neurons | { 64, |
| Number of training epochs | { | |
| Learning rate | { 1e-2, | |
| Dropout rate | { 0, | |
| Layers | { | |
| Dimension of the latent space | { 2, 8, | |
| Dispersion | { | |
| Reconstruction loss | { nb, | |
| Random seed | { |
For each method (first column), we vary a number of tuneable parameters (second column) systematically over a grid of values (third column). The bold value in the third column is the default value
Fig. 4UMAP representation of Zhengmix8eq after DR by each method (in column) using the ANOVA AMI (top two rows) or empirical silhouette (bottom two rows) heuristic to tune parameters. In each row, cells are colored either based on their true cell type (rows 1 and 3) or based on a k-means clustering