| Literature DB >> 32033589 |
David Lähnemann1,2,3, Johannes Köster1,4, Ewa Szczurek5, Davis J McCarthy6,7, Stephanie C Hicks8, Mark D Robinson9, Catalina A Vallejos10,11, Kieran R Campbell12,13,14, Niko Beerenwinkel15,16, Ahmed Mahfouz17,18, Luca Pinello19,20,21, Pavel Skums22, Alexandros Stamatakis23,24, Camille Stephan-Otto Attolini25, Samuel Aparicio13,26, Jasmijn Baaijens27, Marleen Balvert27,28, Buys de Barbanson29,30,31, Antonio Cappuccio32, Giacomo Corleone33, Bas E Dutilh28,34, Maria Florescu29,30,31, Victor Guryev35, Rens Holmer36, Katharina Jahn15,16, Thamar Jessurun Lobo35, Emma M Keizer37, Indu Khatri38, Szymon M Kielbasa39, Jan O Korbel40, Alexey M Kozlov23, Tzu-Hao Kuo3, Boudewijn P F Lelieveldt41,42, Ion I Mandoiu43, John C Marioni44,45,46, Tobias Marschall47,48, Felix Mölder1,49, Amir Niknejad50,51, Lukasz Raczkowski5, Marcel Reinders17,18, Jeroen de Ridder29,30, Antoine-Emmanuel Saliba52, Antonios Somarakis42, Oliver Stegle40,46,53, Fabian J Theis54, Huan Yang55, Alex Zelikovsky22,56, Alice C McHardy3, Benjamin J Raphael57, Sohrab P Shah58, Alexander Schönhuth59,60.
Abstract
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.Entities:
Mesh:
Year: 2020 PMID: 32033589 PMCID: PMC7007675 DOI: 10.1186/s13059-020-1926-6
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Different levels of resolution are of interest, depending on the research question and the data available. Thus, analysis tools and reference systems (such as cell atlases) will have to accommodate multiple levels of resolution from whole organs and tissues over discrete cell types to continuously mappable intermediate cell states, which are indistinguishable even at the microscopic level. A graph abstraction that enables such multiple levels of focus is provided by PAGA [14], a structure that allows for discretely grouping cells, as well as inferring trajectories as paths through a graph
Whole genome amplification: recent improvements
| Recent improvements of whole genome amplification (WGA) methods promise to reduce amplification biases and errors, while scaling throughput to larger cell numbers: |
| 1. Improved coverage uniformity for multiple displacement amplification (MDA) has been achieved using droplet microfluidics-based methods (eWGA [ |
| 2. One way to reduce the amplification error rate of the polymerase chain reaction (PCR)-based methods (including multiple annealing and looping-based amplification cycles (MALBAC)) would be to employ a thermostable polymerase (necessary for use in PCR) with proof-reading activity similar to |
| 3. Three newer methods use an entirely different approach: they randomly insert transposons into the whole genome and then leverage these as priming sites for amplification and library preparation. Transposon Barcoded (TnBC) library preparation (with a PCR amplification, [ |
Fig. 2Measurement error requires denoising methods or approaches that quantify uncertainty and propagate it down analysis pipelines. Where methods cannot deal with abundant missing values, imputation approaches may be useful. While the true population manifold that generated data is never known, one can usually obtain some estimation of it that can be used for both denoising and imputation
Fig. 6Approaches for integrating single-cell measurement datasets across measurement types, samples, and experiments, as also described in Table 4. 1S: clustering of cells from one sample from one experiment requires no data integration. +S: integration of one measurement type across samples requires the linking of cell populations/clusters. +X+S: integration of one measurement type across experiments conducted in separate laboratories requires stable reference systems like cell atlases (compare Fig. 1). +M1C: integration of multiple measurement types obtained from the same cell highlights the problem of data sparsity of all available measurement types and the dependency of measurement types that needs to be accounted for. +M+C: integration of different measurement types from different cells of the same cell population requires special care in matching cells through meaningful profiles. +all: one possibility for easing data integration across measurement types from separate cells would be to have a stable reference (cell atlas) across multiple measurement types, capturing different cell states, cell populations, and organisms. Effectively, this combines the challenges and promises of the approaches +X+S, +M1C, and +M+C
Short description of methods for the imputation of missing data in scRNA-seq data
| A: model-based imputation | ||
| bayNorm | Binomial model, empirical Bayes prior | [ |
| BISCUIT | Gaussian model of log counts, cell- and cluster-specific parameters | [ |
| CIDR | Decreasing logistic model (DO), non-linear least-squares regression (imp) | [ |
| SAVER | NB model, Poisson LASSO regression prior | [ |
| ScImpute | Mixture model (DO), non-negative least squares regression (imp) | [ |
| scRecover | ZINB model (DO identification only) | [ |
| VIPER | Sparse non-negative regression model | [ |
| B: data smoothing | ||
| DrImpute | [ | |
| knn-smooth | [ | |
| LSImpute | Locality sensitive imputation | [ |
| MAGIC | Diffusion across nearest neighbor graph | [ |
| netSmooth | Diffusion across PPI network | [ |
| C: data reconstruction, matrix factorization | ||
| ALRA | SVD with adaptive thresholding | [ |
| ENHANCE | Denoising PCA with aggregation step | [ |
| scRMD | Robust matrix decomposition | [ |
| consensus NMF | Meta-analysis approach to NMF | [ |
| f-scLVM | Sparse Bayesian latent variable model | [ |
| GPLVM | Gaussian process latent variable model | [ |
| pCMF | Probab. count matrix factorization with Poisson model | [ |
| scCoGAPS | Extension of NMF | [ |
| SDA | Sparse decomposition of arrays (Bayesian) | [ |
| ZIFA | ZI factor analysis | [ |
| ZINB-WaVE | ZINB factor model | [ |
| C: data reconstruction, machine learning | ||
| AutoImpute | AE, no error back-propagation for zero counts | [ |
| BERMUDA | AE for cluster batch correction (MMD and MSE loss function) | [ |
| DeepImpute | AE, parallelized on gene subsets | [ |
| DCA | Deep count AE (ZINB / NB model) | [ |
| DUSC / DAWN | Denoising AE (PCA determines hidden layer size) | [ |
| EnImpute | Ensemble learning consensus of other tools | [ |
| Expression Saliency | AE (Poisson negative log-likelihood loss function) | [ |
| LATE | Non-zero value AE (MSE loss function) | [ |
| Lin_DAE | Denoising AE (imputation across | [ |
| SAUCIE | AE (MMD loss function) | [ |
| scScope | Iterative AE | [ |
| scVAE | Gaussian-mixture VAE (NB / ZINB / ZIP model) | [ |
| scVI | VAE (ZINB model) | [ |
| scvis | VAE (objective function based on latent variable model and t-SNE) | [ |
| VASC | VAE (denoising layer; ZI layer, double-exponential and Gumbel distribution) | [ |
| Zhang_VAE | VAE (MMD loss function) | [ |
| T: using external information | ||
| ADImpute | Gene regulatory network information | [ |
| netSmooth | PPI network information | [ |
| SAVER-X | Transfer learning with atlas-type resources | [ |
| SCRABBLE | Matched bulk RNA-seq data | [ |
| TRANSLATE | Transfer learning with atlas-type resources | [ |
| URSM | Matched bulk RNA-seq data | [ |
Imputation methods using only data from within a dataset are roughly categorized approaches A (model-based), B (data smoothing), and C (data reconstruction), with the latter further differentiated into matrix factorization and machine learning approaches. In contrast to these methods, those in category T (for transfer learning) also use information external to the dataset to be analyzed
AE autoencoder, DO dropout, imp imputation, MMD maximum mean discrepancy, MSE mean squared error, NB negative binomial, NMF non-negative matrix factorization, P Poisson, PC principal component, PCA principal component analysis, PPI protein-protein interaction, SVD singular value decomposition, VAE variational autoencoder, ZI zero-inflated
Approaches for data integration, highlighting their promises and challenges
| Integration | Example MT combination | Example AMs | Promises | Challenges | |
|---|---|---|---|---|---|
| 1S | None | scDNA-seq | Clustering/unsupervised | Discover new subclones, cell types, or cell states | Technical noise ↓; data sparsity ↓ |
| +S | Within 1 MT, within 1 exp, across >1 smps | scRNA-seq | Differential analyses, time series, spatial sampling | Identify effects across sample groups, time, and space | Batch effects ↓; validate cell type assignments ↓ |
| +X+S | Within 1 MT, across >1 exp, across >1 smps | merFISH | Map cells to stable reference (cell atlas) | Accelerate analyses, increase sample size, generalize observations | Standards across experimental centers |
| +M1C | Across >1 MTs, within 1 exp, within 1 cell | scM&T-seq (scRNA-seq + methylome) | MOFA, DIABLO, MINT | Holistic view of cell state; quantify dependency of MTs | Scaling cell throughput; MT combinations limited; dependency of MTs ↓ |
| +M+C | Across >1 MTs, within 1 exp, across >1 cells, within 1 cell pop | scDNA-seq + scRNA-seq | Cardelino, Clonealign, MATCHER | Use existing datasets (faster than +M1C); flexible experimental design | Validate cell/data matching; test assumptions for integrating data |
| +all | Across >1 MTs, across >1 exps, across >1 smps, within cells | Hypothetical (any combination) | Hypothetical (map cells to multi-omic HCA, single-cell TCGA) | Holistic view of biological systems | All from approaches +X+S, +M1C, and +M+C |
The labeling corresponds to Fig. 6. For each approach, one (combination of) measurement type(s) that is available is given, but more exist and several are discussed in the text. As example analysis methods, actual tool names are given where few tools exist to date; otherwise, broader categories or imaginable methodologies are described
Abbreviations: “↓” same challenge also applies to all approaches below, AM analysis method, exp(s) experiment(s), HCA human cell atlas, MT measurement type, smps samples, TCGA The Cancer Genome Atlas
Fig. 3Differential expression of a gene or transcript between cell populations. The top row labels the specific gene or transcript, as is also done in Fig. 6. A difference in mean gene expression manifests in a consistent difference of gene expression across all cells of a population (e.g., high vs. low). A difference in variability of gene expression means that in one population, all cells have a very similar expression level, whereas in another population, some cells have a much higher expression and some a much lower expression. The resulting average expression level may be the same, and in such cases, only single-cell measurements can find the difference between populations. A difference across pseudotime is a change of expression within a population, for example, along a developmental trajectory (compare Fig. 1). This also constitutes a difference between cell populations that is not apparent from population averages, but requires a pseudo-temporal ordering of measurements on single cells
Published cell atlases of whole tissues or whole organisms
| Organism | Scale of cell atlas | Citation |
|---|---|---|
| Nematode ( | Whole organism at larval stage L2 | [ |
| Planaria ( | Whole organism of the adult animal | [ |
| Fruit fly ( | Whole organism at embryonic stage | [ |
| Zebrafish | Whole organism at embryonic stage | [ |
| Frog ( | Whole organism at embryonic stage | [ |
| Mouse | Whole adult brain | [ |
| Mouse | Whole adult organism | [ |
Fig. 4A tumor evolves somatically—from initiation to detection, to resection, and to possible metastasis. New genomic mutations can confer a selective advantage to the resulting new subclone that allows it to outperform other tumor subclones (subclone competition). At the same time, the acting selection pressures can change over time (e.g., due to new subclones arising, the immune system detecting certain subclones, or as a result of therapy). Understanding such selective regimes—and how specific mutations alter a subclone’s susceptibility to changes in selection pressures—will help construct an evolutionary model of tumorigenesis. And it is only within this evolutionary model that more efficient and more patient-specific treatments can be developed. For such a model, unambiguously identifying mutation profiles of subclones via scDNA-seq of resected or biopsied single cells is crucial
Fig. 5Mutations (colored stars) accumulate in cells during somatic cell divisions and can be used to reconstruct the developmental lineages of individual cells within an organism (leaf nodes of the tree with mutational presence/absence profiles attached). However, insufficient or unbalanced WGA can lead to the dropout of one or both alleles at a genomic site. This can be mitigated by better amplification methods, but also by computational and statistical methods that can account for or impute the missing values