Literature DB >> 29713661

Single-Cell Computational Strategies for Lineage Reconstruction in Tissue Systems.

Charles A Herring^1,2, Bob Chen^1,3, Eliot T McKinley^1,4, Ken S Lau^1,2,3.

Abstract

Function at the organ level manifests itself from a heterogeneous collection of cell types. Cellular heterogeneity emerges from developmental processes by which multipotent progenitor cells make fate decisions and transition to specific cell types through intermediate cell states. Although genetic experimental strategies such as lineage tracing have provided insights into cell lineages, recent developments in single-cell technologies have greatly increased our ability to interrogate distinct cell types, as well as transitional cell states in tissue systems. From single-cell data that describe these intermediate cell states, computational tools have been developed to reconstruct cell-state transition trajectories that model cell developmental processes. These algorithms, although powerful, are still in their infancy, and attention must be paid to their strengths and weaknesses when they are used. Here, we review some of these tools, also referred to as pseudotemporal ordering algorithms, and their associated assumptions and caveats. We hope to provide a rational and generalizable workflow for single-cell trajectory analysis that is intuitive for experimental biologists.

Entities: Chemical Disease Gene Species

Keywords: Cell State Transition; Differentiation; MST, minimum spanning tree; PCA, principal component analysis; Pseudotime; Single-Cell Analysis; Stem Cells; Trajectory; scRNA-seq, single-cell RNA-sequencing; t-SNE, t-distributed stochastic neighbor embedding

Year: 2018 PMID： 29713661 PMCID： PMC5924749 DOI： 10.1016/j.jcmgh.2018.01.023

Source DB: PubMed Journal: Cell Mol Gastroenterol Hepatol ISSN： 2352-345X

Recent developments in single-cell technologies have stimulated growth in analysis techniques, in particular, computational tools for ordering cell states as a function of pseudotemporal progression. We provide a review of current algorithms and a generalized single-cell workflow tailored for trajectory analysis, with a focus on underlying assumptions and caveats. Cellular heterogeneity, defined by a diversity of co-occurring cell types in a tissue, is characteristic of practically every organ in the human body. The organs of the digestive system also comprise specialized cell populations that play important but diverse roles in absorption, secretion, and barrier function. For instance, distinct cell types of the pancreatic islet secrete different hormones, including insulin-secreting β cells, glucagon-secreting δ cells, and somatostatin-expressing δ cells. Likewise, the small and large intestines exist in a dynamic equilibrium of heterogeneous stem, transitional, and differentiated cell populations, with the latter responsible for nutrient absorption, antimicrobial peptide secretion, and formation and maintenance of the mucus layer in the gut. A fundamental question in developmental biology is the origin of cellular heterogeneity, which arises from a specification process initiated from multipotent cells. Recent developments in multiplex single-cell experimental tools have greatly facilitated the interrogation of individual cells; data on single cells then can be grouped into relevant cell populations. In digestive organ systems, populational analysis of single-cell data has been used for discovering previously unidentified β cell subpopulations in the pancreatic islet, novel markers of intestinal tuft cells,2, 3 endocrine progenitor cell heterogeneity, and signaling mechanisms between neighboring intestinal epithelial cells, among others. Populational analysis using single-cell tools is a powerful approach for dissecting tissue-level heterogeneity, and has been reviewed extensively elsewhere.6, 7 Beyond defining cell populations, such as stem and differentiated cell types, single-cell experimental tools also can be used to characterize transitional intermediate cell states in various tissues and organoid systems. Thus, it theoretically should be possible, using single-cell data, to trace terminal cell types through intermediate cell states back to their roots of differentiation in a series of progenitor–progeny relationships. Here, we review current computational tools by which a “virtual lineage trace,” also known as a pseudotemporal order, can be extracted from multidimensional single-cell data.

Single-Cell Experimental Technologies to Interrogate Cell States From Tissues

The theoretical basis of pseudotemporal ordering is that asynchronous sampling from multiple time points over development or snap-shot sampling at a single time point of a continually renewing tissue (such as the intestine) can result in a dense sampling of transitional states that can be aligned to reflect a time course of state transitions (Figure 1). A cell state is represented by the position of a cell in a data space defined by multiple molecular markers that describe the identity and behavior of cells (Figure 1). Ordering is conducted on the basis of similarity in cell states; dense sampling of these states is required to obtain a continuum of data by which the relationship between cell states can be inferred. Because transitional cell states often are rare compared with differentiated cells in tissue, it is required for single-cell technologies to be able to query a large volume of data points, as well as simultaneously measure multiple markers, to fully depict a continuum of cell states. Here, we briefly review common single-cell tools that can evaluate many cells in a multiplex fashion in the context of their classification into either suspension approaches or in situ approaches.

Figure 1

General workflow of trajectory analysis algorithms. Beginning with data in multidimensional space, feature selection is first performed to include relevant analytes and exclude noise. From the selected feature set, dimension reduction is applied to best emphasize the part of the data most relevant to cell-state transitions. Trajectories then are reconstructed in this reduced space and analyzed as pseudotime courses. Suspension approaches involve cellular dissociation and then separate processing and analysis of individual cells, with the major caveat that the spatial context of the tissue is lost. Suspension approaches include protein-based techniques such as mass and multiparameter flow cytometry, and transcript-based techniques such as single-cell RNA-sequencing (scRNA-seq) and gene expression assays. The advantage of these approaches is in their high-throughput capacity to produce data. Flow and mass cytometry can analyze hundreds of thousands of cells in a multiplex fashion (20–40 protein analytes per cell) on the order of minutes, while scRNA-seq can quantify gene expression in an unbiased, genome-wide manner (thousands of gene analytes). Multiple platforms of scRNA-seq exist, with variations in cell-containment strategies ranging from microwells14, 15, 16 to liquid-oil emulsion droplets17, 18, 19; many of the current iterations can query up to thousands of cells. A factor to consider when applying suspension approaches, especially on organs of the digestive system, is the perturbation imposed on cells when they are disaggregated from tissue. Cells of the hematopoietic system exist either as single-cell suspensions or in loosely connected tissues, which are readily amenable to single-cell analysis. For intestinal cells, specifically for those in the lamina propria, protocols have been developed such that the correct numbers and types of cells can be retrieved for single-cell analysis, providing critical insights into biological and disease processes. For epithelial tissues that are tightly connected, additional factors must be considered so as to not introduce technical artifacts during the single-cell dissociation process. Disaggregation for Intracellular Signaling of Single Epithelial Cells from Tissue was developed as a fixation approach for preserving the intact state of epithelial cells for single-cell signaling analysis using mass and flow cytometry. Disaggregation for Intracellular Signaling of Single Epithelial Cells from Tissue can be applied to formalin-fixed paraffin-embedded tissues, for instance, to observe signaling state alterations in human colorectal cancer specimens. On the scRNA-seq side, Adam et al adapted psychrophilic proteases for single-cell dissociation in the cold, which drastically reduces artifacts and maintains native cell states. Adaptation of a similar strategy to fixed24, 25 or frozen tissues may enable scRNA-seq of preserved cell states. For cells that cannot be dissociated without compromising integrity, such as neurons with long and fragile axonal processes, single nucleus profiling of fresh and preserved tissues is a viable strategy to obtain a glimpse of cell state.27, 28, 29 It should be noted, however, that transcriptomes obtained from the nucleus may be drastically different from those obtained from the entire cell. These efforts highlight recent developments into suspension approaches to enable high-throughput evaluation of native cell states for characterizing cellular heterogeneity and developmental events. Unlike suspension approaches, in situ imaging techniques allow cells and their niche components to be analyzed in their native spatial context. Because of the lack of tissue dissociation, communication mechanisms between niche cells and epithelial cells can be directly visualized and quantified. Recent advances have improved the multiplex capabilities of microscopy approaches, enabling detection and quantification of dozens of markers leading to accurate identification of cell types that reside within certain niches. Current multiplex imaging technologies for proteins can be classified either as mass-based or iterative. Mass-based imaging approaches, including imaging mass cytometry and multiplex ion beam imaging, rely on metal-tagged antibodies coupled to mass spectrometry, while iterative approaches, including Multiplex Immunofluorescence, Cyclic Immunofluorescence, and others,34, 35, 36 rely on cycles of staining, imaging, and de-staining to enable multiplexity. Iterative approaches also are available for in situ gene expression analysis, including iterative RNA fluorescence in situ hybridization37, 38 and in situ sequencing. More recent work also has combined the barcoding capabilities of nucleic acids with protein antibody staining in an approach called nucleic acid exchange to achieve higher levels of efficiency and multiplexity. Microscopy approaches also can query thousands of cells if whole tissues are imaged at the appropriate resolution, although the time for acquisition of such data sets can be large on a per-sample basis. Currently, multiplex approaches are developed for 2-dimensional imaging, but future efforts may combine tissue clearing41, 42, 43 along with intravital techniques to enable 3-dimensional imaging of cells in real time. Although a variety of techniques can generate intricate multiplex images of intact tissue, challenges in the automatic identification of objects hinder quantitative analysis of spatial relationships among cells and niche components. Although these tools are in their infancy, in situ multiplex approaches hold the promise for understanding cell-to-environment interactions in the context of cell-state transitions. The choice of suspension or in situ techniques is highly dependent on the experimental question being sought and oftentimes can be complementary. Suspension approaches are much higher throughput in terms of the number of cells and analytes analyzed, whereas in situ techniques can afford spatial resolution. We have previously coupled the 2 classes of tools, using suspension-based signaling analysis and in situ microscopy to define neighbor cell signaling mechanisms. An integrative strategy of using suspension-based analysis to deeply profile cell populations and in situ approaches to define spatial relationships between identified populations is one of many powerful strategies for delineating functionally meaningful relationships in tissue systems.

Feature Selection: A Preprocessing Step for Trajectory Analysis of scRNA-Seq Data

Multiplex cytometry and scRNA-seq techniques both attempt to capture extremely complex cell states in the form of high-dimensional data, in proteomic or transcriptomic spaces, respectively. scRNA-seq is known to produce noisy data on a per-feature basis, especially for lowly expressed genes, owing to the processing and amplification of small amounts of nucleic acids and the biological phenomenon of bursting transcription. The effects of noise are compounded in multidimensional space in a phenomenon known as the curse of dimensionality, which greatly affects downstream trajectory analysis when using the full ensemble of features. A way to mitigate this effect is to select and analyze only a subset of the most important features that maximally captures the phenomenon of interest, while ignoring uninformative or noisy features. The feature selection step is implicitly performed in candidate-based approaches, such as Cytometry Time-of-Flight and multiplex microscopy, because the user is picking the most important markers to measure. How to pick informative features while eliminating uninformative ones from genome-scale scRNA-seq experiments is still an active area of research. One intuitive method for feature selection is a supervised approach that only includes genes of interest. For instance, candidate genes can be selected from a differentially expressed gene set from a bulk RNA-seq experiment that uses a time course or genetic perturbation experimental design. Pipelines such as Single-cell Topological Data Analysis and Single Cell Lineage Inference Using Cell Expression Similarity and Entropy incorporate annotated gene sets from gene ontology resources such as Protein ANalysis THrough Evolutionary Relationships or the Database for Annotation, Visualization and Integrated Discovery to select features in a semi-supervised fashion.47, 48 For studies with minimal or unreliable prior knowledge, completely unsupervised methods that leverage general gene expression patterns may be used. Different unsupervised feature selection methods vary in their assumptions as well as complexity. For example, a commonly used method in analyzing scRNA-seq data involves identifying transcriptomic features with highly variable expression across the entire data set of single cells. Here, the assumption is that variance in gene expression between cells corresponds to meaningful gene regulation. This method calculates the variance of each gene across all data points (cells), and filters the features to capture only those with the highest variances. In a way, this method is analogous to principal component analysis (PCA) in selecting the dimensions with the highest variances. Technical variation can potentially exceed meaningful biological variation, and filtering methods can be confounded by the simultaneous occurrence of these 2 sources of variation. However, because of their computational tractability, variance ranking methods can provide a quick evaluation of data quality by enumerating the number of biologically relevant genes returned, which can be collected to potentially reveal both known and unknown cellular relationships. More sophisticated methods based on different patterns of gene expression have been developed to identify biologically relevant features. Qui et al developed dpFeature, a method that selects differentially expressed genes between cell populations described by unsupervised clustering for downstream trajectory analysis. Clusters of cells automatically identified are representative of distinct cell states, and differentially expressed genes represent likely regulators of these states. However, data sets that depict transitions are generally continuously distributed and do not form distinct clusters. Clustering in these cases are based on arbitrary cut-off values, and, thus, how dpFeature performs on these types of data sets remains to be tested. To handle continuous data distributions, Welch et al developed a metric called neighborhood variance. Implementing a K-nearest neighbors graph approach with each cell represented as a node, this method defines neighborhoods of locally varying cell states. Variance of a feature is analyzed over each defined neighborhood and compared with the global variance of that feature, with a threshold of selection for downstream analysis. Selected features exhibit small local variance with gradual and monotonic changes, consistent with progressively transitioning cell states. In addition, Furchtgott et al developed a Bayesian approach for identifying subsets of gene expression patterns over 3 cell states that are useful for defining lineage relationships. These feature selection methods use unique patterns of gene expression present in single-cell data sets to filter out genes whose variances are either owing to noise or are irrelevant to the phenomenon of interest. More refined gene expression patterns perhaps can be identified in the future for more sophisticated feature selection.

t-Distributed Stochastic Neighbor Embedding: A Technique for Cell Population Analysis

A challenge of the analysis of highly multiplexed single-cell data is the inherent difficulty of visualizing high-dimensional data spaces (Figure 1). Thus, multiple methods, such as PCA, have been developed to represent high-dimensional data in a lower-dimensional space while best retaining the underlying relationships among data points in the original data space. In principle, cell-state transition relationships, based on a continuum of similar states, can be visualized in 2- or 3-dimensional space given the correct information within the data is retained. In practice, however, all dimensionality reduction techniques result in information loss because some parts of the data are discarded for lower-dimension representations. For instance, PCA represents high-dimensional data with linear combinations of variables with the highest variances while discarding low variance variables as “noise.” This optimization strategy may not have retained the relevant variables for depicting state transitions. One of the primary objectives of many trajectory analysis techniques is thus to find and retain the necessary information from a multidimensional data space relevant for mapping transitory relationships in a different data space. t-Distributed stochastic neighbor embedding (t-SNE), a nonlinear dimensionality reduction approach, has emerged as a popular and powerful technique for the analysis of single-cell data generated by a wide variety of experimental platforms.55, 56, 57, 58 t-SNE focuses on preserving the local structure while de-emphasizing the global structure of high-dimensional data, resulting in similar data points clustering together in an unsupervised manner. Because t-SNE allows user definition of the number of axes for analysis, cell populations can be unbiasedly shown in 2 or 3 dimensions. Although useful for defining divergent cell populations, the prospect for using t-SNE for trajectory analysis remains undefined. Because t-SNE is a stochastic algorithm emphasizing local data structure, the membership of each t-SNE–defined cluster is robust whereas the positions of the clusters are randomized in every run of the same data. Of note, the relative distances and positions between t-SNE–defined clusters may not be meaningful and should be evaluated carefully. Thus, using t-SNE to establish relationships between cell populations to model transition from one cell population to another (such as from a stem cell population to a differentiated cell population) may not be appropriate. Nevertheless, t-SNE can be used as a gating strategy before trajectory analysis to identify cells that are related in the same lineage continuum for further analysis, as opposed to those that are in separate lineages. This step is crucial because most trajectory alignment algorithms (noted later) will try to establish relationships between all cells in the input data, even though such relationships do not exist biologically.

Established Algorithms for Trajectory Reconstruction

Trajectory analysis algorithms generally can be categorized into 2 groups, minimum spanning tree (MST)-based approaches and nonlinear embedding approaches. A MST is an acyclic graph with all the nodes connected in such a way to minimize the total edge weight, which in many cases represents the distance in data space between nodes. The idea is that nodes of the MST, which represent cells or clusters of cells, and their connections approximate the geometric shape of the data cloud when laid out in 2 dimensions. Multiple MST algorithms (eg, Spanning-tree Progression Analysis of Density-normalized Events, Monocle1, Tools for Single Cell ANalysis, Waterfall) exist and they differ by their applications on different experimental platforms, and the type and degree of clustering of data that occurs before MST construction.10, 60, 61, 62 MSTs represent the first algorithms that attempt to map transition trajectories from single-cell data. In addition to the general problem with clustering continuous data, MST-based algorithms are well known to be unstable, such that multiple applications on the same data set result in multiple, seemingly random solutions.63, 64 MST algorithms also tend to overfit smaller data sets, producing topologies with superfluous branches.65, 66 Thus, MST-based tools have shown utility mostly in well-defined systems such as hematopoiesis, in which a previously determined correct solution can be selected from an ensemble of solutions that include incorrect ones. Some MST-based algorithms developed strategies to mitigate some of these issues. For instance, Monocle1 allows the user to set a parameter to limit the number of branches present in the final graph, but this parameter requires prior knowledge as to how many independent differentiated cell types are present, which may not be known in less-defined systems. Other approaches such as Ensemble Cell Lineage Analysis with Improved Robustness take a cohort of MSTs generated from the same data set and attempt to extract a consensus tree from the most common connections. However, given the general instability of MSTs, the common connections may only generate the most rudimentary topology that may or may not provide new biological insights. Thus, the field has adopted other algorithms that are more robust and provide consistent results when applied to the same data. The second class of algorithms, nonlinear embedding, incorporates nonlinear dimensionality reduction techniques to deconvolute difficult-to-interpret, high-dimensional data into more approachable 2- to 3-dimensional representation. Unlike PCA, which assumes linear combinations of features can approximate the original data, nonlinear embedding assumes the data cloud in mathematical space lies on a nonlinear manifold, which is a mathematical topologic space (sphere, torus, and so forth) that preserves the distances of points in close proximity. t-SNE is one such nonlinear embedding approach, but different classes of algorithms have different assumptions regarding the nature, distribution, and shape of the data cloud. Unlike t-SNE, which nonlinearly transforms data into distinct clusters, trajectory analysis on continuous data aims for embedding of data into elongated and compressed shapes to capture major structures and progressive trends in the data. Multiple such embedding approaches have been adopted for single-cell data analysis, including Diffusion maps used in various algorithms such as Wishbone,65, 68 local linear embedding used in Selective Locally Linear Inference of Cellular Expression Relationships, and multidimensional scaling and mapper in scTDA. Adoption of nonlinear embedding algorithms, which were not originally designed for biological data, has accessibility issues with biologists. Specifically, the parameters for tuning these algorithms are mathematical in nature, but can have dramatic effects in shrinking or expanding the data such that local resolution may be gained or lost. Thus, nonlinear embedding algorithms are largely used for depicting simple topologies that can be described by the largest variation in the data most insensitive to parameter changes. One of the major goals of newer algorithms is for complex, multibranching trajectories to be depicted robustly.

The Next Generation of Algorithms to Reconstruct Cell-State Transition Trajectories

Next-generation algorithms that do not fall within the MST or nonlinear embedding categories have been developed recently. Force-directed layout, such as FLOW-MAP and SPRING, are a graph visualization strategy in which a densely connected network in multidimensional space is redistributed in a lower-dimensional space (eg, in 2D) by considering edges as weighted springs and using physical laws to simulate the equilibrium position of nodes as an energy minimization problem. Whether cells are clustered or whether and what type of prior dimension reduction has been performed differentiates these algorithms. Force-directed layout resolves the problem of stochasticity of MST algorithms by using multiple connections to guide the layout. However, the interconnectedness of the graphs makes it difficult to analyze cellular transitions outside in addition to visualization, given that all cell states in the graph will be connected to multiple other cell states. A significant advantage, however, is the possibility to represent nonacyclic structures, such as loops that occur in cell-cycle state transitions.47, 69 Another new algorithm, Monocle2, uses a process called reverse graph embedding to construct pseudotemporal trajectories in an unsupervised fashion. Monocle2 is currently the most widely used next-generation algorithm for trajectory analysis capable of producing multibranching trees. In principle, Monocle2 iteratively embeds data points, in a process similar to k-means clustering, into multiple principal curves. Instead of learning clusters of cells, Monocle2 learns multiple principal curves connecting into a spanning tree that reflects a transitional hierarchy (Figure 2A). As with other techniques, Monocle2 works best with expert guidance because multiple parameters that significantly affect the output must be specified. These parameters tune the fit of the principal curves in mathematical space. An example of how user input can alter interpretations is the fact that including 2 (default) or 10 principal components greatly altered the number of cell lineages that can be identified. Monocle2 results have been shown to be robust on multiple runs and different parameters on singly-bifurcating trajectories.

Figure 2

New approaches for trajectory analysis from single-cell data. (A) Monocle2 embeds the data cloud into a graph composed of principal curves. (B) p-Creode learns the most likely path through the data cloud as a function of density and shape. Arrows represent data embedding into the graph. Although most algorithms aim to produce one output representation of cell-state transition processes, few evaluate the quality of such output by its statistical support by data. In many cases, the output of an algorithm is solely evaluated based on its fit to a known differentiation hierarchy, which raises the possibility of overfitting. Although cross-validation and bootstrapping methods are useful methods of evaluation, the difficulty lies in the current inability to compare overall topologic structures of graph outputs with both differing nodes and edges, which are produced over multiple different runs on the same data set. The p-Creode algorithm is unique in this respect by leveraging an ensemble of N resampled topologies to lessen the effects of overfitting. p-Creode uses a unique hierarchical placement strategy for generating cell-state transition trajectories from end states identified in an unsupervised manner (Figure 2B). Instead of placing data points on leaves on a dendrogram as in hierarchical clustering, hierarchical placement allowed tiered assignment of data points as ancestor-descendent relationships. Multiple resampled runs then are evaluated by a graph dissimilarity metric called the p-Creode score to identify the number of different classes of topologies as well as the most representative topology from the ensemble. The parameters required to run p-Creode also are designed to be robust and accessible to nonexperts, which can be tuned according to how the data cloud visually appears. p-Creode also has been shown to generate robust and accurate results on complex multibranching trajectories even with noisy data. Despite these positives, p-Creode reliance on a downsampling preprocessing step may pose a problem for the automatic identification of rare cells, which cannot be distinguished from noise at the current time. Rare cell detection from relatively noisy single-cell data is a necessary and important area of development for all types of single-cell data analysis, and we anticipate rapid advances in this field.13, 71

Downstream Analysis of Reconstructed Trajectories

Once trajectories are generated by various reconstruction algorithms, there are a substantial number of methods to extract biological insight, many of which are borrowed from bulk analyses such as RNA-seq. We will mention a few of the most common and insightful here. First, the topology of a cell-state transition trajectory may indicate when and where developmental decisions are made. For instance, a deep hierarchical topology may reflect a process by which a series of branching cell-fate decisions are made through identifiable progenitor states, whereas a shallow, star-shaped topology can be interpreted as prepatterning, in which individuals from a seemingly homogeneous pool of progenitor cells (identified by RNA or protein) are already fated toward cell types73, 74 by mechanisms not evaluated (such as epigenetics). The analysis of network topologies can be formalized by graph theory, such as those used for identifying motifs, degree distribution, and transience of hubs from protein–protein interaction networks.75, 76, 77 Second, a common analysis is to plot and visualize relative changes in analyte expression values over a pseudotime course.9, 65 This type of analysis can be performed over separate branches to identify mechanisms of maturation or over branch points to show mechanisms of cell-fate decisions in which a cell must choose between 2 or more unique differentiation routes. Manifold alignment algorithms, such as Manifold Alignment to CHaracterize Experimental Relationships, facilitate integrated comparisons between different trajectories (different routes/different data types depicting the same route, and so forth) with different cell state and temporal units. Third, differentially expressed gene analysis along trajectories can be performed. In this case, however, instead of looking at genes differentially expressed between 2 conditions, one would group genes together on the basis that they show similar dynamics over a pseudotime course (eg, transient vs sustained expression). The hypothesis is that genes that are expressed in a correlated fashion may share common biological functions. As such, higher-level meta-analyses such as gene ontology enrichment, gene set enrichment, transcription factor–gene correlation analysis, and mathematical logic modeling have been used for constructing regulatory networks and models that are postulated to directly control cell decision making and/or progression.47, 62, 79, 80

Notes on Using and Evaluating Trajectory Reconstruction Algorithms

As outlined in the previous sections, there are multiple algorithmic options for reconstructing trajectories from single-cell data. We leave you with a few points of considerations when applying these methods. Many algorithms are developed by showing that existing algorithms do not perform well on a synthetic data set or a newly generated data set, thus motivating the development of a new algorithm. Overfitting is a point of consideration when the data set used for building the algorithm also is used to show its effectiveness. The effectiveness of the algorithm also should be shown on existing data sets that generated well-behaved results by previous algorithms. Pseudotime currently has no real correspondence to real time. The number of cell states that recapitulates a trajectory can reflect the frequency of a transition event or the rate of transition. For instance, a longer branch can reflect a lineage that produces many cells compared with a shorter one. The distribution of the input data matters. Tissue-level data sets, which are expected to contain multiple cellular phenotypes, usually are distributed with common and rare cell subsets. The power of droplet-based scRNA-seq approaches lies in their ability to query thousands of cells, and thus reduces the need for flow-sorting enrichment of rare cell populations for analysis. Uncommon cells can be extracted computationally after the data have been collected. However, results undoubtedly will be better analyzed for common cell types than rare ones. For instance, a 0.1% representation of a rare cell type even in a 4000-cell data set will be represented by only 4 data points in the data set. Although down-sampling and other strategies can be applied to normalize the distribution of data post hoc, a better strategy would be to tackle this issue during data collection. Enrichment experimental strategies for target populations, and/or methods to remove overly abundant or uninteresting cells may be considered depending on the biological question and the cell type of study. All computational modeling approaches are hypothesis-generating tools that require assumptions to be fulfilled and results to be validated. For trajectory analysis algorithms, the key assumption is that transitioning cell states are represented within the collected data. Thus, whether tissue is being harvested during embryonic development vs adult will greatly affect the interpretation of results. For instance, pancreatic islet development is completed by embryonic day 16.5. Thus, an adult pancreatic data set collected at homeostasis will contain very few transitioning cells and will be unsuitable for trajectory analysis. Furthermore, results generated in silico should always be confirmed experimentally by methods such as conventional lineage tracing or lineage perturbation experiments. More recently, next-generation approaches that leverage mutational scars, such as those induced by Clustered Regularly Interspaced Short Palindromic Repeats, have been developed for accurately determining if individual cells belong to the same lineage in the classic, parent-child sense.81, 82 These approaches can potentially be integrated with single-cell approaches to combine cell-state transitional information with parent-child lineage data. Future development of trajectory analysis algorithms will probably improve scalability to meet the demands of even higher throughput technologies, adopt approaches to reduce the impact of overfitting, and ideally be more user-friendly to nonexpert biologists, either by being completely unsupervised or by incorporating more intuitive methods and visualization for the parameter tuning process.

77 in total

1. Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape.

Authors: Eugenio Marco; Robert L Karp; Guoji Guo; Paul Robson; Adam H Hart; Lorenzo Trippa; Guo-Cheng Yuan
Journal: Proc Natl Acad Sci U S A Date: 2014-12-15 Impact factor: 11.205

2. Div-Seq: Single-nucleus RNA-Seq reveals dynamics of rare adult newborn neurons.

Authors: Naomi Habib; Yinqing Li; Matthias Heidenreich; Lukasz Swiech; Inbal Avraham-Davidi; John J Trombetta; Cynthia Hession; Feng Zhang; Aviv Regev
Journal: Science Date: 2016-07-28 Impact factor: 47.728

3. Dissecting Cell-Type Composition and Activity-Dependent Transcriptional State in Mammalian Brains by Massively Parallel Single-Nucleus RNA-Seq.

Authors: Peng Hu; Emily Fabyanic; Deborah Y Kwon; Sheng Tang; Zhaolan Zhou; Hao Wu
Journal: Mol Cell Date: 2017-12-07 Impact factor: 17.970

4. Mapping the human DC lineage through the integration of high-dimensional techniques.

Authors: Peter See; Charles-Antoine Dutertre; Jinmiao Chen; Patrick Günther; Naomi McGovern; Sergio Erdal Irac; Merry Gunawan; Marc Beyer; Kristian Händler; Kaibo Duan; Hermi Rizal Bin Sumatoh; Nicolas Ruffin; Mabel Jouve; Ester Gea-Mallorquí; Raoul C M Hennekam; Tony Lim; Chan Chung Yip; Ming Wen; Benoit Malleret; Ivy Low; Nurhidaya Binte Shadan; Charlene Foong Shu Fen; Alicia Tay; Josephine Lum; Francesca Zolezzi; Anis Larbi; Michael Poidinger; Jerry K Y Chan; Qingfeng Chen; Laurent Rénia; Muzlifah Haniffa; Philippe Benaroch; Andreas Schlitzer; Joachim L Schultze; Evan W Newell; Florent Ginhoux
Journal: Science Date: 2017-05-04 Impact factor: 47.728

5. Multiplexed ion beam imaging of human breast tumors.

Authors: Michael Angelo; Sean C Bendall; Rachel Finck; Matthew B Hale; Chuck Hitzman; Alexander D Borowsky; Richard M Levenson; John B Lowe; Scot D Liu; Shuchun Zhao; Yasodha Natkunam; Garry P Nolan
Journal: Nat Med Date: 2014-03-02 Impact factor: 53.440

6. Highly multiplexed single-cell analysis of formalin-fixed, paraffin-embedded cancer tissue.

Authors: Michael J Gerdes; Christopher J Sevinsky; Anup Sood; Sudeshna Adak; Musodiq O Bello; Alexander Bordwell; Ali Can; Alex Corwin; Sean Dinn; Robert J Filkins; Denise Hollman; Vidya Kamath; Sireesha Kaanumalle; Kevin Kenny; Melinda Larsen; Michael Lazare; Qing Li; Christina Lowes; Colin C McCulloch; Elizabeth McDonough; Michael C Montalto; Zhengyu Pang; Jens Rittscher; Alberto Santamaria-Pang; Brion D Sarachan; Maximilian L Seel; Antti Seppo; Kashan Shaikh; Yunxia Sui; Jingyu Zhang; Fiona Ginty
Journal: Proc Natl Acad Sci U S A Date: 2013-07-01 Impact factor: 11.205

7. Distinct routes of lineage development reshape the human blood hierarchy across ontogeny.

Authors: Faiyaz Notta; Sasan Zandi; Naoya Takayama; Stephanie Dobson; Olga I Gan; Gavin Wilson; Kerstin B Kaufmann; Jessica McLeod; Elisa Laurenti; Cyrille F Dunant; John D McPherson; Lincoln D Stein; Yigal Dror; John E Dick
Journal: Science Date: 2015-11-05 Impact factor: 47.728

8. Automated Analysis and Classification of Histological Tissue Features by Multi-Dimensional Microscopic Molecular Profiling.

Authors: Daniel P Riordan; Sushama Varma; Robert B West; Patrick O Brown
Journal: PLoS One Date: 2015-07-15 Impact factor: 3.240

9. Cytometry-based single-cell analysis of intact epithelial signaling reveals MAPK activation divergent from TNF-α-induced apoptosis in vivo.

Authors: Alan J Simmons; Amrita Banerjee; Eliot T McKinley; Cherie' R Scurrah; Charles A Herring; Leslie S Gewin; Ryota Masuzaki; Seth J Karp; Jeffrey L Franklin; Michael J Gerdes; Jonathan M Irish; Robert J Coffey; Ken S Lau
Journal: Mol Syst Biol Date: 2015-10-30 Impact factor: 11.429

10. Intestinal crypt homeostasis revealed at single-stem-cell level by in vivo live imaging.

Authors: Laila Ritsma; Saskia I J Ellenbroek; Anoek Zomer; Hugo J Snippert; Frederic J de Sauvage; Benjamin D Simons; Hans Clevers; Jacco van Rheenen
Journal: Nature Date: 2014-02-16 Impact factor: 49.962

15 in total

Review 1. Unraveling Hematopoiesis through the Lens of Genomics.

Authors: L Alexander Liggett; Vijay G Sankaran
Journal: Cell Date: 2020-09-17 Impact factor: 41.582

2. Single-cell RNA sequencing of mast cells in eosinophilic esophagitis reveals heterogeneity, local proliferation, and activation that persists in remission.

Authors: Netali Ben-Baruch Morgenstern; Adina Y Ballaban; Ting Wen; Tetsuo Shoda; Julie M Caldwell; Kara Kliewer; Jennifer M Felton; J Pablo Abonia; Vincent A Mukkada; Philip E Putnam; Scott M Bolton; Daniel F Dwyer; Nora A Barrett; Marc E Rothenberg
Journal: J Allergy Clin Immunol Date: 2022-03-15 Impact factor: 14.290

3. Lineage tracing on transcriptional landscapes links state to fate during differentiation.

Authors: Caleb Weinreb; Alejo Rodriguez-Fraticelli; Fernando D Camargo; Allon M Klein
Journal: Science Date: 2020-01-23 Impact factor: 47.728

Review 4. Use of Single-Cell -Omic Technologies to Study the Gastrointestinal Tract and Diseases, From Single Cell Identities to Patient Features.

Authors: Mirazul Islam; Bob Chen; Jeffrey M Spraggins; Ryan T Kelly; Ken S Lau
Journal: Gastroenterology Date: 2020-05-14 Impact factor: 22.682