| Literature DB >> 33286070 |
Luca Albergante1,2,3,4, Evgeny Mirkes5,6, Jonathan Bac1,2,3,7, Huidong Chen8,9, Alexis Martin1,2,3,10, Louis Faure1,2,3,11, Emmanuel Barillot1,2,3, Luca Pinello8,9, Alexander Gorban5,6, Andrei Zinovyev1,2,3.
Abstract
Multidimensional datapoint clouds representing large datasets are frequently characterized by non-trivial low-dimensional geometry and topology which can be recovered by unsupervised machine learning approaches, in particular, by principal graphs. Principal graphs approximate the multivariate data by a graph injected into the data space with some constraints imposed on the node mapping. Here we present ElPiGraph, a scalable and robust method for constructing principal graphs. ElPiGraph exploits and further develops the concept of elastic energy, the topological graph grammar approach, and a gradient descent-like optimization of the graph topology. The method is able to withstand high levels of noise and is capable of approximating data point clouds via principal graph ensembles. This strategy can be used to estimate the statistical significance of complex data features and to summarize them into a single consensus principal graph. ElPiGraph deals efficiently with large datasets in various fields such as biology, where it can be used for example with single-cell transcriptomic or epigenomic datasets to infer gene expression dynamics and recover differentiation landscapes.Entities:
Keywords: data approximation; principal graphs; principal trees; software; topological grammars
Year: 2020 PMID: 33286070 PMCID: PMC7516753 DOI: 10.3390/e22030296
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Elements of the principal graph methodology.
| Element | Initial Publication | Principal Advances |
|---|---|---|
| Principal curves | Hastie and Stuelze, 1989 [ | Definition of principal curves based on self-consistency |
| Piece-wise linear principal curves | Kégl et al., 1999 [ | Length constrained principal curves, polygonal line algorithm |
| Elastic energy functional for elastic principal curves and manifolds | Gorban, Rossiev, Wunsch II, 1999 [ | Fast iterative splitting algorithm to minimize the elastic energy, based on sequence of solutions of simple quadratic minimization problems |
| Method of elastic maps | Zinovyev, 2000 [ | Construction of principal manifold approximations possessing various topologies |
| Principal Oriented Points | Delicado, 2001 [ | Principal curves passing through a set of principal oriented points |
| Principal graphs specialized for image skeletonization | Kégl and Krzyzak, 2002 [ | Coining the term principal graph, an algorithm extending the polygonal line algorithm, specialized on image skeletonization |
| Self-assembling principal graphs | Gorban and Zinovyev, 2005 [ | Simple principal graph algorithm, based on application of elastic map method, specialized on image skeletonization |
| General purpose elastic principal graphs | Gorban, Sumner, Zinovyev, 2007 [ | Suggesting the principle of (pluri-)harmonic graph embedding, coining the terms ‘principal tree’ and ‘principal cubic complex’ with algorithms for their construction |
| Topological grammars | Gorban, Sumner, Zinovyev, 2007 [ | Exploring multiple principal graph topologies via gradient descent-like search in the space of admissible structures |
| Explicit control of principal graph complexity | Gorban and Zinovyev, 2009 [ | Introducing three types of principal graph complexity and ways to constrain it |
| Regularized principal graphs | Mao et al., 2015 [ | Formulating reverse graph embedding problem. Suggesting SimplePPT algorithm. Further development in Mao et al., 2017 [ |
| Robust principal graphs | Gorban, Mirkes, Zinovyev, 2015 [ | Using trimmed version of the mean squared error, resulting in the ‘local’ growth of the principal graphs and robustness to background noise |
| Domain-specific adaptations of principal graphs | Qiu et al., 2017 [ | Use of principal graphs in single cell data analysis as a part of pipelines Monocle, STREAM, MERLoT. Introducing heuristics for problem-specific initializations of principal graphs. Benchmarked in Saelens et al., 2019 [ |
| Partition-based graph abstraction | Wolf et al., 2019 [ | Dealing with non-tree like topologies, large-scale and multi-scale data analysis using graphs |
| Excessive branching control | This publication | Introducing a penalty for excessive branching of elastic principal graphs (α parameter) |
| Principal graph ensemble approach | This publication | Estimating confidence of branching point positions, constructing consensus principal graphs |
| Reducing the computational complexity of elastic principal graphs | This publication | Accelerated procedures and heuristics in order to enable constructing large principal graphs |
| Use of GPUs for constructing elastic principal graphs | This publication | Scalable Python and R implementations of elastic principal graphs (ElPiGraph), improved scalability and introducing various plotting functions |
List of used abbreviations: GDA—geometrical data analysis; TDA—topological data analysis; ElPiGraph—elastic principal graphs; GPU—graphical processing unit; CPU—central processing unit; kNN—k neareast neighbors; SOM—self-organizing maps; LLE—local linear embedding; UMAP—Uniform Manifold Approximation and Projection for Dimension Reduction.
Figure 1Basic principles and examples of ElPiGraph usage. (A) Schematic workflow of the ElPiGraph method. Left, construction of the elastic graph starts by defining the initial graph topology and embedding it into the data space. The graph structure is fit to the data, using minimization of the mean square error regularized by the elastic energy. The elastic energy includes a term reflecting the overall stretching of the graph (symbolically shown as contractive red springs) and a term reflecting the overall bending of the graph branches and the harmonicity of branching points (shown as repulsive green springs). Middle, ElPiGraph explores a large region of the structural space by exhaustively applying a set of graph rewriting rules (graph grammars) and selecting, at each step, the structure leading to the minimum overall energy of the graph embedding. (B) Example of elastic matrix for a simple graph and transforming this matrix into several parts used for optimization (see details in the Methods). (C) Left and middle, illustration of the robust local workflow of ElPiGraph, which makes it possible to deal with the presence of noise. In the global version, the graph structure is fit to all data points at the same time, while in the local version, the structure is fit to the points in the local graph neighborhood, which expands as the graph grows. Right, an illustration of principal graph ensemble approach: 100 elastic principle graphs are superimposed, each constructed on a fraction of data points randomly sampled at each execution. (D) Application of a set of graph grammar rules to generate a set of tree topologies with up to seven nodes.
Figure 2Synthetic examples showing features of ElPiGraph. (A) ElPiGraph is robust with respect to downsampling and oversampling of a dataset. Here, a reference branching dataset [32] is downsampled to 50 points (middle) or oversampled by sampling 20 points randomly placed around each point of the original dataset. More systematic study of the down/over-sampling effect on robustness of the principal graph is shown in Supplementary Figure S2. (B) ElPiGraph is able to capture non-tree like topologies. Here, the standard set of principal tree graph grammars was applied to a graph initialized by four nodes connected to a ring. (C) ElPiGraph is robust to large amounts of uniformly distributed background noise. Here, the initial dataset from Figure 1 is shown as black points, and the uniformly distributed noisy points are shown as grey points. ElPiGraph is blindly applied to the union of black and grey points. (D) ElPiGraph is able to solve the problem of learning intersecting manifolds. On the left, a toy dataset of three uniformly sampled curves intersecting in many points is shown. ElPiGraph starts by learning a principal curve using the local version several times, each time on a complete dataset. However, for each iteration, ElPiGraph is initialized by a pair of neighboring points not belonging to points already captured by a principal curve. The fitted curves are shown in the middle of the point distribution by using different colors, and the clustering of the dataset based on curve approximation is shown on the right. (E) Approximating a synthetic ten-dimensional dataset with known branch structure (with different colors indicating different branches), where one of the branches (blue one) extrude into higher dimensions and collapses with other branches when projected in 2D by principal component analysis (left). Middle, being applied in the space of two first principal components, ElPiGraph does not recover the branch, while it is captured when the ElPiGraph is applied in the complete 10-dimensional dataset (right). In both cases, the principal tree is visualized using metro map layout [21], and a pie chart associated to each node of the graph indicates the percentage of points belonging to the different populations. The size of the pie chart indicates the number of points associated with each node.
Figure 3Computational performance of ElPiGraph. (A) Time required to build principal curves and trees (y-axis) with a different number of nodes (x-axis) using the default parameters across synthetic datasets containing different numbers of points having a different number of dimensions (color scale), using single CPU. (B) Demonstrating computational advantages of using a combination of GPU and CPU in ElPiGraph computations. Starting from 10k data points, using GPU becomes advantageous. The computational experiment was done using Google Colab environment. (C) Empirical performance scaling of ElPiGraph on a 12k data points in 5D example. The exact ElPiGraph algorithm scales as polynome of degree 3, while the approximate reduced algorithm with a limitation on the number of tested candidate structures scales as s2.5 in the s = 50…1000 range, where s is the number of nodes in the principal graph.
Figure 4Quantification of branching cellular trajectories from single cell scRNASeq data using ElPiGraph. (A) Diagrammatic representation of the concept behind branching cellular trajectories and biological pseudotime in an arbitrary 2D space associated with gene expression: as cells progress from Stage 1 they differentiate (Stages 2 and 3) and branch (Stage 4) into two different subpopulations (Stages 5 and 6). Local distances between the cells indicate epigenetic similarity. Note how embedding a tree into the data allow recovering genetic changes associated with cell progression into the two differentiated states. (B) Application of ElPiGraph to scRNA-Seq data [59]. Each point indicates a cell and is color-coded using the cellular phenotype specified in the original paper. One hundred bootstrapped trees are represented (in black), along with the tree fitted on all the data (black nodes and edges). Projection on ElPiGraph principal components 2 and 3 is shown enlarged as the most informative. The fraction of variance of the original data explained by the projection on the nodes (FVE) and on the edges (FVEP) is reported on top of the plot. (C) Diagrammatic representation of the distribution of cells across the branches of the tree reconstructed by ElPiGraph with the same color scheme as in panel B. Pie charts indicate the distribution of populations associated with each node. The black arrows highlight a particular cell fate decision trajectory from CMP to GMP and DC. (D) Single-cell dynamics of gene expression along the path from the root of the tree (at the top of panel C) to the branch corresponding to DC commitment and GMP commitment. The expression profiles have been fitted by a LOESS smoother, colored according to the majority cell type in the branch, with a 95% confidence interval (in grey). The vertical grey area represents a 95% confidence interval obtained by projecting the relevant branching points of the resampled tree showed in panel B on the path of the principal graph.
Figure 5Using ElPiGraph to approximate complex datasets describing developing embryos (xenopus) or adult organisms (planarian). (A) A kNN graph constructed using the gene expression of 7936 cells of Stage 22 Xenopus embryo has been projected on a 3D space using a force directed layout. The color in this and the related panels indicate the population assigned to the cells by the source publication. (B) The coordinates of the points in panel A have been used to fit 1280 principal trees with different parameters, hence obtaining a principal forest. (C) The principal forest shown in panel B has been used to produce 10 consensus graphs (one for each parameter set). (D) A final consensus graph has been produced using the consensus graphs shown in panel C. (E) A final principal graph has been obtained by applying standard grammar operations to the consensus graph shown in panel D. (F) The associations of the different cell types to the nodes of the consensus graph shown in panel E is reported on a plane with a pie chart for each node. Note the complexity of the graph and the predominance of different cell types in different branches, as indicated the predominance of one or few colors. (G) The dynamics of notable genes had been derived by deriving a pseudotime for a branching structure (top) and a linear structure (bottom) present in the principal graph of panel E (see black polygons). Each point represents the gene expression of a cell and their color indicate either their associated path (top) of the cell type (bottom). The gene expression profiles have been fitted with a LOESS smoother which include a 95% confidence interval. In the top panel the smoother has been colored to highlight the different paths, with the color indicated in the text of panel F. (H–J) The same approach described by panels A–G has been used to study the single-cell transcriptome of planarians. In panel J, the color of the smoother indicates the predominant cell type on the path. Interactive versions of key panels are available at https://sysbio-curie.github.io/elpigraph/.