| Literature DB >> 32069291 |
Chieh Lin1, Jun Ding2, Ziv Bar-Joseph1,2.
Abstract
Methods for the analysis of time series single cell expression data (scRNA-Seq) either do not utilize information about transcription factors (TFs) and their targets or only study these as a post-processing step. Using such information can both, improve the accuracy of the reconstructed model and cell assignments, while at the same time provide information on how and when the process is regulated. We developed the Continuous-State Hidden Markov Models TF (CSHMM-TF) method which integrates probabilistic modeling of scRNA-Seq data with the ability to assign TFs to specific activation points in the model. TFs are assumed to influence the emission probabilities for cells assigned to later time points allowing us to identify not just the TFs controlling each path but also their order of activation. We tested CSHMM-TF on several mouse and human datasets. As we show, the method was able to identify known and novel TFs for all processes, assigned time of activation agrees with both expression information and prior knowledge and combinatorial predictions are supported by known interactions. We also show that CSHMM-TF improves upon prior methods that do not utilize TF-gene interaction.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32069291 PMCID: PMC7048296 DOI: 10.1371/journal.pcbi.1007644
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1CSHMM-TF model structure and parameters.
The figure presents the assignments of cells and TFs to the reconstructed branching model for the process studies. Each edge (path) represents a set of infinite states parameterized by the path number and the location along the path. We use a function based on parameters learned for the split nodes (nodes at the start and end of each path) and TF assignments to define an emission probability. Emission probability for a gene along a path is a function of the location of the state and prior TFs (t and t) and a gene specific parameter k which controls the rate of change of its expression along the path. Split nodes are locations where paths split and are associated with a branch (transition) probability. The t_start parameter defines the TF activation time for a specific TF associated with the path. Cell assignment to paths is determined by the emission probabilities and the expression of specific TF targets for the TFs associated with the path. w is a vector of gene-specific mixture weight, where the weights are a non linear function which depends on (t and t). See text for more details.
Fig 2CSHMM-TF result for the liver dataset.
(a) CSHMM-TF structure and continuous cell assignment for the liver dataset. D nodes are split nodes and p edges are paths as shown in Fig 1. Each circle on a path represents cells assigned to a state on that path. The bigger the circle the more cells are assigned to this state. Cells are colored based on the cell type / time point assigned to them in the original paper. (b) TF assignments by CSHMM-TF for the liver dataset. We highlight known functional roles for several TFs. Path names (DE, LB etc.) are based on annotated cells assigned to that path in the figure above. Full names of cell types can be found on S1 Appendix Supporting methods of data collection and processing.
Fig 3CSHMM-TF result for the lung development dataset.
(a) CSHMM-TF structure and continuous cell assignment for lung development dataset. Notations are similar to the ones described in Fig 2 (b) TF assignments to each path by CSHMM-TF. We highlight known functional roles for several TFs. Path names (Ciliated, AT1 etc.) are based on annotated cells assigned to that path in the figure above.
Fig 4Expression profiles for top TFs assigned by the method to the lung, neuron, and liver reconstructed models.
Each figure plots the expression TFs predicted to co-regulate a specific path. Each figure legend denotes the color and the time assignment for each TF. Profiles for TFs are the MLE estimates for these TFs expression values based on learned model parameters. (a-d) co-regulating TF expressions in lung paths. (e-i) co-regulating TF expressions in neuron paths. (j-l) co-regulating TF expressions in liver paths. See text for details.
Analysis of predicted TF-TF interactions based on the TcoF database.
Abbreviations: total: all possible interactions in a dataset, A: all TFs assigned to each path, E: early TFs in each of the paths, L: late TFs. For each dataset we present 3 rows: number of combinations, ratio and p-value.
| Dataset | #of TF | #total | #A vs A | #E vs E | #L vs L | #E vs L |
|---|---|---|---|---|---|---|
| Liver #comb | 252 | 1021/31626 | 20/342 | 11/166 | 2/48 | 7 / 128 |
| Liver ratio | 0.032 | 0.058 | 0.066 | 0.042 | 0.055 | |
| Liver-p-value | X | 3.99E-03 | 7.85E-03 | 2.02E-01 | 5.60E-02 | |
| Lung #comb | 257 | 960/32896 | 30/315 | 8/119 | 5/47 | 17/149 |
| Lung ratio | 0.029 | 0.095 | 0.067 | 0.106 | 0.114 | |
| Lung p-value | X | 4.56E-09 | 8.24E-03 | 2.35E-03 | 3.91E-07 | |
| Cortical #comb | 157 | 423/12246 | 19/291 | 9/144 | 0/33 | 10 / 114 |
| Cortical ratio | 0.035 | 0.065 | 0.063 | 0.000 | 0.088 | |
| Cortical-p-value | X | 2.72E-03 | 2.76E-02 | X | 1.93E-03 | |
| Neuron #comb | 208 | 873/21528 | 30/351 | 16/90 | 8/85 | 6/176 |
| Neuron ratio | 0.040 | 0.085 | 0.17 | 0.094 | 0.034 | |
| Neuron p-value | X | 4.47E-05 | 1.07E-07 | 7.47E-03 | X | |
| Myoblast #comb | 230 | 875/26335 | 49/447 | 45/408 | 0/3 | 4/36 |
| Myoblast ratio | 0.033 | 0.109 | 0.111 | 0.000 | 0.111 | |
| Myoblast-p-value | X | 7.18E-14 | 5.50E-13 | X | 6.42E-03 |
Parameters of the CSHMM-TF model: θ = (V, π, S, A, E′).
| symbol | definition |
|---|---|
| the observation alphabet | |
| the initial probability for each state, | |
| the set of states (each path has infinitely many states) | |
| the branch probability defined on each pair of paths, ∑ | |
| the transition probability defined on any pair of states | |
| the parameters associated with emission probability for a given state | |
|
| |
|
| |
|
| the variance vector for genes |
|
| the matrix where each entry Ω |
|
| the matrix where each entry Φ |
| the set of split points | |
| the set of paths | |
| the number of genes (dimension of data) | |
| the set of TFs | |
| λ | the hyper parameter for the L1 regularization that controls the sparsity of Δ |
Fig 5flow chart of how to iteratively learn CSHMM-TF.