| Literature DB >> 35758781 |
Will Macnair1, Revant Gupta2, Manfred Claassen2,3.
Abstract
MOTIVATION: Improvements in single-cell RNA-seq technologies mean that studies measuring multiple experimental conditions, such as time series, have become more common. At present, few computational methods exist to infer time series-specific transcriptome changes, and such studies have therefore typically used unsupervised pseudotime methods. While these methods identify cell subpopulations and the transitions between them, they are not appropriate for identifying the genes that vary coherently along the time series. In addition, the orderings they estimate are based only on the major sources of variation in the data, which may not correspond to the processes related to the time labels.Entities:
Mesh:
Year: 2022 PMID: 35758781 PMCID: PMC9235474 DOI: 10.1093/bioinformatics/btac227
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.(A) Inputs to psupertime are single-cell RNA-seq data, where the cells have sequential labels associated with them. psupertime then identifies a sparse set of ordering coefficients for the genes. Multiplying the gene expression values by this vector of coefficients gives pseudotime values for each cell, which place the labels approximately in sequence. (B) Cartoon of statistical model used by psupertime, including thresholds between labels. Where there is a sequence of K condition labels, psupertime learns K−1 simultaneous (i.e. sharing coefficients) logistic regressions, each seeking to separate labels (out) from (in). (C) Dimensionality reduction of 411 human acinar cell data with ages ranging from 1 to 54 (Enge ). Representations in two dimensions via non-linear dimensionality reduction technique UMAP. Colours indicate donor age. (D) Distributions of donor ages for acinar cells over the pseudotime learned psupertime. Vertical lines indicate thresholds learned by psupertime distinguishing between earlier and later sets of labels; colour corresponds to the next later label. (E) Expression values of selected genes (five with largest absolute coefficients; see Supplementary Fig. S2 for 20 largest). The x-axis is psupertime value learned for each cell; y-axis is z-scored gene expression values. Gene labels also show the Kendall’s τ correlation between sequential labels (treated as a sequence of integers ) and gene expression
Details of datasets used in benchmarking of pseudotime cell orderings
| Dataset name | Source | Accession | Labels used | No. of labels | No. of cells | No. of highly varying genes |
|---|---|---|---|---|---|---|
| Acinar cells |
| GSE81547 | Donor age | 8 | 411 | 827 |
| Human germline, F |
| GSE86146 | Age (weeks) | 12 | 992 | 1081 |
| Embryonic beta cells |
| GSE87375 | Developmental stage | 7 | 575 | 2666 |
| Human ESCs |
| E-MTAB-3929 | Embryonic day | 5 | 1529 | 2876 |
| MEF to neurons |
| GSE67310 | Days since induction | 5 | 315 | 1698 |
| Colon cells | Herring | GSE102698 | User-selected clusters | 4, 5 | 1894 | 1515 |
| iPSCs |
| GSE106340 | Days during reprogramming | 11 | 3600 | 731 |
Fig. 2.Performance of psupertime against benchmark methods. See Section 2.8 for details of data processing and use of benchmark methods. All results for (A–C) based on 411 aging human acinar cell data with ages ranging from 1 to 54 (Enge ), using 827 highly variable genes. Colours indicate donor age. (A) Projection of acinar cells into first two principal components (% of variance explained shown). Curves learned by slingshot shown (note that here we show the projection of these curves into the first two principal components). (B) Projection of acinar cells into dimensionality reduction calculated by Monocle 2, annotated with pseudotime learned by Monocle 2 (Qiu ). (C) Results of benchmark pseudotime methods applied to acinar data. For each method, the x-axis is a one-dimensional representation for each cell (see Section 2.8), scaled to and given the direction with the highest positive correlation with the label sequence. The y-axis is density of the distributions for each label used as input, as calculated by the function geom_density in the R package ggplot2. (D) Performance of psupertime and benchmark classifiers in identifying simulated time-series genes. Precision-recall curves based on identification of time-series genes via variable importance measures for each method (see Section 2.7). Line and area show mean and ±2 standard error, respectively, over 20 simulations. Recall is limited to range 0–10%. Panels correspond to simulations with different proportions of time-series (TS) genes; all panels include 10% batch effect genes which are sample-specific. (E) Absolute Kendall’s τ correlation coefficient between label sequences (treated as sets of integers ) and calculated pseudotimes. Error bars show 95% confidence interval over 1000 bootstraps, calculated with boot package in R. For Tempora, this calculation was performed using scipy package in python. Datasets are specified in Table 1
psupertime performance and timings on comparison datasets
| Dataset name | Accuracy (%) | Time taken (s) | Sparsity (%) |
|---|---|---|---|
| Acinar cells | 75.7 ± 1.1 | 5.5 ± 0.41 | 90.6 ± 1.8 |
| Human germline, F | 43.4 ± 1.5 | 25 ± 0.73 | 80.4 ± 4.9 |
| Embryonic beta cells | 78.5 ± 0.9 | 19 ± 0.67 | 96.4 ± 0.4 |
| Human ESCs | 97.6 ± 0.2 | 35 ± 0.80 | 90.0 ± 1.1 |
| MEF to neurons | 89.6 ± 1.7 | 4.7 ± 0.062 | 96.6 ± 0.4 |
Note: Mean and standard deviation of psupertime accuracy, timing and sparsity calculated over 10 random seeds.