| Literature DB >> 34844559 |
Xiutao Pan1, Zhong Li2, Shengwei Qin1, Minzhe Yu1, Hang Hu1.
Abstract
BACKGROUND: With single-cell RNA sequencing (scRNA-seq) methods, gene expression patterns at the single-cell resolution can be revealed. But as impacted by current technical defects, dropout events in scRNA-seq lead to missing data and noise in the gene-cell expression matrix and adversely affect downstream analyses. Accordingly, the true gene expression level should be recovered before the downstream analysis is carried out.Entities:
Keywords: Data imputation; Low-rank tensor; Single-cell RNA-seq
Mesh:
Year: 2021 PMID: 34844559 PMCID: PMC8628418 DOI: 10.1186/s12864-021-08101-3
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Parameter settings corresponding to respective datasets in our experiment
| Dataset | K | P | epsilon | ||
|---|---|---|---|---|---|
| Pollen | 10 | 10 | [1 1e-1 2e-3] | 1e-5 | 1e-2 |
| Usoskin | 8 | 8 | [1 1e-2 1e-3] | 1e-5 | 1e-2 |
| Yan | 5 | 5 | [1 1e-2 2e-3] | 1e-5 | 1e-2 |
| Zeisel | 20 | 20 | [1 1e-2 1e-3] | 1e-5 | 1e-2 |
| Mouse | 10 | 10 | [1 1e-2 1e-3] | 1e-5 | 1e-2 |
| PBMC | 10 | 10 | [1 1e-2 1e-3] | 1e-5 | 1e-3 |
| Chen | 10 | 10 | [1 1e-2 1e-3] | 1e-5 | 1e-3 |
| Loh | 11 | 11 | [1 1e-2 2e-3] | 2e-4 | 1e-2 |
| Petropoulos | 10 | 10 | [1 1e-2 1e-3] | 1e-5 | 1e-2 |
| Simulation dataset | 15 | 8 | [1 1e-1 1e-3] | 1e-5 | 1e-5 |
Parameter settings of simulation datasets generated from Splatter
| Parameter | Simulation dataset | Parameter | Simulation dataset |
|---|---|---|---|
| version | 1.10.1 | dropout.type | “group” |
| nGenes | 1000 | method | “groups” |
| nCells | 500 | de.prob | c(0.05, 0.08, 0.01) |
| group.prob | c(0.3, 0.3, 0.4) | de.facLoc | 0.5 |
| dropout.shape | c (ds, ds, ds), ds ∈ {−0.3,0, 0.05,0.25} | de.facScale | 0.8 |
| dropout.mid | Default | dropout. present | Null |
Fig. 1SC3 clustering comparison of scLRTC and other methods on different datasets. (A) Comparison of ARI indicators obtained by SC3 clustering on 6 scRNA-seq datasets using different algorithms. (B) Comparison of NMI indicators obtained by SC3 clustering on 6 scRNA-seq datasets using different algorithms
Fig. 2Performance analysis and comparison of scLRTC and other methods by the t-SNE + K-means clustering. (A) ARI obtained by t-SNE + K-means clustering on the Pollen dataset using different algorithms. (B) ARI obtained by t-SNE + K-means clustering on the Usoskin dataset using different algorithms. In (A) and (B), asterisk indicates the statistically significant difference (P < 0.05) between scLRTC and the imputation method of interest using the Wilcoxon rank-sum test. (C) Running time of scLRTC for datasets with different sample sizes. The different tensor size setting is represented by different colors, where n is the number of genes
Fig. 3UMAP visualization and SC comparison on Yan and Pollen datasets. (A) UMAP visualization and SC of the Yan dataset. (B) UMAP visualization and SC of the Pollen dataset
Fig. 4UMAP visualization and SC comparison on Usoskin and Zeisel datasets. (A) UMAP visualization and SC of the Usoskin dataset. (B) UMAP visualization and SC of the Zeisel dataset
Fig. 5Imputation accuracy analysis and comparison of scLRTC and other methods in the real data masking. (A) Scatter plots of the masked imputation data and original data. The x-axis corresponds to the true value of the masked data point, and the y-axis represents the imputation value. The closer the points are to the red centerline, the higher the accuracy of imputaion. SSE and PCC are shown in the upper left corner (data is transformed by log10(X + 1)). (B) PCC values computed with 7 imputation methods in 5 repeated experiments. (C). SSE values computed with 7 imputation methods in 5 repeated experiments
Fig. 6Visual distinguishability and comparison of scLRTC and other methods on simulation datasets with various dropout rates. We use t-SNE to visualize the cell gene expression matrix, and apply different algorithms for imputation. Each column represents a ds, which controls the ratio of zero
Fig. 7Imputation accuracy analysis and comparison of scLRTC and other methods under different dropout rates. (A) PCC values computed between the Full (without dropout) data and the raw data (with dropout) as well as imputed ones respectively. (B) SSE values computed between the Full data and the raw data as well as imputed ones respectively
Fig. 8Correlation analysis and comparison of scLRTC and other methods. The more similar the heat map is to the raw heat map, the better the imputation effect. (A) Visualized heat map of cell-cell correlation matrix. (B) Visualized heat map of gene-gene correlation matrix
Fig. 9Violin chart of data expression distribution and accuracy measurements of DE genes by scLRTC and other methods. (A) Violin chart of data expression distribution after imputation when the dropout rate is 40% (data is transformed by log10(X + 1)). The more similar the shape of the violin is to FULL, the more effective the imputation effect. (B) ROC curves and AUC scores of DE genes with different imputation methods. AUC combines the recall rate and precision rate, and the value closer to 1 indicates a better imputation method. Here, the recall rate is defined as the number of true positives divided by the total number of samples that actually belong to the positive class, and the precision rate is the number of true positives divided by the total number of samples labelled as belonging to the positive class
Fig. 10Cell trajectory inference analysis of scLRTC and other methods. Visualization of lineage reconstruction is implemented by TSCAN on the Petropoulos dataset. Lines represent the developmental trajectory of cells, and each type of cells (E3 to E7) represents a stage of cell development. Cells should distribute along the cell trajectory. The POS and Kendall’s rank correlation scores as the indicators to quantify this process are also provided
A summary of nine real scRNA-seq datasets used in our experiment
| Dataset | Number of clusters | Number of cells | Number of genes | Standard |
|---|---|---|---|---|
| Pollen | 11 | 301 | 23,730 | Gold |
| Usoskin | 4 | 622 | 25,334 | Silver |
| Yan | 7 | 90 | 20,214 | Gold |
| Zeisel | 9 | 3005 | 19,972 | Gold |
| Mouse | 16 | 2100 | 20,670 | Gold |
| PBMC | 8 | 4340 | 33,694 | Silver |
| Chen | 46 | 12,089 | 23,284 | Silver |
| Loh | 8 | 429 | 23,794 | Gold |
| Petropoulos | 5 | 1529 | 21,749 | Copper |
Fig. 11The whole framework of scLRTC. For the scRNA-seq dataset A, it uses the PCC and selects the closest K cells to construct N K × M low- rank matrices B. Then it applies the Euclidean, Cosine, and Chebyshev distances to select the closest P low-rank matrices to construct N K × M × P low-rank tensors C. Followingly, it uses the ADMM algorithm to impute the low-rank tensors C to obtain the updated tensors D. Finally, it extracts the cell vector from each low-rank tensor in D and integrates it to obtain the imputed scRNA-seq expression matrix E
1. 2. Construction of low rank tensor 4 : 5 : 7 : 8 : 11 : 12:extract 13 : 14: |