| Literature DB >> 30658573 |
Tianyu Wang1, Boyang Li2, Craig E Nelson3, Sheida Nabavi4.
Abstract
BACKGROUND: The analysis of single-cell RNA sequencing (scRNAseq) data plays an important role in understanding the intrinsic and extrinsic cellular processes in biological and biomedical research. One significant effort in this area is the detection of differentially expressed (DE) genes. scRNAseq data, however, are highly heterogeneous and have a large number of zero counts, which introduces challenges in detecting DE genes. Addressing these challenges requires employing new approaches beyond the conventional ones, which are based on a nonzero difference in average expression. Several methods have been developed for differential gene expression analysis of scRNAseq data. To provide guidance on choosing an appropriate tool or developing a new one, it is necessary to evaluate and compare the performance of differential gene expression analysis methods for scRNAseq data.Entities:
Keywords: Comparative analysis; Differential gene expression analysis; RNAseq; Single-cell
Mesh:
Year: 2019 PMID: 30658573 PMCID: PMC6339299 DOI: 10.1186/s12859-019-2599-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Distributions of gene expression values of total 92 cells in two groups (ES and MEF) using real data show that scRNAseq data exhibit a different types of multimodality (DU, DP, DM, and DB) and b large amounts of zero counts. X axis represents log-transformed expression values. To clearly show the multimodality of scRNAseq data, zero counts are removed from the distribution plots in (a)
Software tools for identifying DE genes using scRNAseq data
| Tool | Prog. Language | Input format | Model | Year/ version | URL |
|---|---|---|---|---|---|
| SCDE | R | Read counts | Poisson and negative binomial model | 2014/2.2.0 |
|
| MAST | R | TPM/FPKM | Generalized linear model | 2015/1.0.5 |
|
| scDD | R | TPM/FPKM | Conjugate Dirichlet process mixture | 2016/0.99.0 |
|
| EMDomics | R | TPM/FPKM | Non-parametric earth mover’s distance | 2016/2.4.0 |
|
| D3E | Python | Read counts | Cramér-von Mises test, Kolmogorov-Smirnov test, likelihood ratio test | 2016/ |
|
| Monocle2 | R | TPM/FPKM | Generalized additive model | 2014/2.2.0 |
|
| SINCERA | R | TPM/FPKM/Read counts | Welch’s t-test and Wilcoxon rank sum test | 2015/ |
|
| edgeR | R | Read counts | Negative binomial model, Exact test | 2010/3.16.5 |
|
| DESeq2 | R | Read counts | Negative binomial model, Exact test | 2014/1.14.1 |
|
| DEsingle | R | Read counts | Zero inflated negative binomial | 2018/1.2.0 |
|
| SigEMD | R | TPM/FPKM | Non-parametric earth mover’s distance | 2018/0.21.1 |
|
Fig. 2ROC curves for the eleven differential gene expression analysis tools using simulated data
Numbers of the detected DE genes, sensitivities, false positive rates, precisions, and accuracies of the nine tools using simulated data for an adjusted p-value or FDR of 0.05
| Number of detected DE genes | Sensitivity | False positive rate | Precision | Accuracy | F1 score | |
|---|---|---|---|---|---|---|
| Monocle2 | 4664.6 | 0.785 | 0.172 | 0.337 | 0.824 | 0.472 |
| EMDomics | 2465.8 | 0.666 | 0.063 | 0.540 | 0.910 | 0.596 |
| DESeq2 | 2182.6 | 0.739 | 0.039 | 0.677 | 0.939 | 0.707 |
| D3E | 1683.4 | 0.565 | 0.031 | 0.671 | 0.929 | 0.613 |
| scDD | 1155.8 | 0.505 | 0.008 | 0.875 | 0.943 | 0.640 |
| MAST | 954.4 | 0.470 | 0.001 | 0.986 | 0.946 | 0.637 |
| edgeR | 1161.2 | 0.557 | 0.003 | 0.959 | 0.953 | 0.705 |
| SCDE | 842 | 0.419 | 0.0003 | 0.994 | 0.942 | 0.590 |
| SINCERA | 633.6 | 0.312 | 0.001 | 0.984 | 0.931 | 0.474 |
| DEsingle | 1448.8 | 0.697 | 0.003 | 0.962 | 0.967 | 0.808 |
| SigEMD | 1456 | 0.682 | 0.005 | 0.937 | 0.964 | 0.790 |
Fig. 3True detection rates for different scenarios of DE genes and non-DE genes using simulated data. a true positive rates for DE genes under DU, DP, DM, DB scenarios b true negative genes for non-DE genes under EP and EE scenarios
Number of detected DE genes, and sensitivities of the eleven tools using positive control real data for an adjusted p-value or FDR of 0.05
| Number of detected DE genes | Sensitivity (TP/1000 gold standard) | |
|---|---|---|
| Monocle2 | 8674 | 0.765 |
| EMDomics | 8437 | 0.762 |
| DESeq2 | 7612 | 0.695 |
| D3E | 8401 | 0.722 |
| scDD | 2638 | 0.351 |
| MAST | 734 | 0.198 |
| edgeR | 4447 | 0.58 |
| SCDE | 2414 | 0.392 |
| SINCERA | 8366 | 0.73 |
| DEsingle | 9031 | 0.797 |
| SigEMD | 3702 | 0.488 |
Fig. 4Tools’ total numbers of detected significantly DE genes with the p-value or FDR threshold of 0.05 and their overlaps with the 1000 gold standard genes
Number of the detected DE genes and false positive rates of the eleven tools using negative control real data for an adjusted p-value or FDR of 0.05
| Number of detected DE genes | False positive rate (FP/FP + TN) | |
|---|---|---|
| Monocle2 | 917 | 0.126 |
| EMDomics | 733 | 0.101 |
| DESeq2 | 19 | 0.003 |
| D3E | 160 | 0.022 |
| scDD | 5 | 0.0007 |
| MAST | 0 | 0 |
| edgeR | 0 | 0 |
| SCDE | 0 | 0 |
| SINCERA | 0 | 0 |
| DEsingle | 4 | 0.0005 |
| SigEMD | 50 | 0.007 |
Fig. 5Numbers of pairwise common DE genes tested by top 1000 genes in real data
Fig. 6Numbers of pairwise common DE genes tested by adjusted p-value< 0.05 in real data
Fig. 7Effect of sample size (number of cells) on detecting DE genes. The sample size is in horizontal axis, from 10 to 400 cells in each condition. Effect of sample size on a TPR, b FPR, c accuracy (=(TP + TN)/(TP + FP + TN + FN)), and precision (=TP/(TP + FP)). A threshold of 0.05 is used for FDR or adjusted p-value
Number of KEGG gene sets and GO terms enriched by the top 300 DE genes identified by each tool under an FDR threshold of 0.05
| Methods | KEGG | GO Term |
|---|---|---|
| EMDomics | 53 | 19 |
| MAST | 10 | 5 |
| D3E | 49 | 10 |
| SCDE | 21 | 9 |
| Monocle2 | 42 | 24 |
| SINCERA | 39 | 16 |
| scDD | 26 | 1 |
| DESeq2 | 39 | 19 |
| edgeR | 39 | 17 |
| SigEMD | 23 | 15 |
| DEsingle | 41 | 21 |
Scores from word and phrase significance analysis of each tool to recover biologically relevant terms and phrases
| Methods | Score (phrase) | Score (word) | Overall score (word+phrase) |
|---|---|---|---|
| Monocle2 | 3 | 3 | 6 |
| MAST | 3 | 3 | 6 |
| DESeq2 | 2 | 3 | 5 |
| D3E | 2 | 3 | 5 |
| DEsingle | 2 | 3 | 5 |
| SigEMD | 3 | 2 | 5 |
| EMDomics | 2 | 2 | 4 |
| edgeR | 2 | 2 | 4 |
| SINCERA | 2 | 2 | 4 |
| SCDE | 1 | 1 | 2 |
| scDD | 1 | 1 | 2 |
Average runtime of identifying DE genes in real data by each tool
| Methods | Platform | Time consumption in minutes |
|---|---|---|
| DESeq2 | R | 4.2 |
| edgeR | R | 0.41 |
| scDD | R | 85.13 |
| EMDomics | R | 14.64 |
| MAST | R | 1.47 |
| D3E | Python | 38.43 |
| Monocle2 | R | 2.6 |
| SCDE | R | 10.39 |
| SINCERA | R | 0.3 |
| DEsingle | R | 14.97 |
| SigEMD | R | 14.86 |