| Literature DB >> 35354805 |
Rui Hong1,2, Yusuke Koga1,2, Shruthi Bandyadka1,3, Anastasia Leshchyk1,2, Yichen Wang2, Vidya Akavoor2,4, Xinyun Cao4, Irzam Sarfraz2, Zhe Wang1,2, Salam Alabdullatif2, Frederick Jansen4, Masanao Yajima5, W Evan Johnson1,2, Joshua D Campbell6,7.
Abstract
Single-cell RNA sequencing (scRNA-seq) can be used to gain insights into cellular heterogeneity within complex tissues. However, various technical artifacts can be present in scRNA-seq data and should be assessed before performing downstream analyses. While several tools have been developed to perform individual quality control (QC) tasks, they are scattered in different packages across several programming environments. Here, to streamline the process of generating and visualizing QC metrics for scRNA-seq data, we built the SCTK-QC pipeline within the singleCellTK R package. The SCTK-QC workflow can import data from several single-cell platforms and preprocessing tools and includes steps for empty droplet detection, generation of standard QC metrics, prediction of doublets, and estimation of ambient RNA. It can run on the command line, within the R console, on the cloud platform or with an interactive graphical user interface. Overall, the SCTK-QC pipeline streamlines and standardizes the process of performing QC for scRNA-seq data.Entities:
Mesh:
Year: 2022 PMID: 35354805 PMCID: PMC8967915 DOI: 10.1038/s41467-022-29212-9
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Overview of the SCTK-QC pipeline.
The SCTK-QC pipeline is developed in R and can import datasets generated from various preprocessing tools. The pipeline incorporates various software and tools to perform QC for Droplet and/or Cell matrices within each sample. Tools are included for calculation of standard metrics such as the number of Unique Molecular Identifier (UMIs) per cell, detection of empty droplets, prediction of doublets, and estimation of contamination from ambient RNA. The pipeline utilizes the SingleCellExperiment (SCE) R object to store assay data and the derived QC metrics. Data visualization and report generation can be subsequently performed on the imported dataset based on user specified parameters. All data can be exported to Seurat object, a Python AnnData object, or as Market Exchange Format (MEX) and.txt flat files to facilitate analysis in downstream workflows.
Functions available in the singleCellTK package and the SCTK-QC pipeline along with the corresponding wrapper functions.
| SCTK QC modules | Methods | Goal | Packages integrated | Function |
|---|---|---|---|---|
| runDropletQC | runBarcodeRankDrops | Calculate barcode ranks | DropletUtils | barcodeRanks |
| runEmptyDrops | Detection of empty droplets | DropletUtils | emptyDrops | |
| runPerCellQC | Compute general quality control metrics | scater | addPerCellQC | |
| runCellQC | runPerCellQC | Compute general quality control metrics | scater | addPerCellQC |
| runScrublet | Doublet detection | Scrublet | scrub_doublets* | |
| runScDblFinder | Doublet detection | scDblFinder | scDblFinder | |
| runDoubletFinder | Doublet detection | DoubletFinder | doubletFinder_v3 | |
| runCxds | Doublet detection | scds | cxds | |
| runBcds | Doublet detection | scds | bcds | |
| runCxdsBcdsHybrid | Doublet detection | scds | cxds_bcds_hybrid | |
| runDecontX | Detect ambient RNA contamination | celda | decontX |
The diverse algorithms and their corresponding SCTK-QC wrapper functions that are used to generate quality control QC metrics in SCTK-QC pipeline. The asterisk denotes Python functions.
Fig. 2Generation of HTML reports for visualization and assessment of QC metrics.
The functions reportDropletQC() and reportCellQC() generate the extensive HTML reports to display data generated by the various QC tools applied by the functions runDropletQC() and runCellQC(), respectively. The reportDropletQC() report contains figures visualizing identified empty droplets. The reportCellQC() report contains visualizations of total read counts, total genes detected, doublet scores, doublet calls, percentages of ambient RNA detected, and cell clusters identified by decontX. These reports are run automatically by the SCTK-QC pipeline. Examples of reportDropletQC() (on the left) and reportCellQC() (on the right) reports are shown.
Fig. 3Interactive QC of single cell data using a Graphical User Interface (GUI).
An R/Shiny GUI can be used to interactively run QC algorithms in the singleCellTK package. A screenshot of the “Data QC & Filtering” tab from the interactive GUI is shown. After importing the data, quality control is performed within the “QC & Filtering” tab (red) of the user interface. QC algorithms are chosen from a list (blue), while specific parameters may be specified as well (green). Plots displaying metrics generated by each QC tool will appear to the right in a tab.
Comparison of features in the SCTK-QC pipeline with other single-cell analysis toolkits.
| SCTK | PIVOT | Seurat | ascend | scRNABatchQC | Adobo | SCONE | SCHNAPPs | iS-CellR | Ganatum | ASAP browser | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 10x CellRanger | ✓ | ✓ | ✓ | ✓ | |||||||
| SCE Object | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||
| Seurat Object | ✓ | ✓ | |||||||||
| AnnData | ✓ | ✓ | |||||||||
| LOOM | |||||||||||
| BUStools | ✓ | ||||||||||
| SEQC | ✓ | ||||||||||
| STARSolo | ✓ | ||||||||||
| Optimus | ✓ | ||||||||||
| DropEst | ✓ | ||||||||||
| CSV, TXT, and MTX | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| RSEM | ✓ | ||||||||||
| ✓ | |||||||||||
| Total counts | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Number of features detected | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Gene set count (e.g mitochondrial) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| scDblFinder | ✓ | ||||||||||
| Scrublet | ✓ | ||||||||||
| doubletFinder | ✓ | ||||||||||
| cxds | ✓ | ||||||||||
| bcds | ✓ | ||||||||||
| cxds/bcds hybrid | ✓ | ||||||||||
| ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
| ✓ | ✓ | ✓ | ✓ | ||||||||
| ✓ | ✓ | ✓ | ✓ | ||||||||
| RDS | ✓ | ✓ | ✓ | ✓ | |||||||
| AnnData | ✓ | ||||||||||
| hdf5 | ✓ | ✓ | |||||||||
| .txt Flatfile | ✓ | ✓ | |||||||||
| pickle | ✓ | ||||||||||
| joblib | ✓ | ||||||||||
SCTK-QC pipeline supports various types of input, full scRNA-seq quality control pipeline and supports common data structures for data storage.
Summary of QC metrics for each PBMC sample. A total of six PBMC datasets were analyzed with the SCTK-QC pipeline.
| GENCODE GRCh38 v27 | GENCODE GRCh38 v34 | SMART-Seq2 | ||||
|---|---|---|---|---|---|---|
| PBMC1k V2 | PBMC1k V3 | PBMC1k V2 | PBMC 1k V3 | Replicate 1 | Replicate 2 | |
| Total number of genes detected | 58,347 | 60,669 | 58,347 | 60,669 | 33,694 | 33,694 |
| Number of droplets, Droplet matrix | 737,280 | 6,794,880 | 737,280 | 6,794,880 | NA | NA |
| Number of Cells, Cell matrix | 995 | 1223 | 996 | 1222 | 311 | 273 |
| Mean counts | 3559 | 7576 | 3553 | 7576 | 390,058 | 292,971 |
| Median counts | 3374 | 6637 | 3375 | 6640 | 388,420 | 290,819 |
| Mean features detected | 1133 | 2088 | 1140 | 2104 | 2436 | 2795 |
| Median features detected | 1106 | 1957 | 1109 | 1978 | 2406 | 2632 |
| Scrublet, Number of doublets | 12 | 16 | 12 | 18 | 0 | 3 |
| Scrublet, Percentage of doublets | 1.21 | 1.31 | 1.2 | 1.47 | 0 | 1.1 |
| ScDblFinder, Number of doublets | 13 | 16 | 14 | 20 | 3 | 13 |
| ScDblFinder, Percentage of doublets | 1.31 | 1.31 | 1.41 | 1.64 | 0.97 | 4.76 |
| DoubletFinder, Number of doublets, Resolution 1.5 | 75 | 92 | 75 | 92 | 23 | 20 |
| DoubletFinder, Percentage of doublets, Resolution 1.5 | 7.54 | 7.52 | 7.53 | 7.53 | 7.4 | 7.33 |
| CXDS—Number of doublets | 51 | 195 | 53 | 183 | 19 | 4 |
| CXDS—Percentage of doublets | 5.13 | 15.9 | 5.32 | 15 | 6.11 | 1.47 |
| BCDS—Number of doublets | 91 | 91 | 69 | 71 | 17 | 8 |
| BCDS—Percentage of doublets | 9.15 | 7.44 | 6.93 | 5.81 | 5.47 | 2.93 |
| SCDS Hybrid—Number of doublets | 65 | 119 | 77 | 90 | 20 | 13 |
| SCDS Hybrid—Percentage of doublets | 6.53 | 9.73 | 7.73 | 7.36 | 6.43 | 4.76 |
| DecontX—Mean contamination percentage | 5.4 | 3.7 | 5.8 | 3.0 | 2.1 | 2.9 |
| DecontX—Median contamination percentage | 1.7 | 0.9 | 1.8 | 0.7 | 0.7 | 1.3 |
A total of six PBMC datasets were analyzed with the SCTK-QC pipeline. Two GENCODE PBMC 1k datasets of differing 10x Chemistry were taken from GENCODE v27 and v34, resulting in a total of four datasets. Additionally, two SMART-Seq2 datasets from PBMC replicates were also taken. A per-sample summary table is automatically generated by the pipeline.
Fig. 4Application of SCTK-QC to PBMC datasets.
A QC metrics were generated by the SCTK-QC pipeline for 1K healthy donor Peripheral Blood Mononuclear Cell (PBMC) datasets from 10X Genomics. Violin plots generated by the pipeline demonstrate higher capture sensitivity of the 10x v3 Chromium chemistry. Furthermore, lower ambient RNA contamination was observed in the samples run with v3 chemistry compared to samples profiled with the v2 chemistry. B The SCTK-QC pipeline was applied similarly on a PBMC dataset generated by SMART-Seq2. A higher number of features were detected per cell in the SMART-Seq2 datasets compared to either of the 10X Genomics datasets.