Literature DB >> 27153613

Quality control of single-cell RNA-seq by SinQC.

Peng Jiang1, James A Thomson2, Ron Stewart1.   

Abstract

UNLABELLED: Single-cell RNA-seq (scRNA-seq) is emerging as a promising technology for profiling cell-to-cell variability in cell populations. However, the combination of technical noise and intrinsic biological variability makes detecting technical artifacts in scRNA-seq samples particularly challenging. Proper detection of technical artifacts is critical to prevent spurious results during downstream analysis. In this study, we present 'Single-cell RNA-seq Quality Control' (SinQC), a method and software tool to detect technical artifacts in scRNA-seq samples by integrating both gene expression patterns and data quality information. We apply SinQC to nine different scRNA-seq datasets, and show that SinQC is a useful tool for controlling scRNA-seq data quality.
AVAILABILITY AND IMPLEMENTATION: SinQC software and documents are available at http://www.morgridge.net/SinQC.html CONTACTS: : PJiang@morgridge.org or RStewart@morgridge.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2016. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2016        PMID: 27153613      PMCID: PMC4978927          DOI: 10.1093/bioinformatics/btw176

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Single-cell RNA-seq (scRNA-seq) provides a relatively unbiased approach to characterize and dissect the heterogeneity of cells in complex mixtures (Eberwine ). However, one of the major challenges of this technology is distinguishing biological heterogeneity from technical artifacts (cells with substantial technical noise that makes their gene expression patterns distinguished from other cells) (Sandberg, 2014; Stegle ). To detect potential technical artifacts in scRNA-seq, previous studies have used various strategies that can be generally grouped into three categories. The first category involves using housekeeping genes to perform QC. For example, cells not expressing housekeeping genes (e.g. Actb, Gapdh) or abnormally expressing them are filtered out (Ting ; Treutlein ). The assumption of methods in this category is that housekeeping genes are highly and consistently expressed, which is not necessarily true for single cells. For example, a study using single-cell qPCR not only showed that the gene expression of housekeeping genes had high variation between individual cells but also that gene expression expression of housekeeping genes can even distinguish cell types (Oyolu ). Thus, a reliance on housekeeping genes to perform QC can result in removing cells with real biological variation. The second category involves using overall gene expression patterns to define technical artifacts. For example, cells are excluded from further analysis if they cluster separately from the rest of the cells (Zeisel ) or if their median expression values fall below a certain threshold (Pollen ). The major problem of the methods in this category is that they can potentially discard cells with real biological variation. The third category involves using the number of genes detected (per some defined expression threshold) and/or the reads mapping rate to define technical artifacts (Kumar ). However, the number of genes detected and mapping rate vary among experiments depending on the quality of a particular library, cell type, or RNA-protocol. Hence, the cutoff settings are typically arbitrary. Thus, although single-cell approaches hold great promise in exploring heterogeneity within a cell population or complex mixture, QC still remains a major challenge (Stegle ). In this study, we present ‘Single-cell RNA-seq Quality Control’ (SinQC), a method and software tool for detecting technical artifacts in scRNA-seq samples. SinQC assumes that if gene expression outliers are also associated with poor sequencing library quality (poor data quality, e.g. low mapped reads, low mapping rate or low library complexity), then they are more likely to be technical artifacts than to be cells with real biological variation.

2 Method

A detailed description of the SinQC algorithm can be found in Supplementary Data. Briefly, given a batch of scRNA-seq data, SinQC first uses gene expression patterns to detect outliers. SinQC assumes that gene expression outliers contain both cells with real biological variation and technical artifacts, but the rest of the cells (main population cells) in general, are more likely to contain good quality cells. Thus, SinQC uses cells of the main population as controls to estimate data quality cutoffs and a corresponding false positive rate (FPR). For each sample, SinQC calculates two data quality meta-scores: Minimal Quantile Score (MQS) and Weighted Combined Quality Score (WCQS) by combining a set of data quality metrics (total number of mapped reads, mapping rate and library complexity). These two data quality meta-scores represent whether a sample has significant deficiency in any of the three quality metrics or the overall quality metrics are low, respectively. SinQC determines these two data quality meta-score cutoffs by allowing a minimal fraction (user-defined) of cells of the main population to fail to pass these cutoffs. The technical artifacts are defined as gene expression outliers with poor data quality (Supplementary Fig. S1A). A more detailed and comprehensive study of SinQC (e.g. comparison with other QC methods) can be found in Supplementary Data (Supplementary results and discussion).

3 Results and discussion

We applied SinQC to a highly heterogeneous scRNA-seq dataset containing 301 cells (mixture of 11 different cell types) (Pollen ). SinQC detected 12 technical artifacts (FPR < 5%). These 12 artifacts showed a significantly lower mapping rate but not fewer mapped reads nor lower library complexity if compared with the QC pass cells (Supplementary Fig. S2). We calculated the number of genes detected (TPM > 1) for each cell. The artifacts detected have significantly fewer genes detected if compared with QC pass cells (P = 3.83e-07, 1-sided Wilcoxon rank sum test; Supplementary Fig. S3). As shown in Figure 1, technical artifacts detected by SinQC overall have fewer genes detected and/or lower mapping rates, which is similar to using a ‘genes detected and/or mapping rate’ method to do quality control (Kumar ). However, SinQC has the advantage of not having to arbitrarily choose thresholds for the number of genes detected or mapping rate. Moreover, SinQC uses data quality meta-scores (Section 2) instead of only using mapping rate to represent data quality for each sample. The following examples will further demonstrate that integrating more universal data quality metrics is helpful to detect technical artifacts.
Fig. 1.

Technical artifacts detected by SinQC (FPR < 5%) in a highly heterogeneous dataset containing a mixture of 11 cell types

Technical artifacts detected by SinQC (FPR < 5%) in a highly heterogeneous dataset containing a mixture of 11 cell types We next applied SinQC to another eight scRNA-seq datasets, including six batches of human H1 ES cells (H1-Exp1 = 72 cells, H1-Exp2 = 81 cells, H1-Exp3 = 75 cells, G1 = 91 cells, S = 80 cells and G2 = 76 cells) (Leng ) and two mouse datasets (ES = 48 cells and MEF = 44 cells) (Islam ). We ran SinQC separately for each dataset. As shown in Supplementary Figure S4, the technical artifacts (FPR < 5%) identified by SinQC either have significantly fewer mapped reads, and/or lower mapping rate, and/or lower library complexity compared with cells that pass QC. Among eight low-heterogeneity scRNA-seq datasets tested, all except Human G1 show significantly fewer number of genes detected in artifacts if compared with the QC pass cells (Supplementary Table S1). Human G1 shows marginal overlap of 95% of CI between artifacts and the QC pass cells (Supplementary Table S1). This indicates that artifacts identified by SinQC overall have lower number of genes detected and are also associated with poor data quality. The technical artifacts detected by SinQC overall have fewer genes detected if compared with QC pass cells. But this does not mean that the cells with fewer genes detected are technical artifacts. The number of genes detected is determined by both data quality and cell type (a detailed discussion can be found in Supplementary Data (Supplementary Results and Discussion)). By integrating both gene expression and data quality information, SinQC maximizes the probability that the technical artifacts are correctly detected while also minimizing the false positives by using cells of the main population as data quality controls. If a single-cell RNA-seq experiment contains hundreds or thousands cells, it is likely that they are processed in several experimental batches. For our lab-generated human embryonic stem cell (ES) datasets (Leng ), we processed them in three different experimental batches (H1-Exp1 = 72 cells, H1-Exp2 = 81 cells and H1-Exp3 = 75 cells). We further compared the technical artifacts detected (FPR < 5%) if we run SinQC on each individual experiment alone or we run SinQC on pooled experimental batches. We applied SinQC to three pooled batches of human H1 single cell RNA-seq data and detect 15 technical artifacts (H1-Exp1 = 12, H1-Exp2 = 3, H1-Exp3 = 0) (Supplementary Fig. S9). However, if we run SinQC batch by batch, we detect not only these 15 artifacts but also 11 additional ones (Supplementary Table S1) (H1-Exp1 = 12, H1-Exp2 = 11, H1-Exp3 = 3). This suggests that SinQC is more sensitive if run batch by batch. This is because pooling batches will increase the diversity of the population being studied owing to batch effects in scRNA-seq datasets. Since SinQC uses relative measurements to determine data quality cutoffs, the increased diversity in pooled batches will relax the absolute data quality cutoffs thus allowing more gene expression outliers to pass these cutoffs. We then further investigated the sensitivity and specificity of SinQC when different types of cell are mixed. We mixed two mouse datasets (48 ES cells and 44 MEF cells) (Islam ) in different ways to simulate datasets containing different portions of subpopulations. As shown in Supplementary Figure S5, the artifacts detected by SinQC using different combinations of datasets are overall consistent with each other. However, we observe that SinQC increases specificity at the cost of dropping sensitivity when the extent of heterogeneity in a dataset is high. For example, if we ran SinQC on each individual ES or MEF dataset, we can detect more artifacts, if compared to running SinQC on pooled mixture datasets (e.g. ‘All’). However, the two artifacts (ESC_46 and ESC_32) which were detected by pooled mixture datasets (‘All’) can be robustly detected by running SinQC either on ES and MEF datasets separately or on ‘ES + 1/5 (MEF)’ or ‘1/5 (ES) + MEF’. In highly heterogeneous cell populations, detecting technical artifacts carries a higher risk of dropping real biological variation cells. The increased specificity and decreased sensitivity of SinQC for highly heterogeneous cell populations can minimize the false positives.
  11 in total

1.  Distinguishing human cell types based on housekeeping gene signatures.

Authors:  Chuba Oyolu; Fouad Zakharia; Julie Baker
Journal:  Stem Cells       Date:  2012-03       Impact factor: 6.277

2.  Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq.

Authors:  Saiful Islam; Una Kjällquist; Annalena Moliner; Pawel Zajac; Jian-Bing Fan; Peter Lönnerberg; Sten Linnarsson
Journal:  Genome Res       Date:  2011-05-04       Impact factor: 9.043

Review 3.  Computational and analytical challenges in single-cell transcriptomics.

Authors:  Oliver Stegle; Sarah A Teichmann; John C Marioni
Journal:  Nat Rev Genet       Date:  2015-01-28       Impact factor: 53.242

4.  Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.

Authors:  Amit Zeisel; Ana B Muñoz-Manchado; Simone Codeluppi; Peter Lönnerberg; Gioele La Manno; Anna Juréus; Sueli Marques; Hermany Munguba; Liqun He; Christer Betsholtz; Charlotte Rolny; Gonçalo Castelo-Branco; Jens Hjerling-Leffler; Sten Linnarsson
Journal:  Science       Date:  2015-02-19       Impact factor: 47.728

5.  Entering the era of single-cell transcriptomics in biology and medicine.

Authors:  Rickard Sandberg
Journal:  Nat Methods       Date:  2014-01       Impact factor: 28.547

6.  Deconstructing transcriptional heterogeneity in pluripotent stem cells.

Authors:  Roshan M Kumar; Patrick Cahan; Alex K Shalek; Rahul Satija; AJay DaleyKeyser; Hu Li; Jin Zhang; Keith Pardee; David Gennert; John J Trombetta; Thomas C Ferrante; Aviv Regev; George Q Daley; James J Collins
Journal:  Nature       Date:  2014-12-04       Impact factor: 49.962

7.  Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments.

Authors:  Ning Leng; Li-Fang Chu; Chris Barry; Yuan Li; Jeea Choi; Xiaomao Li; Peng Jiang; Ron M Stewart; James A Thomson; Christina Kendziorski
Journal:  Nat Methods       Date:  2015-08-24       Impact factor: 28.547

8.  Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq.

Authors:  Barbara Treutlein; Doug G Brownfield; Angela R Wu; Norma F Neff; Gary L Mantalas; F Hernan Espinoza; Tushar J Desai; Mark A Krasnow; Stephen R Quake
Journal:  Nature       Date:  2014-04-13       Impact factor: 49.962

9.  Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.

Authors:  Alex A Pollen; Tomasz J Nowakowski; Joe Shuga; Xiaohui Wang; Anne A Leyrat; Jan H Lui; Nianzhen Li; Lukasz Szpankowski; Brian Fowler; Peilin Chen; Naveen Ramalingam; Gang Sun; Myo Thu; Michael Norris; Ronald Lebofsky; Dominique Toppani; Darnell W Kemp; Michael Wong; Barry Clerkson; Brittnee N Jones; Shiquan Wu; Lawrence Knutsson; Beatriz Alvarado; Jing Wang; Lesley S Weaver; Andrew P May; Robert C Jones; Marc A Unger; Arnold R Kriegstein; Jay A A West
Journal:  Nat Biotechnol       Date:  2014-08-03       Impact factor: 54.908

10.  Single-cell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells.

Authors:  David T Ting; Ben S Wittner; Matteo Ligorio; Nicole Vincent Jordan; Ajay M Shah; David T Miyamoto; Nicola Aceto; Francesca Bersani; Brian W Brannigan; Kristina Xega; Jordan C Ciciliano; Huili Zhu; Olivia C MacKenzie; Julie Trautwein; Kshitij S Arora; Mohammad Shahid; Haley L Ellis; Na Qu; Nabeel Bardeesy; Miguel N Rivera; Vikram Deshpande; Cristina R Ferrone; Ravi Kapur; Sridhar Ramaswamy; Toshi Shioda; Mehmet Toner; Shyamala Maheswaran; Daniel A Haber
Journal:  Cell Rep       Date:  2014-09-18       Impact factor: 9.423

View more
  16 in total

1.  SIDEseq: A Cell Similarity Measure Defined by Shared Identified Differentially Expressed Genes for Single-Cell RNA sequencing Data.

Authors:  Courtney Schiffman; Christina Lin; Funan Shi; Luonan Chen; Lydia Sohn; Haiyan Huang
Journal:  Stat Biosci       Date:  2017-05-17

2.  PRODUCTION OF A PRELIMINARY QUALITY CONTROL PIPELINE FOR SINGLE NUCLEI RNA-SEQ AND ITS APPLICATION IN THE ANALYSIS OF CELL TYPE DIVERSITY OF POST-MORTEM HUMAN BRAIN NEOCORTEX.

Authors:  Brian Aevermann; Jamison McCorrison; Pratap Venepally; Rebecca Hodge; Trygve Bakken; Jeremy Miller; Mark Novotny; Danny N Tran; Francisco Diezfuertes; Lena Christiansen; Fan Zhang; Frank Steemers; Roger S Lasken; E D Lein; Nicholas Schork; Richard H Scheuermann
Journal:  Pac Symp Biocomput       Date:  2017

3.  A Novel Approach to Single Cell RNA-Sequence Analysis Facilitates In Silico Gene Reporting of Human Pluripotent Stem Cell-Derived Retinal Cell Types.

Authors:  M Joseph Phillips; Peng Jiang; Sara Howden; Patrick Barney; Jee Min; Nathaniel W York; Li-Fang Chu; Elizabeth E Capowski; Abigail Cash; Shivani Jain; Katherine Barlow; Tasnia Tabassum; Ron Stewart; Bikash R Pattnaik; James A Thomson; David M Gamm
Journal:  Stem Cells       Date:  2017-12-25       Impact factor: 6.277

4.  Multi-Omics Profiling of the Tumor Microenvironment.

Authors:  Oliver Van Oekelen; Alessandro Laganà
Journal:  Adv Exp Med Biol       Date:  2022       Impact factor: 2.622

5.  SAREV: A review on statistical analytics of single-cell RNA sequencing data.

Authors:  Dorothy Ellis; Dongyuan Wu; Susmita Datta
Journal:  Wiley Interdiscip Rev Comput Stat       Date:  2021-05-20

Review 6.  Single-Cell Transcriptomics Bioinformatics and Computational Challenges.

Authors:  Olivier B Poirion; Xun Zhu; Travers Ching; Lana Garmire
Journal:  Front Genet       Date:  2016-09-21       Impact factor: 4.599

Review 7.  A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications.

Authors:  Ashraful Haque; Jessica Engel; Sarah A Teichmann; Tapio Lönnberg
Journal:  Genome Med       Date:  2017-08-18       Impact factor: 11.117

8.  Dr.seq2: A quality control and analysis pipeline for parallel single cell transcriptome and epigenome data.

Authors:  Chengchen Zhao; Sheng'en Hu; Xiao Huo; Yong Zhang
Journal:  PLoS One       Date:  2017-07-03       Impact factor: 3.240

9.  Retinal Ganglion Cell Diversity and Subtype Specification from Human Pluripotent Stem Cells.

Authors:  Kirstin B Langer; Sarah K Ohlemacher; M Joseph Phillips; Clarisse M Fligor; Peng Jiang; David M Gamm; Jason S Meyer
Journal:  Stem Cell Reports       Date:  2018-03-22       Impact factor: 7.765

Review 10.  The Human Cell Atlas: Technical approaches and challenges.

Authors:  Chung-Chau Hon; Jay W Shin; Piero Carninci; Michael J T Stubbington
Journal:  Brief Funct Genomics       Date:  2018-07-01       Impact factor: 4.241

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.