Literature DB >> 26743507

OEFinder: a user interface to identify and visualize ordering effects in single-cell RNA-seq data.

Ning Leng1, Jeea Choi2, Li-Fang Chu1, James A Thomson1, Christina Kendziorski3, Ron Stewart1.   

Abstract

UNLABELLED: A recent article identified an artifact in multiple single-cell RNA-seq (scRNA-seq) datasets generated by the Fluidigm C1 platform. Specifically, Leng et al. showed significantly increased gene expression in cells captured from sites with small or large plate output IDs. We refer to this artifact as an ordering effect (OE). Including OE genes in downstream analyses could lead to biased results. To address this problem, we developed a statistical method and software called OEFinder to identify a sorted list of OE genes. OEFinder is available as an R package along with user-friendly graphical interface implementations which allows users to check for potential artifacts in scRNA-seq data generated by the Fluidigm C1 platform.
AVAILABILITY AND IMPLEMENTATION: OEFinder is freely available at https://github.com/lengning/OEFinder CONTACT: rstewart@morgridge.org or lengning1@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2016. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2016        PMID: 26743507      PMCID: PMC4848403          DOI: 10.1093/bioinformatics/btw004

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Single-cell RNA-seq (scRNA-seq) has led to important findings in many fields and is becoming increasingly popular in studies of transcriptome-wide expression (Deng ; Leng ; Shalek ; Trapnell ; Treutlein ). To facilitate scRNA-seq, the majority of studies utilize the Fluidigm C1 platform for cell capture, reverse transcription and cDNA amplification, as this platform allows for rapid and reliable isolation and processing of individual cells. In spite of the advantages, Leng identified an artifact in multiple datasets generated by C1 and confirmed that the artifact was present in the cDNA processed by the C1 machine. In particular, in these datasets, there are genes showing higher expression in cells captured in specific capture sites. These capture sites are the ones with small or large plate output IDs. We refer to this artifact as an ordering effect (OE), which has been shown to be independent of organism and laboratory (Leng ). As detailed in Leng , accurate identification of OE genes is important to ensure unbiased downstream analyses. Leng used an ANOVA-based approach to detect OE genes. The ANOVA-based approach performs well in many cases, but has reduced power when few cells are available. (Note that in empirical data, the cells can be missed owing to the random effects of capture failure—an empty capture site, or doublets—capturing more than one cells in one capture site.) To improve power for identifying OE genes, we developed an approach, OEFinder, based on orthogonal polynomial regression. Results show that OEFinder is less sensitive to sample size and outperforms the ANOVA-based approach when few cells are available. OEFinder is implemented in R, a free and open source language, with a vignette that provides working examples. The graphical user interface (GUI) implementations of OEFinder allow users with little computing background to easily identify and characterize OE genes in scRNA-seq data (Fig. 1a and Supplementary Fig. S4).
Fig. 1

(a) OEFinder GUI for identifying OE genes (shown is implementation using R/RGtk2 package; an implementation using R/shiny package is also available, see Supplementary Fig. S4). (b, c) Operating characteristics in simulated datasets. The x-axes show the number of available cells. The y-axis shows TPR and FDR. (d, e) The OE genes identified in the first experiment of Trapnell et al. data and Leng et al. data, respectively. The cells were ordered following the capture site ID. The y-axis shows scaled gene expression (z-score). Each line represents one OE gene

(a) OEFinder GUI for identifying OE genes (shown is implementation using R/RGtk2 package; an implementation using R/shiny package is also available, see Supplementary Fig. S4). (b, c) Operating characteristics in simulated datasets. The x-axes show the number of available cells. The y-axis shows TPR and FDR. (d, e) The OE genes identified in the first experiment of Trapnell et al. data and Leng et al. data, respectively. The cells were ordered following the capture site ID. The y-axis shows scaled gene expression (z-score). Each line represents one OE gene

2 Analysis input and output

2.1 Input

2.1.1 Expression estimates

OEFinder requires a genes-by-cells expression matrix. The expression matrix can be either normalized or unnormalized. If the input matrix is unnormalized, OEFinder applies the Median-by-Ratio normalization method introduced by Anders and Huber (2010) prior to OE detection.

2.1.2 Capture site group definitions

As detailed in Leng , the capture sites are labeled as A01,…, A12, B01,…, B12,…, H01,…, H12. If the capture site IDs are provided, OEFinder groups cells from sites with the same starting letters. When the capture site information is not available, OEFinder groups cells based on their input order. By default, OEFinder groups cells into eight even-sized groups. The number of groups may be changed by the user.

2.2 Method

The normalized expression values of each gene are scaled to z-scores. For each gene, OEFinder applies an orthogonal polynomial regression on z-scores against group code. To infer whether gene g follows the OE trend, OEFinder calculates the P-value p,2 of a one-tailed test that tests whether the coefficient of the quadratic term is positive. To account for the goodness of the spline fitting, OEFinder defines an aggregate statistics S as −log(p,2) − log(p), in which p denotes the F test P-value of the full model. OEFinder then generates 10 000 simulated genes from permuted data to evaluate the significance of the observed aggregated statistics. By default, genes with permutation P-value <0.01 are identified as OE genes. The number of simulated genes and the P-value cutoff may be changed by a user (for further details, see Supplementary Section S2).

2.3 Output

2.3.1 List of OE genes

OEFinder outputs two.csv files - one contains a sorted list of OE genes and the other contains p-values for all genes.

2.3.2 Expression matrix for downstream analysis

OEFinder outputs a normalized expression matrix that can be directly input to downstream analyses. The user has the option to choose either removal of the OE genes, or imputation of the OE genes with adjusted values.

2.3.3 Visualization of OE genes

OEFinder generates a .pdf file contains expression plots of the top N OE genes, where N is user specified. An example is shown in Supplementary Fig. S1.

3 Evaluations

3.1 Simulation studies

We conducted eight simulation studies to evaluate the performance of the OE detection algorithms. In each simulation, we generated 5000 expressed genes with 500 OE genes. Expression of OE genes was generated based on expression profiles of OE genes detected in empirical data. (Details of the simulations may be found in Supplementary Section S3.) The eight simulation studies evaluate cases with varying numbers of available cells (20–90 cells). Each simulation study contains 100 repeated simulations. Figure 1b and c shows the true positive rate (TPR) and false positive rate (FDR) comparing the ANOVA-based method introduced in Leng and OEFinder. Results indicate that >60 cells are available, both methods have TPR >90% while FDR is controlled <10%. When fewer cells are available, OEFinder has a higher TPR than the ANOVA-based approach and the FDR is still well controlled. The improved power in OEFinder is likely because the polynomial regression in OEFinder fits the OE trend more specifically than the ANOVA-based approach. Additional simulation results may be found in Supplementary Section S4.

3.2 Case studies

We applied OEFinder on two publicly available datasets with capture site ID information. Trapnell data and Leng data contain four and three experiments, respectively. Figure 1d and e shows 187 and 451 OE genes identified in the first experiment of each dataset. Genes detected by OEFinder show a clear OE pattern. Results of other experiments may be found in Supplementary Section S5.

4 Discussion

We developed an R package OEFinder which can robustly detect OE genes in scRNA data generated by the Fluidigm C1 platform. OEFinder provides user-friendly graphical interface implementations that facilitate use by investigators.

Funding

This work was funded in part by GM102756, 4UH3TR000506, 5U01HL099773, U54 AI117924 Charlotte Geyer Foundation and Morgridge Institute for Research. Conflict of Interest: none declared.
  6 in total

1.  Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells.

Authors:  Qiaolin Deng; Daniel Ramsköld; Björn Reinius; Rickard Sandberg
Journal:  Science       Date:  2014-01-10       Impact factor: 47.728

2.  Differential expression analysis for sequence count data.

Authors:  Simon Anders; Wolfgang Huber
Journal:  Genome Biol       Date:  2010-10-27       Impact factor: 13.583

3.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.

Authors:  Cole Trapnell; Davide Cacchiarelli; Jonna Grimsby; Prapti Pokharel; Shuqiang Li; Michael Morse; Niall J Lennon; Kenneth J Livak; Tarjei S Mikkelsen; John L Rinn
Journal:  Nat Biotechnol       Date:  2014-03-23       Impact factor: 54.908

4.  Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments.

Authors:  Ning Leng; Li-Fang Chu; Chris Barry; Yuan Li; Jeea Choi; Xiaomao Li; Peng Jiang; Ron M Stewart; James A Thomson; Christina Kendziorski
Journal:  Nat Methods       Date:  2015-08-24       Impact factor: 28.547

5.  Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq.

Authors:  Barbara Treutlein; Doug G Brownfield; Angela R Wu; Norma F Neff; Gary L Mantalas; F Hernan Espinoza; Tushar J Desai; Mark A Krasnow; Stephen R Quake
Journal:  Nature       Date:  2014-04-13       Impact factor: 49.962

6.  Single-cell RNA-seq reveals dynamic paracrine control of cellular variation.

Authors:  Alex K Shalek; Rahul Satija; Joe Shuga; John J Trombetta; Dave Gennert; Diana Lu; Peilin Chen; Rona S Gertner; Jellert T Gaublomme; Nir Yosef; Schraga Schwartz; Brian Fowler; Suzanne Weaver; Jing Wang; Xiaohui Wang; Ruihua Ding; Raktima Raychowdhury; Nir Friedman; Nir Hacohen; Hongkun Park; Andrew P May; Aviv Regev
Journal:  Nature       Date:  2014-06-11       Impact factor: 49.962

  6 in total
  9 in total

Review 1.  Single-cell technologies in reproductive immunology.

Authors:  Jessica Vazquez; Irene M Ong; Aleksandar K Stanic
Journal:  Am J Reprod Immunol       Date:  2019-06-26       Impact factor: 3.886

2.  Accounting for technical noise in differential expression analysis of single-cell RNA sequencing data.

Authors:  Cheng Jia; Yu Hu; Derek Kelly; Junhyong Kim; Mingyao Li; Nancy R Zhang
Journal:  Nucleic Acids Res       Date:  2017-11-02       Impact factor: 16.971

Review 3.  Design and computational analysis of single-cell RNA-sequencing experiments.

Authors:  Rhonda Bacher; Christina Kendziorski
Journal:  Genome Biol       Date:  2016-04-07       Impact factor: 13.583

4.  Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm.

Authors:  Li-Fang Chu; Ning Leng; Jue Zhang; Zhonggang Hou; Daniel Mamott; David T Vereide; Jeea Choi; Christina Kendziorski; Ron Stewart; James A Thomson
Journal:  Genome Biol       Date:  2016-08-17       Impact factor: 13.583

Review 5.  Single-Cell Transcriptomics Bioinformatics and Computational Challenges.

Authors:  Olivier B Poirion; Xun Zhu; Travers Ching; Lana Garmire
Journal:  Front Genet       Date:  2016-09-21       Impact factor: 4.599

6.  Dr.seq2: A quality control and analysis pipeline for parallel single cell transcriptome and epigenome data.

Authors:  Chengchen Zhao; Sheng'en Hu; Xiao Huo; Yong Zhang
Journal:  PLoS One       Date:  2017-07-03       Impact factor: 3.240

Review 7.  Computational approaches for interpreting scRNA-seq data.

Authors:  Raghd Rostom; Valentine Svensson; Sarah A Teichmann; Gozde Kar
Journal:  FEBS Lett       Date:  2017-06-12       Impact factor: 4.124

8.  An interpretable framework for clustering single-cell RNA-Seq datasets.

Authors:  Jesse M Zhang; Jue Fan; H Christina Fan; David Rosenfeld; David N Tse
Journal:  BMC Bioinformatics       Date:  2018-03-09       Impact factor: 3.169

9.  Identifying and removing the cell-cycle effect from single-cell RNA-Sequencing data.

Authors:  Martin Barron; Jun Li
Journal:  Sci Rep       Date:  2016-09-27       Impact factor: 4.379

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.