Literature DB >> 30590489

M3Drop: dropout-based feature selection for scRNASeq.

Tallulah S Andrews1, Martin Hemberg1.   

Abstract

MOTIVATION: Most genomes contain thousands of genes, but for most functional responses, only a subset of those genes are relevant. To facilitate many single-cell RNASeq (scRNASeq) analyses the set of genes is often reduced through feature selection, i.e. by removing genes only subject to technical noise.
RESULTS: We present M3Drop, an R package that implements popular existing feature selection methods and two novel methods which take advantage of the prevalence of zeros (dropouts) in scRNASeq data to identify features. We show these new methods outperform existing methods on simulated and real datasets.
AVAILABILITY AND IMPLEMENTATION: M3Drop is freely available on github as an R package and is compatible with other popular scRNASeq tools: https://github.com/tallulandrews/M3Drop. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2018. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2019        PMID: 30590489      PMCID: PMC6691329          DOI: 10.1093/bioinformatics/bty1044

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Single-cell RNASeq (scRNASeq) has made it possible to analyze the transcriptome from individual cells. In a typical scRNASeq experiment for human or mouse, ∼10 000 genes will be detected. Most genes, however, are not relevant for understanding the underlying biological processes, and an important computational challenge is to select the most relevant features. Feature selection improves the signal to noise ratio and the computational efficiency of downstream analyses, such as clustering or pseudotime inference, by reducing the number of genes under consideration. However, unsupervised feature selection remains difficult due to the high technical variability and the low detection rates of scRNASeq experiments. In this work, we introduce M3Drop a software package for performing feature selection for scRNASeq data. M3Drop implements several existing feature selection methods, including identifying highly variable genes (Brennecke ), GiniClust (Jiang ), PCA-based e.g. (Macosko ), and introduces two novel feature selection methods which identify genes with unusually high numbers of zeros, also called ‘dropouts’, among their observations. The advantage of using the dropout-rate over variance is that the former can be estimated more accurately due to much lower sampling noise (Supplementary Fig. S1). These novel methods exploit the observation that dropout-rates per gene are strongly correlated with gene expression level (Pierson and Yau, 2015; Kharchenko ). Due to the non-linear nature of this relationship, averaging expression level and dropout rate across a heterogeneous cell population results in differentially expressed (DE) genes being shifted above the expected curve. Hence, biologically relevant features can be identified as outliers above the null expectation.

2 Materials and methods

M3Drop implements two dropout-based feature selection methods with specific models for the null expectation, that are tailored to either read-counts from full-transcript sequencing protocols, such as Smartseq2 (Picelli ), or unique molecular identifier (UMI) counts from tag-based protocols, such as 10X Chromium (Zheng ). The first method fits a Michaelis-Menten function to the relationship between mean expression (S) and dropout-rate (Pdropout) (M3Drop). Since the Michaelis-Menten function has a single parameter (KM), we can test the hypothesis that the gene-specific Ki is equal to the KM that was fit for the whole transcriptome. This can be done by propagating errors on both observed dropout rate and observed mean expression to estimate the error of each Ki. The significance can then be evaluated using a t-test (see: Supplementary Methods). We confirmed that the M3Drop model fits diverse Smartseq/2 scRNASeq datasets (Supplementary Fig. S3d–f). The second method fits a library-size adjusted negative binomial model (see: Supplementary Methods) similar to those used previously to model variability in UMI-tagged data (Grün ) and bulk RNASeq data (Anders ). Genes with high dropout rates (NBDrop) or high dispersions (NBDisp) can be identified as features. Unlike M3Drop, NBDrop does not account for errors in the estimated mean expression levels. Thus, it is not as well suited for data with small-sample sizes and/or high amplification noise, as is typical of full-transcript, plate-based protocols (Islam ). We confirmed that the NBDrop model fits diverse tag-based scRNASeq datasets (Supplementary Fig. S3a–c). Since M3Drop integrates multiple feature selection methods into a single package, we are also able to calculate consensus features by averaging gene ranking across all six implemented feature selection methods.

3 Results

We evaluated the performance of dropout-based feature selection compared to existing feature selection methods on data simulated from either a zero-inflated negative binomial (ZINB) fit to one of three different Smartseq/2 datasets (Fig. 1A) or data simulated from a negative binomial model with variability in library size (LS-NB) fit to one of three different UMI-tagged datasets (Fig. 1B). We generated a total of 108 simulated datasets and test each method’s ability to identify DE genes. For a fair comparison, we ranked genes by significance (if available) or effect-size and calculate the area under the ROC curve (AUC) from these rankings.
Fig. 1.

Comparison of feature selection methods. (A and B) Accuracy in identifying DE genes in simulated data. (C and D) Reproducibility of features across five mouse embryo and four human pancreas datasets. (E and F) Average fold-change in expression of reproducible features. (C–F) Each point represents a pair of datasets and the horizontal lines indicate the mean across all pairs. PCA scored genes by their loadings for the top components, Gini is the method used by GiniClust (Jiang ), Cons is the consensus across all other methods

Comparison of feature selection methods. (A and B) Accuracy in identifying DE genes in simulated data. (C and D) Reproducibility of features across five mouse embryo and four human pancreas datasets. (E and F) Average fold-change in expression of reproducible features. (C–F) Each point represents a pair of datasets and the horizontal lines indicate the mean across all pairs. PCA scored genes by their loadings for the top components, Gini is the method used by GiniClust (Jiang ), Cons is the consensus across all other methods Both dropout-based feature selection methods, NBDrop and M3Drop, performed significantly better than variance-based feature selection on the ZINB simulations, as did the consensus features (Fig. 1A). Furthermore, NBDrop and consensus features significantly outperformed other feature selection methods on LS-NB simulations. Notably, the popular HVG method was only marginally better than random chance (AUC < 0.6). One potential disadvantage of dropout-based feature selection is that they may be unable to detect highly expressed genes since these may have no dropouts, even when they are DE across cell populations. However, when we binned data by expression level, we found that dropout-based feature selection performance only dropped below variance-based feature selection for the top 5% most highly expressed genes in our simulations (Supplementary Fig. S4). This corresponds to a mean expression level of >1000 reads/cell or >64 umis/cell, which is rare in the most datasets. To demonstrate unsupervised feature selection in real datasets, we considered five datasets examining early mouse embryo development and four datasets examining human pancreas (Supplementary Table S1). Since these datasets are derived from the same biological system, we expect the most significant features to be reproducible (Fig. 1C and D). To ensure reproducibility was not due to technical biases, we also considered the magnitude of the log-fold-change in expression across the annotated cell-types for the selected features (Fig. 1E and F). Both dropout-based feature selection methods were more reproducible and identified genes with larger fold changes that other methods in the mouse embryo datasets (Fig. 1C and E). Reproducibility was highly dependent on the paired datasets for the pancreas data, since half are UMI-tagged data and half are full-transcript data. However, genes identified by NBDrop had larger fold changes than those of the other methods, except for PCA. This result is consistent with our previous findings that high-dropout genes are superior to high-variance genes for mapping across datasets (Kiselev ). The advantage of dropout-based holds for both discrete clustering and pseudotime analysis as it is the only method that preserves both distinct biological stages and the developmental trajectory of when combining developmental datasets (Supplementary Fig. S8).

Funding

This work has been supported by the Wellcome Trust Sanger Core Funding and the Chan Zuckerberg Initiative DAF, Grant Reference 183501. Conflict of Interest: none declared. Click here for additional data file.
  11 in total

1.  Quantitative single-cell RNA-seq with unique molecular identifiers.

Authors:  Saiful Islam; Amit Zeisel; Simon Joost; Gioele La Manno; Pawel Zajac; Maria Kasper; Peter Lönnerberg; Sten Linnarsson
Journal:  Nat Methods       Date:  2013-12-22       Impact factor: 28.547

2.  Count-based differential expression analysis of RNA sequencing data using R and Bioconductor.

Authors:  Simon Anders; Davis J McCarthy; Yunshun Chen; Michal Okoniewski; Gordon K Smyth; Wolfgang Huber; Mark D Robinson
Journal:  Nat Protoc       Date:  2013-08-22       Impact factor: 13.491

3.  Accounting for technical noise in single-cell RNA-seq experiments.

Authors:  Philip Brennecke; Simon Anders; Jong Kyoung Kim; Aleksandra A Kołodziejczyk; Xiuwei Zhang; Valentina Proserpio; Bianka Baying; Vladimir Benes; Sarah A Teichmann; John C Marioni; Marcus G Heisler
Journal:  Nat Methods       Date:  2013-09-22       Impact factor: 28.547

4.  Smart-seq2 for sensitive full-length transcriptome profiling in single cells.

Authors:  Simone Picelli; Åsa K Björklund; Omid R Faridani; Sven Sagasser; Gösta Winberg; Rickard Sandberg
Journal:  Nat Methods       Date:  2013-09-22       Impact factor: 28.547

5.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets.

Authors:  Evan Z Macosko; Anindita Basu; Rahul Satija; James Nemesh; Karthik Shekhar; Melissa Goldman; Itay Tirosh; Allison R Bialas; Nolan Kamitaki; Emily M Martersteck; John J Trombetta; David A Weitz; Joshua R Sanes; Alex K Shalek; Aviv Regev; Steven A McCarroll
Journal:  Cell       Date:  2015-05-21       Impact factor: 41.582

6.  Validation of noise models for single-cell transcriptomics.

Authors:  Dominic Grün; Lennart Kester; Alexander van Oudenaarden
Journal:  Nat Methods       Date:  2014-04-20       Impact factor: 28.547

7.  Bayesian approach to single-cell differential expression analysis.

Authors:  Peter V Kharchenko; Lev Silberstein; David T Scadden
Journal:  Nat Methods       Date:  2014-05-18       Impact factor: 28.547

8.  Massively parallel digital transcriptional profiling of single cells.

Authors:  Grace X Y Zheng; Jessica M Terry; Phillip Belgrader; Paul Ryvkin; Zachary W Bent; Ryan Wilson; Solongo B Ziraldo; Tobias D Wheeler; Geoff P McDermott; Junjie Zhu; Mark T Gregory; Joe Shuga; Luz Montesclaros; Jason G Underwood; Donald A Masquelier; Stefanie Y Nishimura; Michael Schnall-Levin; Paul W Wyatt; Christopher M Hindson; Rajiv Bharadwaj; Alexander Wong; Kevin D Ness; Lan W Beppu; H Joachim Deeg; Christopher McFarland; Keith R Loeb; William J Valente; Nolan G Ericson; Emily A Stevens; Jerald P Radich; Tarjei S Mikkelsen; Benjamin J Hindson; Jason H Bielas
Journal:  Nat Commun       Date:  2017-01-16       Impact factor: 14.919

9.  ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis.

Authors:  Emma Pierson; Christopher Yau
Journal:  Genome Biol       Date:  2015-11-02       Impact factor: 13.583

10.  GiniClust: detecting rare cell types from single-cell gene expression data with Gini index.

Authors:  Lan Jiang; Huidong Chen; Luca Pinello; Guo-Cheng Yuan
Journal:  Genome Biol       Date:  2016-07-01       Impact factor: 13.583

View more
  67 in total

Review 1.  Single-Cell RNA-Seq of the Pancreatic Islets--a Promise Not yet Fulfilled?

Authors:  Yue J Wang; Klaus H Kaestner
Journal:  Cell Metab       Date:  2018-12-20       Impact factor: 27.287

2.  FSCAM: CAM-Based Feature Selection for Clustering scRNA-seq.

Authors:  Yan Wang; Jie Gao; Chenxu Xuan; Tianhao Guan; Yujie Wang; Gang Zhou; Tao Ding
Journal:  Interdiscip Sci       Date:  2022-01-14       Impact factor: 2.233

3.  Emerging Bioinformatics Methods and Resources in Drug Toxicology.

Authors:  Karine Audouze; Olivier Taboureau
Journal:  Methods Mol Biol       Date:  2022

4.  Visualization, benchmarking and characterization of nested single-cell heterogeneity as dynamic forest mixtures.

Authors:  Benedict Anchang; Raul Mendez-Giraldez; Xiaojiang Xu; Trevor K Archer; Qing Chen; Guang Hu; Sylvia K Plevritis; Alison Anne Motsinger-Reif; Jian-Liang Li
Journal:  Brief Bioinform       Date:  2022-03-10       Impact factor: 11.622

5.  Consensus clustering of single-cell RNA-seq data by enhancing network affinity.

Authors:  Yaxuan Cui; Shaoqiang Zhang; Ying Liang; Xiangyun Wang; Thomas N Ferraro; Yong Chen
Journal:  Brief Bioinform       Date:  2021-11-05       Impact factor: 11.622

Review 6.  Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data.

Authors:  Tallulah S Andrews; Vladimir Yu Kiselev; Davis McCarthy; Martin Hemberg
Journal:  Nat Protoc       Date:  2020-12-07       Impact factor: 13.491

7.  PRODeepSyn: predicting anticancer synergistic drug combinations by embedding cell lines with protein-protein interaction network.

Authors:  Xiaowen Wang; Hongming Zhu; Yizhi Jiang; Yulong Li; Chen Tang; Xiaohan Chen; Yunjie Li; Qi Liu; Qin Liu
Journal:  Brief Bioinform       Date:  2022-03-10       Impact factor: 11.622

8.  The Malaria Cell Atlas: Single parasite transcriptomes across the complete Plasmodium life cycle.

Authors:  Virginia M Howick; Andrew J C Russell; Tallulah Andrews; Haynes Heaton; Adam J Reid; Kedar Natarajan; Hellen Butungi; Tom Metcalf; Lisa H Verzier; Julian C Rayner; Matthew Berriman; Jeremy K Herren; Oliver Billker; Martin Hemberg; Arthur M Talman; Mara K N Lawniczak
Journal:  Science       Date:  2019-08-23       Impact factor: 47.728

Review 9.  Orchestrating single-cell analysis with Bioconductor.

Authors:  Robert A Amezquita; Aaron T L Lun; Etienne Becht; Vince J Carey; Lindsay N Carpp; Ludwig Geistlinger; Federico Marini; Kevin Rue-Albrecht; Davide Risso; Charlotte Soneson; Levi Waldron; Hervé Pagès; Mike L Smith; Wolfgang Huber; Martin Morgan; Raphael Gottardo; Stephanie C Hicks
Journal:  Nat Methods       Date:  2019-12-02       Impact factor: 28.547

10.  A single-cell atlas of Plasmodium falciparum transmission through the mosquito.

Authors:  Eliana Real; Virginia M Howick; Farah A Dahalan; Kathrin Witmer; Juliana Cudini; Clare Andradi-Brown; Joshua Blight; Mira S Davidson; Sunil Kumar Dogga; Adam J Reid; Jake Baum; Mara K N Lawniczak
Journal:  Nat Commun       Date:  2021-05-27       Impact factor: 14.919

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.