Literature DB >> 30445607

pyNVR: investigating factors affecting feature selection from scRNA-seq data for lineage reconstruction.

Bob Chen^1,2, Charles A Herring^1,3, Ken S Lau^1,2,3.

Abstract

MOTIVATION: The emergence of single-cell RNA-sequencing has enabled analyses that leverage transitioning cell states to reconstruct pseudotemporal trajectories. Multidimensional data sparsity, zero inflation and technical variation necessitate the selection of high-quality features that feed downstream analyses. Despite the development of numerous algorithms for the unsupervised selection of biologically relevant features, their differential performance remains largely unaddressed.
RESULTS: We implemented the neighborhood variance ratio (NVR) feature selection approach as a Python package with substantial improvements in performance. In comparing NVR with multiple unsupervised algorithms such as dpFeature, we observed striking differences in features selected. We present evidence that quantifiable dataset properties have observable and predictable effects on the performance of these algorithms.
AVAILABILITY AND IMPLEMENTATION: pyNVR is freely available at https://github.com/KenLauLab/NVR. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Substances：
RNA, Small Cytoplasmic

Year: 2019 PMID： 30445607 PMCID： PMC6596893 DOI： 10.1093/bioinformatics/bty950

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

Complex tissue systems consist of heterogeneous cell populations, and single-cell RNA-sequencing (scRNA-seq) is capable of extracting transcriptomic information while preserving this complexity. Each cell and its respective state become a high-dimensional data point (Tang ). These data encompass nuanced transitional cell states, and through pseudotemporal trajectory reconstruction, these transitional states can be ordered to describe developmental dynamics (Trapnell ). Computational techniques such as density-dependent k-Nearest Neighbors (k-NN) network traversal, minimum spanning trees and reverse graph embedding are examples of trajectory reconstruction approaches utilizing high-dimensional data (Herring ,b; Qiu , 2017). However, these algorithmic approaches perform best when provided with high-quality features, which are often confounded by artifacts such as gene dropouts, resulting from incomplete transcriptomic sampling, and stochasticity, arising from the amplification of single-cell scale reaction materials (Kim and Marioni, 2013). Detectable cell-to-cell variation can also originate from stochastic gene expression, where an underlying level of randomness is captured at the time of sample processing (Elowitz, 2002; Raj and van Oudenaarden, 2008). These sources of variation necessitate machine learning strategies for the selection of biologically meaningful features (Herring ). Unsupervised feature selection algorithms such as neighborhood variance ratio (NVR), dpFeature (dpF), FindVariableGenes (FVG) and PCA-Based Feature Extraction (PCAFE) are distinct strategies for achieving this goal in the context of pseudotemporal analysis (Butler ; Qiu ; Taguchi, 2018; Welch ). Although feature selection is essential, the assumptions and performance of these algorithms have not been systematically evaluated, confounding the applicability of these methods to different datasets. Here, we examine an underlying characteristic of high-dimensional data that interacts with these algorithms with a focus on NVR and dpFeature.

2 Software description

2.1 Implementation

pyNVR (Welch ) was implemented in Python 2.7 with 3.7 compatibility. It is available through Github with installation instructions for multiple operating systems and tutorials with example datasets (https://github.com/KenLauLab/NVR). This package works as an unsupervised feature selection pipeline on quality controlled scRNA-seq data. scRNA-seq count or FPKM data act as inputs into this pipeline. The pseudocode for NVR is documented (Supplementary Material S1.1). We benchmarked our Python implementation of NVR against a previous R implementation across multiple machines.

2.2 Subject datasets

Datasets used in this study can be found on the NCBI GEO repository as GSE102698, GSE52529 and GSE60781 (Herring ; Schlitzer ; Trapnell ). These data were generated using different platforms and biological contexts (Supplementary Table S1).

2.3 Comparative performance

To compare the performance of NVR and dpFeature (Supplementary Material S1.2) in a controlled manner, we examined GSE102698, a dataset generated from the colonic epithelium, and sampled the data in two ways. First, we sampled the dataset to different cell numbers. We surmised that relationships within high-density subspaces would be more robust to data sampling, thus affecting how extensively their metric space neighborhoods are defined. Second, we directly controlled the distribution of the data by imposing closeness centrality thresholds on cell sampling (Supplementary Material S1.3, Supplementary Fig. S1). Closeness centrality, as defined by the normalized sum of the length of the shortest paths between a given node and all other nodes, is calculated on a density-based k-NN graph (Lever ; Pearson, 1901). Nodes with low closeness represent disjointed cells comprising separable cell subpopulations, while those with high closeness occupy more central and inseparable positions within the graph. As a correlative output, we calculated gene set similarities by taking their Jaccard indices, describing divergence between algorithm performance (Levandowsky and Winter, 1971) (Supplementary Material S1.4). This workflow is further described in Supplementary Figure S2.

2.4 Pseudotemporal analysis and gene ontology term enrichment

Given that cell closeness is a controllable parameter affecting data distribution, we further investigated its effects on downstream analyses. We performed pseudotemporal analysis on GSE102698 using feature-selected gene sets with p-Creode (Herring ) (Supplementary Material S1.5). To better understand the gene sets each algorithm selected for, we performed gene ontology (GO) term enrichment analysis using WebGestalt as described by the Zhang group (Supplementary Material S1.6) (Wang ).

3 Results

We noted significant improvements in speed, with an average 14-fold decrease in runtime, when comparing our Python implementation against the R implementation (Fig. 1A, Supplementary Table S2). In applying NVR and dpFeature to distinct scRNA-seq datasets, we observed striking inconsistencies in the genes selected (Fig. 1B, Supplementary Table S1). We observed significant, positive linear relationships (Supplementary Table S3) between gene set Jaccard index, cell number (≤ 2.2e-16, Fig. 1C), and cell closeness sampling thresholds (P = 1.227e-06, Fig. 1D). We observed another performance divergence in examining their robustness, as defined by the resistance to performance decay given decreasing input data quality (Supplementary Fig. S3). Given these gene sets, we also observed distinct differences in GO term enrichment (Supplementary Fig. S4, Supplementary Table S4). Downstream p-Creode trajectory reconstruction using low closeness dpFeature gene sets resulted in the underrepresentation of stem-like developmental projections (Fig. 1E). Additionally, we analyzed two more methods in parallel, findVariableGenes and PCAFE (Supplementary Material S1.7–S1.8), and observed significant differences in performance (Supplementary Figs S5–S7).

Fig. 1.

Evaluation of pyNVR performance. (A) Fold difference in runtime between the Python and R implementations of NVR. (B) Gene set similarity given different datasets. (C) Gene set similarity and its relationship with cell number. (D) Gene set similarity and its relationship with closeness threshold-imposed sampling. (E) Representative p-Creode graphs generated using genes selected from closeness-thresholded samples. Heatmap overlay and gating depicts Myc and putative stem-like cell states, respectively

4 Discussion

We reasoned that the algorithms examined were differentially impacted by data distributions, through examinations of cell number and closeness centrality. NVR and dpFeature rely on calculating metric space neighbors, in the context of graphs and t-Distributed Stochastic Neighbor Embedding (t-SNE) embeddings, respectively (Qiu ; van der Maaten and Hinton, 2008; Welch ). Both cell number and closeness affect the calculation of neighborhood-defining radial distances. NVR uses these distances to generate a k-NN graph. t-SNE, instead, uses these distances to calculate a probabilistic representation of neighborhoods. Beyond neighborhood representations, the algorithms also have unique selection criteria.

5 Conclusion

We created an accessible and significantly faster implementation of NVR feature selection. We present evidence suggesting that the performance of different unsupervised feature selection algorithms diverge based on dataset properties such as cell number and closeness. Downstream pseudotemporal or gene ontological analyses are demonstrably affected by the feature selection algorithm used. Click here for additional data file.

13 in total

1. Stochastic gene expression in a single cell.

Authors: Michael B Elowitz; Arnold J Levine; Eric D Siggia; Peter S Swain
Journal: Science Date: 2002-08-16 Impact factor: 47.728

2. Identification of cDC1- and cDC2-committed DC progenitors reveals early lineage priming at the common DC progenitor stage in the bone marrow.

Authors: Andreas Schlitzer; V Sivakamasundari; Jinmiao Chen; Hermi Rizal Bin Sumatoh; Jaring Schreuder; Josephine Lum; Benoit Malleret; Sanqian Zhang; Anis Larbi; Francesca Zolezzi; Laurent Renia; Michael Poidinger; Shalin Naik; Evan W Newell; Paul Robson; Florent Ginhoux
Journal: Nat Immunol Date: 2015-06-08 Impact factor: 25.606

Review 3. Nature, nurture, or chance: stochastic gene expression and its consequences.

Authors: Arjun Raj; Alexander van Oudenaarden
Journal: Cell Date: 2008-10-17 Impact factor: 41.582

4. Integrating single-cell transcriptomic data across different conditions, technologies, and species.

Authors: Andrew Butler; Paul Hoffman; Peter Smibert; Efthymia Papalexi; Rahul Satija
Journal: Nat Biotechnol Date: 2018-04-02 Impact factor: 54.908

5. Inferring the kinetics of stochastic gene expression from single-cell RNA-sequencing data.

Authors: Jong Kyoung Kim; John C Marioni
Journal: Genome Biol Date: 2013-01-28 Impact factor: 13.583

6. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.

Authors: Cole Trapnell; Davide Cacchiarelli; Jonna Grimsby; Prapti Pokharel; Shuqiang Li; Michael Morse; Niall J Lennon; Kenneth J Livak; Tarjei S Mikkelsen; John L Rinn
Journal: Nat Biotechnol Date: 2014-03-23 Impact factor: 54.908

7. WebGestalt 2017: a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit.

Authors: Jing Wang; Suhas Vasaikar; Zhiao Shi; Michael Greer; Bing Zhang
Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971

8. SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data.

Authors: Joshua D Welch; Alexander J Hartemink; Jan F Prins
Journal: Genome Biol Date: 2016-05-23 Impact factor: 13.583

9. Reversed graph embedding resolves complex single-cell trajectories.

Authors: Xiaojie Qiu; Qi Mao; Ying Tang; Li Wang; Raghav Chawla; Hannah A Pliner; Cole Trapnell
Journal: Nat Methods Date: 2017-08-21 Impact factor: 47.990

Review 10. Single-Cell Computational Strategies for Lineage Reconstruction in Tissue Systems.

Authors: Charles A Herring; Bob Chen; Eliot T McKinley; Ken S Lau
Journal: Cell Mol Gastroenterol Hepatol Date: 2018-02-13

5 in total

1. Single cell RNA-sequencing reveals cellular heterogeneity and trajectories of lineage specification during murine embryonic limb development.

Authors: Natalie H Kelly; Nguyen P T Huynh; Farshid Guilak
Journal: Matrix Biol Date: 2019-12-24 Impact factor: 11.583

Review 2. Use of Single-Cell -Omic Technologies to Study the Gastrointestinal Tract and Diseases, From Single Cell Identities to Patient Features.

Authors: Mirazul Islam; Bob Chen; Jeffrey M Spraggins; Ryan T Kelly; Ken S Lau
Journal: Gastroenterology Date: 2020-05-14 Impact factor: 22.682