| Literature DB >> 31447887 |
Gennaro Gambardella1,2, Diego di Bernardo1,2.
Abstract
Gene expression in individual cells can now be measured for thousands of cells in a single experiment thanks to innovative sample-preparation and sequencing technologies. State-of-the-art computational pipelines for single-cell RNA-sequencing data, however, still employ computational methods that were developed for traditional bulk RNA-sequencing data, thus not accounting for the peculiarities of single-cell data, such as sparseness and zero-inflated counts. Here, we present a ready-to-use pipeline named gf-icf (gene frequency-inverse cell frequency) for normalization of raw counts, feature selection, and dimensionality reduction of scRNA-seq data for their visualization and subsequent analyses. Our work is based on a data transformation model named term frequency-inverse document frequency (TF-IDF), which has been extensively used in the field of text mining where extremely sparse and zero-inflated data are common. Using benchmark scRNA-seq datasets, we show that the gf-icf pipeline outperforms existing state-of-the-art methods in terms of improved visualization and ability to separate and distinguish different cell types.Entities:
Keywords: cell type; enrichment analysis; feature extraction; single-cell transcriptomics; term frequency–inverse document frequency
Year: 2019 PMID: 31447887 PMCID: PMC6696874 DOI: 10.3389/fgene.2019.00734
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1GF-ICF improves visualization of single-cell RNA-sequencing data. (A) The gf-icf pipeline. Starting from transcriptional profiles of a set of cells C1…CN, the pipeline consists of the following steps: (i) normalization of gene expression profiles of each cell to sum one (GF step); (ii) cross-cell normalization, to score rarely expressed genes higher than commonly expressed genes (ICF step); (iii) L2 normalization on each cell to obtain normalized gf-icf weights; and (iv) principal component analysis (PCA) to reduce the number of features (genes) dimensions before (v) projecting cell in an embedded space. (B) Comparison between t-SNE projection following gf-icf pipeline (left) and the Seurat tool (right) on 40k human PBMCs single-cell transcriptional profiles. Cells are colored according to their cell type of origin identified by FACS analysis by Grace et al. (C) Average Euclidean distance among PBMCs of the same type using either gf-icf pipeline or the Seurat tool. (D) Distribution of the average Euclidean distance among PBMCs of the same type using either gf-icf pipeline or the Seurat tool. Legend: TCY, cytotoxic T-cells; TH, helper T-cells; TREG, regulatory T-cells; TMEM, memory T-cells; TNCY, naïve cytotoxic T-cells; TN, naïve T-cells; NK cells, natural killer cells.
Figure 2Relevant genes extracted from the gf-icf pipeline enable cell type prediction. (A) Pipeline for the identification of cell type using gf-icf pipeline. Single-cell transcriptional profiles are normalized by gf-icf in order to score genes in each single-cell, and then (i) cells are projected with t-SNE in a bi-dimensional space; (ii) cells are divided in small groups using Louvain–Jaccard clustering; and (iii) the gene signature of each cluster is identified and (iv) used to predict cell type of origin by gene set enrichment analysis (GSEA) against a set bulk transcriptomic data of pure cell types. (B) Comparison between FACS-sorted cell type (left) and predicted cell type (right) of about 40k PBMCs. (C) Cell type prediction accuracy as a percentage of correctly predicted cells using either gf-icf or normalized counts. (D) Distribution of cell type prediction accuracy using either gf-icf or normalized counts. (E) Adjusted Rand index of cell type prediction using either gf-icf or normalized counts. (F) Expression of CD14, CD16, and CD34 marker genes for a small subpopulation of HSCs predicted instead to be monocyte and macrophages.