Chunxuan Shao1,2, Thomas Höfer1,2. 1. Division of Theoretical Systems Biology, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany. 2. Bioquant Center, University of Heidelberg, 69120 Heidelberg, Germany.
Abstract
MOTIVATION: Single-cell transcriptome data provide unprecedented resolution to study heterogeneity in cell populations and present a challenge for unsupervised classification. Popular methods, like principal component analysis (PCA), often suffer from the high level of noise in the data. RESULTS: Here we adapt Nonnegative Matrix Factorization (NMF) to study the problem of identifying subpopulations in single-cell transcriptome data. In contrast to the conventional gene-centered view of NMF, identifying metagenes, we used NMF in a cell-centered direction, identifying cell subtypes ('metacells'). Using three different datasets (based on RT-qPCR and single cell RNA-seq data, respectively), we show that NMF outperforms PCA in identifying subpopulations in an accurate and robust way, without the need for prior feature selection; moreover, NMF successfully recovered the broad classes on a large dataset (thousands of single-cell transcriptomes), as identified by a computationally sophisticated method. NMF allows to identify feature genes in a direct, unbiased manner. We propose novel approaches for determining a biologically meaningful number of subpopulations based on minimizing the ambiguity of classification. In conclusion, our study shows that NMF is a robust, informative and simple method for the unsupervised learning of cell subtypes from single-cell gene expression data. AVAILABILITY AND IMPLEMENTATION: https://github.com/ccshao/nimfa CONTACTS: c.shao@Dkfz-Heidelberg.de or t.hoefer@Dkfz-Heidelberg.deSupplementary information: Supplementary data are available at Bioinformatics online.
MOTIVATION: Single-cell transcriptome data provide unprecedented resolution to study heterogeneity in cell populations and present a challenge for unsupervised classification. Popular methods, like principal component analysis (PCA), often suffer from the high level of noise in the data. RESULTS: Here we adapt Nonnegative Matrix Factorization (NMF) to study the problem of identifying subpopulations in single-cell transcriptome data. In contrast to the conventional gene-centered view of NMF, identifying metagenes, we used NMF in a cell-centered direction, identifying cell subtypes ('metacells'). Using three different datasets (based on RT-qPCR and single cell RNA-seq data, respectively), we show that NMF outperforms PCA in identifying subpopulations in an accurate and robust way, without the need for prior feature selection; moreover, NMF successfully recovered the broad classes on a large dataset (thousands of single-cell transcriptomes), as identified by a computationally sophisticated method. NMF allows to identify feature genes in a direct, unbiased manner. We propose novel approaches for determining a biologically meaningful number of subpopulations based on minimizing the ambiguity of classification. In conclusion, our study shows that NMF is a robust, informative and simple method for the unsupervised learning of cell subtypes from single-cell gene expression data. AVAILABILITY AND IMPLEMENTATION: https://github.com/ccshao/nimfa CONTACTS: c.shao@Dkfz-Heidelberg.de or t.hoefer@Dkfz-Heidelberg.deSupplementary information: Supplementary data are available at Bioinformatics online.
Authors: Min Jung; Daniel Wells; Jannette Rusch; Suhaira Ahmad; Jonathan Marchini; Simon R Myers; Donald F Conrad Journal: Elife Date: 2019-06-25 Impact factor: 8.140