Xin Wei1, Ziyi Li2, Hongkai Ji3, Hao Wu1. 1. Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA. 2. Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA. 3. Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA.
Abstract
MOTIVATION: Single-cell RNA sequencing (scRNA-seq) has revolutionized biological research by enabling the measurement of transcriptomic profiles at the single-cell level. With the increasing application of scRNA-seq in larger-scale studies, the problem of appropriately clustering cells emerges when the scRNA-seq data are from multiple subjects. One challenge is the subject-specific variation; systematic heterogeneity from multiple subjects may have a significant impact on clustering accuracy. Existing methods seeking to address such effects suffer from several limitations. RESULTS: We develop a novel statistical method, EDClust, for multi-subject scRNA-seq cell clustering. EDClust models the sequence read counts by a mixture of Dirichlet-multinomial distributions and explicitly accounts for cell-type heterogeneity, subject heterogeneity and clustering uncertainty. An EM-MM hybrid algorithm is derived for maximizing the data likelihood and clustering the cells. We perform a series of simulation studies to evaluate the proposed method and demonstrate the outstanding performance of EDClust. Comprehensive benchmarking on four real scRNA-seq datasets with various tissue types and species demonstrates the substantial accuracy improvement of EDClust compared to existing methods. AVAILABILITY AND IMPLEMENTATION: The R package is freely available at https://github.com/weix21/EDClust. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Single-cell RNA sequencing (scRNA-seq) has revolutionized biological research by enabling the measurement of transcriptomic profiles at the single-cell level. With the increasing application of scRNA-seq in larger-scale studies, the problem of appropriately clustering cells emerges when the scRNA-seq data are from multiple subjects. One challenge is the subject-specific variation; systematic heterogeneity from multiple subjects may have a significant impact on clustering accuracy. Existing methods seeking to address such effects suffer from several limitations. RESULTS: We develop a novel statistical method, EDClust, for multi-subject scRNA-seq cell clustering. EDClust models the sequence read counts by a mixture of Dirichlet-multinomial distributions and explicitly accounts for cell-type heterogeneity, subject heterogeneity and clustering uncertainty. An EM-MM hybrid algorithm is derived for maximizing the data likelihood and clustering the cells. We perform a series of simulation studies to evaluate the proposed method and demonstrate the outstanding performance of EDClust. Comprehensive benchmarking on four real scRNA-seq datasets with various tissue types and species demonstrates the substantial accuracy improvement of EDClust compared to existing methods. AVAILABILITY AND IMPLEMENTATION: The R package is freely available at https://github.com/weix21/EDClust. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Teemu Kivioja; Anna Vähärautio; Kasper Karlsson; Martin Bonke; Martin Enge; Sten Linnarsson; Jussi Taipale Journal: Nat Methods Date: 2011-11-20 Impact factor: 28.547
Authors: Maayan Baron; Adrian Veres; Samuel L Wolock; Aubrey L Faust; Renaud Gaujoux; Amedeo Vetere; Jennifer Hyoje Ryu; Bridget K Wagner; Shai S Shen-Orr; Allon M Klein; Douglas A Melton; Itai Yanai Journal: Cell Syst Date: 2016-09-22 Impact factor: 10.304
Authors: Linas Mazutis; John Gilbert; W Lloyd Ung; David A Weitz; Andrew D Griffiths; John A Heyman Journal: Nat Protoc Date: 2013-04-04 Impact factor: 13.491