Emanuele Aliverti1, Jeffrey L Tilson2, Dayne L Filer2,3, Benjamin Babcock3,4, Alejandro Colaneri3, Jennifer Ocasio4,5, Timothy R Gershon4,5,6,7, Kirk C Wilhelmsen2,3,4, David B Dunson8. 1. Department of Statistical Sciences, University of Padova, Padova 35121, Italy. 2. RENCI, University of North Carolina, Chapel Hill, NC 27517, USA. 3. Department of Genetics. 4. Department of Neurology. 5. UNC Neuroscience Center. 6. Carolina Institute for Developmental Disabilities. 7. Lineberger Comprehensive Cancer Center, University of North Carolina School of Medicine, Chapel Hill, NC 27599, USA. 8. Department of Statistical Science, Duke University, Durham, NC 27708, USA.
Abstract
MOTIVATION: Low-dimensional representations of high-dimensional data are routinely employed in biomedical research to visualize, interpret and communicate results from different pipelines. In this article, we propose a novel procedure to directly estimate t-SNE embeddings that are not driven by batch effects. Without correction, interesting structure in the data can be obscured by batch effects. The proposed algorithm can therefore significantly aid visualization of high-dimensional data. RESULTS: The proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings. Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types. When applied to single-cell gene expression data to investigate mouse medulloblastoma, the proposed method successfully removes batches related with mice identifiers and the date of the experiment, while preserving clusters of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are expected to lie in the stroma within or adjacent to the tumours. AVAILABILITY AND IMPLEMENTATION: Source code implementing the proposed approach is available as an R package at https://github.com/emanuelealiverti/BC_tSNE, including a tutorial to reproduce the simulation studies. CONTACT: aliverti@stat.unipd.it.
MOTIVATION: Low-dimensional representations of high-dimensional data are routinely employed in biomedical research to visualize, interpret and communicate results from different pipelines. In this article, we propose a novel procedure to directly estimate t-SNE embeddings that are not driven by batch effects. Without correction, interesting structure in the data can be obscured by batch effects. The proposed algorithm can therefore significantly aid visualization of high-dimensional data. RESULTS: The proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings. Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types. When applied to single-cell gene expression data to investigate mousemedulloblastoma, the proposed method successfully removes batches related with mice identifiers and the date of the experiment, while preserving clusters of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are expected to lie in the stroma within or adjacent to the tumours. AVAILABILITY AND IMPLEMENTATION: Source code implementing the proposed approach is available as an R package at https://github.com/emanuelealiverti/BC_tSNE, including a tutorial to reproduce the simulation studies. CONTACT: aliverti@stat.unipd.it.
Authors: Maria C Vladoiu; Ibrahim El-Hamamy; Laura K Donovan; Nada Jabado; Lincoln Stein; Michael D Taylor; Hamza Farooq; Borja L Holgado; Yogi Sundaravadanam; Vijay Ramaswamy; Liam D Hendrikse; Sachin Kumar; Stephen C Mack; John J Y Lee; Vernon Fong; Kyle Juraschka; David Przelicki; Antony Michealraj; Patryk Skowron; Betty Luu; Hiromichi Suzuki; A Sorana Morrissy; Florence M G Cavalli; Livia Garzia; Craig Daniels; Xiaochong Wu; Maleeha A Qazi; Sheila K Singh; Jennifer A Chan; Marco A Marra; David Malkin; Peter Dirks; Lawrence Heisler; Trevor Pugh; Karen Ng; Faiyaz Notta; Eric M Thompson; Claudia L Kleinman; Alexandra L Joyner Journal: Nature Date: 2019-05-01 Impact factor: 49.962