Miroslav Kratochvíl1,2, Oliver Hunewald3, Laurent Heirendt4, Vasco Verissimo4, Jiří Vondrášek1, Venkata P Satagopam4,5, Reinhard Schneider4,5, Christophe Trefois4,5, Markus Ollert3,6. 1. Institute of Organic Chemistry and Biochemistry, Flemingovo náměstí 542/2, 160 00 Prague, Czech Republic. 2. Charles University, Department of Software Engineering, Malostranské náměstí 25, 118 00 Prague, Czech Republic. 3. Luxembourg Institute of Health, Department of Infection and Immunity, 29 rue Henri Koch, L-4354 Esch-sur-Alzette, Luxembourg. 4. University of Luxembourg, Luxembourg Centre for Systems Biomedicine, 6 avenue du Swing, Campus Belval, L-4367 Belvaux, Luxembourg. 5. ELIXIR Luxembourg, University of Luxembourg, 6, avenue du Swing, Campus Belval, L-4367 Belvaux, Luxembourg. 6. Odense Research Center for Anaphylaxis, Department of Dermatology and Allergy Center, OdenseUniversity Hospital, University of Southern Denmark, Kløvervænget 15, DK-5000 Odense C, Denmark.
Abstract
BACKGROUND: The amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow the easy generation of data with hundreds of millions of single-cell data points with >40 parameters, originating from thousands of individual samples. The analysis of that amount of high-dimensional data becomes demanding in both hardware and software of high-performance computational resources. Current software tools often do not scale to the datasets of such size; users are thus forced to downsample the data to bearable sizes, in turn losing accuracy and ability to detect many underlying complex phenomena. RESULTS: We present GigaSOM.jl, a fast and scalable implementation of clustering and dimensionality reduction for flow and mass cytometry data. The implementation of GigaSOM.jl in the high-level and high-performance programming language Julia makes it accessible to the scientific community and allows for efficient handling and processing of datasets with billions of data points using distributed computing infrastructures. We describe the design of GigaSOM.jl, measure its performance and horizontal scaling capability, and showcase the functionality on a large dataset from a recent study. CONCLUSIONS: GigaSOM.jl facilitates the use of commonly available high-performance computing resources to process the largest available datasets within minutes, while producing results of the same quality as the current state-of-art software. Measurements indicate that the performance scales to much larger datasets. The example use on the data from a massive mouse phenotyping effort confirms the applicability of GigaSOM.jl to huge-scale studies.
BACKGROUND: The amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow the easy generation of data with hundreds of millions of single-cell data points with >40 parameters, originating from thousands of individual samples. The analysis of that amount of high-dimensional data becomes demanding in both hardware and software of high-performance computational resources. Current software tools often do not scale to the datasets of such size; users are thus forced to downsample the data to bearable sizes, in turn losing accuracy and ability to detect many underlying complex phenomena. RESULTS: We present GigaSOM.jl, a fast and scalable implementation of clustering and dimensionality reduction for flow and mass cytometry data. The implementation of GigaSOM.jl in the high-level and high-performance programming language Julia makes it accessible to the scientific community and allows for efficient handling and processing of datasets with billions of data points using distributed computing infrastructures. We describe the design of GigaSOM.jl, measure its performance and horizontal scaling capability, and showcase the functionality on a large dataset from a recent study. CONCLUSIONS: GigaSOM.jl facilitates the use of commonly available high-performance computing resources to process the largest available datasets within minutes, while producing results of the same quality as the current state-of-art software. Measurements indicate that the performance scales to much larger datasets. The example use on the data from a massive mouse phenotyping effort confirms the applicability of GigaSOM.jl to huge-scale studies.
Authors: Sofie Van Gassen; Britt Callebaut; Mary J Van Helden; Bart N Lambrecht; Piet Demeester; Tom Dhaene; Yvan Saeys Journal: Cytometry A Date: 2015-01-08 Impact factor: 4.355
Authors: Robert V Bruggner; Bernd Bodenmiller; David L Dill; Robert J Tibshirani; Garry P Nolan Journal: Proc Natl Acad Sci U S A Date: 2014-06-16 Impact factor: 11.205
Authors: Dmitry R Bandura; Vladimir I Baranov; Olga I Ornatsky; Alexei Antonov; Robert Kinach; Xudong Lou; Serguei Pavlov; Sergey Vorobiev; John E Dick; Scott D Tanner Journal: Anal Chem Date: 2009-08-15 Impact factor: 6.986
Authors: Peng Qiu; Erin F Simonds; Sean C Bendall; Kenneth D Gibbs; Robert V Bruggner; Michael D Linderman; Karen Sachs; Garry P Nolan; Sylvia K Plevritis Journal: Nat Biotechnol Date: 2011-10-02 Impact factor: 54.908
Authors: Yapei Huang; Juliana E Shin; Alexander M Xu; Changfu Yao; Sandy Joung; Min Wu; Ruan Zhang; Bongha Shin; Joslyn Foley; Simeon B Mahov; Matthew E Modes; Joseph E Ebinger; Matthew Driver; Jonathan G Braun; Caroline A Jefferies; Tanyalak Parimon; Chelsea Hayes; Kimia Sobhani; Akil Merchant; Sina A Gharib; Stanley C Jordan; Susan Cheng; Helen S Goodridge; Peter Chen Journal: iScience Date: 2022-09-26