Ben Blamey1, Salman Toor1, Martin Dahlö2,3, Håkan Wieslander1, Philip J Harrison2,3, Ida-Maria Sintorn1,3,4, Alan Sabirsh5, Carolina Wählby1,3, Ola Spjuth2,3, Andreas Hellander1. 1. Department of Information Technology, Uppsala University, Lägerhyddsvägen 2, 75237 Uppsala, Sweden. 2. Department of Pharmaceutical Biosciences, Uppsala University, Husargatan 3, 75237, Uppsala, Sweden. 3. Science for Life Laboratory, Uppsala University, Husargatan 3, 75237 Uppsala, Sweden. 4. Vironova AB, Gävlegatan 22, 11330 Stockholm, Sweden. 5. Advanced Drug Delivery, Pharmaceutical Sciences, R&D, AstraZeneca, Pepparedsleden 1, 43183 Mölndal, Sweden.
Abstract
BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered "data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources. FINDINGS: In our pipeline model, an "interestingness function" assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a "policy" guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope. CONCLUSIONS: Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems - and is intended for use with a range of technologies in different deployment scenarios.
BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered "data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources. FINDINGS: In our pipeline model, an "interestingness function" assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a "policy" guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope. CONCLUSIONS: Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems - and is intended for use with a range of technologies in different deployment scenarios.
Authors: Jesús Cuenca-Alba; Laura Del Cano; Josué Gómez Blanco; José Miguel de la Rosa Trevín; Pablo Conesa Mingo; Roberto Marabini; Carlos Oscar S Sorzano; Jose María Carazo Journal: J Struct Biol Date: 2017-06-26 Impact factor: 2.867
Authors: J M de la Rosa-Trevín; A Quintana; L Del Cano; A Zaldívar; I Foche; J Gutiérrez; J Gómez-Blanco; J Burguet-Castell; J Cuenca-Alba; V Abrishami; J Vargas; J Otón; G Sharov; J L Vilas; J Navas; P Conesa; M Kazemi; R Marabini; C O S Sorzano; J M Carazo Journal: J Struct Biol Date: 2016-04-20 Impact factor: 2.867
Authors: Hakan Wieslander; Philip J Harrison; Gabriel Skogberg; Sonya Jackson; Markus Friden; Johan Karlsson; Ola Spjuth; Carolina Wahlby Journal: IEEE J Biomed Health Inform Date: 2021-02-05 Impact factor: 5.772
Authors: Duane Rinehart; Caroline H Johnson; Thomas Nguyen; Julijana Ivanisevic; H Paul Benton; Jessica Lloyd; Adam P Arkin; Adam M Deutschbauer; Gary J Patti; Gary Siuzdak Journal: Nat Biotechnol Date: 2014-06 Impact factor: 54.908
Authors: Zachary D Stephens; Skylar Y Lee; Faraz Faghri; Roy H Campbell; Chengxiang Zhai; Miles J Efron; Ravishankar Iyer; Michael C Schatz; Saurabh Sinha; Gene E Robinson Journal: PLoS Biol Date: 2015-07-07 Impact factor: 8.029
Authors: Yan Zhang; Ranjeet Bhamber; Isabel Riba-Garcia; Hanqing Liao; Richard D Unwin; Andrew W Dowsey Journal: Proteomics Date: 2015-03-09 Impact factor: 3.984
Authors: Dantong Wang; Simon Fong; Raymond K Wong; Sabah Mohammed; Jinan Fiaidhi; Kelvin K L Wong Journal: Sci Rep Date: 2017-02-23 Impact factor: 4.379
Authors: Jerome Kelleher; Mike Lin; C H Albach; Ewan Birney; Robert Davies; Marina Gourtovaia; David Glazer; Cristina Y Gonzalez; David K Jackson; Aaron Kemp; John Marshall; Andrew Nowak; Alexander Senf; Jaime M Tovar-Corona; Alexander Vikhorev; Thomas M Keane Journal: Bioinformatics Date: 2019-01-01 Impact factor: 6.937
Authors: Jon Ander Novella; Payam Emami Khoonsari; Stephanie Herman; Daniel Whitenack; Marco Capuccini; Joachim Burman; Kim Kultima; Ola Spjuth Journal: Bioinformatics Date: 2019-03-01 Impact factor: 6.937