Literature DB >> 33739401

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit.

Ben Blamey1, Salman Toor1, Martin Dahlö2,3, Håkan Wieslander1, Philip J Harrison2,3, Ida-Maria Sintorn1,3,4, Alan Sabirsh5, Carolina Wählby1,3, Ola Spjuth2,3, Andreas Hellander1.   

Abstract

BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered "data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources.
FINDINGS: In our pipeline model, an "interestingness function" assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a "policy" guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope.
CONCLUSIONS: Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems - and is intended for use with a range of technologies in different deployment scenarios.
© The Author(s) 2021. Published by Oxford University Press GigaScience.

Entities:  

Keywords:  HASTE; image analysis; interestingness functions; stream processing; tiered storage

Year:  2021        PMID: 33739401      PMCID: PMC7976223          DOI: 10.1093/gigascience/giab018

Source DB:  PubMed          Journal:  Gigascience        ISSN: 2047-217X            Impact factor:   6.524


  10 in total

1.  Workflow and metrics for image quality control in large-scale high-content screens.

Authors:  Mark-Anthony Bray; Adam N Fraser; Thomas P Hasaka; Anne E Carpenter
Journal:  J Biomol Screen       Date:  2011-09-28

2.  ScipionCloud: An integrative and interactive gateway for large scale cryo electron microscopy image processing on commercial and academic clouds.

Authors:  Jesús Cuenca-Alba; Laura Del Cano; Josué Gómez Blanco; José Miguel de la Rosa Trevín; Pablo Conesa Mingo; Roberto Marabini; Carlos Oscar S Sorzano; Jose María Carazo
Journal:  J Struct Biol       Date:  2017-06-26       Impact factor: 2.867

3.  Scipion: A software framework toward integration, reproducibility and validation in 3D electron microscopy.

Authors:  J M de la Rosa-Trevín; A Quintana; L Del Cano; A Zaldívar; I Foche; J Gutiérrez; J Gómez-Blanco; J Burguet-Castell; J Cuenca-Alba; V Abrishami; J Vargas; J Otón; G Sharov; J L Vilas; J Navas; P Conesa; M Kazemi; R Marabini; C O S Sorzano; J M Carazo
Journal:  J Struct Biol       Date:  2016-04-20       Impact factor: 2.867

4.  Deep Learning With Conformal Prediction for Hierarchical Analysis of Large-Scale Whole-Slide Tissue Images.

Authors:  Hakan Wieslander; Philip J Harrison; Gabriel Skogberg; Sonya Jackson; Markus Friden; Johan Karlsson; Ola Spjuth; Carolina Wahlby
Journal:  IEEE J Biomed Health Inform       Date:  2021-02-05       Impact factor: 5.772

5.  Metabolomic data streaming for biology-dependent data acquisition.

Authors:  Duane Rinehart; Caroline H Johnson; Thomas Nguyen; Julijana Ivanisevic; H Paul Benton; Jessica Lloyd; Adam P Arkin; Adam M Deutschbauer; Gary J Patti; Gary Siuzdak
Journal:  Nat Biotechnol       Date:  2014-06       Impact factor: 54.908

6.  Big Data: Astronomical or Genomical?

Authors:  Zachary D Stephens; Skylar Y Lee; Faraz Faghri; Roy H Campbell; Chengxiang Zhai; Miles J Efron; Ravishankar Iyer; Michael C Schatz; Saurabh Sinha; Gene E Robinson
Journal:  PLoS Biol       Date:  2015-07-07       Impact factor: 8.029

7.  Streaming visualisation of quantitative mass spectrometry data based on a novel raw signal decomposition method.

Authors:  Yan Zhang; Ranjeet Bhamber; Isabel Riba-Garcia; Hanqing Liao; Richard D Unwin; Andrew W Dowsey
Journal:  Proteomics       Date:  2015-03-09       Impact factor: 3.984

8.  Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT.

Authors:  Dantong Wang; Simon Fong; Raymond K Wong; Sabah Mohammed; Jinan Fiaidhi; Kelvin K L Wong
Journal:  Sci Rep       Date:  2017-02-23       Impact factor: 4.379

9.  htsget: a protocol for securely streaming genomic data.

Authors:  Jerome Kelleher; Mike Lin; C H Albach; Ewan Birney; Robert Davies; Marina Gourtovaia; David Glazer; Cristina Y Gonzalez; David K Jackson; Aaron Kemp; John Marshall; Andrew Nowak; Alexander Senf; Jaime M Tovar-Corona; Alexander Vikhorev; Thomas M Keane
Journal:  Bioinformatics       Date:  2019-01-01       Impact factor: 6.937

10.  Container-based bioinformatics with Pachyderm.

Authors:  Jon Ander Novella; Payam Emami Khoonsari; Stephanie Herman; Daniel Whitenack; Marco Capuccini; Joachim Burman; Kim Kultima; Ola Spjuth
Journal:  Bioinformatics       Date:  2019-03-01       Impact factor: 6.937

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.