Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit.

Literature DB >> 33739401

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit.

Ben Blamey¹, Salman Toor¹, Martin Dahlö^2,3, Håkan Wieslander¹, Philip J Harrison^2,3, Ida-Maria Sintorn^1,3,4, Alan Sabirsh⁵, Carolina Wählby^1,3, Ola Spjuth^2,3, Andreas Hellander¹.

Abstract

BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered "data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources.
FINDINGS: In our pipeline model, an "interestingness function" assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a "policy" guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope.
CONCLUSIONS: Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems - and is intended for use with a range of technologies in different deployment scenarios.

Entities: Chemical Disease Gene Species

Keywords: HASTE; image analysis; interestingness functions; stream processing; tiered storage

Year: 2021 PMID： 33739401 PMCID： PMC7976223 DOI： 10.1093/gigascience/giab018

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 6.524

Keyword Cloud
References

10 in total

1. Workflow and metrics for image quality control in large-scale high-content screens.

Authors: Mark-Anthony Bray; Adam N Fraser; Thomas P Hasaka; Anne E Carpenter
Journal: J Biomol Screen Date: 2011-09-28

2. ScipionCloud: An integrative and interactive gateway for large scale cryo electron microscopy image processing on commercial and academic clouds.

Authors: Jesús Cuenca-Alba; Laura Del Cano; Josué Gómez Blanco; José Miguel de la Rosa Trevín; Pablo Conesa Mingo; Roberto Marabini; Carlos Oscar S Sorzano; Jose María Carazo
Journal: J Struct Biol Date: 2017-06-26 Impact factor: 2.867

3. Scipion: A software framework toward integration, reproducibility and validation in 3D electron microscopy.

Authors: J M de la Rosa-Trevín; A Quintana; L Del Cano; A Zaldívar; I Foche; J Gutiérrez; J Gómez-Blanco; J Burguet-Castell; J Cuenca-Alba; V Abrishami; J Vargas; J Otón; G Sharov; J L Vilas; J Navas; P Conesa; M Kazemi; R Marabini; C O S Sorzano; J M Carazo
Journal: J Struct Biol Date: 2016-04-20 Impact factor: 2.867

4. Deep Learning With Conformal Prediction for Hierarchical Analysis of Large-Scale Whole-Slide Tissue Images.

Authors: Hakan Wieslander; Philip J Harrison; Gabriel Skogberg; Sonya Jackson; Markus Friden; Johan Karlsson; Ola Spjuth; Carolina Wahlby
Journal: IEEE J Biomed Health Inform Date: 2021-02-05 Impact factor: 5.772

5. Metabolomic data streaming for biology-dependent data acquisition.

Authors: Duane Rinehart; Caroline H Johnson; Thomas Nguyen; Julijana Ivanisevic; H Paul Benton; Jessica Lloyd; Adam P Arkin; Adam M Deutschbauer; Gary J Patti; Gary Siuzdak
Journal: Nat Biotechnol Date: 2014-06 Impact factor: 54.908

6. Big Data: Astronomical or Genomical?

Authors: Zachary D Stephens; Skylar Y Lee; Faraz Faghri; Roy H Campbell; Chengxiang Zhai; Miles J Efron; Ravishankar Iyer; Michael C Schatz; Saurabh Sinha; Gene E Robinson
Journal: PLoS Biol Date: 2015-07-07 Impact factor: 8.029

7. Streaming visualisation of quantitative mass spectrometry data based on a novel raw signal decomposition method.

Authors: Yan Zhang; Ranjeet Bhamber; Isabel Riba-Garcia; Hanqing Liao; Richard D Unwin; Andrew W Dowsey
Journal: Proteomics Date: 2015-03-09 Impact factor: 3.984

8. Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT.

Authors: Dantong Wang; Simon Fong; Raymond K Wong; Sabah Mohammed; Jinan Fiaidhi; Kelvin K L Wong
Journal: Sci Rep Date: 2017-02-23 Impact factor: 4.379

9. htsget: a protocol for securely streaming genomic data.

Authors: Jerome Kelleher; Mike Lin; C H Albach; Ewan Birney; Robert Davies; Marina Gourtovaia; David Glazer; Cristina Y Gonzalez; David K Jackson; Aaron Kemp; John Marshall; Andrew Nowak; Alexander Senf; Jaime M Tovar-Corona; Alexander Vikhorev; Thomas M Keane
Journal: Bioinformatics Date: 2019-01-01 Impact factor: 6.937

10. Container-based bioinformatics with Pachyderm.

Authors: Jon Ander Novella; Payam Emami Khoonsari; Stephanie Herman; Daniel Whitenack; Marco Capuccini; Joachim Burman; Kim Kultima; Ola Spjuth
Journal: Bioinformatics Date: 2019-03-01 Impact factor: 6.937

10 in total