Literature DB >> 30698642

Predicting runtimes of bioinformatics tools based on historical data: five years of Galaxy usage.

Anastasia Tyryshkina1, Nate Coraor2, Anton Nekrutenko2.   

Abstract

MOTIVATION: One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation.
RESULTS: Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation.
AVAILABILITY AND IMPLEMENTATION: Source code available at https://github.com/atyryshkina/algorithm-performance-analysis, implemented in Python. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Mesh:

Year:  2019        PMID: 30698642      PMCID: PMC6931352          DOI: 10.1093/bioinformatics/btz054

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  3 in total

1.  Galaxy: a web-based genome analysis tool for experimentalists.

Authors:  Daniel Blankenberg; Gregory Von Kuster; Nathaniel Coraor; Guruprasad Ananda; Ross Lazarus; Mary Mangan; Anton Nekrutenko; James Taylor
Journal:  Curr Protoc Mol Biol       Date:  2010-01

2.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors:  Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal:  Genome Biol       Date:  2010-08-25       Impact factor: 13.583

3.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update.

Authors:  Enis Afgan; Dannon Baker; Marius van den Beek; Daniel Blankenberg; Dave Bouvier; Martin Čech; John Chilton; Dave Clements; Nate Coraor; Carl Eberhard; Björn Grüning; Aysam Guerler; Jennifer Hillman-Jackson; Greg Von Kuster; Eric Rasche; Nicola Soranzo; Nitesh Turaga; James Taylor; Anton Nekrutenko; Jeremy Goecks
Journal:  Nucleic Acids Res       Date:  2016-05-02       Impact factor: 16.971

  3 in total
  5 in total

1.  GalaxyCloudRunner: enhancing scalable computing for Galaxy.

Authors:  Nuwan Goonasekera; Alexandru Mahmoud; John Chilton; Enis Afgan
Journal:  Bioinformatics       Date:  2021-07-19       Impact factor: 6.937

2.  DNAPred_Prot: Identification of DNA-Binding Proteins Using Composition- and Position-Based Features.

Authors:  Omar Barukab; Yaser Daanial Khan; Sher Afzal Khan; Kuo-Chen Chou
Journal:  Appl Bionics Biomech       Date:  2022-04-13       Impact factor: 1.664

3.  Accumulating computational resource usage of genomic data analysis workflow to optimize cloud computing instance selection.

Authors:  Tazro Ohta; Tomoya Tanjo; Osamu Ogasawara
Journal:  Gigascience       Date:  2019-04-01       Impact factor: 6.524

4.  Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems.

Authors:  Mohammed Tanash; Daniel Andresen; Huichen Yang; William Hsu
Journal:  Pract Exp Adv Res Comput (2021)       Date:  2021-07-17

5.  A machine learning technique for identifying DNA enhancer regions utilizing CIS-regulatory element patterns.

Authors:  Ahmad Hassan Butt; Tamim Alkhalifah; Fahad Alturise; Yaser Daanial Khan
Journal:  Sci Rep       Date:  2022-09-07       Impact factor: 4.996

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.