Literature DB >> 33511996

Transcriptome annotation in the cloud: complexity, best practices, and cost.

Roberto Vera Alvarez1, Leonardo Mariño-Ramírez1, David Landsman1.   

Abstract

BACKGROUND: The NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative provides NIH-funded researchers cost-effective access to commercial cloud providers, such as Amazon Web Services (AWS) and Google Cloud Platform (GCP). These cloud providers represent an alternative for the execution of large computational biology experiments like transcriptome annotation, which is a complex analytical process that requires the interrogation of multiple biological databases with several advanced computational tools. The core components of annotation pipelines published since 2012 are BLAST sequence alignments using annotated databases of both nucleotide or protein sequences almost exclusively with networked on-premises compute systems.
FINDINGS: We compare multiple BLAST sequence alignments using AWS and GCP. We prepared several Jupyter Notebooks with all the code required to submit computing jobs to the batch system on each cloud provider. We consider the consequence of the number of query transcripts in input files and the effect on cost and processing time. We tested compute instances with 16, 32, and 64 vCPUs on each cloud provider. Four classes of timing results were collected: the total run time, the time for transferring the BLAST databases to the instance local solid-state disk drive, the time to execute the CWL script, and the time for the creation, set-up, and release of an instance. This study aims to establish an estimate of the cost and compute time needed for the execution of multiple BLAST runs in a cloud environment.
CONCLUSIONS: We demonstrate that public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low cost. Using our cloud recipes, the BLAST alignments required to annotate a transcriptome with ∼500,000 transcripts can be processed in <2 hours with a compute cost of ∼$200-$250. In our opinion, for BLAST-based workflows, the choice of cloud platform is not dependent on the workflow but, rather, on the specific details and requirements of the cloud provider. These choices include the accessibility for institutional use, the technical knowledge required for effective use of the platform services, and the availability of open source frameworks such as APIs to deploy the workflow. Published by Oxford University Press on behalf of GigaScience 2021.

Entities:  

Mesh:

Year:  2021        PMID: 33511996      PMCID: PMC7845158          DOI: 10.1093/gigascience/giaa163

Source DB:  PubMed          Journal:  Gigascience        ISSN: 2047-217X            Impact factor:   6.524


  30 in total

1.  Why Jupyter is data scientists' computational notebook of choice.

Authors:  Jeffrey M Perkel
Journal:  Nature       Date:  2018-11       Impact factor: 49.962

2.  Science in the cloud (SIC): A use case in MRI connectomics.

Authors:  Gregory Kiar; Krzysztof J Gorgolewski; Dean Kleissas; William Gray Roncal; Brian Litt; Brian Wandell; Russel A Poldrack; Martin Wiener; R Jacob Vogelstein; Randal Burns; Joshua T Vogelstein
Journal:  Gigascience       Date:  2017-05-01       Impact factor: 6.524

3.  CGtag: complete genomics toolkit and annotation in a cloud-based Galaxy.

Authors:  Saskia Hiltemann; Hailiang Mei; Mattias de Hollander; Ivo Palli; Peter van der Spek; Guido Jenster; Andrew Stubbs
Journal:  Gigascience       Date:  2014-01-24       Impact factor: 6.524

4.  RNA-seq analysis of Quercus pubescens Leaves: de novo transcriptome assembly, annotation and functional markers development.

Authors:  Sara Torre; Massimiliano Tattini; Cecilia Brunetti; Silvia Fineschi; Alessio Fini; Francesco Ferrini; Federico Sebastiani
Journal:  PLoS One       Date:  2014-11-13       Impact factor: 3.240

5.  Accumulating computational resource usage of genomic data analysis workflow to optimize cloud computing instance selection.

Authors:  Tazro Ohta; Tomoya Tanjo; Osamu Ogasawara
Journal:  Gigascience       Date:  2019-04-01       Impact factor: 6.524

6.  PhenoMeNal: processing and analysis of metabolomics data in the cloud.

Authors:  Kristian Peters; James Bradbury; Sven Bergmann; Marco Capuccini; Marta Cascante; Pedro de Atauri; Timothy M D Ebbels; Carles Foguet; Robert Glen; Alejandra Gonzalez-Beltran; Ulrich L Günther; Evangelos Handakas; Thomas Hankemeier; Kenneth Haug; Stephanie Herman; Petr Holub; Massimiliano Izzo; Daniel Jacob; David Johnson; Fabien Jourdan; Namrata Kale; Ibrahim Karaman; Bita Khalili; Payam Emami Khonsari; Kim Kultima; Samuel Lampa; Anders Larsson; Christian Ludwig; Pablo Moreno; Steffen Neumann; Jon Ander Novella; Claire O'Donovan; Jake T M Pearce; Alina Peluso; Marco Enrico Piras; Luca Pireddu; Michelle A C Reed; Philippe Rocca-Serra; Pierrick Roger; Antonio Rosato; Rico Rueedi; Christoph Ruttkies; Noureddin Sadawi; Reza M Salek; Susanna-Assunta Sansone; Vitaly Selivanov; Ola Spjuth; Daniel Schober; Etienne A Thévenot; Mattia Tomasoni; Merlijn van Rijswijk; Michael van Vliet; Mark R Viant; Ralf J M Weber; Gianluigi Zanetti; Christoph Steinbeck
Journal:  Gigascience       Date:  2019-02-01       Impact factor: 6.524

7.  Banana (Musa acuminata) transcriptome profiling in response to rhizobacteria: Bacillus amyloliquefaciens Bs006 and Pseudomonas fluorescens Ps006.

Authors:  Rocío M Gamez; Fernando Rodríguez; Newton Medeiros Vidal; Sandra Ramirez; Roberto Vera Alvarez; David Landsman; Leonardo Mariño-Ramírez
Journal:  BMC Genomics       Date:  2019-05-14       Impact factor: 3.969

8.  De novo transcriptome assembly and annotation for gene discovery in avocado, macadamia and mango.

Authors:  Tinashe G Chabikwa; Francois F Barbier; Milos Tanurdzic; Christine A Beveridge
Journal:  Sci Data       Date:  2020-01-08       Impact factor: 6.444

9.  Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Authors:  Manfred G Grabherr; Brian J Haas; Moran Yassour; Joshua Z Levin; Dawn A Thompson; Ido Amit; Xian Adiconis; Lin Fan; Raktima Raychowdhury; Qiandong Zeng; Zehua Chen; Evan Mauceli; Nir Hacohen; Andreas Gnirke; Nicholas Rhind; Federica di Palma; Bruce W Birren; Chad Nusbaum; Kerstin Lindblad-Toh; Nir Friedman; Aviv Regev
Journal:  Nat Biotechnol       Date:  2011-05-15       Impact factor: 54.908

10.  SV-plaudit: A cloud-based framework for manually curating thousands of structural variants.

Authors:  Jonathan R Belyeu; Thomas J Nicholas; Brent S Pedersen; Thomas A Sasani; James M Havrilla; Stephanie N Kravitz; Megan E Conway; Brian K Lohman; Aaron R Quinlan; Ryan M Layer
Journal:  Gigascience       Date:  2018-07-01       Impact factor: 6.524

View more
  3 in total

Review 1.  A simple guide to de novo transcriptome assembly and annotation.

Authors:  Venket Raghavan; Louis Kraft; Fantin Mesny; Linda Rigerte
Journal:  Brief Bioinform       Date:  2022-03-10       Impact factor: 11.622

2.  Understanding enterprise data warehouses to support clinical and translational research: enterprise information technology relationships, data governance, workforce, and cloud computing.

Authors:  Boyd M Knosp; Catherine K Craven; David A Dorr; Elmer V Bernstam; Thomas R Campion
Journal:  J Am Med Inform Assoc       Date:  2022-03-15       Impact factor: 7.942

3.  Transcriptome annotation in the cloud: complexity, best practices, and cost.

Authors:  Roberto Vera Alvarez; Leonardo Mariño-Ramírez; David Landsman
Journal:  Gigascience       Date:  2021-01-29       Impact factor: 6.524

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.