Literature DB >> 25001293

Dissemination of scientific software with Galaxy ToolShed.

Daniel Blankenberg, Gregory Von Kuster, Emil Bouvier, Dannon Baker, Enis Afgan, Nicholas Stoler, James Taylor, Anton Nekrutenko.   

Abstract

The proliferation of web-based integrative analysis frameworks has enabled users to perform complex analyses directly through the web. Unfortunately, it also revoked the freedom to easily select the most appropriate tools. To address this, we have developed Galaxy ToolShed.

Entities:  

Mesh:

Year:  2014        PMID: 25001293      PMCID: PMC4038738          DOI: 10.1186/gb4161

Source DB:  PubMed          Journal:  Genome Biol        ISSN: 1474-7596            Impact factor:   13.583


Previously, our group has investigated the persistence of mitochondrial variants (heteroplasmies) through mother-child transmissions [1]. Many disease-causing mitochondrial variants are heteroplasmic and their clinical manifestations depend on the relative proportion of normal to mutant alleles [2-4]. Because almost all of the mitochondrial genome is transcribed [5], the next important question is whether the relative frequencies of heteroplasmic alleles are maintained in transcripts. We turned to published studies to find the appropriate dataset that would include matched genomic and transcriptomic data. The initial analysis of DNA/RNA differences by Li et al. [6] omitted the mitochondrial transcriptome and a much more comprehensive dataset by Chen et al. [7] has since become available. The latter contains both whole genome and RNA sequencing data from a single individual and is therefore ideally suited for our purpose. To perform this analysis, we started with a ‘clean’ Galaxy Amazon EC2 instance [8-10], mapped the reads against the latest version of the human genome, retained properly mapped pairs, removed reads mapping to multiple locations, added readgroup information, and combined all results into a single binary version of the sequence alignment/map format (BAM) dataset for further analysis (Additional file 1) [11]. At this point in the analysis, we ran into the first roadblock: the Galaxy instance we were using did not contain any tools for detecting sequence variants. This is exactly the type of situation where the ToolShed is the most useful, as it already contains a collection of utilities for variant detection such as FreeBayes [12]. Installing the FreeBayes tool along with the required dependencies into Galaxy using the ToolShed is accomplished through the web-based graphical user interface [11]. Behind the scenes, the ToolShed fetches source code from the FreeBayes GitHub repository, compiles it, and registers all necessary components with the Galaxy instance, making it accessible to the user [13]. Application of FreeBayes to our dataset has identified two potential heteroplasmic sites with minor allele frequencies >2% (a heteroplasmy detection threshold derived from empirical and simulation data [1,14]): 2,619 and 13,636 (Figure 1a,b). Site 13,363 is a textbook example of a heteroplasmy - it is biallelic (T/C) with an average minor allele frequency of 22% across the 21 samples in our study. However, the other site, 2,619, is different and represents a potential RNA modification reported recently by our group [15]. Within genomic DNA it is represented by an invariable A, while in all RNA-seq datasets it is scored by FreeBayes as a heterozygous locus with the major allele being a T. Moreover, while the total coverage at this site across all samples was 40,132, the numbers of reference and alternative observations were 11,086 and 20,584, respectively (summing to a total of 31,670), suggesting that the site is multiallelic. FreeBayes used here only reports two possibilities: reference and alternative. However, in many cases, such as genotyping of pooled, bacterial or viral samples, it is necessary to report exact counts for all variants. In a typical sequence analysis experiment this is the point where custom scripts are often being developed. While we did exactly that - developed two custom Python-based tools, ‘Naïve Variant Caller’ (NVC) and ‘Variant Annotator’ - we went a step further and deposited these tools into the ToolShed. By doing so, we not only made it accessible to any Galaxy instance, but also ensured reproducibility of our experiment, which is almost universally lacking in studies utilizing custom scripts [16]. The NVC produces Variant Call Format (VCF) output [17] containing counts for all observed variants from multisample BAM datasets (Additional file 1), while Variant Annotator converts VCF data into allele counts stratified by samples. To deposit the tools into the ToolShed, we have created a version-controlled repository and uploaded all software components, including the tool configuration file, NVC Python script, information about necessary software dependencies, and a set of functional tests. At this point, the tool becomes ‘visible’ to any Galaxy installation, including the cloud-based instance we use in this study. After installing the NVC from the ToolShed [18], we have applied it to the original BAM dataset to obtain counts shown in Figure 1c,d. Here the multiallelic nature of site 2,619 is clearly seen as well as the fact that this variation only appears in transcriptome data.
Figure 1

Frequency of the four possible nucleotides across genomic DNA (accession number SRR345592) and RNA-seq (accession numbers SRR353635-SRR353654) samples for sites 13,636 and 2,619. NVC, Naïve Variant Caller. Data is deposited in the Short Read Archive at the National Center for Biotechnology Information (NCBI).

Frequency of the four possible nucleotides across genomic DNA (accession number SRR345592) and RNA-seq (accession numbers SRR353635-SRR353654) samples for sites 13,636 and 2,619. NVC, Naïve Variant Caller. Data is deposited in the Short Read Archive at the National Center for Biotechnology Information (NCBI). This short example has illustrated that the ToolShed behaves as a de facto AppStore: when users need an analysis tool that is not present in a given Galaxy instance, it can be easily fetched and installed. Just like a brand new iPad, Galaxy comes with a small number of preinstalled applications providing basic functionality. Additional tools may subsequently be installed from the ToolShed to create a ‘flavor’ of Galaxy suitable for a particular analysis. An expanded discussion of the ToolShed can be found in the online supplement.

Abbreviations

BAM: Binary version of the sequence alignment/map format; NVC: Naïve Variant Caller; VCF: Variant call format.

Competing interests

The authors declare that they have no competing interests.

Additional file 1

Contains examples of tools deposited to ToolShed and discusses implications of this system for improving the reproducibility of biomedical research. Click here for file
  15 in total

1.  A general approach to single-nucleotide polymorphism discovery.

Authors:  G T Marth; I Korf; M D Yandell; R T Yeh; Z Gu; H Zakeri; N O Stitziel; L Hillier; P Y Kwok; W R Gish
Journal:  Nat Genet       Date:  1999-12       Impact factor: 38.330

Review 2.  Making mitochondrial mutants.

Authors:  H T Jacobs
Journal:  Trends Genet       Date:  2001-11       Impact factor: 11.639

3.  A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly.

Authors:  Daniel Blankenberg; James Taylor; Ian Schenck; Jianbin He; Yi Zhang; Matthew Ghent; Narayanan Veeraraghavan; Istvan Albert; Webb Miller; Kateryna D Makova; Ross C Hardison; Anton Nekrutenko
Journal:  Genome Res       Date:  2007-06       Impact factor: 9.043

4.  The human mitochondrial transcriptome.

Authors:  Tim R Mercer; Shane Neph; Marcel E Dinger; Joanna Crawford; Martin A Smith; Anne-Marie J Shearwood; Eric Haugen; Cameron P Bracken; Oliver Rackham; John A Stamatoyannopoulos; Aleksandra Filipovska; John S Mattick
Journal:  Cell       Date:  2011-08-19       Impact factor: 41.582

5.  Widespread RNA and DNA sequence differences in the human transcriptome.

Authors:  Mingyao Li; Isabel X Wang; Yun Li; Alan Bruzel; Allison L Richards; Jonathan M Toung; Vivian G Cheung
Journal:  Science       Date:  2011-05-19       Impact factor: 47.728

6.  Detecting heteroplasmy from high-throughput sequencing of complete human mitochondrial DNA genomes.

Authors:  Mingkun Li; Anna Schönberg; Michael Schaefer; Roland Schroeder; Ivane Nasidze; Mark Stoneking
Journal:  Am J Hum Genet       Date:  2010-08-13       Impact factor: 11.025

Review 7.  Mitochondrial diseases.

Authors:  Salvatore DiMauro
Journal:  Biochim Biophys Acta       Date:  2004-07-23

8.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors:  Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal:  Genome Biol       Date:  2010-08-25       Impact factor: 13.583

9.  The variant call format and VCFtools.

Authors:  Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal:  Bioinformatics       Date:  2011-06-07       Impact factor: 6.937

10.  RNA-DNA differences in human mitochondria restore ancestral form of 16S ribosomal RNA.

Authors:  Dan Bar-Yaacov; Gal Avital; Liron Levin; Allison L Richards; Naomi Hachen; Boris Rebolledo Jaramillo; Anton Nekrutenko; Raz Zarivach; Dan Mishmar
Journal:  Genome Res       Date:  2013-08-02       Impact factor: 9.043

View more
  85 in total

1.  BioShaDock: a community driven bioinformatics shared Docker-based tools registry.

Authors:  François Moreews; Olivier Sallou; Hervé Ménager; Yvan Le Bras; Cyril Monjeaud; Christophe Blanchet; Olivier Collin
Journal:  F1000Res       Date:  2015-12-14

Review 2.  The utility of genomic data for Plasmodium vivax population surveillance.

Authors:  Rachel F Daniels; Benjamin L Rice; Noah M Daniels; Sarah K Volkman; Daniel L Hartl
Journal:  Pathog Glob Health       Date:  2015-04-18       Impact factor: 2.894

3.  PathwayMatcher: proteoform-centric network construction enables fine-granularity multiomics pathway mapping.

Authors:  Luis Francisco Hernández Sánchez; Bram Burger; Carlos Horro; Antonio Fabregat; Stefan Johansson; Pål Rasmus Njølstad; Harald Barsnes; Henning Hermjakob; Marc Vaudel
Journal:  Gigascience       Date:  2019-08-01       Impact factor: 6.524

4.  Age-related accumulation of de novo mitochondrial mutations in mammalian oocytes and somatic tissues.

Authors:  Barbara Arbeithuber; James Hester; Marzia A Cremona; Nicholas Stoler; Arslan Zaidi; Bonnie Higgins; Kate Anthony; Francesca Chiaromonte; Francisco J Diaz; Kateryna D Makova
Journal:  PLoS Biol       Date:  2020-07-15       Impact factor: 8.029

5.  Nextflow enables reproducible computational workflows.

Authors:  Paolo Di Tommaso; Maria Chatzou; Evan W Floden; Pablo Prieto Barja; Emilio Palumbo; Cedric Notredame
Journal:  Nat Biotechnol       Date:  2017-04-11       Impact factor: 54.908

6.  Chlamydia trachomatis ChxR is a transcriptional regulator of virulence factors that function in in vivo host-pathogen interactions.

Authors:  Chunfu Yang; Laszlo Kari; Gail L Sturdevant; Lihua Song; Michael John Patton; Claire E Couch; Jillian M Ilgenfritz; Timothy R Southern; William M Whitmire; Michael Briones; Christine Bonner; Chris Grant; Pinzhao Hu; Grant McClarty; Harlan D Caldwell
Journal:  Pathog Dis       Date:  2017-04-01       Impact factor: 3.166

Review 7.  A Primer on Infectious Disease Bacterial Genomics.

Authors:  Tarah Lynch; Aaron Petkau; Natalie Knox; Morag Graham; Gary Van Domselaar
Journal:  Clin Microbiol Rev       Date:  2016-09-07       Impact factor: 26.132

8.  Models and Simulations as a Service: Exploring the Use of Galaxy for Delivering Computational Models.

Authors:  Mark A Walker; Ravi Madduri; Alex Rodriguez; Joseph L Greenstein; Raimond L Winslow
Journal:  Biophys J       Date:  2016-03-08       Impact factor: 4.033

9.  Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses.

Authors:  Ruijie Liu; Aliaksei Z Holik; Shian Su; Natasha Jansz; Kelan Chen; Huei San Leong; Marnie E Blewitt; Marie-Liesse Asselin-Labat; Gordon K Smyth; Matthew E Ritchie
Journal:  Nucleic Acids Res       Date:  2015-04-29       Impact factor: 16.971

Review 10.  Quantitative bacterial transcriptomics with RNA-seq.

Authors:  James P Creecy; Tyrrell Conway
Journal:  Curr Opin Microbiol       Date:  2014-12-05       Impact factor: 7.934

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.