| Literature DB >> 22594911 |
Thomas Stropp1, Timothy McPhillips, Bertram Ludäscher, Mark Bieda.
Abstract
BACKGROUND: Microarray data analysis has been the subject of extensive and ongoing pipeline development due to its complexity, the availability of several options at each analysis step, and the development of new analysis demands, including integration with new data sources. Bioinformatics pipelines are usually custom built for different applications, making them typically difficult to modify, extend and repurpose. Scientific workflow systems are intended to address these issues by providing general-purpose frameworks in which to develop and execute such pipelines. The Kepler workflow environment is a well-established system under continual development that is employed in several areas of scientific research. Kepler provides a flexible graphical interface, featuring clear display of parameter values, for design and modification of workflows. It has capabilities for developing novel computational components in the R, Python, and Java programming languages, all of which are widely used for bioinformatics algorithm development, along with capabilities for invoking external applications and using web services.Entities:
Mesh:
Year: 2012 PMID: 22594911 PMCID: PMC3431220 DOI: 10.1186/1471-2105-13-102
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Screenshot of a complex workflow for designing PCR primers. By using locally installed software and online resources, this workflow can design primers for any genome available at the UCSC browser simply by requiring the user to provide genome assembly information and genomic coordinates. See RESULTS for details.
Figure 2Changing Parameter Values in Kepler. This is a screenshot showing the changing of a parameter value in a workflow. Parameter names and values are displayed clearly on the Kepler canvas. Parameters may be easily moved into groups and font type, size, and color can be manipulated to provide visual cues as to importance or group relationships. Parameters are changed by simply clicking on the parameter. As shown here, clicking on the parameter indicates the parameter in question with a small yellow box and produces a text box in which the new parameter value may be entered. The displayed workflow is the gene expression workflow shown in Figure 5. See RESULTS for details.
Figure 5A full Affymetrix gene expression microarray analysis workflow in Kepler. This workflow uses well-established R/BioConductor modules following the steps recommended in a published pipeline [9]. Several resulting graphs and files are output. See RESULTS for details.
Microarray Workflow Listing
| | |
| DisplayRegion.xml | Create a graphical display of the value field of a GFF file (like output provided by NimbleGen SignalMap) |
| GeneralHist.xml | Create a histogram of a given column of a text file. Useful for microarray GFF files. |
| gffFreqPoly_python.xml | Make several frequency polygons superimposed on one another for comparison. |
| gffFullDescription.xml | Display information about the GFF file specified. |
| gffQuickLook.xml | Displays first few lines of a GFF file. |
| gffStats_gffread_simple.xml | Calculate min, max, mean, median, num of lines, and various percentiles of a specified field. (Python version) |
| gffStats_Rbased_simple.xml | Calculate min, max, mean, median, num of lines, and various percentiles of a specified field. (R version) |
| ProbeSpacings.xml | Make a histogram of the probe spacings of a GFF file. |
| | |
| AddComments.xml | Add comments to the beginning of a GFF file. |
| gffMakeTinyl.xml | Greatly reduces the size of a GFF so that loading and processing is much faster. Reduces file size by replacing the second, third, and last fields of the file with placeholders. Assumes that these fields are the same in all lines. |
| gffModThirdField.xml | Modify the third field of a GFF file. |
| | |
| gffSmooth.xml | Median smooth (length 3) the 6th column of GFF files. |
| gffSort.xml | Sort a GFF file in chromosome + start point order (actually field 1 then field 4 order). |
| QuantNorm.xml | Quantile normalize the 6th field (ratio field) of a series of GFF files. |
| gffQN_SM3_TINY.xml | Quantile Normalize, Smooth, and Tiny-ize a set of GFF files. See gffMakeTiny.xml for explanation of Tiny-ize. |
| gffSubtract.xml | Subtract one GFF file from another GFF file (result based on subtraction of values in field 6). |
| gffSplit.xml | Split a GFF file containing the strings ‘tiled region’, ‘transcription_start_site’, and ‘primary_transcript’ into 3 separate files. |
| | |
| RunDetection.xml | Calculates runs of ratios (6th field) that are greater than or equal to the specified percentile of that column. Can be used for binding site detection for ChIP-chip as in [ |
| RunDetection_with_annotation.xml | RunDetection workflow with added annotation of resulting binding sites (e.g. nearest gene) by using R/BioConductor ChIPpeakAnno package |
| AMDA.xml | Perform Affymetrix gene expression microarray analysis. |
| AMDA_limmafinal.xml | Variant of AMDA workflow using limma package [ |
| PrimerDesign.xml | Pick sets of primers, given a chromosome range from user. Uses UCSC genome browser for outputs. |
| Regex_R.xml | Simple example of find a substring within a string using regular expressions in R framework. |
| kepler_cut.xml | clone UNIX ‘cut’ command |
| kepler_paste.xml | clone UNIX ‘paste’ command |
| kepler_sort.xml | clone UNIX ‘sort’ command |
These workflows are further described in Additional file 2: Table S 1. Each workflow is displayed in Additional file 1: Figures S1-S26.
Figure 3One output of PCR primer design workflow. Screenshot of partial graphical output of PCR primer design workflow as displayed in web browser. The output is truncated for representation clarity. This figure displays the first two primer sets generated for the region chr16:23,597,600-23,597,933 of the human genome (assembly hg18). The primers and PCR product are illustrated in text followed by graphical representation of PCR region derived from UCSC genome browser. Tracks displayed from browser may be changed by adjusting workflow.
Figure 4Screenshot of a simple Kepler workflow for calculating statistics of the “ratio” column of microarray GFF files. Python (Jython) and R implementation. Note that red arrows and italicized red text are not part of original screen view but are added to show basic parts of a Kepler workflow screen. All other text is part of screenshot. See RESULTS for details.
Figure 6One output graph from AMDA workflow applied to NCBI GEO dataset GSE7181. One output of the Affymetrix gene expression workflow is displayed in Figure 5. This is a display of the file Data_norm_HeatMap.png, which is produced and stored by the workflow (see RESULTS). This is a heatmap representation of clustering of differentially expressed genes from six gene expression microarrays, each from a different line of brain tumor stem cells. The three adherent lines cluster together to the left; the three lines that grow as neurospheres cluster together to the right. Details of experimental methods and goals are found in [31] and [32] and heatmap generation methods details in [9].