| Literature DB >> 25050811 |
Hugh P Shanahan1, Anne M Owen2, Andrew P Harrison3.
Abstract
We discuss the applicability of the Microsoft cloud computing platform, Azure, for bioinformatics. We focus on the usability of the resource rather than its performance. We provide an example of how R can be used on Azure to analyse a large amount of microarray expression data deposited at the public database ArrayExpress. We provide a walk through to demonstrate explicitly how Azure can be used to perform these analyses in Appendix S1 and we offer a comparison with a local computation. We note that the use of the Platform as a Service (PaaS) offering of Azure can represent a steep learning curve for bioinformatics developers who will usually have a Linux and scripting language background. On the other hand, the presence of an additional set of libraries makes it easier to deploy software in a parallel (scalable) fashion and explicitly manage such a production run with only a few hundred lines of code, most of which can be incorporated from a template. We propose that this environment is best suited for running stable bioinformatics software by users not involved with its development.Entities:
Mesh:
Year: 2014 PMID: 25050811 PMCID: PMC4106841 DOI: 10.1371/journal.pone.0102642
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Definitions of Cloud Computing Terms.
| Term | Explanation | Example |
| VM | Virtual Machine - a piece of software that emulates the behaviour of a separate computer running an Operating System. | |
| IaaS | Infrastructure as a Service - VM's can be accessed directly via a command line interface. | EC2, RackSpace, OpenStack |
| PaaS | Platform as a Service - VM's can only be accessed programmatically | Azure, AppEngine, Elastic BeanStalk, Heroku |
| Job manager | Software which manages the submission of an arbitrary number of executables (jobs) over a large number of computers which typically vary in their parameters. Job Management software will typically include the creation of log files for each run in a systematic fashion and deal with failures in an orderly way. | StarCluster, Generic Worker, Condor, Oracle Grid Engine |
| Software stack | A set of software that communicate with each other in a hierarchical fashion. In the context of cloud computing, this allows the decoupling of issues that are relevant to each local computer with global issues such as their overall management. | |
| Image | Bit-for-bit copy of the state of a particular VM which can then be deployed elsewhere. As a result, one can use a VM which runs locally or on a cloud which is configured precisely with the software and data the user requires. | |
| MapReduce | A protocol for distributed systems that notes that in the analysis of large data sets distributed over many VM's require one (Map) step that has to be executed by all the VM's on the data it has, followed by another (Reduce) step where the results of the Map step are then collated in some fashion to one VM. | Hadoop, HDInsight, Greenplum |
Figure 1Batch mode operation schematic.
Comparison of Cloud Computing Features.
| Feature | Microsoft Azure | Amazon EC2 |
| Infrastructure provision | PaaS (Cloud Service) and IaaS (Virtual Machines) | IaaS, also PaaS via Elastic Beanstalk |
| Job Manager? | Via Generic Worker libraries | Yes. |
| Operating Systems available | Windows Server 2008 on PaaS Windows and Linux on IaaS | Linux and Windows |
| Data Storage | Mass store | S3 Storage |
| MapReduce available? | Yes | Yes |
| SQL available? | Yes | Yes |
| Ease of use for Linux developer | Learning curve to get familiar with C#; authentication methods not yet trivial | Provision of excellent tutorials plus extensive community support. |
| Ease of use for user | GW allows development of tailored tools | Requires experience of scripting or workflow software such as Galaxy or Taverna. |
Figure 2Using Azure with the Generic Worker.
Shows that a number of Virtual Machines (VMs) created for the worker roles can be scaled up and down as needed.
Cost of some features of Azure and Amazon Cloud Computing.
| Feature | Microsoft Azure | Amazon EC2 |
| VM (Small Instance) | $ | $ |
| $ | ||
| Ingress | Nothing | Nothing (from Internet) |
| Egress | $ | $ |
| Storage | $ | $ |
Figure 3Time taken to load microarray data from Azure mass storage to R working storage.
Plot showing the time in seconds taken to load each of 576 datasets from Azure blob storage to local VM disk space, in terms of the number of CEL files in each GSE experiment.
Figure 4Time taken to analyse data with R script.
Shows the time in seconds taken to analyse each of 576 datasets, in terms of the number of CEL files in each GSE experiment.
Figure 5Comparison of Analysis Times between Cloud and 2 Local machines.
Shows the time in seconds taken to analyse each of 6 particular experiments, in terms of the number of CEL files in each experiment. The particular experiments were chosen because they had 4, 8, 16, 32, 64 and 128 CEL files, to give a range of experiment data amounts. The machine labelled Local1 had a CPU clock speed of 2.13% CPU cap was added to the Local2 machine to crudely estimate the slower 1.60 GHz stated clock speed of the Azure VM.