| Literature DB >> 20482786 |
Dennis P Wall1, Parul Kudtarkar, Vincent A Fusaro, Rimma Pivovarov, Prasad Patil, Peter J Tonellato.
Abstract
BACKGROUND: Large comparative genomics studies and tools are becoming increasingly more compute-expensive as the number of available genome sequences continues to rise. The capacity and cost of local computing infrastructures are likely to become prohibitive with the increase, especially as the breadth of questions continues to rise. Alternative computing architectures, in particular cloud computing environments, may help alleviate this increasing pressure and enable fast, large-scale, and cost-effective comparative genomics strategies going forward. To test this, we redesigned a typical comparative genomics algorithm, the reciprocal smallest distance algorithm (RSD), to run within Amazon's Elastic Computing Cloud (EC2). We then employed the RSD-cloud for ortholog calculations across a wide selection of fully sequenced genomes.Entities:
Mesh:
Year: 2010 PMID: 20482786 PMCID: PMC3098063 DOI: 10.1186/1471-2105-11-259
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The reciprocal smallest distance algorithm (RSD). Arrows denote bidirectional BLAST runs. After each run, hits are paired with the query to calculate evolutionary distances. If the same pair produces the smallest distance in both search directions, it is assumed to be orthologous. The specifics of the algorithm are provided in the Introduction.
Elastic Map Reduce commands
| Argument | Description | Input |
|---|---|---|
| --stream | Activates the "streaming" module | N/A |
| --input | File(s) to be processed by EMR | hdfs:///home/hadoop/blast_runner hdfs:///home/hadoop/ortho_runner |
| --mapper | Name of mapper file | s3n://rsd_bucket/blast_mapper.py s3n://rsd_bucket/ortho_mapper.py |
| --reducer | None required, reduction done within RSD algorithm | N/A |
| --cache-archive | Individual symlinks to the executables, genomes, | s3n://rsd_bucket/executables.tar.gz #executables,#genomes, #RSD_standalone,#blastinput,#results |
| --output | hdfs:///home/hadoop/outl | |
| -- jobconf mapred.map.tasks | Number of blast and ortholog calculation processes | = N |
| -- jobconf mapred.tasktracker.map.tasks.maximum | Total number of task trackers | = 8 |
| --jobconf mapred. task, timeout | Time at which a process was considered a failure and restarted | = 86400000 ms |
| --jobconf mapred.tasktracker.expiry.interval | Time at which an instance was declared dead. | 3600000 (set to be large to avoid instance shut down with long running jobs) |
| --jobconf mapred.map.tasks.speculative.execution | If true, EMR will speculate that a job is running slow and run the same job in parallel | False (because the time for each genome-vs-genome run varied widely, we elected to set this argument to False to ensure maximal availability of the cluster) |
Specific commands passed through the Ruby command line client to the Elastic MapReduce program (EMR) from Amazon Web Services. The inputs specified correspond to (1) the BLAST step and (2) the ortholog computation step of the RSD cloud algorithm. These configurations settings correspond to both the EMR and Hadoop frameworks, with two exceptions: In EMR, a --j parameter can be used to provide an identifier for the entire cluster, useful only in cases where more than one cloud cluster is needed simultaneously. In Hadoop, these commands are passed directly to the streaming.jar program, obviating the need for the --stream argument.
Summary of time and cost for Elastic MapReduce runs.
| Process Type | processes | instances | Time | Total ($) |
|---|---|---|---|---|
| Blast | 21945 | 100 | 40 hours 0 mins | $ 3,680 |
| Ortholog Estimation | 281160 | 100 | 28 hours 21 mins | $ 2,622 |
These cost estimates are based on the use of the High-CPU Extra Large Instance at 0.80 per hour and use of EMR at 0.12 per hour. These costs assume constant processing without node failures. Total costs = $6302.
Figure 2Example of the Compute Cloud user interface for monitoring the health of the cluster and progress of mapped cloud tasks. (A) The Cluster summary provided a summary of the compute cloud. (B) Running jobs listed the Job id of the current running task, root user, job name and map task progress update. (C) Completed Jobs provided an up-to-date summary of completed tasks. This user interface also provided information about failed steps as well as links to individual job logs and histories. Access to this user interface was through FoxyProxy, described in the Methods.
Figure 3Example of the Job user interface for monitoring the status of individual jobs. (A) Job summary provided job information like the user, job start time and the duration of the job. (B) Job status gave the task completion rate and failure reporting. (C) Job Counter indicated job progress and additional counter. The progression of the mapper was also displayed graphically at the bottom of web UI page (not shown here). Access to this user interface was through FoxyProxy, described in the Methods.
Cost comparison of Amazon's cloud computing instance types.
| Instance Type | # Instances | Time * (hours) | Cost ($) |
|---|---|---|---|
| Standard Small (single core) | 50 | 1088 | 6256 (0.115 per hour per instance) |
| Standard Large (dual core) | 50 | 544 | 12512 (0.46 per hour per instance) |
| Standard Extra Large (4-cores) | 50 | 272 | 12512 (0.92 per hour per instance) |
| High-CPU Medium (dual core) | 50 | 544 | 6256 (0.23 per hour per instance) |
| High-CPU Extra Large (8 core) | 50 | 136 | 6256 (0.92 per hour per instance) |
Amazon's Elastic Compute Cloud (EC2) can be accessed via a number of differet instance types. For the purposes of our comparative genomics problem, we elected to utilize the extra-large high-CPU instance. Note that the total cost for a small instance is equal to the total of the extra large, despite the large difference in computing time.
Programs associated with the reciprocal smallest distance algorithm.
| Program name | Description |
|---|---|
| ReadFasta.py | a module used by RSD.py |
| RSD.py | the main program which executes the RSD reciprocal smallest distance ortholog detection algorithm |
| BioUtilities.py | a suite of utilities, many of which wrap external programs such as clustalW and PAML |
| Utility.py | a package used by BioUtilities.py |
| Blast_compute.py | the main program that builds all-against-all BLAST databases for fast execution of RSD |
| clustal2phylip | a small perl function that converts clustalw alignment files into files that are recognized by paml |
| codeml.ctl_cp | the control file required by RSD to properly calculate the maximum likelihood estimates of distance between two protein sequences |
| execute.py | an error reporter used by RSD |
| RSD_common.py | the directive file used by RSD |
| examples | a directory containing examples of inputs and outputs to RSD.py and Blast_co mpute.py. |
These programs are required for running the RSD package on a cloud computing platform such as Amazon's Elastic MapReduce. These programs are packaged and available for download at http://roundup.hms.harvard.edu and are also provided as additional files associated with the manuscript.
Figure 4Workflow for establishment and execution of the reciprocal smallest distance algorithm using the Elastic MapReduce framework on the Amazon Elastic Compute Cloud (EC2). (1) Preconfiguration involves the general setup and porting of the RSD program and genomes to the Amazon S3, and configuration of the Mappers for executing the BLAST and RSD runs within the cluster. (2) Instantiation specifies the Amazon EC2 instance type (e.g. small, medium, or large), logging of cloud cluster performance, and preparation of the runner files as described in the Methods. (3) Job Flow Execution launches the processes across the cluster using the command-line arguments indicated in Table 1. This is done for the Blast and RSD steps separately. (4) The All-vs-All BLAST utilizes the BLAST runner and BLAST mapper to generate a complete set of results for all genomes under consideration. (5) The Ortholog computation step utilizes the RSD runner file and RSD mapper to estimate orthologs and evolutionary distances for all genomes under study. This step utilizes the stored BLAST results from step 4 and can be run asynchronously, at any time after the BLAST processes complete. The Amazon S3 storage bucket was used for persistent storage of BLAST and RSD results. The Hadoop Distributed File System (HDFS) was used for local storage of genomes, and genome-specific BLAST results for faster I/O when running the RSD step. Additional details are provided in the Methods.
Figure 5Example of the mapper program used to run the BLAST and ortholog estimation steps required by the reciprocal smallest distance algorithm (RSD). This example assumes a runner file containing precise command line arguments for executing the separate steps of the RSD algorithm. The programs were written in python.