| Literature DB >> 24464852 |
Allison P Heath1, Matthew Greenway1, Raymond Powell1, Jonathan Spring1, Rafael Suarez1, David Hanley1, Chai Bandlamudi1, Megan E McNerney2, Kevin P White3, Robert L Grossman4.
Abstract
BACKGROUND: As large genomics and phenotypic datasets are becoming more common, it is increasingly difficult for most researchers to access, manage, and analyze them. One possible approach is to provide the research community with several petabyte-scale cloud-based computing platforms containing these data, along with tools and resources to analyze it.Entities:
Keywords: biomedical clouds; cloud computing; genomic clouds
Mesh:
Year: 2014 PMID: 24464852 PMCID: PMC4215034 DOI: 10.1136/amiajnl-2013-002155
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1Screenshot of the Tukey console from the Bionimbus Protected Data Cloud. A user can click an image and then click the button labeled ‘Launch’ to start one or more virtual machines. Different images contain different tools, utilities, and pipelines. Users can also create their own custom images containing specific pipelines, tools, and applications of interest.
Figure 2Major components of the Open Science Data Cloud (OSDC). The OpenFISMA-based application for monitoring, compliance and security is only part of the Bionimbus Protected Data Cloud, and not used by the OSDC in general. FISMA, Federal Information Security Management Act.
Figure 3Moving large genomics datasets over wide area networks can be difficult. The Open Science Data Cloud (OSDC) supports several protocols for moving datasets, including the open-source UDR protocol, which integrates UDT with rsync and is designed to synchronize large datasets over wide-area, high-performance networks. The figure shows the relative comparison of UDR with rsync when synchronizing the Encyclopedia of DNA Elements (ENCODE) repository between the OSDC in Chicago and the ENCODE Data Coordination Center in Santa Cruz. The transfer speed varies, but UDR consistently has at least four to five times the performance of rsync. ENCODE is open access and UDR without encryption can be used. UDR with encryption enabled (for moving controlled-access data) achieves about 660 Mb/s when transferring data between the Bionimbus Protected Data Cloud in Chicago with a server at the Ontario Institute for Cancer Research in Toronto, Canada. The speed can be increased using disks with higher throughput or using multiple flows to multiple disks.
The Open Science Data Cloud (OSDC; in October 2013) consists of 7994 cores, 9.2 PB of raw storage and 5.7 PB of usable storage
| All OSDC resources | Cores | Compute cores | Total RAM (TB) | Computed RAM (TB) | Raw storage (PB) | Usable storage (PB) |
|---|---|---|---|---|---|---|
| Open access clouds | 3632 | 3168 | 13.9 | 12.9 | 4.4 | 2.6 |
| Bionimbus PDC | 2816 | 2664 | 11.0 | 10.9 | 2.5 | 1.5 |
| Hadoop | 1090 | 1030 | 3.5 | 3.2 | 1.5 | 1.1 |
| Other resources | 456 | 352 | 1.5 | 1.4 | 0.8 | 0.5 |
| Total | 7994 | 7214 | 29.9 | 28.4 | 9.2 | 5.7 |
This includes both the open-access Bionimbus Community Cloud and the controlled-access Bionimbus Protected Data Cloud (PDC). There is a data commons that is part of the open-access OSDC that consists of 1.4 PB of raw storage and 0.9 PB of usable storage. The Bionimbus PDC contains a data commons that contains about 0.5 PB of usable storage. The Bionimbus PDC can also access data from the open-access data commons. The other OSDC resources consist of a test bed, development machines, web servers, etc. During October 2013, there were about 150 active users out of a total of about 360 users. During October 2013, an average of 47% of the available core hours were used for the resources that were available for general users. Older resources were more highly used (86%), while newer resources were more lightly used (18%). The amount of usable space is less than the raw storage, since we make use of redundant array of independent disks (RAID) 5 and RAID 6 for GlusterFS and keep disks as hot spares.
Figure 4This figure compares the charges for Amazon Web Services (AWS) S3 storage, with the costs incurred as the Open Science Data Cloud (OSDC) adds 1 PB of storage. The AWS costs were computed using the Simple Monthly Calculator on the AWS web site (/calculator.s3.amazonaws.com/calc5.html) and does not include the cost of accessing the data. We used a simple cost model for the OSDC that includes the capital charges for equipment amortized over 3 years, the operating costs of supplying power, space and cooling, and the operating costs of the staff required to manage the OSDC. This cost model assumes that we are operating a minimum of 20 racks and that we refresh one-third of the racks each year.