Literature DB >> 20622843

Cloud computing and the DNA data race.

Michael C Schatz¹, Ben Langmead, Steven L Salzberg.

Abstract

Entities: Disease Gene Species

Mesh：

Substances：
DNA

Year: 2010 PMID： 20622843 PMCID： PMC2904649 DOI： 10.1038/nbt0710-691

Source DB: PubMed Journal: Nat Biotechnol ISSN： 1087-0156 Impact factor: 54.908

× No keyword cloud information.

In the race between DNA sequencing throughput and computer speed, sequencing is winning by a mile. Sequencing throughput has recently been improving at a rate of about 5-fold per year1, while computer performance generally follows “Moore's Law,” doubling only every 18 or 24 months2. As this gap widens, the question of how to design higher-throughput analysis pipelines becomes critical. If analysis throughput fails to turn the corner, research projects will continually stall until analyses catch up. How do we close the gap? One option is to invent algorithms that make better use of a fixed amount of computing power. Unfortunately, algorithmic breakthroughs of this kind, like scientific breakthroughs, are difficult to plan or foresee. A more practical option is to concentrate on developing methods that make better use of multiple computers and processers. When many computer processors work together in parallel, a software program can often finish in significantly less time. While parallel computing has existed for decades in various forms3–5, a recent manifestation called “cloud computing” holds particular promise. Cloud computing is a model whereby users access compute resources from a vendor over the Internet1, such as from the commercial Amazon Elastic Compute Cloud6, or the academic DOE Magellan Cloud7. The user can then apply the computers to any task, such as serving web sites, or even running computationally intensive parallel bioinformatics pipelines. Vendors benefit from vast economies of scale8, allowing them to set fees that are competitive with what users would otherwise have spent building an equivalent facility, and potentially saving all the ongoing costs incurred by a facility that consumes space, electricity, cooling, and staff support. Finally, because the pool of resources available “in the cloud” is so large, customers have substantial leeway to “elastically” grow and shrink their allocations. Cloud computing is not a panacea: it poses problems for developers and users of cloud software, requires large data transfers over precious low-bandwidth Internet uplinks, raises new privacy and security issues, and is an inefficient solution for some types of problems. On balance, though, cloud computing is an increasingly valuable tool for processing large datasets, and it is already used by the US federal government9, pharmaceutical10 and Internet companies11, as well as scientific labs12 and bioinformatics services13, 14. Furthermore, several bioinformatics applications and resources have been developed to specifically address the challenges of working with the very large volumes of data generated by second-generation sequencing technology (Table 1).

Table 1

Bioinformatics Cloud Resources

Applications
CloudBLAST34	Scalable BLAST in the Clouds http://www.acis.ufl.edu/~ammatsun/mediawiki-1.4.5/index.php/CloudBLAST_Project
CloudBurst19	Highly Sensitive Short Read Mapping http://cloudburst-bio.sf.net
Cloud RSD26	Reciprocal Smalest Distance Ortholog Detection http://roundup.hms.harvard.edu
Contrail27	De novo assembly of large genomes http://contrail-bio.sf.net
Crossbow22	Alignment and SNP Genotyping http://bowtie-bio.sf.net/crossbow/
Myrna25	Differential expression analysis of mRNA-seq http://bowtie-bio.sf.net/myrna/
Quake35	Quality guided correction of short reads http://github.com/davek44/error_correction/

MapReduce and Genomics

Parallel programs run atop a parallel “framework” to enable efficient, fault-tolerant parallel computation without making the developer's job too difficult. The Message Passing Interface (MPI) framework3, for example, gives the programmer ample power to craft parallel programs, but requires relatively complicated software development. Batch processing systems such as Condor4, are very effective for running many independent computations in parallel, but are not expressive enough for more complicated parallel algorithms. In between, the MapReduce framework15 is efficient for many (although not all) programs, and makes the programmer's job simpler by automatically handling duties such as job scheduling, fault tolerance, and distributed aggregation. MapReduce was originally developed at Google to streamline analyses of very large collections of webpages. Google's implementation is proprietary, but Hadoop16 is a popular open source alternative maintained by the Apache Software Foundation. Hadoop/MapReduce programs comprise a series of parallel computational steps (Map and Reduce), interspersed with aggregation steps (Shuffle). Despite its simplicity, Hadoop/MapReduce has been successfully applied to many large-scale analyses within and outside of DNA sequence analysis17–21. In a genomics context, Hadoop/MapReduce is particularly well suited for common “Map-Shuffle-Scan” pipelines (Figure 1) that use the following paradigm:

Figure 1

Map-Shuffle-Scan framework used by Crossbow

Users begin by uploading the sequencing reads into the cloud storage. Hadoop, running on a cluster of virtual machines in the cloud, then maps the unaligned reads to the reference genome using many parallel instances of Bowtie. Hadoop then automatically shuffles the alignments into sorted bins determined by chromosome region. Finally, many parallel instances of SOAPsnp scan the sorted alignments in each bin. The final output is a stream of SNP calls stored within the cloud that can be downloaded back to the user's local computer.

Map: many reads are mapped to the reference genome in parallel on multiple machines. Shuffle: the alignments are aggregated so that all alignments on the same chromosome or locus are grouped together and sorted by position. Scan: the sorted alignments are scanned to identify biological events such as polymorphisms or differential expression within each region. For example, the Crossbow22 genotyping program leverages Hadoop/MapReduce to launch many copies of the short read aligner Bowtie23 in parallel. After Bowtie has aligned the reads (which may number in the billions for a human re-sequencing project) to the reference genome, Hadoop automatically sorts and aggregates the alignments by chromosomal region. It then launches many parallel instances of the Bayesian SNP caller SOAPsnp24 to accurately call SNPs from the alignments. In our benchmark test on the Amazon cloud, Crossbow genotyped a human sample comprising 2.7 billion reads in ~4 hours, including the time required for uploading the raw data, for a total cost of $85 USD22. Programs with abundant parallelism tend to scale well to larger clusters; i.e., increasing the number of processors proportionally decreases the running time, less any additional overhead or non-parallel components. Several comparative genomics pipelines have been shown to scale well using Hadoop19, 22, 25, 26, but not all genomics software is likely to follow suit. Hadoop, and cloud computing in general, tends to reward “loosely coupled” programs where processors work independently for long periods and rarely coordinate with each other. But some algorithms are inherently “tightly coupled,” requiring substantial coordination and making them less amenable to cloud computing. That being said, PageRank20 (Google's algorithm for ranking web pages) and Contrail27 (a large-scale genome assembler) are examples of relatively tightly coupled algorithms that have been successfully adapted to MapReduce in the cloud.

Cloud computing obstacles

To run a cloud program over a large dataset, the input must first be deposited in a cloud resource. Depending on data size and network speed, transfers to and from the cloud can pose a significant barrier. Some institutions and repositories connect to the Internet via high-speed backbones such as Internet2 and JANET, but each potential user should assess whether their data generation schedule is compatible with transfer speeds achievable in practice. A reasonable alternative is to physically ship hard drives to the cloud vendor28. Another obstacle is usability. The rental process is complicated by technical questions of geographic zones, instance types, and which software image the user plans to run. Fortunately, efforts such as the Galaxy project29 and Amazon's Elastic MapReduce service30 enhance usability by allowing customers to launch and manage resources and analyses through a point-and-click web interface. Data security and privacy are also concerns. Whether storing and processing data in the cloud is more or less secure than doing so locally is a complicated question, depending as much on local policy as on cloud policy. That said, regulators and Institutional Review Boards are still adapting to this trend, and local computation is still the safer choice when privacy mandates apply. An important exception is HIPAA; several HIPAA-compliant companies already operate cloud-based services31. Finally, cloud computing often requires re-designing applications for parallel frameworks like Hadoop. This takes expertise and time. A mitigating factor is that Hadoop's “streaming mode” allows existing non-parallel tools to be used as computational steps. For instance, Crossbow uses the non-cloud programs Bowtie and SOAPsnp, albeit with some small changes to format intermediate data for the Hadoop framework. New parallel programming frameworks, such as DryadLINQ32 and Pregel33 can also help in some cases by providing richer programming abstractions. But for problems where the underlying parallelism is sufficiently complex, researchers may have to develop sophisticated new algorithms.

Recommendations

With biological datasets accumulating at ever faster rates, it is better to prepare for distributed and multi-core computing sooner rather than later. The cloud provides a vast, flexible source of computing power at a competitive cost, potentially allowing researchers to analyze ever-growing sequencing databases while relieving them of the burden of maintaining large computing facilities. On the other hand, the cloud requires large, possibly network-clogging data transfers, it can be challenging to use, and it isn't suitable for all types of analysis tasks. For any research group considering the use of cloud computing for large-scale DNA sequence analysis, we recommend a few concrete steps: Verify that your DNA sequence data will not overwhelm your network connection, taking into account expected upgrades for any sequencing instruments. Determine whether cloud computing is compatible with any privacy or security requirements associated with your research. Determine whether necessary software tools exist and can run efficiently in a cloud context. Is new software needed, or can existing software be adapted to a parallel framework? Consider the time and expertise required. Consider cost: what is the total cost of each alternative? Consider the alternative: is it justified to build and maintain, or otherwise gain access to a sufficiently powerful non-cloud computing resource? If these prerequisites are met, then computing “in the cloud” can be a viable option to keep pace with the enormous data streams produced by the newest DNA sequencing.

8 in total

1. Galaxy: a platform for interactive large-scale genome analysis.

Authors: Belinda Giardine; Cathy Riemer; Ross C Hardison; Richard Burhans; Laura Elnitski; Prachi Shah; Yi Zhang; Daniel Blankenberg; Istvan Albert; James Taylor; Webb Miller; W James Kent; Anton Nekrutenko
Journal: Genome Res Date: 2005-09-16 Impact factor: 9.043

2. SNP detection for massively parallel whole-genome resequencing.

Authors: Ruiqiang Li; Yingrui Li; Xiaodong Fang; Huanming Yang; Jian Wang; Karsten Kristiansen; Jun Wang
Journal: Genome Res Date: 2009-05-06 Impact factor: 9.043

3. The case for cloud computing in genome informatics.

Authors: Lincoln D Stein
Journal: Genome Biol Date: 2010-05-05 Impact factor: 13.583

4. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors: Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-03-04 Impact factor: 13.583

5. Cloud computing for comparative genomics.

Authors: Dennis P Wall; Parul Kudtarkar; Vincent A Fusaro; Rimma Pivovarov; Prasad Patil; Peter J Tonellato
Journal: BMC Bioinformatics Date: 2010-05-18 Impact factor: 3.169

6. MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees.

Authors: Suzanne J Matthews; Tiffani L Williams
Journal: BMC Bioinformatics Date: 2010-01-18 Impact factor: 3.169

7. Searching for SNPs with cloud computing.

Authors: Ben Langmead; Michael C Schatz; Jimmy Lin; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-11-20 Impact factor: 13.583

8. CloudBurst: highly sensitive read mapping with MapReduce.

Authors: Michael C Schatz
Journal: Bioinformatics Date: 2009-04-08 Impact factor: 6.937

8 in total

78 in total

1. Opportunities and challenges for the life sciences community.

Authors: Eugene Kolker; Elizabeth Stewart; Vural Ozdemir
Journal: OMICS Date: 2012-03

2. Compressive genomics.

Authors: Po-Ru Loh; Michael Baym; Bonnie Berger
Journal: Nat Biotechnol Date: 2012-07-10 Impact factor: 54.908

Review 3. The long tail and rare disease research: the impact of next-generation sequencing for rare Mendelian disorders.

Authors: Tony Shen; Ariel Lee; Carol Shen; C Jimmy Lin
Journal: Genet Res (Camb) Date: 2015-09-14 Impact factor: 1.588

4. Optimizing high performance computing workflow for protein functional annotation.

Authors: Larissa Stanberry; Bhanu Rekepalli; Yuan Liu; Paul Giblock; Roger Higdon; Elizabeth Montague; William Broomall; Natali Kolker; Eugene Kolker
Journal: Concurr Comput Date: 2014-09-10 Impact factor: 1.536

5. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.

Authors: Cole Trapnell; Adam Roberts; Loyal Goff; Geo Pertea; Daehwan Kim; David R Kelley; Harold Pimentel; Steven L Salzberg; John L Rinn; Lior Pachter
Journal: Nat Protoc Date: 2012-03-01 Impact factor: 13.491

6. Flow cytometric chromosome sorting from diploid progenitors of bread wheat, T. urartu, Ae. speltoides and Ae. tauschii.

Authors: István Molnár; Marie Kubaláková; Hana Šimková; András Farkas; András Cseh; Mária Megyeri; Jan Vrána; Márta Molnár-Láng; Jaroslav Doležel
Journal: Theor Appl Genet Date: 2014-02-20 Impact factor: 5.699