| Literature DB >> 25983539 |
Pankaj Agarwal1, Kouros Owzar2.
Abstract
Advances in next generation sequencing (NGS) and mass spectrometry (MS) technologies have provided many new opportunities and angles for extending the scope of translational cancer research while creating tremendous challenges in data management and analysis. The resulting informatics challenge is invariably not amenable to the use of traditional computing models. Recent advances in scalable computing and associated infrastructure, particularly distributed computing for Big Data, can provide solutions for addressing these challenges. In this review, the next generation of distributed computing technologies that can address these informatics problems is described from the perspective of three key components of a computational platform, namely computing, data storage and management, and networking. A broad overview of scalable computing is provided to set the context for a detailed description of Hadoop, a technology that is being rapidly adopted for large-scale distributed computing. A proof-of-concept Hadoop cluster, set up for performance benchmarking of NGS read alignment, is described as an example of how to work with Hadoop. Finally, Hadoop is compared with a number of other current technologies for distributed computing.Entities:
Keywords: NGS; big data; cancer; cloud computing; cluster; data management; data storage; genomics; gpu; hadoop; high performance computing; informatics; scalable computing
Year: 2015 PMID: 25983539 PMCID: PMC4412427 DOI: 10.4137/CIN.S16344
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
MPI-based applications for bioinformatics.
| CATEGORY | APPLICATION | DESCRIPTION |
|---|---|---|
| Alignment | mpiBLAST | Implementation of BLAST on MPI for parallel execution. |
| ClustalW-MPI | Implementation of Clustal-W, a multiple alignment tool, on MPI. | |
| mrNA | Short read alignment of NGS read. | |
| MrBayes 3 | Bayesian phylogenetic analysis using MPI for parallelizing Markov chain Monte Carlo convergence. | |
| Proteomics | Parallel Tandem | Implementation of X!Tandem for MS/MS spectra search against a protein database. |
GPU applications for bioinformatics.
| CATEGORY | APPLICATION | DESCRIPTION |
|---|---|---|
| Alignment | CUSHAW | Short read alignment based on the Burrows-Wheeler transform (BWT |
| SOAP3-dp | Short read alignment based on BWT with native BAM support | |
| Proteomics | FastPaSS | Spectra matching using spectral library searching implemented in CUDA |
| Tempest | Spectral matching using GPU-CPU | |
| Motif Discovery | GPUmotif | Motif scan and de novo motif finding using GPU |
| Epigenetics | GPU-BSM | GPU based tool for mapping whole genome bisulfite sequencing reads and estimating methylation levels |
| Systems Biology (see Ref. | ABC-SysBio | Simulate models written in the Systems Biology Markup Language (SBML |
| PMCGPU | Parallel simulators for Membrane Computing on the GPU | |
| Genome-wide Inference | permGPU | Permutation resampling analysis for binary, quantitative, and censored time-to-event outcomes |
Cloud-based applications for bioinformatics.
| CATEGORY | APPLICATION | DESCRIPTION |
|---|---|---|
| Alignment | CloudBrush | Distributed de novo genome assembler based on MapReduce which can be run on a cloud. |
| CloudBurst | Short read mapping software for Hadoop based on the RMAP | |
| CloudBLAST | MapReduce based BLAST which can be run on a cloud. | |
| Proteomics | Integrated Proteomics Pipeline (IP2) | Proteomics data analysis pipeline also available on Amazon Web Service. |
| ProteoCloud | Proteomics computing pipeline system on the cloud for peptide and protein identifications available on Amazon Web Service. |
Figure 1Core components of the Hadoop architecture. Adapted from Ref.58
Figure 2Typical MapReduce algorithm workflow.
Figure 3Representative subset of the Hadoop ecosystem.
Bioinformatics applications for Hadoop.
| CATEGORY | APPLICATION | DESCRIPTION |
|---|---|---|
| Alignment and Assembly Genome Assembly | Jnomics | Command line driven alignment on Hadoop cluster for a number of alignment software |
| Contrail | de novo assembly without reference genome | |
| DistMap | Perl tool, works with 9 different aligners, very easy client-only installation | |
| Seal | Alignment tool based on Pydoop (a python based API for developing Hadoop applications | |
| RNA-seq | Eoulsan | In additional to alignment, complete RNA-seq pipeline |
| Myrna | Complete RNA-seq pipeline, from alignment to differential expression, written in Perl, integrates R | |
| Variant Calling | Crossbow | Performs alignment and SNP genotyping |
Figure 4Cisco FlexPod cluster architecture.
Figure 5Genomics data analysis pipeline.