| Literature DB >> 26604801 |
Kary Ocaña1, Daniel de Oliveira2.
Abstract
Today's genomic experiments have to process the so-called "biological big data" that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities.Entities:
Keywords: cloud computing; cluster computing; genomic research; grid computing; high-performance computing; parallel computing
Year: 2015 PMID: 26604801 PMCID: PMC4655901 DOI: 10.2147/AABC.S64482
Source DB: PubMed Journal: Adv Appl Bioinform Chem ISSN: 1178-6949
Figure 1Defined search string with genomics and HPC-related key terms.
Abbreviation: HPC, high-performance computing.
Electronic scientific databases selected as sources
| Database | URL |
|---|---|
| PubMed | |
| ACM Digital Library | |
| IEEE Xplore | |
| Scopus | |
| Google Scholar |
Abbreviations: ACM, Association for Computing Machinery; IEEE, Institute of Electrical and Electronics Engineers; URL, Uniform Resource Locator.
Main information about publications related to genomics and parallel computing
| Software/Developer name | Bioinformatics field | Bioinformatics applications | HPC infrastructure | Execution main findings | Publication source | Type of study | Year | Research country |
|---|---|---|---|---|---|---|---|---|
| Bernardes et al | Comparative and evolutionary genomics | HMMER | Clusters | SE | BMC Bioinformatics | RA | 2007 | Brazil |
| AMPHORA | Phylogenomics | BLAST, ClustalW, HMMER, PhyML, MEGAN | Clusters and grids | SE | Bioinformatics | SA | 2012 | USA |
| Ahmed et al | Comparative genomics | Some of the most used assembly approaches | Clusters | PT | Interdiscip Sci | MA | 2011 | USA |
| Hadoop-BAM | Genomics | Picard SAM JDK, SAMtools | Clusters-Hadoop | PT | Bioinformatics | MA | 2012 | Finland |
| EDGAR | Comparative genomics | BLAST | Clusters or Sun Grid Engine (SGE) | PT | BMC Bioinformatics | SA | 2009 | Germany |
| Armadillo | Phylogenomics | BLAST, PAML, PROTML, PHYLIP | Not enough information about the HPC infrastructure used | PT | PLoS One | SA | 2012 | Canada |
| eHive | Comparative genomics | BLAST, BLAT | Portable Batch System (PBS) or SGE | HPC, PT | BMC Bioinformatics | MA | 2010 | UK |
| Tavaxy II | Genomics, NGS, assembly, variants, metagenomics, phylogeny | BLAST, MegaBLAST, SAMtools, ClustalW, Muscle | Local, clouds | HPC, PT | BMC Bioinformatics | SA | 2012 | Egypt |
| Bioconductor | Genomics, NGS, assembly, variants, metagenomics, proteomics, phylogeny | R packages for more than 1,024 bioinformatics software packages | Local, clouds | HPC, PT | Nat Methods | RA | 2015 | USA |
| Kleftogiannis | Assembly | Overlap-layout-consensus (OLC) and de Bruijn graph (DBG) assembly approaches | Clouds | HPC, PT | PLoS One | RA | 2013 | Saudi Arabia |
| Nefedov and Sadygov | Proteomics | Method for enumerating all amino acid compositions up to a given length | Cluster | HPC, PT | BMC Bioinformatics | SA | 2011 | USA |
| Yabi | Genomics, transcriptomics, proteomics | Repeatmasker, genscan, MzXML2Search, Peptide Prophet, Mascot | Local, grids, clouds | HPC, PT | Source Code Biol Med | RA | 2012 | Australia |
| Crossbow | Genomics | Bowtie, SOAPsnp | Local, cluster, clouds (Amazon EC2-Hadoop) | HPC, PT | Current Protocols in Bioinformatics | MA | 2012 | USA |
| Rainbow | Genomics | Some tools for NGS | Clouds (Amazon EC2) | HPC, PT | BMC Genomics | SA | 2013 | USA |
| BioNode | Evolutionary genomics | PAML, Muscle, MAFFT, MrBayes, BLAST | Networked PCs, Clouds | HPC, PT | Methods Mol Biol | MA | 2012 | the Netherlands |
| ProteinSPA | Genomics | mpiBLAST | Clusters, grids | HPC, PT | Parallel and Distributed Processing and Applications | MA | 2005 | People’s Republic of China |
| Bionimbus | Comparative genomics | Comparative genomics- related applications | Grids | HPC, PT | J Am Med Inform Assoc | RA | 2014 | USA |
| PheGee | Comparative genomics | BLAST | Grids | HPC, PT | IEEE Trans Inf Technol Biomed | MA | 2008 | Singapore |
| iTree | Phylogenomics | BLAST, PhyML | Grids | HPC, PT | Cairo International Biomedical Engineering Conference | MA | 2010 | USA |
| elasticHPC | Genomics | Similar to CloudBioLinux | Clouds | HPC, PT | BMC Bioinformatics | MA | 2012 | Egypt |
| Mercury | Genomics | Tools for NGS pipeline | Clouds (Amazon EC2) | HPC, PT | BMC Bioinformatics | MA | 2014 | USA |
| CloudMap | Analysis of mutant genome sequences | PHRED, GATK, Bowtie, BWA | Clouds (Amazon EC2- Galaxy) | HPC, PT | Genetics | RA | 2012 | USA |
| Roundup | Comparative genomics | BLAST, ClustalW, PAML, RSD algorithm | Clouds (Amazon EC2- Hadoop) | HPC, PT | BMC Bioinformatics | MA | 2010 | USA |
| CloudBioLinux | Genomics | More than 135 tools | Clouds (Amazon EC2, Eucalyptus) | HPC, PT | BMC Bioinformatics | SA | 2012 | USA |
| SciHmm | Genomics | HMMER | Clouds (Amazon EC2- SciCumulus) | HPC, PT | Future Generation Computer Syst | RA | 2013 | Brazil |
| SciPhy | Phylogeny | RAxML | HPC, PT | Advances in Bioinformatics and Computational Biology | MA | 2011 | Brazil | |
| SciPhylomics | Phylogenomics | RAxML | HPC, PT | Future Generation Computer Syst | RA | 2013 | Brazil | |
| SciEvol | Evolution | PAML | HPC, PT | Advances in Bioinformatics and Computational Biology | MA | 2012 | Brazil | |
| SciDock | Docking | AutoDock | HPC, PT | HiComb | MA | 2014 | Brazil | |
| SciSamma | Homology modeling | AutoDock Vina MODELLER, PROCHECK | HPC, PT | ICSOC | MA | 2014 | Brazil |
Notes:
HPC, HPC approaches used for genomic analysis; PT, parallel techniques coupled to these approaches.
Abbreviations: RSD, reciprocal smallest distance; NGS, next-generation sequencing; HPC, high-performance computing; SE, standalone/serial execution; MA, methodology article; SA, software article; RA, research article; PT, parallel techniques; FASTA, Fast-All; BLAST, basic local alignment search tool; PhyML, phylogenetic estimation using maximum likelihood; MEGAN, MEta Genome ANalyzer; SAM, sequence alignment and modeling system; JDK, Java development kit; PAML, phylogenetic analysis by maximum likelihood; PROTML, maximum likelihood inference of protein phylogeny; PHYLIP, phylogeny inference package; BLAT, BLAST-like alignment tool; Muscle, MUltiple sequence comparison by log-expectation; MAFFT, multiple alignment using fast fourier transform; mpiBLAST, mpi - basic local alignment search tool; GATK, genome analysis toolkit; BWA, burrows-wheeler transform; RAxML, randomized axelerated maximum likelihood.