| Literature DB >> 19114001 |
Jon Hill1, Matthew Hambley, Thorsten Forster, Muriel Mewissen, Terence M Sloan, Florian Scharinger, Arthur Trew, Peter Ghazal.
Abstract
BACKGROUND: Microarray analysis allows the simultaneous measurement of thousands to millions of genes or sequences across tens to thousands of different samples. The analysis of the resulting data tests the limits of existing bioinformatics computing infrastructure. A solution to this issue is to use High Performance Computing (HPC) systems, which contain many processors and more memory than desktop computer systems. Many biostatisticians use R to process the data gleaned from microarray analysis and there is even a dedicated group of packages, Bioconductor, for this purpose. However, to exploit HPC systems, R must be able to utilise the multiple processors available on these systems. There are existing modules that enable R to use multiple processors, but these are either difficult to use for the HPC novice or cannot be used to solve certain classes of problems. A method of exploiting HPC systems, using R, but without recourse to mastering parallel programming paradigms is therefore necessary to analyse genomic data to its fullest.Entities:
Mesh:
Year: 2008 PMID: 19114001 PMCID: PMC2628907 DOI: 10.1186/1471-2105-9-558
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Parallel architectures. A) A block diagram of a shared memory system in which all processors access the same memory by means of a network. B) A block diagram of a message passing system in which memory is private to a processor. Processors share data by explicit communication over the network.
Figure 2A task farm in which data is sent from the master to the slave processes. The slaves process the data and return a result. The slaves do not communicate with each other, only the master.
Figure 3The SPRINT framework. SPRINT runs across all processors. The R application (which runs the R script) runs on the master processor. This links to the SPRINT library via the R to C interface. The compute farm library uses files to communicate with compute farm which can then execute functions in the library over this compute farm. The pcor library is a parallel function library (in this case parallel correlation). Other libraries can be added, such as "hello" which is a simple "Hello World" function.
Figure 4Altering the R script. The upper box shows the original R script used to carry out the correlation. The lower box shows the modified R script. Only two additional lines are needed and the function (cor) is changed to pcor.
Figure 5Performance and strong scaling of the parallel correlation function. Top graph shows the executing time of the correlation function. The dashed horizontal line is the time taken for R to execute the correlation on a single processor. The solid line shows the time for the parallelised version within SPRINT. The bottom graph shows the strong scaling (same data, different number of processors) for the parallel correlation function within SPRINT. The straight, dashed line shows linear scaling based on the execution time of R running on a single processor.