| Literature DB >> 22303303 |
Xiao-Lin Wu1, Timothy M Beissinger, Stewart Bauck, Brent Woodward, Guilherme J M Rosa, Kent A Weigel, Natalia de Leon Gatti, Daniel Gianola.
Abstract
High-throughput computing (HTC) uses computer clusters to solve advanced computational problems, with the goal of accomplishing high-throughput over relatively long periods of time. In genomic selection, for example, a set of markers covering the entire genome is used to train a model based on known data, and the resulting model is used to predict the genetic merit of selection candidates. Sophisticated models are very computationally demanding and, with several traits to be evaluated sequentially, computing time is long, and output is low. In this paper, we present scenarios and basic principles of how HTC can be used in genomic selection, implemented using various techniques from simple batch processing to pipelining in distributed computer clusters. Various scripting languages, such as shell scripting, Perl, and R, are also very useful to devise pipelines. By pipelining, we can reduce total computing time and consequently increase throughput. In comparison to the traditional data processing pipeline residing on the central processors, performing general-purpose computation on a graphics processing unit provide a new-generation approach to massive parallel computing in genomic selection. While the concept of HTC may still be new to many researchers in animal breeding, plant breeding, and genetics, HTC infrastructures have already been built in many institutions, such as the University of Wisconsin-Madison, which can be leveraged for genomic selection, in terms of central processing unit capacity, network connectivity, storage availability, and middleware connectivity. Exploring existing HTC infrastructures as well as general-purpose computing environments will further expand our capability to meet increasing computing demands posed by unprecedented genomic data that we have today. We anticipate that HTC will impact genomic selection via better statistical models, faster solutions, and more competitive products (e.g., from design of marker panels to realized genetic gain). Eventually, HTC may change our view of data analysis as well as decision-making in the post-genomic era of selection programs in animals and plants, or in the study of complex diseases in humans.Entities:
Keywords: Bayesian models; general-purpose computing; genomic selection; high-throughput computing; parallel programming; pipelining
Year: 2011 PMID: 22303303 PMCID: PMC3268564 DOI: 10.3389/fgene.2011.00004
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Running multiple chains for Bayesian LASSO in a parallel setting: (A) list of processes showing 10 running jobs; (B) computing time for the 10 jobs (numbers on the .
Figure 2Pipelining with sequential (upper) and parallel (middle and lower) execution. Here, S1–S4 can be viewed as four stages of the computing job, and each of them can consist of sub-stages such as S2a, S2b, and S2c.
Figure 3A DAG with three sequential jobs.
Figure 4A DAG with four nodes (among them jobs B and C run in parallel).
Figure 5Workflow of a high-throughput computing pipeline for predicting genetic merit using candidate gene panels.
Figure 6Running and waiting time in parallel computing for prediction of genetic merit using candidate genes for 15 quantitative traits, each with three alternative methods for feature selection. Here, x-axis represents time of computing, and y-axis represents number of jobs pending (yellow bars) and number of jobs running (green bars).
Figure 7Results stored on the submit machine (A) and also deployed in web-accessible folders (B).
Figure 8Illustration of GPU-enabled parallel computing: (A) graphic representation of summing two vector .