| Literature DB >> 31797624 |
Nicholas R Wheeler1, Penelope Benchek, Brian W Kunkle, Kara L Hamilton-Nelson, Mike Warfe, Jeremy R Fondran, Jonathan L Haines, William S Bush.
Abstract
Modern genomic studies are rapidly growing in scale, and the analytical approaches used to analyze genomic data are increasing in complexity. Genomic data management poses logistic and computational challenges, and analyses are increasingly reliant on genomic annotation resources that create their own data management and versioning issues. As a result, genomic datasets are increasingly handled in ways that limit the rigor and reproducibility of many analyses. In this work, we examine the use of the Spark infrastructure for the management, access, and analysis of genomic data in comparison to traditional genomic workflows on typical cluster environments. We validate the framework by reproducing previously published results from the Alzheimer's Disease Sequencing Project. Using the framework and analyses designed using Jupyter notebooks, Spark provides improved workflows, reduces user-driven data partitioning, and enhances the portability and reproducibility of distributed analyses required for large-scale genomic studies.Entities:
Mesh:
Year: 2020 PMID: 31797624 PMCID: PMC6956992
Source DB: PubMed Journal: Pac Symp Biocomput ISSN: 2335-6928
Figure 1.Illustration of UDFs and UDAFs for producing typical GWAS association results (A) and gene-based aggregation test results (B). User Defined Aggregation Functions (UDAFs) partition the genotype data into frames that are programmatically accessible to User Defined Functions (UDFs), which can implement R and Python-based code. In this example, genotypes are aggregated by gene to produce results from the seqMeta R package.
Figure 2.Comparison of the Spark-based versus Traditional workflows for conducting rare-variant analyses. Approximate timings are noted in red (timings for individual steps of the traditional framework were unavailable). Green labels denote advantages of the Spark-based over the traditional framework.
Figure 3.Manhattan plots of Skat-O Meta-analysis results from the Spark-based implementation versus the results published in Bis et al.