| Literature DB >> 26651996 |
Aidan R O'Brien1,2, Neil F W Saunders1, Yi Guo3, Fabian A Buske4,5, Rodney J Scott6, Denis C Bauer7.
Abstract
BACKGROUND: Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilise the recently developed SPARK engine, along with its associated machine learning library, MLlib, which offers more flexibility in the parallelisation of population-scale bioinformatics tasks. The resulting tool, VARIANTSPARK provides an interface from MLlib to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.Entities:
Mesh:
Year: 2015 PMID: 26651996 PMCID: PMC4676146 DOI: 10.1186/s12864-015-2269-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
The resource consumption of the six compared methods as well as the accuracy measured as adjusted Rand index on chromosome 22
| Tool | Pre-processing | Clustering | Accuracy | ||||
|---|---|---|---|---|---|---|---|
| Threads | Memory | Time | Threads | Memory | Time | ||
|
| 8 | 32 | 2 min 58 sec | 8 | 32 | 1 min 20 sec | 0.84 |
|
| 8 | 32 | 12 min 48 sec | 8 | 32 | 1 min 52 sec | 0.84 |
| Hadoop | 8 | 32 | 14 min 22 sec | 8 | 32 | 14 min 23 sec | 0.84 |
| R | 1 | 32 | 34 min 30 sec | 8 | 32 | 7 min 25 sec | 0.84 |
| Python | 1 | 32 | 34 min 15 sec | 8 | 32 | 11 min 29 sec | 0.84 |
|
| 1 | 32 | 10 min 08 sec | 8 | 32 | 8 min 19 sec | 0.25 |
Fig. 1Comparison of method and genome-wide scaling experiment. Left Runtime for clustering variants from chromosome 22 is given in seconds with 32 GB of memory on 8 threads (except for the pre-processing in R and Python where this was not supported). Right Scaling from 20 % to 100 % of variants in the genome with maximal number of executors and lowest possible memory assignment
Fig. 2Visualisation of VARIANTSPARK predicted clusters. The figure shows the four clusters predicted for the 1000 Genomes data. Individuals from the super-populations AFR, AMR and EAS are accurately grouped into distinct clusters. The fourth cluster contains predominantly EUR + AMR individuals potentially accurately reflecting migrational backgrounds
The resources consumption on different subsets of the entire autosome (chromosomes 1–22) of phase 1 as well as all of phase 3. Memory specified is the memory allocated to each executor
| Data | Portion | Pre-processing | Clustering | ||||
|---|---|---|---|---|---|---|---|
| Executors | Memory | Time | Executors | Memory | Time | ||
| Phase 1 | 20 % | 64 | 2 | 11 min 53 sec | 64 | 6 | 1 h 10 min |
| 40 % | 64 | 2 | 19 min 09 sec | 64 | 12 | 2 h 19 min | |
| 60 % | 64 | 2 | 26 min 34 sec | 64 | 17 | 3 h 33 min | |
| 100 % | 64 | 2 | 40 min 48 sec | 40 | 24 | 14 h 44 min | |
| Phase 3 | 100 % | 64 | 2 | 3 h 54 min 24 sec | 40 | 24 | 27 h 46 min |
Fig. 3Schematic overview of VariantSpark. The image shows the flow from the input VCF file to the machine learning library and onto the visualization. It highlights the differences between the Hadoop and Spark implementations for converting data in VCF format to a data structure readable by Mahout and MLlib, respectively