| Literature DB >> 31703553 |
Luca Nanni1, Pietro Pinoli2, Arif Canakoglu2, Stefano Ceri2.
Abstract
BACKGROUND: With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation.Entities:
Keywords: Data scalability; Distribution transparency; Genomic data; Python; Tertiary data analysis
Mesh:
Substances:
Year: 2019 PMID: 31703553 PMCID: PMC6842186 DOI: 10.1186/s12859-019-3159-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Schematic representation of the software components of PyGMQL. In the front-end, the GMQLDataset is a data structure associated with a query, referring directly to the DAG expressing the query operations. The GDataframe stores the query result and enables in-memory manipulation of the data. The front-end provides also a module for loading and storing data, and a RemoteManager module, used for message interchange between the package and an external GMQL service. The back-end interacts with the front-end through a Manager module, which maps the operations specified in Python with the GMQL operators implemented in Spark
Mapping between PyGMQL methods and GMQL operators or utilities
| PyGMQL function | Description | GMQL operator |
|---|---|---|
| load_from_path | UTIL, loads a dataset from local repository | SELECT |
| load_from_remote | UTIL, loads a dataset from remote repository | SELECT |
| load_from_file | UTIL, loads a bed file from local repository | |
| selectreg_selectmeta_select | UNOP, filters samples using region and/or metadata predicates | SELECT |
| projectreg_projectmeta_project | UNOP, projects (in/out) attributes of regions or metadata. Creates new attributes by means of expressions | PROJECT |
| extend | UNOP, creates a new metadata attribute by aggregation of region data | EXTEND |
| covernormal_coverflat_coversummit_coverhistogram_cover | UNOP, collapses regions from several samples into regions of a single sample, based on min/max accumulation indexes | COVER |
| order | UNOP, orders the samples of a dataset based on regions and/or metadata attributes | ORDER |
| merge | UNOP, merges all the samples of a dataset into a single one | MERGE |
| groupmeta_groupreg_group | UNOP, groups regions and/or metadata with the same values | GROUP |
| join | BINOP, joins the regions of two datasets based on distance-based predicates | JOIN |
| map | BINOP, computes aggregate values from overlapping regions of two datasets | MAP |
| union | BINOP, builds the union of regions and metadata of two datasets | UNION |
| difference | BINOP, keeps the regions of a dataset not intersecting with regions of another one | DIFFERENCE |
| materialize | UTIL, triggers the query execution for the specified dataset and stores the result after query completion | MATERIALIZE |
| head | UTIL, Shows the first lines of a dataset |
For every method we provide a concise explanation (UNOP stands for unary operator, BINOP stands for binary operator and UTIL identifies an utility function)
Fig. 2Relationships between GMQLDataset and GDataframe. Data can be imported into a GMQLDataset from a local GDM dataset with the load_from_path function. Using the load_from_file, it is possible to load generic BED files, while load_from_remote enables the loading of GDM datasets from an external GMQL repository, accessible through TCP connection. The user applies operation on the GMQLDataset and triggers the computation of the result with the materialize function. At the end of computation, the result is stored in-memory in a GDataframe, which can be then manipulated in Python. It is possible to import data directly from Pandas with from_pandas. Finally, it is possible to transform a GDataframe structure back into GMQLDataset using the to_GMQLDataset function
Fig. 3Deployment modes and executor options of the library. When the library is in remote mode, it interfaces with an external GMQL service, hosting a GMQL repository (accessible by the Python program, which has been deployed on several file systems). When the mode is set to local, the library can operate on various file systems, based on the selected master
Fig. 4Schematic representation of the deployment strategies adopted in the three applications. a Local/Remote system interaction for the analysis of ENCODE histone marks signal on promotorial regions. The gene dataset is stored in the local file system, the ENCODE BroadPeak database is hosted in the GMQL remote repository, deployed on the Hadoop file system with three slaves. b Configuration for the interactive analysis of the GWAS dataset against the whole set of enhancers from ENCODE. The library interacts directly with the YARN cluster and the data is stored in the Google Cloud File System with a fixed configuration of three slaves, accessed through the Hadoop engine. The gwas.tsv file is downloaded from the web and stored in the file system before executing the query. c Distributed setup for running the TICA query. Three datasets (from ENCODE and GENCODE) are in GDM format and stored in HDFS and the query runs on Amazon Web Services with a variable number of slave nodes, for evaluating the scalability of the system
Sizes of inputs ad outputs for three different cell lines, and execution times (in minutes) for the TICA query over four cluster configurations
| GM12878 | HepG2 | K562 | |
|---|---|---|---|
| Input samples | 164 | 224 | 347 |
| Distinct TFs | 116 | 192 | 268 |
| Input regions | 3,003,121 | 4,384,181 | 6,101,933 |
| Output samples | 13,454 | 36,330 | 71.612 |
| Output regions | 109,858,355 | 213,499,617 | 381,255,507 |
| Output size (MB) | 3,122 | 6,064. | 10,921 |
| 1 node e. t. ∗ | 26.73 | 73.05 | 246.85 |
| 3 nodes e. t. ∗ | 10.40 | 26.28 | 91.27 |
| 5 nodes e. t. ∗ | 7.21 | 16.67 | 59.12 |
| 10 nodes e. t. ∗ | 4.75 | 9.67 | 32.92 |
Fig. 5Execution time for the TICA query on three different cell lines, with four different cluster configurations