| Literature DB >> 35392801 |
Simone Pallotta1, Silvia Cascianelli2, Marco Masseroli1.
Abstract
BACKGROUND: Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures.Entities:
Keywords: Data scalability; Distribution transparency; Heterogeneous omics big data; Tertiary data analysis
Mesh:
Year: 2022 PMID: 35392801 PMCID: PMC8991469 DOI: 10.1186/s12859-022-04648-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Genometric RGMQL functions with their extension over already existing R functions and mapping to corresponding GMQL operators
| R package of origin | RGMQL function | GMQL operator | Brief description |
|---|---|---|---|
| dplyr | ORDER | It orders samples sample regions based on metadata region attributes | |
| dplyr | MATERIALIZE | Itsaves persistently the content of any dataset obtained after query completion | |
| dplyr | SELECT | It extracts a subset of samples sample regions using region metadata predicates | |
| dplyr | GROUP | It groups samples sample regions based on region metadata attributes with the same value | |
| dplyr | PROJECT | It selects region metadata attributes to be kept and can update create metadata region attributes | |
| dplyr | DIFFERENCE | It discards the regions of the first dataset intersecting regions of the second one | |
| dplyr | UNION | It puts together samples of two datasets keeping as region attributes those of the first one | |
| base | JOIN | It returns a dataset by joining the regions of two datasets based on distance region predicates | |
| stats | MERGE | It combines all the samples of a dataset into a single sample | |
| – | COVER | It collapses the samples of a dataset into a single sample based on specified rules | |
| – | – | It launches the query execution | |
| – | EXTEND | It generates new metadata attributes for each sample from aggregations applied to region attributes | |
| – | MAP | It computes aggregated values from overlapping regions of two datasets |
Fig. 1Representation of the RGMQL package within the R/Bioconductor environment. REST Web services and Sequential execution modules can handle alternative RGMQL processing environments, together with their dependency links to httr and rJava R packages, respectively
Additional RGMQL functions to handle initialization, remote data exploration, processing and result conversions
| Function type | RGMQL function | Brief description | Input dataset | Output dataset | Remote processing required |
|---|---|---|---|---|---|
| FUNCTIONS TO HANDLE, READ AND ANALYZE LOCAL AND REMOTE DATASETS, PROVINDING ALSO USEFUL CONVERSIONS | delete_dataset() | It deletes a private dataset from remote repository | Remote dataset | – | YES |
| download_dataset() | It downloads a private dataset from remote repository to local path | Remote dataset | Local dataset | YES | |
| download_as_GRangesList() | It downloads a private dataset into R environment as a GRangesList | Remote dataset | GRangesList | YES | |
| export_gmql() | It creates a GDM-like dataset from a GRangesList | GRangesList | Local dataset | NO | |
| filter_and_extract() | It filters based on metadata predicates and generates a new GRanges with a chosen list of region attributes. It works if samples have their region coordinates (chr, ranges, strand) in the same order | Local dataset/ GRangesList | GRanges | NO | |
| import_gmql() | It creates a GRangesList from a GDM-like dataset | Local dataset | GRangesList | NO | |
| read_gmql() | It reads a GMQLDataset from a dataset (with a valid format) on disk, or from the remoterepository in case of remote processing | Local/Remote dataset | GMQLDataset | YES, if is_local = FALSE | |
| read_GRangesList() | It reads a GMQLDataset from a GRangeList | GRangesList | GMQLDataset | NO | |
| sample_metadata() | It retrieves metadata of a specific sample in a dataset | Remote dataset | – | YES | |
| sample_region() | It retrieves regions data of a specific sample in a dataset | Remote dataset | – | YES | |
| semijoin() | It supports the filter method defining semijoin conditions on metadata | – | – | NO | |
| show_datasets_list() | It shows all GMQL datasets in remote repository, both public or privately stored by the user | – | – | YES | |
| show_all_metadata() | It shows all metadata of a given GMQL dataset either locally or in the remote repository | – | – | NO | |
| show_samples_list() | It show all samples of a GMQL dataset on the remote repository | – | – | YES | |
| show_schema() | It shows the region attribute schema of a GMQL dataset on the remote repository | – | – | YES | |
| take() | It saves as a GRangesList any dataset resulting from local processing. If invoked after collect(), the dataset is materialized also in local File System | GMQLDataset | GRangesList | NO, only for local processing | |
| upload_dataset() | It uploads a dataset (GDM or not), and a corresponding GMQL dataset is created on the remote repository | Local dataset | Remote dataset | YES | |
| FUNCTIONS TO HANDLE GMQL SERVER AND MONITOR REMOTE JOBS, IF NEEDED | init_gmql() | It initializes and runs GMQL server to execute any processing, and also performs a login to GMQL REST services suite, if needed | – | – | NO |
| login_gmql() | Login to GMQL REST services suite as a registered user, specifying username and password, or as guest | – | – | YES | |
| logout_gmql() | Logout from GMQL REST services suite | – | – | YES | |
| register_gmql() | Register to GMQL REST services suite | – | – | YES | |
| remote_processing() | It allows to enable or disable remote processing | – | – | YES | |
| show_jobs_log() | It shows the log of a specific job | – | – | YES | |
| trace_job() | It traces a specific job | – | – | YES | |
| show_job_list() | It shows all jobs (run, succeded or failed) invoked by the user on the remote GMQL server | – | – | YES | |
| show_queries_list() | It shows all the GMQL queries saved by the user on the remote repository | – | – | YES | |
| stop_gmql() | It stops the GMQL server processing | – | – | NO | |
| stop_job() | It stops a specific job | – | – | YES | |
| FUNCTIONS USING QUERIES IN GMQL SYNTAX | compile_query() | It compiles a GMQL query inserted as a text string | – | – | YES |
| compile_query_fromfile() | It compiles a GMQL query taken from a file | – | – | YES | |
| run_query() | It runs a GMQL query inserted as a text string | – | – | YES | |
| run_query_fromfile() | It runs a GMQL query taken from a file | – | – | YES | |
| save_query() | It saves into the remote repository a GMQL query, taken from a file | – | – | YES | |
| save_query_fromfile() | It saves into the remote repository a GMQL query, inserted as a text string | – | – | YES |
For each function, we report if it requires remote resources and processing, as well as the formats of its input and output data
Fig. 2Representation of RGMQL functions for data import/export both locally and remotely. A GMQLDataset is created by the read_GMQL() function from a local dataset (in GDM or different tab-delimited format), or from a remote dataset (specifying is_local = FALSE). Any processing is applied on the involved GMQLDataset objects, and the computation and materialization of any result (remotely or locally) is deferred until the collect() and execute() functions are called. A GMQLDataset can be created also by the read_GRangesList() function from a GRangesList. Similarly, a GRangesList can be obtained from a remote dataset through the download_as_GRangesList() function, from a local dataset through the import_GMQL() function and, in local processing only, directly from a GMQLDataset through the take() function
Fig. 3Top 20 genes by percentage of the 217 patients under analysis with the gene mutated
Fig. 4Top 20 genes by number of mutations per gene length across the 217 patients considered
Fig. 5Clusters from patient-wise hierarchical clustering on the first two dimensions of the data principal component analysis. The fraction of variance explained by each dimension is reported as percentage in the corresponding axis label
Fig. 6Mosaic plot of the three clusters emerged from patient-wise hierarchical clustering compared with the published clustering results obtained in [48] using the K4 gene signature
Fig. 7Mosaic plot of the three clusters emerged from patient-wise hierarchical clustering compared with the patient overall survival status annotations
Fig. 8Plot of the transcription factor accumulation for chromosome 21 and of the 186 HOT zones (in red) identified according to the found accumulation threshold 5.6 (red line)