Literature DB >> 27637471

Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying.

Marco Masseroli1, Abdulrahman Kaitoua2, Pietro Pinoli3, Stefano Ceri4.   

Abstract

While a huge amount of (epi)genomic data of multiple types is becoming available by using Next Generation Sequencing (NGS) technologies, the most important emerging problem is the so-called tertiary analysis, concerned with sense making, e.g., discovering how different (epi)genomic regions and their products interact and cooperate with each other. We propose a paradigm shift in tertiary analysis, based on the use of the Genomic Data Model (GDM), a simple data model which links genomic feature data to their associated experimental, biological and clinical metadata. GDM encompasses all the data formats which have been produced for feature extraction from (epi)genomic datasets. We specifically describe the mapping to GDM of SAM (Sequence Alignment/Map), VCF (Variant Call Format), NARROWPEAK (for called peaks produced by NGS ChIP-seq or DNase-seq methods), and BED (Browser Extensible Data) formats, but GDM supports as well all the formats describing experimental datasets (e.g., including copy number variations, DNA somatic mutations, or gene expressions) and annotations (e.g., regarding transcription start sites, genes, enhancers or CpG islands). We downloaded and integrated samples of all the above-mentioned data types and formats from multiple sources. The GDM is able to homogeneously describe semantically heterogeneous data and makes the ground for providing data interoperability, e.g., achieved through the GenoMetric Query Language (GMQL), a high-level, declarative query language for genomic big data. The combined use of the data model and the query language allows comprehensive processing of multiple heterogeneous data, and supports the development of domain-specific data-driven computations and bio-molecular knowledge discovery. Copyright Â
© 2016 Elsevier Inc. All rights reserved.

Keywords:  Data interoperability; Data modeling; Genomic data management; Metadata management; Operations for genomics; Query languages

Mesh:

Year:  2016        PMID: 27637471     DOI: 10.1016/j.ymeth.2016.09.002

Source DB:  PubMed          Journal:  Methods        ISSN: 1046-2023            Impact factor:   3.608


  6 in total

1.  PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets.

Authors:  Luca Nanni; Pietro Pinoli; Arif Canakoglu; Stefano Ceri
Journal:  BMC Bioinformatics       Date:  2019-11-08       Impact factor: 3.169

2.  Implementing the FAIR Data Principles in precision oncology: review of supporting initiatives.

Authors:  Charles Vesteghem; Rasmus Froberg Brøndum; Mads Sønderkær; Mia Sommer; Alexander Schmitz; Julie Støve Bødker; Karen Dybkær; Tarec Christoffer El-Galaly; Martin Bøgsted
Journal:  Brief Bioinform       Date:  2020-05-21       Impact factor: 11.622

3.  RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor.

Authors:  Simone Pallotta; Silvia Cascianelli; Marco Masseroli
Journal:  BMC Bioinformatics       Date:  2022-04-07       Impact factor: 3.169

4.  Genomic data integration and user-defined sample-set extraction for population variant analysis.

Authors:  Tommaso Alfonsi; Anna Bernasconi; Arif Canakoglu; Marco Masseroli
Journal:  BMC Bioinformatics       Date:  2022-09-29       Impact factor: 3.307

5.  Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction.

Authors:  Eleonora Cappelli; Giovanni Felici; Emanuel Weitschek
Journal:  BioData Min       Date:  2018-10-25       Impact factor: 2.522

6.  A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses.

Authors:  Dariusz Mrozek; Krzysztof Stępień; Piotr Grzesik; Bożena Małysiak-Mrozek
Journal:  Front Genet       Date:  2021-07-13       Impact factor: 4.599

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.