Literature DB >> 30101316

Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data.

Marco Masseroli1, Arif Canakoglu1, Pietro Pinoli1, Abdulrahman Kaitoua2, Andrea Gulino1, Olha Horlova1, Luca Nanni1, Anna Bernasconi1, Stefano Perna1, Eirini Stamoulakatou1, Stefano Ceri1.   

Abstract

MOTIVATION: We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance.
RESULTS: The new system has a well-designed modular architecture featuring: (i) an intermediate representation supporting many different implementations (including Spark, Flink and SciDB); (ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database or others); (iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work.
AVAILABILITY AND IMPLEMENTATION: The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Year:  2019        PMID: 30101316     DOI: 10.1093/bioinformatics/bty688

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  11 in total

1.  Proposal of Smith-Waterman algorithm on FPGA to accelerate the forward and backtracking steps.

Authors:  Fabio F de Oliveira; Leonardo A Dias; Marcelo A C Fernandes
Journal:  PLoS One       Date:  2022-06-30       Impact factor: 3.752

2.  GeMI: interactive interface for transformer-based Genomic Metadata Integration.

Authors:  Giuseppe Serna Garcia; Michele Leone; Anna Bernasconi; Mark J Carman
Journal:  Database (Oxford)       Date:  2022-06-03       Impact factor: 4.462

3.  Accurate and highly interpretable prediction of gene expression from histone modifications.

Authors:  Fabrizio Frasca; Matteo Matteucci; Michele Leone; Marco J Morelli; Marco Masseroli
Journal:  BMC Bioinformatics       Date:  2022-04-26       Impact factor: 3.307

4.  PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets.

Authors:  Luca Nanni; Pietro Pinoli; Arif Canakoglu; Stefano Ceri
Journal:  BMC Bioinformatics       Date:  2019-11-08       Impact factor: 3.169

5.  Computational analysis of fused co-expression networks for the identification of candidate cancer gene biomarkers.

Authors:  Sara Pidò; Gaia Ceddia; Marco Masseroli
Journal:  NPJ Syst Biol Appl       Date:  2021-03-12

6.  RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor.

Authors:  Simone Pallotta; Silvia Cascianelli; Marco Masseroli
Journal:  BMC Bioinformatics       Date:  2022-04-07       Impact factor: 3.169

7.  GenoSurf: metadata driven semantic search system for integrated genomic datasets.

Authors:  Arif Canakoglu; Anna Bernasconi; Andrea Colombo; Marco Masseroli; Stefano Ceri
Journal:  Database (Oxford)       Date:  2019-01-01       Impact factor: 3.451

8.  Application of BERT to Enable Gene Classification Based on Clinical Evidence.

Authors:  Yuhan Su; Hongxin Xiang; Haotian Xie; Yong Yu; Shiyan Dong; Zhaogang Yang; Na Zhao
Journal:  Biomed Res Int       Date:  2020-10-07       Impact factor: 3.411

9.  A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses.

Authors:  Dariusz Mrozek; Krzysztof Stępień; Piotr Grzesik; Bożena Małysiak-Mrozek
Journal:  Front Genet       Date:  2021-07-13       Impact factor: 4.599

10.  BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Authors:  Jinxiang Chen; Fuyi Li; Miao Wang; Junlong Li; Tatiana T Marquez-Lago; André Leier; Jerico Revote; Shuqin Li; Quanzhong Liu; Jiangning Song
Journal:  Front Big Data       Date:  2022-01-18
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.