SUMMARY: Efficient processing of large-scale genomic datasets has recently become possible due to the application of 'big data' technologies in bioinformatics pipelines. We present SeQuiLa-a distributed, ANSI SQL-compliant solution for speedy querying and processing of genomic intervals that is available as an Apache Spark package. Proposed range join strategy is significantly (∼22×) faster than the default Apache Spark implementation and outperforms other state-of-the-art tools for genomic intervals processing. AVAILABILITY AND IMPLEMENTATION: The project is available at http://biodatageeks.org/sequila/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
SUMMARY: Efficient processing of large-scale genomic datasets has recently become possible due to the application of 'big data' technologies in bioinformatics pipelines. We present SeQuiLa-a distributed, ANSI SQL-compliant solution for speedy querying and processing of genomic intervals that is available as an Apache Spark package. Proposed range join strategy is significantly (∼22×) faster than the default Apache Spark implementation and outperforms other state-of-the-art tools for genomic intervals processing. AVAILABILITY AND IMPLEMENTATION: The project is available at http://biodatageeks.org/sequila/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.