| Literature DB >> 32618995 |
Jayaram Kancherla1,2,3, Yifan Yang3, Hyeyun Chae4, Hector Corrada Bravo1,2,3.
Abstract
MOTIVATION: Genomic data repositories like The Cancer Genome Atlas, Encyclopedia of DNA Elements, Bioconductor's AnnotationHub and ExperimentHub etc., provide public access to large amounts of genomic data as flat files. Researchers often download a subset of data files from these repositories to perform exploratory data analysis. We developed Epiviz File Server, a Python library that implements an in situ data query system for local or remotely hosted indexed genomic files, not only for visualization but also data transformation. The File Server library decouples data retrieval and transformation from specific visualization and analysis tools and provides an abstract interface to define computations independent of the location, format or structure of the file. We demonstrate the File Server in two use cases: (i) integration with Galaxy workflows and (ii) using Epiviz to create a custom genome browser from the Epigenome Roadmap dataset.Entities:
Mesh:
Year: 2020 PMID: 32618995 PMCID: PMC7695125 DOI: 10.1093/bioinformatics/btaa591
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.A high-level overview of the EFS Library. EFS library supports directly querying indexed genomic files. Data files are described in the measurements module and provides a programmatic interface to parse, query and define transformations over files using any NumPy(-like) function. Transformations are lazily computed at query time using Dask and the cache layer makes sure we only request for bytes not already accessed. Datasets and their transformations can be accessed using a REST API and allows developers to build interactive visualization and exploration tools
Comparison of features across in situ file querying and transformation tools
| Features | IGV | RawVis | HiGlass | UCSC | EFS |
|---|---|---|---|---|---|
|
| Supports add, subtract, multiply and divide by operations. | Limited—computes data aggregations at query time for each dimension in a visualization. | Supports client-side divide by operations. | Does not support transformations across files. | Provides an interface to apply transformations across files using NumPy-like definition. |
|
| Visualization and data workflows are coupled together. | Converts user interaction queries into data queries hence tightly couples visualization with data workflows. | Separates the user interface and the server. | Visualization and data workflows are coupled together in the browser. | Decouples data from visualization workflows, allows tools developers to develop tools using the REST API. |
|
| Yes | No | Yes | Yes | Yes |
|
| No | No | HiGlass-server provides an API to get tileset information and query by tiles. | REST API supports querying files from a TrackHub. | REST API supports querying both files and the transformations. |
Comparison of features between a traditional database server and EFS
| Feature | Traditional DB (MySQL) | EFS |
|---|---|---|
| Time to query | Time to query is longer since the data needs to be imported into the database. Often requires custom indexing to improve query time. | EFS performs |
| Cache | Provides cache support for repeated query processing. | Implements cache for repeated query processing. |
| SQL | Provides an easy to use query language to search tables. | Does not provide a query language (but provides REST API). |
| Schema | Schema can be changed on the fly. | Changes in schema result in regenerating the index file. |
| Interval Overlap Operations | Does not provide interval overlap operations to apply transformations across datasets. | EFS supports overlap operations to apply transformations across files. |
| Transformations | SQL supports basic mathematical operations (average, min, max etc.) and often needs middleware to support any other transformations over query results. | EFS supports any transformation over the query results. |
Fig. 2.Overview of Epiviz integration with Galaxy. Users can include the Epiviz Galaxy Tool in a workflow to choose files and define annotations to generate an Epiviz configuration file. A Galaxy IE using the Epiviz configuration spins the Epiviz docker instance. Once the docker image loads, Galaxy embeds the user interface from the instance on its user interface as shown on the right
Fig. 3.Interactive visualization of data from the NIH Roadmap Epigenomics project. This figure demonstrates the EFS library querying and computing transformations over data available from the NIH Roadmap Epigenomics project. We chose the ESR1 and its neighboring gene region for this example. (Top to bottom) The first track is a hg19 genome annotation track. The line track in the middle is visualizing the H3K36me3 binding signal from the ChIP-seq experiments across three different brain tissues. This track queries the data directly from the files. The last line track is a transformation over files to compute difference in histone binding across different tissues
Impact of cache on processing requests
| Implementation | Average Latency (in ms; ± SD) | Average Requests (per s; ± SD) |
|---|---|---|
| EFS—no Cache (remote file) | 1152 (± 201.32) | 8.2 (± 0.44) |
| EFS—cache (remote file) | 68.41 (± 83.55) | 179.4 (± 3.2) |
| EFS (local file) | 36.05 (± 8.74) | 284.1 (± 41.31) |
| PyBigWig (remote file) | 121.864 (± 40.67) | — |
| PyBigWig (local file) | 0.52 (± 0.28) | — |
Note: This table displays the average latency and the requests processed per second measured when benchmarking the File Server API with and without a cache implementation. With cache, the library was able to process a significantly higher number of genomic range queries resulting in higher throughput and lower latency. The extra overhead in EFS is because of using intermediate data representations so that transformations can be performed across files and the use of JSON as a portable output format for multiple clients to query the system.
Fig. 4.Impact of computing transformations at run time. We measure the overhead in computing transformations lazily (on-the-fly) versus querying a file that stores the pre-computed result. We measure the average latency and requests processed per second by the system across five different runs and the shows the mean and standard deviation for these metrics. The results indicate the latency of the system increases as we increase the number of files involved in the computation, hence lowering the number of requests processed per second