Literature DB >> 30203022

The GCTx format and cmap{Py, R, M, J} packages: resources for optimized storage and integrated traversal of annotated dense matrices.

Oana M Enache¹, David L Lahr¹, Ted E Natoli¹, Lev Litichevskiy¹, David Wadden², Corey Flynn³, Joshua Gould¹, Jacob K Asiedu¹, Rajiv Narayan¹, Aravind Subramanian¹.

Abstract

MOTIVATION: Facilitated by technological improvements, pharmacologic and genetic perturbational datasets have grown in recent years to include millions of experiments. Sharing and publicly distributing these diverse data creates many opportunities for discovery, but in recent years the unprecedented size of data generated and its complex associated metadata have also created data storage and integration challenges.
RESULTS: We present the GCTx file format and a suite of open-source packages for the efficient storage, serialization and analysis of dense two-dimensional matrices. We have extensively used the format in the Connectivity Map to assemble and share massive datasets currently comprising 1.3 million experiments, and we anticipate that the format's generalizability, paired with code libraries that we provide, will lower barriers for integrated cross-assay analysis and algorithm development.
AVAILABILITY AND IMPLEMENTATION: Software packages (available in Python, R, Matlab and Java) are freely available at https://github.com/cmap. Additional instructions, tutorials and datasets are available at clue.io/code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 30203022 PMCID： PMC6477971 DOI： 10.1093/bioinformatics/bty784

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Computational analysis of datasets generated by treating diverse cell types with pharmacological and genetic perturbagens has proven useful for functional relationship discovery (Hughes ; Lamb ; Weinstein ). To enable such discovery, the NIH Common Fund’s Library of Network-Based Cellular Signatures (LINCS) has brought together several high-dimensional assays to systematically characterize the effects of perturbagens on human cells (Keenan, 2018). Additionally, the scale of the perturbation-based datasets produced has multiplied in recent years to encompass millions of samples. Such large-scale, holistic representations of perturbation provide an incredible opportunity for systems-based research on health and disease, but also introduce data management challenges unique to perturbation-based functional compendia. In particular, two challenges that are crucial to address in order to facilitate analysis of these initially heterogeneous data are standardized formatting of assay output and ease of access to arbitrary ranges of output datasets. While the more mature field of DNA sequencing has largely converged on a standard set of file formats and data types, raw forms of perturbational data are more diverse and can range from flow cytometry readouts for mRNA (Subramanian ) to mass spectrometry traces for protein phosphorylation (Abelin ) to quantitative data extracted from microscopy images for morphological profiling (Bray ). Each of these diverse data types has associated metadata, and so additional relevant metadata annotations on literature pathways (Liberzon ), drug targets and mechanisms of action (Corsello ), are also key to interpretation of analysis results. Although the LINCS consortium thoughtfully considered the challenges in integrating heterogeneous data in its establishment of standards for metadata prior to public data deposition (Vempati et al., 2014), adoption of a standard for data deposition in itself does not necessarily ease access for computation during exploratory data analysis. To address this, we present the GCTx file format along with open-source software packages that we have developed. GCTx relies on robust HDF5 technology to make large, dense matrices of data and metadata annotations easy to store and explore. Importantly, the format’s utility is not just theoretical: to date, we have aggregated, analyzed, and publicly distributed millions of profiles‘ worth of data from LINCS and other large compendia (Supplementary Material S1).

2 The GCTx format and code libraries

Text-based formats like GCT (Supplementary Material S2) have long been used in gene expression analysis (Eisen ). However, the dramatically increasing size of datasets in more recent years has made storage as plain text impractical (Fig. 1B). In addition, without a governing data model, text formats cannot efficiently represent relationships between rows, columns and metadata. Furthermore, text formats make it cumbersome to retrace the provenance of an element in the data matrix as it passes through multiple stages of a data processing pipeline, which is an important requirement for reproducibility.

Fig. 1.

(a) Schematic of a GCTx file. (b) Parse times are faster for GCTx files compared with text-based files; more details in Supplementary Material S3

(a) Schematic of a GCTx file. (b) Parse times are faster for GCTx files compared with text-based files; more details in Supplementary Material S3 . The format we developed that addresses these issues is a schema built on HDF5 that we term GCTx. HDF5 supports a platform-independent file format capable of unlimited size, rapid read/write capabilities and selective parsing of a subset of a dataset without loading the entire file into memory first (The HDF Group, 1997–2018). In addition, HDF5 has a vibrant developer community that supports multiple programming languages and operating systems. GCTx adopts a lightweight, shallow hierarchy optimized for representing matrices and associated annotations while retaining several of the key benefits of HDF5’s infrastructure. This shallow hierarchy of component nodes decreases random access time when compared to deeply nested alternatives and also enables simple and efficient extension in the data matrix (appending it to the /MATRIX node) or dataset (incrementing the numerically indexed group name to ‘/1’, ‘/2’, etc. and then appending data and metadata) dimensions (Fig. 1A). More broadly, the standardization of data and metadata representation that GCTx provides frees developers from having to repeatedly customize their analytical code and to think about how to optimally use or represent data in HDF5. This is important because the HDF5 format only provides a generic data model; standardizing the representation of data and metadata consequently encourages reproducibility of data analyses. . To facilitate adoption of the GCTx format with existing bioinformatics and data science tools, we also developed four open source software packages in Python (‘cmapPy’), R (‘cmapR’), Matlab (‘cmapM’) and Java (‘cmapJ’), which simplify input, output, conversion and analysis of GCTx files by representing these file inputs as native data structures readily compatible with powerful data analysis tools.

3 Conclusions

We present GCTx, an HDF5-based file format designed for efficient storage and rapid access of dense data matrices paired with metadata annotations. The format’s ability to store multiple distinct datasets and annotations enable a single file to contain an entire workflow’s worth of content, which aids reproducibility in analyses and collaboration. Importantly, the format's utility is not just theoretical: to date, we have compiled ∼1.3 million samples and made them freely available in the GCTx format (Supplementary Material S3). Worth noting is that HDF5-based formats have previously been used in genomics (Millard ; Sommer ); however, these prior formats differ from GCTx in that most of them involve using deep hierarchies to store a variety of experimental design and modeling data with assay output. While this can be a useful structure, our primary needs deviated sufficiently from the features of other HDF5-based formats to merit the development of our own format. Additionally, although relational databases and cloud-based object stores are also capable of storing and efficiently serving massive datasets, we have found that--even as cloud-based object stores become more commonplace--users still request downloadable file-based representations of data for use on their personal computers or traditional login servers. Although this may change over time, we consequently decided that a file-based format would best address the majority of current user needs. To ease adoption of GCTx, we also present four open-source packages that make GCTx straightforward to incorporate with existing tools. Click here for additional data file.

16 in total

1. Noncanonical open reading frames encode functional proteins essential for cancer cell survival.

Authors: John R Prensner; Oana M Enache; Victor Luria; Karsten Krug; Karl R Clauser; Joshua M Dempster; Amir Karger; Li Wang; Karolina Stumbraite; Vickie M Wang; Ginevra Botta; Nicholas J Lyons; Amy Goodale; Zohra Kalani; Briana Fritchman; Adam Brown; Douglas Alan; Thomas Green; Xiaoping Yang; Jacob D Jaffe; Jennifer A Roth; Federica Piccioni; Marc W Kirschner; Zhe Ji; David E Root; Todd R Golub
Journal: Nat Biotechnol Date: 2021-01-28 Impact factor: 54.908

2. The peripheral and core regions of virus-host network of COVID-19.

Authors: Bingbo Wang; Xianan Dong; Jie Hu; Xiujuan Ma; Chao Han; Yajun Wang; Lin Gao
Journal: Brief Bioinform Date: 2021-05-06 Impact factor: 11.622

Review 3. Artificial intelligence unifies knowledge and actions in drug repositioning.

Authors: Zheng Yin; Stephen T C Wong
Journal: Emerg Top Life Sci Date: 2021-12-21

4. Getting Started with LINCS Datasets and Tools.

Authors: Zhuorui Xie; Eryk Kropiwnicki; Megan L Wojciechowicz; Kathleen M Jagodnik; Ingrid Shu; Allison Bailey; Daniel J B Clarke; Minji Jeon; John Erol Evangelista; Maxim V Kuleshov; Alexander Lachmann; Abhijna A Parigi; Jose M Sanchez; Sherry L Jenkins; Avi Ma'ayan
Journal: Curr Protoc Date: 2022-07

5. Deep learning of pharmacogenomics resources: moving towards precision oncology.

Authors: Yu-Chiao Chiu; Hung-I Harry Chen; Aparna Gorthi; Milad Mostavi; Siyuan Zheng; Yufei Huang; Yidong Chen
Journal: Brief Bioinform Date: 2020-12-01 Impact factor: 11.622

6. Side effect prediction based on drug-induced gene expression profiles and random forest with iterative feature selection.

Authors: Arzu Cakir; Melisa Tuncer; Hilal Taymaz-Nikerel; Ozlem Ulucan
Journal: Pharmacogenomics J Date: 2021-06-21 Impact factor: 3.550

7. Signatures of cell death and proliferation in perturbation transcriptomics data-from confounding factor to effective prediction.

Authors: Bence Szalai; Vigneshwari Subramanian; Christian H Holland; Róbert Alföldi; László G Puskás; Julio Saez-Rodriguez
Journal: Nucleic Acids Res Date: 2019-11-04 Impact factor: 16.971