Nezar Abdennur1, Leonid A Mirny1,2. 1. Institute for Medical Engineering and Science, Cambridge, MA 02139, USA. 2. Department of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
Abstract
MOTIVATION: Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis. RESULTS: We developed a file format called cooler, based on a sparse data model, that can support genomically labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium. AVAILABILITY AND IMPLEMENTATION: Cooler is cross-platform, BSD-licensed and can be installed from the Python package index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis. RESULTS: We developed a file format called cooler, based on a sparse data model, that can support genomically labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium. AVAILABILITY AND IMPLEMENTATION: Cooler is cross-platform, BSD-licensed and can be installed from the Python package index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Björn Grüning; Ryan Dale; Andreas Sjödin; Brad A Chapman; Jillian Rowe; Christopher H Tomkins-Tinch; Renan Valieris; Johannes Köster Journal: Nat Methods Date: 2018-07 Impact factor: 28.547
Authors: Neva C Durand; James T Robinson; Muhammad S Shamim; Ido Machol; Jill P Mesirov; Eric S Lander; Erez Lieberman Aiden Journal: Cell Syst Date: 2016-07 Impact factor: 10.304
Authors: Erez Lieberman-Aiden; Nynke L van Berkum; Louise Williams; Maxim Imakaev; Tobias Ragoczy; Agnes Telling; Ido Amit; Bryan R Lajoie; Peter J Sabo; Michael O Dorschner; Richard Sandstrom; Bradley Bernstein; M A Bender; Mark Groudine; Andreas Gnirke; John Stamatoyannopoulos; Leonid A Mirny; Eric S Lander; Job Dekker Journal: Science Date: 2009-10-09 Impact factor: 47.728
Authors: Matthew T Dougherty; Michael J Folk; Erez Zadok; Herbert J Bernstein; Frances C Bernstein; Kevin W Eliceiri; Werner Benger; Christoph Best Journal: Commun ACM Date: 2009-10-01 Impact factor: 4.654
Authors: Felipe da Veiga Leprevost; Björn A Grüning; Saulo Alves Aflitos; Hannes L Röst; Julian Uszkoreit; Harald Barsnes; Marc Vaudel; Pablo Moreno; Laurent Gatto; Jonas Weber; Mingze Bai; Rafael C Jimenez; Timo Sachsenberg; Julianus Pfeuffer; Roberto Vera Alvarez; Johannes Griss; Alexey I Nesvizhskii; Yasset Perez-Riverol Journal: Bioinformatics Date: 2017-08-15 Impact factor: 6.937
Authors: Yanli Wang; Fan Song; Bo Zhang; Lijun Zhang; Jie Xu; Da Kuang; Daofeng Li; Mayank N K Choudhary; Yun Li; Ming Hu; Ross Hardison; Ting Wang; Feng Yue Journal: Genome Biol Date: 2018-10-04 Impact factor: 13.583
Authors: Hongbo Yang; Yu Luan; Tingting Liu; Hyung Joo Lee; Li Fang; Yanli Wang; Xiaotao Wang; Bo Zhang; Qiushi Jin; Khai Chung Ang; Xiaoyun Xing; Juan Wang; Jie Xu; Fan Song; Iyyanki Sriranga; Chachrit Khunsriraksakul; Tarik Salameh; Daofeng Li; Mayank N K Choudhary; Jacek Topczewski; Kai Wang; Glenn S Gerhard; Ross C Hardison; Ting Wang; Keith C Cheng; Feng Yue Journal: Nature Date: 2020-11-25 Impact factor: 49.962
Authors: Michael Mitter; Zsuzsanna Takacs; Thomas Köcher; Ronald Micura; Christoph C H Langer; Daniel W Gerlich Journal: Nat Protoc Date: 2022-04-27 Impact factor: 13.491
Authors: Hiruy S Meharena; Asaf Marco; Vishnu Dileep; Elana R Lockshin; Grace Y Akatsu; James Mullahoo; L Ashley Watson; Tak Ko; Lindsey N Guerin; Fatema Abdurrob; Shruthi Rengarajan; Malvina Papanastasiou; Jacob D Jaffe; Li-Huei Tsai Journal: Cell Stem Cell Date: 2022-01-06 Impact factor: 24.633
Authors: Tsung-Han S Hsieh; Claudia Cattoglio; Elena Slobodyanyuk; Anders S Hansen; Oliver J Rando; Robert Tjian; Xavier Darzacq Journal: Mol Cell Date: 2020-03-25 Impact factor: 17.970