Literature DB >> 35936573

Balancing efficient analysis and storage of quantitative genomics data with the D4 format and d4tools.

Hao Hou1,2, Brent Pedersen1,2, Aaron Quinlan1,2,3.   

Abstract

Modern DNA sequencing is used as a readout for diverse assays, with the count of aligned sequences (read depth) representing the quantitative signal for each underlying cellular phenomena. Existing data formats for quantitative genomics assays are, however, limited in either the analysis speeds they enable, the disk space they require or both. We have developed the dense depth data dump (D4) format and tool suite, with the goal of balancing improved analysis speeds with file size. The D4 format is adaptive in that it profiles a random sample of aligned sequence depth from the input sequence file to determine an optimal encoding that enables fast data access. We demonstrate that the D4 format offers substantial speed improvements over existing formats for random access, aggregation and summarization, while also achieving better or comparable file sizes. This performance enables scalable downstream analyses that would be otherwise difficult.

Entities:  

Year:  2021        PMID: 35936573      PMCID: PMC9355464          DOI: 10.1038/s43588-021-00085-0

Source DB:  PubMed          Journal:  Nat Comput Sci        ISSN: 2662-8457


  12 in total

1.  smallWig: parallel compression of RNA-seq WIG files.

Authors:  Zhiying Wang; Tsachy Weissman; Olgica Milenkovic
Journal:  Bioinformatics       Date:  2015-09-30       Impact factor: 6.937

2.  Efficient storage of high throughput DNA sequencing data using reference-based compression.

Authors:  Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney
Journal:  Genome Res       Date:  2011-01-18       Impact factor: 9.043

Review 3.  Toward better understanding of artifacts in variant calling from high-coverage samples.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2014-06-27       Impact factor: 6.937

4.  Balancing efficient analysis and storage of quantitative genomics data with the D4 format and d4tools.

Authors:  Hao Hou; Brent Pedersen; Aaron Quinlan
Journal:  Nat Comput Sci       Date:  2021-06-21

5.  BigWig and BigBed: enabling browsing of large distributed datasets.

Authors:  W J Kent; A S Zweig; G Barber; A S Hinrichs; D Karolchik
Journal:  Bioinformatics       Date:  2010-07-17       Impact factor: 6.937

6.  Differential expression analysis for sequence count data.

Authors:  Simon Anders; Wolfgang Huber
Journal:  Genome Biol       Date:  2010-10-27       Impact factor: 13.583

7.  Mosdepth: quick coverage calculation for genomes and exomes.

Authors:  Brent S Pedersen; Aaron R Quinlan
Journal:  Bioinformatics       Date:  2018-03-01       Impact factor: 6.937

8.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors:  Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal:  Bioinformatics       Date:  2009-11-11       Impact factor: 6.937

9.  An integrated encyclopedia of DNA elements in the human genome.

Authors: 
Journal:  Nature       Date:  2012-09-06       Impact factor: 49.962

10.  Large, three-generation human families reveal post-zygotic mosaicism and variability in germline mutation accumulation.

Authors:  Thomas A Sasani; Brent S Pedersen; Ziyue Gao; Lisa Baird; Molly Przeworski; Lynn B Jorde; Aaron R Quinlan
Journal:  Elife       Date:  2019-09-24       Impact factor: 8.140

View more
  1 in total

1.  Balancing efficient analysis and storage of quantitative genomics data with the D4 format and d4tools.

Authors:  Hao Hou; Brent Pedersen; Aaron Quinlan
Journal:  Nat Comput Sci       Date:  2021-06-21
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.