Literature DB >> 21266471

A novel compression tool for efficient storage of genome resequencing data.

Congmao Wang1, Dabing Zhang.   

Abstract

With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently available bioinformatics tools used to compress genome sequence data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequence data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequence data set, GRS was able to achieve ∼159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.

Entities:  

Mesh:

Year:  2011        PMID: 21266471      PMCID: PMC3074166          DOI: 10.1093/nar/gkr009

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

The development of new DNA sequencing technologies, such as next-generation sequencing (NGS) and single-molecule sequencing, has enabled the research of genomics and functional genomics to advance to new levels (1,2). Due to the dramatic reduction of sequencing cost and increase of sequencing efficiency, these new high-throughput sequencing technologies have become effective and routine applications in the ‘resequencing’ of individual genomes for detecting sequence variation between the individual and the reference genome (3). Resequencing individual genomes can facilitate the investigation of the relationship between sequence and phenotypic variations. To date, several personal human genome sequencing data have been released (2,4–6). Sequencing of individual human genomes is believed to provide molecular basis for personalized medicine. Furthermore, more resequencing data are being generated from various organisms. Individual genome resequencing in animals such as mouse and pig, and in plants such as rice, maize and soybean, has proven to be extremely powerful in investigating genome variations, such as single nucleotide polymorphisms (SNPs), deletions, insertions and rearrangements. However, how to store and manage the huge amount of sequencing data has become a challenge to biologists. For example, the storage of one 2009 human reference genome (i.e. UCSC hg19 assembly) requires up to 905 MB with the tar.gz compression format (7). Thus, a total of 90 500 MB (nearly 88.38 GB) hard disk storage space with the tar.gz compression format would be required to store data for 100 individuals in genetic disease studies. It is noteworthy that different individuals within one species share higher consensus nucleotide sequence; for example, human has ∼99.9% common genome sequence with DNA sequencing errors of 0.01% (8,9). Moreover, the electronic transfer of sequencing data is a bottleneck, even though some tools have been developed to compress the files and increase the network bandwidth. Currently, several methods for compressing genomic sequence data have been reported (10–13). However, these compression tools can not process the genome sequence data without the reference SNPs map or information about sequence variations, such as insertions or deletions. Here, we present a general Genome ReSequencing (GRS) tool for storing and managing the individual genome resequencing data without having to rely on known reference SNPs maps or other information on sequence variation. We demonstrate the power of this GRS tool in processing genome resequencing data, using whole genome sequencing data sets from human and the model plants Arabidopsis and rice.

MATERIALS AND METHODS

Data set

Data sets used to test GRS include KOREF_20090131 and KOREF_20090224, two of the first Korean personal genome sequences (4), and two different versions of the reference genome sequence from Arabidopsis thaliana (TAIR8 and TAIR9) and rice (TIGR5 and TIGR6) (14–16), respectively.

Software availability

GRS is implemented in C and Shell. It will be freely available for non-profit use. The source code and its executable file are available at http://gmdd.shgmo.org/Computational-Biology/GRS.

Architecture of the GRS tool

The main modules in GRS connect the input chromosome file, the intermediate data and the final compressed file. The architecture of GRS is shown in Figure 1. When an individual genome sequence data needs to be compressed, GRS first evaluates each chromosome varied sequence percentage (δ) based on the reference chromosome. Then it filters the longest identical nucleotide sequence and extracts the different sequence (δ ≤ 0.03), and precodes the different sequence file that has been generated to reduce the file size. Then, GRS uses the Huffman coding strategy to compress the reduced different sequence file to the bz2 type file and generates the command file to decompress the compressed file. If 0.03 ≤ δ ≤ 0.1, GRS will cut chromosome into n pieces and calculate each different rate δ to find the position with minimal ∑δ, then compress each piece with the same strategy as the one used for δ ≤ 0.03. Individual genome sequence data that has been compressed using GRS can be later decoded easily with the GRS decoding tool.
Figure 1.

Architecture of the GRS Tool. The main modules in GRS connect the input chromosome file, the intermediate data and the final compressed file. Details of the processing procedure are described in the text.

Architecture of the GRS Tool. The main modules in GRS connect the input chromosome file, the intermediate data and the final compressed file. Details of the processing procedure are described in the text.

Evaluation of individual genome sequence variation

Higher percentage of nucleotide sequence variation from the reference genome leads to longer time to run GRS and results in larger compressed file for an individual genome. When the individual genome sequence data needs to be compressed, GRS checks whether the users use the correct reference chromosome data, quantifying the percentage of varied nucleotide sequence in an individual genome is thus required. Here we used the following method to calculate δ. We used the formula where means the number of different nucleotide sequence between the individual and the reference; i means the type of nucleotide; including A, T, C, G, N, a, t, c, g and n; t means the total DNA base number in the reference genome sequence data.

Recording the longest common local nucleotide sequence and the changed sequence

It is reported that finding and recording the varied nucleotide sequence of two sequences equals to finding their longest common local sequence (17). Using a matrix graph, the longest common sequence can be extracted effectively. Taking two sequences ‘gaNGCTA’ and ‘gNGTNA’ as an example, their longest common sequence is ‘gNGTA’. That is to say that each nucleotide of ‘gaNGCTA’ in x-axis is used to compare with the whole sequence of ‘gNGTNA’ in y-axis, and each common nucleotide in y-axis direction will be marked with a red circle, after the base by base comparison the longest common sequence will be marked in the whole matrix (Supplementary Figure S1). GRS can find the minimal changes between two genome sequences using the modified UNIX diff program (18).

RESULTS

Huffman encoding for varied nucleotide sequence information and individual genome sequence rebuilding

Figure 2a shows the module of presenting the raw information on sequence difference between the individual and the reference sequences generated based on the modified UNIX diff program. When processing the varied sequence information, the ‘>’, ‘<’ and the ‘\n’ between adjacent nucleotides is removed by GRS (Figure 2b). Then ‘a’ (add) is converted to ‘i’ and ‘c’ (change) to ‘h’. In addition, the base information below the ‘d’ (deletion) can be removed since the deleted sequence information can be extracted based on the sequence position such as N5 and N6 (Figure 2a and b). Also, the ‘—’ and the highlighted bases can be removed because these sequences can be recovered by using the sequence at N9 and N10 (Figure 2a and b).
Figure 2.

Processing changes file of DNA base, genome position and recover language. (a) Raw changes between two sequences generated based on the modified UNIX diff program. N1 to N12 indicate the nucleotide sequence position ranging from N1 to N12; ‘a’ is the insertion of nucleotide(s); ‘d’ is the deletion of nucleotide(s) and ‘c’ is the changed nucleotide(s). In addition, symbol ‘,’ between N1 and N2 means positions start from N1 to N2. Symbol ‘>’, ‘<’ and ‘—’ are the keywords when the whole individual genome sequence is rebuilt on basis of the reference genome sequence. (b) Changes file with redundant information deleted. (c) Changes file generated based on the subtracted number, which is more readable to the computer.

Processing changes file of DNA base, genome position and recover language. (a) Raw changes between two sequences generated based on the modified UNIX diff program. N1 to N12 indicate the nucleotide sequence position ranging from N1 to N12; ‘a’ is the insertion of nucleotide(s); ‘d’ is the deletion of nucleotide(s) and ‘c’ is the changed nucleotide(s). In addition, symbol ‘,’ between N1 and N2 means positions start from N1 to N2. Symbol ‘>’, ‘<’ and ‘—’ are the keywords when the whole individual genome sequence is rebuilt on basis of the reference genome sequence. (b) Changes file with redundant information deleted. (c) Changes file generated based on the subtracted number, which is more readable to the computer. Next, ‘,’ is changed to ‘\t’ and ‘\t’ is added to each side of ‘i’, ‘d’ and ‘h’ by GRS to make the rebuilding language more readable by the computer. If there are two numbers at the side of ‘i’, ‘d’ or ‘h’, the second nucleotide position will be recorded using the subtracted number of the first nucleotide position to the second one. Therefore, at each side of ‘i’, ‘d’ and ‘h’, the number N5, N7, N9 and N11 is replaced by the subtracted number of their corresponding nucleotide position N1, N3, N5 and N7 to reduce the file size (Figure 2b and c). Eventually, the individual genome sequence information can be recorded as the format shown in Figure 2c using Huffman coding (19). To encode the processed individual sequence data more effectively, each nucleotide sequence and its relevant number are recorded with the same binary value since it can be decoded uniquely with the help of ‘i’, ‘d’ and ‘h’. Table 1 shows an example of the encoding strategy using the varied sequence information of A. thaliana chromosome 1 with TAIR8 as the reference and TAIR9 as the individual genome, showing that the larger counts of the symbol are reduced to the shorter encoding value. Then the bit file is able to be generated to the char code, for instance, the taken bits ‘01000001’ presents the corresponding ASCII code ‘A’.
Table 1.

Huffman encoding for DNA base, genome position and recover language

DNA baseRelevant numberCountsEncoding value
A0950110
T1155000
C21321110
G31051100
N41011010
a51101111
t6980111
c7800010
g81061101
n9830011
d31101 110
h15101 111
i5410 110
\t210100
\n168010

Each DNA base and its relevant number are encoded with the same binary value based on the Huffman encoding strategy. Shown here is the encoding table for changes file generated for chromosome 1 of the A. thaliana genome using TAIR8 as reference and TAIR9 as the individual genome. Character d means delete sequence, h means change sequence and i means insert sequence.

Huffman encoding for DNA base, genome position and recover language Each DNA base and its relevant number are encoded with the same binary value based on the Huffman encoding strategy. Shown here is the encoding table for changes file generated for chromosome 1 of the A. thaliana genome using TAIR8 as reference and TAIR9 as the individual genome. Character d means delete sequence, h means change sequence and i means insert sequence.

Performance of GRS

Performance of the GRS tool was tested in three cases. When two Korean genome sequence data (KOREF_20090131 and KOREF_20090224) were used (4), the raw file with 2986.8 MB in size (KOREF_20090224) was reduced to a 18.8-MB compressed file, achieving a ∼159-fold compression rate (Table 2). In addition, we also compressed the raw file of rice genome from 361.0 MB to 4.4 MB with the compression rate ∼82 fold (Table 3), and 115.1 MB of A. thaliana genome to 6.5 KB with nearly 18 133 fold of compression (Table 4). Furthermore, the good performance of GRS was revealed by the calculated compression and decompression time of these three genomes (Supplementary Table S1).
Table 2.

Performance of GRS in compressing the KOREF_20090224 human genome using KOREF_20090131 as the reference

Chromosome numberVaried sequence percentage (%)Raw file sizeCompressed file sizeCompression rate
10.656 929239.7 MB1.3 MB184.4
20.716 863235.6 MB1.3 MB181.2
30.630 572193.4 MB987.4 KB200.6
40.762 314185.5 MB1.1 MB168.6
50.711 956175.4 MB964.9 KB186.1
60.649 071165.7 MB884.9 KB191.7
70.912 855154.0 MB1.0 MB154.0
80.639 359141.8 MB746.4 KB194.5
90.774 539136.0 MB844.0 KB165.0
100.705 819131.3 MB750.4 KB179.2
110.720 238130.4 MB738.0 KB180.9
120.638 779128.3 MB685.6 KB191.6
130.550 377110.7 MB508.4 KB223.0
140.529 220103.1 MB473.4 KB223.0
150.589 09597.3 MB484.6 KB205.6
160.808 03286.1 MB554.7 KB158.9
170.818 43076.4 MB494.1 KB158.3
180.666 47273.8 MB399.0 KB189.4
190.744 55361.9 MB390.4 KB162.4
200.493 78160.5 MB276.0 KB224.5
210.579 50545.5 MB221.2 KB210.6
220.632 44848.2 MB256.3 KB192.6
M0.108 71516.5 KB183.0 B94 543.8
X3.299 049150.2 MB3.1 MB48.5
Y1.768 07656.0 MB578.9 KB99.1
The whole genome0.804 2822986.8 MB18.8 MB158.9

The verified sequence percentage of each chromosome, the size of raw sequence file and compressed file, as well as the compression rate are shown.

Table 3.

Performance of GRS in compressing rice genome of TIGR6 using TIGR5 as the reference

Chromosome numberVaried sequence percentage (%)Raw file size (MB)Compressed file sizeCompression rate
10.757 80142.01.4 MB30.0
20.013 89834.81.4 KB25 453.7
30.168 38135.346.6 KB775.7
40.096 34534.235.3 KB992.1
50.069 04629.06.0 KB4949.3
60.000 00030.30
70.027 04128.84.0 KB7372.8
80.479 45227.6115.5 KB244.7
90.000 00022.30
101.128 50322.4770.1 KB29.8
110.188 99227.62.3 MB12.0
120.000 00026.70
The whole genome0.244 122361.04.4 MB82.0

The verified sequence percentage of each chromosome, the size of raw sequence file and compressed file, as well as the compression rate are shown.

Table 4.

Performance of GRS in compressing A. thaliana genome of TAIR9 using TAIR8 as the reference

Chromosome numberVaried sequence percentage (%)Raw file size (MB)Compressed file sizeCompression rate
10.016 31429.4715.0 B43 116.3
20.036 14519.0385.0 B51 747.9
30.046 91022.72.9 KB6709.0
40.000 30117.91.9 KB9647.2
50.063 88826.1604.0 B45 311.0
The whole genome0.032 712115.16.5 KB18 132.7

The verified sequence percentage of each chromosome, the size of raw sequence file and compressed file, as well as the compression rate are shown.

Performance of GRS in compressing the KOREF_20090224 human genome using KOREF_20090131 as the reference The verified sequence percentage of each chromosome, the size of raw sequence file and compressed file, as well as the compression rate are shown. Performance of GRS in compressing rice genome of TIGR6 using TIGR5 as the reference The verified sequence percentage of each chromosome, the size of raw sequence file and compressed file, as well as the compression rate are shown. Performance of GRS in compressing A. thaliana genome of TAIR9 using TAIR8 as the reference The verified sequence percentage of each chromosome, the size of raw sequence file and compressed file, as well as the compression rate are shown.

DISCUSSION

With the advance of DNA sequencing technologies, more and more genome resequencing projects, such as the International HapMap Project and the 1000 Genomes Project, have been initiated (20,21). As a result, compression of the huge amount of genome sequencing data has become an important issue (10,11). Currently available tools have limitations in effectively processing the large amount of genome reseqencing data. For example, tools developed by Brandon et al. (10) and Christley et al. (11) are limited by not only the known reference SNPs map, but also the possible loss of sequence information, such as large structural variations (SVs) including sequence rearrangements and segment duplications. Even though the advent of sequencing technologies facilitates the processing of individual genome sequence such as reassembling genome sequences using the sequencing reads on basis of the reference genome (3,4), the current sequence compression tools are not very suitable for this purpose. Moreover, comprehensive reference SNPs maps are unavailable for many organisms such as rice, A. thaliana and other species, making it hard to compress these genome resequencing data using the available tools. In this study, we show that GRS is a de novo genome compression approach for compressing resequencing data, which is applicable to the genome data management of many species. Varied sequence percentage plays a critical role in compressing the genome resequencing data. GRS employs a novel approach to deal with the resequensing data, especially for those data sets with higher variation between the reference genome and the individual genome. The key point of GRS is to splice the reference chromosome and the input chromosome into the same intervals, and then calculate each corresponding pair of the varied sequence percentage δ based on each nucleotide frequency. Subsequently, concatenating piece i to make the modified reference chromosome and modified input chromosome creates the minimum varied sequence percentage based on the δ value (Figure 3). Then the minimum change file can be compressed using GRS and the chromosome piece with a higher value of varied sequence percentage can be compressed using the general and routine file compression method such as 7-Zip. When the chromosome size is too big or the computer memory capability is limited, it is useful to splice the reference chromosome and the input chromosome data. In this study, GRS grouped each chromosome sequencing data of the Korean genome (4), into 50, 25 or 10 million per piece, respectively. Similar compression capabilities were obtained (i.e. file size is ∼19 MB), demonstrating the flexibility and reliability of GRS.
Figure 3.

Method to resolve the minimum varied sequence percentage between the reference and input chromosomes and assemble the pieces together. Here is an example showing that two chromosomes are spliced into nine parts with relevant δ. Part 1, 3, 4, 5, 6, 7 and 8 are put together because δ2 and δ9 with a higher value than others based on the threshold of varied sequence percentage.

Method to resolve the minimum varied sequence percentage between the reference and input chromosomes and assemble the pieces together. Here is an example showing that two chromosomes are spliced into nine parts with relevant δ. Part 1, 3, 4, 5, 6, 7 and 8 are put together because δ2 and δ9 with a higher value than others based on the threshold of varied sequence percentage.

CONCLUSIONS

In this article, we designed and implemented a generic tool, GRS, for de novo compression of genome resequencing data. GRS is simple to use and does not need the reference SNPs map, thus can be widely used for many genomes, especially those without reference SNPs. Case studies using the sequencing data of human, rice and A. thaliana genomes have demonstrated the good performance of GRS in sequencing data compression.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Key Basic Research Developments Program, Ministry of Science and Technology, P. R. China (2009CB941500, 2006CB101700); National ‘863′ High-Tech Project (2006AA10A102); National Natural Science Foundation of China (30725022 and 30600347); Shanghai Leading Academic Discipline Project (B205). Funding for open access charge: Special Funding for Transgenic Organisms (2008ZX08012-002). Conflict of interest statement. None declared.
  18 in total

1.  G-SQZ: compact encoding of genomic sequence and quality data.

Authors:  Waibhav Tembe; James Lowey; Edward Suh
Journal:  Bioinformatics       Date:  2010-07-06       Impact factor: 6.937

2.  Data structures and compression algorithms for genomic sequence data.

Authors:  Marty C Brandon; Douglas C Wallace; Pierre Baldi
Journal:  Bioinformatics       Date:  2009-05-15       Impact factor: 6.937

3.  Human genomes as email attachments.

Authors:  Scott Christley; Yiming Lu; Chen Li; Xiaohui Xie
Journal:  Bioinformatics       Date:  2008-11-07       Impact factor: 6.937

4.  The complete genome of an individual by massively parallel DNA sequencing.

Authors:  David A Wheeler; Maithreyan Srinivasan; Michael Egholm; Yufeng Shen; Lei Chen; Amy McGuire; Wen He; Yi-Ju Chen; Vinod Makhijani; G Thomas Roth; Xavier Gomes; Karrie Tartaro; Faheem Niazi; Cynthia L Turcotte; Gerard P Irzyk; James R Lupski; Craig Chinault; Xing-zhi Song; Yue Liu; Ye Yuan; Lynne Nazareth; Xiang Qin; Donna M Muzny; Marcel Margulies; George M Weinstock; Richard A Gibbs; Jonathan M Rothberg
Journal:  Nature       Date:  2008-04-17       Impact factor: 49.962

5.  A lossless compression algorithm for DNA sequences.

Authors:  Taysir H A Soliman; Tarek F Gharib; Alshaimaa Abo-Alian; M A El Sharkawy
Journal:  Int J Bioinform Res Appl       Date:  2009

Review 6.  Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing.

Authors:  David Stephen Horner; Giulio Pavesi; Tiziana Castrignanò; Paolo D'Onorio De Meo; Sabino Liuni; Michael Sammeth; Ernesto Picardi; Graziano Pesole
Journal:  Brief Bioinform       Date:  2009-10-27       Impact factor: 11.622

7.  Personal genome sequencing: current approaches and challenges.

Authors:  Michael Snyder; Jiang Du; Mark Gerstein
Journal:  Genes Dev       Date:  2010-03-01       Impact factor: 11.361

8.  Single-molecule sequencing of an individual human genome.

Authors:  Dmitry Pushkarev; Norma F Neff; Stephen R Quake
Journal:  Nat Biotechnol       Date:  2009-08-10       Impact factor: 54.908

9.  The diploid genome sequence of an Asian individual.

Authors:  Jun Wang; Wei Wang; Ruiqiang Li; Yingrui Li; Geng Tian; Laurie Goodman; Wei Fan; Junqing Zhang; Jun Li; Juanbin Zhang; Yiran Guo; Binxiao Feng; Heng Li; Yao Lu; Xiaodong Fang; Huiqing Liang; Zhenglin Du; Dong Li; Yiqing Zhao; Yujie Hu; Zhenzhen Yang; Hancheng Zheng; Ines Hellmann; Michael Inouye; John Pool; Xin Yi; Jing Zhao; Jinjie Duan; Yan Zhou; Junjie Qin; Lijia Ma; Guoqing Li; Zhentao Yang; Guojie Zhang; Bin Yang; Chang Yu; Fang Liang; Wenjie Li; Shaochuan Li; Dawei Li; Peixiang Ni; Jue Ruan; Qibin Li; Hongmei Zhu; Dongyuan Liu; Zhike Lu; Ning Li; Guangwu Guo; Jianguo Zhang; Jia Ye; Lin Fang; Qin Hao; Quan Chen; Yu Liang; Yeyang Su; A San; Cuo Ping; Shuang Yang; Fang Chen; Li Li; Ke Zhou; Hongkun Zheng; Yuanyuan Ren; Ling Yang; Yang Gao; Guohua Yang; Zhuo Li; Xiaoli Feng; Karsten Kristiansen; Gane Ka-Shu Wong; Rasmus Nielsen; Richard Durbin; Lars Bolund; Xiuqing Zhang; Songgang Li; Huanming Yang; Jian Wang
Journal:  Nature       Date:  2008-11-06       Impact factor: 49.962

10.  The UCSC Genome Browser database: update 2010.

Authors:  Brooke Rhead; Donna Karolchik; Robert M Kuhn; Angie S Hinrichs; Ann S Zweig; Pauline A Fujita; Mark Diekhans; Kayla E Smith; Kate R Rosenbloom; Brian J Raney; Andy Pohl; Michael Pheasant; Laurence R Meyer; Katrina Learned; Fan Hsu; Jennifer Hillman-Jackson; Rachel A Harte; Belinda Giardine; Timothy R Dreszer; Hiram Clawson; Galt P Barber; David Haussler; W James Kent
Journal:  Nucleic Acids Res       Date:  2009-11-11       Impact factor: 16.971

View more
  23 in total

1.  smallWig: parallel compression of RNA-seq WIG files.

Authors:  Zhiying Wang; Tsachy Weissman; Olgica Milenkovic
Journal:  Bioinformatics       Date:  2015-09-30       Impact factor: 6.937

2.  ERGC: an efficient referential genome compression algorithm.

Authors:  Subrata Saha; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2015-07-02       Impact factor: 6.937

3.  Compression and fast retrieval of SNP data.

Authors:  Francesco Sambo; Barbara Di Camillo; Gianna Toffolo; Claudio Cobelli
Journal:  Bioinformatics       Date:  2014-07-26       Impact factor: 6.937

4.  NRGC: a novel referential genome compression algorithm.

Authors:  Subrata Saha; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2016-08-02       Impact factor: 6.937

5.  iDoComp: a compression scheme for assembled genomes.

Authors:  Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2014-10-24       Impact factor: 6.937

Review 6.  Computational solutions for omics data.

Authors:  Bonnie Berger; Jian Peng; Mona Singh
Journal:  Nat Rev Genet       Date:  2013-05       Impact factor: 53.242

7.  Efficient DNA sequence compression with neural networks.

Authors:  Milton Silva; Diogo Pratas; Armando J Pinho
Journal:  Gigascience       Date:  2020-11-11       Impact factor: 6.524

8.  An Adaptive Difference Distribution-based Coding with Hierarchical Tree Structure for DNA Sequence Compression.

Authors:  Wenrui Dai; Hongkai Xiong; Xiaoqian Jiang; Lucila Ohno-Machado
Journal:  Proc Data Compress Conf       Date:  2013-03-22

9.  QualComp: a new lossy compressor for quality scores based on rate distortion theory.

Authors:  Idoia Ochoa; Himanshu Asnani; Dinesh Bharadia; Mainak Chowdhury; Tsachy Weissman; Golan Yona
Journal:  BMC Bioinformatics       Date:  2013-06-08       Impact factor: 3.169

10.  NGC: lossless and lossy compression of aligned high-throughput sequencing data.

Authors:  Niko Popitsch; Arndt von Haeseler
Journal:  Nucleic Acids Res       Date:  2012-10-12       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.