Literature DB >> 22139935

GReEn: a tool for efficient compression of genome resequencing data.

Armando J Pinho1, Diogo Pratas, Sara P Garcia.   

Abstract

Research in the genomic sciences is confronted with the volume of sequencing and resequencing data increasing at a higher pace than that of data storage and communication resources, shifting a significant part of research budgets from the sequencing component of a project to the computational one. Hence, being able to efficiently store sequencing and resequencing data is a problem of paramount importance. In this article, we describe GReEn (Genome Resequencing Encoding), a tool for compressing genome resequencing data using a reference genome sequence. It overcomes some drawbacks of the recently proposed tool GRS, namely, the possibility of compressing sequences that cannot be handled by GRS, faster running times and compression gains of over 100-fold for some sequences. This tool is freely available for non-commercial use at ftp://ftp.ieeta.pt/~ap/codecs/GReEn1.tar.gz.

Entities:  

Mesh:

Year:  2011        PMID: 22139935      PMCID: PMC3287168          DOI: 10.1093/nar/gkr1124

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Inspired by the Biocompress algorithm of Grumbach and Tahi (1), the last two decades have witnessed the proposal of a myriad of algorithms for compressing genomic sequences [(2–16) for a recent review]. The acquired knowledge regarding genome structure that these compression algorithms have been providing, through their representation of genomic sequences using probabilistic models, is likely to surpass in relevance the benefits of the effective storage space reduction provided. One of the most successful compression algorithms specifically designed for genomic sequences is XM, a statistical method proposed by Cao et al. (14), though other approaches may present competitive or even superior results for some classes of genomes (15,17). XM relies on a mixture of experts for providing symbol by symbol probability estimates that are fed to an arithmetic encoder. The XM algorithm comprises three types of experts: (i) order-2 Markov models; (ii) order-1 context Markov models, i.e. Markov models that rely on statistical information from a recent past (typically, the 512 previous symbols); (iii) the copy expert, which considers the next symbol as part of a copied region from a particular offset. The probability estimates provided by the set of experts are then combined using Bayesian averaging and sent to the arithmetic encoder. Common practice continues to rely on standard and general purpose data compression methods, e.g. gzip or bzip2. However, this practice may be close to a turning point, as the rate at which genomic data is being produced is clearly overtaking the rate of increase in storage resources and communication bandwidth. The development of high-throughput sequencing technologies that offer dramatically reduced sequencing costs enables possibilities hardly foreseeable a decade ago (18). Large-scale projects such as the 1000 Genomes Project (http://www.1000genomes.org/) and The Cancer Genome Atlas (http://cancergenome.nih.gov/), as well as, prizes that reward cheaper, faster, less prone to errors and higher-throughput sequencing methodologies (e.g. http://genomics.xprize.org/) are paving the way to individual genomics and personalized medicine (19). As such, huge volumes of genomic data will be produced in the near future. However, as a very significant part of the genome is shared among individuals of the same species, these data will be mostly redundant. Some ideas for storing and communicating redundant genomic data have already been put forward, based on, for example, single nucleotide polymorphism (SNP) databases (20), or insert and delete operations (21). Recently, Wang et al. (22) proposed a compression tool, GRS, that is able to compress a sequence using another one as reference without requiring any additional information about those sequences, such as, a reference SNPs map. The algorithm proposed by Kuruppu et al. (23) RLZ, is also able to perform relative Lempel–Ziv compression of DNA sequences, though its current implementation cannot fully handle sequences that have characters outside the {a,c,g,t,n} set. Other approaches propose encoding sequence reads output by massively parallel sequencing experiments (24–27), which is also a very important problem. The compression of short reads shares some points with the problem being addressed here, though it needs to cope with other requirements, such as, the efficient representation of base calling quality information. In this article, we describe GReEn (Genome Resequencing Encoding), a new tool for compressing genome resequencing data using a reference genome sequence. As such, it addresses the same problem as GRS (22), RLZ (23) or XM (14). However, as will be demonstrated, GReEn outperforms GRS in storage space requirements and running times, though GRS can handle some sequences in a very effective way, and it overcomes RLZ's and XM's lack of support for arbitrary alphabets and inferior performance. GReEn is a compression tool based on arithmetic coding that handles arbitrary alphabets. Its running time depends only on the size of the sequence being compressed. Moreover, it provides compression gains of over 100-fold for some sequences, when compared to GRS, and even larger gains when compared to RLZ. Finally, GReEn handles without restriction sequences that cannot be compressed with GRS due to excessive difference to the reference sequence.

MATERIALS AND METHODS

Dataset

We use the same data as in (22), for ease of comparison with GRS: two versions of the first individual Korean genome sequenced, KOREF_20090131 and KOREF_20090224 (28); two versions of the genome of the thale cress Arabidopsis thaliana, TAIR8 and TAIR9 (29,30); and two versions of the rice Oryza sativa genome, TIGR5.0 and TIGR6.0 (31). We also present results for four additional human genome assemblies, namely, the genome of J. Craig Venter referred to as HuRef (32), the Celera alternate assembly referred to as Celera (33), the genome of a Han Chinese individual referred to as YH (34), and the human genome reference assembly build 37.p2, as made available by the National Center for Biotechnology Information and referred to as NCBI37 (35).

Software availability

The codec (encoder/decoder) is implemented in the C programming language and is freely available for non-commercial purposes. It can be downloaded at ftp://ftp.ieeta.pt/∼ap/codecs/GReEn1.tar.gz.

The compression method

As with GRS (22), GReEn relies on a reference sequence for compressing the target sequence. The reference sequence is generally only slightly different from the target sequence, although this is not mandatory. In fact, it is possible to use a sequence from a different species as reference, though, as expected, the compression efficiency depends on the degree of similarity between both reference and target sequences. Moreover, in order to recover the target sequence, the decoder needs access to exactly the same reference sequence as that used by the encoder. The codec developed in GReEn is able to handle arbitrary alphabets, although it automatically ignores all lines beginning with the ‘>’ character, as well as, all newline characters. We denote by the set of all different characters, or symbols, that are found in the target sequence, where denotes the number of elements in , i.e. the alphabet size. Each character of the target sequence is encoded by an arithmetic encoder (36). As with any arithmetic encoder, besides the symbol to encode, it is necessary to provide the probability distribution of the symbols. One major advantage of arithmetic coding is its ability to adjust the probabilistic model as the encoding proceeds, in response to the changing probability distribution from one encoded symbol to the next. We denote by θ(c) the relative frequency of character in the target sequence, and by P(c) the estimated probability of character when encoding the character at position n in the target sequence. The set of probabilities are passed down to the arithmetic coder. Note that, whereas θ(c) values are fixed for a given target sequence, P(c) values usually change along the coding process. For a sequence , with N characters, the arithmetic coder produces a bitstream with bits, which demonstrates the importance of providing good probability estimates to the arithmetic coder. The probability distribution, P(c), can be provided by two different sources: (i) an adaptive model (the copy model) which assumes that the characters of the target sequence are an exact copy of (parts of) the reference sequence; (ii) a static model that relies on the frequencies of the characters in the target sequence, i.e. θ(c). The adaptive model is the main statistical model, as it allows a high compression rate of the target sequence, particularly in areas where the target and reference sequences are highly similar. However, this adaptive, or copy, model will at times not be used (the reasons why will be detailed shortly), and the static model will act as a fallback mechanism, feeding the arithmetic coder with the required probability distribution.

The copy model

The copy model is inspired by the copy expert of the XM DNA compression method (14), relying on a pointer to a position in the reference sequence that has a ‘good chance’ of containing a character identical to that being encoded. As encoding of the target sequence proceeds, the pointer associated with the copy model may be repositioned to different locations of the reference sequence. When this repositioning occurs, all parameters of the model are reset. Besides accounting for the number of times, t, that the copy model was used after the previous repositioning, two more counters are maintained: stores the number of times the model guessed the correct character including the correct case (uppercase or lowercase), and records the number of times the model guessed the character but failed the case (for example, it guessed ‘A’ but the correct character was ‘a’). Figure 1 exemplifies the operation of the copy model. Consider the most recent repositioning occurred at position 341 587 of the reference sequence, corresponding to position 327 829 of the target sequence (in this example, the reference is ahead of the target, but this may be different in other cases). Assuming the codec is going to compress the character marked with ‘?’, then the character predicted by the copy model would be ‘G’ (the one under the ‘Current position’ arrow), with t = 12, and . The characters linked by the dashed arrow indicate a prediction error (the predicted character was ‘A’, whereas the correct one was ‘G’).
Figure 1.

The copy model. In this example, the copy model was restarted at position 341 587 of the reference sequence, corresponding to position 327 829 of the target sequence. Since then, it has correctly predicted 5 characters, if the case is considered, and a total of 11 characters if the case is ignored. The dashed arrow indicates a failed prediction. According to this example, the next character to be predicted is ‘G’.

The copy model. In this example, the copy model was restarted at position 341 587 of the reference sequence, corresponding to position 327 829 of the target sequence. Since then, it has correctly predicted 5 characters, if the case is considered, and a total of 11 characters if the case is ignored. The dashed arrow indicates a failed prediction. According to this example, the next character to be predicted is ‘G’.

Computing the probabilities

Let us denote by the character predicted by the copy model (‘G’ in the example in Figure 1) and by the case converted (‘g’ according to the example in Figure 1). If (note that characters of the reference sequence that do not appear in the target sequence do not belong to ), the probabilities that are passed down to the arithmetic coder are given by The first two branches of Equation (2) correspond to Laplace probability estimators of the form where the s form a set of K collectively exhaustive and mutually exclusive events, and denotes the number of times that event has occurred in the past. In Equation (2) we considered three events, namely, , and . The third branch of Equation (2) defines how the probability assigned to , i.e. , is distributed among the individual characters of . This distribution is proportional to the relative frequencies of the characters, θ(c), after discounting the effect of treating and differently. If only or belongs to , the probabilities are given by where if , or if . As such, we have considered only two events, namely, and , where the distribution of probabilities among the characters of is performed as before. Finally, if both , the probabilities communicated to the arithmetic coder are the character frequencies of the target sequence, i.e.

Starting and stopping the copy model

Typically, the codec starts by constructing a hash table with the occurrences and corresponding positions in the reference sequence of all k-mers of a given size (the default size is k = 11, but it can be changed using a command line option). Figure 2 shows an example where k = 8 and k-mers ‘CTNANGTC’ and ‘AAAGTTGG’ have been mapped by the hashing function into the same index (index 4 529 821). As usual in hashing schemes, disambiguation is achieved by direct comparison of the k-mers that originated the index, which have to be stored in the data structure in order to be compared. Using the hash table, it is easy to find in the reference sequence the characters that come right after all occurrences of a given k-mer.
Figure 2.

Data organized in a hash table.

Data organized in a hash table. Before encoding a new character from the target sequence, the performance of the copy model, if in use, is checked. If , where m is a parameter that indicates the maximum number of prediction failures allowed, the copy model is stopped. The default value for m is zero, but this may be changed through a command line option. Following this performance check, if the copy model is not in use, an attempt is made to restart the copy model before compressing the character. This is accomplished by looking for the positions in the reference sequence where the k-mer composed of the k-most-recently-encoded characters occurs. If more than one position is found, the one closest to the encoding position is chosen. If none is found, the current character is encoded using the static model and a new attempt for starting a new copy model is performed after advancing one position in the target sequence.

Special case for equal size sequences

When the reference and target sequences have the same size, the codec assumes that both sequences are aligned. Therefore, whenever the copy model is restarted, it is forced to use the current encoding position as reference. This avoids constructing the hash table, hence, increasing the codec speed and generally producing better results. However, this mode of operation may be overridden by a command line option, as it may lead to poor performance for same-sized sequences that are not aligned.

RESULTS

We compare the performance of the method proposed here, GReEn, to the performance of GRS (22), the most recently proposed approach for compressing genome resequencing data that handles sequences drawn from arbitrary alphabets. We also include results of the RLZ algorithm (23) for some sequences, due to its restriction to sequences drawn from the alphabet {a, c, g, t, n}. Tables 1–3 present both the number of bytes produced and the time taken by the respective methods for compressing the sequences used in (22). The results regarding both the GRS and RLZ methds have been obtained using the software publicly provided by the authors. All experimental results were obtained using an Intel Core i7-2620M laptop computer at 2.7 GHz with 8 GB of memory and running Ubuntu 11.04. The best results are highlighted in boldface.
Table 1.

Arabidopsis thaliana genome: compression of TAIR9 using TAIR8 as reference

ChrSizeGRS
GReEn
BytesSecsBytesSecs
11130 427 6717157155113
21119 698 28938549378
31023 459 8302989610979
4718 585 0561951523567
5526 975 502604661811
Total119 146 348664428655948

Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The column indicates the size of the alphabet of the target sequence.

Table 3.

Homo sapiens genome: compression of KOREF_20090224 using KOREF_20090131 as reference

ChrSizeGRS
GReEn
BytesSecsBytesSecs
1247 249 7191 336 6262221 225 76732
2242 951 1491 354 0592301 272 10531
3199 501 8271 011 124165971 52726
4191 273 0631 139 2251931 074 35725
5180 857 866988 070173947 37823
6170 899 992906 116146865 44822
7158 821 4241 096 646167998 48220
8146 274 826764 313125729 36219
9140 273 252864 222134773 71618
10135 374 737768 364122717 30517
11134 452 384755 708119716 30117
12132 349 534702 040114668 45517
13114 142 980520 59887490 88815
14106 368 585484 79181451 01814
15100 338 915496 21579453 30113
1688 827 254567 98991510 25411
1778 774 742505 97981464 32410
1876 117 153408 52971378 42010
1963 811 651399 80762369 3888
2062 435 964282 62848266 5628
2146 944 323226 54940203 0366
2249 691 432262 44341230 0496
M16 57118311271
X154 913 7543 231 7765002 712 15320
Y57 772 954592 79196481 3077
Total3 080 436 05119 666 7913,18817 971 030396

Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The size of the alphabet in the target sequence is 21 for all chromosomes, except for the M chromosome where it is 11.

Arabidopsis thaliana genome: compression of TAIR9 using TAIR8 as reference Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The column indicates the size of the alphabet of the target sequence. Table 1 displays the compression results for the TAIR9 version of the thale cress genome using the TAIR8 version as reference. Globally, GReEn required 6559 bytes for storing the sequences, whereas GRS needed a little more (6644 bytes). While GReEn took 48 s to encode the data, GRS needed only 28. Therefore, in this case, GRS is equivalent to the proposed method in terms of storage space, but faster. Table 2 displays the compression results for the TIGR6.0 version of the rice genome using the TIGR5.0 version as reference. In this case, the outcome varies dramatically to the previous results (Table 1). The first significant difference can be observed in both the compressed size and compression time of chromosome 1 : 1 502 040 bytes in 708 s using GRS versus 4972 bytes in 18 s (more than a 300-fold improvement) using GReEn. A similarly significant difference can be observed in chromosome 11 (with a gain of over 160-fold). Globally, GRS required 4 901 902 bytes and 2290 s, whereas GReEn was able to store the entire genome in just 125 535 bytes (39-fold improvement) using only 123 s of computing time.
Table 2.

Oryza sativa genome: compression of TIGR6.0 using TIGR5.0 as reference

ChrSizeRLZ
GRS
GReEn
BytesSecsBytesSecsBytesSecs
1543 268 879185 715351 502 0407084 97218
2535 930 381210 295281 40951 90614
3636 406 68947 7642817 89015
4535 278 225175 6632736 145206 75014
5529 894 789120 625216 17755 53912
6531 246 78961 038231444822
7529 696 629167 822214 06782 44812
8528 439 308109 60820118 246439 50711
9523 011 23944 953161443662
10923 134 759788 54233960 4499
111128 512 6662 397 4701 12214 79712
12527 497 21453 714191444292
Total372 317 5674 901 9022 290125 535123

Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The column indicates the size of the alphabet of the target sequence. The missing RLZ values correspond to sequences with characters that cannot be handled by the current implementation of this algorithm.

Oryza sativa genome: compression of TIGR6.0 using TIGR5.0 as reference Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The column indicates the size of the alphabet of the target sequence. The missing RLZ values correspond to sequences with characters that cannot be handled by the current implementation of this algorithm. The main conclusion from these results is that, under certain conditions not yet investigated, GRS fails to find large-scale similarities between the two sequences. Therefore, the number of bytes generated is much larger than necessary and, probably as a consequence, the running time explodes. Moreover, when the target sequence is exactly equal to the reference sequence (as in chromosomes 6, 9 and 12), the GRS reports a number of bytes that is essentially zero [in (22) they are shown as zero, although we opted to display the number of bytes effectively used], while GReEn uses a few hundred bytes. However, if critical, this could be easily reduced to almost zero using a sequence comparison before starting encoding (note that, due to the requirement that the probabilities communicated to the arithmetic coder should be represented as integers, a lower bound exists in the minimum number of bits that can be generated in each coding step). Table 3 displays the compression results for the KOREF_20090224 version of the human genome using the KOREF_20090131 version as reference. In this case, GReEn gives consistently better results, both in terms of storage requirements and computing time. In fact, this latter aspect deserves a special note because contrarily to GRS, the running time of GReEn varies linearly with the size of the sequence. Therefore, GReEn allows for an a priori good estimate of the time that is required to compress a given sequence. Homo sapiens genome: compression of KOREF_20090224 using KOREF_20090131 as reference Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The size of the alphabet in the target sequence is 21 for all chromosomes, except for the M chromosome where it is 11. Besides considering the datasets in (22), we also investigate four human genome assemblies, in order to provide a more comprehensive comparison of both GRS and GReEn compression approaches. However, our intention fell short because GRS failed to compress most of the sequences due to an excessive difference between the reference and target sequences. Table 4 displays the results obtained when the YH genome was compressed using KOREF_20090224 as reference. It is clear that GRS gave unacceptable results, both regarding the size of the compressed sequences and the time required to compress them, for the few chromosomes that could be compressed with GRS.
Table 4.

Homo sapiens genome: compression of YH using KOREF_20090224 as reference

ChrSizeGRS
GReEn
BytesSecsBytesSecs
1247 249 7192 349 12422
2242 951 1492 420 00722
3199 501 82717 410 94628791 730 47718
4191 273 0631 877 05617
5180 857 8661 792 27816
6170 899 99225 815 44675261 588 73915
7158 821 4241 820 42514
8146 274 8261 358 77013
9140 273 2521 476 49513
10135 374 7371 353 19312
11134 452 3841 274 43312
12132 349 53416 136 61021201 174 96612
13114 142 98011 227 9543181866 26610
14106 368 585826 67210
15100 338 915892 4299
1688 827 2541 015 2468
1778 774 742864 7107
1876 117 15313 187 8924061713 7877
1963 811 651589 4226
2062 435 9648 409 7761449493 4046
2146 944 323726 269664374 3834
2249 691 432444 9325
M16 57132111271
X154 913 7543 258 18811
Y57 772 954859 6884

Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The missing values are due to the inability of GRS to compress sequences differing more than a predefined value.

Homo sapiens genome: compression of YH using KOREF_20090224 as reference Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The missing values are due to the inability of GRS to compress sequences differing more than a predefined value. Table 5 displays the compression results, using GReEn, for four different human genome assemblies (HuRef, Celera, YH and KOREF_20090224) using the NCBI37 version as reference. As this article is about sequence compression, not sequence analysis, we refrain from elaborating too much on the differences observed. Nevertheless, we hint at what we believe may be possible explanations. First, the HuRef and Celera assemblies are not resequencing assemblies and this, per se, accounts for greater compression differences with respect to the reference assembly.
Table 5.

Homo sapiens genome: compression with GReEn of the HuRef, Celera, YH and KOREF_20090224 versions using the NCBI37 as reference

ChrHuRefCeleraYHKOREF
16 652 1845 106 7201 979 6612 074 258
24 109 6063 271 1052 205 1021 833 388
31 718 6831 125 5442 868 4622 808 941
42 440 2551 675 8781 815 3091 844 448
52 084 6301 962 8691 327 2351 289 709
61 926 8531 846 1011 460 6661 436 168
72 216 6432 345 8591 381 2341,511 664
81 755 5121 084 5841 323 8451 310 275
93 939 8562 906 9691 049 4561 152 997
102 235 3882 025 4591 075 8991 237 129
111 565 5361 459 8541 068 3351 104 478
121 495 6961 559 6351 199 7091 260 183
134 429 1543 023 6811 065 0061 052 608
143 480 6762 325 885803 902854 166
153 358 2392 944 889946 244958 050
161 848 1722 319 629747 166802 956
171 091 9171 163 879955 918905 359
18893 600625 364726 165765 927
19697 898621 9432 777 8942 832 746
20611 521433 253468 215490 498
21884 601415 412434 679481 691
22929 001655 089404 354431 417
X3 159 2053 259 716492 893740 530
Y565 7461 157 801138 838279 461

Number of bytes after compressing each sequence. For ease of comparison we transformed all characters to lowercase and mapped all unknown nucleotides to ‘n’ before compression. Therefore, after this transformation, all sequences were composed only of characters from the alphabet {a,c,g,t,n}.

Homo sapiens genome: compression with GReEn of the HuRef, Celera, YH and KOREF_20090224 versions using the NCBI37 as reference Number of bytes after compressing each sequence. For ease of comparison we transformed all characters to lowercase and mapped all unknown nucleotides to ‘n’ before compression. Therefore, after this transformation, all sequences were composed only of characters from the alphabet {a,c,g,t,n}. The HuRef assembly is an individual genome sequenced with capillary-based whole-genome shotgun technologies and de novo assembled with the Celera Assembler. Hence, this assembly is the farthest apart (i.e. with a larger number of bytes required for its compression) from the reference NCBI37 assembly. The Celera assembly represents one of the two pioneering efforts in sequencing a human genome. Its consensus sequence is derived from the genomes of five individuals using a capillary-based whole-genome shotgun sequencing approach. Unlike the reference assembly generated by the International Human Genome Sequencing Consortium (here represented in the NCBI37 assembly), which used a clone-based hierarchical shotgun strategy that is more likely to output a high-quality finished genome sequence as the sequence assembly is local and anchored to the genome, the Celera Genomics Sequencing Team opted for a whole-genome shotgun strategy where sequence contigs and scaffolds must be individually anchored to the genome, rendering assembly more complex and more prone to long-range misassembly. Moreover, this whole-genome shotgun assembly resulted from a combined analysis of the genomic data generated by the Celera Genomics Sequencing Team and some data generated by the International Human Genome Sequencing Consortium, hence, it has been claimed that this Celera assembly is not a totally independent human genome assembly (37). We believe this may be part of the explanation for the smaller compression values in Table 5 with respect to this assembly, than those of the HuRef assembly. The YH assembly is an individual genome based on resequencing data from massively parallel sequencing technologies and assembled with the Short Oligonucleotide Alignment Program, using the NCBI human genome assembly as reference. Essentially, it is a map of SNPs with respect to the reference assembly, hence it displays very low compression values in Table 5. The KOREF_20090224 assembly is also an individual genome based on resequencing data from massively parallel sequencing technologies and assembled with the Mapping and Assembly with Qualities program, using the NCBI human genome assembly as reference. As with the YH assembly, resequencing renders the resulting assembly very redundant with respect to the reference (NCBI37) assembly, hence also displaying very low compression values in Table 5. The compression values for chromosome 19 in the YH and KOREF_20090224 assemblies are unexpectedly high. This chromosome has the highest GC content (48.4%) and the lowest (median) sequence depth (28-fold) in the YH genome (34), hence constraining the quality of the final sequence. Not surprisingly, chromosome 19 in the YH genome has a very large number (more than twice those of the reference NCBI37 assembly) of unsequenced bases (‘N’ symbols in our encoding). Chromosome 19 in the KOREF_20090224 assembly faces the same hurdles, which we assume to be a consequence of the similar sequencing methodology. Finally, Table 6 displays again the compression results for the KOREF_20090224 version of the human genome using the KOREF_20090131 version as reference. However, for allowing the comparison of GReEn to GRS and RLZ on a larger genome, we converted the sequences to the {a,c,g,t,n} alphabet.
Table 6.

Homo sapiens genome: compression of KOREF_20090224 using KOREF_20090131 as reference

ChrRLZGRSGReEn
1591 629152 38890 555
2576 769146 75489 440
3472 814117 54472 708
4471 157134 62883 611
5428 287108 40766 597
6411 404109 86667 264
7395 524119 22371 898
8350 33794 13956 650
9357 584119 64768 607
10335 464101 48660 303
11326 83691 38054 966
12320 44489 17055 408
13266 37864 31336 962
14248 16558 86534 245
15235 09456 56932 693
16217 74860 58035 315
17193 70055 58233 836
18182 60448 09829 191
19162 82653 35530 505
20149 40338 11422 969
21112 82229 04816 620
22119 79132 56218 423
M567554
X428 878224 997129 497
Y150 90161 30633 312
Total7 506 6152 168 0961 291 629

Number of bytes after compressing each sequence. For allowing the comparison to RLZ and GRS, all characters were transformed to lowercase before compression and all unknown nucleotides were mapped to ‘n’. Therefore, after this transformation, all sequences were composed only of characters from set {a,c,g,t,n}.

Homo sapiens genome: compression of KOREF_20090224 using KOREF_20090131 as reference Number of bytes after compressing each sequence. For allowing the comparison to RLZ and GRS, all characters were transformed to lowercase before compression and all unknown nucleotides were mapped to ‘n’. Therefore, after this transformation, all sequences were composed only of characters from set {a,c,g,t,n}.

DISCUSSION

The GRS tool recently introduced by Wang et al. (22) for compressing DNA resequencing data using a reference sequence allows to significantly reduce data storage space requirements. However, this tool seems to be effective only when the target sequence is very similar to the reference sequence, preventing the compression of many sequences of interest. Moreover, as we have shown, for example, in chromosomes 1 and 11 of the TIGR6.0 version of the rice genome, it may fail to give reasonable results even for similar sequences. Another drawback of GRS is that the encoding time does not depend only on the sequence size, but mainly on the similarity between the target and reference sequences (lower similarity implying greater compression times), resulting in a large unpredictability regarding the time that a certain sequence requires to be compressed. To overcome these limitations, we propose a statistical compression method that uses a probabilistic copy model. The probabilities are estimated for every character of the target sequence and are used to feed an arithmetic coder. The compression tool has two control parameters, namely, the size of the k-mer that is used for searching copies (with a default value of k = 11), and the number of prediction failures that are tolerated by the copy model before it is restarted (with a default value of 0). Changing these parameters may change the performance of the codec, degrading the performance for some sequences while improving it for others. It is left to the user the decision of trying to optimize these parameters or, as we have done when producing the experimental results included in this article, to use the default values.

CONCLUSION

In this article, we described a computational tool, GReEn, aiming at compressing genome resequencing data using another sequence as reference. This tool is able to handle arbitrary alphabets and does not pose any restrictions or requirements on the sequences to compress. Several examples of its efficiency in compressing genomic data and its improvements with respect to other recently proposed tools have been included, rendering evident the practical interest of the tool here proposed. With the generation of increasingly larger volumes of genome sequencing and resequencing data, and the increasing costs associated to storing and transmitting those data, compression tools that efficiently recognize redundancies are in demand. However, the interest in such compression methodologies goes beyond data storage and communication. By being a probabilistic model of the underlying genomic sequence(s), compression tools reveal similarities and differences that are paramount for studies of human genomic variation between individuals, hence, key for progress in personal medicine efforts.

FUNDING

European Fund for Regional Development (FEDER) through the Operational Program Competitiveness Factors (COMPETE); Portuguese Foundation for Science and Technology (FCT) through project grants FCOMP-01-0124-FEDER-010099 (FCT reference PTDC/EIA-EIA/103099/2008) and FCOMP-01-0124-FEDER-022682 (FCT reference PEst-C/EEI/UI0127/2011); European Social Fund (to S.P.G.); Portuguese Ministry of Education and Science (to S.P.G.). Funding for open access charge: Portuguese Foundation for Science and Technology (FCT) project grant FCOMP-01-0124-FEDER-022682 (FCT reference PEst-C/EEI/UI0127/2011). Conflict of interest statement. None declared.
  21 in total

1.  A compression algorithm for DNA sequences.

Authors:  C Xin; K Sam; L Ming
Journal:  IEEE Eng Med Biol Mag       Date:  2001 Jul-Aug

2.  DNACompress: fast and effective DNA sequence compression.

Authors:  Xin Chen; Ming Li; Bin Ma; John Tromp
Journal:  Bioinformatics       Date:  2002-12       Impact factor: 6.937

3.  Efficient storage of high throughput DNA sequencing data using reference-based compression.

Authors:  Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney
Journal:  Genome Res       Date:  2011-01-18       Impact factor: 9.043

4.  Compressing genomic sequence fragments using SlimGene.

Authors:  Christos Kozanitis; Chris Saunders; Semyon Kruglyak; Vineet Bafna; George Varghese
Journal:  J Comput Biol       Date:  2011-03       Impact factor: 1.479

5.  On the sequencing of the human genome.

Authors:  Robert H Waterston; Eric S Lander; John E Sulston
Journal:  Proc Natl Acad Sci U S A       Date:  2002-03-05       Impact factor: 11.205

6.  Compression of DNA sequence reads in FASTQ format.

Authors:  Sebastian Deorowicz; Szymon Grabowski
Journal:  Bioinformatics       Date:  2011-01-19       Impact factor: 6.937

7.  The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant.

Authors:  E Huala; A W Dickerman; M Garcia-Hernandez; D Weems; L Reiser; F LaFond; D Hanley; D Kiphart; M Zhuang; W Huang; L A Mueller; D Bhattacharyya; D Bhaya; B W Sobral; W Beavis; D W Meinke; C D Town; C Somerville; S Y Rhee
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

8.  The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community.

Authors:  Seung Yon Rhee; William Beavis; Tanya Z Berardini; Guanghong Chen; David Dixon; Aisling Doyle; Margarita Garcia-Hernandez; Eva Huala; Gabriel Lander; Mary Montoya; Neil Miller; Lukas A Mueller; Suparna Mundodi; Leonore Reiser; Julie Tacklind; Dan C Weems; Yihe Wu; Iris Xu; Daniel Yoo; Jungwon Yoon; Peifen Zhang
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

9.  The sequence of the human genome.

Authors:  J C Venter; M D Adams; E W Myers; P W Li; R J Mural; G G Sutton; H O Smith; M Yandell; C A Evans; R A Holt; J D Gocayne; P Amanatides; R M Ballew; D H Huson; J R Wortman; Q Zhang; C D Kodira; X H Zheng; L Chen; M Skupski; G Subramanian; P D Thomas; J Zhang; G L Gabor Miklos; C Nelson; S Broder; A G Clark; J Nadeau; V A McKusick; N Zinder; A J Levine; R J Roberts; M Simon; C Slayman; M Hunkapiller; R Bolanos; A Delcher; I Dew; D Fasulo; M Flanigan; L Florea; A Halpern; S Hannenhalli; S Kravitz; S Levy; C Mobarry; K Reinert; K Remington; J Abu-Threideh; E Beasley; K Biddick; V Bonazzi; R Brandon; M Cargill; I Chandramouliswaran; R Charlab; K Chaturvedi; Z Deng; V Di Francesco; P Dunn; K Eilbeck; C Evangelista; A E Gabrielian; W Gan; W Ge; F Gong; Z Gu; P Guan; T J Heiman; M E Higgins; R R Ji; Z Ke; K A Ketchum; Z Lai; Y Lei; Z Li; J Li; Y Liang; X Lin; F Lu; G V Merkulov; N Milshina; H M Moore; A K Naik; V A Narayan; B Neelam; D Nusskern; D B Rusch; S Salzberg; W Shao; B Shue; J Sun; Z Wang; A Wang; X Wang; J Wang; M Wei; R Wides; C Xiao; C Yan; A Yao; J Ye; M Zhan; W Zhang; H Zhang; Q Zhao; L Zheng; F Zhong; W Zhong; S Zhu; S Zhao; D Gilbert; S Baumhueter; G Spier; C Carter; A Cravchik; T Woodage; F Ali; H An; A Awe; D Baldwin; H Baden; M Barnstead; I Barrow; K Beeson; D Busam; A Carver; A Center; M L Cheng; L Curry; S Danaher; L Davenport; R Desilets; S Dietz; K Dodson; L Doup; S Ferriera; N Garg; A Gluecksmann; B Hart; J Haynes; C Haynes; C Heiner; S Hladun; D Hostin; J Houck; T Howland; C Ibegwam; J Johnson; F Kalush; L Kline; S Koduru; A Love; F Mann; D May; S McCawley; T McIntosh; I McMullen; M Moy; L Moy; B Murphy; K Nelson; C Pfannkoch; E Pratts; V Puri; H Qureshi; M Reardon; R Rodriguez; Y H Rogers; D Romblad; B Ruhfel; R Scott; C Sitter; M Smallwood; E Stewart; R Strong; E Suh; R Thomas; N N Tint; S Tse; C Vech; G Wang; J Wetter; S Williams; M Williams; S Windsor; E Winn-Deen; K Wolfe; J Zaveri; K Zaveri; J F Abril; R Guigó; M J Campbell; K V Sjolander; B Karlak; A Kejariwal; H Mi; B Lazareva; T Hatton; A Narechania; K Diemer; A Muruganujan; N Guo; S Sato; V Bafna; S Istrail; R Lippert; R Schwartz; B Walenz; S Yooseph; D Allen; A Basu; J Baxendale; L Blick; M Caminha; J Carnes-Stine; P Caulk; Y H Chiang; M Coyne; C Dahlke; A Deslattes Mays; M Dombroski; M Donnelly; D Ely; S Esparham; C Fosler; H Gire; S Glanowski; K Glasser; A Glodek; M Gorokhov; K Graham; B Gropman; M Harris; J Heil; S Henderson; J Hoover; D Jennings; C Jordan; J Jordan; J Kasha; L Kagan; C Kraft; A Levitsky; M Lewis; X Liu; J Lopez; D Ma; W Majoros; J McDaniel; S Murphy; M Newman; T Nguyen; N Nguyen; M Nodell; S Pan; J Peck; M Peterson; W Rowe; R Sanders; J Scott; M Simpson; T Smith; A Sprague; T Stockwell; R Turner; E Venter; M Wang; M Wen; D Wu; M Wu; A Xia; A Zandieh; X Zhu
Journal:  Science       Date:  2001-02-16       Impact factor: 47.728

10.  On the representability of complete genomes by multiple competing finite-context (Markov) models.

Authors:  Armando J Pinho; Paulo J S G Ferreira; António J R Neves; Carlos A C Bastos
Journal:  PLoS One       Date:  2011-06-30       Impact factor: 3.240

View more
  30 in total

1.  smallWig: parallel compression of RNA-seq WIG files.

Authors:  Zhiying Wang; Tsachy Weissman; Olgica Milenkovic
Journal:  Bioinformatics       Date:  2015-09-30       Impact factor: 6.937

2.  LFQC: a lossless compression algorithm for FASTQ files.

Authors:  Marius Nicolae; Sudipta Pathak; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2015-06-20       Impact factor: 6.937

3.  ERGC: an efficient referential genome compression algorithm.

Authors:  Subrata Saha; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2015-07-02       Impact factor: 6.937

4.  NRGC: a novel referential genome compression algorithm.

Authors:  Subrata Saha; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2016-08-02       Impact factor: 6.937

5.  iDoComp: a compression scheme for assembled genomes.

Authors:  Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2014-10-24       Impact factor: 6.937

Review 6.  Computational solutions for omics data.

Authors:  Bonnie Berger; Jian Peng; Mona Singh
Journal:  Nat Rev Genet       Date:  2013-05       Impact factor: 53.242

7.  Efficient DNA sequence compression with neural networks.

Authors:  Milton Silva; Diogo Pratas; Armando J Pinho
Journal:  Gigascience       Date:  2020-11-11       Impact factor: 6.524

8.  An Adaptive Difference Distribution-based Coding with Hierarchical Tree Structure for DNA Sequence Compression.

Authors:  Wenrui Dai; Hongkai Xiong; Xiaoqian Jiang; Lucila Ohno-Machado
Journal:  Proc Data Compress Conf       Date:  2013-03-22

9.  QualComp: a new lossy compressor for quality scores based on rate distortion theory.

Authors:  Idoia Ochoa; Himanshu Asnani; Dinesh Bharadia; Mainak Chowdhury; Tsachy Weissman; Golan Yona
Journal:  BMC Bioinformatics       Date:  2013-06-08       Impact factor: 3.169

10.  NGC: lossless and lossy compression of aligned high-throughput sequencing data.

Authors:  Niko Popitsch; Arndt von Haeseler
Journal:  Nucleic Acids Res       Date:  2012-10-12       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.