| Literature DB >> 22139935 |
Armando J Pinho1, Diogo Pratas, Sara P Garcia.
Abstract
Research in the genomic sciences is confronted with the volume of sequencing and resequencing data increasing at a higher pace than that of data storage and communication resources, shifting a significant part of research budgets from the sequencing component of a project to the computational one. Hence, being able to efficiently store sequencing and resequencing data is a problem of paramount importance. In this article, we describe GReEn (Genome Resequencing Encoding), a tool for compressing genome resequencing data using a reference genome sequence. It overcomes some drawbacks of the recently proposed tool GRS, namely, the possibility of compressing sequences that cannot be handled by GRS, faster running times and compression gains of over 100-fold for some sequences. This tool is freely available for non-commercial use at ftp://ftp.ieeta.pt/~ap/codecs/GReEn1.tar.gz.Entities:
Mesh:
Year: 2011 PMID: 22139935 PMCID: PMC3287168 DOI: 10.1093/nar/gkr1124
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The copy model. In this example, the copy model was restarted at position 341 587 of the reference sequence, corresponding to position 327 829 of the target sequence. Since then, it has correctly predicted 5 characters, if the case is considered, and a total of 11 characters if the case is ignored. The dashed arrow indicates a failed prediction. According to this example, the next character to be predicted is ‘G’.
Figure 2.Data organized in a hash table.
Arabidopsis thaliana genome: compression of TAIR9 using TAIR8 as reference
| Chr | Size | GRS | GReEn | |||
|---|---|---|---|---|---|---|
| Bytes | Secs | Bytes | Secs | |||
| 1 | 11 | 30 427 671 | 715 | 7 | 1551 | 13 |
| 2 | 11 | 19 698 289 | 385 | 4 | 937 | 8 |
| 3 | 10 | 23 459 830 | 2989 | 6 | 1097 | 9 |
| 4 | 7 | 18 585 056 | 1951 | 5 | 2356 | 7 |
| 5 | 5 | 26 975 502 | 604 | 6 | 618 | 11 |
| Total | – | 119 146 348 | 6644 | 28 | 6559 | 48 |
Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The column indicates the size of the alphabet of the target sequence.
Homo sapiens genome: compression of KOREF_20090224 using KOREF_20090131 as reference
| Chr | Size | GRS | GReEn | ||
|---|---|---|---|---|---|
| Bytes | Secs | Bytes | Secs | ||
| 1 | 247 249 719 | 1 336 626 | 222 | 1 225 767 | 32 |
| 2 | 242 951 149 | 1 354 059 | 230 | 1 272 105 | 31 |
| 3 | 199 501 827 | 1 011 124 | 165 | 971 527 | 26 |
| 4 | 191 273 063 | 1 139 225 | 193 | 1 074 357 | 25 |
| 5 | 180 857 866 | 988 070 | 173 | 947 378 | 23 |
| 6 | 170 899 992 | 906 116 | 146 | 865 448 | 22 |
| 7 | 158 821 424 | 1 096 646 | 167 | 998 482 | 20 |
| 8 | 146 274 826 | 764 313 | 125 | 729 362 | 19 |
| 9 | 140 273 252 | 864 222 | 134 | 773 716 | 18 |
| 10 | 135 374 737 | 768 364 | 122 | 717 305 | 17 |
| 11 | 134 452 384 | 755 708 | 119 | 716 301 | 17 |
| 12 | 132 349 534 | 702 040 | 114 | 668 455 | 17 |
| 13 | 114 142 980 | 520 598 | 87 | 490 888 | 15 |
| 14 | 106 368 585 | 484 791 | 81 | 451 018 | 14 |
| 15 | 100 338 915 | 496 215 | 79 | 453 301 | 13 |
| 16 | 88 827 254 | 567 989 | 91 | 510 254 | 11 |
| 17 | 78 774 742 | 505 979 | 81 | 464 324 | 10 |
| 18 | 76 117 153 | 408 529 | 71 | 378 420 | 10 |
| 19 | 63 811 651 | 399 807 | 62 | 369 388 | 8 |
| 20 | 62 435 964 | 282 628 | 48 | 266 562 | 8 |
| 21 | 46 944 323 | 226 549 | 40 | 203 036 | 6 |
| 22 | 49 691 432 | 262 443 | 41 | 230 049 | 6 |
| M | 16 571 | 183 | 1 | 127 | 1 |
| X | 154 913 754 | 3 231 776 | 500 | 2 712 153 | 20 |
| Y | 57 772 954 | 592 791 | 96 | 481 307 | 7 |
| Total | 3 080 436 051 | 19 666 791 | 3,188 | 17 971 030 | 396 |
Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The size of the alphabet in the target sequence is 21 for all chromosomes, except for the M chromosome where it is 11.
Oryza sativa genome: compression of TIGR6.0 using TIGR5.0 as reference
| Chr | Size | RLZ | GRS | GReEn | ||||
|---|---|---|---|---|---|---|---|---|
| Bytes | Secs | Bytes | Secs | Bytes | Secs | |||
| 1 | 5 | 43 268 879 | 185 715 | 35 | 1 502 040 | 708 | 4 972 | 18 |
| 2 | 5 | 35 930 381 | 210 295 | 28 | 1 409 | 5 | 1 906 | 14 |
| 3 | 6 | 36 406 689 | – | – | 47 764 | 28 | 17 890 | 15 |
| 4 | 5 | 35 278 225 | 175 663 | 27 | 36 145 | 20 | 6 750 | 14 |
| 5 | 5 | 29 894 789 | 120 625 | 21 | 6 177 | 5 | 5 539 | 12 |
| 6 | 5 | 31 246 789 | 61 038 | 23 | 14 | 4 | 482 | 2 |
| 7 | 5 | 29 696 629 | 167 822 | 21 | 4 067 | 8 | 2 448 | 12 |
| 8 | 5 | 28 439 308 | 109 608 | 20 | 118 246 | 43 | 9 507 | 11 |
| 9 | 5 | 23 011 239 | 44 953 | 16 | 14 | 4 | 366 | 2 |
| 10 | 9 | 23 134 759 | – | – | 788 542 | 339 | 60 449 | 9 |
| 11 | 11 | 28 512 666 | – | – | 2 397 470 | 1 122 | 14 797 | 12 |
| 12 | 5 | 27 497 214 | 53 714 | 19 | 14 | 4 | 429 | 2 |
| Total | – | 372 317 567 | – | – | 4 901 902 | 2 290 | 125 535 | 123 |
Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The column indicates the size of the alphabet of the target sequence. The missing RLZ values correspond to sequences with characters that cannot be handled by the current implementation of this algorithm.
Homo sapiens genome: compression of YH using KOREF_20090224 as reference
| Chr | Size | GRS | GReEn | ||
|---|---|---|---|---|---|
| Bytes | Secs | Bytes | Secs | ||
| 1 | 247 249 719 | – | – | 2 349 124 | 22 |
| 2 | 242 951 149 | – | – | 2 420 007 | 22 |
| 3 | 199 501 827 | 17 410 946 | 2879 | 1 730 477 | 18 |
| 4 | 191 273 063 | – | – | 1 877 056 | 17 |
| 5 | 180 857 866 | – | – | 1 792 278 | 16 |
| 6 | 170 899 992 | 25 815 446 | 7526 | 1 588 739 | 15 |
| 7 | 158 821 424 | – | – | 1 820 425 | 14 |
| 8 | 146 274 826 | – | – | 1 358 770 | 13 |
| 9 | 140 273 252 | – | – | 1 476 495 | 13 |
| 10 | 135 374 737 | – | – | 1 353 193 | 12 |
| 11 | 134 452 384 | – | – | 1 274 433 | 12 |
| 12 | 132 349 534 | 16 136 610 | 2120 | 1 174 966 | 12 |
| 13 | 114 142 980 | 11 227 954 | 3181 | 866 266 | 10 |
| 14 | 106 368 585 | – | – | 826 672 | 10 |
| 15 | 100 338 915 | – | – | 892 429 | 9 |
| 16 | 88 827 254 | – | – | 1 015 246 | 8 |
| 17 | 78 774 742 | – | – | 864 710 | 7 |
| 18 | 76 117 153 | 13 187 892 | 4061 | 713 787 | 7 |
| 19 | 63 811 651 | – | – | 589 422 | 6 |
| 20 | 62 435 964 | 8 409 776 | 1449 | 493 404 | 6 |
| 21 | 46 944 323 | 726 269 | 664 | 374 383 | 4 |
| 22 | 49 691 432 | – | – | 444 932 | 5 |
| M | 16 571 | 321 | 1 | 127 | 1 |
| X | 154 913 754 | – | – | 3 258 188 | 11 |
| Y | 57 772 954 | – | – | 859 688 | 4 |
Size of the compressed target sequences (in bytes) and corresponding compression time (in seconds). The original sequence alphabets have been preserved. The missing values are due to the inability of GRS to compress sequences differing more than a predefined value.
Homo sapiens genome: compression with GReEn of the HuRef, Celera, YH and KOREF_20090224 versions using the NCBI37 as reference
| Chr | HuRef | Celera | YH | KOREF |
|---|---|---|---|---|
| 1 | 6 652 184 | 5 106 720 | 1 979 661 | 2 074 258 |
| 2 | 4 109 606 | 3 271 105 | 2 205 102 | 1 833 388 |
| 3 | 1 718 683 | 1 125 544 | 2 868 462 | 2 808 941 |
| 4 | 2 440 255 | 1 675 878 | 1 815 309 | 1 844 448 |
| 5 | 2 084 630 | 1 962 869 | 1 327 235 | 1 289 709 |
| 6 | 1 926 853 | 1 846 101 | 1 460 666 | 1 436 168 |
| 7 | 2 216 643 | 2 345 859 | 1 381 234 | 1,511 664 |
| 8 | 1 755 512 | 1 084 584 | 1 323 845 | 1 310 275 |
| 9 | 3 939 856 | 2 906 969 | 1 049 456 | 1 152 997 |
| 10 | 2 235 388 | 2 025 459 | 1 075 899 | 1 237 129 |
| 11 | 1 565 536 | 1 459 854 | 1 068 335 | 1 104 478 |
| 12 | 1 495 696 | 1 559 635 | 1 199 709 | 1 260 183 |
| 13 | 4 429 154 | 3 023 681 | 1 065 006 | 1 052 608 |
| 14 | 3 480 676 | 2 325 885 | 803 902 | 854 166 |
| 15 | 3 358 239 | 2 944 889 | 946 244 | 958 050 |
| 16 | 1 848 172 | 2 319 629 | 747 166 | 802 956 |
| 17 | 1 091 917 | 1 163 879 | 955 918 | 905 359 |
| 18 | 893 600 | 625 364 | 726 165 | 765 927 |
| 19 | 697 898 | 621 943 | 2 777 894 | 2 832 746 |
| 20 | 611 521 | 433 253 | 468 215 | 490 498 |
| 21 | 884 601 | 415 412 | 434 679 | 481 691 |
| 22 | 929 001 | 655 089 | 404 354 | 431 417 |
| X | 3 159 205 | 3 259 716 | 492 893 | 740 530 |
| Y | 565 746 | 1 157 801 | 138 838 | 279 461 |
Number of bytes after compressing each sequence. For ease of comparison we transformed all characters to lowercase and mapped all unknown nucleotides to ‘n’ before compression. Therefore, after this transformation, all sequences were composed only of characters from the alphabet {a,c,g,t,n}.
Homo sapiens genome: compression of KOREF_20090224 using KOREF_20090131 as reference
| Chr | RLZ | GRS | GReEn |
|---|---|---|---|
| 1 | 591 629 | 152 388 | 90 555 |
| 2 | 576 769 | 146 754 | 89 440 |
| 3 | 472 814 | 117 544 | 72 708 |
| 4 | 471 157 | 134 628 | 83 611 |
| 5 | 428 287 | 108 407 | 66 597 |
| 6 | 411 404 | 109 866 | 67 264 |
| 7 | 395 524 | 119 223 | 71 898 |
| 8 | 350 337 | 94 139 | 56 650 |
| 9 | 357 584 | 119 647 | 68 607 |
| 10 | 335 464 | 101 486 | 60 303 |
| 11 | 326 836 | 91 380 | 54 966 |
| 12 | 320 444 | 89 170 | 55 408 |
| 13 | 266 378 | 64 313 | 36 962 |
| 14 | 248 165 | 58 865 | 34 245 |
| 15 | 235 094 | 56 569 | 32 693 |
| 16 | 217 748 | 60 580 | 35 315 |
| 17 | 193 700 | 55 582 | 33 836 |
| 18 | 182 604 | 48 098 | 29 191 |
| 19 | 162 826 | 53 355 | 30 505 |
| 20 | 149 403 | 38 114 | 22 969 |
| 21 | 112 822 | 29 048 | 16 620 |
| 22 | 119 791 | 32 562 | 18 423 |
| M | 56 | 75 | 54 |
| X | 428 878 | 224 997 | 129 497 |
| Y | 150 901 | 61 306 | 33 312 |
| Total | 7 506 615 | 2 168 096 | 1 291 629 |
Number of bytes after compressing each sequence. For allowing the comparison to RLZ and GRS, all characters were transformed to lowercase before compression and all unknown nucleotides were mapped to ‘n’. Therefore, after this transformation, all sequences were composed only of characters from set {a,c,g,t,n}.