| Literature DB >> 18489794 |
W Timothy J White1, Michael D Hendy.
Abstract
BACKGROUND: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil.Entities:
Mesh:
Year: 2008 PMID: 18489794 PMCID: PMC2426707 DOI: 10.1186/1471-2105-9-242
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Edit-tree coding of similar sequence groups. Circles represent DNA sequences in a database; the straight-line distance between circles represents the edit distance between sequences. Initially (a) we are presented with the input database. In the first step (b), groups of similar sequences are discovered. In the second step (c), each group is edit-tree coded independently by determining a reasonable tree, selecting a root sequence (coloured black) and recording the necessary edits along each edge. Some sequences are not sufficiently similar to any other sequence to be delta-encoded – these sequences will be recorded verbatim.
Figure 2Example k-tuple index structure for k = 4.
Execution time for find_edges variations on a small dataset
| SSAHA | maxGap = 0, | 118.58 |
| maxInsert = 0 | ||
| maxGap = 20, | 118.72 | |
| maxInsert = 20 | ||
| Basic | 97.98 | |
| Batch merging | 113.03 | |
| 84.35 | ||
| 71.09 | ||
| Recursive merging | 80.55 | |
| Hashtable | 56.02 | |
| 56.24 | ||
| 57.92 |
The dataset used, month.est_mouse, is a monthly update of the Genbank Mouse EST dataset comprising 31,401 sequences having average length 438 nucleotides.
Figure 3Using coil to compress a FASTA database. As few two-file k-tuple index segments are produced as memory allows.
Compressed file sizes
| ems1 | 23292780 | 5876747 | 5871445 | 5331953 | 4990193 | |
| 23199910 | 5853780 | 5852865 | 5311350 | 4981279 | ||
| 23201245 | 5852837 | 5852772 | 5312411 | 4988747 | ||
| ems2 | 46519702 | 11576074 | 11574420 | 10475531 | 9432789 | |
| 46428669 | 11557030 | 11556573 | 10454826 | 9410376 | ||
| 46390115 | 11547516 | 11549117 | 10445740 | 9426594 | ||
| ems3 | 69631679 | 17211495 | 17205793 | 15537092 | 13607729 | |
| 69647486 | 17212318 | 17208461 | 15543489 | 13592739 | ||
| 69715954 | 17231912 | 17225610 | 15558845 | 13623246 | ||
| ems4 | 92905691 | 22841127 | 22810035 | 20601712 | 17625302 | |
| 93012024 | 22868732 | 22849091 | 20629724 | 17655369 | ||
| 92850447 | 22813494 | 22799324 | 20585812 | 17584008 | ||
| ems5 | 116125238 | 28428297 | 28415051 | 25636473 | 21509065 | |
| 116249077 | 28451622 | 28426520 | 25663621 | 21547174 | ||
| 116117128 | 28413464 | 28397742 | 25630456 | 21496207 | ||
| ems10 | 232365230 | 56136032 | 56054164 | 50662993 | 39774087 | |
| 232226017 | 56101887 | 56085818 | 50643774 | 39711435 | ||
| 232230440 | 56099503 | 56030860 | 50622855 | 39685294 | ||
| ems15 | 348404276 | 83539894 | 83461996 | 75411889 | 56758484 | |
| 348435883 | 83529794 | 83463158 | 75435650 | 56771053 | ||
| 348292392 | 83453434 | 83396104 | 75374710 | 56768193 | ||
| ems20 | 464825178 | 110838776 | 110755872 | 72989089 | 100113255 | |
| 464778933 | 110777795 | 110650470 | 100083039 | 73004561 | ||
| 464532828 | 110766213 | 110653180 | 100046482 | 72978434 | ||
| ems25 | 581105516 | 137940393 | 137814551 | 89636246 | 124600275 | |
| 580758935 | 137898843 | 137748733 | 89647136 | 124521398 | ||
| 580693026 | 137884675 | 137756070 | 89594767 | 124526386 | ||
| ems50 | 1161787240 | 272394718 | 271857439 | 169833915 | 244302747 | |
| 1161908810 | 272481687 | 271896055 | 169824808 | 244355206 | ||
| 1161582289 | 272310746 | 271844248 | 169812165 | 244255108 | ||
| ems75 | 1742471477 | 405262890 | 404293340 | 247835911 | 362403056 | |
| 1742664959 | 405243466 | 404268271 | 247921410 | 362419128 | ||
| 1742458336 | 405281768 | 404397179 | 247684455 | 362394809 | ||
| ems100 | 2323234744 | 533757352 | 324292321 | 478735224 | ||
| 2323234744 | 533757352 | 324292321 | 478735224 | |||
| 2323234744 | 533757352 | 324292321 | 478735224 | |||
| ems100* | 2323234744 | 308212275 | ||||
| rfam_full | 140518668 | 4413613 | 4113889 | 9504648 | ||
| 140518668 | 4413613 | 4113889 | 9504648 | |||
| 140518668 | 4413613 | 4113889 | 9504648 |
All sizes are in bytes. The FASTA column shows the size of the original uncompressed FASTA file. The smallest file in each row is shown in bold. * This row shows the result of using version of find_edges optimised for the Pentium 4. nrdb+bz2 failed to compress the ems100 dataset because the size of the FASTA file exceeded 2 Gb. All coil runs performed on the rfam_full dataset used the -x option to enable in-order recovery of sequences. nrdb+bz2 was not used with the rfam_full dataset because it is incapable of restoring this order.
Figure 4Compression ratio vs. DB size. The compression ratios of all tested algorithms increase as the input size increases; those of coil and 7-Zip increase faster than the rest.
Compression execution time
| find_edges | encode | tar+bz | other | ||||||
| ems1 | 5.6 | 7.7 | 43.3 | 5.6 | 3.5 | 9.3 | 1.2 | 10.8 | 24.8 |
| 5.5 | 9.2 | 48.2 | 4.7 | 3.6 | 8.1 | 1.0 | 10.4 | 23.1 | |
| 5.7 | 10.2 | 43.5 | 4.3 | 3.6 | 8.2 | 1.2 | 12.5 | 25.5 | |
| ems2 | 10.3 | 15.1 | 96.5 | 9.9 | 9.5 | 20.0 | 1.0 | 19.1 | 49.5 |
| 10.3 | 17.7 | 101.4 | 8.5 | 9.7 | 20.2 | 1.0 | 19.0 | 49.8 | |
| 10.1 | 16.8 | 95.6 | 8.3 | 9.7 | 20.3 | 1.1 | 20.8 | 51.9 | |
| ems3 | 15.2 | 22.6 | 154.8 | 14.7 | 17.0 | 36.5 | 1.9 | 27.7 | 83.1 |
| 16.7 | 24.6 | 162.9 | 12.7 | 17.3 | 35.4 | 2.2 | 28.0 | 82.9 | |
| 15.4 | 22.4 | 154.2 | 12.8 | 17.3 | 34.5 | 3.1 | 29.2 | 84.2 | |
| ems4 | 20.3 | 32.0 | 216.5 | 20.0 | 25.8 | 50.6 | 3.6 | 34.3 | 114.3 |
| 20.5 | 33.4 | 221.1 | 17.4 | 26.7 | 50.3 | 3.0 | 37.1 | 117.1 | |
| 20.0 | 31.1 | 215.0 | 17.0 | 26.0 | 49.1 | 3.3 | 39.3 | 117.7 | |
| ems5 | 30.3 | 39.1 | 276.7 | 25.3 | 35.5 | 65.4 | 4.2 | 42.7 | 147.7 |
| 25.5 | 43.5 | 280.4 | 21.5 | 35.7 | 65.5 | 4.1 | 45.2 | 150.5 | |
| 25.3 | 38.7 | 275.9 | 21.4 | 35.8 | 64.6 | 4.1 | 46.2 | 150.7 | |
| ems10 | 62.5 | 85.1 | 573.8 | 49.8 | 100.5 | 179.3 | 8.4 | 100.4 | 388.6 |
| 60.7 | 88.0 | 580.8 | 45.5 | 102.1 | 176.0 | 9.0 | 84.4 | 371.4 | |
| 50.6 | 80.1 | 575.4 | 43.5 | 100.5 | 160.8 | 9.4 | 87.7 | 358.4 | |
| ems15 | 94.3 | 117.8 | 871.1 | 69.0 | 197.9 | 271.8 | 12.6 | 118.2 | 600.6 |
| 76.7 | 136.5 | 876.5 | 64.7 | 198.5 | 276.9 | 13.7 | 130.5 | 619.5 | |
| 89.8 | 119.6 | 869.5 | 64.5 | 196.1 | 275.8 | 13.8 | 133.4 | 619.1 | |
| ems20 | 101.5 | 169.9 | 1163.0 | 92.8 | 317.1 | 393.5 | 16.7 | 176.0 | 903.5 |
| 101.7 | 179.7 | 1169.3 | 86.9 | 321.6 | 393.5 | 18.5 | 212.2 | 945.8 | |
| 120.5 | 158.8 | 1161.3 | 84.9 | 319.6 | 399.7 | 16.9 | 215.6 | 951.8 | |
| ems25 | 133.0 | 207.7 | 1482.2 | 116.0 | 471.7 | 503.2 | 22.5 | 280.8 | 1278.1 |
| 152.3 | 220.7 | 1438.9 | 105.9 | 470.9 | 467.2 | 22.3 | 218.4 | 1178.8 | |
| 171.0 | 196.3 | 1456.4 | 106.0 | 468.0 | 504.4 | 23.4 | 248.0 | 1243.8 | |
| ems50 | 306.2 | 411.2 | 2882.0 | 215.4 | 1657.4 | 1172.3 | 105.4 | 716.3 | 3651.4 |
| 340.0 | 452.4 | 2893.3 | 209.2 | 1658.2 | 1170.7 | 104.5 | 583.7 | 3517.1 | |
| 291.1 | 411.6 | 2888.3 | 207.2 | 1655.5 | 1174.7 | 107.9 | 671.8 | 3609.9 | |
| ems75 | 500.7 | 712.4 | 4328.8 | 314.4 | 3517.1 | 1814.9 | 167.8 | 1173.4 | 6673.2 |
| 506.8 | 618.1 | 4304.5 | 311.9 | 3502.1 | 1810.2 | 164.7 | 992.9 | 6469.8 | |
| 508.7 | 593.5 | 4298.9 | 317.7 | 3490.7 | 1798.6 | 165.0 | 1116.4 | 6570.6 | |
| ems100 | 668.6 | 5760.8 | 408.5 | 6064.4 | 2552.4 | 223.9 | 1421.8 | 10262.6 | |
| 634.1 | 5707.5 | 404.1 | 6042.3 | 2524.8 | 219.3 | 1429.0 | 10215.3 | ||
| 689.2 | 5773.6 | 403.3 | 6114.1 | 2496.4 | 217.5 | 1546.2 | 10374.1 | ||
| ems100* | 6446.3 | 2515.4 | 218.6 | 1505.8 | 10686.1 | ||||
| rfam_full | 32.8 | 75.8 | 7.9 | 114.8 | 12.8 | 4.0 | 40.7 | 172.3 | |
| 29.6 | 75.3 | 7.9 | 113.9 | 12.3 | 4.3 | 38.2 | 168.7 | ||
| 29.6 | 75.5 | 7.8 | 114.5 | 12.4 | 4.2 | 36.0 | 167.1 | ||
All durations are in seconds. The rightmost five columns break down the execution of coil by its main component programs; the "other" column includes the time needed for the programs extract_seqs, make_index and select_lines.
*This row shows the result of using the Pentium 4-optimised version of find_edges – surprisingly, this version of find_edges is actually about 6% slower than the original version on this CPU.
Decompression execution time
| ems1 | 1.8 | 3.0 | 1.5 | 4.6 | 3.6 |
| 1.8 | 3.1 | 1.5 | 3.9 | 4.5 | |
| 1.8 | 3.1 | 1.5 | 4.3 | 4.5 | |
| ems2 | 3.5 | 6.2 | 3.2 | 8.7 | 6.5 |
| 3.5 | 6.1 | 3.2 | 7.9 | 6.6 | |
| 3.4 | 6.0 | 3.1 | 8.6 | 7.1 | |
| ems3 | 5.2 | 9.0 | 4.8 | 13.2 | 10.4 |
| 5.2 | 9.3 | 4.9 | 11.7 | 11.0 | |
| 5.2 | 9.0 | 4.9 | 13.2 | 10.3 | |
| ems4 | 6.9 | 12.0 | 6.5 | 17.5 | 14.3 |
| 7.0 | 12.3 | 6.4 | 15.6 | 13.6 | |
| 6.9 | 12.0 | 6.5 | 17.6 | 13.7 | |
| ems5 | 8.7 | 15.2 | 7.8 | 22.1 | 16.8 |
| 8.6 | 15.3 | 7.3 | 19.7 | 17.7 | |
| 8.6 | 14.9 | 8.1 | 22.0 | 18.3 | |
| ems10 | 17.0 | 29.8 | 15.1 | 44.9 | 36.3 |
| 17.0 | 30.6 | 14.8 | 40.1 | 36.8 | |
| 17.1 | 29.9 | 17.6 | 44.5 | 36.9 | |
| ems15 | 25.2 | 44.6 | 22.1 | 65.9 | 52.3 |
| 25.5 | 46.0 | 22.0 | 59.1 | 53.7 | |
| 25.5 | 44.4 | 24.4 | 65.5 | 52.8 | |
| ems20 | 33.8 | 59.5 | 30.3 | 87.1 | 68.2 |
| 34.0 | 60.7 | 29.8 | 78.1 | 70.2 | |
| 34.1 | 59.4 | 32.0 | 86.2 | 69.1 | |
| ems25 | 41.9 | 74.4 | 36.7 | 107.6 | 91.1 |
| 42.1 | 76.3 | 38.9 | 100.9 | 85.9 | |
| 42.4 | 74.0 | 36.2 | 106.7 | 86.8 | |
| ems50 | 126.6 | 147.7 | 71.7 | 210.4 | 286.1 |
| 128.6 | 152.0 | 71.9 | 202.8 | 274.9 | |
| 129.0 | 148.4 | 74.5 | 209.7 | 280.9 | |
| ems75 | 187.7 | 223.2 | 110.3 | 312.0 | 511.3 |
| 190.5 | 228.3 | 111.3 | 306.7 | 471.3 | |
| 191.2 | 221.5 | 116.3 | 352.4 | 464.9 | |
| ems100 | 247.1 | 142.1 | 324.2 | 646.1 | |
| 248.6 | 137.6 | 404.9 | 674.2 | ||
| 252.4 | 143.5 | 531.9 | 700.8 | ||
| ems100* | 649.9 | ||||
| rfam_full | 9.6 | 7.9 | 9.1 | 60.7 | |
| 6.3 | 6.4 | 9.1 | 59.9 | ||
| 6.1 | 6.2 | 9.1 | 60.2 |
All durations are in seconds. *This row shows the result of using the Pentium 4-optimised version of find_edges.