| Literature DB >> 33237908 |
Shengwang Du1, Junyi Li1, Naizheng Bian1.
Abstract
The development of high-throughput sequencing technology has generated huge amounts DNA data. Many general compression algorithms are not ideal for compressing DNA data, such as the LZ77 algorithm. On the basis of Nour and Sharawi's method,we propose a new, lossless and reference-free method to increase the compression performance. The original sequences are converted into eight intermediate files and six final files. Then, the LZ77 algorithm is used to compress the six final files. The results show that the compression time is decreased by 83% and the decompression time is decreased by 54% on average.The compression rate is almost the same as Nour and Sharawi's method which is the fastest method so far. What's more, our method has a wider range of application than Nour and Sharawi's method. Compared to some very advanced compression tools at present, such as XM and FCM-Mx, the time for compression in our method is much smaller, on average decreasing the time by more than 90%.Entities:
Mesh:
Substances:
Year: 2020 PMID: 33237908 PMCID: PMC7688149 DOI: 10.1371/journal.pone.0238220
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Files tree.
The datasets.
| Accession Number | Number of Bases |
|---|---|
| NC_017526 | 2682626 |
| NC_002942 | 3397754 |
| NZ_CP015934 | 3453407 |
| NZ_CP015935 | 3409361 |
| NZ_CP015938 | 3359444 |
| NC_013929 | 10148695 |
| NC_014318 | 1036715 |
| NC_013595 | 10341314 |
| NC_013131 | 10467782 |
| NC_010162 | 13033779 |
The results between our method and NSM.
| SM | NSM | OM | ||||
|---|---|---|---|---|---|---|
| CR | CT | DCT | CR | CT | DCT | |
| NC_017526 | 75.35 | 21.227 | 10.109 | 75.00 | 6.004 | 5.311 |
| NC_002942 | 75.41 | 29.772 | 12.612 | 75.02 | 5.351 | 4.947 |
| NZ_CP015934 | 75.41 | 28.130 | 12.543 | 75.05 | 5.985 | 5.073 |
| NZ_CP015935 | 75.40 | 28.507 | 13.264 | 75.02 | 5.529 | 5.733 |
| NZ_CP015938 | 75.42 | 23.882 | 11.726 | 75.07 | 5.133 | 5.060 |
| NC_013929 | 76.43 | 67.131 | 44.774 | 75.17 | 9.018 | 17.929 |
| NC_014318 | 76.42 | 63.687 | 33.048 | 75.15 | 9.870 | 15.742 |
| NC_013595 | 76.35 | 63.695 | 34.395 | 75.17 | 11.217 | 14.430 |
| NC_013131 | 76.22 | 64.265 | 34.553 | 75.06 | 8.987 | 18.589 |
| NC_010162 | 76.28 | 79.472 | 50.957 | 75.06 | 12.564 | 25.590 |
| Average | 75.87 | 46.977 | 25.798 | 75.08 | 7.966 | 11.840 |
SM: Sequence name.
NSM: Nour and Sharawi’ s method.
OM: Our method.
CR: compression ratio(%).
CT: compression time(s).
DCT: decompression time(s).
Compression benchmarks for state-of-the-art pure genomic compression tools.
| SM | XM | FCM-Mx | OM | |||
|---|---|---|---|---|---|---|
| CR | CT | CR | CT | CR | CT | |
| NC_017526 | 77.45 | 41.227 | 76.12 | 35.102 | 75.00 | 6.004 |
| NC_002942 | 77.41 | 49.772 | 76.15 | 41.722 | 75.02 | 5.351 |
| NZ_CP015934 | 77.41 | 48.130 | 76.39 | 39.603 | 75.05 | 5.985 |
| NZ_CP015935 | 77.40 | 48.507 | 76.44 | 39.541 | 75.02 | 5.529 |
| NZ_CP015938 | 77.42 | 43.882 | 76.21 | 35.796 | 75.07 | 5.133 |
| NC_013929 | 77.43 | 107.131 | 76.33 | 99.305 | 75.17 | 9.018 |
| NC_014318 | 77.42 | 103.687 | 76.67 | 95.850 | 75.15 | 9.870 |
| NC_013595 | 77.35 | 103.695 | 76.15 | 95.208 | 75.17 | 11.217 |
| NC_013131 | 77.22 | 114.265 | 76.13 | 103.209 | 75.06 | 8.987 |
| NC_010162 | 77.28 | 139.472 | 76.17 | 124.167 | 75.06 | 12.564 |
| Average | 77.34 | 79.977 | 76.28 | 70.950 | 75.08 | 7.966 |
SM: Sequence name.
OM: Our method.
CR: compression ratio(%).
CT: compression time(s).