| Literature DB >> 30395579 |
You Tang1, Min Li2, Jing Sun3, Tao Zhang2, Jicheng Zhang2, Ping Zheng2.
Abstract
BACKGROUND: The massive quantities of genetic data generated by high-throughput sequencing pose challenges to data storage, transmission and analyses. These problems are effectively solved through data compression, in which the size of data storage is reduced and the speed of data transmission is improved. Several options are available for compressing and storing genetic data. However, most of these options either do not provide sufficient compression rates or require a considerable length of time for decompression and loading.Entities:
Mesh:
Year: 2018 PMID: 30395579 PMCID: PMC6218042 DOI: 10.1371/journal.pone.0206521
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Application of TRCMGene to the toy example.
Fig 2Index structure of the toy example.
Fig 3Compressed file structure.
The general structure is shown on the left. The detailed description of every compressed sequence is shown on the right. Refer to the text for details.
The detailed information of data files mentioned.
| Species | Original File Size | The Numbers of Individuals | The Numbers of SNPs | The Numbers of Marks |
|---|---|---|---|---|
| Maize | 321MB | 115 | 73157 | 8413055 |
| Maize | 3.42G | 201 | 459446 | 92348646 |
| Maize | 44.3G | 702 | 1692698 | 1188273996 |
| Maize | 100.6G | 1398 | 1928450 | 2695973100 |
| Arabidopsis | 4.35 GB | 219 | 759270 | 166280130 |
| Mice | 611 MB | 59 | 144782 | 8542138 |
Performance of TRCMGene compared with that of two other compression methods.
| Species | Original File Size | Items | TRCM Gene | PLINK | GZIP |
|---|---|---|---|---|---|
| Maize | 3.42 GB | File size after compression (MB) | 201.5 | 219.6 | 419.7 |
| Compression factor | 16.97 | 15.57 | 8.15 | ||
| Compression time (s) | 428 | 549 | 407 | ||
| Arabidopsis | 4.35 GB | File size after compression (MB) | 212.2 | 286.1 | 498.3 |
| Compression factor | 20.50 | 15.20 | 8.73 | ||
| Compression time (s) | 539 | 719 | 489 | ||
| Mice | 611 MB | File size after compression (MB) | 34.6 | 40.3 | 83.9 |
| Compression factor | 17.68 | 15.16 | 7.28 | ||
| Compression time (s) | 89 | 73 | 71 |
Fig 4Compression capabilities of TRCMGene compared to method by ORCM.
Compression factor with respect to the uncompressed file size calculated as original file sizes divided by the compressed file sizes (greater is better).
Time needed to compress by TRCMGene.
| DataSet size | Compression time (s) | ||||
|---|---|---|---|---|---|
| 3 clusters | 5 clusters | 10 clusters | 15 clusters | 20 clusters | |
| 321M | 40 | 45 | 56 | 63 | 83 |
| 3.42G | 428 | 449 | 462 | 488 | 561 |
| 44.3G | 4656 | 4678 | 4689 | 4756 | 4810 |
| 100.6G | 13968 | 14789 | 15239 | 15887 | 16342 |
Fig 5Factors that influence the performance of TRCMGene.
Fig 6Reading capabilities of TRCMGene.
Reading time ratio was defined as the ratio between the time of reading uncompressed file and the time of reading compressed file (greater is better).