Literature DB >> 31725736

LFastqC: A lossless non-reference-based FASTQ compressor.

Sultan Al Yami1,2, Chun-Hsi Huang1.   

Abstract

The cost-effectiveness of next-generation sequencing (NGS) has led to the advancement of genomic research, thereby regularly generating a large amount of raw data that often requires efficient infrastructures such as data centers to manage the storage and transmission of such data. The generated NGS data are highly redundant and need to be efficiently compressed to reduce the cost of storage space and transmission bandwidth. We present a lossless, non-reference-based FASTQ compression algorithm, known as LFastqC, an improvement over the LFQC tool, to address these issues. LFastqC is compared with several state-of-the-art compressors, and the results indicate that LFastqC achieves better compression ratios for important datasets such as the LS454, PacBio, and MinION. Moreover, LFastqC has a better compression and decompression speed than LFQC, which was previously the top-performing compression algorithm for the LS454 dataset. LFastqC is freely available at https://github.uconn.edu/sya12005/LFastqC.

Entities:  

Mesh:

Year:  2019        PMID: 31725736      PMCID: PMC6855649          DOI: 10.1371/journal.pone.0224806

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Next-generation sequencing (NGS) technologies have accelerated genomic research, thereby producing significant amount of data at a fast pace and low cost. However, the storage technology is evolving at a much slower pace compared with the NGS technologies, thereby posing challenges for data storage. Data centers are often used as a solution, while incurring considerable costs for storage space and transmission bandwidth. The time required for data transmission over network can be reduced by compressing the highly redundant genomics data. Most data centers use generic compressors, such as the gzip and bzip2. However, these are not ideal solutions for compressing NGS data since both were designed for general-purposes compression and have been shown to perform inadequately when compressing genomic data. Therefore, it is important for researchers to have an appropriate tool that is specifically developed for NGS data compression. The NGS-generated data are stored in a FASTQ format; FASTQ files comprise millions-to-billions of records, each of which containing the following four lines: Line 1 stores a sequence identifier and begins with @. Line 2 represents the read sequence. Line 3 begins with the + character, optionally followed by the same sequence identifier as Line 1. Line 4 represents the quality score value for the sequence in Line 2 and has the same length as Line 2. As indicated above, the NGS produces DNA sequences and a corresponding quality score value for each base. This value is the probability that the corresponding base call is incorrect. These error values are then converted into integers. The Ewing and Green algorithm [1] transformed these error values into PHRED quality scores (QS) [2], which can be used to characterize the quality of DNA sequences. The algorithm uses the following equation to calculate QS, i.e. Q = −10 log10p. Then, the resulting values are shortened to fit within 0–93; this range represents the values for the base score in the ASCII form. Each integer is then incremented by 33 so that all values are in the range of printable ASCII characters. In recent years, DNA sequence compression has been actively investigated, and many tools have been specifically developed for this application. One example is the MFCompress [3], which uses the finite-context model for compression. The finite-context model is a probabilistic model that utilizes a probability distribution and estimates the probability of the next symbol in a sequence based on the k previous symbols. MFCompress encodes the DNA sequences using multiple competing finite-context models [4], as well as arithmetic encoding. Another algorithm is the DNA-COMPACT, which is based on a pattern-aware contextual modeling technique [5]. This algorithm exploits complementary contextual models and works in two phases. In the first phase, the algorithm searches for the exact repeats and palindromes and then represents them by a compact quadruplet. In the second phase, the algorithm introduces the non-sequential contextual models to utilize the DNA sequence characteristics, and then a logistic regression model is used to synthesize the predictions of these models. Another algorithm is the SeqCompress [6], which is based on a statistical model and arithmetic coding and can efficiently compress both repetitive and non-repetitive sequences.

Related work

A similar problem to DNA sequence compression is the compression of FASTQ data, which consists of a DNA sequence of varying-length, a quality score, and an identifier. Compressing FASTQ requires new techniques that compress each stream of FASTQ independently and efficiently. Recently, several domain-specific data compressors have been developed for NGS data compression. DSRC [7,8], for example, is a fast FASTQ data compressor designed for industry. Quip [9] is another tool that uses a de novo assembly algorithm and was the first assembly-based compressor. Another tool is the fqzcomp [10], which was the winner of the Sequence Squeeze competition organized by the Pistoia Alliance. Fastqz is another compressor which uses context modeling to reduce the data volume to maximize the compression gain [10]. LFQC [11], which was developed recently and yielded the best compression ratio on the LS454 and SOLEXA datasets. Another recently developed tool is the FaStore, which is optimized to handle files generated by sequencing machines that generate a large number of short reads at a low sequencing error rate [12]. SPRING is another FASTQ compressor which was developed recently and provides high compression and decompression speed [13]. However, some of the previous algorithms have failed to compress some datasets due to the lack of support for reading variable-length read or space encoding, such as the color space encoding used in SOLiD. This study presents the LFastqC, a lossless non-reference-based compression algorithm, which is an extension of the LFQC but performs better when using only the quality score as a length reference for the sequence stream. The algorithm is discussed and evaluated in the following sections. The compression result is compared with results from other methods that also adopt a lossless non-reference-based compression technique.

Materials and methods

FASTQ data compression

NGS files are stored in the FASTQ format, which typically consist of millions-to-billions of records with four lines each. Line 1 represents the record identifier, Line 2 stores the record nucleotide, Line 3 represents a constant “+” optionally followed by record identifier, and Line 4 represents the corresponding quality scores for the nucleotide sequences from Line 2. Each line comprises four different alphabet sets and has its own structure; therefore, several existing FASTQ compressors compress each line separately. The proposed algorithm follows the literature by splitting the FASTQ file into three data streams; each stream is then preprocessed independently for compression. A regular data compressor and a special-purpose Fasta file compressor, i.e. lpaq8 (http://mattmahoney.net/dc/#lpaq) and MFCompress (http://bioinformatics.ua.pt/software/mfcompress), are used at the compression stages.

lpaq8 compressor

lpaq8 is a part of the PAQ series, which are lossless data-compression archivers that adopt a context mixing algorithm. These algorithms divided the compressor into a predictor and an arithmetic coder, and work just like prediction by partial matching (PPM). However, they are different from PPM in two ways. First of all, context mixing algorithms predict the next symbol by computing a weighted combination of probability estimates from many models on different contexts. Secondly, context mixing algorithms use many models, unlike PPM which uses a single model. Context mixing algorithms yield the best performance on many benchmarks in terms of compression ratio. These benchmarks vary in size and alphabet, rendering lpaq8 the best option when compressing quality score and read identifiers.

MFCompress compressor

MFCompress relies on finite-context models, which is probabilistic and works by estimating the probability of the next symbol of the source based on the frequencies of the symbols that occurred before the current symbol. MFCompress uses multiple competing finite-context models to compress the DNA sequences and uses a single finite-context models to compress the file header. MFCompress compresses Fasta files efficiently in term of time and compression gain, making it suitable for compressing the DNA sequence stream in a FASTQ file. The following subsections explain how LFastqC pre-processes each stream before sending it to the corresponding compression tool.

Identifier compression

The main goal of the identifier field is to uniquely identify the read. The reads are identified using an integer value, but the identifiers have more information than what is needed to identify each read. For example, the identifier field contains the instrument’s name, run identifier, flow cell identifier, and tile coordinates. Most of this information is the same for every read. This redundancy increases the file size, but it can also be utilized to achieve better compression of the identifier stream. The identifier fields can be classified as one of three types: fields with data that do not change over the whole records, fields with the same data value over a set of consecutive records, and fields with integer values that are either incremented or decremented over consecutive records. LFastqC takes this information into consideration when preprocessing this stream. The algorithm first scans the identifier for one of these delimiters: a dot (.), underscore (_), space (), hyphen (–), colon (:), equal sign (=), or slash (/). It then splits the identifiers into fields based on these delimiters. This process leads to the creation of new files with one column (the field column) and N rows, where N is the number of records. The following example explains how the algorithm splits a record identifier using the following record identifier: The algorithm first returns the identifier’s delimiters, which are the following in this case: a dot (.), a space (), a space (), and then an equal sign (=). Thus, the identifier is split into five fields: @SRR001471 1 E96DJWM01D47CS length 79 At this point, some FASTQ compressors in the literature add a layer of compression by compressing each field using one of the following compression techniques: delta encoding, run length encoding, or reversing the fields (reading them from right to left) for further compression. We observed that this compression layer did not improve the compression ratio in general, but increased the running time altogether. Instead, for our algorithm feeds the identifier fields to the lpaq8 compressor at this point, which is a standard context mixing compression algorithm. We use lpaq8 with option “9” which yielded the best compression ratio.

Sequence compression

The nucleotide sequences are arranged in a small string of five alphabetic characters, namely A, C, G, T, and N. The N base contains unknown nucleotides and always has “!” as its corresponding quality score, which indicates the lowest probability and is equal to zero. Some FASTQ algorithms eliminate “N” in the record sequences or “!” in the record quality score because they can be easily reconstructed from one another. Our algorithm does not follow this approach as we simply use the quality score as a read-length reference. Some other datasets use color space encoding, which means that the read sequence has more than five characters. The color-space read sequence starts with any of A, C, G, or T, followed by numbers 0–3, which represent the relationship between the current base and the previous one. Our algorithm supports these datasets because it uses MFCompress, a FASTA and multi-FASTA special-purpose compressor that accepts FASTA files with more than five characters. To compress the record sequences, our algorithm first converts the stream into a single FASTA file by adding the header of the first sequence as the first line, then deleting all sequence reads’ new lines to get a long single sequence read. LFastqC then feeds the converted stream to MFCompress for compression. We use MFCompress with a parameter of -3 and obtain the best compression ratio.

Quality score compression

Each record has a quality score, which has the same length as the record sequence. We use it as a length reference for the sequence reconstruction since we converted the record sequence to a single FASTA file. According to the literature, there is a correlation between any quality score and the score at the preceding position. This correlation tends to decrease along the length of the sequence and behaves randomly for different FASTQ files. This behavior makes it difficult to predict the nature of the quality scores and hence achieve better compression. We found that the best way to compress this stream is by using a context mixing algorithm, which yielded the best compression ratio for a number of benchmarks. We feed the quality score stream as is to the lpaq8 compressor with option “9” to achieve the best compression ratio. When conducting experiments both MFCompress and lpaq8 are run in parallel to speed up the process.

FASTQ data decompression

LFastqC decompresses FASTQ data using the same tools used to compress them in the first place. First, LFastqC regenerates the identifier by decompressing the files of the identifier stream and then merges them to create a single file. Next, lpaq8 and MFCompress decompress both the quality score and DNA sequence in parallel. LFastqC then regenerates the compressed FASTQ data by combining all streams together and uses the quality score file as a reference for the length of each record.

Results and discussion

We compared our algorithm with two general-purpose compression tools, Gzip [14] and bzip2 [15], as well as other state-of-the-art FASTQ file-compression tools, namely SPRING, LFQC, DSRC2, fqzcomp, SeqSqueeze1 [16], and Quip. FaStore was excluded from this study because it did not work after trying on different platforms. Moreover, a recent study showed that FaStore was outperformed by SPRING on different datasets [13]. For each selected tool, we used their recommended parameters to obtain the best possible compression, as shown in Table 1.
Table 1

Compression tools adopted and their parameters.

AlgorithmParameters
SPRING-c -i -t 16 / -c -l -i -t 16
LFQC-
DSRC2c -m2
FqzcompSOLEXA: -n2 -s7þ -b -q3
LS454: -n1 -s7þ -b -q2
SOLiD: -S -n2 -s5þ -q1
SeqSqueeze1-h 4 1/5 -hs 5 -b 1:3 -b 1:7 -b 1:12 1/10 -bg 0.9 -s 1:2 1/5 -s 1:3 1/10 -ss 10 -sg 0.95
Quip-
Gzip-9
Bzip2-9
Since our tool is a lossless reference-free algorithm, we compared it only with other tools that compress FASTQ data in a lossless manner without using a reference genome. Lossy compression tools are excluded from comparison. Also, tools adopted for comparison are used without any extra information besides the FASTQ file. All the experiments were carried out on a machine running UBUNTU 16.04 64-bit powered by an Intel core i7 processor with 8 GB of RAM. In our comparison, we used publicly available datasets that can be downloaded from the 1,000 Genome Project. These datasets have the same data as those used in [7] as well as in [11]. In addition, three datasets used in [13] as well as four others, such as the PacBio and MinION, are tested on, as shown in Table 2.
Table 2

Datasets.

DatasetsTypeOrganismCoverageRead LengthSize (Mb)
SRR001471LS454Homo sapiens0.07x188216
SRR003177LS454Homo sapiens0.27x5641196
SRR003186LS454Homo sapiens0.21x581886
SRR007215SOLiDHomo sapiens0.07x25695
SRR010637SOLiDHomo sapiens0.14x352086
SRR013951SOLEXAHomo sapiens0.89x763190
SRR027520_1SOLEXAHomo sapiens1.19x764808
SRR027520_2SOLEXAHomo sapiens1.19x764808
SRR554369GAIIxP.aeruginosa50x100384
SRR327342GAIIS.cerevisiae175x632812
SRR1284073PacBioE.coli140x29421302
SRR9046049PacBioA. brasilense136x30782622
SRR8858470PacBioHomo sapiens0.67x139644288
ERR3307082MinIONC.freundii367x40023632
ERR637420MinIONE. coli118x6232264
While selecting datasets, we ensure to include data that were used by previous tools for the sake of a fair comparison, and to include new data for further comparisons. Our datasets incorporate data from different technologies, have different coverages, and different read lengths. The datasets used include three files from LS454, two from SOLiD, as well as five files are from Illumina (three in SOLEXA format, one in GAIIx, and one in GAII format). Our datasets also include three files from PacBio and two from MinION, both widely used. Experimental results revealed the four winners in terms of compression ratio, namely LFastqC, LFQC, SPRING and Fqzcomp, as shown in Table 3. The winning condition is for the compression tool to be able to compress as many different datatypes as possible. Individual winning tools, however, perform better on different datasets, e.g. LFastqC on LS454, PacBio, and MinION; LFQC on SOLiD dataset; SPRING on Illumina GAIIX and GAII; and Fqzcomp on the Illumina SOLEXA dataset. Although fqzcomp performed the best on Illumina SOLEXA, the results show instability in its performance since it cannot compress the SOLiD datasets due to their color space encoding. Additionally, fqzcomp was not able to compress both PacBio and MinION datasets because it was not able to recognize the file format. Moreover, Quip and SeqSqueeze1were not able to compress both PacBio and MinION datasets except for SRR1284073, which was compressed successfully by Quip. Also noticed was that Quip and SPRING were not able to compress the SOLiD datasets due to the lack of color space encoding support. Our comparison shows that SPRING performs the best when compressing files with short read and medium to high coverage. Fastqz does not work when the read lengths vary so we excluded it from the comparison. The comparisons also revealed the poor performance of the general-purpose compressors when it comes to compression ratio, despite their competitive performance at compression and decompression speed.
Table 3

Compression ratios for each tool.

DatasetCompression Ratio
LFastqCLFQCSPRINGDSRC2SeqSqueeze1QuipFQZCompGzipBzip2
SRR0014715.295.244.584.845.154.475.023.233.93
SRR0031775.155.114.464.604.904.454.773.163.81
SRR0031864.714.644.174.344.634.174.492.973.59
SRR0072156.607.26-6.767.07--4.185.20
SRR0106375.305.59-5.315.56--3.484.25
SRR0139513.463.483.293.393.463.483.572.402.80
SRR027520_14.284.364.144.334.444.484.552.873.41
SRR027520_24.254.274.044.244.354.384.452.803.33
SRR5543696.125.906.484.325.374.344.942.823.38
SRR3273425.905.846.454.745.645.246.083.073.65
SRR12840733.213.203.10--3.10-2.392.82
SRR90460493.093.092.98----2.742.36
SRR88584703.113.022.85----2.502.32
ERR33070822.752.702.60----2.022.32
ERR6374202.882.882.812.85---2.212.59

Table 3: Compression ratio is defined as the ratio of the original file size to the compressed file size. Best performance is indicated in bold.

Table 3: Compression ratio is defined as the ratio of the original file size to the compressed file size. Best performance is indicated in bold. In terms of compression speed, DSRC2 and SPRING show an outstanding performance. Both obtain the best compression speed on different datasets and outperform the other tools in all cases except for two times where they came behind Quip in SRR554369 and SRR1284073. Table 4 summarizes the results.
Table 4

Compression speed.

DatasetCompression Time
LFastqCLFQCSPRINGDSRC2SeqSqueeze1QuipFQZCompGzipBzip2
SRR0014712m00s3m19s0m50m12s1m45s0m17s0m11s0m41s0m17s
SRR00317710m13s18m04s0m220m31s10m03s0m39s1m02s4m35s1m35s
SRR0031867m15s12m06s0m160m17s7m29s0m29s0m59s3m41s1m13s
SRR0072156m18s6m00s-0m11s2m23s--0m46s1m10s
SRR01063721m18s18m05s-0m41s8m21s--3m30s3m30s
SRR01395137m20s41m04s0m57s0m48s25m30s1m41s3m06s8m53s5m27s
SRR027520_144m37s68m01s1m16s2m03s33m44s2m24s4m34s11m17s7m35s
SRR027520_246m42s59m08s1m23s0m58s34m00s2m22s4m31s11m07s7m37s
SRR5543695m34s6m38s0m15s0m23s3m56s0m11s0m25s1m12s0m32s
SRR32734241m40s45m0s2m17s0m35s20m31s1m14s2m20s6m35s4m33s
SRR128407315m11s21m21s1m7s--0m38s-3m49s2m7s
SRR904604940m21s46m52s2m56s----4m37s8m10s
SRR885847070m49s74m47s3m35s----7m56s22m9s
ERR330708266m35s69m23s3m01----9m35s6m53s
ERR6374203m44s4m56s0m240m11s---0m41s0m26s
DSRC2 and SPRING also attain the best performance in decompression speed in most cases except for five times where the first place was claimed by either gzip or bzip2. Gzip has the best decompression speed on SRR001471, SRR1284073, and ERR637420. On the other hand, bzip2 has the best decompression speed on SRR9046049 and SRR8858470 as shown in Table 5. Although both DSRC2 and SPRING have demonstrated an impressive speed, neither was able to compress all the different types of data we have in our datasets. SPRING was not able to compress SOLiD. On the other hand, DSRC2 does not support long reads on medium to large sized data. DSRC2 shows an obvious trade-off between the compression ratio and speed. DSRC2, in most cases, was not among the top four in terms of compression ratio except in some cases when other tools failed. This has left us with only four reliable tools that were able to compress all the files within our datasets, which are LFastqC, LFQC, gzip and bzip2. Among these four tools, LFastqC has the best compression ratio on most of the data except for four times when it comes second after LFQC and two times when it shares the same result with LFQC as shown in Table 3. In those cases where LFastqC came after LFQC, we noticed that LFQC performed better in compressing Quality Score, which is largely due to the fact that the back-end compressor ZPAQ performs better on highly randomly generated data than the lpaq8 used in LFastqC. In general, when compressing small to medium sized data, LFastqC performed well both in speed and compression ratio. LFastqC has a better speed and compression ratio when compressing LS454, PacBio, and MinION than LFQC, which has the second-best performance for these types of data. Additionally, LFastqC came second, behind SPRING, in terms of compression ratio on GAIIx and GAII dataset. Nevertheless, LFastqC fell behind when compressing SOLiD and SOLEXA due to the color space encoding in SOLiD dataset and the randomness of the rate of change in correlation between scores in the quality score of SOLEXA.
Table 5

Decompression speed.

DatasetDecompression Speed
LFastqCLFQCSPRINGDSRC2SeqSqueeze1QuipFQZCompGzipBzip2
SRR0014712m16s3m20s0m40m10s1m45s0m47s0m13s0m08s0m30s
SRR00317710m43s14m48s0m160m22s10m03s3m40s1m21s0m34s3m00s
SRR0031867m59s11m40s0m130m20s7m29s2m38s0m58s0m43s2m08s
SRR007215_16m08s7m14s-0m11s2m23s--0m23s1m16s
SRR01063720m59s23m28s-0m26s8m21s--1m27s4m12s
SRR013951_235m27s37m27s0m54s0m57s25m30s9m39s3m12s2m34s8m28s
SRR027520_148m27s56m12s1m09s2m34s33m44s15m38s5m01s3m51s13m57s
SRR027520_255m49s56m59s1m09s4m01s34m00s16m03s5m24s4m13s13m05s
SRR554369_15m54s4m46s0m5s0m27s5m16s1m24s0m26s0m6s0m48
SRR327342_140m30s44m38s0m33s0m42s32m12s8m48s2m49s1m56s5m49s
SRR128407316m37s18m44s0m31s--2m18s-0m19s2m46s
SRR904604935m52s40m29s2m45s----6m18s1m10s
SRR885847062m59s68m12s2m36s----8m50s2m12s
ERR330708256m40s59m38s1m30----2m13s9m16s
ERR6374202m17s4m3s0m120m14s---0m4s0m34s
LFastqC memory usage is calculated by summing the memory usage of both compressors, i.e. lpaq8 and MFcompress since they are running in parallel and the result of their summation is the worst-case scenario. Lpaq8 memory usage is based on argument N, where N can be an integer from 1 to 9. Larger numbers yield a better compression. LFastqC uses lpaq8 with N = 9. The memory usage then can be calculated as follows for both compression and decompression. On the other hand, LFastqC uses MFcompress with option -3 which uses more memory for better compression. MFcompress with option -3 uses around 2,433 MB for both compression and decompression. This sums up to 4GB in total for each dataset as the worst-case scenario, as shown in Table 6.
Table 6

Memory consumption.

DatasetsSize (Mb)Memory Usage
SRR0014712164 GB
SRR00317711964 GB
SRR0031868864 GB
SRR0072156954 GB
SRR01063720864 GB
SRR01395131904 GB
SRR027520_148084 GB
SRR027520_248084 GB
SRR5543693844 GB
SRR32734228124 GB
SRR128407313024 GB
SRR904604926224 GB
SRR885847042884 GB
ERR330708236324 GB
ERR6374202644 GB

Conclusions

We have developed a specialized FASTQ compressor that achieves the best compression ratio on the LS454, PacBio, and MinION datasets with a faster compression and decompression speed than LFQC. LFastqC was compared against two general-purpose compressor and six specialized FASTQ-file compressors. LFastqC outperformed all other tools in eight out of fifteen cases while for the other seven cases the winning spot was shared among FQZComp, LFQC, and SPRING. To sum up, LFastqC was competitive on all datasets due to the elegant preprocessing method and the strength of the two compressors chosen to compress different streams, namely lpaq8 and MFCompress. As of now, LFastqC does not support color space encoding, as well as lpaq8 performance degrades when the quality score stream presents a high degree of randomness, so the performance of LFastqC fell behind the LFQC in some cases. In the future, we will add a feature that supports converting the color space encoding into base space to gain better compression ratios on the SOLiD dataset as well as working on improving the compression ratio of quality score stream.
  13 in total

1.  LFQC: a lossless compression algorithm for FASTQ files.

Authors:  Marius Nicolae; Sudipta Pathak; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2015-06-20       Impact factor: 6.937

2.  SPRING: a next-generation compressor for FASTQ data.

Authors:  Shubham Chandak; Kedar Tatwawadi; Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2019-08-01       Impact factor: 6.937

3.  FaStore: a space-saving solution for raw sequencing data.

Authors:  Lukasz Roguski; Idoia Ochoa; Mikel Hernaez; Sebastian Deorowicz
Journal:  Bioinformatics       Date:  2018-08-15       Impact factor: 6.937

4.  Compression of DNA sequence reads in FASTQ format.

Authors:  Sebastian Deorowicz; Szymon Grabowski
Journal:  Bioinformatics       Date:  2011-01-19       Impact factor: 6.937

Review 5.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Authors:  Peter J A Cock; Christopher J Fields; Naohisa Goto; Michael L Heuer; Peter M Rice
Journal:  Nucleic Acids Res       Date:  2009-12-16       Impact factor: 16.971

6.  DSRC 2--Industry-oriented compression of FASTQ files.

Authors:  Lukasz Roguski; Sebastian Deorowicz
Journal:  Bioinformatics       Date:  2014-04-18       Impact factor: 6.937

7.  On the representability of complete genomes by multiple competing finite-context (Markov) models.

Authors:  Armando J Pinho; Paulo J S G Ferreira; António J R Neves; Carlos A C Bastos
Journal:  PLoS One       Date:  2011-06-30       Impact factor: 3.240

8.  Compression of FASTQ and SAM format sequencing data.

Authors:  James K Bonfield; Matthew V Mahoney
Journal:  PLoS One       Date:  2013-03-22       Impact factor: 3.240

9.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Authors:  Daniel C Jones; Walter L Ruzzo; Xinxia Peng; Michael G Katze
Journal:  Nucleic Acids Res       Date:  2012-08-16       Impact factor: 16.971

10.  DNA-COMPACT: DNA COMpression based on a pattern-aware contextual modeling technique.

Authors:  Pinghao Li; Shuang Wang; Jihoon Kim; Hongkai Xiong; Lucila Ohno-Machado; Xiaoqian Jiang
Journal:  PLoS One       Date:  2013-11-25       Impact factor: 3.240

View more
  4 in total

1.  ACO:lossless quality score compression based on adaptive coding order.

Authors:  Yi Niu; Mingming Ma; Fu Li; Xianming Liu; Guangming Shi
Journal:  BMC Bioinformatics       Date:  2022-06-07       Impact factor: 3.307

Review 2.  Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.

Authors:  Kirill Kryukov; Lihua Jin; So Nakagawa
Journal:  Patterns (N Y)       Date:  2022-07-07

3.  Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.

Authors:  Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi
Journal:  Gigascience       Date:  2020-07-01       Impact factor: 6.524

4.  FCLQC: fast and concurrent lossless quality scores compressor.

Authors:  Minhyeok Cho; Albert No
Journal:  BMC Bioinformatics       Date:  2021-12-20       Impact factor: 3.169

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.