Literature DB >> 35436292

CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments.

Md Ashiqur Rahman¹, Abdullah Aman Tutul¹, Sifat Muhammad Abdullah¹, Md Shamsuzzoha Bayzid¹.

Abstract

BACKGROUND: High-throughput experimental technologies are generating tremendous amounts of genomic data, offering valuable resources to answer important questions and extract biological insights. Storing this sheer amount of genomic data has become a major concern in bioinformatics. General purpose compression techniques (e.g. gzip, bzip2, 7-zip) are being widely used due to their pervasiveness and relatively good speed. However, they are not customized for genomic data and may fail to leverage special characteristics and redundancy of the biomolecular sequences.
RESULTS: We present a new lossless compression method CHAPAO (COmpressing Alignments using Hierarchical and Probabilistic Approach), which is especially designed for multiple sequence alignments (MSAs) of biomolecular data and offers very good compression gain. We have introduced a novel hierarchical referencing technique to represent biomolecular sequences which combines likelihood based analyses of the sequence similarities and graph theoretic algorithms. We performed an extensive evaluation study using a collection of real biological data from the avian phylogenomics project, 1000 plants project (1KP), and 16S and 23S rRNA datasets. We report the performance of CHAPAO in comparison with general purpose compression techniques as well as with MFCompress and Nucleotide Archival Format (NAF)-two of the best known methods especially designed for FASTA files. Experimental results suggest that CHAPAO offers significant improvements in compression gain over most other alternative methods. CHAPAO is freely available as an open source software at https://github.com/ashiq24/CHAPAO.
CONCLUSION: CHAPAO advances the state-of-the-art in compression algorithms and represents a potential alternative to the general purpose compression techniques as well as to the existing specialized compression techniques for biomolecular sequences.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35436292 PMCID： PMC9015123 DOI： 10.1371/journal.pone.0265360

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Background

One of the major tasks of bioinformatics is to collect, analyze and interpret large volumes of biomolecular data. The amount of available genomic data is increasing approximately tenfold every year, at a much faster rate than Moore’s Law for computational power [1, 2]. This advancement in sequencing technologies demands more efficient ways to store and analyze large genomic datasets. Numerous general purpose compression algorithms, such as zip and gzip based on DEFLATE algorithm [3], bzip2 using Burrows-Wheeler transform [4], 7-zip [5] are being widely used to deal with the genomic data deluge. However, these general purpose compression techniques are agnostic about the special characteristics and redundancy existing in the biomolecular sequences. Thus, due to the growing awareness about the challenges posed by the genomic data deluge and the inability of the general purpose compression techniques to take advantage of the redundancy in genomic data, developing specialized compression techniques for biomolecular sequences has drawn substantial attention from the bioinformatics community. Biomolecular sequence compression has been an active research area over the last decade. Most of these works are focused on directly compressing individual DNA/RNA sequences, such as BioCompress [6, 7], GenCompress [8], the CTW+LZ algorithm [9], DNACompress [10], MFCompress [11], DELIMINATE [12], XM [13], Pinho et al. [14] and Tabus and Korodi [15]. This class of methods utilizes various properties of genomic sequences such as small alphabet size and repetitive regions. There is another class of compression techniques, known as reference-based methods, that takes advantage of the redundancy in the biomolecular sequences. Here, a reference sequence is used to encode a “target sequence”, resulting in substantial compression when there are significant similarities between the reference sequence and the target sequence. Reference-based approach is a popular technique for genomic data compression, and has been used in many methods including RLZ [16], GRS [17], GReEn [18], coil [6], Fritz et al. [19], Christley et al. [20], Brandon et al. [21], Wang and Zhang [17], Kozanitis et al. [22] and Popitsch et al. [23]. These methods are useful in compressing sequence databases or storing millions of reads produced by next generation sequencing technologies. Multiple sequence alignment (MSA) is the alignment of biological sequences, inferring homologies by reflecting basic evolutionary events (insertion, deletion, and substitution). Constructing an MSA is a basic step in many analyses in computational biology such as phylogenetic tree construction, orthology identification, predicting the structure, and function of proteins. Therefore, an exponentially increasing number of MSA files are being generated and analyzed in various domains of computational biology. This underscores the need for developing methods for the efficient storage of MSA files. However, there has not been notable advancement in developing compression techniques that are especially customized to consider the special characteristics and redundancy of MSAs. Fundamental to the recent advancements in compressing sequence data is the ability to leverage the redundancy of the biomolecular sequences [6, 19]. Likewise, MSA files have specific formats and characteristics. Hickey et al. [24] proposed a way for saving MSA files on the basis of phylogenetic hierarchy. Matos et al. presented a model using a special arithmetic coding for DNA multiple sequence alignment blocks [25]. Many of these existing studies aimed more at presenting a concept than at providing publicly available usable compression tools. Moreover, many of them are dependent on external reference sequences [18–20, 22] which limits their practical use. Furthermore, some of them can handle only the four-letter alphabet (A, T, C, G), preventing their applications to protein sequences. Thus, although the last two decades have witnessed the proposal of many algorithms for compressing genomic sequences, this community is still dependent on the general purpose compressors. In this study, we present CHAPAO, a reference-based technique for compressing MSA files. This is to our knowledge the first application of the reference-based technique for compressing MSAs. Unlike conventional reference-based methods where an “extra” sequence (not included in the input sequence to compress) is used as a reference [17–20, 22], we proposed a novel hierarchical referencing technique where a suitable subset of the input sequences in the MSA file is used as reference sequences. Our referencing technique is hierarchical in a sense that a subset S1 of sequences can be used to encode a subset S2 of sequences, and S2 can subsequently be used to encode another subset of sequences. Thus, we aim to keep an optimal subset of input sequences that can encode all the sequences in the MSA in a hierarchical manner. We have proposed a likelihood based technique to model the sequence similarity and “representability”, and subsequently apply a minimum spanning arborescence [26-28] based algorithm on a graph modeled from the MSA in order to find an optimal set of reference sequences and an optimal order of hierarchical referencing. We performed an extensive evaluation study to assess the performance of CHAPAO on the MSA files from the Avian Phylogenomics [29, 30] and 1000 plants (1KP) [31, 32] projects (two of the largest phylogenomics projects to date) containing various types of gene sequences (introns, exons, and UCEs). We also analyzed a collection of large and challenging ribosomal RNA datasets (16S and 23S) obtained from the Gutell Lab [33, 34]. In addition to the general purpose compressors (zip, gzip, bzip2, and LZMA [35] (implemented in the 7-zip archiver [36])), we compared with MFCompress [11], which is the best known alternative method for compressing FASTA files, and Nucleotide Archival Format (NAF) [37]. Experimental results suggest that CHAPAO offers notable compression gain and significantly outperforms the best alternate methods except for 7-zip.

Methods

Overview of CHAPAO

In conventional reference-based techniques, the target sequences (sequences to be compressed) are represented in terms of a reference sequence and some additional metadata information. The additional information may be insertion, substitution, or deletion from reference sequence, that will convert the reference sequence to the target sequence. Fig 1 shows two sequences that evolved with substitutions, insertions, and deletions and the corresponding multiple sequence alignment. We denote a pair of reference r and target t sequences by a tuple . Usually, only one reference sequence is used for all the target sequences. This works well when the sequences come from the same or closely related species (as it is the case for Fritz et al. [19] where they used a reference genome sequence to map the short reads). For compressing an MSA with sequences from a collection of species with higher degrees of dissimilarity between them, single reference based techniques may result in higher amounts of metadata, and may lead to lower compression ratio.

Fig 1

Character evolution and multiple sequence alignment.

Character evolution and multiple sequence alignment.

(a) Two observed sequences, (b) Character evolution with substitution and indels which can change the sequence length and blur the homology, and (c) Multiple sequence alignment of the two sequences capturing the underlying character evolution where each site consists of homologous characters. CHAPAO finds a suitable subset of sequences in the MSA as reference. Unlike other reference-based techniques [18–20, 22], CHAPAO does not depend on any external reference sequence. An efficient statistical and graph theoretic algorithm has been incorporated in CHAPAO to find an optimal set R of reference sequences so that other (target) sequences T can be hierarchically obtained from R with minimum representational cost. This is a hierarchical approach where a subset T1 ⊆ T is encoded using R, and subsequently T (i > 1) is encoded using sequences from R∪T1∪…∪T. In order to find an optimal set of reference sequences, we create an encodability graph where each vertex corresponds to a sequence and the weight w of a directed edge (S, S) from S to S represents the cost of representing sequence S by sequence S. Next, we find a minimum spanning arborescence in using Edmond’s algorithm [38, 39]. This defines an optimal set of reference sequences and hierarchical referencing () relationships among the sequences (see Theorem 0.1). Appropriate metadata are generated to decode the sequence hierarchically from the reference sequences. Note that the encodability graph would be a very dense graph—a directed complete graph where every pair of vertices is connected by a pair of edges (one in each direction). To keep relatively sparse, edges are established only between the nodes that correspond to “similar” sequences. We used a likelihood based approach to find similar sequences. Fig 2 shows an overview of the algorithmic workflow of CHAPAO. We used bzip2 at the final stage of our algorithm to compress the reference sequences and the metadata.

Fig 2

Overview of the compression and decompression techniques in CHAPAO.

A directed weighted “Encodability Graph” is constructed where each vertex corresponds to a sequence in MSA except for the dummy node (shown in red) which is used as a “source” vertex. Next, a minimum spanning arborescence in the graph is constructed. Sequences that are children of the dummy node in the will be used as reference sequences. Appropriate metadata are generated to hierarchically represent all other sequences. Finally, the reference sequences along with the metadata are compressed using existing compression techniques. The pipeline is completely reversible, allowing lossless decompression of the original MSAs.

Overview of the compression and decompression techniques in CHAPAO.

Representational cost

The cost C of representing a target sequence S using a reference sequence S depends on the metadata required to store in order to retrieve S from S. C includes the cost of storing the indices of the positions where S and S differ, and the cost of storing the mismatched bases. Thus, C = f(N, I), where, N is the number of bits required to store a base, and I is the number of bits required to store an index. Fig 3(b) shows the cost matrix , showing the cost of representing every pair of sequences (in both directions) in Fig 3(a). Note that C2,1 = 2I + 4N, but C1,2 = 2I + 2N. These two sequences differ in indices 2, 3, 10, and 11. To represent S2 using S1, we need to store “tt” and indices 2 and 10, whereas we need to store “aa” at 2 and “cc” at 10 for representing S1 using S2. Therefore, C is not necessarily identical to C and thus the encodability graph is a directed complete graph. The cost of storing a reference sequence is I + l*N, where l is the length of the sequence, and I is the cost to store the index of the sequence. We store the index of a sequence to ensure that the sequences in the decompressed MSA are in exactly the same order as in the original uncompressed MSA.

Fig 3

Directed graph based modeling.

(a) A multiple sequence alignment with three sequences, (b) the corresponding cost matrix, (c) the encodability graph , and (d) the corresponding minimum spanning arborescence .

Directed graph based modeling.

(a) A multiple sequence alignment with three sequences, (b) the corresponding cost matrix, (c) the encodability graph , and (d) the corresponding minimum spanning arborescence .

Modeling the encodability graph

Given an MSA with n sequences, we create an encodability graph with n + 1 vertices where n vertices correspond to the n sequences in the MSA. The (n + 1)-th vertex is a dummy vertex v which does not correspond to any sequence in the MSA. There is an edge from v to i (1 ≤ i ≤ n), where w represents the cost of storing sequence S as a reference sequence. Thus, the dummy node acts as the “source” vertex in the , and a minimum spanning arborescence is constructed considering the dummy node as the root node. Thus, in addition to the representational cost w, the cost for storing a reference sequence is considered in the encodability graph, and therefore, the minimum spanning arborescence in defines the optimal set of reference sequences and an optimal order of hierarchical referencing (see Theorem 0.1). A sequence S is considered to be a reference if there is an edge (v, S) from v to S in the . The hierarchical referencing is defined by the directed edges in . Fig 3(c) shows the encodability graph of the MSA shown in Fig 3(a), and the corresponding is shown in Fig 3(d). Here, the cost of storing all three DNA sequences would be the summation of the cost to save sequence S3 as a reference, the cost to represent S1 using S3 as a reference, and the cost to represent S2 using S1 as a reference. Thus, we only need to store S3 and appropriate metadata to hierarchically decode S1 and S2. White and Hendy [6] previously used an undirected-graph based technique to model the similarity among the sequences in a database. They used edit tree distance as an approximation of the maximum parsimony distance to find groups of similar sequences. They split the whole database into multiple undirected similarity graphs composed of highly similar sequences. Next, (undirected) minimum spanning trees are computed for each of the similarity graphs. The smallest sequence in each similarity graph is considered as the only reference sequence for all the target sequences. Therefore, despite some parallels, our technique—with the directed encodability graph based approach using likelihood based similarity and subsequent computation of minimum spanning arborescence and the hierarchical referencing—is significantly different than the one used in White and Hendy [6]. Theorem 0.1. The minimum spanning arborescence in an encodability graph defines an optimal set of reference sequences and reference-target (< r, t >) relationships. Proof. Let be the set of reference sequences and be the set of reference-target pairs suggested by . Let C be the total weight of , meaning that C is the cost of storing and . Assume that C is not optimal, meaning that there exists another set of reference sequences and a set of reference-target pairs that can be stored with cost C′, and C′ < C. Let us build a graph where a vertex corresponds to a sequence in and there is an edge from S to S if . The weight of an edge (v, v) represents the cost of storing S using S as a reference. Finally, we add a dummy node v to and add edges from v to all the reference sequences in , where the cost of an edge represents the cost of storing a reference sequence. It is easy to see that is a spanning arborescence, rooted at v, of with cost C′. Therefore, since C′ < C, cannot be a minimum spanning arborescence, which leads to a contradiction. This completes the proof.

Log-likelihood based similarity modeling

For a directed complete graph, calculating the cost matrix is expensive and requires O(n2 l) time where n is the number of sequences and l is the length of each sequence. For computational efficiency, CHAPAO tries to keep the encodability graph relatively sparse by considering only those edges that are incident on reasonably similar sequences. However, finding pairwise similar sequences is computationally expensive as well. Therefore, we have introduced an efficient heuristic using the likelihood values of the sequences. Let be a multiple sequence alignment with n sequences. A sequence S in can be considered as an l-dimensional random vector, S = [S, S, …, S], where S ∈ {a, t, g, c, −} refers to the j-th base (character) in S. Thus, is a collection of n l-dimensional random vectors. We assume that the occurrence of a base at a position j in a sequence S is independent of any other base in S. Therefore, the likelihood of the sequence S in can be computed as follows. Here, is the probability of the occurrence of a particular base S ∈ {a, t, c, g, −} at column j in . Let be the number of times appears at column j in . Then can be calculated as follows. As the individual probability values are very small, we take the log-likelihood as follows. We sort the sequences in an MSA according to their likelihood values so that the adjacent sequences in the sorted list have a relatively low cost for representing each other. Next we take a sliding window of a preferred length l (which is a tunable parameter), and slide it over the sorted list. The step size (sliding length) is chosen appropriately to ensure a certain amount of overlap between two windows. The sequences within a window will form a clique (i.e. every pair of vertices will be connected with each other) in the encodability graph . This reduces the time complexity of computing the cost matrix to O(nl).

Time complexity

The time complexity of our algorithm depends on the cost of computing the cost matrix and computing the minimum spanning arborescence . For each edge e in the encodability graph , we have to calculate its weight which takes O(l) time. Thus, the time complexity for computing the cost matrix will be O(El), where E is the number of edges in . For an MSA with n sequences, a sliding window of length l and step size (sliding amount) l, there will be number of cliques in . Time complexity to compute the cost matrix for a clique is as there are edges in a clique with l nodes. Considering all the cliques in , the time complexity is . Note that , and l = cl, where c is a positive real number. Thus, the time complexity of computing the cost matrix is as follows. As l is a constant which is usually much smaller than the length of the sequences (l), and does not depend on n, computing the cost matrix takes O(nl) time. This also implies that the number of edges in the encodability graph, constructed by considering the overlapping windows of sequences with similar likelihood values, is O(n). We implemented Edmond’s algorithm [38, 39] to find the minimum spanning arborescence, which takes O(VE) time. Thus, the time complexity of our compression pipeline is O(El + VE). Therefore, without the log-likelihood based heuristic version with sliding windows, CHAPAO takes O(n2 l + n3) time. But the compression time is reduced to O(nl + n2) using our proposed sparse graph representation, saving a factor of O(n).

Experimental studies

We evaluated the performance of CHAPAO on a collection of real and challenging biological datasets. We used data from two of the largest phylogenomics projects to date: 1) Avian phylogenomics project [29, 30] and 2) 1000 plants (1KP) project [31, 32]. We also analyzed two other widely used large biological datasets (16S and 23S) from Gutell Lab [33, 34, 40] containing alignment files with large numbers of sequences from 16S and 23S ribosomal RNA sampled from bacteria. S1 Table in S1 File shows the summary of various alignments in these datasets. We assessed the performance of CHAPAO on DNA sequence alignments. We compared CHAPAO with several popular general purpose compression techniques, namely zip, bzip2, gzip, and LZMA [35] (implemented in the 7-zip archiver [36]). We also compared CHAPAO with special purpose compression techniques, MFCompress [11] and NAF [37], which is especially targeted to compress biomolecular sequences in FASTA files. MFCompress was previously compared with gzip, bzip2, ppmd (a variant of ppm [41]), and LZMA as well as with the recent special purpose compressor DELIMINATE [12], and was shown to be the best method for compressing FASTA files. NAF is based on zstd [42] and was shown to achieve a compression ratio close to DELIMINATE. In order to compare various compression techniques, we report the average compression gains (over all the MSAs in a particular dataset) attained by different methods. We also show the size of the MSAs after compression by different methods, and we divide the MSA files in a particular dataset into different bins based on the size of the MSAs to better assess the performance of different method across varying file sizes. We performed Wilcoxon signed-rank test (with α = 0.05) to measure the statistical significance of the differences between two methods. The experiments were performed on a Windows machine with an Intel Core I7–7500U processor (3.5 GHz), 8GB DDR4 RAM, and 128GB SSD memory (SATA 3). The performance of CHAPAO may vary depending on the window size. Longer window sizes are expected to achieve better compression gain at the cost of more compression time. We used window sizes ranging from 20 − 40 on various datasets. These smaller window sizes provided enough compression gain to significantly outperform most other methods. The particular window size and overlap size that we used to generate the results are mentioned in each subsequent figure.

Results on avian dataset

Avian phylogenomics project is the largest vertebrate phylogenomics project [29, 30], which assembled or collected the genomes of 48 avian species spanning most orders of birds. This dataset contains exons from 8251 syntenic protein-coding genes, introns from 2516 of these genes, and a nonoverlapping set of 3769 ultraconserved elements (UCEs). The exon gene set was prepared based on synteny-defined orthologs chosen from the assembled genomes of chicken and zebra finch. The intron gene set consists of 2516 genes that are orthologous subset of introns from the 8295 protein-coding genes. The UCE dataset has 3679 genes with ∼1000 bp of flanking sequences. The UCE dataset was filtered to remove overlap with the exon and intron datasets. Fig 4 shows the relative performance of various methods on avian dataset. Since we have thousands of alignments covering a wide range of file sizes, we show the results for various bins of different file size limits. CHAPAO consistently achieved a significantly higher compression ratio than all other methods, except LZMA and NAF, regardless of the file size and sequence type. LZMA, despite being a general purpose compression technique, achieved the best compression gains on all the model conditions on avian datasets, followed by CHAPAO and NAF.

Fig 4

Performance of various compression techniques on avian datasets.

Performance of various compression techniques on avian datasets.

To better understand the relative performance of different methods across different file sizes, we distribute the MSA files into various bins based on their sizes. For each bin (file-size range), we show the average size of the compressed files produced by various methods. (a) UCEs. (b) Introns. (c) Exons. CHAPAO and NAF achieved competitive compression gain on Intron datasets, while NAF was slightly better than CHAPAO on the UCEs and CHAPAO was slightly better than NAF of the Exons. On the Intron alignments, CHAPAO achieved 38.8%, 31.3%, 14.3% and 16.6% more compression than zip, gzip, bzip2, and MFCompress respectively (see Fig 4(b)). CHAPAO achieved the second best compression ratio on exon MSAs, where it achieved 20.28% more compression than MFCompress, 6.2% more compression than NAF, and 24.64% more compression than bzip2 (see Fig 4(c)). On the UCE dataset, CHAPAO achieved 21.6% more compression than MFCompress and 15.9% more than bzip2, and NAF achieved slightly better compression (2.29%) than CHAPAO (see Fig 4(a)). To assess the applicability and performance of our method on very large alignments, we analyzed the concatenated alignments resulting from concatenating the alignments of introns, exons and UCEs. We do not analyze the ultra-large alignments as the likelihood based analysis is computationally intensive for very large alignments, and restricted our analyses to the files not exceeding 300 MB [43] (see S2 and S3 Tables in S1 File). Although concatenation (also known as combined analysis) can be problematic as it is agnostic to the topological differences among the gene trees [44-49], it is one of the most widely used methods for species tree estimation from multi-locus data. Therefore, there is intrinsic value in storing data of this nature. Similar to individual gene sequence alignments, LZMA achieved the best compression gain on concatenated alignments as well except for MSA-7 and MSA-8, where CHAPAO achieved the best compression gains (Fig 5). However, the performance of NAF substantially degraded on these large concatenated alignments. CHAPAO and MFCompress achieved second best compression gains. On an average, CHAPAO achieved 45.15% better compression than bzip2 which is the second best performing general purpose compressor on the concatenated alignments. Unlike other datasets, MFCompress outperformed CHAPAO on six (out of 10) MSAs. However, among the largest three MSAs (MSA-7, -8, and -9), CHAPAO outperformed MFCompress on two of them (MSA-7 and MSA-8). On the other large file (MSA-9), MFCompress is better than CHAPAO. We investigated the average p-distance of the sequences in these large MSA files and observed that the average p-distance of MSA-9 is 0.15 which is much higher than those of MSA-7 and MSA-8 (0.043 and 0.029 respectively), indicating a lower level of similarity/redundancy in MSA-9 compared to MSA-7 and MSA-8. This could explain why CHAPAO did not achieve the same level of compression gain on MSA-9 as it did on MSA-7 and MSA-8. Note also that both CHAPAO and MFCompress achieved substantially better compression gain than bzip2 on MSA-7 and MSA-8, but the gain is not that substantial on MSA-9. These results suggest that CHAPAO can effectively capture the similarities/redundancies in the sequences, and hence underscore the importance of using special purpose compressor for large scale MSAs in order to effectively leverage the redundancies in biomolecular sequences. CHAPAO significantly outperformed NAF on seven (out of 10) MSAs. The compression gains of CHAPAO and NAF are competitive on the remaining three files. Notably, on an average, CHAPAO achieved 40.81% more compression than NAF on these concatenated alignments.

Fig 5

Performance of various compression techniques on 10 concatenated alignments in avian dataset.

The avian dataset is distributed as gzip-compressed files which consumes 923 MB (considering only the MSAs analyzed in this study) [43]. However, CHAPAO can archive these files using 604 MB of data, saving 34.56% of the storage requirement. Moreover, this compression gain has been achieved using a window size of 20 and can be further improved by using longer window sizes at the cost of more computation time. There is a notable correlation between the similarity/redundancy in MSA files and the compression ratio obtained by CHAPAO, which is in line with our objective of leveraging the similarity in the biomolecular sequences. We investigated this on avian datasets. We computed the hamming distances between the pairs of sequences in every MSA file in the avian datasets. We have defined the average hamming distance of an MSA file according to Eq 4. Here, N and L are the number of sequences and the length of each sequence in an MSA file, respectively. Hamming(S, S) denotes the hamming distance between two sequence S and S. CHAPAO has achieved more compression on the MSAs with less average pairwise hamming distance among the sequences (Fig 6). As the level of dissimilarity between the pairs of sequences in an MSA file is increased, the compression gain of CHAPAO gradually decreases. The experimental results also suggest that CHAPAO may provide better compression gain for MSAs with relatively large numbers of sequences. For MSAs with large numbers of sequences, relatively larger proportions of the sequences may be expressed as non-reference sequences, which subsequently leads to higher compression gain.

Fig 6

Performance of CHAPAO with varying levels of dissimilarity/divergence of the 14,490 MSAs in avian datasets.

Performance of CHAPAO with varying levels of dissimilarity/divergence of the 14,490 MSAs in avian datasets.

The average hamming distance (as defined in Eq 4) of these files ranges from 0–13. The box plots show the compression ratio (ratio of the size of the original file and the compressed file) of CHAPAO on the MSAs in avian datasets (here MSAs are sorted in an ascending order of their average hamming distance). One of the notable observations from these results is the superior performance of general purpose compression technique LZMA compared to various special purpose compression techniques (e.g., MFCompress, CHAPAO, NAF). The LZ algorithms approach the data sequentially and keep track of all the strings that appeared in the past up to a certain limit (window size) [50, 51]. If the current substring is previously seen, it is then replaced by a reference to the previous occurrence. LZMA is an improvement of LZ coding which can detect repeats that are further apart. As a result, it can capture both intra-sequence and inter-sequence similarities [52], whereas CHAPAO tries to leverage only the inter-sequence similarities. We believe that this could be a reason why the performance of CHAPAO is not better than LZMA in most cases.

Results on 16S and 23S datasets

Results on 16S and 23S datasets are shown in Fig 7a and 7b. LZMA and CHAPAO achieved extra-ordinary compression gain on these datasets. MFCompress performed worse than bzip2 even though it is a special purpose tool for compressing FASTA files. LZMA was the best performing method on 23S dataset, and CHAPAO was the second best method, which achieved 80%, 78.68%, 55.38%, 65.47%, and 45.84% more compression than zip, gzip, bzip2, MFCompress, and NAF respectively. Similar trends hold for 16S dataset, where CHAPAO obtained 76.76%, 73.26%, 32.82%, 68.25%, and 33.92% more compression than zip, gzip, bzip2, MFCompress and NAF respectively. CHAPAO was able to compress the 16S and 23S datasets, originally occupying 361.66 MB and 58.79 MB respectively, to only 7.2 MB and 1.3 MB which can easily be transmitted as an email attachment.

Fig 7

Comparison of various compression techniques on 16S and 23S datasets and 1KP dataset.

(a) 16S. (b) 23S. (c) 1KP.

Comparison of various compression techniques on 16S and 23S datasets and 1KP dataset.

(a) 16S. (b) 23S. (c) 1KP.

Results on 1KP dataset

The 1000 plants (1KP) initiative has generated large-scale gene sequencing data for over 1000 species of plants, representing approximately one billion years of evolution, including flowering plants, conifers, ferns, mosses, and streptophyte green algae [31, 32]. This dataset comprises 9,609 multiple sequence alignments each containing sequences from 1000 different plant species with a wide range of sequence lengths (303 ∼ 63150 bp). The performance of CHAPAO along with other compression methods is shown in Fig 7(c). LZMA and CHAPAO achieved the best compression gains on this dataset, and CHAPAO obtained 42.7%, 40.3%, 7.3%, 24.1%, and 6.26% more compression than zip, gzip, bzip2, MFCompress, and NAF, respectively. Notably, CHAPAO was significantly better than LZMA on larger MSAs (243MB—284MB range), and competitive with LZMA on other size ranges.

Impact of sliding window lengths on compression gain

Sliding window length l and sliding amount l are important hyper-parameters of our algorithm. With a sliding window of lengths of l, the encodability graph will be composed of a series of cliques each of size l. Smaller sizes of the sliding window may result in relatively lower compression gain. Maximum compression is expected to be obtained when the window length is equal to the number of sequences in an MSA. We investigated the impact of varying lengths of window and overlap. Fig 8 shows the impact of various hyper-parameter settings on 16S and 23S datasets. These results suggest that as we increase the window size, the compression gain tends to improve. A sliding window of length 30 with overlap of 20 (CHAPAO (window = 30, overlap = 20)) achieved almost 17.71% and 16.1% more compression than CHAPAO with a sliding window of size 5 with overlap length 3 (CHAPAO (window = 5, overlap = 3)) on 16S and 23S datasets, respectively. This impact is even more prominent on the MSAs with larger numbers of sequences. For example, CHAPAO (window = 30, overlap = 20) is 20.6% better than CHAPAO (window = 5, overlap = 3) on 16S.B.ALL which contains 27,643 sequences, whereas the improvement is 10.13% on 16S.M which contains 901 sequences. However, this improved compression gain comes with an additional computational burden. CHAPAO(window = 30, overlap = 20) took almost 3.8 times more compression time than CHAPAO(window = 5, overlap = 3) on 16S and 23S datasets. The average running times for different lengths of window and overlap on 16S and 23S datasets are shown in Table 1 (see also S11 Table in S1 File).

Fig 8

Impact of the lengths of sliding window and overlap on compression ratio.

We show the performance of various variants of CHAPAO on 16S and 23S datasets. (a) 16S. (b) 23S.

Table 1

Impact of sliding window and overlap lengths on compression time.

We show the running time of various variants of CHAPAO on 16S and 23S datasets.

Window	Overlap	Average Running Time
5	3	83.91
10	7	136.05
20	15	238.82
30	20	323

Impact of sliding window and overlap lengths on compression time.

We show the running time of various variants of CHAPAO on 16S and 23S datasets.

Impact of the lengths of sliding window and overlap on compression ratio.

We show the performance of various variants of CHAPAO on 16S and 23S datasets. (a) 16S. (b) 23S.

Compression and decompression time

Compression times of CHAPAO (for the window sizes used in this study) and other methods are shown in S4-S12 Tables in S1 File. CHAPAO tends to take more time for compression, especially on the larger files, than other methods. However, shorter window sizes can substantially reduce the compression time. However, the decompression step is much faster and takes less computational time than MFCompress (see S15 Table in S1 File). Even for the largest ones like 16.S.B.All, it takes only around 9 seconds, whereas MFCompress takes around 24 seconds to compress and 26 seconds to decompress (S15 Table in S1 File). NAF and LZMA (7-zip) are faster than CHAPAO both in terms of compression and decompression speed.

Conclusions

Given the huge number of multiple sequence alignments that can be harvested from various comparative genomics projects, there is a need for efficient tools to archive them. General purpose techniques are being widely used to archive MSA files. However, these methods are agnostic to the specificity of MSAs. In this paper, we have attempted to advance the state-of-the-art in MSA compression by taking the redundancy and specificity of MSAs into account. We have presented CHAPAO, a new lossless compression technique which is especially tailored to leverage the redundancy of genomic sequences by exploiting a novel hierarchical reference-based sequence representation to allow parsimonious storage of MSAs. Extensive experimental studies on a variety of real biological datasets suggest that CHAPAO can achieve substantially higher compression gain over the existing general purpose compression techniques (except 7-zip) as well as the special purpose techniques for genomic data at the cost of more compression time. However, this study can be extended in several directions. Although CHAPAO can handle reasonably large MSAs, this is not yet scalable to ultra-large whole genome alignments due to computational requirements (time and memory). Designing appropriate divide-and-conquer frameworks—which will operate on smaller blocks in the ultra large alignments—to boost the performance of CHAPAO both in terms of scalability and compression gain would be interesting. CHAPAO cannot take in account the intra-sequence similarity like LZMA (7-zip). Capturing the similarity within a particular sequence may improve the performance of CHAPAO and so future studies need to investigate this. CHAPAO, in its current form, can run on protein sequence alignments, but CHAPAO is not particularly tailored for protein alphabets. As the number of different characters present in protein sequences is significantly higher than the alphabet size of DNA sequences, special customization is required to handle alignments of protein families (as was done in CoMSA [53]). Besides the proposed likelihood-based technique, exploring other techniques for capturing the similarity and redundancy in protein MSAs are required to identify appropriate approaches for protein MSAs. Although we have performed an extensive simulation study on various datasets with a wide range of model conditions, future studies need to analyze more datasets to further investigate the relative strengths and weaknesses of various methods to guide the users in choosing the right compressors for different datasets. Finally, our proposed hierarchical referencing technique is expected to be of potential interest for efficiently compressing short reads generated by the next generation sequencing technologies, which we leave as future work. Finally, CHAPAO represents a notable contribution towards designing compression algorithmic frameworks for biomolecular sequences and should be considered as a potential alternative to the widely used general purpose compression techniques.

Supplementary results and data.

(PDF) Click here for additional data file. 9 Nov 2021

PONE-D-21-27907

CHAPAO: likelihood and hierarchical reference based representation of biomolecular sequences and applications to compressing multiple sequence alignments

PLOS ONE Dear Dr. Bayzid

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Dec 24 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Ram Kumar Sharma, Ph.D Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex. 3. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ Additional Editor Comments: Both the reviewers indicated the concerns on the current version of the manuscript. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: No ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The manuscript entitled "CHAPAO: likelihood and hierarchical reference based representation of biomolecular sequences and applications to compressing multiple sequence alignment", by Rahman et al, seems to be very interesting, wherein authors have put good efforts in developing a tool for compressing multiple sequence alignment. The manuscript is nicely written incorporating proper statistical tools in support of CHAPAO. I would like to recommend the manuscript, however, some typos should be corrected in the manuscript, as an instance 'arborescenc' in keywords. Reviewer #2: Comments to the Author This is an interesting article presenting a new compression method CHAPAO (Compressing Alignments using Hierarchical and Probabilistic Approach) especially designed for compressing MSAs. I think it contributes towards further alternatives to easily compress and decompress multiple sequence alignments (MSAs) of biomolecular data using hierarchical referencing technique combined likelihood-based analysis. Also, the algorithm was evaluated using various real biological datasets, and can be used as alternative to other compressing techniques. Moreover, several issues described below need to be addressed to improve the quality of this manuscript. Major comments: 1. In section 3 Experimental studies: What was the optimum window size used to get better compression gain and also less compression time as the authors used 20-40 window size on various datasets 2. In section 3.1 please explain more about the datasets used such as UCS ultra-conserved elements locus sets. 3. In result section intron, exon, UCE alignment in various real datasets could you please explain the how the percentage gain achieved in CHAPAO, on comparing with four other programs. Have you considered each file bin size for deducing percentage? 4. Could you please define how this method will perform if the sequences are highly divergent, and consisting of insertion and deletion which are larger than 50bp. 5. Is this method also considering the protein sequences alignments also? 6. Edit the references according to journal format. 7. Please add author contributions section. Minor comments: English language: The grammar throughout the manuscript needs editing and the language in general needs some streamlining. Introduction: 1. Page 2, Paragraph-1, Line-2, Reference based should be reference-based. 2. Paragraph-3, Line-4, etc should be etc, 3. Parahraph-4, Line-10, format should be formats. 4. Paragraph-4, Line-17, last two decades should be the last two decades. 5. Paragraph-6, Line-7, dataset should be datasets. Methods: 1. Section 2.1, Line-1, reference based should be reference-based. 2. Line 3, substitution should be substitution, Experimental studies: 1. Paragraph-3, Line-3. Ram should be RAM, Section 3.1 Results on avian dataset Paragraph 1, Line-3, set should be set, Section 3.4 Impact of sliding window lengths on compression gain. Paragraph-1, Line-7, tend should be tends. Paragraph-1, Line-14, additional should be an additional. Conclusion: Line 7, reference based should be reference-based. Line-21, DNA sequence should be DNA sequences. All supplementary data and figures should be quoted in same format. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 28 Jan 2022 December 23, 2021 Dr. Emily Chenette Editor-in-Chief PLoS ONE Dear Dr. Chenette, Thank you and the Academic Editor, Dr. Ram Kumar Sharma for handling our manuscript and for the constructive reviews by the reviewers. We have revised the manuscript by addressing the reviewers’ comments. The new/modified material in the manuscript is provided in blue text to make it easy to identify. Detailed responses to the individual reviews are provided inline. The reviewers' questions led us to perform more experiments and provide additional details, and – we think – the manuscript is much improved now. We very much thank you and the reviewers for pointing out these issues. We hope the revised version will satisfy the reviewers and also meet PLoS ONE’s requirements. Yours sincerely, Dr. Md. Shamsuzzoha Bayzid, Corresponding Author Department of Computer Science and Engineering Bangladesh University of Engineering and Technology Email: shams_bayzid@cse.buet.ac.bd Reviewer #1: The manuscript entitled "CHAPAO: likelihood and hierarchical reference based representation of biomolecular sequences and applications to compressing multiple sequence alignment", by Rahman et al, seems to be very interesting, wherein authors have put good efforts in developing a tool for compressing multiple sequence alignment. The manuscript is nicely written incorporating proper statistical tools in support of CHAPAO. I would like to recommend the manuscript, however, some typos should be corrected in the manuscript, as an instance 'arborescenc' in keywords. Response: Thank you very much for appreciating our effort and considering this manuscript publishable. We are very sorry for the typos. We made a sincere effort to fix this type of typos. Reviewer #2: Comments to the Author This is an interesting article presenting a new compression method CHAPAO (Compressing Alignments using Hierarchical and Probabilistic Approach) especially designed for compressing MSAs. I think it contributes towards further alternatives to easily compress and decompress multiple sequence alignments (MSAs) of biomolecular data using hierarchical referencing technique combined likelihood-based analysis. Also, the algorithm was evaluated using various real biological datasets, and can be used as alternative to other compressing techniques. Moreover, several issues described below need to be addressed to improve the quality of this manuscript. Response: Thank you very much for your encouraging comments and for considering CHAPAO a useful tool for compressing MSAs. We appreciate your nice suggestions, which we have taken into account (please see our responses below). I hope you will find the revised manuscript publishable. Major comments: 1. In section 3 Experimental studies: What was the optimum window size used to get better compression gain and also less compression time as the authors used 20-40 window size on various datasets Response: We have already indicated, in each figure, the particular window size that we used to generate the reported results. We have now added the following sentence in Section 3 to make it clearer. “The particular window size and overlap size that we used to generate the results are indicated in each subsequent figure.” 2. In section 3.1 please explain more about the datasets used such as UCS ultra-conserved elements locus sets. Response: Thank you for this suggestion. We now have added more details on these datasets. 3. In result section intron, exon, UCE alignment in various real datasets could you please explain the how the percentage gain achieved in CHAPAO, on comparing with four other programs. Have you considered each file bin size for deducing percentage? Response: We are sorry that it was not clear. The reported performance gains on a particular dataset are for the entire dataset (not for individual bins). We computed the compression gain by considering the entire size of a dataset after it had been compressed using CHAPAO and other methods. We now have made it clear in Section 3. We computed the improvement, in compression gain, of one method over another as follows. Improvement of method M1 in comparison with another method M2= (size of the file compressed by M2 - size of the file compressed by M1)/size of the file compressed by M2 4. Could you please define how this method will perform if the sequences are highly divergent, and consisting of insertion and deletion which are larger than 50bp. Response: Since CHAPAO aims for capturing the similarity of the sequences, the performance is expected to degrade for highly divergent sequences. We have already demonstrated the impact of divergence/dissimilarity in our original manuscript (please see Section 3.1 (pages 11,12) and Figure 6). As for sequences with larger than 50bp insertions and deletions, we believe that there is no direct association between the quantity of insertion/deletion and the performance of CHAPAO. The length of the sequences, the number of sequences, the divergence between the sequences, and other factors collectively have an impact on CHAPAO's effectiveness. For longer sequences (as in whole genome sequences), 50 bp insertion/deletion should not cause any noticeable problem. However, for shorter sequences, it may affect the performance given the insertions and deletions are in different positions in different sequences of an MSA. But if the indels are aligned (meaning that there is not much divergence in their positions in different MSAs), the performance of CHAPAO should not degrade. Acknowledging the importance of your suggestions, we investigated the amounts of insertions and deletions in the sequences analyzed in our study. On the exon gene set (in the Avian dataset), around 25% of the MSAs contain sequences with (on average) less than 50 bp indels. The average length of these sequences is 831 bp and the compression ratio (compressed size/original size) attained by CHAPAO is 0.0734. The rest of the MSA files in the exon dataset have more than 50 bp indels (per sequence). These sequences are 1876 bp long (on average), and the compression ratio of CHAPAO is 0.059. This may lead to the incorrect conclusion that CHAPAO performs better on MSAs with > 50 bp indels than on MSAs with fewer than 50 bp indels. However, we suspect that this is attributable to the file size and the level of similarity among the sequences, not the number of indels. In comparison to shorter sequences, CHAPAO was able to capture more similarity in longer sequences, resulting in a greater compression ratio (although by a small margin). Therefore, we believe that there is no notable impact of the number of indels on the performance of CHAPAO. However, without more rigorous and systematic analyses and investigation, we do not want to discuss this topic in the manuscript, and so we did not include these results in the revised version. 5. Is this method also considering the protein sequences alignments also? Response: Thank you for raising this point. Indeed, this is an important thing to mention/discuss. In fact, we have already discussed this in Sec 4 (Conclusions). CHAPAO, in its current form, can handle protein sequences, but CHAPAO is not particularly tailored for protein alphabets. As the number of different characters present in protein sequences is significantly higher than the alphabet size of DNA sequences, special customization is required to handle alignments of protein families. We leave this as future work. 6. Edit the references according to journal format. Response: We have formatted the references accordingly. 7. Please add author contributions section. Response: Thank you for this suggestion. We have added an author contributions section. Minor comments: English language: The grammar throughout the manuscript needs editing and the language in general needs some streamlining. Response: We have fixed all the issues that you reported below. Thank you for pointing out these issues. In addition, we have carefully edited the manuscript for grammatical errors. We believe you will find this manuscript more convincing. Introduction: 1. Page 2, Paragraph-1, Line-2, Reference based should be reference-based. 2. Paragraph-3, Line-4, etc should be etc, 3. Parahraph-4, Line-10, format should be formats. 4. Paragraph-4, Line-17, last two decades should be the last two decades. 5. Paragraph-6, Line-7, dataset should be datasets. Methods: 1. Section 2.1, Line-1, reference based should be reference-based. 2. Line 3, substitution should be substitution, Experimental studies: 1. Paragraph-3, Line-3. Ram should be RAM, Section 3.1 Results on avian dataset Paragraph 1, Line-3, set should be set, Section 3.4 Impact of sliding window lengths on compression gain. Paragraph-1, Line-7, tend should be tends. Paragraph-1, Line-14, additional should be an additional. Conclusion: Line 7, reference based should be reference-based. Line-21, DNA sequence should be DNA sequences. All supplementary data and figures should be quoted in same format. Response: We have fixed this. Submitted filename: Reviewers-response-PloS-ONE-CHAPAO.pdf Click here for additional data file. 1 Mar 2022 CHAPAO: likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments PONE-D-21-27907R1 Dear Dr. Bayzid , We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Ram Kumar Sharma, Ph.D Academic Editor PLOS ONE Additional Editor Comments (optional): Since all the quires has been reasonably responded and necessary suggested changes have been included, current version can be accepted for publication. Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #2: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: No 29 Mar 2022 PONE-D-21-27907R1 CHAPAO: likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments Dear Dr. Bayzid: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Ram Kumar Sharma Academic Editor PLOS ONE

34 in total

1. Biological sequence compression algorithms.

Authors: T Matsumoto; K Sadakane; H Imai
Journal: Genome Inform Ser Workshop Genome Inform Date: 2000

2. Compressive genomics.

Authors: Po-Ru Loh; Michael Baym; Bonnie Berger
Journal: Nat Biotechnol Date: 2012-07-10 Impact factor: 54.908

3. DELIMINATE--a fast and efficient method for loss-less compression of genomic sequences: sequence analysis.

Authors: Monzoorul Haque Mohammed; Anirban Dutta; Tungadri Bose; Sudha Chadaram; Sharmila S Mande
Journal: Bioinformatics Date: 2012-07-25 Impact factor: 6.937

Review 4. Lessons from an evolving rRNA: 16S and 23S rRNA structures from a comparative perspective.

Authors: R R Gutell; N Larsen; C R Woese
Journal: Microbiol Rev Date: 1994-03

5. A novel compression tool for efficient storage of genome resequencing data.

Authors: Congmao Wang; Dabing Zhang
Journal: Nucleic Acids Res Date: 2011-01-25 Impact factor: 16.971

Review 6. Data access for the 1,000 Plants (1KP) project.

Authors: Naim Matasci; Ling-Hong Hung; Zhixiang Yan; Eric J Carpenter; Norman J Wickett; Siavash Mirarab; Nam Nguyen; Tandy Warnow; Saravanaraj Ayyampalayam; Michael Barker; J Gordon Burleigh; Matthew A Gitzendanner; Eric Wafula; Joshua P Der; Claude W dePamphilis; Béatrice Roure; Hervé Philippe; Brad R Ruhfel; Nicholas W Miles; Sean W Graham; Sarah Mathews; Barbara Surek; Michael Melkonian; Douglas E Soltis; Pamela S Soltis; Carl Rothfels; Lisa Pokorny; Jonathan A Shaw; Lisa DeGironimo; Dennis W Stevenson; Juan Carlos Villarreal; Tao Chen; Toni M Kutchan; Megan Rolf; Regina S Baucom; Michael K Deyholos; Ram Samudrala; Zhijian Tian; Xiaolei Wu; Xiao Sun; Yong Zhang; Jun Wang; Jim Leebens-Mack; Gane Ka-Shu Wong
Journal: Gigascience Date: 2014-10-27 Impact factor: 6.524

7. Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.

Authors: Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi
Journal: Bioinformatics Date: 2019-10-01 Impact factor: 6.937

8. Whole-genome analyses resolve early branches in the tree of life of modern birds.

Authors: Erich D Jarvis; Siavash Mirarab; Andre J Aberer; Bo Li; Peter Houde; Cai Li; Simon Y W Ho; Brant C Faircloth; Benoit Nabholz; Jason T Howard; Alexander Suh; Claudia C Weber; Rute R da Fonseca; Jianwen Li; Fang Zhang; Hui Li; Long Zhou; Nitish Narula; Liang Liu; Ganesh Ganapathy; Bastien Boussau; Md Shamsuzzoha Bayzid; Volodymyr Zavidovych; Sankar Subramanian; Toni Gabaldón; Salvador Capella-Gutiérrez; Jaime Huerta-Cepas; Bhanu Rekepalli; Kasper Munch; Mikkel Schierup; Bent Lindow; Wesley C Warren; David Ray; Richard E Green; Michael W Bruford; Xiangjiang Zhan; Andrew Dixon; Shengbin Li; Ning Li; Yinhua Huang; Elizabeth P Derryberry; Mads Frost Bertelsen; Frederick H Sheldon; Robb T Brumfield; Claudio V Mello; Peter V Lovell; Morgan Wirthlin; Maria Paula Cruz Schneider; Francisco Prosdocimi; José Alfredo Samaniego; Amhed Missael Vargas Velazquez; Alonzo Alfaro-Núñez; Paula F Campos; Bent Petersen; Thomas Sicheritz-Ponten; An Pas; Tom Bailey; Paul Scofield; Michael Bunce; David M Lambert; Qi Zhou; Polina Perelman; Amy C Driskell; Beth Shapiro; Zijun Xiong; Yongli Zeng; Shiping Liu; Zhenyu Li; Binghang Liu; Kui Wu; Jin Xiao; Xiong Yinqi; Qiuemei Zheng; Yong Zhang; Huanming Yang; Jian Wang; Linnea Smeds; Frank E Rheindt; Michael Braun; Jon Fjeldsa; Ludovic Orlando; F Keith Barker; Knud Andreas Jønsson; Warren Johnson; Klaus-Peter Koepfli; Stephen O'Brien; David Haussler; Oliver A Ryder; Carsten Rahbek; Eske Willerslev; Gary R Graves; Travis C Glenn; John McCormack; Dave Burt; Hans Ellegren; Per Alström; Scott V Edwards; Alexandros Stamatakis; David P Mindell; Joel Cracraft; Edward L Braun; Tandy Warnow; Wang Jun; M Thomas P Gilbert; Guojie Zhang
Journal: Science Date: 2014-12-12 Impact factor: 47.728

9. HAL: a hierarchical format for storing and analyzing multiple genome alignments.

Authors: Glenn Hickey; Benedict Paten; Dent Earl; Daniel Zerbino; David Haussler
Journal: Bioinformatics Date: 2013-03-16 Impact factor: 6.937

10. NGC: lossless and lossy compression of aligned high-throughput sequencing data.

Authors: Niko Popitsch; Arndt von Haeseler
Journal: Nucleic Acids Res Date: 2012-10-12 Impact factor: 16.971