| Literature DB >> 22165819 |
Suzanne J Matthews1, Tiffani L Williams.
Abstract
BACKGROUND: Biologists require new algorithms to efficiently compress and store their large collections of phylogenetic trees. Our previous work showed that TreeZip is a promising approach for compressing phylogenetic trees. In this paper, we extend our TreeZip algorithm by handling trees with weighted branches. Furthermore, by using the compressed TreeZip file as input, we have designed an extensible decompressor that can extract subcollections of trees, compute majority and strict consensus trees, and merge tree collections using set operations such as union, intersection, and set difference.Entities:
Mesh:
Year: 2011 PMID: 22165819 PMCID: PMC3236838 DOI: 10.1186/1471-2105-12-S10-S16
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Example trees. A collection of three evolutionary trees on six taxa labeled A to F. Each edge e represents an evolutionary relationship (or bipartition) along with a value that represents the length of the branch.
Figure 2Example Newick string representations. Newick representations for the phylogenetic trees shown in Figure 1. Two different, but equivalent, Newick representations are given for each tree.
Figure 3Internal hash table. Our hash table data structure for the phylogenetic trees shown in Figure 1.
Two Sample Files of Weighted Trees
| File 1 | |
|---|---|
| 1. | (((A : 0.12, B : 0.13) : 0.14, C : 0.15) : 0.16, (D : 0.17, (E : 0.18, F : 0.19) : 0.20) : 0.21); |
| 2. | (((A : 0.11, B : 0.34) : 0.29, D : 0.23) : 0.22, (C : 0.24, (E : 0.25, F : 0.26) : 0.27) : 0.28); |
| 3. | (((A : 0.29, B : 0.11) : 0.31, E : 0.33) : 0.15, (D : 0.38, (C : 0.36, F : 0.37) : 0.32) : 0.31); |
| File 2 | |
| 4. | (((E : 0.18, F : 0.19) : 0.20, D : 0.17) : 0.21, (C : 0.15, (A : 0.12, B : 0.13) : 0.14) : 0.16); |
| 5. | (((A : 0.34, B : 0.23) : 0.21, C : 0.53) : 0.24, (F : 0.41, (E : 0.13, D : 0.51) : 0.21) : 0.33); |
| 6. | (((A : 0.12, B; 0.43) : 0.21, C : 0.06) : 0.20, (E : 0.04, (D : 0.28, F : 0.33) : 0.02) : 0.41); |
Figure 4Compression performance. Compression performance for our biological datasets. In this figure, (a) shows running time of compression approaches, while (b) shows space savings.
Figure 5Compression performance on different, but equivalent Newick strings. Compression performance for our biological datasets using different, but equivalent Newick strings. (a) Unlike TreeZip and TreeZip+7zip, 7zip experiences an increase in compressed file size when different, but equivalent Newick strings are introduced (100% commuted). TreeZip and TreeZip+7zip experience no change. (b) A closer look at how the percent of different, but equivalent Newick strings affect the increase in file size of 7zip (p% commuted). As more Newick strings are randomly commuted, the performance of 7zip becomes increasingly worse.
Figure 6Decompression performance. Decompression performance for our biological datasets.
Figure 7Set operations performance. Performance of set operations on our biological datasets. (a) The running time of a random collection of set operations run on different file formats. (b) The amount of disk space required by the result of the set operations.