| Literature DB >> 21569267 |
Krishna M Roskin1, Benedict Paten, David Haussler.
Abstract
BACKGROUND: Continuing research into the global multiple sequence alignment problem has resulted in more sophisticated and principled alignment methods. Unfortunately these new algorithms often require large amounts of time and memory to run, making it nearly impossible to run these algorithms on large datasets. As a solution, we present two general methods, Crumble and Prune, for breaking a phylogenetic alignment problem into smaller, more tractable sub-problems. We call Crumble and Prune meta-alignment methods because they use existing alignment algorithms and can be used with many current alignment programs. Crumble breaks long alignment problems into shorter sub-problems. Prune divides the phylogenetic tree into a collection of smaller trees to reduce the number of sequences in each alignment problem. These methods are orthogonal: they can be applied together to provide better scaling in terms of sequence length and in sequence depth. Both methods partition the problem such that many of the sub-problems can be solved independently. The results are then combined to form a solution to the full alignment problem.Entities:
Mesh:
Year: 2011 PMID: 21569267 PMCID: PMC3114744 DOI: 10.1186/1471-2105-12-144
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Figure 1A set of constraints visualized as a sparse alignment. In each species, positions increase from left to right with respect to the partial order. Sequence positions in the same column are equal under the partial order. A set of x, yseparations breaks the sparse alignment into a set of blocks. Positions x, yand x, ydefine block k, which is composed of the core k (light gray) and the adjacent separation (dark gray).
Figure 2The Crumble pipeline. The pipeline used after the formation of semi-independent blocks (A). Blocks are aligned (B) and trimmed to remove overlap (C). Overlaps are aligned (D), and the final alignment is formed by concatenation (E). Note that the alignments in (B) and (D) can be performed in parallel.
Figure 3Prune partitioning of a phylogenetic tree of 44 species. Prune run with a maximum sub-tree size of 10 sequences breaks the tree into seven sub-trees. Six of the sub-trees (light gray) can be aligned in parallel because they contain only known leaf sequences (filled squares; out-groups for sub-trees are not shown). Once these six sub-trees are aligned and the sequence of the roots (filled circles) is inferred, the internal sub-tree (dark gray) can be aligned. Note that this sub-tree includes both leaf sequences and inferred sequences. The alignments from the light gray sub-trees are merged with the alignment of the dark gray sub-tree to form the alignment of the entire tree.
Figure 4Example of the Maximal root inference method. Every alignment column is assigned the most frequently occurring base in the column. Thus Maximal infers the longest possible root sequence that fits within the alignment.
Comparison between root inference methods.
| 50 leaves | |||
|---|---|---|---|
| Number Nodes | Maximal | Ortheus | |
| Prune w/Pecan | 30 | 0.880 | 0.579 |
| 15 | 0.909 | 0.560 | |
| 7 | 0.912 | 0.555 | |
| Prune w/FSA | 30 | 0.912 | 0.574 |
| 15 | 0.893 | 0.523 | |
| 7 | 0.885 | 0.495 | |
| Prune w/MUSCLE | 30 | 0.899 | 0.579 |
| 15 | 0.896 | 0.555 | |
| 7 | 0.905 | 0.501 | |
The average agreement score of Prune alignments when Maximal and Ortheus root inference methods are used. Fifty alignment problems with fifty leaf species and ~10 kilobases of sequence were used. Three underlying alignment algorithms and three different maximum sub-tree sizes were used in the comparison. The faster Maximal method performed better across all comparisons that Ortheus for this application.
Figure 5Schema of the Job-tree job system. In Job-tree, job a creates a set of jobs that perform a task in parallel. These jobs are collectively called the children of job a. The job also creates a follow-on job to be performed after all children have successfully completed. The follow-on job is responsible for cleaning up the input files created for the children and for any further processing. After job a ends successfully, the batch system runs the children. These jobs may, in turn, have children and follow-on jobs. Upon completion of all descendants, the follow-on job is run. The follow-on job may create more children.
Crumble results for different sized simulated datasets and underlying alignment methods.
| 60 kb | 150 kb | 500 kb | 1000 kb | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Time | Agreement | Time | Agreement | Time | Agreement | Time | Agreement | ||
| Pecan1 | 3.43 | 0.896 | 10.6 | 0.905 | 46.9 | 0.906 | 100 | 0.906 | |
| Crumble w/Pecan | 60% | 3.29 | 0.894 | 7.18 | 0.904 | 21.5 | 0.905 | 51.9 | 0.906 |
| 30% | 2.56 | 0.889 | 4.66 | 0.903 | 11.9 | 0.905 | 23.5 | 0.905 | |
| 15% | 2.39 | 0.859 | 3.77 | 0.893 | 8.29 | 0.903 | 13.9 | 0.905 | |
| FSA2 | 37.4 | 0.886 | _ | _ | _ | _ | _ | _ | |
| Crumble w/FSA | 60% | 25.8 | 0.881 | 69.8 | 0.903 | _ | _ | _ | _ |
| 30% | 21.0 | 0.873 | 3act9.2 | 0.898 | _ | _ | _ | _ | |
| 15% | 17.7 | 0.849 | 25.5 | 0.893 | 104. | 0.811 | _ | _ | |
| MUSCLE3 | _ | _ | _ | _ | _ | _ | _ | _ | |
| Crumble w/MUSCLE | 60% | _ | _ | _ | _ | _ | _ | _ | _ |
| 30% | 128 | 0.707 | _ | _ | _ | _ | _ | _ | |
| 15% | 63.1 | 0.679 | 251. | 0.705 | _ | _ | _ | _ | |
1 Pecan was run with default parameters.
2 FSA was run with the --exonerate, --anchored, and --softmasked flags.
3 MUSCLE was run with default parameters.
The majority of these problems were unable to be aligned due to running out of memory.
The run-time and average agreement score of Crumble alignments of different sized datasets. Several sets of simulated alignment problems were generated using a root sequence of 60, 150, 500, and 1000 kilobases. The neutral evolution of each root sequence was simulated over a nine species tree. Fifty problems were generated per root size for a total of two hundred test alignment problems. The agreement and run-time (in minutes) for each problem size is the average over the fifty simulated alignments. Crumble was used to break the problems down to sub-problems that were 60%, 30%, and 15% of the length of the original problem. The approximate core size was set to 60%, 30%, and 15% of the length of the original problem and the block was allowed to be at most 4 kb larger as measured in any of the sequences. Pecan, FSA, and MUSCLE were used as the underlying alignment method. PrePecan was used to generate the constraints. We were unable to apply FSA directly (not using Crumble) to 150 kb or larger problems because FSA required more than the 4GBs of memory we had available per cluster node. Using Crumble we were able to run FSA on problems as large as half a megabase. MUSCLE had more memory issues but we were able to use it on problems as large as 150 kb using Crumble. For Pecan, Crumble achieved more than a seven fold speedup with almost no loss of accuracy on the largest problem size.
Prune results for different sized datasets and underlying alignment methods.
| 50 leaves | 100 leaves | 500 leaves | 1000 leaves | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Time | Agreement | Time | Agreement | Time | Agreement | Time | Agreement | ||
| Pecan1 | 21.9 | 0.914 | 297. | 0.879 | _ | _ | _ | _ | |
| Prune w/Pecan | 60% | 7.26 | 0.880 | 39.2 | 0.862 | _ | _ | _ | _ |
| 30% | 3.13 | 0.909 | 19.6 | 0.839 | _ | _ | _ | _ | |
| 15% | 7.26 | 0.912 | 13.3 | 0.878 | 125. | 0.844 | _ | _ | |
| 7% | 4.24 | 0.909 | 13.5 | 0.849 | 29.1 | 0.907 | 122. | 0.877 | |
| FSA2 | 63.1 | 0.933 | 266. | 0.856 | _ | _ | _ | _ | |
| Prune w/FSA | 60% | 33.8 | 0.912 | 78.9 | 0.838 | 589. | 0.871 | _ | _ |
| 30% | 10.5 | 0.893 | 23.8 | 0.838 | 142. | 0.879 | _ | _ | |
| 15% | 4.25 | 0.885 | 17.1 | 0.857 | 40.8 | 0.877 | 150. | 0.861 | |
| 7% | 3.00 | 0.866 | 4.23 | 0.842 | 12.7 | 0.903 | 34.8 | 0.887 | |
| MUSCLE3 | 55.6 | 0.905 | 138. | 0.799 | _ | _ | _ | _ | |
| Prune w/MUSCLE | 60% | 40.7 | 0.899 | 77.9 | 0.777 | 886. | 0.862 | _ | _ |
| 30% | 24.7 | 0.896 | 42.8 | 0.777 | 368. | 0.883 | _ | _ | |
| 15% | 15.1 | 0.905 | 29.1 | 0.828 | 185. | 0.899 | 440. | 0.900 | |
| 7% | 24.7 | 0.905 | 18.8 | 0.841 | 114. | 0.924 | 228 | 0.928 | |
| MAFFT4 | 3.17 | 0.897 | 5.39 | 0.806 | 20.1 | 0.886 | 25.2 | 0.912 | |
| SATé5 | 101. | 0.915 | 301. | 0.840 | _ | _ | _ | _ | |
1 Pecan was run with default parameters.
2 FSA was run with the --exonerate, --anchored, --softmasked, and --fast flags.
3 MUSCLE was run with default parameters.
4 MAFFT was run with the --treein option.
5 SATé was run with the -t option but limited to two iterations. We found that more iterations did almost nothing for accuracy.
The majority of these problems were unable to be aligned due to running out of memory.
The majority of these problems took longer than 3 days and were aborted.
The run-time and average agreement score of Prune alignments of different sized datasets. Several sets of simulated alignment problems were generated using a root sequence of 10 kilobases. The neutral evolution of each root sequence was simulated over 50, 100, 500, and 1000 species trees. Fifty problems were generated per tree size for a total of two hundred test alignment problems. The agreement and run-time (in minutes) for each problem size is the average over the fifty simulated alignments. Each underlying alignment method was tested on the dataset (Pecan, FSA, MUSCLE). Prune was then used to break the problems down into sub-trees that contained at most 60%, 30%, 15%, and 7% of the nodes in the entire tree. The largest number of stages was six but most of the problems had no more than 3 stages. Pecan, FSA, and MUSCLE were used as the underlying alignment method to Prune. We also performed alignment using MAFFT and SATé to compare against. To ensure a fair comparison, the true tree topology was passed to SATé (using -t option) and to MAFFT (using the poorly documented --treein option). We were unable to apply some alignment algorithms to large problems because of very long run-times and memory issues. Using Prune, we were able to use Pecan, FSA, and MUSCLE to solve alignment problems that were much deeper than could be solved without Prune. Prune achieved a very large speedup with little loss of accuracy and sometimes with an increase in accuracy.
Crumble results for 90 kb of genomic DNA from seven species.
| Time | Log-likelihood1 | ||
|---|---|---|---|
| Pecan2 | 11.3 | -0.354 | |
| Crumble w/Pecan | 60% | 7.42 | -0.355 |
| 30% | 4.67 | -0.357 | |
| 15% | 5.42 | -0.357 | |
| FSA3 | 38.3 | -0.374 | |
| Crumble w/FSA | 60% | 20.4 | -0.375 |
| 30% | 12.2 | -0.375 | |
| 15% | 9.68 | -0.376 | |
| MUSCLE4 | _ | _ | |
| Crumble w/MUSCLE | 60% | _ | _ |
| 30% | 153. | -0.363 | |
| 15% | 59.2 | -0.367 | |
1 The log-likelihood of the alignment as calculated by phyloFit, in millions of nats.
2 Pecan was run with default parameters.
3 FSA was run with the --exonerate, --anchored, --softmasked, and --fast flags.
4 MUSCLE was run with default parameters.
5 This problem was unable to be aligned due to running out of memory.
The run-time and log-likelihood score of Crumble alignments. Each underlying alignment method (Pecan, FSA, MUSCLE) was tested on the dataset. Crumble was then used to break the problem into sub-problems that were approximately 60%, 30%, and 15% of the length of the original problem. While MUSCLE was unable to align this problem directly, using Crumble we were able to apply it to this problem.
Prune results for twelve alignment problems from the Rfam database.
| Time | |||
|---|---|---|---|
| Pecan1 | _ | _ | |
| Prune w/Pecan | 60% | _ | _ |
| 30% | 14.6 | 0.651 | |
| 15% | 5.35 | 0.649 | |
| 7% | 2.57 | 0.643 | |
| FSA2 | 13.6 | 0.792 | |
| Prune w/FSA | 60% | 10.3 | 0.669 |
| 30% | 4.30 | 0.615 | |
| 15% | 2.39 | 0.636 | |
| 7% | 2.17 | 0.636 | |
| MUSCLE3 | 3.67 | 0.709 | |
| Prune w/MUSCLE | 60% | 3.03 | 0.704 |
| 30% | 1.23 | 0.649 | |
| 15% | 1.03 | 0.672 | |
| 7% | 1.42 | 0.659 | |
| MAFFT4 | 0.04 | 0.693 | |
| SATé5 | 93.9 | 0.753 | |
1 Pecan was run with default parameters.
2 FSA was run with the --exonerate, --anchored, --softmasked, and --fast flags.
3 MUSCLE was run with default parameters.
4 MAFFT was run with --treein option.
5 SATé was run with the -t option but limited to two iterations. We found that more iterations did almost nothing for accuracy.
The majority of problems were unable to be aligned due to running out of memory.
The run-time and agreement score of Prune alignments of twelve RNA alignment problems from the Rfam database. The average time and agreement over all twelve problems are shown. Pecan, FSA, and MUSCLE were used as the underlying alignment method of Prune. MAFFT and SATé were also tested to provide comparison. We were unable to apply Pecan without using Prune because of memory issues. Using Prune, we were able to use Pecan to solve these alignment problems. Prune achieved a very large speedup with little loss of accuracy. Other alignment methods achieved a large speedup but more accuracy was lost.