| Literature DB >> 29026435 |
Shixiang Wan1, Quan Zou1,2.
Abstract
BACKGROUND: Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types.Entities:
Keywords: Distributed computing; Multiple sequence alignment; Phylogenetic trees; Spark
Year: 2017 PMID: 29026435 PMCID: PMC5622559 DOI: 10.1186/s13015-017-0116-x
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1A simple Spark workflow
Fig. 2Traceback procedure and pairwise alignment results of Smith–Waterman algorithm
Fig. 3MSA procedures based on Spark distributed framework
Fig. 4Constructing phylogenetic trees based on distance measure
Original dataset and datasets after threshold removal
| Dataset | Number | Minimum length | Maximum length | Average length | File size |
|---|---|---|---|---|---|
| Φ | 672 | 16,556 | 16,579 | 16,569.7 | 10 MB |
| Φ | 67,200 | As above | As above | As above | 1.1 GB |
| Φ | 672,000 | As above | As above | As above | 11 GB |
| Φ | 108,453 | 807 | 1599 | 1442.8 | 156 MB |
| Φ | 1,011,621 | 807 | 1629 | 1388.5 | 1.4 GB |
| Φ | 17,892 (218 families) | 19 | 4895 | 459.0 | 15 MB |
| Φ | 1,789,200 (218 families) | As above | As above | As above | 1.5 GB |
| Φ | 17,892,000 (218 families) | As above | As above | As above | 15 GB |
Running time and average SPS with genome MSA
| Φ | Φ | Φ | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Time | Memory | Avg SPS | Time | Memory (GB) | Avg SPS | Time | Memory (GB) | Avg SPS | |
| MUSCLE | 45 m 23 s | ~ 8 GB | 0.951 | – | – | – | – | – | – |
| MAFFT | 1 m 20 s | ~ 100 MB | 0.926 | 13 m 21 s | ~ 8 | 0.926 | – | – | – |
| Clustal-Omega | 1 h 25 m | ~ 3 GB | 0.913 | 15 h 56 m | ~ 30 | 0.913 | – | – | – |
| HAlign | 2 m 12 s | ~ 300 MB | 0.722 | 26 m 35 s | ~ 8 | 0.722 | 5 h 28 m | ~ 40 | 0.722 |
| HAlign-II | 14 s | ~ 100 MB | 0.723 | 10 m 24 s | ~ 2 | 0.723 | 1 h 25 m | ~ 15 | 0.723 |
Memory is the maximum memory usage (the same as below)
Running time and average SPS with RNA MSA
| Φ | Φ | |||||
|---|---|---|---|---|---|---|
| Time | Memory (GB) | Avg SPS | Time | Memory (GB) | Avg SPS | |
| MUSCLE | 1 h 25 m | ~ 13 | 0.821 | – | – | – |
| MAFFT | 45 m 33 s | ~ 10 | 0.815 | – | – | – |
| Clustal-Omega | 4 h 16 m | ~ 10 | 0.835 | – | – | – |
| HAlign | 1 h 32 s | ~ 2 | 0.631 | 3 h 15 m | ~ 10 | 0.631 |
| HAlign-II | 23 m 34 s | ~ 1 | 0.633 | 59 m 42 s | ~ 2 | 0.633 |
Running time and average SPS with protein MSA
| Φ | Φ | Φ | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Time | Memory (MB) | Avg SPS | Time | Memory (GB) | Avg SPS | Time | Memory (GB) | Avg SPS | |
| MUSCLE | 3 m 13 s | ~ 300 | 0.892 | ~ 34 h | ~ 10 | 0.892 | – | – | – |
| MAFFT | 5 m 53 s | ~ 100 | 0.878 | 37 m 51 s | ~ 5 | 0.878 | 7 h 22 m | ~ 30 | 0.878 |
| Clustal-Omega | 54 m 08 s | ~ 100 | 0.912 | 28 h 15 m | ~ 5 | 0.912 | – | – | – |
| SparkSW | 3 m 23 s | ~ 100 | 0.716 | 1 h 12 m | ~ 2 | 0.716 | 5 h 30 m | ~ 18 | 0.716 |
| HAlign-II | 1 m 45 s | ~ 100 | 0.695 | 26 m 36 s | ~ 1 | 0.695 | 2 h 52 m | ~ 10 | 0.695 |
Running time during phylogenetic trees construction
| IQ-TREE | HPTree | HAlign-II | ||||
|---|---|---|---|---|---|---|
| Time | Memory | Time | Memory | Time | Memory | |
| Φ | 9 m 52 s | ~ 100 MB | 1 m 25 s | ~ 300 MB | 27 s | ~ 100 MB |
| Φ | 1 h 2 m | ~ 5 GB | 45 m 32 s | ~ 8 GB | 17 m 45 s | ~ 1 GB |
| Φ | – | – | – | – | 1 h 45 m | ~ 10 GB |
| Φ | – | – | 6 h 23 m | ~ 2 GB | 52 m 39 s | ~ 1 GB |
| Φ | – | – | 28 h 36 m | ~ 10 GB | 8 h 20 m | ~ 2 GB |
| Φ | 22 m 12 s | ~ 100 MB | Not supported | Not supported | 2 m 24 s | ~ 100 MB |
| Φ | 5 h 05 m | ~ 5 GB | Not supported | Not supported | 33 m 32 s | ~ 1 GB |
| Φ | 38 h 52 m | ~ 30 GB | Not supported | Not supported | 3 h 36 m | ~ 10 GB |
Fig. 5Average maximum memory usage of various experiments
Fig. 6Running time with increasing worker nodes