| Literature DB >> 30458766 |
Enzo Rucci1, Carlos Garcia2, Guillermo Botella2, Armando De Giusti3, Marcelo Naiouf4, Manuel Prieto-Matias2.
Abstract
BACKGROUND: The Smith-Waterman (SW) algorithm is the best choice for searching similar regions between two DNA or protein sequences. However, it may become impracticable in some contexts due to its high computational demands. Consequently, the computer science community has focused on the use of modern parallel architectures such as Graphics Processing Units (GPUs), Xeon Phi accelerators and Field Programmable Gate Arrays (FGPAs) to speed up large-scale workloads.Entities:
Keywords: DNA; FPGA; High-performance computing; OpenCL; Smith-Waterman
Mesh:
Substances:
Year: 2018 PMID: 30458766 PMCID: PMC6245597 DOI: 10.1186/s12918-018-0614-6
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Fig. 1Data dependences in the alignment matrix H. Red arrows indicate the data dependences among cells while green arrows denote cells that can be computed simultaneously
OpenCL memory model for the Intel Arria 10 FPGA and the resources available in the Arria 10 FPGA
| OpenCL | FPGA | Intel Arria 10 FPGA | |
|---|---|---|---|
|
| Global | External | 2GB DDR3 |
| Constant | Cache | 32KB DDR3 | |
| Local | Embedded | 67Mbits | |
| Private | Registers | 67244Kbits |
Fig. 2Graphic representation of our OpenCL kernel implementation
Experimental platforms used in the tests
| Platform | |||
|---|---|---|---|
| FPGA | GPU | Xeon Phi | |
|
| 2 ×Intel Xeon E5-2670 2.60Ghz | 2 ×Intel Xeon E5-2695 v3 2.30Ghz | 2 ×Intel Xeon E5-2695 v3 2.30Ghz |
| (16 cores, 32GB RAM) | (28 cores, 64 GB RAM) | (28 cores, 128 GB RAM) | |
|
| Intel Arria 10 GX | NVIDIA GTX 980 | Intel Xeon Phi 3120P |
| (Maxwell architecture, 2048 CUDA cores, 4GB RAM) | |||
| (2GB RAM) | NVIDIA GTX1080 | (Knights corner generation, 57 cores, 6GB RAM) | |
| (Pascal architecture, 2560 CUDA cores, 8GB RAM) | |||
|
| CentOS release 6.5 | Debian release 8.0 | CentOS release 6.5 |
|
| Intel ICC 17.0.1.132 | Intel ICC 17.0.1.132 | Intel ICC 17.0.1.132 |
| Intel FPGA OpenCL SDK 16.0 | CUDA SDK 7.5 | ||
Information of the sequences used in the tests
| Set | Sequence 1 | Sequence 2 | Matrix size | Score | ||
|---|---|---|---|---|---|---|
| Accesion | Size | Accesion | Size | (cells) | ||
|
| AF133821.1 | 10K | AY352275.1 | 10K | 100K | 5027 |
| NC_001715.1 | 57K | AF494279.1 | 57K | 3M | 51 | |
| NC_000898 | 162K | NC_007605 | 172K | 28M | 18 | |
| NC_003064.2 | 543K | NC_000914.1 | 536K | 291M | 48 | |
|
| CP000051.1 | 1M | AE002160.2 | 1M | 1G | 82091 |
| BA000035.2 | 3M | BX927147.1 | 3M | 9G | 3888 | |
| AE016879.1 | 5M | AE017225.1 | 5M | 25G | 5220775 | |
| NC_005027.1 | 7M | NC_003997.3 | 5M | 35G | 157 | |
| NC_017186.1 | 10M | NC_014318.1 | 10M | 100G | 10235056 | |
| NT_033779.4 | 23M | NT_037436.3 | 25M | 575G | 9059 | |
|
| NC_000021.9 | 48M | NC_006488.4 | 34M | 1.6T | 24922392 |
| NC_000022.11 | 51M | NC_006489.4 | 38M | 1.9T | 20133752 | |
| NC_000019.10 | 59M | NC_006486.4 | 62M | 3.7T | 23570332 | |
| NC_000020.11 | 65M | NC_006487.4 | 67M | 4.4T | 35488641 | |
Performance and resource usage comparison for the different OpenCL kernel implementations
| Kernel |
|
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Integer type | int (32 bits) | short (16 bits) | char (8 bits) | ||||||||
| Maximum value | 2147483647 | 32767 | 127 | ||||||||
| BW | 256 | 512 | 1024 | 1152 | 512 | 1024 | 1536 | 512 | 1024 | 1536 | |
| Resource | ALMs | 29% | 49% | 87% | 94% | 32% | 52% | 73% | 21% | 31% | 41% |
| usage | Regs | 3% | 3% | 4% | 4% | 3% | 4% | 5% | 3% | 4% | 4% |
| RAM | 8% | 8% | 20% | 22% | 7% | 18% | 27% | 7% | 18% | 23% | |
| DSPs | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | |
| Matrix size (cells) | Performance (GCUPS) | ||||||||||
| 100K | 24.15 | 31.57 | 44.99 | 49.81 | 48.00 | 52.35 | 56.92 | - | - | - | |
| 3M | 34.94 | 61.59 | 101.89 | 105.14 | 80.71 | 122.72 | 160.44 | 93.03 | 152.75 | 223.1 | |
| 28M | 36.70 | 68.11 | 119.15 | 122.91 | 85.96 | 146.80 | 186.74 | 102.50 | 173.23 | 255.49 | |
| 291M | 37.32 | 69.23 | 122.32 | 126.95 | 87.18 | 149.90 | 195.17 | 105.14 | 181.16 | 268.83 | |
| 1G | 37.42 | 70.13 | 124.93 | 129.44 | - | - | - | - | - | - | |
| 9G | 37.84 | 70.80 | 126.96 | 131.45 | 88.40 | 155.85 | 202.56 | - | - | - | |
| 25G | 37.91 | 70.92 | 127.49 | 131.96 | - | - | - | - | - | - | |
| 35G | 37.93 | 70.94 | 127.47 | 131.98 | 88.71 | 156.43 | 203.51 | - | - | - | |
| 100G | 37.98 | 70.99 | 127.68 | 132.15 | - | - | - | - | - | - | |
| 575G | 38.03 | 71.09 | 127.85 | 132.33 | 88.87 | 156.83 | 204.06 | - | - | - | |
Performance comparison among SW implementations using the small and medium sets
| Implementation | SWIFOLD | SWAPHI-LS | SW# | CUDAlign | SW# | CUDAlign |
|---|---|---|---|---|---|---|
| Accelerator | Intel Arria 10 GX | Intel Xeon Phi 3120P | NVIDIA GTX980 | NVIDIA GTX1080 | ||
| Matrix size (cells) | Performance (GCUPS) | |||||
| 100K | 49.81 (56.92) | 0.42 | 0.3 | 0.03 | 0.23 | 0.03 |
| 3M | 105.14 (223.1) | 7.69 | 7.62 | 1.08 | 7.55 | 1.08 |
| 28M | 122.91 (255.49) | 21.24 | 33.33 | 8.18 | 41.47 | 8.63 |
| 291M | 126.95 (268.83) | 30.67 | 64.53 | 45.89 | 111.60 | 58.24 |
| 1G | 129.44 | 32.84 | 75.24 | 79.21 | 144.97 | 117.97 |
| 9G | 131.45 (202.56) | 33.9 | 69.54 | 84.05 | 143.50 | 152.63 |
| 25G | 131.96 | 34.16 | 120.92 | 160.79 | 255.89 | 295.43 |
| 35G | 131.98 (203.51) | 34.38 | 68.84 | 84.43 | 142.12 | 155.19 |
| 100G | 132.15 | 33.19 | 118.81 | 163.77 | 253.13 | 297.05 |
| 575G | 132.33 (204.06) | 30.36 | 67.55 | 84.84 | 143.51 | 158.13 |
SWIFOLD performance rates belong to the best 32-bits kernel version but faster performances from smaller data types are also reported (between parenthesis) whenever correspond
Performance comparison among SW implementations using the large set
| Implementation | SWIFOLD | SWAPHI-LS | SW# | CUDAlign | SW# | CUDAlign |
|---|---|---|---|---|---|---|
| Accelerator | Intel Arria 10 GX | Intel Xeon Phi 3120P | NVIDIA GTX980 | NVIDIA GTX1080 | ||
| Matrix size (cells) | Performance (GCUPS) | |||||
| 1.6T | 132.41 | 31.03 | 91.54 | 122.14 | 193.56 | 224.15 |
| 1.9T | 132.41 | 27.86 | 84.93 | 110.77 | 180.34 | 231.91 |
| 3.7T | 132.42 | 33.66 | 89.02 | 119.47 | 191.59 | 232.54 |
| 4.4T | 132.43 | 30.41 | 95.61 | 132.22 | 138.22 | 250.78 |
Categorized options of SW implementations on different accelerator devices
| Implementation | SSW | SWIFOLD | SWAPHI-LS | SW# | CUDAlign |
|---|---|---|---|---|---|
| Device | Intel multicore | Intel FPGA | Intel Xeon Phi | NVIDIA GPU | |
| Matrix size (cells) | Performance (GCUPS) | ||||
| Small | - | +++ | + | ++ | + |
| Medium | - | +++ | + | +++ | +++ |
| Large | - | ++ | + | ++ | +++ |
(+) and (-) mean better and worse options, respectively