| Literature DB >> 33166275 |
Richard Wilton1, Alexander S Szalay1,2.
Abstract
In large DNA sequence repositories, archival data storage is often coupled with computers that provide 40 or more CPU threads and multiple GPU (general-purpose graphics processing unit) devices. This presents an opportunity for DNA sequence alignment software to exploit high-concurrency hardware to generate short-read alignments at high speed. Arioc, a GPU-accelerated short-read aligner, can compute WGS (whole-genome sequencing) alignments ten times faster than comparable CPU-only alignment software. When two or more GPUs are available, Arioc's speed increases proportionately because the software executes concurrently on each available GPU device. We have adapted Arioc to recent multi-GPU hardware architectures that support high-bandwidth peer-to-peer memory accesses among multiple GPUs. By modifying Arioc's implementation to exploit this GPU memory architecture we obtained a further 1.8x-2.9x increase in overall alignment speeds. With this additional acceleration, Arioc computes two million short-read alignments per second in a four-GPU system; it can align the reads from a human WGS sequencer run-over 500 million 150nt paired-end reads-in less than 15 minutes. As WGS data accumulates exponentially and high-concurrency computational resources become widespread, Arioc addresses a growing need for timely computation in the short-read data analysis toolchain.Entities:
Mesh:
Year: 2020 PMID: 33166275 PMCID: PMC7676696 DOI: 10.1371/journal.pcbi.1008383
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Lookup table memory layouts supported by Arioc.
The H lookup table is a hash table that contains J-table offsets; the J table contains reference-sequence locations. (a) H and J tables both reside in page-locked system RAM; all memory accesses from the GPU traverse the PCIe bus. (b) A copy of the H table resides in device RAM on each GPU; only J-table data traverses the PCIe bus. (c) The H and J tables are partitioned across device RAM on all available GPUs; GPU peer-to-peer memory accesses use the NVlink interconnect.
Computers used for Arioc performance testing.
(See S1 Text for GPU peer-to-peer memory topology).
| computer | CPU threads | GPUs | GPU interconnect |
|---|---|---|---|
| Dell PowerEdge C4140 | 40@2.4GHz | 4 x Nvidia V100 | NVlink 2.0 (SXM2) |
| AWS p3dn.24xlarge instance | 96@2.5GHz | 8 x Nvidia V100 | NVlink 2.0 (SXM2) |
| Nvidia DGX-2 | 96@2.7GHz | 16 x Nvidia V100 | NVlink 3.0 (SXM3) |
aDell EMC [13].
bAWS [14].
cPSC [15, 16].
Whole genome sequencing (WGS) and whole genome bisulfite sequencing (WGBS) runs used for Arioc performance testing.
| sample:run | type | pairs | read length | properly mapped |
|---|---|---|---|---|
| ERP010710:ERR1347703 | WGS | 681,380,865 (1,362,761,730) | 2×100nt | 79.31% |
| ERP010710:ERR1419128 | WGS | 596,611,242 (1,193,222,484) | 2×100nt | 96.56% |
| SRP117159:SRR6020687 | WGBS | 534,647,118 (1,069,294,236) | 2×150nt | 89.95% |
| SRP117159:SRR6020688 | WGS | 419,380,558 (838,761,116) | 2×150nt | 96.52% |
aSee [18].
bSee [19].
Fig 2Speed versus sensitivity for three GPU memory-layout techniques.
Speed (reads/second) is greater when GPU peer-to-peer memory interconnect is used, for a range of sensitivity (% concordant) settings. Speeds are highest when the H and J tables are partitioned across device RAM on all available GPUs and GPU peer-to-peer memory accesses use the direct P2P memory interconnect. Speeds are lower with H in device RAM on each GPU and J in page-locked system memory. Speeds are lowest when H and J both reside in page-locked system RAM. H table: 25GB; J table: 52GB. Data from SRR6020688 (S1 Data).
Fig 3Sensitivity (as overall percentage of concordantly mapped pairs), overall elapsed time, and dollar cost for WGS and WGBS alignment on Amazon Web Services virtual machine instances.
WGS results for SRR6020688 (human, 419,380,558 150nt pairs); WGBS results for SRR6020687 (human, 534,647,118 150nt pairs). Arioc: EC2 p3dn.24xlarge instance ($31.212/hour). Bowtie 2, Bismark: EC2 m5.12xlarge instance ($2.304/hour).