Literature DB >> 33166275

Arioc: High-concurrency short-read alignment on multiple GPUs.

Richard Wilton1, Alexander S Szalay1,2.   

Abstract

In large DNA sequence repositories, archival data storage is often coupled with computers that provide 40 or more CPU threads and multiple GPU (general-purpose graphics processing unit) devices. This presents an opportunity for DNA sequence alignment software to exploit high-concurrency hardware to generate short-read alignments at high speed. Arioc, a GPU-accelerated short-read aligner, can compute WGS (whole-genome sequencing) alignments ten times faster than comparable CPU-only alignment software. When two or more GPUs are available, Arioc's speed increases proportionately because the software executes concurrently on each available GPU device. We have adapted Arioc to recent multi-GPU hardware architectures that support high-bandwidth peer-to-peer memory accesses among multiple GPUs. By modifying Arioc's implementation to exploit this GPU memory architecture we obtained a further 1.8x-2.9x increase in overall alignment speeds. With this additional acceleration, Arioc computes two million short-read alignments per second in a four-GPU system; it can align the reads from a human WGS sequencer run-over 500 million 150nt paired-end reads-in less than 15 minutes. As WGS data accumulates exponentially and high-concurrency computational resources become widespread, Arioc addresses a growing need for timely computation in the short-read data analysis toolchain.

Entities:  

Mesh:

Year:  2020        PMID: 33166275      PMCID: PMC7676696          DOI: 10.1371/journal.pcbi.1008383

Source DB:  PubMed          Journal:  PLoS Comput Biol        ISSN: 1553-734X            Impact factor:   4.475


This is a PLOS Computational Biology Software paper.

Introduction

Short-read DNA sequencing technology has been estimated to generate 35 petabases of DNA sequencer data per year [1], with the amount of new data increasing exponentially [2]. Much of this data resides in commercial and academic repositories that may contain thousands or tens of thousands of DNA sequencer runs. Copying this quantity of data remotely is costly and time-consuming. For this reason, it can be more practical to analyze and reduce the data without transferring it out of the repository where it is archived. As a consequence, high-throughput computational resources are becoming an integral part of the local computing environment in data centers that archive sequencing data. A fundamental step in analyzing short-read DNA sequencer data is read alignment, the process of establishing the point of origin of each sequencer read with respect to a reference genome. Read alignment is algorithmically complex and time-consuming. It can represent a bottleneck in the prompt analysis of rapidly accumulating sequencer data. A fruitful approach to improving read-aligner throughput is to compute alignments using the parallel processing capability of general-purpose graphics processing units, or GPUs. GPU-accelerated read-alignment software such as Arioc [3, 4] and SOAP3-dp [5] can provide order-of-magnitude speed increases compared with CPU-only implementations such as BWA-MEM [6], Bowtie 2 [7], and Bismark [8]. In recent years, the speed of CPU-only implementations has increased with improvements in CPU speed, multithreaded concurrency, and hardware memory management, but GPU-accelerated implementations also run faster on newer GPUs that support a greater number of GPU threads and higher-bandwidth memory-access capabilities (S1 Table).

Increased speed due to more capable GPU memory architecture

In this regard, Arioc benefits specifically from larger GPU device memory and high-bandwidth peer-to-peer (P2P) memory-access topology among multiple GPUs [9, 10]. This is because Arioc's implementation relies on a set of large in-memory lookup tables (LUTs) to identify candidate locations in the reference genome sequence at which to compute alignments. For example, these LUTs can occupy up to 77GB for the human reference genome. When GPU memory is insufficient to contain the LUTs, Arioc places them in page-locked system RAM that the GPU must access across the PCIe bus. When GPU memory is large enough to contain these tables and P2P memory interconnect is supported, LUT data accesses execute 10 or more times faster and the overall speed of the software increases accordingly.

Lookup-table layouts adapted to available GPU features

Arioc can be configured to use one of three memory-layout configurations for its lookup tables (Fig 1) according to the total amount of installed GPU memory and the availability of GPU P2P memory access:
Fig 1

Lookup table memory layouts supported by Arioc.

The H lookup table is a hash table that contains J-table offsets; the J table contains reference-sequence locations. (a) H and J tables both reside in page-locked system RAM; all memory accesses from the GPU traverse the PCIe bus. (b) A copy of the H table resides in device RAM on each GPU; only J-table data traverses the PCIe bus. (c) The H and J tables are partitioned across device RAM on all available GPUs; GPU peer-to-peer memory accesses use the NVlink interconnect.

LUTs in page-locked system RAM, mapped into the GPU address space LUTs duplicated in each available GPU LUTs partitioned across all available GPUs

Lookup table memory layouts supported by Arioc.

The H lookup table is a hash table that contains J-table offsets; the J table contains reference-sequence locations. (a) H and J tables both reside in page-locked system RAM; all memory accesses from the GPU traverse the PCIe bus. (b) A copy of the H table resides in device RAM on each GPU; only J-table data traverses the PCIe bus. (c) The H and J tables are partitioned across device RAM on all available GPUs; GPU peer-to-peer memory accesses use the NVlink interconnect. A lookup-table layout in page-locked system RAM relies on slower memory accesses through the PCIe bus; a layout where the smaller of the LUTs is copied to each GPU instance uses page-locked system RAM only for the larger LUTs; and a fully-partitioned layout places the LUT data entirely in GPU memory where data-access speed is greatest.

Software design and implementation

The Arioc aligner is written in C++. The compiled program runs in one computer with one or more Nvidia GPUs and on a minimum of two concurrent CPU threads per GPU. The program uses 38 different CUDA kernels written in C++ (management and prioritization of alignment tasks, alignment computation) and about 150 calls to various CUDA Thrust APIs (sort, set reduction, set difference, string compaction). The software is implemented as a pipeline through which batches of reads are processed [3]. For each batch of reads, Arioc computes alignments in parallel on each available GPU device, so the number of reads in a batch is limited by available GPU memory. The software uses concurrently-executing CPU threads for disk I/O and for computing per-read metadata, including alignment scores, mapping quality scores, and methylation context for WGBS reads, in SAM format [11]. The current version of Arioc uses the same basic alignment algorithms as were implemented in earlier versions [3]: a fast spaced-seed nongapped alignment implementation for reads with few differences from the reference sequence, and a slower gapped-alignment implementation for reads with multiple differences from the reference. The accuracy of the current version was validated against SOAP3-dp and Bowtie 2 using simulated human short-read data (S5 Text, S1 Data) at error rates of 0.25% (to approximate typical Illumina sequencer error rates [12]) and 7.0% (to simulate read sequences with multiple differences from the reference genome).

Results

We assessed Arioc's performance on three different computers provisioned with multiple Nvidia V100 GPUs (Table 1). The computer systems represented three different operating environments: an experimental high-performance computing cluster with exclusive single-user access to a single machine, an academic research cluster with shared access to computing and network resources, and a commercial cloud environment. To avoid contention from other concurrently executing applications for CPU, GPU, or disk I/O resources, we measured performance only when no other user programs were executing, and we used local (as opposed to network) filesystems for output.
Table 1

Computers used for Arioc performance testing.

(See S1 Text for GPU peer-to-peer memory topology).

computerCPU threadssystem RAMGPUsGPU RAMGPU interconnect
Dell PowerEdge C4140a2 x Intel Xeon Gold 614840@2.4GHz384GB4 x Nvidia V10032GBNVlink 2.0 (SXM2)
AWS p3dn.24xlarge instanceb2 x Intel Xeon Platinum 8175M96@2.5GHz768GB8 x Nvidia V10032GBNVlink 2.0 (SXM2)
Nvidia DGX-2c2 x Intel Xeon Platinum 816896@2.7GHz1.5TB16 x Nvidia V10032GBNVlink 3.0 (SXM3)+ NVswitch

aDell EMC [13].

bAWS [14].

cPSC [15, 16].

Computers used for Arioc performance testing.

(See S1 Text for GPU peer-to-peer memory topology). aDell EMC [13]. bAWS [14]. cPSC [15, 16]. Since LUT accesses represent a significant portion of Arioc's execution time, we measured the extent to which the use of GPU P2P interconnect hardware affects Arioc's overall speed. We did this by comparing the speed of each of the three LUT-layout patterns supported by the software. In each computer, we ensured that hardware and CUDA driver software were configured for GPU P2P memory access (S1 Text). On computers whose GPU P2P topology did not include all possible GPU pairs, we ran Arioc only on a subset of GPU devices that supported mutual P2P interconnect. We used Arioc with human reference genome release 38 [17] to align four public-domain paired-end whole genome sequencer runs (Table 2) and filtered the results to exclude all but proper mappings, that is, alignments that were concordant with criteria for Illumina paired-end mappings (forward-reverse orientation of the mates, inferred fragment length no greater than 500). Each of the four sequencer runs provided at least 40x sequencing coverage, but each differed from the others in at least one of the following characteristics:
Table 2

Whole genome sequencing (WGS) and whole genome bisulfite sequencing (WGBS) runs used for Arioc performance testing.

sample:runtypepairs(reads)read lengthproperly mapped
ERP010710:ERR1347703aWGS681,380,865 (1,362,761,730)2×100nt79.31%
ERP010710:ERR1419128aWGS596,611,242 (1,193,222,484)2×100nt96.56%
SRP117159:SRR6020687bWGBS534,647,118 (1,069,294,236)2×150nt89.95%
SRP117159:SRR6020688WGS419,380,558 (838,761,116)2×150nt96.52%

aSee [18].

bSee [19].

read length: 100nt or 150nt sequencing technology: WGS or WGBS percentage of reads with proper (concordant) mappings: below 80% or above 90% aSee [18]. bSee [19]. To determine the speed-versus-sensitivity pattern for the alignment of each sequencer run, we recorded speed (overall throughput) and sensitivity for a variety of settings of maxJ, a runtime parameter that specifies the maximum number of candidate reference-sequence locations per seed. When Arioc is configured with higher maxJ settings, speed decreases and sensitivity increases (S2 Text). We carried out speed-versus-sensitivity experiments using all four sets of sequencer reads and with all three LUT layouts in GPU memory. We characterized speed (throughput) as the number of read sequences (paired-end mates) per second for which the software computes alignments. Sensitivity was recorded as the percentage of the pairs for which the aligner reported at least one proper (concordant) mapping. Overall, aligner speed varied by a factor of 10 across all the short-read data and all the hardware and software configurations we tested, with wide variations attributable to the characteristics of the read-sequence data as well as to the runtime software configuration. With each whole-genome sequencing sample, speed always decreased with increasing sensitivity, with a pronounced drop-off in speed as the aligner approaches maximum sensitivity.

Computing environments

With a four-GPU computer in a high-performance computing research cluster [13] across all samples and speed-versus-sensitivity settings, the maximum throughput was 2,195,046 reads/second for 100nt paired-end reads and 1,686,477 reads/second for 150nt paired-end reads. On a commercial cloud compute instance [14], speeds were generally about 10% slower. With an Nvidia DGX-2 computer in an academic research cluster [20], speeds were generally 5–10% faster (S1 Data).

GPU memory

Speed increased in proportion to the amount of LUT data residing in GPU memory as opposed to page-locked system memory (Fig 2). The relative increase in speed varied from about 1.5 to about 2.5. This was most apparent when Arioc was parameterized for maximum speed and lower sensitivity.
Fig 2

Speed versus sensitivity for three GPU memory-layout techniques.

Speed (reads/second) is greater when GPU peer-to-peer memory interconnect is used, for a range of sensitivity (% concordant) settings. Speeds are highest when the H and J tables are partitioned across device RAM on all available GPUs and GPU peer-to-peer memory accesses use the direct P2P memory interconnect. Speeds are lower with H in device RAM on each GPU and J in page-locked system memory. Speeds are lowest when H and J both reside in page-locked system RAM. H table: 25GB; J table: 52GB. Data from SRR6020688 (S1 Data).

Speed versus sensitivity for three GPU memory-layout techniques.

Speed (reads/second) is greater when GPU peer-to-peer memory interconnect is used, for a range of sensitivity (% concordant) settings. Speeds are highest when the H and J tables are partitioned across device RAM on all available GPUs and GPU peer-to-peer memory accesses use the direct P2P memory interconnect. Speeds are lower with H in device RAM on each GPU and J in page-locked system memory. Speeds are lowest when H and J both reside in page-locked system RAM. H table: 25GB; J table: 52GB. Data from SRR6020688 (S1 Data).

Read-sequence characteristics

In general, throughput for 150nt WGBS reads was about 2/3 of the throughput for 150nt WGS reads (S1 Data). With 150nt WGS reads, throughput was about 3/4 of the throughput for 100nt WGS reads. There was no evident relationship between throughput and the percentage of proper (concordant) mappings.

Scaling with additional GPUs

Throughput increased when additional GPU devices were available (S1 Data). Scaling was nearly ideal with 8 or fewer GPUs. With 9 or more GPUs, speed continued to increase but the gain in speed was proportionally smaller with each additional GPU.

Disk write bandwidth

With multiple GPUs, Arioc runs up to 10% faster when writing output to a filesystem on a dedicated high-bandwidth disk device than when using a network filesystem (S1 Data). The speed increase attributable to higher-performance disk storage was greater when Arioc was configured for higher speed and lower sensitivity.

Comparison with CPU-only software

We performed WGS and WGBS speed-versus-sensitivity comparisons with a widely-used WGS aligner, Bowtie 2 [7], and with a well-known WGBS aligner, Bismark [8] (S1 Data). These observations confirmed the order-of-magnitude difference in speed that we had observed in previous experiments [3] [4]. We also estimated system memory utilization with each aligner (S6 text). Arioc uses more system memory to contain reference-genome index structures than do CPU-only implementations; its maximal memory utilization was about 1/3 of available system memory in the most conservatively provisioned computers we used.

Advantages and limitations

The most important element in the design of the Arioc software is that it uses algorithms and implementation techniques that are amenable to GPU acceleration. This design approach is validated by the data throughputs we have observed on computers provisioned with four or more GPU devices. With Arioc, using GPU P2P memory interconnect leads to higher speeds throughout the usable range of speed-versus-sensitivity configurations. The speed increase is greater with commonly used runtime parameterizations that favor higher speed over sensitivity. With a four-GPU system, Arioc can align the reads from a human whole-genome sequencing run with 40x coverage in less than 15 minutes. However, although Arioc supports reference genomes up to 234 base pairs (17G base pairs) in size, the lookup tables for such genomes are proportionately larger. The use of a partitioned LUT memory configuration to obtain maximum speed for a given reference genome thus depends on the size of the LUTs for that genome as well as the amount of interconnected GPU memory. High-concurrency GPU interconnect architectures are increasingly available not only in dedicated high-performance computing environments but in also academic or commercial computing clusters where they are typically used as shared resources. In the latter case, however, maximum throughput is obtainable only when P2P topology mutually interconnects all the GPUs on which Arioc executes and when no other concurrently executing programs can contend for GPU memory-access bandwidth. We observed, in both the WGS and the WGBS samples, the same pattern of decreasing speed with increasing sensitivity. With maximum sensitivity, all short-read aligners display a sharp fall-off in speed due to the amount of work performed in searching for mappings for "hard to map" read sequences that contain multiple differences from the reference genome, but the speed-versus-sensitivity pattern provides insight into the behavior of each of the aligners we evaluated and suggests an optimal choice of speed versus sensitivity for a given data sample. For Arioc in particular, the range of possible speeds depended on how many GPUs were used (that is, the maximum number of concurrently executing GPU threads), the layout of lookup tables in GPU memory, the sequencing technology (WGS or WGBS), and read length. Although alignment speeds improve with additional GPUs, there may be limited practical value in using more than four GPUs concurrently. When alignment throughput exceeds one million reads per second, factors other than computation speed become increasingly important. These may include the time required to initialize GPU lookup tables, data-transfer speeds between CPU and GPU memory, available GPU interconnect bandwidth, and available disk I/O bandwidth. A common approach to aligning multiple WGS samples is to use a CPU-only aligner such as BWA-MEM or Bowtie 2 on dozens or hundreds of CPU cores, either by splitting the work across multiple computers or by using systems that support a large number of CPU threads [21]. Our results suggest that using Arioc in a single multi-GPU computer is more effective, in terms of both sensitivity and of throughput, than either CPU-only strategy. Any short-read analysis protocol could benefit from this kind of read-aligner performance, but the potential time and cost savings would be substantial in a large-scale repository of DNA sequencing data. In recent years an increasing number of large WGS datasets have become available in shared-access data centers and in the commercial cloud [22, 23, 24, 25, 26]. In these computing environments, high-throughput short-read alignment is likely to be carried out on every WGS sample, either as an initial step in a short-read alignment analysis pipeline or as preparation for data archival. When we compared Arioc performance with CPU-only short-read aligners where a monetary price is associated with computational resource utilization, Arioc costs half as much and executes ten or more times faster (Fig 3). Although hardware-resource sharing in a cloud environment degrades software performance [27], and pricing for cloud-based computing services, data storage [28], and data transfer [29] is complex and may fluctuate, our results imply that Arioc has significant potential for rapidly and economically aligning large aggregates of short-read sequencer data in a commercial cloud setting.
Fig 3

Sensitivity (as overall percentage of concordantly mapped pairs), overall elapsed time, and dollar cost for WGS and WGBS alignment on Amazon Web Services virtual machine instances.

WGS results for SRR6020688 (human, 419,380,558 150nt pairs); WGBS results for SRR6020687 (human, 534,647,118 150nt pairs). Arioc: EC2 p3dn.24xlarge instance ($31.212/hour). Bowtie 2, Bismark: EC2 m5.12xlarge instance ($2.304/hour).

Sensitivity (as overall percentage of concordantly mapped pairs), overall elapsed time, and dollar cost for WGS and WGBS alignment on Amazon Web Services virtual machine instances.

WGS results for SRR6020688 (human, 419,380,558 150nt pairs); WGBS results for SRR6020687 (human, 534,647,118 150nt pairs). Arioc: EC2 p3dn.24xlarge instance ($31.212/hour). Bowtie 2, Bismark: EC2 m5.12xlarge instance ($2.304/hour).

Future directions and availability

Improving the performance of one component in a pipelined application inevitably accentuates potential bottlenecks in other components of the pipeline. For example, alleviating the bottleneck related to GPU memory management in Arioc emphasized the inverse relationship of read length and throughput. Arioc's seeding strategy is a simple divide-and-conquer procedure that starts with examining the set of contiguous seeds that span the read sequence. With longer read sequences, a strategy of initially examining fewer seeds might increase throughput without sacrificing sensitivity. Ideally, Arioc's throughput would increase in direct proportion to the number of available GPU devices. That it does not is most likely related to a limitation in memory bandwidth, either in the transfer of data between CPU-addressable memory and on-device GPU memory or in randomly accessing large amounts of data in GPU memory. In either case, achieving near-ideal scaling beyond 8 GPUs would require nontrivial software re-engineering to further decrease the amount of data transferred between CPU and GPU and to optimize the utilization of GPU memory bandwidth [9]. Short-read sequencing technology remains an essential step in DNA sequence analysis on the petabyte and exabyte scale. In an analysis workflow that includes software such as samtools [30], Picard tools [31], or Bismark tools [8], short-read alignment may represent a comparatively small proportion of the time required to complete the workflow. For this reason, a number of software tools are being developed [32, 33] that use high-concurrency computing hardware to process large quantities of WGS and WGBS data. The full potential for rapid analysis of WGS samples with Arioc may eventually be realized only when the entire software toolchain utilizes concurrent CPU and GPU resources. Arioc's current implementation is fast, but advances in hardware and in software technology will inevitably lead to even faster implementations in the future. Nevertheless, given the trend toward associating high-concurrency CPU and GPU hardware with large repositories of whole genome sequencing data, Arioc can fill the need for a high-throughput general-purpose short-read alignment tool that exploits the capabilities of the hardware. The Arioc software is available at https://github.com/rwilton/arioc. It is released under a BSD open-source license.

GPU peer-to-peer memory interconnect topology.

(DOCX) Click here for additional data file.

Arioc configuration parameters for WGS and WGBS alignments.

(DOCX) Click here for additional data file.

Bowtie 2 configuration parameters for WGS alignments.

(DOCX) Click here for additional data file.

Bismark configuration parameters for WGBS alignments.

(DOCX) Click here for additional data file.

Correct vs. incorrect mappings, classified by MAPQ.

(DOCX) Click here for additional data file.

Arioc hardware resource utilization.

(DOCX) Click here for additional data file.

Representative Nvidia GPU devices since 2012.

(DOCX) Click here for additional data file.

Short-read aligner versions and distribution websites.

(DOCX) Click here for additional data file. (XLSX) Click here for additional data file. 7 Apr 2020 Dear Dr Wilton, Thank you very much for submitting your manuscript "Arioc:  high-concurrency short-read alignment on multiple GPUs" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Manja Marz Software Editor PLOS Computational Biology Manja Marz Software Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Authors present a new implementation of their short-read aligner Arioc. The proposed implementation takes advantage of the most recent GPU hw architecture supporting high-bandwidth P2P memory access among multiple GPUs to increase its performance in terms of computing time. Although the work is interesting there are several points that need to be clarified. Major comments 1. My main concern is related to the results. 1.a - Authors did not perform experiments on simulated libraries with the aim to assess the reliability of the new version of ARIOC. These analyses were carried out with previous versions of the tool. However, as this is a new implementation it must be validated. Authors should perform these experiments and compare the results with those of the other tools. Simulated libraries should also be permanently accessible through a DOI or according to the journal policies. 1.b - Experiments on real data are not complete. For instance, for Bowtie2 results were only reported for the library SRR6020688. Experiments must be performed for all libraries. 1.c - Unbalanced hardware configurations were used to analyze the performance ARIOC and the other tools. For instance using AWS (Suppl Data D2) ARIOC was run using 4 NVIDIA V100 and 96 threads whereas Bowtie 2 was run using 40 threads. Experiments should be performed using an identical hardware configuration and using modern processors supporting hundreds of cores. A comparison between Bowtie2/Bismark and ARIOC with a single GPU and identical modern processor should also be performed. 1.d - As done for Bowtie2 and Bismark comparative experiments should also be performed for SOAP3-dp with the aim to confirm the best performances of ARIOC. As a reader, I would be curios to read about the behaviour of SOAP3-dp with the new V100 equipped with 32GB of memory. Also in this case experiments should be performed using an identical hardware configuration. 1.e - Results in supplementary data (D1 - D5) should be report the same (common) information for all tools. For instance, the table reporting results for Bowtie2 (Suppl Data D2) reports a column labeled “overall mapped %”, but the same column is not present in tables for Arioc. Moreover, for an easy reading of the results, each column of the tables should be described. 1.f - In Supplementary Data D1 are reported performance obtained for an unpublished library “LIBD1373”. I don't think the library is mentioned in the article nor is it downloadable from the ARIOC repository at https://github.com/rwilton/arioc. 1.g - Figure 4 show speed vs sensitivity for the LUT layouts. I observe that increasing the sensitivity (par. maxJ) the speed tends to converge for all LUT layouts. It would seem that by increasing the sensitivity there are not more advantages related to the device memory and NVLINK. 1.h - Can you also comment about the overall host (and device) memory consumption? Is the host (device) memory consumption comparable with Bowtie2/Bismark/SOAP3-dp? 2. All experiments were carried out with the human genome. Today the scientific community is also heavily involved in the study of more complex genomes. My question is whether ARIOC and its LUT are suitable for very big and highly repetitive genomes. 3. Line 41: “The first step in analyzing short-read DNA sequencer data is read alignment …” It would be more correct to write that alignment is at the base of the NGS analyses. Quality control and filtering are mandatory steps before alignment. 4. Finally, I suggest (it is not mandatory) to dockerize the tool installation and deposit it on DockerHub and/or BioContainers. Containers allow a fast deployment on local clusters as well as on the cloud. Minor comments Labels “Figure 2” and “Figure 3” refer to tables. Line 58: (For example, these LUTs can occupy up to …): remove the brackets Reviewer #2: This manuscript addresses an important area of short-read alignment. A new version of Arioc has been developed by utilising the multiple GPUs on the same machine to achieve a substantial speed up in short-read alignment. This new version of Arioc can benefit from larger GPU memories and high-bandwidth peer-to-peer memory access between GPUs. The manuscript is well written. I have a few comments: The authors should explicitly mention that this is a newer version of Arioc. According to the results, Arioc has a better performance compared with the other aligners like Bowtie2 (for WGS) and Bismark (for WGBS) in terms to accuracy and running time. It is understandable that Arior performs faster due to the utilisation of multiple GPU technology. However, why does Arior have a better accuracy? Is it due to the algorithm used in Arioc? The authors should briefly explain this in the manuscript. The authors need to explain what "concordant" alignment means. This is important in order to understand how the accuracies of different aligners were evaluated. Figure 5 mentions "concordantly-mapped pairs". Does this figure only consider the aligned pairs? How about the case that one of reads in the pair is properly aligned but another cannot? If Figure 5 only considers the aligned pairs, it is necessary to have another figure showing, for different aligners, the % of reads concordantly aligned, including those either with only one read or both reads in the pairs being aligned. This manuscript does not mention about the alignment algorithm used in Arioc. The authors should briefly mention this and refer to their previous papers for details. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: None Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see 9 Jun 2020 Submitted filename: Response to Reviewers.3.pdf Click here for additional data file. 30 Jul 2020 Dear Dr Wilton, Thank you very much for submitting your manuscript "Arioc:  high-concurrency short-read alignment on multiple GPUs" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Manja Marz Software Editor PLOS Computational Biology Manja Marz Software Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Although the authors have worked to improve the manuscript, I must point out that some of my indications have not been fully implemented. In the following I report my comments to the authors' responses. 1.a The authors performed tests on synthetic data as required. However, they did not provide a link to download the simulated libraries as required. Readers must be able to reproduce the experiments. Please, provide a DOI to share the libraries with the readers. 1.c Considerations made by the authors are interesting and should be reported in the manuscript. However a comparison of the performance in terms of computing time must be based on an identical hw configuration. I have no doubt that ARIOC outperforms the other tool, but what is the information given by performances obtained with unbalanced hw configuration for different tools? 1.g Please report in the manuscript the considerations given in your reply to the author. 1.h As requested, the authors report a comparison in terms of memory consumption between ARIOC and the other tools. The results of the comparison are reported in Appendix A6. However, no discussion of memory consumption is provided in the manuscript as well as no reference is made to Appendix A6. Please, discuss in the manuscript about the performances in terms of memory consumption. What considerations about the high RAM required by ARIOC with respect to the other tools? 2. Please report in the manuscript the considerations given in your reply to the author. Constraints as the maximum genome size as well as other limitations to use ARIOC must be reported and discussed in the manuscript. This information is very important for researchers that plan to use a tool. 4. As wrote in my previous comment, I only suggested to dockerize ARIOC. I can understand the difficulties of building an optimized container. In any case, using your Ferrari/Volkswagen comparison, observe that only few people drive a Ferrari whereas a lot drive a Volkswagen. Since the installation and configuration take hours, the risk is that an unskilled user will abandon the tool in favour of others that are easy to install. For the future, I suggest to take into account the possibility to build a container for ARIOC. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see 16 Aug 2020 Submitted filename: Response to Reviewers.4.docx Click here for additional data file. 10 Sep 2020 Dear Dr Wilton, We are pleased to inform you that your manuscript 'Arioc:  high-concurrency short-read alignment on multiple GPUs' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Manja Marz Software Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors properly addressed all comments ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 26 Oct 2020 PCOMPBIOL-D-20-00150R2 Arioc: High-concurrency short-read alignment on multiple GPUs Dear Dr Wilton, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Laura Mallard PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol
  13 in total

1.  Fast gapped-read alignment with Bowtie 2.

Authors:  Ben Langmead; Steven L Salzberg
Journal:  Nat Methods       Date:  2012-03-04       Impact factor: 28.547

2.  Arioc: GPU-accelerated alignment of short bisulfite-treated reads.

Authors:  Richard Wilton; Xin Li; Andrew P Feinberg; Alexander S Szalay
Journal:  Bioinformatics       Date:  2018-08-01       Impact factor: 6.937

Review 3.  Cloud computing for genomic data analysis and collaboration.

Authors:  Ben Langmead; Abhinav Nellore
Journal:  Nat Rev Genet       Date:  2018-01-30       Impact factor: 53.242

4.  SOAP3-dp: fast, accurate and sensitive GPU-based short read aligner.

Authors:  Ruibang Luo; Thomas Wong; Jianqiao Zhu; Chi-Man Liu; Xiaoqian Zhu; Edward Wu; Lap-Kei Lee; Haoxiang Lin; Wenjuan Zhu; David W Cheung; Hing-Fung Ting; Siu-Ming Yiu; Shaoliang Peng; Chang Yu; Yingrui Li; Ruiqiang Li; Tak-Wah Lam
Journal:  PLoS One       Date:  2013-05-31       Impact factor: 3.240

5.  Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space.

Authors:  Richard Wilton; Tamas Budavari; Ben Langmead; Sarah J Wheelan; Steven L Salzberg; Alexander S Szalay
Journal:  PeerJ       Date:  2015-03-03       Impact factor: 2.984

6.  Big Data: Astronomical or Genomical?

Authors:  Zachary D Stephens; Skylar Y Lee; Faraz Faghri; Roy H Campbell; Chengxiang Zhai; Miles J Efron; Ravishankar Iyer; Michael C Schatz; Saurabh Sinha; Gene E Robinson
Journal:  PLoS Biol       Date:  2015-07-07       Impact factor: 8.029

7.  SAMBLASTER: fast duplicate marking and structural variant read extraction.

Authors:  Gregory G Faust; Ira M Hall
Journal:  Bioinformatics       Date:  2014-05-07       Impact factor: 6.937

8.  Systematic evaluation of error rates and causes in short samples in next-generation sequencing.

Authors:  Franziska Pfeiffer; Carsten Gröber; Michael Blank; Kristian Händler; Marc Beyer; Joachim L Schultze; Günter Mayer
Journal:  Sci Rep       Date:  2018-07-19       Impact factor: 4.379

9.  Scaling read aligners to hundreds of threads on general-purpose processors.

Authors:  Ben Langmead; Christopher Wilks; Valentin Antonescu; Rone Charles
Journal:  Bioinformatics       Date:  2019-02-01       Impact factor: 6.937

10.  Cloud computing applications for biomedical science: A perspective.

Authors:  Vivek Navale; Philip E Bourne
Journal:  PLoS Comput Biol       Date:  2018-06-14       Impact factor: 4.475

View more
  2 in total

1.  Data-Rich Spatial Profiling of Cancer Tissue: Astronomy Informs Pathology.

Authors:  Alexander S Szalay; Janis M Taube
Journal:  Clin Cancer Res       Date:  2022-08-15       Impact factor: 13.801

2.  Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.

Authors:  Yuansheng Liu; Jinyan Li
Journal:  PLoS Comput Biol       Date:  2021-07-19       Impact factor: 4.475

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.