| Literature DB >> 28003261 |
Justin Chu1,2, Hamid Mohamadi1,2, René L Warren2, Chen Yang1,2, Inanç Birol1,2,3.
Abstract
Identifying overlaps between error-prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read-to-reference alignment problem, read-to-read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state-of-the-art read-to-read overlap tools for error-prone long reads, including BLASR, DALIGNER, MHAP, GraphMap and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. To supplement our review of the algorithms, we benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. In the versions of the tools tested, we observed that Minimap is the most computationally efficient, specific and sensitive method on the ONT datasets tested; whereas GraphMap and DALIGNER are the most specific and sensitive methods on the tested PB datasets. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing throughput. CONTACT: cjustin@bcgsc.ca , ibirol@bcgsc.ca. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.Entities:
Mesh:
Year: 2017 PMID: 28003261 PMCID: PMC5408847 DOI: 10.1093/bioinformatics/btw811
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.An overview of possible outcomes from an overlap detection algorithm. Each level has a computational cost associated with it, with the general trend being A < B
Fig. 2.Visualization of partial and full overlaps in dovetail or contained (containment) forms. The grey portion between the reads indicates the range of the overlap region, note that partial overlaps do not extend to the end of the reads
Fig. 3.Visual overview of overlap detection algorithms. At the least, each method produces overlap regions. They may also generate auxiliary information, such as alignment trace points or full alignments. We show different seed identification approaches from leading overlap detection tools in the central box. BLASR utilizes the FM-Index data structure for seed identification. MHAP employs the MinHash sketch for seed selection. Minimap takes Minimizer sketch, a similar sketch approach used by MHAP. GraphMap uses gapped q-grams for finding the seeds. DALIGNER takes advantage of a cache-efficient k-mer sorting approach for rapid seed detection
Summary of overlap tools output formats, associated pipelines and availability
| Software | Algorithm features | Associated assembly tools | Output | Availability |
|---|---|---|---|---|
| BLASR | FM-Index, anchor clusters | PBcR | SAM alignment, other proprietary formats (overlap regions) |
|
| DALIGNER | Cache efficient | DAZZLER, MARVEL, FALCON | Local Alignments, LAS format (alignment tracepoints) |
|
| MHAP | MinHash | PBcR, Canu | MHAP output format (overlap regions) |
|
| GraphMap | Gapped q-gram (spaced seeds), colinear clustering | Ra | SAM alignment, MHAP output format (overlap regions) |
|
| Minimap | Minimizer colinear clustering | Miniasm | PAF (overlap regions) |
|
Fig. 4.ROC-like plot using BLASR, DALIGNER, GraphMap, MHAP, GraphMap and MHAP. Top left: PB P6-C4 E.coli simulated with PBsim. Top right: PB P6-C4 E.coli dataset. Bottom left: ONT SQK-MAP-006 E.coli simulated with Nanosim. Bottom right: ONT SQK-MAP-006 E.coli dataset
An overview of sensitivity and precision on simulated and real error-prone long read datasets
| Tool | Simulated PB | Simulated ONT | PB P6-C4 | ONT SQK-MAP-006 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sens. (%) | Prec. (%) | F1 (%) | Sens. (%) | Prec. (%) | F1 (%) | Sens. (%) | Prec. (%) | F1 (%) | Sens. (%) | Prec. (%) | F1 (%) | |
| BLASR | 91.0 | 81.9 | 86.2 | 75.1 | 84.0 | 66.0 | 78.3 | 89.9 | 73.0 | 80.6 | ||
| DALIGNER | 91.9 | 92.1 | 94.9 | 97.6 | 95.9 | 85.8 | 91.0 | 91.9 | ||||
| MHAP | 91.5 | 88.0 | 89.8 | 86.5 | 90.6 | 79.8 | 79.8 | 79.8 | 91.2 | 82.0 | 86.3 | |
| GraphMap | 90.1 | 90.4 | 96.0 | 93.1 | 71.7 | 94.0 | 81.4 | 90.6 | 93.4 | 92.0 | ||
| Minimap | 88.9 | 94.8 | 91.8 | 94.6 | 59.6 | 83.8 | 69.7 | 91.2 | ||||
In both the PB and ONT simulated datasets, the best values, shown in bold face, are statistically significantly better than the other values. We derived these values from the best settings of each tool (according to the best F1 score) after a parameter search. We calculated confidence intervals for the sensitivity, specificity and F1 scores using three standard deviations around the observed values. In the worst case, the error never exceeded ±0.1%, ±0.1% and ±0.2% respectively.
Fig. 5.Wallclock time and memory on the PB P6-C4 (top) and simulated ONT (bottom) C.elegans dataset on 100 000 randomly sampled reads. Each tool was parameterized using default settings (left), or using settings from runs yielding the highest F1 score on the simulated E.coli datasets (right)