| Literature DB >> 28806977 |
Ben Langmead1,2.
Abstract
Read alignment is the first step in most sequencing data analyses. Because a read's point of origin can be ambiguous, aligners report a mapping quality, which is the probability that the reported alignment is incorrect. Despite its importance, there is no established and general method for calculating mapping quality. I describe a framework for predicting mapping qualities that works by simulating a set of tandem reads. These are like the input reads in important ways, but the true point of origin is known. I implement this method in an accurate and low-overhead tool called Qtip, which is compatible with popular aligners.Entities:
Keywords: Mapping; Quality; Read alignment; Sequencing
Mesh:
Year: 2017 PMID: 28806977 PMCID: PMC5557537 DOI: 10.1186/s13059-017-1290-3
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Relative change in area under CID (RCA) and relative change in sum of squared error (RCE) when running Qtip and Bowtie 2 on Mason-simulated Illumina-like samples of various lengths
| End-to-end | Local | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| RCA | RCE | RCA | RCE | ||||||
| Read length | Mean | SD | Mean | SD | Mean | SD | Mean | SD | |
| Unpaired | 50 | −9.03 | 0.26 | −24.61 | 0.19 | −2.49 | 0.47 | −15.30 | 0.57 |
| 100 | −7.43 | 1.82 | −18.53 | 2.61 | −10.26 | 1.52 | −27.87 | 1.59 | |
| 150 | −9.11 | 1.15 | −16.22 | 0.62 | −14.77 | 2.15 | −29.27 | 1.03 | |
| 250 | −16.82 | 2.01 | −19.97 | 0.44 | −22.51 | 1.77 | −28.85 | 0.62 | |
| 500 | −33.77 | 0.60 | −27.26 | 0.37 | −37.75 | 1.32 | −31.65 | 1.01 | |
| Paired | 50 | −13.11 | 0.44 | −19.60 | 0.44 | −18.60 | 0.28 | −33.63 | 0.24 |
| 100 | −15.80 | 1.29 | −21.79 | 0.33 | −36.84 | 0.69 | −45.42 | 0.53 | |
| 150 | −22.39 | 1.74 | −25.94 | 0.28 | −46.87 | 0.54 | −52.76 | 0.56 | |
| 250 | −37.65 | 0.97 | −33.48 | 0.22 | −58.39 | 0.91 | −58.75 | 1.38 | |
| 500 | −54.59 | 0.58 | −44.91 | 0.34 | −68.54 | 0.95 | −70.37 | 3.45 | |
Relative change is expressed as a percentage. Each sample consists of 4 million reads/pairs. Samples are either unpaired or paired-end, and Bowtie 2 is run in either end-to-end or local alignment mode as indicated. Results are means and standard deviations over ten random trials, repeated starting from the input modeling step
CID cumulative incorrect difference
RCA relative change in area under CID
RCE relative change in sum of squared errors
SD standard deviation
Fig. 1CSED for various lengths. Cumulative squared-error difference plot from running Qtip and Bowtie 2 on Mason-simulated Illumina-like samples of various lengths. Each sample consists of 4 million reads or pairs. The horizontal axis indicates the cumulative number of reads/ends passing the threshold, with the left-hand extreme corresponding to a high mapping-quality threshold and the right-hand extreme corresponding to a low threshold. Results for unpaired samples are on top, paired on bottom. Bowtie 2 is run in its (default) end-to-end alignment mode for the left-hand plots, and in local alignment mode for the right-hand plots. CSED cumulative squared-error difference
Relative change in area under CID (RCA) and relative change in sum of squared error (RCE) for various aligners and reference genomes, expressed as percentage change
| 100 nt | 250 nt | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| RCA | RCE | RCA | RCE | |||||||
| Mean | SD | Mean | SD | Mean | SD | Mean | SD | |||
| Unpaired | GRCh37 | Bowtie 2 | −11.22 | 1.08 | −24.43 | 0.66 | −15.02 | 0.39 | −28.37 | 0.38 |
| BWA-MEM | −14.49 | 2.29 | −49.54 | 0.43 | −9.14 | 2.09 | −52.31 | 0.38 | ||
| SNAP | −15.94 | 0.32 | −36.86 | 0.23 | −9.53 | 3.88 | −28.57 | 0.47 | ||
| GRCh38 | Bowtie 2 | −7.43 | 1.82 | −18.53 | 2.61 | −16.82 | 2.01 | −19.97 | 0.44 | |
| BWA-MEM | −15.49 | 0.58 | −47.42 | 0.37 | −15.78 | 0.57 | −51.14 | 0.31 | ||
| SNAP | −19.58 | 0.18 | −36.58 | 0.47 | −14.74 | 0.27 | −25.47 | 0.32 | ||
| Mouse | Bowtie 2 | −5.60 | 0.24 | −17.19 | 0.45 | −7.05 | 0.33 | −17.73 | 0.37 | |
| BWA-MEM | −13.50 | 0.15 | −46.25 | 0.27 | −16.39 | 0.38 | −51.12 | 0.30 | ||
| SNAP | −9.02 | 0.17 | −31.07 | 0.33 | −10.61 | 0.20 | −31.78 | 0.43 | ||
|
| Bowtie 2 | −6.63 | 0.32 | −19.56 | 0.25 | −17.09 | 0.38 | −25.89 | 0.44 | |
| BWA-MEM | −19.26 | 0.11 | −58.32 | 0.26 | −25.14 | 0.19 | −66.76 | 0.23 | ||
| SNAP | −13.02 | 0.24 | −38.48 | 0.53 | −24.01 | 0.43 | −53.80 | 0.42 | ||
| Paired | GRCh37 | Bowtie 2 | −25.86 | 0.33 | −30.26 | 0.50 | −36.31 | 2.16 | −38.29 | 0.60 |
| BWA-MEM | −13.33 | 0.23 | −45.70 | 0.27 | −10.08 | 0.75 | −47.58 | 0.31 | ||
| SNAP | −56.53 | 1.99 | 1.39 | 2.34 | −42.89 | 7.63 | 13.17 | 3.95 | ||
| GRCh38 | Bowtie 2 | −15.80 | 1.29 | −21.79 | 0.33 | −38.22 | 0.31 | −33.63 | 0.30 | |
| BWA-MEM | −14.19 | 0.16 | −41.35 | 0.19 | −12.36 | 0.46 | −42.78 | 0.29 | ||
| SNAP | −51.36 | 0.98 | −11.16 | 1.12 | −51.32 | 1.45 | 4.34 | 2.29 | ||
| Mouse | Bowtie 2 | −10.10 | 0.26 | −18.93 | 0.31 | −19.03 | 0.21 | −29.07 | 0.32 | |
| BWA-MEM | −11.86 | 0.12 | −36.18 | 0.37 | −13.30 | 0.19 | −39.91 | 0.21 | ||
| SNAP | −29.90 | 0.67 | −17.04 | 0.76 | −30.16 | 0.30 | −15.79 | 0.47 | ||
|
| Bowtie 2 | −17.92 | 0.21 | −26.95 | 0.27 | −43.19 | 0.18 | −51.69 | 0.38 | |
| BWA-MEM | −17.04 | 0.15 | −47.48 | 0.29 | −21.45 | 0.20 | −56.58 | 0.08 | ||
| SNAP | −36.28 | 0.55 | −17.08 | 0.79 | −26.45 | 4.44 | −20.05 | 0.52 | ||
The experiments used 100 or 250 nt reads, and unpaired or paired-end reads, as indicated. Results are means and standard deviations over ten random trials, repeated starting from the input modeling step
CID cumulative incorrect difference
RCA relative change in area under CID
RCE relative change in sum of squared errors
SD standard deviation
Fig. 2CSED for various aligners and references. Cumulative squared-error difference plot from running Qtip with various reference genomes and read aligners. The input reads are Mason-simulated Illumina-like 100 nt (left) and 250 nt (right) samples, each consisting of 4 million reads/pairs. The horizontal axis indicates the cumulative number of reads/ends passing the threshold, with the left-hand extreme corresponding to a high mapping-quality threshold and the right-hand extreme corresponding to a low threshold. Results for unpaired samples are shown on top, and paired on bottom. CSED cumulative squared-error difference
Single-nucleotide variant (SNV) F scores for various β’s with original mapping qualities and with Qtip-generated qualities
| Original | Qtip |
| |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
| QUAL |
|
| QUAL |
|
| TP | FP |
| 0.250 | 0.9925 | 194 | 2 | 0.9924 | 213 | 2 | -2.2e-05 | +20,189 | +1505 |
| 0.333 | 0.9906 | 150 | 3 | 0.9908 | 178 | 2 | +2.6e-04 | +16,001 | +911 |
| 0.500 | 0.9872 | 87 | 3 | 0.9881 | 125 | 3 | +8.3e-04 | +15,143 | +311 |
| 0.750 | 0.9843 | 10.6 | 3 | 0.9854 | 70.1 | 4 | +1.1e-03 | -413 | -6537 |
| 1.000 | 0.9835 | 0.0158 | 3 | 0.9845 | 13.6 | 4 | +9.4e-04 | +3999 | -2745 |
| 1.500 | 0.9846 | 1.79e-06 | 3 | 0.9856 | 0.000675 | 5 | +1.0e-03 | +4392 | -1832 |
| 2.000 | 0.9860 | 5.06e-08 | 3 | 0.9870 | 1.38e-05 | 5 | +1.0e-03 | +3110 | -5692 |
| 3.000 | 0.9880 | 1.16e-09 | 3 | 0.9889 | 8.64e-08 | 4 | +8.6e-04 | +2583 | -7937 |
| 4.000 | 0.9892 | 1.06e-10 | 3 | 0.9899 | 6.58e-09 | 4 | +7.1e-04 | +1891 | -13,600 |
Paired-end reads from ERR194147, a female, were aligned with Bowtie 2 together with Qtip. SNV variants were called with Freebayes for chromosomes 1–22 and X. Variant-quality (QUAL) and mapping-quality (Q) thresholds yielding the greatest F score are reported. Platinum variants were used as the true callset. Before calculating F , calls outside Platinum Genomes high-confidence regions were excluded. The three rightmost columns show differences in F , the number of true positive SNVs, and the number of false positive SNVs
FP false positive
SNV single-nucleotide variant
TP true positive
Overhead of the Qtip tool
| Time (minutes) | Peak memory (gigabytes) | |||||||
|---|---|---|---|---|---|---|---|---|
| Time | +Qtip | % inc | Memory | +Qtip | % inc | |||
| ERR050082 | Unpaired | Bowtie 2 | 23.58 | 25.20 | 6.89 | 3.26 | 3.52 | 8.25 |
| BWA-MEM | 22.18 | 23.75 | 7.08 | 7.53 | 7.79 | 3.40 | ||
| SNAP | 12.13 | 13.75 | 13.32 | 29.26 | 29.51 | 0.85 | ||
| Paired | Bowtie 2 | 57.52 | 61.35 | 6.67 | 3.27 | 3.66 | 12.02 | |
| BWA-MEM | 57.93 | 63.38 | 9.41 | 7.87 | 8.26 | 4.87 | ||
| SNAP | 11.28 | 14.42 | 27.69 | 30.21 | 30.55 | 1.14 | ||
| ERR050083 | Unpaired | Bowtie 2 | 23.02 | 24.75 | 7.55 | 3.26 | 3.53 | 8.28 |
| BWA-MEM | 24.60 | 26.08 | 6.03 | 7.75 | 8.01 | 3.33 | ||
| SNAP | 12.23 | 13.73 | 12.31 | 29.26 | 29.51 | 0.85 | ||
| Paired | Bowtie 2 | 63.60 | 67.40 | 6.00 | 3.27 | 3.66 | 11.96 | |
| BWA-MEM | 61.58 | 67.52 | 9.62 | 7.92 | 8.31 | 4.83 | ||
| SNAP | 11.95 | 14.68 | 22.86 | 30.21 | 30.55 | 1.14 | ||
This is measured as the increase in running time (left) and peak memory footprint (right) from when the aligner runs by itself (Time) to when the aligner runs in combination with Qtip (+Qtip). % inc columns give the percentage increase. Times are in minutes and memory footprints are in gigabytes
Fig. 3Stages of the Qtip pipeline. Computational steps and intermediate results in Qtip. Numbers denote ordering of steps and arrows denote the flow of data. The input (upper left) is a collection of sequencing reads and the ultimate output (upper right) is a SAM file containing alignments, where each aligned read’s MAPQ field is set according to Qtip’s prediction