| Literature DB >> 22946927 |
Christopher R Cabanski1, Keary Cavin, Chris Bizon, Matthew D Wilkerson, Joel S Parker, Kirk C Wilhelmsen, Charles M Perou, J S Marron, D Neil Hayes.
Abstract
BACKGROUND: Next-generation sequencing technologies have become important tools for genome-wide studies. However, the quality scores that are assigned to each base have been shown to be inaccurate. If the quality scores are used in downstream analyses, these inaccuracies can have a significant impact on the results.Entities:
Mesh:
Year: 2012 PMID: 22946927 PMCID: PMC3447716 DOI: 10.1186/1471-2105-13-221
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1 Recalibration of U87 cell line replicate 1 with ReQON. Plot A shows the distribution of sequencing errors by read position. Plot B shows frequency distributions of quality scores before (solid blue) and after (dashed red) recalibration. Reported quality scores versus empirical quality scores are shown before (plot C) and after (plot D) recalibration. The points are shaded according to the frequency of bases assigned that quality score, corresponding to the values shown in plot B. Plots C and D also report the Frequency-Weighted Squared Error (FWSE), a measure of quality score accuracy. The large decrease in FWSE confirms that the recalibrated quality scores more accurately represent the probability of a sequencing error than the original quality scores.
Comparison of Frequency-Weighted Squared Error (FWSE)
| Chromosome 10 | Replicate 1 | 69.36 | 3.61 | 11.55 | 13.93 |
| | Replicate 2 | 62.89 | 5.31 | 14.91 | 21.76 |
| Chromosome 20 | Replicate 1 | 71.09 | 3.04 | 12.28 | 15.93 |
| Replicate 2 | 64.34 | 5.82 | 17.43 | 24.38 |
Comparison of FWSE for two cell line replicates between original quality scores reported from the sequencer machine and after recalibration with ReQON, GATK and BAQ. ReQON quality scores have the lowest FWSE values, corresponding to increased accuracy. ReQON does not overfit the model to the training set, shown by the roughly equivalent FWSE values for both the training (chromosome 10) and testing (chromosome 20) sets after recalibration.
Figure 2 Discrimination performance of original and ReQON-recalibrated quality scores. Relative frequency distributions of quality scores for bases not matching the reference sequence in chromosome 20 of cell line replicate 2. These non-reference bases are separated as belonging to positions in dbSNP version 132 (known variants, red curve) versus other positions (sequencing errors, blue curve). Plot A shows the distribution of original quality scores and plot B shows the distribution after recalibration with ReQON. The area under the ROC curve (AUC) is reported. The increased AUC demonstrates that the recalibrated quality scores do a better job of distinguishing sequencing errors from non-errors.
Comparison of the area under the ROC curve (AUC)
| Replicate 1 | 0.673 | 0.806 | 0.824 | 0.798 |
| Replicate 2 | 0.764 | 0.881 | 0.874 | 0.814 |
Comparison of AUC for two cell line replicates recalibrated with ReQON, GATK and BAQ. Bases from chromosome 20 that do not match the reference sequence are separated as belonging to positions in dbSNP version 132 or not. Overall, all three recalibration methods outperform the original quality scores. ReQON and GATK have similar AUC values, with both methods outperforming BAQ.
Figure 3 Example position where bases are identified as sequencing errors by GATK but not ReQON. Plot A shows an Integrative Genomics Viewer (IGV) visualization of chr10:75,531,679-75,531,712 for cell line replicate 1, highlighting a position where the reference sequence is T but all of the bases mapped to this position are a C. This position (chr10:75,531,700) is not listed as a known variant in dbSNP version 132. The bases at this position are removed from the training set by ReQON but are called as sequencing errors by GATK. Plot B shows box plots comparing the quality scores of the bases at this position after recalibration with GATK and ReQON. Overall, ReQON assigns higher quality scores to these non-reference bases than GATK.