| Literature DB >> 18006572 |
Gagan A Pandya1, Michael H Holmes, Sirisha Sunkara, Andrew Sparks, Yun Bai, Kathleen Verratti, Kelly Saeed, Pratap Venepally, Behnam Jarrahi, Robert D Fleischmann, Scott N Peterson.
Abstract
DNA resequencing arrays enable rapid acquisition of high-quality sequence data. This technology represents a promising platform for rapid high-resolution genotyping of microorganisms. Traditional array-based resequencing methods have relied on the use of specific PCR-amplified fragments from the query samples as hybridization targets. While this specificity in the target DNA population reduces the potential for artifacts caused by cross-hybridization, the subsampling of the query genome limits the sequence coverage that can be obtained and therefore reduces the technique's resolution as a genotyping method. We have developed and validated an Affymetrix Inc. GeneChip(R) array-based, whole-genome resequencing platform for Francisella tularensis, the causative agent of tularemia. A set of bioinformatic filters that targeted systematic base-calling errors caused by cross-hybridization between the whole-genome sample and the array probes and by deletions in the sample DNA relative to the chip reference sequence were developed. Our approach eliminated 91% of the false-positive single-nucleotide polymorphism calls identified in the SCHU S4 query sample, at the cost of 10.7% of the true positives, yielding a total base-calling accuracy of 99.992%.Entities:
Mesh:
Year: 2007 PMID: 18006572 PMCID: PMC2175352 DOI: 10.1093/nar/gkm918
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Representation of the ‘alternate homology effect’. Query location is shown in bold and mismatches are shown in red. Chip oligonucleotides and sample DNA alignment at SNP location is shown. The top pair represents a sample DNA sequence perfectly matching a reference probe. The next pair illustrates a sample DNA sequence partially matching a SNP probes and therefore capable of hybridizing with high efficiency to the SNP probe pair.
Figure 2.ROC curve showing the effect of different delta binding energy threshold values on the true positive and false positive rates. The values on the line graph are the delta energy values.
Figure 3.ROC curve illustrating the effect of different quality threshold values on the true positive and false positive rates. The GSEQ quality score threshold was set to 3.0, and our quality filter was applied using different threshold values shown on the line graph.
Figure 4.Representation of the ‘footprint effect’. Query locations are in bold and mismatches are shown in red. Chip oligonucleotides and sample DNA alignments at SNP location (central 13th position) and SNP location plus two bases are shown.
Figure 5.Schematic representation of whole genome resequencing array set design. Blue vertical lines indicate repeats in the genomes. Unique sequences for LVS and SCHU S4 are shown as red and green vertical lines, respectively. Similarly, yellow and purple vertical lines represent unique sequences from plasmids pOM1 and pFNL10, respectively.
Raw resequencing results for F. tularensis LVS query against F. tularensis LVS reference
| Expt. No. | Array | Bases/array | Bases called | Call cate (%) | SNPs | % SNPs of called bases | True–positive SNPs (expected/detected) |
|---|---|---|---|---|---|---|---|
| 007 | A | 301 470 | 300 490 | 99.675 | 49 | 0.016 | 0/0 |
| B | 302 018 | 297 231 | 98.415 | 23 | 0.008 | 0/0 | |
| C | 301 394 | 296 530 | 98.386 | 27 | 0.009 | 0/0 | |
| D | 296 905 | 291 127 | 98.054 | 33 | 0.011 | 0/0 | |
| E | 290 100 | 282 102 | 97.243 | 18 | 0.006 | 1/1 | |
| F | 234 779 | 227 722 | 96.994 | 17 | 0.007 | 2/2 | |
| Total | 1 726 666 | 1 695 202 | 98.178 | 167 | 0.010 | 3/3 | |
| 013 | A | 301 470 | 300 267 | 99.601 | 54 | 0.018 | 0/0 |
| B | 302 018 | 296 718 | 98.245 | 25 | 0.008 | 0/0 | |
| C | 301 394 | 296 372 | 98.334 | 29 | 0.010 | 0/0 | |
| D | 296 905 | 290 472 | 97.833 | 32 | 0.011 | 0/0 | |
| E | 290 100 | 284 220 | 97.973 | 20 | 0.007 | 1/1 | |
| F | 234 779 | 229 583 | 97.787 | 17 | 0.007 | 2/2 | |
| Total | 1 726 666 | 1 697 632 | 98.318 | 177 | 0.010 | 3/3 |
The results shown are for the LVS sample, using the Affymetrix-recommended batch analysis parameters, including a quality score threshold of 12 and a call rate cutoff of 0.5. Those portions of chip F that represent the SCHU S4 reference and the plasmids were excluded from this analysis. Therefore, this represents the performance of the system under ideal circumstances: the chips are challenged with a sample that is essentially identical to the chip reference.
Raw resequencing data for F. tularensis LVS and F. tularensis SCHU S4 samples
| Expt. No. | Array | Bases/array | Bases called | Call rate (%) | SNPs | % SNPs of called bases |
|---|---|---|---|---|---|---|
| Raw Data for | ||||||
| 007 | A | 301 470 | 298 283 | 98.943 | 30 | 0.010 |
| B | 302 018 | 298 072 | 98.693 | 30 | 0.010 | |
| C | 301 394 | 297 350 | 98.658 | 35 | 0.012 | |
| D | 296 905 | 292 333 | 98.460 | 43 | 0.015 | |
| E | 290 100 | 283 408 | 97.693 | 21 | 0.007 | |
| F | 273 824 | 230 750 | 84.269 | 1087 | 0.471 | |
| Total | 1 765 711 | 1 700 196 | 96.290 | 1246 | 0.073 | |
| 013 | A | 301 470 | 297 426 | 98.659 | 30 | 0.010 |
| B | 302 018 | 297 614 | 98.542 | 28 | 0.009 | |
| C | 301 394 | 297 381 | 98.669 | 38 | 0.013 | |
| D | 296 905 | 291 534 | 98.191 | 45 | 0.015 | |
| E | 290 100 | 285 828 | 98.527 | 27 | 0.009 | |
| F | 273 824 | 234 169 | 85.518 | 1688 | 0.721 | |
| Total | 1 765 711 | 1 703 952 | 96.502 | 1856 | 0.109 | |
| Raw data for | ||||||
| 008 | A | 301 470 | 288171 | 95.589 | 1331 | 0.462 |
| B | 302 018 | 291499 | 96.517 | 1293 | 0.444 | |
| C | 301 394 | 291 988 | 96.879 | 1571 | 0.538 | |
| D | 296 905 | 282 940 | 95.296 | 1545 | 0.546 | |
| E | 290 100 | 280 992 | 96.860 | 1306 | 0.465 | |
| F | 273 824 | 258 411 | 94.371 | 1326 | 0.513 | |
| Total | 1 765 711 | 1 694 001 | 95.939 | 8372 | 0.494 | |
| 014 | A | 301 470 | 292 313 | 96.963 | 1383 | 0.473 |
| B | 302 018 | 290 452 | 96.170 | 1298 | 0.447 | |
| C | 301 394 | 290 080 | 96.246 | 1532 | 0.528 | |
| D | 296 905 | 282 768 | 95.239 | 1539 | 0.544 | |
| E | 290 100 | 280 557 | 96.710 | 1293 | 0.461 | |
| F | 273 824 | 256 803 | 93.784 | 1259 | 0.490 | |
| Total | 1 765 711 | 1 692 973 | 95.881 | 8304 | 0.490 |
For these results, a quality score threshold of 12 and a call rate cutoff of zero were used, as explained in the text. All base positions on the chip set were considered (LVS, SCHU S4 and plasmid reference sequences). This accounts for the much higher SNP count from chip F in the LVS experiments.
Effects of filtering steps on base calling accuracy
| Filter steps | True positives | False positives | Accuracy (%) | True-positive retention (%) | False-positive rejection (%) |
|---|---|---|---|---|---|
| None (raw unfiltered) | 3 | 1243 | 99.927 | 100.000 | 0.000 |
| Low homology | 3 | 179 | 99.989 | 100.000 | 85.599 |
| Alternate homology | 3 | 25 | 99.999 | 100.000 | 97.989 |
| Footprint effect | 3 | 23 | 99.999 | 100.000 | 98.150 |
| Replicate combination | 3 | 19 | 99.999 | 100.000 | 98.471 |
| None (raw unfiltered) | 3 | 1853 | 99.891 | 100.000 | 0.000 |
| Low homology | 3 | 190 | 99.989 | 100.000 | 89.746 |
| Alternate homology | 3 | 30 | 99.998 | 100.000 | 98.381 |
| Footprint effect | 3 | 29 | 99.998 | 100.000 | 98.435 |
| Replicate combination | 3 | 19 | 99.999 | 100.000 | 98.975 |
| None (raw unfiltered) | 6908 | 1464 | 99.914 | 100.000 | 0.000 |
| Low homology | 6878 | 816 | 99.951 | 99.566 | 44.262 |
| Alternate homology | 6529 | 388 | 99.977 | 94.514 | 73.497 |
| Footprint effect | 6327 | 200 | 99.988 | 91.589 | 86.339 |
| Replicate combination | 6172 | 126 | 99.992 | 89.346 | 91.393 |
| None (raw unfiltered) | 6902 | 1402 | 99.917 | 100.000 | 0.000 |
| Low homology | 6859 | 777 | 99.954 | 99.377 | 44.579 |
| Alternate homology | 6515 | 363 | 99.978 | 94.393 | 74.108 |
| Footprint effect | 6317 | 198 | 99.988 | 91.524 | 85.877 |
| Replicate combination | 6172 | 126 | 99.992 | 89.423 | 91.013 |
The true positive retention and false positive rejection rates are calculated relative to the number of true and false positive results in the raw, unfiltered data. The accuracy is calculated relative to the number of base calls remaining after the specified filtering step, where reference calls and true positive SNP calls are considered correct, and no-calls (‘N’) are not considered.
SCHU S4 SNP validation summary
| Total locations attempted | 562 |
| Results obtained | 484 |
| False-positive validation results | 320 |
| False-negative validation results | 164 |
| ‘False positive’ calls revealed as ‘True positive’ | 5 |
| ‘False positive’ calls confirmed as ‘False positive’ | 315 |
| ‘False negative’ calls revealed as ‘True negative’ | 6 |
| ‘False negative’ calls confirmed as ‘False negative’ | 158 |
Causes of false-positive SNP calls in SCHU S4
| Category | Number of SNPs |
|---|---|
| Total false positives after filtering | 126 |
| False-positive SNPs within 12 bases of a rearrangement boundary | 61 |
| False-positive SNPs within 12 bases of a predicted SNP | 29 |
| Unexplained false-positive SNPs | 42 |
A total of six false-positive SNPs were found to be both within 12 bases of a rearrangement boundary and within 12 bases of a predicted SNP.
Comparison of raw (unfiltered) versus filtered resequencing results
| Results | LVS (007) | LVS (013) | SCHU S4 (008) | SCHU S4 (014) |
|---|---|---|---|---|
| Raw | ||||
| Raw positions | 1 765 711 | 1 765 711 | 1 765 711 | 1 765 711 |
| Raw base calls | 1 700 196 | 1 703 952 | 1 694 001 | 1 692 973 |
| Raw call rate | 96.290% | 96.502% | 95.939% | 95.881% |
| Raw accuracy | 99.927% | 99.891% | 99.914% | 99.917% |
| False positive SNPs | 1243 | 1853 | 1464 | 1402 |
| True positive SNPs | 3 | 3 | 6908 | 6902 |
| Genome-adjusted | ||||
| Genome-adjusted positions | 1 725 937 | 1 725 937 | 1 743 224 | 1 743 224 |
| Filtered base calls | 1 689 733 | 1 689 733 | 1 674 222 | 1 674 222 |
| Filtered call rate | 97.902% | 97.902% | 96.042% | 96.042% |
| Filtered accuracy | 99.999% | 99.999% | 99.992% | 99.992% |
| False-positive SNPs | 19 | 19 | 126 | 126 |
| True-positive SNPs | 3 | 3 | 6172 | 6172 |
| False-negative SNPs | 0 | 0 | 1292 | 1292 |
| False-positive SNP rate | 0.001% | 0.001% | 0.007% | 0.007% |
| False-negative SNP rate | 0.000% | 0.000% | 17.310% | 17.310% |
The genome-adjusted results are calculated relative to the portions of the chip set that performed well with the DNA samples under consideration. The regions identified by our low-homology filter are excluded from the genome-adjusted positions. The false-negative SNP counts represent the number of expected SNPs that were missing from the final, filtered SNP set. For SCHU S4, 7464 SNPs were expected, on the basis of in silico alignment of the LVS and SCHU S4 genome sequences. In the false-positive SNP rate calculation, the denominator is the number of genome-adjusted base positions that were not expected to be SNPs. In the false-negative SNP rate calculation, the denominator is the number of genome-adjusted positions that were expected to be SNPs.