| Literature DB >> 23735080 |
Manuel Allhoff1, Alexander Schönhuth, Marcel Martin, Ivan G Costa, Sven Rahmann, Tobias Marschall.
Abstract
BACKGROUND: Elevated sequencing error rates are the most predominant obstacle in single-nucleotide polymorphism (SNP) detection, which is a major goal in the bulk of current studies using next-generation sequencing (NGS). Beyond routinely handled generic sources of errors, certain base calling errors relate to specific sequence patterns. Statistically principled ways to associate sequence patterns with base calling errors have not been previously described. Extant approaches either incur decisive losses in power, due to relating errors with individual genomic positions rather than motifs, or do not properly distinguish between motif-induced and sequence-unspecific sources of errors.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23735080 PMCID: PMC3622629 DOI: 10.1186/1471-2105-14-S5-S1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Aligned reads with strand biased errors. Hypothetical reads of two directions (red: forward; blue: backward) are aligned to a reference genome shown on top. Nucleotides within reads indicate mismatches to the forward reference. Three genome positions with extreme strand bias are marked by arrows. CSE-causing motifs described in [8] (GGC, inverted repeats) and [9] (GGT) are highlighted in yellow. Created with the Integrative Genomics Viewer (IGV) [18].
Figure 2Statistical power analysis. Statistical power (probability, y-axis) to detect significant strand bias at a human exome position with Fisher's exact test, depending on read coverage (x-axis) and on assumed position-specific error rate (color) higher than the assumend background error rate of 0.01. Example: Even at an extremely high error rate of 0.5 (cyan), even a coverage of 100 grants only a discovery chance of 40%. Fluctuations are caused by finite sampling size for simulations.
2 × 2 contingency table
| Match | Mismatch | Total | |
|---|---|---|---|
| Forward | |||
| Backward | |||
| Total | |||
2 × 2 contingency table; a, b, c, d: numbers of reads; f, k, m, s: marginals; n = a + b + c + d = f + k = m + s.
Figure 3Contingency table construction for the motif CCAGACT. Contingency table construction for the motif CCAGACT. The forward reference (5' to 3') is displayed at the top; below, its complement is shown (3' to 5'). F-reads are indicated as red arrows, R-reads as blue arrows. F-intervals are marked in the forward reference, R-intervals are marked in the reverse complementary reference. Two F-positions (last position in an F-interval) and one R-position (first position in an R-interval) are indicated by vertical boxes. The corresponding individual contingency tables and the resulting joint contingency for the motif CCAGACT are shown below the alignments. Note that F-positions and R-positions both contribute to the motif's contingency table, as described in the Algorithm section.
Overview of datsets
| Name | Organism | Reads Accession | Genome Accession | |
|---|---|---|---|---|
| GAIIx-bs | DRA DRX000504 | NCBI | [ | |
| GAIIx-hg | Human chr. 1 | HG00131 | GRCh37 | [ |
| MiSeq-ec | Illumina (*) | NCBI | ||
| HiSeq-hg | Human chr. 1 | HG00108 | GRCh37 | [ |
Overview of Datsets. Names refer to Illumina platform and organism. DRA: DDBJ sequence read archive.
Illumina reads (*): http://www.illumina.com/systems/miseq/scientific_data.ilmn.
Overview of filter settings
| Search space | Thresholds | Number of motifs per dataset | |||||
|---|---|---|---|---|---|---|---|
| (4,1) | 4 | 0 | 6 | 5 | |||
| (8,4) | 8 | 13 | 26 | 74 | |||
Overview of filter settings and number of remaining significant motifs after filtering per dataset and per search space.
A selection of CSE-causing motifs
| ( | Context | FM | RM | FMM | RMM | - log(p) | FER | RER | ERD | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| (8, 4) | NGGCGGGT | 3 | 264 | 5857 | 6867 | 859 | 40 | 180.0 | 12.8 | 0.6 | 12.2 |
| CGGNGGGT | 4 | 136 | 3366 | 3930 | 477 | 22 | 121.2 | 12.4 | 0.6 | 11.9 | |
| GGCGGGGT | 5 | 62 | 1318 | 1624 | 180 | 5 | 52.0 | 12.0 | 0.3 | 11.7 | |
| ACGGCGGG | 6 | 84 | 1690 | 2065 | 241 | 17 | 58.3 | 12.5 | 0.8 | 11.7 | |
| (4, 1) | GGGT | 1 | 13478 | 374933 | 384732 | 10002 | 2643 | ∞ | 2.6 | 0.7 | 1.9 |
| CGGT | 2 | 25144 | 716801 | 730328 | 14765 | 5071 | ∞ | 2.0 | 0.7 | 1.3 | |
| AGGT | 3 | 20146 | 581562 | 584578 | 12086 | 4237 | ∞ | 2.0 | 0.7 | 1.3 | |
| NGGT | 4 | 79810 | 2272988 | 2317196 | 46304 | 16224 | ∞ | 2.0 | 0.7 | 1.3 | |
| (8, 4) | CGGCGGGT | 1 | 532 | 731 | 1330 | 169 | 7 | 60.7 | 18.8 | 0.5 | 18.3 |
| TGGCGGGT | 2 | 3232 | 5715 | 6410 | 1128 | 37 | 229.3 | 16.5 | 0.6 | 15.9 | |
| CGGCAGGT | 3 | 1396 | 2788 | 3522 | 409 | 19 | 110.8 | 12.8 | 0.5 | 12.3 | |
| NGGCGGGT | 10 | 13712 | 24040 | 30886 | 3029 | 158 | ∞ | 11.2 | 0.5 | 10.7 | |
| (4, 1) | No motifs passed filter | ||||||||||
| (8, 4) | TGGCGGGT | 1 | 3232 | 3803 | 5547 | 1475 | 53 | ∞ | 27.9 | 0.9 | 27.0 |
| CGGCGGGT | 2 | 532 | 418 | 777 | 152 | 4 | 56.1 | 26.7 | 0.5 | 26.2 | |
| CGGCAGGT | 4 | 1396 | 1935 | 2820 | 567 | 23 | 167.5 | 22.7 | 0.8 | 21.9 | |
| NGGCGGGT | 10 | 13712 | 17251 | 26924 | 4432 | 177 | ∞ | 20.4 | 0.7 | 19.8 | |
| GTGGCTTG | 17 | 7568 | 12047 | 18583 | 2526 | 67 | ∞ | 17.3 | 0.4 | 17.0 | |
| (4, 1) | GGGT | 1 | 1366400 | 3208669 | 3340323 | 82048 | 15104 | ∞ | 2.5 | 0.5 | 2.0 |
| AGGT | 2 | 1836218 | 4530889 | 4740634 | 87166 | 20448 | ∞ | 1.9 | 0.4 | 1.5 | |
| NGGT | 3 | 5261516 | 13265123 | 13614878 | 239748 | 57694 | ∞ | 1.8 | 0.4 | 1.4 | |
| CGGG | 4 | 460830 | 876560 | 861233 | 16336 | 4710 | ∞ | 1.8 | 0.5 | 1.3 | |
| CGGT | 5 | 232662 | 516547 | 521942 | 9306 | 2544 | ∞ | 1.8 | 0.5 | 1.3 | |
| (8, 4) | GGCGGGGT | 1 | 102 | 16780 | 24956 | 5809 | 88 | ∞ | 25.7 | 0.4 | 25.4 |
| GGCGCCTC | 4 | 4 | 349 | 506 | 84 | 1 | 28.7 | 19.4 | 0.2 | 19.2 | |
| NGGCGGGT | 5 | 762 | 122922 | 171199 | 28401 | 879 | ∞ | 18.8 | 0.5 | 18.3 | |
| CGGNGGGT | 11 | 444 | 74979 | 95226 | 12415 | 568 | ∞ | 14.2 | 0.6 | 13.6 | |
| CGGCGGGN | 12 | 942 | 158741 | 205881 | 25090 | 1187 | ∞ | 13.6 | 0.6 | 13.1 | |
| (4, 1) | GGGT | 1 | 24802 | 5324301 | 5495475 | 145090 | 24701 | ∞ | 2.7 | 0.4 | 2.2 |
| AGGT | 2 | 27414 | 5979767 | 6104684 | 121330 | 29230 | ∞ | 2.0 | 0.5 | 1.5 | |
| NGGT | 3 | 146116 | 32813986 | 33422161 | 604790 | 162298 | ∞ | 1.8 | 0.5 | 1.3 | |
| CGGT | 4 | 49530 | 10934765 | 11081037 | 184200 | 54762 | ∞ | 1.7 | 0.5 | 1.2 | |
| GGGN | 5 | 78504 | 20903313 | 21323544 | 338589 | 114360 | ∞ | 1.6 | 0.5 | 1.1 | |
| CGGG | 6 | 32740 | 7089342 | 7227334 | 115433 | 42523 | ∞ | 1.6 | 0.6 | 1.0 | |
A selection of CSE-causing motifs for each combination of dataset and parameters. For each motif, we give the rank (Rk.) in the original list sorted by ERD; number of occurrences in the respective genome (Occ.); the contingency table entries FM, RM, FMM, and RMM; the forward error rate FER = FMM/(FM + FMM); the reverse error rate RER = RMM/(RM + RMM); and the error rate difference ERD = FER - RER. If a motif's p-value cannot be numerically distinguished from zero within double precision, we report a - log(p) score of ∞.
Top 10 discovered motifs after alignment postprocessing
| Rank | Context | FER | RER | ERD |
|---|---|---|---|---|
| 1 | ACGGCGGT | 26.1 | 0.5 | 25.6 |
| 2 | GTGGCGGT | 25.1 | 0.7 | 24.4 |
| 3 | GCGGCGGT | 22.9 | 0.7 | 22.2 |
| 4 | GTGGCTGT | 22.4 | 0.6 | 21.8 |
| 5 | ATGGCGGT | 21.2 | 1.0 | 20.3 |
| 6 | NCGGCGGT | 20.0 | 0.7 | 19.3 |
| 7 | GTGGCTTG | 20.2 | 1.2 | 19.0 |
| 8 | GNGGCGGT | 19.2 | 0.7 | 18.5 |
| 9 | GCGGCTGT | 18.8 | 0.7 | 18.1 |
| 10 | ACGGCTGT | 18.6 | 0.8 | 17.7 |
Top 10 (based on ERD) contexts on dataset GAIIx-bs with (q, n) = (8, 4) after GATK postprocessing and duplicate removal.