| Literature DB >> 31318901 |
Michael Smith1,2, Rachel Chan1, Paul Gordon3,4.
Abstract
Nucleotides ratcheted through the biomolecular pores of nanopore sequencers generate raw picoamperage currents, which are segmented into step-current level signals representing the nucleotide sequence. These 'squiggles' are a noisy, distorted representation of the underlying true stepped current levels due to experimental and algorithmic factors. We were interested in developing a simulation model to support a white-box approach to identify common distortions, rather than relying on commonly used black box neural network techniques for basecalling nanopore signals. Dynamic time warped-space averaging (DTWA) techniques can generate a consensus from multiple noisy signals without introducing key feature distortions that occur with standard averaging. As a preprocessing tool, DTWA could provide cleaner and more accurate current signals for direct RNA or DNA analysis tools. However, DTWA approaches need modification to take advantage of the a-priori knowledge regarding a common, underlying gold-standard RNA / DNA sequence. Using experimental data, we derive a simulation model to provide known squiggle distortion signals to assist in validating the performance of analysis tools such as DTWA. Simulation models were evaluated by comparing mocked and experimental squiggle characteristics from one Enolase mRNA squiggle group produced by an Oxford MinION nanopore sequencer, and cross-validated using other Enolase, Sequin R1_71_1 and Sequin R2_55_3 mRNA studies. New techniques identified high inserted but low deleted base rates, generating consistent x1.7 squiggle event to base called ratios. Similar probability density and cumulative distribution functions, PDF and CDF, were found across all studies. Experimental PDFs were not the normal distributions expected if squiggle distortion arose from segmentation algorithm artefacts, or through individual nucleotides randomly interacting with individual nanopores. Matching experimental and mocked CDFs required the assumption that there are unique features associated with individual raw-current data streams. Z-normalized signal-to-noise ratios suggest intrinsic sensor limitations being responsible for half the gold standard and noisy squiggle DTW differences.Entities:
Mesh:
Year: 2019 PMID: 31318901 PMCID: PMC6638935 DOI: 10.1371/journal.pone.0219495
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Thirty representative squiggles from the A) Enolase and B) Sequin R2-55-3 studies. The leaders, shown in red, can be identified by differences between local leader and global stream characteristics of intensity means and standard deviations. A further data pruning was used to remove stream outliers that varied by more than 5 standard deviations from the mean ensemble characteristics.
Fig 2The proposed simulation model uses either an experimentally determined or proposed cumulative distribution function to generate a stream length model for the mocked squiggle.
Each gold-stream base is either inserted into, or deleted from, the mocked stream based on probability functions determined experimentally or from an assumed theoretical model.
Fig 3The original length distribution from SG2001-4000 (red line) has a broad PDF peak at around x1.7 longer than the gold standard length, with a long high-length tail.
Other stream groupings (dotted and dashed red lines) have similar, but shifted, PDF’s. The GEV PDF for SG2001-4000 (solid black line) is shown for comparision.
Fig 4Experimental insertion and delete rates can be characterized by expressions of the form A + B (1–1 / LSF) where the length scaling factor LSF = squiggle_length / gold_standard_length.
The experimentally determined insertion and deletion probabilities for a distorted squiggle’s length scaling factor (LSF) can be modelled as A+B(1+1/LSF).
| A | B | |
|---|---|---|
| 90.2 | -56.1 | |
| 88.7 | 17.7 | |
| 11.0 | 14.0 | |
| 9.7 | -15.0 | |
| 2.1 | 12.5 | |
| 1.1 | -1.8 | |
| -0.2 | 8.8 | |
| 0.3 | -0.4 | |
| -0.7 | 6.2 |
Fig 5Comparison of original Enolase SG2001–4000 E-CDF (red line) against M-CDF’s generated using experimental insertion and deletion probabilities applied to A) base-specific and B) squiggle specific simulation models.
Comparison of insertion and deletion probabilities between the original SG2001-4000 squiggle ensemble with local and global models using experimental probability rates.
| INSERTION PROBABILITIES | DELETION PROBABILITIES | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 s | 4 | 5 | 0 | 1 | 2 | 3 | |
| 0.0±0.0 | ||||||||||
| 0.0±0.0 | ||||||||||
| 0.0±0.0 | ||||||||||
| 0.0±0.0 | ||||||||||
| 0.0±0.0 | ||||||||||
| 0.0±0.0 | ||||||||||
| 0.1±0.1 | ||||||||||
| 0.0±0.0 | ||||||||||
| 0.1±0.1 | ||||||||||
1The empirical insertion probabilities were chosen to better match the experimental and mocked CDF
2The empirical deletion probabilities were chosen to match those of the original SG2001-4000 ensemble
Fig 6Comparison of the original Enolase SG2001–4000 CDF (red line) against CDF’s from proposed theoretical insertion and deletion rate models.
Comparison of the fits of the original and mocked probability density functions to the long-tailed loglogistic, lognormal and generalized extreme value distributions.
| Loglogistic | Lognormal | Generalized | |
|---|---|---|---|
| P(INSERT) | mu = 7.73; | mu = 7.73; | mu = 2190; [2179, 2202] |
| P(INSERT) | mu = 7.73; | mu = 7.74; | mu = 2180; [2167, 2193] |
| P(INSERT) | mu = 7.74; | mu = 7.74; | mu = 2187; [2174, 2200] |
| Experimental insertion | mu = 7.68; | mu = 7.68; | mu = 2111; [2102, 2120] |
| Experimental insertion | mu = 7.68; | mu = 7.68; | mu = 2136; [2134, 2139] |
Comparing the DTW Frechet distance between the original and mocked squiggles and the gold standard.
As the complexity of the model increases, the Frechet distance becomes closer between the original and mocked data sets.
| Mean DTW Distance | |
|---|---|
| Experimental insertions only | 27 ± 18 |
| Experimental full model (insertions and deletions) | 76 ± 28 |
| Experimental Insertions only | 29 ± 17 |
| Experimental full model (insertions and deletions) | 70 ± 28 |
| P(INSERT) = 100 (1–1 / LSF) | 35 ± 23 |
| P(INSERT) = 100 (1–1 / LSF) | 109 ± 28 |
| P(INSERT) = 114.4 (1–1 / LSF) | 117 ± 34 |
| P(INSERT) = 8.6 + 100 (1–1 / LSF) | 149 ± 34 |
Comparison of the DTW Frechet distance between the original and mocked squiggles and the gold standard as the simulated SNR level deteriorates.
| P(INSERT) = 114.4 (1–1 / LSF) | P(INSERT) = 8.6 + 100 (1–1 / LSF) | |
|---|---|---|
| No noise | 116 ± 33 | 149 ± 34 |
| 128 ± 30 | 160 ± 28 | |
| 169 ± 27 | 199 ± 24 | |
| 247 ± 33 | 275 ± 29 | |
| 397 ± 50 | 421 ± 43 | |
| 573 ± 69 | 588 ± 63 | |
Fig 7The proposed simulation model provides individual matches between the E-CDF and M-CDF of the three Enolase data streams, SG1-2000, SG2001-4000 and SG5001-7000.
However additional simulation terms need to be introduced to model the experimental reason behind the increased mean LSF value for the SG2001–4000 and SG5001-7000 Enolase streams compared to the Enolase SG1-2000 and to the Sequin squiggles.
Fig 8The simulation model provides comparable E-CDF and M-CDF for the noisier SEQUIN R1_71_1 and SEQUIN R 2_55_3 squiggles.
Given the low number of experimental available data streams, 116 and 122 respectively, the simulation model is used to generate 2000 mocked streams to provide a smoother M-CDF to compare to the original E-CDF.