| Literature DB >> 33165508 |
Hiroki Konishi1, Rui Yamaguchi2, Kiyoshi Yamaguchi3, Yoichi Furukawa3, Seiya Imoto1,2.
Abstract
MOTIVATION: In recent years, nanopore sequencing technology has enabled inexpensive long-read sequencing, which promises reads longer than a few thousand bases. Such long-read sequences contribute to the precise detection of structural variations and accurate haplotype phasing. However, deciphering precise DNA sequences from noisy and complicated nanopore raw signals remains a crucial demand for downstream analyses based on higher-quality nanopore sequencing, although various basecallers have been introduced to date.Entities:
Year: 2021 PMID: 33165508 PMCID: PMC8189681 DOI: 10.1093/bioinformatics/btaa953
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Metrics of all 11 runs of MinION sequencing
| MinION run | Number of reads | Signals length | Nucleotide length |
|---|---|---|---|
| RUN 1 | 198 318 | 50 408 ± 37 154 | 4762 ± 3577 |
| RUN 2 | 90 619 | 53 587 ± 42 081 | 4700 ± 4220 |
| RUN 3 | 720 885 | 65 579 ± 41 308 | 6724 ± 4345 |
| RUN 4 | 605 642 | 72 292 ± 74 283 | 7040 ± 7380 |
| RUN 5 | 541 783 | 76 314 ± 76 775 | 7123 ± 7347 |
| RUN 6 | 255 240 | 75 450 ± 79 599 | 6795 ± 7372 |
| RUN 7 | 665 879 | 82 656 ± 82 728 | 7503 ± 7715 |
| RUN 8 | 1 016 413 | 72 082 ± 42 833 | 6413 ± 3905 |
| RUN 9 | 946 914 | 72 807 ± 44 096 | 6299 ± 3929 |
| RUN 10 | 569 715 | 72 186 ± 43 109 | 6316 ± 3866 |
| RUN 11 | 220 199 | 70 420 ± 46 432 | 5825 ± 3905 |
Note: The number of reads obtained in each run, a mean and a standard deviation of lengths of raw signals and the lengths of nucleotides basecalled by Guppy (exploited by Taiyaki) observed in each run are also shown.
Read metrics for reads basecalled by five different basecallers
| Basecaller | Total reads | Total yield (Gb) | Read length | Read identity | Insertion rate | Deletion rate |
|---|---|---|---|---|---|---|
| Halcyon | 3 225 205 | 20.5 | 6359 ± 5702 | 0.894 ± 0.084 | 0.028 ± 0.023 | 0.041 ± 0.043 |
| Guppy | 3 150 600 | 20.5 | 6519 ± 5748 | 0.905 ± 0.080 | 0.021 ± 0.018 | 0.041 ± 0.044 |
| Bonito | 3 160 225 | 20.3 | 6410 ± 5664 | 0.902 ± 0.080 | 0.020 ± 0.016 | 0.045 ± 0.050 |
| Chiron | 2 129 764 | 17.4 | 8161 ± 5384 | 0.800 ± 0.061 | 0.047 ± 0.019 | 0.072 ± 0.033 |
| Deepnano | 2 783 926 | 18.4 | 6606 ± 5616 | 0.805 ± 0.055 | 0.042 ± 0.014 | 0.075 ± 0.030 |
Note: Except for total reads and total yield, the mean and standard deviation of each measurement is described. Read identity, insertion rate, deletion rate are obtained by aligning basecalled reads to reference by minimap2.
Fig. 1.Overview of the network architecture of Halcyon from the input (nanopore raw signals) to the output (nucleotide sequence). Each convolution component is composed of a one-dimensional convolution layer with a rectified linear unit (ReLU) activation function followed by a batch normalization layer. A semantic relationship between the last layer among five stacked bidirectional LSTM encoding layers and the first layer among five stacked LSTM decoding layers is comprehended by monotonic attention
Fig. 2.(a) Overview of preparation of training datasets using ONT’s retraining model, Taiyaki. Labeled reads obtained by Taiyaki are then split into fixed-length raw signals and corresponding nucleotide sequences. (b) Overview of evaluation of different basecallers in terms of SNV-detection performance assuming short-read sequencing as the ground truth
Fig. 3.Individual read statistics obtained by aligning basecalled reads to the reference sequence with minimap2. Distributions of (a) read identities, (b) insertion error rates and (c) deletion error rates calculated over all basecalled reads are illustrated using letter-value plots. The SNV detection rate measured by comparing SNVs detected by LongShot to those detected by Strelka2 using short-read sequences. (d) SNV detection rate overall each chromosome, and (e) true positive rate of SNV-detection for each read depth (6 20). Basecalling speed measured in terms of the number of nucleotide basecalled in a second. Speed of basecalling (f) measured using CPU with a single thread and (g) that measured using a single GPU and CPU with five threads
Fig. 4.Actual row signal input (top) and an attention matrix obtained in the basecalling phase to infer nucleotides from the given signals (bottom). The number of signal values measured during single nucleotide passage through a pore changes rapidly at a certain point (indicated by a circle in the figure). In the corresponding part of the attention matrix, the gradient of attention transition speed also changes rapidly