| Literature DB >> 28694453 |
S M Hossein Tabatabaei Yazdi1, Ryan Gabrys1, Olgica Milenkovic2.
Abstract
DNA-based data storage is an emerging nonvolatile memory technology of potentially unprecedented density, durability, and replication efficiency. The basic system implementation steps include synthesizing DNA strings that contain user information and subsequently retrieving them via high-throughput sequencing technologies. Existing architectures enable reading and writing but do not offer random-access and error-free data recovery from low-cost, portable devices, which is crucial for making the storage technology competitive with classical recorders. Here we show for the first time that a portable, random-access platform may be implemented in practice using nanopore sequencers. The novelty of our approach is to design an integrated processing pipeline that encodes data to avoid costly synthesis and sequencing errors, enables random access through addressing, and leverages efficient portable sequencing via new iterative alignment and deletion error-correcting codes. Our work represents the only known random access DNA-based data storage system that uses error-prone nanopore sequencers, while still producing error-free readouts with the highest reported information rate/density. As such, it represents a crucial step towards practical employment of DNA molecules as storage media.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28694453 PMCID: PMC5503945 DOI: 10.1038/s41598-017-05188-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Comparison of features/properties of current DNA-based storage platforms.
| Work | Random access | Portability | Sequencing technology | Sequencer error rate | Error correction/detection | Net density (bits/bp) |
|---|---|---|---|---|---|---|
| Church[ | No | No | HiSeq | 0.1–0.3% | None | 0.83 |
| Goldman[ | No | No | HiSeq | 0.1% | Detection | 0.33 |
| Yazdi[ | No | Sanger | 0.05% | Correction | 1.575 | |
| Grass[ | No | No | MiSeq | 0.1% | Correction | 1.14 |
| Bornholt[ | Yes | No | MiSeq | 0.1% | None | 0.88 |
| Erlich[ | No | No | MiSeq | 0.1% | None | 1.55 |
| This work | Correction |
Figure 1The encoding stage. This stage involves compression, representation conversion, encoding into DNA, and subsequent synthesis. Each synthesized DNA codeword is equipped with one or two addresses. The encoding phase entails constrained coding, which limits the occurrence of the address block to one predefined position in the codeword only, and GC-content balancing of each substring of eight bases. Additional homopolymer checks are added directly into the string or stored on classical media; they correspond to only 0.02% of the data content.
Figure 2Post processing via sequence alignment and homopolymer correction. In the first phase, estimates of the DNA codewords are obtained by running several MSA algorithms on high-quality reads that contain an exact match with the address sequence. The second phase improves the estimate by employing an iterative method that includes BWA alignment and an errorcorrecting scheme.
Summary of the readout data, along with the number and type of errors encountered in the reads.
| Block (length) | Number of reads | Sequencing Coverage depth | Number of errors: (substitution, insertion, deletion) | |||
|---|---|---|---|---|---|---|
| Average | Maximum | Per read (average) | Consensus | |||
| Nanopolish | Our method | |||||
| 1 (1,000) | 201 | 176.145 | 192 | (107, 14, 63) | (14, 32, 5) | (0, 0, 2) |
| 2 (1,000) | 407 | 315.521 | 349 | (123, 12, 70) | (75, 99, 40) | (0, 0, 0) |
| 3 (1,000) | 490 | 460.375 | 482 | (80, 23, 42) | (10, 45, 0) | (0, 0, 0) |
| 4 (1,000) | 100 | 81.763 | 87 | (69, 18, 37) | (1, 54, 1) | (0, 0, 0) |
| 5 (1,000) | 728 | 688.663 | 716 | (88, 20, 48) | (4, 45, 3) | (0, 0, 0) |
| 6 (1,000) | 136 | 120.907 | 129 | (79, 21, 42) | (390, 102, 61) | (0, 0, 0) |
| 7 (1,000) | 577 | 542.78 | 566 | (83, 26, 41) | (3, 31, 3) | (0, 0, 0) |
| 8 (1,000) | 217 | 199.018 | 207 | (83, 20, 46) | (18, 51, 1) | (0, 0, 0) |
| 9 (1,000) | 86 | 56.828 | 75 | (60, 16, 30) | (404, 92, 54) | (0, 0, 0) |
| 10 (1,000) | 442 | 396.742 | 427 | (91, 18, 52) | (388, 100, 59) | (0, 0, 0) |
| 11 (1,000) | 114 | 101.826 | 110 | (79, 23, 42) | (16, 23, 18) | (0, 0, 0) |
| 12 (1,000) | 174 | 162.559 | 169 | (94, 23, 50) | (14, 59, 1) | (0, 0, 0) |
| 13 (1,060) | 378 | 352.35 | 366 | (88, 26, 44) | (7, 55, 4) | (0, 0, 0) |
| 14 (1,000) | 222 | 189.918 | 203 | (69, 22, 34) | (15, 34, 3) | (0, 0, 0) |
| 15 (1,000) | 236 | 222.967 | 232 | (92, 24, 45) | (15, 46, 2) | (0, 0, 0) |
| 16 (1,000) | 198 | 182.99 | 195 | (103, 16, 61) | (15, 62, 4) | (0, 0, 1) |
| 17 (880) | 254 | 240.273 | 250 | (77, 19, 42) | (359, 95, 44) | (0, 0, 0) |
Figure 3Image files used in our experiment. (a,b) show the raw images which were compressed, encoded and synthesized into DNA blocks. The Citizen Kane poster[19] (photographed by Kahle, A., date of access: 17/11/2016) RKO Radio Pictures, not copyrighted per claim of Wikipedia repository) and Smiley Face emoji were of size 9,592 and 130.2 bytes, and had dimensions of 88 × 109 and 56 × 56 pixels, respectively. (c,d) show the recovered images after sequencing of the DNA blocks and the post-processing phase without homopolymer error correction. Despite having only two errors in the Citizen Kane file, we were not able to recover any detail in the image. On the other hand, one error in the Smiley Face emoji did not cause any visible distortion. (e,f) show the image files obtained after homopolymer error correction, leading to an error-free reconstruction of the original file.