| Literature DB >> 26382652 |
S M Hossein Tabatabaei Yazdi1, Yongbo Yuan2, Jian Ma3,4, Huimin Zhao2,4, Olgica Milenkovic1.
Abstract
We describe the first DNA-based storage architecture that enables random access to data blocks and rewriting of information stored at arbitrary locations within the blocks. The newly developed architecture overcomes drawbacks of existing read-only methods that require decoding the whole file in order to read one data fragment. Our system is based on new constrained coding techniques and accompanying DNA editing methods that ensure data reliability, specificity and sensitivity of access, and at the same time provide exceptionally high data storage capacity. As a proof of concept, we encoded parts of the Wikipedia pages of six universities in the USA, and selected and edited parts of the text written in DNA corresponding to three of these schools. The results suggest that DNA is a versatile media suitable for both ultrahigh density archival and rewritable storage applications.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26382652 PMCID: PMC4585656 DOI: 10.1038/srep14138
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1(a) The scheme of4 uses a storage format consisting of DNA strings that cover the encoded compressed text in fragments of length of 100 bps. The fragments overlap in 75 bps, thereby providing 4-fold coverage for all except the flanking end bases. This particular fragmenting procedure prevents efficient file editing: If one were to rewrite the “shaded” block, all four fragments containing this block would need to be selected and rewritten at different positions to record the new “shaded” block. (b) The address sequence construction process we propose which uses the notions of autocorrelation and cross-correlation of sequences13. A sequence is uncorrelated with itself if no proper prefix of the sequence is also a suffix of the same sequence. Alternatively, no shift of the sequence overlaps with the sequence itself. Similarly, two different sequences are uncorrelated if no prefix of one sequence matches a suffix of the other. Addresses are chosen to be mutually uncorrelated, and each 1000 bps block is flanked by an address of length 20 on the left and by another address of length 20 on the right (colored ends). (c) Content rewriting via DNA editing: the gBlock method10 for short rewrites, and the cost efficient OE-PCR (Overlap Extension PCR) method11 for sequential rewriting of longer blocks.
Figure 2(a) Gel electrophoresis results for three blocks, indicating that the length of the three selected and amplified sequences is tightly concentrated around 1000 bps, and hence correct. (b) Output of the Sanger sequencer, where all bases shaded in yellow correspond to correct readouts. The sequencing results confirmed that the desired sequences were selected, amplified, and rewritten with 100% accuracy.
Selection, rewriting and sequencing results.
| # | |||
|---|---|---|---|
| B1-M-gBlock | 5 | 20 | (5/5)/0% |
| B1-M-PCR | 5 | 20 | (5/5)/0% |
| B2-M-gBlock | 5 | 28 | (5/5)/0% |
| B2-M-PCR | 5 | 28 | (5/5)/0% |
| B3-M-gBlock | 5 | 41 + 29 | (5/5)/0% |
| B3-M-PCR | 5 | 41 + 29 | (5/5)/0% |
Each rewritten 1000 bps sequence was ligated to a linearized pCRTM-Blunt vector using the Zero Blunt PCR Cloning Kit and was transformed into E. coli. The E. coli strains with correct plasmids were sequenced at ACGT, Inc. Sequencing was performed using two universal primers: M13F_20 (in the reverse direction) and M13R (in the forward direction) to ensure that the entire block of 1000 bps is covered.
Comparison of storage densities for the DNA encoded information expressed in B/g (bytes per gram), file size, synthesis cost, and random access features of three known DNA storage technologies.
| Density | 0.7 × 1015 B/g | 2.2 × 1015 B/g | 4.9 × 1020 B/g |
| File size | 5.27 Mb | 739 KB | 17 KB |
| Cost | Not available | $12,600 | $4,023 |
| Features | Archival, no random-access | Archival, no random-access | Rewritable, random-access |
Note that the density does not reflect the entropy of the information source, as the text files are encoded in ASCII format, which is a redundant representation system.
| begin | begin |
| 1 | 1 |
| 2 if ( | 2 |
| 3 | 3 assume that the input is |
| 4 | 4 if ( |
| 5 while | 5 return |
| 6 | 6 else |
| 7 | 7 find ( |
| 8 end; | 8 return |
| 9 | 9 end; |
| 10 | end; |
| 11 return | |
| 12 else | |
| 13 return | |
| 14 end; | |
| end; | |