| Literature DB >> 35677811 |
Chisom Ezekannagha1, Anke Becker2, Dominik Heider1, Georges Hattab1.
Abstract
Deoxyribonucleic acid (DNA) is increasingly emerging as a serious medium for long-term archival data storage because of its remarkable high-capacity, high-storage-density characteristics and its lasting ability to store data for thousands of years. Various encoding algorithms are generally required to store digital information in DNA and to maintain data integrity. Indeed, since DNA is the information carrier, its performance under different processing and storage conditions significantly impacts the capabilities of the data storage system. Therefore, the design of a DNA storage system must meet specific design considerations to be less error-prone, robust and reliable. In this work, we summarize the general processes and technologies employed when using synthetic DNA as a storage medium. We also share the design considerations for sustainable engineering to include viability. We expect this work to provide insight into how sustainable design can be used to develop an efficient and robust synthetic DNA-based storage system for long-term archiving.Entities:
Keywords: Data; Data storage; Design considerations; Encoding; Synthetic DNA
Year: 2022 PMID: 35677811 PMCID: PMC9167972 DOI: 10.1016/j.mtbio.2022.100306
Source DB: PubMed Journal: Mater Today Bio ISSN: 2590-0064
Fig. 1Evolution of the amount of data stored in DNA. The fitted line shows the trend in data size from related work.
Summary of notable related work for synthetic DNA-based data storage. The table is presented in descending order based on the size of stored data. The encoding alphabet includes binary or alphanumeric encoding. The storage refers to the mechanism used for storing the DNA, either in an organism, buffer solution, or silica beads. Sequencing technologies include Illumina's technology based on the sequencing by synthesis (SBS) principle, and ONT's nanopore technology. The Error Correction refers to related work with codes that are able to detect and correct errors. The information density refers to the average number of binary information (bits) encoded in a nucleotide (nt). This binary information totals the data payload and the additional sequences for index, error correction, and primers. The synthetic DNA is stored in ∗ silica beads and ∗∗ an organism (in vivo).
| Organick et al. | [0, 1] | SBS | 1 | 150 GB | 0.003 |
| Appusawamy et al. | [0, 1] | SBS | 1 | 400 MB | 1 |
| Organick et al. | [0, 1] | SBS/Nanopore | 1 | 200.2 MB | 0.81 |
| Blawat et al. | [0, 1] | SBS | 1 | 22 MB | 0.89 |
| Erlich and Zielinski | [0, 1] | SBS | 1 | 2.15 MB | 1.18 |
| Goldman et al. | [0, 1] | SBS | 1 | 739 kB | 0.19 |
| Church et al. | [0, 1] | SBS | 0 | 650 kB | 0.60 |
| Bornholt et al. | [0, 1] | SBS | 0 | 150 kB | 0.57 |
| Grass et al.∗ | [0, 1] | SBS | 1 | 80 kB | 0.86 |
| Yadzi et al. | [0, 1] | Nanopore | 0 | 3 kB | 1.72 |
| Clelland et al. | [A-Z, a-z, !] | SBS | 0 | 4.625 B | 1.27 |
| Davis∗∗ | [0, 1] | – | 0 | 1.125 B | 1.25 |
Fig. 2Overview Process of a synthetic DNA-based storage system. Digital files are converted into binary data. The binary data is encoded into DNA sequences, additional DNA sequences including index, error correction, and primer for DNA amplification are also appended to the oligos. Oligos are synthesized chemically or enzymatically. Synthetic DNA from chemical or enzymatic synthesis processes is usually single-stranded. It is often stored in the single-stranded step, or the complementary strand is enzymatically synthesized, generating double-stranded DNA before storage. The selective extraction or amplification of the suitable oligos from the pool is enabled by PCR. The obtained oligos are sequences using a DNA sequencer. Then the obtained oligo sequences are decoded to obtain the original binary data.
Fig. 3Error correction methods for synthetic DNA-based storage systems. (A) Huffman codes based on overlapping DNA fragments, resulting in a fourfold redundancy to prevent error and data loss. (B) Code based on exclusive-or operation where any two out of the three oligos can recover the original information. (C) Two Reed-Solomon codes were added to the original information in orthogonal directions alongside the indices. (D) Fountain codes groups data into ‘resource packets’ and restores the original information after obtaining a sufficient number of packets. An additional screening step was added to exclude unqualified fragments.